What reasoning model are you actually running in production?

Reddit r/AI_Agents News

Summary

A practitioner seeks real-world feedback on reasoning models like o3, Claude extended thinking, Gemini 2.5 Pro, and Ring 2.6 1T for production agent tasks, questioning the practical performance of Ring's dual-reasoning-effort modes versus benchmarks.

I need to pick a reasoning model for production agent work. The usual suspects are obvious (o3, Claude extended thinking, Gemini 2.5 Pro), but I'm also looking at Ring 2.6 1T, which has two reasoning effort modes — high for fast multi-step agent loops and xhigh for harder problems. The dual-mode approach appeals to me because not every agent call needs maximum reasoning depth. But I can't find much real-world feedback on it. The benchmarks exist (PinchBench 87.60, Tau2-Bench Telecom 95.32) but I don't trust benchmarks to tell me how it handles real multi-step agent tasks with messy intermediate states. How does the high/xhigh split work in practice is the speed difference noticeable? Does it stay stable on longer agent runs?
Original Article

Similar Articles

Reasoning models struggle to control their chains of thought, and that’s good

OpenAI Blog

OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.

OpenAI o3-mini

OpenAI Blog

OpenAI releases o3-mini, a cost-efficient reasoning model with strong STEM capabilities, available in ChatGPT and API with support for function calling, structured outputs, and three reasoning effort levels. The model matches o1 performance in math and coding while being faster and cheaper, with free plan users gaining access to a reasoning model for the first time.

Economics and reasoning with OpenAI o1

OpenAI Blog

OpenAI released the o1 model series, designed with extended reasoning capabilities to tackle complex problems in science, coding, and math by spending more time thinking before responding.