Better Models: Worse Tools
Summary
Newer Anthropic models like Opus 4.8 and Sonnet 5 are worse at using third-party editing tools (e.g., Pi's) compared to older models, likely because they were trained to use Claude Code's built-in edit tool via RL, causing them to invent extra fields in tool calls.
View Cached Full Text
Cached at: 07/05/26, 10:28 AM
Similar Articles
Better Models: Worse Tools
Newer Claude models (Opus 4.8 and Sonnet 5) exhibit worse tool-calling behavior by inventing extra fields in tool invocation arguments, causing validation failures, a regression compared to older models.
@yibie: Recommended article: Flask author Armin Ronacher, while tracking a Pi bug, discovered a troubling fact: the tool calling of the new Claude models (Opus 4.8, Sonnet 5) is regressing—not improving but worsening. And he found the root cause: RL...
Flask author Armin Ronacher found that the tool-calling ability of the new Claude models (Opus 4.8, Sonnet 5) is degrading. The root cause is that RL post-training over-adapts to Claude Code's own tool schema, making alternative tool schemas increasingly difficult to generate correctly. The article reveals the phenomenon of models performing worse rather than better on specific tool-calling scenarios, offering an important caution for agent development.
What's new in Claude Sonnet 5
Anthropic released Claude Sonnet 5, a model with performance near Opus 4.8 at lower prices, but featuring a new tokenizer that increases token counts for English and code by ~30%, effectively raising costs.
@kapicode: I've been using Claude as the "human" prompting @opencode to rebuild reference projects, evaluating four LLMs on the sa…
An evaluation of four LLMs (Qwen, MiniMax, GLM) using Claude as a prompter for the Opencode agent tool reveals that a smaller local model (Qwen 27B on a 3090) outperforms a larger pruned model in coding quality and reliability.
Models Are Hitting Diminishing Returns Within Software Engineering
A distinguished engineer at a hyperscaler argues that AI models are hitting diminishing returns in software engineering tasks, as he finds little difference between Claude's Fable 5 and previous Opus models, and predicts local models will soon provide comparable value.