Better Models: Worse Tools

Simon Willison's Blog News

Summary

Newer Anthropic models like Opus 4.8 and Sonnet 5 are worse at using third-party editing tools (e.g., Pi's) compared to older models, likely because they were trained to use Claude Code's built-in edit tool via RL, causing them to invent extra fields in tool calls.

No content available
Original Article
View Cached Full Text

Cached at: 07/05/26, 10:28 AM

# Better Models: Worse Tools Source: [https://simonwillison.net/2026/Jul/4/better-models-worse-tools/](https://simonwillison.net/2026/Jul/4/better-models-worse-tools/) 4th July 2026 \- Link Blog **[Better Models: Worse Tools](https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/)**\. Armin reports on a weird problem he ran into while hacking on Pi: > The short version is that newer Claude models sometimes call Pi’s edit tool with extra, invented fields in the nested`edits\[\]`array\. And not Haiku or some small model: Opus 4\.8\. The edit itself is usually correct but the arguments do not match the schema as the model invents made\-up keys and Pi thus rejects the tool call and asks to try again\. That alone is not too surprising as models emit malformed tool calls sometimes\. Particularly small ones\. What surprised me is that this is getting worse with newer Anthropic models as both Opus 4\.8 and Sonnet 5 show it but none of the older models\. In other words, the SOTA models of the family are worse at this specific tool schema than their older siblings\. Armin theorizes that this is because more recent Anthropic models have been specifically trained \(presumably via Reinforcement Learning\) to better use the edit tools that are baked into Claude Code\. This has the unfortunate effect that other coding harnesses, such as Pi, may find that their own custom edit tools are more likely to be used incorrectly\. Claude's edit tool[uses search and replace](https://platform.claude.com/docs/en/agents-and-tools/tool-use/text-editor-tool#str-replace)\. OpenAI's Codex[uses an apply\_patch mechanism instead](https://developers.openai.com/api/docs/guides/tools-apply-patch), and OpenAI have talked in the past about how their models are trained to use that tool effectively\. Does this mean third\-party coding harnesses like Pi should implement multiple edit tools just so they can use the one with the best performance for the underlying model the user has selected?

Similar Articles

Better Models: Worse Tools

Hacker News Top

Newer Claude models (Opus 4.8 and Sonnet 5) exhibit worse tool-calling behavior by inventing extra fields in tool invocation arguments, causing validation failures, a regression compared to older models.

@yibie: Recommended article: Flask author Armin Ronacher, while tracking a Pi bug, discovered a troubling fact: the tool calling of the new Claude models (Opus 4.8, Sonnet 5) is regressing—not improving but worsening. And he found the root cause: RL...

X AI KOLs Timeline

Flask author Armin Ronacher found that the tool-calling ability of the new Claude models (Opus 4.8, Sonnet 5) is degrading. The root cause is that RL post-training over-adapts to Claude Code's own tool schema, making alternative tool schemas increasingly difficult to generate correctly. The article reveals the phenomenon of models performing worse rather than better on specific tool-calling scenarios, offering an important caution for agent development.

What's new in Claude Sonnet 5

Simon Willison's Blog

Anthropic released Claude Sonnet 5, a model with performance near Opus 4.8 at lower prices, but featuring a new tokenizer that increases token counts for English and code by ~30%, effectively raising costs.

Models Are Hitting Diminishing Returns Within Software Engineering

Reddit r/ArtificialInteligence

A distinguished engineer at a hyperscaler argues that AI models are hitting diminishing returns in software engineering tasks, as he finds little difference between Claude's Fable 5 and previous Opus models, and predicts local models will soon provide comparable value.