Better Models: Worse Tools

Simon Willison's Blog 07/04/26, 10:53 PM News

model-behavior tool-use anthropic claude pi editing-tools regression

Summary

Newer Anthropic models like Opus 4.8 and Sonnet 5 are worse at using third-party editing tools (e.g., Pi's) compared to older models, likely because they were trained to use Claude Code's built-in edit tool via RL, causing them to invent extra fields in tool calls.

No content available

Original Article

View Cached Full Text

Cached at: 07/05/26, 10:28 AM

# Better Models: Worse Tools Source: [https://simonwillison.net/2026/Jul/4/better-models-worse-tools/](https://simonwillison.net/2026/Jul/4/better-models-worse-tools/) 4th July 2026 \- Link Blog **[Better Models: Worse Tools](https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/)**\. Armin reports on a weird problem he ran into while hacking on Pi: > The short version is that newer Claude models sometimes call Pi’s edit tool with extra, invented fields in the nested`edits\[\]`array\. And not Haiku or some small model: Opus 4\.8\. The edit itself is usually correct but the arguments do not match the schema as the model invents made\-up keys and Pi thus rejects the tool call and asks to try again\. That alone is not too surprising as models emit malformed tool calls sometimes\. Particularly small ones\. What surprised me is that this is getting worse with newer Anthropic models as both Opus 4\.8 and Sonnet 5 show it but none of the older models\. In other words, the SOTA models of the family are worse at this specific tool schema than their older siblings\. Armin theorizes that this is because more recent Anthropic models have been specifically trained \(presumably via Reinforcement Learning\) to better use the edit tools that are baked into Claude Code\. This has the unfortunate effect that other coding harnesses, such as Pi, may find that their own custom edit tools are more likely to be used incorrectly\. Claude's edit tool[uses search and replace](https://platform.claude.com/docs/en/agents-and-tools/tool-use/text-editor-tool#str-replace)\. OpenAI's Codex[uses an apply\_patch mechanism instead](https://developers.openai.com/api/docs/guides/tools-apply-patch), and OpenAI have talked in the past about how their models are trained to use that tool effectively\. Does this mean third\-party coding harnesses like Pi should implement multiple edit tools just so they can use the one with the best performance for the underlying model the user has selected?

Better Models: Worse Tools

Similar Articles

Better Models: Worse Tools

@yibie: Recommended article: Flask author Armin Ronacher, while tracking a Pi bug, discovered a troubling fact: the tool calling of the new Claude models (Opus 4.8, Sonnet 5) is regressing—not improving but worsening. And he found the root cause: RL...

What's new in Claude Sonnet 5

@kapicode: I've been using Claude as the "human" prompting @opencode to rebuild reference projects, evaluating four LLMs on the sa…

Models Are Hitting Diminishing Returns Within Software Engineering

Submit Feedback

Similar Articles

@yibie: Recommended article: Flask author Armin Ronacher, while tracking a Pi bug, discovered a troubling fact: the tool calling of the new Claude models (Opus 4.8, Sonnet 5) is regressing—not improving but worsening. And he found the root cause: RL...

@kapicode: I've been using Claude as the "human" prompting @opencode to rebuild reference projects, evaluating four LLMs on the sa…

Models Are Hitting Diminishing Returns Within Software Engineering