Your agent keeps failing after you upgrade the model. Cursor's engineering notes explain why.

Reddit r/AI_Agents News

Summary

Cursor's engineering notes reveal that agent failures often stem from the harness (scaffolding) rather than the model itself, with different tool formats across providers causing silent errors and reliability issues.

Something I keep seeing in this sub and in my own builds: agent fails, swap the model, still fails the same way. Swap again. Cursor just published engineering notes on why that loop never works. They achieved a 10x reduction in tool errors for the same model by rewriting the harness. Same API calls, same weights, different scaffolding. A few things from those notes that changed how I think about debugging: The tool format problem explains a lot of silent failures. Claude models train on string replacement. OpenAI models train on patch-based diffs. Giving Claude a patch-format tool works, but forces real-time translation that burns reasoning capacity and generates more errors. Cursor builds separate harnesses per provider because of this. The compounding reliability math is worse than people expect. Five agents at 95% each chained together: 77.4% end-to-end success rate. One failure in four. Without checkpointing and rollback in the harness, that compounds silently. SWE-Bench Pro isolated this cleanly. Claude Opus 4.5 under a minimal standardised harness: 45.9%. Same model under Claude Code's custom harness: 55.4%. Same tasks, same model. Their conclusion: "Two years ago, the harness was a small piece of agent quality. Now it's the most important agent quality." For people actively building agents: where are the failures hiding when the model clearly isn't the problem?
Original Article

Similar Articles