Your agent keeps failing after you upgrade the model. Cursor's engineering notes explain why.

Reddit r/AI_Agents 05/22/26, 10:09 PM News

agent-failures harness cursor reliability engineering-notes scaffolding tool-format

Summary

Cursor's engineering notes reveal that agent failures often stem from the harness (scaffolding) rather than the model itself, with different tool formats across providers causing silent errors and reliability issues.

Something I keep seeing in this sub and in my own builds: agent fails, swap the model, still fails the same way. Swap again. Cursor just published engineering notes on why that loop never works. They achieved a 10x reduction in tool errors for the same model by rewriting the harness. Same API calls, same weights, different scaffolding. A few things from those notes that changed how I think about debugging: The tool format problem explains a lot of silent failures. Claude models train on string replacement. OpenAI models train on patch-based diffs. Giving Claude a patch-format tool works, but forces real-time translation that burns reasoning capacity and generates more errors. Cursor builds separate harnesses per provider because of this. The compounding reliability math is worse than people expect. Five agents at 95% each chained together: 77.4% end-to-end success rate. One failure in four. Without checkpointing and rollback in the harness, that compounds silently. SWE-Bench Pro isolated this cleanly. Claude Opus 4.5 under a minimal standardised harness: 45.9%. Same model under Claude Code's custom harness: 55.4%. Same tasks, same model. Their conclusion: "Two years ago, the harness was a small piece of agent quality. Now it's the most important agent quality." For people actively building agents: where are the failures hiding when the model clearly isn't the problem?

Original Article

Your agent keeps failing after you upgrade the model. Cursor's engineering notes explain why.

Similar Articles

The agent bug I thought was the model turned out to be the harness

Your harness is failing your agent but there's no benchmark to prove it

Your agent isn't failing because of the model, it's failing because nobody built a stop button

Your AI agent isn't broken. Your harness is. Here's the system that took mine from "liability" to shipping production code.

The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade

Submit Feedback

Similar Articles

The agent bug I thought was the model turned out to be the harness

Your harness is failing your agent but there's no benchmark to prove it

Your agent isn't failing because of the model, it's failing because nobody built a stop button

Your AI agent isn't broken. Your harness is. Here's the system that took mine from "liability" to shipping production code.

The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade