Your coding agent didn't get worse. You just never measured the first version.

Reddit r/AI_Agents 05/13/26, 12:37 PM News

Summary

The article argues that perceived degradation in coding agents is often due to untracked changes in agent instances and configuration rather than the underlying model itself, highlighting a critical lack of baseline measurement in current AI agent workflows.

There's a pattern I keep seeing in agent discussions lately: someone reports their coding agent "got worse" over a few weeks. The replies split into two camps: "yes, model updates broke it" vs. "you're imagining it, the model is the same." Both camps are missing the actual thing. The model probably is the same. But the agent instance you're running today is not the same as the one from six weeks ago, different context window contents, different session history, different harness configuration, small accumulated decisions that compound. Same model. Different behavior. And you have no baseline to compare against because you never measured the first one. This is the structural problem with how we're deploying coding agents right now: the model name is treated as the unit of measurement. "We use Claude Code" or "we switched to Codex" as if the model name tells you something about what that specific agent did in your monorepo over the last sprint. It doesn't. Two engineers running the same model on the same codebase, with different harness setups and different session patterns, are running different agents. When one of those instances "gets worse," the right question is not "did the model change?" It's: what changed in this instance's behavior profile, and how would you know? The engineers having the clearest picture of this are the ones keeping records at the instance level. Not "Claude Code is good at refactoring" but "this instance, on this codebase, over these 30 sessions, here is where it earned trust and here is where it didn't." How are you currently tracking behavioral drift across agent sessions?

Original Article

Your coding agent didn't get worse. You just never measured the first version.

Similar Articles

AI agents fail in ways nobody writes about. Here's what I've actually seen.

Your AI agent isn't broken. Your harness is. Here's the system that took mine from "liability" to shipping production code.

People running coding agents across real repos: what breaks after the agent writes the code?

I analyzed how 50+ AI teams debug production agent failures and got surprised

Spec-driven agentic coding is quietly making us worse at the job of supervising agents

Submit Feedback

Similar Articles

AI agents fail in ways nobody writes about. Here's what I've actually seen.

Your AI agent isn't broken. Your harness is. Here's the system that took mine from "liability" to shipping production code.

People running coding agents across real repos: what breaks after the agent writes the code?

I analyzed how 50+ AI teams debug production agent failures and got surprised

Spec-driven agentic coding is quietly making us worse at the job of supervising agents