Your agent gets dumber the longer a session runs

Reddit r/AI_Agents News

Summary

The article discusses how AI agent performance degrades over long sessions due to context window clutter from raw history, tool outputs, and repeated reasoning, and suggests solutions like summarizing old turns and trimming tool outputs to extend useful run length.

You give your agent a long task. The first handful of steps are clean, then somewhere past step ten it starts slipping. It re-runs a tool it already called, drops an instruction from the very top of the thread, and starts repeating its own reasoning back to itself. Same model that nailed step one. What changed is everything that piled into the context window by the time it slipped. Trace one of these runs and tag each step with how deep it is in tokens. The drop in quality usually lines up with the window filling up, and by that point it is packed with three things: The full raw history, re-injected every turn. The instructions from the start are still in there, buried under thousands of tokens of everything that happened since. Tool outputs dumped in whole. One search or one file read drops a giant JSON blob into context, and most of those fields never get read again. The agent's own reasoning, fed back and built on every turn, so an early wobble compounds as the run goes on. What held up for us was cutting the noise at the source: Summarize old turns once they are settled, so the decision stays and the raw back-and-forth drops out. Trim tool outputs down to the fields the agent actually reads, before they ever reach context. Pin the core instructions near the end of the window, where attention holds up best deep into a run. Across the runs we looked at, "the model can't handle long tasks" turned into "the model was drowning in transcript" far more often than not. Same model, much longer useful runs once the window stopped filling with noise. Curious how others deal with the slip deep in a run. Are you keeping the original instructions alive by summarizing old turns, re-pinning the system prompt, something else? And is anyone measuring the in-session quality drop directly, or only checking the final answer?
Original Article

Similar Articles

AI agents feel impressive until the workflow gets messy

Reddit r/AI_Agents

A reflection on AI agents: impressive for narrow supervised tasks but fragile and unreliable in long-running, messy workflows due to issues like session expiration, context drift, and silent failures.