@Xudong07452910: This paper is a must-read for heavy users of Claude Code, Codex, or other AI Agents. It doesn't study how Agents fail on benchmarks, but a more real problem: In real development, what exactly are AI coding agents doing...
Summary
This paper analyzes 20,574 real-world coding-agent sessions to identify how AI agents misalign with developer intent, finding that constraint violations and inaccurate self-reporting are the most common failure modes, imposing trust and effort costs rather than irreversible damage.
View Cached Full Text
Cached at: 06/13/26, 01:05 AM
This paper is highly recommended for anyone who heavily uses Claude Code, Codex, or other AI agents.
It doesn’t study how agents fail on benchmarks — it tackles a much more realistic question:
In real-world development, how exactly do AI coding agents piss off developers?
The paper analyzes 20,574 real coding-agent sessions, covering both IDE and CLI workflows. It defines “failure” in an interesting way: not just whether the code compiles, but when developers start to correct, interrupt, or push back against the agent.
The results feel very real. The most common problem isn’t “wrong code” — it’s the agent violating explicit constraints the developer stated.
For example, you say “don’t change this file,” “don’t touch the code yet,” “make only minimal changes,” and it still does a bit more; you ask it to explain an issue, and it starts modifying code on the fly; you tell it to verify before reporting done, and it declares victory before finishing.
The paper also finds an interesting difference: CLI agents are more prone to constraint violations because they are often entrusted with longer, more open-ended tasks; IDE agents are more likely to have local implementation errors because they act as a close copilot, editing code during high-frequency interactions.
The most draining part is that many failures don’t cause immediate catastrophic damage — they consume the developer’s time and trust. You constantly have to judge whether it understood, whether it overstepped, whether it really verified.
This matches my own experience: what truly exhausts me about AI coding is the constant need to check whether it’s understood, stayed within bounds, and actually verified.
So my hope for the next generation of coding agents isn’t just better code generation — it’s the ability to continuously align with developer intent, respect boundaries, and report progress accurately.
The hard part of AI coding might not be “writing faster,” but “not making me clean up after you constantly.”
https://arxiv.org/pdf/2605.29442
How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions
Source: https://arxiv.org/html/2605.29442 Ningzhi Tang1,Chaoran Chen1,Gelei Xu1,Yiyu Shi1,Yu Huang2, Collin McMillan1,Tao Dong3,Toby Jia-Jun Li1 1University of Notre Dame2Vanderbilt University3Google {ntang, toby.j.li}@nd.edu
Abstract
AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 repositories across IDE and CLI workflows. We operationalize misalignment as a breakdown made visible through developer pushback, and annotate each episode along four axes: form, cause, cost, and resolution. We identify seven recurring forms, spanning how agents read projects, interpret developer intent, follow rules, bound their actions, implement and execute code, and report progress. 90.50% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49% of visible resolutions still require explicit user correction. Misalignment patterns also differ across IDE and CLI settings, persist across adjacent sessions, and shift over time: while overall rates decline, constraint violations and inaccurate self-reporting grow in share. Our findings inform the design of training, evaluation, and interfaces for keeping coding agents aligned with real developer workflows.
How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions
Ningzhi Tang1, Chaoran Chen1, Gelei Xu1, Yiyu Shi1, Yu Huang2,Collin McMillan1,Tao Dong3,Toby Jia-Jun Li11University of Notre Dame2Vanderbilt University3Google{ntang, toby.j.li}@nd.edu
1Introduction
AI coding agents have moved beyond text generation to act directly in software environments, handling multi-turn development tasks that involve file edits, command execution, and sustained communication with developers. This shift changes what alignment requires: rather than correctness on isolated tasks, agents must stay aligned with developer intent as both the task and that intent evolve across turns. In practice, this proves difficult: developers rarely give agents a complete specification upfront; instead, they refine their requests turn by turn, often changing direction as they see what the agent producesTang et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib18)). The resulting friction is measurable: users interrupt agents mid-turn in 5% of interactions and push back against outputs in 41% of turnsBaumann et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib2)).
However, systematic empirical characterizations of this friction remain limited. The closest existing work studies agent failures fromwithinthe agent itself. For example,Cemri et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib3))andZhang et al. (2025 (https://arxiv.org/html/2605.29442#bib.bib21))analyze execution traces on controlled benchmarks to classify where agents go wrong and which component in a pipeline is responsible. These analyses are rigorous on their own terms, but benchmark trajectories are generated under pre-specified tasks with no real developer in the loop. As a result, they cannot capture misalignment as developers experience it: not only whether an agent task succeeded or failed, but what form the divergence took, why it occurred, and how developers detected and corrected it across real sessions. Closing this gap requires interaction logs from naturalistic sessions rather than benchmark trajectories. Without them, efforts to improve coding agents lack empirical grounding for where and how alignment breaks down in practice.
To bridge this gap, we present, to the best of our knowledge, the first large-scale characterization of developer-agent misalignments in the wild. We definemisalignmentas observable breakdowns in developer-agent collaboration that surface through developer correction or pushback in conversational logs, and scope our analysis to two proximal alignment goals:instructions(what developers explicitly ask for) andintentions(what they actually want)Shen et al. (2024 (https://arxiv.org/html/2605.29442#bib.bib16)). We analyze two complementary datasets of 20,574 real IDE and CLI coding-agent sessions across 1,639 repositoriesTang et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib18)); Baumann et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib2)), and develop an LLM-based extraction pipeline with a second-stage evidence filter that removes claims unsupported by the conversation, yielding 16,118 evidence-grounded episodes with a human-evaluated precision of 0.93. We characterize each episode along four axes (symptom, cause, outcome, and resolution) using an LLM judge validated against human experts (inter-rater agreement0.830.83; LLM judge accuracy0.810.81), and organize the analysis around four research questions spanning misalignment forms and causes (RQ1), outcomes and resolution patterns (RQ2), variation across IDE and CLI modalities (RQ3), and structural and temporal effects (RQ4).
We highlight the following findings.First, we identify seven recurring symptom categories and seven cause categories that characterize developer-agent misalignment, spanning how agents read the project, interpret developer intent, follow stated rules, bound their actions, implement and execute work, and report their progress.Second, 90.50% of episodes impose effort and trust costs rather than irreversible system damage; visible resolution occurs in only 9.33% of episodes, and 91.49% of these require explicit developer pushback.Third, misalignment differs systematically across modalities: CLI sessions are more prone to constraint violations, with damage extending to project and external state, whereas IDE sessions more often surface faulty implementations and underspecified instructions confined to task state.Finally, misalignment persists across adjacent sessions in the same repository; its overall rate declines over time, but constraint violations and inaccurate self-reporting grow in share, suggesting coding agents need improvements beyond implementation accuracy.
2Related Work
2.1Coding Agents in Developer Workflows
AI coding agents mark a clear shift from earlier code-generation tools. Unlike inline autocompletion or single-turn chat assistants, they combine language reasoning, tool use, and sub-agent invocation to operate autonomously within live codebasesJimenez et al. (2024 (https://arxiv.org/html/2605.29442#bib.bib9)); Yang et al. (2024 (https://arxiv.org/html/2605.29442#bib.bib20)); Li et al. (2025 (https://arxiv.org/html/2605.29442#bib.bib11)). Agent sessions leave interaction traces in public repositories, making real-world usage increasingly observable. Two large-scale studies provide the initial empirical foundation.Tang et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib18))analyzes 11,579 IDE sessions from Cursor and GitHub Copilot across 1,300 repositories, finding that developers rarely specify tasks upfront; instead, they refine requests progressively, redistribute cognitive work such as comprehension and validation to the agent, and actively manage agent behavior throughout a session.Baumann et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib2))extend this picture to CLI-based workflows, analyzing 6,000 sessions involving more than 355,000 tool calls and finding that only 44% of agent-written code survives into final commits. Together, these studies establish that developer-agent interaction is iterative, corrective, and marked by persistent friction. However, neither characterizes the forms that friction takes, where it originates, or how it is resolved.
2.2Failure Analysis of Coding Agents
The current dominant approach to understanding agentic failures analyzes agent-internal trajectories on predefined controlled benchmarks.Cemri et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib3))introducesMAST, a taxonomy of 14 failure modes derived from 1,642 execution traces across five multi-agent frameworks.Zhang et al. (2025 (https://arxiv.org/html/2605.29442#bib.bib21))extends this line of work by attributing failures to specific agents and steps within multi-agent pipelines, while related studies examine behavioral patterns in agent trajectoriesMajgaonkar et al. (2025 (https://arxiv.org/html/2605.29442#bib.bib12)); Mehtiyev and Assunção (2026 (https://arxiv.org/html/2605.29442#bib.bib13)). A separate line of work infers agent failures from downstream artifacts, such as whether agent-written code is accepted into software projectsEhsani et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib6)); Alam et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib1)), surfacing useful signals about which kinds of agentic contributions succeed or fail. However, these studies primarily illuminate either how agents fail internally or how their outputs fare downstream, leaving the developer’s real-time corrective process unexamined.
2.3Human-AI Alignment
Aligning AI agents with human intent has primarily been approached through training-time interventions. Reinforcement learning from human feedback (RLHF)Ouyang et al. (2022 (https://arxiv.org/html/2605.29442#bib.bib14))optimizes model behavior against preference signals collected from human comparisons, while reinforcement learning with verifiable rewards (RLVR)Lambert et al. (2024 (https://arxiv.org/html/2605.29442#bib.bib10)); Guo et al. (2025 (https://arxiv.org/html/2605.29442#bib.bib8))sidesteps human annotation by using programmatic outcomes, e.g., whether generated code passes tests, as supervision signals. More recent work extends these paradigms to multi-turn settingsShani et al. (2024 (https://arxiv.org/html/2605.29442#bib.bib15))and continual adaptation to evolving preferencesShi et al. (2025 (https://arxiv.org/html/2605.29442#bib.bib17)). These approaches have driven substantial gains in model alignment, but the empirical structure of misalignment as it unfolds during real developer-agent interactions remains undercharacterized. Our work provides such an analysis to inform the design of more targeted reward signals and evaluation metrics for coding agents.
3Methodology
We analyze developer-agent misalignment using two complementary datasets of real-world coding-agent sessions. All LLM-based pipeline stages (extraction, post-validation, annotation) use GPT-5.4 with temperature 0 to reduce sampling variance111GPT-5.4 was the strongest model available under our access; it also outperformed two alternative frontier models we piloted on a held-out sample for the post-validation stage..
3.1Datasets
Refer to captionFigure 1:Monthly session volume across the combined dataset, broken down by interaction modality. Vertical markers indicate the launch dates of major coding agents and data collection tools, as well as the scrape date.The first dataset is from SpecStory222SpecStory: https://specstory.com/, which exports coding-agent chat histories as timestamped Markdown files under.specstory/history/. FollowingTang et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib18)), we queried the GitHub Code Search API and re-crawled all available exports on April 30, 2026, covering September 2024–April 2026. Unlike their IDE-focused analysis, we included CLI sessions to cover both interaction modalities. This yielded 14,789 sessions (2,588 CLI) across 1,441 repositories.
The second dataset,SWE-chatBaumann et al. (2026 (https://arxiv.org/html/2605.29442#bib.bib2)), was collected via Entire.io333Entire.io: https://entire.io/, a tool that logs CLI coding-agent sessions. It includes public checkpoint logs from developers on GitHub who opted in between January and April 2026, adding 5,785 sessions across 198 repositories.
We verified that the two datasets contain no overlapping repositories by matching repository names. The combined dataset includes 20,574 sessions from 1,639 distinct repositories, with monthly distribution shown in Figure1 (https://arxiv.org/html/2605.29442#S3.F1). Each session consists of interleaved user prompts, agent responses, and tool-call traces (e.g., file edits, command executions). Table1 (https://arxiv.org/html/2605.29442#S3.T1)summarizes the agent composition.444We do not analyze by model identity: SpecStory exports do not record it, and withinSWE-chat, Claude-family models account for 94.9% of annotated responses, leaving insufficient variation for meaningful comparison.
Table 1:Agent composition of the combined dataset.Unknownentries reflect early SpecStory exports that did not record agent identity.Med. Turnsreports the median number of user-authored messages per session.
3.2Structured Misalignment Extraction
Scope.
We usemisalignmentto describe observable breakdowns between a developer and a coding agent. Drawing onShen et al. (2024 (https://arxiv.org/html/2605.29442#bib.bib16))’s bidirectional human-AI alignment framework, we scope our analysis to the two most proximal alignment goals:instructions(what the developer explicitly instructs) andintentions(what the developer actually intends). We identify misalignment only when it becomes visible through subsequent developer correction or pushback in the logs. Latent misalignment visible only through private cognition or off-chat actions (e.g., silently rejecting output or editing code directly) is outside our scope. We exclude the remaining alignment goals (preferences, desires, interests, and values) because assessing them would require evidence that chat logs cannot reliably support.
Session preprocessing.
Raw session logs interleave user messages with tool-call traces and system callbacks, so we preprocess each session into a structured turn sequence. Subagent outputs are aggregated into their parent agent turns. Long agent turns are truncated with a head-tail strategy that preserves equal-length prefixes and suffixes. The character budget scales inversely with session length, from 5,000 characters per turn in short sessions to 500 in very long sessions, to keep the total context tractable. User turns are always preserved in full, as they anchor developer intent.
Extraction.
Misalignment episodes are identified using an LLM-based extractor that processes each session as a whole rather than turn by turn, because misalignment is inherently cross-turn and context-dependent: a developer’s pushback in turn 8 is interpretable only against the instruction in turn 3. The extractor induces episodes bottom-up rather than applying a prescriptive taxonomy, producing one structured record per breakdown.
We apply four extraction principles. First, misalignment is defined only relative to developer messages, excluding agent-initiated actions without
Similar Articles
@nash_su: Official best practices for Claude Code in large codebases. Of course, the same methodology can also be applied to Codex or any Agent. AI can make mistakes and bluff, and the larger the project, the more AI debt accumulates. This article covers some basic safeguards and optimization methods. This article uses http://Wi…
Official best practices for Claude Code in large codebases, also applicable to Codex or other AI Agents, introducing basic safeguards and optimization methods.
@FakeMaidenMaker: The scariest thing about using an AI agent to write code is losing control: the agent runs wild, quality is inconsistent, you don’t know what stage it’s in, and it messes things up halfway through. AWS just open-sourced a set of development lifecycle workflow rules specifically designed for AI coding agents — AI-DLC — that make the agent…
AWS has open-sourced AI-DLC (AI-Driven Development Life Cycle), a set of development lifecycle workflow rules designed for AI coding agents to help developers control agent behavior and ensure quality. It supports multiple platforms including Claude Code, Cursor, and GitHub Copilot.
@knoYee_: https://x.com/knoYee_/status/2062780637677752366
The author reviews three months of experience using multi-agent collaboration, summarizing five main pain points (such as conflicts between agents, ignoring boundary conditions, self-censorship failure, difficulty in merging decisions, and exposing harder problems after compressed execution) and two insights (the high value of read-only review agents, and that agent conflicts expose ambiguous requirements), emphasizing the core decision-making role of humans in AI collaboration.
This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.
This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.
@justloveabit: With This Open-Source Tool, I Got a Team of AIs to Work for Me. Here's the deal: I've been tinkering with various AI agents lately. Multiple Claude Code windows open, Codex running, occasionally using Cursor. The result? Total chaos—I had no idea what each agent was doing or how much it was costing. Restar…
This article introduces Paperclip, an open-source tool designed to centrally manage and orchestrate multiple AI agents. By simulating a corporate organizational structure, task assignment, and budget control, it addresses key pain points in multi-agent collaboration, such as lost context, unpredictable costs, and chaotic scheduling.