Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

arXiv cs.CL 06/15/26, 04:00 AM Papers
benchmark dialogue coding-agents software-engineering evaluation user-simulator schema-guided-agent
Summary
Introduces Dialogue-SWE-Bench, a benchmark for evaluating coding agents' ability to resolve software engineering problems through dialogue with a user. Proposes a persona-grounded user simulator and a schema-guided agent that improves dialogue capabilities.
arXiv:2606.13995v1 Announce Type: new Abstract: AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.
Original Article
View Cached Full Text
Cached at: 06/15/26, 08:57 AM
# Dialogue-SWEBench: A Benchmark for Dialogue-Driven Coding Agents
Source: [https://arxiv.org/html/2606.13995](https://arxiv.org/html/2606.13995)
###### Abstract

AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants\. Despite their interactive real\-world use, existing benchmarks evaluate them as fully\-autonomous systems\. In this work, we introduceDialogue\-SWEBench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real\-world software engineering problems through dialogue with a user\. We design a novel, persona\-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality\. We also propose a newschema\-guidedagent, aimed at improving the dialogue capabilities of off\-the\-shelf coding agents, which improves over strong baselines by 3\-14%\. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance\.111Code and data available at[https://jlab\-nlp\.github\.io/dialogue\-swe\-bench/](https://jlab-nlp.github.io/dialogue-swe-bench/)

Dialogue\-SWEBench: A Benchmark for Dialogue\-Driven Coding Agents

Brendan King and Jeffrey FlaniganUniversity of California, Santa Cruz\{bking2,jmflanig\}@ucsc\.edu

## 1Introduction

Coding agents have radically transformed the software\-engineering \(SWE\) landscape, powering widely\-used tools like Github Co\-Pilot and Claude CodeGitHub \([2022](https://arxiv.org/html/2606.13995#bib.bib5)\); Anthropic \([2025b](https://arxiv.org/html/2606.13995#bib.bib4)\)\. To support their development, the research community has produced increasingly complex benchmarks for evaluating coding agents on real\-world engineering tasksJimenezet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib14)\); Merrillet al\.\([2026](https://arxiv.org/html/2606.13995#bib.bib9)\)\. This has enabled rapid improvement in thecoding capabilitiesof these agents, through advances in model reasoningOpenAI \([2025](https://arxiv.org/html/2606.13995#bib.bib42)\); Qwen \([2025](https://arxiv.org/html/2606.13995#bib.bib38)\); Anthropic \([2025a](https://arxiv.org/html/2606.13995#bib.bib6)\), coding agent designYanget al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib23)\); Wanget al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib30)\), and training methods for agentic\-codingPanet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib19)\); Weiet al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib16)\)\.

Yet despite this progress, existing benchmark datasets have focused on the fully autonomous setting, leaving the interactive nature of real\-world coding agent use understudied\. Such fully\-autonomous SWE evaluations leave a significant gap between what they measure and real\-world software engineering\. First, fully\-autonomous benchmarks presume a complete and correct problem specification as input, yet these are rarely available in practice\.222For example,Chowdhuryet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib31)\)find that 76% of the real\-world Github Issues comprising SWE\-BenchJimenezet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib14)\)are at least somewhat under\-specified, and 39% are deemed too vague to determine “what a successful solution would look like\.”Second, real\-world engineering with coding agents is highly interactive: in a study of real\-world agentic coding sessions,Baumannet al\.\([2026](https://arxiv.org/html/2606.13995#bib.bib7)\)find that users use dialogue to correct or reject agent outputs 44% of the time\. This interactivity was also largely one\-sided: while users frequently used dialogue to correct agent outputs, agents themselves only sought clarification 1\-2% of the time\. These findings suggest that improving thedialogue capabilitiesof coding agents represents a significant and largely untapped opportunity for real\-world impact\.

![Refer to caption](https://arxiv.org/html/2606.13995v1/x1.png)Figure 1:Our dialogue benchmark for coding agents\. The user never interacts with the code, and performs software engineering tasks strictly through dialogue\. Agents complete repository\-level software engineering tasks through dialogue with the user\.![Refer to caption](https://arxiv.org/html/2606.13995v1/x2.png)Figure 2:Overview of our benchmark setup and user simulator\. The user never interacts with the code, and performs software engineering tasks strictly through dialogue\. For each SWE problem, the dialogue starts with an initial fixed query, and proceeds with an online evaluation of agents with a user simulator which replies to each agent message\. To ensure faithful simulation of the user, we include a self\-revision step for the user \(§[3\.3](https://arxiv.org/html/2606.13995#S3.SS3)\), where a generated replyut\+1′u^\{\\prime\}\_\{t\+1\}is validated and then revised if necessary\.While real\-world coding agent use is often dialogue\-driven, to our knowledge, no benchmark exists for systematically evaluating the dialogue capabilities of coding agents at repository scale\. Previous works have evaluated dialogue systems for producing single functionsChaurasia and Mooney \([2017](https://arxiv.org/html/2606.13995#bib.bib28)\), but are not suitable for repository\-level tasks\. Recent work evaluates the ability of agents to resolve ambiguous problem specificationsVijayvargiyaet al\.\([2026](https://arxiv.org/html/2606.13995#bib.bib3)\), but does not fully address dialogue\-driven use of coding agents\.

In this paper, we introduceDialogue\-SWEBench: a benchmark dataset for evaluating the ability of coding agents to resolve real\-world software engineering problems through dialogue with a user\. We evaluate state\-of\-the\-art coding agents and find that better coding models do not always correspond better interactive agents in the dialogue setting\. Our contributions are as follows:

- ∙\\bulletWe introduce adialogue\-drivenbenchmark for coding agents in which they must solve real\-world SWE tasks via dialogue with a user, rather than being given a complete problem specification upon task start \([§ 3](https://arxiv.org/html/2606.13995#S3)\)\.
- ∙\\bulletTo do this, we design a novel, persona\-grounded user simulator for online evaluation of coding agents through dialogue \([§ 3\.3](https://arxiv.org/html/2606.13995#S3.SS3)\)\.
- ∙\\bulletWe evaluate state\-of\-the\-art coding models and agents in this new dialogue setting, and find that stronger coding models are not always stronger dialogue systems \([§ 5](https://arxiv.org/html/2606.13995#S5)\-[§ 6](https://arxiv.org/html/2606.13995#S6)\)\.
- ∙\\bulletWe support our task evaluation with automatic evaluation of dialogue quality using a novel LLM\-as\-a\-Judge \([§ 7](https://arxiv.org/html/2606.13995#S7)\)\.
- ∙\\bulletWe propose a new, dialogue\-theory informed coding agent with the best average performance on this task \([§ 4](https://arxiv.org/html/2606.13995#S4)\)\.

## 2Related Work

The rise of language agentsSuet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib39)\)– AI agents whose policies are mediated by LLM reasoning – has enabled a shift in text\-to\-code research from writing simple functionsChenet al\.\([2021](https://arxiv.org/html/2606.13995#bib.bib51)\)to real\-world software engineering \(SWE\)Jimenezet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib14)\); Aleithanet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib32)\)\. While there has been substantial progress made on SWE tasks through advances in language modelsRastogiet al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib36)\); Qwen \([2025](https://arxiv.org/html/2606.13995#bib.bib38)\), agent designWanget al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib30)\); Yanget al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib23)\), and fine\-tuning techniquesPanet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib19)\); Weiet al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib16)\); Yanget al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib37)\), much of this work considers only the fully automated SWE setting, which burdens the user with providing a complete and correct specification and ignores the potential of human\-agent interaction\. The research community has begun to argue for the need to model such human\-agent interactionsWanget al\.\([2026](https://arxiv.org/html/2606.13995#bib.bib53)\), but evaluating such interactive settings remains an open challenge\.

Two recent works evaluate human\-agent interaction in the SWE setting\.Vijayvargiyaet al\.\([2026](https://arxiv.org/html/2606.13995#bib.bib3)\)the ability of coding agent in resolving ambiguity from incomplete problem specification through clarifying questions\. While interactive, this setup still presumes a specification document as input, with interaction used only to resolve ambiguity\. Our evaluation differs in that our task begins with user utterance and unfolds entirely through dialogue\. We compare our proposed agent to theirs in our evaluation\.Zhouet al\.\([2026](https://arxiv.org/html/2606.13995#bib.bib43)\)design an evaluation for multi\-session coding interactions with a coding agent, evaluating the ability of agents to honoring user\-level preferences across sessions\. Our evaluation differs in that our setup is dialogue\-driven with a single SWE task, and our user simulator does not interact with code and is not given knowledge that could contain the task solution, such as issue commentsAleithanet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib32)\)\.

Prior to the advent of coding agents, other works have considered dialogue forfunction\-leveltext\-to\-code problems\. Some works use the gold solution to a text\-to\-code problem to create a dialogue for that problem, by crafting clarification question & answer pairsLiet al\.\([2023](https://arxiv.org/html/2606.13995#bib.bib13)\), simulating a user’s feedbackPanet al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib22)\); Wanget al\.\([2023](https://arxiv.org/html/2606.13995#bib.bib52)\), or addressing uncertainty in a model’s predictionsChaurasia and Mooney \([2017](https://arxiv.org/html/2606.13995#bib.bib28)\)\.Rakotonirinaet al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib33)\)evaluate LLMs in multi\-session coding interactions across severalfunction\-levelcoding tasks\. Our work instead addresses dialogue forrepository\-levelSWE tasks, and importantly, never conditions the simulated user on gold knowledge, like solutions or test cases\.

## 3Dialogue\-SWEBench

In[§ 3\.1](https://arxiv.org/html/2606.13995#S3.SS1), we review coding agents and the fully\-autonomous evaluation setup, as introduced in SWE\-BenchJimenezet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib14)\)\. We then describe how we re\-formulate the fully\-autonomous evaluation into an interactive setup whichrequires dialogue\([§ 3\.2](https://arxiv.org/html/2606.13995#S3.SS2)\) and provide further detail on the user simulator we design to support our dialogue\-driven evaluation \([§ 3\.3](https://arxiv.org/html/2606.13995#S3.SS3)\) and our evaluation metric \([§ 3\.4](https://arxiv.org/html/2606.13995#S3.SS4)\)\.

### 3\.1Preliminaries

In the fully\-autonomous SWE evaluation, a system generates a repository\-level code change \(e\.g\. git patch\) from a problem description and initial repository\. This patch is then evaluated using unit tests\. The dominant approach to this task uses a coding agent or ‘SWE Agent’Yanget al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib23)\), an LLM\-powered agent which operates directly within a programming environment to iteratively explore, edit, and execute code\. Given an initial repositoryℛepo\\mathcal\{R\}\_\{epo\}and problem specificationℐtext\\mathcal\{I\}\_\{text\}, A coding agent is initialized in an observation\-action loop, beginning with a first observationo1o\_\{1\}containing the problem specificationℐtext\\mathcal\{I\}\_\{text\}\. At each time steptt, the agent executes an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}, such as editing a file or running a test, and receives a corresponding observationot\+1∈𝒪o\_\{t\+1\}\\in\\mathcal\{O\}, such as execution results, where𝒪\\mathcal\{O\}and𝒜\\mathcal\{A\}represent the agent’s observation space and action space, respectively\. A special action ‘finish’ terminates the loop and prepares a patchppfrom the edited files for evaluation\. Each problem is evaluated by applying the generated patchppto the base repositoryℛepo\\mathcal\{R\}\_\{epo\}and then executing tests associated with the task\. A task is considered resolved if all tests pass after the patch is applied, and the evaluation metric for the benchmark is the percentage of resolved tasks\.

### 3\.2Dialogue Problem Design

We propose a benchmark in which coding agents must resolve real\-world software engineering problems through goal\-oriented dialogue with a user\. We re\-formulate SWE\-Bench VerifiedChowdhuryet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib31)\)into this benchmark as follows \(overview in[Figure 2](https://arxiv.org/html/2606.13995#S1.F2)\)\. First, we replace the full issue textℐtext\\mathcal\{I\}\_\{text\}in the first observationo1o\_\{1\}with an initial user queryu1u\_\{1\}\. Next, we situate agents in an environment with an action space which supports both programming and dialogue with a simulated user\. Specifically, we augment the action space𝒜\\mathcal\{A\}of each agent with an additional action ‘message\_user’, which takes a single argument containing the message to communicate with the user, and yields an observationot\+1o\_\{t\+1\}containing their response\. We simulate each user using an LLM, with knowledge grounded in the full issue textℐtext\\mathcal\{I\}\_\{text\}, described further in[§ 3\.3](https://arxiv.org/html/2606.13995#S3.SS3)\. As before, the agent’s observation\-action loop terminates with the ‘finish’ action, yielding a patchpp, which can be evaluated as resolving or not resolving the user’s issue using execution tests\.

### 3\.3Simulating Users

To reduce variance across runs in our benchmark, we design a fixed initial user queryu1u\_\{1\}for each problem\. Following this, we design a simulator for producing faithful and realistic user replies to follow\-up queries from the agent\. To enable stable future evaluation on the benchmark, the user simulator is open\-source and released as part of the benchmark\.

Crafting initial queriesWe design each initial queryu1u\_\{1\}to faithfully reflect the intent given in the problem specification but omit key details needed for resolution, using the following semi\-automated approach\. First, we prompt an LLM with the Github issue title and brief instructions for paraphrasing it into an initial dialogue query\. Following this, we use a revision prompt to remove any critical details from the query that an agent might be able to exploit to avoid engaging in dialogue altogether\. Finally, we manually review each query to verify it is faithful to the intent described in the problem specification\.[Appendix A](https://arxiv.org/html/2606.13995#A1)provides further details on our prompting approach and manual review\. While our manual review ensures a high\-quality evaluation set, we find only 13% of generated queries require any modification, suggesting our pipeline could be fully automated in training or development settings\.

Simulating follow\-up replies with Self\-RevisionFor any timestept≥1t\\geq 1, we prompt an open\-weights LLM𝒰ϕ\\mathcal\{U\}\_\{\\phi\}to simulate the user’s reply to an agent’s messageut\+1u\_\{t\+1\}\.333We use a quantized version of LLaMa 3\.3 70BGrattafioriet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib59)\)available at[https://huggingface\.co/shuyuej/Llama\-3\.3\-70B\-Instruct\-GPTQ](https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ)Specifically, we define a system prompt for the user simulator conditioned on the user’s knowledge for that problem, given in the complete issue textℐtext\\mathcal\{I\}\_\{text\}, as well as a persona for grounding behavior, detailed below\. Following this, all dialogue messages for that problem are included in the context for generating a candidate for the next user utteranceut\+1′u^\{\\prime\}\_\{t\+1\}\. To ensure our user simulator behaves realistically for the setting, we use a revision step, in which a candidate user utteranceut\+1′u^\{\\prime\}\_\{t\+1\}is validated by the same user LLM𝒰ϕ\\mathcal\{U\}\_\{\\phi\}, checking for potential hallucinations or violations of the task instructions\. For example, since our simulated user is unable to run new code on its own, we use the revision step to identify any replies in which the user claims to do so\. We also test for directly detectable violations, such as exceeding the length limit in the reply\. If any errors are found, we use a final prompt to revise the candidate utterance before it is returned to the agent as an observation\. See[Appendix A](https://arxiv.org/html/2606.13995#A1)for further details\.

#### User Personas

To improve diversity of dialogues simulated in our benchmark we assign each problem a user persona, followingRakotonirinaet al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib33)\)\. For each problem, we sample a name and hand\-crafted persona description which will be used to influence responses given by our user simulator\. Names and persona descriptions are given in[Table 5](https://arxiv.org/html/2606.13995#A1.T5)andLABEL:tab:persona\-descriptionsof[Appendix A](https://arxiv.org/html/2606.13995#A1), respectively\.

Table 1:Resolve rate and dialogue statistics on Dialo\-SWE\-Bench for closed\- and open\-weight models\. Our schema\-guided agent \([§ 4](https://arxiv.org/html/2606.13995#S4)\) achieves the best average performance at the lowest average cost\.

### 3\.4Evaluation Metric

The principal metric for our benchmark is the resolution rate, or the percentage of dialogues which result in a correctly resolved SWE task\. To evaluate this, we use the execution tests corresponding to each problem from SWE\-Bench VerifiedChowdhuryet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib31)\)\. A dialogue is considered successful if the conversation results in a submitted patchppwhich passes all associated execution tests\.

## 4Schema Guided Coding Agents

To address the unique demands of our dialogue benchmark, we propose a novel schema\-guided coding agent which adaptively plans dialogue moves usingdialogue schemas, or “structured representations of the prototypical sequence of events in a dialogue”Kane \([2022](https://arxiv.org/html/2606.13995#bib.bib46)\)\. Specifically, we use prompts to instruct our agent to build and maintain its own structured representation of the dialogue state in order to guide its questions, code exploration, and patch generation\.

![Refer to caption](https://arxiv.org/html/2606.13995v1/x3.png)Figure 3:Workflow for our schema\-guided SWE AgentUsing prompts, we instruct our agent to conduct a schema\-guided dialogue as follows \(overview in[Figure 3](https://arxiv.org/html/2606.13995#S4.F3)\)\. First, we instruct the agent to determine the type of issue the user presents and draft a schema with keys and values for the critical details needed to resolve the issue\. The possible issue types, keys, and values in this schema are all determined by the agent\. For example, if the user seems to be describing a bug, the agent might draft a schema with an issue type of ‘bug’, with keys for ‘actual behavior’, ‘expected behavior’, and ‘reproduction steps\.’ We instruct the agent to mark values which have not yet been discussed as UNKNOWN, and fill these in using dialogue, until enough details are gathered to begin solving the problem\. The agent is instructed to maintain this dialogue state as it explores code, makes changes, and verifies a solution\. We implement our agent using the OpenHands Agent SDKWanget al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib44)\), using the default tools for coding agents, including file editing, bash tools, and finishing\.

## 5Experimental Setup

In this section, we describe our evaluation of closed and open\-weight LLMs on our benchmark\. For all experiments, we evaluate systems with our dialogue transformation of the 500 problems from SWE\-Bench VerifiedChowdhuryet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib31)\)\.

Selected ModelsWe evaluate a representative set of closed and open\-weight models based on their performance on SWE\-Bench\. For proprietary models, we evaluate GPT\-5 and GPT\-5\-miniOpenAI \([2025](https://arxiv.org/html/2606.13995#bib.bib42)\)\. We also evaluate the open\-weight models Qwen3 Coder 30\-A\-3BQwen \([2025](https://arxiv.org/html/2606.13995#bib.bib38)\)and Devstral 2 SmallMistralAI \([2025](https://arxiv.org/html/2606.13995#bib.bib41)\), both trained for agentic coding\.

Benchmark SetupFor all experiments, we instantiate our user simulator with a quantized version of LLaMa 3\.3 70B\.444[https://huggingface\.co/shuyuej/Llama\-3\.3\-70B\-Instruct\-GPTQ](https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ)We also limit each agent to 100 steps per problem instance\. For each agent, we include an additionalmessage\_usertool, providing an additional interface for communicating with the user\.555This tool takes a single argument<message\>and returns the user’s response\. An assistant message without a tool call is similarly treated as a dialogue message\.

Baselines and AgentsFor each model, we consider the following baseline agents: \(1\) OpenHandsWanget al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib30)\), as an off\-the\-shelf coding agent, and \(2\) OH InteractiveVijayvargiyaet al\.\([2026](https://arxiv.org/html/2606.13995#bib.bib3)\), as a baseline designed to interact with a user to resolve ambiguity\. We also evaluate each model with our schema guided agent \([§ 4](https://arxiv.org/html/2606.13995#S4)\)\. Importantly, all agents share the same set of available tools\.

## 6Results

We present the performance of each approach on our benchmark in[Table 1](https://arxiv.org/html/2606.13995#S3.T1)\. In addition to our core task metric \(% Resolved\), we report the average number of dialogue turns, agent steps, and cost in dollars per problem session\.666SeeLABEL:app:cost\-calculationfor details on cost calculations\.We find that our schema\-guided agent achieves the highest average resolve rate across all models \(46\.9%\), outperforming OpenHands \(32\.9%\) and OH Interactive \(44\.1%\), at the lowest average cost\. While our schema\-guided agent typically makes more use of dialogue, this is not accompanied with a corresponding increase in the number of total steps\. Surprisingly, we find that performance of GPT\-5 mini rivals that of GPT\-5 at a fraction of the cost, and that strong coding ability does not always correspond with dialogue ability\. We investigate this further in a few ways throughout the paper\. First, inLABEL:app:stratify\-by\-diff, we stratify performance by engineering difficulty, finding that while the larger GPT\-5 performs best on harder engineering problems, it under\-performs on simpler tasks\. We find this is in some cases due to dialogue failures, such as asking too many unnecessary questions or failing to follow up on a missing detail, and present a case study as an example in[§ 9](https://arxiv.org/html/2606.13995#S9)\. Finally, we note that relative to GPT\-5, GPT\-5 mini dialogues rate as more natural and coherent \([§ 7](https://arxiv.org/html/2606.13995#S7)\)\. Altogether, these results demonstrates thatstrong coding ability does not always coincide with dialogue ability, highlighting the importance of a dialogue benchmark for coding agents\.

Information\-Seeking Drives Task ResolutionIn[Figure 4](https://arxiv.org/html/2606.13995#S6.F4), we look more closely at the number of information seeking dialogue moves used by each agent and its relationship with Resolve Rate\. Using an LLM classifier, we determine whether a given dialogue message from agent to user seeks information about the problem, rather than serving only other rhetorical functions such as summarizing changes made, offering further assistance, or providing a greeting/conventional closing\. We use GPT\-5\-mini to classify each agent message as containing a question\-info\-request dialogue actJurafsky and Shriberg \([1997](https://arxiv.org/html/2606.13995#bib.bib54)\)or not, and find strong agreement with human annotation \(Cohen’sκ\\kappa= 0\.87\)\. We note a few interesting findings\. First, the best performing agent for a given model consistently uses more information\-seeking dialogue moves\. Second, the off\-the\-shelf agent framework OpenHands rarely uses information seeking moves, suggesting that intervention to the agent or model is necessary to support multi\-turn negotiations of a user’s intent in the dialogue setting\. Finally, our schema\-guided agent typically uses the highest number of information seeking moves, with the exception of Devstral 2 Small\.

![Refer to caption](https://arxiv.org/html/2606.13995v1/imgs/figs/plots/info_seeking_vs_resolved_v3.png)Figure 4:The average number of information seeking dialogue moves used by an agent in a dialogue, compared with the Resolve Rate \(%\)\. We find that \(1\) off\-the\-shelf agents \(OpenHands\) rarely seek information from the user and solve the fewest issues, \(2\) increased information\-seeking correlates with a higher resolve rate, and \(3\) in all but one case, our schema\-guided agent uses the most information seeking moves and solves the most tasks\.
## 7Analyzing Dialogue Quality

To complement our benchmark evaluation, we propose automatic metrics of dialogue quality for the coding agent\. FollowingKaziet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib25)\), we evaluate our agents along two dimensions, Naturalness and Coherence\. For each measure, we devise an LLM\-as\-a\-judge which assesses the performance of an agent at the dialogue level, and validate it’s judgments with human annotation\. For both metrics, we implement our LLM\-as\-a\-Judge using Gemma 4 31B\-ITDeepmind \([2026](https://arxiv.org/html/2606.13995#bib.bib2)\)\.

### 7\.1Naturalness

We analyze naturalness as the degree to which an agent is easy to understand and converse with\. To assess naturalness, we use the following three point scale:

1. 1\.Low naturalness \(1\): this agent does not communicate in an understandable way, or has clear communication issues that a user may find frustrating\.
2. 2\.Medium naturalness \(2\): this agent communicates mostly in an understandable way, but there are some minor issues that make the dialogue feel slightly less natural\.
3. 3\.High naturalness \(3\): this agent communicates clearly and concisely, and in way user’s would find easy to communicate with and understand\.

![Refer to caption](https://arxiv.org/html/2606.13995v1/imgs/figs/plots/nat_coh_v2/naturalness_llm_plot_full_w_rules_gemma.png)\(a\)Dialogue naturalness
![Refer to caption](https://arxiv.org/html/2606.13995v1/imgs/figs/plots/nat_coh_v2/coherence_llm_plot_full_w_rules_gemma.png)\(b\)Dialogue coherence

Figure 5:LLM\-as\-a\-judge ratings for naturalness and coherence \(1\-3\) with 95% confidence intervals\.ResultsWe present results from our LLM\-as\-a\-Judge in[5\(a\)](https://arxiv.org/html/2606.13995#S7.F5.sf1)\. Across systems, we find more variance in naturalness from choice of model than choice of agent\. Notably, GPT\-5 suffers from low naturalness, as it often fails to close a dialogue correctly or repeats internal portions of its system prompt to the user\. We explore this further as a case study in[§ 9](https://arxiv.org/html/2606.13995#S9)\. Alternatively, Devstral’s low naturalness is due to failure to participate in dialogue: in 23\-34% of dialogues, devstral never responds to the user, exiting on max calls\. Finally, we observe no clear relationship between naturalness and resolution rate, suggesting the usability of a coding agent in dialogue cannot be measured by task success alone\.

### 7\.2Coherence

We analyze coherence as the degree to which an agent’s dialogue moves guide the dialogue on a logical path toward solving the user’s task, on a 1\-3 scale\. FollowingKaziet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib25)\), we derive the agent’s score from two criteria:

1. 1\.Local coherence: whether each agent turn logically follows from the preceding messages \(i\.e\. is on task and not redundant\)\.
2. 2\.Global coherence: whether the agent’s contributions collectively steer the conversation solving the user’s task, including seeking clarification or detail as needed and closing the dialogue appropriately\.

If neither criteria are met, the agent scores 1\. If only one of the two criteria are met, the agent scores 2\. An agent which meets both criteria scores 3\.

#### Results

We present results from our LLM\-as\-a\-Judge in[5\(b\)](https://arxiv.org/html/2606.13995#S7.F5.sf2)\. We observe stronger differentiation between agents on coherence than naturalness, suggesting that agent design can more strongly influence a system’s ability to construct a logical dialogue path toward issue resolution\. Our schema\-guided agent rates the most coherent for two out of four models with statistical significance\.777Determined with a wilcoxon signed rank test,p<10−5p<10^\{\-5\}For GPT\-5 mini and Devstral, the difference between the leading OH Interactive and our schema\-guided agent is not statistically significant\.

### 7\.3Evaluating the LLM\-as\-a\-Judge

To evaluate our LLM\-as\-a\-judge we conduct a human annotation of 30 dialogues per model\-agent pair for a total of 360 dialogues\. We score each system using the same 3\-point scale for naturalness and coherence\.

We evaluate our judge with two measures of agreement\. First, we quantify the agreement between our judge and human ratings using a quadratically\-weighted Cohen’sκ\\kappaCohen \([1968](https://arxiv.org/html/2606.13995#bib.bib8)\)\. Second, we evaluate our judge’s ability to correctly rank competing systems, followingKocmiet al\.\([2021](https://arxiv.org/html/2606.13995#bib.bib11)\)\. We begin by identifying pairs of evaluated systems whose human ratings differ significantly under a Wilcoxon signed rank testPratt \([1959](https://arxiv.org/html/2606.13995#bib.bib56)\)\. We then define ranking accuracy as the percentage of these pairs for which the judge’s ranking agrees with the human ranking\.

Table 2:Agreement of our LLM\-as\-a\-judge with human ratings, measured using quadratic weighted Cohen’sκ\\kappaand ranking accuracy over thennsystem pairs with a statistically significant difference in human ratingsWe report the agreement measures between our LLM\-as\-a\-judge and human ratings in[Table 2](https://arxiv.org/html/2606.13995#S7.T2)\.LABEL:fig:nat\-human\-resultsandLABEL:fig:coh\-human\-resultsofLABEL:app:nat\-coh\-annotationprovide our human ratings for naturalness and coherence, respectively\. For naturalness, we find substantial agreement between our LLM\-as\-a\-judge and human ratings \(κ\\kappa=0\.70\)\. Our naturalness judge also ranks systems by naturalness with 100% accuracy\. For coherence, we find moderate agreement \(κ\\kappa=0\.51\), due in part to the complex judgment required to assess an entire dialogue’s logical flow\. Despite only moderate agreement, our coherence judge still achieves a rank accuracy of 84\.3%\.

## 8Ablations & Analyses

### 8\.1Ablating our User Simulator

We verify that our benchmark evaluates multi\-turn problem solving through dialogue using an ablation that removes the user simulator after the first turn\. As before, we seed the dialogue with the user’s fixed initial query, but then return “the user is not available\.” for any subsequent user messages\. The agent must therefore solve the task without any interaction\. For this ablation, we sample 50 problems from the evaluation set stratified by difficulty as annotated byChowdhuryet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib31)\)\. We use our schema\-guided agent \([§ 4](https://arxiv.org/html/2606.13995#S4)\) with each model\.[Table 3](https://arxiv.org/html/2606.13995#S8.T3)shows that removing the user simulator reduces performance significantly across models, indicating that success on our benchmark requires multi\-turn dialogue capabilities\.

Table 3:An ablation of the simulated follow\-up replies from our user simulator on a stratified sample of the evaluation set \(n=50n=50\)\. ‘full sim\.’ indicates our full user simulator, ‘u1u\_\{1\}only’ indicates an ablation in which only the first utteranceu1u\_\{1\}is given\. Ablating simulated follow\-up replies drastically reduces agent performance, confirming our benchmark effectively evaluates multi\-turn conversation\.
### 8\.2Evaluating our User Simulator

We evaluate whether our user simulator can effectively serve as a proxy for real\-world users through human annotation\. To do this, we evaluate our user’s goal\-oriented behavior along three binary dimensions:

1. 1\.Faithfulness: is the user faithful to the intent and knowledge in the issue specification?
2. 2\.Goal Adherence: does the user reliably pursue their stated goal, and not get side\-tracked?
3. 3\.Environment Adherence: does the user respect the limitations of their environment, and not hallucinate abilities like running code?

Each criterion is annotated at the utterance level as met or not met using the same annotation guidelines\. We evaluate both our final user simulator, and an ablation which removes our revision step described in[§ 3\.3](https://arxiv.org/html/2606.13995#S3.SS3)\. We sample 120 dialogues across model and agent choices, and report results in[Table 4](https://arxiv.org/html/2606.13995#S8.T4)\. We find that our best user simulator scores highly across all three dimensions at both the turn and dialogue level, demonstrating strong goal\-oriented behavior\. Without our revision step, the user simulator performs significantly worse, particularly with environment adherence\. Further, our full user simulator scores perfectly on all dimensions in 97\.5% of dialogues, where our simulator without revision only scores perfectly on 82\.5%\.

Table 4:Human evaluation of faithfulness, goal adherence, and environment adherence in our User Simulator \([§ 3\.3](https://arxiv.org/html/2606.13995#S3.SS3)\)\. Scores are percentage of utterances \(turn\-level\) and dialogues \(dialogue\-level\) where each criteria is met\. All \(↑\\uparrow\) is the percentage of turns or dialogues where all criteria are met\. Our user meets each criteria in over 99% of dialogues, with 97\.5% of dialogues scoring perfectly\. Without self\-revision \([§ 3\.3](https://arxiv.org/html/2606.13995#S3.SS3)\), only 82\.5% of dialogues are defect free\.

## 9Case Studies

We analyze system behavior in our benchmark through case studies\.

#### Schemas Guide Reasoning

First, we note an example in which our schema\-guided agent outperforms our strongest baseline \(OH Interactive\) because it is able to surface a key details about the expected behavior through dialogue\.LABEL:fig:case\-study\-ours\-vs\-baselineinLABEL:app:case\-study\-detailsshows this example in detail\. In this example, the user intends to modify a function to accept an argument as either \(1\) None, mapping it to an emptyset, or \(2\) as\-is, asserting that the passed argument is of typeset\. While both agents ask about what types to accept, the schema\-guided agent reasons about the expected behavior in it’s schema explicitly, and further considers the expected behavior when the caller’s argument is not aset\. The schema\-guided agent proposes one solution \(coercing any iterable to aset\), which prompts the user to respond with the correct alternative solution: raising an assertion error if the argument is not aset\. The baseline instead assumes it is thecaller’sresponsibility to ensure the argument is aset, violating the user’s true intent\.

#### Verbose Models Burden Users

We also notice that while thorough, some models commonly burdens the user with unnecessary questions\. In particular, we notice this across agents with GPT\-5, in part leading to the longer dialogues we observe in[Table 1](https://arxiv.org/html/2606.13995#S3.T1)\.LABEL:fig:case\-study\-gpt\-5\-vs\-miniinLABEL:app:case\-study\-detailsprovides an example dialogue comparing GPT\-5 and GPT\-5\-mini, both using our schema\-guided agent\. The user’s problem is a ‘<15<15min fix’ and has a very simple solution: recognizing improper inputs and throwing an exception\. Where GPT\-5\-mini asks two clear and useful questions, GPT\-5 asks several more, many of which are unnecessary\. In response to a long list of questions from GPT\-5, the user selectively answers only two\. GPT\-5 declines to follow\-up on an unanswered question about expected behavior for improper inputs, and instead implements a solution which ignores such inputs rather than raising an exception\.

## 10Conclusion

We introduce Dialogue\-SWEBench, an evaluation of coding agents on real\-world software engineering tasks through multi\-turn dialogue with a simulated user\. We hope our work inspires future work on the intersection of coding and dialogue capabilities needed for real\-world interactive coding agent use\.

## 11Limitations

We build our benchmark using real\-world problems from SWE\-Bench VerifiedChowdhuryet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib31)\)\. The annotations provided in the the verified set allow us to be confident that each task’s problem specification is complete and correct, and thus solvable in principle\. However, this design choice means our benchmark inherits the distribution of SWE\-Bench Verified, which skews toward Python repositories and particular issue types that might not cover the full distribution of real\-world coding agent uses\. Future work will be needed to assess coding agent dialogue capability in a greater variety of repository\-level task domains\.

## Acknowledgments

The authors were supported in part by the NSF National AI Institute for Student\-AI Teaming \(iSAT\) under grant DRL 2019805\. The opinions expressed are those of the authors and do not represent views of the NSF\. We are thankful for the computing resources provided by the Pacific Research Platform’s Nautilus cluster, supported by the National Science Foundation under Award Numbers CNS\-1730158, ACI\-1540112, ACI1541349, OAC\-1826967, the University of California Office of the President, and the University of California San Diego’s California Institute for Telecommunications and Information Technology/Qualcomm Institute\.

## References

- R\. Aleithan, H\. Xue, M\. M\. Mohajer, E\. Nnorom, G\. Uddin, and S\. Wang \(2024\)SWE\-Bench\+: Enhanced Coding Benchmark for LLMs\.arXiv\.Note:arXiv:2410\.06992 \[cs\] version: 1External Links:[Link](http://arxiv.org/abs/2410.06992),[Document](https://dx.doi.org/10.48550/arXiv.2410.06992)Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p1.1),[§2](https://arxiv.org/html/2606.13995#S2.p2.1)\.
- Anthropic \(2025a\)Claude 4 system card\.Technical ReportAnthropic\.External Links:[Link](https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5ae987bc64dd.pdf)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1)\.
- Anthropic \(2025b\)Claude codeNote:Command\-line AI coding toolExternal Links:[Link](https://code.claude.com/docs/en/overview)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1)\.
- J\. Baumann, V\. Padmakumar, X\. Li, J\. Yang, D\. Yang, and S\. Koyejo \(2026\)SWE\-chat: Coding Agent Interactions From Real Users in the Wild\.arXiv\.Note:arXiv:2604\.20779 \[cs\]External Links:[Link](http://arxiv.org/abs/2604.20779),[Document](https://dx.doi.org/10.48550/arXiv.2604.20779)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p2.1)\.
- S\. Chaurasia and R\. Mooney \(2017\)Dialog for Language to Code\.Note:\[TLDR\] This work proposes a system that can engage users in a dialog to clarify their intent until it has all the information to produce correct code, and focuses on synthesizing conditional statements in the form of IFTTT recipes\.External Links:[Link](https://www.semanticscholar.org/paper/Dialog-for-Language-to-Code-Chaurasia-Mooney/3ade4d3be53981a1678b1e3a736d01547f7d3b9e)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p3.1),[§2](https://arxiv.org/html/2606.13995#S2.p3.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating Large Language Models Trained on Code\.arXiv\.Note:arXiv:2107\.03374 \[cs\]External Links:[Link](http://arxiv.org/abs/2107.03374)Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p1.1)\.
- N\. Chowdhury, J\. Aung, C\. Jun Shern, O\. Jaffe, D\. Sherburn, G\. Starace, E\. Mays, R\. Dias, M\. Aljubeh, M\. Glaese, C\. E\. Jimenez, J\. Yang, L\. Ho, T\. Patwardhan, K\. Liu, and A\. Madry \(2024\)Introducing SWE\-bench Verified\.\(en\-US\)\.External Links:[Link](https://openai.com/index/introducing-swe-bench-verified/)Cited by:[§11](https://arxiv.org/html/2606.13995#S11.p1.1),[§3\.2](https://arxiv.org/html/2606.13995#S3.SS2.p1.7),[§3\.4](https://arxiv.org/html/2606.13995#S3.SS4.p1.1),[§5](https://arxiv.org/html/2606.13995#S5.p1.1),[§8\.1](https://arxiv.org/html/2606.13995#S8.SS1.p1.1),[footnote 2](https://arxiv.org/html/2606.13995#footnote2)\.
- J\. Cohen \(1968\)Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit\.Psychological Bulletin70\(4\),pp\. 213–220\.External Links:ISSN 1939\-1455,[Document](https://dx.doi.org/10.1037/h0026256)Cited by:[§7\.3](https://arxiv.org/html/2606.13995#S7.SS3.p2.1)\.
- G\. Deepmind \(2026\)Gemma 4 model card\.\(en\)\.External Links:[Link](https://ai.google.dev/gemma/docs/core/model_card_4)Cited by:[§7](https://arxiv.org/html/2606.13995#S7.p1.1)\.
- GitHub \(2022\)GitHub copilot: your AI pair programmerExternal Links:[Link](https://github.com/features/copilot)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. v\. d\. Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. v\. d\. Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. d\. Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The Llama 3 Herd of Models\.arXiv\.Note:arXiv:2407\.21783 \[cs\]External Links:[Link](http://arxiv.org/abs/2407.21783),[Document](https://dx.doi.org/10.48550/arXiv.2407.21783)Cited by:[Appendix A](https://arxiv.org/html/2606.13995#A1.p1.1),[footnote 3](https://arxiv.org/html/2606.13995#footnote3)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: Can Language Models Resolve Real\-World GitHub Issues?\.arXiv\(en\)\.Note:arXiv:2310\.06770 \[cs\]Comment: Data, code, and leaderboard are available at https://www\.swebench\.com ICLR 2024, https://openreview\.net/forum?id=VTF8yNQM66External Links:[Link](http://arxiv.org/abs/2310.06770)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1),[§2](https://arxiv.org/html/2606.13995#S2.p1.1),[§3](https://arxiv.org/html/2606.13995#S3.p1.1),[footnote 2](https://arxiv.org/html/2606.13995#footnote2)\.
- D\. Jurafsky and E\. Shriberg \(1997\)Switchboard SWBD\-DAMSL shallow\-discourse\-function annotation coders manual\.External Links:[Link](https://www.semanticscholar.org/paper/Switchboard-SWBD-DAMSL-shallow-discourse-function-Jurafsky-Shriberg/100ab5ac194a6a3bf12206ce477079a2a66733c1)Cited by:[§6](https://arxiv.org/html/2606.13995#S6.p2.1)\.
- B\. Kane \(2022\)Schema\-Guided Dialogue Management for Virtual Conversational Agents\.\(en\)\.Cited by:[§4](https://arxiv.org/html/2606.13995#S4.p1.1)\.
- T\. Kazi, R\. Lyu, S\. Zhou, D\. Hakkani\-Tur, and G\. Tur \(2024\)Large Language Models as User\-Agents for Evaluating Task\-Oriented\-Dialogue Systems\.arXiv\.Note:arXiv:2411\.09972External Links:[Link](http://arxiv.org/abs/2411.09972),[Document](https://dx.doi.org/10.48550/arXiv.2411.09972)Cited by:[§7\.2](https://arxiv.org/html/2606.13995#S7.SS2.p1.1),[§7](https://arxiv.org/html/2606.13995#S7.p1.1)\.
- T\. Kocmi, C\. Federmann, R\. Grundkiewicz, M\. Junczys\-Dowmunt, H\. Matsushita, and A\. Menezes \(2021\)To ship or not to ship: an extensive evaluation of automatic metrics for machine translation\.InProceedings of the Sixth Conference on Machine Translation,L\. Barrault, O\. Bojar, F\. Bougares, R\. Chatterjee, M\. R\. Costa\-jussa, C\. Federmann, M\. Fishel, A\. Fraser, M\. Freitag, Y\. Graham, R\. Grundkiewicz, P\. Guzman, B\. Haddow, M\. Huck, A\. J\. Yepes, P\. Koehn, T\. Kocmi, A\. Martins, M\. Morishita, and C\. Monz \(Eds\.\),Online,pp\. 478–494\.External Links:[Link](https://aclanthology.org/2021.wmt-1.57/)Cited by:[§7\.3](https://arxiv.org/html/2606.13995#S7.SS3.p2.1)\.
- H\. Li, M\. Mesgar, A\. F\. T\. Martins, and I\. Gurevych \(2023\)Python Code Generation by Asking Clarification Questions\.arXiv\.Note:arXiv:2212\.09885 \[cs\]External Links:[Link](http://arxiv.org/abs/2212.09885),[Document](https://dx.doi.org/10.48550/arXiv.2212.09885)Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p3.1)\.
- M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. H\. Shin, T\. Walshe, E\. K\. Buchanan, J\. Shen, G\. Ye, H\. Lin, J\. Poulos, M\. Wang, M\. Nezhurina, J\. Jitsev, D\. Lu, O\. M\. Mastromichalakis, Z\. Xu, Z\. Chen, Y\. Liu, R\. Zhang, L\. L\. Chen, A\. Kashyap, J\. Uslu, J\. Li, J\. Wu, M\. Yan, S\. Bian, V\. Sharma, K\. Sun, S\. Dillmann, A\. Anand, A\. Lanpouthakoun, B\. Koopah, C\. Hu, E\. K\. Guha, G\. H\. S\. Dreiman, J\. Zhu, K\. Krauth, L\. Zhong, N\. Muennighoff, R\. K\. Amanfu, S\. Tan, S\. Pimpalgaonkar, T\. Aggarwal, X\. Lin, X\. Lan, X\. Zhao, Y\. Liang, Y\. Wang, Z\. Wang, C\. Zhou, D\. Heineman, H\. Liu, H\. Trivedi, J\. Yang, J\. Lin, M\. Shetty, M\. Yang, N\. Omi, N\. Raoof, S\. Li, T\. Y\. Zhuo, W\. Lin, Y\. Dai, Y\. Wang, W\. Chai, S\. Zhou, D\. Wahdany, Z\. She, J\. Hu, Z\. Dong, Y\. Zhu, S\. Cui, A\. Saiyed, A\. Kolbeinsson, J\. Hu, C\. Rytting, R\. Marten, Y\. Wang, A\. G\. Dimakis, A\. Konwinski, and L\. Schmidt \(2026\)Terminal\-bench: benchmarking agents on hard, realistic tasks in command line interfaces\.ArXivabs/2601\.11868\.External Links:[Link](https://api.semanticscholar.org/CorpusID:284911857)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1)\.
- MistralAI \(2025\)Introducing: Devstral 2 and Mistral Vibe CLI\.\.\(en\)\.External Links:[Link](https://mistral.ai/news/devstral-2-vibe-cli)Cited by:[§5](https://arxiv.org/html/2606.13995#S5.p2.1)\.
- OpenAI \(2025\)GPT\-5 System Card\.\(en\-US\)\.External Links:[Link](https://openai.com/index/gpt-5-system-card/)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1),[§5](https://arxiv.org/html/2606.13995#S5.p2.1)\.
- J\. Pan, R\. Shar, J\. Pfau, A\. Talwalkar, H\. He, and V\. Chen \(2025\)When Benchmarks Talk: Re\-Evaluating Code LLMs with Interactive Feedback\.arXiv\.Note:arXiv:2502\.18413 \[cs\]External Links:[Link](http://arxiv.org/abs/2502.18413),[Document](https://dx.doi.org/10.48550/arXiv.2502.18413)Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p3.1)\.
- J\. Pan, X\. Wang, G\. Neubig, N\. Jaitly, H\. Ji, A\. Suhr, and Y\. Zhang \(2024\)Training Software Engineering Agents and Verifiers with SWE\-Gym\.arXiv\.Note:arXiv:2412\.21139 \[cs\]Comment: Code at https://github\.com/SWE\-Gym/SWE\-GymExternal Links:[Link](http://arxiv.org/abs/2412.21139),[Document](https://dx.doi.org/10.48550/arXiv.2412.21139)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1),[§2](https://arxiv.org/html/2606.13995#S2.p1.1)\.
- J\. W\. Pratt \(1959\)Remarks on Zeros and Ties in the Wilcoxon Signed Rank Procedures\.Journal of the American Statistical Association54\(287\),pp\. 655–667\(en\)\.External Links:ISSN 0162\-1459, 1537\-274X,[Link](http://www.tandfonline.com/doi/abs/10.1080/01621459.1959.10501526),[Document](https://dx.doi.org/10.1080/01621459.1959.10501526)Cited by:[§7\.3](https://arxiv.org/html/2606.13995#S7.SS3.p2.1)\.
- Qwen \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1),[§2](https://arxiv.org/html/2606.13995#S2.p1.1),[§5](https://arxiv.org/html/2606.13995#S5.p2.1)\.
- N\. C\. Rakotonirina, M\. Hamdy, J\. A\. Campos, L\. Weber, A\. Testoni, M\. Fadaee, S\. Pezzelle, and M\. D\. Tredici \(2025\)From Tools to Teammates: Evaluating LLMs in Multi\-Session Coding Interactions\.arXiv\.Note:arXiv:2502\.13791 \[cs\]Comment: Published as conference paper at ACL 2025External Links:[Link](http://arxiv.org/abs/2502.13791),[Document](https://dx.doi.org/10.48550/arXiv.2502.13791)Cited by:[Appendix A](https://arxiv.org/html/2606.13995#A1.SS0.SSS0.Px3.p1.1),[Table 5](https://arxiv.org/html/2606.13995#A1.T5),[§2](https://arxiv.org/html/2606.13995#S2.p3.1),[§3\.3](https://arxiv.org/html/2606.13995#S3.SS3.SSS0.Px1.p1.1)\.
- A\. Rastogi, A\. Yang, A\. Q\. Jiang, A\. H\. Liu, A\. Sablayrolles, A\. Héliou, A\. Martin, A\. Agarwal, A\. Ehrenberg, A\. Lo, A\. Roux, A\. Darcet, A\. Mensch, B\. Bout, B\. Rozière, B\. D\. Monicault, C\. Bamford, C\. Wallenwein, C\. Renaudin, C\. Lanfranchi, C\. Denoix, C\. Barreau, D\. D\. D\. Mizelle, D\. de las Casas, E\. Chane\-Sane, E\. Fugier, E\. B\. Hanna, G\. Berrada, G\. Delerce, G\. Guinet, G\. Novikov, G\. Neubig, G\. Lample, G\. Martin, H\. Jaju, J\. Ludziejewski, J\. Rute, J\. Delignon, J\. Chabran, J\. Studnia, J\. Barmentlo, J\. Amar, J\. S\. Roberts, J\. Denize, K\. Saxena, K\. Yadav, K\. Khandelwal, K\. R\. Chandu, K\. Jain, L\. R\. Lavaud, L\. Blier, L\. Zhao, L\. Martin, L\. Saulnier, L\. Gao, M\. Pellat, M\. Guillaumin, M\. Felardos, M\. Dinot, M\. Darrin, M\. Augustin, M\. Seznec, N\. Gupta, N\. Raghuraman, O\. Duchenne, P\. Wang, P\. von Platen, P\. Saffer, P\. Jacob, P\. Wambergue, P\. Kurylowicz, P\. Chagniot, P\. Stock, P\. Agrawal, R\. Delacourt, R\. Soletskyi, R\. Sauvestre, S\. Vaze, S\. Gandhi, S\. Subramanian, S\. Dalal, S\. Gandhi, S\. Ghosh, S\. Mishra, S\. Aithal, S\. Antoniak, T\. L\. Scao, T\. Lavril, T\. Schueller, T\. Foubert, T\. Robert, T\. Wang, T\. Lacroix, T\. Bewley, V\. Nemychnikova, V\. Paltz, V\. Richard, W\. Li, W\. Marshall, X\. Wang, X\. Zhang, Y\. Wan, and Y\. Tang \(2025\)Devstral: fine\-tuning language models for coding agent applications\.External Links:2509\.25193,[Link](https://arxiv.org/abs/2509.25193)Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p1.1)\.
- Y\. Su, D\. Yang, S\. Yao, and T\. Yu \(2024\)Language agents: foundations, prospects, and risks\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts,J\. Li and F\. Liu \(Eds\.\),Miami, Florida, USA,pp\. 17–24\.External Links:[Link](https://aclanthology.org/2024.emnlp-tutorials.3/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-tutorials.3)Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p1.1)\.
- S\. Vijayvargiya, X\. Zhou, A\. Yerukola, M\. Sap, and G\. Neubig \(2026\)Ambig\-SWE: interactive agents to overcome underspecificity in software engineering\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=X2yzXtH4wp)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p3.1),[§2](https://arxiv.org/html/2606.13995#S2.p2.1),[§5](https://arxiv.org/html/2606.13995#S5.p4.1)\.
- X\. Wang, B\. Li, Y\. Song, F\. F\. Xu, X\. Tang, M\. Zhuge, J\. Pan, Y\. Song, B\. Li, J\. Singh, H\. H\. Tran, F\. Li, R\. Ma, M\. Zheng, B\. Qian, Y\. Shao, N\. Muennighoff, Y\. Zhang, B\. Hui, J\. Lin, R\. Brennan, H\. Peng, H\. Ji, and G\. Neubig \(2024\)OpenHands: An Open Platform for AI Software Developers as Generalist Agents\.arXiv\.Note:arXiv:2407\.16741 \[cs\]Comment: Code: https://github\.com/All\-Hands\-AI/OpenHandsExternal Links:[Link](http://arxiv.org/abs/2407.16741),[Document](https://dx.doi.org/10.48550/arXiv.2407.16741)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1),[§2](https://arxiv.org/html/2606.13995#S2.p1.1),[§5](https://arxiv.org/html/2606.13995#S5.p4.1)\.
- X\. Wang, S\. Rosenberg, J\. Michelini, C\. Smith, H\. Tran, E\. Nyst, R\. Malhotra, X\. Zhou, V\. Chen, R\. Brennan, and G\. Neubig \(2025\)The openhands software agent sdk: a composable and extensible foundation for production agents\.External Links:2511\.03690,[Link](https://arxiv.org/abs/2511.03690)Cited by:[§4](https://arxiv.org/html/2606.13995#S4.p2.1)\.
- X\. Wang, Z\. Wang, J\. Liu, Y\. Chen, L\. Yuan, H\. Peng, and H\. Ji \(2023\)MINT: Evaluating LLMs in Multi\-turn Interaction with Tools and Language Feedback\.Note:Version Number: 3Other ICLR 2024\. Code is available on our project website: https://xingyaoww\.github\.io/mint\-benchExternal Links:[Link](https://arxiv.org/abs/2309.10691),[Document](https://dx.doi.org/10.48550/ARXIV.2309.10691)Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p3.1)\.
- Z\. Z\. Wang, J\. Yang, K\. Lieret, A\. Tartaglini, V\. Chen, Y\. Wei, Z\. Wang, L\. Zhang, K\. Narasimhan, L\. Schmidt, G\. Neubig, D\. Fried, and D\. Yang \(2026\)Position: Humans are Missing from AI Coding Agent Research\.\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p1.1)\.
- Y\. Wei, O\. Duchenne, J\. Copet, Q\. Carbonneaux, L\. Zhang, D\. Fried, G\. Synnaeve, R\. Singh, and S\. I\. Wang \(2025\)SWE\-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution\.arXiv\.Note:arXiv:2502\.18449 \[cs\]External Links:[Link](http://arxiv.org/abs/2502.18449),[Document](https://dx.doi.org/10.48550/arXiv.2502.18449)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1),[§2](https://arxiv.org/html/2606.13995#S2.p1.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024\)SWE\-agent: Agent\-Computer Interfaces Enable Automated Software Engineering\.arXiv\.Note:arXiv:2405\.15793External Links:[Link](http://arxiv.org/abs/2405.15793)Cited by:[§1](https://arxiv.org/html/2606.13995#S1.p1.1),[§2](https://arxiv.org/html/2606.13995#S2.p1.1),[§3\.1](https://arxiv.org/html/2606.13995#S3.SS1.p1.12)\.
- J\. Yang, K\. Lieret, C\. E\. Jimenez, A\. Wettig, K\. Khandpur, Y\. Zhang, B\. Hui, O\. Press, L\. Schmidt, and D\. Yang \(2025\)SWE\-smith: scaling data for software engineering agents\.External Links:2504\.21798,[Link](https://arxiv.org/abs/2504.21798)Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p1.1)\.
- X\. Zhou, V\. Chen, Z\. Z\. Wang, G\. Neubig, M\. Sap, and X\. Wang \(2026\)TOM\-swe: user mental modeling for software engineering agents\.External Links:2510\.21903,[Link](https://arxiv.org/abs/2510.21903)Cited by:[§2](https://arxiv.org/html/2606.13995#S2.p2.1)\.

## Appendix AUser Simulation Details

Here we provide further details on the design of our user simulator\. For both generating initial queries and simulating follow up responses, we use a quantized version of LLaMa 3\.3 70BGrattafioriet al\.\([2024](https://arxiv.org/html/2606.13995#bib.bib59)\)\.888[https://huggingface\.co/shuyuej/Llama\-3\.3\-70B\-Instruct\-GPTQ](https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ)

#### Generating Initial Queries

Our semi\-automated procedure for producing initial queries has three key steps\. First, we prompt the LLM𝒰ϕ\\mathcal\{U\}\_\{\\phi\}with the title of the Github issue to produce a candidate initial query for the dialogue,u1′u^\{\\prime\}\_\{1\}\. This prompt is in[Figure 6](https://arxiv.org/html/2606.13995#A1.F6)\. Following this, we prompt the LLM to revise the candidateu1′u^\{\\prime\}\_\{1\}to ensure it is faithful to the user’s original intent, sufficiently incomplete to require dialogue, and meets the requirements of our evaluation setting\. This revision prompt is in[Figure 7](https://arxiv.org/html/2606.13995#A1.F7)\. Finally, we manually review the initial queries to ensure they are clear and faithful to the user’s original intent\. In preparing our evaluation set, 13% of the problems required minor manual revision, to ensure they had an issue resolution based framing, rather than asking an open ended question\.

#### Generating Follow\-Up Responses

To produce a follow\-up queryut\+1u\_\{t\+1\}, we first prompt the user LLM𝒰ϕ\\mathcal\{U\}\_\{\\phi\}to produce a candidate replyu^t\+1\\hat\{u\}\_\{t\+1\}\. We combine the system prompt detailed in[Figure 8](https://arxiv.org/html/2606.13995#A1.F8)with the dialogue history thus far, withroles reversed\. This way, the LLM𝒰ϕ\\mathcal\{U\}\_\{\\phi\}is tasked with producing a message with the ‘assistant’ role, matching its instruction tuning setup, before being returned to our agent as a user message\. Following candidate generation, we use aself\-revision step, in which we prompt the user LLM with the candidateu^t\+1\\hat\{u\}\_\{t\+1\}to check for possible hallucinations or violations of our setting\.[Figure 9](https://arxiv.org/html/2606.13995#A1.F9)details the prompt for detecting and classifying such violations\. Finally, if any violations are recorded, the LLM𝒰ϕ\\mathcal\{U\}\_\{\\phi\}is prompted to reviseu^t\+1\\hat\{u\}\_\{t\+1\}before it is returned to the agent as an observationot\+1o\_\{t\+1\}\.

#### Sampling Personas

Inspired by the approach inRakotonirinaet al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib33)\), we aim to evaluate agents against a more diverse and varied set of users using personas\. We define five persona descriptions for users in our benchmark, given inLABEL:tab:persona\-descriptions\. For each problem in our evaluation set, we sample a description and instantiate it with a name sampled from[Table 5](https://arxiv.org/html/2606.13995#A1.T5), using the same names as inRakotonirinaet al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib33)\)\. The resulting persona text is then used to influence our user simulators behavior in the system prompt\.

ThisisthetitleofanissueonGithubthatauserneedshelpwith\.Canyoure\-phrasethetitleasaquestionorothernaturalbeginningofadialogueitwillhavewithapairprogrammer?Bebriefandcapturetheessenceoftheissue,thepairprogrammerwillaskquestionstogetdetailsandresolveambiguity\.

Wherepossible,avoid"why"or"let’sdiscuss"questions\.

Hereisadescriptionoftheuseryouaretryingtoemulate:

\{\{persona\}\}

Hereisthetitleoftheissue:

IssueTitle:\{\{issue\_title\}\}

Pleasephraseasanaturalquery,forexample:"Canyouhelpmefigureoutwhy‘iam\.physical‘isreturningNaNforanAoIgreaterthan90?"

Figure 6:Prompt used for producing a draft initial queryu1′u^\{\\prime\}\_\{1\}for initializing our user simulator\.Nowhereisthefullissuecontext\.Theinitialquerywillbecometheuser’sinitialqueryinasimulatedbenchmarkforsoftwareengineeringdialogues\.

Pleasecheckcarefullyfortwopossibleissues:

\-Misconceptions:doesthequerymis\-representtheintentoftheuser?Intentsshouldbebroad:’fixabug’,’safelyupgradeadependency’,etc\.Missingdetailsarenotmisconceptions:detailsshouldbesurfacedthroughthemulti\-turndialogue\.

\-Over\-detailedintents:doesthequerytoomuchinformation,thatshouldinsteadcomefromdialoguewiththeuser?Ifaquerycanbere\-statedinawaythatremovesdetailsthatareimportanttosolvingtheproblem,withoutleadingtoachangeintheuser’sbroadintent\(fixingabug,addingafeature,etc\.\),thenitshouldbesimplified,withdetailstofollowindialogue\.

Formatting:Youshouldverballythinkthroughanddecidewhethertomakeanychangesfirst\.Afterthis,endyourresponsewith"Query:<query\>"

Hereisthefullproblemstatement\.Wewantasfewdetailsaspossiblefromthisintheinitialquerywithoutdestroyingitsmeaning\.

\{\{problem\_statement\}\}

Query:\{\{initial\_query\}\}

Figure 7:Prompt used for automatic revision of a draft initial queryu1′u^\{\\prime\}\_\{1\}for initializing our user simulator\.SETTING:

YourtaskistosimulateaGithubUserengaginginadialoguewithanautonomoussoftwareengineertosolveaproblem\.

Youmustnaturally,accurately,andauthenticallybehaveasauserseekingasolutiontoasoftwareengineeringproblem\.

Yourtaskistoposetheproblemtoanengineerinanaturaldialogue,andclearlyandconciselyansweranyquestionsfromtheengineer\.

Todothis,youneedtoalwaysadheretothefollowingrulesandboundariesconceriningyourpersona,environment,andknowledge:

ABOUTYOURPERSONA:

\{\{persona\}\}

RULESFORPERSONA:

1\.SuccessatthistaskREQUIRESthatyoubehaveaccordingtothisPERSONA,evenwhenyouwouldtypicallybehavedifferently\!

2\.NEVERdescribeyourselfasasimulator,breakcharacterasauser,oracknowledgethatyouhaveaproblemstatementyouareconsulting\.YourtaskistoAUTHENTICALLYbehaveasauserthatwishestosolvetheprobleminreal\-time,withoutapre\-writtenproblemstatement\.

3\.ALWAYSbeasbriefandnaturalaspossible:donotoverwhelmtheengineerwithtoomuchinformation\!Rememberthatreal\-worldusersarenotwordy,theyaimtotypeaslittleaspossible\.

4\.Iftheengineerhasoverwhelmedyouwithmanyquestionsinonemessage,youdonotneedtoanswerallofthem\.Theywillaskagainifneeded\.

ABOUTYOURENVIRONMENT:

Youareengagingwiththeautonomousengineeronlythroughdialogue\(likeaslackDM\)\.Followtheserulesandboundarieswithrespecttoyourenvironment:

Theengineerisamaintainerofthe\{\{repo\}\}repository\.

Yourproblemisdescribedbelowandmustbesolvedbytheengineerthroughsomechangetothe\{\{repo\}\}repository\.

RULESFORYOURENVIRONMENT:

1\.Alwaysrememberthatyouaresituatedonlyindialoguewiththesoftwareengineer:youdonothaveaterminal,IDE,accesstotherepository\(neither\{\{repo\}\}northerepowheretheproblemsurfaced\),ortheabilitytocompleteanytaskorfollow\-uptaskonyourown\.Youmayasktheengineertoruncodeintheirenvironmentifnecessary\.DoNOTpretendtohavetheseormakepromisesaboutfollow\-uptasksyoucancomplete\.

2\.Useonlyyourprovidedknowledgetoparticipateinthedialogue\.Whileyoushouldneverbreakcharacter,youcanbeforthrightaboutthelimitationsofyourenvironment\("e\.g\.Ican’trunanycoderightnow","Ican’tseethatrightnow","Idon’thavethatinformationrightnow"\)

3\.YouwillbegiventhefulltextofaGithubIssuebelow\.NEVERshareitdirectlywiththeengineer,butinsteaduseitasaknowledgesourceforauthenticallybehavingastheuser\.

4\.Theengineeristaskedwithallimplementationandanytesting,ifnecessary,communicatingwithyouonlythroughdialogue\.TheirtaskisnotcompleteuntiltheproblemhasbeenSOLVEDinthe\{\{repo\}\}repository\.

5\.Theengineercanonlyprovideyouinformationthroughdialogue\.Iftheyrefercodeorsomethingelseyoucannotsee,askthemtomessageittoyou\.

6\.Youmayasktheengineertoruncodeinthe\{\{repo\}\}repositoryifnecessary,butyoumaynotruncodeyourself\.Youmayrefertocodethatwasrunpreviouslyanddescribedincompletedetailintheproblemstatement,butyoucan

YOURKNOWLEDGEFORTHISPROBLEM\(FullGithubIssue/ProblemStatement\):

\{\{full\_problem\_statement\}\}

RULESFORKNOWLEDGEANDDIALOGUE:

1\.ALWAYSbeclearwithyouranswersusingexcerptsorparaphrasesofthefullproblemstatementgivenabove\.

2\.NEVERmakeupananswertoaquestionyoucannotanswerbyconsultingtheproblemstatement\.Youcanrespondtothesequestionswith"I’mnotsure","Idon’tknow",etc\.

3\.NEVERgivetheengineerinformationfromtheproblemstatementthattheydidnotaskfor\.

4\.Ifaskedforatraceback,reproducingcode,orsimilarinformation,ALWAYSgiveitinacompleteformifitisreasonablethattheuseryouareportrayingwouldhaveitonhand:

\-Ifyouareaskedforatracebackandhaveone,giveALLofit\.

\-Ifyouareonlyaskedforanerrormessage,justgivethaterrormessage\.

\-Donotparaphrasecodeoratracebackfromtheproblemstatement:behaveasauserwhowouldbecopyingandpastingthisfromtheirIDEorterminal\.

5\.NEVERmakedebugginginferencesabouttheproblemorpotentialsolutionsonbehalfoftheengineeriftheyareunnaturalforthedialoguesettingornotsuggestedintheproblemstatement\.

6\.IftheengineerasksyoutoruncodeorcompleteataskthatdoesnotfityourENVIRONMENT,provideapoliterefusalwithoutbreakingcharacter\.Itcanbeassimpleas"Iamunabletodothat,youwillneedto\[runtests,etc\]"

7\.ALWAYSrememberthelimitationsofyourENVIRONMENTwhenansweringquestions,andbecarefulnottohallucinateorspeculatebeyondwhatisreasonablegivenyourENVIRONMENT\.

8\.NEVERmakeupcode,data,inputs,tracebacks,orerrormessagesifyoudonothavethemavailable,eveniftheyareaskedfor\.Providearefusallike"Sorry,Idon’thavethatonhand\."instead\.Shareonlywhatyouknowfromtheknowledgesource\.

9\.Theengineermightsuggestout\-of\-scopefollowupsorelaboratetests\.ALWAYSrefusetheseiftheyarenotnecessaryforaddressingthegivenprobleminthe\{\{repo\}\}repository\.Simpletestsarevaluableforensuringthegivenproblemisactuallyresolved\.

10\.NEVERelaboratebeyondwhatisnecessarytoansweraquestion\.DONOTexplainyourthinkingifitisnotnecessary\.ForexamplecansimplyansweraYes/Noquestionwith"Yes"or"No"\.

11\.Don’tmakesuggestionsorgivehintstotheengineerunlesstheyaredirectlyrelevanttotheirquestionsANDarecontainedintheproblemstatement\.

12\.Remembertheengineeristaskedwithallimplementationandtestinganditshouldoccurinthe\{\{repo\}\}repository\.

\-Iftheengineerdoesnotseemawareofthis,youcanclarify:"Iamunabletodothat,Ineedyoutocomplete\[theimplementation,thatstep,etc\.\]"

\-Donotsuggestyouwillcompletefollow\-uptasks\.

\-Donotacceptrequestsforyoutorunanycodenoworinthefuture\.

13\.Insomecases,theengineermaymessageyouto’check\-in’withdetailsabouttheirworkflow,ratherthanaskyouanewquestion\.Ifthereisnoaskofyouintheirmessage,onlydothefollowing:

\-Iftheirunderstandingoftheproblemalignswithyours,simplyacknowledgethemessage\(e\.g\.’Soundsgood’,or’Thanks’\)\.Youdonotneedtore\-statetheirunderstandingormakesuggestions\.

\-Iftheirunderstandingisincorrect,youmayissueacorrectionorareminderasneeded\.Theirunderstandingshouldonlybeconsideredincorrectifitexplicitlycontradictswiththeknowledgeyouhavebeengivenfortheproblem\.

14\.Evenifyouencounteredtheproblemoutsideofthe\{\{repo\}\}repository,onlyacceptafixthatisachievedwithinthe\{\{repo\}\}repository,ratherthanapatchtoyourowncode\.Youcanassumetheupdatedversionof\{\{repo\}\}willbeavailableassoonasyouneedit\.

Figure 8:System prompt used in our simulated user when generating candidate user responsesut\+1u\_\{t\+1\}fort≥1t\\geq 1, with template variables for the user’s persona, the complete issue text, and the name of the repository associated with this problem\.SETTING:

Yourareanexpertdialogueanalysandreviewer,whosetaskistoevaluatethebehaviorofausersimulatorinasimulateddialogue\.

Yourroleistoevaluatewhetherutterancesfromtheusersimulatoradheretothetaskgoalsoutlinedbelow\.

TASKOFTHEUSERSIMULATOR:

TheusersimulatorissimulatingaGithubUserengaginginadialoguewithanautonomoussoftwareengineertosolveaproblem\.

Itmustnaturally,accurately,andauthenticallybehaveasauserseekingasolutiontoasoftwareengineeringproblem\.

Itmustposetheproblemtoanengineerinanaturaldialogue,andclearlyandconciselyansweranyquestionsfromtheengineer\.

RULESTHEUSERSIMULATORMUSTFOLLOW:

Eachruleisprefixedwitha’violation’type,whichindicateswhattocalltheviolationifyouobservetherulehasbeenbroken\.

1\.BREAKING\_CHARACTER:ThesimulatormustNEVERdescribeitselfasasimulatororbreakcharacterasauser\.

2\.BREAKING\_IMMERSION:ThesimulatormustNEVERacknowledgethatithasaproblemstatementitisconsultingtoproduceitsutterances\.ItmustAUTHENTICALLYbehaveasauserthatwishestosolvetheprobleminreal\-time,withoutapre\-writtenproblemstatement\.

3\.REMAIN\_CONCISE:ThesimulatormustALWAYSbeasbriefandnaturalaspossible:itshouldnotoverwhelmtheengineerwithtoomuchinformation\!Rememberthatreal\-worldusersarenotwordy,theyaimtotypeaslittleaspossible\.

4\.BREAKING\_ENVIRONMENT:

\-ThesimulatormustALWAYSrememberthatitissituatedonlyindialoguewiththesoftwareengineer\.Itdoesnothaveaterminal,IDE,accesstotherepository,ortheabilitytocompleteanytaskorfollow\-uptaskonitsown,andshouldNEVERofferorpromisetodoso\.

\-ThesimulatormustNEVERsuggestthatishasruncodeaffectedbytheengineerschanges,suchasforverification,evenifthiswouldbenormalinarealdialogue\.

\-TheuserMAYasktheengineertoruncode,forexamplebyprovidingreproductioncodeorcommandsthatreproduceanissue\.

\-Youcanencourageuseofrefusalslike"Ican’trunanycoderightnow","I’mnotavailabletodothatrightnow",or"Ican’tverifythatdirectly"

5\.BREAKING\_KNOWLEDGE:Thesimulatorshouldonlyusetheprovidedknowledgebelowtoparticipateinthedialogue\.

6\.OVERSHARING:ThesimulatorwillbegiventhefulltextofaGithubIssuebelow\.ItshouldNEVERshareitdirectlyinfullwiththeengineer,butinsteaduseitasaknowledgesourceforauthenticallybehavingastheuser\.Thismaymeancopyingrelevantfragments\(e\.g\.atraceback\)whentheyaredirectlyneededtoansweraquestion\.

7\.BREAKING\_SCOPE:Theengineerisamaintainerofthe\{\{repo\}\}repositoryandistaskedwithallimplementationandanytesting,ifnecessary,communicatingwiththesimulatoronlythroughdialogue\.Theuserissimulatingauserorclientof\{\{repo\}\}andmustnevertakeonthiswork,offertodoso,signalthatitwilllateronitsowntime,oracceptapatchtocodeoutsideofthe\{\{repo\}\}repositoryasasolution\.Thesimulatormustmaketheseexpectationscleartotheengineerifnecessary\.

8\.FALSE\_AGREEMENT:Theengineercanonlyprovidethesimulatorinformationthroughdialogue\.Iftheengineerreferstocodeorsomethingelsenotvisibileinthedialogue,thesimulatormustNEVERpretendtohavereaditwhenitisnotvisible\.ThesimulatormayASKforthisinformationtoresolvethemiscommunication\.

9\.INCOMPLETE\_ANSWER:ThesimultormustALWAYSbeclearwithitsanswersusingexcerptsorparaphrasesofthefullproblemstatementgivenbelow\.

10\.HALLUCINATION:ThesimulatormustNEVERmakeupananswertoaquestionitcannotanswerbyconsultingtheproblemstatement\.Youcanencourageittorespondtothesequestionswith"I’mnotsure","Idon’tknow",etc\.Similarly,atracebackorcodesamplemustNEVERbeparaphrased,incasethischangesthemeaning\.

11\.EXCESSIVE\_RESTATEMENT\_OR\_GUIDANCE:inresponseto’check\-ins’fromtheengineer\(asopposedtoanactualaskoftheuser\),theusershouldnotofferexcessiverestatementoftheengineerscommunication,whenanacknowledgementwouldsuffice\.Theusermaycorrecttheengineersunderstanding,butONLYifitsignificantlyandexplicitlycontradictstheproblemstatement\.

THESIMULATORSKNOWLEDGEFORTHISPROBLEM\(FullGithubIssue/ProblemStatement\):

\{\{full\_problem\_statement\}\}

STRATEGYFORTHEREVIEWER:

Remember,theusersimulatoris\_designed\_tofollowtherulesabove\.Whileviolationsoccur,theyarenotthenorm\.Becarefulnottomisconstrueanacceptableresponseasaviolation,justbecauseyouthinkitcouldbeimproved\.

RESPONSEFORMAT:

Asthereviewer,respondtoincomingtasksasfollows:

1\.First,provideaTHOUGHTexplainingyourunderstandingofthedialogueandthecontributionsoftheusersimulator’sutterance\.Thisthoughtshouldthendiscussanyrule\(s\)brokenbytheutterance\.

2\.Then,inablocklabeledwith<violations\>\[CONTENTHERE\]</violations\>,listanyviolations\.

EXAMPLERESPONSES:

Forutteranceswithnoviolations,leaveoutthe<violations\>blockentirely:

<example1\>

THOUGHT:Theengineeraskedforcodetoreproducetheissueandaatraceback,ifpossible\.Theusersuppliesboth\.Norulesareviolated\.

</example1\>

Includethe<violations\>blockwhenviolationsarepresent:

<example2\>

THOUGHT:Theengineersuggestedthattheuserrunsomecodetotestanissue\.Insteadofrefusing,theuseragreedtoruncode,breakingtheenvironment\.

<violations\>

BREAKING\_ENVIRONMENT:Usershouldnotagreetorunanycodeastheydonothaveaterminal\.Providearefusalsuchas’Iamnotabletorunanycoderightnow’instead\.

</violations\>

</example2\>

\.\.\.

TheuserMAYasktheengineertoruncode,suchasreproductioncodefromtheproblemstatement:

<example7\>

THOUGHT:Theuserisprovidingtheengineeracodesnippetwhichreproducestheerrortheyareseeing,andaskingthattheyrunittoverifythefix\.Theuserisnotthemselvesrunningthecode,sotheenvironmentisnotbroken,thereforenoviolations\.

</example7\>

TheuserMAYalsoaccept/allowtheengineertocommitchanges\.

<example8\>

THOUGHT:TheusertellstheengineertocommitchangesorsubmitaPR,whichisokasthisistheengineersresponsibilityinsolvingtheproblem\.

</example8\>

Figure 9:System prompt used with our user LLM𝒰ϕ\\mathcal\{U\}\_\{\\phi\}for self\-revision\. The LLM is tasked with finding violations in the candidate replyu^t\+1\\hat\{u\}\_\{t\+1\}, which will then be revised in a final prompt for𝒰ϕ\\mathcal\{U\}\_\{\\phi\}, before being returned asut\+1u\_\{t\+1\}to the agentPossible NamesAlice, Bob, Juan, Luke, Sara, Eva, Luis, Kiyotaka, Maria, David, Carlos, Sofia, Yuichi, Pablo, Pedro, Marta, Djibril, Jorge, Jean\-Aimé, Lucas, Emma, Oliver, Michael, Ella, Yoon\-Seo, Alexander, Ethan, Rado, Harena, Jacob, Sylvie, Sophia, Sophie, Liam, Naivo, Dera, Daniel, NoahTable 5:Names used in sampling personas for the user simulator, fromRakotonirinaet al\.\([2025](https://arxiv.org/html/2606.13995#bib.bib33)\)\.
Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

Similar Articles

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

SWE-chat: Coding Agent Interactions From Real Users in the Wild

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

Submit Feedback

Similar Articles

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
SWE-chat: Coding Agent Interactions From Real Users in the Wild
DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents
MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks
Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks