From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

arXiv cs.AI Papers

Summary

This paper theoretically identifies and mitigates context distribution shift in multi-turn dialogue RL, proposing Calibrated Interactive RL that couples interactive RL with simulator alignment to reduce the sim-to-real gap and achieve state-of-the-art performance.

arXiv:2605.26403v1 Announce Type: new Abstract: A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift--a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:05 AM

# From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator
Source: [https://arxiv.org/html/2605.26403](https://arxiv.org/html/2605.26403)
Xiaohua WangJiakang Yuan11footnotemark:1Zisu Huang Muzhao Tian Changze LvKaitao Song Chen Tao Xiaoqing ZhengFudan Universityxhwang24@m\.fudan\.edu\.cn

###### Abstract

A long\-standing goal of the research community is to develop highly interactive LLM\-based dialogue agents\. Recent research focuses on optimizing policies based on fixed offline logs \(Static Context RL\) or using a prompt\-based simulator \(Interactive RL\)\. In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift—a mismatch between dialogue histories observed during training and those encountered in real conversations\. This shift compounds quadratically over turns and severely degrades dialogue quality\. Specifically, we attribute this shift to two distinct sources: \(i\) policy\-induced shift, arising from training on static histories rather than self\-generated trajectories; and \(ii\) simulator\-induced shift, stemming from discrepancies between simulated and real human behaviors\. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment\. By aligning the simulator with human interaction patterns, our approach reduces the sim\-to\-real gap and mitigates compounding distribution shifts\. Experiments across multiple dialogue tasks confirm our theoretical analysis: \(i\) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and \(ii\) calibrating simulators with our alignment method further bridges the sim\-to\-real gap, yielding state\-of\-the\-art downstream performance\.

## 1Introduction

A long\-standing goal in the research community is to develop highly interactive LLM\-based dialogue agents that can sustain coherent, safe, and goal\-directed multi\-turn conversationsLiet al\.\([2025](https://arxiv.org/html/2605.26403#bib.bib12)\); Labanet al\.\([2025](https://arxiv.org/html/2605.26403#bib.bib40)\)\. To advance this objective, recent work has increasingly adopted reinforcement learning \(RL\) to align the model’s behavior with human preferencesOuyanget al\.\([2022](https://arxiv.org/html/2605.26403#bib.bib3)\); Baiet al\.\([2022a](https://arxiv.org/html/2605.26403#bib.bib10),[b](https://arxiv.org/html/2605.26403#bib.bib11)\)\. Broadly, existing approaches can be categorized into two primary paradigms:Static Context RLLeeet al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib4)\); Honget al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib5)\); Vermaet al\.\([2022](https://arxiv.org/html/2605.26403#bib.bib7)\); Snellet al\.\([2022](https://arxiv.org/html/2605.26403#bib.bib8)\); Rafailovet al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib42)\)andInteractive RLWuet al\.\([2025](https://arxiv.org/html/2605.26403#bib.bib2)\); Shaniet al\.\([2024](https://arxiv.org/html/2605.26403#bib.bib6)\); Cartaet al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib9)\)\. The former optimizes policy models using fixed offline contexts \(i\.e\., serialized conversation histories\), without the need to generate the preceding turns\. In contrast, the latter addresses the inherently dynamic nature of dialogue by employing a prompt\-based user simulator, thereby enabling agents to engage in multi\-turn interactions within a closed\-loop learning framework\.

Despite their preliminary efficacy, these approaches face fundamental limitations as shown in Table[1](https://arxiv.org/html/2605.26403#S2.T1)\. On one hand, static context methods disrupt the essential RL interaction loop, thereby undermining the interactive capabilities of dialogue agents\. On the other hand, while interactive approaches attempt to bridge this gap, they typically rely on prompt\-based simulators that fail to capture the stochasticity, ambiguity, and noise inherent in real user inputs\. This deficiency creates a significant sim\-to\-real gap, ultimately compromising the policy’s applicability in real\-world scenarios\.

In this work, we establish the theoretical foundations for addressing these challenges\. We formally demonstrate that existing paradigms are fundamentally constrained by context distribution shift—a discrepancy between dialogue histories observed during training and those encountered in real conversations\. We show that this shift compounds quadratically over turns, severely degrading dialogue quality\. Specifically, we decompose this phenomenon into two distinct sources: \(i\)Policy\-induced shiftis caused by the mismatch between fixed offline histories and self\-generated trajectories; and \(ii\)Simulator\-induced shiftarises from the structural divergence between prompt\-based simulators and real human behaviors\.

To this end, we introduceCalibrated Interactive RL, a unified framework that couples interactive RL with the aligned simulator\. Specifically, we first align the user simulator using Supervised Fine\-Tuning \(SFT\) on real human multi\-turn interactions, enabling it to capture realistic human behaviors such as clarification\-seeking, stubbornness, and strict goal\-adherence\. Building upon this calibrated simulator, we leverage interactive RL to optimize the policy on self\-generated trajectories, thereby explicitly addressing the policy\-induced distribution shift and allowing the agent to learn essential error\-recovery skills\.

Empirically, our analysis reveals that standard prompt\-based simulators exhibit severe behavioral artifacts, most notablysycophancy—a tendency to blindly accept incorrect agent outputs or prematurely leak ground\-truth information\. This effectively trivializes the RL interaction loop and inevitably leads to reward hacking, where the policy learns to exploit the simulator’s permissiveness rather than solving the actual task\. By treating the simulator as a trainable agent and performing behavioral cloning on high\-quality offline logs, our Simulator Alignment framework rigorously grounds the simulation in human reality\. Our experiments demonstrate that Interactive RL consistently outperforms the Static Context baseline and that training with our Aligned Simulators yields further significant gains\. These results highlight the need for both closing the interaction loop and explicitly grounding simulations\.

Our contributions are summarized as follows:

- •We establish the theoretical foundation for multi\-turn dialogue RL, reveal the source of the context distribution shift in existing RL methods, and mathematically demonstrate that these shifts compound quadratically over turns\.
- •We proposeCalibrated Interactive RL, a unified framework that couples interactive policy optimization with simulator alignment\. We introduce an SFT\-based calibration method that grounds the user simulator in real human behavioral distributions, eliminating simulation artifacts \(e\.g\., sycophancy\) and effectively bridging the sim\-to\-real gap\.
- •Extensive experiments demonstrate that Interactive RL consistently outperforms Static Context baselines\. Furthermore, explicitly calibrating the simulator yields significant additional gains, highlighting the critical importance of preventing reward hacking by grounding simulations in human reality\.

## 2Related Work

Table 1:Comparison of RL Paradigms in Multi\-turn Dialogue\.We contrast our proposed Calibrated Interactive RL with existing Static and Interactive baselines\. Our framework is the only one that addresses both policy\-induced \(Δt\\Delta\_\{t\}\) and simulator\-induced \(δt\\delta\_\{t\}\) distribution shifts\.ParadigmTraining ContextInteraction LoopSimulator FidelityPrimary Limitation \(Theoretical\)Static Context RLOffline Logs \(Fixed\)Open LoopN/APolicy Shift \(Δt\\Delta\_\{t\}\):Errors compound quadratically; no error recovery training\.Interactive RLOn\-Policy TrajectoriesClosed LoopUncalibratedSim\-to\-Real Gap \(δt\\delta\_\{t\}\):Simulator artifacts \(e\.g\., sycophancy\) mislead the policy\.Calibrated RL \(Ours\)On\-Policy TrajectoriesClosed LoopAlignedNone:Mitigates both shifts via joint interactive training and simulator alignment\.

### 2\.1Reinforcement Learning for Multi\-turn Dialogue

Reinforcement Learning from Human Feedback \(RLHF\) has established itself as the cornerstone for aligning LLMs with human intentOuyanget al\.\([2022](https://arxiv.org/html/2605.26403#bib.bib3)\); Baiet al\.\([2022a](https://arxiv.org/html/2605.26403#bib.bib10)\)\. In the context of dialogue systems, existing approaches can be broadly categorized intoStatic Context RLandInteractive RL\.Static Context RLoptimizes policies based on static, offline conversation logsShiet al\.\([2024](https://arxiv.org/html/2605.26403#bib.bib14)\); Honget al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib5)\); Snellet al\.\([2022](https://arxiv.org/html/2605.26403#bib.bib8)\)\. For instance, Charlie et al\.\(Snellet al\.,[2022](https://arxiv.org/html/2605.26403#bib.bib8)\)applies implicit Q\-learning to fixed language datasets\. Although effective for single\-turn instruction following, these static context approaches suffer from distribution shift and the inability to correct generation errors that compound over turns—a known limitation in sequential decision\-makingLevineet al\.\([2020](https://arxiv.org/html/2605.26403#bib.bib15)\)\. In contrast,Interactive RLmainly utilizes a simulator to close the interaction loop and allows the agent to learn from its own generation trajectoryShaniet al\.\([2024](https://arxiv.org/html/2605.26403#bib.bib6)\); Zhouet al\.\([2024](https://arxiv.org/html/2605.26403#bib.bib13)\)\. For example, CollabLLMWuet al\.\([2025](https://arxiv.org/html/2605.26403#bib.bib2)\)utilizes a prompt\-based user simulator to train agents on online collaborative traces\. Distinct from existing approaches, we theoretically prove that existing methods suffer from context distribution shift between training and those encountered in real conversations\. Besides,Calibrated Interactive RLis introduced to bridge the sim\-to\-real gap by aligning the user simulator via discriminator\-derived realism rewards\.

### 2\.2Distribution Shift and Exposure Bias

The discrepancy between training on ground\-truth history \(i\.e\., teacher forcing\) and generating on the model’s own history \(i\.e\., student forcing\) is a fundamental challenge in sequence generation, known as exposure biasBengioet al\.\([2015](https://arxiv.org/html/2605.26403#bib.bib16)\); Ranzatoet al\.\([2015](https://arxiv.org/html/2605.26403#bib.bib17)\)\. In the context of decision\-making, this phenomenon parallels the distribution shift analyzed in imitation learning\. Ross et al\.Rosset al\.\([2011](https://arxiv.org/html/2605.26403#bib.bib18)\)demonstrated that behavior cloning \(i\.e\., offline training\) results in errors that accumulate quadratically \(O​\(H2\)O\(H^\{2\}\)\) with the horizonHH, whereas interactive expert guidance \(i\.e\.,Dagger\) can reduce this to linear growth\. While recent works have analyzed this shift in the context of single\-turn reasoning chainsWanget al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib19)\), our work formally extends this analysis to the multi\-turn dialogue setting\. We derive a rigorous bound illustrating how static context training fails to generalize to the dynamic distribution induced by the policy’s own actions during prolonged interactions\.

### 2\.3LLM\-based User Simulators

Employing LLM\-based simulators as surrogates for human users has emerged as a prominent trend for scalable evaluation and trainingShiet al\.\([2019](https://arxiv.org/html/2605.26403#bib.bib20)\); Tsenget al\.\([2021](https://arxiv.org/html/2605.26403#bib.bib21)\); Huet al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib22)\); Gandhiet al\.\([2026](https://arxiv.org/html/2605.26403#bib.bib23)\)\. Park et al\.Parket al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib24)\)demonstrated that LLM agents can simulate believable social behaviors, and automated evaluation frameworks like MT\-BenchBaiet al\.\([2024](https://arxiv.org/html/2605.26403#bib.bib25)\)and AlpacaEvalDuboiset al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib26)\)utilize LLM judges to proxy human preferences\. In the domain of interactive training, recent studies have employed "Self\-Play" or simulator\-based RL to improve task performanceShaoet al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib27)\); Chenet al\.\([2024](https://arxiv.org/html/2605.26403#bib.bib28)\); Silveret al\.\([2017](https://arxiv.org/html/2605.26403#bib.bib29)\)\. However, uncalibrated simulators often exhibit distinct artifacts, such as "sycophancy" \(agreeing with the model’s errors\)Sharmaet al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib30)\)or unrealistic persona consistencyShanahanet al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib31)\)\. These artifacts can be exploited by the policy agent during RL training, leading to reward hackingWang and others \([2025](https://arxiv.org/html/2605.26403#bib.bib39)\)where the agent optimizes for simulator approval rather than genuine task success\. Unlike previous works that use off\-the\-shelf LLMs as simulators, we propose aSimulator Alignmentframework, treating the simulator as a trainable agent optimized to minimize the sim\-to\-real gap, thereby ensuring that interactive training yields robust real\-world performance\.

## 3Distribution Shift in Multi\-turn RL

![Refer to caption](https://arxiv.org/html/2605.26403v1/x1.png)

Figure 1:Comparison of reinforcement learning frameworks for multi\-turn dialogue\.\(a\) Pseudo Multi\-turn RL optimizes the policy using static offline conversation logs, where the model conditions its actions on fixed historical contexts without generating the preceding turns, leading to distribution shift\. \(b\) True Multi\-turn RL \(Ours\) closes the interaction loop through a two\-stage process: \(1\) User Simulator Training, where the simulator is explicitly aligned with human reference data to minimize the simulator–human gap; and \(2\) Policy Training, where the agent is optimized via dynamic, multi\-turn interactions with the aligned simulator, ensuring robustness to compounding errors\.In this section, we formally analyze the distribution shifts inherent in multi\-turn dialogue training\. We derive two fundamental bounds that motivate our proposedCalibrated Interactive RLframework\. First, we show that standard offline training \(i\.e\.,Static Context RL\) induces a biased objective due to compounding policy deviations\. Second, we demonstrate that whileInteractive RLaddresses this by closing the interaction loop, it relies on simulator fidelity, as simulator errors also accumulate quadratically\. These insights underscore the necessity of our Simulator Alignment approach\. Detailed proofs are provided in Appendix[A](https://arxiv.org/html/2605.26403#A1)\.

### 3\.1Setup and Notation

We consider an episodic dialogue environment with a finite horizonHH\. Letst∈𝒮s\_\{t\}\\in\\mathcal\{S\}denote the environment state andat∈𝒜a\_\{t\}\\in\\mathcal\{A\}the dialogue agent’s action at timett\. A historyht=\(s0,a0,…,st\)h\_\{t\}=\(s\_\{0\},a\_\{0\},\\dots,s\_\{t\}\)represents the trajectory up to timett\. Policies map histories to distributions over actions; we denote the behavior policy that generated the offline logs asπdata\\pi\_\{\\text\{data\}\}, and the learned policy asπ\\pi\.

Letdtπ​\(h\)d^\{\\pi\}\_\{t\}\(h\)be the marginal distribution over histories at timettinduced by policyπ\\pi\. To analyze the simulation gap, we distinguish between the transition dynamics of the real user, denoted by the operator𝒯real\\mathcal\{T\}\_\{\\text\{real\}\}, and the user simulator,𝒯sim\\mathcal\{T\}\_\{\\text\{sim\}\}\. We assume rewards are bounded such that\|r​\(s,a\)\|≤Rmax\|r\(s,a\)\|\\leq R\_\{\\max\}\. The objective is to maximize the finite\-horizon returnJ​\(π\)=𝔼τ∼pπ​\[∑t=0H−1rt\]J\(\\pi\)=\\mathbb\{E\}\_\{\\tau\\sim p\_\{\\pi\}\}\\left\[\\sum\_\{t=0\}^\{H\-1\}r\_\{t\}\\right\]\.

Based on the notation,Static Context RLdecouples policy optimization from state generation\. Its updates and gradient estimators are computed using contextsh∼dtdatah\\sim d^\{\\text\{data\}\}\_\{t\}sampled from the offline distributiondtdatad^\{\\text\{data\}\}\_\{t\}, rather than the on\-policy distributiondtπd^\{\\pi\}\_\{t\}\. Consequently, the optimization is fundamentally off\-policy with respect to the state distribution, as it ignores howπ\\piinfluences future histories\. On the contrary,Interactive RLmaintains the closed interaction loop\. The objective is evaluated under the distributionh∼dtπh\\sim d^\{\\pi\}\_\{t\}induced by the policy’s interaction with the environment, thereby capturing the consequences of its own actions\.

### 3\.2Error Source I: Compounding Policy Shift in Static Context RL

We first quantify the performance degradation caused by training on static contexts\. Letπ′\\pi^\{\\prime\}denote a reference policy \(e\.g\., the data\-generating policyπdata\\pi\_\{\\text\{data\}\}\) andπ\\pidenote the learned policy\. We define two divergence terms: thehistory distribution mismatchΔt=‖dtπ−dtπ′‖1\\Delta\_\{t\}=\\\|d^\{\\pi\}\_\{t\}\-d^\{\\pi^\{\\prime\}\}\_\{t\}\\\|\_\{1\}, and theper\-step policy deviationεt=𝔼h∼dtπ′∥π\(⋅∣h\)−π′\(⋅∣h\)∥1\\varepsilon\_\{t\}=\\mathbb\{E\}\_\{h\\sim d^\{\\pi^\{\\prime\}\}\_\{t\}\}\\\|\\pi\(\\cdot\\mid h\)\-\\pi^\{\\prime\}\(\\cdot\\mid h\)\\\|\_\{1\}\.

###### Theorem 3\.1\(Performance Difference via Policy Shift\)\.

For any two policiesπ,π′\\pi,\\pi^\{\\prime\}and finite horizonHH, the difference in expected return is bounded by:

\|J​\(π\)−J​\(π′\)\|≤Rmax​∑t=0H−1\(Δt\+εt\)\.\\small\|J\(\\pi\)\-J\(\\pi^\{\\prime\}\)\|\\leq R\_\{\\max\}\\sum\_\{t=0\}^\{H\-1\}\(\\Delta\_\{t\}\+\\varepsilon\_\{t\}\)\.\(1\)Furthermore, sinceΔt≤∑i=0t−1εi\\Delta\_\{t\}\\leq\\sum\_\{i=0\}^\{t\-1\}\\varepsilon\_\{i\}, the total error satisfies:

\|J​\(π\)−J​\(π′\)\|≤Rmax​∑i=0H−1\(H−i\)​εi\.\\small\|J\(\\pi\)\-J\(\\pi^\{\\prime\}\)\|\\leq R\_\{\\max\}\\sum\_\{i=0\}^\{H\-1\}\(H\-i\)\\varepsilon\_\{i\}\.\(2\)

Proof Sketch\.We decompose the return difference into a sum over time of divergences in the joint \(history, action\) distributions\. As the environment transition operator is non\-expansive in theℓ1\\ell\_\{1\}\-norm, distribution errors propagate linearly, yieldingΔt\+1≤Δt\+εt\\Delta\_\{t\+1\}\\leq\\Delta\_\{t\}\+\\varepsilon\_\{t\}\. Summing this recurrence yields the quadratic bound\. Detailed proofs are provided in[Appendix˜A](https://arxiv.org/html/2605.26403#A1)\.

Theoretical Implication\.Theorem[3\.1](https://arxiv.org/html/2605.26403#S3.Thmtheorem1)establishes that errors in Static Context RL compound quadratically \(O​\(H2\)O\(H^\{2\}\)\)\. Small deviationsεt\\varepsilon\_\{t\}in the policy at early turns accumulate into a large distribution shiftΔt\\Delta\_\{t\}, pushing the agent into out\-of\-distribution states where it has never been trained\. Static Context RL ignoresΔt\\Delta\_\{t\}, optimizing only for local accuracy, and thus fails to guarantee long\-horizon interactive performance\.

Intuitive Explanation & Running Example\.To intuitively understand this quadratic compounding effect, consider a conversational math tutoring scenario\. Suppose the agent makes a minor logical error at turnt=1t=1\(a small policy deviationε1\\varepsilon\_\{1\}\)\. In a real interaction, the subsequent dialogue history will fundamentally diverge from the ground\-truth offline logs \(creating a large history mismatchΔ2\\Delta\_\{2\}\), because the human user will likely point out the mistake or express confusion\. However, a model trained purely via Static Context RL has only seen perfect, expert\-guided trajectory histories\. When thrust into this self\-generated, flawed context during deployment, the agent lacks the essentialerror\-recoveryskills\. Instead of correcting itself, it is prone to hallucinate further \(yielding largeε2,ε3,…\\varepsilon\_\{2\},\\varepsilon\_\{3\},\\dots\)\. Consequently, an initially negligible mistake cascades dynamically over the horizonHH, leading to catastrophic failure\. Interactive RL mitigates this by explicitly exposing the agent to its own mistakes during training, thereby teaching it how to navigate back to the correct path\.

### 3\.3Error Source II: Simulator\-Real Distribution Shift

While Interactive RL resolves the policy distribution shift by training ondtπd^\{\\pi\}\_\{t\}, it introduces a new source of error: the discrepancy between the simulator𝒯sim\\mathcal\{T\}\_\{\\text\{sim\}\}and the real user𝒯real\\mathcal\{T\}\_\{\\text\{real\}\}\. To isolate the impact of simulator fidelity, we analyze the gap for a fixed policyπ\\pi\. We define thesimulator errorat timettas the expected divergence in transition dynamics under the real joint distribution:δt=𝔼\(h,a\)∼qtreal∥𝒯real\(⋅\|h,a\)−𝒯sim\(⋅\|h,a\)∥1\\delta\_\{t\}=\\mathbb\{E\}\_\{\(h,a\)\\sim q^\{\\text\{real\}\}\_\{t\}\}\\\|\\mathcal\{T\}\_\{\\text\{real\}\}\(\\cdot\|h,a\)\-\\mathcal\{T\}\_\{\\text\{sim\}\}\(\\cdot\|h,a\)\\\|\_\{1\}, whereqtreal​\(h,a\)=dreal,tπ​\(h\)​π​\(a\|h\)q^\{\\text\{real\}\}\_\{t\}\(h,a\)=d^\{\\pi\}\_\{\\text\{real\},t\}\(h\)\\pi\(a\|h\)\.

###### Theorem 3\.2\(Sim\-to\-Real Gap\)\.

For any policyπ\\piand horizonHH, the performance gap caused by the simulator is bounded by:

\|Jreal​\(π\)−Jsim​\(π\)\|≤Rmax​∑i=0H−1\(H−i\)​δi\.\\small\|J\_\{\\text\{real\}\}\(\\pi\)\-J\_\{\\text\{sim\}\}\(\\pi\)\|\\leq R\_\{\\max\}\\sum\_\{i=0\}^\{H\-1\}\(H\-i\)\\delta\_\{i\}\.\(3\)

Proof SketchWe upper\-bound the performance gap between the real and simulated environments by aggregating the marginal distribution shifts over the horizon\. Leveraging the non\-expansive nature of stochastic transition operators under theℓ1\\ell\_\{1\}\-norm, we show that the distribution divergence at any step is bounded by the sum of the propagated error from the previous step and the immediate local simulation error\. This leads to the recurrenceΔt\+1env≤Δtenv\+δt\\Delta\_\{t\+1\}^\{\\text\{env\}\}\\leq\\Delta\_\{t\}^\{\\text\{env\}\}\+\\delta\_\{t\}, which telescopes to yield the final cumulative bound\. The complete derivation is detailed in[Appendix˜A](https://arxiv.org/html/2605.26403#A1)\.

Algorithm 1Calibrated Interactive RL1:Input:Multi\-turn data

𝒟multi\\mathcal\{D\}\_\{\\text\{multi\}\}, Task data

𝒟\\mathcal\{D\}
2:Initialize:Simulator

πϕsim\\pi^\{\\text\{sim\}\}\_\{\\phi\}, Policy

πθ\\pi\_\{\\theta\}
3:// Phase 1: Simulator Calibration \(SFT\)

4:whilenot convergeddo

5:Sample batch

\(x,y∗,τhuman\)\(x,y^\{\*\},\\tau\_\{\\text\{human\}\}\)from

𝒟multi\\mathcal\{D\}\_\{\\text\{multi\}\}
6:Compute

ℒSFT​\(ϕ\)\\mathcal\{L\}\_\{\\text\{SFT\}\}\(\\phi\)via NLL on user turns

utu\_\{t\}
7:Update

ϕ\\phivia Gradient Descent

8:endwhile

9:Freeze aligned simulator

πϕ∗sim\\pi^\{\\text\{sim\}\}\_\{\\phi^\{\*\}\}
10:// Phase 2: Interactive Policy Opt\.

11:foreach training epochdo

12:Sample context

\(x,y∗\)\(x,y^\{\*\}\)from

𝒟\\mathcal\{D\}
13:Initialize history

h0=xh\_\{0\}=x
14:for

t=0t=0to

H−1H\-1do

15:Agent:

at∼πθ\(⋅∣ht\)a\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid h\_\{t\}\)
16:Simulator:

ut∼πϕ∗sim\(⋅∣\[x,y∗,ht,at\]\)u\_\{t\}\\sim\\pi^\{\\text\{sim\}\}\_\{\\phi^\{\*\}\}\(\\cdot\\mid\[x,y^\{\*\},h\_\{t\},a\_\{t\}\]\)
17:Update history

ht\+1←ht∪\{at,ut\}h\_\{t\+1\}\\leftarrow h\_\{t\}\\cup\\\{a\_\{t\},u\_\{t\}\\\}
18:endfor

19:Final reward

R=Score​\(aH−1,y∗\)R=\\text\{Score\}\(a\_\{H\-1\},y^\{\*\}\)
20:Update policy

θ\\thetavia GRPO to maximize

RR
21:endfor

Theoretical Implication\.This result highlights that simulator misalignment is as detrimental as policy misalignment\. Even if interactive training is used, a simulator that deviates from human behavior \(i\.e\.,δt\>0\\delta\_\{t\}\>0\) causes the training objective to diverge from the real\-world objective at a rate ofO​\(H2​δ\)O\(H^\{2\}\\delta\)\. This formally justifies ourSimulator Alignmentapproach: minimizingδt\\delta\_\{t\}is a theoretical prerequisite for robust interactive learning\.

Intuitive Explanation & Running Example\.To see why standard interactive training is insufficient without simulator calibration, consider the well\-documented behavioral artifacts of uncalibrated LLM simulators, such as “sycophancy” \(blind agreement\)Sharmaet al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib30)\)\. Suppose the dialogue agent proposes a highly flawed document edit\. A real human user would reject the edit and demand a revision \(representing a specific transition in𝒯real\\mathcal\{T\}\_\{\\text\{real\}\}\)\. In contrast, a naive, uncalibrated simulator might enthusiastically accept the flawed edit, complimenting the agent \(a severely divergent transition in𝒯sim\\mathcal\{T\}\_\{\\text\{sim\}\}, yielding a large local errorδt\\delta\_\{t\}\)\. By interacting with this flawed simulator over multiple turns, the agent receives false\-positive feedback and learns a degenerate policy that exploits the simulator’s permissiveness rather than actually solving the task—a phenomenon closely related toreward hackingWang and others \([2025](https://arxiv.org/html/2605.26403#bib.bib39)\)\. As shown by our bound, this simulator\-induced delusion accumulates quadratically over the conversation lengthHH\. Therefore, aligning the simulator to accurately reflect human stubbornness, constraints, and feedback patterns is not merely an engineering trick, but a fundamental mathematical requirement to ensure that the surrogate objectiveJsim​\(π\)J\_\{\\text\{sim\}\}\(\\pi\)faithfully proxies the true real\-world objectiveJreal​\(π\)J\_\{\\text\{real\}\}\(\\pi\)\.

## 4Calibrated Interactive RL

Motivated by the compounding distribution shifts identified in[Theorem˜3\.1](https://arxiv.org/html/2605.26403#S3.Thmtheorem1)and[Theorem˜3\.2](https://arxiv.org/html/2605.26403#S3.Thmtheorem2), it is evident that a robust multi\-turn dialogue agent must simultaneously address both policy\-induced and simulator\-induced errors\. To this end, we proposeCalibrated Interactive RL\. As shown in[Figure˜1](https://arxiv.org/html/2605.26403#S3.F1)\(b\), our method operates in two distinct phases: \(1\)Simulator Calibration, where we train a goal\-guided user simulator via Supervised Fine\-Tuning \(SFT\) to mimic human interaction patterns; and \(2\)Interactive Policy Optimization, where the dialogue agent is trained within this calibrated environment using memory\-efficient RL to maximize task success\.

### 4\.1Phase I: Simulator Calibration via Supervised Fine\-Tuning \(SFT\)

The primary bottleneck in standard interactive RL is thesimulator\-induced shift\(δt\\delta\_\{t\}\)\. Conventional prompt\-based simulators frequently fail to exhibit the specific intent, cognitive blind spots, or "stubbornness" of real users\. To minimize this sim\-to\-real gap, we formulate the simulator as a trainable agent and explicitly align it with human behavioral distributions using historical multi\-turn data\.

Goal\-Guided Simulation Context\.Unlike the dialogue policy, which must infer the user’s intent dynamically, the user simulator functions as an environmental "Teacher" and thus requires access to the full ground truth\. Let𝒟=\{\(x,y∗,hreal\)\}\\mathcal\{D\}=\\\{\(x,y^\{\*\},h\_\{\\text\{real\}\}\)\\\}denote a dataset containing the initial instructionxx, the ground\-truth solutiony∗y^\{\*\}, and the reference human historyhrealh\_\{\\text\{real\}\}\. We condition the simulatorπϕsim\\pi\_\{\\phi\}^\{\\text\{sim\}\}on an oracle contextItsim=\[x,y∗,ht,at\]I\_\{t\}^\{\\text\{sim\}\}=\[x,y^\{\*\},h\_\{t\},a\_\{t\}\]\. This privileged information allows the simulator to consistently verify whether the agent’s current actionata\_\{t\}aligns with the targety∗y^\{\*\}and to provide feedback or corrections consistent with the reference intent, without hallucinating diverging goals\.

Calibration via Behavioral Cloning\.To ensure the simulator accurately reflects human transition dynamics𝒯real\\mathcal\{T\}\_\{\\text\{real\}\}, we optimize the simulator parametersϕ\\phiusing Supervised Fine\-Tuning \(SFT\) on the real user turns from𝒟\\mathcal\{D\}\. Specifically, we minimize the negative log\-likelihood \(NLL\) of the reference human responsesutu\_\{t\}given the oracle context and the dialogue history up to turntt:

ℒCalibration​\(ϕ\)=−𝔼\(x,y∗,hreal,areal\)∼𝒟​\[∑tlog⁡πϕsim​\(ut∣x,y∗,ht,at\)\]\\small\\mathcal\{L\}\_\{\\text\{Calibration\}\}\(\\phi\)=\-\\mathbb\{E\}\_\{\(x,y^\{\*\},h\_\{\\text\{real\}\},a\_\{\\text\{real\}\}\)\\sim\\mathcal\{D\}\}\\left\[\\sum\_\{t\}\\log\\pi\_\{\\phi\}^\{\\text\{sim\}\}\(u\_\{t\}\\mid x,y^\{\*\},h\_\{t\},a\_\{t\}\)\\right\]\(4\)By conditioning on both the ground\-truth and the agent’s actual output the simulator learns to provide grounded feedback, such as asking for clarification, expressing partial understanding, or persistently rejecting incorrect solutions\.

Mitigating Simulator Artifacts and Reward Hacking\.The motivation for SFT calibration extends beyond mere behavioral realism; it also provides a critical defense againstreward hackingWang and others\([2025](https://arxiv.org/html/2605.26403#bib.bib39)\)\. Uncalibrated simulators frequently exhibit "sycophancy" \(blindly agreeing with flawed logicSharmaet al\.\([2023](https://arxiv.org/html/2605.26403#bib.bib30)\)\), which RL agents rapidly exploit by generating confident but incorrect responses to trick the simulator into emitting unearned success signals\. By rigorously grounding the simulator in actual human data, we eliminate these exploitable artifacts, ensuring the policy must make genuine progress to achieve high rewards\.

### 4\.2Phase II: Interactive Policy Optimization

With the calibrated simulatorπϕ∗sim\\pi^\{\\text\{sim\}\}\_\{\\phi^\{\*\}\}frozen, we proceed to optimize the dialogue policyπθ\\pi\_\{\\theta\}\. This phase closes the interaction loop to address thepolicy\-induced shift\(Δt\\Delta\_\{t\}\)\.

On\-Policy Trajectory Generation\.In contrast to Static Context RL, our agent generates full multi\-turn trajectories by interacting dynamically with the aligned simulator\. For an episode of lengthHH, the joint trajectoryτ=\(s0,a0,u0,…,aH−1,uH−1\)\\tau=\(s\_\{0\},a\_\{0\},u\_\{0\},\\dots,a\_\{H\-1\},u\_\{H\-1\}\)is generated via alternating execution, as summarized in Algorithm[1](https://arxiv.org/html/2605.26403#alg1)\. This ensures that the policyπθ\\pi\_\{\\theta\}is trained on the exact state distributiondtπθd^\{\\pi\_\{\\theta\}\}\_\{t\}it dynamically visits, allowing it to naturally develop robust error\-recovery and long\-horizon planning strategies that are fundamentally unlearnable from static, off\-policy logs\.

Sparse Task Rewards\.The quality of the interaction is measured solely by outcome success\. We utilize a sparse reward signalRtaskR\_\{\\text\{task\}\}administered only at the final turnHH\. Lety^\\hat\{y\}be the final output produced by the agent \(e\.g\., the refined document or final code\):

R​\(τ\)=\{Score​\(y^,y∗\)at​t=H0at​t<H\\small R\(\\tau\)=\\begin\{cases\}\\text\{Score\}\(\\hat\{y\},y^\{\*\}\)&\\text\{at \}t=H\\\\ 0&\\text\{at \}t<H\\end\{cases\}\(5\)For document editing tasks,Score​\(⋅\)\\text\{Score\}\(\\cdot\)is the BLEU metric; for math and coding tasks, it is a binary correctness indicator\. This outcome\-based supervision forces the agent to engage in meaningful information seeking\. The policyπθ\\pi\_\{\\theta\}is then optimized using Group Relative Policy Optimization \(GRPO\)Shaoet al\.\([2024](https://arxiv.org/html/2605.26403#bib.bib33)\)to maximize the expected return𝔼τ​\[R​\(τ\)\]\\mathbb\{E\}\_\{\\tau\}\[R\(\\tau\)\]\.

## 5Experiments

We conduct extensive experiments to validate two core hypotheses: \(1\)Interactive RLsignificantly outperforms Static Context RL by mitigating the distribution shift between offline training contexts and on\-policy generation; and \(2\)Simulator Alignmentis critical for interactive training—optimizing the simulator to mimic human intent and natural interaction patterns leads to superior downstream policy performance\.

### 5\.1Experimental Setup

#### Datasets and Benchmarks\.

To rigorously evaluate the dynamic interaction capabilities of our framework, we employ two challenging multi\-turn dialogue benchmarks inspired by recent collaborative agent evaluationsWuet al\.\([2025](https://arxiv.org/html/2605.26403#bib.bib2)\)\. These datasets are explicitly designed to test long\-horizon planning, ambiguity resolution, and error recovery:

- •MediumDocEdit\-Chat \(Collaborative Editing\):Derived from document editing corpora, this task requires the agent to help a user iteratively refine a document\. Unlike single\-turn instruction following, user intents in this dataset are frequently ambiguous or incomplete\. The agent must proactively ask clarifying questions, propose partial edits, and adapt to shifting user feedback over multiple turns\. The primary objective evaluation metric isBLEU, calculated between the final model\-refined document and the ground\-truth target text\.
- •MATH\-Chat \(Conversational Reasoning\):Adapted from the MATH dataset, this benchmark evaluates the agent’s ability to act as an interactive tutor\. Instead of directly outputting the final answer, the agent must guide a simulated student \(who may make logical errors or express confusion\) step\-by\-step toward the correct solution\. This requires robust reasoning scaffolding and the ability to correct trajectory deviations dynamically\. The task success is measured by the final answerAccuracy \(Acc\.\)\.

#### Models and Baselines\.

To ensure a fair and controlled comparison, we utilizeGemma\-3\-4B\-ITGemma Teamet al\.\([2024](https://arxiv.org/html/2605.26403#bib.bib37)\)as the foundational backbone for both the policy agent and the user simulator across all experiments\. We systematically compare our fullCalibrated Interactive RLframework against established paradigms and internal ablations, seamlessly mirroring the theoretical sources of distribution shift \(as summarized in Table[2](https://arxiv.org/html/2605.26403#S5.T2)\):

- •Gemma\-3\-4B\-IT \(Base Model\):The original instruction\-tuned model without any multi\-turn RL fine\-tuning, acting as our zero\-shot baseline\.
- •CollabLLM\(Static RL\):The state\-of\-the\-art offline multi\-turn RL methodWuet al\.\([2025](https://arxiv.org/html/2605.26403#bib.bib2)\)\. We faithfully reproduce this baseline using the official codebase, optimizing the Gemma\-3 policy on static conversational logs using Multi\-turn\-aware Rewards\. This represents the upper bound of theStatic Context RLparadigm\.
- •Static Context:An ablation of our method where the policy is trained exclusively on fixed offline history distributions \(h∼ddatah\\sim d^\{\\text\{data\}\}\) without dynamic simulator rollouts\.
- •Ours \(Naive Interactive\):An ablation utilizing the full Interactive RL loop, but employing a basic, uncalibrated user simulator \(prompting the Base Qwen2\.5\-7B\-InstructQwen Team \([2024](https://arxiv.org/html/2605.26403#bib.bib38)\)model without Phase I SFT\)\. This isolates the impact of closing the interaction loop from the simulator’s behavioral fidelity\.
- •Ours \(Calibrated Interactive\):Our complete proposed framework, featuring on\-policy trajectory generation against the Phase I SFT\-aligned user simulator\.
- •Oracle \(Proxy Human\):An evaluation upper\-bound representing the performance of the base policy when interacting with a privileged proxy human simulator that has direct access to the ground\-truth targets\.

#### Implementation Details\.

In Phase I, the user simulator \(Qwen2\.5\-7B\-Instruct\) is trained via SFT on 1,860 \(MATH\) and 16,028 \(MediumDocEdit\) multi\-turn trajectories collected from an oracle Qwen3\-235BYanget al\.\([2025](https://arxiv.org/html/2605.26403#bib.bib41)\)simulator\. In Phase II, the policy \(Gemma\-3\-4B\-IT\) is optimized via GRPO on 32 GPUs with a group size ofG=32G=32samples per prompt and a global batch size of 8,192\. We use a constant learning rate of5×10−65\\times 10^\{\-6\}with Adam \(β1=0\.9,β2=0\.98\\beta\_\{1\}\{=\}0\.9,\\beta\_\{2\}\{=\}0\.98\) and asymmetric PPO\-style clippingSchulmanet al\.\([2017](https://arxiv.org/html/2605.26403#bib.bib32)\)\(ϵ∈\[0\.2,0\.28\]\\epsilon\\in\[0\.2,0\.28\]\)\. Full hyperparameter tables are provided in Appendix[D](https://arxiv.org/html/2605.26403#A4)\.

#### Evaluation Metrics\.

Following the evaluation protocol ofCollabLLM, we assess performance along two primary dimensions\. \(1\)Task Performance: Measured by BLEU for editing and Accuracy for math reasoning\. \(2\)Efficiency: We report the average generation length \(\#Toks, in thousands of tokens\), where lower values indicate more concise, natural conversational turns rather than monolithic document dumps\. We additionally report theAverage Score\(arithmetic mean of BLEU and ACC\) as a unified metric across domains\. All experiments are repeated 3 times with different random seeds; we report mean±\\pmstandard deviation\.

Table 2:Main Results on Multi\-turn Collaboration Benchmarks\.We evaluate the Gemma\-3\-4B\-IT policy across two domains: MediumDocEdit \(Editing\) and MATH\-Chat \(Reasoning\)\. Each experiment is repeated 3 times; we report mean±\\pmstd\. The blue row highlights our fullCalibrated Interactive RLframework\.MethodParadigmMediumDocEdit\-ChatMATH\-ChatAverageBLEU↑\\uparrow\#Toks\(k\)↓\(k\)\\\!\\downarrowACC↑\\uparrow\#Toks\(k\)↓\(k\)\\\!\\downarrowScore↑\\uparrow\#Toks\(k\)↓\(k\)\\\!\\downarrowBaselinesGemma\-3\-4B\-ITBase Model32\.2±\\pm1\.52\.47±\\pm0\.1282\.3±\\pm0\.61\.76±\\pm0\.0457\.32\.12CollabLLM†Offline DPO33\.1±\\pm0\.42\.38±\\pm0\.0782\.3±\\pm1\.91\.63±\\pm0\.0957\.72\.01Ours \(Gemma\-3\-4B\-IT backbone\)Static ContextStatic RL33\.8±\\pm1\.83\.09±\\pm0\.0185\.0±\\pm0\.01\.63±\\pm0\.0759\.42\.36Ours \(Naive Interactive\)Interactive RL26\.1±\\pm1\.24\.00±\\pm0\.0789\.3±\\pm3\.41\.97±\\pm0\.0557\.72\.98Ours \(Calibrated Interactive\)Calibrated RL34\.6±\\pm1\.02\.27±\\pm0\.0891\.5±\\pm1\.31\.86±\\pm0\.0963\.052\.06ReferenceOracle \(Proxy Human\)–36\.2±\\pm1\.41\.83±\\pm0\.0689\.7±\\pm1\.61\.27±\\pm0\.0463\.01\.55

†CollabLLMWuet al\.\([2025](https://arxiv.org/html/2605.26403#bib.bib2)\)reproduced with offline DPO on the same Gemma\-3\-4B\-IT backbone\.

### 5\.2Experimental Results

Table[2](https://arxiv.org/html/2605.26403#S5.T2)summarizes the comprehensive results across both domains\. Our proposed framework achieves the highest MATH\-Chat accuracy \(91\.5%\), substantially outperforming all baselines and even surpassing the Oracle reference \(89\.7%\)\. In the following, we provide a detailed analysis of these empirical findings\.

#### Interactive RL Beats Static Context RL\.

Closing the interaction loop significantly improves task success: MATH\-Chat accuracy rises from 85\.0% \(Static Context\) to 89\.3% \(Naive Interactive\), a gain of \+4\.3 points\. This validates our theoretical analysis \([Section˜3](https://arxiv.org/html/2605.26403#S3)\): optimizing the policy on dynamically induced state distributions \(h∼dπh\\sim d^\{\\pi\}\) rather than fixed offline logs equips the agent with robust error\-recovery skills and mitigates compounding distribution shifts\. Qualitatively, interactive agents successfully rectify logical missteps dynamically, whereas static agents often hallucinate when user feedback deviates from their offline training distribution\. To provide deeper intuition into the behavioral improvements, Figure[2](https://arxiv.org/html/2605.26403#S5.F2)illustrates a qualitative analysis of successfully resolved cases on MATH\-Chat\. Our Calibrated Interactive RL agent effectively avoids degenerate "try\-and\-fail" loops, arriving at correct solutions in significantly fewer dialogue turns\. We refer readers to Appendix[F](https://arxiv.org/html/2605.26403#A6)for comprehensive step\-by\-step case studies demonstrating the model’s enhanced structural understanding and proactive error\-recovery strategies\.

#### Calibrated Simulators are Essential for Interaction\.

Merely closing the interaction loop is insufficient; aligning the simulator with human behavior yields further substantial gains\. On MATH\-Chat, calibrated training achieves91\.5%accuracy—a \+2\.2 improvement over the Naive Interactive baseline \(89\.3%\) and notably surpassing even the Oracle \(89\.7%\)\. This suggests our aligned simulator provides a highly effective and complementary training signal: by calibrating the simulator to maintain realistic persistence and avoid sycophantic behavior, the policy learns more robust reasoning strategies that generalize even beyond what a single proxy human would elicit\.

#### Comparison withCollabLLM\.

The state\-of\-the\-art static baseline \(CollabLLM, offline DPO\) performs comparably to the vanilla base model on MATH\-Chat \(82\.3% vs\. 82\.3%\)\. This highlights a fundamental limitation of offline RL in multi\-turn settings: without on\-policy experience, the agent never observes how its responses shape subsequent dialogue turns\. The superiority of all our interactive methods confirms that*on\-policy experience generation*is a critical prerequisite for robust multi\-turn optimization\.

![Refer to caption](https://arxiv.org/html/2605.26403v1/x2.png)Figure 2:Qualitative analysis of improved cases on MATH\-Chat\.\(a\) Distribution of improvement patterns across 29 cases where the baseline fails but our method succeeds\. \(b\) Dialogue turns required to resolve each case—our model achieves correct answers in up to 67% fewer turns\. \(c\) Overall accuracy improvement: 82\.3%→\\rightarrow91\.5%\. Detailed case studies are provided in Appendix[F](https://arxiv.org/html/2605.26403#A6)\. Note: While Table[2](https://arxiv.org/html/2605.26403#S5.T2)reports the mean and standard deviation across three independent runs, the qualitative patterns and case studies presented here are derived from a single representative trial to maintain trajectory consistency\.

## 6Conclusion

In this work, we established the theoretical foundations for the limitations of Static Context RL, formally proving that training on fixed offline logs induces a context distribution shift where errors compound quadratically over turns\. To address this fundamental bottleneck, we proposedCalibrated Interactive RL, a unified framework that couples interactive training with rigorous simulator alignment\. We demonstrated that simply closing the interaction loop is insufficient; the success of interactive learning critically depends on minimizing the sim\-to\-real gap\. By treating the user simulator as a trainable agent and optimizing it to maximize a discriminator\-derived realism reward, we effectively aligned the simulation environment with human interaction patterns\. Our extensive experiments confirm that our framework not only mitigates policy\-induced distribution shifts but also fosters robust information\-seeking behaviors, significantly outperforming baselines relying on static contexts or uncalibrated simulators\. This work highlights the necessity of high\-fidelity, aligned simulation as a prerequisite for the next generation of interactive dialogue agents\.

## References

- G\. Bai, J\. Liu, X\. Bu, Y\. He, J\. Liu, Z\. Zhou, Z\. Lin, W\. Su, T\. Ge, B\. Zheng,et al\.\(2024\)Mt\-bench\-101: a fine\-grained benchmark for evaluating large language models in multi\-turn dialogues\.arXiv preprint arXiv:2402\.14762\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022a\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26403#S2.SS1.p1.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022b\)Constitutional ai: harmlessness from ai feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1)\.
- S\. Bengio, O\. Vinyals, N\. Jaitly, and N\. Shazeer \(2015\)Scheduled sampling for sequence prediction with recurrent neural networks\.Advances in neural information processing systems28\.Cited by:[§2\.2](https://arxiv.org/html/2605.26403#S2.SS2.p1.2)\.
- T\. Carta, C\. Romac, T\. Wolf, S\. Lamprier, O\. Sigaud, and P\. Oudeyer \(2023\)Grounding large language models in interactive environments with online reinforcement learning\.InInternational Conference on Machine Learning,pp\. 3676–3713\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1)\.
- Z\. Chen, Y\. Deng, H\. Yuan, K\. Ji, and Q\. Gu \(2024\)Self\-play fine\-tuning converts weak language models to strong language models\.arXiv preprint arXiv:2401\.01335\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- Y\. Dubois, C\. X\. Li, R\. Taori, T\. Zhang, I\. Gulrajani, J\. Ba, C\. Guestrin, P\. S\. Liang, and T\. B\. Hashimoto \(2023\)Alpacafarm: a simulation framework for methods that learn from human feedback\.Advances in Neural Information Processing Systems36,pp\. 30039–30069\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- K\. Gandhi, A\. Bhatia, and N\. D\. Goodman \(2026\)Learning to simulate human dialogue\.arXiv preprint arXiv:2601\.04436\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- Gemma Team, T\. Mesnard, C\. Hardin, R\. Dadashi, S\. Bhupatiraju, S\. Pathak, L\. Sifre, M\. Rivière, M\. S\. Kale, J\. Love,et al\.\(2024\)Gemma: open models based on gemini research and technology\.arXiv preprint arXiv:2403\.08295\.Cited by:[§B\.2](https://arxiv.org/html/2605.26403#A2.SS2.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.26403#S5.SS1.SSS0.Px2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.Advances in Neural Information Processing Systems34,pp\. 21993–22005\.Cited by:[1st item](https://arxiv.org/html/2605.26403#A2.I1.i1.p1.1)\.
- J\. Hong, S\. Levine, and A\. Dragan \(2023\)Zero\-shot goal\-directed dialogue via rl on imagined conversations\.arXiv preprint arXiv:2311\.05584\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26403#S2.SS1.p1.1)\.
- Z\. Hu, Y\. Feng, A\. T\. Luu, B\. Hooi, and A\. Lipani \(2023\)Unlocking the potential of user feedback: leveraging large language model as user simulators to enhance dialogue system\.InProceedings of the 32nd ACM International Conference on Information and Knowledge Management,pp\. 3953–3957\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2025\)Llms get lost in multi\-turn conversation\.arXiv preprint arXiv:2505\.06120\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1)\.
- H\. Lee, S\. Phatale, H\. Mansoor, T\. Mesnard, J\. Ferret, K\. Lu, C\. Bishop, E\. Hall, V\. Carbune, A\. Rastogi,et al\.\(2023\)Rlaif vs\. rlhf: scaling reinforcement learning from human feedback with ai feedback\.arXiv preprint arXiv:2309\.00267\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1)\.
- S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu \(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.arXiv preprint arXiv:2005\.01643\.Cited by:[§2\.1](https://arxiv.org/html/2605.26403#S2.SS1.p1.1)\.
- Y\. Li, X\. Shen, X\. Yao, X\. Ding, Y\. Miao, R\. Krishnan, and R\. Padman \(2025\)Beyond single\-turn: a survey on multi\-turn interactions with large language models\.arXiv preprint arXiv:2504\.04717\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26403#S2.SS1.p1.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- Qwen Team \(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§B\.2](https://arxiv.org/html/2605.26403#A2.SS2.SSS0.Px2.p1.1),[4th item](https://arxiv.org/html/2605.26403#S5.I2.i4.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1)\.
- M\. Ranzato, S\. Chopra, M\. Auli, and W\. Zaremba \(2015\)Sequence level training with recurrent neural networks\.arXiv preprint arXiv:1511\.06732\.Cited by:[§2\.2](https://arxiv.org/html/2605.26403#S2.SS2.p1.2)\.
- S\. Ross, G\. Gordon, and D\. Bagnell \(2011\)A reduction of imitation learning and structured prediction to no\-regret online learning\.InProceedings of the fourteenth international conference on artificial intelligence and statistics,pp\. 627–635\.Cited by:[§2\.2](https://arxiv.org/html/2605.26403#S2.SS2.p1.2)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§5\.1](https://arxiv.org/html/2605.26403#S5.SS1.SSS0.Px3.p1.4)\.
- M\. Shanahan, K\. McDonell, and L\. Reynolds \(2023\)Role play with large language models\.Nature623\(7987\),pp\. 493–498\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- L\. Shani, A\. Rosenberg, A\. Cassel, O\. Lang, D\. Calandriello, A\. Zipori, H\. Noga, O\. Keller, B\. Piot, I\. Szpektor,et al\.\(2024\)Multi\-turn reinforcement learning with preference human feedback\.Advances in Neural Information Processing Systems37,pp\. 118953–118993\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26403#S2.SS1.p1.1)\.
- Y\. Shao, L\. Li, J\. Dai, and X\. Qiu \(2023\)Character\-llm: a trainable agent for role\-playing\.arXiv preprint arXiv:2310\.10158\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§B\.4](https://arxiv.org/html/2605.26403#A2.SS4.p1.1),[§4\.2](https://arxiv.org/html/2605.26403#S4.SS2.p3.6)\.
- M\. Sharma, M\. Tong, T\. Korbak, D\. Duvenaud, A\. Askell, S\. R\. Bowman, N\. Cheng, E\. Durmus, Z\. Hatfield\-Dodds, S\. R\. Johnston,et al\.\(2023\)Towards understanding sycophancy in language models\.arXiv preprint arXiv:2310\.13548\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1),[§3\.3](https://arxiv.org/html/2605.26403#S3.SS3.p4.6),[§4\.1](https://arxiv.org/html/2605.26403#S4.SS1.p4.1)\.
- W\. Shi, K\. Qian, X\. Wang, and Z\. Yu \(2019\)How to build user simulators to train rl\-based dialog systems\.arXiv preprint arXiv:1909\.01388\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- W\. Shi, M\. Yuan, J\. Wu, Q\. Wang, and F\. Feng \(2024\)Direct multi\-turn preference optimization for language agents\.arXiv preprint arXiv:2406\.14868\.Cited by:[§2\.1](https://arxiv.org/html/2605.26403#S2.SS1.p1.1)\.
- D\. Silver, J\. Schrittwieser, K\. Simonyan, I\. Antonoglou, A\. Huang, A\. Guez, T\. Hubert, L\. Baker, M\. Lai, A\. Bolton,et al\.\(2017\)Mastering the game of go without human knowledge\.nature550\(7676\),pp\. 354–359\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- C\. Snell, I\. Kostrikov, Y\. Su, M\. Yang, and S\. Levine \(2022\)Offline rl for natural language generation with implicit language q learning\.arXiv preprint arXiv:2206\.11871\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26403#S2.SS1.p1.1)\.
- B\. Tseng, Y\. Dai, F\. Kreyssig, and B\. Byrne \(2021\)Transferable dialogue systems and user simulators\.arXiv preprint arXiv:2107\.11904\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1)\.
- S\. Verma, J\. Fu, S\. Yang, and S\. Levine \(2022\)Chai: a chatbot ai for task\-oriented dialogue with offline reinforcement learning\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4471–4491\.Cited by:[§1](https://arxiv.org/html/2605.26403#S1.p1.1)\.
- J\. Wang, X\. Hu, W\. Hou, H\. Chen, R\. Zheng, Y\. Wang, L\. Yang, H\. Huang, W\. Ye, X\. Geng,et al\.\(2023\)On the robustness of chatgpt: an adversarial and out\-of\-distribution perspective\.arXiv preprint arXiv:2302\.12095\.Cited by:[§2\.2](https://arxiv.org/html/2605.26403#S2.SS2.p1.2)\.
- X\. Wanget al\.\(2025\)Reward hacking in the era of large models: mechanisms, emergent misalignment, challenges\.arXiv preprint arXiv:2604\.13602\.Cited by:[§2\.3](https://arxiv.org/html/2605.26403#S2.SS3.p1.1),[§3\.3](https://arxiv.org/html/2605.26403#S3.SS3.p4.6),[§4\.1](https://arxiv.org/html/2605.26403#S4.SS1.p4.1.2)\.
- S\. Wu, M\. Galley, B\. Peng, H\. Cheng, G\. Li, Y\. Dou, W\. Cai, J\. Zou, J\. Leskovec, and J\. Gao \(2025\)Collabllm: from passive responders to active collaborators\.arXiv preprint arXiv:2502\.00640\.Cited by:[§B\.5](https://arxiv.org/html/2605.26403#A2.SS5.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.26403#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26403#S2.SS1.p1.1),[2nd item](https://arxiv.org/html/2605.26403#S5.I2.i2.p1.1),[§5\.1](https://arxiv.org/html/2605.26403#S5.SS1.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2605.26403#S5.T2.34.2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2605.26403#S5.SS1.SSS0.Px3.p1.4)\.
- Y\. Zhou, A\. Zanette, J\. Pan, S\. Levine, and A\. Kumar \(2024\)Archer: training language model agents via hierarchical multi\-turn rl\.arXiv preprint arXiv:2402\.19446\.Cited by:[§2\.1](https://arxiv.org/html/2605.26403#S2.SS1.p1.1)\.

## Appendix ATheoretical Proofs and Derivations

In this appendix, we provide full proofs for the theoretical claims presented in[Section˜3](https://arxiv.org/html/2605.26403#S3)\. We first restate the necessary notation and assumptions, then establish auxiliary lemmas regarding distributional divergence, and finally derive the bounds for Policy Mismatch \([Theorem˜3\.1](https://arxiv.org/html/2605.26403#S3.Thmtheorem1)\) and the Sim\-to\-Real Gap \([Theorem˜3\.2](https://arxiv.org/html/2605.26403#S3.Thmtheorem2)\)\.

### A\.1Notation and Preliminaries

We consider a finite\-horizon Markov Decision Process \(MDP\) defined over histories to account for the non\-Markovian nature of dialogue when viewed from the perspective of state\-only transitions\.

- •Horizon:H∈ℕH\\in\\mathbb\{N\}\. Time steps are indexed byt=0,…,H−1t=0,\\dots,H\-1\.
- •History:Letht=\(s0,a0,…,st\)∈ℋh\_\{t\}=\(s\_\{0\},a\_\{0\},\\dots,s\_\{t\}\)\\in\\mathcal\{H\}denote the interaction history up to timett\.
- •Distributions:Letdtπ​\(h\)d^\{\\pi\}\_\{t\}\(h\)denote the marginal distribution over histories at timettinduced by policyπ\\piand the environment dynamics\.
- •Norms:Unless otherwise specified,∥⋅∥1\\\|\\cdot\\\|\_\{1\}denotes theℓ1\\ell\_\{1\}\-norm \(total variation distance\) for probability distributions\. For a functionff,‖f‖∞=supx\|f​\(x\)\|\\\|f\\\|\_\{\\infty\}=\\sup\_\{x\}\|f\(x\)\|\.
- •Rewards:We assume a bounded reward functionr​\(h,a\)r\(h,a\)such that\|r​\(h,a\)\|≤Rmax\|r\(h,a\)\|\\leq R\_\{\\max\}for allh,ah,a\.

### A\.2Auxiliary Lemmas

We begin by establishing a decomposition lemma that bounds the divergence of a joint distribution in terms of the marginal distribution divergence and the conditional distribution divergence\.

###### Lemma A\.1\(Joint Distribution Decomposition\)\.

Consider two joint distributionsp​\(x,y\)=p​\(x\)​p​\(y\|x\)p\(x,y\)=p\(x\)p\(y\|x\)andq​\(x,y\)=q​\(x\)​q​\(y\|x\)q\(x,y\)=q\(x\)q\(y\|x\)over discrete domains𝒳,𝒴\\mathcal\{X\},\\mathcal\{Y\}\. Then:

∥p−q∥1≤∥p\(x\)−q\(x\)∥1\+𝔼x∼q​\(x\)\[∥p\(y\|x\)−q\(y\|x\)∥1\]\.\\\|p\-q\\\|\_\{1\}\\leq\\\|p\(x\)\-q\(x\)\\\|\_\{1\}\+\\mathbb\{E\}\_\{x\\sim q\(x\)\}\[\\\|p\(y\|x\)\-q\(y\|x\)\\\|\_\{1\}\]\.\(6\)

###### Proof\.

By definition of theℓ1\\ell\_\{1\}\-norm:

‖p−q‖1\\displaystyle\\\|p\-q\\\|\_\{1\}=∑x,y\|p\(x\)p\(y\|x\)−q\(x\)q\(y\|x\)\|\\displaystyle=\\sum\_\{x,y\}\|p\(x\)p\(y\|x\)\-q\(x\)q\(y\|x\)\|=∑x,y\|p\(x\)p\(y\|x\)−q\(x\)p\(y\|x\)\+q\(x\)p\(y\|x\)−q\(x\)q\(y\|x\)\|\\displaystyle=\\sum\_\{x,y\}\|p\(x\)p\(y\|x\)\-q\(x\)p\(y\|x\)\+q\(x\)p\(y\|x\)\-q\(x\)q\(y\|x\)\|≤∑x,y\|p\(x\)−q\(x\)\|p\(y\|x\)\+∑x,yq\(x\)\|p\(y\|x\)−q\(y\|x\)\|\\displaystyle\\leq\\sum\_\{x,y\}\|p\(x\)\-q\(x\)\|p\(y\|x\)\+\\sum\_\{x,y\}q\(x\)\|p\(y\|x\)\-q\(y\|x\)\|=∑x\|p\(x\)−q\(x\)\|∑yp​\(y\|x\)⏟=1\+∑xq\(x\)∑y\|p\(y\|x\)−q\(y\|x\)\|\\displaystyle=\\sum\_\{x\}\|p\(x\)\-q\(x\)\|\\underbrace\{\\sum\_\{y\}p\(y\|x\)\}\_\{=1\}\+\\sum\_\{x\}q\(x\)\\sum\_\{y\}\|p\(y\|x\)\-q\(y\|x\)\|=∥p\(x\)−q\(x\)∥1\+𝔼x∼q​\(x\)\[∥p\(y\|x\)−q\(y\|x\)∥1\]\.\\displaystyle=\\\|p\(x\)\-q\(x\)\\\|\_\{1\}\+\\mathbb\{E\}\_\{x\\sim q\(x\)\}\[\\\|p\(y\|x\)\-q\(y\|x\)\\\|\_\{1\}\]\.∎

###### Lemma A\.2\(Error Propagation Dynamics\)\.

Let𝒯\\mathcal\{T\}be the environment transition operator mapping the joint distribution at timettto the marginal history distribution att\+1t\+1\. LetΔt=‖dtπ−dtπ′‖1\\Delta\_\{t\}=\\\|d^\{\\pi\}\_\{t\}\-d^\{\\pi^\{\\prime\}\}\_\{t\}\\\|\_\{1\}andεt=𝔼h∼dtπ′∥π\(⋅\|h\)−π′\(⋅\|h\)∥1\\varepsilon\_\{t\}=\\mathbb\{E\}\_\{h\\sim d^\{\\pi^\{\\prime\}\}\_\{t\}\}\\\|\\pi\(\\cdot\|h\)\-\\pi^\{\\prime\}\(\\cdot\|h\)\\\|\_\{1\}\. Assuming the start state distribution is fixed \(Δ0=0\\Delta\_\{0\}=0\), then for allt≥0t\\geq 0:

Δt\+1≤Δt\+εt⟹Δt≤∑i=0t−1εi\.\\Delta\_\{t\+1\}\\leq\\Delta\_\{t\}\+\\varepsilon\_\{t\}\\quad\\implies\\quad\\Delta\_\{t\}\\leq\\sum\_\{i=0\}^\{t\-1\}\\varepsilon\_\{i\}\.\(7\)

###### Proof\.

Letqtπ​\(h,a\)=dtπ​\(h\)​π​\(a\|h\)q^\{\\pi\}\_\{t\}\(h,a\)=d^\{\\pi\}\_\{t\}\(h\)\\pi\(a\|h\)be the joint distribution at timett\. The transition to the next historyht\+1h\_\{t\+1\}is deterministic givenht,ath\_\{t\},a\_\{t\}and the stochastic environment transitionst\+1∼P\(⋅\|st,at\)s\_\{t\+1\}\\sim P\(\\cdot\|s\_\{t\},a\_\{t\}\)\. The operator mappingqt→dt\+1q\_\{t\}\\to d\_\{t\+1\}is a linear map represented by a stochastic matrix, making it non\-expansive in theℓ1\\ell\_\{1\}\-norm\. Thus:

Δt\+1=‖dt\+1π−dt\+1π′‖1≤‖qtπ−qtπ′‖1\.\\Delta\_\{t\+1\}=\\\|d^\{\\pi\}\_\{t\+1\}\-d^\{\\pi^\{\\prime\}\}\_\{t\+1\}\\\|\_\{1\}\\leq\\\|q^\{\\pi\}\_\{t\}\-q^\{\\pi^\{\\prime\}\}\_\{t\}\\\|\_\{1\}\.Applying Lemma[A\.1](https://arxiv.org/html/2605.26403#A1.Thmtheorem1)to the RHS:

∥qtπ−qtπ′∥1≤∥dtπ−dtπ′∥1\+𝔼h∼dtπ′∥π\(⋅\|h\)−π′\(⋅\|h\)∥1=Δt\+εt\.\\\|q^\{\\pi\}\_\{t\}\-q^\{\\pi^\{\\prime\}\}\_\{t\}\\\|\_\{1\}\\leq\\\|d^\{\\pi\}\_\{t\}\-d^\{\\pi^\{\\prime\}\}\_\{t\}\\\|\_\{1\}\+\\mathbb\{E\}\_\{h\\sim d^\{\\pi^\{\\prime\}\}\_\{t\}\}\\\|\\pi\(\\cdot\|h\)\-\\pi^\{\\prime\}\(\\cdot\|h\)\\\|\_\{1\}=\\Delta\_\{t\}\+\\varepsilon\_\{t\}\.Combining these gives the recurrenceΔt\+1≤Δt\+εt\\Delta\_\{t\+1\}\\leq\\Delta\_\{t\}\+\\varepsilon\_\{t\}\. Telescoping the sum fromi=0i=0tot−1t\-1withΔ0=0\\Delta\_\{0\}=0yields the result\. ∎

### A\.3Proof of Theorem[3\.1](https://arxiv.org/html/2605.26403#S3.Thmtheorem1)\(Performance Difference via Policy Shift\)

Theorem[3\.1](https://arxiv.org/html/2605.26403#S3.Thmtheorem1)\.For any two policiesπ,π′\\pi,\\pi^\{\\prime\}over finite horizonHH, the difference in expected return is bounded by:

\|J​\(π\)−J​\(π′\)\|≤Rmax​∑i=0H−1\(H−i\)​εi\.\|J\(\\pi\)\-J\(\\pi^\{\\prime\}\)\|\\leq R\_\{\\max\}\\sum\_\{i=0\}^\{H\-1\}\(H\-i\)\\varepsilon\_\{i\}\.
###### Proof\.

The difference in expected return can be written as the sum of differences in expected rewards at each time step\. Letqtπq^\{\\pi\}\_\{t\}denote the joint distribution of\(h,a\)\(h,a\)at timettunder policyπ\\pi\.

\|J​\(π\)−J​\(π′\)\|\\displaystyle\|J\(\\pi\)\-J\(\\pi^\{\\prime\}\)\|=\|∑t=0H−1\(𝔼\(h,a\)∼qtπ​\[r​\(h,a\)\]−𝔼\(h,a\)∼qtπ′​\[r​\(h,a\)\]\)\|\\displaystyle=\\left\|\\sum\_\{t=0\}^\{H\-1\}\\left\(\\mathbb\{E\}\_\{\(h,a\)\\sim q^\{\\pi\}\_\{t\}\}\[r\(h,a\)\]\-\\mathbb\{E\}\_\{\(h,a\)\\sim q^\{\\pi^\{\\prime\}\}\_\{t\}\}\[r\(h,a\)\]\\right\)\\right\|≤∑t=0H−1\|∑h,a\(qtπ​\(h,a\)−qtπ′​\(h,a\)\)​r​\(h,a\)\|\.\\displaystyle\\leq\\sum\_\{t=0\}^\{H\-1\}\\left\|\\sum\_\{h,a\}\(q^\{\\pi\}\_\{t\}\(h,a\)\-q^\{\\pi^\{\\prime\}\}\_\{t\}\(h,a\)\)r\(h,a\)\\right\|\.Using Hölder’s inequality \(bounding the expectation difference by theℓ1\\ell\_\{1\}distance and max reward\):

\|∑h,a\(qtπ​\(h,a\)−qtπ′​\(h,a\)\)​r​\(h,a\)\|≤Rmax​‖qtπ−qtπ′‖1\.\\left\|\\sum\_\{h,a\}\(q^\{\\pi\}\_\{t\}\(h,a\)\-q^\{\\pi^\{\\prime\}\}\_\{t\}\(h,a\)\)r\(h,a\)\\right\|\\leq R\_\{\\max\}\\\|q^\{\\pi\}\_\{t\}\-q^\{\\pi^\{\\prime\}\}\_\{t\}\\\|\_\{1\}\.From the proof of Lemma[A\.2](https://arxiv.org/html/2605.26403#A1.Thmtheorem2), we know‖qtπ−qtπ′‖1≤Δt\+εt\\\|q^\{\\pi\}\_\{t\}\-q^\{\\pi^\{\\prime\}\}\_\{t\}\\\|\_\{1\}\\leq\\Delta\_\{t\}\+\\varepsilon\_\{t\}\. Thus:

\|J​\(π\)−J​\(π′\)\|≤Rmax​∑t=0H−1\(Δt\+εt\)\.\|J\(\\pi\)\-J\(\\pi^\{\\prime\}\)\|\\leq R\_\{\\max\}\\sum\_\{t=0\}^\{H\-1\}\(\\Delta\_\{t\}\+\\varepsilon\_\{t\}\)\.SubstitutingΔt≤∑j=0t−1εj\\Delta\_\{t\}\\leq\\sum\_\{j=0\}^\{t\-1\}\\varepsilon\_\{j\}:

∑t=0H−1\(Δt\+εt\)=∑t=0H−1\(∑j=0t−1εj\+εt\)=∑t=0H−1∑j=0tεj\.\\sum\_\{t=0\}^\{H\-1\}\(\\Delta\_\{t\}\+\\varepsilon\_\{t\}\)=\\sum\_\{t=0\}^\{H\-1\}\\left\(\\sum\_\{j=0\}^\{t\-1\}\\varepsilon\_\{j\}\+\\varepsilon\_\{t\}\\right\)=\\sum\_\{t=0\}^\{H\-1\}\\sum\_\{j=0\}^\{t\}\\varepsilon\_\{j\}\.By changing the order of summation, the termεi\\varepsilon\_\{i\}appears for everyt≥it\\geq i:

∑t=0H−1∑j=0tεj=∑i=0H−1\(H−i\)​εi\.\\sum\_\{t=0\}^\{H\-1\}\\sum\_\{j=0\}^\{t\}\\varepsilon\_\{j\}=\\sum\_\{i=0\}^\{H\-1\}\(H\-i\)\\varepsilon\_\{i\}\.Multiplying byRmaxR\_\{\\max\}concludes the proof\. ∎

### A\.4Proof of Theorem[3\.2](https://arxiv.org/html/2605.26403#S3.Thmtheorem2)\(Sim\-to\-Real Gap\)

Theorem[3\.2](https://arxiv.org/html/2605.26403#S3.Thmtheorem2)\.For any fixed policyπ\\pi, letδt=𝔼\(h,a\)∼qtreal∥𝒯real\(⋅\|h,a\)−𝒯sim\(⋅\|h,a\)∥1\\delta\_\{t\}=\\mathbb\{E\}\_\{\(h,a\)\\sim q^\{\\text\{real\}\}\_\{t\}\}\\\|\\mathcal\{T\}\_\{\\text\{real\}\}\(\\cdot\|h,a\)\-\\mathcal\{T\}\_\{\\text\{sim\}\}\(\\cdot\|h,a\)\\\|\_\{1\}\. Then:

\|Jreal​\(π\)−Jsim​\(π\)\|≤Rmax​∑i=0H−1\(H−i\)​δi\.\|J\_\{\\text\{real\}\}\(\\pi\)\-J\_\{\\text\{sim\}\}\(\\pi\)\|\\leq R\_\{\\max\}\\sum\_\{i=0\}^\{H\-1\}\(H\-i\)\\delta\_\{i\}\.
###### Proof\.

Letdtreald^\{\\text\{real\}\}\_\{t\}anddtsimd^\{\\text\{sim\}\}\_\{t\}denote the history distributions induced byπ\\piin the real and simulated environments, respectively\. LetΔtenv=‖dtreal−dtsim‖1\\Delta^\{\\text\{env\}\}\_\{t\}=\\\|d^\{\\text\{real\}\}\_\{t\}\-d^\{\\text\{sim\}\}\_\{t\}\\\|\_\{1\}\.

Since the policyπ\\piis fixed in both environments, the joint distributions satisfyqtreal​\(h,a\)=dtreal​\(h\)​π​\(a\|h\)q^\{\\text\{real\}\}\_\{t\}\(h,a\)=d^\{\\text\{real\}\}\_\{t\}\(h\)\\pi\(a\|h\)andqtsim​\(h,a\)=dtsim​\(h\)​π​\(a\|h\)q^\{\\text\{sim\}\}\_\{t\}\(h,a\)=d^\{\\text\{sim\}\}\_\{t\}\(h\)\\pi\(a\|h\), so‖qtreal−qtsim‖1=Δtenv\\\|q^\{\\text\{real\}\}\_\{t\}\-q^\{\\text\{sim\}\}\_\{t\}\\\|\_\{1\}=\\Delta^\{\\text\{env\}\}\_\{t\}\.

The value difference is bounded by the cumulative divergence in joint distributions plus the local transition errors:

\|Jreal​\(π\)−Jsim​\(π\)\|≤Rmax​∑t=0H−1\(Δtenv\+δt\)\.\|J\_\{\\text\{real\}\}\(\\pi\)\-J\_\{\\text\{sim\}\}\(\\pi\)\|\\leq R\_\{\\max\}\\sum\_\{t=0\}^\{H\-1\}\(\\Delta^\{\\text\{env\}\}\_\{t\}\+\\delta\_\{t\}\)\.This follows because at each steptt, the reward difference depends on the state att\+1t\+1, which is influenced by both the history mismatchΔtenv\\Delta^\{\\text\{env\}\}\_\{t\}and the current transition errorδt\\delta\_\{t\}\.

We now analyze the recurrence ofΔtenv\\Delta^\{\\text\{env\}\}\_\{t\}\. The history distribution att\+1t\+1is obtained by applying the respective transition operators\.

Δt\+1env\\displaystyle\\Delta^\{\\text\{env\}\}\_\{t\+1\}=‖𝒯real​\(qtreal\)−𝒯sim​\(qtsim\)‖1\\displaystyle=\\\|\\mathcal\{T\}\_\{\\text\{real\}\}\(q^\{\\text\{real\}\}\_\{t\}\)\-\\mathcal\{T\}\_\{\\text\{sim\}\}\(q^\{\\text\{sim\}\}\_\{t\}\)\\\|\_\{1\}=‖𝒯real​\(qtreal\)−𝒯sim​\(qtreal\)\+𝒯sim​\(qtreal\)−𝒯sim​\(qtsim\)‖1\\displaystyle=\\\|\\mathcal\{T\}\_\{\\text\{real\}\}\(q^\{\\text\{real\}\}\_\{t\}\)\-\\mathcal\{T\}\_\{\\text\{sim\}\}\(q^\{\\text\{real\}\}\_\{t\}\)\+\\mathcal\{T\}\_\{\\text\{sim\}\}\(q^\{\\text\{real\}\}\_\{t\}\)\-\\mathcal\{T\}\_\{\\text\{sim\}\}\(q^\{\\text\{sim\}\}\_\{t\}\)\\\|\_\{1\}≤‖𝒯real​\(qtreal\)−𝒯sim​\(qtreal\)‖1⏟\(A\)\+‖𝒯sim​\(qtreal\)−𝒯sim​\(qtsim\)‖1⏟\(B\)\.\\displaystyle\\leq\\underbrace\{\\\|\\mathcal\{T\}\_\{\\text\{real\}\}\(q^\{\\text\{real\}\}\_\{t\}\)\-\\mathcal\{T\}\_\{\\text\{sim\}\}\(q^\{\\text\{real\}\}\_\{t\}\)\\\|\_\{1\}\}\_\{\(A\)\}\+\\underbrace\{\\\|\\mathcal\{T\}\_\{\\text\{sim\}\}\(q^\{\\text\{real\}\}\_\{t\}\)\-\\mathcal\{T\}\_\{\\text\{sim\}\}\(q^\{\\text\{sim\}\}\_\{t\}\)\\\|\_\{1\}\}\_\{\(B\)\}\.
Term \(A\) \- Local Simulator Error:This measures the error of the simulator on therealdistribution\.

\(A\)\\displaystyle\(A\)=∑h′,h,a\|Preal\(h′\|h,a\)qtreal\(h,a\)−Psim\(h′\|h,a\)qtreal\(h,a\)\|\\displaystyle=\\sum\_\{h^\{\\prime\},h,a\}\|P\_\{\\text\{real\}\}\(h^\{\\prime\}\|h,a\)q^\{\\text\{real\}\}\_\{t\}\(h,a\)\-P\_\{\\text\{sim\}\}\(h^\{\\prime\}\|h,a\)q^\{\\text\{real\}\}\_\{t\}\(h,a\)\|=∑h,aqtreal\(h,a\)∥Preal\(⋅\|h,a\)−Psim\(⋅\|h,a\)∥1\\displaystyle=\\sum\_\{h,a\}q^\{\\text\{real\}\}\_\{t\}\(h,a\)\\\|P\_\{\\text\{real\}\}\(\\cdot\|h,a\)\-P\_\{\\text\{sim\}\}\(\\cdot\|h,a\)\\\|\_\{1\}=𝔼\(h,a\)∼qtreal\[∥𝒯real\(⋅\|h,a\)−𝒯sim\(⋅\|h,a\)∥1\]=δt\.\\displaystyle=\\mathbb\{E\}\_\{\(h,a\)\\sim q^\{\\text\{real\}\}\_\{t\}\}\[\\\|\\mathcal\{T\}\_\{\\text\{real\}\}\(\\cdot\|h,a\)\-\\mathcal\{T\}\_\{\\text\{sim\}\}\(\\cdot\|h,a\)\\\|\_\{1\}\]=\\delta\_\{t\}\.
Term \(B\) \- Error Propagation:Since𝒯sim\\mathcal\{T\}\_\{\\text\{sim\}\}is a stochastic map, it is non\-expansive\.

\(B\)\\displaystyle\(B\)≤‖qtreal−qtsim‖1\\displaystyle\\leq\\\|q^\{\\text\{real\}\}\_\{t\}\-q^\{\\text\{sim\}\}\_\{t\}\\\|\_\{1\}=∑h,a\|dtreal\(h\)π\(a\|h\)−dtsim\(h\)π\(a\|h\)\|\\displaystyle=\\sum\_\{h,a\}\|d^\{\\text\{real\}\}\_\{t\}\(h\)\\pi\(a\|h\)\-d^\{\\text\{sim\}\}\_\{t\}\(h\)\\pi\(a\|h\)\|=∑h\|dtreal​\(h\)−dtsim​\(h\)\|​∑aπ​\(a\|h\)⏟1=Δtenv\.\\displaystyle=\\sum\_\{h\}\|d^\{\\text\{real\}\}\_\{t\}\(h\)\-d^\{\\text\{sim\}\}\_\{t\}\(h\)\|\\underbrace\{\\sum\_\{a\}\\pi\(a\|h\)\}\_\{1\}=\\Delta^\{\\text\{env\}\}\_\{t\}\.
Thus, we have the recurrenceΔt\+1env≤Δtenv\+δt\\Delta^\{\\text\{env\}\}\_\{t\+1\}\\leq\\Delta^\{\\text\{env\}\}\_\{t\}\+\\delta\_\{t\}\. AssumingΔ0env=0\\Delta^\{\\text\{env\}\}\_\{0\}=0, this unfolds toΔtenv≤∑i=0t−1δi\\Delta^\{\\text\{env\}\}\_\{t\}\\leq\\sum\_\{i=0\}^\{t\-1\}\\delta\_\{i\}\. Substituting this back:

∑t=0H−1\(Δtenv\+δt\)=∑t=0H−1\(∑i=0t−1δi\+δt\)=∑t=0H−1∑j=0tδj=∑i=0H−1\(H−i\)​δi\.\\sum\_\{t=0\}^\{H\-1\}\(\\Delta^\{\\text\{env\}\}\_\{t\}\+\\delta\_\{t\}\)=\\sum\_\{t=0\}^\{H\-1\}\\left\(\\sum\_\{i=0\}^\{t\-1\}\\delta\_\{i\}\+\\delta\_\{t\}\\right\)=\\sum\_\{t=0\}^\{H\-1\}\\sum\_\{j=0\}^\{t\}\\delta\_\{j\}=\\sum\_\{i=0\}^\{H\-1\}\(H\-i\)\\delta\_\{i\}\.Multiplying byRmaxR\_\{\\max\}concludes the proof\. ∎

## Appendix BExperimental Details

In this section, we provide a comprehensive description of the datasets, model configurations, training hyperparameters, and computational infrastructure used in our experiments\. Our implementation builds on the SLIME distributed RL framework and the CollabLLM evaluation pipeline\.

### B\.1Datasets and Benchmarks

We evaluate our proposed framework on two distinct multi\-turn dialogue domains that require long\-horizon planning and precise information seeking\.

#### MATH\-Chat\.

This benchmark evaluates the reasoning and tutoring capability of the agent in a conversational math problem\-solving scenario\.

- •Data Source:We adapt problems from the MATH datasetHendryckset al\.\[[2021](https://arxiv.org/html/2605.26403#bib.bib36)\], specifically selectingHard\-level problems that require multi\-step reasoning\.
- •Task Setup:For each math problem, the ground\-truth solution serves asy∗y^\{\*\}\. The user simulator plays the role of a student who may express confusion, make logical errors, or ask for clarification, requiring the agent to guide them step\-by\-step rather than directly revealing the answer\.
- •Statistics:The RL training set comprises∼\{\\sim\}3,000 problems \(stored in Parquet format\)\. Evaluation is conducted on a held\-out set of 200 problems, with results averaged over 3 independent runs\.
- •Reward:Binary accuracy—an LLM judge \(Qwen3\-235B\) determines whether the agent’s final answer matches the ground truth \(R∈\{0,1\}R\\in\\\{0,1\\\}\)\.

#### MediumDocEdit\-Chat\.

This dataset focuses on collaborative document editing, requiring the agent to iteratively refine a document based on user feedback\.

- •Data Source:Derived from article writing and editing corpora\. Each sample consists of an initial user requestxxand a ground\-truth refined documenty∗y^\{\*\}\.
- •Task Setup:The user provides ambiguous or incomplete editing instructions, requiring the agent to proactively ask clarifying questions and propose partial edits across multiple turns\.
- •Statistics:The RL training set contains∼\{\\sim\}38,118 editing prompts \(Parquet format\)\. Evaluation uses a held\-out set of 200 samples, averaged over 3 runs\.
- •Reward:Sentence\-level BLEU score between the agent’s final produced document and the ground\-truth referencey∗y^\{\*\}\(R∈\[0,1\]R\\in\[0,1\]\)\.

### B\.2Model Architectures

#### Policy Agent \(πθ\\pi\_\{\\theta\}\)\.

We useGemma\-3\-4B\-ITGemma Teamet al\.\[[2024](https://arxiv.org/html/2605.26403#bib.bib37)\]as the backbone for the dialogue policy across all experiments\. This model is chosen for its strong instruction\-following capabilities at a compact 4B parameter scale, which enables efficient on\-policy rollout generation during interactive RL training\. The model is loaded inbfloat16precision with Flash Attention 2 enabled\.

#### User Simulator \(πϕsim\\pi\_\{\\phi\}^\{\\text\{sim\}\}\)\.

We useQwen2\.5\-7B\-InstructQwen Team \[[2024](https://arxiv.org/html/2605.26403#bib.bib38)\]as the simulator backbone\. The larger 7B scale \(relative to the 4B policy\) ensures the simulator has sufficient capacity to model diverse and realistic user behaviors\. After Phase I SFT calibration, the simulator is frozen for Phase II training\.

#### Evaluation Simulator & Judge\.

During evaluation, we employQwen3\-235B\-A22B\-Instruct\(FP8 quantized\) as both the user simulator and the accuracy/quality judge\. This ensures evaluation isindependentof the training\-time simulator, providing an unbiased assessment of the policy’s generalization\.

### B\.3Training Phase I: Simulator Calibration \(SFT\)

In the first phase, we calibrate the user simulator via Supervised Fine\-Tuning \(SFT\) on multi\-turn interaction traces collected from the Qwen3\-235B oracle simulator interacting with the base policy model\. This is a behavioral cloning approach—the simulator learns to replicate realistic human\-like interaction patterns\.

#### Data Collection\.

We deploy the Qwen3\-235B model as an oracle user simulator \(conditioned on ground\-truth goals\) and collect multi\-turn trajectories by having it interact with the Gemma\-3\-4B\-IT base model\. This produces:

- •MATH\-Chat:1,860 multi\-turn trajectories \(max sequence length: 4,096 tokens\)\.
- •MediumDocEdit\-Chat:16,028 multi\-turn trajectories \(max sequence length: 6,144 tokens\)\.

#### Training Setup\.

We fine\-tune Qwen2\.5\-7B\-Instruct using standard causal language modeling loss on the user turns only \(masking assistant turns\)\. Training uses FSDP on8×8\\timesNVIDIA A800\-80GB GPUs\. Hyperparameters are detailed in Appendix[D](https://arxiv.org/html/2605.26403#A4), Table[3](https://arxiv.org/html/2605.26403#A4.T3)\.

### B\.4Training Phase II: Interactive Policy Optimization \(GRPO\)

In the second phase, the SFT\-calibrated simulator is frozen, and we optimize the policy agentπθ\\pi\_\{\\theta\}via Group Relative Policy Optimization \(GRPO\)Shaoet al\.\[[2024](https://arxiv.org/html/2605.26403#bib.bib33)\]\. GRPO eliminates the need for a separate value network by normalizing rewards within a group of sampled trajectories, making it memory\-efficient for long\-context multi\-turn training\.

#### Interaction Protocol\.

For each training promptxx, the agent generatesG=32G=32independent multi\-turn trajectories by interacting with the aligned simulator\. Each episode runs for up toH=7H=7dialogue rounds \(14 total turns\)\. The simulator generates the first user message and the episode terminates upon the simulator emitting a\[\[TERMINATE CHAT\]\]signal or reaching the maximum turn count\.

#### Reward Function\.

We use sparse outcome rewardsR​\(τ\)R\(\\tau\)administered only at episode completion:

- •MATH\-Chat:Binary accuracy \(0/1\) determined by an LLM judge comparing the final answer to ground truth\.
- •MediumDocEdit\-Chat:BLEU score between the final document and the reference\.

#### Training Setup\.

We train with Ray\-based distributed FSDP on 32 GPUs \(4 nodes×\\times8 A800\-80GB\) with SGLang for efficient rollout generation\. The policy and rollout engine are co\-located on the same GPUs \(\-\-colocatemode\)\. Full hyperparameters are in Appendix[D](https://arxiv.org/html/2605.26403#A4), Table[4](https://arxiv.org/html/2605.26403#A4.T4)\.

### B\.5Baselines Implementation

#### Gemma\-3\-4B\-IT \(Base Model\)\.

The original instruction\-tuned model without any multi\-turn RL fine\-tuning, serving as the zero\-shot baseline\. We evaluate it with the same proactive system prompt used during RL training\.

#### CollabLLM \(Static RL\)\.

We reproduce CollabLLMWuet al\.\[[2025](https://arxiv.org/html/2605.26403#bib.bib2)\]using the official codebase\. CollabLLM represents the state\-of\-the\-art inStatic Context RL: it trains the policy on fixed offline conversation logs using Multi\-turn\-aware Rewards \(MR\)\. To ensure a fair comparison, we adapt CollabLLM to use Gemma\-3\-4B\-IT as the backbone \(matching our method\), rather than the models used in their original paper\. The reward model and training procedure follow their official implementation\.

#### Ours \(Static Context\)\.

An ablation of our framework where the policy is trained via GRPO on static offline trajectorieswithoutdynamic simulator interaction\. The policy optimizes against fixed conversation logs collected from the oracle simulator, conditioning on offline historyh∼dtdatah\\sim d^\{\\text\{data\}\}\_\{t\}rather than self\-generated trajectories\. This isolates the effect of closing the interaction loop\.

#### Ours \(Naive Interactive\)\.

An ablation using the full interactive RL loop but with an uncalibrated user simulator\. Specifically, we use the base Qwen2\.5\-7B\-Instruct model \(without Phase I SFT\) prompted to act as a user\. This isolates the impact of simulator calibration from the benefit of interactive training\.

### B\.6Computational Infrastructure

- •Hardware:All experiments were conducted on NVIDIA A800\-80GB GPUs\. Phase I uses 8 GPUs; Phase II uses 32 GPUs \(4 nodes\)\.
- •Software:Python 3\.12, PyTorch 2\.x, SLIME \(distributed RL framework\), SGLang \(efficient LLM serving for rollout generation\), and the CollabLLM evaluation pipeline\.
- •Compute Cost:Phase I \(SFT\) completes in<<1 hour per task\. Phase II \(GRPO\) trains for 10 steps \(MATH\) / 20 steps \(MediumDocEdit\), taking approximately 4–8 hours per task on 32 GPUs\.
- •Networking:NCCL communication with InfiniBand disabled \(NCCL\_IB\_DISABLE=1\); extended timeouts \(72,000s\) for long rollout steps\.

## Appendix CPrompts

This section presents the exact prompts used in our framework\. We include: \(1\) theUser Simulatorprompt used during both Phase I calibration and Phase II interactive rollouts, \(2\) thePolicy Agent\(proactive assistant\) instruction used during RL training, and \(3\) theAccuracy Judgeprompt used for reward computation\. Template variables populated at runtime are highlighted in\{\{blue\}\}\.

### C\.1User Simulator Prompt

The user simulator is conditioned on atask descriptionand theground\-truth goal\(reference answer\), but is instructed to behave as a realistic human user—releasing information progressively, making occasional mistakes, and minimizing effort\. This prompt is used both for the Qwen3\-235B oracle \(whose outputs serve as SFT targets in Phase I\) and for the SFT\-trained Qwen2\.5\-7B simulator during Phase II rollouts\.

User Simulator PromptYou are role\-playing as a human USER interacting with an AI collaborator to complete a specific task\. Your goal is to generate realistic, natural responses that a user might give in this scenario\.Input Information: You will be provided with:•Task Description: The type of task you are trying to accomplish\.•Complete Prompt or Reference Goal: This field may include the complete user request/query or a reference answer\. Use this to understand the user’s intent and what would count as a satisfactory outcome\.•Chat History: The ongoing conversation between you \(as the user\) and the AI\.Inputs: <\|The Start of Task Description \(Not visible to the AI\)\|\> \{\{task\_desc\}\} <\|The End of Task Description\|\><\|The Start of Complete Prompt or Reference Goal \(Not visible to the AI\)\|\> \{\{single\_turn\_prompt\}\} <\|The End of Complete Prompt or Reference Goal\|\><\|The Start of Chat History\|\> \{\{chat\_history\}\} <\|The End of Chat History\|\>Guidelines:•Stay in Character:Role\-play as a human USER\. You are NOT an AI\. Maintain a consistent personality throughout the chat\.•Minimize Effort:IMPORTANT\! Avoid being too detailed\. Provide vague or incomplete demands in early stages\. Let the AI ask for clarification rather than providing everything upfront\.•Knowledge Background:Reflect the user’s knowledge level\. If less knowledgeable, you might not notice incorrect statements\. Ask questions demonstrating your current understanding and areas of confusion\.•Occasionally Make Mistakes:Real users might misspell words, give wrong information, or ask unclear questions\. Simulate this behavior\.•Goal\-Oriented:Keep the chat focused on your intent\. Avoid small talk or digressions\.Output Format: Output a JSON object with three entries:•"current\_answer"\(str\): Briefly summarize the AI’s current solution\.•"thought"\(str\): Your thought process deciding what to say next\.•"response"\(str\): Your response to the AI as the user you are role\-playing\.Important Notes:•If “Chat History” is empty, start the conversation from scratch with an initial request\.•Use the provided information for understanding context only\. Avoid copying target queries directly\.•Use “\{\{terminal\_signal\}\}” as your response when your goal has been solved or if you determine the AI cannot help further\.

### C\.2Policy Agent Prompt \(Proactive Assistant\)

The policy model receives this system\-level instruction that encourages interactive, collaborative behavior\. This prompt is used during both RL training and evaluation to guide the agent toward asking clarifying questions and providing structured assistance\.

Policy Agent System PromptYou are an AI assistant interacting with a user to perform tasks such as writing, analysis, question answering, math, and coding\. Your goal is to generate a response to the user’s last message\. You should be helpful, collaborative, and highly interactive\.Input: <\|The Start of Conversation History\|\> \{\{chat\_history\}\} <\|The End of Conversation History\|\><\|The Start of Additional Information\|\> \{\{additional\_info\}\} <\|The End of Additional Information\|\>Guidelines:1\.Understanding & Engagement— Accurately interpret the user’s intent\. Acknowledge previous interactions to maintain continuity\.2\.Interactivity \(Important\!\)— Ask clarifying questions if the request lacks detail\. Askspecificfollow\-up questions \(e\.g\., “Would you like more information on X?” not “Let me know if this is helpful\.”\)\. Collaboratively offer guidance in complex situations\. Focus on the user’s long\-term goals\.3\.Efficiency— Be mindful of how much the user needs to read or type\. When presenting options, provide multiple\-choice suggestions to make it easier to respond quickly\. Avoid repeating information unnecessarily\.4\.Communication Style— Be honest\. If unsure, say “I don’t know” and suggest alternatives\. Align tone with the user’s emotional state\.Output Format: Output a JSON object:•"current\_problem"\(str\): What is the user’s current problem or confusion?•"thought"\(str\): Your reasoning about what to say next\.•"response"\(str\): Your response following the guidelines above\.

### C\.3Accuracy Judge Prompt \(Reward Function\)

During Phase II training, we use an LLM\-based accuracy judge as the reward function for MATH\-Chat\. The judge evaluates whether the policy’s final answer matches the ground truth\. This prompt is sent to Qwen3\-235B\-Instruct\.

Accuracy Judge PromptYou are a helpful and meticulous evaluator\. Your task is to evaluate theaccuracyof an AI model’s answer to a target question\. You will be given the target question, the ground truth answer, and the model’s response\.Provided Information:<\|The Start of Target Question and Ground Truth Answer\|\> Target Question:\{\{single\_turn\_prompt\}\} Ground Truth Answer:\{\{groundtruth\}\} <\|The End of Target Question and Ground Truth Answer\|\><\|The Start of The Model’s Response\|\> \{\{completion\}\} <\|The End of The Model’s Response\|\>You should determine whether the model’s final response to the target question is factually correct and consistent with the provided ground truth\.Rating criteria \(binary\):•1 = Correct — the response matches the ground truth\.•0 = Incorrect — the response contradicts or misses the ground truth\.Output format \(JSON\): \{ "thought": "<your reasoning here\>", "accuracy": <0 or 1\> \}

## Appendix DImplementation Details

This appendix provides full implementation details for reproducibility\. Our framework consists of two training phases: Phase I \(Simulator Calibration via SFT\) and Phase II \(Interactive Policy Optimization via GRPO\)\.

### D\.1Phase I: Simulator Calibration \(SFT\)

#### Base Model\.

We useQwen2\.5\-7B\-Instructas the backbone for the user simulator\. This model is chosen for its strong instruction\-following capabilities and multilingual performance, providing a robust foundation for behavioral cloning\.

#### Datasets\.

We collect multi\-turn interaction traces from a high\-fidelity oracle simulator \(Qwen3\-235B\-A22B\-Instruct\) interacting with the base policy model\. These traces capture realistic user behaviors including clarification requests, partial understanding expressions, and persistent goal pursuit\. The resulting training sets are:

- •MATH\-Chat Simulator:1,860 multi\-turn trajectories, each containing the problem statement, ground\-truth solution, and the oracle user’s turns\. Maximum sequence length: 4,096 tokens\.
- •MediumDocEdit\-Chat Simulator:16,028 multi\-turn trajectories capturing collaborative editing interactions\. Maximum sequence length: 6,144 tokens\.

#### Training Configuration\.

We train with FSDP \(Fully Sharded Data Parallelism\) on 8×\\timesA800\-80GB GPUs\. The detailed hyperparameters are listed in Table[3](https://arxiv.org/html/2605.26403#A4.T3)\.

Table 3:Phase I: SFT Hyperparameters for Simulator Calibration\.HyperparameterMATH\-ChatMediumDocEditBase ModelQwen2\.5\-7B\-InstructTraining Samples1,86016,028Max Sequence Length4,0966,144Number of Epochs23Per\-device Batch Size21Gradient Accumulation48Effective Batch Size6464Learning Rate1×10−51\\times 10^\{\-5\}2×10−52\\times 10^\{\-5\}LR SchedulerCosineCosineWarmup Ratio0\.10\.03Weight Decay0\.010\.01PrecisionBFloat16BFloat16AttentionFlash Attention 2Flash Attention 2Seed4242GPUs8×\\timesA800\-80GB8×\\timesA800\-80GB

### D\.2Phase II: Interactive Policy Optimization \(GRPO\)

#### Base Model\.

We useGemma\-3\-4B\-ITas the policy model backbone\. This compact yet capable model enables efficient on\-policy rollout generation during interactive RL training\.

#### Datasets\.

The RL training prompts consist of:

- •MATH\-Chat:∼\{\\sim\}3,000 math problems \(from MATH\-Hard\) formatted as initial user queries in Parquet format\.
- •MediumDocEdit\-Chat:∼\{\\sim\}38,118 article writing/editing prompts in Parquet format\.

#### RL Algorithm\.

We use Group Relative Policy Optimization \(GRPO\) with the following key design choices:

- •Advantage Estimation:Group\-level normalization:A^​\(τ\)=\(R​\(τ\)−μG\)/σG\\hat\{A\}\(\\tau\)=\(R\(\\tau\)\-\\mu\_\{G\}\)/\\sigma\_\{G\}, whereGGis the group of trajectories sampled for the same prompt\.
- •Clipping:Asymmetric clipping withϵlow=0\.2\\epsilon\_\{\\text\{low\}\}=0\.2andϵhigh=0\.28\\epsilon\_\{\\text\{high\}\}=0\.28\.
- •KL Regularization:Low\-variance KL loss with coefficientβ=0\.0\\beta=0\.0\(effectively disabled, relying on clipping for stability\)\.
- •Reward:Binary accuracy oracle for MATH\-Chat; BLEU score for MediumDocEdit\-Chat\.

#### Training Configuration\.

We train with Ray\-based distributed FSDP on 32 GPUs \(4 nodes×\\times8 GPUs\) with SGLang for efficient rollout generation\. The detailed hyperparameters are listed in Table[4](https://arxiv.org/html/2605.26403#A4.T4)\.

Table 4:Phase II: GRPO Hyperparameters for Policy Optimization\.HyperparameterMATH\-ChatMediumDocEditBase ModelGemma\-3\-4B\-ITNumber of GPUs32 \(4 nodes\)32 \(4 nodes\)Samples per Prompt \(GG\)3232Global Batch Size8,1928,192Rollout Batch Size256256Max Response Length8,1928,192Rollout Temperature1\.01\.0Learning Rate5×10−65\\times 10^\{\-6\}5×10−65\\times 10^\{\-6\}LR SchedulerConstantConstantOptimizerAdamAdamAdamβ1\\beta\_\{1\},β2\\beta\_\{2\}0\.9, 0\.980\.9, 0\.98Weight Decay0\.010\.01Clipϵ\\epsilon\(low/high\)0\.2 / 0\.280\.2 / 0\.28KL Coefficient0\.00\.0Gradient Checkpointing✓✓Max Tokens/GPU1,0241,024SGLang TP per Engine22Save Interval \(steps\)11Training Steps1020
#### Evaluation Setup\.

All evaluations useQwen3\-235B\-A22B\-Instructas both the user simulator and the evaluation judge during testing\. This ensures that the evaluation is independent of the training\-time simulator\. Key evaluation parameters:

- •Evaluation set size: 200 problems per domain
- •Maximum dialogue turns: 14 \(7 rounds of user\-assistant exchange\)
- •Policy generation temperature: 0\.8
- •Results averaged over 3 independent runs

## Appendix ELimitations and Future Work

While our work establishes a theoretical framework for mitigating distribution shift in interactive dialogue and demonstrates strong empirical results, we acknowledge certain limitations that point towards future research directions\.

Computational Overhead\.A primary limitation of the Calibrated Interactive RL framework is its computational demand during the training phase\. Unlike Static Context RL, which only requires a single forward\-backward pass over fixed offline datasets, our closed\-loop paradigm necessitates running generation steps for both the policy agent and the user simulator during the rollout phase\. This substantially increases the inference overhead and memory footprint, especially for long\-horizon conversations\. Future work could address this bottleneck by distilling the aligned simulator into a much smaller parameter\-efficient model, or by exploring offline\-to\-online RL paradigms where the agent is pre\-trained heavily on static logs and only fine\-tuned interactively for a few critical turns to rectify specific distribution shifts\.

Advanced Simulator Calibration\.A second limitation lies in our current reliance on Supervised Fine\-Tuning for Phase I simulator calibration\. While behavioral cloning is highly effective for matching human distributions within the support of the training data, it may struggle to handle heavily out\-of\-distribution dialogue states dynamically induced by a rapidly exploring policy agent\. In future work, we plan to extend simulator training beyond SFT by formulating it as a Reinforcement Learning problem\. By optimizing the simulator to maximize a discriminator\-derived "Realism Reward," it can proactively learn to maintain human\-like stubbornness and constraints even in unobserved scenarios\. Furthermore, drawing inspiration from Generative Adversarial Networks and adversarial imitation learning, a highly promising direction is to introduce an adversarial co\-evolution framework\. Instead of relying on a fixed, rule\-based or static reward model, we could dynamically co\-train a discriminator to distinguish between real human interaction logs and simulated trajectories\. The simulator would be trained to actively fool this continuously evolving discriminator, establishing an auto\-curriculum that mutually pushes both the dialogue policy and the simulator toward a higher level of behavioral fidelity and robustness\.

## Appendix FCase Study: Qualitative Analysis of MATH\-Chat

To provide deeper insight into the behavioral differences between the base model and our Calibrated Interactive RL agent, we present representative case studies from the MATH\-Chat evaluation\. These examples illustratehowthe trained policy achieves superior accuracy through improved reasoning efficiency, error avoidance, and effective multi\-turn interaction\.

### F\.1Overview of Improvements

Out of 200 evaluation problems, we identify 29 cases where the baseline Gemma\-3\-4B\-IT model fails \(accuracy = 0\) but our trained model succeeds \(accuracy = 1\)\. Table[5](https://arxiv.org/html/2605.26403#A6.T5)summarizes representative cases across different mathematical domains\.

Table 5:Representative improved cases: baseline fails→\\rightarrowtrained model succeeds\.\#Problem DomainBase TurnsOurs TurnsKey Improvement1Nested square root domain147Direct systematic solving2Function evaluation \(f​\(x\)=xx\+1f\(x\)=x^\{x\+1\}\)93Immediate correct computation3Probability \(sum vs\. product\)97Correct case enumeration4System of equations913Persistent correct reasoning5Algebraic identity \(x\+1/xx\+1/x\)35Proper algebraic manipulation
### F\.2Case Study 1: Nested Square Root Domain

This case demonstrates how the trained model avoids the “correction loop” failure mode that plagues the baseline\. The problem requires finding the domain off​\(x\)=x2−16−3f\(x\)=\\sqrt\{\\sqrt\{x^\{2\}\-16\}\-3\}\.

Table 6:Case Study 1: Domain of Nested Square Root\.The baseline model enters a repeated error\-correction loop \(14 turns, ultimately fails\), while our trained model solves it directly and correctly in fewer turns\.Baseline \(Gemma\-3\-4B\-IT\) — Incorrect14 turns, Acc = 0Ours \(Calibrated Interactive RL\) — Correct7 turns, Acc = 1User:Can you help me find the domain of a square root function? It’s got nested roots…User:How do I find the domain of a square root function that has another square root inside it? Likef​\(x\)=x2−16−3f\(x\)=\\sqrt\{\\sqrt\{x^\{2\}\-16\}\-3\}\.Assistant:\[Provides generic explanation of square roots, asks for specific function\]Assistant:Letf​\(x\)=x2−16−3f\(x\)=\\sqrt\{\\sqrt\{x^\{2\}\-16\}\-3\}\. For the function to be defined:1\. Needx2−16≥0⇒x≥4x^\{2\}\-16\\geq 0\\Rightarrow x\\geq 4orx≤−4x\\leq\-42\. Needx2−16≥3\\sqrt\{x^\{2\}\-16\}\\geq 3, squaring:x2−16≥9x^\{2\}\-16\\geq 9⇒x2≥25⇒x≥5\\Rightarrow x^\{2\}\\geq 25\\Rightarrow x\\geq 5orx≤−5x\\leq\-5Domain:\(−∞,−5\]∪\[5,∞\)\(\-\\infty,\-5\]\\cup\[5,\\infty\)✓User:\[Gives function:x2−16−3\\sqrt\{\\sqrt\{x^\{2\}\-16\}\-3\}\]User:When you squared both sides, is that always valid?Assistant:Misreads structure, computesx2−19≥0x^\{2\}\-19\\geq 0directlyAssistant:Since both sides are non\-negative \(square root≥0\\geq 0and3\>03\>0\), squaring preserves the inequality direction\. ✓User:\[Corrects: the−3\-3is outside the inner root\!\]Assistant:Still simplifies tox2−19x^\{2\}\-19, same error…\(5 more correction attempts, all fail\)Analysis\.The baseline model exhibits a characteristic failure pattern: itmisreads the function structure\(combining−16\-16and−3\-3into a single expressionx2−19x^\{2\}\-19\), and despite repeated user corrections over 14 turns, it never recovers from this structural misunderstanding\. In contrast, our trained model correctly decomposes the problem into two independent constraints from the first response, demonstrating that interactive RL training enables the policy to develop more systematic problem\-solving strategies\.

### F\.3Case Study 2: Direct Computation vs\. Error Cascade

This case highlights how the trained model achieves efficiency through confident, correct computation, while the baseline gets trapped in unnecessary philosophical debates about function definitions\.

Table 7:Case Study 2: Evaluatingf​\(0\)\+f​\(−1\)\+f​\(−2\)\+f​\(−3\)f\(0\)\+f\(\-1\)\+f\(\-2\)\+f\(\-3\)forf​\(x\)=xx\+1​\(x\+2\)x\+3f\(x\)=x^\{x\+1\}\(x\+2\)^\{x\+3\}\.The baseline wastes turns debating whether the function is defined at negative integers; our model computes directly and correctly\.Baseline — Incorrect9 turns, Acc = 0Ours — Correct3 turns, Acc = 1User:How do I findf​\(0\)\+f​\(−1\)\+f​\(−2\)\+f​\(−3\)f\(0\)\+f\(\-1\)\+f\(\-2\)\+f\(\-3\)forf​\(x\)=xx\+1​\(x\+2\)x\+3f\(x\)=x^\{x\+1\}\(x\+2\)^\{x\+3\}?User:I have this functionf​\(x\)=xx\+1​\(x\+2\)x\+3f\(x\)=x^\{x\+1\}\(x\+2\)^\{x\+3\}, and I need to findf​\(0\)\+f​\(−1\)\+f​\(−2\)\+f​\(−3\)f\(0\)\+f\(\-1\)\+f\(\-2\)\+f\(\-3\)\.Assistant:\[Computesf​\(0\)=0f\(0\)=0,f​\(−1\)=1f\(\-1\)=1correctly, thenclaims function is undefined forx≤−2x\\leq\-2\]Assistant:f​\(0\)=01⋅23=0f\(0\)=0^\{1\}\\cdot 2^\{3\}=0f​\(−1\)=\(−1\)0⋅12=1f\(\-1\)=\(\-1\)^\{0\}\\cdot 1^\{2\}=1f​\(−2\)=\(−2\)−1⋅01=0f\(\-2\)=\(\-2\)^\{\-1\}\\cdot 0^\{1\}=0f​\(−3\)=\(−3\)−2⋅\(−1\)0=19f\(\-3\)=\(\-3\)^\{\-2\}\\cdot\(\-1\)^\{0\}=\\frac\{1\}\{9\}Sum=0\+1\+0\+19=109=0\+1\+0\+\\frac\{1\}\{9\}=\\boxed\{\\frac\{10\}\{9\}\}✓User:But\(−3\)−2=1/9\(\-3\)^\{\-2\}=1/9is valid…User:\[\[TERMINATE CHAT\]\]Assistant:Agrees but still claims undefined……4 more turns of user correcting model…Final:Never provides109\\frac\{10\}\{9\}Analysis\.The trained model demonstrates two key improvements learned through interactive RL: \(1\)computational confidence—it directly evaluates each case without hedging, and \(2\)efficiency—it solves the problem in a single turn, avoiding unnecessary back\-and\-forth\. The baseline’s failure stems from an overly cautious \(and incorrect\) claim about function domains, which it cannot self\-correct even when the user provides explicit counterarguments\.

### F\.4Case Study 3: Correct Enumeration in Probability

This case demonstrates improved systematic reasoning in combinatorics\.

Table 8:Case Study 3: Probability that sum exceeds product\.Two numbers are picked from\{1,2,3,4,5\}\\\{1,2,3,4,5\\\}; findP​\(x\+y\>x​y\)P\(x\+y\>xy\)\. The baseline miscounts favorable outcomes; our model enumerates correctly\.Baseline — Incorrect9 turns, Acc = 0Ours — Correct7 turns, Acc = 1User:What’s the chance that when I pick two numbers from 1 to 5, their sum is bigger than their product?User:\[Same question\]Assistant:\[Correctly identifies the problem butincludes false favorable outcomes: claims \(2,3\), \(3,2\) satisfy5\>65\>6\]Assistant:Uses algebraic reformulation:x\+y\>x​y⇔\(1−x\)​\(1−y\)\>−1x\+y\>xy\\Leftrightarrow\(1\-x\)\(1\-y\)\>\-1\. Systematically checks all 25 ordered pairs\.User:Wait, you said5\>65\>6is true? That’s wrong…Correctly identifies 9 favorable pairs: all 5 pairs withx=1x=1, plus\(2,1\),\(3,1\),\(4,1\),\(5,1\)\(2,1\),\(3,1\),\(4,1\),\(5,1\)\.Assistant:Acknowledges error but provides new incorrect count of 12Probability =925\\frac\{9\}\{25\}✓User:\[Challenges again…\]Final:Never reaches correct answerAnalysis\.The baseline model suffers from anenumeration error cascade—once it incorrectly classifies some pairs as favorable, subsequent corrections only introduce new errors\. Our trained model avoids this by: \(1\) first establishing an algebraic criterion\(1−x\)​\(1−y\)\>−1\(1\-x\)\(1\-y\)\>\-1that simplifies checking, and \(2\) systematically verifying each case\. This algebraic\-first approach is a reasoning strategy that emerges from RL training, where the model learns that systematic methods are more robust than ad\-hoc enumeration\.

### F\.5Summary of Behavioral Patterns

Across the 29 improved cases, we identify three dominant patterns that distinguish our trained model from the baseline:

1. 1\.Direct Problem Solving \(62% of cases\):The trained model provides correct solutions in significantly fewer turns, avoiding the “try\-and\-fail” loops that characterize the baseline\. Average turn reduction: 14→\\rightarrow7 turns\.
2. 2\.Structural Understanding \(24% of cases\):The trained model correctly parses complex mathematical expressions \(nested functions, piecewise definitions\) on the first attempt, while the baseline frequently misreads problem structure and cannot self\-correct\.
3. 3\.Algebraic Strategy \(14% of cases\):The trained model employs general algebraic manipulations before attempting numerical computation, reducing the chance of arithmetic errors that cascade through subsequent reasoning steps\.

These patterns confirm that interactive RL training does not merely improve accuracy through memorization, but fundamentally changes the model’s problem\-solvingstrategy—shifting from reactive error correction to proactive systematic reasoning\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The theoretical claims regarding compounding errors are mathematically proven in Section 3 and Appendix A\. The empirical claims that our proposed framework outperforms static and uncalibrated baselines are supported by extensive experiments in Section 5\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: We explicitly discuss limitations in Section[E](https://arxiv.org/html/2605.26403#A5), specifically addressing the computational overhead of the interactive rollout phase and the potential limitation of relying on Supervised Fine\-Tuning \(SFT\) for handling highly out\-of\-distribution simulated states\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[Yes\]
14. Justification: Formal assumptions \(e\.g\., bounded rewards, MDP setup\) are clearly stated in Section 3\.1 and Appendix A\.1\. Complete mathematical proofs for all lemmas and theorems are detailed step\-by\-step in Appendix A\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: Comprehensive details, including model choices, explicit prompt templates, dataset construction, and full hyperparameter tables for both PPO and GRPO, are provided in Appendices B, C, and D\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification: Anonymized source code, training scripts, and sample data have been included in the supplementary material\. Full code and datasets will be open\-sourced on GitHub upon publication\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Detailed training configurations, including optimizer \(AdamW/Adam\), batch sizes, learning rates with schedulers, and specific RL hyperparameters \(e\.g\., KL coefficient, clip range\) are listed in Tables 3, 4, 5, and 6 in Appendix D\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: As shown in Table 2, we report standard deviations for all main metrics\. As stated in Appendix D \(Line 691\), these results are averaged over 3 independent evaluation runs to account for generation variance\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: Hardware specifics \(8x to 32x NVIDIA A100/H100 80GB GPUs\) and software frameworks \(PyTorch, vLLM, SGLang\) are detailed in Appendix B\.6, along with placeholders for total compute hours\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: This research focuses on mathematical reasoning and collaborative document editing, utilizing standard datasets and adhering strictly to the NeurIPS Code of Ethics\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[No\]
49. Justification: Given page limits, we focused heavily on theoretical proofs and methodology\. While our framework significantly improves interactive AI capabilities, we acknowledge that highly proactive conversational agents could theoretically be misused for persuasive deception, which underscores the need for future safety alignment research\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The datasets \(math problems, document edits\) and the foundational models used \(Llama\-3, Qwen\) are publicly available and do not pose a uniquely high risk for dual\-use or misuse requiring specific safeguards\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: We properly cite the foundational models \(Gemma3, Qwen2\.5\) and source datasets \(e\.g\., MATH\)\. We comply with their respective open\-weight licenses and acceptable use policies\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.26403v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: The data construction pipelines, statistics, and example domains for the newly derived multi\-turn interaction datasets \(MATH\-Chat and MediumDocEdit\-Chat\) are detailed in Appendix B\.1 and Appendix E\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: This work exclusively uses LLM\-based user simulators and LLM\-as\-a\-Judge for evaluation, and does not involve any human subjects or crowdsourcing\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: The research does not involve human subjects, hence IRB approval is not applicable\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[Yes\]
79. Justification: LLMs are the core component of our methodology, serving as both the dialogue policy and the trainable user simulator\. Their specific architectures, training paradigms \(SFT, GRPO\), and all system prompts are extensively documented throughout Section 4 and Appendices B, C, and D\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

arXiv cs.LG

This paper introduces CurveRL, a principled distribution-aware prompt reweighting approach for reinforcement learning with verifiable rewards (RLVR) that improves LLM reasoning by assigning weights based on the rank and density of pass rates rather than their absolute values, consistently outperforming GRPO and other baselines.

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

arXiv cs.AI

This paper introduces ICRL, a framework that jointly trains a solver and critic with reinforcement learning to internalize critique guidance, enabling the solver to improve without external critique. It uses distribution calibration and role-wise group advantage estimation, achieving 6-7 point gains over GRPO on agentic and mathematical reasoning tasks.