SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability
Summary
The paper introduces SeDT, a training-free inference-time method that improves LLM reliability in multi-turn conversations by annotating conversation history with cumulative relevance scores from three signals, achieving up to +37.7% performance gains on the Lost-in-Conversation benchmark.
View Cached Full Text
Cached at: 05/27/26, 09:10 AM
# Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability
Source: [https://arxiv.org/html/2605.26788](https://arxiv.org/html/2605.26788)
Ramakrishna Vamsi Setti1Jagadeesh Rachapudi2Sachin Chaudhary3 Praful Hambarde2Amit Shukla2 1Independent Researcher2Drone Lab, IIT Mandi,3UPES, Dehradun vamsisetti007@gmail\.com, s23096@students\.iitmandi\.ac\.in sachin\.chaudhary@ddn\.upes\.ac\.in \{praful, amitshukla\}@iitmandi\.ac\.in
###### Abstract
Large language models \(LLMs\) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to39%39\\%of that performance when the identical task is revealed incrementally across multiple turns, a phenomenon documented at scale as*Lost in Conversation*\. Crucially, this collapse is almost entirely a*reliability*failure; the best case, the aptitude only falls16%16\\%, while the unreliability more than doubles \(\+112%\+112\\%\)\. We argue that the root cause is structural, a flat conversation history assigns equal implicit weight to every prior turn, giving the model no signal to distinguish a critical constraint from incidental dialog\. We presentSeDTSentence\-transformerDecision\-Transformer, a training\-free inference\-time method that resolves this by importing return\-to\-go conditioning from offline reinforcement learning\. SeDT annotates each conversation shard with a cumulative relevance score derived from three complementary semantic, lexical, and positional signals and presents the full annotated history to the model at the final turn, without weight changes, without training data, and without discarding context\. Evaluated on the Lost\-in\-Conversation benchmark in three LLMs and three generation tasks, SeDT outperforms the sharded baseline in all nine model\-task combinations, with gains up to\+37\.7\+37\.7% in mean performanceP¯\\bar\{P\}and simultaneous reductions in unreliability in seven of the nine combinations\. In short, telling the model which past turns matter is sufficient to substantially recover the performance lost in conversation\.
SeDT: Sentence\-Transformer Decision\-Transformer Conditioning for Multi\-Turn Conversation Reliability
## 1Introduction
Ask a language model to write a function and it will succeed\(Rachapudiet al\.,[2026c](https://arxiv.org/html/2605.26788#bib.bib31),[b](https://arxiv.org/html/2605.26788#bib.bib32),[a](https://arxiv.org/html/2605.26788#bib.bib33)\)\. Give it the same problem one constraint at a time, the function name first, the input format next, the edge cases last, and it will quietly fall apart\.Labanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\)document this failure at scale, 15 state\-of\-the\-art \(SoTA\) LLMs evaluated on six tasks over 200,000 simulated conversations suffer an average performance drop of39%39\\%when the same instruction is revealed across turns rather than all at once\. This is not a contrived setting\. Analysis of large scale real world LLM conversations confirms that multi\-turn, underspecified interaction is the norm rather than the exception\(Zhenget al\.,[2023a](https://arxiv.org/html/2605.26788#bib.bib20)\), and that users, particularly novice ones, rarely specify all requirements upfront\(Herlihyet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib25)\)\.
The cause is structural\. A multi\-turn conversation is concatenated into a flat context window in which every prior turn carries equal implicit weight\. The model receives no signal about which turns specified critical constraints and which were conversational scaffolding\. Transformer attention compounds this; models systematically neglect middle\-context information even in single\-turn settings\(Liuet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib5)\), and multi\-turn conversations stack this bias across turns the model cannot distinguish from noise\. Four concrete failure modes follow directly, premature commitment to an answer before all constraints are revealed, over\-reliance on incorrect intermediate responses, loss of middle\-turn constraints in favor of the first and last turns, and verbose drift that introduces false assumptions\(Labanet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib1)\)\.
The offline reinforcement learning community faced a structurally identical problem\. An agent learning from a flat replay buffer has no signal about which transitions were valuable\. The Decision Transformer\(Chenet al\.,[2021a](https://arxiv.org/html/2605.26788#bib.bib2)\)resolved this by annotating each trajectory step with its return\-to\-go \(RTG\), telling the agent what mattered rather than leaving it to infer importance from a flat buffer\. The same insight applies to multi\-turn LLM inference, yet existing solutions fall short; finetuning requires curated multi\-turn training data and weight modification making it impractical at inference time, and users reveal constraints incrementally not by choice but by necessity\(Herlihyet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib25)\)\. We import this parallel directly into inference\-time prompt construction\. A multi\-turn conversation is a trajectory; each*shard*, an atomic piece of information revealed in a single turn, is a step, and the semantic relevance of that shard to the final output goal is the reward\. Just as the Decision Transformer tells the agent which steps mattered, SeDT tells the model which turns matter\.
To our knowledge, no prior work identifies equal implicit turn weighting in flat conversation contexts as the structural driver of multi\-turn reliability collapse; prior work attributes the failure to intent misalignment\(Liuet al\.,[2026](https://arxiv.org/html/2605.26788#bib.bib6)\)or model unreliability\(Labanet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib1)\)rather than to the context representation itself\.
We presentSeDT\(Sentence\-transformerDecision\-Transformer conditioning\), a training\-free inference\-time method that resolves flat\-context weighting by annotating each prior shard with a cumulative relevance score and presenting the full RTG\-annotated history to the model at the final turn changing no weights, requiring no training data, and discarding no context\. On the Lost\-in\-Conversation benchmark\(Labanet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib1)\), SeDT consistently outperforms the sharded baseline across all three evaluated LLMs and tasks, with gains up to\+37\.7\+37\.7% in mean performanceP¯\\bar\{P\}and simultaneous reductions in unreliability, confirming that the lost\-in\-conversation problem is at least partially one of context weighting, addressable at inference time without training\. SeDT requires no task\-specific data, no model modification, and at most one additional LLM call\.
#### Contributions\.
Our main contributions are as follows:
- •Problem identificationFlat\-context turn weighting identified as the structural root cause of multi\-turn reliability collapse, with a formal parallel to Decision Transformer RTG conditioning\.
- •SeDTA training\-free inference\-time method that requires zero model modification and zero training data\.
- •Three\-signal relevanceA semantic, lexical and positional relevance formulation that directly counteracts the four documented failure modes of multi\-turn LLMs\.
- •RTG\-grounded self\-correctionA two\-guard correction mechanism that introduces zero hurt cases while providing a conservative verification pathway\.
## 2Background
### 2\.1Lost in Conversation
Labanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\)document a systematic performance gap between single\-turn, fully\-specified interaction and multi\-turn, underspecified interaction at scale\. Their sharding framework decomposes a fully\-specified instruction into atomic information shards revealed one per turn, enabling controlled comparison while holding task content constant\. Runningnnindependent simulations per example at temperatureT=1\.0T\{=\}1\.0yields three metrics: average performanceP¯\\bar\{P\}\(mean score\), aptitudeA90A\_\{90\}\(90th percentile, capturing best\-case capability\) and unreliabilityUU\(the 90th−\-10th percentile gap, where lower is better\)\. The central finding is that the lost\-in\-conversation problem manifests itself primarily as an unreliability explosion \(\+112%\+112\\%\) rather than an aptitude collapse \(−16%\-16\\%\)\. Therefore, a reliable solution must reduceUUalong with improvingP¯\\bar\{P\}\.
### 2\.2Related Work
#### Multi\-turn evaluation
A growing body of work evaluates LLMs in multi\-turn settings through*episodic*benchmark settings where each conversation turn introduces a self\-contained subtask that can be evaluated independently, without requiring the model to fuse information accumulated across turns\(Zhenget al\.,[2023b](https://arxiv.org/html/2605.26788#bib.bib21); Baiet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib22); Kwanet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib23); Wanget al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib24)\)\. Although these benchmarks capture important capabilities such as refinement \(iteratively improving a response based on user feedback\) and tool use \(calling external APIs or executing code across turns\), they do not require the model to fuse underspecified information accumulated across turns and, as a result, systematically overestimate multi\-turn performance\(Labanet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib1)\)\.
#### Attention bias and long\-context failures
Transformer models exhibit a well\-documented U\-shaped attention bias, disproportionately attending to tokens at the beginning and end of long contexts while neglecting middle content\(Liuet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib5)\)\. This phenomenon extends naturally to multi\-turn conversation, where middle turns carrying critical constraints receive insufficient attention\(Labanet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib1)\)\. Static self\-attention has been identified as the root cause of score dilution in long\-context settings, motivating test\-time approaches that address this at the query level\(Bansalet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib7)\)\.
#### Context management and recapitulation
Beyond evaluation, a separate line of work addresses the multi\-turn problem by modifying how conversation history is presented to the model\. One family of approaches uses recapitulation, a strategy in which prior user turns are restated verbatim in the current context to ensure that the model has access to all previous constraints\. Standard recapitulation appends all prior shards as a restatement at the final turn, while snowball recapitulation grows this restatement cumulatively at every turn\(Labanet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib1)\)\. A subtler alternative is starting a new conversation for each sub\-task, which empirically improves performance\(Labanet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib1)\)by resetting the flat context, but this discards conversational history entirely and is unavailable when constraints arrive incrementally from a real user\. Finetuning resolves flat weighting by updating model weights on curated multi\-turn data, but requires task\-specific corpora, retraining for every new model or domain, and is unavailable at inference time\.
#### Intent alignment and instruction following
When faced with incomplete or ambiguous questions, LLMs exhibit systematic response patterns\. Some models hedge, producing vague or noncommittal answers that avoid committing to an incomplete specification\. Others issue clarification requests, asking the user to provide missing information before proceeding\. A third pattern is the premature direct response, where the model assumes the most likely interpretation and responds immediately, often incorrectly\(Herlihyet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib25)\)\. All three patterns reflect the same underlying problem; the model cannot determine user intent from an incomplete context\. Improving multi\-turn instruction following has been pursued through demonstration\-guided training\(Sunet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib26)\)and through architectural frameworks that decouple intent inference from task execution by reconstructing a single\-turn instruction from conversation history\(Liuet al\.,[2026](https://arxiv.org/html/2605.26788#bib.bib6)\)\.
#### Return\-to\-go conditioning
Sequence modeling approaches to offline reinforcement learning have demonstrated that conditioning on desired outcomes, rather than learning value functions, produces robust and controllable agent behavior\(Chenet al\.,[2021a](https://arxiv.org/html/2605.26788#bib.bib2)\)\. Dense sentence representations optimized for semantic similarity have since enabled goal\-conditioned reasoning in language settings\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.26788#bib.bib3); Songet al\.,[2020](https://arxiv.org/html/2605.26788#bib.bib4)\), providing the computational foundation for translating outcome\-conditioned sequence modeling into prompt\-level context management\. SeDT builds on both lines of work\.
## 3Method: SeDT
### 3\.1The Formal Parallel
The structural analogy between the Decision Transformer\(Chenet al\.,[2021a](https://arxiv.org/html/2605.26788#bib.bib2)\)and SeDT is precise\. In the Decision Transformer, a trajectory is a sequence of interactions between an agent and its environment recorded as a series of steps\. At each steptt, the agent observes a statests\_\{t\}describing the current situation, takes an actionata\_\{t\}, and receives a signalR^t\\hat\{R\}\_\{t\}called the return\-to\-go, which represents the total reward the agent can still accumulate from that step onward to the end of the trajectory\. By conditioning on these return\-to\-go values, the Decision Transformer teaches the agent to associate high\-value steps with high future reward, producing the full trajectory\[\(R^1,s1,a1\),…,\(R^T,sT,aT\)\]\[\(\\hat\{R\}\_\{1\},s\_\{1\},a\_\{1\}\),\\ldots,\(\\hat\{R\}\_\{T\},s\_\{T\},a\_\{T\}\)\]as input to generate the next action\. In SeDT, we draw a direct parallel to multi\-turn conversation\. The conversation history plays the role of the trajectory, each shard revealed in a single turn plays the role of a step, and the semantic relevance of that shard to the final output goal plays the role of reward\. Just as the Decision Transformer annotates each step with how much future reward it is worth, SeDT annotates each shard with how much goal\-relevant information is still to come from that turn onward, which we also call the return\-to\-go of that shard\. The conversation history thus becomes\[\(R^1,shard1\),…,\(R^T,shardT\)\]\[\(\\hat\{R\}\_\{1\},\\text\{shard\}\_\{1\}\),\\ldots,\(\\hat\{R\}\_\{T\},\\text\{shard\}\_\{T\}\)\], conditioning the final answer on return\-to\-go values:
DT:\[R^1,s1,a1\]\[R^2,s2,a2\]⋯→aT\\displaystyle\[\\hat\{R\}\_\{1\},s\_\{1\},a\_\{1\}\]\\;\[\\hat\{R\}\_\{2\},s\_\{2\},a\_\{2\}\]\\;\\cdots\\to a\_\{T\}SeDT:\[R^1,shard1\]\[R^2,shard2\]⋯→y^T,\\displaystyle\[\\hat\{R\}\_\{1\},\\text\{shard\}\_\{1\}\]\\;\[\\hat\{R\}\_\{2\},\\text\{shard\}\_\{2\}\]\\;\\cdots\\to\\hat\{y\}\_\{T\},whereR^t=∑t′=tT−1rel\(t′\)\\hat\{R\}\_\{t\}=\\sum\_\{t^\{\\prime\}=t\}^\{T\-1\}\\mathrm\{rel\}\(t^\{\\prime\}\)is the cumulative relevance from turnttto the final turnTT, andTTis the total number of shards in the conversation\. In the Decision Transformer, a highR^t\\hat\{R\}\_\{t\}tells the agent that much future reward is still achievable from this step\. In SeDT, a highR^t\\hat\{R\}\_\{t\}tells the model that much goal\-relevant information is still to come from this turn, and the model should carefully attend to the constraints that follow it\. The parallel is not merely metaphorical; it directly motivates every design decision that follows\. Figure[1](https://arxiv.org/html/2605.26788#S3.F1)shows the overview of SeDT\.
Figure 1:Overview of the SeDT pipeline\.
### 3\.2Goal\-Based Anchor and Shard Embedding
The anchors∗s^\{\*\}is the expected output goal, which the model must produce at the final turn\. We use task\-typed anchors:“Calculate and give the final numerical answer”\(math\),“Return all required function calls with correct parameters”\(actions\), and“Write a complete Python function that solves the problem”\(code\)\. For each shardsts\_\{t\}with intermediate responsertr\_\{t\}, we embedST\(st∥rt\)\\mathrm\{ST\}\(s\_\{t\}\\\|r\_\{t\}\)if\|rt\|≤30\|r\_\{t\}\|\\leq 30characters andST\(st\)\\mathrm\{ST\}\(s\_\{t\}\)otherwise, preventing the verbose intermediate output from distorting the relevance estimate\.
### 3\.3Three\-Signal Relevance Scoring
A shard may be semantically relevant to the goal without sharing surface keywords and vice versa\. To capture complementary dimensions of relevance, we combine three signals:
rel\(t\)=α⋅sem\(t\)\+β⋅lex\(t\)\+γ⋅pos\(t\),\\mathrm\{rel\}\(t\)=\\alpha\\cdot\\mathrm\{sem\}\(t\)\+\\beta\\cdot\\mathrm\{lex\}\(t\)\+\\gamma\\cdot\\mathrm\{pos\}\(t\),\(1\)with weightsα=0\.6\\alpha\{=\}0\.6,β=0\.2\\beta\{=\}0\.2,γ=0\.2\\gamma\{=\}0\.2, chosen to reflect the relative discriminative power of each signal as empirically validated through the analysis of the signal contribution in Section[5](https://arxiv.org/html/2605.26788#S5)\. Semantic similarity to the output goal carries the most weight because it captures goal relevance that surface overlap misses, while lexical and positional signals provide complementary corrections at lower weights\.
Signal 1 \(Semantic\)Deep alignment between the shard and the goal anchor, computed assem\(t\)=cosim\(embed\(s∗\),embedt\)\\mathrm\{sem\}\(t\)=\\mathrm\{cosim\}\(\\mathrm\{embed\}\(s^\{\*\}\),\\mathrm\{embed\}\_\{t\}\), whereembed\(⋅\)\\mathrm\{embed\}\(\\cdot\)denotes dense vector representations produced by the all\-mpnet\-base\-v2 sentence transformer\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.26788#bib.bib3)\)andcosim\\mathrm\{cosim\}denotes cosine similarity between two vectors\.
Signal 2 \(Lexical\)Surface keyword overlap between the goal anchor and the shard, computed aslex\(t\)=\|W\(s∗\)∩W\(st\)\|/\|W\(s∗\)∪W\(st\)\|\\mathrm\{lex\}\(t\)=\|W\(s^\{\*\}\)\\cap W\(s\_\{t\}\)\|/\|W\(s^\{\*\}\)\\cup W\(s\_\{t\}\)\|, whereW\(s\)W\(s\)denotes the set of unique words in the stringss,∩\\capis the intersection of sets and∪\\cupis the union of sets\. This is the Jaccard similarity between the word sets of the anchor and the shard, capturing surface\-level overlap that semantic embeddings may miss when domain\-specific keywords appear verbatim across both\.
Signal 3 \(Positional\)An inverted\-U boost that counteracts both the suffix\-sum bias of the RTG computation and the U\-shaped positional attention of transformers\(Liuet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib5); Vaswaniet al\.,[2017](https://arxiv.org/html/2605.26788#bib.bib13)\), calculated aspos\(t\)=1−\|t−μ\|/\(2μ\)\\mathrm\{pos\}\(t\)=1\-\|t\-\\mu\|/\(2\\mu\), wheret∈\{0,…,T−1\}t\\in\\\{0,\\ldots,T\{\-\}1\\\}is the zero\-indexed position of the current shard,TTis the total number of shards in the conversation, andμ=\(T−1\)/2\\mu=\(T\{\-\}1\)/2is the midpoint of the shard indices\. The intuition is straightforward; the formula assigns the highest score of1\.01\.0to the middle turn and decreases symmetrically towards both first and last turns\. This directly counteracts the transformer’s natural tendency to over\-attend to the beginning and end of the context while neglecting the middle, ensuring that constraints revealed in the middle of the conversation receive a relevance boost proportional to how far they are from the endpoints\. This signal directly addresses the loss\-of\-middle\-turns failure mode documented inLabanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\)\. A*minimum guarantee*for the last shard ensures that the final shard, which by the construction of RTG accumulates the lowest cumulative score, receives an RTG of at least the mean RTG across all shards:R^T−1←max\(R^T−1,1T∑tR^t\)\\hat\{R\}\_\{T\{\-\}1\}\\leftarrow\\max\(\\hat\{R\}\_\{T\{\-\}1\},\\;\\frac\{1\}\{T\}\\sum\_\{t\}\\hat\{R\}\_\{t\}\)\.
Table[1](https://arxiv.org/html/2605.26788#S3.T1)illustrates the full RTG annotation on a real GSM8K example\. Shards above the mean relevance are markedCONFIRMED; below\-average shards are markedUNCERTAIN\. In particular, shardt=4t\{=\}4has the lowest raw relevance \(rel=0\.12\\mathrm\{rel\}\{=\}0\.12\), yet the last\-shard guarantee increases its RTG from0\.120\.12to0\.610\.61, preventing the model from discarding the final constraint\.
Table 1:RTG annotation for a real GSM8K example with anchor“Calculate and give the final numerical answer”\.†Raised by the last\-shard minimum guarantee from raw0\.120\.12\.
### 3\.4RTG\-Conditioned Prompt and Confidence Labels
In the final turn, SeDT presents the complete history annotated with RTG and augments the system prompt with confidence labels: shards withrel\(t\)≥ρ¯\\mathrm\{rel\}\(t\)\\geq\\bar\{\\rho\}are markedCONFIRMED; below\-average shards are markedUNCERTAIN\. This gives the model both a quantitative relevance signal and a categorical one, reinforcing the constraints to prioritize and directly targeting all four failure modes ofLabanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\)\.Premature commitmentis addressed by RTG reorienting attention toward the full goal\-weighted history rather than the earliest shard\.Overreliance on incorrect intermediatesis reduced by UNCERTAIN labeling, which reduces the influence of low\-relevance prior responses\.Loss of middle turnsis countered by Signal 3, which explicitly increases the relevance of middle turns\.Verbose driftis prevented by excluding intermediate responses longer than 30 characters from embedding, preventing noisy outputs from inflating relevance scores\.
### 3\.5RTG\-Guided Self\-Correction
When the final\-turn RTG signal indicates a weak conversation, SeDT triggers a verification step\. Naïve triggers such asR^last≤R¯\\hat\{R\}\_\{\\text\{last\}\}\\leq\\bar\{R\}are unsuitable because RTG is a suffix sum, making this condition trivially always true\. Instead, we define a*weakness score*:
W=\(−zlast\)×ur,W=\(\-z\_\{\\text\{last\}\}\)\\times u\_\{r\},\(2\)wherezlast=\(rel\(T−1\)−ρ¯\)/σρz\_\{\\text\{last\}\}=\(\\mathrm\{rel\}\(T\{\-\}1\)\-\\bar\{\\rho\}\)/\\sigma\_\{\\rho\}is the normalized relevance of the final shard anduru\_\{r\}is the fraction of UNCERTAIN shards\. The adaptive thresholdτ=0\.5×\(1−ρ¯\)\\tau=0\.5\\times\(1\-\\bar\{\\rho\}\)is reduced when the global conversation is weak globally, allowing for a more aggressive correction\. Self\-correction occurs whenW\>τW\>\\tau\.
The verifier’s response replaces the initial answer only via*RTG\-grounded acceptance*: the replacement is accepted if and only ifa1≠ava\_\{1\}\\neq a\_\{v\}andcov\(v,𝒞\)\>cov\(r1,𝒞\)\\mathrm\{cov\}\(v,\\mathcal\{C\}\)\>\\mathrm\{cov\}\(r\_\{1\},\\mathcal\{C\}\), where𝒞\\mathcal\{C\}is the CONFIRMED shard set and coverage is defined as
cov\(y,𝒞\)=1\|𝒞\|∑s∈𝒞𝟏\[∃w∈W\>3\(s\):w∈W\(y\)\],\\mathrm\{cov\}\(y,\\mathcal\{C\}\)=\\frac\{1\}\{\|\\mathcal\{C\}\|\}\\sum\_\{s\\in\\mathcal\{C\}\}\\mathbf\{1\}\\\!\\left\[\\exists\\,w\\in W\_\{\>3\}\(s\):w\\in W\(y\)\\right\],\(3\)the fraction of confirmed shards for which at least one content word \(longer than three characters\) appears in the responseyy\. This guard prevents a stochastic verifier from overwriting a correct answer simply by disagreeing\.
Algorithm 1SeDT: RTG\-Conditioned Multi\-Turn Inference1:Shards
\{s1,…,sT\}\\\{s\_\{1\},\\dots,s\_\{T\}\\\}, intermediate responses
\{r1,…,rT−1\}\\\{r\_\{1\},\\dots,r\_\{T\-1\}\\\}, goal anchor
s∗s^\{\*\}, LLM
ℳ\\mathcal\{M\}, weights
α=0\.6\\alpha\{=\}0\.6,
β=0\.2\\beta\{=\}0\.2,
γ=0\.2\\gamma\{=\}0\.2
2:Final answer
y^T\\hat\{y\}\_\{T\}
3:for
t=0t=0to
T−1T\{\-\}1do
4:
textt←st∥rt\\text\{text\}\_\{t\}\\leftarrow s\_\{t\}\\\|r\_\{t\}if
\|rt\|≤30\|r\_\{t\}\|\\leq 30else
sts\_\{t\}
5:
𝐞t←ST\(textt\)\\mathbf\{e\}\_\{t\}\\leftarrow\\mathrm\{ST\}\(\\text\{text\}\_\{t\}\)
6:endfor
7:
𝐞∗←ST\(s∗\)\\mathbf\{e\}^\{\*\}\\leftarrow\\mathrm\{ST\}\(s^\{\*\}\)
8:for
t=0t=0to
T−1T\{\-\}1do
9:
sem\(t\)←cosim\(𝐞∗,𝐞t\)\\mathrm\{sem\}\(t\)\\leftarrow\\mathrm\{cosim\}\(\\mathbf\{e\}^\{\*\},\\mathbf\{e\}\_\{t\}\)
10:
lex\(t\)←\|W\(s∗\)∩W\(st\)\|/\|W\(s∗\)∪W\(st\)\|\\mathrm\{lex\}\(t\)\\leftarrow\|W\(s^\{\*\}\)\\cap W\(s\_\{t\}\)\|\\;/\\;\|W\(s^\{\*\}\)\\cup W\(s\_\{t\}\)\|
11:
μ←\(T−1\)/2\\mu\\leftarrow\(T\-1\)/2
12:
pos\(t\)←1−\|t−μ\|/\(2μ\)\\mathrm\{pos\}\(t\)\\leftarrow 1\-\|t\-\\mu\|\\,/\\,\(2\\mu\)
13:
rel\(t\)←α⋅sem\(t\)\+β⋅lex\(t\)\+γ⋅pos\(t\)\\mathrm\{rel\}\(t\)\\leftarrow\\alpha\\cdot\\mathrm\{sem\}\(t\)\+\\beta\\cdot\\mathrm\{lex\}\(t\)\+\\gamma\\cdot\\mathrm\{pos\}\(t\)
14:endfor
15:for
t=T−1t=T\{\-\}1downto
0do
16:
R^t←∑t′=tT−1rel\(t′\)\\hat\{R\}\_\{t\}\\leftarrow\\sum\_\{t^\{\\prime\}=t\}^\{T\{\-\}1\}\\mathrm\{rel\}\(t^\{\\prime\}\)
17:endfor
18:
R^T−1←max\(R^T−1,1T∑tR^t\)\\hat\{R\}\_\{T\{\-\}1\}\\leftarrow\\max\(\\hat\{R\}\_\{T\{\-\}1\},\\;\\tfrac\{1\}\{T\}\\sum\_\{t\}\\hat\{R\}\_\{t\}\)
19:
ρ¯←1T∑trel\(t\)\\bar\{\\rho\}\\leftarrow\\tfrac\{1\}\{T\}\\sum\_\{t\}\\mathrm\{rel\}\(t\)
20:for
t=0t=0to
T−1T\{\-\}1do
21:
ℓt←CONFIRMED\\ell\_\{t\}\\leftarrow\\textbf\{CONFIRMED\}if
rel\(t\)≥ρ¯\\mathrm\{rel\}\(t\)\\geq\\bar\{\\rho\}elseUNCERTAIN
22:endfor
23:Build annotated history
ℋ←\{\[R^t,ℓt,st,rt\]\}t=0T−1\\mathcal\{H\}\\leftarrow\\\{\[\\hat\{R\}\_\{t\},\\,\\ell\_\{t\},\\,s\_\{t\},\\,r\_\{t\}\]\\\}\_\{t=0\}^\{T\{\-\}1\}
24:
y^1←ℳ\(ℋ,sT\)\\hat\{y\}\_\{1\}\\leftarrow\\mathcal\{M\}\(\\mathcal\{H\},\\,s\_\{T\}\)
25:
σρ←std\(\{rel\(t\)\}\)\\sigma\_\{\\rho\}\\leftarrow\\mathrm\{std\}\(\\\{\\mathrm\{rel\}\(t\)\\\}\),
ur←\|\{t:ℓt=UNCERTAIN\}\|/Tu\_\{r\}\\leftarrow\|\\\{t:\\ell\_\{t\}=\\textbf\{UNCERTAIN\}\\\}\|/T
26:
zT−1←\(rel\(T−1\)−ρ¯\)/σρz\_\{T\{\-\}1\}\\leftarrow\(\\mathrm\{rel\}\(T\{\-\}1\)\-\\bar\{\\rho\}\)\\,/\\,\\sigma\_\{\\rho\}
27:
W←\(−zT−1\)×urW\\leftarrow\(\-z\_\{T\{\-\}1\}\)\\times u\_\{r\}
28:
τ←0\.5×\(1−ρ¯\)\\tau\\leftarrow 0\.5\\times\(1\-\\bar\{\\rho\}\)
29:if
W\>τW\>\\tauthen
30:
y^v←ℳ\(ℋ,sT,verify\)\\hat\{y\}\_\{v\}\\leftarrow\\mathcal\{M\}\(\\mathcal\{H\},\\,s\_\{T\},\\,\\text\{verify\}\)
31:
𝒞←\{st:ℓt=CONFIRMED\}\\mathcal\{C\}\\leftarrow\\\{s\_\{t\}:\\ell\_\{t\}=\\textbf\{CONFIRMED\}\\\}
32:if
y^v≠y^1\\hat\{y\}\_\{v\}\\neq\\hat\{y\}\_\{1\}and
cov\(y^v,𝒞\)\>cov\(y^1,𝒞\)\\mathrm\{cov\}\(\\hat\{y\}\_\{v\},\\mathcal\{C\}\)\>\\mathrm\{cov\}\(\\hat\{y\}\_\{1\},\\mathcal\{C\}\)then
33:
y^1←y^v\\hat\{y\}\_\{1\}\\leftarrow\\hat\{y\}\_\{v\}
34:endif
35:endif
36:return
y^1\\hat\{y\}\_\{1\}
## 4Experimental Setup
#### Dataset and tasks\.
We evaluate on themicrosoft/lost\_in\_conversationHuggingFace dataset\(Labanet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib1)\), using all available sharded examples: actions \(105 instructions from the Berkeley Function Calling Leaderboard\), math / GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.26788#bib.bib11)\)\(103 instructions\), and code\(Chenet al\.,[2021b](https://arxiv.org/html/2605.26788#bib.bib10); Jainet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib12)\)\(100 instructions; 45 HumanEval \+ 55 LiveCodeBench medium\)\. These three tasks represent the binary\-correctness subset of the benchmark, where the evaluation is unambiguous: functional accuracy for code, exact match for API calls, and numerical match for math\.
#### Models\.
We evaluate three LLMs spanning three families: GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\)\(OpenAI\), Gemini 2\.5 Flash\(Teamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib19)\)\(Google\) and Llama 3\.3\-70B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib14)\)\(Meta\), covering proprietary and open\-weight models at different scales\.
#### Evaluation protocol\.
FollowingLabanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\), we performn=5n\{=\}5independent simulations per example at temperatureT=1\.0T\{=\}1\.0, withmax\_tokens = 1000\. The MetricsP¯\\bar\{P\},AA, andUUare computed per sample from thennrun scores and averaged across the corpus\. We terminate conversations early on a correct answer, preventing later incorrect turns from overwriting a correct intermediate; this is particularly consequential for code tasks, where partial specifications can yield valid partial solutions\.
#### Sharded baseline\.
The Sharded baseline presents all prior shards as a plain multi\-turn conversation with no relevance annotation\. All other settings for LLM, the system prompt, and the scorer are identical to SeDT\. Task\-specific scorers follow the base paper exactly: flexible regex for math, AST\-based evaluation for API calls, and test\-case execution for code\.
## 5Results
Figure 2:Mean performanceP¯\\bar\{P\}\(%\) for Sharded \(hatched\) and SeDT \(solid\) across three tasks and three model families\. SeDT consistently outperforms the sharded baseline on all nine model\-task combinations\.### 5\.1Main Results
Table[2](https://arxiv.org/html/2605.26788#S5.T2)and Figure[2](https://arxiv.org/html/2605.26788#S5.F2)present the mean performance for all nine combinations of model tasks\. SeDT consistently outperforms the flat\-history sharded baseline on every model and every task\.
In the actions task \(105 examples\), the gains range from\+11\.2\+11\.2to\+37\.7\+37\.7\. Gemini 2\.5 Flash\(Teamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib19)\)records the largest single gain \(\+37\.7\+37\.7\) and the highest average gain across all three tasks \(\+18\.0\+18\.0points\), reaching78\.178\.1, more than halfway back to its single\-turn ceiling of88\.488\.4\. GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\)and Llama 3\.3\-70B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib14)\)also improve substantially, by\+26\.3\+26\.3and\+11\.2\+11\.2, respectively\.
In math \(103 examples, GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.26788#bib.bib11)\)\), all three models improve consistently: GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\)gains\+11\.5\+11\.5, Gemini 2\.5 Flash\(Teamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib19)\)gains\+10\.7\+10\.7, and Llama 3\.3\-70B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib14)\)gains\+8\.9\+8\.9\. In code \(100 examples\), the gains hold for all models:\+8\.4\+8\.4,\+5\.6\+5\.6, and\+10\.2\+10\.2for GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\), Gemini 2\.5 Flash\(Teamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib19)\)and Llama 3\.3\-70B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib14)\), respectively\. The Code gains are smaller in absolute terms, reflecting the inherent difficulty of LiveCodeBench\(Jainet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib12)\)medium problems, but SeDT improves consistently even in this challenging regime\.
Table 2:Mean performanceP¯\\bar\{P\}\(%\) across three LLMs and three tasks\. Full†is the single\-turn upper bound fromLabanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\); Sharded is the flat\-history multi\-turn baseline; SeDT is our method\. Gain is the absolute improvement over Sharded\.
### 5\.2Reliability Analysis
Table[3](https://arxiv.org/html/2605.26788#S5.T3)presents aptitudeAAand unreliabilityUUin all nine model\-task combinations\. SeDT improvesAAon every combination, confirming that gains reflect genuine improvement in best\-case capability rather than noise reduction, and reducesUUin seven of nine, with the strongest reductions in math \(Llama 3\.3\-70B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib14)\),ΔU=−7\.6\\Delta U=\-7\.6\) and actions \(Gemini 2\.5 Flash\(Teamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib19)\),ΔU=−9\.6\\Delta U=\-9\.6\)\. The two exceptions Actions / Llama 3\.3\-70B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.26788#bib.bib14)\)\(ΔU=\+0\.2\\Delta U=\+0\.2\) and Code / GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\)\(ΔU=\+2\.6\\Delta U=\+2\.6\) both occur where the ShardedUUis already the lowest within its task, leaving limited room for further reduction — confirming that SeDT tightens score distributions precisely where the reliability improvement the lost\-in\-conversation literature calls for\(Labanet al\.,[2025](https://arxiv.org/html/2605.26788#bib.bib1)\)is most needed\.
Table 3:AptitudeAA\(90th percentile\) and unreliabilityUU\(90th−\-10th percentile gap, lower is better\), computed per\-example and averaged across the corpus followingLabanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\)\.
### 5\.3Signal Contribution Analysis
To understand which of the three relevance signals drives performance, Table[4](https://arxiv.org/html/2605.26788#S5.T4)reportsP¯\\bar\{P\},AA, andUUon a representative subset of 25 GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.26788#bib.bib11)\)examples with GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\), ablating each signal in turn\.
Each single\-signal variant underperforms SeDT by8\.08\.0–11\.211\.2points inP¯\\bar\{P\}, confirming that no individual signal is sufficient\. The semantic signal alone \(P¯=56\.8\\bar\{P\}\{=\}56\.8\) performs worse despite receiving the highest weight in the full model, because cosine similarity without positional correction is vulnerable to suffix\-sum bias\. The position signal alone \(P¯=60\.0\\bar\{P\}\{=\}60\.0\) provides the strongest single\-signal baseline, consistent with middle\-turn neglect being a primary failure mode\. Combining all three signals \(P¯=68\.0\\bar\{P\}\{=\}68\.0\) outperforms the best uniform weighting \(P¯=63\.2\\bar\{P\}\{=\}63\.2\) by4\.84\.8points, confirming thatα=0\.6\\alpha\{=\}0\.6,β=0\.2\\beta\{=\}0\.2,γ=0\.2\\gamma\{=\}0\.2reflects a meaningful prior on the importance of the signal rather than the choice of incidental hyperparameter\.
To isolate whether the gains are due to the RTG prompt format or the calculated relevance signal, we include a Random RTG condition that replaces the computed scores with uniform random values in\[0,1\]\[0,1\]while keeping the prompt structure identical\. Random RTG \(P¯=60\.8\\bar\{P\}\{=\}60\.8\) outperforms the sharded baseline, confirming that the instruction\-tuned models respond to numerical prefixes from pretraining exposure\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.26788#bib.bib17); Weiet al\.,[2022](https://arxiv.org/html/2605.26788#bib.bib18)\)\. SeDT \(P¯=68\.0\\bar\{P\}\{=\}68\.0\) further outperforms Random RTG by7\.27\.2points inP¯\\bar\{P\}and by13\.613\.6points inUU\(33\.6→20\.033\.6\\to 20\.0\), demonstrating that structured relevance adds substantial signal beyond the format alone\.
Table 4:Signal contribution analysis on 25 GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.26788#bib.bib11)\)examples with GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\)\.
### 5\.4Self\-Correction Mechanism Analysis
We evaluated the RTG\-guided self\-correction mechanism on 15 randomly drawn GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.26788#bib.bib11)\)examples \(75 runs total,random\.seed\(42\)\) with GPT\-4o\-mini, recording Guard 1 trigger and Guard 2 acceptance decisions at each run\. The verifier triggers on28\.0%28\.0\\%of runs \(21/7521/75\), indicating thatW\>τW\>\\taufires selectively in genuinely uncertain conversations; in most triggered cases the verifier confirms the original answer without replacement, and when it disagrees Guard 2 blocks replacement unless the verifier demonstrates strictly higher confirmed\-shard coverage\. Across all 75 runs,zero hurt casesare recorded, confirming that the mechanism introduces no errors while providing a conservative verification pathway\.
### 5\.5Why RTG Conditioning Works Without Training
A natural question is why a model responds meaningfully to return\-to\-go prefixes it was not explicitly trained on\. Instruction\-tuned LLMs are extensively exposed to relevance\-annotated text during pretraining; code comments, document outlines, meeting agendas, and editorial markup all of which condition content on explicit importance signals, suggesting RTG\-style numerical prefixes fall within their learned distribution\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.26788#bib.bib17); Weiet al\.,[2022](https://arxiv.org/html/2605.26788#bib.bib18)\)\. Additionally, the RTG\-annotated history functions as an implicit in\-context demonstration, the pattern\[R^t,ℓt,st\]\[\\hat\{R\}\_\{t\},\\ell\_\{t\},s\_\{t\}\]repeated through turns teaches the model that higher\-scored shards carry more constraint weight, without any gradient signal\(Brownet al\.,[2020](https://arxiv.org/html/2605.26788#bib.bib16)\)\. Consistent with this, the two models with the highest gains, Gemini 2\.5 Flash \(\+18\.0\+18\.0\) and GPT\-4o\-mini \(\+15\.4\+15\.4\), are the most capable instruction\-tuned models in our evaluation, while Llama 3\.3\-70B records the lowest gain \(\+10\.1\+10\.1\)\.
### 5\.6Comparison with Prior Approaches
#### Recapitulation strategies
Labanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\)evaluate two recapitulation baselines on GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\); Recap, which includes a complete restatement of all prior shards as a final turn, and Snowball, which cumulatively restates shards at every turn\. Both improve over the sharded baseline, yet fall short of the single\-turn ceiling, with Recap achieving an averageP¯\\bar\{P\}of66\.566\.5and Snowball61\.861\.8across four tasks\. The per\-task breakdowns for these baselines are not publicly available, which excludes a direct per\-task comparison\. SeDT achieves68\.768\.7on GPT\-4o\-mini across our three evaluated tasks, comparing favorably to both reported averages, while requiring no modification of the conversation structure and no assumption about which turn is final\.
#### Intent reconstruction
Liuet al\.\([2026](https://arxiv.org/html/2605.26788#bib.bib6)\)propose a Mediator\-Assistant framework that decouples intent inference from task execution, reconstructing ambiguous conversation history into a fully\-specified single\-turn instruction before passing it to the assistant\. The Mediator requires task\-specific historical interaction data and invokes 2–3 additional LLM calls per inference\. On GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\), it achievesP¯\\bar\{P\}of85\.785\.7,77\.777\.7and66\.966\.9on Actions, Math, and Code respectively\(Liuet al\.,[2026](https://arxiv.org/html/2605.26788#bib.bib6)\)\. SeDT achieves73\.073\.0,74\.474\.4, and58\.858\.8, remaining competitive on Math while requiring zero task\-specific data and at most one additional LLM call\. The two approaches are complementary; the Mediator addresses intent alignment through experience\-driven reconstruction, while SeDT addresses context weighting through return\-to\-go conditioning\.
## 6Conclusion
We identified flat\-context turn weighting as a structural root cause of multi\-turn reliability collapse and proposedSeDT, a training\-free inference\-time method importing return\-to\-go conditioning from offline reinforcement learning\. The key insight is that conversation shards map to trajectory steps and semantic relevance serves as reward\. SeDT annotates each shard with a cumulative relevance score from three complementary signals and presents the annotated history at the final turn, changing no weights and requiring no training data\. Across three LLMs and three tasks, SeDT outperforms the sharded baseline in all nine combinations, improving mean performance while reducing unreliability in seven\. The self\-correction mechanism introduces zero hurt cases\.
## Limitations
Several design decisions and resource constraints scope the current evaluation\. We usen=5n\{=\}5runs per example rather thann=10n\{=\}10as inLabanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\), followingLiuet al\.\([2026](https://arxiv.org/html/2605.26788#bib.bib6)\)who also adoptn=5n\{=\}5and demonstrate stable estimates at this sample size; theUUmetric may marginally underestimate the true variance\. Due to computational constraints, we evaluated three of the six tasks inLabanet al\.\([2025](https://arxiv.org/html/2605.26788#bib.bib1)\), as the remaining tasks \(Database, Data\-to\-Text, Summary\) require SQL execution infrastructure or long\-context pipelines that exceed our resource budget\. Goal anchors are task\-typed natural language strings that describe the expected output, analogous to system prompt design and can be specified with minimal effort for any new task, though automatic anchor generation remains future work\. The signal weightsα=0\.6\\alpha\{=\}0\.6,β=0\.2\\beta\{=\}0\.2,γ=0\.2\\gamma\{=\}0\.2were validated on 25 GSM8K examples with GPT\-4o\-mini\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26788#bib.bib15)\); however, consistent gains across all nine combinations of model\-task with these fixed weights provide indirect evidence of robustness\. A systematic weight sweep across all tasks and model families, complete six\-task evaluation, and automatic anchor generation are left to future work\.
## E1 \- AI Assistant Disclosure
An AI writing assistant was used for paper\-prose drafting, paragraph restructuring, and figure / table layout suggestions\. The assistant wasnotused to generate or fabricate experimental numbers, dataset records, annotator labels, or finding interpretations\. All empirical results, dataset construction scripts, model selection decisions, are author\-generated and have been independently verified against the locked release\. The specific assistant and version will be named in the camera\-ready; anonymising the vendor here preserves double\-blind review\.
## Ethics Statement
This work uses publicly available datasets and API\-based language models\. No human subjects were involved and no new data were collected\. The proposed method is inference time only and introduces no additional risks beyond those of the underlying models\.
## References
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§4](https://arxiv.org/html/2605.26788#S4.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.26788#S5.SS1.p2.8),[§5\.1](https://arxiv.org/html/2605.26788#S5.SS1.p3.6),[§5\.2](https://arxiv.org/html/2605.26788#S5.SS2.p1.9),[§5\.3](https://arxiv.org/html/2605.26788#S5.SS3.p1.3),[§5\.6](https://arxiv.org/html/2605.26788#S5.SS6.SSS0.Px1.p1.4),[§5\.6](https://arxiv.org/html/2605.26788#S5.SS6.SSS0.Px2.p1.7),[Table 2](https://arxiv.org/html/2605.26788#S5.T2.1.1.2.1),[Table 3](https://arxiv.org/html/2605.26788#S5.T3.14.14.4),[Table 3](https://arxiv.org/html/2605.26788#S5.T3.20.20.4),[Table 3](https://arxiv.org/html/2605.26788#S5.T3.8.8.4),[Table 4](https://arxiv.org/html/2605.26788#S5.T4),[Limitations](https://arxiv.org/html/2605.26788#Sx1.p1.7)\.
- Mt\-bench\-101: a fine\-grained benchmark for evaluating large language models in multi\-turn dialogues\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7421–7454\.Cited by:[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px1.p1.1)\.
- R\. Bansal, A\. Zhang, R\. Tiwari, L\. Madaan, S\. S\. Duvvuri, D\. Khatri, D\. Brandfonbrener, D\. Alvarez\-Melis, P\. Bhargava, M\. S\. Kale,et al\.\(2025\)Let’s \(not\) just put things in context: test\-time training for long\-context llms\.arXiv preprint arXiv:2512\.13898\.Cited by:[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px2.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§5\.5](https://arxiv.org/html/2605.26788#S5.SS5.p1.4)\.
- L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch \(2021a\)Decision transformer: reinforcement learning via sequence modeling\.Advances in neural information processing systems34,pp\. 15084–15097\.Cited by:[§1](https://arxiv.org/html/2605.26788#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px5.p1.1),[§3\.1](https://arxiv.org/html/2605.26788#S3.SS1.p1.6)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021b\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4](https://arxiv.org/html/2605.26788#S4.SS0.SSS0.Px1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4](https://arxiv.org/html/2605.26788#S4.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.26788#S5.SS1.p3.6),[§5\.3](https://arxiv.org/html/2605.26788#S5.SS3.p1.3),[§5\.4](https://arxiv.org/html/2605.26788#S5.SS4.p1.3),[Table 4](https://arxiv.org/html/2605.26788#S5.T4)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4](https://arxiv.org/html/2605.26788#S4.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.26788#S5.SS1.p2.8),[§5\.1](https://arxiv.org/html/2605.26788#S5.SS1.p3.6),[§5\.2](https://arxiv.org/html/2605.26788#S5.SS2.p1.9),[Table 2](https://arxiv.org/html/2605.26788#S5.T2.3.3.2.1),[Table 3](https://arxiv.org/html/2605.26788#S5.T3.12.12.3),[Table 3](https://arxiv.org/html/2605.26788#S5.T3.18.18.3),[Table 3](https://arxiv.org/html/2605.26788#S5.T3.24.24.3)\.
- C\. Herlihy, J\. Neville, T\. Schnabel, and A\. Swaminathan \(2024\)On overcoming miscalibrated conversational priors in llm\-based chatbots\.arXiv preprint arXiv:2406\.01633\.Cited by:[§1](https://arxiv.org/html/2605.26788#S1.p1.1),[§1](https://arxiv.org/html/2605.26788#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px4.p1.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)Livecodebench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.Cited by:[§4](https://arxiv.org/html/2605.26788#S4.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.26788#S5.SS1.p3.6)\.
- W\. Kwan, X\. Zeng, Y\. Jiang, Y\. Wang, L\. Li, L\. Shang, X\. Jiang, Q\. Liu, and K\. Wong \(2024\)Mt\-eval: a multi\-turn capabilities evaluation benchmark for large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 20153–20177\.Cited by:[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px1.p1.1)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2025\)Llms get lost in multi\-turn conversation\.arXiv preprint arXiv:2505\.06120\.Cited by:[§1](https://arxiv.org/html/2605.26788#S1.p1.1),[§1](https://arxiv.org/html/2605.26788#S1.p2.1),[§1](https://arxiv.org/html/2605.26788#S1.p4.1),[§1](https://arxiv.org/html/2605.26788#S1.p5.2),[§2\.1](https://arxiv.org/html/2605.26788#S2.SS1.p1.10),[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px2.p1.1),[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2605.26788#S3.SS3.p4.6),[§3\.4](https://arxiv.org/html/2605.26788#S3.SS4.p1.1),[§4](https://arxiv.org/html/2605.26788#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.26788#S4.SS0.SSS0.Px3.p1.6),[§5\.2](https://arxiv.org/html/2605.26788#S5.SS2.p1.9),[§5\.6](https://arxiv.org/html/2605.26788#S5.SS6.SSS0.Px1.p1.4),[Table 2](https://arxiv.org/html/2605.26788#S5.T2),[Table 3](https://arxiv.org/html/2605.26788#S5.T3),[Limitations](https://arxiv.org/html/2605.26788#Sx1.p1.7)\.
- G\. Liu, F\. Zhu, R\. Feng, C\. Ma, S\. Wang, and G\. Meng \(2026\)Intent mismatch causes llms to get lost in multi\-turn conversation\.arXiv preprint arXiv:2602\.07338\.Cited by:[§1](https://arxiv.org/html/2605.26788#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px4.p1.1),[§5\.6](https://arxiv.org/html/2605.26788#S5.SS6.SSS0.Px2.p1.7),[Limitations](https://arxiv.org/html/2605.26788#Sx1.p1.7)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the association for computational linguistics12,pp\. 157–173\.Cited by:[§1](https://arxiv.org/html/2605.26788#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2605.26788#S3.SS3.p4.6)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§5\.3](https://arxiv.org/html/2605.26788#S5.SS3.p3.8),[§5\.5](https://arxiv.org/html/2605.26788#S5.SS5.p1.4)\.
- J\. Rachapudi, P\. Singh, R\. Vatsi, P\. Hambarde, and A\. Shukla \(2026a\)RePAIR: interactive machine unlearning through prompt\-aware model repair\.arXiv preprint arXiv:2604\.12820\.Cited by:[§1](https://arxiv.org/html/2605.26788#S1.p1.1)\.
- J\. Rachapudi, R\. Vatsi, P\. Hambarde, and A\. Shukla \(2026b\)BID\-lora: a parameter\-efficient framework for continual learning and unlearning\.arXiv preprint arXiv:2604\.12686\.Cited by:[§1](https://arxiv.org/html/2605.26788#S1.p1.1)\.
- J\. Rachapudi, R\. Vatsi, P\. Singh, P\. Hambarde, and A\. Shukla \(2026c\)BackFlush: knowledge\-free backdoor detection and elimination with watermark preservation in large language models\.arXiv preprint arXiv:2605\.12529\.Cited by:[§1](https://arxiv.org/html/2605.26788#S1.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px5.p1.1),[§3\.3](https://arxiv.org/html/2605.26788#S3.SS3.p2.3)\.
- K\. Song, X\. Tan, T\. Qin, J\. Lu, and T\. Liu \(2020\)Mpnet: masked and permuted pre\-training for language understanding\.Advances in neural information processing systems33,pp\. 16857–16867\.Cited by:[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px5.p1.1)\.
- Y\. Sun, C\. Liu, K\. Zhou, J\. Huang, R\. Song, W\. X\. Zhao, F\. Zhang, D\. Zhang, and K\. Gai \(2024\)Parrot: enhancing multi\-turn instruction following for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9729–9750\.Cited by:[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px4.p1.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§4](https://arxiv.org/html/2605.26788#S4.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.26788#S5.SS1.p2.8),[§5\.1](https://arxiv.org/html/2605.26788#S5.SS1.p3.6),[§5\.2](https://arxiv.org/html/2605.26788#S5.SS2.p1.9),[Table 2](https://arxiv.org/html/2605.26788#S5.T2.2.2.2.1),[Table 3](https://arxiv.org/html/2605.26788#S5.T3.10.10.3),[Table 3](https://arxiv.org/html/2605.26788#S5.T3.16.16.3),[Table 3](https://arxiv.org/html/2605.26788#S5.T3.22.22.3)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§3\.3](https://arxiv.org/html/2605.26788#S3.SS3.p4.6)\.
- X\. Wang, Z\. Wang, J\. Liu, Y\. Chen, L\. Yuan, H\. Peng, and H\. Ji \(2023\)Mint: evaluating llms in multi\-turn interaction with tools and language feedback\.arXiv preprint arXiv:2309\.10691\.Cited by:[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§5\.3](https://arxiv.org/html/2605.26788#S5.SS3.p3.8),[§5\.5](https://arxiv.org/html/2605.26788#S5.SS5.p1.4)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, T\. Li, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Li, Z\. Lin, E\. P\. Xing,et al\.\(2023a\)Lmsys\-chat\-1m: a large\-scale real\-world llm conversation dataset\.arXiv preprint arXiv:2309\.11998\.Cited by:[§1](https://arxiv.org/html/2605.26788#S1.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023b\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§2\.2](https://arxiv.org/html/2605.26788#S2.SS2.SSS0.Px1.p1.1)\.Similar Articles
Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
This paper introduces Bipredictability (P) and the Information Digital Twin (IDT), a lightweight method to monitor conversational consistency in multi-turn LLM interactions using token frequency statistics without embeddings or model internals. The approach achieves 100% sensitivity in detecting contradictions and topic shifts while establishing a practical monitoring framework for extended LLM deployments.
Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap
This paper introduces Found in Conversation (FiC), a training framework using View-Asymmetric Self-Distillation to close the multi-turn performance gap in LLMs. The method teaches models to recover single-turn competence from underspecified multi-turn prompts, achieving 92-100% recovery across model families and sizes.
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
MTR-DuplexBench introduces a comprehensive benchmark for evaluating Full-Duplex Speech Language Models in multi-round conversations, addressing challenges like blurred turn boundaries and context inconsistency while assessing conversational features, dialogue quality, instruction following, and safety.
When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models
When2Speak is a synthetic dataset and pipeline for training LLMs to decide when to speak in multi-party conversations. Fine-tuning on this dataset significantly improves turn-taking, with reinforcement learning reducing missed interventions from 50% to ~20%.
Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue
Context-Agent proposes a novel framework that models multi-turn dialogue history as dynamic tree structures rather than flat sequences, better capturing the hierarchical and branching nature of natural conversation. The paper introduces the NTM benchmark for evaluating non-linear dialogue scenarios and demonstrates improved task completion rates and token efficiency across various LLMs.