Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning
Summary
This paper introduces satisfiable drift, a failure mode where multi-turn reasoning systems silently violate prior commitments while maintaining internal logical consistency, dominating contradictions. The authors present DRIFT-Bench, a benchmark of 816 problems, and find that after repair, 98-100% of residual errors are drift errors.
View Cached Full Text
Cached at: 05/26/26, 09:01 AM
# Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning
Source: [https://arxiv.org/html/2605.23940](https://arxiv.org/html/2605.23940)
###### Abstract
How do multi\-turn reasoning systems fail? The expected answer is logical contradiction, in which the system’s maintained state becomes unsatisfiable\. We show that the dominant mode is instead*satisfiable drift*, where the internal state stays consistent while the returned answer silently violates prior commitments\. We buildDRIFT\-Bench\(Decomposing Reasoning Into Failure Types\), a solver\-instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open\-weight models \(8B–120B parameters\)\. MUS\-Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting \(\+1\.8\+1\.8to\+15\.0\+15\.0pp over the best non\-MUS baseline\)\. But the central finding is what repair leaves behind\. After structured feedback, models rarely contradict themselves\. They forget\. Residual errors are 98–100% satisfiable drift across all settings, while contradiction drops to near zero\. Reliable multi\-turn systems must separately validate that the returned answer respects the maintained state\. Code is available at[https://github\.com/kaons\-research/drift\-bench](https://github.com/kaons-research/drift-bench)\.
## 1Introduction
When an interactive assistant manages evolving structured state, it must honor every commitment it has already accepted while folding in new constraints\. A scheduling tool that confirms “Bob is not on Tuesday” should never subsequently place Bob on Tuesday, yet current language models do exactly this with troubling regularity\. What makes the failure especially dangerous is its subtlety\. The system’s internal state remains logically consistent, no solver alarm fires, and the returned answer looks correct to every automated check that inspects only state consistency\. We call this pattern*satisfiable drift*, and show that it accounts for the vast majority of residual errors even after structured repair feedback\. Figure[2](https://arxiv.org/html/2605.23940#S1.F2)decomposes residual errors by channel: drift dominates across every model, while contradiction is near\-invisible \(Table[2](https://arxiv.org/html/2605.23940#S1.F2)\)\.
Qwen3\-8B100\.0%0\.0%Qwen3\-32B98\.1%1\.9%gpt\-oss\-20b99\.9%0\.1%gpt\-oss\-120b99\.9%0\.1%DriftContradictionOther
Figure 1:Residual error decomposition afterMUS\-Repair\. Drift \(answer violates a SAT ledger\) accounts for 98–100% of residual errors; contradiction \(red, at right\) is near\-invisible\.
Best baselineMUS\-RepairModelAcc\.MethodAcc\.Drift %Δ\\Delta\(pp\)Qwen3\-8B28\.2Direct\\cellcolorbackred30\.0100\.0\\cellcolorbackgreen\+1\.8Qwen3\-32B31\.4CoT\\cellcolorbackred38\.298\.1\\cellcolorbackgreen\+6\.8gpt\-oss\-20b53\.7Ledger\\cellcolorbackred68\.799\.9\\cellcolorbackgreen\+15\.0gpt\-oss\-120b54\.0CoT\\cellcolorbackred62\.799\.9\\cellcolorbackgreen\+8\.7
Figure 2:Summary of main results\.MUS\-Repairoutperforms the strongest non\-MUS baseline in every setting\. Drift % shows the share of residual errors from satisfiable drift rather than contradiction\.
Existing evaluations collapse two fundamentally different failure modes into a single accuracy number\(Weiet al\.,[2022](https://arxiv.org/html/2605.23940#bib.bib1); Yaoet al\.,[2024](https://arxiv.org/html/2605.23940#bib.bib21); Madaanet al\.,[2023](https://arxiv.org/html/2605.23940#bib.bib8)\)\.*Contradiction*, where the maintained state becomes unsatisfiable, is a state\-level defect that formal methods can detect\.*Satisfiable drift*, where the state is consistent but the assignment violates it, requires a second verification layer most systems lack\. This paper separates the two with a solver\-instrumented benchmark that checks both ledger satisfiability and assignment validity at every turn across 816 problems and four open\-weight models \(Table[2](https://arxiv.org/html/2605.23940#S1.F2)\)\.
Findings\.❶MUS\-Repairis the strongest method in every setting, producing gains of\+1\.8\+1\.8to\+15\.0\+15\.0pp over the best non\-MUS baseline, all of which survive paired tests after false\-discovery correction\. ❷ These gains do not eliminate the dominant failure mode\. After structured feedback, 98–100% of remaining failures involve a consistent ledger with a violating assignment, while contradiction drops to near zero\. Models stop contradicting themselves but keep forgetting prior commitments\. ❸ The degradation with conversational depth is structural rather than a capacity bottleneck\. Even gpt\-oss\-120b drops from 93% at turn one to 40% at turn ten; higher capability lifts the entire curve but does not flatten it\.
Contributions\.❶DRIFT\-Bench, a solver\-instrumented multi\-turn benchmark covering three constraint domains \(logic grid, scheduling, seating\) with Z3\-verified turn\-level decomposition of contradiction and drift\. ❷ Atrigger\-conditioned repair interfacethat routes unsatisfiable states through MUS localization and satisfiable assignment failures through policy diagnostics within a single retry loop\. ❸ Thefirst empirical demonstrationthat satisfiable drift dominates residual errors across all tested settings, arguing that contradiction and drift should be reported as separate evaluation metrics\.
## 2Related Work
#### Evaluation of multi\-step reasoning\.
Prompting strategies, search over intermediate traces, and tool\-augmented agent architectures have produced substantial accuracy gains on reasoning benchmarks\(Weiet al\.,[2022](https://arxiv.org/html/2605.23940#bib.bib1); Kojimaet al\.,[2022](https://arxiv.org/html/2605.23940#bib.bib2); Wanget al\.,[2023](https://arxiv.org/html/2605.23940#bib.bib3); Yaoet al\.,[2023a](https://arxiv.org/html/2605.23940#bib.bib5); Gaoet al\.,[2023](https://arxiv.org/html/2605.23940#bib.bib6); Chenet al\.,[2023](https://arxiv.org/html/2605.23940#bib.bib7); Yaoet al\.,[2023b](https://arxiv.org/html/2605.23940#bib.bib4); Huet al\.,[2025](https://arxiv.org/html/2605.23940#bib.bib19); Hanet al\.,[2025](https://arxiv.org/html/2605.23940#bib.bib20)\)\. These advances primarily target single\-turn performance or final\-answer quality, and they have achieved impressive results in that scope\. However, most evaluations do not instrument turn\-level state validity under accumulated constraints\. The COLLIE benchmark\(Yaoet al\.,[2024](https://arxiv.org/html/2605.23940#bib.bib21)\)evaluates LLMs on constraint satisfaction, but it operates in a single\-turn setting without multi\-turn state tracking or failure\-channel decomposition\. Long\-context and length\-extrapolation studies document sensitivity to sequence length and position\(Presset al\.,[2022](https://arxiv.org/html/2605.23940#bib.bib12); Liuet al\.,[2024](https://arxiv.org/html/2605.23940#bib.bib13)\), yet they do not separate state inconsistency from assignment inconsistency conditional on a satisfiable state\. Our benchmark is designed to fill this gap\. Each turn is solver\-verified for both ledger satisfiability and assignment validity\.
#### Verifier\-guided repair and self\-correction\.
Iterative self\-correction with verifier feedback produces strong aggregate improvements in mathematics and code\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.23940#bib.bib11); Lightmanet al\.,[2024](https://arxiv.org/html/2605.23940#bib.bib10); Madaanet al\.,[2023](https://arxiv.org/html/2605.23940#bib.bib8); Shinnet al\.,[2023](https://arxiv.org/html/2605.23940#bib.bib9)\)\. Tool\-integrated reasoning systems, including systems that couple deterministic solvers with neural generation\(Lyuet al\.,[2023](https://arxiv.org/html/2605.23940#bib.bib22); Luet al\.,[2023](https://arxiv.org/html/2605.23940#bib.bib23)\), improve single\-turn accuracy, withLyuet al\.\([2023](https://arxiv.org/html/2605.23940#bib.bib22)\)demonstrating that removing the deterministic external solver causes a 50\-point accuracy drop on GSM8K\. However, aggregate gains can obscure shifts in the composition of residual errors\. Endpoint accuracy may improve substantially even as assignment\-level drift remains unchanged or worsens, because the error types that repair eliminates are not necessarily the ones most visible to users\. Recent work on the limits of LLM self\-verification reaches a related conclusion\.Stechlyet al\.\([2025](https://arxiv.org/html/2605.23940#bib.bib18)\)show that when GPT\-4 is tasked with both generating and critiquing its own answers, performance actually decreases, and that substantial gains require a sound external verifier regardless of critique richness\. Our analysis extends this concern to interactive trajectories by decomposing residuals by operational failure type\.
#### Formal methods in neural systems\.
Satisfiability solving and minimal unsatisfiable subset extraction are well\-established tools in symbolic debugging and verification\(de Moura and Bjørner,[2008](https://arxiv.org/html/2605.23940#bib.bib14); Liffiton and Sakallah,[2008](https://arxiv.org/html/2605.23940#bib.bib15); Belov and Marques\-Silva,[2012](https://arxiv.org/html/2605.23940#bib.bib16); Biereet al\.,[2009](https://arxiv.org/html/2605.23940#bib.bib17)\)\. A separate but related thread comes from task\-oriented dialogue, where belief state updates track evolving user requirements\(Younget al\.,[2013](https://arxiv.org/html/2605.23940#bib.bib24); Wuet al\.,[2019](https://arxiv.org/html/2605.23940#bib.bib25)\)\. The ledger mechanism in our system draws on both traditions\. It maintains formal constraint sets, as in symbolic verification, but updates them incrementally at each conversational turn, as in dialogue state tracking\. Our contribution is adapting this combined toolbox to neural multi\-turn traces through fixed turn\-level solver instrumentation, trigger\-conditioned repair routing, and paired inferential analysis over interactive trajectories\.
## 3Method
### 3\.1Notation and State Semantics
The multi\-turn setting requires distinguishing between the raw model output and the structured state derived from it\. We writeutu\_\{t\}for the user message at turntt,ata\_\{t\}for the model’s response text, andAtA\_\{t\}for the structured assignment parsed fromata\_\{t\}when parsing succeeds\. The cumulative gold constraints are denoted by𝒞1:t\\mathcal\{C\}\_\{1:t\}, extracted constraints by𝒞^t\\widehat\{\\mathcal\{C\}\}\_\{t\}\(the model’s parse of new constraints at turntt\), and ledger state byLtL\_\{t\}\. The predicateSAT\(⋅\)\\mathrm\{SAT\}\(\\cdot\)indicates solver satisfiability; we writeUNSAT\(⋅\)\\mathrm\{UNSAT\}\(\\cdot\)for its negation\.
Each problem is a turn sequence\{ut\}t=1T\\\{u\_\{t\}\\\}\_\{t=1\}^\{T\}with cumulative gold constraints
𝒞1:t=⋃τ=1t𝒞τnew\.\\mathcal\{C\}\_\{1:t\}=\\bigcup\_\{\\tau=1\}^\{t\}\\mathcal\{C\}\_\{\\tau\}^\{\\text\{new\}\}\.Turn\-level correctness is defined by constraint satisfaction rather than string match against a single witness assignment\. The operational correctness predicate applies to the raw response and its parsed assignment:
Correct\(at\)=Parse\(at\)∧Complete\(At\)∧Satisfies\(At,𝒞1:t\)\.\\mathrm\{Correct\}\(a\_\{t\}\)=\\mathrm\{Parse\}\(a\_\{t\}\)\\ \\land\\ \\mathrm\{Complete\}\(A\_\{t\}\)\\ \\land\\ \\mathrm\{Satisfies\}\(A\_\{t\},\\mathcal\{C\}\_\{1:t\}\)\.
In implementation,answer\_correctis obtained by checking satisfiability of𝒞1:t\\mathcal\{C\}\_\{1:t\}with the parsedAtA\_\{t\}injected as an assignment in Z3\. This definition remains valid when multiple satisfying assignments exist\. The sameSatisfies\\mathrm\{Satisfies\}predicate appears in drift diagnostics withLtL\_\{t\}replacing𝒞1:t\\mathcal\{C\}\_\{1:t\}\. The constraint set argument determines which notion of consistency is tested\. We measure accuracy against gold cumulative constraints𝒞1:t\\mathcal\{C\}\_\{1:t\}, while drift is a diagnostic defined against the model\-maintained ledgerLtL\_\{t\}\.
The distinction between ledger satisfiability and assignment validity is central to the paper\. A turn can preserveSAT\(Lt\)\\mathrm\{SAT\}\(L\_\{t\}\)while still violating active commitments through¬Satisfies\(At,Lt\)\\neg\\mathrm\{Satisfies\}\(A\_\{t\},L\_\{t\}\)\. This separation allows contradiction and drift to be measured as distinct channels rather than merged into a single error indicator\. Formally, letΦ\(At\)\\Phi\(A\_\{t\}\)denote the assignment constraints induced by the parsed answer\. Then
Satisfies\(At,S\)=SAT\(S∪Φ\(At\)\)\.\\mathrm\{Satisfies\}\(A\_\{t\},S\)=\\mathrm\{SAT\}\(S\\cup\\Phi\(A\_\{t\}\)\)\.
The parser predicateParse\(at\)\\mathrm\{Parse\}\(a\_\{t\}\)is one only when the response is valid schema\-conforming JSON for the domain\. The completeness predicateComplete\(At\)\\mathrm\{Complete\}\(A\_\{t\}\)is one only when each required entity is assigned exactly once\. The ledger update is
Merge\(Lt−1,𝒞^t\)=Lt−1∪Dedup\(𝒞^t\),\\mathrm\{Merge\}\(L\_\{t\-1\},\\widehat\{\\mathcal\{C\}\}\_\{t\}\)=L\_\{t\-1\}\\cup\\mathrm\{Dedup\}\(\\widehat\{\\mathcal\{C\}\}\_\{t\}\),
whereDedup\\mathrm\{Dedup\}removes canonical duplicates before insertion\.
These predicates partition turn outcomes into three categories\. A turn is*consistent*when the ledger is satisfiable and the assignment respects it\. When the ledger remains satisfiable but the assignment violates it, the turn exhibits*drift*\. When the ledger itself becomes unsatisfiable, the turn exhibits*contradiction*\. The critical distinction is that drift produces no solver alarm, making it invisible to any system that checks only state consistency\. Figure[3](https://arxiv.org/html/2605.23940#S3.F3)illustrates all three outcomes on a four\-turn scheduling trajectory where drift occurs at the final turn\.
### 3\.2System Components
The evaluation system decomposes each turn into four stages, reflecting a deliberate separation of generation from verification\. A generatorGGproduces the responseata\_\{t\}given the current user message and prior ledger state\. An extractorEEthen parses the response alongside the user message to identify newly introduced constraints𝒞^t\\widehat\{\\mathcal\{C\}\}\_\{t\}\. These feed into a verifierVV, which runs both solver\-level satisfiability checks and policy\-level checks on the parsed assignment\. Finally, a repair policyRRexamines the verifier output and decides whether to issue a retry with targeted feedback\.
Algorithm 1Turn processing with verification and optional repair\.1:Input:
utu\_\{t\},
Lt−1L\_\{t\-1\}, method
mm, turn
tt, repair budget
kk
2:Output:response
at′a^\{\\prime\}\_\{t\}, ledger
LtL\_\{t\}
3:
at←G\(ut,Lt−1\)a\_\{t\}\\leftarrow G\(u\_\{t\},L\_\{t\-1\}\)
4:
𝒞^t←E\(ut,at,t\)\\widehat\{\\mathcal\{C\}\}\_\{t\}\\leftarrow E\(u\_\{t\},a\_\{t\},t\)
5:
Lt←Lt−1∪Dedup\(𝒞^t\)L\_\{t\}\\leftarrow L\_\{t\-1\}\\cup\\mathrm\{Dedup\}\(\\widehat\{\\mathcal\{C\}\}\_\{t\}\)
6:
\(satt,𝒯t\)←V\(Lt,at\)\(\\mathrm\{sat\}\_\{t\},\\,\\mathcal\{T\}\_\{t\}\)\\leftarrow V\(L\_\{t\},a\_\{t\}\)
7:if
m≠MUS\-Repairm\\neq\\textsc\{MUS\-Repair\}or
\(satt∧𝒯t=∅\)\(\\mathrm\{sat\}\_\{t\}\\,\\land\\,\\mathcal\{T\}\_\{t\}\\\!=\\\!\\emptyset\)then
8:return
\(at,Lt\)\(a\_\{t\},L\_\{t\}\)
9:endif
10:
at′←ata^\{\\prime\}\_\{t\}\\leftarrow a\_\{t\}
11:for
i=1i=1to
kkdo
12:
𝒰t←MUS\(Lt\)\\mathcal\{U\}\_\{t\}\\leftarrow\\mathrm\{MUS\}\(L\_\{t\}\)if
¬satt\\neg\\,\\mathrm\{sat\}\_\{t\}else
∅\\emptyset
13:
at′←R\(ut,Lt−1,Render\(𝒯t,𝒰t\)\)a^\{\\prime\}\_\{t\}\\leftarrow R\\bigl\(u\_\{t\},L\_\{t\-1\},\\mathrm\{Render\}\(\\mathcal\{T\}\_\{t\},\\mathcal\{U\}\_\{t\}\)\\bigr\)
14:
Lt←Lt−1∪Dedup\(E\(ut,at′,t\)\)L\_\{t\}\\leftarrow L\_\{t\-1\}\\cup\\mathrm\{Dedup\}\\bigl\(E\(u\_\{t\},a^\{\\prime\}\_\{t\},t\)\\bigr\)
15:
\(satt,𝒯t\)←V\(Lt,at′\)\(\\mathrm\{sat\}\_\{t\},\\,\\mathcal\{T\}\_\{t\}\)\\leftarrow V\(L\_\{t\},a^\{\\prime\}\_\{t\}\)
16:if
satt∧𝒯t=∅\\mathrm\{sat\}\_\{t\}\\,\\land\\,\\mathcal\{T\}\_\{t\}\\\!=\\\!\\emptysetthenbreak
17:endif
18:endfor
19:return
\(at′,Lt\)\(a^\{\\prime\}\_\{t\},L\_\{t\}\)
The verifier combines solver\-level satisfiability checks with policy\-level checks on the parsed assignment, emitting a deterministic trigger code per failure type\. The five codes areAnswer\-Ledger Conflict\(ledger is SAT but the assignment violates it\),Unsatisfiable Ledger\(ledger is UNSAT\),Incomplete Assignment\(required entities missing\),Answer Parse Failure\(invalid JSON\), andConstraint Extraction Failure\(no constraints extracted\)\. At runtime these codes route the repair decision; post\-hoc they enable fine\-grained failure decomposition\. Algorithm[1](https://arxiv.org/html/2605.23940#alg1)shows how they integrate into the turn processing loop\.
We evaluate four inference policies on this shared infrastructure:Direct,Chain\-of\-Thought,Ledger, andMUS\-Repair\. Figure[3](https://arxiv.org/html/2605.23940#S3.F3)compares their formal signatures on a scheduling example where drift occurs at the final turn\. The repair step is active only forMUS\-Repairand only when the verifier emits one or more failing triggers\. Crucially, we hold the extractor and verifier fixed across all methods so that observed differences reflect reasoning and repair strategy, not variation in parsing or verification logic\. This shared\-infrastructure design isolates the comparison to the reasoning policy itself\.
### 3\.3Minimal Unsatisfiable Subset for Repair
WhenVVdetects an unsatisfiable ledger state,MUS\-Repaircomputes a minimal unsatisfiable subset𝒰t⊆Lt\\mathcal\{U\}\_\{t\}\\subseteq L\_\{t\}such that
UNSAT\(𝒰t\)=1,∀𝒰′⊂𝒰t,SAT\(𝒰′\)=1\.\\mathrm\{UNSAT\}\(\\mathcal\{U\}\_\{t\}\)=1,\\quad\\forall\\mathcal\{U\}^\{\\prime\}\\subset\\mathcal\{U\}\_\{t\},\\ \\mathrm\{SAT\}\(\\mathcal\{U\}^\{\\prime\}\)=1\.This subset is minimal in set inclusion and identifies a minimal committed constraint subset that is jointly inconsistent at turntt\. The retry prompt receives trigger diagnostics and, for unsatisfiable states, the corresponding𝒰t\\mathcal\{U\}\_\{t\}\. The same retry channel is used for satisfiable assignment failures through policy triggers, so contradiction and drift are handled in one controlled repair interface\. The repair feedback packet is
Ft=\{\(𝒯t,𝒰t\),UNSAT\(Lt\)=1,\(𝒯t,∅\),SAT\(Lt\)=1∧𝒯t≠∅\.F\_\{t\}=\\begin\{cases\}\(\\mathcal\{T\}\_\{t\},\\mathcal\{U\}\_\{t\}\),&\\mathrm\{UNSAT\}\(L\_\{t\}\)=1,\\\\ \(\\mathcal\{T\}\_\{t\},\\emptyset\),&\\mathrm\{SAT\}\(L\_\{t\}\)=1\\ \\land\\ \\mathcal\{T\}\_\{t\}\\neq\\emptyset\.\\end\{cases\}MUS is injected only for contradiction events, while satisfiable assignment failures are repaired with policy diagnostics and the prior ledger state\.
### 3\.4Failure Channel Decomposition
The conceptual contradiction indicator isItunsat=𝟏\[UNSAT\(Lt\)\]\.I^\{\\text\{unsat\}\}\_\{t\}=\\mathbf\{1\}\[\\mathrm\{UNSAT\}\(L\_\{t\}\)\]\.The conceptual drift indicator isItdrift=𝟏\[SAT\(Lt\)∧¬Satisfies\(At,Lt\)\]\.I^\{\\text\{drift\}\}\_\{t\}=\\mathbf\{1\}\[\\mathrm\{SAT\}\(L\_\{t\}\)\\land\\neg\\mathrm\{Satisfies\}\(A\_\{t\},L\_\{t\}\)\]\.Current logs provide direct contradiction status and trigger\-level diagnostics onMUS\-Repairtraces\. We measure contradiction withz3\_sat=0 and drift with the Answer\-Ledger Conflict triggeranswer\_ledger\_conflict, which corresponds to a satisfiable ledger with a violating assignment\. Parser and completeness failures are tracked by Answer Parse Failure, Incomplete Assignment, and Constraint Extraction Failure triggers\.
Primary reporting uses turn\-level accuracy as defined byCorrect\(at\)\\mathrm\{Correct\}\(a\_\{t\}\)\. The inference protocol is described in Section[4\.3](https://arxiv.org/html/2605.23940#S4.SS3)\.
Scheduling: Mon/Tue/Wedto Alice, Bob, CarolTurn 1: Alice on MondayTurn 2: Bob with AliceTurn 3: Carol on TuesdayTurn 4: Bob≠\\neqTuesdayExisting Methods: ✗ No state tracking✗ No SAT verification✗ No drift detectionMUS\-Repair: ✓ Constraint ledgerLtL\_\{t\}✓ Z3 SAT check✓ MUS\-guided repairDirect G:H1:t×ut→AtG:\{\\color\[rgb\]\{1,\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,\.5,0\}H\_\{1:t\}\}\\times u\_\{t\}\\rightarrow A\_\{t\}✗Chain\-of\-Thought G:H1:t×ut→\{rk\}k=1K,AtG:\{\\color\[rgb\]\{1,\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,\.5,0\}H\_\{1:t\}\}\\times u\_\{t\}\\rightarrow\\\{r\_\{k\}\\\}\_\{k=1\}^\{K\}\\\!,\\,A\_\{t\}✗Ledger G:H1:t×ut×Lt−1→AtG:\{\\color\[rgb\]\{1,\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,\.5,0\}H\_\{1:t\}\}\\times u\_\{t\}\\times\{\\color\[rgb\]\{0,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,\.5,\.5\}L\_\{t\-1\}\}\\rightarrow A\_\{t\} Lt←Lt−1∪E\(ut,at\)L\_\{t\}\\\!\\leftarrow\\\!L\_\{t\-1\}\\cup E\(u\_\{t\},a\_\{t\}\)✗MUS\-Repair G:H1:t×ut×Lt−1→AtG:\{\\color\[rgb\]\{1,\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,\.5,0\}H\_\{1:t\}\}\\times u\_\{t\}\\times\{\\color\[rgb\]\{0,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,\.5,\.5\}L\_\{t\-1\}\}\\rightarrow A\_\{t\} V:SAT\(Lt\)∧Satisfies\(At,Lt\)V:\{\\color\[rgb\]\{0,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,\.5,\.5\}\\mathrm\{SAT\}\(L\_\{t\}\)\}\\wedge\\mathrm\{Satisfies\}\(A\_\{t\},L\_\{t\}\) R:\(𝒯t,𝒰t\)→At′R:\(\\mathcal\{T\}\_\{t\},\{\\color\[rgb\]\{0,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,\.5,\.5\}\\mathcal\{U\}\_\{t\}\}\)\\rightarrow A^\{\\prime\}\_\{t\}✓
Figure 3:Comparison of constraint reasoning approaches\.Left: a four\-turn scheduling trajectory where drift occurs at turn 4 \(red\)\. Center: properties of baselines vs\.MUS\-Repair\. Right: formal method signatures withorangemarking implicit context accumulation andtealmarking explicit ledger state and solver verification\. OnlyMUS\-Repairdetects the drift because it verifies bothSAT\(Lt\)\\mathrm\{SAT\}\(L\_\{t\}\)andSatisfies\(At,Lt\)\\mathrm\{Satisfies\}\(A\_\{t\},L\_\{t\}\)\.
## 4Experimental Setup
### 4\.1DRIFT\-Bench
DRIFT\-Benchproblems are generated by a procedure that guarantees every gold interaction trajectory is satisfiable at each turn\. For each domain𝒟∈\{logic\_grid,scheduling,seating\}\\mathcal\{D\}\\in\\\{\\texttt\{logic\\\_grid\},\\texttt\{scheduling\},\\texttt\{seating\}\\\}, the generator samples entities, contextual framing, and one to three candidate constraints per turn\. It accepts a candidate set only when the cumulative set remains satisfiable under Z3
SAT\(𝒞1:t−1∪𝒞^tcand\)=1\.\\mathrm\{SAT\}\\\!\\left\(\\mathcal\{C\}\_\{1:t\-1\}\\cup\\widehat\{\\mathcal\{C\}\}^\{\\text\{cand\}\}\_\{t\}\\right\)=1\.
If the candidate set is unsatisfiable, the turn is resampled until acceptance or retry budget exhaustion\. This process ensures that every gold interaction trajectory is satisfiable at each turn\. Generation also removes duplicate constraints by canonical form before satisfiability checks, which prevents trivial repetition across turns\.
Gold correctness does not assume a unique assignment\. We verify by checking satisfiability of cumulative constraints conjoined with the parsed answer assignment, so a response is correct whenever it satisfies the active constraints, even if multiple assignments are valid\.
Each domain uses a fixed template that determines the structural parameters of generated problems\. Logic\-grid instances pair four entities with three categorical attributes, producing compact but combinatorially rich assignments\. Scheduling instances involve five to seven events assigned to temporal slots, with predicates such as ordering and simultaneity constraints\. Seating instances are the most spatially complex, placing six to eight participants around round or rectangular tables subject to adjacency, separation, and positional constraints\. Turn count is sampled between four and ten, with one to three new constraints introduced per turn\.
The final corpus has 1,020 problems with a fixed seed split of 816 test and 204 development instances\. Table[1](https://arxiv.org/html/2605.23940#S4.T1)summarizes structural properties by domain; the Final column is the mean number of cumulative active constraints at the last turn\.
Table 1:DRIFT\-Benchstructure by domain\.DomainSplitTurns \[min,max\]Ent\.VocabFinalLogic\-Grid272/68/3406\.89 \[4,10\]4\.00411\.57Scheduling272/68/3407\.06 \[4,10\]5\.92612\.83Seating272/68/3406\.97 \[4,10\]7\.01711\.20
### 4\.2Evaluation Stack and Model Matrix
All methods run in a shared OpenAI\-compatible serving stack with identical extraction, verification, and logging paths\. The model matrixℳ\\mathcal\{M\}contains Qwen3\-8B, Qwen3\-32B, gpt\-oss\-20b, and gpt\-oss\-120b, the method set isΠ=\{Direct,Chain\-of\-Thought,Ledger,MUS\-Repair\}\\Pi=\\\{\\textsc\{Direct\},\\textsc\{Chain\-of\-Thought\},\\textsc\{Ledger\},\\textsc\{MUS\-Repair\}\\\}, and𝒫\\mathcal\{P\}is the 816\-problem test split\. The Qwen models are from the Qwen3 family\(Yang and others,[2025](https://arxiv.org/html/2605.23940#bib.bib26)\)\. The gpt\-oss models are OpenAI’s open\-weight releases at 20B and 120B parameters\(OpenAI,[2025](https://arxiv.org/html/2605.23940#bib.bib27)\), served locally through vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2605.23940#bib.bib28)\)under the same stack as the Qwen runs\. For gpt\-oss evaluations we use deterministic decoding with temperature set to zero and default reasoning configuration, with paired comparisons run under fixed decoding controls within each model\.
The full corpus sums to\|ℳ\|⋅\|Π\|⋅∑p∈𝒫Tp=4×4×5,672=90,752\|\\mathcal\{M\}\|\\cdot\|\\Pi\|\\cdot\\sum\_\{p\\in\\mathcal\{P\}\}T\_\{p\}=4\\times 4\\times 5\{,\}672=90\{,\}752turn evaluations\. The gpt\-oss\-120b run is complete for all methods at 5,672 rows and 816 problems per method and is included in all main\-text tables\.
### 4\.3Inference Protocol and Robustness Checks
To ensure that reported accuracy differences are not artifacts of problem sampling, we construct 95% bootstrap confidence intervals at the problem level and assess pairwise significance using sign\-permutation tests againstDirect\. We then apply Benjamini\-Hochberg correction across all reported comparisons to control the false\-discovery rate\.
Prompt templates, JSON schema constraints, repair message format, and extraction prompts are documented in Appendix A\. Runtime controls are fixed across methods with temperature set to zero, maximum repair attempts set to two, maximum truncation retries set to two, and ledger token budget set to 3,000 unless a serving\-side safety clamp is required\.
One practical consideration is that the gpt\-oss models occasionally produce truncated responses, which could in principle affect accuracy estimates\. We verified that restricting analysis to non\-truncated responses preserves both the method ordering and the MUS\-Repair margins, indicating that truncation is not a systematic confound\.
## 5Results
### 5\.1Primary Accuracy
Table[2](https://arxiv.org/html/2605.23940#S5.T2)presents the full results\.MUS\-Repairis the strongest method in every model setting, with gains overDirectranging from\+2\.0\+2\.0pp on Qwen3\-8B to\+16\.2\+16\.2pp on gpt\-oss\-20b\. Every MUS\-Repair comparison survives paired problem\-level permutation tests after Benjamini\-Hochberg correction \(qFDR<0\.03q\_\{\\mathrm\{FDR\}\}<0\.03in all cases\), and the pattern holds when tested against each model’s strongest non\-MUS comparator rather thanDirectalone \(Appendix Table[8](https://arxiv.org/html/2605.23940#A2.T8)\)\. Structured repair helps the weakest model \(Qwen3\-8B, 30%\) and the strongest \(gpt\-oss\-20b, 69%\) alike\. This consistency across a wide capability range suggests that the benefit comes from the verification\-and\-retry mechanism itself rather than from model\-specific artifacts\. The other methods show mixed results\. Ledger significantly hurts Qwen3\-8B \(−3\.1\-3\.1pp,q=0\.003q=0\.003\) and gpt\-oss\-120b \(−2\.3\-2\.3pp,q=0\.022q=0\.022\), while CoT produces modest gains on Qwen3\-32B and gpt\-oss\-120b but not on the other two models\.
Table 2:Turn\-level accuracy, paired inferential tests versusDirect\(n=816n=816\), and depth retention\.Highlightedrows showMUS\-Repair\. Retain %==turn\-10 accuracy//turn\-1 accuracy, measuring how well each method preserves performance as constraints accumulate\.ModelMethodAcc\. \(%\)Δ\\Delta\(pp\)95% CI \(pp\)qFDRq\_\{\\mathrm\{FDR\}\}Retain %Qwen3\-8BDirect \(baseline\)28\.1928\.19———5\.15\.1Qwen3\-8BChain\-of\-Thought27\.9127\.91−0\.19\-0\.19\[−\-2\.01,\+\+1\.63\]0\.84220\.84226\.76\.7Qwen3\-8BLedger25\.2325\.23−3\.14\-3\.14\[−\-4\.99,−\-1\.31\]0\.00340\.00345\.75\.7\\rowcoloriclrblue\!8 Qwen3\-8BMUS\-Repair30\.0130\.012\.032\.03\[\+\+0\.32,\+\+3\.71\]0\.02950\.02958\.88\.8Qwen3\-32BDirect \(baseline\)28\.9328\.93———6\.56\.5Qwen3\-32BChain\-of\-Thought31\.4431\.442\.542\.54\[\+\+0\.73,\+\+4\.38\]0\.01870\.01872\.52\.5Qwen3\-32BLedger31\.4431\.441\.631\.63\[−\-0\.93,\+\+4\.20\]0\.22040\.220423\.623\.6\\rowcoloriclrblue\!8 Qwen3\-32BMUS\-Repair38\.2238\.229\.039\.03\[\+\+6\.98,\+\+11\.00\]0\.00020\.000215\.715\.7gpt\-oss\-20bDirect \(baseline\)51\.8051\.80———23\.123\.1gpt\-oss\-20bChain\-of\-Thought50\.3550\.35−1\.39\-1\.39\[−\-3\.05,\+\+0\.27\]0\.11830\.118319\.419\.4gpt\-oss\-20bLedger53\.7053\.701\.911\.91\[\+\+0\.25,\+\+3\.60\]0\.03270\.032729\.129\.1\\rowcoloriclrblue\!8 gpt\-oss\-20bMUS\-Repair68\.7168\.7116\.2016\.20\[\+\+14\.52,\+\+17\.90\]0\.00020\.000248\.348\.3gpt\-oss\-120bDirect \(baseline\)52\.1252\.12———30\.230\.2gpt\-oss\-120bChain\-of\-Thought53\.9553\.952\.042\.04\[\+\+0\.36,\+\+3\.70\]0\.02950\.029524\.224\.2gpt\-oss\-120bLedger50\.0250\.02−2\.29\-2\.29\[−\-4\.03,−\-0\.59\]0\.02200\.022036\.436\.4\\rowcoloriclrblue\!8 gpt\-oss\-120bMUS\-Repair62\.6862\.6810\.0510\.05\[\+\+8\.40,\+\+11\.72\]0\.00020\.000242\.942\.9
### 5\.2Capability Scaling of Repair Gains
Stronger models benefit more from structured repair\.Raw accuracy gains are larger for more capable models, but comparing absolute improvements across models with different baselines can be misleading\. A more informative measure is relative lift, which normalizes theMUS\-Repairgain by each model’s best non\-MUS baseline\.
ρm=AmMUS−max\(AmDirect,AmCoT,AmLedger\)max\(AmDirect,AmCoT,AmLedger\)\.\\rho\_\{m\}=\\frac\{A\_\{m\}^\{\\text\{MUS\}\}\-\\max\\\!\\left\(A\_\{m\}^\{\\text\{Direct\}\},A\_\{m\}^\{\\text\{CoT\}\},A\_\{m\}^\{\\text\{Ledger\}\}\\right\)\}\{\\max\\\!\\left\(A\_\{m\}^\{\\text\{Direct\}\},A\_\{m\}^\{\\text\{CoT\}\},A\_\{m\}^\{\\text\{Ledger\}\}\\right\)\}\.Relative lift rises from 6\.4% on Qwen3\-8B to 27\.9% on gpt\-oss\-20b before dropping to 16\.2% on gpt\-oss\-120b\. The non\-monotonic drop at gpt\-oss\-120b resists simple scaling predictions, but the overall trajectory still shows that repair remains materially beneficial even at the highest capability level tested\. One plausible explanation centers on instruction\-following fidelity\. The repair signal is a structured prompt containing trigger codes and a minimal unsatisfiable subset\. Converting this signal into a corrected assignment requires precisely the kind of structured instruction following that improves with model capability\. A model that cannot parse the signal treats the retry as noise\.
Ledger\-only tracking, by contrast, is not uniformly positive\. It hurts Qwen3\-8B \(−3\.0\-3\.0pp\), is near\-neutral on Qwen3\-32B, helps gpt\-oss\-20b \(\+1\.9\+1\.9pp\), and drops again on gpt\-oss\-120b \(−2\.1\-2\.1pp\)\. Explicit state tracking trades control benefits against the cost of occupying prompt context, and the balance shifts with model capability \(Section[7](https://arxiv.org/html/2605.23940#S7)\)\.
### 5\.3Depth Degradation
Every model shows steep accuracy decline with turn depth\(Figure[4](https://arxiv.org/html/2605.23940#S5.F4)\)\. The decline is dramatic across the full model range\. Qwen3\-8B drops from 72% at turn one to 6% at turn ten underMUS\-Repair, and even gpt\-oss\-120b falls from 93% to 40% over the same span\. Crucially, the shape of the decline is steep rather than gradual, consistent with the probability of violating at least one constraint growing combinatorially as the active set expands\. Higher capability lifts the entire curve but does not flatten it\. This pattern is reminiscent of the positional sensitivity documented byLiuet al\.\([2024](https://arxiv.org/html/2605.23940#bib.bib13)\)for long contexts, but here the degradation is temporal rather than positional\. Long\-horizon state maintenance appears to be a qualitatively harder problem that will likely require architectural support beyond pure scaling\.
Figure 4:Per\-turn accuracy curves\. Left2×22\\times 2grid: one panel per model, each showing all four methods \(colors encode*method*\)\. Right panel:MUS\-Repairacross all four models \(colors encode*model*\)\. Shaded bands are 95% bootstrap intervals\. Higher capability lifts the curve but does not flatten it\.
### 5\.4Domain Structure
Seating is the hardest domain and scheduling the easiest across all models\(Table[3](https://arxiv.org/html/2605.23940#S5.T3)\)\. This ranking holds consistently from the smallest Qwen model to the largest gpt\-oss run, indicating that the difficulty is inherent in the constraint topology rather than an artifact of any single model’s weaknesses\. Seating problems involve circular positional constraints \(adjacency, separation, wrapping\) that require globally consistent placement of 6–8 entities, whereas scheduling admits more localized solutions\. The gap is largest on gpt\-oss\-20b, where scheduling reaches 85\.7% while seating remains at 38\.7%\.
Table 3:MUS\-Repairdomain\-conditioned accuracy \(%\)\. Scheduling is consistently easiest and seating hardest\.ModelLogic\-Grid \(%\)Scheduling \(%\)Seating \(%\)Qwen3\-8B43\.143\.134\.734\.712\.112\.1Qwen3\-32B43\.943\.955\.755\.714\.814\.8gpt\-oss\-20b81\.481\.485\.785\.738\.738\.7gpt\-oss\-120b64\.264\.287\.287\.236\.336\.3
## 6Failure Analysis
### 6\.1Contradiction and Drift Decomposition
BecauseMUS\-Repairlogs solver status and trigger codes at every retry, we can decompose residual errors into three distinct channels: contradiction, in which the ledger becomes unsatisfiable; drift, in which the ledger remains satisfiable but the assignment violates it; and formatting or extraction errors\. We measure contradiction byz3\_sat=0 and drift by the Answer\-Ledger Conflict trigger, which fires when a satisfiable ledger accompanies a returned assignment that violates active constraints\.
Contradiction repair does not remove the dominant residual failure mode\.This is the paper’s central empirical finding\. Drift accounts for 98\.1% to 100\.0% of residual errors across all settings \(Table[4](https://arxiv.org/html/2605.23940#S6.T4)\), while contradiction drops to 0\.0–1\.9%, with near\-zero counts in the gpt\-oss runs\. The magnitude of this imbalance is best appreciated through raw counts\. On Qwen3\-8B, all 3,970 residual errors are drift, with zero contradiction events\. On Qwen3\-32B, which triggers unsatisfiable states more frequently due to aggressive constraint extraction, contradiction still accounts for only 66 of 3,504 residual errors \(1\.9%\)\. The remaining errors, categorized as “Other,” encompass parse failures, incomplete assignments, and extraction errors\. These are comparatively rare and model\-dependent\. On gpt\-oss\-20b, Answer Parse Failure triggers fire 664 times and Constraint Extraction Failure triggers fire 559 times across repair retries \(Table[5](https://arxiv.org/html/2605.23940#S6.T5)\), but are absorbed by successful retries before becoming residual errors\. Only 1 contradiction and 0 extraction failures remain in the residual set\. The practical implication is that reducing unsatisfiable ledgers is necessary for reliability, but most remaining user\-facing errors stem from assignments that violate a satisfiable maintained state\.
Table 4:MUS\-Repairfailure channel decomposition over residual errors\. The visual decomposition appears in Figure[2](https://arxiv.org/html/2605.23940#S1.F2)\.ModelDriftUNSATOther\\columncoloriclrblue\!8Drift \(%\)UNSAT \(%\)Other \(%\)Qwen3\-8B3970397000\\columncoloriclrblue\!8100\.0100\.00\.00\.00\.00\.0Qwen3\-32B3438343866660\\columncoloriclrblue\!898\.198\.11\.91\.90\.00\.0gpt\-oss\-20b17741774110\\columncoloriclrblue\!899\.999\.90\.10\.10\.00\.0gpt\-oss\-120b21152115220\\columncoloriclrblue\!899\.999\.90\.10\.10\.00\.0
### 6\.2Trigger Composition, Repair Outcomes, and Residual Overlap
Trigger event counts \(Table[5](https://arxiv.org/html/2605.23940#S6.T5)\) reinforce the decomposition\. Answer\-Ledger Conflict dominates every model, firing 12,089 times on Qwen3\-8B and 5,300 times on gpt\-oss\-20b\. Unsatisfiable Ledger, by contrast, concentrates in Qwen3\-32B \(212 events\) and stays scarce elsewhere \(≤\\leq4 events on the gpt\-oss models\)\. Note that Table[4](https://arxiv.org/html/2605.23940#S6.T4)counts final\-row outcomes while Table[5](https://arxiv.org/html/2605.23940#S6.T5)counts trigger events across retries, so magnitudes differ by construction\.
Not all trigger types respond equally to repair\.Schema\-completion failures, such as missing entities or malformed JSON, are relatively straightforward to fix with a retry prompt because the error is localized and well\-defined\. Assignment\-level consistency failures, by contrast, require the model to simultaneously satisfy all active constraints while revising its answer, a much harder task\. The data bear this out\. On Qwen3\-8B, post\-repair accuracy reaches 65\.5% for Incomplete Assignment triggers but only 4\.0% for Answer\-Ledger Conflict triggers\. On Qwen3\-32B, the same contrast is 69\.8% versus 5\.5%\. This recoverability gap explains why drift persists as the dominant failure mode even after multiple repair attempts\.
Table 5:Trigger event counts inMUS\-Repairtraces by model\. Answer\-Ledger Conflict dominates in every setting, while Unsatisfiable Ledger is concentrated in Qwen3\-32B\.TriggerQwen3\-8BQwen3\-32Bgpt\-oss\-20bgpt\-oss\-120bAnswer\-Ledger Conflict12 08912\\,08910 48910\\,4895300530062556255Incomplete Assignment218218528528103610363131Answer Parse Failure022664664810810Constraint Extraction Failure0225595593636Unsatisfiable Ledger112122122244Residual error overlap across models is high rather than fragmented\.Qwen3\-8B covers 95\.0% of gpt\-oss\-20b residualMUS\-Repairerrors, and Qwen3\-32B covers 92\.3%\. This shared residual set points to common hard regions of the benchmark rather than disjoint model\-specific failure pockets\. The problems that resist repair for one model tend to resist it for all of them, which suggests the difficulty is intrinsic to the constraint structure rather than tied to any particular model’s weaknesses\.
## 7Discussion
The central finding is not that MUS\-Repair works, but what it leaves behind\. After contradiction\-aware repair, the residual error mass concentrates in satisfiable drift rather than in unsatisfiable states\. This asymmetry has consequences for system design, scaling expectations, and evaluation methodology\.
#### Drift dominance in deployed systems\.
A system that ships satisfiability checks as its primary reliability gate will miss the majority of user\-visible failures in this benchmark family\. The failure mode is insidious because it evades every standard check\. The internal ledger remains consistent, the solver raises no alarm, and the returned answer nonetheless violates a commitment the user already accepted\. Unlike contradiction, which at least signals that something has gone wrong, drift produces confident answers that pass every automated check inspecting only state consistency\. For scheduling or resource allocation assistants, this means a user who asks “remind me of the constraints so far” receives a valid summary while the assignment silently breaks one of those same constraints\. Detecting this class of error requires a second verification layer that explicitly checks the assignment against the maintained state\. This architectural requirement parallels the finding ofStechlyet al\.\([2025](https://arxiv.org/html/2605.23940#bib.bib18)\)that sound external verification is necessary regardless of critique sophistication\.
#### Why stronger models benefit more from symbolic feedback\.
Relative MUS\-Repair lift rises from 6\.4% on Qwen3\-8B to 27\.9% on gpt\-oss\-20b\. We see two complementary mechanisms at work, both supported by the trigger data\. The first is baseline error composition\. On Qwen3\-8B, Answer\-Ledger Conflict accounts for 12,089 of 12,308 total triggers \(98%\), meaning nearly all repair attempts target drift, a failure type that resists retry\. On gpt\-oss\-20b, the trigger mix is more diverse \(5,300 drift, 1,036 incomplete, 664 parse\), giving the repair loop a broader surface of recoverable errors\. The second mechanism is instruction\-following fidelity\. The repair signal is a structured prompt containing trigger codes, violated constraints, and a minimal unsatisfiable subset\. Converting this signal into a corrected assignment requires the kind of structured instruction following that improves with model capability; a model that cannot parse the signal treats the retry as noise\. The trigger data bear this out indirectly\. Post\-repair accuracy on Answer\-Ledger Conflict triggers rises from 4\.0% on Qwen3\-8B to 33\.3% on gpt\-oss\-20b, a factor\-of\-eight improvement, suggesting that the larger model is better at acting on the structured feedback even for the hardest failure type\. The non\-monotonic drop at gpt\-oss\-120b complicates this picture\. One possibility is that the largest model’s implicit state tracking is already strong enough that the marginal value of explicit MUS feedback diminishes, even though absolute accuracy still improves\.
#### Depth collapse as accumulation\.
The depth curves in Figure[4](https://arxiv.org/html/2605.23940#S5.F4)present perhaps the most challenging finding for scaling\-based solutions\. The steep decline, rather than gradual erosion, suggests that each new constraint does not simply add a fixed probability of error\. Instead, the probability of violating at least one constraint grows combinatorially with the active set, creating a qualitatively harder problem at each successive turn\. This framing suggests that flattening the depth curve will require mechanisms that scale sublinearly with constraint count, such as hierarchical state abstractions or incremental re\-verification, rather than relying on raw model capacity alone\.
#### Ledger tracking and context competition\.
Explicit ledger injection helps gpt\-oss\-20b but hurts Qwen3\-8B, and the benefit declines again at gpt\-oss\-120b\. The likely mechanism is context competition\. Serializing the ledger \(up to 3,000 tokens\) consumes prompt budget that a smaller model needs for reasoning\. A larger model absorbs the overhead and uses the explicit state productively\. The renewed decline at gpt\-oss\-120b suggests that above a capability threshold the model’s implicit tracking is competitive with explicit injection, so the added context cost outweighs the control benefit\.
#### Toward drift\-targeted repair\.
The current repair loop is contradiction\-oriented, identifying minimal unsatisfiable subsets and feeding them back\. Drift, by contrast, receives only generic policy diagnostics without localizing which constraints are violated or which entities are misplaced\. Closing this gap requires localizing the violated constraints\.
𝒱t=\{c∈Lt:¬SAT\(\{c\}∪Φ\(At\)\)\},\\mathcal\{V\}\_\{t\}=\\bigl\\\{c\\in L\_\{t\}:\\neg\\mathrm\{SAT\}\\\!\\bigl\(\\\{c\\\}\\cup\\Phi\(A\_\{t\}\)\\bigr\)\\bigr\\\},
which mirrors the MUS definition structurally: MUS localizes a contradictory subset of the ledger, while𝒱t\\mathcal\{V\}\_\{t\}identifies ledger constraints violated by the returned assignment\.
#### Beyond solver\-structured domains\.
Our benchmark uses formal constraint sets because they enable sound verification, but satisfiable drift is not specific to constraint satisfaction\. Any multi\-turn system that maintains evolving commitments can exhibit it\. A travel\-planning assistant might confirm “no flights on Sunday” and later propose a Sunday itinerary; a code\-editing agent might acknowledge a variable rename and subsequently reference the old name\. In these open\-domain settings, detecting drift would require extracting implicit constraints from natural language, likely through entailment\-based commitment tracking rather than SAT solving\. The core diagnostic question remains the same: is the system’s internal state valid, and does the output respect that state? We expect drift to dominate in open\-domain settings as well, because the underlying cause, forgetting prior commitments while maintaining a coherent narrative, is a property of how language models process sequential context rather than an artifact of the constraint format\.
#### Evaluation implications\.
The distinction between state\-level and assignment\-level failure extends beyond our benchmark\. Dialogue state trackers, collaborative document editors, and iterative code generators all maintain evolving commitments across turns\. We suggest that multi\-turn evaluations in these domains similarly decompose errors by whether the system’s internal state became invalid or whether the output simply failed to respect a valid state\. Reporting the two channels separately would prevent the pattern we document here, where progress on one failure type masks stagnation on the other, and would give practitioners a clearer picture of where reliability investments should be directed\.
Taken together, these results argue that contradiction detection, though necessary, is not sufficient for reliable multi\-turn systems\. A second verification layer that checks the returned assignment against the maintained state is needed to catch the dominant failure mode\. Reporting contradiction and drift as separate evaluation channels, rather than merging them into a single accuracy number, would give practitioners a clearer picture of where residual risk concentrates\.
## 8Limitations
Several scope limitations bear on the generalizability of our findings\. The study evaluates four open\-weight models from two families but does not include closed\-weight frontier systems or specialist fine\-tuned variants, either of which might exhibit different drift\-to\-contradiction ratios\. Failure\-channel decomposition relies on the solver\-state and trigger logs thatMUS\-Repairproduces at each retry\. We do not have equivalent per\-turn logging for the non\-repair methods, which limits fully symmetric cross\-method comparison of failure channels\. Post\-hoc instrumentation of non\-repair traces with the same solver checks would enable symmetric decomposition\. We leave this to follow\-up work\. The benchmark covers three solver\-structured domains, and whether the drift\-dominance finding transfers to open\-domain dialogue, where constraints are implicit and verification is harder, remains an open question\. Finally, we evaluate a single repair routing design without ablating over trigger definitions, retry budgets, or alternative repair controllers\. Different routing policies might shift the balance between contradiction and drift in the residual error distribution\.
## 9Conclusion
This paper introduced a solver\-instrumented multi\-turn benchmark that cleanly separates two failure modes, contradiction and satisfiable drift, and used it to evaluate four reasoning methods across four open\-weight models\. MUS\-Repair produces significant gains in every setting after false\-discovery correction, but the errors that survive are overwhelmingly drift\. Models rarely contradict themselves after structured feedback, but they still forget prior commitments\. This forgetting compounds with conversational depth, and accuracy declines steeply even on the strongest model, suggesting that long\-horizon state maintenance remains an open challenge regardless of scale\.
These findings point to a concrete gap in current evaluation practice\. Solver\-level contradiction checks are necessary but insufficient\. Reliable multi\-turn systems must also validate that the returned assignment respects the maintained state\. Reporting contradiction and drift as separate channels, rather than merging them into a single accuracy number, exposes where the real residual risk lies\.
## References
- A\. Belov and J\. Marques\-Silva \(2012\)MUSer2: an efficient MUS extractor\.Journal on Satisfiability, Boolean Modelling and Computation8\(3–4\),pp\. 123–128\.External Links:[Link](https://doi.org/10.3233/SAT190094)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Biere, M\. Heule, H\. van Maaren, and T\. Walsh \(Eds\.\) \(2009\)Handbook of satisfiability\.Frontiers in Artificial Intelligence and Applications, Vol\.185,IOS Press\.External Links:[Link](https://ebooks.iospress.nl/volume/handbook-of-satisfiability)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2023\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.Transactions on Machine Learning Research\.External Links:[Link](https://mlanthology.org/tmlr/2023/chen2023tmlr-program/)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.External Links:[Link](https://arxiv.org/abs/2110.14168)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px2.p1.1)\.
- L\. de Moura and N\. Bjørner \(2008\)Z3: an efficient SMT solver\.InTools and Algorithms for the Construction and Analysis of Systems,Lecture Notes in Computer Science, Vol\.4963,pp\. 337–340\.External Links:[Link](https://doi.org/10.1007/978-3-540-78800-3_24)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 10764–10799\.External Links:[Link](https://proceedings.mlr.press/v202/gao23f.html)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Han, W\. Buntine, and E\. Shareghi \(2025\)VerifiAgent: a unified verification agent in language model reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 16410–16431\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.891/)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Hu, T\. Chen, Q\. Chen, Y\. Mu, W\. Shao, and P\. Luo \(2025\)HiAgent: hierarchical working memory management for solving long\-horizon agent tasks with large language model\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 32779–32798\.External Links:[Link](https://aclanthology.org/2025.acl-long.1575/)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 22199–22213\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with PagedAttention\.InProceedings of the 29th Symposium on Operating Systems Principles,pp\. 611–626\.External Links:[Link](https://doi.org/10.1145/3600006.3613165)Cited by:[§4\.2](https://arxiv.org/html/2605.23940#S4.SS2.p1.3)\.
- M\. H\. Liffiton and K\. A\. Sakallah \(2008\)Algorithms for computing minimal unsatisfiable subsets of constraints\.Journal of Automated Reasoning40\(1\),pp\. 1–33\.External Links:[Link](https://doi.org/10.1007/s10817-007-9084-z)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px2.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.External Links:[Link](https://aclanthology.org/2024.tacl-1.9/)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1),[§5\.3](https://arxiv.org/html/2605.23940#S5.SS3.p1.1)\.
- P\. Lu, B\. Peng, H\. Cheng, M\. Galley, K\. Chang, Y\. N\. Wu, S\. Zhu, and J\. Gao \(2023\)Chameleon: plug\-and\-play compositional reasoning with large language models\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 43447–43478\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/871ed095b734818cfba48db6aeb25a62-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Lyu, S\. Havaldar, A\. Stein, L\. Zhang, D\. Rao, E\. Wong, M\. Apidianaki, and C\. Callison\-Burch \(2023\)Faithful chain\-of\-thought reasoning\.InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 305–329\.External Links:[Link](https://aclanthology.org/2023.ijcnlp-main.20/)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 46534–46594\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.23940#S1.p2.1),[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2025\)gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.External Links:[Link](https://arxiv.org/abs/2508.10925)Cited by:[§4\.2](https://arxiv.org/html/2605.23940#S4.SS2.p1.3)\.
- O\. Press, N\. A\. Smith, and M\. Lewis \(2022\)Train short, test long: attention with linear biases enables input length extrapolation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=R8sQPpGCv0)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 8634–8652\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Stechly, K\. Valmeekam, and S\. Kambhampati \(2025\)On the self\-verification limitations of large language models on reasoning and planning tasks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=4O0v4s3IzY)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px2.p1.1),[§7](https://arxiv.org/html/2605.23940#S7.SS0.SSS0.Px1.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.23940#S1.p2.1),[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Wu, A\. Madotto, E\. Hosseini\-Asl, C\. Xiong, R\. Socher, and P\. Fung \(2019\)Transferable multi\-domain state generator for task\-oriented dialogue systems\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 808–819\.External Links:[Link](https://aclanthology.org/P19-1078/)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Yanget al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.2](https://arxiv.org/html/2605.23940#S4.SS2.p1.3)\.
- S\. Yao, H\. Chen, A\. W\. Hanjie, R\. Yang, and K\. R\. Narasimhan \(2024\)COLLIE: systematic construction of constrained text generation tasks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=kxgSlyirUZ)Cited by:[§1](https://arxiv.org/html/2605.23940#S1.p2.1),[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023a\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 11809–11822\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2023b\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WE_vluYUL-X)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Young, M\. Gašić, B\. Thomson, and J\. D\. Williams \(2013\)POMDP\-based statistical spoken dialog systems: a review\.Proceedings of the IEEE101\(5\),pp\. 1160–1179\.External Links:[Link](https://doi.org/10.1109/JPROC.2012.2225812)Cited by:[§2](https://arxiv.org/html/2605.23940#S2.SS0.SSS0.Px3.p1.1)\.
## Appendix APrompts and Message Schemas
Subsections A\.1, A\.2, and A\.6 give the verbatim prompt strings\. Subsections A\.3, A\.4, and A\.5 describe the schema of assembled messages whose contents vary by domain, turn, and ledger state; we document the field structure rather than literal text\.
### A\.1Main\-Turn System Prompts
System PromptdirectYou solve multi\-turn logical constraint satisfaction problems\. Track all prior commitments and keep the final assignment consistent with every active constraint\. Return only the final JSON solution\.
System PromptcotYou solve multi\-turn logical constraint satisfaction problems\. Track all prior commitments and reason briefly before answering\. Output at most 3 short bullets, then the final JSON solution\.
System Promptledger\_onlyYou solve a multi\-turn logical constraint problem with an explicit ledger\. Treat the ledger as committed state and keep your answer consistent with it and the new turn constraints\. Return only JSON\.
System Promptmus\_repairYou solve a multi\-turn logical constraint problem with formal consistency checks and targeted repair signals\. Use the ledger as committed state, address any repair signal directly, and provide a complete consistent solution\. Return only JSON\.
### A\.2Additional System Prompts for Extraction and Retries
Answer Retry System PromptYou are a strict JSON formatter\. Output one valid JSON object only\. No markdown or prose\.
Constraint Extraction System PromptYou extract formal constraints from an assistant answer\. Extract only constraints introduced in the latest user turn\. Do not restate full solution assignments unless they directly encode one allowed constraint\.
### A\.3Main User Message Schema
User Message SchemaProblem setup \(first turn only\):includeDomain: <domain\>andEntities: <entity list\>together with domain\-specific context such as seat labels, time slots, or logic\-grid categories\.State block \(ledger methods only\):includeCurrent ledger: <serialized constraints\>\.Current turn block:includeNew constraints from user: <latest user message\>, optionalRepair signal: <trigger codes \+ optional MUS subset\>, and a domain\-specific JSON schema hint\.
### A\.4Repair Signal Schema
Repair Signal SchemaREPAIR REQUIREDDetected issue classes are provided as<trigger code\> : <detail\>\. An optional MUS subset is provided as<constraint id\> : "<constraint text\>" \(turn <t\>\)\. The required action is to return a revised JSON solution that resolves all listed issues\.
### A\.5Constraint Extraction Schema
Constraint Extraction SchemaSystem instruction:extract only constraints introduced in the latest user turn from the assistant answer\.User block fields:includeDomain,Source turn, andEntities; includeLatest user messageandAssistant response; specify the domain\-specific allowed constraint vocabulary; enforce strict rules on type names, argument order, and turn locality; and require JSON output with key"constraints"\.
### A\.6Retry Prompt Templates
Answer Reformat Retry PromptSystem:strict JSON formatter, one valid JSON object only\.User:reformat the response as JSON only; preserve required entities as keys and add no commentary\.
Truncation Retry PromptPrevious answer was clipped by token limit\. Retry with one compact JSON object only: no bullets, no prose, no analysis\.
## Appendix BSupplementary Tables
Table 6:Domain\-conditioned turn\-level accuracy\.ModelDomainDirect \(%\)CoT \(%\)Ledger \(%\)MUS\-Repair \(%\)Qwen3\-8BLogic\-Grid39\.739\.737\.537\.536\.236\.243\.143\.1Qwen3\-8BScheduling35\.035\.035\.135\.129\.129\.134\.734\.7Qwen3\-8BSeating9\.89\.811\.011\.010\.310\.312\.112\.1Qwen3\-32BLogic\-Grid35\.635\.638\.638\.624\.024\.043\.943\.9Qwen3\-32BScheduling41\.141\.144\.844\.856\.956\.955\.755\.7Qwen3\-32BSeating9\.99\.910\.710\.713\.013\.014\.814\.8gpt\-oss\-20bLogic\-Grid57\.357\.356\.156\.160\.860\.881\.481\.4gpt\-oss\-20bScheduling64\.464\.464\.664\.668\.868\.885\.785\.7gpt\-oss\-20bSeating33\.533\.530\.130\.131\.231\.238\.738\.7gpt\-oss\-120bLogic\-Grid53\.653\.656\.256\.249\.149\.164\.264\.2gpt\-oss\-120bScheduling73\.773\.772\.372\.372\.872\.887\.287\.2gpt\-oss\-120bSeating28\.728\.733\.133\.127\.927\.936\.336\.3Table 7:Truncation robustness check\.ModelMethodTrunc\. \(%\)Acc\. All \(%\)Acc\. Non\-trunc\. \(%\)gpt\-oss\-120bDirect0\.370\.3752\.152\.152\.352\.3gpt\-oss\-120bChain\-of\-Thought0\.230\.2353\.953\.954\.154\.1gpt\-oss\-120bLedger0\.140\.1450\.050\.050\.150\.1gpt\-oss\-120bMUS\-Repair0\.110\.1162\.762\.762\.762\.7gpt\-oss\-20bDirect1\.751\.7551\.851\.852\.752\.7gpt\-oss\-20bChain\-of\-Thought1\.531\.5350\.450\.451\.151\.1gpt\-oss\-20bLedger1\.061\.0653\.753\.754\.354\.3gpt\-oss\-20bMUS\-Repair0\.830\.8368\.768\.769\.369\.3Qwen3\-32BDirect0\.000\.0028\.928\.928\.928\.9Qwen3\-32BChain\-of\-Thought0\.020\.0231\.431\.431\.431\.4Qwen3\-32BLedger0\.000\.0031\.431\.431\.431\.4Qwen3\-32BMUS\-Repair0\.000\.0038\.238\.238\.238\.2Qwen3\-8BDirect0\.000\.0028\.228\.228\.228\.2Qwen3\-8BChain\-of\-Thought0\.000\.0027\.927\.927\.927\.9Qwen3\-8BLedger0\.000\.0025\.225\.225\.225\.2Qwen3\-8BMUS\-Repair0\.000\.0030\.030\.030\.030\.0Table 8:Paired MUS\-Repair tests against the strongest non\-MUS comparator per model\.ModelComparatorΔ\\DeltaAcc\. \(pp\)95% CI \(pp\)ppqFDRq\_\{\\mathrm\{FDR\}\}gpt\-oss\-120bChain\-of\-Thought8\.018\.01\[\+\+6\.19,\+\+9\.83\]<<0\.0001<<0\.0001gpt\-oss\-20bLedger14\.2914\.29\[\+\+12\.70,\+\+15\.93\]<<0\.0001<<0\.0001Qwen3\-32BChain\-of\-Thought6\.496\.49\[\+\+4\.34,\+\+8\.66\]<<0\.0001<<0\.0001Qwen3\-8BDirect2\.032\.03\[\+\+0\.34,\+\+3\.76\]0\.01780\.0178Table 9:Post\-repair outcomes by trigger code in MUS traces\.ModelTriggerRowsRepair Acc\. \(%\)Repair SAT \(%\)Qwen3\-8BAnswer\-Ledger Conflict413141314\.04\.0100\.0100\.0Qwen3\-8BIncomplete Assignment13913965\.565\.5100\.0100\.0Qwen3\-8BUnsatisfiable Ledger110\.00\.0100\.0100\.0Qwen3\-32BAnswer\-Ledger Conflict364336435\.55\.599\.699\.6Qwen3\-32BIncomplete Assignment27527569\.869\.8100\.0100\.0Qwen3\-32BUnsatisfiable Ledger87876\.96\.917\.217\.2Qwen3\-32BConstraint Extraction Failure220\.00\.0100\.0100\.0gpt\-oss\-20bAnswer\-Ledger Conflict2402240233\.333\.3100\.0100\.0gpt\-oss\-20bIncomplete Assignment73573533\.333\.399\.999\.9gpt\-oss\-20bAnswer Parse Failure44944923\.423\.4100\.0100\.0gpt\-oss\-20bConstraint Extraction Failure42942922\.622\.6100\.0100\.0gpt\-oss\-120bAnswer\-Ledger Conflict2471247121\.221\.299\.999\.9gpt\-oss\-120bAnswer Parse Failure41641618\.518\.599\.899\.8gpt\-oss\-120bConstraint Extraction Failure323212\.512\.596\.996\.9gpt\-oss\-120bIncomplete Assignment303026\.726\.7100\.0100\.0Table 10:Pairwise overlap of MUS error rows across models\.Model AModel BOverlapJaccardShare AShare BQwen3\-8BQwen3\-32B314331430\.7260\.7260\.7920\.7920\.8970\.897Qwen3\-8Bgpt\-oss\-20b168716870\.4160\.4160\.4250\.4250\.9500\.950Qwen3\-8Bgpt\-oss\-120b194519450\.4700\.4700\.4900\.4900\.9190\.919Qwen3\-32Bgpt\-oss\-20b163816380\.4500\.4500\.4670\.4670\.9230\.923Qwen3\-32Bgpt\-oss\-120b189218920\.5070\.5070\.5400\.5400\.8940\.894gpt\-oss\-20bgpt\-oss\-120b141214120\.5690\.5690\.7950\.7950\.6670\.667
## Appendix CExample Transcripts
Three Qwen3\-8B transcripts, one per domain, comparingDirectandMUS\-Repairturn by turn\. ✓ marks a satisfying assignment, ✗ a violation\.
### C\.1Transcript A: Scheduling \(scheduling\_249\)
Setup\.Six activities \(Sync, Testing, Meeting, QA, Planning, Design\) must be assigned start times and durations\. Constraints accumulate over four turns\.
Table 11:Turn\-by\-turn outputs forscheduling\_249\.Direct1/4,MUS\-Repair4/4\.TurnNew constraintsDirectanswerMUS\-Repairanswer1QA must start between slots 1–2\.QA→\\to1✓QA→\\to2✓2Testing≠\\neqDesign \(simult\.\); QA duration = 3; Design→\\toslot 9\.QA dur = 3, Design→\\to9, but Testing→\\to5, Design→\\to9 \(Testing dur 2⇒\\Rightarrow5–6\)✗†QA dur = 3, Design→\\to9, Testing→\\to4✓3Testing duration = 3\.Testing dur = 3, Design→\\to9 dur = 2, Meeting→\\to9 dur = 2✗Testing→\\to4 dur = 3; Design→\\to9; QA→\\to2 dur = 3✓4Testing→\\toslot 7; Planning→\\toslot 5\.Testing→\\to7 dur = 3, Design→\\to9 dur = 2 \(overlap 9–10\)✗Testing→\\to7 dur = 3, Planning→\\to5, Design→\\to9✓†Directassigns default duration = 2 to several activities, violating the implicit single\-slot default\.
### C\.2Transcript B: Logic Grid \(logic\_grid\_021\)
Setup\.Four people \(Blake, Drew, Avery, Finley\) are each assigned a unique value in three categories: color \(Red/Blue/Green/Yellow\), pet \(Cat/Dog/Bird/Fish\), and profession \(Doctor/Artist/Teacher/Chef\)\.
Table 12:Turn\-by\-turn outputs forlogic\_grid\_021\.Direct0/5,MUS\-Repair5/5\.TurnNew constraintsDirectanswerMUS\-Repairanswer1Finley\.pet<<Drew\.pet; Finley\.pet≠\\neqAvery\.pet\.Finley→\\toBird, Drew→\\toDog \(Bird<<Dog: wrong order\)✗Finley→\\toCat, Drew→\\toDog✓2Blake\.color<<Finley\.color\.\(identical to turn 1\)✗Blake→\\toRed, Finley→\\toYellow✓3Drew\.pet≠\\neqFinley\.pet; Avery\.pet≠\\neqDrew\.pet; Avery\.prof≠\\neqFinley\.prof\.\(identical to turn 1\)✗Drew→\\toDog, Avery→\\toBird, Finley→\\toCat✓4Drew→\\toChef\.Drew→\\toChef, Finley→\\toChef \(duplicate: two Chefs\)✗Drew→\\toChef, Finley→\\toArtist✓5Blake\.color≠\\neqDrew\.color; Drew≠\\neqBird\.\(same as turn 4, still two Chefs\)✗Drew→\\toDog \(not Bird\); Blake→\\toRed, Drew→\\toBlue✓
### C\.3Transcript C: Seating \(seating\_062\)
Setup\.Seven people \(Diana, Ruby, Tina, Noah, Charlie, Frank, Karen\) sit around a round table with positions 1–7\. Constraints include fixed positions, adjacency prohibitions, and separation requirements\.
Table 13:Turn\-by\-turn outputs forseating\_062\.Direct1/4,MUS\-Repair3/4\.TurnNew constraintsDirectanswerMUS\-Repairanswer1Karen→\\topos\. 3; Karen not adjacent to Ruby\.Karen→\\to3, Ruby→\\to5✓Karen→\\to3, Ruby→\\to5✓2Charlie not adjacent to Frank\.Charlie→\\to6, Frank→\\to7 \(adjacent: violates new constraint\)✗Charlie→\\to7, Frank→\\to4 \(separated by 3 positions\)✓3Karen–Noah≥\\geq1 apart; Tina–Frank≥\\geq2 apart\.Frank→\\to8 \(invalid: only 7 seats\)✗Noah→\\to7, Tina→\\to2, Frank→\\to6✓4Frank not adj\. Ruby; Noah not adj\. Charlie; Diana→\\topos\. 6\.Karen→\\to4 \(drift: violates turn\-1 at\_position\(Karen,3\)\)✗Diana→\\to6, but Ruby→\\to5 adj\. Frank→\\to1 \(wraps\)✗Similar Articles
Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments
The paper introduces CXR-MAX, a large-scale benchmark for evaluating reasoning alignment in non-stationary environments using X-ray data from multiple MLLMs.
Has Anyone Actually Solved Memory Drift?
Discusses the problem of memory drift in AI systems where preferences and facts become outdated but are only appended, leading to conflicting versions and unreliable retrieval.
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure
This paper identifies a novel failure mode in reasoning models called unfaithful capitulation, where the chain-of-thought remains factually correct across adversarial multi-turn dialogues but the final answer flips wrong, highlighting limitations of current evaluation methods.
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
This paper proposes DRIFT, a framework that combines offline trajectories with importance-weighted supervised fine-tuning to efficiently achieve multi-turn interactive learning performance comparable to reinforcement learning.
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.