Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

arXiv cs.AI Papers

Summary

Introduces Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV) to prevent runtime control failures in personalized language systems where systems make incorrect commitments despite having relevant context. Achieves zero failures within validator scope at 0.49–0.60 availability, significantly outperforming baselines.

arXiv:2605.16712v1 Announce Type: new Abstract: Long-context and memory systems usually treat personalization as a recall problem. In practice, many failures occur later, when a system commits: it turns noisy hints into hard constraints, drops rare witnesses, forgets downstream obligations, or answers despite infeasibility. We introduce Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV). CBEA activates a bounded evidence set using typed coverage, tail witnesses, and consequence debt; LCV validates structured commitments before prose and routes infeasible states to repair, abstention, or recontract. Across 360 fixtures and three generation backends, CBEA+LCV reaches zero failures within validator scope at 0.49-0.60 availability over attempted runs. Raw and long-context baselines with the same LCV gate reach zero only at 0.003-0.092. A shadow oracle diagnostic marks the limit: CBEA+LCV recalls 0.012 of uncompiled visible facts, while raw recalls 0.53. The result is a bounded operating point: explicit commitment control and 74-75% lower median input payload, not universal memory dominance.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:35 AM

# 1 Introduction
Source: [https://arxiv.org/html/2605.16712](https://arxiv.org/html/2605.16712)
###### Abstract

Long\-context and memory systems usually treat personalization as a recall problem\. In practice, many failures occur later, when a system commits: it turns noisy hints into hard constraints, drops rare witnesses, forgets downstream obligations, or answers despite infeasibility\. We introduce*Contract\-Bounded Evidence Activation*\(CBEA\) with*Lexicographic Commitment Validation*\(LCV\)\. CBEA activates a bounded evidence set using typed coverage, tail witnesses, and consequence debt; LCV validates structured commitments before prose and routes infeasible states to repair, abstention, or recontract\. Across 360 fixtures and three generation backends, CBEA\+LCV reaches zero failures within validator scope at 0\.49–0\.60 availability over attempted runs\. Raw and long\-context baselines with the same LCV gate reach zero only at 0\.003–0\.092\. A shadow oracle diagnostic marks the limit: CBEA\+LCV recalls 0\.012 of uncompiled visible facts, while raw recalls 0\.53\. The result is a bounded operating point: explicit commitment control and 74–75% lower median input payload, not universal memory dominance\.

Recall Isn’t Enough: Bounding Commitments in Personalized Language Systems

Long\-horizon personalization must preserve confirmed user constraints while adapting to noisy, evolving state\. Fluent responses can still drop hard constraints, overweight recent context, lose rare witnesses, or continue when no response is compatible with the contract\. We call these*runtime control failures*: errors that occur after evidence is collected but before a commitment is safely realized\. The standard alternative—stuffing the raw user history into the prompt—asks the model to rediscover which facts matter at every turn\. The same sentence may be reinterpreted across turns, hard constraints flatten into soft preferences, and rare evidence disappears unless the runtime keeps an explicit obligation model\.

This failure mode is distinct from ordinary memory failure\. A system may retrieve the relevant sentence and still make the wrong commitment if it treats a soft hint as a hard constraint, ignores a required witness, or fails to recognize that the feasible set is empty\. Conversely, a system may be fluent and useful while remaining outside any claim we can verify\. The central question is therefore not only what context to show a model, but what commitments the runtime is allowed to realize after that context is selected\.

We address this with runtime control bounded by contracts: runtime decisions are constrained by explicitly confirmed structured commitments covered by validators\. Clarification targets missing information with high downstream impact; compilation converts confirmed information into hard predicates, evidence units, mutable state, and required coverage; CBEA selects bounded evidence; LCV filters structured commitments before prose; and infeasible states route to repair, abstention, or recontract rather than to new commitments\.

The paper deliberately separates two boundaries\. Inside the compiled contract, the runtime can validate structured commitments against explicit predicates and coverage requirements\. Outside that boundary, uncompiled facts remain ordinary context: the method can miss them, and we measure that exclusion rather than hiding it\. This is why the evaluation reports both validator\-covered control failures and a shadow\-oracle diagnostic over visible facts that were not compiled into the contract\.

The paper makes four contributions:

1. 1\.CBEA, a budgeted evidence activation objective over local relevance, contract coverage, tail witness retention, consequence debt, and overpersonalization penalties;
2. 2\.LCV, a lexicographic validation rule whose invariant is attached to structured commitments covered by validators rather than unrestricted prose;
3. 3\.a matched evaluation across nine variants and 360 fixtures, covering hard constraints, coverage, witnesses, infeasible continuation, consequences, and repair, plus a long history payload diagnostic built from prompts selected by CBEA;
4. 4\.an audit of 90 cases by model judges over six fidelity dimensions, combined with privacy protected production aggregates as diagnostic evidence rather than causal outcome evidence\.

## 2Related Work and Positioning

#### Long horizon\.

Recent benchmarks already study long\-horizon personalization and contextual preference inference\. HorizonBench tracks evolving preferences over long simulated historiesLiet al\.\([2026b](https://arxiv.org/html/2605.16712#bib.bib6)\); CUPID evaluates contextualized alignment from interaction historiesKimet al\.\([2025](https://arxiv.org/html/2605.16712#bib.bib7)\)\. Our target is not preference inference alone, but runtime control after evidence exists\.

#### Memory\.

Memory systems address recall under context pressure\. TiMem proposes temporal hierarchical memory consolidationLiet al\.\([2026a](https://arxiv.org/html/2605.16712#bib.bib8)\), while OP\-Bench shows that memory can be misused through irrelevance, repetition, and sycophancyHuet al\.\([2026](https://arxiv.org/html/2605.16712#bib.bib9)\)\. We also include a selector\-level diagnostic against a classic MMR selectorCarbonell and Goldstein \([1998](https://arxiv.org/html/2605.16712#bib.bib10)\); unlike CBEA, MMR has no typed coverage, tail witness, or consequence debt terms\. We therefore treat evidence activation as a constrained runtime decision before validation and realization\.

#### Clarification\.

Prior work studies clarification for preference elicitationMontazeralghaemet al\.\([2025](https://arxiv.org/html/2605.16712#bib.bib1)\); Zhanget al\.\([2024](https://arxiv.org/html/2605.16712#bib.bib2)\), profile compressionShiet al\.\([2025](https://arxiv.org/html/2605.16712#bib.bib3)\); Qiuet al\.\([2025](https://arxiv.org/html/2605.16712#bib.bib4)\), and long context limitsDuet al\.\([2025](https://arxiv.org/html/2605.16712#bib.bib5)\)\. These motivate moving runtime truth from raw profile text to explicit artifacts\.

#### Prompt compression and RAG\.

Compression and retrieval baselines reduce context cost, but they do not by themselves define which constraints are hard, which evidence is mandatory, or what the system must do when no feasible answer exists\. Our comparison therefore includes raw history, summary, dense retrieval, long\-context prompting, and tool/memory baselines, but the main claim concerns the control interface placed after evidence selection: a structured commitment is either covered by validators or outside the guarantee\.

#### Verification\.

Contract\-grounded planning separates retrieval, typed constraints, verification, repair, and abstentionAgarwal \([2026](https://arxiv.org/html/2605.16712#bib.bib11)\)\. Our setting differs: constraints are confirmed by the user, state evolves, evidence is incomplete, and validated commitments are realized through language\. We borrow the boundary that safety claims apply only to structured commitments covered by validators\.

#### Position\.

Our closest comparison is not personalization memory alone, but systems that separate retrieval, constraint representation, validation, and repair\. Unlike previous work, our contribution is their integration into a runtime control method for noisy long\-horizon personalized language systems\.

## 3Formal Runtime Model

The paper’s central object is a runtime over structured commitments\. Letx0:tx\_\{0:t\}be a sequence of noisy user observations, including forms, labels, free text, and previous turns\. The runtime maintains a confirmed hard contracthth\_\{t\}, mutable stateutu\_\{t\}, a shared evidence poolEtE\_\{t\}, a required\-coverage setRtR\_\{t\}, and a structured action space𝒜\\mathcal\{A\}\.

###### Definition 1\(Confirmed hard contract\)\.

A hard contract is a set

ht=\{\(ϕj,πj\)\}j=1mt,h\_\{t\}=\\\{\(\\phi\_\{j\},\\pi\_\{j\}\)\\\}\_\{j=1\}^\{m\_\{t\}\},where eachϕj:𝒜→\{0,1\}\\phi\_\{j\}:\\mathcal\{A\}\\rightarrow\\\{0,1\\\}is a machine\-checkable hard predicate andπj\\pi\_\{j\}is a provenance record\. Only explicitly confirmed predicates belong tohth\_\{t\}\. Inferred or unresolved constraints remain outside the immutable contract\.

###### Definition 2\(Evidence pool\)\.

The evidence pool is a finite set

Et=\{ei=\(zi,di,ρi,τi,κi\)\}i=1nt,E\_\{t\}=\\\{e\_\{i\}=\(z\_\{i\},d\_\{i\},\\rho\_\{i\},\\tau\_\{i\},\\kappa\_\{i\}\)\\\}\_\{i=1\}^\{n\_\{t\}\},whereziz\_\{i\}is an evidence unit,did\_\{i\}its dimension,ρi\\rho\_\{i\}provenance metadata,τi\\tau\_\{i\}a tail witness indicator, andκi\\kappa\_\{i\}an activation cost\. Selected evidence units, not raw profile text, are the operational interface to generation\.

###### Definition 3\(Required coverage set\)\.

The runtime derives a set

Rt=Γ​\(ht,ut,ct\)R\_\{t\}=\\Gamma\(h\_\{t\},u\_\{t\},c\_\{t\}\)of required evidence dimensions for the current turn or result\. Requirements may come from confirmed hard predicates, required evidence fields, consequence debt, local scene obligations, or no\-feasible checks\. LetMt∈\{0,1\}\|Et\|×\|Rt\|M\_\{t\}\\in\\\{0,1\\\}^\{\|E\_\{t\}\|\\times\|R\_\{t\}\|\}be an evidence requirement coverage matrix, whereMi​r=1M\_\{ir\}=1means evidence uniteie\_\{i\}covers requirementrr\.

#### CBEA\.

CBEA selects evidence through a budgeted objective rather than top\-kkretrieval\. For a candidate evidence subsetZ⊆EtZ\\subseteq E\_\{t\}, define

Jt​\(Z\)=\\displaystyle J\_\{t\}\(Z\)=\{\}λr​Rel​\(Z,ct\)\+λc​Cov​\(Z,Rt\)\\displaystyle\\lambda\_\{r\}\\,\\mathrm\{Rel\}\(Z,c\_\{t\}\)\+\\lambda\_\{c\}\\,\\mathrm\{Cov\}\(Z,R\_\{t\}\)\+λw​Tail​\(Z\)\+λd​Debt​\(Z,ut\)\\displaystyle\+\\lambda\_\{w\}\\,\\mathrm\{Tail\}\(Z\)\+\\lambda\_\{d\}\\,\\mathrm\{Debt\}\(Z,u\_\{t\}\)−λo​Over​\(Z,ct\),\\displaystyle\-\\lambda\_\{o\}\\,\\mathrm\{Over\}\(Z,c\_\{t\}\),whereRel\\mathrm\{Rel\}measures local relevance,Cov\\mathrm\{Cov\}measures requirement coverage,Tail\\mathrm\{Tail\}rewards retention of rare but decisive tail witnesses,Debt\\mathrm\{Debt\}rewards evidence needed for downstream obligations, andOver\\mathrm\{Over\}penalizes irrelevant or intrusive personalization\. Requirement coverage is computed fromMtM\_\{t\}using a short coverage indicator:

Cov​\(Z,Rt\)\\displaystyle\\mathrm\{Cov\}\(Z,R\_\{t\}\)=∑r∈Rtwr​ηr​\(Z\),\\displaystyle=\\textstyle\\sum\_\{r\\in R\_\{t\}\}w\_\{r\}\\,\\eta\_\{r\}\(Z\),ηr​\(Z\)\\displaystyle\\eta\_\{r\}\(Z\)=𝟏\[∃ei∈Z:Mi​r=1\]\.\\displaystyle=\\mathbf\{1\}\[\\exists e\_\{i\}\\in Z:M\_\{ir\}=1\]\.CBEA chooses

Zt⋆∈arg​maxZ⊆Et⁡Jt​\(Z\)s\.t\.∑ei∈Zκi≤ℬt\.Z\_\{t\}^\{\\star\}\\in\\operatorname\*\{arg\\,max\}\_\{Z\\subseteq E\_\{t\}\}J\_\{t\}\(Z\)\\quad\\text\{s\.t\.\}\\quad\\sum\_\{e\_\{i\}\\in Z\}\\kappa\_\{i\}\\leq\\mathcal\{B\}\_\{t\}\.In implementation, this objective can be solved approximately with greedy budgeted coverage\. A reserved budgetℬt=ℬtm​a​i​n\+ℬtt​a​i​l\\mathcal\{B\}\_\{t\}=\\mathcal\{B\}\_\{t\}^\{main\}\+\\mathcal\{B\}\_\{t\}^\{tail\}lets the runtime protect required tail witnesses separately from ordinary evidence with high relevance\. The corresponding ablations remove coverage, tail reservation, or consequence debt terms\.

#### LCV\.

A candidate generator constructs a finite setAt⊆𝒜A\_\{t\}\\subseteq\\mathcal\{A\}of structured commitments fromZt⋆Z\_\{t\}^\{\\star\}\. For each candidate, LCV computes a violation vector

νt\(a\)=\(\\displaystyle\\nu\_\{t\}\(a\)=\\big\(νh​\(a,ht\),νc​\(a,Zt⋆,Rt\),\\displaystyle\\nu\_\{h\}\(a,h\_\{t\}\),\\nu\_\{c\}\(a,Z\_\{t\}^\{\\star\},R\_\{t\}\),ν0\(a\),−S\(a;ut,Zt⋆,ct\)\),\\displaystyle\\nu\_\{0\}\(a\),\-S\(a;u\_\{t\},Z\_\{t\}^\{\\star\},c\_\{t\}\)\\big\),whereνh\\nu\_\{h\}counts hard predicate failures covered by validators,νc\\nu\_\{c\}counts missing required evidence coverage,ν0\\nu\_\{0\}indicates commitment emission when the runtime diagnoses no feasible candidate covered by validators, andSSis a soft utility score over mutable state and local context\. LCV ranks candidates by lexicographic minimization:

a~t∈arg​lexmina∈At⁡νt​\(a\)\.\\tilde\{a\}\_\{t\}\\in\\operatorname\*\{arg\\,lexmin\}\_\{a\\in A\_\{t\}\}\\nu\_\{t\}\(a\)\.For readability, writeν¯h,t​\(a\)=νh​\(a,ht\)\\bar\{\\nu\}\_\{h,t\}\(a\)=\\nu\_\{h\}\(a,h\_\{t\}\)andν¯c,t​\(a\)=νc​\(a,Zt⋆,Rt\)\\bar\{\\nu\}\_\{c,t\}\(a\)=\\nu\_\{c\}\(a,Z\_\{t\}^\{\\star\},R\_\{t\}\)\. The feasible set is

ℱt=\{a∈At:\\displaystyle\\mathcal\{F\}\_\{t\}=\\\{a\\in A\_\{t\}:ν¯h,t​\(a\)=0,\\displaystyle\\bar\{\\nu\}\_\{h,t\}\(a\)=0,ν¯c,t\(a\)=0\}\.\\displaystyle\\bar\{\\nu\}\_\{c,t\}\(a\)=0\\\}\.Ifℱt≠∅\\mathcal\{F\}\_\{t\}\\neq\\varnothing, the runtime emits

at⋆∈arg​maxa∈ℱt⁡S​\(a;ut,Zt⋆,ct\),a\_\{t\}^\{\\star\}\\in\\operatorname\*\{arg\\,max\}\_\{a\\in\\mathcal\{F\}\_\{t\}\}S\(a;u\_\{t\},Z\_\{t\}^\{\\star\},c\_\{t\}\),realizes it and updates mutable state:

yt=Gψ​\(at⋆,ut,Zt⋆,ct\),ut\+1=U​\(ut,at⋆,xt\+1\)\.y\_\{t\}=G\_\{\\psi\}\(a\_\{t\}^\{\\star\},u\_\{t\},Z\_\{t\}^\{\\star\},c\_\{t\}\),\\;u\_\{t\+1\}=U\(u\_\{t\},a\_\{t\}^\{\\star\},x\_\{t\+1\}\)\.
Ifℱt=∅\\mathcal\{F\}\_\{t\}=\\varnothing, the runtime must not emit a commitment as if it were feasible\. LCV diagnoses an infeasibility reason such as missing evidence, contract conflict, unsupported commitment, or validator failure\. Letδt\\delta\_\{t\}denote this reason\. The runtime calls a restricted repair operator

ot=Ω​\(δt,ht,ut,Zt⋆,ct\),o\_\{t\}=\\Omega\(\\delta\_\{t\},h\_\{t\},u\_\{t\},Z\_\{t\}^\{\\star\},c\_\{t\}\),whereoto\_\{t\}is repair, abstention, fallback, or explicit recontract\. In our implementation, missing evidence maps to clarification, contract conflict to recontract, and unsupported or validator failing candidates to abstention or fallback\. These acts do not carry commitments and are scored separately from valid structured commitments\. Appendix[6](https://arxiv.org/html/2605.16712#A6)gives an illustrative LCV repair routing sketch\.

###### Assumption 1\(Structured emission covered by validators\)\.

The validator exactly covers the confirmed predicates inhth\_\{t\}for parseable structured commitments\. Repair, abstention, and recontract acts do not carry commitments: they may ask for missing information, decline to continue, or request explicit contract correction, but they must not smuggle a new commitment as advice\. Timeout, no output, parse failure, partial output, and blank output are not counted as safe emissions; they are system level failures or unevaluable states with explicit denominators\.

###### Property 1\(Emission boundary covered by validators\)\.

Under exact validator coverage for predicates inhth\_\{t\}, any emitted structured commitmentat⋆a\_\{t\}^\{\\star\}satisfies all confirmed hard predicates inhth\_\{t\}\. Whenℱt=∅\\mathcal\{F\}\_\{t\}=\\varnothing, no structured commitment is emitted; the runtime is restricted to repair, abstention, or explicit recontract\. This invariant is distinct from the oracle\-level violation metrics used in the benchmark\.

This property is a claim boundary, not a theorem over all natural language implications\. The guarantee is attached toat⋆a\_\{t\}^\{\\star\}, and the surface realizerGψG\_\{\\psi\}remains a separately evaluated layer because prose can introduce unsupported implications even when the underlying commitment is valid\.

Table 1:The paper is organized around runtime objects\. Evaluation metrics are derived from these objects rather than from generic fluency or engagement\.

## 4CBEA and LCV Runtime Algorithm

Figure[1](https://arxiv.org/html/2605.16712#S4.F1)shows the method\. The model is not asked to rediscover the user contract from raw background text on every turn; the runtime compiles constraints, activates evidence, validates commitments, and only then realizes language\.

Noisy userobservationsx0:tx\_\{0:t\}BoundedclarificationCompileht,Et,ut,Rt,Mth\_\{t\},E\_\{t\},u\_\{t\},R\_\{t\},M\_\{t\}CBEAbudgeted evidenceZt=Zm​a​i​n∪Zt​a​i​lZ\_\{t\}=Z^\{main\}\\cup Z^\{tail\}Generate structuredcandidatesAtA\_\{t\}LCVhard→\\rightarrowcoverage→\\rightarrowno\-feasible→\\rightarrowsoftvalidcommitment?Realizeyt=Gψ​\(at⋆\)y\_\{t\}=G\_\{\\psi\}\(a\_\{t\}^\{\\star\}\)Ω​\(δt\)\\Omega\(\\delta\_\{t\}\):repair, abstain,fallback, recontractUpdate mutablestateut\+1u\_\{t\+1\}CBEA objective:relevance \+ coverage \+ tail\+ consequence \- overuseyesno

Figure 1:Runtime control with validator gates, CBEA, and LCV\. Evidence is activated under budget, commitments are validated before prose realization, and infeasible states route to repair, abstention, fallback, or recontract\.#### Compilation\.

Clarification is bounded information acquisition, not open interviewing\. Compilation converts confirmed answers intohth\_\{t\}, extracts evidence units intoEtE\_\{t\}, initializesutu\_\{t\}, derivesRtR\_\{t\}, and populatesMtM\_\{t\}, keeping inferred soft preferences outside the hard contract\.

#### Activation\.

Evidence activation is constrained selection, not unconditional memory retrieval\. CBEA adds coverage ofRtR\_\{t\}, tail witness reservation, consequence debt evidence, and overpersonalization penalties to local relevance\. Turn activation prioritizes latency; result activation prioritizes synthesis and consequence preservation\.

#### Validation\.

The runtime validates structured commitments before language\. Hard predicates are lexicographically prior to evidence coverage, and both precede soft utility\. More evidence cannot compensate for violating an explicitly confirmed boundary\. If every candidate violateshth\_\{t\}or lacks required coverage, the runtime forbids commitments that claim feasibility and routes toΩ​\(δt\)\\Omega\(\\delta\_\{t\}\)\.

#### Realization\.

A valid commitment can still be phrased ambiguously or with unsupported implications, so surface realization remains a language generation problem evaluated separately \(§[6](https://arxiv.org/html/2605.16712#S7)\)\.

#### Implementation\.

CBEA is not a learned preference model\. In our experiments,RtR\_\{t\}andMtM\_\{t\}are constructed from fixture schemas and runtime contract rules: confirmed hard predicates induce mandatory requirements, fixture witnesses mark covered requirements, and consequence debt obligations follow from confirmed state transitions\. Oracle labels\(h⋆,E⋆,ℱ⋆\)\(h^\{\\star\},E^\{\\star\},\\mathcal\{F\}^\{\\star\}\)are used only for scoring\. They are not inputs to clarification, contract compilation, CBEA selection, candidate generation, LCV validation, repair routing, or prose realization\. CBEA weights are fixed design priors\(λr,λc,λw,λd,λo\)=\(1,2,2,1,1\)\(\\lambda\_\{r\},\\lambda\_\{c\},\\lambda\_\{w\},\\lambda\_\{d\},\\lambda\_\{o\}\)=\(1,2,2,1,1\), not learned, tuned on a development set, or tuned after the fact for metric maximization; they encode the priority that coverage and tail witnesses dominate ordinary relevance\. Candidates share a schema \(commitment type, evidence ids, claimed predicates, required slots, repair status, surface text\); validators cover hard contract consistency, required evidence coverage, infeasible emission, and repair reason validity\. These choices make the method auditable but also bound the claim: results test this fixed instantiation, not an optimized learned selector\. Appendix[6](https://arxiv.org/html/2605.16712#A2)lists the reporting schema\.

## 5Evaluation and Benchmark Protocol

The evaluation follows the formal objects in Table[1](https://arxiv.org/html/2605.16712#S3.T1)\. We use narrative decision support as a hostile stress test because it combines noisy self report, delayed consequences, household or social spillovers, long free text background, and recovery after drift\. The benchmark does not claim to measure real world decision quality\.

### 5\.1Fixture Construction

Fixtures derived from real users are synthetic or composite cases derived from aggregate failure patterns, never raw histories\. The key transformation is from observed failure families to controlled fixtures with known evaluator labels: confirmed predicatesh⋆h^\{\\star\}, evidence witnessesE⋆E^\{\\star\}, expected feasible setℱ⋆\\mathcal\{F\}^\{\\star\}, and allowed repair or abstention behavior\. The construction pipeline—aggregate failure patterns→\\rightarrowabstract stress mechanism→\\rightarrowsynthetic/composite profile rewrite→\\rightarrowstructured labels→\\rightarrowcontrolled offline fixture—is detailed in Appendix[6](https://arxiv.org/html/2605.16712#A3.SS0.SSS0.Px2), which also reports the exported fixture buckets and stress surface mapping\.

The fixtures cover seven stress surfaces: false hardening, hidden exception, witness drop, infeasible continuation, consequence debt, overpersonalization, and surface mismatch\. Six are exported as primary fixture buckets in the matched manifest; overpersonalization is retained as an activation penalty rather than a separate bucket \(Appendix[6](https://arxiv.org/html/2605.16712#A3.SS0.SSS0.Px2)\)\.

### 5\.2Metrics

Let𝒟\\mathcal\{D\}be a controlled benchmark set,𝒟e​m​i​t\\mathcal\{D\}\_\{emit\}the subset with parseable emitted structured commitments, and𝒟0\\mathcal\{D\}\_\{0\}the subset whose oracle feasible set is empty\. Letqf=1q\_\{f\}=1when fixtureffemits a parseable structured commitment andqf=0q\_\{f\}=0otherwise\. Structured output availability is

Avail=1\|𝒟\|​∑f∈𝒟qf\.\\mathrm\{Avail\}=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{f\\in\\mathcal\{D\}\}q\_\{f\}\.Letvf=1v\_\{f\}=1whena^f\\hat\{a\}\_\{f\}violates at least one oracle predicate inhf⋆h\_\{f\}^\{\\star\}, andvf=0v\_\{f\}=0otherwise\. Oracle hard constraint violation rate is reported over parseable emitted commitments:

OHCVR=1\|𝒟e​m​i​t\|​∑f∈𝒟e​m​i​tvf\.\\mathrm\{OHCVR\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{emit\}\|\}\\sum\_\{f\\in\\mathcal\{D\}\_\{emit\}\}v\_\{f\}\.This oracle level metric is stricter than the invariant covered by validators above: it can count hidden, uncompiled, or incorrectly realized fixture predicates that were not certified byVhV\_\{h\}\. Letnf=1n\_\{f\}=1when the emitted control act carries a commitment in an infeasible fixture, andnf=0n\_\{f\}=0for repair, abstention, or recontract\. The infeasible emission rate is

NFER=1\|𝒟0\|​∑f∈𝒟0nf,\\mathrm\{NFER\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{0\}\|\}\\sum\_\{f\\in\\mathcal\{D\}\_\{0\}\}n\_\{f\},where a commitment output in this setting is an infeasible emission failure\. Evidence coverage failure and consequence continuity failure are computed analogously from fixture labels\. Repair correctness is computed over𝒟r​e​p​a​i​r\\mathcal\{D\}\_\{repair\}, the subset of fixtures whose evaluator label expects repair, abstention, or recontract behavior \(including but not limited to𝒟0\\mathcal\{D\}\_\{0\}\); it scores 1 when the emitted control act matches the expected repair category without making a commitment\. Surface realization is evaluated separately in the blinded model judge audit, where judges score surface coherence from rubric labels rather than an automatic proxy\. The surface mismatch fixture bucket still contributes its structured contract, coverage, witness, consequence, infeasible emission, and repair labels to the automatic metrics in Table[2](https://arxiv.org/html/2605.16712#S6.T2); only surface coherence itself is not scored by an automatic proxy\.

Metrics requiring a structured commitment are reported on𝒟e​m​i​t\\mathcal\{D\}\_\{emit\}with denominator shown\. Aggregate system pass rates include timeout, no output, parse failure, partial output, blank output, and unrepaired error states as failures rather than silently removing them \(Appendix[6](https://arxiv.org/html/2605.16712#A3.SS0.SSS0.Px1)\)\.

### 5\.3Baselines and Ablations

The run includes two isolating variants: validator only separates validation and repair from evidence activation, and runtime without CBEA keeps structured state and repair while replacing CBEA with evidence selection without tail reserve\. We use these rows as targeted component diagnostics for the activation and validation interfaces\. Appendix[6](https://arxiv.org/html/2605.16712#A2), Table[6](https://arxiv.org/html/2605.16712#A2.T5)summarizes the nine comparison interfaces\.

#### Model reporting\.

Controlled comparisons report model identifiers, decoding settings, context/generation budgets, retrieval budgets, tool availability, and memory availability\. Production traffic is not used for model comparison because deployed provider/model mix changes over time and is only operational context\.

#### Reproducibility\.

We provide an anonymized artifact that preserves privacy and includes the fixture generator, 360\-fixture benchmark, runtime harness with CBEA selector and LCV validators, released model output result tables, metric summarizers, paired and case cluster bootstrap scripts, selector level MMR diagnostic, backend sensitivity configs, and privacy boundary checks\. It supports reproduction of Table[2](https://arxiv.org/html/2605.16712#S6.T2), the long history payload diagnostic, targeted ablations, backend sensitivity operating points, and model judge bootstrap summaries; it contains no raw user histories or production text, and production evidence is released only as aggregate counts\.

## 6Controlled Offline Experiments

We ran 360 synthetic and composite fixtures across the nine variants \(Appendix[6](https://arxiv.org/html/2605.16712#A2), Table[6](https://arxiv.org/html/2605.16712#A2.T5)\)\. All non\-oracle variants used MiniMax\-M2\.7 in a matched offline run with temperature 0\.2, a 2200\-token cap, up to three JSON parse retries, and a 180\-second timeout\. The final run contains 3,240 attempted and evaluable rows\. Oracle evidence is deterministic and included only as a reference point\.

Table 2:Matched offline operating points over 360 synthetic/composite fixtures \(MiniMax\-M2\.7\)\. Att\.: attempted runs; Inv\.: invalid; Struct\.: structured commitment availability; Repair:𝒟r​e​p​a​i​r\\mathcal\{D\}\_\{repair\}correctness; Cost: average prompt cost units\. Appendix[6](https://arxiv.org/html/2605.16712#A8)reports OHCVR/ECF/Wit\./Cons\./NFER for all nine methods; Appendix[6](https://arxiv.org/html/2605.16712#A12)reports the uncompiled context boundary\.Table[2](https://arxiv.org/html/2605.16712#S6.T2)and Figure[2](https://arxiv.org/html/2605.16712#S6.F2)show a shift in operating point, not broad dominance\. CBEA\+LCV emits structured commitments on 0\.5000 of fixtures with zero measured failures on covered surfaces; raw prompt stuffing emits on 0\.7833 but has failures across every covered surface\. The remaining CBEA\+LCV fixtures route to repair, abstention, or recontract without making new commitments, with𝒟r​e​p​a​i​r\\mathcal\{D\}\_\{repair\}correctness=1\.0000=1\.0000\. Raw remains competitive on availability because the fixtures expose relevant evidence directly\. The runtime contribution is to make constraint, coverage, witness, and consequence behavior*measurable and enforceable*on emitted commitments\.

This operating point is intentionally conservative\. A system that emits on more cases can look better under a simple availability metric while silently violating hard predicates or losing required evidence\. A system that refuses too often can also drive failure rates to zero without solving the activation problem\. We therefore read Table[2](https://arxiv.org/html/2605.16712#S6.T2)together with Figure[2](https://arxiv.org/html/2605.16712#S6.F2): the target is not the upper\-right corner, but a point where emitted commitments remain covered and unrepaired cases are explicitly routed\.

0\.40\.50\.60\.70\.80\.00\.20\.40\.6Structured commitment availabilityMean control failure ratebetterOracleRawCBEA\+LCV

Figure 2:Operating point view of Table[2](https://arxiv.org/html/2605.16712#S6.T2)\. Raw prompt stuffing emits more commitments but has higher control failures on covered surfaces; CBEA\+LCV emits fewer commitments but reaches the zero failure operating point within the validator scope\. Other baselines \(orange: with LCV gates; gray: ungated\) are shown as context, not as a universal dominance claim\. Dashed arrow: design shift raw→\\rightarrowCBEA\+LCV\.Property 1 explains why LCV can suppress infeasible commitments or commitments that violate validators; it does not ensure that the generator sees the right witnesses or consequence obligations\. Those surfaces depend on activation: validator\-only and runtime without CBEA still lose coverage, witnesses, and consequences \(Appendix[6](https://arxiv.org/html/2605.16712#A8)\)\. Thus validation alone can refuse infeasible cases, but CBEA supplies the evidence state that makes validation useful\.

Targeted ablations \(Appendix[6](https://arxiv.org/html/2605.16712#A7)\) isolate validation, repair, and coverage/tail activation; a selector diagnostic \(Appendix[6](https://arxiv.org/html/2605.16712#A9)\) shows CBEA retains 0\.9970 of typed control evidence versus 0\.6960 for MMR\. Backend sensitivity and the Hy3\-preview output budget diagnostic are summarized below; horizon complexity and long history payload diagnostics reuse the same harness\.

;0\.0030\.0030\.0060\.0060\.5060\.506Raw\+LLong\+LCBEA\+L0\.00\.20\.40\.6Attempted run availability

[6](https://arxiv.org/html/2605.16712#A10)Zt⋆Z\_\{t\}^\{\\star\}

Liuet al\.[2023](https://arxiv.org/html/2605.16712#bib.bib12); Zhenget al\.[2023](https://arxiv.org/html/2605.16712#bib.bib13); Guet al\.[2024](https://arxiv.org/html/2605.16712#bib.bib14); Yeet al\.[2024](https://arxiv.org/html/2605.16712#bib.bib15)

[6](https://arxiv.org/html/2605.16712#A13.T21)[6](https://arxiv.org/html/2605.16712#A13.T22)- Contract2Plan: verified contract\-grounded retrieval\-augmented optimization for BOM\-aware procurement and multi\-echelon inventory planning\.External Links:2601\.06164,[Link](https://arxiv.org/abs/2601.06164)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px5.p1.1)\.
- J\. Carbonell and J\. Goldstein \(1998\)The use of MMR, diversity\-based reranking for reordering documents and producing summaries\.InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 335–336\.External Links:[Document](https://dx.doi.org/10.1145/290941.291025),[Link](https://doi.org/10.1145/290941.291025)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Du, M\. Tian, S\. Ronanki, S\. Rongali, S\. B\. Bodapati, A\. Galstyan, A\. Wells, R\. Schwartz, E\. A\. Huerta, and H\. Peng \(2025\)Context length alone hurts LLM performance despite perfect retrieval\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 23281–23298\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1264),[Link](https://aclanthology.org/2025.findings-emnlp.1264/)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Gu, X\. Jiang, Z\. Shi, H\. Tan, X\. Zhai, C\. Xu, W\. Li, Y\. Shen, S\. Ma, H\. Liu, S\. Wang, K\. Zhang, Y\. Wang, W\. Gao, L\. Ni, and J\. Guo \(2024\)A survey on LLM\-as\-a\-judge\.External Links:2411\.15594,[Link](https://arxiv.org/abs/2411.15594)Cited by:[§6](https://arxiv.org/html/2605.16712#S7.p1.1)\.
- Y\. Hu, Z\. Long, J\. Guo, X\. Sui, X\. Fu, W\. Zhao, Y\. Zhao, and B\. Qin \(2026\)OP\-Bench: benchmarking over\-personalization for memory\-augmented personalized conversational agents\.External Links:2601\.13722,[Link](https://arxiv.org/abs/2601.13722)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px2.p1.1)\.
- T\. S\. Kim, Y\. Lee, Y\. Park, J\. Kim, Y\. Kim, and J\. Kim \(2025\)CUPID: evaluating personalized and contextualized alignment of LLMs from interactions\.Note:Accepted to COLM 2025External Links:2508\.01674,[Link](https://arxiv.org/abs/2508.01674)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Li, X\. Yu, Z\. Ni, Y\. Zeng, Y\. Xu, Z\. Zhang, X\. Li, J\. Sang, X\. Duan, X\. Wang, C\. Liu, and J\. Tan \(2026a\)TiMem: temporal\-hierarchical memory consolidation for long\-horizon conversational agents\.External Links:2601\.02845,[Link](https://arxiv.org/abs/2601.02845)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px2.p1.1)\.
- S\. S\. Li, B\. Paranjape, K\. Oktar, Z\. Ma, G\. Zhou, L\. Guan, N\. Zhang, S\. Park, L\. Chen, D\. Yang, Y\. Tsvetkov, and A\. Celikyilmaz \(2026b\)HorizonBench: long\-horizon personalization with evolving preferences\.External Links:2604\.17283,[Link](https://arxiv.org/abs/2604.17283)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-Eval: NLG evaluation using GPT\-4 with better human alignment\.External Links:2303\.16634,[Link](https://arxiv.org/abs/2303.16634)Cited by:[§6](https://arxiv.org/html/2605.16712#S7.p1.1)\.
- A\. Montazeralghaem, G\. Tennenholtz, C\. Boutilier, and O\. Meshi \(2025\)Asking clarifying questions for preference elicitation with large language models\.External Links:2510\.12015,[Link](https://research.google/pubs/asking-clarifying-questions-for-preference-elicitation-with-large-language-models/)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Qiu, X\. Zhao, Y\. Zhang, Y\. Bai, W\. Wang, H\. Cheng, F\. Feng, and T\. Chua \(2025\)Measuring what makes you unique: difference\-aware user modeling for enhancing LLM personalization\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 21258–21277\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1095),[Link](https://aclanthology.org/2025.findings-acl.1095/)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Shi, W\. Xu, Z\. Zeqi, X\. Zi, Q\. Wu, and M\. Xu \(2025\)PersonaX: a recommendation agent\-oriented user modeling framework for long behavior sequence\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 5764–5787\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.300),[Link](https://aclanthology.org/2025.findings-acl.300/)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Ye, Y\. Wang, Y\. Huang, D\. Chen, Q\. Zhang, N\. Moniz, T\. Gao, W\. Geyer, C\. Huang, P\. Chen, N\. V\. Chawla, and X\. Zhang \(2024\)Justice or prejudice? quantifying biases in LLM\-as\-a\-judge\.External Links:2410\.02736,[Link](https://arxiv.org/abs/2410.02736)Cited by:[§6](https://arxiv.org/html/2605.16712#S7.p1.1)\.
- S\. Zhang, D\. Yu, H\. Sharma, H\. Zhong, Z\. Liu, Z\. Yang, S\. Wang, H\. Hassan, and Z\. Wang \(2024\)Self\-exploring language models: active preference elicitation for online alignment\.External Links:2405\.19332,[Link](https://arxiv.org/abs/2405.19332)Cited by:[§2](https://arxiv.org/html/2605.16712#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.External Links:2306\.05685,[Link](https://arxiv.org/abs/2306.05685)Cited by:[§6](https://arxiv.org/html/2605.16712#S7.p1.1)\.

Δ\\Delta[6](https://arxiv.org/html/2605.16712#A4.T10)

[2](https://arxiv.org/html/2605.16712#S6.T2)
†††

†=0=0

×\\times×\\times
×\\times

EtE\_\{t\}

[6](https://arxiv.org/html/2605.16712#A13.T21)pp

[6](https://arxiv.org/html/2605.16712#A13.T20)

Similar Articles

Context Is Not Control, a source-boundary eval for LLMs

Reddit r/LocalLLaMA

A paper introducing 'Context Is Not Control', an evaluation benchmark for assessing source-boundary failures in LLMs' use of controlled text-mediated evidence. Includes replication packages for open-weight and frontier API models.

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

arXiv cs.AI

This paper introduces Partial-Evidence-Bench, a deterministic benchmark for measuring 'authorization-limited evidence' failures in agentic AI systems. It evaluates how models handle tasks where access control restricts visibility, assessing their ability to recognize and report incomplete information rather than silently producing seemingly complete but incomplete answers.

Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

arXiv cs.AI

This paper introduces Grounded Continuation, a linear-time runtime verifier for LLM conversations that maintains an explicit dependency graph to detect whether a next utterance is supported by prior conversation, achieving accuracy gains over baselines on benchmarks including LongMemEval and LoCoMo.