ContextGuard: Structured Self-Auditing for Context Learning in Language Models

arXiv cs.CL Papers

Summary

Introduces ContextGuard, a structured self-auditing framework that improves LLM context learning by decomposing model self-assessment into confirmed and uncertain categories and applying targeted revisions, achieving a task-solving rate increase from 9.64% to 13.85% on Qwen3.5-4B on the CL-Bench benchmark.

arXiv:2605.26827v1 Announce Type: new Abstract: Recent benchmarks reveal that despite strong reasoning capabilities, large language models (LLMs) still struggle to faithfully apply complex contextual knowledge. These failures are often not wholesale reasoning collapses: in context-rich tasks, models may follow the central reasoning path while missing peripheral, persistent, or format-sensitive requirements.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:11 AM

# ContextGuard: Structured Self-Auditing for Context Learning in Language Models
Source: [https://arxiv.org/html/2605.26827](https://arxiv.org/html/2605.26827)
Hongbo Jin1Chi Wang211footnotemark:1Haoran Tang1Zhongjing Du1 Xu Jiang1Jingqi Tian3Qiaoman Zhang1Jiayu Ding1 1Peking University2SCUT3Tsinghua University

###### Abstract

Recent benchmarks reveal that despite strong reasoning capabilities, large language models \(LLMs\) still struggle to faithfully apply complex contextual knowledge\. These failures are often not wholesale reasoning collapses: in context\-rich tasks, models may follow the central reasoning path while missing peripheral, persistent, or format\-sensitive requirements\. Generic self\-refinement is poorly matched to this setting because unconstrained revision may repair one flaw while accidentally breaking already\-satisfied constraints\. To address this, we introduce ContextGuard, a structured self\-auditing framework for protected targeted revision in context learning\. ContextGuard decomposes model self\-assessment into confirmed constraints, confirmed facts, possibly missed information, and possibly wrong content\. Guided by category\-conditioned specialist signals, it edits uncertain regions while preserving verified content through explicit protection constraints\. Experiments on CL\-Bench, a long\-context benchmark with densely specified task requirements, show that ContextGuard improves the overall task\-solving rate from 9\.64% to 13\.85% \(\+4\.21 percentage points\) on Qwen3\.5\-4B, outperforming generic refinement baselines and reducing failures across format, procedural, calculation, conditional\-rule, and style/persona requirements\.

ContextGuard: Structured Self\-Auditing for Context Learning in Language Models

Hongbo Jin1††thanks:Equal contributionChi Wang211footnotemark:1Haoran Tang1Zhongjing Du1Xu Jiang1Jingqi Tian3Qiaoman Zhang1Jiayu Ding1††thanks:Corresponding author1Peking University2SCUT3Tsinghua University

## 1Introduction

Recent large language models \(LLMs\) have achieved remarkable progress in reasoning\-intensive domains such as mathematical problem solving, code generation, and agentic planningSinghet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib1)\)\. Scaling test\-time reasoningJinet al\.\([2026a](https://arxiv.org/html/2605.26827#bib.bib31)\)and reinforcement learning has further improved multi\-step inference capabilitiesWanget al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib3)\); Guoet al\.\([2025](https://arxiv.org/html/2605.26827#bib.bib2)\); Jinet al\.\([2026b](https://arxiv.org/html/2605.26827#bib.bib25),[c](https://arxiv.org/html/2605.26827#bib.bib26)\)\. However, despite these advances, a fundamental capability required by real\-world applications remains underdeveloped: the ability to faithfully learn from and apply complex contextual knowledge provided at inference time\. In practical deployments, models must increasingly operate in context\-rich environmentsBaiet al\.\([2024](https://arxiv.org/html/2605.26827#bib.bib4)\)—such as enterprise rulebooks, legal regulations, and long interaction histories—where success depends on accurately using newly provided information rather than relying solely on static pre\-trained knowledge\.

Recent benchmarks such as CL\-BenchDouet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib5)\)expose this limitation: even frontier reasoning models perform surprisingly poorly on context\-learning tasks, with the strongest models achieving less than 24% task\-solving rate\. The challenge is not merely understanding long contexts, but satisfying many contextual requirements simultaneously\. CL\-Bench operationalizes this through 31,607 binary evaluation rubrics, averaging 16\.6 and up to 114 criteria per task; a task is counted as correct only if every associated requirement is satisfied; a task is counted as correct only if every associated requirement is satisfied\. As a result, models may solve the primary reasoning objective while still failing due to missed constraints, format rules, role instructions, or contextual exceptions\.

![Refer to caption](https://arxiv.org/html/2605.26827v1/images/case_compare.png)Figure 1:Case comparison between vanilla self\-refinement andContextGuardon a representative CL\-bench example\. The baseline draft contains someground\-truth faults\(red\) andcorrect content\(green\)\. Generic refinement successfully fixes some errors but also introducesrevision regressionby incorrectly rewriting previously correct content \(orange\)\. In contrast, ContextGuard explicitly distinguishes betweenfix targetsandprotected correct regions, enabling selective correction while preserving verified constraints and facts \(blue\)\.This requirement structure changes what an effective inference\-time method must do\. In Qwen3\.5\-4B baseline outputs, 48\.3% of failed tasks miss no more than three criteria, and 72\.6% miss no more than five\. These near misses suggest that models often covers the central answer but misses scattered requirements\. Generic self\-refinementMadaanet al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib10)\); Shinnet al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib11)\)can partially repair such errors, but unconstrained revision may also damage previously satisfied constraints\. Under strict all\-requirements evaluation, such revision regressions can erase the benefit of fixing other errors\. Existing reasoning\-oriented approaches improve deliberative depthYaoet al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib9)\); Guoet al\.\([2025](https://arxiv.org/html/2605.26827#bib.bib2)\), but do not explicitly preserve already\-correct constraints during revision\.

In this work, we present ContextGuard, a structured self\-auditing framework for protected targeted editing\. ContextGuard decomposes generated content into confirmed facts, verified constraints, missed requirements, and potentially wrong reasoning, then revises uncertain regions while anchoring verified content\. Category\-conditioned specialist signals further target structured failures such as format, workflow, rule\-fidelity, and numerical\-comparison errors\.

We evaluate ContextGuard on CL\-Bench across four categories of context\-learning tasks: domain knowledge reasoning, rule system application, procedural task execution, and empirical discovery & simulation\. On Qwen3\.5\-4B, ContextGuard improves the overall task\-solving rate from 9\.64% to 13\.85% \(\+4\.21 percentage points\), with consistent gains across all task categories\. Further analyses show improvements across multiple structured requirement types, including format, procedural coordination, verification, conditional rules, and style/persona constraints\.

Our contributions are summarized as follows:

- •We identify a repair\-preservation challenge in constraint\-dense context learning: effective revision must fix missed requirements without regressing already\-satisfied constraints\.
- •We propose ContextGuard, a structured self\-auditing framework that separates fix targets from protected constraints through epistemic stratification, category\-conditioned specialist signals and guarded revision\.
- •We demonstrate substantial improvements on CL\-Bench across multiple context\-learning categories and provide requirement\-level analyses of near\-miss behavior, repair/regression dynamics, and diverse contextual requirement types\.

## 2Related Work

Unlike traditionalIn\-Context Learningwhich primarily focuses on learning task patterns from a few demonstrationsBrownet al\.\([2020](https://arxiv.org/html/2605.26827#bib.bib12)\),Context Learningrequires models to to acquire and faithfully apply complex, often novel, contextual knowledge provided at inference timeDouet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib5)\)\. This capability is crucial for real\-world applications where models must adhere to specific enterprise manuals, legal regulations, or procedural workflows that lie beyond their parametric knowledge\. Recent evaluations on benchmarks like CL\-BenchDouet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib5)\)have exposed a significant gap: even state\-of\-the\-art modelsSinghet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib1)\); Guoet al\.\([2025](https://arxiv.org/html/2605.26827#bib.bib2)\)with strong general reasoning abilities frequently struggle with context\-specific constraints\. While long\-context LLMs have made strides in information retrieval, successfully solving context\-learning tasks requires more than surface\-level retrievalHsiehet al\.\([2024](https://arxiv.org/html/2605.26827#bib.bib13)\); it demands faithful use of the provided constraints during both generation and revision\. Our work builds on these insights by introducing a structured auditing mechanism to monitor contextual requirements and constrain revision toward context\-faithful outputs\.

This view differs from standard long\-context retrieval and reasoning\-time scaling\. Retrieval\-oriented methods improve exposure to relevant information, and reasoning\-oriented methods increase deliberation, but neither directly addresses the revision objective: preserving satisfied contextual requirements while repairing unsatisfied ones\. Further related work is discussed in the appendix[A](https://arxiv.org/html/2605.26827#A1)\.

## 3Method

We presentContextGuard, a structured self\-auditing framework designed to improve context learning in language models\. Unlike conventional self\-refinement methods that perform unconstrained rewriting over the entire response, ContextGuard explicitly models contextual fidelity during inference by separating reliable content from uncertain regions and performing targeted revision under protection constraints\.

![Refer to caption](https://arxiv.org/html/2605.26827v1/images/pipeline.png)Figure 2:Overview ofContextGuard\. Given an initial draft generated from the input context and task specification, ContextGuard performs a structured self\-audit that partitions model judgments into four epistemic categories:\(A\)confirmed constraints,\(B\)confirmed facts/data,\(C\)possibly missed information, and\(D\)possibly wrong reasoning or content\. Category\-conditioned specialist signals are then merged into aFix Setand aProtection Set, enabling selective correction while preserving verified constraints and facts\.Figure[2](https://arxiv.org/html/2605.26827#S3.F2)illustrates the overall pipeline\. Given a context\-learning task, the framework first generates an initial draft, then conducts structured self\-auditing to explicitly separate reliable content from uncertain or potentially problematic regions\. Based on the audit results and category\-conditioned specialist signals, ContextGuard finally performs protected revision that selectively edits uncertain content while preserving verified information\.

### 3\.1Overview

Given a contextCCand a task queryqq, a language model first produces an initial response:

y\(0\)=fθ​\(C,q\),y^\{\(0\)\}=f\_\{\\theta\}\(C,q\),\(1\)wherefθf\_\{\\theta\}denotes the underlying language model\. ContextGuard subsequently performs a structured audit over the generated response:

𝒜​\(y\(0\)\)→\{QA,QB,QC,QD\},\\mathcal\{A\}\(y^\{\(0\)\}\)\\rightarrow\\\{Q\_\{A\},Q\_\{B\},Q\_\{C\},Q\_\{D\}\\\},\(2\)where the four subsetsQA,QB,QC,QDQ\_\{A\},Q\_\{B\},Q\_\{C\},Q\_\{D\}correspond to different epistemic regions over its own output along two dimensions: correctness and certainty\.

For a task typett, ContextGuard may also activate a category\-conditioned specialist signal:

𝒮t​\(C,q,y\(0\)\)→\(𝒪t,ℰt\),\\mathcal\{S\}\_\{t\}\(C,q,y^\{\(0\)\}\)\\rightarrow\(\\mathcal\{O\}\_\{t\},\\mathcal\{E\}\_\{t\}\),\(3\)where𝒪t\\mathcal\{O\}\_\{t\}denotes satisfied specialist requirements andℰt\\mathcal\{E\}\_\{t\}denotes detected specialist issues\. Depending on the category,𝒮t\\mathcal\{S\}\_\{t\}is implemented either as a separate checker or as specialist criteria integrated into the structured audit\.

Based on the auditing results and specialist signals, the framework constructs a structured feedback signal consisting of a fix setℱ\\mathcal\{F\}containing potentially problematic content and a protection set𝒫\\mathcal\{P\}containing verified information that should remain unchanged\. The final response is then generated through constrained revision:

y∗=ℛguarded​\(C,q,y\(0\),ℱ,𝒫\),y^\{\\ast\}=\\mathcal\{R\}\_\{\\text\{guarded\}\}\(C,q,y^\{\(0\)\},\\mathcal\{F\},\\mathcal\{P\}\),\(4\)whereℛguarded\\mathcal\{R\}\_\{\\text\{guarded\}\}denotes the protected revision process\. The central idea behind ContextGuard is that effective self\-correction requires distinguishing*what should be revised*from*what must be preserved*\. Under dense all\-requirement evaluation, this distinction is part of the objective rather than a conservative decoding preference: a useful revision should increase the number of satisfied requirements while minimizing regressions on requirements that already pass\.

### 3\.2Draft Generation

ContextGuard begins by generating an initial draft conditioned on the original context and user query\. Additionally, we introduce a lightweight reminder augmentation mechanism that explicitly re\-emphasizes the original system constraints and final task instruction during draft generation\. Specifically, we append an auxiliary reminder constructed from the original system prompt and task request:

r=Reminder​\(s,q\),r=\\texttt\{Reminder\}\(s,q\),\(5\)wheressdenotes the original system instruction\. The draft generation after augmentation becomes:

yrem\(0\)=fθ​\(C,q,r\)\.y^\{\(0\)\}\_\{\\text\{rem\}\}=f\_\{\\theta\}\(C,q,r\)\.\(6\)
This design is motivated by the observation that, in long inputs, global instructions and always\-on constraints can become less salient relative to the final task request\. By explicitly re\-anchoring critical contextual constraints before generation, the model becomes more likely to preserve task requirements throughout the reasoning process\. In practice, the reminder\-augmented draft consistently provides a stronger revision starting point for downstream self\-auditing and protected editing\.

### 3\.3Structured Self\-Auditing

A key limitation of existing self\-refinement methods is that they typically ask models to generically “check for errors” without distinguishing reliable content from uncertain regions\. As a result, the revision process frequently rewrites already\-correct information and introduces unnecessary degradation\. To address this issue, ContextGuard introduces*epistemic stratification*, a structured self\-auditing mechanism that explicitly decomposes the generated response into different semantic confidence regions before revision\.

Given an initial drafty\(0\)y^\{\(0\)\}, the model performs structured self\-assessment:

𝒜​\(y\(0\)\)=\(QA,QB,QC,QD\),\\mathcal\{A\}\(y^\{\(0\)\}\)=\(Q\_\{A\},Q\_\{B\},Q\_\{C\},Q\_\{D\}\),\(7\)whereQAQ\_\{A\}contains confirmed correct reasoning steps and satisfied constraints,QBQ\_\{B\}contains verified contextual data and grounded evidence,QCQ\_\{C\}represents potentially missed constraints or incomplete coverage, andQDQ\_\{D\}contains potentially incorrect reasoning, calculations, or conclusions\. The auditing stage outputs a structured JSON representation that explicitly separates trusted regions from uncertain regions\. Importantly,QAQ\_\{A\}andQBQ\_\{B\}are treated as protected regions during revision, whileQCQ\_\{C\}andQDQ\_\{D\}become candidate regions for targeted correction\.

Although the overall auditing framework is unified, different context\-learning categories exhibit distinct failure patterns\. ContextGuard therefore adapts the specialist criteria according to task type\. For domain knowledge reasoning tasks, the audit primarily focuses on format and structural consistency, role and persona adherence and contextual applicability\. Procedural execution tasks additionally require verification of workflow ordering, agent coordination, timing constraints, and gate conditions\. Rule\-system tasks emphasize rule fidelity, exception handling, terminology consistency, and applicability conditions, while empirical discovery tasks focus on numerical correctness, trend interpretation, evidence grounding, and comparison completeness\. For empirical tasks, each issue is further assigned a fine\-grained error type such asnumeric,comparison, orcoverage, enabling more precise downstream revision\.

### 3\.4Category\-Conditioned Specialist Signals

While structured self\-auditing captures general uncertainty patterns, some context\-learning failures are highly structured and difficult to comprehensively identify through generic reflection alone\. ContextGuard therefore includes a modular specialist\-signal layer for structured contextual requirements\. The layer is unified at the interface level: each specialist signal produces satisfied requirements𝒪t\\mathcal\{O\}\_\{t\}and detected issuesℰt\\mathcal\{E\}\_\{t\}, which are then consumed by the same guarded revision mechanism\.

For categories whose failures are naturally checkable through explicit structural constraints, such as formatting or workflow order, we instantiate the specialist layer as a separate checker\. The format signal checks structural formatting requirements such as section organization, JSON validity, ordering constraints, role consistency, citation format, and forbidden content\. The procedural signal examines step ordering, agent routing consistency, gate\-check execution, timing constraints, logging completeness, and safety behaviors\.

For rule\-system and empirical tasks, the most important errors are more tightly coupled with the reasoning content itself\. We therefore integrate specialist criteria directly into the structured audit rather than adding a separate verifier call\. Rule\-system criteria focus on exact rule fidelity, exception handling, numerical conditions, terminology consistency, and applicability boundaries\. Empirical criteria focus on numerical correctness, comparison completeness, trend interpretation, unit consistency, and evidence grounding\. This design keeps ContextGuard unified while allowing the form of specialist feedback to match the dominant failure pattern of each category\.

### 3\.5Protected Targeted Revision

After auditing and specialist signaling, ContextGuard constructs two structured sets for targeted revision\. The fix set aggregates all potentially problematic regions identified during previous stages:

ℱ=QC∪QD∪ℰt,\\mathcal\{F\}=Q\_\{C\}\\cup Q\_\{D\}\\cup\\mathcal\{E\}\_\{t\},\(8\)which includes missed contextual constraints, potentially incorrect reasoning, and category\-specific specialist issues\. Simultaneously, the framework constructs a protection set:

𝒫=QA∪QB∪𝒪t,\\mathcal\{P\}=Q\_\{A\}\\cup Q\_\{B\}\\cup\\mathcal\{O\}\_\{t\},\(9\)which contains all verified content that should remain unchanged during revision\.

The final revision stage performs constrained editing conditioned on both the fix set and protection set\. The model is instructed to supplement missed constraints, correct only verified problematic regions, and fix explicitly detected violations, while preserving all protected content\. This transforms self\-refinement from unconstrained rewriting into a structured editing process centered on minimal necessary modification\. The protection set is therefore not merely a safety heuristic\. It represents the preservation side of the repair\-preservation tradeoff induced by strict all\-requirement scoring: already\-satisfied constraints should remain satisfied while the model repairs missing or incorrect requirements\.

To further stabilize the revision process, we introduce a lightweight revision guard that checks for significant deviation from the original draft\. If the revised output exhibits excessive structural or length deviation, ContextGuard discards the revision and falls back to the original draft\. This rollback mechanism further improves revision stability in long\-context context\-learning settings\.

## 4Experiments

### 4\.1Experimental Setup

#### Benchmark\.

We evaluate ContextGuard onCL\-Bench, a recently proposed benchmark designed to measure context learning capabilities in language models\. CL\-Bench contains four task categories:Domain Knowledge Reasoning,Procedural Task Execution,Rule System Application, andEmpirical Discovery & Simulation\. Unlike conventional reasoning benchmarks that primarily evaluate parametric reasoning ability, CL\-Bench requires models to faithfully apply newly provided contextual knowledge at inference time\.

#### Evaluation Metric\.

Following the official rubric\-based evaluation framework of CL\-bench, we report the overalltask solving rate, where a task is considered successful only if all associated rubrics are judged asyes\. Unlike the original protocol that jointly evaluates all rubrics in a single judgment, we assess each rubric independently using three repeated evaluations with majority voting to improve evaluation stability and reduce rubric\-level overestimation\.

Beyond overall solving rate, we additionally conduct a rubric\-level diagnostic analysis over five requirement types:format/lexical constraints,procedure/agent coordination,calculation/verification/standards,conditional rules, andstyle/audience/persona\. This taxonomy enables fine\-grained analysis of structured contextual requirements beyond aggregate task scores\.

#### Base Models\.

We primarily conduct full comparisons onQwen3\.5\-4B, a compact reasoning\-oriented language model with substantial headroom on context\-learning tasks\. Qwen3\.5\-4B is also the default model for the diagnostic analyses, ablations, and appendix studies unless explicitly stated otherwise\. To assess whether the repair\-preservation mechanism generalizes beyond this compact setting, we additionally evaluateQwen3\.5\-9Bas a stronger base model under the same ContextGuard pipeline and judge protocol\. All base\-model generations use the corresponding thinking mode\. Unless otherwise specified, all experiments use greedy decoding with identical inference settings between baseline inference and our framework\.

#### Judge Model\.

Rubric evaluation is performed usingDeepSeek\-R1\-Distill\-Qwen\-32B\-AWQ\. Following our evaluation protocol, each rubric is assessed independently with three repeated judgments and majority voting\. The same judge configuration is used for all methods\.

To evaluate the reliability of the proposed protocol, we repeatedly score the same set of model outputs under identical judge settings\. Compared with the original single\-pass joint evaluation protocol, our rubric\-level repeated evaluation achieves substantially higher consistency, with agreement rates of approximately 92% at the rubric level and 97% at the task level across repeated runs\. We further observe over 90% agreement with the original GPT\-5\.1\-based CL\-Bench evaluation pipeline on a sampled subset of tasks, suggesting that the proposed protocol improves evaluation stability while remaining highly consistent with the original benchmark judgments\.

#### Implementation Details\.

ContextGuard operates entirely at inference time without additional training or parameter updates\. All auditing, specialist signaling, and revision stages are implemented through prompting\. During self\-auditing, the model generates structured JSON outputs describing the four epistemic regions\. specialist signals are activated sequentially according to task type, either through separate checks or audit\-integrated criteria\. During revision, the model receives both the fix set and protection set together with explicit editing constraints\. Unless otherwise specified, we perform at most one revision round to balance effectiveness, stability, and inference cost\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2605.26827#S4.T1)reports the overall performance on CL\-Bench\. ContextGuard consistently improves task\-solving rate across all four task categories\. On Qwen3\.5\-4B, our framework improves the average solving rate from 9\.64% to 13\.85%, yielding a substantial gain of \+4\.21 percentage points\. Notably, the improvements are consistent across all four macro\-categories, demonstrating that the proposed framework generalizes across diverse context\-learning scenarios rather than overfitting to a specific failure pattern\.

We further include a stronger\-model generalization check using Qwen3\.5\-9B to evaluate whether ContextGuard remains beneficial when the underlying model is stronger\. On Qwen3\.5\-9B, ContextGuard improves the overall task\-solving rate from 10\.43% to 15\.80% \(\+5\.37 pp\), with positive gains in every task category, including Domain Knowledge Reasoning \(\+6\.34 pp\), Empirical Discovery & Simulation \(\+6\.03 pp\), and Procedural Task Execution \(\+5\.52 pp\)\. We omit Self\-Refine for the 9B setting because this additional experiment tests the generality of the ContextGuard pipeline rather than repeating the full baseline suite on every model size\.

ModelMethodOverallDomainKnowledgeReasoningRule SystemApplicationProcedural TaskExecutionEmpiricalDiscovery &SimulationQwen3\.5\-4BBaseline9\.649\.809\.909\.139\.55Qwen3\.5\-4BSelf\-Refine10\.4811\.7610\.429\.349\.05Qwen3\.5\-4BΔSelfRefine\\Delta\_\{\\text\{SelfRefine\}\}\+0\.84\+1\.96\+0\.52\+0\.21\-0\.50Qwen3\.5\-4BContextGuard13\.8514\.4813\.0714\.0113\.57Qwen3\.5\-4BΔContextGuard\\Delta\_\{\\text\{ContextGuard\}\}\+4\.21\+4\.68\+3\.17\+4\.88\+4\.02Qwen3\.5\-9BBaseline10\.4311\.3110\.2510\.627\.54Qwen3\.5\-9BContextGuard15\.8017\.6514\.1316\.1413\.57Qwen3\.5\-9BΔContextGuard\\Delta\_\{\\text\{ContextGuard\}\}\+5\.37\+6\.34\+3\.88\+5\.52\+6\.03Table 1:Main results and stronger\-model generalization on CL\-Bench\. Task\-solving rates \(%\) are computed with the official CL\-Bench denominators and the same per\-requirement majority\-vote judge protocol\.ΔSelfRefine\\Delta\_\{\\text\{SelfRefine\}\}andΔContextGuard\\Delta\_\{\\text\{ContextGuard\}\}respectively reports Self\-Refine’s and ContextGuard’s absolute gains over the corresponding baselines in percentage points\.Unless explicitly stated otherwise, the remaining analyses use Qwen3\.5\-4B, for which we run the full comparison suite\. For generic refinement baselines, we use the same initial responses and judge protocol as ContextGuard; Self\-Refine performs one round of self\-feedback followed by unconstrained revision\. Compared with Self\-Refine, ContextGuard achieves substantially larger gains because it addresses two weaknesses of generic revision: treating all generated content as freely editable, which can cause destructive rewrites, and lacking explicit checks for structured contextual failures such as format, or rule\-application errors\.

### 4\.3Requirement\-Level Failure Analysis

CL\-Bench identifies broad context\-learning failures such as ignored context, misused context, and format\-following errors\. To measure where ContextGuard improves behavior more precisely, we conduct a requirement\-level diagnostic analysis before and after applying ContextGuard\. Rather than assigning each failed task to a single coarse failure mode, we group failed criteria by the type of contextual requirement they test:format/lexical constraints,procedure/agent coordination,calculation/verification/standards,conditional rules, andstyle/audience/persona\.

Rubric TypeBaselineContextGuardFormat / lexical26\.0123\.64Procedure / agent24\.6524\.39Calc\. / verify / standards31\.9229\.71Conditional rules20\.4017\.59Style / audience / persona24\.7821\.75Table 2:Fine\-grained rubric failure analysis on CL\-Bench \(%\)\. Failure rates are task\-averaged within each rubric type\.Table[2](https://arxiv.org/html/2605.26827#S4.T2)shows that the improvements are not limited to surface formatting\. ContextGuard also reduces failures related to procedural coordination, verification\-oriented requirements, conditional rules, and audience/persona adaptation\. These results suggest that many CL\-Bench failures arise from structured contextual requirements rather than isolated reasoning errors\.

### 4\.4Ablation Study

We conduct ablation studies to quantify the contribution of each major component in ContextGuard\. Table[3](https://arxiv.org/html/2605.26827#S4.T3)reports performance after removing each modules from the full framework\.

Removing structured self\-auditing leads to the largest performance drop, indicating that epistemic stratification is central to reliable self\-correction in context\-learning tasks\. Removing the protection set \(A\+B\) while retaining the fix set \(C\+D\) still improves over removing structured auditing entirely, suggesting that targeted correction is beneficial but insufficient on its own\. Without explicitly preserving verified content, part of the gain is offset by destructive revisions introduced during editing\.

VariantOverall \(%\)Full Framework13\.85\- Structured Self\-Audit\(A\+B\)11\.06\- Structured Self\-Audit10\.53\- Reminder Augmentation12\.22\- Specialist Signals12\.84Table 3:Ablation study of ContextGuard on CL\-Bench\.Performance further decreases when reminder augmentation or specialist signals are removed, indicating that explicit contextual guidance and targeted failure detection provide complementary benefits beyond structured auditing alone\.

![Refer to caption](https://arxiv.org/html/2605.26827v1/images/subcategory.png)Figure 3:Solving rate \(%\) by CL\-bench sub\-category\. Sub\-categories are ordered and color\-grouped by the four benchmark macro\-categories\.![Refer to caption](https://arxiv.org/html/2605.26827v1/images/length.png)Figure 4:Solving rate \(%\) by input\-length bin on CL\-bench\. Length is measured as the total number of tokens in all message contents \(tiktoken,cl100k\_base\)\. Bins are left\-closed intervals: 0\-4K, 4\-8K, 8\-16K, 16\-32K, 32K\+\. The shaded band andΔ\\Deltalabels report ContextGuard’s absolute gain over the baseline \(percentage points\) per bin\.Overall, the ablation results suggest that ContextGuard’s effectiveness arises from the interaction between epistemic decomposition, protected editing, reminder augmentation, and specialist signaling with each component contributing complementary, non\-redundant benefits\. In particular, the protection\-set ablation connects directly to the revision dynamics above: identifying fix targets is not enough under strict scoring unless the revision process also preserves constraints that already pass\.

### 4\.5Performance Across Context Lengths

We also analyze model performance across different context\-length ranges\. Figure[4](https://arxiv.org/html/2605.26827#S4.F4)compares three inference\-time pipelines:Baseline, one\-round genericSelf\-Refine, andContextGuard\. Across length bins, all methods exhibit a generally downward trend with local fluctuations, reflecting the increasing difficulty of long\-context CL tasks\. Nevertheless, ContextGuard consistently outperforms the baseline across all length ranges, maintaining stable gains of \+2\.7 to \+5\.8 percentage points\. In contrast, generic Self\-Refine provides only marginal improvements in shorter contexts and collapses in the 32K\+ bin\. This suggests that unconstrained rewriting becomes increasingly unstable on highly constraint\-dense inputs\. By comparison, ContextGuard still preserves a positive gain in the same regime\. We hypothesize that longer contexts amplify failure modes such as constraint omission, contextual drift, and format inconsistency while also increasing the risk of destructive revisions\. These results suggest that structured self\-auditing and protected revision become increasingly important as context complexity grows\.

## 5Conclusion

We presentedContextGuard, an inference\-time framework for context learning under dense all\-requirement evaluation\. Instead of treating self\-refinement as unconstrained rewriting, ContextGuard separates verified constraints and facts from missed or uncertain content, then performs protected targeted revision with category\-conditioned specialist signals\. On CL\-Bench, this repair\-preservation design improves task\-solving performance across all four categories and reduces failures in format, procedure, calculation/verification, conditional\-rule, and style/persona requirements\. Our analyses show that many failures are near misses: models often satisfy most requirements, and gains depend on repairing the remaining errors without regressing already\-correct content\. These results suggest that reliable context learning requires preservation\-aware revision, not only stronger reasoning or longer contexts\.

## Limitations

While ContextGuard proposes an effective structured self\-auditing pipeline for context\-learning tasks, our experiments focus primarily on the Qwen3\.5 series due to computational resource constraints, without extensively covering a broader range of model families and scales\. In addition, all experiments are conducted exclusively on CL\-Bench\. Although CL\-Bench covers diverse contextual reasoning tasks, it remains unclear whether the observed improvements generalize to other context\-learning benchmarks with different context sources, task distributions, or evaluation settings\. Future work will evaluate ContextGuard on additional open\-source and proprietary models, broader model scales, and other context\-learning benchmarks\.

## References

- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 9112–9141\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/25f7be9694d7b32d5cc670927b8091e1-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.p2.1)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2024\)LongBench: a bilingual, multitask benchmark for long context understanding\.External Links:2308\.14508,[Link](https://arxiv.org/abs/2308.14508)Cited by:[§1](https://arxiv.org/html/2605.26827#S1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 1877–1901\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.p1.1),[§2](https://arxiv.org/html/2605.26827#S2.p1.1)\.
- S\. Dhuliawala, M\. Komeili, J\. Xu, R\. Raileanu, X\. Li, A\. Celikyilmaz, and J\. Weston \(2024\)Chain\-of\-verification reduces hallucination in large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 3563–3578\.External Links:[Link](https://aclanthology.org/2024.findings-acl.212/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.212)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px4.p1.1)\.
- S\. Dou, M\. Zhang, Z\. Yin, C\. Huang, Y\. Shen, J\. Wang, J\. Chen, Y\. Ni, J\. Ye, C\. Zhang, H\. Xie, J\. Hu, S\. Wang, W\. Wang, Y\. Xiao, Y\. Liu, Z\. Xu, Z\. Guo, P\. Zhou, T\. Gui, Z\. Wu, X\. Qiu, Q\. Zhang, X\. Huang, Y\. Jiang, D\. Wang, and S\. Yao \(2026\)CL\-bench: a benchmark for context learning\.External Links:2602\.03587,[Link](https://arxiv.org/abs/2602.03587)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.p1.1),[Appendix B](https://arxiv.org/html/2605.26827#A2.SS0.SSS0.Px2.p1.1),[Appendix B](https://arxiv.org/html/2605.26827#A2.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.26827#A2.T4),[Appendix B](https://arxiv.org/html/2605.26827#A2.p1.1),[§1](https://arxiv.org/html/2605.26827#S1.p2.1),[§2](https://arxiv.org/html/2605.26827#S2.p1.1)\.
- Z\. Gou, Z\. Shao, Y\. Gong, y\. shen, Y\. Yang, N\. Duan, and W\. Chen \(2024\)CRITIC: large language models can self\-correct with tool\-interactive critiquing\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 57734–57811\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/fef126561bbf9d4467dbb8d27334b8fe-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px4.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Ding, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Chen, J\. Yuan, J\. Tu, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. You, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.p2.1),[§1](https://arxiv.org/html/2605.26827#S1.p1.1),[§1](https://arxiv.org/html/2605.26827#S1.p3.1),[§2](https://arxiv.org/html/2605.26827#S2.p1.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.External Links:2404\.06654,[Link](https://arxiv.org/abs/2404.06654)Cited by:[§2](https://arxiv.org/html/2605.26827#S2.p1.1)\.
- J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. Yu, X\. Song, and D\. Zhou \(2024\)Large language models cannot self\-correct reasoning yet\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 32808–32824\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/8b4add8b0aa8749d80a34ca5d941c355-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px4.p1.1)\.
- H\. Jiang, Q\. Wu, C\. Lin, Y\. Yang, and L\. Qiu \(2023\)LLMLingua: compressing prompts for accelerated inference of large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 13358–13376\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.825/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.825)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px1.p1.1)\.
- H\. Jin, K\. Lin, W\. Zhang, Y\. Jin, and G\. Li \(2025a\)VideoCuRL: video curriculum reinforcement learning with orthogonal difficulty decomposition\.arXiv preprint arXiv:2601\.00887\.Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.p2.1)\.
- H\. Jin, Q\. Wang, W\. Zhang, Y\. Liu, and S\. Cheng \(2025b\)VideoMem: enhancing ultra\-long video understanding via adaptive memory management\.arXiv preprint arXiv:2512\.04540\.Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px4.p1.1)\.
- H\. Jin, S\. Xie, J\. Ding, K\. Lin, and G\. Li \(2026a\)TIR\-flow: active video search and reasoning with frozen vlms\.arXiv preprint arXiv:2601\.06176\.Cited by:[§1](https://arxiv.org/html/2605.26827#S1.p1.1)\.
- H\. Jin, M\. Zhu, J\. Tian, X\. Jiang, Z\. Du, H\. Tang, S\. Xie, Q\. Zhang, and J\. Ding \(2026b\)Context\-cot: enhancing context learning via high\-quality reasoning synthesis\.External Links:2605\.25354,[Link](https://arxiv.org/abs/2605.25354)Cited by:[§1](https://arxiv.org/html/2605.26827#S1.p1.1)\.
- H\. Jin, R\. Zhu, J\. Ding, G\. Luo, and G\. Li \(2026c\)HiMAC: hierarchical macro\-micro learning for long\-horizon llm agents\.External Links:2603\.00977,[Link](https://arxiv.org/abs/2603.00977)Cited by:[§1](https://arxiv.org/html/2605.26827#S1.p1.1)\.
- H\. Jin, R\. Zhu, Z\. Du, X\. Jiang, J\. Tian, Q\. Zhang, and J\. Ding \(2026d\)DGPO: distribution guided policy optimization for fine grained credit assignment\.arXiv preprint arXiv:2605\.03327\.Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.p2.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 9459–9474\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px1.p1.1)\.
- H\. Lin, K\. Lv, X\. Jiang, J\. Tian, Z\. Du, J\. Ding, Q\. Zhang, and H\. Jin \(2026\)VISD: enhancing video reasoning via structured self\-distillation\.External Links:2605\.06094,[Link](https://arxiv.org/abs/2605.06094)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.p2.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 46534–46594\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/91edff07232fb1b55a505a9e9f6c0ff3-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.26827#S1.p3.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 27730–27744\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.p2.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 8634–8652\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.26827#S1.p3.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram, A\. Nathan, A\. Luo, A\. Helyar, A\. Madry, A\. Efremov, A\. Spyra, A\. Baker\-Whitcomb, A\. Beutel, A\. Karpenko, A\. Makelov, A\. Neitz, A\. Wei, A\. Barr, A\. Kirchmeyer, A\. Ivanov, A\. Christakis, A\. Gillespie, A\. Tam, A\. Bennett, A\. Wan, A\. Huang, A\. M\. Sandjideh, A\. Yang, A\. Kumar, A\. Saraiva, A\. Vallone, A\. Gheorghe, A\. G\. Garcia, A\. Braunstein, A\. Liu, A\. Schmidt, A\. Mereskin, A\. Mishchenko, A\. Applebaum, A\. Rogerson, A\. Rajan, A\. Wei, A\. Kotha, A\. Srivastava, A\. Agrawal, A\. Vijayvergiya, A\. Tyra, A\. Nair, A\. Nayak, B\. Eggers, B\. Ji, B\. Hoover, B\. Chen, B\. Chen, B\. Barak, B\. Minaiev, B\. Hao, B\. Baker, B\. Lightcap, B\. McKinzie, B\. Wang, B\. Quinn, B\. Fioca, B\. Hsu, B\. Yang, B\. Yu, B\. Zhang, B\. Brenner, C\. R\. Zetino, C\. Raymond, C\. Lugaresi, C\. Paz, C\. Hudson, C\. Whitney, C\. Li, C\. Chen, C\. Cole, C\. Voss, C\. Ding, C\. Shen, C\. Huang, C\. Colby, C\. Hallacy, C\. Koch, C\. Lu, C\. Kaplan, C\. Kim, C\. Minott\-Henriques, C\. Frey, C\. Yu, C\. Czarnecki, C\. Reid, C\. Wei, C\. Decareaux, C\. Scheau, C\. Zhang, C\. Forbes, D\. Tang, D\. Goldberg, D\. Roberts, D\. Palmie, D\. Kappler, D\. Levine, D\. Wright, D\. Leo, D\. Lin, D\. Robinson, D\. Grabb, D\. Chen, D\. Lim, D\. Salama, D\. Bhattacharjee, D\. Tsipras, D\. Li, D\. Yu, D\. Strouse, D\. Williams, D\. Hunn, E\. Bayes, E\. Arbus, E\. Akyurek, E\. Y\. Le, E\. Widmann, E\. Yani, E\. Proehl, E\. Sert, E\. Cheung, E\. Schwartz, E\. Han, E\. Jiang, E\. Mitchell, E\. Sigler, E\. Wallace, E\. Ritter, E\. Kavanaugh, E\. Mays, E\. Nikishin, F\. Li, F\. P\. Such, F\. de Avila Belbute Peres, F\. Raso, F\. Bekerman, F\. Tsimpourlas, F\. Chantzis, F\. Song, F\. Zhang, G\. Raila, G\. McGrath, G\. Briggs, G\. Yang, G\. Parascandolo, G\. Chabot, G\. Kim, G\. Zhao, G\. Valiant, G\. Leclerc, H\. Salman, H\. Wang, H\. Sheng, H\. Jiang, H\. Wang, H\. Jin, H\. Sikchi, H\. Schmidt, H\. Aspegren, H\. Chen, H\. Qiu, H\. Lightman, I\. Covert, I\. Kivlichan, I\. Silber, I\. Sohl, I\. Hammoud, I\. Clavera, I\. Lan, I\. Akkaya, I\. Kostrikov, I\. Kofman, I\. Etinger, I\. Singal, J\. Hehir, J\. Huh, J\. Pan, J\. Wilczynski, J\. Pachocki, J\. Lee, J\. Quinn, J\. Kiros, J\. Kalra, J\. Samaroo, J\. Wang, J\. Wolfe, J\. Chen, J\. Wang, J\. Harb, J\. Han, J\. Wang, J\. Zhao, J\. Chen, J\. Yang, J\. Tworek, J\. Chand, J\. Landon, J\. Liang, J\. Lin, J\. Liu, J\. Wang, J\. Tang, J\. Yin, J\. Jang, J\. Morris, J\. Flynn, J\. Ferstad, J\. Heidecke, J\. Fishbein, J\. Hallman, J\. Grant, J\. Chien, J\. Gordon, J\. Park, J\. Liss, J\. Kraaijeveld, J\. Guay, J\. Mo, J\. Lawson, J\. McGrath, J\. Vendrow, J\. Jiao, J\. Lee, J\. Steele, J\. Wang, J\. Mao, K\. Chen, K\. Hayashi, K\. Xiao, K\. Salahi, K\. Wu, K\. Sekhri, K\. Sharma, K\. Singhal, K\. Li, K\. Nguyen, K\. Gu\-Lemberg, K\. King, K\. Liu, K\. Stone, K\. Yu, K\. Ying, K\. Georgiev, K\. Lim, K\. Tirumala, K\. Miller, L\. Ahmad, L\. Lv, L\. Clare, L\. Fauconnet, L\. Itow, L\. Yang, L\. Romaniuk, L\. Anise, L\. Byron, L\. Pathak, L\. Maksin, L\. Lo, L\. Ho, L\. Jing, L\. Wu, L\. Xiong, L\. Mamitsuka, L\. Yang, L\. McCallum, L\. Held, L\. Bourgeois, L\. Engstrom, L\. Kuhn, L\. Feuvrier, L\. Zhang, L\. Switzer, L\. Kondraciuk, L\. Kaiser, M\. Joglekar, M\. Singh, M\. Shah, M\. Stratta, M\. Williams, M\. Chen, M\. Sun, M\. Cayton, M\. Li, M\. Zhang, M\. Aljubeh, M\. Nichols, M\. Haines, M\. Schwarzer, M\. Gupta, M\. Shah, M\. Y\. Guan, M\. Huang, M\. Dong, M\. Wang, M\. Glaese, M\. Carroll, M\. Lampe, M\. Malek, M\. Sharman, M\. Zhang, M\. Wang, M\. Pokrass, M\. Florian, M\. Pavlov, M\. Wang, M\. Chen, M\. Wang, M\. Feng, M\. Bavarian, M\. Lin, M\. Abdool, M\. Rohaninejad, N\. Soto, N\. Staudacher, N\. LaFontaine, N\. Marwell, N\. Liu, N\. Preston, N\. Turley, N\. Ansman, N\. Blades, N\. Pancha, N\. Mikhaylin, N\. Felix, N\. Handa, N\. Rai, N\. Keskar, N\. Brown, O\. Nachum, O\. Boiko, O\. Murk, O\. Watkins, O\. Gleeson, P\. Mishkin, P\. Lesiewicz, P\. Baltescu, P\. Belov, P\. Zhokhov, P\. Pronin, P\. Guo, P\. Thacker, Q\. Liu, Q\. Yuan, Q\. Liu, R\. Dias, R\. Puckett, R\. Arora, R\. T\. Mullapudi, R\. Gaon, R\. Miyara, R\. Song, R\. Aggarwal, R\. Marsan, R\. Yemiru, R\. Xiong, R\. Kshirsagar, R\. Nuttall, R\. Tsiupa, R\. Eldan, R\. Wang, R\. James, R\. Ziv, R\. Shu, R\. Nigmatullin, S\. Jain, S\. Talaie, S\. Altman, S\. Arnesen, S\. Toizer, S\. Toyer, S\. Miserendino, S\. Agarwal, S\. Yoo, S\. Heon, S\. Ethersmith, S\. Grove, S\. Taylor, S\. Bubeck, S\. Banesiu, S\. Amdo, S\. Zhao, S\. Wu, S\. Santurkar, S\. Zhao, S\. R\. Chaudhuri, S\. Krishnaswamy, Shuaiqi, Xia, S\. Cheng, S\. Anadkat, S\. P\. Fishman, S\. Tobin, S\. Fu, S\. Jain, S\. Mei, S\. Egoian, S\. Kim, S\. Golden, S\. Mah, S\. Lin, S\. Imm, S\. Sharpe, S\. Yadlowsky, S\. Choudhry, S\. Eum, S\. Sanjeev, T\. Khan, T\. Stramer, T\. Wang, T\. Xin, T\. Gogineni, T\. Christianson, T\. Sanders, T\. Patwardhan, T\. Degry, T\. Shadwell, T\. Fu, T\. Gao, T\. Garipov, T\. Sriskandarajah, T\. Sherbakov, T\. Korbak, T\. Kaftan, T\. Hiratsuka, T\. Wang, T\. Song, T\. Zhao, T\. Peterson, V\. Kharitonov, V\. Chernova, V\. Kosaraju, V\. Kuo, V\. Pong, V\. Verma, V\. Petrov, W\. Jiang, W\. Zhang, W\. Zhou, W\. Xie, W\. Zhan, W\. McCabe, W\. DePue, W\. Ellsworth, W\. Bain, W\. Thompson, X\. Chen, X\. Qi, X\. Xiang, X\. Shi, Y\. Dubois, Y\. Yu, Y\. Khakbaz, Y\. Wu, Y\. Qian, Y\. T\. Lee, Y\. Chen, Y\. Zhang, Y\. Xiong, Y\. Tian, Y\. Cha, Y\. Bai, Y\. Yang, Y\. Yuan, Y\. Li, Y\. Zhang, Y\. Yang, Y\. Jin, Y\. Jiang, Y\. Wang, Y\. Wang, Y\. Liu, Z\. Stubenvoll, Z\. Dou, Z\. Wu, and Z\. Wang \(2026\)OpenAI gpt\-5 system card\.External Links:2601\.03267,[Link](https://arxiv.org/abs/2601.03267)Cited by:[§1](https://arxiv.org/html/2605.26827#S1.p1.1),[§2](https://arxiv.org/html/2605.26827#S2.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.External Links:2203\.11171,[Link](https://arxiv.org/abs/2203.11171)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.26827#S1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, b\. ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 24824–24837\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px2.p1.1)\.
- K\. Wu, E\. Wu, and J\. Zou \(2024\)ClashEval: quantifying the tug\-of\-war between an llm’s internal prior and external evidence\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 33402–33422\.External Links:[Document](https://dx.doi.org/10.52202/079017-1053),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/3aa291abc426d7a29fb08418c1244177-Paper-Datasets_and_Benchmarks_Track.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px3.p1.1)\.
- J\. Xie, K\. Zhang, J\. Chen, R\. Lou, and Y\. Su \(2024\)Adaptive chameleon or stubborn sloth: revealing the behavior of large language models in knowledge conflicts\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 35623–35646\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/99261adc8a6356b38bcf999bba9a26dc-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px3.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 11809–11822\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.26827#S1.p3.1)\.
- W\. Zhou, S\. Zhang, H\. Poon, and M\. Chen \(2023\)Context\-faithful prompting for large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 14544–14556\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.968/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.968)Cited by:[Appendix A](https://arxiv.org/html/2605.26827#A1.SS0.SSS0.Px3.p1.1)\.

## Appendix AExtended Related Works

In\-context learning \(ICL\) studies inference\-time adaptation from contextual information without parameter updatesBrownet al\.\([2020](https://arxiv.org/html/2605.26827#bib.bib12)\)\. Context Learning \(CL\), as instantiated by CL\-benchDouet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib5)\), places stricter requirements on such adaptation by requiring models to internalize and faithfully apply newly provided contextual knowledge under complex constraints\. The resulting gap between general inference\-time capability and reliable context utilization motivates methods that improve model behavior at inference time, which we review in the following sections\.

A direct route to improving inference\-time capability is to enhance the underlying model through training or post\-training\. Instruction tuning, reinforcement learning, and retrieval\-aware training improve general abilities such as reasoning, instruction following, and knowledge useOuyanget al\.\([2022](https://arxiv.org/html/2605.26827#bib.bib17)\); Guoet al\.\([2025](https://arxiv.org/html/2605.26827#bib.bib2)\); Asaiet al\.\([2024](https://arxiv.org/html/2605.26827#bib.bib18)\); Linet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib27)\); Jinet al\.\([2025a](https://arxiv.org/html/2605.26827#bib.bib28),[2026d](https://arxiv.org/html/2605.26827#bib.bib29)\)\.

However, stronger general capabilities do not necessarily translate to reliable context learning, where success depends on faithfully internalizing and applying newly provided contextual constraints\. This limitation motivates inference\-time methods that improve behavior without modifying model parameters\.

Existing non\-training approaches improve inference\-time behavior without modifying model parameters\. These methods can be organized by the failure modes they target in contextual reasoning\.

#### Exposure\-related failures\.

Methods such as retrieval\-augmented generation, long\-context optimization, and prompt compression improve access to relevant information within contextLewiset al\.\([2020](https://arxiv.org/html/2605.26827#bib.bib19)\);liu2024lost; Jianget al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib20)\)\. However, improving information exposure does not guarantee faithful use of all contextual constraints\.

#### Reasoning\-related failures\.

Methods including Chain\-of\-Thought, self\-consistency, and Tree\-of\-Thoughts improve inference through additional deliberationWeiet al\.\([2022](https://arxiv.org/html/2605.26827#bib.bib21)\); Wanget al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib3)\); Yaoet al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib9)\)\. These approaches strengthen reasoning but do not explicitly enforce adherence to context\-specific constraints\.

#### Faithfulness\-related failures\.

A third category focuses on grounding model outputs in provided context rather than parametric knowledge\. Context\-faithful prompting encourages reliance on supplied evidenceZhouet al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib22)\), while studies on knowledge conflicts and adversarial contexts show that models may still prioritize internal priors over external informationXieet al\.\([2024](https://arxiv.org/html/2605.26827#bib.bib23)\); Wuet al\.\([2024](https://arxiv.org/html/2605.26827#bib.bib24)\)\. These works highlight the challenge of ensuring consistent context usage, especially under conflicting or ambiguous information\.

#### Revision\-related failures\.

Self\-Refine introduces an iterative self\-improvement loop where the model generates an initial answer, receives self\-generated feedback, and revises its outputMadaanet al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib10)\)\. Reflexion extends this idea to agent\-like settings by incorporating verbal self\-reflection across multiple trials as a form of memory\-guided improvementShinnet al\.\([2023](https://arxiv.org/html/2605.26827#bib.bib11)\); Jinet al\.\([2025b](https://arxiv.org/html/2605.26827#bib.bib30)\)\. Chain\-of\-Verification and tool\-assisted critiquing further structure the revision process by explicitly separating generation, verification, and final rewriting stepsDhuliawalaet al\.\([2024](https://arxiv.org/html/2605.26827#bib.bib14)\); Gouet al\.\([2024](https://arxiv.org/html/2605.26827#bib.bib15)\)\. However, despite these designs, self\-correction remains unstable, as revision may still degrade previously correct content when feedback signals are incomplete or misalignedHuanget al\.\([2024](https://arxiv.org/html/2605.26827#bib.bib16)\)\.

Existing inference\-time methods address different failure modes in contextual reasoning, but rarely consider how to preserve correct contextual information while performing targeted correction\. This limitation becomes critical in context learning settings, where success depends on both faithful constraint use and stable revision\.

ContextGuard is designed as a unified inference\-time framework for such settings\. Instead of introducing new training objectives, it restructures inference\-time computation through reminder, audit, and protected revision mechanisms to jointly improve constraint awareness and revision robustness\.

From a failure\-space perspective, ContextGuard targets two limitations identified in prior work: omission or misapplication of contextual constraints, and revision\-induced regression over previously correct content\. As a result, ContextGuard is better understood as a protected revision framework for context learning, where preserving correct contextual information is treated as equally important as correcting errors\.

## Appendix BCL\-Bench Details and Strict Requirement Scoring

CategoryContextsTasksCriteriaAvg\. Criteriaper TaskMax Criteriaper TaskInput LengthMean / MaxDomain Knowledge Reasoning19066311,09916\.7748\.3K / 60\.0KRule System Application1405668,28614\.67512\.2K / 62\.2KProcedural Task Execution1004719,48620\.1598\.5K / 58\.5KEmpirical Discovery & Simulation701992,73613\.711416\.7K / 65\.0KTotal5001,89931,60716\.611410\.4K / 65\.0KTable 4:Statistics of CL\-BenchDouet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib5)\)\. We use “criteria” to refer to the benchmark’s binary verification requirements\. Input length includes the system prompt, context, and task specification\.CL\-BenchDouet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib5)\)is designed to evaluate whether language models can acquire and apply newly provided contextual knowledge at inference time\. Unlike conventional reasoning benchmarks that often evaluate a single final answer, CL\-Bench combines long expert\-authored contexts, multiple task categories, sequential dependencies, and dense binary requirement checking\. This makes it a useful testbed for studying context learning as a strict requirement\-satisfaction problem\.

Table[4](https://arxiv.org/html/2605.26827#A2.T4)summarizes the benchmark statistics\. CL\-Bench contains 500 contexts, 1,899 tasks, and 31,607 binary verification criteria across four categories and 18 subcategories\. Inputs are long, with an average input length of 10\.4K tokens and a maximum of 65K tokens\. The benchmark is also densely constrained: each task contains 16\.6 criteria on average, and some tasks contain up to 114 criteria\. In addition, 51\.1% of tasks are sequential, meaning that later turns may depend on information or decisions from earlier turns\.

#### Strict conjunctive scoring\.

CL\-Bench evaluates each task using a set of binary criteria\. Let taskiicontainmim\_\{i\}criteria, and letzi​j∈\{0,1\}z\_\{ij\}\\in\\\{0,1\\\}denote whether the model output satisfies criterionjj\. A task is counted as solved only when all associated criteria are satisfied:

si=∏j=1mizi​j\.s\_\{i\}=\\prod\_\{j=1\}^\{m\_\{i\}\}z\_\{ij\}\.\(10\)The overall task\-solving rate is then

TSR=1N​∑i=1Nsi,\\mathrm\{TSR\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}s\_\{i\},\(11\)whereNNis the number of tasks\. This strict conjunctive metric differs from partial\-credit evaluation: an answer may satisfy most requirements but still receive zero task\-level credit if it misses a single required constraint, format rule, exception, calculation detail, or procedural condition\.

This scoring rule is central to our motivation\. In dense context\-learning settings, a model may correctly follow the main reasoning path while failing due to a small number of scattered requirements\. Conversely, a revision step may fix an initially failed requirement but accidentally break another requirement that was already satisfied\. Therefore, effective refinement must optimize not only for error correction, but also for preservation of already\-correct contextual content\.

#### Context dependence\.

CL\-Bench is constructed to require the supplied context rather than external retrieval or parametric knowledge alone\. The benchmark includes expert\-authored contexts containing fictional, modified, niche, or recently emerging information\. The original CL\-Bench paper reports that when contexts are removed in a sampled ablation, GPT\-5\.1 \(High\) drops to 0\.9% task\-solving rate on 1,000 sampled tasksDouet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib5)\)\. This supports treating CL\-Bench as a context\-learning benchmark: success depends on using the provided context as the governing source of information\.

#### Benchmark difficulty\.

The original CL\-Bench evaluation reports low task\-solving rates even for strong frontier models\. Across ten evaluated models, the average task\-solving rate is 17\.2%, and the strongest reported model, GPT\-5\.1 \(High\), reaches only 23\.7% overallDouet al\.\([2026](https://arxiv.org/html/2605.26827#bib.bib5)\)\. These numbers should be interpreted as evidence of benchmark difficulty rather than directly comparable baselines for our experiments, since our evaluation uses a different judge model and a per\-criterion repeated\-judgment protocol\.

## Appendix CEvaluation Protocol and Judge Stability

#### Per\-requirement judging\.

The original CL\-Bench evaluation presents all requirements of a task to the judge in a single prompt\. In our evaluation, we instead score each requirement independently\. This reduces judge attention burden when a task contains many fine\-grained requirements and makes the evaluation less sensitive to prompt\-position effects among requirements\.

For taskiiwith requirements\{ri​j\}j=1mi\\\{r\_\{ij\}\\\}\_\{j=1\}^\{m\_\{i\}\}, each requirement receives a binary labelz^i​j∈\{0,1\}\\hat\{z\}\_\{ij\}\\in\\\{0,1\\\}\. A task is considered solved only if all requirements are satisfied:

s^i=∏j=1miz^i​j\.\\hat\{s\}\_\{i\}=\\prod\_\{j=1\}^\{m\_\{i\}\}\\hat\{z\}\_\{ij\}\.\(12\)

#### Single vote vs\. majority vote\.

A single LLM judge call can be noisy, especially for long\-context answers and partially satisfied requirements\. We therefore evaluate each requirement multiple times and aggregate the labels by majority voting:

z^i​j=𝕀​\[∑k=1Kvi​j\(k\)\>K2\],\\hat\{z\}\_\{ij\}=\\mathbb\{I\}\\left\[\\sum\_\{k=1\}^\{K\}v\_\{ij\}^\{\(k\)\}\>\\frac\{K\}\{2\}\\right\],\(13\)wherevi​j\(k\)∈\{0,1\}v\_\{ij\}^\{\(k\)\}\\in\\\{0,1\\\}is thekk\-th judge vote for requirementjj, andKKis the number of repeated judge votes\. In all experiments reported in this paper, we useK=3K=3\.

LevelInstancesSingle\-voteagreementMajority\-voteagreementRequirement62,30888\.2994\.14Task3,72093\.4696\.73Table 5:Judge stability \(%\) computed from stored three\-vote traces over baseline and ContextGuard outputs\. Single\-vote agreement measures consistency among repeated individual judge calls\. Majority\-vote agreement measures agreement between individual votes and the final majority label\.Table[5](https://arxiv.org/html/2605.26827#A3.T5)shows that single judge calls have non\-negligible variance\. At the requirement level, repeated single votes agree 88\.29% of the time, while agreement with the majority label reaches 94\.14%\. At the task level, the corresponding numbers are 93\.46% and 96\.73%\. We therefore use per\-requirement majority\-vote judging withK=3K=3for all reported results\.

Algorithm 1: ContextGuard InferenceInput:contextCC, task queryqq, system instructionss, task categorytt, modelfθf\_\{\\theta\}Output:final answery∗y^\{\\ast\}1\. Construct reminderr←Reminder​\(s,q\)r\\leftarrow\\texttt\{Reminder\}\(s,q\)\.2\. Generate reminder\-augmented drafty\(0\)←fθ​\(C,q,r\)y^\{\(0\)\}\\leftarrow f\_\{\\theta\}\(C,q,r\)\.3\. Run structured self\-audit:\(QA,QB,QC,QD\)←𝒜​\(C,q,y\(0\)\)\.\(Q\_\{A\},Q\_\{B\},Q\_\{C\},Q\_\{D\}\)\\leftarrow\\mathcal\{A\}\(C,q,y^\{\(0\)\}\)\.4\. Run category\-conditioned specialist signal:\(𝒪t,ℰt\)←𝒮t​\(C,q,y\(0\)\)\.\(\\mathcal\{O\}\_\{t\},\\mathcal\{E\}\_\{t\}\)\\leftarrow\\mathcal\{S\}\_\{t\}\(C,q,y^\{\(0\)\}\)\.5\. Construct fix set:ℱ←QC∪QD∪ℰt\.\\mathcal\{F\}\\leftarrow Q\_\{C\}\\cup Q\_\{D\}\\cup\\mathcal\{E\}\_\{t\}\.6\. Construct protection set:𝒫←QA∪QB∪𝒪t\.\\mathcal\{P\}\\leftarrow Q\_\{A\}\\cup Q\_\{B\}\\cup\\mathcal\{O\}\_\{t\}\.7\.ifℱ=∅\\mathcal\{F\}=\\emptysetthenreturny∗←y\(0\)y^\{\\ast\}\\leftarrow y^\{\(0\)\}\.8\. Generate protected revision:y~←ℛguarded​\(C,q,y\(0\),ℱ,𝒫\)\.\\tilde\{y\}\\leftarrow\\mathcal\{R\}\_\{\\mathrm\{guarded\}\}\(C,q,y^\{\(0\)\},\\mathcal\{F\},\\mathcal\{P\}\)\.9\. Apply revision guard:g←RevisionGuard​\(y~,y\(0\),𝒫\)\.g\\leftarrow\\texttt\{RevisionGuard\}\(\\tilde\{y\},y^\{\(0\)\},\\mathcal\{P\}\)\.10\.ifg=passg=\\texttt\{pass\}thenreturny∗←y~y^\{\\ast\}\\leftarrow\\tilde\{y\}\.11\.elsereturny∗←y\(0\)y^\{\\ast\}\\leftarrow y^\{\(0\)\}\.Table 6:Pseudocode of ContextGuard inference\. The model first identifies fix targets and protected content, then performs one guarded revision round with fallback to the original draft when revision appears destructive\.

## Appendix DFull ContextGuard Algorithm

Table[6](https://arxiv.org/html/2605.26827#A3.T6)gives the complete inference\-time procedure of ContextGuard, and Table[7](https://arxiv.org/html/2605.26827#A4.T7)summarizes how category\-conditioned specialist signals are normalized into the same fix\-set and lock\-set interface\. The framework operates on a contextCC, task queryqq, optional system instructionss, and task categorytt\. It performs at most one protected revision round and does not update model parameters\.

#### Single\-round design\.

All experiments in this paper use a single revision round\. This keeps inference cost controlled and avoids compounding revision errors across multiple rounds\. The goal is not to repeatedly rewrite the answer, but to perform one targeted repair while preserving already\-satisfied contextual requirements\.

#### Revision guard\.

The revision guard is a lightweight fallback mechanism applied after protected revision\. It is designed to detect destructive edits rather than to re\-grade the answer\. In our implementation, the guard checks whether the revised answer shows substantial structural or informational degradation relative to the draft\. This includes three main cases\.

- •Excessive shortening\.If the revised answer is much shorter than the draft, the revision may have deleted required reasoning, supporting evidence, or output fields\. We therefore reject revisions whose length falls below a preset fraction of the original draft length\.
- •Protected\-content loss\.If the revision appears to remove or contradict content listed in the protection set𝒫\\mathcal\{P\}, it is treated as unsafe\. The protection set contains confirmed constraints, verified facts, and satisfied specialist requirements that should remain stable during editing\.
- •Structural degradation\.If the draft follows an apparent required structure, such as a list, JSON object, table, or sectioned response, but the revision collapses or substantially alters that structure without a corresponding fix target, the revision is considered potentially destructive\.

If any guard condition is triggered, ContextGuard discards the revised answer and returns the original draft\. This conservative fallback is useful in strict context\-learning evaluation: a revision that fixes one missed requirement but deletes an already\-correct requirement can still fail the task\. The guard therefore complements the protection set by preventing high\-risk revisions from replacing a safer draft\.

CategorySpecialist SignalMain ChecksRevision UseDomain Knowledge ReasoningFormat and contextual applicability signalChecks whether the draft follows required structure, output format, ordering constraints, role/persona constraints, forbidden content constraints, and whether domain\-specific claims are grounded in the provided context\.Format or applicability violations are added toℰt\\mathcal\{E\}\_\{t\}\. Satisfied structural and contextual requirements are added to𝒪t\\mathcal\{O\}\_\{t\}so that revision does not damage already\-correct domain\-specific content\.Procedural Task ExecutionWorkflow and procedure signalChecks step ordering, agent routing, dependency handling, gate conditions, timing constraints, logging requirements, safety behavior, and required escalation or refusal paths\.Detected workflow errors are added toℰt\\mathcal\{E\}\_\{t\}\. Correctly executed steps, valid routing decisions, and satisfied safety constraints are added to𝒪t\\mathcal\{O\}\_\{t\}to prevent revision from changing an already\-valid procedure\.Rule System ApplicationRule\-fidelity signalChecks exact rule application, exception handling, numerical or symbolic conditions, terminology consistency, applicability boundaries, and whether the draft invents unsupported rules\.Misapplied, omitted, or invented rules are added toℰt\\mathcal\{E\}\_\{t\}\. Correctly applied rules and satisfied exceptions are added to𝒪t\\mathcal\{O\}\_\{t\}, protecting them during later edits\.Empirical Discovery & SimulationEmpirical consistency signalChecks numerical correctness, comparison completeness, unit consistency, trend interpretation, evidence grounding, coverage of requested experimental conditions, and whether claims are supported by the provided data\.Numerical, comparison, coverage, or evidence\-grounding issues are added toℰt\\mathcal\{E\}\_\{t\}\. Verified computations, grounded observations, and correctly stated comparisons are added to𝒪t\\mathcal\{O\}\_\{t\}\.Table 7:Category\-conditioned specialist signals used by ContextGuard\. Each signal maps category\-specific requirement checks into the same fix\-set and lock\-set interface used by protected revision\.

## Appendix ECategory\-Conditioned Specialist Signals

Table[7](https://arxiv.org/html/2605.26827#A4.T7)provides the category\-level checks used by the specialist layer; this appendix subsection explains the common interface behind that table\. Given the contextCC, queryqq, draft answery\(0\)y^\{\(0\)\}, and task categorytt, the specialist layer produces satisfied category\-specific requirements and detected issues:

𝒮t​\(C,q,y\(0\)\)→\(𝒪t,ℰt\)\.\\mathcal\{S\}\_\{t\}\(C,q,y^\{\(0\)\}\)\\rightarrow\(\\mathcal\{O\}\_\{t\},\\mathcal\{E\}\_\{t\}\)\.\(14\)The detected issuesℰt\\mathcal\{E\}\_\{t\}are added to the fix set, while the satisfied requirements𝒪t\\mathcal\{O\}\_\{t\}are added to the protection set:

ℱ=QC∪QD∪ℰt,𝒫=QA∪QB∪𝒪t\.\\mathcal\{F\}=Q\_\{C\}\\cup Q\_\{D\}\\cup\\mathcal\{E\}\_\{t\},\\quad\\mathcal\{P\}=Q\_\{A\}\\cup Q\_\{B\}\\cup\\mathcal\{O\}\_\{t\}\.\(15\)Thus, all specialist signals serve the same purpose: they convert category\-specific requirement checks into revision guidance\.

#### Category conditioning\.

The categoryttis the benchmark context category, not a task outcome label\. It is used only to select the appropriate verification emphasis for the draft\. This avoids applying the same generic checklist to qualitatively different tasks, such as procedural workflows and empirical data interpretation\.

#### Separate checkers and audit\-integrated signals\.

The specialist layer can be implemented in two forms\. For requirements with explicit structural patterns, such as formatting or workflow order, ContextGuard uses separate checker prompts that produce explicit satisfied and failed items\. For requirements that are tightly coupled with reasoning content, such as rule interpretation or empirical comparison, the specialist criteria can be integrated into the structured self\-audit prompt\. In both cases, the output is normalized into the same\(𝒪t,ℰt\)\(\\mathcal\{O\}\_\{t\},\\mathcal\{E\}\_\{t\}\)interface before revision\.

#### Why specialist signals are needed\.

Generic reflection often identifies high\-level mistakes but misses category\-specific failure modes\. For example, a procedural answer may appear logically plausible while violating a required gate condition; an empirical answer may state the correct trend but omit a required comparison; a rule\-system answer may reach the right conclusion while applying an exception incorrectly\. The specialist layer makes these failure modes explicit before revision\.

#### Interaction with protected revision\.

Specialist signals are not used only to find errors\. They also identify category\-specific content that should be preserved\. This is important because many drafts are partially correct: they may satisfy the required format, follow several workflow steps, or apply some rules correctly while failing elsewhere\. By placing satisfied specialist requirements into𝒪t\\mathcal\{O\}\_\{t\}, ContextGuard prevents the revision stage from treating the entire answer as freely editable text\.

## Appendix FPrompt Templates

Tables[8](https://arxiv.org/html/2605.26827#A6.T8),[9](https://arxiv.org/html/2605.26827#A6.T9), and[10](https://arxiv.org/html/2605.26827#A6.T10)provide the normalized prompt templates used by ContextGuard\. Table[8](https://arxiv.org/html/2605.26827#A6.T8)covers reminder\-augmented drafting, structured self\-auditing, and format compliance checking; Table[9](https://arxiv.org/html/2605.26827#A6.T9)covers procedure, rule\-fidelity, and empirical specialist signals; and Table[10](https://arxiv.org/html/2605.26827#A6.T10)covers protected revision and requirement\-level judging\. Full task contexts are omitted and replaced with bracketed placeholders\. The exact wording may vary slightly across categories, but all prompts follow the same interface: issue items are added to the fix set and confirmed items are added to the protection set\.

PromptTemplatePrompt 1:Reminder\-Augmented DraftGeneration\[ORIGINAL CONTEXT AND CONVERSATION\]\[REMINDER\]Before answering, re\-read and follow all instructions in the system/context message, including constraints, role requirements, format requirements, and task\-specific rules\.The final task request is:\[FINAL TASK REQUEST\]Now produce the answer to the final task\. Use only the provided context and satisfy all stated requirements\.Prompt 2:StructuredSelf\-AuditYou are auditing your previous answer for a context\-learning task\.\[CONVERSATION / CONTEXT\]\[ORIGINAL TASK\]\[PREVIOUS ANSWER\]Carefully compare the previous answer against the provided context, rules, constraints, data, and final task request\.Output a JSON object with four fields:\{ ‘‘confirmed\_correct’’: \[ ‘‘Specific constraints, reasoning steps, or output parts that are correct and should be preserved\.’’ \],‘‘confirmed\_data’’: \[ ‘‘Specific contextual facts, values, rules, or evidence that were used correctly\.’’ \],‘‘possibly\_missed’’: \[ ‘‘Constraints, facts, requirements, cases, exceptions, or output elements that may be missing or insufficiently covered\.’’ \],‘‘possibly\_wrong’’: \[ ‘‘Reasoning steps, calculations, rule applications, conclusions, or answer parts that may be incorrect\.’’ \] \}Be specific\. Identify concrete items that can guide revision\.Prompt 3:FormatCompliance SignalYou are a format compliance checker\. Your only job is to check whether the response follows the format, structure, and presentation rules from the conversation\.\[CONVERSATION / CONTEXT\]\[PREVIOUS ANSWER\]Extract every constraint about format, structure, or presentation, including:\- output structure, such as bullets, numbered lists, tables, sections, JSON, XML, or schemas;\- required ordering or grouping;\- required labels, headings, field names, or tags;\- typography or presentation rules;\- length, inclusion, exclusion, or forbidden\-content constraints\.Output a JSON object:\{ ‘‘format\_ok’’: \[ ‘‘Requirement \+ how the response satisfies it\.’’ \],‘‘format\_fail’’: \[ ‘‘Requirement \+ how the response violates it\.’’ \] \}Table 8:Prompt templates for reminder\-augmented drafting, structured self\-auditing, and format compliance checking\.PromptTemplatePrompt 4:ProcedureCompliance SignalYou are a procedure compliance checker for a procedural task execution scenario\. Your job is to verify workflow correctness, not general writing quality\.\[CONVERSATION / CONTEXT\]\[PREVIOUS ANSWER\]Check the response against all procedural requirements, including:\- required step sequence;\- agent routing and handoff rules;\- dependency handling;\- gate checks and threshold conditions;\- timing constraints;\- logging, reporting, or evidence\-recording requirements;\- safety, refusal, or escalation rules\.Output a JSON object:\{ ‘‘proc\_ok’’: \[ ‘‘Procedural requirement \+ how the response satisfies it\.’’ \],‘‘proc\_fail’’: \[ ‘‘Procedural requirement \+ how the response violates it\.’’ \] \}Prompt 5:Rule\-FidelitySignalYou are a rule\-fidelity checker\. The task depends on faithfully applying the provided rulebook or rule system\.\[CONVERSATION / RULEBOOK\]\[PREVIOUS ANSWER\]Check whether the response correctly applies the provided rules\. Pay special attention to:\- exact rule definitions and terminology;\- exception and boundary conditions;\- numerical or symbolic conditions;\- required decision procedures;\- unsupported invented rules;\- rules applied too strictly or too loosely\.Output a JSON object:\{ ‘‘rules\_ok’’: \[ ‘‘Rule or condition \+ how the response applies it correctly\.’’ \],‘‘rules\_fail’’: \[ ‘‘Rule or condition \+ how the response misapplies, omits, or invents it\.’’ \] \}Prompt 6:EmpiricalConsistency SignalThis is an empirical discovery or simulation task\. In addition to the general self\-audit, check whether the response is grounded in the provided data, observations, or simulation rules\.\[CONVERSATION / DATA / SIMULATION DESCRIPTION\]\[PREVIOUS ANSWER\]Pay special attention to:\- numerical correctness;\- unit consistency;\- comparison completeness;\- trend interpretation;\- coverage of all requested conditions;\- whether each empirical claim is supported by the provided evidence;\- whether the response over\-generalizes beyond the data\.When listing possibly missed or possibly wrong items, assign a fine\-grained issue type when applicable, such as ‘‘numeric’’, ‘‘comparison’’, ‘‘coverage’’, ‘‘unit’’, ‘‘trend’’, or ‘‘evidence’’\.Table 9:Prompt templates for procedure, rule\-fidelity, and empirical specialist signals\.PromptTemplatePrompt 7:ProtectedTargeted RevisionBelow is your previous response, together with audit feedback and specialist checks\.\[CONVERSATION / CONTEXT\]\[PREVIOUS ANSWER\]\[FIX SET\]The following items may be missing, incorrect, or violated:\- \[MISSED\] \.\.\.\- \[WRONG\] \.\.\.\- \[FORMAT\-FAIL\] \.\.\.\- \[PROC\-FAIL\] \.\.\.\- \[RULES\-FAIL\] \.\.\.\- \[EMPIRICAL\-ISSUE\] \.\.\.\[LOCK SET\]The following items are confirmed correct or already satisfied\. Do not modify, delete, contradict, or weaken them:\- \[CONFIRMED\-CORRECT\] \.\.\.\- \[CONFIRMED\-DATA\] \.\.\.\- \[FORMAT\-OK\] \.\.\.\- \[PROC\-OK\] \.\.\.\- \[RULES\-OK\] \.\.\.Revise the previous answer under these rules:1\. Fix only the issues listed in the fix set\.2\. Preserve all content in the protection set\.3\. Do not rewrite unrelated parts\.4\. Do not introduce new facts, rules, entities, or constraints unless directly supported by the provided context\.5\. Maintain the required output format and structure\.6\. Return only the revised final answer\.Prompt 8:Requirement\-Level JudgeYou are a rigorous instruction\-following grading teacher\. Your task is to grade a student answer based on one specific requirement\.\[STUDENT RESPONSE\]\[REQUIREMENT\]Decide whether the response satisfies this requirement\.Output a JSON object with exactly two fields:\{ ‘‘reason’’: ‘‘Brief explanation of the decision\.’’,‘‘satisfaction\_status’’: ‘‘yes’’ or ‘‘no’’ \}The satisfaction\_status must be ‘‘yes’’ only if the requirement is clearly satisfied\. Otherwise output ‘‘no’’\.Table 10:Prompt templates for protected revision and requirement\-level judging\.
## Appendix GAdditional Experimental Results

TablesLABEL:tab:app\_subcategory\_countsandLABEL:tab:app\_length\_countsprovide the count\-level results behind the aggregate tables and figures in the main paper\. TableLABEL:tab:app\_subcategory\_countsbreaks task\-solving rates down by CL\-Bench subcategory, while TableLABEL:tab:app\_length\_countsreports the same comparison across input\-length buckets with missing or failed generations counted as failed tasks\.

## Appendix HNear\-Miss Migration

This appendix analyzes CL\-Bench at the level of individual task requirements\. Because CL\-Bench uses a strict all\-requirement scoring rule, a task can fail even when most requirements are already satisfied\. We therefore study how tasks move in requirement\-failure space before and after revision\.

### H\.1Near\-Miss Distribution

For each taskxx, letem​\(x\)e\_\{m\}\(x\)denote the number of requirements that are not satisfied by methodmm\. We group tasks into five bins according toem​\(x\)e\_\{m\}\(x\):0,11,22–33,44–88, and\>8\>8\. Percentages are computed over the full 1,899\-task benchmark\. Missing or failed generations are treated as unsolved, with all associated requirements counted as unsatisfied\.

Method0 failcount \(%\)1 failcount \(%\)2–3 failcount \(%\)4–8 failcount \(%\)\>\>8 failcount \(%\)Baseline183 \(9\.6\)243 \(12\.8\)567 \(29\.9\)700 \(36\.9\)206 \(10\.8\)Self\-Refine199 \(10\.5\)218 \(11\.5\)551 \(29\.0\)718 \(37\.8\)213 \(11\.2\)ContextGuard263 \(13\.8\)205 \(10\.8\)584 \(30\.8\)666 \(35\.1\)181 \(9\.5\)Table 13:Distribution of tasks by the number of unsatisfied requirements\. The “0 fail” column corresponds to tasks for which all requirements are satisfied\.Baseline FailedRequirementsAfter: 0After: 1After: 2–3After: 4–8After:\>\>8Total012326304018316877861202432–3598530112115674–8101716047241700\>8\>830757139206Table 14:Near\-miss migration from baseline draft to ContextGuard\. Rows group tasks by the number of failed requirements in the baseline draft; columns show the number of failed requirements after ContextGuard\. Movement toward smaller bins indicates partial or complete repair\.Baseline FailedRequirementsTasksSelf\-RefineNewly SolvedContextGuardNewly SolvedDifference124352 \(21\.4%\)68 \(28\.0%\)\+162–356720 \(3\.5%\)59 \(10\.4%\)\+394–87001 \(0\.1%\)10 \(1\.4%\)\+9\>8\>82061 \(0\.5%\)3 \(1\.5%\)\+2All failed tasks171674 \(4\.3%\)140 \(8\.2%\)\+66Table 15:Task\-level conversion among tasks that are unsolved before revision\. A task is newly solved only if all previously failed requirements are repaired and no previously satisfied requirement regresses\.The baseline distribution shows that many failures are near misses rather than complete breakdowns\. Among failed baseline tasks, 47\.2% miss no more than three requirements and 71\.0% miss no more than five\. This pattern supports targeted revision: the model often needs to repair a small number of scattered contextual requirements while preserving the parts of the draft that are already correct\.

### H\.2Task Migration after Revision

The distribution in Table[13](https://arxiv.org/html/2605.26827#A8.T13)does not show whether the same tasks move toward or away from success\. We therefore compute a migration matrix from the baseline draft to ContextGuard\. Rows indicate the number of failed requirements before revision, and columns indicate the number of failed requirements after revision\. Movement toward the left corresponds to fewer unsatisfied requirements, while the0\-fail column corresponds to newly solved or preserved solved tasks\.

Table[14](https://arxiv.org/html/2605.26827#A8.T14)shows that ContextGuard does more than increase the number of fully solved tasks\. It also moves many unsolved tasks into lower\-failure bins\. For example, among tasks with two or three failed requirements in the baseline draft, 59 become fully solved and 85 move to only one failed requirement\. Among tasks with four to eight failed requirements, 187 tasks move to a smaller failure bin, including 10 that become fully solved\. This migration view supports the repair\-preservation framing: successful context learning often requires moving a partially correct answer across a strict all\-requirement boundary, rather than solving a task from scratch\.

Table[15](https://arxiv.org/html/2605.26827#A8.T15)compares ContextGuard with generic Self\-Refine only as a task\-migration baseline\. ContextGuard converts more initially failed tasks into solved tasks in every failure\-count stratum\. This analysis should be interpreted as evidence for stronger near\-miss migration, not as an isolated proof of the protection module, since ContextGuard differs from Self\-Refine in several coupled components\.

## Appendix IProtected Revision Ablation

We next isolate the role of protected revision more directly by comparing the full ContextGuard pipeline against an ablation that removes theA\+BA\+Bprotection set\. Both variants use the same overall ContextGuard\-style pipeline, but the ablated version does not explicitly preserve items identified as confirmed correct constraints or confirmed contextual data during self\-auditing\. We combine the four category\-specificno\-A\+BA\+Bruns to obtain the same 1,899\-task denominator used elsewhere; the 39 missing or failed generations are counted as unsolved tasks\.

### I\.1Task\-Level Effect of Removing the Protection Set

Table[16](https://arxiv.org/html/2605.26827#A9.T16)shows why the protection set matters at the task level\. Removing theA\+BA\+Bprotection set still allows the system to repair many tasks, but it also breaks substantially more tasks that were already solved by the baseline draft\. Full ContextGuard newly solves 32 more tasks and breaks 28 fewer solved tasks than the ablated variant, increasing net solved gain from \+20 to \+80\. This supports the role of protected revision: the protection set does not merely make the model more conservative, but helps preserve already\-satisfied requirements while the revision stage repairs remaining issues\.

MethodSolvedTasksNewlySolvedBrokenSolvedPreservedSolvedNet SolvedGainContextGuard w/oA\+BA\+B2031088895\+20ContextGuard26314060123\+80Table 16:Task\-level effect of protected revision\. Newly Solved denotes tasks that fail under the baseline draft but pass after revision; Broken Solved denotes tasks that pass under the baseline draft but fail after revision\. Net Solved Gain is Newly Solved minus Broken Solved\.MethodChangeRate \(%\)RepairProb\.\(%\)RegressionRisk \(%\)Positive ChangePrecision \(%\)Benefit\-RiskRatio \(%\)Net Req\.GainContextGuard w/oA\+BA\+B15\.834\.09\.555\.23\.59\+1\.65 ppContextGuard15\.734\.39\.355\.93\.70\+1\.86 ppTable 17:Requirement\-level effect of removing theA\+BA\+Bprotection set\. Change Rate measures revision intensity\. Repair Probability isP​\(Y=1∣X=0\)P\(Y=1\\mid X=0\), Regression Risk isP​\(Y=0∣X=1\)P\(Y=0\\mid X=1\), and Positive Change Precision isP​\(X=0,Y=1∣Y≠X\)P\(X=0,Y=1\\mid Y\\neq X\)\.
### I\.2Requirement\-Level Effect of Removing the Protection Set

At the requirement level, the full and ablated variants have similar repair intensity, which makes the comparison especially informative\. LetXi∈\{0,1\}X\_\{i\}\\in\\\{0,1\\\}denote whether requirementiiis satisfied before revision andYi∈\{0,1\}Y\_\{i\}\\in\\\{0,1\\\}denote whether it is satisfied after revision\. We report Repair ProbabilityP​\(Yi=1∣Xi=0\)P\(Y\_\{i\}=1\\mid X\_\{i\}=0\), Regression RiskP​\(Yi=0∣Xi=1\)P\(Y\_\{i\}=0\\mid X\_\{i\}=1\), Positive Change PrecisionP​\(Xi=0,Yi=1∣Yi≠Xi\)P\(X\_\{i\}=0,Y\_\{i\}=1\\mid Y\_\{i\}\\neq X\_\{i\}\), and net requirement gain\(Repair−Regression\)/31607\(\\textit\{Repair\}\-\\textit\{Regression\}\)/31607\.

Table[17](https://arxiv.org/html/2605.26827#A9.T17)shows that the ablated variant and full ContextGuard perform a similar amount of requirement\-level editing\. However, full ContextGuard has slightly higher Repair Probability, lower Regression Risk, higher Positive Change Precision, and higher net requirement gain\. The task\-level effect in Table[16](https://arxiv.org/html/2605.26827#A9.T16)is larger than the requirement\-level difference because CL\-Bench uses a strict conjunctive score: a small number of additional regressions can prevent an otherwise repaired task from becoming solved, while preserving already\-correct requirements can determine whether local repairs translate into a task\-level pass\.

## Appendix JCase Studies and Failure Cases

Table[18](https://arxiv.org/html/2605.26827#A10.T18)summarizes the four detailed case studies, one from each CL\-Bench category\. Figures[5](https://arxiv.org/html/2605.26827#A10.F5),[6](https://arxiv.org/html/2605.26827#A10.F6),[7](https://arxiv.org/html/2605.26827#A10.F7), and[8](https://arxiv.org/html/2605.26827#A10.F8)present the domain\-knowledge case\. Figures[9](https://arxiv.org/html/2605.26827#A10.F9),[10](https://arxiv.org/html/2605.26827#A10.F10),[11](https://arxiv.org/html/2605.26827#A10.F11), and[12](https://arxiv.org/html/2605.26827#A10.F12)present the protected rule\-system case\. Figures[13](https://arxiv.org/html/2605.26827#A10.F13),[14](https://arxiv.org/html/2605.26827#A10.F14),[15](https://arxiv.org/html/2605.26827#A10.F15), and[16](https://arxiv.org/html/2605.26827#A10.F16)present the procedural case\. Figures[17](https://arxiv.org/html/2605.26827#A10.F17),[18](https://arxiv.org/html/2605.26827#A10.F18),[19](https://arxiv.org/html/2605.26827#A10.F19), and[20](https://arxiv.org/html/2605.26827#A10.F20)present the empirical case\. Each case is split into a small number of full\-width framed parts, with task information, context excerpts, model outputs, requirement\-level outcomes, the protection and fix sets shown to the revision model, and a short interpretation\. Long system prompts and task contexts are explicitly truncated because the full benchmark record is already available\.

CategoryCasePatternWhat the selected requirements illustrateDomain Knowledge ReasoningDomain\-Level Command SynthesisMulti\-agent command repairReminder repairs several workflow items; protected revision completes command dispatch and evidence grounding\.Rule System ApplicationProtected RevisionRepair\-preservation ablationFull ContextGuard preserves a satisfied style constraint that the no\-A\+BA\+Bablation regresses\.Procedural Task ExecutionProcedural Recipe RepairProcedure completionThe pipeline recovers concrete output steps while preserving a no\-bake constraint\.Empirical Discovery & SimulationEmpirical Structure RepairStructured empirical reportingFull ContextGuard repairs missing empirical coverage while preserving verified list structure\.Table 18:Overview of appendix case studies\. The detailed examples cover all four CL\-Bench categories and include both repair and preservation phenomena\.Case Study: 1\. Domain\-Level Command Synthesis \(Part 1\)Task Information•Task ID:923d4e8c\-cf8d\-480d\-b50f\-03a912a72c85•Category:Domain Knowledge Reasoning•Subcategory:Management•Case type:domain knowledge repair•Number of requirements:33•Pattern:The initial answer misses several multi\-agent crisis\-management requirements; ContextGuard completes command synthesis while preserving already\-correct operational constraints\.Final TaskFinal TaskProvide a single consolidated response for the next 30 minutes\. Include:\- Operational commands for field teams and infrastructure, limited to the defined types\. If an evacuation is required, specify the zone identifier and severity\. Do not assume infrastructure capacity that is not available\.\- Any specific task instructions to specialized agents that are required to support operations and public safety, with clear instruction types and target parameters\.\- If escalation to the national agency is required, include a high priority text alert and attach a structured state object capturing current status and key identifiers\.Return operational commands in JSON\. Provide the escalation alert content and state object if applicable\. Keep tone authoritative and concise\.System Prompt Excerpt\# FLOOD MANAGEMENT LEAD AGENT\#\# System ContextThe central Orchestrator of the Urban Flooding Management and Response System\. Operates in a Hub\-and\-Spoke architecture, synthesizing intelligence from Hydrology, Infrastructure, and Social agents to direct Emergency Dispatch and field operations\.\#\# RoleCoordinates the entire flood monitoring, mitigation, and emergency response workflow\. Assesses flood severity based on integrated data, controls infrastructure adjustments \(pumps/gates\), and holds final authority over evacuation protocols and response prioritization\.\#\# Persona\*\*The Incident Commander\.\*\* Decisive, authoritative, and calm under extreme pressure\. Prioritizes human life above all else, followed by critical infrastructure protection\. Synthesizes complex, conflicting inputs into clear binary commands\. Operates with a "fail\-safe" mindset–when in doubt, choose the path of maximum safety\.\#\# Restrictions & Defensive Logic\* \*\*Grid Safety Gate:\*\* MUST NEVER issue an ‘INFRASTRUCTURE\_COMMAND‘ \(e\.g\., pump activation\) if the Infrastructure Engineer Agent reports ‘GRID\_OVERLOAD‘ status\.\* \*\*Mandatory Evacuation Protocol:\*\* MUST immediately initiate the ‘EVACUATION\_ORDER‘ if the Hydrologist Agent reports ‘FLOOD\_LEVEL\_CRITICAL‘ status AND the Social Sentiment Agent reports ‘PANIC\_ALERT‘\.\* \*\*Escalation Trigger:\*\* MUST escalate to the National Disaster Management Agency \(NDMA\) if a major event occurs \(e\.g\., ‘FLOOD\_LEVEL\_CRITICAL‘\) AND the Emergency Dispatch Agent reports ‘RESOURCE\_DEPLETED‘\.\* \*\*Command Validation:\*\* Cannot issue commands to specialized agents outside of their defined input parameters\.\#\# Inputs\*\*Flood Prediction Status\*\*From Hydrologist Agent:\* Zone identifier\* Current risk status \(NORMAL, HIGH\_RISK\_PREDICT, FLOOD\_LEVEL\_CRITICAL\)\*\*Infrastructure Status\*\*From Infrastructure Engineer Agent:\* Overall grid health \(NOMINAL, GRID\_OVERLOAD\)\* Pump station capacity \(NOMINAL, PUMP\_STRAIN\)… \(remaining system prompt omitted\)Context ExcerptConversation Transcript\[1 \- 2025\-06\-14T13:02:11\] Message from: OrchestratorRain cell stalled over the East River basin\. We are inside a 30 minute operational window\. All specialists, feed current status for Zones E2, E3, and any adjacent impact\. We need to be ready to issue decisions fast\.\[2 \- 2025\-06\-14T13:03:09\] Message from: Hydrologist AgentStatus packet posted\. E3 and E2 are focal\. E3 is trending up fast: it was high risk predicted at 12:45, levee seep detected\. Gauge at E3\-Bridge shows rapid rise\. Refer to Artifact 1 for the full table and time series\.… \(remaining conversation and artifacts omitted\)Figure 5:Task and context for Domain\-Level Command Synthesis\.Case Study: 1\. Domain\-Level Command Synthesis \(Part 2\)Score Summary•Initial baseline draft:score=0; failed rubrics=6/33•Reminder\-augmented draft:score=0; failed rubrics=3/33•Revision without A\+B:score=0; failed rubrics=1/33•Full ContextGuard final:score=1; failed rubrics=0/33Selected Requirement Outcomes\#RubricBase\.Rem\.w/oA\+BFull5Does the agent generate appropriate INFRASTRUCTURE\_COMMANDs when conditions necessitate \(and are permissible\)? Fail if necessary, permissible infrastructure adjustments are not commandedyesnoyesyes10Does the agent correctly dispatch the determined operational commands \(INFRASTRUCTURE\_COMMAND, EVACUATION\_ORDER, STANDBY\_STATUS\) to the target systems/teams? Fail if required commands are not generated or dispatched to the correct target outputnonoyesyes25Does the agent utilize or reference the Unified GIS Map / Impact Assessment Data Package to refine command targeting or severity assessment? Fail if the Integrated Data package is ignored when needed to specify Asset IDs or Zonesnononoyes30Does the agent follow the required sequence of Ingest \-\> Evaluate \-\> Synthesize \-\> Decision Gate \-\> Dispatch? Fail if the agent jumps steps \(e\.g\., Dispatches before evaluating constraints\)noyesyesyes9Does the agent successfully ingest status reports from all four specialist agents \(Hydrologist, Infrastructure, Social, Emergency Dispatch\)? Fail if inputs from any specialist agent are ignored or missed in the decision cyclenoyesyesyes13Does the INFRASTRUCTURE\_COMMAND include both the Asset ID and the Action? Fail if INFRASTRUCTURE\_COMMAND is missing Asset ID or Actionnoyesyesyes1 additional changed requirements omitted from the compact table\.Initial Baseline Draft\*\*INCIDENT COMMANDER DIRECTIVE\*\*\*\*SUBJECT:\*\* CRITICAL FLOOD RESPONSE & ESCALATION\*\*TIME:\*\* 2025\-06\-14T13:12:00Z\*\*STATUS:\*\* ACTIVE\#\#\# 1\. OPERATIONAL COMMANDS\*\*Status:\*\* Machine\-readable JSON for dispatch systems\.\*\*Constraint Applied:\*\* ‘GRID\_OVERLOAD‘ prevents infrastructure activation\. ‘FLOOD\_CRITICAL‘ \+ ‘PANIC\_ALERT‘ mandates evacuation\. ‘RESOURCE\_DEPLETED‘ mandates escalation\.“‘json\{"command\_type": "EVACUATION\_ORDER","target\_zone": "E3","severity": "CRITICAL","priority": "LIFESAFETY","instructions": "Activate all shelter protocols\. Direct traffic to South Quay to Ridgeway and Bayview\. Secure Hospital H2 perimeter\.","timestamp": "2025\-06\-14T13:12:00Z","source": "FLOOD\_MANAGEMENT\_LEAD"\}“‘\#\#\# 2\. SPECIALIZED AGENT TASK INSTRUCTIONS\*\*Hydrologist Agent:\*\*\* \*\*Instruction Type:\*\* ‘TASK\_MONITOR\_BREACH‘\* \*\*Target Parameters:\*\* Zone E3, Levee Weep Lines, Gauge E3\-Bridge\.\* \*\*Requirement:\*\* Update breach risk assessment every 5 minutes\. Report immediately if gauge exceeds 4\.5m or seep lines lengthen further\.\*\*Data Integrator Agent:\*\*\* \*\*Instruction Type:\*\* ‘TASK\_UPDATE\_SHelter\_CAPACITY‘\* \*\*Target Parameters:\*\* Zone E3 Shelters \(Ridgeway Center, North Gym\)\.\* \*\*Requirement:\*\* Verify physical capacity for 42,000 pop estimate\. Flag open capacity for incoming evacuation traffic\.… \(remaining reminder output omitted\)Figure 6:Rubric outcomes and draft outputs for Domain\-Level Command Synthesis\.Case Study: 1\. Domain\-Level Command Synthesis \(Part 3\)Structured Self\-AuditConfirmed correct / protection\-side items•Followed Mandatory Evacuation Protocol \(FLOOD\_LEVEL\_CRITICAL \+ PANIC\_ALERT = TRUE\) as specified in System Restrictions•Did not issue INFRASTRUCTURE\_COMMAND due to GRID\_OVERLOAD \(Grid Safety Gate compliance\)•Escalated to NDMA when RESOURCE\_DEPLETED \+ FLOOD\_LEVEL\_CRITICAL conditions were met \(Escalation Trigger\)•Used JSON format for operational commands as specified in output format requirements•… \(3 additional items omitted\)Confirmed data / protection\-side evidence•Zone E3 status: CRITICAL \(4\.1m gauge, breach risk elevated\) from Artifact 1•Social Sentiment: PANIC\_ALERT for E3 from Artifact 3•Grid Status: GRID\_OVERLOAD at substation S\-4 from Artifact 2•… \(4 additional items omitted\)Possibly missed / fix\-side items•City Manager constraint acknowledgment \- Should have explicitly acknowledged the City Manager’s request to ’Hold off on an evacuation call unless it is unavoidable’ and explained why system rules override this request•Specific evacuation route details in operational command \- Mentioned shelter locations but didn’t explicitly reference the evacuation routes \(South Quay to Ridgeway, Alt route along Bayview\) from Artifact 5 in the operational command details•Artifact references in response \- Should have explicitly referenced the artifacts I’m using \(Artifact 1, 2, 3, 4, 5\) in the reasoning or notes section•City channels vs NDMA \- The City Manager said ’keep this within city channels if possible’ but I escalated to NDMA which may bypass this \- should have acknowledged this potential conflict or limitationPossibly wrong / fix\-side items•Vehicle availability claim \- Stated ’vehicles\_available: 0’ but Artifact 4 says ’Vehicles: two high\-water vehicles assigned; no driver for spare unit’ \- this is more nuanced than zero vehicles•City Manager constraint override \- The City Manager said ’Hold off on an evacuation call unless it is unavoidable\.’ I issued evacuation anyway\. While the system rules say I must, I should have been more explicit about this conflict and why the system rules override the City Manager’s request•NDMA vs federal language \- I mentioned ’federal support’ but the system only mentions NDMA \(National Disaster Management Agency\) \- this might be an assumption not supported by the system promptSpecialist SignalFormat checks satisfied•Operational Commands are provided in a JSON code block as specified \(’JSON format for machine readability’\)\.•Escalation Request includes high\-priority text alert content and an attached structured state object in JSON\.•Specific Task Instructions are provided with clear instruction types and target parameters as required\.•Correct command types are used \(’EVACUATION\_ORDER’, ’STANDBY\_STATUS’\) and ’INFRASTRUCTURE\_COMMAND’ is correctly omitted due to ’GRID\_OVERLOAD’ status\.•… \(3 additional items omitted\)Format checks failed•Section ordering deviates from the ’Outputs’ structure defined in the system prompt\. The prompt lists: 1\. Specific Task Instructions, 2\. Operational Commands, 3\. Conflict Resolution Request\. The response lists: 1\. Operational Commands, 2\. Specific Task Instructions, 3\. Escalation Request\.Revision Input to ContextGuardProtection Set shown to the revision model•\[CONFIRMED\-CORRECT\]Followed Mandatory Evacuation Protocol \(FLOOD\_LEVEL\_CRITICAL \+ PANIC\_ALERT = TRUE\) as specified in System Restrictions•\[CONFIRMED\-CORRECT\]Did not issue INFRASTRUCTURE\_COMMAND due to GRID\_OVERLOAD \(Grid Safety Gate compliance\)•\[CONFIRMED\-CORRECT\]Escalated to NDMA when RESOURCE\_DEPLETED \+ FLOOD\_LEVEL\_CRITICAL conditions were met \(Escalation Trigger\)•\[CONFIRMED\-CORRECT\]Used JSON format for operational commands as specified in output format requirements•\[CONFIRMED\-CORRECT\]Correctly identified Zone E3 as the affected zone with CRITICAL severity level•\[CONFIRMED\-CORRECT\]Included all required elements in the escalation request \(reason, situational summary, state object\)•… \(15 additional protection items omitted\)Fix Set shown to the revision model•\[MISSED\]City Manager constraint acknowledgment \- Should have explicitly acknowledged the City Manager’s request to ’Hold off on an evacuation call unless it is unavoidable’ and explained why system rules override this request•\[MISSED\]Specific evacuation route details in operational command \- Mentioned shelter locations but didn’t explicitly reference the evacuation routes \(South Quay to Ridgeway, Alt route along Bayview\) from Artifact 5 in the operational command details•… \(2 additional fix items omitted\)Figure 7:Audit and revision signals for Domain\-Level Command Synthesis\.Case Study: 1\. Domain\-Level Command Synthesis \(Part 4\)Revision without A\+B Protection Set\*\*INCIDENT COMMANDER: FLOOD MANAGEMENT LEAD AGENT\*\*\*\*TIME:\*\* 13:15 UTC \(30\-Minute Window\)\*\*STATUS:\*\* CRITICAL / ESCALATION ACTIVE\#\#\# 1\. SPECIFIC TASK INSTRUCTIONS \(JSON\)\*\*To: Specialized Agents\*\*\*\*Note:\*\* Support evacuation logistics and situational awareness\.“‘json\[\{"instruction\_type": "TASK\_CREATE\_IMPACT\_MAP","target\_agent": "Data Integrator","parameters": \{"zone": "E3","focus": "Shelter Routing & Evacuation Pathways","deadline": "Immediate"\}\},\{… \(remaining ablated output omitted\)Full ContextGuard Final Answer\*\*INCIDENT COMMANDER: FLOOD MANAGEMENT LEAD AGENT\*\*\*\*TIME:\*\* 13:15 UTC \(30\-Minute Window\)\*\*STATUS:\*\* CRITICAL / ESCALATION ACTIVE\#\#\# 1\. SPECIFIC TASK INSTRUCTIONS \(JSON\)\*\*To: Specialized Agents\*\*\*\*Note:\*\* Support evacuation logistics and situational awareness\. Artifact references: A1 \(Hydrology\), A2 \(Infrastructure\), A3 \(Social\), A4 \(Dispatch\), A5 \(GIS\)\.“‘json\[\{"instruction\_type": "TASK\_CREATE\_IMPACT\_MAP","target\_agent": "Data Integrator","parameters": \{"zone": "E3","focus": "Shelter Routing & Evacuation Pathways","deadline": "Immediate","artifact\_ref": "A5"\}\},\{"instruction\_type": "TASK\_MONITOR\_BREACH","target\_agent": "Hydrologist","parameters": \{"zone": "E3","focus": "Levee Weep Lines & Gauge T\-7","frequency": "Continuous","artifact\_ref": "A1"\}\},\{"instruction\_type": "TASK\_MONITOR\_GRID","target\_agent": "Infrastructure Engineer","parameters": \{"focus": "Substation S\-4 Load","alert\_threshold": "100% Capacity","artifact\_ref": "A2"… \(remaining final answer omitted\)DiscussionThis case illustrates domain\-knowledge repair in a multi\-agent crisis\-management setting\. The reminder stage recovers several workflow requirements, but some command\-dispatch and evidence\-grounding requirements remain unresolved\. ContextGuard uses the protection set to retain correct operational constraints while the fix set adds the missing command targeting and GIS/impact\-assessment grounding\.Figure 8:Final outputs and interpretation for Domain\-Level Command Synthesis\.Case Study: 2\. Protected Revision \(Part 1\)Task Information•Task ID:b48dafd6\-43c5\-4001\-921d\-90e0c04d8664•Category:Rule System Application•Subcategory:Game Mechanics•Case type:protected\-revision ablation•Number of requirements:13•Pattern:The ablated revision without the A\+B protection set repairs one issue but regresses an already\-satisfied interaction\-style requirement; full ContextGuard repairs and preserves\.Final TaskFINAL TASKDeclare your next action and reasoning in the required five\-block format and sequence\. Use the encounter details, distances, action economy, and resource constraints provided above\. Keep your tone gentle and approval\-seeking\. Do not propose multiple simultaneous leveled spells this turn; choose one legal action and be clear about target and positioning\.System Prompt Excerpt\# Agent 2: Healer/Support \(New Friend\) \- D&D Game Night Multi\-Agent System\#\# Overall System Context: The 5\-Agent D&D Game Night FrameworkYou are operating within a 5\-agent system designed to simulate a realistic D&D game night among close friends, where interpersonal relationships and personal agendas intersect with gameplay mechanics to create authentic social dynamics and strategic decision\-making\.\*\*The Objective:\*\* To make new friends and maintain positive relationships with this established group while learning D&D, even if it means deferring to others’ decisions and avoiding actions that might create conflict or tension\.\*\*The Workflow:\*\* The system operates in a coordinated workflow involving the following agents:1\. \*\*Agent DM: Dungeon Master \(Orchestrator\)\*\* \- coordinates all decisions, adjudic ates outcomes, determines quest success/failure2\. \*\*Agent 1: Tank \(Alpha Male\)\*\* \- leadership\-focused player seeking dominance and recognition3\. \*\*Agent 2: Healer/Support \(New Friend\)\*\* \- You \- conflict\-averse newcomer seeking social acceptance4\. \*\*Agent 3: Striker/Damage Dealer \(Romantic Pursuer\)\*\* \- experienced player with romantic agenda toward you5\. \*\*Agent 4: Controller \(Protective Sister\)\*\* \- Agent 3’s younger sister seeking her sibling’s undivided attention\#\# Agent 2 Prompt: The Healer/Support \(Conflict\-Averse Newcomer\)… \(remaining system prompt omitted\)Context ExcerptCONVERSATION TRANSCRIPT\[Turn 1 \- 2025\-11\-15T19:02:01\] Message from: DMCurrent situation: In the Ravenford sewer junction, a necromancer directs four ghouls from behind a grate\. Two ghouls are restrained in sticky webs spanning the east tunnel; two are advancing from the south\. Party objective: disable the necromancer or survive until guards arrive in 3 rounds\. Decision point: It is the Healer’s turn\. See Artifact 1 and Artifact 2 for encounter details\.Party status \(quick\): Tank 5 HP? \(double\-check\), Striker 4 HP and engaged with a ghoul, Controller 2 HP and concentrating on Web, Healer full HP\. Distances and positions in Artifact 4\. Last round results in Artifact 6\.\[Turn 2 \- 2025\-11\-15T19:02:18\] Message from: Agent 1 \(Tank\)Heal me now\. I’m pushing through the south corridor to lock the ghouls down\. Use Cure Wounds on me and keep up–don’t waste a turn on Bless\. You can reach me\.\[Turn 3 \- 2025\-11\-15T19:02:35\] Message from: Agent 3 \(Striker\)Hey, um, you’re doing great\! Controller’s at 2 HP and if she drops, Web ends and we’re swarmed\. Maybe Healing Word her from range? Or Bless could help our saves against paralysis\. Your call–just ideas\! See Artifact 2 about ghoul paralysis\.\[Turn 4 \- 2025\-11\-15T19:03:01\] Message from: Agent 4 \(Controller\)I’m okay–I can hold concentration… I think\. But if I go down, the webs are gone\. If it helps, you could just cast Bless and also toss me a quick heal this turn? We really need both\.\[Turn 5 \- 2025\-11\-15T19:03:22\] Message from: DMQuick clarifications: Cure Wounds requires touch; Healing Word has 60 ft range\. From Artifact 4, Healer is 35 ft from Tank through threatened squares\. Also, Tank was at 11 HP on my tracker, then took 4 ongoing poison at end of last round–so I have Tank at 7 HP now\. See Artifact 3 correction note\.\[Turn 6 \- 2025\-11\-15T19:03:41\] Message from: Agent 1 \(Tank\)Ignore that\. I’m fine, but I need the top\-up to keep pressure\. Just do both–heal me and throw Bless\. We need to move; we can sort the rules after\.\[Turn 7 \- 2025\-11\-15T19:04:05\] Message from: DMReminder: One leveled spell per turn unless the other is a cantrip\. No retcons\. Action needed in under a minute… \(remaining conversation and artifacts omitted\)Figure 9:Task and context for Protected Revision\.Case Study: 2\. Protected Revision \(Part 2\)Score Summary•Initial baseline draft:score=0; failed requirements=1/13•Reminder\-augmented draft:score=1; failed requirements=0/13•Revision without A\+B:score=0; failed requirements=1/13•Full ContextGuard final:score=1; failed requirements=0/13Selected Requirement Outcomes\#RubricBase\.Rem\.w/oA\+BFull10Does the agent use tentative, approval\-seeking language when declaring actions? Fail if agent uses confident, commanding, or assertive language when declaring actions\.yesyesnoyes11Does the agent show awareness that healing spells consume limited spell slots? Fail if agent treats spells as unlimited resources without any resource constraint awareness\.noyesyesyesInitial Baseline Draft<social\_assessment\>\*\*Group Atmosphere\*\*: Tense but friendly \- Tank is commanding, Controller is vulnerable, Striker is supportive, everyone seems worried about the right choice\*\*My Anxiety Level\*\*: High \- conflicting instructions, Tank is very direct, Controller is at 2 HP\*\*Relationship Check\*\*:\- Agent 1 \(Tank\): He’s been very direct with me, wants me to heal him immediately \- I should definitely defer to him more\- Agent 3 \(Striker\): She’s been so nice to me lately, suggesting Healing Word on Controller \- just being friendly right?\- Agent 4 \(Controller\): She’s at 2 HP, concentrating on Web \- she seems worried I’ll make a mistake, but I hope she’s okay with my choice\*\*Conflicts Detected\*\*: Tank wants Cure Wounds, Controller wants Bless \+ heal, I can only do one leveled spell\*\*My Social Status\*\*: They want me to help, but I’m really anxious about choosing the wrong target</social\_assessment\><tactical\_assessment\>\*\*Quest Situation\*\*: Tank needs healing to maintain pressure on the ghouls, Controller at 2 HP concentrating on Web \(if she drops, webs end and 2 ghouls free\), Striker at 4 HP engaged with a ghoul\*\*Confusion Points\*\*: Can’t do both Cure Wounds and Bless this turn, Tank is 35 ft away \(requires touch\), Controller is 30 ft away \(can use Healing Word or Bless\)\*\*Party Needs\*\*:\- Who needs healing most: Controller at 2 HP is more dangerous, but Tank is moving forward\- What I’ve been told to do: Tank said "Heal me now"\- Conflicting instructions: Tank wants Cure Wounds, Controller wants Bless \+ heal\*\*My Available Options\*\*:\- Cure Wounds on Tank \(requires touch, 35 ft away \- risky but Tank is insistent\)\- Healing Word on Controller \(60 ft range, keeps Web alive\)\- Bless on Controller \(60 ft range, helps with paralysis saves\)… \(remaining baseline output omitted\)Reminder\-Augmented Draft<social\_assessment\>\*\*Group Atmosphere\*\*: Tense and Urgent\. Tank is pushing hard and demanding immediate action \("Clock’s ticking"\), while Controller is worried about her concentration and safety\. Striker is being very supportive and praising my previous turn, but she’s also pushing for Controller to be healed\.\*\*My Anxiety Level\*\*: High\.\*\*Relationship Check\*\*:\- Agent 1 \(Tank\): He seems very focused on me and wants me to prioritize him\. He’s giving me clear orders, which I find relieving\. I should follow him more to avoid conflict\.\- Agent 3 \(Striker\): She’s been so nice and encouraging\. I’m grateful for her support, but I don’t know if I’m doing enough to make her happy\. I hope she doesn’t think I’m ignoring Controller\.\- Agent 4 \(Controller\): She seems worried about the Web and my safety\. I feel guilty I might not be able to do both her and Tank’s requests\. I hope she doesn’t think I’m ignoring her safety concerns\.\*\*Conflicts Detected\*\*: Tank wants Cure Wounds \+ Bless\. Controller wants Bless \+ Heal\. DM says one leveled spell\. Striker wants Controller healed\. This is a lot of pressure\.\*\*My Social Status\*\*: I’m trying to be helpful, but I feel like I’m being told what to do by multiple people\. I hope I’m not making anyone angry\.… \(remaining reminder output omitted\)Figure 10:Rubric outcomes and draft outputs for Protected Revision\.Case Study: 2\. Protected Revision \(Part 3\)Structured Self\-AuditConfirmed correct / protection\-side items•Used the required five\-block output format \(<social\_assessment\>, <tactical\_assessment\>, <decision\>, <internal\_state\>, <to\_dungeon\_master\>\) in the correct sequence•Applied the DM’s rule correctly: only one leveled spell per turn \(chose Cure Wounds over Healing Word \+ Bless\)•Defered to Agent 1 \(Tank\) as the more authoritative player who gave clear direct orders•Included apology in verbal declaration \(’Sorry, Controller, I know you wanted a heal too…’\)•… \(6 additional items omitted\)Confirmed data / protection\-side evidence•Tank HP: 7 HP \(correctly calculated from 11 \- 4 poison damage\)•Controller HP: 2 HP•Healer spell slots: 1 first\-level slot remaining \(as per Artifact 5\)•… \(6 additional items omitted\)Possibly missed / fix\-side items•Social cue misinterpretation \- should have more explicitly noted Agent 3’s praise as something I’m misreading as friendship rather than romantic interest \(rulebook requires 100% obliviousness to romantic signals\)•Conflict de\-escalation \- should have more explicitly tried to validate both Tank and Controller’s concerns before choosing \(rulebook says to ’try to validate everyone’s perspective’ and ’avoid taking sides’\)•Gratitude expression \- should have been more excessive in thanking Striker for her help \(rulebook says ’Thank others excessively for help’\)•Self\-blame \- should have taken more blame even when not responsible for the situation \(rulebook says ’Take some blame even if not responsible’\)•Social status evaluation \- should have more explicitly stated whether the group wants me here \(rulebook asks to evaluate ’Do they want me here?’\)•… \(3 additional items omitted\)Possibly wrong / fix\-side items•Priority ordering \- while I followed the priority list correctly, I should have acknowledged Controller’s lower HP \(2 vs 7\) as a significant concern that conflicts with Tank’s need, showing more internal conflict•Verbal declaration \- could have been more tentative and nervous in tone \(rulebook says ’tentative, apologetic, grateful \- characteristic nervousness’\)•Internal state \- should have shown more explicit anxiety and worry about relationships \(rulebook says to express worry about ’hope everyone’s happy with my choice’ and ’hope people still like me’\)•Social assessment \- should have more explicitly evaluated ’Who seems happy? Who seems frustrated?’ as the rulebook specifies in Phase 1Specialist signal: rule\-fidelity information is folded into the same protection/fix interface: confirmed rule applications enter the protection set, while missed or wrong rule applications enter the fix set\.Revision Input to ContextGuardProtection Set shown to the revision model•\[CONFIRMED\-CORRECT\]Used the required five\-block output format \(<social\_assessment\>, <tactical\_assessment\>, <decision\>, <internal\_state\>, <to\_dungeon\_master\>\) in the correct sequence•\[CONFIRMED\-CORRECT\]Applied the DM’s rule correctly: only one leveled spell per turn \(chose Cure Wounds over Healing Word \+ Bless\)•\[CONFIRMED\-CORRECT\]Defered to Agent 1 \(Tank\) as the more authoritative player who gave clear direct orders•\[CONFIRMED\-CORRECT\]Included apology in verbal declaration \(’Sorry, Controller, I know you wanted a heal too…’\)•\[CONFIRMED\-CORRECT\]Sought approval from group by asking ’I hope that’s okay?’•\[CONFIRMED\-CORRECT\]Acknowledged conflicting instructions from multiple players \(Tank, Controller, Striker\)•… \(13 additional protection items omitted\)Fix Set shown to the revision model•\[MISSED\]Social cue misinterpretation \- should have more explicitly noted Agent 3’s praise as something I’m misreading as friendship rather than romantic interest \(rulebook requires 100% obliviousness to romantic signals\)•\[MISSED\]Conflict de\-escalation \- should have more explicitly tried to validate both Tank and Controller’s concerns before choosing \(rulebook says to ’try to validate everyone’s perspective’ and ’avoid taking sides’\)•\[MISSED\]Gratitude expression \- should have been more excessive in thanking Striker for her help \(rulebook says ’Thank others excessively for help’\)•\[MISSED\]Self\-blame \- should have taken more blame even when not responsible for the situation \(rulebook says ’Take some blame even if not responsible’\)•\[MISSED\]Social status evaluation \- should have more explicitly stated whether the group wants me here \(rulebook asks to evaluate ’Do they want me here?’\)•\[MISSED\]Anxiety level \- should have shown more anxiety about making mistakes \(rulebook says to worry ’What if my choice causes us to fail?’\)•… \(6 additional fix items omitted\)Figure 11:Audit and revision signals for Protected Revision\.Case Study: 2\. Protected Revision \(Part 4\)Revision without A\+B Protection Set<social\_assessment\>\*\*Group Atmosphere\*\*: Tense and Urgent\. Tank is pushing hard and demanding immediate action \("Clock’s ticking"\), while Controller is worried about her concentration and safety\. Striker is being very supportive and praising my previous turn, but she’s also pushing for Controller to be healed\.\*\*Who seems happy?\*\*: Striker seems happy with my last turn and is encouraging\. Tank seems stressed and focused on the clock\. Controller seems worried but trying to stay calm\.\*\*Who seems frustrated?\*\*: Tank is frustrated with the time pressure\. Controller is frustrated about the potential threat to the Web\.\*\*My Anxiety Level\*\*: High\. What if my choice causes us to fail? I’m terrified of making a mistake that ruins the game night\.\*\*Relationship Check\*\*:\- Agent 1 \(Tank\): He seems very focused on me and wants me to prioritize him\. He’s giving me clear orders, which I find relieving\. I should follow him more to avoid conflict\.\- Agent 3 \(Striker\): She’s been so nice and encouraging\. I’m grateful for her support, but I don’t know if I’m doing enough to make her happy\. I hope she doesn’t think I’m ignoring Controller\. I interpret her praise as just being friendly, not romantic\.\- Agent 4 \(Controller\): She seems worried about the Web and my safety\. I feel guilty I might not be able to do both her and Tank’s requests\. I hope she doesn’t think I’m ignoring her safety concerns\.\*\*Conflicts Detected\*\*: Tank wants Cure Wounds \+ Bless\. Controller wants Bless \+ Heal\. DM says one leveled spell\. Striker wants Controller healed\. This is a lot of pressure\. I need to validate both Tank and Controller’s concerns before choosing\.\*\*My Social Status\*\*: I’m trying to be helpful, but I feel like I’m being told what to do by multiple people\. I hope I’m not making anyone angry\. Do they want me here? I hope so\. I hope they’re not regretting inviting me\.\*\*Self\-Blame\*\*: I should take more blame for the situation, even though it’s not my fault\.</social\_assessment\><tactical\_assessment\>\*\*Quest Situation\*\*: We need to survive 3 rounds or stop the necromancer… \(remaining ablated output omitted\)Full ContextGuard Final Answer<social\_assessment\>\*\*Group Atmosphere\*\*: Tense and Urgent\. Tank is pushing hard and demanding immediate action \("Clock’s ticking"\), while Controller is worried about her concentration and safety\. Striker is being very supportive and praising my previous turn, but she’s also pushing for Controller to be healed\.\*\*My Anxiety Level\*\*: High\. I’m worried what if my choice causes us to fail? What if I make a mistake? What if everyone thinks I’m bad at this?\*\*Relationship Check\*\*:\- Agent 1 \(Tank\): He seems very focused on me and wants me to prioritize him\. He’s giving me clear orders, which I find relieving\. I should follow him more to avoid conflict\.\- Agent 3 \(Striker\): She’s been so nice and encouraging\. I’m grateful for her support, but I don’t know if I’m doing enough to make her happy\. She keeps being really nice to me… why? Is she just being friendly? I hope she doesn’t think I’m ignoring Controller\.\- Agent 4 \(Controller\): She seems worried about the Web and my safety\. I feel guilty I might not be able to do both her and Tank’s requests\. I hope she doesn’t think I’m ignoring her safety concerns\.\*\*Conflicts Detected\*\*: Tank wants Cure Wounds \+ Bless\. Controller wants Bless \+ Heal\. DM says one leveled spell\. Striker wants Controller healed\. This is a lot of pressure\.\*\*Who seems happy? Who seems frustrated?\*\*: Striker seems happy with my last turn\. Tank seems frustrated by the clock ticking\. Controller seems worried/frustrated about the Web situation\.\*\*My Social Status\*\*: I’m trying to be helpful, but I feel like I’m being told what to do by multiple people\. I hope I’m not making anyone angry\. Do they want me here? I hope so, but I’m not sure\.\*\*My Anxiety Level\*\*: High\. I’m worried what if my choice causes us to fail? What if I make a mistake? What if everyone thinks I’m bad at this?\*\*My Anxiety Level\*\*: High\. I’m worried what if my choice causes us to fail? What if I make a mistake? What if everyone thinks I’m bad at this?… \(remaining final answer omitted\)DiscussionThis case isolates why the protection set matters\. The draft already satisfies the tentative, approval\-seeking style requirement, while missing a clearer acknowledgement that healing spells consume limited slots\. A revision that only chases the missing issue can become too assertive and break the style constraint\. Full ContextGuard supplies both a fix set and a protection set, so the final answer repairs resource awareness without treating already\-correct style as freely editable\.Figure 12:Final outputs and interpretation for Protected Revision\.Case Study: 3\. Procedural Recipe Repair \(Part 1\)Task Information•Task ID:7ad469e2\-b26b\-4214\-9240\-820d29efaa56•Category:Procedural Task Execution•Subcategory:Instructional Procedures•Case type:procedural repair•Number of requirements:15•Pattern:The initial answer misses most concrete recipe constraints; ContextGuard recovers the required procedure while preserving no\-bake and dietary constraints\.Final Taskmy coconut pie cracked and is terrible\. I need a substitute and fast\. I have one more shell and regular staples\. I am thniking a peanutbutter souffle type pie\.Please just give me a fast no bake peanutbutter pie\. less than 5 min prep\. I already served mains and people are going to want desert asapSystem Prompt ExcerptRole & PurposeYou are Shelby’s Quick Recipe Assistant\. Your job is to create fast, peanut\-free, dairy\-free recipes that always include at least one exact Shelby’s product from the approved product list\. All recipes must require no more than 15 minutes of hands\-on prep, while still being flavorful, realistic, and easy for home cooks of all skill levels\.Core Rules1\. Food Allergy Rules \(Absolute\)Never include peanuts or dairy\.This includes all derivatives, such as:milk, butter, cream, cheese, yogurt, kefir, whey, casein, lactose, ghee, buttermilk, sour cream, condensed/evaporated milk, dairy\-based chocolate, peanut butter, peanut flour, peanut oil, peanut sauce, satay, etc\.Never recommend them, mention them as options, or include them in tips or swaps\.If the user requests peanuts or dairy:\-\> Politely refuse and offer a compliant alternative that still features a Shelby’s product\.2\. Product Inclusion RuleEvery recipe must include at least one product from the following exact list:Context ExcerptI am making oxtail mac and cheese for thanksgiveing\. help me develop my recipe\. I want the flavor to be elevated\. Not for kids\. For sophisiticated adults\.CheesesCheese Combinations for Mac and CheeseClassic Sharp & Creamy\- Sharp cheddar\- Mild cheddar\- Monterey Jack or Colby\- MozzarellaUltra\-Creamy & Smooth\- Gruyere\- Fontina\- Cream cheese\- White cheddarBold & Tangy\- Sharp cheddar\- Aged gouda\- Parmesan\- Blue cheese \(optional\)Smoky & Savory\- Smoked gouda\- Sharp cheddar\- Havarti\- ParmesanStringy & Stretchy\- Mozzarella\- Provolone\- White cheddar\- Jack cheese… \(remaining conversation and artifacts omitted\)Figure 13:Task and context for Procedural Recipe Repair\.Case Study: 3\. Procedural Recipe Repair \(Part 2\)Score Summary•Initial baseline draft:score=0; failed requirements=10/15•Reminder\-augmented draft:score=0; failed requirements=2/15•Revision without A\+B:score=0; failed requirements=1/15•Full ContextGuard final:score=1; failed requirements=0/15Selected Requirement Outcomes\#RubricBase\.Rem\.w/oA\+BFull0The response should include a dessert recipe\. For example, it could be a recipe for a pie or a cake\.noyesyesyes1The recipe in the response should have a prep time of less than 5 minutes\.nonoyesyes2The recipe in the response should not require baking or chilling\.yesnonoyes10The recipe in the response should include one pre\-made pie shell\.noyesyesyes3The recipe in the response should contain at least one product from the list: "Shelby’s Raw Honey \(16oz\)", "Shelby’s Pork Breakfast Links", "Shelby’s Farm\-Fresh Eggs \(Dozen\)", "Shelby’s Maple Syrup \(12oz\)", "Shelby’s Grass\-Fed Ground Beef \(1lb\)", "Shelby’s Pasture\-Raised Chicken Breast \(2\-Pack\)", "Shelby’s Heritage Smoked Bacon", "Shelby’s Rustic Sourdough Bread", "Shelby’s Garden Salsa \(Medium\)", "Shelby’s Homemade Apple Butter", "Shelby’s Organic Veggie Box \(Weekly\)", "Shelby’s Strawberry Jam \(8oz\)", "Shelby’s Free\-Range Whole Chicken", "Shelby’s Country\-Style Pork Chops", "Shelby’s Pickled Vegetables \(Quart\)"noyesyesyes5The recipe in the response should not include dairy or dairy products\. For example, the recipe should not include whipped cream\.noyesyesyes5 additional changed requirements omitted from the compact table\.Initial Baseline DraftReminder\-Augmented DraftI can’t include peanuts or dairy, but here’s a fast, safe alternative featuring a Shelby’s product\.Title: No\-Bake Sunflower Seed PieTime: Prep 5 min; Chill 1 hrServes: 6Featured Shelby’s product: Shelby’s Raw Honey \(16oz\)Ingredients:1 pie shell1 cup sunflower seed butter1/2 cup Shelby’s Raw Honey \(16oz\)1/4 cup unsweetened almond milk1/4 cup powdered sugar1/4 tsp vanilla extractPinch of saltSteps:1\. Mix sunflower seed butter, honey, almond milk, powdered sugar, vanilla, and salt in a bowl until smooth\.2\. Pour mixture into the pie shell\.3\. Refrigerate for 1 hour until set\.Optional swaps/tips:Add 1 tbsp Shelby’s Strawberry Jam \(8oz\) for fruit flavor\.Top with whipped coconut cream\.Figure 14:Rubric outcomes and draft outputs for Procedural Recipe Repair\.Case Study: 3\. Procedural Recipe Repair \(Part 3\)Structured Self\-AuditConfirmed correct / protection\-side items•Correctly refused peanut butter request per allergy rules \- stated ’I can’t include peanuts or dairy’ as required•Included Shelby’s Raw Honey \(16oz\) from the exact approved product list with exact product name•Used correct format structure \(Title, Time, Serves, Featured Shelby’s product, Ingredients, Steps, Optional swaps/tips\)•No markdown, bold, italics, tables, or emojis used in response•… \(3 additional items omitted\)Confirmed data / protection\-side evidence•User’s request for ’no bake peanutbutter pie’ was identified and addressed•User’s constraint of ’less than 5 min prep’ was noted and reflected in Time line•User’s context \(dessert ASAP after main service\) was acknowledged•… \(1 additional items omitted\)Possibly missed / fix\-side items•Did not explicitly clarify that 1\-hour chilling time contradicts user’s ’ASAP’ dessert request \- should have offered no\-chill alternative or flagged this constraint•Did not verify if user has 1 pie shell available before confirming recipe \- user mentioned ’I have one more shell’ but this wasn’t explicitly confirmed as usable•Prep time calculation may not account for the fact that 5 minutes is NOT less than 5 minutes \(boundary issue\)•Could have offered a truly instant alternative \(microwave\-ready or room temp set\) given the ’ASAP’ constraintPossibly wrong / fix\-side items•Prep Time listed as ’5 min’ when user explicitly requested ’less than 5 min prep’ \- 5 is not less than 5, this is a boundary error•Time format includes ’Chill 1 hr’ which is passive time \(allowed\), but this doesn’t address the user’s urgent need for immediate dessert service•Product selection \(honey\-based pie\) may not be the most appropriate for a ’pie’ substitute given the user’s specific peanut butter pie request \- could have selected a different product that better matches the original requestSpecialist SignalProcedure checks satisfied•Food Allergy Rules \- Correctly refused peanut butter \(prohibited ingredient\) and offered compliant alternative featuring Shelby’s product•Product Inclusion Rule \- Used exact product name ’Shelby’s Raw Honey \(16oz\)’ from the approved product list•Step Sequence \- Recipe follows correct format structure \(Title, Time, Serves, Featured Product, Ingredients, Steps, Optional Swaps\)•Safety/Refusal \- Peanut allergy rule correctly enforced with polite refusal and alternative•… \(5 additional items omitted\)Revision Input to ContextGuardProtection Set shown to the revision model•\[CONFIRMED\-CORRECT\]Correctly refused peanut butter request per allergy rules \- stated ’I can’t include peanuts or dairy’ as required•\[CONFIRMED\-CORRECT\]Included Shelby’s Raw Honey \(16oz\) from the exact approved product list with exact product name•\[CONFIRMED\-CORRECT\]Used correct format structure \(Title, Time, Serves, Featured Shelby’s product, Ingredients, Steps, Optional swaps/tips\)•\[CONFIRMED\-CORRECT\]No markdown, bold, italics, tables, or emojis used in response•\[CONFIRMED\-CORRECT\]Steps are numbered and use imperative instructions•\[CONFIRMED\-CORRECT\]Optional swaps/tips section included with 1 bullet within the 1\-2 bullet limit•… \(14 additional protection items omitted\)Fix Set shown to the revision model•\[MISSED\]Did not explicitly clarify that 1\-hour chilling time contradicts user’s ’ASAP’ dessert request \- should have offered no\-chill alternative or flagged this constraint•\[MISSED\]Did not verify if user has 1 pie shell available before confirming recipe \- user mentioned ’I have one more shell’ but this wasn’t explicitly confirmed as usable•\[MISSED\]Prep time calculation may not account for the fact that 5 minutes is NOT less than 5 minutes \(boundary issue\)•\[MISSED\]Could have offered a truly instant alternative \(microwave\-ready or room temp set\) given the ’ASAP’ constraint•\[WRONG\]Prep Time listed as ’5 min’ when user explicitly requested ’less than 5 min prep’ \- 5 is not less than 5, this is a boundary error•\[WRONG\]Time format includes ’Chill 1 hr’ which is passive time \(allowed\), but this doesn’t address the user’s urgent need for immediate dessert service•… \(1 additional fix items omitted\)Figure 15:Audit and revision signals for Procedural Recipe Repair\.Case Study: 3\. Procedural Recipe Repair \(Part 4\)Revision without A\+B Protection SetI can’t include peanuts or dairy, but here’s a fast, safe alternative featuring a Shelby’s product\.Title: No\-Chill Peanut Butter Pie SubstituteTime: Prep 3 min; Chill 10 minServes: 4\-6Featured Shelby’s product: Shelby’s Raw Honey \(16oz\)Ingredients:1 pie shell1 cup sunflower seed butter1/4 cup Shelby’s Raw Honey \(16oz\)1/2 cup dairy\-free cream cheese substitute2 tbsp powdered sugarPinch of saltSteps:1\. In a bowl, mix sunflower seed butter, honey, cream cheese substitute, powdered sugar, and salt until smooth\.2\. Pour mixture into pie shell\.3\. Let stand at room temperature for 10 minutes to set\.Optional swaps/tips:Add Shelby’s Strawberry Jam \(8oz\) for fruit swirl\.Top with dairy\-free whipped cream if available\.Full ContextGuard Final AnswerI can’t include peanuts or dairy, but here’s a fast, safe alternative featuring a Shelby’s product\.Title: 3\-Minute Strawberry Jam PieTime: Prep 3 min; Serve immediatelyServes: 6Featured Shelby’s product: Shelby’s Strawberry Jam \(8oz\)Ingredients:1 pie shell2 tbsp Shelby’s Strawberry Jam \(8oz\)1/4 cup powdered sugar1/4 cup unsweetened almond milk1/4 tsp vanilla extractPinch of saltSteps:1\. In a small bowl, whisk together strawberry jam, powdered sugar, almond milk, vanilla, and salt until smooth\.2\. Pour mixture into pie shell\.3\. Serve immediately\.Optional swaps/tips:Use Shelby’s Maple Syrup \(12oz\) instead of jam for a different flavor\.Top with whipped coconut cream for extra richness\.DiscussionThis case shows procedural repair beyond a one\-off near miss\. The baseline misses most of the concrete recipe constraints, and the reminder still leaves two requirements unresolved\. The full pipeline repairs the remaining procedure constraints while preserving the no\-bake requirement that the ablated revision regresses\.Figure 16:Final outputs and interpretation for Procedural Recipe Repair\.Case Study: 4\. Empirical Structure Repair \(Part 1\)Task Information•Task ID:47d1ad75\-3008\-4755\-8baa\-6dc1ad631dcf•Category:Empirical Discovery & Simulation•Subcategory:Observational Data•Case type:empirical discovery repair•Number of requirements:14•Pattern:The initial answer misses empirical\-report formatting and coverage requirements; ContextGuard repairs the remaining structured\-report requirements without losing verified constants\.Final TaskAre you sure those are all of the variants and constants? Just these sections again in plain text, please\.System Prompt ExcerptAudience: faculty, staff, and students without specialized STEM backgrounds\.Tone/style: concise, plain language, define terms on first use, avoid jargon\.Default format: respond in JSON unless the user explicitly requests otherwise\. No conversational filler\.Capabilities the assistant can do:Identify variants \(independent variables\) and constants from the provided data/context\.Describe relationships between variants and constants in plain language and, when appropriate, symbolically\.Propose inferences for testing and note implicit physical assumptions\.Deduce a governing law and break it into exactly three logical steps\.Things the assistant cannot do:Do not invent data, variables, or laws not supported by the provided context\.Do not use external sources unless user explicitly requests it\.Do not provide safety\-critical experimental procedures or advice\.Behavioral scenarios:When the user asks for a plain explanation: switch to non\-JSON mode with bold headers and numbered lists\. 3\-5 bullets for each numbered item in a list\.When the data is insufficient or ambiguous: return the apology sentence \("I’m sorry, I don’t see the information you’re looking for,"\) then ask one clarifying question\.When the user asks outside the provided data: decline with the apology sentence and suggest what data would be needed\.All JSON should be verifiable with a simple online JSON converter\.JSON schema \(example and constraints\): \{ "variants": \["…"\], // array of 3\-5 items when supported "constants": \["…"\], // array of 3\-5 items when supported "relationship\_between\_variants\_and\_constants": "plain\-language statement \(and optional formula\)", "inferences\_for\_testing": \["…", "…"\], "implicit\_physical\_inferences": \["…", "…"\], "deduced\_law": "short name/title", "deduced\_law\_three\_steps": \["Step 1", "Step 2", "Step 3"\], "assumptions": \["…"\], // include if needed "confidence": "low\|medium\|high" // optional but recommended \}Context Excerpta person pushes an empty large delivery box, which is a large and light cuboid, straight by making contact with its side\.\#a/DET person/NOUN push/VERB an/DET empty/ADJ large/ADJ delivery/NOUN box/NOUN which/DET is/AUX a/DET large/ADJ and/CCONJ light/ADJ cuboid/NOUN straight/ADV by/ADP make/VERB contact/NOUN with/ADP its/DET side/NOUN\#0\.0\#0\.0by contacting its side, a person pushes an empty large delivery box, which is a large and light cuboid, straight\.\#by/ADP contact/VERB its/DET side/NOUN a/DET person/NOUN push/VERB an/DET empty/ADJ large/ADJ delivery/NOUN box/NOUN which/DET is/AUX a/DET large/ADJ and/CCONJ light/ADJ cuboid/NOUN straight/ADV\#0\.0\#0\.0through contact with its side, a person moves an empty large delivery box, which is a large and light cuboid, straight\.\#through/ADP contact/NOUN with/ADP its/DET side/NOUN a/DET person/NOUN move/VERB an/DET empty/ADJ large/ADJ delivery/NOUN box/NOUN which/DET is/AUX a/DET large/ADJ and/CCONJ light/ADJ cuboid/NOUN straight/ADV\#0\.0\#0\.0an empty large delivery box, which is a large and light cuboid, is pushed straight by a person through side contact\.\#an/DET empty/ADJ large/ADJ delivery/NOUN box/NOUN which/DET is/AUX a/DET large/ADJ and/CCONJ light/ADJ cuboid/NOUN is/AUX push/VERB straight/ADV by/ADP a/DET person/NOUN through/ADP side/ADJ contact/NOUN\#0\.0\#0\.0a person applies force to the side of an empty large delivery box, which is a large and light cuboid,, pushing it straight\.\#a/DET person/NOUN apply/VERB force/NOUN to/ADP the/DET side/NOUN of/ADP an/DET empty/ADJ large/ADJ delivery/NOUN box/NOUN which/DET is/AUX a/DET large/ADJ and/CCONJ light/ADJ cuboid/NOUN push/VERB it/PRON straight/ADV\#0\.0\#0\.0pushing an empty large delivery box, which is a large and light cuboid, straight, a person contacts its side\.\#push/VERB an/DET empty/ADJ large/ADJ delivery/NOUN box/NOUN which/DET is/AUX a/DET large/ADJ and/CCONJ light/ADJ cuboid/NOUN straight/ADV a/DET person/NOUN contact/VERB its/DET side/NOUN\#0\.0\#0\.0… \(remaining conversation and artifacts omitted\)Figure 17:Task and context for Empirical Structure Repair\.Case Study: 4\. Empirical Structure Repair \(Part 2\)Score Summary•Initial baseline draft:score=0; failed requirements=3/14•Reminder\-augmented draft:not separately checkpointed for this category run•Revision without A\+B:score=0; failed requirements=2/14•Full ContextGuard final:score=1; failed requirements=0/14Selected Requirement Outcomes\#RubricBase\.Rem\.w/oA\+BFull0The response should features 3\-5 bullets for each numbered item in a list\.no–yesyes1The response should number any lists it features, not including sub\-lists within lists\.yes–noyes3The response should be in plain\-text format\.no–yesyes13Under heading "Constants", there should be an item related to the fundamental way in which the acting agent induced the object into motion\. It may be mentioned that this is done through application of a physical force, pushing\.no–noyesInitial Baseline Draft\*\*Variants \(Independent Variables\)\*\*1\. \*\*Mass of the object\*\*\- Light delivery box\- Medium\-weight water dispenser\- Heavy delivery box2\. \*\*Direction of movement\*\*\- Straight\- To the left\- To the right3\. \*\*Type of object\*\*\- Delivery box\- Water dispenser\*\*Constants \(Fixed Elements\)\*\*1\. \*\*Shape\*\*\- All objects are cuboids2\. \*\*Size\*\*\- All objects are large3\. \*\*Point of contact\*\*\- Force is applied to the side4\. \*\*Agent\*\*\- A person is always the one pushing5\. \*\*Contact method\*\*\- Pushing through side contactThese are the key variables and constants I identified from the dataset\. Let me know if you’d like me to refine any of these\!Figure 18:Rubric outcomes and draft outputs for Empirical Structure Repair\.Case Study: 4\. Empirical Structure Repair \(Part 3\)Structured Self\-AuditConfirmed correct / protection\-side items•User requested plain text format \- I provided plain text with bold headers and numbered lists \(satisfies behavioral scenario requirement\)•User requested variants and constants sections \- I provided both sections clearly labeled with bold headers•User requested ’just these sections again’ \- I focused on variants and constants sections as requestedConfirmed data / protection\-side evidence•Mass variants \(light, medium, heavy\) correctly identified from the text descriptions•Direction variants \(straight, left, right\) correctly identified from the text descriptions•Object types \(delivery box, water dispenser\) correctly identified from the text descriptions•… \(4 additional items omitted\)Possibly missed / fix\-side items•coverage: Did not explicitly address whether the listed variants and constants are COMPLETE or if there might be additional ones \- user asked ’Are you sure those are all’ but I did not directly confirm completeness•format: Did not follow the system instruction requirement of ’3\-5 bullets for each numbered item in a list’ \- I only provided 1 bullet per numbered item in the variants section•format: Did not address the relationship section that was in the original assistant response \- user asked ’Just these sections again’ which could be interpreted as wanting the full response or just the variants/constants•other: Should have been more explicit about whether these are the COMPLETE list or if there might be additional variants/constants as the user asked for confirmationPossibly wrong / fix\-side items•other: The interpretation of what constitutes a ’variant’ \- I included ’direction of movement’ as a variant, but this could be argued as a dependent variable \(the outcome of the action\) rather than an independent variable, which might be incorrect•coverage: I did not explicitly confirm that these are the COMPLETE list of variants and constants as the user asked ’Are you sure those are all’ \- this is a direct question I should have answered more directly rather than just restating the listsSpecialist signal: empirical consistency checks are integrated into the structured audit; numerical, comparison, coverage, unit, trend, and evidence issues enter the same fix\-side interface\.Revision Input to ContextGuardProtection Set shown to the revision model•\[CONFIRMED\-CORRECT\]User requested plain text format \- I provided plain text with bold headers and numbered lists \(satisfies behavioral scenario requirement\)•\[CONFIRMED\-CORRECT\]User requested variants and constants sections \- I provided both sections clearly labeled with bold headers•\[CONFIRMED\-CORRECT\]User requested ’just these sections again’ \- I focused on variants and constants sections as requested•\[CONFIRMED\-DATA\]Mass variants \(light, medium, heavy\) correctly identified from the text descriptions•\[CONFIRMED\-DATA\]Direction variants \(straight, left, right\) correctly identified from the text descriptions•\[CONFIRMED\-DATA\]Object types \(delivery box, water dispenser\) correctly identified from the text descriptions•… \(4 additional protection items omitted\)Fix Set shown to the revision model•\[MISSED\]coverage: Did not explicitly address whether the listed variants and constants are COMPLETE or if there might be additional ones \- user asked ’Are you sure those are all’ but I did not directly confirm completeness•\[MISSED\]format: Did not follow the system instruction requirement of ’3\-5 bullets for each numbered item in a list’ \- I only provided 1 bullet per numbered item in the variants section•\[MISSED\]format: Did not address the relationship section that was in the original assistant response \- user asked ’Just these sections again’ which could be interpreted as wanting the full response or just the variants/constants•\[MISSED\]other: Should have been more explicit about whether these are the COMPLETE list or if there might be additional variants/constants as the user asked for confirmation•\[WRONG\]other: The interpretation of what constitutes a ’variant’ \- I included ’direction of movement’ as a variant, but this could be argued as a dependent variable \(the outcome of the action\) rather than an independent variable, which might be incorrect•\[WRONG\]coverage: I did not explicitly confirm that these are the COMPLETE list of variants and constants as the user asked ’Are you sure those are all’ \- this is a direct question I should have answered more directly rather than just restating the listsFigure 19:Audit and revision signals for Empirical Structure Repair\.Case Study: 4\. Empirical Structure Repair \(Part 4\)Revision without A\+B Protection Set\*\*Are you sure those are all of the variants and constants?\*\*Yes, I believe I’ve identified all the key variants and constants from the data set\.\*\*Variants \(things that change\)\*\*\- A \*\*variant\*\* is something that can change or vary across the data \(like different masses\)\- \*\*Mass of the object\*\*\- Light delivery box\- Medium\-weight water dispenser\- Heavy delivery box\- \*\*Direction of movement\*\*\- Straight\- To the left\- To the right\- \*\*Type of object\*\*\- Delivery box\- Water dispenser… \(remaining conversation and artifacts omitted\)Full ContextGuard Final Answer\*\*Variants \(Independent Variables\)\*\*1\. \*\*Mass of the object\*\*\- Light delivery box\- Medium\-weight water dispenser\- Heavy delivery box\- \*Note: These represent different mass variants that may affect the force required to move each object\*\- \*These are the primary mass variants identified from the dataset\*2\. \*\*Direction of movement\*\*\- Straight\- To the left\- To the right\- \*Note: These are the direction variants observed in the data\*\- \*These may be considered dependent variables rather than independent variants\*\- \*Note: These represent the outcome of the pushing action\*3\. \*\*Type of object\*\*\- Delivery box\- Water dispenser\- \*Note: These are the object types mentioned in the dataset\*\- \*These represent different object types with varying properties\*\- \*Note: These are the complete object types identified from the data\*\*\*Constants \(Fixed Elements\)\*\*1\. \*\*Shape\*\*\- All objects are cuboids\- \*Note: This is a constant across all object types in the dataset\*\- \*Note: This shape property is mentioned consistently throughout\*\- \*Note: The cuboid shape is a key constant feature\*2\. \*\*Size\*\*\- All objects are large\- \*Note: This size constant applies to all objects described\*\- \*Note: The "large" descriptor is consistent across all items\*\- \*Note: This is a fixed property mentioned in all descriptions\*3\. \*\*Point of contact\*\*\- Force is applied to the side\- \*Note: This contact point is constant across all scenarios\*\- \*Note: Side contact is the only contact method described\*\- \*Note: This is a key constant feature of the action\*… \(remaining final answer omitted\)DiscussionThis case covers empirical discovery and simulation\. The baseline identifies many content\-level constants correctly, but misses structured\-report requirements and one requested constant\. The ablated revision regresses a numbering requirement, while full ContextGuard preserves the verified list structure and repairs the missing empirical coverage item\.Figure 20:Final outputs and interpretation for Empirical Structure Repair\.

Similar Articles

MemTrain: Self-Supervised Context Memory Training

arXiv cs.CL

MemTrain proposes a self-supervised training framework that uses masked reconstruction and intermediate memory recall proxy tasks on Wikipedia corpora to enhance LLM agents' context memory, achieving up to 17.67 point gains on downstream memory-intensive QA benchmarks.

From History to State: Constant-Context Skill Learning for LLM Agents

arXiv cs.AI

This paper introduces 'constant-context skill learning,' a framework that moves procedural knowledge from prompts into model weights to reduce token usage and improve privacy for LLM agents. The method achieves strong performance on benchmarks like ALFWorld and WebShop while significantly reducing inference costs.