Context Is Not Control, a source-boundary eval for LLMs
Summary
A paper introducing 'Context Is Not Control', an evaluation benchmark for assessing source-boundary failures in LLMs' use of controlled text-mediated evidence. Includes replication packages for open-weight and frontier API models.
View Cached Full Text
Cached at: 05/13/26, 08:19 PM
rjsabouhi/context-is-not-control
Source: https://github.com/rjsabouhi/context-is-not-control
Context Is Not Control
This repository contains the public working manuscript and cleaned replication artifacts for:
Context Is Not Control: Source-Boundary Failures in Controlled Text-Mediated Evidence Use
Contents
paper/— working manuscript PDFreplication_packages/— cleaned open-weight and frontier/API lite replication packagesrelease_materials/— citation metadata, release notes, and Zenodo metadata
Replication packages
Two cleaned lite packages are included:
context_is_not_control_open_weight_replication_package_v0_2_LITE_no_raw_or_heavy_outputs.zipcontext_is_not_control_frontier_api_replication_package_v0_2_LITE_no_raw_outputs.zip
Full archives with raw/heavy model-output files are preserved separately.
Status
Public working manuscript / preprint draft. Results and artifact organization may be updated in later versions.
Similar Articles
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
This paper introduces a paired-prompt protocol to measure 'evaluation-context divergence' in open-weight LLMs, finding that models behave differently depending on whether prompts are framed as evaluations or live deployments. The study highlights heterogeneity across models, with some being 'eval-cautious' and others 'deployment-cautious', raising concerns about the validity of safety benchmarks.
It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs
Proposes Complementary Self-Distillation (SelfCI) to improve contextual integrity in LLMs by balancing utility and privacy. Evaluated on CI-RL and PrivacyLens benchmarks across multiple models.
ContextGuard: Structured Self-Auditing for Context Learning in Language Models
Introduces ContextGuard, a structured self-auditing framework that improves LLM context learning by decomposing model self-assessment into confirmed and uncertain categories and applying targeted revisions, achieving a task-solving rate increase from 9.64% to 13.85% on Qwen3.5-4B on the CL-Bench benchmark.
Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
This paper proposes a three-regime framework to resolve empirical contradictions in how LLMs handle conflict between training knowledge and new documents, validated across five major models. It distinguishes between parametric strength and uniqueness and demonstrates how task framing and evidence coherence significantly impact model behavior.