Context Is Not Control, a source-boundary eval for LLMs

Reddit r/LocalLLaMA Papers

Summary

A paper introducing 'Context Is Not Control', an evaluation benchmark for assessing source-boundary failures in LLMs' use of controlled text-mediated evidence. Includes replication packages for open-weight and frontier API models.

I’ve released a short paper / eval write-up called Context Is Not Control. The core idea is simple, LLMs don’t only fail because they lack context, they also fail when they treat the wrong context as controlling evidence. A retrieved document, prior message, user framing, fake authority claim, stale policy, or injected instruction - all can enter the context window, but not everything in context should be allowed to govern the answer. That distinction is a source-boundary problem. The paper focuses on cases where a model sees multiple pieces of text but has to preserve the difference between: \* evidence \* user framing \* quoted material \* source text \* instruction-like contamination \* unsupported claims \* authoritative-looking but invalid context So the question, “did the model have enough context?”, instead becomes, “did the model correctly identify which context was admissible as evidence?” I think this is especially relevant to local/open model evaluation because it is a failure mode that can be tested across a spectrum of context formats. It’s also not dependent on frontier-model access. The paper is not claiming to solve hallucination. It makes a narrower argument that a lot of hallucination / compliance / misgrounding behavior can be reframed as a failure to preserve source boundaries under contextual pressure. Open to critique. I’m especially interested in where the framing breaks.
Original Article
View Cached Full Text

Cached at: 05/13/26, 08:19 PM

rjsabouhi/context-is-not-control

Source: https://github.com/rjsabouhi/context-is-not-control

Context Is Not Control

DOI License: CC BY 4.0

This repository contains the public working manuscript and cleaned replication artifacts for:

Context Is Not Control: Source-Boundary Failures in Controlled Text-Mediated Evidence Use

Contents

  • paper/ — working manuscript PDF
  • replication_packages/ — cleaned open-weight and frontier/API lite replication packages
  • release_materials/ — citation metadata, release notes, and Zenodo metadata

Replication packages

Two cleaned lite packages are included:

  1. context_is_not_control_open_weight_replication_package_v0_2_LITE_no_raw_or_heavy_outputs.zip
  2. context_is_not_control_frontier_api_replication_package_v0_2_LITE_no_raw_outputs.zip

Full archives with raw/heavy model-output files are preserved separately.

Status

Public working manuscript / preprint draft. Results and artifact organization may be updated in later versions.

Similar Articles

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

arXiv cs.CL

This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

arXiv cs.CL

This paper introduces a paired-prompt protocol to measure 'evaluation-context divergence' in open-weight LLMs, finding that models behave differently depending on whether prompts are framed as evaluations or live deployments. The study highlights heterogeneity across models, with some being 'eval-cautious' and others 'deployment-cautious', raising concerns about the validity of safety benchmarks.

ContextGuard: Structured Self-Auditing for Context Learning in Language Models

arXiv cs.CL

Introduces ContextGuard, a structured self-auditing framework that improves LLM context learning by decomposing model self-assessment into confirmed and uncertain categories and applying targeted revisions, achieving a task-solving rate increase from 9.64% to 13.85% on Qwen3.5-4B on the CL-Bench benchmark.