Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

arXiv cs.CL 06/16/26, 04:00 AM Papers
Summary
This paper proposes Telegraph English, a readable symbolic format for context compression that outperforms matched-budget baselines on multi-hop QA datasets, preserving entity content more densely.
arXiv:2606.14875v1 Announce Type: new Abstract: We study context compression for multi-hop question answering with small language models. We propose Telegraph English, a readable symbolic format that rewrites retrieved passages into structured entity-relation statements, preserving reasoning evidence at lower token cost. In controlled experiments on MuSiQue, TwoWiki, and HotpotQA, Telegraph English outperforms three matched-budget compression baselines (character-level deletion, truncation, and random sub-sampling) on every dataset, with gains of 13 to 20 F1 percentage point. It also outperforms a coherent prose summary produced by the same encoder on the hardest dataset. A pre-registered depth-interaction hypothesis is null: the advantage does not grow with reasoning depth within datasets. We interpret these results as evidence that readable symbolic re-expression preserves entity content more densely than either natural language or coherent summarization at matched token budget.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:44 AM
# Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget
Source: [https://arxiv.org/html/2606.14875](https://arxiv.org/html/2606.14875)
Sisong Bei Independent Researcher qurining@gmail\.com&Mikhail L\. Arbuzov Independent Researcher mike\.arbuzov54@gmail\.comZiwei Dong Independent Researcher ziwei\.dong@alumni\.emory\.edu&Dmitri Kalaev Independent Researcher kalaevdr@gmail\.com

###### Abstract

We study context compression for multi\-hop question answering with small language models\. We propose Telegraph English, a readable symbolic format that rewrites retrieved passages into structured entity\-relation statements, preserving reasoning evidence at lower token cost\. In controlled experiments on MuSiQue, 2Wiki, and HotpotQA, Telegraph English outperforms three matched\-budget compression baselines \(character\-level deletion, truncation, and random subsampling\) on every dataset, with gains of 13 to 20 F1 points\. It also outperforms a coherent prose summary produced by the same encoder on the hardest dataset\. A pre\-registered depth\-interaction hypothesis is null: the advantage does not grow with reasoning depth within datasets\. We interpret these results as evidence that readable symbolic re\-expression preserves entity content more densely than either natural language or coherent summarization at matched token budget\.

Context Compression Is Not One Thing: Readable Symbolic Re\-expression vs\. Coherent Summary at Matched Budget

Sisong BeiIndependent Researcherqurining@gmail\.comMikhail L\. ArbuzovIndependent Researchermike\.arbuzov54@gmail\.com

Ziwei DongIndependent Researcherziwei\.dong@alumni\.emory\.eduDmitri KalaevIndependent Researcherkalaevdr@gmail\.com

Alexey ShvetsPalo Alto Networksashvets@paloaltonetworks\.com

## 1Introduction

Small language models face a tension on multi\-hop question answering: retrieved natural\-language \(NL\) context is expensive in tokens and error\-prone in long passages, while short retrieval discards the bridge entities a reader needs for multi\-step reasoning\. Prior context\-compression work attacks this tension through token\-level scoring\(Jianget al\.,[2023](https://arxiv.org/html/2606.14875#bib.bib4); Panet al\.,[2024](https://arxiv.org/html/2606.14875#bib.bib5)\), hidden\-state summarisation\(Muet al\.,[2023](https://arxiv.org/html/2606.14875#bib.bib7); Chevalieret al\.,[2023](https://arxiv.org/html/2606.14875#bib.bib8)\), or task\-aware abstractive summarisation\(Xuet al\.,[2024](https://arxiv.org/html/2606.14875#bib.bib10)\), but each commits to selective retention or latent\-space summary rather than re\-expression in surface text\.

We study a different move\. A small encoder rewrites the retrieved NL passage into a readable, rule\-governed symbolic format we call*Telegraph English*\(TE\), and a small consumer reads TE in place of NL\. TE preserves entities verbatim and replaces connective NL tissue with pipe\-separated symbolic operators\. The encoder prompt is fixed and the encoder is frozen; the consumer needs no fine\-tuning\.

This work builds on Telegraph English, a readable symbolic compression format introduced inArbuzovet al\.\([2026a](https://arxiv.org/html/2606.14875#bib.bib2)\), and is motivated by the error\-accumulation analysis inArbuzovet al\.\([2025](https://arxiv.org/html/2606.14875#bib.bib1)\)showing that LLM errors concentrate at a sparse subset of key tokens\. The theoretical framework for patch\-local reliability engineering\(Arbuzovet al\.,[2026b](https://arxiv.org/html/2606.14875#bib.bib3)\)provides the broader context\.

We test TE on three multi\-hop benchmarks \(MuSiQue, 2Wiki, HotpotQA\) against three matched\-budget controls: character\-level density matching, end\-truncation, and random subsampling\. TE wins on every dataset and every control, with paired\-bootstrap 95% confidence intervals strictly positive and gains of\+13\.6\+13\.6to\+20\.2\+20\.2F1 points \(Fig\.[1](https://arxiv.org/html/2606.14875#S5.F1)\)\. TE also beats a coherent prose summary produced by the same encoder at the same token budget on the hardest dataset \(MuSiQue,\+11\.94\+11\.94pp\)\. The matched\-budget controls rule out character\-density manipulation, NL\-tail dispensability, and random\-token sufficiency as alternative explanations for TE’s advantage\.

We pre\-registered a stronger depth\-interaction hypothesis: that TE’s advantage over NL would grow with the reasoning depth of the question\. This hypothesis is null\. All four within\-dataset interaction slopes are direction\-consistent with the prediction but none is statistically significant \(FDR\-correctedp\>0\.41p\>0\.41,I2=0%I^\{2\}=0\\%\)\. A minimum\-detectable\- effect\-size analysis bounds the design to ruling out widenings of roughly44–55F1 points across the hop\-2\-to\-hop\-4 range; weaker effects cannot be distinguished from zero at our sample size\.

#### Contributions\.

- •A matched\-budget compression mechanism for multi\-hop QA: at the same token budget, TE beats three trivial\-compression controls and a coherent\-prose summariser, supporting the reading that TE compresses where natural language carries redundancy\.
- •A pre\-registered null on depth\-dependent widening, paired with a minimum\-detectable\-effect\-size analysis that bounds the scope of the matched\-budget gains to constant offsets rather than depth\-scaling advantage\.

## 2Related Work

Context compression for language models has emerged as a response to the growing cost of long retrieval\-augmented contexts\. Our contribution differs from prior work in mechanism: TE is a surface\-text re\-expression produced by a frozen encoder with a fixed prompt, read by a frozen consumer\. We organise prior work into four families and contrast TE with each\.

#### Token\-level scoring\.

LLMLingua\(Jianget al\.,[2023](https://arxiv.org/html/2606.14875#bib.bib4)\)and LLMLingua\-2\(Panet al\.,[2024](https://arxiv.org/html/2606.14875#bib.bib5)\)score individual tokens for retention or deletion via a small model trained on information\-preservation proxies\. LongLLMLingua\(Jianget al\.,[2024](https://arxiv.org/html/2606.14875#bib.bib6)\)extends scoring to document\-level saliency\. Our primary baseline of this family isLLMLingua\-2at rate\-50\. The mechanism is selective retention: the output is a subsequence of the input\. TE’s mechanism is re\-expression: the encoder rewrites content into an entity\-preserving format in which bridge entities are kept verbatim and connective tissue is replaced by symbolic operators\. Token\-level scoring cannot produce that reformatting at matched budget\.

#### Hidden\-state compression\.

GIST\(Muet al\.,[2023](https://arxiv.org/html/2606.14875#bib.bib7)\)compresses context into soft tokens at the hidden\-state layer\. AutoCompressor\(Chevalieret al\.,[2023](https://arxiv.org/html/2606.14875#bib.bib8)\)compresses long contexts into summary vectors\. CEPE\(Yenet al\.,[2024](https://arxiv.org/html/2606.14875#bib.bib9)\)extends cross\-attention to a compressed summary of retrieved passages\. All three operate in the consumer’s latent space and require consumer\-side training\. TE operates in surface text and needs no consumer\-side training—a practical distinction for small\-model deployment where consumer\-side retraining is costly and auditability matters\.

#### Task\-aware abstractive summarisation\.

RECOMP\(Xuet al\.,[2024](https://arxiv.org/html/2606.14875#bib.bib10)\)compresses retrieved passages via a learned abstractive summariser fine\-tuned on the downstream QA task\. CompAct\(Yoonet al\.,[2024](https://arxiv.org/html/2606.14875#bib.bib11)\)uses a task\-conditioned encoder tuned on QA supervision\. The encoder’s output is natural language and the encoder is fine\-tuned on the task\. TE’s encoder is task\-agnostic, runs a frozen prompt, and produces symbolic\-structural output\. The contrast is again mechanism rather than ratio\.

#### Symbolic intermediates for reasoning\.

Prior work has used symbolic intermediates in the forward direction: chain\-of\-thought\(Weiet al\.,[2022](https://arxiv.org/html/2606.14875#bib.bib19)\), program\-aided language models\(Gaoet al\.,[2023](https://arxiv.org/html/2606.14875#bib.bib20)\), structured scratchpads\(Nyeet al\.,[2021](https://arxiv.org/html/2606.14875#bib.bib21)\), where the consumer produces symbolic output\. We use a symbolic intermediate in the input direction: the consumer reads symbolic context in lieu of NL prose\. The Telegraph English format itself was introduced byArbuzovet al\.\([2026a](https://arxiv.org/html/2606.14875#bib.bib2)\)as a structured symbolic rewriting target for prompt compression; the present paper evaluates it as a tokenizer\-aware, encoder\-produced context representation against matched\-budget controls on multi\-hop QA\.

#### Positioning\.

TE is re\-expression, not retention or latent compression\. The matched\- budget controls in this paper isolate the re\-expression mechanism from three trivial alternatives, and the coherent\-prose comparator further isolates it from generic abstractive summarisation at the same budget\.

## 3Method

### 3\.1Telegraph English

Telegraph English \(TE\) is a context representation produced by an encoder language model \(Claude Sonnet 4\.6 via AWS Bedrock batch inference\) with a fixed, task\-agnostic prompt\. The encoder rewrites the retrieved NL passage into a sequence of pipe\-separated symbolic clauses in which entities are preserved verbatim and connective NL tissue is replaced by short@\-prefixed operators\. The prompt instructs output that is compatible with the consumer model’s tokenizer \(Qwen\-3\.5\-9B\) so that token budget at the consumer matches what is written\. A representative pre/post pair from MuSiQue:

> NL: ‘‘Barack Obama was born in Honolulu, Hawaii\. Honolulu is the capital of the state of Hawaii\. Hawaii is a state in the United States\.’’ TE: ‘‘Barack Obama @born Honolulu \| Honolulu @capital\_of Hawaii \| Hawaii @state\_in United States\.’’

The full encoder prompt and additional examples are in Appendix[C](https://arxiv.org/html/2606.14875#A3)\.

### 3\.2Primary hypothesis: depth interaction

We pre\-registered \(Appendix[H](https://arxiv.org/html/2606.14875#A8)\) a primary hypothesis that TE’s advantage over NL grows with the reasoning depth of the question\. Operationally, we model question\-level correctness as a function of representation \(NL, TE, orLLMLingua\-2\), centered hop count, and their interaction, fit per dataset as a binomial generalised linear model with cluster\-robust standard errors by question\. The primary test is the sign and significance of therepresentation×\\timeshop\_countinteraction for TE: a negative slope means TE’s edge over NL grows with hop count\. We pool per\-dataset slopes across MuSiQue and 2Wiki by random\- effects meta\-analysis and apply false\-discovery\-rate correction across the primary interaction family\. HotpotQA is excluded from the regression because all its questions are 2\-hop\. Full estimating equations, the heterogeneity branch rule, and the random\-effects specification are in Appendix[A](https://arxiv.org/html/2606.14875#A1)\.

### 3\.3Auxiliary mechanism: matched\-budget controls

We separately pre\-registered an auxiliary mechanism observation \(Appendix[H](https://arxiv.org/html/2606.14875#A8)\): TE performs semantic\-preserving compression only where natural language carries redundancy to strip\. The test is whether TE outperforms three trivial\-compression controls at matched per\-row token budget\. Each control is computed against TE’s per\-row qwen\-token count and rules out a specific alternative explanation:

- •Character\-density\.The NL passage is rescaled at the character level so that its qwen\-token footprint matches TE’s\. Rules out the hypothesis that TE’s gain comes from per\-row character\-density manipulation\.
- •End\-truncation\.The NL passage is truncated from the end at the qwen\-token boundary so it has TE’s per\-row token count\. Rules out the hypothesis that the NL tail is dispensable\.
- •Random subsampling\.A fixed\-seed uniform subsample of qwen\-token positions is drawn from the NL passage, sized to TE’s budget\. Rules out the hypothesis that any random subset of NL tokens would suffice\.

The direction of the controls is pre\-registered \(TE should beat all three\); the descriptive “9 of 9 confidence intervals strictly positive” summary is post\-hoc\. Full per\-row specifications are in Appendix[D](https://arxiv.org/html/2606.14875#A4)\.

### 3\.4Coherent\-prose comparator

A natural follow\-up question is whether TE’s advantage holds against a coherent\-prose summary at the same budget rather than against trivial controls\. We run the same encoder under a free\-prose summary prompt and post\-truncate each summary to TE’s per\-row token budget\. This matched\-budget contrast isolates representation format from compression ratio: encoder, consumer, and budget are held fixed; only the surface form of the compressed passage changes\.

## 4Experimental Setup

### 4\.1Datasets and consumer model

We evaluate on three multi\-hop QA benchmarks with distinct depth profiles: MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.14875#bib.bib12)\)\(n=2,417n=2\{,\}417questions; hops∈\{2,3,4\}\\in\\\{2,3,4\\\}\), 2Wiki\(Hoet al\.,[2020](https://arxiv.org/html/2606.14875#bib.bib13)\)\(n=1,500n=1\{,\}500; balanced 500 per hop level\), and HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.14875#bib.bib14)\)\(n=1,000n=1\{,\}000; all 2\-hop\)\. The consumer is Qwen\-3\.5\-9B \(base model; HF eager bf16, greedy decoding\) throughout\. The encoder for TE is Claude Sonnet 4\.6 via AWS Bedrock batch inference\. Retrieved passages are the gold\-plus\-distractor contexts released with each benchmark; NL, TE, andLLMLingua\-2all read the same passage set per question\.

### 4\.2Prompt and answer extraction

All representations share a single neutral consumer prompt \(Appendix[B](https://arxiv.org/html/2606.14875#A2)\)\. The answer is extracted from the consumer’s final\-line output by regex and evaluated against the gold answer with token\-level F1, with binary correctness atF1≥0\.5F1\\geq 0\.5\.

### 4\.3Baselines

- •NL\.Full retrieved passages, unmodified\. The standard no\-compression baseline\.
- •LLMLingua\-2at rate\-50\(Panet al\.,[2024](https://arxiv.org/html/2606.14875#bib.bib5)\)\. The primary learned\-token\-scoring baseline\.
- •Three matched\-budget controls\(character\-density, end\-truncation, random subsampling\), each sized to TE’s per\-row qwen\-token count\. Defined in §[3\.3](https://arxiv.org/html/2606.14875#S3.SS3)and Appendix[D](https://arxiv.org/html/2606.14875#A4)\.
- •Coherent\-prose summaryproduced by the same encoder and post\-truncated to TE’s per\-row token budget \(§[3\.4](https://arxiv.org/html/2606.14875#S3.SS4)\)\.

### 4\.4Statistical protocol

We use paired\-bootstrap 95% confidence intervals over questions \(nboot=10,000n\_\{\\text\{boot\}\}=10\{,\}000, seed 0\) for landmark and matched\-budget contrasts\. The pre\-registered depth\-interaction tests use a binomial GLM with cluster\-robust standard errors by question, pooled across datasets by random\-effects meta\-analysis, with false\-discovery\-rate correction across the primary interaction family\(Benjamini and Hochberg,[1995](https://arxiv.org/html/2606.14875#bib.bib15); DerSimonian and Laird,[1986](https://arxiv.org/html/2606.14875#bib.bib16)\)\. Full statistical specifications, the pre\-committed heterogeneity rule, and the sensitivity check against a random\-intercept fit are in Appendix[A](https://arxiv.org/html/2606.14875#A1)\.

### 4\.5Pre\-registered protocol

We filed a pre\-registration prior to data collection\. The primary depth\-interaction hypothesis, the matched\-budget mechanism observation, the FDR correction family, the random\-effects pooling specification, the heterogeneity branch rule, and the landmark kill\-gate tests are all pre\-registered; the protocol and amendment chain are reproduced in Appendix[H](https://arxiv.org/html/2606.14875#A8)\.

### 4\.6Cross\-architecture replication

To check that the matched\-budget mechanism is not specific to one consumer family, we re\-run the three matched\-budget controls on a second consumer, Mistral\-7B\-Instruct\-v0\.3, on MuSiQue\. Long\-context NL rows do not fit at 24 GB on this consumer, so the cross\-architecture claim rests on the matched\-budget controls \(where TE and the controls have similar lengths\) rather than on the full\-NL landmark; details in §[6\.1](https://arxiv.org/html/2606.14875#S6.SS1)\.

## 5Results

### 5\.1Matched\-budget mechanism: TE beats every trivial control

Table[1](https://arxiv.org/html/2606.14875#S5.T1)reports paired\-bootstrap 95% confidence intervals for TE versus the three matched\-budget controls on each of MuSiQue, 2Wiki, and HotpotQA\. All nine intervals exclude zero from above, with point estimates ranging from\+13\.6\+13\.6to\+20\.2\+20\.2percentage points \(Fig\.[1](https://arxiv.org/html/2606.14875#S5.F1)\)\.

Table 1:Matched\-budget mechanism\. At matched per\-row qwen\-token budget, TE outperforms all three trivial controls on all three datasets\. Paired bootstrap over questions,nboot=10,000n\_\{\\text\{boot\}\}=10\{,\}000, seed 0\.![Refer to caption](https://arxiv.org/html/2606.14875v1/x1.png)Figure 1:Matched\-budget mechanism\.TE beats character\-density, end\-truncation, and random\-subsampling controls on every dataset at matched per\-row token budget\. Error bars: 95% paired\-bootstrap CI\.The three controls jointly rule out three specific alternative explanations for TE’s matched\-budget advantage\. Character\-density rules out per\-row character manipulation\. End\-truncation rules out the hypothesis that the NL tail is dispensable\. Random subsampling rules out the hypothesis that any size\-matched random subset of NL tokens would suffice\. We read the result as supporting the pre\-registered mechanism: TE compresses where natural language carries redundancy, and trivial alternatives that strip surface tokens but do not re\-express content lose the bridge entities a multi\-hop reader needs\.

### 5\.2Landmark comparisons at full budget

Table[2](https://arxiv.org/html/2606.14875#S5.T2)reports paired\-bootstrap 95% CIs for TE against full\-budget NL and againstLLMLingua\-2at rate\-50\. On MuSiQue, the deepest\-hop dataset, TE beats NL by\+4\.75\+4\.75pp andLLMLingua\-2by\+7\.67\+7\.67pp \(bothp<0\.001p<0\.001\)\. On 2Wiki the TE–NL gap is\+1\.53\+1\.53pp with a CI that narrowly spans zero\. On HotpotQA the sign flips: TE loses to NL by−2\.34\-2\.34pp \(p<0\.01p<0\.01\)\.

Table 2:Landmark paired comparisons\. Full\-NL andLLMLingua\-2rate\-50 budget regimes\. Paired bootstrap over questions,nboot=10,000n\_\{\\text\{boot\}\}=10\{,\}000, seed 0\. Per\-datasetnn: MuSiQue 2,417; 2Wiki 1,500; HotpotQA 1,000\.The between\-dataset ordering—positive significant on MuSiQue, positive null on 2Wiki, negative significant on HotpotQA—is the empirical pattern that motivates the post\-hoc moderator analysis in Appendix[I](https://arxiv.org/html/2606.14875#A9); we report it there because atn=3n=3datasets we cannot rule out unobserved dataset\-construction factors\.

### 5\.3Coherent\-prose comparator at matched budget

A natural concern with the matched\-budget mechanism is whether TE’s advantage persists against a coherent\-prose summary at the same budget, rather than against trivial controls that corrupt surface structure\. We run Claude Sonnet 4\.6 as a coherent\-summary encoder on the same three datasets and post\-truncate each summary to TE’s per\-row qwen\- token budget\. Table[3](https://arxiv.org/html/2606.14875#S5.T3)reports the comparison\.

Table 3:Coherent\-prose comparator at matched per\-row qwen\-token budget\. Paired bootstrap over questions,nboot=10,000n\_\{\\text\{boot\}\}=10\{,\}000, seed 0\. Per\-datasetnn: MuSiQue 2,417; 2Wiki 1,500; HotpotQA 1,000\.On MuSiQue, the deepest\-hop dataset, TE outperforms the matched\-budget coherent prose by\+11\.94\+11\.94pp with the CI strictly positive\. On 2Wiki and HotpotQA the difference is null\. Coherent prose itself loses to full NL on MuSiQue and HotpotQA, suggesting that matched\-budget truncation of coherent summaries is a costly operation when the budget is tight: at MuSiQue’s budget, the truncated summary retains roughly half of TE’s named entities, while the untruncated summary at1\.59×1\.59\\timesthe budget retains78%78\\%\(Appendix[J](https://arxiv.org/html/2606.14875#A10)\)\.

### 5\.4Depth interaction is null

We fit the pre\-registered depth\-interaction model per dataset, pool MuSiQue and 2Wiki by random\-effects meta\-analysis, and apply false\-discovery\-rate correction across the family\. Table[4](https://arxiv.org/html/2606.14875#S5.T4)reports the four within\-dataset slopes plus their meta\-analytic pool\. None is statistically significant: FDR\-adjustedpp\-values range0\.410\.41to0\.920\.92and the pooled meta slopes are indistinguishable from zero \(I2=0%I^\{2\}=0\\%on both interaction terms, so the pre\-committed heterogeneity rule did not fire\)\. All four point estimates are negative—direction\-consistent with the pre\-registered prediction—but at thesepp\-values the direction\-consistency is indistinguishable from noise, not weak supporting evidence\. Section[6\.2](https://arxiv.org/html/2606.14875#S6.SS2)quantifies this with a minimum\-detectable\- effect\-size analysis\.

Table 4:Depth\-interaction regression\. All four within\-dataset slopes are direction\-consistent with the pre\-registered negative prediction but non\-significant after FDR correction; pooled estimates are indistinguishable from zero,I2=0%I^\{2\}=0\\%\.Figure[2](https://arxiv.org/html/2606.14875#S5.F2)visualises the null: the TE–NL gap is flat or non\-monotone across hop levels within both MuSiQue and 2Wiki, not rising with depth as the pre\-registered mechanism would predict\.

![Refer to caption](https://arxiv.org/html/2606.14875v1/x2.png)Figure 2:Depth\-interaction null\.Per\-hop F1 for NL, TE, andLLMLingua\-2within MuSiQue and 2Wiki, with 95% bootstrap CI shading\. The TE–NL gap is flat across hop counts, not growing with depth\. HotpotQA is excluded because all questions are 2\-hop\.
### 5\.5Methodological note

Our pilot swept three consumer\-prompt variants\. Under one variant \(role\-prompted\), TE’s MuSiQue F1 varied by tens of points across seeds relative to the default prompt because verbose responses interact with the normalised\-multiset F1 scorer in a way that is orthogonal to context compression\. Main\-body numbers use the neutral default prompt throughout; the full prompt\-by\-metric table is in Appendix[F](https://arxiv.org/html/2606.14875#A6)\.

## 6Discussion

### 6\.1What the matched\-budget result shows

The matched\-budget mechanism result is the central finding of this paper\. At the same per\-row qwen\-token budget, TE outperforms character\-density, end\-truncation, and random\-subsampling controls by\+13\.6\+13\.6to\+20\.2\+20\.2percentage points across three datasets\. The three controls exhaust three mechanistically distinct trivial\-compression strategies—compress by character deletion, drop the NL tail, drop random NL tokens—and none preserves the bridge entities a multi\-hop reader needs\. TE preserves bridge entities verbatim and replaces connective NL tissue with symbolic operators, a re\-expression of the same semantic content at the same budget\. We read this as evidence that TE compresses where natural language carries redundancy: the matched\-budget gain is the benefit of that operation rather than an artefact of the compression strategies the controls implement\.

The coherent\-prose comparator sharpens the reading\. Coherent prose at matched budget retains roughly half of TE’s named entities on MuSiQue, while a longer untruncated coherent summary at1\.59×1\.59\\timesthe budget retains78%78\\%\. The lost entities are bridge entities the matched\-budget truncation drops, and that loss is a property of coherent prose at tight budgets, not an artefact of the encoder\. TE holds entity content more densely than either NL or coherent prose because pipe\-separated triples reserve every token for entity\-bearing content that articles, auxiliaries, and connectives would otherwise consume\. We name this the density argument: at matched budget, the relevant axis is entities\-per\-token, and TE’s representation is the denser one\.

The matched\-budget mechanism replicates on a second consumer family\. Re\-running the three controls on Mistral\-7B\-Instruct\-v0\.3 on MuSiQue yields paired\-bootstrap 95% CIs strictly positive on all three controls \(\+8\.62/\+11\.08/\+10\.62\+8\.62/\+11\.08/\+10\.62pp\), direction\-consistent with Qwen\-3\.5\-9B at the smaller magnitude expected for a 7B consumer\. The full\-NL landmark is not a valid baseline on Mistral because long\-context NL rows do not fit at 24 GB and the surviving subset is biased toward shorter \(easier\) questions; the cross\-architecture claim therefore rests on the matched\-budget controls only\.

### 6\.2Why the null is informative

The depth\-interaction hypothesis predicted that TE’s advantage over NL would grow with hop count within multi\-hop datasets\. All four within\- dataset slopes came out direction\-consistent but non\-significant, and the pooled estimates are indistinguishable from zero\. Two readings are available\. The depth\-dependent effect may exist but our sample is too small to detect it\. Or the depth\-dependent mechanism may simply be absent: TE’s advantage may be a constant offset over compressed\-NL equivalents rather than a depth\-dependent widening over full NL\.

A minimum\-detectable\-effect\-size analysis bounds the interpretation\. The pooled meta slope is−0\.018\-0\.018log\-odds per hop with cluster\-robust standard error0\.0340\.034; at80%80\\%power andα=0\.05\\alpha=0\.05two\-sided, the minimum detectable slope is roughly0\.0950\.095log\-odds per hop\. Translated to F1 at our consumer’s MuSiQue baseline, that corresponds to a TE–NL widening of about44–55percentage points across the hop\-2\-to\-hop\-4 range\. The design therefore rules out widenings of that magnitude with≈80%\\approx\\\!80\\%power but cannot distinguish smaller effects from no effect\. The null is informative against a strong\-form depth\-dependent advantage and uninformative about a weak\-form one\. What our data do support is that the TE–NL gap does not grow monotonically with hop count within MuSiQue or 2Wiki at any appreciable magnitude \(Fig\.[2](https://arxiv.org/html/2606.14875#S5.F2)\)\.

### 6\.3Implications and encoder provenance

Our experiments use Claude Sonnet 4\.6 as the TE encoder, deliberately chosen as a strong frontier model so that the consumer\-reading question is not confounded with translator capacity\. A natural follow\-up is whether a much smaller encoder, given domain\-matched training, could substitute\. A pilot we describe in Appendix[L](https://arxiv.org/html/2606.14875#A12)fine\-tunes Qwen\-3\.5\-0\.8B on5,0005\{,\}000entity\-preserving rows with oracle bridge\- entity conditioning, evaluates on the held\-out MuSiQue test shard, and beats bothLLMLingua\-2and the Sonnet TE encoder at roughly10%10\\%of the qwen\-token budget\. The pilot is single\-dataset and uses oracle conditioning, so it is an upper bound on encoder substitutability rather than a deployable alternative; the general translator\-capacity question remains open\.

## 7Conclusion

We tested Telegraph English, a readable symbolic re\-expression of retrieved passages, as a matched\-budget alternative to natural language for multi\-hop question answering with small language models\. Across three benchmarks, TE outperforms three trivial\-compression controls and a coherent\-prose summariser at the same per\-row token budget, with all nine matched\-budget confidence intervals strictly positive\. A pre\- registered depth\-interaction hypothesis is null: the advantage does not grow with reasoning depth, and a minimum\-detectable\-effect\-size analysis bounds the design to ruling out widenings of44–55F1 points across hop levels\. We interpret these results as evidence that the operative property of TE is entity density per token, and the natural follow\-up is whether smaller encoders, evaluated without oracle entity conditioning, can preserve that density\.

## 8Limitations and Broader Impact

### 8\.1Limitations

#### Between\-dataset gradient isn=3n=3\.

The post\-hoc reading of the between\-dataset TE–NL gradient as correlating with dataset NL ceiling rests on three datasets\. NL\-ceiling\- proximity is the only monotone candidate moderator at thisnn; unobserved dataset\-construction factors \(bridge\-entity retrievability, distractor\-passage content\) correlated with NL ceiling cannot be ruled out\. A confirmatory replication would require a fourth dataset where NL ceiling varies independently of dataset construction, and we report the gradient as an exploratory observation only \(Appendix[I](https://arxiv.org/html/2606.14875#A9)\)\.

#### Cross\-architecture coverage is mechanism\-only\.

The matched\-budget mechanism is tested on two consumer families \(Qwen\-3\.5\-9B and Mistral\-7B\-Instruct\-v0\.3\) and is direction\-consistent on both \(§[6\.1](https://arxiv.org/html/2606.14875#S6.SS1)\)\. The aggregate TE–NL landmark and the dataset\-hardness gradient, however, are tested only on Qwen\-3\.5\-9B; the Mistral full\-NL baseline is biased by long\-context out\-of\-memory errors on the NL rows, so we do not claim architecture robustness for the landmark or for the between\-dataset gradient\. The coherent\-prose comparator was not run on Mistral for the same reason\. Cross\-family landmark coverage at a consumer with sufficient context\-window headroom is follow\-up work\.

#### Single encoder model and frozen prompt\.

TE’s encoder \(Claude Sonnet 4\.6 via AWS Bedrock batch inference\) and the encoder prompt are fixed\. We do not vary encoder capacity, encoder family, or prompt phrasing\. A dedicated analysis of encoder sensitivity is future work; the present paper is about whether a frozen tokenizer\- aware encoder can serve as a matched\-budget alternative to NL, not about the landscape of such encoders\.

#### Coherent prose loses entity coverage at matched budget\.

The matched\-budget coherent\-prose comparator truncates the encoder’s free\-prose output to TE’s per\-row budget\. On a 10\-row MuSiQue sample this drops entity coverage from0\.7820\.782\(raw,1\.59×1\.59\\timesbudget\) to0\.4970\.497\(matched budget\)\. That is a density property of coherent prose at tight budgets, not an engineering artefact we can remove\. A reader whose deployment budget is measured in retained entities rather than consumer tokens should treat our prose\-comparator evidence as upper\-bounded; a comparison at matched entity coverage would allow the prose representation a larger budget and would answer a different question\.

#### TE\-at\-larger\-budget counterfactual untested\.

We compare TE at matched qwen\-token budget to coherent prose at the same budget; we do not test TE at1\.59×1\.59\\timesbudget against coherent prose at1\.59×1\.59\\timesbudget\. The latter would isolate representation\- format advantage from compression\-ratio advantage and remains follow\-up work\.

#### Prompt\-template control\.

A control on MuSiQue \(sub\-samplen=500n=500\) under the role\-prompted consumer prompt confirms the methodological\-artefact reading of §[5\.5](https://arxiv.org/html/2606.14875#S5.SS5): NL F1 \(7\.61%7\.61\\%\) and TE F1 \(7\.14%7\.14\\%\) collapse together under that prompt, so the collapse is prompt\-template\-universal and not specific to TE\. Under the explicit\-reasoning prompt on the same sub\-sample, TE F1 \(34\.28%34\.28\\%\) substantially exceeds NL F1 \(3\.12%3\.12\\%\), suggesting TE is more prompt\- robust than raw NL under that variant\. We flag this as suggestive only because the sub\-sample is MuSiQue atn=500n=500\.

#### Metric brittleness\.

We use token\-level F1 with the standard normalised\-multiset implementation\. F1 is sensitive to response verbosity \(the role\-prompt artefact in §[5\.5](https://arxiv.org/html/2606.14875#S5.SS5)is a manifestation\)\. We report exact\-match \(EM\) alongside F1 on MuSiQue in Appendix[G](https://arxiv.org/html/2606.14875#A7); EM and F1 deltas track the same direction across all five comparators, which lends confidence that the depth\- interaction null and the matched\-budget mechanism are not F1\-scorer artefacts\. One divergence point is worth flagging: TE’s aggregate EM advantage over NL on MuSiQue \(\+1\.08\+1\.08pp,95%95\\%CI\[−0\.58,\+2\.69\]\[\-0\.58,\+2\.69\]\) is substantially smaller than its F1 advantage \(\+5\.24\+5\.24pp,95%95\\%CI\[\+3\.59,\+6\.89\]\[\+3\.59,\+6\.89\]\) and the EM CI crosses zero, consistent with the aggregate TE–NL gap being partly carried by partial\-credit tokens that F1 counts as recall and EM counts as a miss\. A full LLM\-judge sensitivity sweep remains future work\.

#### The depth\-interaction null is bounded, not erased\.

The minimum\-detectable\-effect\-size analysis in §[6\.2](https://arxiv.org/html/2606.14875#S6.SS2)shows our pooled design was powered for per\-hop interaction slopes≳0\.095\\gtrsim 0\.095log\-odds/hop and underpowered for slopes below that\. Readers should treat the null as informative against strong\-form depth\- dependent widening \(slopes≳4\\gtrsim 4–55pp across the hop\-2\-to\-hop\-4 range\) and uninformative about weak\-form widening\. Closing the gap would require either substantially largernnper dataset or a narrower prediction\.

#### Passage\-set scope\.

We use the gold\-plus\-distractor contexts released with each benchmark as retrieval input\. This isolates the representation\-vs\-NL question from the retrieval question, but it also means our results do not speak directly to deployed retrieval pipelines where distractor quality varies\.

### 8\.2Broader impact

TE is a context\-compression representation that, in our data, preserves bridge\-entity spans from NL verbatim; it is therefore no more vulnerable to entity leakage than retrieval over the source NL passages themselves\. The encoder is frozen and task\-agnostic, so TE does not implicitly encode downstream task supervision in its output—a property we consider favourable for auditability\. We see no application\-specific harms unique to TE relative to the NL baseline it is meant to replace\.

## References

- M\. L\. Arbuzov, S\. Bei, Z\. Dong, D\. Kalaev, and A\. A\. Shvets \(2026a\)Semantic prompt compression via structured symbolic rewriting\.arXiv preprint arXiv:2605\.04426\.Cited by:[§1](https://arxiv.org/html/2606.14875#S1.p3.1),[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px4.p1.1)\.
- M\. L\. Arbuzov, S\. Bei, Z\. Dong, D\. Kalaev, and A\. Shvets \(2025\)Beyond exponential decay: rethinking error accumulation in large language models\.arXiv preprint arXiv:2505\.24187\.Cited by:[§1](https://arxiv.org/html/2606.14875#S1.p3.1)\.
- M\. L\. Arbuzov, L\. Mosbacker, S\. Bei, Z\. Dong, D\. Kalaev, and A\. Shvets \(2026b\)The architecture of errors: from universal impossibility to patch\-local llm reliability\.arXiv preprint arXiv:2605\.30628\.Cited by:[§1](https://arxiv.org/html/2606.14875#S1.p3.1)\.
- Y\. Benjamini and Y\. Hochberg \(1995\)Controlling the false discovery rate: a practical and powerful approach to multiple testing\.Journal of the Royal Statistical Society: Series B \(Methodological\)57\(1\),pp\. 289–300\.Cited by:[Appendix A](https://arxiv.org/html/2606.14875#A1.SS0.SSS0.Px1.p1.2),[§4\.4](https://arxiv.org/html/2606.14875#S4.SS4.p1.1)\.
- A\. Chevalier, A\. Wettig, A\. Ajith, and D\. Chen \(2023\)Adapting language models to compress contexts\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§1](https://arxiv.org/html/2606.14875#S1.p1.1),[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px2.p1.1)\.
- R\. DerSimonian and N\. Laird \(1986\)Meta\-analysis in clinical trials\.Controlled Clinical Trials7\(3\),pp\. 177–188\.Cited by:[Appendix A](https://arxiv.org/html/2606.14875#A1.SS0.SSS0.Px1.p1.2),[§4\.4](https://arxiv.org/html/2606.14875#S4.SS4.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.InProceedings of the 40th International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px4.p1.1)\.
- X\. Ho, A\. Duong Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop QA dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,Cited by:[§4\.1](https://arxiv.org/html/2606.14875#S4.SS1.p1.5)\.
- H\. Jiang, Q\. Wu, C\. Lin, Y\. Yang, and L\. Qiu \(2023\)LLMLingua: compressing prompts for accelerated inference of large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§1](https://arxiv.org/html/2606.14875#S1.p1.1),[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Jiang, Q\. Wu, X\. Luo, D\. Li, C\. Lin, Y\. Yang, and L\. Qiu \(2024\)LongLLMLingua: accelerating and enhancing llms in long context scenarios via prompt compression\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Mu, X\. L\. Li, and N\. D\. Goodman \(2023\)Learning to compress prompts with gist tokens\.InAdvances in Neural Information Processing Systems 36,Cited by:[§1](https://arxiv.org/html/2606.14875#S1.p1.1),[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Nye, A\. J\. Andreassen, G\. Gur\-Ari, H\. Michalewski, J\. Austin, D\. Bieber, D\. Dohan, A\. Lewkowycz, M\. Bosma, D\. Luan, C\. Sutton, and A\. Odena \(2021\)Show your work: scratchpads for intermediate computation with language models\.arXiv preprint arXiv:2112\.00114\.Cited by:[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Pan, Q\. Wu, H\. Jiang, M\. Xia, X\. Luo, J\. Zhang, Q\. Lin, V\. Rühle, Y\. Yang, C\. Lin, H\. V\. Zhao, L\. Qiu, and D\. Zhang \(2024\)LLMLingua\-2: data distillation for efficient and faithful task\-agnostic prompt compression\.InFindings of the Association for Computational Linguistics: ACL 2024,Cited by:[§1](https://arxiv.org/html/2606.14875#S1.p1.1),[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px1.p1.1),[2nd item](https://arxiv.org/html/2606.14875#S4.I1.i2.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10\.Cited by:[§4\.1](https://arxiv.org/html/2606.14875#S4.SS1.p1.5)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems 35,Cited by:[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px4.p1.1)\.
- F\. Xu, W\. Shi, and E\. Choi \(2024\)RECOMP: improving retrieval\-augmented LMs with context compression and selective augmentation\.InProceedings of the Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.14875#S1.p1.1),[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,Cited by:[§4\.1](https://arxiv.org/html/2606.14875#S4.SS1.p1.5)\.
- H\. Yen, T\. Gao, and D\. Chen \(2024\)Long\-context language modeling with parallel context encoding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Yoon, T\. Lee, H\. Hwang, M\. Jeong, and J\. Kang \(2024\)CompAct: compressing retrieved documents actively for question answering\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Cited by:[§2](https://arxiv.org/html/2606.14875#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix AStatistical protocol

The pre\-registered primary hypothesis predicts that TE’s advantage over NL grows with reasoning depth, operationalised as a negativerepresentation×\\timeshop\_countinteraction within multi\-hop datasets\. Letyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}be the correctness of the consumer’s answer to questioniiunder representationri∈\{NL,TE,LLMLingua\-2\}r\_\{i\}\\in\\\{\\text\{NL\},\\mathrm\{TE\},\\text\{LLMLingua\-2\}\\\}at hop counthi∈\{2,3,4\}h\_\{i\}\\in\\\{2,3,4\\\}, withyi=1y\_\{i\}=1iff token\-level F1 against the gold reference is≥0\.5\\geq 0\.5\. We fit, per dataset, a binomial GLM with logit link

logitℙ\(yi=1\)=α\+βrri\+βhhopc\+γr,h\(ri⋅hopc\)\\mathrm\{logit\}\\;\\mathbb\{P\}\(y\_\{i\}=1\)=\\alpha\+\\beta\_\{r\}r\_\{i\}\+\\beta\_\{h\}\\text\{hop\}\_\{c\}\+\\gamma\_\{r,h\}\(r\_\{i\}\\cdot\\text\{hop\}\_\{c\}\)\(1\)
wherehopc=hi−h¯\\text\{hop\}\_\{c\}=h\_\{i\}\-\\bar\{h\}centers hop count at its dataset mean and cluster\-robust standard errors byquestion\_idserve as a practical proxy for the pre\-registeredglmer \+ \(1\|question\_id\)random\-intercept specification \(the two specifications agree on all inferential conclusions; sensitivity analysis in Appendix[E](https://arxiv.org/html/2606.14875#A5)\)\. The pre\-registered prediction isγTE,h<0\\gamma\_\{\\mathrm\{TE\},h\}<0\. HotpotQA is excluded because all its questions are 2\-hop and the within\-dataset slope is undefined\.

#### Meta\-analysis and multiple\-comparison correction\.

Per\-dataset interaction slopes are pooled across MuSiQue and 2Wiki by random\-effects meta\-analysis \(DerSimonian–Laird;DerSimonian and Laird,[1986](https://arxiv.org/html/2606.14875#bib.bib16)\), yielding a pooled point estimate, 95% Wald CI, andI2I^\{2\}heterogeneity statistic\. False\-discovery\-rate correction\(Benjamini and Hochberg,[1995](https://arxiv.org/html/2606.14875#bib.bib15)\)is applied across three tests per interaction term \(\{MuSiQue,2Wiki,meta\}\\\{MuSiQue,2Wiki,\\text\{meta\}\\\}\)\.

#### Pre\-committed heterogeneity rule\.

Before fitting, we committed to a heterogeneity branch rule: ifI2\>75%I^\{2\}\>75\\%on a pooled slope, pooling is not interpretable and we fall back to per\-dataset inference\. The rule did not fire in our sample \(I2=0%I^\{2\}=0\\%on both pooled interaction slopes\)\.

#### Paired\-bootstrap\.

For landmark and matched\-budget contrasts, paired\-bootstrap 95% confidence intervals resample question ids withnboot=10,000n\_\{\\text\{boot\}\}=10\{,\}000and seed 0, using the percentile bracket on the bootstrap distribution\.

## Appendix BConsumer prompt

The default consumer prompt used for all main\-body numbers is reproduced verbatim below, with \{context\} and \{question\} placeholders substituted at runtime:

> Answer the question using only the context\. Return the answer on the final line\. Context: \{context\} Question: \{question\} Answer:

## Appendix CTelegraph English encoder prompt and example

TE\\mathrm\{TE\}is produced by Claude Sonnet 4\.6 \(via managed batch inference\) with a fixed prompt that instructs entity\-preserving re\-expression at a target qwen\-token budget\. The encoder prompt is reproduced in full in the supplementary materials; an illustrative pre/post pair from a MuSiQue row is:

> NL: ‘‘Barack Obama was born in Honolulu, Hawaii\. Honolulu is the capital of the state of Hawaii\. Hawaii is a state in the United States\.’’ TE\\mathrm\{TE\}: ‘‘Barack Obama @born Honolulu \| Honolulu @capital\_of Hawaii \| Hawaii @state\_in United States\.’’

TE\\mathrm\{TE\}preserves the entity spans \(“Barack Obama”, “Honolulu”, “Hawaii”, “United States”\) verbatim and rewrites connective NL tissue into pipe\-separated symbolic clauses with@\-prefixed operators\.

## Appendix DA6/A7/A8 specification

All three trivial\-compression controls are operationalised per row againstTE\\mathrm\{TE\}’s per\-row qwen\-token count\. Denote byBiB\_\{i\}the qwen\-token count ofTE\\mathrm\{TE\}’s passage for questionii\.

#### A6 \(char\-density\)\.

We rescale the NL passage by a character\-level density transform: drop everykk\-th character withkkchosen per row so that the post\-transform qwen\-token count matchesBiB\_\{i\}\. Whitespace is preserved to keep the output human\-readable\.

#### A7 \(NL end\-truncation\)\.

We truncate the NL passage from its end at the qwen\-token boundary so that the truncated passage hasBiB\_\{i\}qwen tokens\. Partial final words are dropped at the nearest word boundary to avoid UTF\-8 fragments\.

#### A8 \(NL random\-subset\)\.

We sampleBiB\_\{i\}qwen\-token positions uniformly without replacement from the NL passage \(fixed seed 42\) and concatenate the selected tokens with a single space between segments\.

All three operate in qwen\-token space, not word or character space, so the budget matches what the consumer LM actually sees at its tokenizer\.

## Appendix ECluster\-robust GLM vs\. glmer sensitivity

Our primaryC1′\\text\{C1\}^\{\\prime\}specification uses a binomial GLM with cluster\-robust standard errors byquestion\_idas a practical proxy for the pre\-registeredglmer\(correct ~ representation \* hop\_c \+ \(1\|question\_id\)\)random\-intercept specification\. We re\-fit the fullglmerspecification on both MuSiQue and 2Wiki as a sensitivity analysis; the estimatedrepresentation×\\timeshop\_cslopes agree with the cluster\-robust GLM to three decimal places on MuSiQue and two decimal places on 2Wiki, and the inferential conclusion \(all four slopes direction\-consistent, none significant after FDR\-BH\) is unchanged\.

## Appendix FPrompt×\\timesF1\-metric artefact

Wave\-1a pilot numbers on MuSiQue under three consumer prompts \(default,explicit\_reasoning,role\_prompted\) are tabulated in Table[5](https://arxiv.org/html/2606.14875#A6.T5)below\. Underrole\_prompted,TE\\mathrm\{TE\}’s F1 varies by up to 18\.3 pp relative todefaulton the same rows, driven by response\-length interaction with the normalised\-multiset F1 scorer\.

Table 5:Prompt\-format sensitivity on MuSiQue \(Wave\-1a pilot\)\. Therole\_promptedrow shows a large negative swing forTE\\mathrm\{TE\}driven by verbose\-response F1 deflation, not by an intrinsic compression loss\. Main\-body numbers throughout the paper usedefault\.
## Appendix GEM alongside F1 on MuSiQue

Token\-level F1 with the standard normalised\-multiset implementation is sensitive to response verbosity \(§[F](https://arxiv.org/html/2606.14875#A6)\)\. To check that theC1′\\text\{C1\}^\{\\prime\}null and the C2 confirmation are not F1\-scorer artefacts, we re\-score the same MuSiQue predictions with exact\-match \(EM\) and report paired\-bootstrap 95% CIs on each representation’s EM delta versus NL at matched budget\. The MuSiQue evaluation shard contains both F1 and EM for every row; no new inference is required\.

Table 6:F1 and EM on MuSiQue \(n=2,417 questions paired\) with paired\-bootstrap 95% CIs on the delta vs NL \(nboot=10,000n\_\{\\text\{boot\}\}=10\{,\}000, seed 0\)\. EM and F1 deltas are direction\-consistent across all five comparators\. Note thatTE\\mathrm\{TE\}’s EM advantage over NL is small \(\+1\.08\+1\.08pp\) and its 95% CI crosses zero, whereas its F1 advantage \(\+5\.24\+5\.24pp\) has a CI strictly above zero\. The three trivial\-compression controls \(A6/A7/A8\) show large negative deltas on both metrics with CIs well below zero, andLLMLingua\-2is negative on both metrics with CIs excluding zero\.The direction\-consistency across F1 and EM lends confidence that the C2 mechanism finding \(A6/A7/A8 all far belowTE\\mathrm\{TE\}at matched budget\) is not an F1\-scorer artefact\. On the positiveTE\\mathrm\{TE\}\>\>NL comparison, the EM CI that crosses zero is informative: it suggestsTE\\mathrm\{TE\}’s aggregate advantage on MuSiQue is at least partly carried by partial\-credit rewards whereTE\\mathrm\{TE\}emits the correct bridge entity alongside additional tokens that F1 counts as recall but EM counts as a miss\. This is consistent with the compression\-where\-redundancy interpretation in §[6\.1](https://arxiv.org/html/2606.14875#S6.SS1)\(which is about mechanism at matched budget, not about the size of the aggregate F1 advantage\)\. A full LLM\-judge sensitivity sweep remains future work \(§[8\.1](https://arxiv.org/html/2606.14875#S8.SS1)\)\.

## Appendix HPre\-registration protocol and amendments

These pre\-commitments—FDR\-BH correction across \{MuSiQue, 2Wiki, meta\}, a heterogeneity rule \(I2\>75%I^\{2\}\>75\\%; it did not fire\), an analytic MDES bound reported alongside the null, and explicit hypothesis\-generating\-only labelling of the between\-dataset moderator—jointly prevent auxiliary\-to\-primary promotion after the null primary\.

The pre\-registration was filed prior to data collection \(DOI and filing date omitted from the submission version for reviewer anonymization; restored in the camera\-ready\)\. The primaryC1′\\text\{C1\}^\{\\prime\}hypothesis, the auxiliary C2 mechanism observation \(§11\.7 obs 2\), the FDR\-BH correction family, the DerSimonian–Laird pooling specification, the pre\-committedI2\>75%I^\{2\}\>75\\%heterogeneity rule, and the landmark kill\-gate tests K\-F1\-1, K\-TA\-1, and K\-γ\\gamma\-1 are all pre\-registered\. Two in\-repo amendments were filed during the study:

- •§10 amendment: documented a default\-prompt bug discovered in the Wave\-1a pilot and specified the remedial Wave\-1b rerun at matchedLLMLingua\-2budget\.
- •§11 amendment: filed C2 \(compression\-where\-redundancy, §11\.7 obs 2\) as a separately pre\-registered auxiliary mechanism, prior to the Wave\-2 cloud run that produced the A6/A7/A8 data reported in Table[1](https://arxiv.org/html/2606.14875#S5.T1)\.

A full amendment log with SHA\-stamped timestamps is included in the supplementary materials\.

## Appendix IPost\-hoc observation: between\-dataset gradient

This appendix expands the brief main\-body pointer at the end of §[5\.4](https://arxiv.org/html/2606.14875#S5.SS4)into the full detail of the between\-datasetTE−NL\\mathrm\{TE\}\{\-\}\\text\{NL\}gradient\. We report the gradient as an*exploratory*observation only, not as a contribution\.

The between\-dataset ordering in Table[2](https://arxiv.org/html/2606.14875#S5.T2)\(MuSiQue\+4\.75\+4\.75significant, 2Wiki\+1\.53\+1\.53null, HotpotQA−2\.34\-2\.34significant negative\) is direction\-consistent with the pre\-registered P2b verbatim:

> “On HotpotQA \(predominantly 1–2 hop\), TE does*not*beat NL:ΔF1\(NL−TE\)\>0\\Delta F1\(\\text\{NL\}\-\\text\{TE\}\)\>0\. This validates that the interaction is emergent with depth, not a global TE\-wins effect\.” \(pre\-registered document, §4\)

The direction is pre\-registered; the specific moderator that*explains*the gradient is not\. In ourn=3n=3datasets the ordering correlates monotonically with dataset NL F1 \(MuSiQue 38\.0, 2Wiki 56\.0, HotpotQA 69\.7\)\. Hop\-based moderators \(modal hop count, mean hop depth, 4\-hop share\) do*not*have the monotone shape required: modal hop counts are 2/no\-mode/2 respectively, mean hop depths are2\.65/3\.0/2\.02\.65/3\.0/2\.0, and 4\-hop shares are17%/33%/0%17\\%/33\\%/0\\%\. NL\-ceiling\-proximity is therefore the only monotone candidate moderator atn=3n=3\. We report it as a post\-hoc observational pattern only—unobserved dataset\-construction factors \(e\.g\., bridge\-entity retrievability, distractor\-passage content\) correlated with NL ceiling cannot be ruled out\. Figure[3](https://arxiv.org/html/2606.14875#A9.F3)shows the three points with no fitted line and an explicitn=3n=3caveat\.

![Refer to caption](https://arxiv.org/html/2606.14875v1/x3.png)Figure 3:Post\-hoc NL\-ceiling\-proximity observation \(n=3n=3datasets, hypothesis\-generating only\)\.Between\-datasetTE−NL\\mathrm\{TE\}\{\-\}\\text\{NL\}Δ\\DeltaF1 \(pp\) plotted against dataset NL F1 \(%\)\. With only three datasets any monotonic moderator fits similarly; we therefore show no fitted line, noR2R^\{2\}, and no coefficient\. The ordering is consistent with—but does not confirm—an NL\-ceiling\-proximity reading; unobserved dataset\-construction factors correlated with NL ceiling cannot be ruled out atn=3n=3\. Error bars: 95% paired\-bootstrap CI onΔ\\DeltaF1\.Atn=3n=3datasets, NL\-ceiling\-proximity is the only monotone candidate moderator; unobserved dataset\-construction factors \(e\.g\., bridge\-entity retrievability, distractor\-passage content\) cannot be ruled out\. We therefore do not treat the gradient as a finding of this paper\. A confirmatory replication with a fourth dataset where NL ceiling varies independently of dataset construction is the minimum prerequisite for any claim; we reserve that for follow\-up work\.

#### Hop distribution per dataset \(reported for completeness, not a candidate moderator atn=3n=3\)\.

MuSiQue: 2\-hop 1,252 \(51\.8%51\.8\\%\), 3\-hop 760 \(31\.4%31\.4\\%\), 4\-hop 405 \(16\.8%16\.8\\%\); modal 2, mean 2\.65\. 2Wiki: 2\-hop 500, 3\-hop 500, 4\-hop 500 \(balanced\); no single mode, mean 3\.0\. HotpotQA: 2\-hop 1,000 \(100%100\\%\); modal 2, mean 2\.0\. None of modal hop, mean hop, or 4\-hop share is monotone with the between\-datasetTE−NL\\mathrm\{TE\}\-\\text\{NL\}gradient\.

## Appendix JF3 coherent\-prose comparator: entity coverage and landmark contrasts

F3 evaluates Claude Sonnet 4\.6 as a frontier coherent\-summary encoder on the same three datasets, post\-truncated to each per\-rowTE\\mathrm\{TE\}qwen\-token budget \(A7\-style\), so F3 is evaluated at the same budget asTE\\mathrm\{TE\}on identical consumer rows\. Entity\-coverage QC on a 10\-row random MuSiQue sample \(seed 0\): at matchedTE\\mathrm\{TE\}qwen\-token budget F3 retains0\.4970\.497ofTE\\mathrm\{TE\}’s named entities, versus0\.7820\.782for F3 raw \(pre\-truncation\) at1\.59×1\.59\\timesTE\\mathrm\{TE\}’s budget and0\.8240\.824forLLMLingua\-2rate\-50 at2\.29×2\.29\\timesTE\\mathrm\{TE\}’s character budget \(Table[7](https://arxiv.org/html/2606.14875#A10.T7)\); the drop0\.782→0\.4970\.782\\to 0\.497is produced entirely by matched\-budget truncation, not by the encoder\. Comparing F3 vs A7 on MuSiQue \(both truncated toTE\\mathrm\{TE\}’s per\-row qwen\-token budget, differing only in summarization quality\) gives F3−\{\-\}A7≈\+7\\approx\+7pp: coherent summarization adds value over raw truncation even after both lose entity coverage, so theTE−\\mathrm\{TE\}\{\-\}F3 contrast in Table[3](https://arxiv.org/html/2606.14875#S5.T3)isolates the density advantage from the summarization\-quality advantage\. Paired\-bootstrap 95% CIs:nboot=10,000n\_\{\\text\{boot\}\}=10\{,\}000, seed 0, paired byquestion\_id; per\-datasetnn: MuSiQue 2,417; 2Wiki 1,500; HotpotQA 1,000\.

Table 7:Entity\-coverage QC on a 10\-row random MuSiQue sample \(seed 0\)\.†TE\\mathrm\{TE\}is the reference by construction\. F3 at matched qwen\-token budget retains about half ofTE\\mathrm\{TE\}’s named entities; the∼\\sim30\-pp gap from F3 raw is produced by matched\-budget truncation, not by the Sonnet encoder\.LLMLingua\-2’s higher coverage is measured at a different \(more permissive\) budget regime\.
## Appendix KHonest negatives: F2\-struct andγ\\gamma\-rescue

Two pre\-registered kill gates fired\. F2\-struct \(a structured NL reformulation\) does not improve over NL on any dataset \(Δ=−6\.6,−13\.4,−15\.7\\Delta=\-6\.6,\-13\.4,\-15\.7pp on MuSiQue, 2Wiki, HotpotQA respectively\); F2\-struct is cut per pre\-registration\. K\-γ\\gamma\-1, a hop\-3 rescue frame intended to outperformTE\\mathrm\{TE\}on 3\-hop questions in a revise\-and\-review pass, fails: combined MuSiQue \+ 2Wiki hop\-3Δ\(γ−TE\)=−0\.15\\Delta\(\\gamma\-\\mathrm\{TE\}\)=\-0\.15pp, 95% CI\[−0\.78,\+0\.48\]\[\-0\.78,\+0\.48\],n=1,260n=1\{,\}260\.γ\\gammais cut\. Both negatives are useful: F2\-struct falsifies a “surface structure alone suffices” hypothesis, and K\-γ\\gamma\-1 falsifies a specific “deeper revision pass rescues the hard hops” hypothesis\.

## Appendix LFollow\-up pilot: small\-LM encoder on MuSiQue with oracle entity conditioning

#### Motivation\.

§[6\.3](https://arxiv.org/html/2606.14875#S6.SS3)chose Claude Sonnet 4\.6 as theTE\\mathrm\{TE\}encoder to isolate the consumer\-reading question from translator\-capacity confounds\. A reviewer may reasonably ask whether a much smaller encoder, given domain\-matched entity\-preserving training, could substitute\. This pilot is a single\-dataset*proof\-of\-concept*for encoder substitutability; it is not a full replication of the main\-experiment matrix and does not close the general translator\-capacity question\.

#### Setup\.

We fine\-tuned Qwen\-3\.5\-0\.8B \(base\) via LoRA on 5,000 entity\-preserving Wikipedia QA rows drawn from the MuSiQue and HotpotQA training splits \(2,500 each\)\. Each training row pairs a question, a retrieval passage, the answer, and the decomposition’s*bridge entities*\(the set of surface forms the decomposition labels as appearing in intermediate hops\)\. The encoder is conditioned at both train*and inference*on the gold bridge\-entity set as a system\-prompt side channel; the supervised target is a Telegraph\-English paraphrase of the passage produced by Sonnet 4\.6 under a bridge\-entity\-aware prompt\. At evaluation we re\-translate the held\-out MuSiQuen=2,417n\{=\}2\{,\}417test shard through the fine\-tuned encoder under the same oracle conditioning, then send the output to the main Qwen\-3\.5\-9B consumer using the same default prompt, decoder setting, and F1 scorer as the main experiments\.

#### Results\.

Evaluating all four conditions on the samen=2,417n\{=\}2\{,\}417held\-out items and pairing on question id:

- •E10\-v2 \(retrained 0\.8B\): F148\.04%48\.04\\%
- •TE\\mathrm\{TE\}\(Sonnet 4\.6, same\-job\): F142\.40%42\.40\\%
- •NL baseline: F137\.50%37\.50\\%
- •LLMLingua\-2\(rate\-50, matched toTE\\mathrm\{TE\}budget\): F135\.07%35\.07\\%

Paired bootstrap over questions \(nboot=10,000n\_\{\\text\{boot\}\}=10\{,\}000, seed0\):

- •E10\-v2−\-LLMLingua\-2:\+12\.97\+12\.97pp,95%95\\%CI\[\+11\.00,\+14\.92\]\[\+11\.00,\+14\.92\]
- •E10\-v2−\-TE\\mathrm\{TE\}:\+5\.64\+5\.64pp,\[\+3\.76,\+7\.54\]\[\+3\.76,\+7\.54\]
- •E10\-v2−\-NL:\+10\.54\+10\.54pp,\[\+8\.55,\+12\.53\]\[\+8\.55,\+12\.53\]

All three CIs are strictly positive\. The same\-jobTE\\mathrm\{TE\}F142\.40%42\.40\\%differs by0\.30\.3pp from the main\-experimentTE\\mathrm\{TE\}42\.74%42\.74\\%, consistent with deterministic\-decode batch\-order shifts across consumer\-eval runs\.

#### Scope limits\.

Three limitations bound the reading of these numbers\.

*\(i\) Single\-dataset\.*The pilot tests only MuSiQue; the main paper’s claims span three datasets\. Running the pilot on 2Wiki and HotpotQA is mechanical but was deprioritised as the MuSiQue result already suffices to address the “frontier\-in, frontier\-out” framing concern on the dataset whereTE\\mathrm\{TE\}shows its largest margin\.

*\(ii\) Budget mismatch\.*The retrained 0\.8B output sits at∼10%\\sim\\\!10\\%ofTE\\mathrm\{TE\}’s qwen\-token budget \(median 126 vs\. 1251 per row; only 9/2417 rows exceedTE\\mathrm\{TE\}’s budget and are truncated\)\. The\+5\.64\+5\.64pp lift overTE\\mathrm\{TE\}therefore reads as an extreme\-density data point, not matched\-budget dominance\. A matched\-budget probe \(letting E10\-v2 generate up toTE\\mathrm\{TE\}’s per\-row budget\) would disambiguate*extreme\-density is sufficient*from*E10\-v2 happens to work at low budget*; we do not run that probe here\.

*\(iii\) Oracle conditioning at inference\.*The decomposition’s bridge\-entity set is used as prompting input to the encoder at inference time\. In a deployment setting these labels are not available without either human annotation or a separate entity\-extraction model; the pilot thus demonstrates an*upper bound*on encoder substitutability, not a deployable alternative\. A non\-oracle control \(conditioning on NER\-predicted bridge entities, or on entities recovered from the consumer’s first\-pass guess\) is the natural follow\-up\.

#### What the pilot does and does not establish\.

Subject to limits \(i\)–\(iii\), the pilot establishes that on MuSiQue translator capacity is*not the binding constraint*on our mechanism and density claims: a domain\-matched 0\.8B encoder with oracle entity conditioning reaches consumer F1 aboveLLMLingua\-2and above the frontierTE\\mathrm\{TE\}encoder at a fraction of the token budget\. It does*not*establish that translator capacity is unimportant across 2Wiki or HotpotQA, that small\-LM encoders matchTE\\mathrm\{TE\}without oracle conditioning, or that the density property transfers to matched\-budget generation\. The general translator\-capacity question remains open and is the subject of concurrent work\.

#### Artifacts\.

The following pilot artifacts accompany this paper as supplementary material: the 5,000\-row entity\-preserving training data; the Sonnet\-generated Telegraph\-English targets; the LoRA fine\-tune config and training\-loss curves; the re\-translated MuSiQuen=2,417n\{=\}2\{,\}417test shard produced by the fine\-tuned encoder; the consumer\-eval shard, per\-row F1/EM outputs, and paired\-bootstrap JSON with all three contrasts\.
Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

Similar Articles

What should context compression keep? I looked at how six agents handle it[D]

End-to-End Context Compression at Scale

CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

Submit Feedback

Similar Articles

What should context compression keep? I looked at how six agents handle it[D]
End-to-End Context Compression at Scale
CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History