AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

arXiv cs.AI Papers

Summary

AGORA introduces an inference-free step-level prompt compressor for LLM agents that avoids the 'action-grammar destruction' failure mode of token-level compressors. It retains ≥75% uncompressed performance in 8 of 9 environments across backbones, using a structural parser, an always-keep floor, and a learned relevance scorer.

arXiv:2605.26596v1 Announce Type: new Abstract: The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token-level method families, every cell collapses to mean reward <= 0.05 despite 1.3-13.3x realized compression. We name and characterize this failure mode as action-grammar destruction -- the tokens carrying action semantics (identifiers, brackets, action verbs) are exactly those self-information ranks lowest, so a general-purpose compressor reliably removes them and the environment rejects the residual. The diagnosis points to step-granularity compression. We introduce AGORA, an inference-free step-level compressor combining a structural prompt parser, an always-keep floor for format- and recency-critical content, and a 125M-parameter relevance scorer trained on counterfactual next-action-change labels (~2ms/step, zero per-step LLM toll). Across the compared inference-free and LLM-based methods, AGORA is the only one retaining >= 75% uncompressed performance in 8 of 9 cells (with the lone exception at 73%); a four-way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of 1.0-11.5x adaptive end-to-end compression from a single fixed keep ratio.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:06 AM

# Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents
Source: [https://arxiv.org/html/2605.26596](https://arxiv.org/html/2605.26596)
###### Abstract

The token\-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across1717\(env, backbone, method\) cells spanning two independent token\-level method families, every cell collapses to mean rewardmr≤0\.05\\textit\{mr\}\{\\leq\}0\.05despite1\.31\.3–13\.3×13\.3\\timesrealized compression\. We name and characterize this failure mode as*action\-grammar destruction*—the tokens carrying action semantics \(identifiers, brackets, action verbs\) are exactly those self\-information ranks lowest, so a general\-purpose compressor reliably removes them and the environment rejects the residual\. The diagnosis points to step\-granularity compression\. We introduce AGORA, an inference\-free step\-level compressor combining a structural prompt parser, an always\-keep floor for format\- and recency\-critical content, and a125125M\-parameter relevance scorer trained on counterfactual next\-action\-change labels \(∼2\{\\sim\}2ms/step, zero per\-step LLM toll\)\. Across the compared inference\-free and LLM\-based methods, AGORA is the only one retaining≥75%\\geq 75\\%uncompressed performance in88of99cells \(with the lone exception at73%73\\%\); a four\-way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of1\.01\.0–11\.5×11\.5\\timesadaptive end\-to\-end compression from a single fixed keep ratio\.

Code—https://github\.com/ranranrannervous/agoracompression

## 1Introduction

LLM agents accumulate trajectories that grow with task horizon: by step3030onWebShop, the in\-context history can exceed100100k tokens\. Compressing it is a practical necessity, but compressing the prompt of an LLM agent is not structurally identical to compressing the prompt of a single\-turn LM\. Agents are notoriously sensitive to surface format:AgentBench\(Liu et al\.[2024b](https://arxiv.org/html/2605.26596#bib.bib17)\)names*Invalid Format*and*Invalid Action*as primary termination categories, meaning\-preserving perturbations alone yield up to7676accuracy\-point swings on ICL tasks\(Sclar et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib23)\), and restricting LLM output format significantly degrades reasoning\(Tam et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib26)\)\. Compressing the prompt of a format\-sensitive agent therefore carries a risk that compressing general LM context does not\.

Two design families dominate\.*Token\-level extractive*compressors\(Jiang et al\.[2023](https://arxiv.org/html/2605.26596#bib.bib7),[2024](https://arxiv.org/html/2605.26596#bib.bib8); Li et al\.[2023b](https://arxiv.org/html/2605.26596#bib.bib12); Pan et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib21)\)rank tokens by self\-information or perplexity and drop those below a budget—cheap, and shown to deliver up to20×20\\timescompression with minimal loss on QA and summarization\.*LLM\-based abstractive*compressors invoke a secondary model per step \(HiAgent\(Hu et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib6)\),ACON\(Kang et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib9)\)\) or per window \(AgentDiet\(Xiao et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib32)\)\)—preserving quality at the cost of88–126126k extra tokens per task\. The token\-level family has not been systematically audited on agents\. We do so, and find it fails not by a margin but*structurally*\.

Across1717\(env, backbone, method\) cells spanningSelective Context\(Li et al\.[2023b](https://arxiv.org/html/2605.26596#bib.bib12)\)\(66cells\) andLLMLingua\-2\(Pan et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib21)\)\(1111cells, off\-shelf plus a4×4\\times\-retrained variant\), every cell collapses tomr≤0\.05\\textit\{mr\}\{\\leq\}0\.05despite1\.31\.3–13\.3×13\.3\\timesrealized compression\. The mechanism is action\-grammar destruction: tokens carrying action semantics—search\[men’s shoes size 10\],click\[Buy Now\], identifiers, brackets, action verbs—are precisely those self\-information ranks lowest because they are predictable from natural\-text context\. The compressor works as designed; the residual is grammatical English but ungrammatical agent commands the environment rejects\. The twoLLMLingua\-2variants \(off\-shelf and retrained at the4×4\\timestarget on agent data\) fail identically, ruling out hyperparameter remedies\.

The diagnosis sharpens the design space: a compressor for agent prompts must operate at*step granularity*, keeping each retained \(action, observation\) pair verbatim rather than pruning inside it\. Existing step\-level methods achieve this through a per\-step LLM call;Lindenbauer et al\. \([2025](https://arxiv.org/html/2605.26596#bib.bib15)\)recently show that on coding agents, even a simple inference\-free observation\-masking rule matches LLM summarization, suggesting the inference\-free corner deserves more attention\. We push it further\. We introduceAGORA\(Adapter\-Grounded Observation\-action Retention Architecture\)—a structural parser that segments a prompt into role\-typed blocks and \(action, observation\) step pairs, an always\-keep floor pinning the system, task, current observation, and lastKrecent=2K\_\{\\text\{recent\}\}\{=\}2steps, and a125125M\-parameter RoBERTa relevance scorer trained on counterfactual next\-action\-change labels that allocates the remaining char budget; one∼2\{\\sim\}2ms forward pass per step, no API call\. AGORA is not proposed as a uniformly superior compressor: LLM\-based methods undercut it on $/task in77–88of99cells, and its cost advantage concentrates on long trajectories paired with expensive backbones\.

Our contributions are as follows\.

A new failure mode for compression on agents\.We identify and name*action\-grammar destruction*—the structural reason token\-level extractive compressors fail on agents, manifesting as collapse tomr≤0\.05\\textit\{mr\}\{\\leq\}0\.05in every cell of a1717\-cell audit across two independent token\-level method families\.

AGORA, an inference\-free step\-level compressor\.We propose a hierarchical compressor matching the diagnosis: a structural parser, an always\-keep floor pinning format\- and recency\-critical content, and a125125M\-parameter relevance scorer trained on counterfactual next\-action\-change labels\. AGORA occupies the inference\-free corner of step\-level design space at∼2\{\\sim\}2ms per step and zero per\-step LLM toll\.

Component attribution: which design choice buys what\.AGORA is the only compared method retaining≥75%\\geq 75\\%uncompressed performance in88of99cells \(the lone exception at73%73\\%\)\. A four\-way ablation \(Table[3](https://arxiv.org/html/2605.26596#S4.T3)\) isolates the always\-keep floor as the dominant quality lever and the learned scorer as the source of adaptive end\-to\-end compression \(1\.01\.0–11\.5×11\.5\\times\) from a single fixed keep ratio\.

## 2Related Work

### 2\.1Token\-level Prompt Compression

The dominant inference\-free compressors rank individual tokens or sentences by an information criterion and drop the lowest\-scoring units\.Selective Context\(Li et al\.[2023b](https://arxiv.org/html/2605.26596#bib.bib12)\)scores with self\-information;LLMLingua\(Jiang et al\.[2023](https://arxiv.org/html/2605.26596#bib.bib7)\)andLongLLMLingua\(Jiang et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib8)\)use perplexity, the latter conditioned on the query;LLMLingua\-2\(Pan et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib21)\)reframes compression as token classification with a BERT\-class encoder distilled from GPT\-4;RECOMP\(Xu, Shi, and Choi[2024](https://arxiv.org/html/2605.26596#bib.bib33)\)carries the same idea to RAG sentence selection with a downstream\-task contrastive objective\. The standard taxonomy ofLi et al\. \([2025](https://arxiv.org/html/2605.26596#bib.bib14)\)groups these as*hard\-prompt*methods alongside the soft\-prompt family discussed below, but its coverage is single\-turn and does not currently distinguish a step\-level agent\-trajectory cell\. All published evaluations are on QA, summarization, or long\-context tasks; our §[4\.5](https://arxiv.org/html/2605.26596#S4.SS5)audit shows the rank\-and\-delete paradigm structurally fails on agents because the tokens that carry action semantics happen to be exactly those self\-information ranks lowest\. AGORA departs from this family on three primitives: the unit of decision is a whole step \(action\+observation pair\) rather than a token; the training signal is counterfactual next\-action change rather than extractive faithfulness; and because each retained step is kept verbatim, brackets and identifiers inside action grammar are preserved by construction rather than by hoping a token scorer learns to\.

### 2\.2Soft\-prompt and KV\-cache Compression

A second family modifies the LM rather than its input: gist tokens\(Mu, Li, and Goodman[2023](https://arxiv.org/html/2605.26596#bib.bib19)\), in\-context autoencoders\(Ge et al\.[2023](https://arxiv.org/html/2605.26596#bib.bib5)\), and auto\-compressors\(Chevalier et al\.[2023](https://arxiv.org/html/2605.26596#bib.bib2)\)learn soft prompts that summarize the input, while KV\-cache eviction\(Xiao et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib31); Zhang et al\.[2023](https://arxiv.org/html/2605.26596#bib.bib36); Li et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib13)\)compresses the attention cache directly\. Both require white\-box backbone access and produce non\-natural\-language artifacts, which puts them out of scope for our setting—frozen black\-box agent backbones with no LM modification\.

### 2\.3Agent Context and Memory Management

The closest neighbours to AGORA compress agent trajectories step by step\.HiAgent\(Hu et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib6)\)has the backbone formulate subgoals and then summarize prior observations within each subgoal scope;ACON\(Kang et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib9)\)compresses observations and interaction histories through a natural\-language guideline iteratively optimized against failure traces;AgentDiet\(Xiao et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib32)\)runs a GPT\-5\-mini reflection module over sliding windows on coding agents;MemGPT\(Packer et al\.[2023](https://arxiv.org/html/2605.26596#bib.bib20)\)pages information between a fixed main context and external storage via LLM\-issued function calls\. All four invoke an LLM at compression time, paying the per\-step token toll AGORA avoids\.

The closest*inference\-free*precedent isLindenbauer et al\. \([2025](https://arxiv.org/html/2605.26596#bib.bib15)\), who show on SWE\-bench that a fixed observation\-masking rule matches LLM summarization at roughly half the cost\. Their rule is single\-domain and has no learned component; AGORA can be read as generalizing it—structural retention \(role\-block parser \+KrecentK\_\{\\text\{recent\}\}\-recent always\-keep\) plays the same quality\-floor role, while the learned scorer extends realized compression beyond what that floor alone achieves\. A separate body of work addresses cross\-trial rather than within\-trial memory—A\-Mem\(Xu et al\.[2026](https://arxiv.org/html/2605.26596#bib.bib34)\),Reflexion\(Shinn et al\.[2023](https://arxiv.org/html/2605.26596#bib.bib24)\),ExpeL\(Zhao et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib37)\),Voyager\(Wang et al\.[2023](https://arxiv.org/html/2605.26596#bib.bib28)\)—and is orthogonal to the working\-memory compression problem we target\. Using downstream\-task utility as a training signal also has coarser\-grained precedents inUDR\(Li et al\.[2023a](https://arxiv.org/html/2605.26596#bib.bib11)\),LLM\-R\(Wang, Yang, and Wei[2024](https://arxiv.org/html/2605.26596#bib.bib29)\), andRECOMP\(Xu, Shi, and Choi[2024](https://arxiv.org/html/2605.26596#bib.bib33)\); AGORA’s counterfactual labels \(*does removing this past step change the next action?*\) specialize that idea to step\-level agent context\.

### 2\.4Action\-format Brittleness and Structural\-over\-semantic Findings

Two background literatures sit behind our design\. The first documents the format sensitivity of LMs and agents: format matters more than ground\-truth labels in ICL\(Min et al\.[2022](https://arxiv.org/html/2605.26596#bib.bib18)\), meaning\-preserving perturbations swing accuracy by up to7676points\(Sclar et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib23)\), template choice can reduce strong models to random\-guess level\(Voronov, Wolf, and Ryabinin[2024](https://arxiv.org/html/2605.26596#bib.bib27)\), output\-format restrictions degrade reasoning\(Tam et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib26)\), and position effects on retrieval are U\-shaped\(Liu et al\.[2024a](https://arxiv.org/html/2605.26596#bib.bib16)\)—the latter directly motivating ourKrecentK\_\{\\text\{recent\}\}recency window\. On the agent side,AgentBench\(Liu et al\.[2024b](https://arxiv.org/html/2605.26596#bib.bib17)\)names*Invalid Format*and*Invalid Action*as primary termination categories, and the multi\-agent failure taxonomy ofCemri et al\. \([2026](https://arxiv.org/html/2605.26596#bib.bib1)\)attributes38\.1%38\.1\\%of1,2421\{,\}242annotated traces to specification issues\. The second is a structural\-over\-semantic pattern in LM context engineering: random documents can improve RAG accuracy by up to35%35\\%while semantically related\-but\-non\-answer\-bearing documents do the most damage\(Cuconasu et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib4)\), and multi\-criteria reranking beats relevance\-maximization alone\(LeVine and Varjavand[2025](https://arxiv.org/html/2605.26596#bib.bib10)\)—both echoing the classical relevance\-vs\.\-utility distinction from information retrieval\(Cooper[1971](https://arxiv.org/html/2605.26596#bib.bib3); Saracevic[1975](https://arxiv.org/html/2605.26596#bib.bib22)\)\. AGORA fits naturally into this pattern: token\-level compressors optimize an informativeness axis that is mis\-specified for agents, and the binding constraint turns out to be structural preservation of action grammar\.

## 3Method

![Refer to caption](https://arxiv.org/html/2605.26596v1/x1.png)Figure 1:Overview of AGORA\. At each steptt, the agent’s trajectoryτt\\tau\_\{t\}is parsed into role\-typed blocks \(csysc\_\{\\text\{sys\}\}: system,ctaskc\_\{\\text\{task\}\}: task,\{si\}\\\{s\_\{i\}\\\}: past \(action, observation\) steps,cnowc\_\{\\text\{now\}\}: current observation,cpendc\_\{\\text\{pend\}\}: pending assistant\) and compressed by three sub\-stages\.Layer 1 \(Quality Floor\)unconditionally retains\{csys,ctask,cnow,cpend\}\\\{c\_\{\\text\{sys\}\},c\_\{\\text\{task\}\},c\_\{\\text\{now\}\},c\_\{\\text\{pend\}\}\\\}, the lastKrecent=2K\_\{\\text\{recent\}\}\{=\}2steps, and any past step the scorer flags withpi\>θhi=0\.9p\_\{i\}\{\>\}\\theta\_\{\\text\{hi\}\}\{=\}0\.9\.Layer 2 \(Relevance Scorer\)is a125125M\-parameter RoBERTa that outputspi=P​\(critical∣cnow\)p\_\{i\}\{=\}P\(\\text\{critical\}\\mid c\_\{\\text\{now\}\}\)\.Layer 3 \(Greedy Char\-Budget Fill\)ranks remaining past steps bypip\_\{i\}and fills the residual char budgetB=ρ​\|τt\|B\{=\}\\rho\\,\|\\tau\_\{t\}\|\(ρ=0\.25\\rho\{=\}0\.25\)\. The compressed contextτ~t\\tilde\{\\tau\}\_\{t\}\(target char budgetρ=0\.25\\rho\{=\}0\.25; realized end\-to\-end compression varies1\.01\.0–11\.5×11\.5\\timesacross cells, see §[4\.3](https://arxiv.org/html/2605.26596#S4.SS3)\) is then sent to the frozen backbone; the dashed return path marks the high\-confidence override from Layer 2 back into Layer 1\.### 3\.1Problem Formulation

At steptt, an LLM agent has trajectoryτt=\(c0,a1,o1,…,at−1,ot−1,ot\)\\tau\_\{t\}=\(c\_\{0\},a\_\{1\},o\_\{1\},\\ldots,a\_\{t\-1\},o\_\{t\-1\},o\_\{t\}\), wherec0c\_\{0\}is the fixed system\+task block and\(ai,oi\)\(a\_\{i\},o\_\{i\}\)is theii\-th action–observation pair\. We insert a compressorCCbetween the environment and the backboneπ\\pi, applied at every step:

τ~t=C​\(τt\),\|τ~t\|≤ρ​\|τt\|,\\tilde\{\\tau\}\_\{t\}=C\(\\tau\_\{t\}\),\\qquad\|\\tilde\{\\tau\}\_\{t\}\|\\,\\leq\\,\\rho\\,\|\\tau\_\{t\}\|,\(1\)with\|⋅\|\|\\cdot\|measured in characters andρ∈\(0,1\]\\rho\\in\(0,1\]\. Two constraints distinguish agent compression from prompt compression in general:CCmust be*step\-level*—surviving fragments must contain syntactically complete action calls, or the environment rejects them \(§[4\.5](https://arxiv.org/html/2605.26596#S4.SS5)\)—and*inference\-free*, since the per\-step LLM toll is precisely what we want to avoid\.

### 3\.2Approach Overview

AGORA comprises three components \(Figure[1](https://arxiv.org/html/2605.26596#S3.F1)\)\. A*structural parser*segmentsτt\\tau\_\{t\}into role\-typed blocks and groups them into past step pairs\. An*always\-keep floor*unconditionally pins format\-critical and recency\-critical content\. A learned*step\-relevance scorer*ranks the remaining past steps against the current observation, and a greedy procedure fills the residual char budget\. Each retained step is kept verbatim, so action grammar is preserved by construction\. The whole pass is a single125125M\-parameter forward—no API call, no LM modification, no white\-box access to the agent backbone\.

### 3\.3Structural Parser and Always\-Keep Floor

Eval\-time prompts follow the role\-marker convention\[SYSTEM\]…\[USER\]…\[ASSISTANT\]…\. A single regex over role markers yields an ordered block list which a deterministic state machine groups into five role\-typed components: a system blockcsysc\_\{\\text\{sys\}\}, the initial task instructionctaskc\_\{\\text\{task\}\}, a sequence of past steps\{si=\(ai,oi\)\}i=1N\\\{s\_\{i\}=\(a\_\{i\},o\_\{i\}\)\\\}\_\{i=1\}^\{N\}\(each an\[ASSISTANT\]\+\[USER\]\\texttt\{\[ASSISTANT\]\}\+\\texttt\{\[USER\]\}pair\), the current observationcnowc\_\{\\text\{now\}\}, and an optional pending\[ASSISTANT\]blockcpendc\_\{\\text\{pend\}\}\. The parser is task\-agnostic at this level\.

The first compression layer is a deterministic floor:

𝒦=\\displaystyle\\mathcal\{K\}\\,=\{csys,ctask,cnow,cpend\}∪\{si\}i=N−Krecent\+1N\\displaystyle\\\{c\_\{\\text\{sys\}\},c\_\{\\text\{task\}\},c\_\{\\text\{now\}\},c\_\{\\text\{pend\}\}\\\}\\,\\cup\\,\\\{s\_\{i\}\\\}\_\{i=N\-K\_\{\\text\{recent\}\}\+1\}^\{N\}\(2\)∪\{sj:pj\>θhi\},\\displaystyle\\cup\\,\\\{s\_\{j\}:p\_\{j\}\>\\theta\_\{\\text\{hi\}\}\\\},withKrecent=2K\_\{\\text\{recent\}\}=2andθhi=0\.9\\theta\_\{\\text\{hi\}\}=0\.9\. The three sets are, respectively, format\-critical blocks whose removal consistently breaks rollouts; theKrecentK\_\{\\text\{recent\}\}most recent steps, motivated by the U\-shaped position effect ofLiu et al\. \([2024a](https://arxiv.org/html/2605.26596#bib.bib16)\)and kept verbatim rather than at a token\-budget rate; and a high\-confidence override from the scorer below that lets the floor absorb past steps the relevance model is near\-certain about, independent of recency\. §[4\.4](https://arxiv.org/html/2605.26596#S4.SS4)\(column−\-Scorer\) shows that this floor alone—without the learned scorer—retainsmr=0\.356\\textit\{mr\}\{=\}0\.356averaged over the99cells \(Δ​mr=−0\.031\\Delta\\textit\{mr\}\{=\}\{\-\}0\.031vs full AGORA\-M22\), so the floor carries the bulk of the quality retention and the scorer’s incremental quality contribution is modest; the scorer’s primary effect is on realized compression range \(§[4\.3](https://arxiv.org/html/2605.26596#S4.SS3)\)\.

### 3\.4Step Relevance Scorer and Greedy Budget Fill

A RoBERTa\-base classifierϕ\\phi\(125M parameters\) scores each past step against the current observation:

pi=\[softmax​\(ϕ​\(cnow≤1500,\[ai;oi\]≤1500\)\)\]1,p\_\{i\}\\,=\\,\\big\[\\mathrm\{softmax\}\(\\phi\(c\_\{\\text\{now\}\}^\{\\leq 1500\},\\;\[a\_\{i\};o\_\{i\}\]^\{\\leq 1500\}\)\)\\big\]\_\{1\},\(3\)where\(⋅\)≤1500\(\\cdot\)^\{\\leq 1500\}truncates each side at15001500characters to fit RoBERTa’s512512\-subword window, and the subscript selectsP​\(label=critical\)P\(\\text\{label\}=\\text\{critical\}\)\. The model is a single shared instance across all environments and backbones—no per\-cell tuning\.

Given budgetB=⌊ρ​\|τt\|⌋B=\\lfloor\\rho\\,\|\\tau\_\{t\}\|\\rflooralready partly spent by𝒦\\mathcal\{K\}, Algorithm[1](https://arxiv.org/html/2605.26596#alg1)fills the remainder by inserting non\-floor past steps in descendingpip\_\{i\}order, skipping any that would exceed budget\. Surviving steps are reinserted in chronological order; consecutive dropped steps are replaced by a single “\[…​N​step\(s\) elided​…\]\[\\ldots N\\text\{ step\(s\) elided\}\\ldots\]” marker\.

Algorithm 1Greedy char\-budget fill \(single step\)\.0:past steps

\{si\}i=1N\\\{s\_\{i\}\\\}\_\{i=1\}^\{N\}, scores

\{pi\}\\\{p\_\{i\}\\\}, floor

𝒦\\mathcal\{K\}, budget

BB
0:compressed trajectory

τ~t\\tilde\{\\tau\}\_\{t\}
1:

𝒮←𝒦\\mathcal\{S\}\\leftarrow\\mathcal\{K\}
2:

Bu←∑x∈𝒦\|x\|B\_\{u\}\\leftarrow\\sum\_\{x\\in\\mathcal\{K\}\}\|x\|
3:

ℛ←\{i:si∉𝒦\}\\mathcal\{R\}\\leftarrow\\\{i:s\_\{i\}\\notin\\mathcal\{K\}\\\}sorted by

pip\_\{i\}descending

4:for

i∈ℛi\\in\\mathcal\{R\}do

5:if

Bu\+\|si\|≤BB\_\{u\}\+\|s\_\{i\}\|\\leq Bthen

6:

𝒮←𝒮∪\{si\}\\mathcal\{S\}\\leftarrow\\mathcal\{S\}\\cup\\\{s\_\{i\}\\\}
7:

Bu←Bu\+\|si\|B\_\{u\}\\leftarrow B\_\{u\}\+\|s\_\{i\}\|
8:endif

9:endfor

10:assemble

τ~t\\tilde\{\\tau\}\_\{t\}from

𝒮\\mathcal\{S\}in chronological order

11:return

τ~t\\tilde\{\\tau\}\_\{t\}

### 3\.5Counterfactual Training

The scorer is trained on labels that operationalize*“does this past step matter for the current decision?”*For each\(cnow,si\)\(c\_\{\\text\{now\}\},s\_\{i\}\)pair, we re\-query the agent backboneK=8K\{=\}8times at sampling temperatureT=1\.0T\{=\}1\.0, half with the full trajectory and half withsis\_\{i\}removed, canonicalize each sampled next action, and set

yi=1K/2​∑k=1K/2𝕀​\[canon​\(ak⋆\)≠canon​\(ak′\)\]∈\[0,1\],y\_\{i\}\\,=\\,\\frac\{1\}\{K/2\}\\sum\_\{k=1\}^\{K/2\}\\mathbb\{I\}\\\!\\left\[\\,\\text\{canon\}\(a^\{\\star\}\_\{k\}\)\\neq\\text\{canon\}\(a^\{\\prime\}\_\{k\}\)\\,\\right\]\\,\\in\\,\[0,1\],\(4\)the fraction of paired rollouts whose canonicalized next action differs\. Multi\-rollout sampling \(K=8K\{=\}8\) absorbs decoder stochasticity that a single\-shot binaryyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}would conflate with true criticality; §[4\.4](https://arxiv.org/html/2605.26596#S4.SS4)\(column Hard\-CF\) shows replacing softyiy\_\{i\}with the binary indicator costs−0\.059\-0\.059mron average\. Labels are generated on21,52321\{,\}523\(cnow,si\)\(c\_\{\\text\{now\}\},s\_\{i\}\)pairs from1,2441\{,\}244training trajectories spanning the three evaluation environments, with task initializations strictly disjoint from the evaluation tasks \(Appendix[A](https://arxiv.org/html/2605.26596#A1)\)\. The scorer is trained with a two\-phase LP\-FT schedule—55epochs linear probing \(backbone frozen, classification head only\) followed by44epochs full fine\-tuning—under soft binary cross\-entropy, AdamW, batch1616,10%10\\%warmup \+ cosine decay; trajectory\-level train/validation split ensures no past step appears in both partitions\. Full hyperparameters are in Appendix[B](https://arxiv.org/html/2605.26596#A2)\.

## 4Experiments

Table 1:Main results across all99\(env, backbone\) cells\. Top of each cell:mr\(task\-averaged reward,n=30n\{=\}30\) with retentionmrmethod/mrNo\-Comp\\textit\{mr\}\_\{\\text\{method\}\}/\\textit\{mr\}\_\{\\textsc\{No\-Comp\}\}in parentheses; bottom:95%95\\%bootstrap CI onmr\.Bold= best inference\-free method per row;⋆= retention\>100%\{\>\}100\\%;†= per\-step LLM call; shaded column = AGORA\. Summary block: PassTTrows count cells clearing retention thresholdTT\(Point\)\. Eff\.×\\timesrange gives per\-method min–max realized end\-to\-end compression; LongLingua’s near\-1×1\{\\times\}and Truncate’s∼1\.5×\\sim\\\!1\.5\{\\times\}are silent\-compression failures \(Table[2](https://arxiv.org/html/2605.26596#S4.T2)\)\.Inference\-freeLLM\-call per step†EnvBb\.No\-CompAGORATruncFloor\-K2ObsMaskLongLinguaSelCtxHiAgentAgentDietACONALFWorldqwen3\.5\-flash\.767\.767\(−\-\)\.767​\(𝟏𝟎𝟎\)\\mathbf\{\.767\(100\)\}\.767​\(𝟏𝟎𝟎\)\\mathbf\{\.767\(100\)\}\.600\.600\(7878\)\.300\.300\(6464\)\.467\.467\(100100\)\.033\.033\(77\)\.400\.400\(8686\)\.333\.333\(7171\)\.400\.400\(8686\)\[\.60,\.90\]\[\.60,\.90\]\[\.60,\.90\]\[\.60,\.90\]\[\.60,\.90\]\[\.60,\.90\]\[\.43,\.77\]\[\.43,\.77\]\[\.13,\.47\]\[\.13,\.47\]\[\.30,\.63\]\[\.30,\.63\]\[\.00,\.10\]\[\.00,\.10\]\[\.23,\.57\]\[\.23,\.57\]\[\.17,\.50\]\[\.17,\.50\]\[\.23,\.57\]\[\.23,\.57\]gpt\-4o\-mini\.300\.300\(−\-\)\.333​\(𝟏𝟏𝟏\)⋆\\mathbf\{\.333\(111\)^\{\\star\}\}\.267\.267\(8989\)\.233\.233\(7878\)\.100\.100\(4343\)\.100\.100\(4343\)\.000\.000\(0\)\.100\.100\(4343\)\.167\.167\(7171\)\.167\.167\(7171\)\[\.13,\.47\]\[\.13,\.47\]\[\.17,\.50\]\[\.17,\.50\]\[\.13,\.43\]\[\.13,\.43\]\[\.10,\.40\]\[\.10,\.40\]\[\.00,\.23\]\[\.00,\.23\]\[\.00,\.23\]\[\.00,\.23\]\[\.00,\.00\]\[\.00,\.00\]\[\.00,\.23\]\[\.00,\.23\]\[\.07,\.30\]\[\.07,\.30\]\[\.03,\.30\]\[\.03,\.30\]gpt\-5\-mini\.767\.767\(−\-\)\.633​\(𝟖𝟑\)\\mathbf\{\.633\(83\)\}\.633​\(𝟖𝟑\)\\mathbf\{\.633\(83\)\}\.500\.500\(6565\)\.267\.267\(5353\)\.467\.467\(9393\)\.000\.000\(0\)\.400\.400\(8080\)\.267\.267\(5353\)\.233\.233\(4747\)\[\.60,\.90\]\[\.60,\.90\]\[\.47,\.80\]\[\.47,\.80\]\[\.47,\.80\]\[\.47,\.80\]\[\.33,\.67\]\[\.33,\.67\]\[\.13,\.43\]\[\.13,\.43\]\[\.30,\.63\]\[\.30,\.63\]\[\.00,\.00\]\[\.00,\.00\]\[\.23,\.57\]\[\.23,\.57\]\[\.13,\.43\]\[\.13,\.43\]\[\.10,\.40\]\[\.10,\.40\]ScienceWorldqwen3\.5\-flash\.230\.230\(−\-\)\.210\.210\(9191\)\.285​\(𝟏𝟐𝟒\)⋆\\mathbf\{\.285\(124\)^\{\\star\}\}\.147\.147\(6464\)\.161\.161\(8282\)\.201​\(103\)⋆\.201\(103\)^\{\\star\}\.114\.114\(5858\)\.164\.164\(8484\)\.147\.147\(7575\)\.146\.146\(7575\)\[\.15,\.32\]\[\.15,\.32\]\[\.13,\.31\]\[\.13,\.31\]\[\.18,\.40\]\[\.18,\.40\]\[\.10,\.20\]\[\.10,\.20\]\[\.10,\.23\]\[\.10,\.23\]\[\.14,\.28\]\[\.14,\.28\]\[\.07,\.16\]\[\.07,\.16\]\[\.11,\.23\]\[\.11,\.23\]\[\.08,\.23\]\[\.08,\.23\]\[\.08,\.23\]\[\.08,\.23\]gpt\-4o\-mini\.173\.173\(−\-\)\.155\.155\(8989\)\.156\.156\(9090\)\.164\.164\(9595\)\.154​\(𝟏𝟏𝟕\)⋆\\mathbf\{\.154\(117\)^\{\\star\}\}\.144​\(110\)⋆\.144\(110\)^\{\\star\}\.091\.091\(6969\)\.146​\(111\)⋆\.146\(111\)^\{\\star\}\.129\.129\(9898\)\.132​\(100\)\.132\(100\)\[\.11,\.25\]\[\.11,\.25\]\[\.11,\.20\]\[\.11,\.20\]\[\.11,\.21\]\[\.11,\.21\]\[\.11,\.23\]\[\.11,\.23\]\[\.09,\.23\]\[\.09,\.23\]\[\.09,\.21\]\[\.09,\.21\]\[\.06,\.13\]\[\.06,\.13\]\[\.08,\.23\]\[\.08,\.23\]\[\.08,\.19\]\[\.08,\.19\]\[\.08,\.20\]\[\.08,\.20\]gpt\-5\-mini\.298\.298\(−\-\)\.217\.217\(7373\)\.319​\(𝟏𝟎𝟕\)⋆\\mathbf\{\.319\(107\)^\{\\star\}\}\.188\.188\(6363\)\.215\.215\(9393\)\.229\.229\(9999\)\.114\.114\(4949\)\.206\.206\(8989\)\.168\.168\(7272\)\.172\.172\(7474\)\[\.21,\.40\]\[\.21,\.40\]\[\.15,\.30\]\[\.15,\.30\]\[\.22,\.43\]\[\.22,\.43\]\[\.12,\.27\]\[\.12,\.27\]\[\.14,\.31\]\[\.14,\.31\]\[\.16,\.31\]\[\.16,\.31\]\[\.07,\.16\]\[\.07,\.16\]\[\.12,\.31\]\[\.12,\.31\]\[\.10,\.25\]\[\.10,\.25\]\[\.11,\.25\]\[\.11,\.25\]WebShopqwen3\.5\-flash\.499\.499\(−\-\)\.468\.468\(9494\)\.538​\(108\)⋆\.538\(108\)^\{\\star\}\.561​\(𝟏𝟏𝟐\)⋆\\mathbf\{\.561\(112\)^\{\\star\}\}\.594​\(120\)⋆\.594\(120\)^\{\\star\}\.372\.372\(7575\)\.000\.000\(0\)\.484\.484\(9898\)\.567​\(114\)⋆\.567\(114\)^\{\\star\}\.533​\(107\)⋆\.533\(107\)^\{\\star\}\[\.33,\.67\]\[\.33,\.67\]\[\.30,\.63\]\[\.30,\.63\]\[\.37,\.70\]\[\.37,\.70\]\[\.41,\.71\]\[\.41,\.71\]\[\.44,\.75\]\[\.44,\.75\]\[\.23,\.52\]\[\.23,\.52\]\[\.00,\.00\]\[\.00,\.00\]\[\.33,\.64\]\[\.33,\.64\]\[\.43,\.71\]\[\.43,\.71\]\[\.37,\.70\]\[\.37,\.70\]gpt\-4o\-mini\.293\.293\(−\-\)\.266\.266\(9191\)\.267\.267\(9191\)\.301​\(𝟏𝟎𝟑\)⋆\\mathbf\{\.301\(103\)^\{\\star\}\}\.312​\(223\)⋆\.312\(223\)^\{\\star\}\.113\.113\(8181\)\.000\.000\(0\)\.278​\(198\)⋆\.278\(198\)^\{\\star\}\.213​\(152\)⋆\.213\(152\)^\{\\star\}\.227​\(162\)⋆\.227\(162\)^\{\\star\}\[\.16,\.44\]\[\.16,\.44\]\[\.13,\.41\]\[\.13,\.41\]\[\.15,\.39\]\[\.15,\.39\]\[\.16,\.45\]\[\.16,\.45\]\[\.20,\.43\]\[\.20,\.43\]\[\.03,\.21\]\[\.03,\.21\]\[\.00,\.00\]\[\.00,\.00\]\[\.16,\.41\]\[\.16,\.41\]\[\.09,\.35\]\[\.09,\.35\]\[\.12,\.35\]\[\.12,\.35\]gpt\-5\-mini\.479\.479\(−\-\)\.431\.431\(9090\)\.359\.359\(7575\)\.509​\(𝟏𝟎𝟔\)⋆\\mathbf\{\.509\(106\)^\{\\star\}\}\.507\.507\(9090\)\.623​\(111\)⋆\.623\(111\)^\{\\star\}\.000\.000\(0\)\.408\.408\(7373\)\.637​\(114\)⋆\.637\(114\)^\{\\star\}\.033\.033\(66\)\[\.32,\.64\]\[\.32,\.64\]\[\.26,\.60\]\[\.26,\.60\]\[\.21,\.52\]\[\.21,\.52\]\[\.34,\.68\]\[\.34,\.68\]\[\.35,\.67\]\[\.35,\.67\]\[\.47,\.77\]\[\.47,\.77\]\[\.00,\.00\]\[\.00,\.00\]\[\.25,\.57\]\[\.25,\.57\]\[\.49,\.77\]\[\.49,\.77\]\[\.00,\.10\]\[\.00,\.10\]Pass retention≥75%\\geq 75\\%\(Point\)𝟖/𝟗\\mathbf\{8/9\}9/99/96/96/96/96/98/98/90/90/97/97/95/95/95/95/9Realized Eff\.×\\timesrange \(9 cells\)1\.01\.0–11\.511\.5∼1\.5\\sim\\\!1\.51\.91\.9–6\.16\.11\.31\.3–4\.84\.80\.9−1\.6\\mathbf\{0\.9\-1\.6\}1\.31\.3–4\.74\.72\.42\.4–8\.38\.32\.62\.6–7\.97\.93\.03\.0–14\.014\.0### 4\.1Experimental Setup

#### Environments and backbones\.

We evaluate AGORA on three text\-based agent benchmarks—ALFWorld\(Shridhar et al\.[2020](https://arxiv.org/html/2605.26596#bib.bib25)\),ScienceWorld\(Wang et al\.[2022](https://arxiv.org/html/2605.26596#bib.bib30)\), andWebShop\(Yao et al\.[2022](https://arxiv.org/html/2605.26596#bib.bib35)\)—paired with three production LLMs:qwen3\.5\-flash,gpt\-4o\-mini, andgpt\-5\-mini\. This yields3×3=93\{\\times\}3\{=\}9\(environment, backbone\) cells; we hold out 30 evaluation tasks per environment, strictly disjoint from the1,2441\{,\}244trajectories used to generate AGORA’s training labels \(Appendix[A](https://arxiv.org/html/2605.26596#A1)\)\.

#### Protocol\.

All rollouts use temperature0and an episode budget of3030steps with fixed initialization seeds\. Task initializations are shared across compression conditions within each cell\. The AGORA scorer is a single RoBERTa\-base \(125 M parameters\) classifier withKrecent=2K\_\{\\text\{recent\}\}\{=\}2force\-kept steps and target keep ratio0\.250\.25, held fixed across all 9 cells \(no per\-cell tuning\)\. Training labels are derived fromK=8K\{=\}8multi\-rollout counterfactual sampling on the1,2441\{,\}244training trajectories; full label\-generation protocol and hyperparameter values are in Appendix[B](https://arxiv.org/html/2605.26596#A2)\.

#### Baselines\.

We compare AGORA against ten methods, grouped by whether the compressor itself issues an LLM call\.*Inference\-free*:No\-Comp\(uncompressed; quality ceiling\),Random\-Step\(uniform random selection at the target keep ratio\),Truncate\-2048\(hard last\-NNtoken truncation\),Floor\-K2\(structural heuristic that keeps only the last 2 past steps without learned scoring\),ObsMask\(Lindenbauer et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib15)\),LongLLMLingua\(Jiang et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib8)\), andSelective Context\(Li et al\.[2023b](https://arxiv.org/html/2605.26596#bib.bib12)\)\.*LLM\-based*:HiAgent\(Hu et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib6)\),AgentDiet\(Xiao et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib32)\), andACON\(Kang et al\.[2025](https://arxiv.org/html/2605.26596#bib.bib9)\)\. All budgeted methods target the same4×4\\timesnominal compression ratio\(Jiang et al\.[2023](https://arxiv.org/html/2605.26596#bib.bib7)\)\.

#### Metrics\.

We report four per\-cell numbers:*mean reward*\(mr,n=30n\{=\}30\),*retention*\(mrmethod/mrNo\-Comp\\textit\{mr\}\_\{\\text\{method\}\}/\\textit\{mr\}\_\{\\text\{No\-Comp\}\}\),*effective compression ratio*\(end\-to\-end token reduction from API logs\), and*cost*\(¥/task end\-to\-end, including any auxiliary compressor LLM calls\)\.95%95\\%bootstrap CIs use1010k resamples \(Appendix[B](https://arxiv.org/html/2605.26596#A2)\); significance is paired Wilcoxon signed\-rank with Holm–Bonferroni correction\.

Table 2:Realized compression ratio and per\-task cost for the55working methods across99cells \(LongLingua \+ SelCtx omitted as paradigm\-fail, Table[1](https://arxiv.org/html/2605.26596#S4.T1)\)\. Eff\.×\\times: end\-to\-end token compression vsNo\-Comp\. $/task: per\-task spend \(USD\), inclusive of any per\-step compressor LLM tokens\. AGORA columns shaded\.†: per\-step LLM call\. Per\-method Eff\.×\\timesranges across the99cells: AGORA1\.01\.0–11\.5×11\.5\\times, ObsMask1\.31\.3–4\.8×4\.8\\times, HiAgent2\.42\.4–8\.3×8\.3\\times, AgentDiet2\.62\.6–7\.9×7\.9\\times, ACON3\.03\.0–14\.0×14\.0\\times\. LLM\-based methods spend88–126126k Meta\-tokens per task on their own compressor call \(inference\-free methods spend0\)\. LLM\-based competitors beat AGORA on $/task in8/98/9cells each; ObsMask wins4/94/9\.Table 3:Component ablation across all99cells\. Each variant disables one design choice in AGORA\-M22 and reuses the same deploy protocol \(n=30n\{=\}30tasks/cell,ρ=0\.25\\rho\{=\}0\.25,Krecent=2K\_\{\\text\{recent\}\}\{=\}2when applicable, identical task initializations, seed4747\)\. Top of each cell:mr; bottom:95%95\\%bootstrap CI\.Bold: best inference\-free ablation per row\.∗: paired Wilcoxon vs AGORA\-M22 hasp<0\.05p\{<\}0\.05on the3030paired tasks\. Shaded column = AGORA\-M22 \(the deployed model\)\. Mean andΔ\\Deltarows aggregate across the99cells\.

### 4\.2Main Results: Quality Retention

![Refer to caption](https://arxiv.org/html/2605.26596v1/x2.png)Figure 2:\(A\)Cost–quality Pareto frontier across the99cells \(per\-env facets; dashed = per\-env Pareto, excludingNo\-CompandSelective Context\)\. AGORA \(deep blue\) leads on the inference\-free frontier at the4×4\{\\times\}operating point, dominating Floor\-K2 at matched compression and Pareto\-improving over Truncate\-2048 \(which only realizes∼1\.5×\{\\sim\}1\.5\{\\times\}\)\.\(B\)Realized end\-to\-end Eff\.×\\timesper method \(logyy, ordered by AGORA’s ratio\)\. The shaded band is AGORA’s adaptive1\.01\.0–11\.5×11\.5\\timesend\-to\-end range emerging from a single fixed scorer keep\-ratioρ=0\.25\\rho\{=\}0\.25\.At the4×4\{\\times\}compression target, AGORA achieves averagemr=0\.387\\textit\{mr\}\{=\}0\.387\(≈92%\\approx 92\\%retention vsNo\-Comp\), outperforming structural Floor\-K2 \(mr=0\.356\\textit\{mr\}\{=\}0\.356,85%85\\%\) by\+7\+7pp at matched compression\. Truncate\-2048’s apparently highermr=0\.399\\textit\{mr\}\{=\}0\.399reflects only≈1\.5×\\approx 1\.5\{\\times\}realized compression \(§[4\.3](https://arxiv.org/html/2605.26596#S4.SS3)\)—not Pareto\-comparable at the4×4\{\\times\}operating point\. AGORA clears≥75%\\geq 75\\%retention in8/98/9cells \(ScienceWorld×\\timesgpt\-5\-miniat73%73\\%\); no other inference\-free method at4×4\{\\times\}achieves this\. We focus on retention because the deployment question is how many cells a compressor keeps within75%75\\%at the target budget, not absolute improvement\.

Atn=30n\{=\}30tasks per cell,95%95\\%CIs are wide \(11–22cells clear any threshold by CI\-lower across all methods\); we therefore report point estimates\.

The two token\-level methods fail in opposite ways\.LongLLMLinguafails to compress: set to4×4\\timesbut realized at0\.90\.9–1\.6×1\.6\\times, so its competitive retention reflects an essentially uncompressed input rather than a working method\.Selective Contextfails to act: it does hit1\.31\.3–4\.7×4\.7\\timesreal compression, yet collapses the agent on every cell—token\-level self\-information selection is structurally incompatible with agent action grammar \(e\.g\. dropping tokens fromsearch\[men’s shoes size 10\]yields ungrammaticalWebShopactions\)\. We replicate this paradigm failure across1717additional cells in §[4\.5](https://arxiv.org/html/2605.26596#S4.SS5)\.

Figure[2](https://arxiv.org/html/2605.26596#S4.F2)\(A\) places AGORA on the inference\-free Pareto frontier in every env facet; \(B\) shows that a single fixedρ=0\.25\\rho\{=\}0\.25produces a1\.01\.0–11\.5×11\.5\\timesend\-to\-end range across cells, a region LLM\-based methods reach only with per\-step LLM overhead \(Table[2](https://arxiv.org/html/2605.26596#S4.T2)\)\.

### 4\.3Compression Ratio and Cost

Table[2](https://arxiv.org/html/2605.26596#S4.T2)decomposes per\-task cost into realized end\-to\-end Eff\.×\\timesand $/task \(cost formula, pricing constants, and the per\-task Eff\.×\\timesdefinition are in Appendix[B](https://arxiv.org/html/2605.26596#A2.SS0.SSS0.Px7)\)\. Although every budgeted method was targeted at the same nominal4×4\\times, realized ratios diverge sharply: AGORA spans1\.01\.0–11\.5×11\.5\\timeswith no method\-side adjustment, adapting to the trajectory regime \(peak onScienceWorld×\\timesgpt\-5\-miniwhere past observations dominate input, bottom on the shortALFWorld×\\timesqwen3\.5\-flashcell where the structural floor already covers most of the context\)\.ACONcovers a wider*maximum*\(3\.03\.0–14\.0×14\.0\\times\), but pays1414–9494k Meta\-tokens per task for it;ObsMask’s position\-only rule yields a narrower1\.31\.3–4\.8×4\.8\\times\. What distinguishes AGORA is the combination—wide adaptive range*at zero meta\-overhead*—which we attribute in §[4\.4](https://arxiv.org/html/2605.26596#S4.SS4)to the learned scorer treating the keep\-ratio as a soft prior rather than a hard budget\.

The three LLM\-based methods each beat AGORA on $/task in77–88of99cells: their per\-step LLM toll is small relative to the savings from aggressive compression\. AGORA is therefore not a universal cost winner\. Its differentiator is*zero meta\-token overhead*: every other competitive method consumes88–126126k extra tokens per task on its own compressor LLM call, and AGORA consumes zero\. The deployments where this matters are exactly those where retention is the binding constraint and the per\-step LLM toll cannot be absorbed—ALFWorld×\\timesgpt\-4o\-miniis illustrative: AGORA actually*exceeds*No\-Compquality \(mr=0\.333\\textit\{mr\}\{=\}0\.333vs0\.3000\.300,111%111\\%retention\) at1\.7×1\.7\\timescompression with zero compressor overhead, while every LLM\-based competitor sits atmr≤0\.167\\textit\{mr\}\{\\leq\}0\.167on this cell\. On the inverse end \(long\-trajectory webshop, where LLM\-based aggressive compression yields large savings\),ACONorAgentDietare reasonable choices; our results should not be read as universal advocacy for inference\-free compression\. We return to this characterization in §[5](https://arxiv.org/html/2605.26596#S5)\.

### 4\.4Component Ablation

Table[3](https://arxiv.org/html/2605.26596#S4.T3)isolates the contribution of each design choice in AGORA\-M22 by disabling one at a time and re\-deploying on the same99\-cell,3030\-task grid \(paired by task initialization, seed4747\)\.−\-Floordisables theKrecent=2K\_\{\\text\{recent\}\}\{=\}2structural always\-keep window;Hard\-CFreplaces theK=8K\{=\}8multi\-rollout soft labels with single\-shot binary ones;−\-Scorerkeeps only the floor and the parser \(no learned step relevance\);Rand\-Stepreplaces the scorer with uniform random scores at the same targetρ=0\.25\\rho\{=\}0\.25;Trunc\-2048is a token\-level baseline that ignores step structure entirely\.

The decomposition shows a consistent hierarchy of contributions\.Removing the floor is the costliest single change\(−0\.088​mr\-0\.088\\,\\textit\{mr\}on average,4/94/9cells significant\), because the floor underwrites the basic “the last two steps are always present” invariant that downstream action\-selection assumes\.Multi\-rollout soft labels matter more than the scorer’s structural choice\(−0\.059\-0\.059for Hard\-CF vs−0\.031\-0\.031for−\-Scorer\): replacingK=8K\{=\}8soft labels withK=1K\{=\}1hard binary loses signal that single\-shot sampling treats as label noise\.Removing the scorer entirely costs only−0\.031\-0\.031on average: structurally,−\-Scorer is Floor\-K2 \(Table[1](https://arxiv.org/html/2605.26596#S4.T1)\), so the scorer’s contribution is primarily*realized compression range*\(1\.01\.0–11\.5×11\.5\\times\), notmr\.Random scoring is catastrophic\(−0\.344\-0\.344,9/99/9cells significant\), confirming that the scorer’s signal—even at modest meanΔ​mr\\Delta\\textit\{mr\}—is the load\-bearing component once the floor is in place\.

Trunc\-2048’s apparent parity \(\+0\.012\+0\.012mr\) is bought with≈1\.5×\\approx 1\.5\\timesrealized compression—far below the4×4\\timestarget and AGORA\-M22’s1\.01\.0–11\.5×11\.5\\timesrange—so it is not Pareto\-comparable here\. The ablation hierarchy is sharpest onALFWorldand weakest onWebShop, where the structural floor is already so competitive that−\-Scorer occasionally improvesmr\(e\.g\.0\.5610\.561vs0\.4680\.468onWebShop×\\timesqwen3\.5\-flash, mirroring Table[1](https://arxiv.org/html/2605.26596#S4.T1)where Floor\-K2 wins this cell\)—the scorer’s value concentrates on shorter trajectories where per\-step relevance signal is strongest\.

### 4\.5Paradigm Failure of Token\-Level Compression

Token\-level extractive compression—the dominant paradigm for general LLM context reduction—is structurally inappropriate for agent settings\. We provide the first systematic evidence across1717\(environment, backbone, method\) cells spanning two independent token\-level method families\.Selective Context\(Li et al\.[2023b](https://arxiv.org/html/2605.26596#bib.bib12)\)fails on all6/66/6cells of our main grid\.LLMLingua\-2\(Jiang et al\.[2024](https://arxiv.org/html/2605.26596#bib.bib8)\)fails on all11/1111/11cells of an extended audit \(off\-shelf withrate=0\.25on66cells, plus a4×4\\times\-retrained variant on55cells; both variants collapse identically\)\. Aggregated,17/1717/17cells collapse tomr≤0\.05\\textit\{mr\}\{\\leq\}0\.05despite achieving genuine1\.31\.3–13\.3×13\.3\\timesrealized compression\. The compressor works as advertised at the token level; the agent does not\.

The failure mechanism is action\-grammar destruction\. Token\-level methods rank by self\-information or perplexity, but the tokens that carry action semantics in agent trajectories—item identifiers, bracket delimiters, action verbs, prepositional connectors—are precisely the tokens that language\-model statistics score lowest, because they are highly predictable from natural\-text context\. The compressor reliably removes them; the residual text is grammatical English but ungrammatical agent commands such assearch men’s shoes 10orclick\[\], which the environment rejects regardless of task content\. This rules out hyperparameter remedies \(bothLLMLingua\-2variants fail identically\) and task\-specific artifacts \(failure is task\-independent; Appendix[A](https://arxiv.org/html/2605.26596#A1)shows no phase\-B contamination on this audit\)\. The per\-cell breakdown of all1717cells is given in Appendix[C](https://arxiv.org/html/2605.26596#A3)\.

This paradigm\-level diagnosis motivates the design space AGORA occupies:*step\-level*compression \(operating on intact actions and observations rather than tokens\), with an*inference\-free*implementation \(no per\-step LLM call\)\. Within that design space, AGORA’s two budget knobs \(ρ=0\.25\\rho\{=\}0\.25,Krecent=2K\_\{\\text\{recent\}\}\{=\}2\) were fixed across all99cells before evaluation, inherited from training\-time defaults; no per\-cell tuning was performed, so the cost–retention frontier reported in Table[2](https://arxiv.org/html/2605.26596#S4.T2)is the one practitioners see at deploy time\.

## 5Conclusion

This paper diagnoses why general token\-level prompt compression fails on LLM agents \(*action\-grammar destruction*: tokens carrying action semantics are exactly those self\-information ranks lowest\) and proposes AGORA, an inference\-free step\-level alternative whose quality guarantee sits on a structural floor and whose learned scorer supplies adaptive end\-to\-end compression—an attribution we expect to generalize to richer agent settings where action grammars are stricter still\.

Limitations\.n=30n\{=\}30per cell gives wide95%95\\%CIs \(point estimates only\); criticality is partly backbone\-specific\.

## Acknowledgments

This work was funded by AI Agent Technologies \(Hong Kong\) Limited\.

## References

- Cemri et al\. \(2026\)Cemri, M\.; Pan, M\. Z\.; Yang, S\.; Agrawal, L\. A\.; Chopra, B\.; Tiwari, R\.; Keutzer, K\.; Parameswaran, A\.; Klein, D\.; Ramchandran, K\.; et al\. 2026\.Why do multi\-agent llm systems fail?*Advances in Neural Information Processing Systems*, 38\.
- Chevalier et al\. \(2023\)Chevalier, A\.; Wettig, A\.; Ajith, A\.; and Chen, D\. 2023\.Adapting language models to compress contexts\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, 3829–3846\.
- Cooper \(1971\)Cooper, W\. S\. 1971\.A definition of relevance for information retrieval\.*Information storage and retrieval*, 7\(1\): 19–37\.
- Cuconasu et al\. \(2024\)Cuconasu, F\.; Trappolini, G\.; Siciliano, F\.; Filice, S\.; Campagnano, C\.; Maarek, Y\.; Tonellotto, N\.; and Silvestri, F\. 2024\.The power of noise: Redefining retrieval for rag systems\.In*Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval*, 719–729\.
- Ge et al\. \(2023\)Ge, T\.; Hu, J\.; Wang, L\.; Wang, X\.; Chen, S\.\-Q\.; and Wei, F\. 2023\.In\-context autoencoder for context compression in a large language model\.*arXiv preprint arXiv:2307\.06945*\.
- Hu et al\. \(2025\)Hu, M\.; Chen, T\.; Chen, Q\.; Mu, Y\.; Shao, W\.; and Luo, P\. 2025\.Hiagent: Hierarchical working memory management for solving long\-horizon agent tasks with large language model\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, 32779–32798\.
- Jiang et al\. \(2023\)Jiang, H\.; Wu, Q\.; Lin, C\.\-Y\.; Yang, Y\.; and Qiu, L\. 2023\.Llmlingua: Compressing prompts for accelerated inference of large language models\.In*Proceedings of the 2023 conference on empirical methods in natural language processing*, 13358–13376\.
- Jiang et al\. \(2024\)Jiang, H\.; Wu, Q\.; Luo, X\.; Li, D\.; Lin, C\.\-Y\.; Yang, Y\.; and Qiu, L\. 2024\.Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, 1658–1677\.
- Kang et al\. \(2025\)Kang, M\.; Chen, W\.\-N\.; Han, D\.; Inan, H\. A\.; Wutschitz, L\.; Chen, Y\.; Sim, R\.; and Rajmohan, S\. 2025\.Acon: Optimizing context compression for long\-horizon llm agents\.*arXiv preprint arXiv:2510\.00615*\.
- LeVine and Varjavand \(2025\)LeVine, W\.; and Varjavand, B\. 2025\.Relevance Isn’t All You Need: Scaling RAG Systems With Inference\-Time Compute Via Multi\-Criteria Reranking\.*arXiv preprint arXiv:2504\.07104*\.
- Li et al\. \(2023a\)Li, X\.; Lv, K\.; Yan, H\.; Lin, T\.; Zhu, W\.; Ni, Y\.; Xie, G\.; Wang, X\.; and Qiu, X\. 2023a\.Unified demonstration retriever for in\-context learning\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, 4644–4668\.
- Li et al\. \(2023b\)Li, Y\.; Dong, B\.; Guerin, F\.; and Lin, C\. 2023b\.Compressing context to enhance inference efficiency of large language models\.In*Proceedings of the 2023 conference on empirical methods in natural language processing*, 6342–6353\.
- Li et al\. \(2024\)Li, Y\.; Huang, Y\.; Yang, B\.; Venkitesh, B\.; Locatelli, A\.; Ye, H\.; Cai, T\.; Lewis, P\.; and Chen, D\. 2024\.Snapkv: Llm knows what you are looking for before generation\.*Advances in Neural Information Processing Systems*, 37: 22947–22970\.
- Li et al\. \(2025\)Li, Z\.; Liu, Y\.; Su, Y\.; and Collier, N\. 2025\.Prompt compression for large language models: A survey\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, 7182–7195\.
- Lindenbauer et al\. \(2025\)Lindenbauer, T\.; Slinko, I\.; Felder, L\.; Bogomolov, E\.; and Zharov, Y\. 2025\.The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management\.*arXiv preprint arXiv:2508\.21433*\.
- Liu et al\. \(2024a\)Liu, N\. F\.; Lin, K\.; Hewitt, J\.; Paranjape, A\.; Bevilacqua, M\.; Petroni, F\.; and Liang, P\. 2024a\.Lost in the middle: How language models use long contexts\.*Transactions of the association for computational linguistics*, 12: 157–173\.
- Liu et al\. \(2024b\)Liu, X\.; Yu, H\.; Zhang, H\.; Xu, Y\.; Lei, X\.; Lai, H\.; Gu, Y\.; Ding, H\.; Men, K\.; Yang, K\.; et al\. 2024b\.Agentbench: Evaluating llms as agents\.In*International Conference on Learning Representations*, volume 2024, 52989–53046\.
- Min et al\. \(2022\)Min, S\.; Lyu, X\.; Holtzman, A\.; Artetxe, M\.; Lewis, M\.; Hajishirzi, H\.; and Zettlemoyer, L\. 2022\.Rethinking the Role of Demonstrations: What Makes In\-Context Learning Work?In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, 11048–11064\.
- Mu, Li, and Goodman \(2023\)Mu, J\.; Li, X\.; and Goodman, N\. 2023\.Learning to compress prompts with gist tokens\.*Advances in Neural Information Processing Systems*, 36: 19327–19352\.
- Packer et al\. \(2023\)Packer, C\.; Fang, V\.; Patil, S\. G\.; Lin, K\.; Wooders, S\.; and Gonzalez, J\. E\. 2023\.MemGPT: towards LLMs as operating systems\.*arXiv preprint arXiv:2310\.08560*\.
- Pan et al\. \(2024\)Pan, Z\.; Wu, Q\.; Jiang, H\.; Xia, M\.; Luo, X\.; Zhang, J\.; Lin, Q\.; Rühle, V\.; Yang, Y\.; Lin, C\.\-Y\.; et al\. 2024\.Llmlingua\-2: Data distillation for efficient and faithful task\-agnostic prompt compression\.In*Findings of the Association for Computational Linguistics: ACL 2024*, 963–981\.
- Saracevic \(1975\)Saracevic, T\. 1975\.Relevance: A review of and a framework for the thinking on the notion in information science\.*Journal of the American Society for information science*, 26\(6\): 321–343\.
- Sclar et al\. \(2024\)Sclar, M\.; Choi, Y\.; Tsvetkov, Y\.; and Suhr, A\. 2024\.Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting\.In*International Conference on Learning Representations*, volume 2024, 25055–25083\.
- Shinn et al\. \(2023\)Shinn, N\.; Cassano, F\.; Gopinath, A\.; Narasimhan, K\.; and Yao, S\. 2023\.Reflexion: Language agents with verbal reinforcement learning\.*Advances in neural information processing systems*, 36: 8634–8652\.
- Shridhar et al\. \(2020\)Shridhar, M\.; Yuan, X\.; Côté, M\.\-A\.; Bisk, Y\.; Trischler, A\.; and Hausknecht, M\. 2020\.Alfworld: Aligning text and embodied environments for interactive learning\.*arXiv preprint arXiv:2010\.03768*\.
- Tam et al\. \(2024\)Tam, Z\. R\.; Wu, C\.\-K\.; Tsai, Y\.\-L\.; Lin, C\.\-Y\.; Lee, H\.\-y\.; and Chen, Y\.\-N\. 2024\.Let me speak freely? a study on the impact of format restrictions on performance of large language models\.*arXiv preprint arXiv:2408\.02442*\.
- Voronov, Wolf, and Ryabinin \(2024\)Voronov, A\.; Wolf, L\.; and Ryabinin, M\. 2024\.Mind your format: Towards consistent evaluation of in\-context learning improvements\.In*Findings of the Association for Computational Linguistics: ACL 2024*, 6287–6310\.
- Wang et al\. \(2023\)Wang, G\.; Xie, Y\.; Jiang, Y\.; Mandlekar, A\.; Xiao, C\.; Zhu, Y\.; Fan, L\.; and Anandkumar, A\. 2023\.Voyager: An open\-ended embodied agent with large language models\.*arXiv preprint arXiv:2305\.16291*\.
- Wang, Yang, and Wei \(2024\)Wang, L\.; Yang, N\.; and Wei, F\. 2024\.Learning to retrieve in\-context examples for large language models\.In*Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, 1752–1767\.
- Wang et al\. \(2022\)Wang, R\.; Jansen, P\.; Côté, M\.\-A\.; and Ammanabrolu, P\. 2022\.Scienceworld: Is your agent smarter than a 5th grader?In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, 11279–11298\.
- Xiao et al\. \(2024\)Xiao, G\.; Tian, Y\.; Chen, B\.; Han, S\.; and Lewis, M\. 2024\.Efficient streaming language models with attention sinks\.In*International Conference on Learning Representations*, volume 2024, 21875–21895\.
- Xiao et al\. \(2025\)Xiao, Y\.\-A\.; Gao, P\.; Peng, C\.; and Xiong, Y\. 2025\.Reducing Cost of LLM Agents with Trajectory Reduction\.*arXiv preprint arXiv:2509\.23586*\.
- Xu, Shi, and Choi \(2024\)Xu, F\.; Shi, W\.; and Choi, E\. 2024\.Recomp: Improving retrieval\-augmented lms with context compression and selective augmentation\.In*International Conference on Learning Representations*, volume 2024, 43478–43502\.
- Xu et al\. \(2026\)Xu, W\.; Liang, Z\.; Mei, K\.; Gao, H\.; Tan, J\.; and Zhang, Y\. 2026\.A\-mem: Agentic memory for llm agents\.*Advances in Neural Information Processing Systems*, 38: 17577–17604\.
- Yao et al\. \(2022\)Yao, S\.; Chen, H\.; Yang, J\.; and Narasimhan, K\. 2022\.Webshop: Towards scalable real\-world web interaction with grounded language agents\.*Advances in Neural Information Processing Systems*, 35: 20744–20757\.
- Zhang et al\. \(2023\)Zhang, Z\.; Sheng, Y\.; Zhou, T\.; Chen, T\.; Zheng, L\.; Cai, R\.; Song, Z\.; Tian, Y\.; Re, C\.; Barrett, C\.; et al\. 2023\.H2o: Heavy\-hitter oracle for efficient generative inference of large language models\.*Advances in Neural Information Processing Systems*, 36: 34661–34710\.
- Zhao et al\. \(2024\)Zhao, A\.; Huang, D\.; Xu, Q\.; Lin, M\.; Liu, Y\.\-J\.; and Huang, G\. 2024\.Expel: Llm agents are experiential learners\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 38, 19632–19642\.

## Appendix AData Integrity and Bug Disclosure

Every reported cell is validated by four automated checks before admission to the aggregate\.

#### \(i\) Task\-ID disjointness\.

Each cell’s3030evaluation task IDs share no overlap with the1,2441\{,\}244trajectory IDs used to generate training labels\. The disjoint split is precomputed \(eval\_split\_v2\_disjoint\.json\); the auditor recomputes the intersection at run time and blocks the run on any non\-empty intersection\.

#### \(ii\) Trajectory\-hash uniqueness\.

Within each chunk of55tasks, full trajectories \(system prompt, observations, actions, rewards\) are hashed; duplicate hashes flag environment\-adapter bugs that ignore task\-ID and replay the same internal state\.

#### \(iii\) Compression\-ratio drift\.

The realized end\-to\-end Eff\.×\\timesof each cell is compared against its nominal target \(4×4\{\\times\}for budgeted methods\); cells outside the0\.5×0\.5\{\\times\}–2×2\{\\times\}band are surfaced rather than suppressed\.

#### \(iv\) Action\-loop and cutoff detection\.

The auditor counts the longest run of identical consecutive actions \(≥5\\geq\\\!5flagged as a loop\) and the rate of mid\-token response cutoffs \(flagged as API\-budget failures\)\. Both flags are reported with the cell, not removed\.

## Appendix BHyperparameters and Prompts

#### Compression\.

The scorer isroberta\-base\(125125M parameters,1212layers, hidden768768\), pair\-input formatted as\(current\_obs, past\_step\)truncated to512512tokens\. AGORA uses force\-keep windowKrecent=2K\_\{\\text\{recent\}\}\{=\}2and target keep ratioρ=0\.25\\rho\{=\}0\.25\(soft prior, not a hard budget\)\. Per\-environment atom\-parser rules are in the released code repository\.

#### Scorer training\.

The scorer is trained once on21,52321\{,\}523counterfactual \(anchor, past\-step\) labels derived from1,2441\{,\}244trajectories across the three environments\. Labels are*soft*\(∈\[0,1\]\\in\[0,1\]\): for each\(cnow,si\)\(c\_\{\\text\{now\}\},s\_\{i\}\)pair we re\-query the backboneK=8K\{=\}8times at temperatureT=1\.0T\{=\}1\.0with and withoutsis\_\{i\}, and setyiy\_\{i\}to the fraction of paired rollouts whose canonicalized next action differs \(§[3\.5](https://arxiv.org/html/2605.26596#S3.SS5)\)\. Multi\-rollout sampling absorbs decoder stochasticity that would otherwise inject noise into a single\-shot binary label\. Training follows a two\-phase LP\-FT schedule:55epochs linear probing \(backbone frozen, head only, AdamW lr2​e−42\\mathrm\{e\}\{\-4\}\) followed by44epochs full fine\-tuning \(AdamW lr1​e−51\\mathrm\{e\}\{\-5\}\), batch size1616,10%10\\%warmup \+ cosine decay, soft binary cross\-entropy loss, validation split0\.20\.2, seed4242\. Trajectory\-level train/val splits prevent any past\-step from appearing in both partitions\.

#### Evaluation\.

All rollouts use temperature0and a3030\-step episode budget with a fixed initialization seed; within each cell the same task initializations are reused across compression conditions\. Statistical protocol: bootstrap95%95\\%percentile\-method CIs \(1010k resamples per cell\), paired Wilcoxon signed\-rank tests on the3030paired task scores \(α=0\.05\\alpha\{=\}0\.05\), and Holm–Bonferroni correction across the6363pairwise tests in the main grid\.

#### Compression context format\.

At steptt, past steps1,…,t−11,\\ldots,t\{\-\}1are either kept verbatim or replaced by an elision marker recording how many consecutive steps were dropped\. The full context delivered to the agent is the system prompt \+ the rendered \(possibly compressed\) trajectory \+ the current observation\. For a66\-stepWebShoprollout where the scorer keeps step22and theKrecent=2K\_\{\\text\{recent\}\}\{=\}2window force\-keeps steps55–66:

WebShop, step 7 context\[Step 1\]\(elided\) \[Step 2\] Action:search\[long sleeve men’s country xx\-large\]Obs: Page 1, 50 results, B00O30JLDK $10\.52–$40\.5 …\[Steps 3–4\]\(2 step\(s\) elided\) \[Step 5\] Action:click\[country\]Obs: color = country selected; size options listed …\[Step 6\] Action:click\[xx\-large\]Obs: size = xx\-large selected; Buy Now visible …

A parallelScienceWorldexample, where the scorer keeps an earlier mass\-measurement step but elides intermediate setup actions:

ScienceWorld, step 9 context\[Step 1\]\(elided\) \[Step 2\] Action:pick up metal block from cupboardObs: holding metal block,22∘22^\{\\circ\}C,50​g50\\text\{g\}\.\[Steps 3–6\]\(4 step\(s\) elided\) \[Step 7\] Action:put metal block in hot waterObs: block now in beaker; water temp85∘85^\{\\circ\}C\.\[Step 8\] Action:wait 30 secondsObs: 30 seconds elapsed; block temp now74∘74^\{\\circ\}C\.

Elided spans are merged into a single marker \(e\.g\.,\[Steps 3\-\-6\]\) to keep token overhead minimal; the scorer never sees raw token spans, only fully formed atoms\.

#### Atom parser examples\.

The parser converts a raw multi\-line step emitted by the environment into a single self\-contained atom that the scorer can score in isolation\. Per\-environment rules differ in how observations are tokenized; the three rule sets handle structural noise \(chain\-of\-thought, multi\-clause observations, separator tokens\) before the scorer ever sees an atom\.

ALFWorldstep 3Raw: Thought: I should find an apple to put in the fridge\.Action:go to countertop 1 Observation: On the countertop 1, you see a apple 1,a bread 1, a knife 1, and a peppershaker 1\.Atom: actiongo to countertop 1 obscountertop 1: apple 1, bread 1, knife 1, peppershaker 1

WebShopstep 5Raw: Thought: I picked color and size; now buy\.Action:click\[Buy Now\] Observation: Thank you for shopping with us\! \[SEP\] Your code: \[SEP\]None \[SEP\] Purchased \[SEP\] asin \[SEP\] B00O30JLDK …Atom: actionclick\[Buy Now\] obspurchased asin=B00O30JLDK; color=country; size=xx\-large

ScienceWorldstep 7Raw: Thought: Need to measure temperature on the cooled sample\.Action:use thermometer in inventory on metal block Observation: The thermometer reads 22 degrees Celsius\.The block is at room temperature\.Atom: actionuse thermometer on metal block obstemperature\(metal block\) = 22C; state = room\_temperature

The parser drops the chain\-of\-thought \(recoverable from the action\), normalizes the observation into a flat key–value summary, and tags the action/observation halves so the scorer can attend to them separately\.

#### Scorer training samples\.

Each training example is a triple\(anchor, past\_atom, label\)whereanchoris the current step’s observation,past\_atomis a candidate atom from earlier in the same trajectory, andlabel∈\{0,1\}\{\\in\}\\\{0,1\\\}indicates whether dropping that past atom changes the backbone’s next\-action prediction\. A positive example \(critical past step\):

label = 1: dropping the past step changes the next actionanchor\(step 12, obs\): You are at the workbench\.The metal block is at42∘42^\{\\circ\}C\.past\_atom\(step 4\):actionpick up metal block;obsholding metal block \(room temp,22∘22^\{\\circ\}C\)\.counterfactual:removing step 4 from history causes the backboneto issuelook at metal block\(exploratory\) instead ofplace metal block in cold bath\(correct\)\.

A negative example \(past step is non\-critical\):

label = 0: dropping the past step does not change behaviouranchor\(step 12, obs\): You are at the workbench\.The metal block is at42∘42^\{\\circ\}C\.past\_atom\(step 6\):actionlook at thermometer;obsthermometer is on the workbench,20∘20^\{\\circ\}C\.counterfactual:removing step 6 from history leaves the next actionplace metal block in cold bathunchanged; the thermometerinspection was incidental\.

#### Cost and compression accounting\.

Per\-task cost in Table[2](https://arxiv.org/html/2605.26596#S4.T2)and Figure[2](https://arxiv.org/html/2605.26596#S4.F2)is computed end\-to\-end \(main backbone tokens plus any per\-step compressor LLM tokens\) as

$/task=Tin⋅pin\+Tout⋅pout106,\\text\{\\textdollar/task\}\\;=\\;\\tfrac\{T^\{\\text\{in\}\}\\cdot p^\{\\text\{in\}\}\\,\+\\,T^\{\\text\{out\}\}\\cdot p^\{\\text\{out\}\}\}\{10^\{6\}\},whereTin,ToutT^\{\\text\{in\}\},T^\{\\text\{out\}\}are mean input/output token counts per task andpin,poutp^\{\\text\{in\}\},p^\{\\text\{out\}\}are the production API list prices in effect at submission time\. We use the following per\-backbone rates \(¥ per million tokens, input / output\):qwen3\.5\-flash0\.1580\.158/1\.581\.58;gpt\-4o\-mini0\.750\.75/3\.03\.0;gpt\-5\-mini1\.251\.25/10\.010\.0\. USD figures in Figure[2](https://arxiv.org/html/2605.26596#S4.F2)convert at7​¥/USD7\\,\\text\{\\textyen\}/\\text\{USD\}\. Effective compression ratio is reported as a per\-task mean:

Eff\.×cell=1N∑i=1NTin,incTin,im,\\text\{Eff\.\}\\\!\\times\_\{\\text\{cell\}\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{T^\{\\text\{nc\}\}\_\{\\text\{in\},i\}\}\{T^\{\\text\{m\}\}\_\{\\text\{in\},i\}\},paired bytask\_idbetween the method\-of\-interest \(m\) and theNo\-Compreference\. We use the per\-task mean rather than the aggregate ratio∑iTinc/∑iTim\\sum\_\{i\}T^\{\\text\{nc\}\}\_\{i\}/\\sum\_\{i\}T^\{\\text\{m\}\}\_\{i\}because aggregation is dominated by a handful of long\-trajectory tasks \(typicallyWebShop×\\timesgpt\-5\-mini\) and washes out the cell\-level adaptivity that motivates Figure[2](https://arxiv.org/html/2605.26596#S4.F2)B; per\-task averaging treats each task as an equally weighted evaluation event, which matches how a deployment operator reads “how much did we compress on average for a typical episode\.”

#### Prompts and code release\.

Full system/user prompts per \(environment, backbone\) pair, the trained scorer checkpoints, the atom\-parser rules, and the evaluation harness are available in the released code repository \(linked after the abstract\)\.

## Appendix CParadigm\-Failure Cells: Per\-cell Detail

Table[4](https://arxiv.org/html/2605.26596#A3.T4)expands the1717\-cell paradigm\-failure aggregate of §[4\.5](https://arxiv.org/html/2605.26596#S4.SS5)into per\-cell evidence\. Three patterns substantiate the paradigm\-level reading\. First, within\-family consistency: all66Selective Contextcells collapse tomr≤0\.033\\textit\{mr\}\{\\leq\}0\.033with1\.31\.3–4\.8×4\.8\\timesrealized compression, and all1111LLMLingua\-2cells \(off\-shelf and retrained4×4\\timesvariants combined\) collapse tomr≤0\.001\\textit\{mr\}\{\\leq\}0\.001with2\.92\.9–13\.3×13\.3\\timesrealized compression\. Second, backbone\-invariance: evengpt\-5\-minifails atmr=0\\textit\{mr\}\{=\}0on every token\-level cell, ruling out a “stronger backbone tolerates token\-level noise” counter\-explanation\. Third, action\-grammar dependence: failure is sharpest onWebShop, where the action grammar is most rigid—WebShop×\\timesqwen3\.5\-flash×\\timeslingua2\_retrained\_4xattains13\.3×13\.3\\timescompression but0\.0000\.000reward, the cleanest case of “compressor works as designed, agent rejects the output”\. The pattern is not a tuning artifact \(covers two checkpoints ofLLMLingua\-2including one retrained at the agent compression target\), not a task artifact \(failure is task\-independent\), and not a backbone artifact \(covers four backbones fromqwen3\.5\-flashtoclaude\)\.

Table 4:Per\-cell breakdown of the1717token\-level paradigm\-failure cells aggregated in §[4\.5](https://arxiv.org/html/2605.26596#S4.SS5)\.end2end ratioistokmethod/tokNo\-Comp\\text\{tok\}\_\{\\text\{method\}\}/\\text\{tok\}\_\{\\textsc\{No\-Comp\}\}\(smaller = more aggressive compression\)\. Every listed cell achieves true≥2×\{\\geq\}2\{\\times\}compression yet collapsesmrto≤0\.05\{\\leq\}0\.05, ruling out the alternative explanation that the compressor silently no\-ops\.EnvBackboneMethod variantEff\.×\\timesmrend2end ratioSelective Context — 6 cellsALFWorldqwen3\.5\-flashselctx\_4x4\.84\.80\.0330\.0330\.210\.21ALFWorldgpt\-4o\-miniselctx\_4x4\.84\.80\.0000\.0000\.210\.21ALFWorldgpt\-5\-miniselctx\_4x3\.23\.20\.0000\.0000\.310\.31WebShopqwen3\.5\-flashselctx\_4x1\.31\.30\.0000\.0000\.780\.78WebShopgpt\-4o\-miniselctx\_4x2\.62\.60\.0000\.0000\.390\.39WebShopgpt\-5\-miniselctx\_4x3\.73\.70\.0000\.0000\.270\.27LLMLingua\-2 \(off\-shelfrate=0\.25\) — 6 cellsALFWorldqwen3\.5\-flashlingua2\_rate0\.252\.92\.90\.0000\.0000\.3400\.340ALFWorldgpt\-4o\-minilingua2\_rate0\.254\.04\.00\.0000\.0000\.2470\.247ALFWorldgpt\-5\-minilingua2\_rate0\.252\.92\.90\.0000\.0000\.3450\.345ALFWorldclaudelingua2\_rate0\.253\.73\.70\.0000\.0000\.2720\.272WebShopgpt\-5\-minilingua2\_rate0\.255\.65\.60\.0000\.0000\.1800\.180WebShopclaudelingua2\_rate0\.258\.38\.30\.0010\.0010\.1210\.121LLMLingua\-2 \(retrained4×4\\times\) — 5 cellsALFWorldgpt\-4o\-minilingua2\_retrained\_4x3\.43\.40\.0000\.0000\.2960\.296ALFWorldclaudelingua2\_retrained\_4x3\.83\.80\.0000\.0000\.2650\.265WebShopqwen3\.5\-flashlingua2\_retrained\_4x13\.3\\mathbf\{13\.3\}0\.0000\.0000\.0750\.075WebShopgpt\-5\-minilingua2\_retrained\_4x3\.63\.60\.0000\.0000\.2760\.276WebShopclaudelingua2\_retrained\_4x3\.13\.10\.0000\.0000\.3260\.326
## Appendix DCross\-method Step\-Overlap \(Jaccard\)

Table 5:Cross\-method step\-overlap \(Jaccard\) per cell, averaged over all compression calls\.AGORA\-Randcompares AGORA’s kept\-step set againstRand\-Step’s on identical \(task, step\) pairs;AGORA\-OMcompares against the deterministic last\-K=2K\{=\}2positional mask used byObsMask;Rand\-OMcomparesRand\-Stepagainst the same positional mask\.Meanrow averages across the 9 cells\. Both AGORA andRand\-Stepunconditionally retain theKrecent=2K\_\{\\text\{recent\}\}\{=\}2most recent positions, which sets a floor on overlap withObsMask\.Two patterns are worth noting\. First, the mean overlapJ​\(AGORA,OM\)=0\.514\\mathrm\{J\}\(\\text\{AGORA\},\\text\{OM\}\)=0\.514is consistently higher thanJ​\(AGORA,Rand\-Step\)=0\.473\\mathrm\{J\}\(\\text\{AGORA\},\\text\{Rand\-Step\}\)=0\.473: AGORA’s kept\-step set is more similar to a positional last\-KKmask than to a uniform\-random selection, even though Rand\-Step shares the sameKrecent=2K\_\{\\text\{recent\}\}\{=\}2force\-keep floor as AGORA\. This is consistent with the component ablation in §[4\.4](https://arxiv.org/html/2605.26596#S4.SS4): removing the structural floor costs−0\.088​mr\-0\.088\\,\\textit\{mr\}on average, the largest single hit among all ablations, while removing the scorer costs only−0\.031\-0\.031—the floor accounts for most of the kept\-step composition\. Second, the cell with the highestJ​\(AGORA,OM\)=0\.845\\mathrm\{J\}\(\\text\{AGORA\},\\text\{OM\}\)=0\.845isWebShop×\\timesgpt\-5\-mini, where AGORA and the positional baseline keep nearly identical step sets despite different scoring rules—further evidence that the scorer’s value on long\-trajectory cells is in*adaptive compression budget*rather than in selecting wildly different past steps\.

Data computed by parsing the\[ASSISTANT\]blocks in the compressed user message of eachraw\_\_\*\.jsonlentry and text\-matching them against the per\-stepraw\_responsefield of the full trajectory; no scorer re\-inference or additional LLM calls were needed\.

Similar Articles

Adaptive Latent Agentic Reasoning

arXiv cs.CL

This paper introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework for LLM agents that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought for harder decisions, achieving up to 84.6% token reduction while maintaining task accuracy.

We built a source-available LLM reliability library (free for research / personal / internal eval) that can cut inference cost by half at matched quality, and you adopt it by changing one import [P] [R]

Reddit r/MachineLearning

AgentCodec is a source-available library unifying 28 LLM reliability techniques (retries, ensembling, generator/critic refinement, etc.) under a single OpenAI-compatible API, with adaptive routers that can reduce inference costs by ~56% at matched quality. It adopts a communication-theory framing and supports drop-in replacement for OpenAI, Anthropic, and Ollama clients.

Learning Agent-Compatible Context Management for Long-Horizon Tasks

arXiv cs.AI

Introduces AdaCoM, an external LLM-based context manager for frozen agents, using reinforcement learning to improve long-horizon task performance by preserving task constraints and pruning stale content, with experiments on web search and deep research benchmarks.