Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

arXiv cs.AI Papers

Summary

Introduces CICL, a decision-aware context layer that selects and compresses evidence for tool-using LLM agents by treating context as a decision-time intervention, using counterfactual-inspired scoring and typed memory cards under a token budget. Experiments on SWE-bench and RepoBench show concrete gains in retrieval accuracy and action criticality.

arXiv:2606.08151v1 Announce Type: new Abstract: Tool-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time. We present CICL, a decision-aware context layer that turns instance evidence into a context graph, routes deterministic, Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA judgments through a shared eight-field schema, scores units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility evidence as typed memory cards for a budgeted agent. The design separates the measured decision signal from the judge model, so frontier annotation, local surrogates, and lightweight rankers can be compared under one auditable protocol. Empirically, CICL yields a concrete open-benchmark gain while exposing its limits. On 50 SWE-bench Verified file-retrieval instances, direct Qwen3.6-plus reranking of BM25 top-50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show action-criticality: at budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing the top-utility semantic v3 unit collapses F1 to 0.000. Supplementary checks add Qwen-QLoRA agreement over 710 candidates, a small 200-label real-code Opus-assisted signal, and a three-instance patch smoke validating retrieval-to-patch plumbing without claiming official SWE-bench success. RepoBench-R summaries still beat cards, and compact rankers do not yet replace the heuristic. CICL contributes a reproducible measurement and selection layer for decision-critical context, not an end-to-end coding-agent repair claim.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:55 AM

# Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents
Source: [https://arxiv.org/html/2606.08151](https://arxiv.org/html/2606.08151)
11institutetext:Alibaba Group, China
11email:guanhan\.gxy@alibaba\-inc\.com
11email:zhaoqianyang\.zqy@alibaba\-inc\.com
11email:finaldreamer@qq\.com
Corresponding author: Xinyu Guan\.###### Abstract

Tool\-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time\. We present CICL, a decision\-aware context layer that turns instance evidence into a context graph, routes deterministic, Opus\-assisted, Qwen, Codex/GPT\-5\.5, and Qwen\-QLoRA judgments through a shared eight\-field schema, scores units by action shift, outcome uplift, necessity, and negative\-transfer risk, and packs high\-utility evidence as typed memory cards for a budgeted agent\. The design separates the measured decision signal from the judge model, so frontier annotation, local surrogates, and lightweight rankers can be compared under one auditable protocol\. Empirically, CICL yields a concrete open\-benchmark gain while exposing its limits\. On 50 SWE\-bench Verified file\-retrieval instances, direct Qwen3\.6\-plus reranking of BM25 top\-50 candidates raises hit@1 from0\.580\.58to0\.780\.78and MRR@10 from0\.6340\.634to0\.7900\.790, with all 2,500 judgments parseable\. Controlled diagnostics show action\-criticality: at budget 120, CICL reaches F10\.6200\.620on v1 and0\.4250\.425on v3, and removing the top\-utility semantic v3 unit collapses F1 to0\.0000\.000\. Supplementary checks add Qwen\-QLoRA agreement over 710 candidates, a small 200\-label real\-code Opus\-assisted signal, and a three\-instance patch smoke validating retrieval\-to\-patch plumbing without claiming official SWE\-bench success\. RepoBench\-R summaries still beat cards, and compact rankers do not yet replace the heuristic\. CICL contributes a reproducible measurement and selection layer for decision\-critical context, not an end\-to\-end coding\-agent repair claim\.

## 1Introduction

Tool\-using LLM agents can interleave reasoning and acting\[[38](https://arxiv.org/html/2606.08151#bib.bib12)\], learn to use tools and revise behaviour from feedback\[[28](https://arxiv.org/html/2606.08151#bib.bib13),[29](https://arxiv.org/html/2606.08151#bib.bib15)\], and accumulate reusable skills in open\-ended environments\[[32](https://arxiv.org/html/2606.08151#bib.bib14)\]\. In coding systems such as SWE\-agent\[[37](https://arxiv.org/html/2606.08151#bib.bib19)\], these agents are increasingly limited by the quality of their context window rather than by model scale alone\. A repository issue may hinge on one failing test, one invariant, or one file\-level constraint\. Recent coding\-agent context benchmarks make this bottleneck explicit\[[21](https://arxiv.org/html/2606.08151#bib.bib2),[40](https://arxiv.org/html/2606.08151#bib.bib1)\]\. Longer prompts do not guarantee that this evidence is found, retained, or surfaced at the moment of action\. This turns context selection into a decision problem: the useful context is not necessarily the nearest text under a retrieval score, but the evidence that changes what the agent is prepared to do next\.

CICL operationalises this idea by treating each candidate context unit as a decision\-time intervention\. The selector estimates whether adding the unit would shift the next action, improve the expected outcome, be necessary for success, or introduce negative transfer, and then packs the highest\-utility evidence under a token budget\. Selected units are rewritten as decision\-aware memory cards with trigger, evidence, action hint, failure\-if\-ignored, and scope fields\. This separation between judgment schema and judging model is important in practice: a costly frontier annotator, a local Qwen judge, a lightweight ranker, or a provider\-specific code model can be used depending on budget, privacy, and deployment constraints, while the measured decision signal remains comparable\. Figure[1](https://arxiv.org/html/2606.08151#S1.F1)gives an overview of this pipeline\.

Instance evidencetaskfilesteststracesrulesmemoryContext graphlink evidencedetect conflictskeep scopeDecision utility engineaction shiftsuccess gainneed signalrisk / costMemory cardstriggerevidenceaction hint \+ scopeBudgeted agentpack contextchoose actionlog traceJudge routerOpusQwen APICodexQwen\-QLoRAEvidence ledgersupported / limited / deferred

Figure 1:CICL pipeline\. The framework turns instance evidence into a graph, routes judge signals through a decision\-utility engine, and packs memory cards for a budgeted agent\. Opus, direct Qwen API calls, Codex, and a trained Qwen\-QLoRA model used as an optional local surrogate are reported separately, so model\-dependent signals are not collapsed into one score\.The paper studies two linked questions\. First, does decision\-aware utility give a useful ranking signal for agent context? Second, does the signal survive judge substitution, rather than collapsing into a provider\-specific prompt? We lead with the most externally meaningful result: on the open\-source SWE\-bench Verified file\-retrieval benchmark, direct Qwen3\.6\-plus judgments over top\-50 candidates improve BM25 on hit@1 and MRR@10\. The remaining experiments explain why the signal is plausible, when compression helps, and where strong simple baselines still expose a boundary\.

##### Contributions\.

\(1\) We frame agent context selection as a decision\-time intervention and formalise a four\-component utility for action\-critical evidence\. \(2\) We introduce decision\-aware memory cards and a graph\-based assembly pipeline for packing action\-oriented context\. \(3\) We evaluate the same judgment schema across Opus\-assisted annotations, lightweight rankers, Qwen\-family judges, and a Codex/GPT\-5\.5 provider check\. \(4\) We provide controlled diagnostics identifying where the framework works, where open\-source retrieval improves, and where baselines remain stronger\.

## 2Related Work

##### Agents, benchmarks, and context budgets\.

Tool\-using agents show that LLMs can call tools, reuse experience, and coordinate multi\-step workflows\. AutoGen and ChatDev extend this pattern to collaborative software development\[[35](https://arxiv.org/html/2606.08151#bib.bib16),[26](https://arxiv.org/html/2606.08151#bib.bib17)\]\. Memory benchmarks show that final success is too coarse for agent state\[[6](https://arxiv.org/html/2606.08151#bib.bib10)\]; incremental and self\-evolving settings sharpen this concern\[[8](https://arxiv.org/html/2606.08151#bib.bib9),[34](https://arxiv.org/html/2606.08151#bib.bib11)\]\. In code, public repair and repository benchmarks provide realistic settings, while Agentless and OpenHands mark practical endpoints from non\-agent repair to open coding\-agent infrastructure\[[36](https://arxiv.org/html/2606.08151#bib.bib20),[33](https://arxiv.org/html/2606.08151#bib.bib21)\]\. CICL studies a narrower layer shared by these systems: before the next action, which retrieved evidence would actually change the decision?

##### Retrieval and long\-context selection\.

Sparse and dense retrieval remain the default route for placing evidence in an agent prompt\. Lexical and supervised dense methods supply strong baselines; Contriever and FAISS cover unsupervised retrieval and vector search\[[11](https://arxiv.org/html/2606.08151#bib.bib26),[16](https://arxiv.org/html/2606.08151#bib.bib27)\]\. RAG conditions generation on retrieved evidence\[[20](https://arxiv.org/html/2606.08151#bib.bib29)\]; Atlas and HyDE add retrieval\-scale pretraining and hypothetical evidence\[[12](https://arxiv.org/html/2606.08151#bib.bib31),[5](https://arxiv.org/html/2606.08151#bib.bib28)\]; Self\-RAG adds self\-critique to retrieval\-conditioned generation\[[1](https://arxiv.org/html/2606.08151#bib.bib30)\]\. Long\-context work adds a caution: more tokens do not ensure that salient facts are used when they appear in distracting positions\[[23](https://arxiv.org/html/2606.08151#bib.bib32),[2](https://arxiv.org/html/2606.08151#bib.bib33)\]\. CICL keeps retrieval as candidate generation, then scores whether each candidate changes the intended action\.

##### Memory, compression, and contribution\-aware diagnostics\.

Agent\-memory and context\-learning methods reuse context as agent state; closest in motivation is AutoContext\[[3](https://arxiv.org/html/2606.08151#bib.bib3)\]\. ACE and ACON provide nearby agent\-adaptation baselines\[[39](https://arxiv.org/html/2606.08151#bib.bib4),[17](https://arxiv.org/html/2606.08151#bib.bib5)\]\. Prompt\-compression methods reduce length by estimating token\- or sentence\-level utility\. CICL instead preserves typed trigger, evidence, action, failure, and scope fields, and asks whether a unit still supports a decision after compression\. The closest contribution\-aware comparisons are causal\-memory selection and RepoShapley\[[30](https://arxiv.org/html/2606.08151#bib.bib6),[9](https://arxiv.org/html/2606.08151#bib.bib7)\]\. CICL uses comparative supervision and parameter\-efficient adaptation only as diagnostic machinery rather than a standalone coding\-agent claim\.

## 3Method

### 3\.1Decision\-Aware Context Selection

Consider an agent policyπ\\piacting on tasksx∈𝒳x\\in\\mathcal\{X\}\. At each decision step the agent receives a context blockC⊆𝒰C\\subseteq\\mathcal\{U\}, where𝒰\\mathcal\{U\}is the pool of candidate units produced by the instance graph\. Unlike BM25\[[27](https://arxiv.org/html/2606.08151#bib.bib23)\]or dense selectors such as DPR and ColBERT\[[18](https://arxiv.org/html/2606.08151#bib.bib24),[19](https://arxiv.org/html/2606.08151#bib.bib25)\], CICL treats selected evidence as an intervention\. A relevance selector solvesCrel=arg⁡maxtok​\(C\)≤B​∑c∈Csim​\(c,x\)C^\{\\mathrm\{rel\}\}=\\arg\\max\_\{\\mathrm\{tok\}\(C\)\\leq B\}\\sum\_\{c\\in C\}\\mathrm\{sim\}\(c,x\)under token budgetBB\. CICL replacessim\\mathrm\{sim\}with a decision\-time utilityU​\(c,x\)U\(c,x\)that estimates whetherccwould change the action distribution induced byπ\\pi\.

### 3\.2Counterfactual\-Inspired Utility

LetC−C^\{\-\}denote the context assembled before considering candidatecc, and letC\+=C−∪\{c\}C^\{\+\}=C^\{\-\}\\cup\\\{c\\\}\. Withπ​\(a∣x,C\)\\pi\(a\\mid x,C\)denoting the next\-action distribution under contextCC, CICL decomposes utility into four components:

Δact​\(c,x\)=𝔼​\[1​\{arg⁡maxa⁡π​\(a∣x,C\+\)≠arg⁡maxa⁡π​\(a∣x,C−\)\}\]Δout​\(c,x\)=𝔼​\[V​\(x,C\+\)−V​\(x,C−\)\]N​\(c,x\)=Pr⁡\[success​\(x,C\+\)=1∧success​\(x,C−\)=0\]R​\(c,x\)=Pr⁡\[c​induces negative transfer on​x\]\.\\begin\{array\}\[\]\{rcl\}\\Delta\_\{\\mathrm\{act\}\}\(c,x\)&=&\\mathbb\{E\}\\big\[\\,\\mathbf\{1\}\\\{\\arg\\max\_\{a\}\\pi\(a\\mid x,C^\{\+\}\)\\neq\\arg\\max\_\{a\}\\pi\(a\\mid x,C^\{\-\}\)\\\}\\,\\big\]\\\\ \\Delta\_\{\\mathrm\{out\}\}\(c,x\)&=&\\mathbb\{E\}\\big\[\\,V\(x,C^\{\+\}\)\-V\(x,C^\{\-\}\)\\,\\big\]\\\\ N\(c,x\)&=&\\Pr\\big\[\\,\\mathrm\{success\}\(x,C^\{\+\}\)=1\\wedge\\mathrm\{success\}\(x,C^\{\-\}\)=0\\,\\big\]\\\\ R\(c,x\)&=&\\Pr\\big\[\\,c\\mbox\{ induces negative transfer on \}x\\,\\big\]\.\\end\{array\}\(1\)Here,VVdenotes an expected success score\. The expectations are operationalised by the evaluator used in each instantiation: deterministic simulator probes, a provider judgment, or lightweight\-ranker predictions under the same reference context\. The aggregate utility is a fixed linear aggregation:

U​\(c,x\)=α​Δact​\(c,x\)\+β​Δout​\(c,x\)\+γ​N​\(c,x\)−λ​R​\(c,x\)\.U\(c,x\)=\\alpha\\,\\Delta\_\{\\mathrm\{act\}\}\(c,x\)\+\\beta\\,\\Delta\_\{\\mathrm\{out\}\}\(c,x\)\+\\gamma\\,N\(c,x\)\-\\lambda\\,R\(c,x\)\.\(2\)We instantiate the four components in three ways: \(i\) a deterministic proxy suitable for ablations and unit testing; \(ii\) a Claude\-Opus 4\.7 counterfactual annotation assistant that produces a structured eight\-field judgment per \(task, candidate\) pair; and \(iii\) a 25\-dimensional pairwise linear ranker trained on these Opus\-assisted diagnostic annotations\. The annotations are provider\-generated and disclosed as such, not human gold labels or a released judge replica\. Eq\.[3](https://arxiv.org/html/2606.08151#S3.E3)is the operational version used for judge\-style judgments: it plugs the judge fields into the same signed utility components and adds a bounded cost penalty\. We keep this aggregation fixed across Opus, Qwen, and Codex/GPT\-5\.5 judge runs to prevent ablation drift; deterministic proxy ablations use the same component signs with model\-free estimates:

s=0\.34​Δact\+0\.26​N\+0\.28​Δout−0\.22​R−0\.08​cost\.s=0\.34\\,\\Delta\_\{\\mathrm\{act\}\}\+0\.26\\,N\+0\.28\\,\\Delta\_\{\\mathrm\{out\}\}\-0\.22\\,R\-0\.08\\,\\mathrm\{cost\}\.\(3\)In the implementation,cost=min⁡\(1,tok​\(c\)/1000\+0\.2​R\)\\mathrm\{cost\}=\\min\(1,\\mathrm\{tok\}\(c\)/1000\+0\.2R\)for LLM\-judged units, matching the released scorer\. The coefficients in Eq\.[3](https://arxiv.org/html/2606.08151#S3.E3)are pre\-set heuristic weights, not learned from test labels\. Component\-removal ablations provide the current sensitivity evidence; equal\-weight, random\-weight, and learned\-weight sweeps remain future work\. Table[1](https://arxiv.org/html/2606.08151#S3.T1)lists the judge output fields\. We stress that “causal” here denotes counterfactual\-inspired utility estimation rather than formal causal identification: the expectations above are not identified by any randomised intervention, and we defer a detailed discussion of this boundary to Section[7](https://arxiv.org/html/2606.08151#S7)\.

Table 1:Eight\-field judge schema used by Opus and Qwen\. The four utility fields feed Eq\.[3](https://arxiv.org/html/2606.08151#S3.E3); confidence is retained for audit and diagnostics, and token cost comes from context metadata\.
### 3\.3Instance Context Graph

For each repository or environment instance, CICL constructs a graph whose nodes correspond to files, symbols, task memories, rules, failures, and strategy records\. Edges capture containment, similarity, conflict, precondition, and task\-memory relations\. The graph supports both lexical and structural retrieval together with one\-hop neighbour expansion, enabling recovery of decision\-relevant context even when lexical overlap with the query is weak\. Each node is a context unit annotated with an identifier, instance id, type, source, content, token cost, and confidence score\. Importantly, the graph never requires gold context identifiers for selection; gold ids appear only during offline evaluation and in oracle baselines\. We audit all method\-facing artifacts for gold\-label leakage and include the audit script in the reproducibility package\.

### 3\.4Decision\-Aware Memory Cards

CICL compiles selected units into compact memory cards with five mandatory fields—*trigger*\(when to consult\),*evidence*\(supporting clue\),*action hint*\(next\-action verb\),*failure\-if\-ignored*\(risk if skipped\), and*scope*\(applicable boundary\)—plus a diagnostic*causal score*\(U​\(c,x\)U\(c,x\)\) for ordering\. The format prioritises decision usefulness over exhaustive semantic fidelity\. Where generic prompt compression often optimises token\- or sentence\-level importance, as in LLMLingua and LongLLMLingua\[[13](https://arxiv.org/html/2606.08151#bib.bib34),[14](https://arxiv.org/html/2606.08151#bib.bib35)\], or filters by salience\[[22](https://arxiv.org/html/2606.08151#bib.bib36)\]; CICL stores typed decision fields\. A deterministic structural audit checks required\-field completeness, action\-verb presence, compression ratio, and absence of placeholder text\.

### 3\.5Budget\-Aware Assembly

At inference time, CICL retrieves candidate units, expands graph neighbours, scores candidates viaUU, and packs the highest\-utility evidence under a fixed token budget\. We distinguish post\-selection compression, which first selects ids and then compresses their text, from pre\-budget compression, which changes candidate costs before packing and can therefore change the selected ids\. We report the two modes separately to avoid conflating compression gains with changes in selection\.

### 3\.6Opus\-Assisted Annotation and an Open\-Weights Surrogate Judge

Claude\-Opus 4\.7 is used to assist a human\-designed context\-utility annotation workflow\. These diagnostic annotations train small CICL context rankers rather than a reproduced teacher: a 25\-feature pairwise linear ranker and a two\-layer MLP\. All features are available at inference time, including proxy utility components, retrieval and lexical\-overlap scores, unit type, token cost, confidence, history success, and graph signals such as conflict degree and source/task overlap\. Rankers are trained from within\-task positive–negative candidate pairs, following preference\-learning style supervision\[[25](https://arxiv.org/html/2606.08151#bib.bib37),[31](https://arxiv.org/html/2606.08151#bib.bib38)\], and are then used only for context ordering\. The cleaned linear ranker reaches0\.9390\.939pairwise accuracy on the post\-leakage\-fix v1 suite; an earlier0\.9960\.996leaky run is excluded because it used provider\-sidellm\_\*features unavailable at inference\.

As a local open\-weights alternative we fine\-tune Qwen3\.5\-9B with QLoRA on two 16 GB V100 GPUs for the same eight\-field schema, following LoRA/QLoRA adaptation\[[7](https://arxiv.org/html/2606.08151#bib.bib39),[4](https://arxiv.org/html/2606.08151#bib.bib40)\]\. We treat it as an agreement surrogate, not an Opus replacement\. Training uses the released v1 Opus\-assisted SFT split \(1,4001\{,\}400examples;1,2561\{,\}256train and144144validation examples, split by task\); evaluation measures selection\-level agreement via top\-kkJaccard and Spearmanρ\\rhoon 710 candidates from 25 base tasks\. This keeps deterministic, Opus\-assisted, lightweight\-ranker, and Qwen\-surrogate scores auditable as separate components\.

## 4Experimental Setup

### 4\.1Tasks and Benchmarks

Table[2](https://arxiv.org/html/2606.08151#S4.T2)gives the minimal data map\. SWE\-bench Verified is the main real\-code retrieval benchmark\[[15](https://arxiv.org/html/2606.08151#bib.bib18)\]; synthetic suites isolate mechanism; RepoBench\-R tests compression\[[24](https://arxiv.org/html/2606.08151#bib.bib8)\]\. CodeSearchNet motivates the broader semantic code\-search setting that these repository\-level retrieval tasks inherit\[[10](https://arxiv.org/html/2606.08151#bib.bib22)\]\.

Table 2:Dataset summary with scale, use, and evaluation scope\.
### 4\.2Methods Compared

We compare CICL and CICL\_Distilled with NoContext, FullContext, VanillaRAG, GraphMemory, SummaryMemory, SelfGeneratedExamples, AutoContextKG, and OracleGoldContext\. The last uses gold ids only as an upper bound and is never employed as a selector elsewhere\. We additionally report seven CICL ablations removing individual scoring components\.

### 4\.3Metrics

For each task we report simulated success rate, context precision/recall/F1 against gold ids, mean reciprocal rank \(MRR\) of the first gold id in the selection, average tokens consumed, and average tool calls\. In settings where paired task\-level deltas are available, compression\-suite experiments additionally employ paired bootstrap tests with2,0002\{,\}000resamples for delta metrics and report 95% confidence intervals; bootstrap pairs are matched per task id\. The executable pilots further report patch success and harmful\-selection rates\.

### 4\.4Implementation Details

The deterministic simulator, the heuristic CICL ranker, the AutoContextKG selector, and the compression pipeline are implemented in pure Python with fixed random seeds\. The Opus\-assisted annotation pipeline queries Claude\-Opus 4\.7 with provider\-supported decoding settings using a counterfactual prompt template that enforces the eight\-field JSON schema\. The synthetic linear rankers are trained for8080epochs with the pairwise logistic trainer \(learning rate0\.080\.08, L25×10−45\\times 10^\{\-4\}in the public script\); the real\-code pilot uses the same trainer for100100epochs\. The MLP rankers use the same cleaned examples but optimise a BPR pairwise loss for120120epochs with AdamW \(learning rate10−310^\{\-3\}, batch size256256, weight decay10−410^\{\-4\}, dropout0\.10\.1\)\. The Qwen3\.5\-9B QLoRA judge is trained on two 16 GB V100 GPUs and uses rankr=8r\\\!=\\\!8,α=16\\alpha\\\!=\\\!16, dropout0\.050\.05, target modulesq\_proj, k\_proj, v\_proj, o\_proj, effective batch size1616\(batch1×161\\times 16gradient accumulation steps\), one epoch,lr=2×10−4\\mathrm\{lr\}=2\\times 10^\{\-4\}, bf16 disabled \(V100 compatibility\), and 4\-bit NF4 base\-weight quantisation\. Local generation, training, and evaluation scripts use fixed seeds\. Decoding used deterministic or provider\-default settings where supported, and all external annotation runs keep archived JSON outputs; because hosted LLM backends can drift, reported numbers are tied to the released artifacts rather than to re\-querying the provider\. The Codex/GPT\-5\.5 path uses the same code\-retrieval prompt and eight\-field schema on the first five SWE\-bench Verified instances; it is reported as a small provider check, not as a full retrieval benchmark\.

## 5Results

### 5\.1Open\-Source File Retrieval Benchmark

We begin with the open benchmark most likely to matter for coding agents: SWE\-bench Verified file\-level retrieval\. This setting does not measure patch success, but it asks whether the selector can place the target file high enough for a downstream agent to act\. On 50 instances \(Table[3](https://arxiv.org/html/2606.08151#S5.T3)\), BM25/HybridRAG reaches hit@10\.580\.58and MRR@100\.6340\.634\. The older deterministic causal proxy is weaker than BM25, but direct Qwen3\.6\-plus judgments over the top\-50 file pool raise hit@1 to0\.780\.78and MRR@10 to0\.7900\.790, with 2500/2500 parseable judgments\. The result is the paper’s main positive real\-code finding: decision\-aware judging becomes a strong reranking signal when the judge can read the candidate evidence well\. Figure[2](https://arxiv.org/html/2606.08151#S5.F2)places this open benchmark next to the controlled mechanism and compression trends\.

A\. SWE\-bench Verified MRR@100\.25\.50\.751\.00Det\. proxy0\.314BM250\.634Qwen3\.6\+0\.790B\. CICL budget sensitivity0\.40\.8080120200400v1v3C\. RepoBench\-R success0\.11\.2260120200400SummaryCardsRaw

Figure 2:Compact result overview\. Panel A foregrounds the open\-source SWE\-bench Verified retrieval result; Panels B and C show the main controlled budget trend and RepoBench\-R compression boundary\.Table 3:SWE\-bench Verified file\-level retrieval \(50 instances\), sorted by non\-oracle MRR@10\. Direct Qwen3\.6\-plus judgments re\-rank top\-50 candidates; the metric is retrieval quality and not a patch result\.We also ran the same prompt and schema through Codex/GPT\-5\.5 on the first five SWE\-bench Verified instances \(250 candidate judgments\)\. All judgments parsed; on this five\-instance slice, CausalRerank, CausalHybridRerank, BM25, and HybridRAG all obtain hit@1 and MRR@10 of0\.600\.60\. We use this only to check that the schema can run through another provider; the 50\-instance Qwen3\.6\-plus run is the main retrieval comparison\.

### 5\.2Evidence Map

The remaining results explain why the real\-code gain is plausible and where it does not yet transfer\. Table[4](https://arxiv.org/html/2606.08151#S5.T4)ranks the evidence by claim strength so that controlled positives, compression findings, and negative boundaries are visible in one place\.

Table 4:Evidence ranking by claim strength, including negative evidence and applicability boundaries\.
### 5\.3Ablation

Tables[5](https://arxiv.org/html/2606.08151#S5.T5)and[6](https://arxiv.org/html/2606.08151#S5.T6)report the two ablation diagnostics most directly tied to the selection mechanism: removing selected evidence and varying the token budget\. The clearest controlled signal is the causal\-removal diagnostic\. Removing the highest\-scoring unit collapses v3 context F1 from0\.2450\.245to0\.0000\.000and MRR from0\.9800\.980to0\.0000\.000; random removal is much less destructive \(F10\.2050\.205\)\. The supplement adds the same removal check on v1 and a compression\-order split\.

Table 5:Causal removal ablation on synthetic v3 \(250 tasks, budget 400\)\. Removing the highest\-utility unit collapses F1 and MRR to zero\.The second controlled pattern is an inverted\-U over token budgets \(Table[6](https://arxiv.org/html/2606.08151#S5.T6)\)\. CICL peaks at budget 120 on both suites \(v10\.6200\.620, v30\.4250\.425\) and then decays as larger budgets admit more distractors\. On v3, VanillaRAG remains stronger, so the budget result is a limited positive result rather than evidence of dominance\.

Table 6:Context F1 across token budgets \(250 tasks each\)\. Rows are sorted by non\-oracle peak strength\. CICL peaks at budget 120 on both suites; the inverted\-U is driven by precision loss at larger budgets\. Bold marks each row’s peak\.
### 5\.4Main Synthetic Ranking Results

Table[7](https://arxiv.org/html/2606.08151#S5.T7)reports the standard budget\-120 comparison, sorted by average non\-oracle F1 across the two suites\. On v1, CICL is below AutoContextKG but above VanillaRAG and GraphMemory\. On v3, semantic distractors change the ordering: VanillaRAG has the best F1 and AutoContextKG the best MRR, while CICL remains competitive but no longer dominates\.

Table 7:Main 250\-task evaluation at budget 120, sorted by average non\-oracle F1 across v1/v3\. v1 uses structural distractors; v3 uses semantic distractors\. Bold = best non\-oracle per metric\.
### 5\.5RepoBench\-R Compression Suite

Table[8](https://arxiv.org/html/2606.08151#S5.T8)reports the 100\-task RepoBench\-R Python compression suite\. At budget 120, causal\-card compression improves over raw selection in success \(0\.02→0\.060\.02\\\!\\to\\\!0\.06\) and context recall \(\+0\.04\+0\.04, 95% CI\[0\.01,0\.08\]\[0\.01,0\.08\],p=0\.016p=0\.016by matched paired bootstrap with2,0002\{,\}000resamples\)\. Generic extractive summarisation is nevertheless stronger at the same budget \(0\.110\.11success; summary\-vs\-card recall delta\+0\.05\+0\.05, 95% CI\[0\.01,0\.10\]\[0\.01,0\.10\],p=0\.009p=0\.009\)\. In selected\-then\-compressed mode, causal cards save44\.9344\.93tokens per query over raw selection \(95% CI\[41\.05,48\.74\]\[41\.05,48\.74\],p<10−3p<10^\{\-3\}\) with identical selected ids\. The supported result is therefore specific: cards can save tokens and improve raw budget fitting, but generic summaries are the stronger baseline on this starter slice\.

Table 8:RepoBench\-R 100\-task compression sweep \(means across 100 tasks\)\. Within each budget, rows are sorted by success and then F1\.
### 5\.6Opus\-Assisted Rankers and Qwen Selection Agreement

We collected 1 400 Opus\-4\.7\-assisted counterfactual judgments on the v1 base tasks and trained both lightweight rankers\. The linear 25\-dimensional ranker achieves pairwise accuracy0\.9390\.939but remains below the heuristic on downstream F1\. The MLP ranker improves over linear on both suites \(v3 F10\.241→0\.3340\.241\\\!\\to\\\!0\.334; Table[7](https://arxiv.org/html/2606.08151#S5.T7)\); pairwise accuracies are0\.939/0\.7970\.939/0\.797for the linear ranker and0\.945/0\.8160\.945/0\.816for the MLP on v1/v3\. The learned ranker still does not replace the hand\-designed ranker\.

On a supplementary real\-code pilot, Opus 4\.6 assists annotation for 200 RepoBench\-R candidates from 50 training tasks; the resulting linear ranker reaches F10\.0270\.027and success0\.0600\.060on 50 heldout tasks \(Δ\\DeltaF1=0\.027=0\.027vs\. heuristic CICL, 95% CI\[0\.000,0\.060\]\[0\.000,0\.060\],p=0\.042p=0\.042; Oracle F10\.5800\.580\)\. A heldout budget sweep in the supplement keeps the signal small, so this is preliminary evidence rather than mature real\-code retrieval training\.

The Qwen3\.5\-9B QLoRA judge was evaluated on 710 \(task, candidate\) pairs from 25 base tasks \(Table[9](https://arxiv.org/html/2606.08151#S5.T9)\)\. This is a selection\-agreement diagnostic on the synthetic Opus\-assisted label distribution, not a held\-out task\-generalisation test\. JSON parse rate is1\.001\.00; field\-level MAE against Opus stays below0\.060\.06on all numeric fields\. Top\-5 selection Jaccard averages0\.5920\.592and Spearmanρ\\rhoreaches0\.3790\.379—sufficient for validation\-pool use, but too noisy for unattended deployment\.

Table 9:Qwen3\.5\-9B QLoRA judge vs\. Opus 4\.7 \(710 candidates, 25 tasks\)\.
### 5\.7Executable Checks and Harmful\-Context Stress

A two\-task toy patch\-and\-test pilot verifies that selected context can be passed through an executable patch loop: CICL, VanillaRAG, and AutoContextKG reach patch success1\.001\.00, while NoContext reaches0\.000\.00\. Three Astropy patch\-generation checks in the supplement test real\-code patch formatting but are not official SWE\-bench passes\. In harmful\-context stress tests, CICL often selects stale units \(70\.8%70\.8\\%on v1;68\.8%68\.8\\%on v3\), but ranks the gold unit ahead of them \(harmful\-before\-gold=0\.00=0\.00\)\. The signal is ordering under stale evidence, not full harmful\-context avoidance\.

### 5\.8Memory\-Card Structural Audit

A deterministic audit checks required fields, action hints, compression success, and placeholder absence\. On RepoBench\-R, the rates are: fields1\.0001\.000, hints1\.0001\.000, compression0\.9480\.948, and token ratio0\.4860\.486\. This checks structure only, not semantic faithfulness\.

## 6Discussion

The results support decision\-aware utility as a scoped ranking signal\. On SWE\-bench Verified, the schema is weak under a deterministic proxy but becomes a strong file reranker with Qwen3\.6\-plus\. The synthetic removal and budget studies explain why: high\-utility evidence can be necessary, while extra context can add distractors\.

The negative findings sharpen the contribution\. RepoBench\-R shows that memory cards save tokens, but generic summaries preserve lexical details useful for code completion\. Together with the v3 and deterministic SWE\-bench failures, this keeps the claim bounded: CICL measures and exploits decision\-critical evidence, but the judge and compression format must match the retrieval setting\.

## 7Limitations

- •Causal wording\.The score is counterfactual\-inspired, not identified by randomised intervention\.
- •Evaluation scope\.We report toy and real\-code patch\-generation checks, but no official SWE\-bench patch success\.
- •Stronger baselines remain\.Summarisation wins on RepoBench\-R, VanillaRAG dominates on v3, and the Qwen SWE\-bench result is file\-level only\.
- •Scale and surrogates\.Diagnostics cover 25–250 tasks; the real\-code pilot is weak; Qwen agreement is synthetic\-distribution only\.

## 8Conclusion

CICL reframes context selection for tool\-using LLM agents as a decision\-aware problem\. Useful context is not merely similar to the task; it is evidence that can change the agent’s next action, raise the expected outcome, become necessary for success, or introduce negative transfer\. By scoring these effects and packing selected evidence into typed memory cards, CICL gives researchers a measurable notion of decision\-critical context and gives builders a practical layer for auditing context failures, targeting expensive judge calls, and compressing long prompts without treating all retrieved text as equally valuable\.

Across the main experiments and supplementary diagnostics, the signal is useful but deliberately bounded\. On SWE\-bench Verified file\-level retrieval, Qwen3\.6\-plus reranking improves BM25 from0\.580\.58to0\.780\.78hit@1 and from0\.6340\.634to0\.7900\.790MRR@10\. Controlled removal explains why the ranking is not only cosmetic: deleting the top\-utility unit collapses synthetic v3 F1 from0\.2450\.245to0\.0000\.000\. The appendix further stress\-tests the protocol: Qwen\-QLoRA produces parseable judgments for all 710 agreement candidates, a 200\-label real\-code Opus\-assisted pilot yields a small heldout signal, and a three\-instance patch\-generation smoke validates the retrieval\-to\-patch plumbing without claiming official SWE\-bench success\. The negative results are equally informative: RepoBench\-R shows that memory cards can lose lexical detail to generic summaries, and compact rankers do not yet replace the heuristic selector\.

CICL is therefore best viewed as a decision\-aware layer for measuring, ranking, and compressing agent context, rather than as a standalone coding agent\. Future work should scale real\-code annotations, learn adaptive utility weights, combine structured cards with lexical snippets, and evaluate the selector inside end\-to\-end agent rollouts\.

#### Reproducibility Statement\.

The codebase, evaluation harness, evidence log, and Qwen\-QLoRA configuration are available at[https://github\.com/stephen\-guan\-researcher/CICL](https://github.com/stephen-guan-researcher/CICL)\. A Qwen3\.5\-9B QLoRA adapter\-only release is available at[https://huggingface\.co/XinyuGuan/CICL](https://huggingface.co/XinyuGuan/CICL); it must be loaded with theQwen/Qwen3\.5\-9Bbase model and is not a standalone judge model\. A supplementary appendix reports additional diagnostics omitted from the page\-limited main text\.

\{credits\}

#### 8\.0\.1\\discintname

The authors have no competing interests to declare that are relevant to the content of this article\.

## References

- \[1\]A\. Asaiet al\.\(2024\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2310.11511)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px2.p1.1)\.
- \[2\]Y\. Baiet al\.\(2023\)LongBench: a bilingual, multitask benchmark for long context understanding\.External Links:2308\.14508,[Link](https://arxiv.org/abs/2308.14508)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px2.p1.1)\.
- \[3\]K\. Caiet al\.\(2025\)AutoContext: instance\-level context learning for llm agents\.External Links:2510\.02369,[Link](https://arxiv.org/abs/2510.02369)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px3.p1.1)\.
- \[4\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\(2023\)QLoRA: efficient finetuning of quantized llms\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2305.14314)Cited by:[§3\.6](https://arxiv.org/html/2606.08151#S3.SS6.p2.5)\.
- \[5\]L\. Gao, X\. Ma, J\. Lin, and J\. Callan\(2023\)Precise zero\-shot dense retrieval without relevance labels\.InAnnual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2212.10496)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]Z\. Heet al\.\(2026\)MemoryArena: benchmarking agent memory in interdependent multi\-session agentic tasks\.External Links:2602\.16313,[Link](https://arxiv.org/abs/2602.16313)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]E\. J\. Huet al\.\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2106.09685)Cited by:[§3\.6](https://arxiv.org/html/2606.08151#S3.SS6.p2.5)\.
- \[8\]Y\. Hu, Y\. Wang, and J\. McAuley\(2025\)Evaluating memory in llm agents via incremental multi\-turn interactions\.External Links:2507\.05257,[Link](https://arxiv.org/abs/2507.05257)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px1.p1.1)\.
- \[9\]Y\. Huoet al\.\(2026\)RepoShapley: shapley\-enhanced context filtering for repository\-level code completion\.External Links:2601\.03378,[Link](https://arxiv.org/abs/2601.03378)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px3.p1.1)\.
- \[10\]H\. Husainet al\.\(2019\)CodeSearchNet challenge: evaluating the state of semantic code search\.InNeurIPS Workshop on Machine Learning for Software Engineering,External Links:[Link](https://arxiv.org/abs/1909.09436)Cited by:[§4\.1](https://arxiv.org/html/2606.08151#S4.SS1.p1.1)\.
- \[11\]G\. Izacardet al\.\(2022\)Unsupervised dense information retrieval with contrastive learning\.InTransactions on Machine Learning Research,External Links:[Link](https://arxiv.org/abs/2112.09118)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px2.p1.1)\.
- \[12\]G\. Izacardet al\.\(2023\)Few\-shot learning with retrieval augmented language models\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2208.03299)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px2.p1.1)\.
- \[13\]H\. Jianget al\.\(2023\)LLMLingua: compressing prompts for accelerated inference of large language models\.InEmpirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/2310.05736)Cited by:[§3\.4](https://arxiv.org/html/2606.08151#S3.SS4.p1.1)\.
- \[14\]H\. Jianget al\.\(2024\)LongLLMLingua: accelerating and enhancing llms in long context scenarios via prompt compression\.InAnnual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2310.06839)Cited by:[§3\.4](https://arxiv.org/html/2606.08151#S3.SS4.p1.1)\.
- \[15\]C\. E\. Jimenezet al\.\(2024\)SWE\-bench: can language models resolve real\-world github issues?\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2310.06770)Cited by:[§4\.1](https://arxiv.org/html/2606.08151#S4.SS1.p1.1)\.
- \[16\]J\. Johnson, M\. Douze, and H\. Jégou\(2019\)Billion\-scale similarity search with gpus\.IEEE Transactions on Big Data\.External Links:[Link](https://arxiv.org/abs/1702.08734)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]M\. Kanget al\.\(2025\)ACON: optimizing context compression for long\-horizon llm agents\.External Links:2510\.00615,[Link](https://arxiv.org/abs/2510.00615)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px3.p1.1)\.
- \[18\]V\. Karpukhinet al\.\(2020\)Dense passage retrieval for open\-domain question answering\.InEmpirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/2004.04906)Cited by:[§3\.1](https://arxiv.org/html/2606.08151#S3.SS1.p1.10)\.
- \[19\]O\. Khattab and M\. Zaharia\(2020\)ColBERT: efficient and effective passage search via contextualized late interaction over BERT\.InSIGIR,External Links:[Link](https://arxiv.org/abs/2004.12832)Cited by:[§3\.1](https://arxiv.org/html/2606.08151#S3.SS1.p1.10)\.
- \[20\]P\. Lewiset al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2005.11401)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px2.p1.1)\.
- \[21\]H\. Liet al\.\(2026\)ContextBench: a benchmark for context retrieval in coding agents\.External Links:2602\.05892,[Link](https://arxiv.org/abs/2602.05892)Cited by:[§1](https://arxiv.org/html/2606.08151#S1.p1.1)\.
- \[22\]Y\. Li, B\. Dong, C\. Lin, and F\. Guerin\(2023\)Selective context: efficient and context\-aware prompt compression\.External Links:2304\.12102,[Link](https://arxiv.org/abs/2304.12102)Cited by:[§3\.4](https://arxiv.org/html/2606.08151#S3.SS4.p1.1)\.
- \[23\]N\. F\. Liuet al\.\(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics\.External Links:[Link](https://arxiv.org/abs/2307.03172)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px2.p1.1)\.
- \[24\]T\. Liu, C\. Xu, and J\. McAuley\(2023\)RepoBench: benchmarking repository\-level code auto\-completion systems\.External Links:2306\.03091,[Link](https://arxiv.org/abs/2306.03091)Cited by:[§4\.1](https://arxiv.org/html/2606.08151#S4.SS1.p1.1)\.
- \[25\]L\. Ouyanget al\.\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2203.02155)Cited by:[§3\.6](https://arxiv.org/html/2606.08151#S3.SS6.p1.2)\.
- \[26\]C\. Qianet al\.\(2023\)Communicative agents for software development\.External Links:2307\.07924,[Link](https://arxiv.org/abs/2307.07924)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px1.p1.1)\.
- \[27\]S\. Robertson and H\. Zaragoza\(2009\)The probabilistic relevance framework: BM25 and beyond\.InFoundations and Trends in Information Retrieval,External Links:[Link](https://doi.org/10.1561/1500000019)Cited by:[§3\.1](https://arxiv.org/html/2606.08151#S3.SS1.p1.10)\.
- \[28\]T\. Schicket al\.\(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2302.04761)Cited by:[§1](https://arxiv.org/html/2606.08151#S1.p1.1)\.
- \[29\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2303.11366)Cited by:[§1](https://arxiv.org/html/2606.08151#S1.p1.1)\.
- \[30\]S\. S\. Srivastava\(2026\)Causal intervention\-based memory selection for long\-horizon llm agents\.External Links:2605\.17641,[Link](https://arxiv.org/abs/2605.17641)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px3.p1.1)\.
- \[31\]N\. Stiennonet al\.\(2020\)Learning to summarize with human feedback\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2009.01325)Cited by:[§3\.6](https://arxiv.org/html/2606.08151#S3.SS6.p1.2)\.
- \[32\]G\. Wanget al\.\(2023\)Voyager: an open\-ended embodied agent with large language models\.External Links:2305\.16291,[Link](https://arxiv.org/abs/2305.16291)Cited by:[§1](https://arxiv.org/html/2606.08151#S1.p1.1)\.
- \[33\]X\. Wanget al\.\(2024\)OpenHands: an open platform for ai software developers as generalist agents\.External Links:2407\.16741,[Link](https://arxiv.org/abs/2407.16741)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]Y\. Wanget al\.\(2026\)EvoMemBench: benchmarking agent memory from a self\-evolving perspective\.External Links:2605\.18421,[Link](https://arxiv.org/abs/2605.18421)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px1.p1.1)\.
- \[35\]Q\. Wuet al\.\(2023\)AutoGen: enabling next\-gen llm applications via multi\-agent conversation\.External Links:2308\.08155,[Link](https://arxiv.org/abs/2308.08155)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px1.p1.1)\.
- \[36\]C\. S\. Xia, Y\. Deng, S\. Dunn, and L\. Zhang\(2024\)Agentless: demystifying llm\-based software engineering agents\.External Links:2407\.01489,[Link](https://arxiv.org/abs/2407.01489)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px1.p1.1)\.
- \[37\]J\. Yanget al\.\(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.External Links:2405\.15793,[Link](https://arxiv.org/abs/2405.15793)Cited by:[§1](https://arxiv.org/html/2606.08151#S1.p1.1)\.
- \[38\]S\. Yaoet al\.\(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WE_vluYUL-X)Cited by:[§1](https://arxiv.org/html/2606.08151#S1.p1.1)\.
- \[39\]Q\. Zhanget al\.\(2025\)Agentic context engineering: evolving contexts for self\-improving language models\.External Links:2510\.04618,[Link](https://arxiv.org/abs/2510.04618)Cited by:[§2](https://arxiv.org/html/2606.08151#S2.SS0.SSS0.Px3.p1.1)\.
- \[40\]J\. Zhuet al\.\(2026\)SWE context bench: a benchmark for context learning in coding\.External Links:2602\.08316,[Link](https://arxiv.org/abs/2602.08316)Cited by:[§1](https://arxiv.org/html/2606.08151#S1.p1.1)\.

Similar Articles

Learning Agent-Compatible Context Management for Long-Horizon Tasks

arXiv cs.AI

Introduces AdaCoM, an external LLM-based context manager for frozen agents, using reinforcement learning to improve long-horizon task performance by preserving task constraints and pruning stale content, with experiments on web search and deep research benchmarks.