Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents
Summary
This paper evaluates context engineering configurations for LLM agents in enterprise tool-use workflows, showing that summarization with selective pruning achieves 91.6% accuracy while reducing token usage by over 60% compared to full-context baselines.
View Cached Full Text
Cached at: 06/10/26, 06:13 AM
# Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents
Source: [https://arxiv.org/html/2606.10209](https://arxiv.org/html/2606.10209)
###### Abstract
Background:Large language models deployed as autonomous agents for enterprise workflows face a critical challenge: verbose tool responses from enterprise systems cause context window overflow and excessive inference costs, preventing reliable task completion at scale\.Methods:We evaluate four context engineering configurations applied to GPT\-5 for automated hotel expense itemization in Microsoft Dynamics 365 Finance and Operations \(D365 F&O\) via Model Context Protocol \(MCP\): \(1\) GPT\-5 with no user model \(a motivating ablation\), \(2\) the standard full\-context agent \(our context\-engineering baseline\) with full conversation history, \(3\) context pruned to the last 5 tool call/response pairs, and \(4\) context pruned with automated summarization using a summary window of 3\. Results are reported as averages across 5 independent experimental runs on a 50\-task hotel expense benchmark, holding the user model constant across the context\-engineering comparison \(C2–C4\) to isolate the effect of context management\. We extend the original study with \(i\) 95% confidence intervals and effect\-size analysis, \(ii\) sensitivity analyses over the pruning window
NNand the summary window
WW, \(iii\) a per\-category failure taxonomy, and \(iv\) generalization evidence across five expense types grouped into three structurally distinct categories, and across a second model family \(Claude Sonnet 4\.5\)\.Results:C1 \(no user model\) achieved only 8\.0% complete itemization\. The full\-context configuration reached 71\.0% but consumed 1,480,996 tokens and 14\.56 hours per benchmark\. Context pruning to the last 5 tool calls achieved 79\.0% complete itemization with 535,274 tokens—a 63\.9% reduction—in 5\.39 hours\. Adding automated summarization achieved 91\.6% complete itemization and 99\.64% average amount itemized with 553,374 tokens in 5\.79 hours\.Discussion:Context engineering with summarization achieves the best balance of performance and efficiency, demonstrating that selective retention of recent tool interactions is more decision\-relevant than full history while dramatically reducing token consumption\. We position these results as strong evidence for one class of enterprise tool\-use workflow rather than a proof of universal generalization, and we discuss the scope and limits of the approach explicitly\.
Keywords:context engineering, tool\-using LLM agents, context pruning, conversation summarization, Model Context Protocol, enterprise workflow automation, token efficiency, Dynamics 365 Finance and Operations
Abhilasha Lodha∗, Mahsa Pahlavikhah Varnosfaderani, Abir Chakraborty, Abhinav Mithal
Microsoft
\{ablodha, mahsap, abchak, mithal\}@microsoft\.com
## 1Background
The deployment of large language models \(LLMs\) as autonomous agents for enterprise workflow automation represents a significant advance in AI\-powered productivity\(Brownet al\.,[2020](https://arxiv.org/html/2606.10209#bib.bib1)\)\. Agents can navigate complex multi\-step processes, interact with enterprise systems through tool calls, and complete tasks that previously required sustained human attention\. However, as these agents engage in extended workflows—particularly those interacting with enterprise resource planning \(ERP\) systems—a fundamental technical constraint emerges: context window overflow caused by verbose tool responses\(Jianget al\.,[2023](https://arxiv.org/html/2606.10209#bib.bib2)\)\.
Modern LLMs operate within finite context windows bounding the total tokens processable per inference call\. While recent frontier models have expanded these limits substantially, enterprise system integrations routinely generate tool responses containing extensive metadata, nested form state, navigation breadcrumbs, and system information well beyond what is decision\-relevant\(Liet al\.,[2023](https://arxiv.org/html/2606.10209#bib.bib3)\)\. In multi\-step agentic workflows where agents execute dozens of tool interactions to complete a single task, even large context windows can be exhausted before completion\. Furthermore, processing costs scale linearly with context length, making full\-history retention prohibitively expensive at production scale\. Recent industry analyses of production agents describe the same failure mode under the name “context rot,” where a model’s effective recall degrades as the token count grows, well before the hard context limit is reached\(Anthropic,[2025](https://arxiv.org/html/2606.10209#bib.bib13)\)\.
This challenge is particularly acute in expense management workflows within Microsoft Dynamics 365 Finance and Operations \(D365 F&O\)\. When an LLM agent interacts with D365 F&O via an MCP proxy, each tool response can contain hundreds to thousands of tokens of form metadata\. For an expense itemization task requiring decomposition of a single receipt into 4–22 individual line items—each requiring multiple tool interactions to create, populate, and verify—cumulative context rapidly exhausts available token budgets\.
The expense itemization task is representative of a broad class of enterprise agentic challenges: agents must decompose a total receipt amount into multiple line items with correct subcategories and amounts, with the strict requirement that the remaining unallocated amount reaches exactly zero\. This precision requirement means partial completion constitutes failure in production systems, creating accounting errors, policy violations, and manual remediation costs\.
Context engineering—the deliberate management of what information is retained in an agent’s context at each step—offers a practical solution\. Rather than maintaining full conversation history \(standard practice\), context engineering selectively retains the most decision\-relevant recent interactions while summarizing or discarding older context\. This approach is inference\-time only, requires no model retraining, and is designed to be portable across LLM backends\.
#### Contributions\.
This paper makes the following contributions:
1. 1\.We formalize a*semantic\-level*context\-engineering policy for tool\-using agents—recency\-based pruning of whole tool call/response pairs plus automated summarization of evicted pairs—and provide its exact construction algorithm \(Algorithm[1](https://arxiv.org/html/2606.10209#alg1)\), distinguishing it from token\-level prompt compression and from external memory stores \(Section[2](https://arxiv.org/html/2606.10209#S2)\)\.
2. 2\.On a 50\-task hotel\-expense benchmark in a live D365 F&O environment, we show that, with the user model held constant, recency pruning and pruning\+\+summarization improve complete itemization from 71\.0% to 79\.0% to 91\.6%*while*reducing tokens by 62\.7% and runtime by 60\.2% relative to full context\.
3. 3\.We report run\-to\-run dispersion, 95% confidence intervals, and effect\-size analysis \(Section[4\.6](https://arxiv.org/html/2606.10209#S4.SS6)\), and we provide sensitivity analyses over the pruning windowNNand the summary windowWW\(Section[4\.7](https://arxiv.org/html/2606.10209#S4.SS7)\)\.
4. 4\.We provide a per\-category failure taxonomy \(Section[5](https://arxiv.org/html/2606.10209#S5)\) and generalization evidence across five expense types grouped into three structurally distinct categories, and a second model family, Claude Sonnet 4\.5 \(Sections[4\.8](https://arxiv.org/html/2606.10209#S4.SS8)–[4\.9](https://arxiv.org/html/2606.10209#S4.SS9)\)\.
We hypothesize that \(1\) restricting context to recent tool interactions improves task\-relevant focus while preventing overflow, and \(2\) automated summarization of pruned context preserves task\-level situational awareness without significant token overhead\. Our results support both hypotheses for this class of workflow\.
## 2Related Work
We organize prior work along three axes and position our contribution against each\. Table[1](https://arxiv.org/html/2606.10209#S2.T1)summarizes the comparison\.
#### Token\-level prompt compression\.
LLMLingua\(Jianget al\.,[2023](https://arxiv.org/html/2606.10209#bib.bib2)\)and Selective Context\(Liet al\.,[2023](https://arxiv.org/html/2606.10209#bib.bib3)\)reduce input size by deleting or merging low\-information*tokens*within a prompt\. These methods operate below the level of the tool interaction: they do not reason about which tool call/response*units*are still relevant to the agent’s current state, and they can corrupt the structured form state that an ERP agent must read verbatim \(e\.g\., control names, numeric balances\)\. Our policy operates at the*semantic level*of whole tool call/response pairs, preserving the exact text of retained interactions and only ever evicting or summarizing complete units\.
#### External and long\-term memory\.
MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2606.10209#bib.bib4)\)and LongMem\(Wanget al\.,[2024](https://arxiv.org/html/2606.10209#bib.bib5)\)augment models with retrievable memory stores, and recent benchmarks such as LoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2606.10209#bib.bib11)\)and LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2606.10209#bib.bib14)\)evaluate long\-horizon recall in multi\-session*dialogue*\. These target factual recall across conversations rather than the working\-memory and stale\-state problems of tool\-heavy, single\-session workflows, where the most recent form state—not a retrieved fact—is the decision\-relevant signal\. We show that for this regime a lightweight recency window plus a compact running summary is sufficient, with no external store or retriever\.
#### Agentic context management and compaction\.
The closest contemporary line of work studies context management for long\-horizon agents directly\. ACON\(Kang and others,[2025](https://arxiv.org/html/2606.10209#bib.bib10)\)learns a failure\-driven compression*guideline*and distills it into a smaller compressor; concurrent work studies context management for long\-horizon software\-engineering agents\(Liuet al\.,[2025](https://arxiv.org/html/2606.10209#bib.bib12)\); and provider platforms now ship native “compaction” and tool\-result\-clearing features\(Anthropic,[2025](https://arxiv.org/html/2606.10209#bib.bib13)\)\. Our work is complementary and deliberately simpler: a fixed\-recency eviction policy with optional single\-pass summarization, evaluated end\-to\-end in a*live enterprise ERP*with a hard, business\-defined success criterion \(zero residual\), rather than on QA or coding benchmarks\.
#### Tool\-use benchmarks\.
MCP\-Bench\(Wang and others,[2025](https://arxiv.org/html/2606.10209#bib.bib9)\)and related MCP agent benchmarks evaluate breadth of tool use across many servers and domains\. Our study is narrower but deeper: a single high\-stakes workflow with a strict completion criterion, measured for both task success*and*cost \(tokens, wall\-clock\), which surfaces the efficiency–accuracy trade\-off that breadth\-oriented benchmarks do not isolate\.
Agentic reasoning frameworks such as ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2606.10209#bib.bib6)\)established the value of interleaving reasoning and action but do not prescribe a context policy for extended workflows with verbose tool responses—the gap this paper addresses\.
Table 1:Positioning relative to prior context\-management approaches\. Our policy operates on whole tool call/response pairs \(the semantic unit of an agentic workflow\)\.
## 3Methods
### 3\.1Task definition
The expense itemization task in D365 F&O requires an autonomous LLM agent to: \(1\) navigate to an existing expense report containing a receipt with a known total amount; \(2\) create individual line items corresponding to each purchased item; \(3\) assign the correct expense subcategory \(e\.g\., room charge, city tax, resort fee, breakfast\) to each line item; \(4\) enter the correct dollar amount; and \(5\) continue until the remaining unallocated amount equals exactly $0\.00\. The task is complete only when the remaining amount reaches zero\. Any nonzero residual constitutes failure in production ERP workflows, preventing expense report finalization and triggering compliance review\.
### 3\.2System architecture
The evaluation system consists of four components:
- •GPT\-5 agent:The primary LLM \(GPT\-5\) executing the autonomous itemization workflow, guided by a detailed agent system prompt covering the full itemization workflow, valid expense subcategories, and subcategory mappings\.
- •User model \(C2–C4 only\):A secondary LLM \(GPT\-4\.1\) that participates in the agentic conversation as the “user”, responding to any follow\-up questions or confirmation requests the GPT\-5 agent raises during execution\. Absent in C1\.
- •D365 F&O MCP server:A Model Context Protocol proxy exposing D365 F&O form interactions as discrete tools \(form navigation, field reading, field value setting, button clicking\)\. The agent is exposed to a fixed inventory of tools covering UI\-level form interaction, entity\-level data access, and action invocation; a capability\-level description is given in Appendix[B](https://arxiv.org/html/2606.10209#A2)\.
- •Internal evaluation harness:A non\-interactive orchestration framework managing agent\-tool interaction loops, context engineering logic, and metric collection\. No human is present during execution\.
MCP tool responses from D365 F&O are verbose by design, containing full form state snapshots including field values, metadata, navigation breadcrumbs, and system state\. A single tool response can contain 500–3,000 tokens, with full conversation history for a complex itemization task accumulating to 50,000–150,000\+ tokens across 15–30 tool interactions\.
### 3\.3Experimental configurations
We evaluated four configurations representing a progression from minimal to optimally engineered context management:
C1 — GPT\-5, No User Model \(Motivating Ablation\):GPT\-5 equipped with the full agent system prompt, which provides detailed step\-by\-step itemization workflow instructions, a complete list of valid D365 F&O expense subcategories, subcategory mapping rules \(e\.g\., “Hotel Tax”→\\rightarrowRoom tax; “Room Service & Meals”→\\rightarrowRoom service\), and explicit directives to continue itemizing without interruption until the remaining amount reaches zero\. Despite these instructions, GPT\-5 in practice occasionally departs from fully autonomous execution—pausing mid\-task to ask clarifying questions or request confirmation before proceeding\. Because the evaluation harness is a non\-interactive framework with no human in the loop, these unanswered queries stall the agentic workflow entirely, resulting in incomplete tasks\. C1 thus establishes a lower performance bound and, as we make explicit below, serves as a*motivating ablation*for the user model rather than as a step in the context\-engineering ladder\.
C2 — GPT\-5 \+ User Model \(Full Context\):To address the non\-interactive framework limitation observed in C1, a user model \(gpt\-4\.1\) is introduced as a conversational participant\. Rather than receiving a single static prompt, the GPT\-5 agent can now ask questions mid\-task and receive meaningful responses that keep the workflow moving\. The user model is guided by auser\_contextthat defines a strict completion protocol: the task is considered complete only when the expense line is saved, the itemization remaining amount is 0\.00,*and*the form is closed\. The user model is further instructed to handle common agent queries—confirming that missing itemizations should be added, declining unnecessary receipt or expense report attachment requests, and prompting the agent to verify completion after each itemization pass\. Full conversation history is retained throughout execution\. This configuration represents standard agentic practice with the user model present and serves as the primary full\-context baseline for the context\-engineering comparison\.
C3 — GPT\-5 \+ User Model \(Last 5 Tool Calls\):Standard agent configuration with context pruned to retain only the 5 most recent tool call/response pairs\. All earlier interactions are discarded without summarization\. The selection ofN=5N\{=\}5was motivated by task structure analysis: a single itemization line requires 2–3 tool calls \(creating the line via a form control, setting field values, and optionally verifying state\)\. Five tool calls thus provide working memory for approximately two complete itemization cycles\. We test the robustness of this choice in Section[4\.7](https://arxiv.org/html/2606.10209#S4.SS7)\.
C4 — GPT\-5 \+ User Model \(Last 5 \+ Summarization, Window = 3\):Standard agent configuration with context pruned to the last 5 tool call/response pairs, augmented with automated summarization of earlier conversation history\. A summary window of 3 is applied, meaning the 3 most recent interactions prior to the pruning boundary inform the generated summary\. The compact summary captures the forms opened, controls interacted with, buttons clicked, and data entered by the agent\. This provides task\-level situational awareness without the token cost of full history retention\.
#### Isolating the effect of context engineering\.
C1 differs from C2–C4 in*two*respects at once—it removes both the user model and any context policy—so the C1→\\toC2 jump alone cannot be attributed to context engineering\. We therefore make the design explicit: the user model is held*constant and present*across C2, C3, and C4, and all of our context\-engineering claims rest on theC2 \(full context\)→\\rightarrowC3 \(pruning\)→\\rightarrowC4 \(pruning\+\+summarization\)comparison, which varies*only*the context policy\. C1 is reported solely as a motivating ablation that quantifies why the user model is needed in a non\-interactive harness for GPT\-5\. Notably, this necessity is model\-specific: in our cross\-model study \(Section[4\.9](https://arxiv.org/html/2606.10209#S4.SS9)\), Claude Sonnet 4\.5 does not stall without the user model, so its C1 already reaches high completion—direct evidence that the large C1→\\toC2 gap for GPT\-5 reflects a model\-specific stalling behavior, not the value of context engineering\.
### 3\.4Context construction algorithm
Algorithm[1](https://arxiv.org/html/2606.10209#alg1)specifies exactly how the retained context is constructed before each agent inference call\. LetHHbe the full message history,NNthe number of recent tool call/response pairs to keep \(the pruning window\), andWWthe summary window\. The policy counts tool messages, evicts the oldestmax\(0,\#tool−N\)\\max\(0,\\\#\\text\{tool\}\-N\)of them*together with*their preceding assistant tool\-call message\. WhenW≠0W\\neq 0and at least one pair is evicted, the policy summarizes theWWmost recently evicted messages \(or all evicted messages ifW=−1W\{=\}\{\-1\}\) and re\-inserts a single summary message at the earliest evicted position\. C2 corresponds toN=∞N\{=\}\\infty\(no eviction\); C3 toN=5,W=0N\{=\}5,W\{=\}0; C4 toN=5,W=3N\{=\}5,W\{=\}3\. The summarization in C4 costs exactly one additional LLM call per eviction event\.
Algorithm 1ConstructContext1:Input:history
HH; keep window
NN; summary window
WW
2:
c←c\\leftarrownumber of tool messages in
HH
3:
d←max\(0,c−N\)d\\leftarrow\\max\(0,\\;c\-N\)\{\# pairs to evict\}
4:if
d=0d=0then
5:return
HH
6:endif
7:
K←\[\]K\\leftarrow\[\\,\];
E←\[\]E\\leftarrow\[\\,\]
8:foreach message
mmin
HH\(in order\)do
9:if
mmis a tool messageand
d\>0d\>0then
10:
d←d−1d\\leftarrow d\-1; append
mmto
EE
11:if
last\(K\)\\mathrm\{last\}\(K\)is an assistant tool\-call msgthen
12:move
last\(K\)\\mathrm\{last\}\(K\)from
KKto
EE
13:endif
14:else
15:append
mmto
KK
16:endif
17:endfor
18:if
W≠0W\\neq 0and
E≠\[\]E\\neq\[\\,\]then
19:
EW←EE\_\{W\}\\leftarrow Eif
W=−1W\{=\}\{\-\}1, else last
WWof
EE
20:
s←Summarize\(EW\)s\\leftarrow\\textsc\{Summarize\}\(E\_\{W\}\)
21:insert “Summary of previous tool calls:
ss” at earliest evicted position in
KK
22:endif
23:return
KK
### 3\.5Evaluation metrics
The following metrics were collected per configuration and averaged across 5 independent runs:
- •Completely Itemized:Percentage of tasks where remaining amount reached exactly $0\.00*\(primary metric\)*\.
- •Less Than 10% Remaining:Percentage of tasks where≤10%\{\\leq\}10\\%of total receipt amount remained unallocated\.
- •At Least One Itemized:Percentage of tasks where at least one line item was successfully created\.
- •Percentage Amount Itemized:Average percentage of total receipt amount correctly allocated across all tasks\.
- •Total Token Usage:Total tokens \(input \+ output\) consumed across the 50\-task benchmark\.
- •Execution Time:Total wall\-clock time to complete all 50 tasks\.
Completely Itemized is designated the primary metric because it reflects genuine business task completion: in production ERP systems, any nonzero remaining amount prevents expense report finalization regardless of how much was correctly allocated\. Metrics are computed by an independent read\-back of the saved form state in D365 F&O \(not from the agent’s self\-report\); the exact extraction and scoring logic, including theremaining\_amount/itemized\_amountcomparison against the ground\-truthPurchasedItems, is given in Appendix[C](https://arxiv.org/html/2606.10209#A3)\.
### 3\.6Dataset
The benchmark consists of 50 hotel expense itemization tasks executed in D365 F&O via the MCP proxy\. Hotel receipts represent an intentionally challenging evaluation domain: they frequently contain multiple line items sharing the same subcategory name but carrying different amounts \(e\.g\., two “Hotel Tax” charges at different rates\), non\-trivial subcategory mappings to D365 F&O’s fixed vocabulary \(a 23\-entry subcategory catalog\), and a strict zero\-residual completion requirement\. Tasks range from 4 to 23 itemization lines \(median: 8\), and the same 50\-task benchmark was used identically across all four configurations to ensure fair comparison\. Per\-category dataset statistics and structural notes are given in Appendix[E](https://arxiv.org/html/2606.10209#A5)\.
Each task is issued to the GPT\-5 agent as a two\-part prompt\. The\#Tasksection specifies the D365 F&O action \(creating an expense line underTrvExpenseLines\), and the\#Datasection provides a structured receipt payload—company, merchant, date, total, expense category, and aPurchasedItemslist that serves as ground truth for evaluation\. For C2–C4, the user model additionally receives auser\_contextdefining the completion protocol \(Section[3\.3](https://arxiv.org/html/2606.10209#S3.SS3)\)\. Figure[1](https://arxiv.org/html/2606.10209#S3.F1)shows a representative task drawn directly from the benchmark\.
\#Task:Create a new expense line in the USMF company underTrvExpenseLinesin Dynamics 365 F&O\. Add the itemizations listed inPurchasedItemsunder the itemize section properly\.\#Data:PurchasedItems:
Figure 1:Representative task from the 50\-task hotel expense benchmark\. The agent must create five itemization lines in D365 F&O and achieve remaining = $0\.00\. Two challenge patterns are visible: \(1\) “Hotel Tax” appears twice with different amounts, and \(2\) items require subcategory mapping — “Entertainment External”→\\rightarrowBusiness entertainment; “Room Service & Meals”→\\rightarrowRoom service\.These two patterns—repeated subcategory names with distinct amounts, and indirect item\-to\-subcategory mappings—are the primary sources of agent failure across the benchmark\. An agent that skips the second “Hotel Tax” entry because it recognises the subcategory as already present leaves $10\.23 unallocated; one that maps “Entertainment External” to the wrong subcategory introduces an incorrect line that cannot reconcile to zero\.
### 3\.7Comparison baselines
Within the controlled C2–C4 comparison, C2 \(full conversation history\) serves as the standard\-practice baseline and C3 \(recency pruning without summarization\) is the ablation that isolates the contribution of summarization\. To situate our recency\-plus\-summarization policy against an alternative class of context\-management strategy, we additionally evaluate a*full\-history summarization*policy that compacts*all*evicted context without a fixed recency window—the compaction style of\(Anthropic,[2025](https://arxiv.org/html/2606.10209#bib.bib13); Kang and others,[2025](https://arxiv.org/html/2606.10209#bib.bib10)\)\. This policy is reported in the sensitivity analysis \(Section[4\.7](https://arxiv.org/html/2606.10209#S4.SS7)\) as theW=−1W\{=\}\{\-\}1configuration\. An orthogonal direction—*importance\-pruning*that retains theNNtool pairs most recently*referenced*by the agent rather than theNNmost recent in time—is discussed qualitatively in Section[6](https://arxiv.org/html/2606.10209#S6)as a candidate for future work\.
## 4Results
Table[2](https://arxiv.org/html/2606.10209#S4.T2)presents performance and efficiency metrics for all four configurations\. All values are averages across 5 independent experimental runs on the 50\-task benchmark\. Section[4\.6](https://arxiv.org/html/2606.10209#S4.SS6)reports dispersion, confidence intervals, and effect\-size analysis; Sections[4\.7](https://arxiv.org/html/2606.10209#S4.SS7)–[4\.9](https://arxiv.org/html/2606.10209#S4.SS9)report sensitivity, multi\-category, and cross\-model results; and Section[5](https://arxiv.org/html/2606.10209#S5)gives the failure taxonomy\.
Table 2:Performance and efficiency metrics across four GPT\-5 context engineering configurations on the 50\-task hotel expense benchmark, averaged across 5 independent runs\.Comp\. Item\.= Completely Itemized \(primary metric\);<<10% Rem\.= Less than 10% Remaining;≥\\geq1 Item\.= At Least One Itemized;%Amt Item\.= Percentage Amount Itemized;Total/Input/Output Tok\.= token counts in thousands;Time= benchmark wall\-clock time in hours\. TC = tool calls; Sum\. = Summarization\. Bold indicates best performance per metric\. Per\-run dispersion and 95% CIs for the primary metric are in Table[3](https://arxiv.org/html/2606.10209#S4.T3); full per\-metric mean±\\pmSD across all categories and configurations is in Table[4](https://arxiv.org/html/2606.10209#S4.T4)\.Per\-metric performance bar charts and token/time efficiency panels are visualized in Appendix Figures[2](https://arxiv.org/html/2606.10209#A7.F2)and[3](https://arxiv.org/html/2606.10209#A7.F3)\.
### 4\.1C1 \(no user model\): motivating the user model
C1 \(no user model\) achieved only 8\.0% complete itemization and 58\.89% average amount itemized, despite 99\.6% of tasks having at least one line item created\. This stark gap between at\-least\-one\-itemized \(99\.6%\) and completely\-itemized \(8\.0%\) reveals that GPT\-5 can initiate itemization actions from MCP tool descriptions and the agent prompt, but in the non\-interactive harness it does not reliably drive the workflow to completion without a user\-model participant to answer follow\-up questions and reinforce the completion protocol\. Agents frequently terminate after creating one or two line items, uncertain of the next required action or stopping condition\. Token usage \(532,600 total; 1,331 output\) reflects abbreviated task executions resulting from early termination\. As emphasized in Section[3\.3](https://arxiv.org/html/2606.10209#S3.SS3), C1 removes both the user model and any context policy and is therefore reported only as motivation; the context\-engineering claims below rest on C2–C4\.
### 4\.2Full context: performance gains with severe efficiency cost
Adding the user model with full task instructions and complete conversation history \(C2\) dramatically improved performance: complete itemization rose to 71\.0% and average amount itemized to 92\.03%\. The at\-least\-one\-itemized rate reached 100%, confirming that task instructions reliably orient the agent\.
However, full\-context retention introduced severe efficiency costs\. Total tokens increased to 1,480,996—a 177\.9% increase over baseline—driven almost entirely by input tokens \(1,478,509\), reflecting the cumulative growth of conversation history as verbose tool responses accumulate\. Execution time rose to 14\.56 hours for the 50\-task benchmark \(4\.73×\\timesthe baseline\), making this configuration impractical for production deployment at scale\. The input\-to\-output token ratio of 594\.7:1 confirms that context management strategies targeting input reduction will yield the greatest efficiency gains\.
### 4\.3Context pruning: simultaneous performance and efficiency improvement
Pruning context to the last 5 tool call/response pairs \(C3\) simultaneously improved both performance and efficiency relative to full context\. Complete itemization rose to 79\.0%—an 8 percentage\-point absolute improvement over C2 \(11\.3% relative\)—and average amount itemized improved to 96\.92%\.
Total tokens dropped to 535,274—a 63\.9% reduction from full context, essentially equivalent to the baseline token budget—while execution time fell to 5\.39 hours \(63\.0% reduction from C2\)\. The finding that context\-pruned configurations outperform full\-context retention on*both*performance and efficiency is counterintuitive but interpretable: older tool interactions describe superseded form states\. Retaining stale form state introduces noise that degrades the agent’s understanding of current system state, leading to incorrect field assignments or navigation errors\. Restricting context to the last 5 tool calls directs agent attention to the current form state and recent actions—the information most relevant to the next decision\.
Output tokens \(2,515\) are substantially higher than the baseline \(1,331\), reflecting more complete task execution and richer agent reasoning when clear task context is provided\.
### 4\.4Summarization: best performance with marginal token overhead
Configuration C4 \(last 5 tool calls \+ summarization, window = 3\) achieved the best performance across all metrics: 91\.6% complete itemization, 99\.6% less\-than\-10%\-remaining, 100\.0% at\-least\-one\-itemized, and 99\.64% average amount itemized\. This represents a 12\.6 percentage\-point improvement over context pruning alone \(79\.0%\) and a 20\.6 percentage\-point improvement over full context \(71\.0%\)\.
Total tokens \(553,374\) increased only 3\.4% over C3 \(535,274\), while execution time increased marginally to 5\.79 hours \(\+7\.4% over C3\)\. The compact summary of earlier task progress adds minimal overhead while preserving the situational awareness the agent needs: which forms have been opened, which controls were interacted with, and what data was entered into the expense report\. By condensing this history into a short assistant message, the agent retains awareness of prior actions even after their full tool\-response payloads have been pruned—reducing the premature\-termination failures observed in C3 without re\-introducing the verbose form snapshots that drive stale\-state errors in C2\.
The near\-perfect less\-than\-10%\-remaining rate \(99\.6%\) is particularly notable: summarization virtually eliminates cases of substantial task abandonment\. Only one task in 50 exceeded 10% remaining, compared to six tasks \(13%\) in C3 and thirteen tasks \(26%\) in C2\.
### 4\.5Token efficiency analysis
Across all configurations, input tokens represent 99\.75%–99\.87% of total token usage, confirming that context management strategies targeting input reduction yield the greatest efficiency gains\. Output tokens remain relatively stable across C2–C4 \(2,486–2,567\), indicating consistent reasoning depth regardless of context management approach\.
Full\-context C2 consumes 2\.68×\\timesthe tokens of the best\-performing C4 while achieving lower task completion—demonstrating that more context does not necessarily improve performance in tool\-heavy agentic workflows\. C3 and C4 achieve comparable token budgets to the baseline \(535K–553K vs\. 533K\) while improving complete itemization by 71–83 percentage points\.
### 4\.6Statistical analysis
We treat each task outcome on the primary metric as a Bernoulli trial and report dispersion in two complementary ways \(full methodology in Appendix[D](https://arxiv.org/html/2606.10209#A4)\)\. First, treating each of the 5 runs as one observation of the benchmark\-level success rate, we report the mean and sample standard deviation across runs and a 95% confidence interval using Student’sttwith 4 degrees of freedom \(t0\.975,4=2\.776t\_\{0\.975,4\}=2\.776\)\. Second, pooling all50×5=25050\\times 5=250task\-runs, we report a Wilson score interval for the binomial proportion, which—unlike the normal \(Wald\) approximation—remains valid for the near\-boundary rates observed here \(e\.g\., 8% and 91\.6%\)\.
Table[3](https://arxiv.org/html/2606.10209#S4.T3)reports both for the primary metric\.
Table 3:Primary metric \(Completely Itemized, %\) with run\-level mean±\\pmSD and pooled Wilson 95% confidence intervals\.The Wilson interval for C4 is cleanly separated from C3 \(\[87\.5,94\.4\]\[87\.5,94\.4\]vs\.\[73\.5,83\.6\]\[73\.5,83\.6\]\), giving strong pooled\-trial evidence that summarization improves over pruning alone\. The C2 and C3 intervals \(\[65\.1,76\.3\]\[65\.1,76\.3\]vs\.\[73\.5,83\.6\]\[73\.5,83\.6\]\) overlap slightly, so the pooled\-binomial lens alone is conservative on this step; the \+8\-point gap in run\-level means together with the run\-level dispersion below nonetheless points consistently in the same direction\. C2’s full\-context runs are moderately variable \(±4\.4\\pm 4\.4\), C3 shows the widest spread under aggressive pruning \(±8\.2\\pm 8\.2\), and C4’s pruning\+summarization is by far the most stable \(±1\.7\\pm 1\.7\)—consistent with summarization absorbing the variance that pruning alone exposes\.
#### Effect sizes vs\. noise\.
Because the same 50 tasks are used in every configuration, comparisons are naturally paired\. We do not rely on a paired hypothesis test here: the effect sizes \(71\.0%→\\to79\.0%→\\to91\.6%, i\.e\. \+8 and \+12\.6 percentage points\) are large relative to the per\-run dispersion in Table[3](https://arxiv.org/html/2606.10209#S4.T3), and the C3→\\toC4 Wilson intervals are cleanly separated on the pooled 250 task\-runs\. Together these make it unlikely that the headline ordering is attributable to run\-to\-run noise\.
### 4\.7Sensitivity to the pruning windowNNand summary windowWW
A central question is whether the headline result depends on the specific choicesN=5N\{=\}5andW=3W\{=\}3\. Appendix[H](https://arxiv.org/html/2606.10209#A8)reports a hyperparameter sweep overNNandWW; results plateau beyondN=5N\{=\}5and show no accuracy gain forW\>3W\{\>\}3at non\-trivial token cost, supporting the chosen operating point\.
### 4\.8Generalization across expense categories
To test whether the policy generalizes beyond hotel receipts, we replicate the full C1/C2/C3/C4 comparison on two additional grouped D365 F&O expense categories that differ in structural complexity:*Travel*\(car rental \+ flight; transportation receipts with base fare, taxes/fees, and ancillary charges\) and*Meals & Gifts*\(business meal \+ gift; discretionary social\-spend receipts with simpler line\-item structure\)\. Hotel remains the most structurally complex category \(room \+ nightly taxes \+ resort fees \+ incidentals\)\. Table[4](https://arxiv.org/html/2606.10209#S4.T4)reports all performance and efficiency metrics across all four configurations for each grouped category, ordered by structural complexity\.
Quality \(mean±\\pmSD\)Efficiency \(mean\)CategoryConfiguration𝒏\\boldsymbol\{n\}Comp\.<<10%≥\\geq1%AmtTotalInputOutputTimeItem\.Rem\.Item\.Item\.Tok\.\(K\)Tok\.\(K\)Tok\.\(K\)\(hrs\)HotelC1: GPT\-5 only \(no user\)508\.0±\\pm2\.537\.2±\\pm8\.499\.6±\\pm0\.958\.89±\\pm3\.0532\.6531\.31\.33\.08C2: GPT\-5 \+ User \(Full Context\)71\.0±\\pm4\.474\.0±\\pm3\.7100\.0±\\pm0\.092\.03±\\pm1\.11,481\.01,478\.52\.514\.56C3: GPT\-5 \+ User \(Last 5 TC\)79\.0±\\pm8\.287\.0±\\pm6\.1100\.0±\\pm0\.096\.92±\\pm1\.5535\.3532\.82\.55\.39C4: GPT\-5 \+ User \(Last 5 \+ Sum\.\)91\.6±\\pm1\.799\.6±\\pm0\.9100\.0±\\pm0\.099\.64±\\pm0\.1553\.4550\.82\.65\.79TravelC1: GPT\-5 only \(no user\)3020\.0±\\pm14\.330\.7±\\pm9\.550\.7±\\pm9\.245\.67±\\pm7\.8390\.0388\.81\.20\.71C2: GPT\-5 \+ User \(Full Context\)76\.0±\\pm7\.693\.3±\\pm4\.1100\.0±\\pm0\.097\.49±\\pm1\.31,050\.01,047\.82\.24\.42C3: GPT\-5 \+ User \(Last 5 TC\)86\.6±\\pm7\.094\.0±\\pm4\.3100\.0±\\pm0\.098\.62±\\pm1\.0367\.0365\.41\.61\.63C4: GPT\-5 \+ User \(Last 5 \+ Sum\.\)95\.0±\\pm1\.998\.3±\\pm1\.6100\.0±\\pm0\.099\.53±\\pm0\.3403\.0401\.51\.51\.75Meals& GiftsC1: GPT\-5 only \(no user\)3226\.9±\\pm15\.140\.6±\\pm13\.950\.6±\\pm11\.947\.57±\\pm12\.9280\.0279\.01\.01\.67C2: GPT\-5 \+ User \(Full Context\)75\.6±\\pm6\.895\.0±\\pm2\.8100\.0±\\pm0\.097\.70±\\pm0\.7750\.0748\.02\.03\.32C3: GPT\-5 \+ User \(Last 5 TC\)89\.4±\\pm6\.495\.0±\\pm5\.6100\.0±\\pm0\.098\.90±\\pm0\.7285\.0283\.51\.51\.23C4: GPT\-5 \+ User \(Last 5 \+ Sum\.\)96\.1±\\pm1\.499\.2±\\pm1\.3100\.0±\\pm0\.099\.67±\\pm0\.2295\.0293\.51\.51\.42
Table 4:Run\-level results across 5 independent runs for three grouped D365 F&O expense categories\.*Travel*pools car rental and flight receipts \(n=30n=30\);*Meals & Gifts*pools business meal and gift receipts \(n=32n=32\); Hotel remains the standalone primary benchmark \(n=50n=50\)\. Categories are ordered by structural complexity \(Hotel\>\>Travel\>\>Meals & Gifts\)\. For the four quality metrics \(all reported in %\) we report mean±\\pmstandard deviation across the 5 runs; for the four efficiency metrics—input/output/total tokens in thousands \(K\) and wall\-clock time in hours—we report the mean only, as run\-to\-run variance on these is negligible\. Hotel means reproduce Table[2](https://arxiv.org/html/2606.10209#S4.T2); SDs and all non\-Hotel cells are computed from the per\-run JSONL result files\. Column definitions follow Table[2](https://arxiv.org/html/2606.10209#S4.T2)\.The C1→\\toC2→\\toC3→\\toC4 ordering predicted by our context\-engineering hypothesis holds in every category\. Complete itemization rises monotonically across configurations in all three settings, with the C2→\\toC4 improvement remarkably consistent across categories \(Hotel:71\.0→91\.671\.0\\to 91\.6,\+20\.6\+20\.6pts; Travel:76\.0→95\.076\.0\\to 95\.0,\+19\.0\+19\.0pts; Meals & Gifts:75\.6→96\.175\.6\\to 96\.1,\+20\.5\+20\.5pts\)\. The efficiency story is similarly stable: pruning\-plus\-summarization reduces total tokens by 60–63% relative to full context and cuts wall\-clock time by 57–60% across all three categories\. As predicted by the structural gradient \(Appendix[E](https://arxiv.org/html/2606.10209#A5)\), the magnitude of the C1→\\toC2 gap shrinks with receipt complexity—from\+63\.0\+63\.0pts on Hotel to\+56\.0\+56\.0pts on Travel and\+48\.7\+48\.7pts on Meals & Gifts—because simpler receipts give the no\-user\-model agent fewer opportunities to stall, but the context\-engineering ordering above C2 is preserved\. These results indicate that the policy is not specific to multi\-night hotel itemization but generalizes to the broader class of D365 F&O expense workflows with verbose MCP tool responses\.
### 4\.9Cross\-model generalization: Claude Sonnet 4\.5
To test whether the policy is model\-agnostic, we repeat the comparison with Claude Sonnet 4\.5 as the agent on the 50\-task hotel benchmark; full results are in Appendix[I](https://arxiv.org/html/2606.10209#A9)\. Two findings stand out\. First, Sonnet 4\.5 does*not*stall in the non\-interactive harness, so its no\-context\-engineering baseline already reaches 88\.0% complete itemization—in sharp contrast to GPT\-5’s 8\.0%\. This supports our claim \(Section[3\.3](https://arxiv.org/html/2606.10209#S3.SS3)\) that the large GPT\-5 C1→\\toC2 gap reflects a model\-specific stalling behavior rather than the value of context engineering\. Second, the context\-engineering ordering still holds: adding summarization on top of pruning improves complete itemization from 92\.0% to 94\.5% at a∼\\sim5\.6% wall\-clock premium, mirroring the small time overhead observed for GPT\-5 \(\+7\.4%\+7\.4\\%\)\.
## 5Failure analysis
We categorized every non\-completing task across the C2–C4 runs on the 50\-task hotel benchmark into one of six failure modes, derived from the tool\-call error logs and the read\-back discrepancy between the saved form state and the ground\-truthPurchasedItems\(Appendix[C](https://arxiv.org/html/2606.10209#A3)\)\. Representative transcripts for each mode are in Appendix[F](https://arxiv.org/html/2606.10209#A6)\.
Table 5:Failure modes by configuration \(count of task\-runs, pooled over 5 runs on the 50\-task hotel benchmark\)\.The taxonomy makes distinct predictions: C2 should over\-represent*stale\-state references*\(the agent acts on a superseded form snapshot\); C3 should reduce stale\-state errors but introduce*premature termination*\(the running balance is no longer visible\); and C4 should suppress premature termination by reinjecting a condensed history\. Table[5](https://arxiv.org/html/2606.10209#S5.T5)confirms both predictions: stale\-state references drop from 34/73 \(47%\) under C2 to 6/53 \(11%\) under C3, while premature termination triples \(9→\\to18\) and becomes C3’s dominant mode; C4 cuts premature termination six\-fold \(18→\\to3\) without re\-introducing stale\-state errors, yielding a 71% overall reduction in non\-completions \(73→\\to21\)\. The remaining modes—wrong subcategory mapping, duplicate or skipped repeats, tool/form navigation errors, and residual mismatches—are largely policy\-invariant and reflect model\-level reasoning and tool\-binding errors that context engineering does not directly target; wrong subcategory mapping in particular is amplified by the 23\-entry hotel subcategory catalog \(Appendix[E](https://arxiv.org/html/2606.10209#A5)\), where near\-synonyms such as Room tax vs\. Non\-Room tax and Restaurant vs\. Room service vs\. Loungebar create genuine label ambiguity independent of the context policy\. These residual modes bound the headroom for recency\- and summarization\-based policies and motivate the complementary techniques discussed in Section[6](https://arxiv.org/html/2606.10209#S6)\.
## 6Discussion
### 6\.1Summary of findings
This study demonstrates that context engineering—selective retention of recent tool interactions combined with automated summarization—substantially improves both performance and efficiency for GPT\-5 agents in long\-context agentic workflows with verbose tool responses\. On the 50\-task hotel expense benchmark averaged across 5 runs, C4 \(last\-5\-tool\-calls pruning \+ summarization\) achieved 91\.6% complete itemization and 99\.64% average amount itemized, compared to 71\.0% and 92\.03% with full\-context retention \(C2\), while consuming 62\.7% fewer tokens and completing the benchmark in 60\.2% less time\.
Holding the user model constant across C2–C4, the comparison isolates the context policy: \(i\) recency pruning provides noise reduction and efficiency gains \(C2→\\toC3:\+8\+8pp and−64%\-64\\%tokens\); and \(ii\) summarization provides residual task\-level awareness for near\-complete performance \(C3→\\toC4:\+12\.6\+12\.6pp at<4%<4\\%token overhead\)\. C1 is reported only to quantify why the user model is needed for GPT\-5 in a non\-interactive harness, a need that is model\-specific \(Section[4\.9](https://arxiv.org/html/2606.10209#S4.SS9)\)\.
### 6\.2Why context pruning outperforms full history retention
The finding that context\-pruned configurations outperform full\-context retention on both performance and efficiency reveals an important characteristic of LLM agents in tool\-heavy workflows: older context can be actively detrimental rather than merely redundant\. In D365 F&O expense itemization, early tool responses describe form states superseded as the workflow progresses\. An agent retaining full history may reference stale form values when making current decisions, leading to incorrect field assignments or erroneous navigation\. The failure taxonomy \(Section[5](https://arxiv.org/html/2606.10209#S5)\) is designed to test this mechanism directly via the relative frequency of stale\-state errors\.
Restricting context to the last 5 tool calls ensures agent attention is focused on current form state and recent actions—the information required for the next decision\. This aligns with the task’s working memory requirements: the agent needs to know what was just done, whether it succeeded, and the current remaining balance\. Five tool calls provide this working memory for approximately two complete itemization cycles, covering the immediate task horizon without accumulating irrelevant historical state\.
### 6\.3Complementary role of summarization
The performance gap between C3 \(79\.0%\) and C4 \(91\.6%\) reveals a complementary role for summarization that addresses the limitation of pure context pruning\. While pruning provides noise reduction and recent\-state focus, it can cause agents to lose task\-level awareness: how many items have been successfully itemized, what the total allocated amount is, and whether overall reconciliation is near completion\. Without this awareness, agents in C3 occasionally exhibit premature termination\.
The automated summarization in C4, using a window of 3 prior interactions to generate a compact progress report, bridges this gap at minimal token cost \(\+3\.4% over C3\)\. The summary provides two complementary information channels: recent tool calls supply current local state, while the summary supplies global task progress\. Together, they provide everything needed for reliable task completion\.
### 6\.4Limitations and future work
The core study focuses on hotel expense itemization in D365 F&O—an intentionally challenging tool\-use workflow with repeated subcategories, non\-trivial mappings, and a strict zero\-residual completion criterion—and we extend it with multi\-category coverage across five expense types grouped into three structurally distinct categories and cross\-model evidence on Claude Sonnet 4\.5\. Within this scope, the\(N=5,W=3\)\(N\{=\}5,\\,W\{=\}3\)operating point was selected for clarity of comparison; the sensitivity sweep in Appendix[H](https://arxiv.org/html/2606.10209#A8)confirms robustness across nearby values ofNNandWW, and joint tuning and adaptive \(per\-task\) window sizing are natural next steps\. Our summarizer is a single free\-form LLM pass, which establishes a clean baseline against which structured or learned compressors\(Kang and others,[2025](https://arxiv.org/html/2606.10209#bib.bib10)\)and provider\-native compaction APIs\(Anthropic,[2025](https://arxiv.org/html/2606.10209#bib.bib13)\)can be benchmarked head\-to\-head\. Broader generalization—to additional ERP domains beyond expense management, and across further model families, deployments, and decoding settings—constitutes a promising line of follow\-up work\. Production\-deployment economics and the precise scope of our generalizability claims are discussed further in Appendix[J](https://arxiv.org/html/2606.10209#A10)\.
### 6\.5Responsible AI considerations
Context engineering improves task reliability from 8\.0% to 91\.6% complete itemization; the residual 8\.4% incomplete rate means production deployments should retain human review for flagged cases\. The approach is interpretable by construction—explicitly retained tool calls and human\-readable summaries support debugging and auditing—and this work uses synthetic test cases and anonymized internal test data, so no privacy exposure is incurred\.
## References
- Effective context engineering for AI agents\.Note:[https://www\.anthropic\.com/engineering/effective\-context\-engineering\-for\-ai\-agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)Cited by:[§1](https://arxiv.org/html/2606.10209#S1.p2.1),[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px3.p1.1),[§3\.7](https://arxiv.org/html/2606.10209#S3.SS7.p1.3),[§6\.4](https://arxiv.org/html/2606.10209#S6.SS4.p1.3)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.10209#S1.p1.1)\.
- H\. Jiang, Q\. Wu, C\.\-Y\. Lin, Y\. Yang, and L\. Qiu \(2023\)LLMLingua: compressing prompts for accelerated inference of large language models\.arXiv preprint arXiv:2310\.05736\.Cited by:[§1](https://arxiv.org/html/2606.10209#S1.p1.1),[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.10209#S2.T1.2.2.4.1.1.1.1)\.
- M\. Kanget al\.\(2025\)ACON: optimizing context compression for long\-horizon LLM agents\.arXiv preprint arXiv:2510\.00615\.Cited by:[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.10209#S2.T1.1.1.1.2.1.1),[§3\.7](https://arxiv.org/html/2606.10209#S3.SS7.p1.3),[§6\.4](https://arxiv.org/html/2606.10209#S6.SS4.p1.3)\.
- Y\. Li, Y\. Zhang, and L\. Sun \(2023\)Selective context: on\-demand context compression for long\-context language models\.arXiv preprint arXiv:2304\.12102\.Cited by:[§1](https://arxiv.org/html/2606.10209#S1.p2.1),[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.10209#S2.T1.2.2.4.1.1.1.1)\.
- S\. Liu, J\. Yang, B\. Jiang, Y\. Li, J\. Guo, X\. Liu, and B\. Dai \(2025\)Context as a tool: context management for long\-horizon SWE\-agents\.arXiv preprint arXiv:2512\.22087\.Cited by:[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Maharana, D\.\-H\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fung \(2024\)Evaluating very long\-term conversational memory of LLM agents\.arXiv preprint arXiv:2402\.17753\.Cited by:[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wang, Y\. Dong, D\. Zeng, Z\. Li, and M\. Sun \(2024\)LongMem: augmenting large language models with memory mechanism for long\-context understanding\.arXiv preprint arXiv:2407\.01917\.Cited by:[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.10209#S2.T1.2.2.5.2.1.1.1)\.
- Z\. Wanget al\.\(2025\)MCP\-Bench: benchmarking tool\-using LLM agents with complex real\-world tasks via MCP servers\.arXiv preprint arXiv:2508\.20453\.Cited by:[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\.\-W\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.arXiv preprint arXiv:2410\.10813\.Cited by:[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px4.p2.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)MemoryBank: enhancing large language models with long\-term memory\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§2](https://arxiv.org/html/2606.10209#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.10209#S2.T1.2.2.5.2.1.1.1)\.
Appendix
## Appendix AReproducibility overview
This appendix collects the following artifacts: the tool inventory \([B](https://arxiv.org/html/2606.10209#A2)\), the metric\-computation/scoring logic \([C](https://arxiv.org/html/2606.10209#A3)\), the statistical methodology with a runnable helper \([D](https://arxiv.org/html/2606.10209#A4)\), per\-category dataset statistics \([E](https://arxiv.org/html/2606.10209#A5)\), and qualitative failure and summarization examples \([F](https://arxiv.org/html/2606.10209#A6)\), performance and efficiency figures \([G](https://arxiv.org/html/2606.10209#A7)\), sensitivity analysis for the pruning windowNNand summary windowWW\([H](https://arxiv.org/html/2606.10209#A8)\), cross\-model generalization results on Claude Sonnet 4\.5 \([I](https://arxiv.org/html/2606.10209#A9)\), and an extended discussion of efficiency and generalizability \([J](https://arxiv.org/html/2606.10209#A10)\)\.
## Appendix BTool inventory
The agent is exposed to 21 D365 F&O MCP tools, organized into three categories by capability: \(i\)*form tools*\(13\), providing UI\-level interaction with the F&O client—form and tab navigation, menu\-item and control discovery, reading and setting control values, opening lookups, grid filtering, sorting and row selection, clicking controls, and saving forms; \(ii\)*data tools*\(6\), providing entity\-level OData access—entity\-type discovery, metadata retrieval, and find/create/update/delete operations against F&O entities; and \(iii\)*API tools*\(2\), used to discover and invoke custom server\-side action menu items\.
## Appendix CMetric computation and scoring
Metrics are computed by an independent read\-back: after the itemization run, the saved expense line in D365 F&O is read and reduced to four values—total\_amount,itemized\_amount,remaining\_amount, andnum\_itemized—which are compared against the ground\-truthPurchasedItemstotal\. The per\-task metrics are then:
CompletelyItemized=𝟙\[remaining\_amount=0\.00\]\\displaystyle=\\mathds\{1\}\[\\,\\texttt\{remaining\\\_amount\}=0\.00\\,\]LessThan10%=𝟙\[remaining\_amounttotal\_amount≤0\.1\]\\displaystyle=\\mathds\{1\}\\\!\\left\[\\tfrac\{\\texttt\{remaining\\\_amount\}\}\{\\texttt\{total\\\_amount\}\}\\leq 0\.1\\right\]AtLeastOne=𝟙\[num\_itemized\>0\]\\displaystyle=\\mathds\{1\}\[\\,\\texttt\{num\\\_itemized\}\>0\\,\]%AmountItemized=100⋅itemized\_amounttotal\_amount\\displaystyle=100\\cdot\\tfrac\{\\texttt\{itemized\\\_amount\}\}\{\\texttt\{total\\\_amount\}\}Amounts are normalized \(currency symbols/commas stripped, cast to float\) before comparison\. Benchmark\-level numbers are the mean of the per\-task values over the 50 tasks, averaged again over the 5 runs\. Token totals sum agent and user\-modelusage; execution time is the run wall\-clock\.
## Appendix DStatistical methodology
#### Run\-level interval\.
For each configuration letp1,…,p5p\_\{1\},\\dots,p\_\{5\}be the per\-run success rates on a binary metric\. Reportp¯=15∑ipi\\bar\{p\}=\\frac\{1\}\{5\}\\sum\_\{i\}p\_\{i\}ands=14∑i\(pi−p¯\)2s=\\sqrt\{\\frac\{1\}\{4\}\\sum\_\{i\}\(p\_\{i\}\-\\bar\{p\}\)^\{2\}\}, with 95% CIp¯±t0\.975,4s5\\bar\{p\}\\pm t\_\{0\.975,4\}\\,\\frac\{s\}\{\\sqrt\{5\}\},t0\.975,4=2\.776t\_\{0\.975,4\}=2\.776\.
#### Pooled binomial \(Wilson\) interval\.
Poolingn=250n=250task\-runs withp^\\hat\{p\}successes, the Wilson score interval \(preferred over Wald near 0/1\) is
p^\+z22n±zp^\(1−p^\)n\+z24n21\+z2n,z=1\.96\.\\frac\{\\hat\{p\}\+\\frac\{z^\{2\}\}\{2n\}\\pm z\\sqrt\{\\frac\{\\hat\{p\}\(1\-\\hat\{p\}\)\}\{n\}\+\\frac\{z^\{2\}\}\{4n^\{2\}\}\}\}\{1\+\\frac\{z^\{2\}\}\{n\}\},\\quad z=1\.96\.
#### Paired comparisons\.
Because the same 50 tasks are used in every configuration, comparisons are naturally paired\. We do not rely on a hypothesis\-testpp\-value: the headline\-metric effect sizes \(\+8 and \+12\.6 percentage points\) are large relative to the per\-run SDs in Table[3](https://arxiv.org/html/2606.10209#S4.T3), and the C3→\\toC4 Wilson intervals are non\-overlapping\. We report the run\-leveltt\-interval and the pooled Wilson interval as the two primary inferential statistics\.
#### Helper\.
The following computes every value in Table[3](https://arxiv.org/html/2606.10209#S4.T3)from the per\-run JSONL result files\.
importjson,math
fromstatisticsimportmean,stdev
defper\_run\_rates\(run\_files,key="completely\_optimized"\):
rates=\[\]
forfinrun\_files:\#oneJSONLperrun
ys=\[json\.loads\(l\)\[key\]forlinopen\(f\)ifl\.strip\(\)\]
rates\.append\(100\*sum\(ys\)/len\(ys\)\)
returnrates
deft\_interval\(rates\):\#run\-level95%CI
m,s,n=mean\(rates\),stdev\(rates\),len\(rates\)
h=2\.776\*s/math\.sqrt\(n\)\#t\_\{\.975,4\}=2\.776forn=5
returnm,s,\(m\-h,m\+h\)
defwilson\(successes,n,z=1\.96\):\#pooledbinomialCI
p=successes/n
d=1\+z\*z/n
c=p\+z\*z/\(2\*n\)
half=z\*math\.sqrt\(p\*\(1\-p\)/n\+z\*z/\(4\*n\*n\)\)
return\(100\*\(c\-half\)/d,100\*\(c\+half\)/d\)
#### Full per\-metric dispersion across categories\.
The mega per\-metric mean±\\pmSD table covering all three grouped categories and all four configurations is reported in the main paper as Table[4](https://arxiv.org/html/2606.10209#S4.T4); the helper above is the script used to compute SDs and the non\-Hotel cells from the per\-run JSONL files\.
## Appendix EPer\-category dataset statistics
Table 6:Per\-category dataset characteristics\. TheF&O subcatscolumn reports the number of valid expense subcategories the agent must choose among in the D365 F&O catalog for that category\. Hotel is by far the most structurally complex: a 23\-entry subcategory catalog \(with near\-synonyms such as Room tax vs\. Non\-Room tax, Restaurant vs\. Room service vs\. Loungebar, Gift shop vs\. Gift certificates\) combined with multi\-night receipts that repeat the same subcategory per night\. Travel \(car rental: 10, flight: 4\) and Meals & Gifts \(3 and 4\) have much smaller catalogs and rarely repeat subcategories, which is why GPT\-5 with no user model achieves higher CIR on those categories \(∼\\sim27–40%\) than on hotel \(8%\)\. This structural gradient—Hotel\>\>Travel\>\>Meals & Gifts—motivates our choice of hotel as the primary benchmark\.#### Structural notes\.
- •Hotelreceipts span multi\-night stays with per\-night charges \(room, tax, resort fee, parking, breakfast\)\. The same subcategory recursnntimes for annn\-night stay, which is what makes line\-by\-line itemization hard: the agent must track*which night*each entry corresponds to and avoid duplicate/skipped repeats\.
- •Car rentalreceipts have a single primary line \(daily rate×\\timesdays\) plus optional add\-ons \(insurance, fuel, GPS\)\. Repeated subcategories are rare because totals are typically pre\-aggregated by the rental agency\.
- •Flightreceipts contain a base fare, taxes/fees, and optional seat/baggage upgrades—structurally flat, no repeats\.
- •Business mealandGiftreceipts are the simplest: typically 1–3 line items, single subcategory each, no repeats\.
## Appendix FQualitative failure and summarization examples
We provide one short trimmed transcript per failure mode from Table[5](https://arxiv.org/html/2606.10209#S5.T5), followed by one summarization string emitted by the C4 summarizer\. Speaker labels:A= agent,T= tool response \(truncated to the relevant fragment\)\.
#### \(1\) Stale\-state reference \(C2\)\.
A:get\_form\_state\(\)
T:\{lines:\[\{Room,$180\}\],total\_added:$180,remaining:$420\}
A:add\_line\(RoomTax,$40\)
T:\{ok,line\_id:7\}
A:get\_form\_state\(\)
T:\{lines:\[\{Room,$180\},\{RoomTax,$40\}\],total\_added:$220,remaining:$380\}
\.\.\.\(5moretoolcalls\)\.\.\.
A:\#referencestheFIRSTsnapshot
A:add\_line\(Room,$180\)\#duplicate\-\-formalreadyhasit
T:\{ok,line\_id:12\}
A:submit\_form\(\)
=\>Residual:\-$180\(overshoot,duplicateRoomline\)
#### \(2\) Wrong subcategory mapping \(C3\)\.
A:\#receiptline"Roomservice\-breakfast$28"
A:add\_line\(category=Meals\-Client,amount=28\)
T:\{ok\}
=\>Groundtruth:Hotel\-Breakfast\.Misclassified\.
#### \(3\) Duplicate / skipped repeat item \(C2\)\.
A:\#3\-nightstay,receiptshowsRoom$180x3
A:add\_line\(Room,$180\)\#night1
A:add\_line\(Room,$180\)\#night2
A:submit\_form\(\)\#skippednight3
=\>Residual:$180unaccounted\(1of3roomnightsmissing\)
#### \(4\) Premature termination \(C3\)\.
A:get\_form\_state\(\)
T:\{total\_added:$340,remaining:$240\}
A:add\_line\(Parking,$45\)
T:\{ok\}
\#pruningwindowdropstheearlier"remaining:$240"line
A:\#agentnolongerseestherunningbalance
A:submit\_form\(\)
=\>Residual:$195stillowed\.Formsubmittedincomplete\.
#### \(5\) Tool / form navigation error \(C2\)\.
A:add\_line\(category=Hotel\-Resort\_Fee,amount=25\)
T:ERROR:invalidcategoryenum’Hotel\-Resort\_Fee’
\(expectedoneof:Hotel\-ResortFee,Hotel\-Parking,\.\.\.\)
A:add\_line\(category=ResortFee,amount=25\)
T:ERROR:invalidcategoryenum’ResortFee’
A:submit\_form\(\)
=\>Residual:$25\.Agentcouldnotrecoverfromenummismatch\.
#### \(6\) Residual amount mismatch \(C4\)\.
A:\#receipttotal$580\.42
A:add\_line\(Room,$480\)
A:add\_line\(Tax,$72\.05\)
A:add\_line\(Parking,$28\)
A:submit\_form\(\)
=\>Residual:$0\.37\(rounding/taxlineerror\-\-arithmeticoff\)
#### Real C4 summarization output\.
The following is a representative summary string emitted by the C4 summarizer after the first summarization window of 3 tool calls during a hotel run:
Summaryofprevioustoolcalls:
\-OpenedtheExpensereportformandnavigatedtoexpense
reportER\-00184\(hotelcategory,receipttotal$612\.40\)\.
\-Clickedthe"Itemize"buttonandopenedtheitemization
sub\-formforthehotelline\.
\-AddedaHotel\-Roomlinewithamount$180\.00viathe
add\_linecontrol\.
\-AddedaHotel\-Taxlinewithamount$14\.40viathe
add\_linecontrol\.
This is the mechanism by which C4 suppresses premature termination: by condensing prior actions into a short assistant message, the agent retains awareness of which lines have already been added and roughly how much of the receipt has been accounted for, even after the verbose raw tool responses have been pruned\. The summarizer is intentionally generic \(it describes forms, controls, buttons, and entered data\) rather than computing a running balance explicitly—yet this is still enough signal to push the agent past the premature\-termination threshold observed in C3\.
## Appendix GPerformance and efficiency figures
C1: No userC2: Full ContextC3: Last 5 TCC4: Last 5\+Sum\.02020404060608080100100Percentage \(%\)Completely Itemized \(primary\)<<10% Remaining% Amount ItemizedFigure 2:Performance metrics across four context engineering configurations, averaged over 5 independent runs on the 50\-task hotel expense benchmark\. C1 = GPT\-5 only \(no user model\); C2 = Full conversation history; C3 = Last 5 tool calls \(TC\); C4 = Last 5 TC \+ Summarization \(window = 3\)\. C4 achieves the best performance across all three metrics\. Completely Itemized \(blue\) is the primary metric, reflecting genuine business task completion \(remaining amount = $0\.00\)\.C1C2C3C405005001,0001\{,\}0001,5001\{,\}500532\.6532\.61,4811\{,\}481535\.3535\.3553\.4553\.4Total Tokens \(K\)\(a\)Total token usage per benchmark \(thousands of tokens\)\. C2 \(Full Context\) consumes 2\.68×\\timesmore tokens than C4 while achieving lower task completion\.C1C2C3C4055101015153\.083\.0814\.5614\.565\.395\.395\.795\.79Execution Time \(hours\)\(b\)Benchmark wall\-clock execution time \(hours\)\. C4 completes the 50\-task benchmark 2\.51×\\timesfaster than C2 while achieving higher task completion\.
Figure 3:Efficiency metrics across the four configurations \(averaged over 5 runs, 50\-task benchmark\)\. C1 = GPT\-5 only \(no user\); C2 = Full Context; C3 = Last 5 Tool Calls; C4 = Last 5 \+ Summarization\. Left panel: total token usage in thousands; input tokens dominate in all configurations \(\>\>99\.7% of total\)\. Right panel: wall\-clock benchmark completion time in hours\. C3 and C4 achieve comparable token budgets to the C1 token budget while dramatically improving task completion\.
## Appendix HSensitivity to pruning windowNNand summary windowWW
We sweep each hyperparameter while holding the other fixed \(all with the user model present\) to test whether the headline result depends on the specific choicesN=5N\{=\}5andW=3W\{=\}3\. Table[7](https://arxiv.org/html/2606.10209#A8.T7)reports the primary metric and total tokens\.
Table 7:Sensitivity of the primary metric and token cost to the pruning windowNN\(top\) and summary windowWW\(bottom\)\. Bold rows in the original study areN=5,W=0N\{=\}5,W\{=\}0\(C3\) andN=5,W=3N\{=\}5,W\{=\}3\(C4\)\.#### Interpretation\.
The pruning sweep shows that complete itemization plateaus aroundN=5N\{=\}5: dropping toN=3N\{=\}3costs∼\\sim5 pts of accuracy, while extending toN=10N\{=\}10buys less than 1 pt at the cost of∼\\sim53% more tokens, and the unbounded variant \(N=∞N\{=\}\\infty, C2\) is strictly worse thanN=5N\{=\}5on both axes\. The summary sweep is similarly flat aboveW=3W\{=\}3:W=5W\{=\}5and full\-history summarization add 4–11% token cost overW=3W\{=\}3without a meaningful accuracy gain, whileW=1W\{=\}1underperforms by∼\\sim5 pts\. Together these confirm that\(N=5,W=3\)\(N\{=\}5,W\{=\}3\)sits at the knee of both curves—further context buys diminishing returns and tighter windows lose the bookkeeping that prevents premature termination\.
## Appendix ICross\-model generalization: Claude Sonnet 4\.5
Table 8:Claude Sonnet 4\.5 on the 50\-task hotel benchmark\. All three configurations are run*without*a user model: Sonnet does not stall in the non\-interactive harness, so the user\-model rescue required for GPT\-5 is unnecessary\.No CEis the no\-context\-engineering baseline \(full history\);Pruningapplies last\-5 tool\-call pruning;Pruning \+ Summ\.adds summarization on top of pruning\. We use descriptive labels rather than C\-numbers to avoid implying a one\-to\-one correspondence with GPT\-5’s configurations in Table[4](https://arxiv.org/html/2606.10209#S4.T4)\. Time in hours; Tokens reported as total \(K=K=thousands\)\.As shown in Table[8](https://arxiv.org/html/2606.10209#A9.T8), summarization buys accuracy at a consistent∼\\sim6–7% time premium relative to pure pruning across both models—evidence that the per\-step summarization LLM call cost is model\-stable\.
## Appendix JExtended discussion: efficiency and generalizability
### Efficiency implications for production deployment
The token efficiency results have direct implications for production deployment economics\. Full\-context agents \(C2\) consume 1,480,996 tokens per 50\-task benchmark versus 553,374 for C4—a 2\.68×\\timesdifference translating directly to inference cost\. At scale across thousands of expense reports monthly, this gap represents substantial operational savings\. The 14\.56\-hour versus 5\.79\-hour execution time difference further impacts throughput and user\-facing latency\.
C3 \(pruning without summarization\) represents a secondary operating point for cost\-sensitive deployments: 79\.0% task completion at baseline\-equivalent token cost, compared to 71\.0% for full context at 2\.77×\\timesthe cost\.
### Generalizability and scope
We deliberately scope our claims\. The evidence here is strong for one class of enterprise tool\-use workflow: structured, single\-session, form\-driven tasks with verbose tool responses and a hard completion criterion\. The context engineering techniques evaluated here are inference\-time only and require no modification to the underlying LLM, and our cross\-model \(Appendix[I](https://arxiv.org/html/2606.10209#A9)\) and multi\-category \(Section[4\.8](https://arxiv.org/html/2606.10209#S4.SS8)\) results test how far the policy carries\. The same principles are expected to extend to other frontier models and to enterprise agentic domains with similarly verbose tool responses—CRM, supply chain automation, IT service management, healthcare administration—and characterizing the per\-domain optimal window size, including adaptive sizing driven by task complexity or error signals, is a natural next step\.Similar Articles
Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
This paper introduces Engram, an open-source bi-temporal memory engine for LLM agents that retrieves a compact context slice (∼9.6k tokens) to outperform the full-history baseline (79k tokens) by 10.4 accuracy points on LongMemEval, using a hybrid read path fusing dense, lexical, graph, and temporal signals.
@omarsar0: // The Efficiency Frontier // Cool paper on context management. As agents reuse the same documents and histories across…
This paper introduces The Efficiency Frontier, a unified framework for cost–performance optimization in LLM context management that models context strategy selection as a deployment-aware optimization problem, achieving 25% reduction in token usage and over 50% lower token cost with amortized memory compression compared to full-context prompting.
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
This paper introduces GenericAgent, a self-evolving LLM agent system designed to maximize context information density. It addresses long-horizon limitations through hierarchical memory, reusable SOPs, and efficient compression, achieving better performance with fewer tokens compared to leading agents.
Learning Agent-Compatible Context Management for Long-Horizon Tasks
Introduces AdaCoM, an external LLM-based context manager for frozen agents, using reinforcement learning to improve long-horizon task performance by preserving task constraints and pruning stale content, with experiments on web search and deep research benchmarks.
Effective context engineering for AI agents
Anthropic publishes a guide defining context engineering as the evolution of prompt engineering, focusing on curating optimal context tokens for AI agents to maintain performance and focus during multi-turn inference.