ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

arXiv cs.AI Papers

Summary

ActiveMem introduces a distributed active memory system that decouples agent memory from the core LLM reasoning process, achieving state-of-the-art accuracy on long-horizon tasks with significantly reduced overhead.

arXiv:2606.10532v1 Announce Type: new Abstract: Memory is essential for enabling large language model (LLM) agents to handle long-horizon reasoning tasks. Existing memory mechanisms are largely centralized, typically organizing retrieved information and interaction history within a single model context. This design imposes a fundamental trade-off: scaling reasoning trajectories risks context overload, whereas aggressive content pruning may result in irreversible information loss. Seeking a better trade-off, we draw inspiration from human cognitive systems, especially the functional complementarity between the prefrontal cortex (executive control) and the hippocampus (memory management), suggesting that such a trade-off need not be inherent, but may instead stem from centralized memory organization. To this end, we propose ActiveMem, a heterogeneous framework that decouples agent memory from the core reasoning process. Specifically, a high-level Planner utilizes distilled semantic gists to execute reasoning, while a lightweight, distributed memory system operates in parallel to actively accumulate and consolidate these gists throughout the task. Experiments on BrowseComp-Plus and GAIA show that ActiveMem achieves state-of-the-art accuracy with significantly reduced overhead, demonstrating the effectiveness of distributed active memory for long-horizon reasoning.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:15 AM

# ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning
Source: [https://arxiv.org/html/2606.10532](https://arxiv.org/html/2606.10532)
Yunhan Jiang1, 2, Wenbin Duan1, 2, Shasha Guo1, Liang Pang111footnotemark:1, Xiaoqian Sun1,Huawei Shen1 1State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences jiangyunhan20@mails\.ucas\.ac\.cn \{duanwenbin25e, guoshasha, pangliang, sunxiaoqian, shenhuawei\}@ict\.ac\.cn

###### Abstract

Memory is essential for enabling large language model \(LLM\) agents to handle long\-horizon reasoning tasks\. Existing memory mechanisms are largely centralized, typically organizing retrieved information and interaction history within a single model context\. This design imposes a fundamental trade\-off: scaling reasoning trajectories risks context overload, whereas aggressive content pruning may result in irreversible information loss\. Seeking a better trade\-off, we draw inspiration from human cognitive systems, especially the functional complementarity between the prefrontal cortex \(executive control\) and the hippocampus \(memory management\), suggesting that such a trade\-off need not be inherent, but may instead stem from centralized memory organization\. To this end, we propose ActiveMem, a heterogeneous framework that decouples agent memory from the core reasoning process\. Specifically, a high\-level Planner utilizes distilled semantic gists to execute reasoning, while a lightweight, distributed memory system operates in parallel to actively accumulate and consolidate these gists throughout the task\. Experiments on BrowseComp\-Plus and GAIA show that ActiveMem achieves state\-of\-the\-art accuracy with significantly reduced overhead, demonstrating the effectiveness of distributed active memory for long\-horizon reasoning\.

ActiveMem: Distributed Active Memory for Long\-Horizon LLM Reasoning

Yunhan Jiang1, 2, Wenbin Duan1, 2, Shasha Guo1††thanks:Corresponding authors\., Liang Pang111footnotemark:1,Xiaoqian Sun1,Huawei Shen11State Key Laboratory of AI Safety, Institute of Computing Technology,Chinese Academy of Sciences2University of Chinese Academy of Sciencesjiangyunhan20@mails\.ucas\.ac\.cn\{duanwenbin25e, guoshasha, pangliang, sunxiaoqian, shenhuawei\}@ict\.ac\.cn

## 1Introduction

LLM agents have demonstrated remarkable capabilities in executing long\-horizon reasoning tasks through sustained, multi\-step interactions\(Yaoet al\.,[2023](https://arxiv.org/html/2606.10532#bib.bib5); Nakanoet al\.,[2021](https://arxiv.org/html/2606.10532#bib.bib16); Wanget al\.,[2024](https://arxiv.org/html/2606.10532#bib.bib30)\)\. However, within these intricate workflows, the continuous expansion of interaction contexts inevitably renders working memory management a critical bottleneck\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.10532#bib.bib29); Huet al\.,[2025b](https://arxiv.org/html/2606.10532#bib.bib32)\)\. To this end, an effective working memory must selectively retain task\-relevant information and anchor the model’s attention on pivotal tokens while compressing the active context window, thereby enabling agents to navigate complex, long\-horizon tasks successfully\.

![Refer to caption](https://arxiv.org/html/2606.10532v1/x1.png)Figure 1:ActiveMem outperforms both modern centralized memory agents and vanilla ReAct LLMs in LLM\-as\-a\-Judge accuracy while achieving substantially lower computational cost\.![Refer to caption](https://arxiv.org/html/2606.10532v1/x2.png)Figure 2:Comparison between \(a\)Coupled Centralized Memoryand \(b\) our proposedDecoupled Distributed Memory\(ActiveMem\)\. In the centralized paradigm, existing approaches manage context growth by selectively retaining memories or compressing them into step\-level summaries, trading information completeness for a bounded context\. ActiveMem takes a different path: evidence is routed to parallel Memorizers that produce distilled semantic gists, stored persistently in Memory Shards and coordinated by an Operator, enabling the Planner to reason over a consistently compact context without discarding the underlying information\.Despite this necessity, most current reasoning systems follow a centralized architecture where memory is tightly bound to a single core reasoner\. In ReAct\-style agents, for instance, retrieved information and intermediate trajectories continuously accumulate within the same model context window\(Yaoet al\.,[2023](https://arxiv.org/html/2606.10532#bib.bib5)\)\. As reasoning chains extend, this centralization inevitably triggers severe context overload\(Levyet al\.,[2024](https://arxiv.org/html/2606.10532#bib.bib3); Anet al\.,[2024](https://arxiv.org/html/2606.10532#bib.bib4)\)and thelost\-in\-the\-middlephenomenon\(Liuet al\.,[2024](https://arxiv.org/html/2606.10532#bib.bib17); Shiet al\.,[2023](https://arxiv.org/html/2606.10532#bib.bib1)\), thereby undermining reasoning performance\. To mitigate this issue, contemporary approaches introduce various context compression mechanisms\(Sunet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib8); Yeet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib9); Qianet al\.,[2026](https://arxiv.org/html/2606.10532#bib.bib13); Wuet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib39); Zhouet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib38); Zhanget al\.,[2025a](https://arxiv.org/html/2606.10532#bib.bib42); Meiet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib33)\)\. However, these strategies invariably incur permanent information loss—either by dropping older memories entirely or compressing them into coarse step\-level summaries\. This leaves memory content irreversibly degraded and unrecoverable for subsequent reasoning steps \(Figure[2](https://arxiv.org/html/2606.10532#S1.F2)\(a\)\)\. This dilemma exposes a fundamental limitation of centralized memory designs: memory storage and reasoning computation are tightly coupled, creating an inherent trade\-off between trajectory scaling and memory fidelity\.

Drawing inspiration from human cognitive systems, we argue that this seemingly inherent trade\-off stems fundamentally from the limitations of centralized memory organization\. The human brain masterfully circumvents this bottleneck through the functional complementarity between the prefrontal cortex \(executive control\) and the hippocampus \(memory management\)\. The prefrontal cortex \(PFC\), serving as the master executive controller, issues top\-down executive signals to guide retrieval rather than acting as a massive repository for detailed memory content\(Lara and Wallis,[2015](https://arxiv.org/html/2606.10532#bib.bib18)\)\. Complementarily, the hippocampus executes parallel pattern completion, integrating these executive signals to reactivate and distribute holistic, abstract information across the neocortex\(Horneret al\.,[2015](https://arxiv.org/html/2606.10532#bib.bib23)\), frequently trading granular episodic details for structured distilled semantic gists\(Hindyet al\.,[2026](https://arxiv.org/html/2606.10532#bib.bib19)\)\. This biological mechanism suggests a promising design paradigm: structurally decoupling memory systems from reasoning processes\.

Motivated by the above insight, we introduceActiveMem, a heterogeneous framework that implements a decoupled and distributed active memory architecture to overcome the limitations of centralized paradigms\. Specifically, ActiveMem consists of two primary modules: a high\-levelPlannerand aDistributed Memory System\. The Planner handles reasoning and top\-down query generation, remaining exclusively focused on executing core reasoning chains over compact context windows\. Complementing this, the Distributed Memory System replaces monolithic context buffers with a parallelized and sharded architecture that is inherently lightweight and active\. This architecture comprises three tightly coordinated components: \(1\)Memorizers, which exploit the inherent parallelizability of information processing to concurrently process retrieved documents and extract distilled semantic gists; \(2\)Memory Shards, which actively partition, persistently store, and consolidate these gists across localized nodes throughout the task lifecycle; and \(3\) anOperator, which dynamically orchestrates proactive routing and semantic reuse across the entire shard network\. As illustrated in Figure[2](https://arxiv.org/html/2606.10532#S1.F2)\(b\), this collaborative design keeps the Planner’s context horizon bounded and clean while preserving document\-level insights, thereby substantially mitigating the trade\-off between trajectory scaling and memory fidelity\.

Contributions\.\(1\)Neuroscience\-inspired cognitive decoupling\.We propose a decoupled memory\-reasoning paradigm inspired by the functional synergy between the prefrontal cortex and hippocampus\. This architecture frees the centralized reasoning core by segregating executive control from distributed memory consolidation\. \(2\)The ActiveMem framework\.We introduce ActiveMem, a heterogeneous framework that materializes this paradigm into a distributed and active memory architecture\. It empowers Memorizers to concurrently process retrieved documents and synthesize consolidated semantic gists, which are dynamically maintained across localized shards throughout the task lifecycle\. \(3\)Superior accuracy with lower computational cost\.ActiveMem achieves the highest LasJ accuracy among nine baselines while incurring substantially lower computational complexity in terms of PFLOPs, as shown in Figure[1](https://arxiv.org/html/2606.10532#S1.F1)\.

## 2Related Work

Memory management has become a central challenge in LLM agents\. Existing centralized approaches can be grouped into three forms\. The first is vanilla centralized memory, where raw trajectories or retrieved documents are directly fed back into one model context, as in ReAct\-style agents\(Yaoet al\.,[2023](https://arxiv.org/html/2606.10532#bib.bib5)\)\. The second targets long\-term dialogue management—methods such as A\-MEM\(Xuet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib6)\), Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib7)\), MemGPT\(Packeret al\.,[2023](https://arxiv.org/html/2606.10532#bib.bib34)\), MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2606.10532#bib.bib35)\), Memory\-R1\(Yanet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib40)\), and Mem\-α\\alpha\(Wanget al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib41)\)focus on preserving user interaction histories across extended conversations rather than supporting tool\-augmented long\-horizon reasoning\. The third addresses long\-horizon reasoning more directly, either through compression\-based strategies that summarize past trajectories or raw documents\(Sunet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib8); Yeet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib9); Yuet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib10); Zhanget al\.,[2025a](https://arxiv.org/html/2606.10532#bib.bib42)\), or through structured organization over working memory\(Huet al\.,[2025a](https://arxiv.org/html/2606.10532#bib.bib12); Qianet al\.,[2026](https://arxiv.org/html/2606.10532#bib.bib13)\)\. These methods improve how agents manage working memory and advance the ability to handle long\-horizon tasks, but memory and reasoning remain tightly coupled within one central reasoning node, creating a trade\-off between retaining sufficient detail and maintaining efficient inference\.

A smaller but growing line of work explores distributed memory across multiple agents or modules\. MIRIX\(Wang and Chen,[2025](https://arxiv.org/html/2606.10532#bib.bib15)\)introduces a modular multi\-agent memory system that assigns memory consolidation to specialized controllers organized by memory type, such as episodic, semantic, and procedural memory\. However, this form of distribution follows a memory taxonomy rather than the evolving information needs of the reasoning process\. Its memory modules therefore do not actively distill task\-relevant information under Planner\-issued sub\-queries, limiting their effectiveness for long\-horizon reasoning tasks where relevant evidence must be dynamically interpreted and selectively surfaced to a central reasoner\.

Our work differs from both lines of research\. Centralized methods keep memory and reasoning tightly coupled within a single model context\. Methods that compress or truncate this context can reduce overload, but may discard information needed for later reasoning\. Existing distributed methods, like MIRIX, organize memory by memory type rather than by the information needs of the reasoning process, which limits their ability to support long\-horizon tasks that require dynamic document selection\. ActiveMem addresses this limitation by decoupling memory formation from reasoning\. Parallel Memorizers process raw documents under queries issued by the Planner and produce query\-conditioned memory summaries, which are stored in persistent Memory Shards and selectively returned to the Planner when needed\.

## 3Method

We introduce ActiveMem, a distributed memory framework that decouples memory management from high\-level reasoning\. The Planner generates retrieval queries and integrates the returned semantic gists to guide subsequent reasoning or produce the final answer\. The Distributed Memory System consists of persistent Memory Shards, lightweight Memorizers and Operator: Memorizers process retrieved documents in parallel and produce distilled semantic gists; Memory Shards actively accumulate and consolidate these gists throughout the task; and the Operator coordinates routing, memory reuse, and consolidation across shards\. Figure[3](https://arxiv.org/html/2606.10532#S3.F3)illustrates the overall architecture of ActiveMem\.

![Refer to caption](https://arxiv.org/html/2606.10532v1/x3.png)Figure 3:Overview of our ActiveMem framework\. The Planner issues retrieval queries𝒬t\\mathcal\{Q\}\_\{t\}to recall documents from an external corpus\. Each document is paired with its query to form a memory task and routed to the appropriate shard\. For repeated documents, the Operator checks semantic similarity to a prior task: if similar, the stored gist is returned directly; otherwise, the Memorizer distills a new gist for the Planner while the Operator consolidates it into the shard asynchronously\. This decouples memory from reasoning—the memory system continuously accumulates gists in persistent shards, while the Planner operates over a clean, compact distilled context\.### 3\.1Planner

The Planner maintains a compact reasoning statest=\(x,ht,mt−1\)s\_\{t\}=\(x,h\_\{t\},m\_\{t\-1\}\), wherexxis the original question,hth\_\{t\}records the trimmed interaction history, andmt−1m\_\{t\-1\}contains the distilled memory returned from the previous step\. To keep the reasoning context bounded, ActiveMem retains only the most recentKKinteraction steps:

ht=Trim​\(ht−1∪\{at,ot\}\),h\_\{t\}=\\mathrm\{Trim\}\\left\(h\_\{t\-1\}\\cup\\\{a\_\{t\},o\_\{t\}\\\}\\right\),\(1\)whereata\_\{t\}denotes the Planner’s action andoto\_\{t\}denotes the tool observation\.Trim\\mathrm\{Trim\}retains only the most recentKKinteraction steps; the discarded content is not lost, as its gist has already been stored in the Memory Shards\. The Planner maps this compact state to a set of retrieval queries,i\.e\.,

𝒬t=π​\(st\),\\mathcal\{Q\}\_\{t\}=\\pi\(s\_\{t\}\),\(2\)where each element\(qi,ki\)∈𝒬t\(q\_\{i\},k\_\{i\}\)\\in\\mathcal\{Q\}\_\{t\}specifies a retrieval queryqiq\_\{i\}and its retrieval budgetkik\_\{i\},i\.e\., the number of documents requested by the Planner for that query\. After receiving the distilled gists from the memory system, the Planner updates its reasoning state and either issues another retrieval request or produces the final answer\.

### 3\.2Distributed Memory System

The distributed memory system is designed to preserve document\-level memory without overloading the Planner’s reasoning context\. It contains three components: Memory Shards that store persistent gists, parallel Memorizers that distill query\-conditioned gists from retrieved documents, and an Operator that coordinates routing, reuse, and shard updates\. Together, they provide a persistent and parallel memory layer for maintaining and reusing distilled information across the task\.

#### 3\.2\.1Memory Shards

Memory Shards serve as the persistent storage layer of ActiveMem\. Instead of merging information from different documents into a single compressed context, ActiveMem maintains a fixed set of logical shards throughout the task\. Each shardBjB\_\{j\}is implemented as a key–value store indexed by documentcc\. For each document entry, the shard stores the distilled gistgcg\_\{c\}and the set of queries that have previously used this document:

Bj​\[c\]=\(gc,ℋc\),ℋc=\{q\(1\),q\(2\),…\}B\_\{j\}\[c\]=\\bigl\(g\_\{c\},\\;\\mathcal\{H\}\_\{c\}\\bigr\),\\quad\\mathcal\{H\}\_\{c\}=\\\{q^\{\(1\)\},q^\{\(2\)\},\\ldots\\\}\(3\)
Because Memory Shards persist throughout the task and are logically isolated, distilled gists produced at any step remain available for later reasoning and are further enriched through asynchronous consolidation rather than discarded\. This ensures that distilled information remains recoverable throughout the task while keeping the Planner’s core reasoning context clean and compact, mitigating the trade\-off between context overload and information loss that centralized agents face\.

#### 3\.2\.2Memorizers

Inspired by the functional role of the hippocampus, each Memorizer is designed as a query\-conditioned memory module\. Given a queryqqand the associated raw contentccrouted by the Operator, the Memorizer produces a content\-specific gist:

gc=ω​\(c,q\),gc∈𝒢∪\{∅\},g\_\{c\}=\\omega\(c,q\),\\quad g\_\{c\}\\in\\mathcal\{G\}\\cup\\\{\\varnothing\\\},\(4\)wheregc∈𝒢g\_\{c\}\\in\\mathcal\{G\}denotes a valid gist whenccis relevant toqq, andgc=∅g\_\{c\}=\\varnothingdenotes an empty gist otherwise\. The queryqqprovides top\-down guidance by specifying which aspect ofccshould be retained\. The Memorizer filters out irrelevant information and returns only the distilled gistgcg\_\{c\}to the Planner\. This prevents the Planner from directly processing raw content while preserving the information needed for the current reasoning step\. Multiple Memorizers operate in parallel and write their outputs independently to their assigned Memory Shards\.

##### Automatic Training Data Construction\.

A vanilla small language model often produces verbose responses and unnecessary reasoning traces when directly used as a Memorizer\. We therefore construct supervised data from real agent interactions to train the Memorizer for concise and content\-grounded distillation\.

We follow the train/test split of BrowseComp\-Plus introduced by Context\-Folding\(Sunet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib8)\)\. On the training split, we run the full ActiveMem pipeline and collect 12,000 query\-document pairs, denoted as\(q,c\)\(q,c\)\. Each pair contains a queryqqissued by the Planner during task execution and a retrieved documentccrouted to the memory system\. These pairs are collected from actual agent trajectories, and therefore reflect the agent’s genuine information needs during long\-horizon reasoning\.

For each query\-document pair\(q,c\)\(q,c\), we usegpt\-oss\-120bas the teacher model to generate a concise semantic gistgcg\_\{c\}conditioned on the query, following the knowledge distillation paradigm\(Hintonet al\.,[2015](https://arxiv.org/html/2606.10532#bib.bib31)\)\. The resulting triples\(q,c,gc\)\(q,c,g\_\{c\}\)are used as supervised training examples for the Memorizer\. We maintain a document\-level separation between the training and test sets\. Specifically, documents used to construct Memorizer training examples are excluded from the test\-time document pool\. This ensures that the Memorizer is evaluated on documents that do not overlap with its training corpus\.

##### Memorizer Training\.

Given the supervised triples\(q,c,gc\)\(q,c,g\_\{c\}\), we fine\-tune the Memorizer with a conditional next\-token prediction objective:

ℒSFT=−𝔼\(q,c,gc\)∼𝒟​\[∑tlog⁡pω​\(gc,t∣q,c,gc,<t\)\],\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}=\-\\mathbb\{E\}\_\{\(q,c,g\_\{c\}\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\sum\_\{t\}\\log p\_\{\\omega\}\\\!\\left\(g\_\{c,t\}\\mid q,c,g\_\{c,<t\}\\right\)\\right\],\(5\)
where𝒟\\mathcal\{D\}denotes the Memorizer training set,ω\\omegadenotes the trainable parameters of the Memorizer, andgc,tg\_\{c,t\}is thett\-th token of the target gistgcg\_\{c\}\. This objective teaches the Memorizer to generate concise gists that are grounded in the retrieved document and conditioned on the Planner’s query\.

#### 3\.2\.3Operator

The Operator is the control layer between the Planner and the distributed memory system\. It converts Planner queries and retrieved documents into memory tasks, routes them to the appropriate Memorizers, and manages when existing shard entries can be reused or updated\. This allows ActiveMem to avoid redundant memory computation while keeping document\-level memory up to date\.

##### Memory Reuse\.

When a memory task\(q,c\)\(q,c\)arrives, the Operator first judges whether documentcchas already been used to answer a similar query\. It comparesqqwith the query historyℋc\\mathcal\{H\}\_\{c\}stored in the corresponding shard:

J​\(q,ℋc\)∈\{SIMILAR,NEW\}\.J\(q,\\mathcal\{H\}\_\{c\}\)\\in\\\{\\texttt\{SIMILAR\},\\texttt\{NEW\}\\\}\.If the query is labeled asSIMILAR, the stored gistgcg\_\{c\}is returned directly to the Planner, avoiding an additional Memorizer call\. If it is labeled asNEW, the task is dispatched to the Memorizer to produce a new query\-conditioned gist\.

##### Asynchronous Consolidation\.

The Operator additionally runs a consolidation pass that is decoupled from the main reasoning loop and does not block ongoing inference\. When the same raw content has been distilled from multiple retrieval angles, the Operator merges the resulting gists into a single enriched shard entry that preserves all entities while eliminating redundancy:

\(gc′,ℋc′\)=Consolidate​\(gc,gnew,ℋc,q\),\(g\_\{c\}^\{\\prime\},\\mathcal\{H\}\_\{c\}^\{\\prime\}\)=\\text\{Consolidate\}\(g\_\{c\},g\_\{\\text\{new\}\},\\mathcal\{H\}\_\{c\},q\),\\vskip\-12\.0pt\(6\)
Algorithm 1ActiveMem Inference1:Question

xx, Max steps

TT, Shards

\{Bj\}\\\{B\_\{j\}\\\}, Shard number

NN, History window

KK
2:Answer

aa
3:Initialize all memory shards and set conversation history to system prompt and input question

4:for

t=1,…,Tt=1,\\ldots,Tdo

5:

𝒬t←π​\(st\)\\mathcal\{Q\}\_\{t\}\\leftarrow\\pi\(s\_\{t\}\),

st=\(x,ht,mt−1\)s\_\{t\}=\(x,h\_\{t\},m\_\{t\-1\}\)
6:if

𝒬t=submit\_answer​\(a\)\\mathcal\{Q\}\_\{t\}=\\texttt\{submit\\\_answer\}\(a\)then

7:return

aa
8:

Dt←Retrieve​\(𝒬t\)D\_\{t\}\\leftarrow\\textsc\{Retrieve\}\(\\mathcal\{Q\}\_\{t\}\)
9:

𝒢t←∅\\mathcal\{G\}\_\{t\}\\leftarrow\\emptyset
10:forparallel

\(q,c,Bj\)∈Route​\(Dt\)\(q,c,B\_\{j\}\)\\in\\textsc\{Route\}\(D\_\{t\}\), at most

NNconcurrentdo

11:if

c∉Bjc\\notin B\_\{j\}then

12:Memory miss: distillg←ω​\(c,q\)g\\leftarrow\\omega\(c,q\)and write\(g,q\)\(g,q\)to shardBjB\_\{j\}

13:elseif

J\(q,Bj\[c\]\.ℋ\)=SIMILARJ\(q,B\_\{j\}\[c\]\.\\mathcal\{H\}\)=\\texttt\{SIMILAR\}then

14:Memory hit: reuseg←Bj​\[c\]\.gg\\leftarrow B\_\{j\}\[c\]\.gdirectly

15:else

16:Memory hit with new query: distillg←ω​\(c,q\)g\\leftarrow\\omega\(c,q\)and merge intoBjB\_\{j\}asynchronously

17:

𝒢t​\[c\]←g\\mathcal\{G\}\_\{t\}\[c\]\\leftarrow g
18:

ot←Serialize​\(𝒢t\)o\_\{t\}\\leftarrow\\text\{Serialize\}\(\\mathcal\{G\}\_\{t\}\)
19:

h←Trim​\(h\+⁣=\(mt,ot\),K\)h\\leftarrow\\text\{Trim\}\(h\\mathrel\{\+\\\!\\\!=\}\(m\_\{t\},o\_\{t\}\),K\)

### 3\.3ActiveMem Inference

Algorithm[1](https://arxiv.org/html/2606.10532#alg1)summarizes the complete inference procedure of ActiveMem\. At each steptt, the Planner observes statest=\(x,ht,mt−1\)s\_\{t\}=\(x,h\_\{t\},m\_\{t\-1\}\)—comprising the original question, the trimmed interaction history, and distilled memory returned from the previous step—and emits a set of retrieval queries𝒬t\\mathcal\{Q\}\_\{t\}\. If the Planner instead callssubmit\_answer​\(a\)\\texttt\{submit\\\_answer\}\(a\), inference terminates and the answeraais returned\.

Otherwise, the Retriever fetches a document setDtD\_\{t\}in response to𝒬t\\mathcal\{Q\}\_\{t\}, andRouteassigns each retrieved document to a stable Memory Shard\. The assigned documents are processed by Memorizers in parallel, with at mostNNMemorizers running concurrently to match the number of available shards\. For each document, the Operator applies one of three policies: a memory miss triggers fresh distillation and writes the new gist\(g,q\)\(g,q\)to the shard; a memory hit with a semantically similar query reuses the stored gist without invoking the Memorizer; a memory hit with a new query distills an updated gist and launches an asynchronous merge that consolidates old and new gists in the background without blocking the Planner\. After all documents are processed, the collected gists𝒢t\\mathcal\{G\}\_\{t\}are assembled into observationoto\_\{t\}, appended to history, andTrimretains only the most recentKKinteraction steps to keep the Planner context compact\.

## 4Experiments

### 4\.1Experimental Setup

Datasets\.We evaluate ActiveMem on two widely used benchmarks\.BrowseComp\-Plus\(Chenet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib22)\)is our primary benchmark\. We evaluate on 150 examples following the Easy, Medium, and Hard split introduced by Context\-Folding\(Sunet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib8)\), which supports fine\-grained analysis\.GAIA\(Mialonet al\.,[2024](https://arxiv.org/html/2606.10532#bib.bib26)\)is a general\-purpose agent benchmark\. We use its WebSearch validation subset \(106 examples\) to focus the comparison on information\-intensive retrieval tasks\.

Baselines\.We compare ActiveMem with two categories of baselines\.Vanilla ReAct LLMsinclude Kimi\-K2\.5\(Kimi Team,[2026](https://arxiv.org/html/2606.10532#bib.bib24)\), Qwen3\.5\-397B\-A17B\(Qwen Team,[2026](https://arxiv.org/html/2606.10532#bib.bib25)\), GLM\-5\.1\(GLM,[2026](https://arxiv.org/html/2606.10532#bib.bib27)\), and DeepSeek\-V3\.2\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.10532#bib.bib28)\), all operating under a standard ReAct loop without memory compression\. To control context length, all vanilla baselines adopt a two\-stage page\-reading mechanism: the model first reads a 512\-token preview; if deemed relevant, it issues anopenaction to load the full 4,096\-token content\.Centralized Memory Agentsinclude Context\-Folding\(Sunet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib8)\), AgentFold\(Yeet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib9)\), MemoBrain\(Qianet al\.,[2026](https://arxiv.org/html/2606.10532#bib.bib13)\), and MemAgent111To align MemAgent with our task setting, we implement a sequential variant as a baseline to our parallel Memorizers by processing retrieved documents through MemAgent\-7B\.\(Yuet al\.,[2025](https://arxiv.org/html/2606.10532#bib.bib10)\)\. For a fair comparison, all models adopt Qwen3\.5\-397B\-A17B as the backbone LLM, with all experiments executed under this unified setup\.

Metrics\.We evaluate the performance of the model using three metrics\.LasJ\(LLM\-as\-a\-Judge\) measures answer correctness using Qwen3\.5\-397B\-A17B as the judge model\.PFLOPsmeasures total inference\-time computational cost and serves as a model\-agnostic proxy for both model scale and processed context length\. Although token count is a widely used metric for measuring model efficiency, it masks crucial parameter\-scale discrepancies across heterogeneous systems, making cross\-model efficiency comparisons highly misleading\. To address this, we strictly count linear FLOPs from matrix multiplications and quadratic FLOPs from self\-attention, with architecture\-specific adjustments for hybrid\-attention and Dynamic Sparse Attention \(DSA\) models\. We also report total token count \(in millions\) alongside PFLOPs to provide a complementary view of context volume processed during inference\.ACT\(Accuracy\-Cost Trade\-off\) is a composite metric that balances model performance against computational cost\. By penalizing raw accuracy with log\-normalized PFLOPs, ACT effectively differentiates competitive methods based on their underlying efficiency\. More details are provided in Appendix[A\.3](https://arxiv.org/html/2606.10532#A1.SS3)\.

ModelEasyMediumHardTotalLasJPFLOPsLasJPFLOPsLasJPFLOPsLasJPFLOPs \(tokens\)ACTVanilla ReAct LLMsKimi\-k2\.51\.0010230\.8068030\.26216430\.6929469 \(157M\)0\.640DeepSeek\-V3\.20\.9815150\.7650780\.3481280\.6914721 \(140M\)0\.652GLM\-5\.10\.9611790\.7452130\.34105080\.6816901 \(158M\)0\.640Qwen3\.5\-397B\-A17B0\.9813000\.7849990\.22122070\.6618506 \(297M\)0\.618Centralized Memory AgentsMemAgent0\.963160\.5615070\.2624910\.594314 \(167M\)0\.573MemoBrain0\.982560\.704640\.209200\.631640\(47M\)0\.630AgentFold0\.964580\.8014660\.3038320\.695757 \(116M\)0\.668Context\-Folding0\.962830\.8410250\.3626130\.723920 \(76M\)0\.705ActiveMem \(Ours\)1\.002040\.946530\.4413550\.792145\(188M\)0\.785Table 1:Overall evaluation on BrowseComp\-Plus\.Boldandunderlineindicate the best and second\-best results\.Table 2:Overall evaluation on GAIA\.Bolddenotes the best result, andunderliningdenotes the second best\.
### 4\.2Main Results

As shown in Table[1](https://arxiv.org/html/2606.10532#S4.T1)and Table[2](https://arxiv.org/html/2606.10532#S4.T2), we conclude three key observations:\(1\) ActiveMem achieves the strongest accuracy–cost trade\-off across both benchmarks\.On BrowseComp\-Plus, ActiveMem obtains the highest overall LasJ score of 0\.79 and the highest ACT score of 0\.785\. On GAIA, it also achieves the best LasJ score of 0\.62 with the lowest computational cost\. The baseline results reveal two different limitations\. Vanilla ReAct LLMs accumulate retrieved documents in the reasoning context, increasing context length and PFLOPs as inference proceeds\. Centralized memory agents reduce context growth through compression, but may discard useful information from the document\. MemoBrain illustrates this trade\-off: on BrowseComp\-Plus, its relatively low cost of 1,640 PFLOPs comes with a lower LasJ score of 0\.63, while ActiveMem improves accuracy by \+0\.16 \(0\.63→\\rightarrow0\.79\)\. On GAIA, ActiveMem matches MemoBrain’s accuracy while reducing computation by 63\.7% \(516→\\rightarrow187 PFLOPs\)\. These results suggest that ActiveMem improves efficiency without sacrificing accuracy\. By distilling retrieved documents in parallel and preserving reusable document\-specific gists in Memory Shards, ActiveMem keeps the Planner’s context compact while retaining information needed for later reasoning\.\(2\) ActiveMem’s advantage is more pronounced on the BrowseComp\-Plus Hard split\.On Easy questions, most methods are close to saturation, leaving limited room for differentiation\. As the difficulty level increases, ActiveMem shows a clearer advantage\. On the Medium split, ActiveMem reaches 0\.94, outperforming the best baseline by \+0\.10 absolute LasJ score \(0\.84→\\rightarrow0\.94\), corresponding to a relative improvement of \+11\.9%\. On the Hard split, ActiveMem achieves 0\.44, improving over the best baseline by \+0\.08 absolute LasJ score \(0\.36→\\rightarrow0\.44\), corresponding to a relative improvement of \+22\.2%\. This result highlights the benefit of decoupled memory under more difficult retrieval\-intensive reasoning\. By returning distilled gists instead of raw retrieved documents, ActiveMem keeps the Planner’s context compact and reduces the degradation caused by long contexts or lossy centralized compression\.

Table 3:Module\-level breakdown of token usage and PFLOPs for ActiveMem on BrowseComp\-Plus\.\(3\) Offloading token\-intensive processing to lightweight Memorizers enables more favorable test\-time scaling at lower cost\.As shown in Table[3](https://arxiv.org/html/2606.10532#S4.T3), Memorizers account for 68\.9% of total PFLOPs while the Planner contributes only 28\.3%\. By routing most token processing to small models, ActiveMem processes substantially more tokens than centralized agents \(188M vs\. 47–167M on BrowseComp\-Plus\) while keeping total PFLOPs low, since the marginal cost per token is much lower for a 4B Memorizer than for the large Planner\. The higher token consumption on BrowseComp\-Plus reflects the benchmark’s tighter question constraints, which demand broader evidence gathering; ActiveMem’s ability to process more evidence at lower cost is consistent with the test\-time scaling principle\(Snellet al\.,[2024](https://arxiv.org/html/2606.10532#bib.bib37)\)\. On GAIA, where questions are less constrained and agents converge faster, token counts are uniformly lower across all methods\. Case studies are provided in Appendix[A\.4](https://arxiv.org/html/2606.10532#A1.SS4)\. Figure[4](https://arxiv.org/html/2606.10532#S4.F4)shows that ActiveMem scales better as the number of reasoning steps increases\. Its per\-case cost curve remains flatter than those of the baselines, and the cost gap widens at higher step counts, reflecting lower sensitivity to context length\. Context\-Folding is an exception, as its lower cost stems from early branch termination rather than reduced per\-step computation\.

![Refer to caption](https://arxiv.org/html/2606.10532v1/img/pflops_vs_steps.png)Figure 4:Average computational cost per case \(PFLOPs\) across reasoning step ranges\. Shaded regions indicate±\\pm1 standard deviation\.
### 4\.3Ablation Studies

We ablate two main design choices in ActiveMem: persistent Memory Shards and Memorizer variants\. Additional analyses of Trim window size, Memorizer scale trade\-offs, and Memory Consolidation are in Appendix[A\.4](https://arxiv.org/html/2606.10532#A1.SS4)\.

#### 4\.3\.1Memory Shards Reduce Redundancy and Preserve Information

We ablate Memory Shards by removing shard\-based storage and retrieval while keeping the Memorizer unchanged\. We define this variant asActiveMem w/o Shards, where distilled gists are returned directly to the Planner at each step rather than stored for selective reuse\. The comparative results are reported in Table[4](https://arxiv.org/html/2606.10532#S4.T4)\.

Table 4:Ablation on Memory Shards\. P\., M\., and Tot\. denote Planner, Memorizer, and total inference cost \(PFLOPs\), respectively\.Removing Memory Shards reduces both accuracy and efficiency\. Without shard\-based storage, overlapping queries trigger redundant Memorizer passes on the same raw content, raising Memorizer PFLOPs from 1478 to 1739\. At the same time, previously distilled gists are not consolidated back into the Planner’s context on new queries—keeping Planner context smaller \(546 vs\. 606 PFLOPs\) but causing information loss that degrades final accuracy\. These results demonstrate the complementary roles of distillation and Memory Shards: the former compresses content into semantic gists, while the latter persists and reuses these gists across reasoning steps, eliminating redundant computation\. Together, they also preserve information across reasoning steps, contributing to higher accuracy\.

#### 4\.3\.2A Stronger Memorizer Improves Reasoning Quality

We compare three 4B\-scale Memorizer variants with all other components fixed: Qwen3\-4B\-Instruct\-2507 \(instruction\-tuned\), Qwen3\-4B\(Qwen Team,[2025](https://arxiv.org/html/2606.10532#bib.bib36)\)\(vanilla reasoning\), and our SFT\-trained Memorizer\-4B\.

As shown in Table[5](https://arxiv.org/html/2606.10532#S4.T5), the choice of Memorizer design is critical to final task performance\. Compared with the Qwen3\-4B\-Instruct\-2507, both reasoning\-oriented variants achieve higher LasJ accuracy\. This suggests that a reasoning\-capable Memorizer can produce more precise and compact gists for the Planner\. The effect is also reflected in Planner cost: the two reasoning\-oriented variants substantially reduce Planner PFLOPs, indicating that a cleaner and less redundant distilled context makes downstream reasoning more efficient\. This improvement comes with a higher Memorizer cost, since reasoning\-oriented Memorizers spend more computation on their own intermediate reasoning\. However, the additional cost is offset by consistent gains in answer accuracy\. SFT further improves this trade\-off\. Compared with vanilla Qwen3\-4B, Memorizer\-4B makes distillation more concise and task\-aligned, siginificantlyreducing Memorizer cost from 1833 to 1478 PFLOPs and Planner cost to 606 PFLOPs, while improving LasJ to 0\.786\.

Table 5:Ablation on Memorizer variants\.

## 5Conclusion

We introduce ActiveMem, a neuroscience\-inspired framework that decouples memory management from high\-level reasoning\. ActiveMem separates executive reasoning from parallel active memory consolidation, alleviating the trade\-off between context overload and information loss in centralized memory designs\. Extensive evaluations on diverse benchmarks show that ActiveMem achieves higher accuracy than competitive baselines while substantially reducing computational overhead\. These results validate our distributed design and suggest that decoupled active memory provides an efficient and scalable foundation for long\-horizon reasoning\.

## Limitations

There are two limitations to this work\. First, although we report computational efficiency metrics, we do not directly measure deployment\-level efficiency, such as wall\-clock latency or monetary cost\. These quantities are highly sensitive to serving infrastructure, API response variability, rate limits, hardware utilization, and provider\-specific pricing assumptions\. Since our experiments involve both commercial API models and locally deployed open\-source models, a fair comparison would require a fully controlled deployment setting with consistent serving infrastructure\. We leave such profiling to future work\. Second, the generalizability of ActiveMem beyond the evaluated domains remains an open question\. The Memorizer is trained via supervised fine\-tuning on agent interaction data collected from BrowseComp\-Plus, a web\-search benchmark consisting mainly of tightly constrained factual questions\. While this leads to effective memory distillation on the evaluated tasks, its applicability to other distinct paradigms, such as mathematical reasoning, code generation, or multimodal agent workflows, has not been systematically cross\-validated\. Adapting ActiveMem to new domains may require collecting domain\-specific interaction data and further tuning the Memorizer\.

## References

- S\. An, Z\. Ma, Z\. Lin, N\. Zheng, J\. Lou, and W\. Chen \(2024\)Make your LLM fully utilize the context\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/71c3451f6cd6a4f82bb822db25cea4fd-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1)\.
- Z\. Chen, X\. Ma, S\. Zhuang, P\. Nie, K\. Zou, A\. Liu, J\. Green, K\. Patel, R\. Meng, M\. Su, S\. Sharifymoghaddam, Y\. Li, H\. Hong, X\. Shi, X\. Liu, N\. Thakur, C\. Zhang, L\. Gao, W\. Chen, and J\. Lin \(2025\)BrowseComp\-plus: A more fair and transparent evaluation benchmark of deep\-research agent\.CoRRabs/2508\.06600\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.06600),[Document](https://dx.doi.org/10.48550/ARXIV.2508.06600),2508\.06600Cited by:[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p1.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready AI agents with scalable long\-term memory\.InECAI 2025 \- 28th European Conference on Artificial Intelligence, 25\-30 October 2025, Bologna, Italy \- Including 14th Conference on Prestigious Applications of Intelligent Systems \(PAIS 2025\),I\. Lynce, N\. Murano, M\. Vallati, S\. Villata, F\. Chesani, M\. Milano, A\. Omicini, and M\. Dastani \(Eds\.\),Frontiers in Artificial Intelligence and Applications,pp\. 2993–3000\.External Links:[Link](https://doi.org/10.3233/FAIA251160),[Document](https://dx.doi.org/10.3233/FAIA251160)Cited by:[§2](https://arxiv.org/html/2606.10532#S2.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.CoRRabs/2512\.02556\.External Links:[Link](https://doi.org/10.48550/arXiv.2512.02556),[Document](https://dx.doi.org/10.48550/ARXIV.2512.02556),2512\.02556Cited by:[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p2.1)\.
- GLM \(2026\)GLM\-5: from vibe coding to agentic engineering\.Vol\.abs/2602\.15763\.External Links:[Link](https://doi.org/10.48550/arXiv.2602.15763),[Document](https://dx.doi.org/10.48550/ARXIV.2602.15763),2602\.15763Cited by:[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p2.1)\.
- N\. C\. Hindy, R\. Coleman, and S\. R\. M\. Selvam \(2026\)Hippocampal reactivation trades episodic detail for semantic gist in human memory\.Journal of Cognitive Neuroscience,pp\. 1–13\.Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p3.1)\.
- G\. E\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.CoRRabs/1503\.02531\.External Links:[Link](http://arxiv.org/abs/1503.02531),1503\.02531Cited by:[§3\.2\.2](https://arxiv.org/html/2606.10532#S3.SS2.SSS2.Px1.p3.3)\.
- A\. J\. Horner, J\. A\. Bisby, D\. Bush, W\. Lin, and N\. Burgess \(2015\)Evidence for holistic episodic recollection via hippocampal pattern completion\.Nature communications6\(1\),pp\. 7462\.Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p3.1)\.
- M\. Hu, T\. Chen, Q\. Chen, Y\. Mu, W\. Shao, and P\. Luo \(2025a\)HiAgent: hierarchical working memory management for solving long\-horizon agent tasks with large language model\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 32779–32798\.External Links:[Link](https://aclanthology.org/2025.acl-long.1575/)Cited by:[§2](https://arxiv.org/html/2606.10532#S2.p1.1)\.
- Y\. Hu, S\. Liu, Y\. Yue, G\. Zhang, B\. Liu, F\. Zhu, J\. Lin, H\. Guo, S\. Dou, Z\. Xi, S\. Jin, J\. Tan, Y\. Yin, J\. Liu, Z\. Zhang, Z\. Sun, Y\. Zhu, H\. Sun, B\. Peng, Z\. Cheng, X\. Fan, J\. Guo, X\. Yu, Z\. Zhou, Z\. Hu, J\. Huo, J\. Wang, Y\. Niu, Y\. Wang, Z\. Yin, X\. Hu, Y\. Liao, Q\. Li, K\. Wang, W\. Zhou, Y\. Liu, D\. Cheng, Q\. Zhang, T\. Gui, S\. Pan, Y\. Zhang, P\. Torr, Z\. Dou, J\. Wen, X\. Huang, Y\. Jiang, and S\. Yan \(2025b\)Memory in the age of AI agents\.CoRRabs/2512\.13564\.External Links:[Link](https://doi.org/10.48550/arXiv.2512.13564),[Document](https://dx.doi.org/10.48550/ARXIV.2512.13564),2512\.13564Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p1.1)\.
- Kimi Team \(2026\)Kimi K2\.5: visual agentic intelligence\.CoRRabs/2602\.02276\.External Links:[Link](https://doi.org/10.48550/arXiv.2602.02276),[Document](https://dx.doi.org/10.48550/ARXIV.2602.02276),2602\.02276Cited by:[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p2.1)\.
- A\. H\. Lara and J\. D\. Wallis \(2015\)The role of prefrontal cortex in working memory: a mini review\.Frontiers in systems neuroscience9,pp\. 173\.Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p3.1)\.
- M\. Levy, A\. Jacoby, and Y\. Goldberg \(2024\)Same task, more tokens: the impact of input length on the reasoning performance of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 15339–15353\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.818),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.818)Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Trans\. Assoc\. Comput\. Linguistics12,pp\. 157–173\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00638),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00638)Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1)\.
- L\. Mei, J\. Yao, Y\. Ge, Y\. Wang, B\. Bi, Y\. Cai, J\. Liu, M\. Li, Z\. Li, D\. Zhang, C\. Zhou, J\. Mao, T\. Xia, J\. Guo, and S\. Liu \(2025\)A survey of context engineering for large language models\.CoRRabs/2507\.13334\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.13334),[Document](https://dx.doi.org/10.48550/ARXIV.2507.13334),2507\.13334Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2024\)GAIA: a benchmark for general AI assistants\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=fibxvahvs3)Cited by:[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p1.1)\.
- R\. Nakano, J\. Hilton, S\. Balaji, J\. Wu, L\. Ouyang, C\. Kim, C\. Hesse, S\. Jain, V\. Kosaraju, W\. Saunders, X\. Jiang, K\. Cobbe, T\. Eloundou, G\. Krueger, K\. Button, M\. Knight, B\. Chess, and J\. Schulman \(2021\)WebGPT: browser\-assisted question\-answering with human feedback\.CoRRabs/2112\.09332\.External Links:[Link](https://arxiv.org/abs/2112.09332),2112\.09332Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p1.1)\.
- C\. Packer, V\. Fang, S\. G\. Patil, K\. Lin, S\. Wooders, and J\. E\. Gonzalez \(2023\)MemGPT: towards llms as operating systems\.CoRRabs/2310\.08560\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.08560),[Document](https://dx.doi.org/10.48550/ARXIV.2310.08560),2310\.08560Cited by:[§2](https://arxiv.org/html/2606.10532#S2.p1.1)\.
- H\. Qian, Z\. Cao, and Z\. Liu \(2026\)MemoBrain: executive memory as an agentic brain for reasoning\.CoRRabs/2601\.08079\.External Links:[Link](https://doi.org/10.48550/arXiv.2601.08079),[Document](https://dx.doi.org/10.48550/ARXIV.2601.08079),2601\.08079Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1),[§2](https://arxiv.org/html/2606.10532#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p2.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.CoRRabs/2505\.09388\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.09388),[Document](https://dx.doi.org/10.48550/ARXIV.2505.09388),2505\.09388Cited by:[§4\.3\.2](https://arxiv.org/html/2606.10532#S4.SS3.SSS2.p1.1)\.
- Qwen Team \(2026\)Qwen3\.5: towards native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p2.1)\.
- F\. Shi, X\. Chen, K\. Misra, N\. Scales, D\. Dohan, E\. H\. Chi, N\. Schärli, and D\. Zhou \(2023\)Large language models can be easily distracted by irrelevant context\.InInternational Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA,Proceedings of Machine Learning Research,pp\. 31210–31227\.External Links:[Link](https://proceedings.mlr.press/v202/shi23a.html)Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.CoRRabs/2408\.03314\.External Links:[Link](https://doi.org/10.48550/arXiv.2408.03314),[Document](https://dx.doi.org/10.48550/ARXIV.2408.03314),2408\.03314Cited by:[§4\.2](https://arxiv.org/html/2606.10532#S4.SS2.p2.1)\.
- W\. Sun, M\. Lu, Z\. Ling, K\. Liu, X\. Yao, Y\. Yang, and J\. Chen \(2025\)Scaling long\-horizon LLM agent via context\-folding\.CoRRabs/2510\.11967\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.11967),[Document](https://dx.doi.org/10.48550/ARXIV.2510.11967),2510\.11967Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1),[§2](https://arxiv.org/html/2606.10532#S2.p1.1),[§3\.2\.2](https://arxiv.org/html/2606.10532#S3.SS2.SSS2.Px1.p2.3),[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p2.1)\.
- L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin, W\. X\. Zhao, Z\. Wei, and J\. Wen \(2024\)A survey on large language model based autonomous agents\.Frontiers Comput\. Sci\.18\(6\),pp\. 186345\.External Links:[Link](https://doi.org/10.1007/s11704-024-40231-1),[Document](https://dx.doi.org/10.1007/S11704-024-40231-1)Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p1.1)\.
- Y\. Wang and X\. Chen \(2025\)MIRIX: multi\-agent memory system for llm\-based agents\.CoRRabs/2507\.07957\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.07957),[Document](https://dx.doi.org/10.48550/ARXIV.2507.07957),2507\.07957Cited by:[§2](https://arxiv.org/html/2606.10532#S2.p2.1)\.
- Y\. Wang, R\. Takanobu, Z\. Liang, Y\. Mao, Y\. Hu, J\. J\. McAuley, and X\. Wu \(2025\)Mem\-α\\alpha: learning memory construction via reinforcement learning\.CoRRabs/2509\.25911\.External Links:[Link](https://doi.org/10.48550/arXiv.2509.25911),[Document](https://dx.doi.org/10.48550/ARXIV.2509.25911),2509\.25911Cited by:[§2](https://arxiv.org/html/2606.10532#S2.p1.1)\.
- X\. Wu, K\. Li, Y\. Zhao, L\. Zhang, L\. Ou, H\. Yin, Z\. Zhang, Y\. Jiang, P\. Xie, F\. Huang, M\. Cheng, S\. Wang, H\. Cheng, and J\. Zhou \(2025\)ReSum: unlocking long\-horizon search intelligence via context summarization\.CoRRabs/2509\.13313\.External Links:[Link](https://doi.org/10.48550/arXiv.2509.13313),[Document](https://dx.doi.org/10.48550/ARXIV.2509.13313),2509\.13313Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-MEM: agentic memory for LLM agents\.CoRRabs/2502\.12110\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.12110),[Document](https://dx.doi.org/10.48550/ARXIV.2502.12110),2502\.12110Cited by:[§2](https://arxiv.org/html/2606.10532#S2.p1.1)\.
- S\. Yan, X\. Yang, Z\. Huang, E\. Nie, Z\. Ding, Z\. Li, X\. Ma, H\. Schütze, V\. Tresp, and Y\. Ma \(2025\)Memory\-r1: enhancing large language model agents to manage and utilize memories via reinforcement learning\.CoRRabs/2508\.19828\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.19828),[Document](https://dx.doi.org/10.48550/ARXIV.2508.19828),2508\.19828Cited by:[§2](https://arxiv.org/html/2606.10532#S2.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=WE%5C_vluYUL-X)Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p1.1),[§1](https://arxiv.org/html/2606.10532#S1.p2.1),[§2](https://arxiv.org/html/2606.10532#S2.p1.1)\.
- R\. Ye, Z\. Zhang, K\. Li, H\. Yin, Z\. Tao, Y\. Zhao, L\. Su, L\. Zhang, Z\. Qiao, X\. Wang, P\. Xie, F\. Huang, S\. Chen, J\. Zhou, and Y\. Jiang \(2025\)AgentFold: long\-horizon web agents with proactive context management\.CoRRabs/2510\.24699\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.24699),[Document](https://dx.doi.org/10.48550/ARXIV.2510.24699),2510\.24699Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1),[§2](https://arxiv.org/html/2606.10532#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p2.1)\.
- H\. Yu, T\. Chen, J\. Feng, J\. Chen, W\. Dai, Q\. Yu, Y\. Zhang, W\. Ma, J\. Liu, M\. Wang, and H\. Zhou \(2025\)MemAgent: reshaping long\-context LLM with multi\-conv rl\-based memory agent\.CoRRabs/2507\.02259\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.02259),[Document](https://dx.doi.org/10.48550/ARXIV.2507.02259),2507\.02259Cited by:[§2](https://arxiv.org/html/2606.10532#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.10532#S4.SS1.p2.1)\.
- Y\. Zhang, J\. Shu, Y\. Ma, X\. Lin, S\. Wu, and J\. Sang \(2025a\)Memory as action: autonomous context curation for long\-horizon agentic tasks\.CoRRabs/2510\.12635\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.12635),[Document](https://dx.doi.org/10.48550/ARXIV.2510.12635),2510\.12635Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1),[§2](https://arxiv.org/html/2606.10532#S2.p1.1)\.
- Z\. Zhang, Q\. Dai, X\. Bo, C\. Ma, R\. Li, X\. Chen, J\. Zhu, Z\. Dong, and J\. Wen \(2025b\)A survey on the memory mechanism of large language model\-based agents\.ACM Trans\. Inf\. Syst\.43\(6\),pp\. 155:1–155:47\.External Links:[Link](https://doi.org/10.1145/3748302),[Document](https://dx.doi.org/10.1145/3748302)Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)MemoryBank: enhancing large language models with long\-term memory\.InThirty\-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024,pp\. 19724–19731\.External Links:[Link](https://doi.org/10.1609/aaai.v38i17.29946),[Document](https://dx.doi.org/10.1609/AAAI.V38I17.29946)Cited by:[§2](https://arxiv.org/html/2606.10532#S2.p1.1)\.
- Z\. Zhou, A\. Qu, Z\. Wu, S\. Kim, A\. Prakash, D\. Rus, J\. Zhao, B\. K\. H\. Low, and P\. P\. Liang \(2025\)MEM1: learning to synergize memory and reasoning for efficient long\-horizon agents\.CoRRabs/2506\.15841\.External Links:[Link](https://doi.org/10.48550/arXiv.2506.15841),[Document](https://dx.doi.org/10.48550/ARXIV.2506.15841),2506\.15841Cited by:[§1](https://arxiv.org/html/2606.10532#S1.p2.1)\.

## Appendix AAppendix

### A\.1Implementation Details

ActiveMem instantiates the Planner with Qwen3\.5\-397B\-A17B, the Memorizer with Memorizer\-4B, the Operator with Qwen3\-4B, and the retrieval module with Qwen3\-Embedding\-8B\. Table[6](https://arxiv.org/html/2606.10532#A1.T6)shows the model and parameter settings for ActiveMem and all four baselines\. To ensure a fair comparison, all systems share the same Planner backbone \(Qwen3\.5\-397B\-A17B\) with identical temperature and step budget\. For each baseline, hyperparameters not listed in the table follow the settings reported in the respective original paper\.

All SFT training was conducted on 4 NVIDIA A100 \(80GB\) GPUs\. For inference, the Planner model \(Qwen3\.5\-397B\-A17B\) was accessed via API due to its large parameter count, while all other models—including the Memorizer, Operator, and retrieval module—were each served locally on a single NVIDIA A100 \(80GB\) GPU\. All models and datasets used in this work are publicly available and used in accordance with their respective licenses for non\-commercial research purposes\.

### A\.2Prompts

We provide the full prompts for all four model roles in ActiveMem\. ThePlannerprompt defines a three\-step decision flow governing when to search versus when to submit a final answer\. TheOperator: Similarity Judgeprompt instructs a lightweight model to detect query overlap so that stored gists can be reused without redundant retrieval\. TheOperator: Memory Consolidationprompt merges an existing document gist with a newly produced one while preserving all factual content\. TheMemorizerprompt directs the extraction model to produce a concise, document\-grounded gist for a given sub\-query, outputtingNONEonly when no relevant information is present\.

PlannerYou are a strategic task planner\. Your job is to decide what to search for in the corpus or when to answer, following a clear three\-step decision flow\.\#\# Tools \(function calling\) •advance\_search\(tasks\): parallel dense retrieval \+ worker analysis\. • Input:\{"tasks": \[\{"query": string, "topk": int\}, \.\.\.\]\} • 1–4 tasks per call; each task must target a DIFFERENT aspect or candidate \(different person, relation, time/place, etc\.\)\. •3 <= topk <= 8; use larger topk only when necessary\. •submit\_answer\(answer\): when you already have enough evidence, submit a short English phrase that directly answers the user’s question\.\#\# Decision flow \(follow in order\) Step 1 – Decompose the question: • Break down the user question into the minimal set of information pieces needed to answer it \(who / what / when / where / relations, etc\.\)\. • Think of these as independent sub\-questions or search paths\.Step 2 – Match against what you already know: • Use the latest tool results \(provided as messages from tools\) to decide which pieces are already satisfied and which are still missing\. • If current knowledge is enough to form a complete, precise answer, callsubmit\_answerwith a concise, direct answer phrase\. Do NOT search again in that case\.Step 3 – Organize search queries for parallel retrieval: • For any missing pieces, prepare search tasks foradvance\_search\. • Eachtask\.querymust be a SHORT English keyword phrase, not a full sentence, combining key entities / dates / relations\. • Avoid near\-duplicate queries; each task should explore a distinct hypothesis, angle, or candidate\.Output rules: • NEVER fabricate tool outputs; always wait for real tool responses\. • NEVER answer in free\-form prose; every final answer must be returned viasubmit\_answer\. • Keep internal reasoning minimal; your visible output is ONLY function calls chosen according to the flow above\.

Operator: Similarity JudgeYou are a strict semantic\-equivalence judge for short information\-seeking queries over the SAME document\. You will receive a NEW sub\-query and a list of PREVIOUS sub\-queries that were already answered using this document\.Decide whether the NEW sub\-query is asking for essentially the same information aspect\(s\) as ANY of the previous sub\-queries, so that the stored summary is already sufficient\.Decision rules: • OutputSIMILARif the NEW sub\-query targets the same entity\(ies\) AND the same attribute/relation/time\-scope as at least one previous sub\-query\. Minor paraphrase, synonym, or word order changes→\\toSIMILAR\. • OutputNEWif the NEW sub\-query targets a different attribute, a different time point, a different entity, or asks for strictly more specific information than any previous one\. • When in doubt, outputNEW\.Output format \(STRICT\): • Output ONLY one token: either exactlySIMILARor exactlyNEW\. • Do NOT include quotes, punctuation, reasoning, or any extra text\.

Operator: Memory ConsolidationYou are a memory consolidator\. You will receive:• an OLD summary previously produced for this document• a NEW summary just produced for this document \(for a different sub\-query\)• the list of sub\-queries that led to OLD• the new sub\-query that led to NEWYour job: produce ONE merged summary that preserves all document\-grounded facts from both OLD and NEW, de\-duplicating identical statements\.Hard rules: • Do NOT add any external knowledge or speculation\. • Do NOT drop factual content that is present in OLD or NEW\. • Prefer concrete entities, dates, and relations over vague wording\. • Keep the merged summary concise: at most 5 sentences,≤\{\\leq\}150 words\. • Output ONLY the merged summary text\. No prefixes, no bullet points, no explanation, no quotes\.

MemorizerYou are a precise information extraction assistant for a single document\. You will receive ONE sub\-query phrase and ONE document snippet\. Your job is to write a SHORT summary of information in THIS document that is relevant to the sub\-query\.VERY IMPORTANT: • If the document contains ANY information related to the sub\-query \(even partially\), you MUST write a summary\. • You may ignore parts of the sub\-query that are NOT supported by the document\. • Output ‘NONE’ ONLY when the document contains NO information related to the sub\-query at all\. • Do NOT guess or add external knowledge\.Output rule: • If there is some relevant information: output ONLY a concise summary in 1–2 sentences \(≤\{\\leq\}40 words\), no bullet points, no explanation\. • If there is truly no relevant information: output EXACTLY ‘NONE’\.

SystemComponentParameterValueActiveMemPlannerModelQwen3\.5\-397B\-A17BTemperature0\.3Max stepsTT50History windowKK10MemorizerModelMemorizer\-4BTemperature0\.2Max doc\. tokens4,096OperatorModelQwen3\-4BJudge temperature0\.0Judge max tokens8Consol\. temp\.0\.2Consol\. max tokens1,024RetrievalEmbedding modelQwen3\-Embedding\-8BMemoryMemory shardsBB16Consol\. pool size16AgentFoldPlannerModelQwen3\.5\-397B\-A17BTemperature0\.3Max stepsTT50Summary LLMModelgpt\-oss\-120b \(∼5\.1\{\\sim\}5\.1B active\)Temperature0\.2Context\-FoldingPlannerModelQwen3\.5\-397B\-A17BTemperature0\.3Max stepsTT50MemoBrainPlannerModelQwen3\.5\-397B\-A17BTemperature0\.3Max stepsTT50MemoBrainModelMemoBrain\-8BTemperature0\.7AuxiliaryModelQwen3\-30B\-A3B\-Instruct\-2507Temperature0\.7MemAgentPlannerModelQwen3\.5\-397B\-A17BTemperature0\.3Max stepsTT50MemAgentModelRL\-MemAgent\-7BTemperature0\.3Table 6:Implementation details for all evaluated systems\.
### A\.3Metrics

#### A\.3\.1PFLOPs Computation Details

##### Why PFLOPs\.

Several cost metrics are commonly used in agent evaluation, each with limitations\.Token countis the most prevalent proxy and maps directly to API billing\. However, token count ignores model scale: a single token processed by Qwen3\.5\-397B\-A17B incurs far more computation than one processed by a 4B model, making cross\-system token comparisons misleading when the underlying models differ\.LLM call countis even coarser—it conflates calls to models of vastly different sizes and cannot distinguish a lightweight coordination call from a full reasoning pass\.Wall\-clock latencyis the most intuitive measure for deployment but is strongly hardware\-dependent and difficult to reproduce across different infrastructure configurations, making it unsuitable for fair academic comparison\. PFLOPs addresses these shortcomings by jointly accounting for model scale \(viaNactN\_\{\\text\{act\}\}\) and context length \(via the quadratic attention term\), yielding a hardware\-agnostic, model\-agnostic estimate of total computation that enables fair comparison across heterogeneous systems\. We note that PFLOPs is itself an approximation: it counts only the two dominant arithmetic terms \(linear projections and full attention matmuls\) and does not capture memory\-bandwidth costs, KV cache effects, or auxiliary operations\. Nevertheless, as a relative ranking metric applied uniformly across all systems, it provides a principled and reproducible basis for comparison\.

##### Formula\.

For a single LLM call withSi​nS\_\{in\}prefill tokens andSo​u​tS\_\{out\}decode tokens, letT=Si​n\+So​u​tT=S\_\{in\}\+S\_\{out\}\. The FLOPs for that call are estimated as:

FLOPs=2​Nact​T\\displaystyle=2N\_\{\\text\{act\}\}T\+Lfull⋅nheads⋅\(dQ​K\+dV\)⋅T2,\\displaystyle\\quad\+L\_\{\\text\{full\}\}\\cdot n\_\{\\text\{heads\}\}\\cdot\(d\_\{QK\}\+d\_\{V\}\)\\cdot T^\{2\},and reported asPFLOPs=FLOPs/1015\\text\{PFLOPs\}=\\text\{FLOPs\}/10^\{15\}\. Total PFLOPs per case are obtained by summing this estimate over all LLM calls made during inference, across all modules \(Planner, Memorizer, and Operator\)\. HereNactN\_\{\\text\{act\}\}denotes the number of parameters activated per token, including dense attention/projection parameters and activated MoE/shared expert parameters\. For dense models,NactN\_\{\\text\{act\}\}equals the total parameter count\. The first term approximates the FLOPs of linear projections and FFN/MoE matrix multiplications, while the second term captures the dominant quadratic cost of causal full attention, with the causal mask reducing the effective attention window by half \(canceling the factor of 2 from the matrix multiply\)\.

For MLA models, we usedQ​K=dnope\+droped\_\{QK\}=d\_\{\\text\{nope\}\}\+d\_\{\\text\{rope\}\}, yielding an architecture\-normalized estimate rather than an exact kernel\-level FLOP count\. For hybrid\-attention models, only theLfullL\_\{\\text\{full\}\}full\-attention layers contribute to the quadratic term; linear\-attention layers are treated as sequence\-linear costs and are absorbed into the linear\-time approximation\.

For models with Dynamic Sparse Attention \(DSA\), the attention term is replaced by:

AttnDSA\\displaystyle\\text\{Attn\}\_\{\\text\{DSA\}\}=Cidx⋅T2\\displaystyle=C\_\{\\text\{idx\}\}\\cdot T^\{2\}\+Cmain⋅T⋅min⁡\(T,ktop\),\\displaystyle\\quad\+C\_\{\\text\{main\}\}\\cdot T\\cdot\\min\(T,k\_\{\\text\{top\}\}\),whereCidx=L⋅nidx⋅didxC\_\{\\text\{idx\}\}=L\\cdot n\_\{\\text\{idx\}\}\\cdot d\_\{\\text\{idx\}\}denotes the cost coefficient of the lightweight indexer,Cmain=L⋅nheads⋅\(dQ​K\+dV\)C\_\{\\text\{main\}\}=L\\cdot n\_\{\\text\{heads\}\}\\cdot\(d\_\{QK\}\+d\_\{V\}\)denotes the cost coefficient of the main sparse attention over the selected candidates, andktopk\_\{\\text\{top\}\}is the number of KV candidates selected by the indexer \(ktop=2048k\_\{\\text\{top\}\}=2048for both DeepSeek\-V3\.2 and GLM\-5\.1\)\.222The formula counts only the two dominant FLOPs terms\. Auxiliary operations \(RMSNorm, SiLU activations, MoE routing, embedding lookup\) are not separately itemized, as they are not expected to change the relative ranking across systems dominated by linear projections and attention matmuls\. KV cache effects \(e\.g\., reduced attention cost from prefix caching or paged attention\) are likewise excluded: such behavior is highly infrastructure\-dependent and difficult to estimate reliably across different deployment configurations\. We apply the same formula uniformly to all systems to ensure a fair and reproducible comparison\.

#### A\.3\.2ACT: Accuracy–Cost Tradeoff

A single accuracy metric cannot distinguish between a system that achieves high accuracy at modest cost and one that achieves the same accuracy by spending orders of magnitude more computation\. Conversely, a pure efficiency metric favors systems that answer cheaply at the expense of correctness\. ACT \(Accuracy–Cost Tradeoff\) is designed to capture both dimensions in a single scalar that reflects the accuracy–efficiency operating point of each system\.

##### Cost normalization\.

Raw PFLOPs values span several orders of magnitude across systems \(e\.g\., from∼200\{\\sim\}200to∼30,000\{\\sim\}30\{,\}000PFLOPs in our BrowseComp\-Plus results\)\. A linear normalization would give disproportionate weight to the few high\-cost outliers\. We therefore apply log\-normalization:

CostNormi=log⁡\(PFLOPsi\)−minj⁡log⁡\(PFLOPsj\)maxj⁡log⁡\(PFLOPsj\)−minj⁡log⁡\(PFLOPsj\),\\text\{CostNorm\}\_\{i\}=\\frac\{\\log\(\\text\{PFLOPs\}\_\{i\}\)\-\\min\_\{j\}\\log\(\\text\{PFLOPs\}\_\{j\}\)\}\{\\max\_\{j\}\\log\(\\text\{PFLOPs\}\_\{j\}\)\-\\min\_\{j\}\\log\(\\text\{PFLOPs\}\_\{j\}\)\},which maps all systems to\[0,1\]\[0,1\]and treats equal multiplicative differences in cost as equal differences on the normalized scale—a natural choice since computational cost tends to scale multiplicatively with context length and model size\.

##### ACT definition\.

ACT is defined as a penalized accuracy:

ACTi=LasJi−α⋅CostNormi,\\text\{ACT\}\_\{i\}=\\text\{LasJ\}\_\{i\}\-\\alpha\\cdot\\text\{CostNorm\}\_\{i\},whereα=0\.05\\alpha=0\.05is the penalty weight\. LasJ accuracy is the primary objective, and a system that achieves the highest accuracy should still rank highly even at elevated cost\. The role of the cost term is to break near\-ties in accuracy in favor of the more efficient system, and to flag systems whose marginal accuracy gain does not justify the additional computation\. Atα=0\.05\\alpha=0\.05, the most expensive system \(CostNorm=1\{\}=1\) incurs a maximum penalty of 5 percentage points, which is approximately the minimum meaningful accuracy gap observed in our benchmarks\. CostNorm is computed separately within each benchmark so that ACT values are directly comparable across rows within a table but should not be compared across tables\.

##### Interpretation\.

A system with higher ACT achieves a better accuracy–efficiency operating point: either higher accuracy at comparable cost, or comparable accuracy at substantially lower cost\. Because normalization is benchmark\-relative, adding or removing a system can shift all CostNorm values; ACT should therefore be interpreted in the context of the full comparison set rather than as an absolute score\.

### A\.4Additional Analysis

#### A\.4\.1Trim Window Size Balances Planner Context and Efficiency

The Trim mechanism controls how many recent interaction steps are retained in the Planner’s context, with the window size denoted bykk\. To assess its effect, we varyk∈\{5,10,15\}k\\in\\\{5,10,15\\\}and compare these settings with a no\-Trim variant,ActiveMem w/o Trim, as shown in Table[7](https://arxiv.org/html/2606.10532#A1.T7)\.

Table 7:Ablation on Trim window size\.We observe thatk=10k=10provides the best accuracy–efficiency trade\-off\. Whenkkis too small \(i\.e\.,k=5k=5\), the Planner receives insufficient recent context and must compensate by issuing more retrieval queries, substantially increasing Memorizer load \(1718\.79 PFLOPs\) and degrading accuracy\. Whenkkis too large \(i\.e\.,k=15k=15\) or absent entirely, Planner cost grows sharply as it processes longer and noisier context windows, without a corresponding accuracy gain\. These results confirm thatk=10k=10is a well\-calibrated operating point and that the Trim mechanism is robust within a moderate range around this value\.

#### A\.4\.2Memorizer Scale Trades Accuracy Against Efficiency

We vary the Memorizer across three parameter scales \(0\.6B, 1\.7B, and 4B\) while holding all other components fixed, in order to characterize how model size shapes the accuracy–efficiency trade\-off\. Table[8](https://arxiv.org/html/2606.10532#A1.T8)reports the results on BrowseComp\-Plus\.

Table 8:Ablation on Memorizer scale on BrowseComp\-Plus \(LasJ\)\.The results reveal a non\-trivial interaction between Memorizer scale and Planner efficiency\. While reducing Memorizer scale substantially lowers Memorizer cost \(from 1,478 to 371 PFLOPs for Memorizer\-4B vs\. 0\.6B\), Planner cost moves in the opposite direction, rising from 606 to 744 PFLOPs\. This counter\-intuitive pattern stems from a capability gap: a weaker 0\.6B Memorizer fails to reliably capture critical information during distillation, so the gists it produces are incomplete or misleading\. The Planner, receiving lower\-quality evidence, is forced into more rounds of suboptimal reasoning—issuing additional retrieval queries, re\-examining previously processed content, and converging more slowly—which drives up its own cost\. The final LasJ of 0\.560 confirms that this additional Planner effort fails to compensate for the information loss at the distillation stage\.

Viewed across all three scales, the Planner cost decreases monotonically as Memorizer scale increases \(744→\\to642→\\to606 PFLOPs\), indicating that a stronger Memorizer produces more precise and complete gists, reducing the Planner’s reasoning burden at every step\. This suggests that investing compute in a more capable Memorizer yields compounding returns: lower Planner cost, higher accuracy, and a more predictable overall computation profile\. Memorizer\-0\.6B reduces total PFLOPs by roughly 44% relative to Memorizer\-4B, but at the cost of a 22\.6% point accuracy drop—a disproportionate exchange that makes aggressive scale reduction an unfavorable operating point in practice\.

#### A\.4\.3Memory Consolidation Keeps Planner Context Compact and Clean

To isolate the effect of the Operator’s asynchronous consolidation mechanism, we evaluateActiveMem \(w/o Conso\.\), a variant in which newly produced gists are appended directly to the existing shard content without any merging or deduplication\. Table[9](https://arxiv.org/html/2606.10532#A1.T9)compares this variant against the full ActiveMem on BrowseComp\-Plus\.

Table 9:Ablation on memory consolidation on BrowseComp\-Plus \(LasJ\)\.Without consolidation, Memory Shards grow monotonically as each newly produced gist is appended verbatim to the existing content\. Over the course of a long\-horizon task, this causes the gist loaded back into the Planner’s context to become increasingly verbose and redundant\. Two consequences follow\. First, Planner PFLOPs rise notably \(606→\\to768,\+\+26\.5%\), since the Planner must attend over a longer and noisier memory context at each reasoning step—a direct manifestation of the quadratic attention cost that consolidation is designed to suppress\. Second, the accumulated noise degrades reasoning quality: the Planner receives overlapping or conflicting statements across appended gists, making it harder to identify the most task\-relevant signals, which leads to a 6\.6 percentage point drop in LasJ \(0\.786→\\to0\.720\)\.

These results highlight the dual role of consolidation in ActiveMem\. Beyond simply reducing memory footprint, consolidation actively distills accumulated evidence into a compact, non\-redundant form, ensuring that the Planner always operates over a clean and bounded context regardless of how many documents have been processed\. This property is precisely what allows ActiveMem to maintain both high accuracy and low Planner cost as task length increases\.

#### A\.4\.4Memory Hit Rate as a Search Saturation Indicator

Table 10:Memory hit rate \(%\) by step range and difficulty level\.The three difficulty levels exhibit qualitatively different memory hit trajectories, as shown in Table[10](https://arxiv.org/html/2606.10532#A1.T10)\. For easy cases, the hit rate peaks at steps 7–10 \(≈7\.5%\{\\approx\}7\.5\\%\) and subsequently declines to 5–6%\. Easy cases average only 8\.8 steps, so cases that reach step 16\+ are a self\-selected subset that failed to find an answer early\. At this point, having exhausted the small cluster of documents relevant to a straightforward question, the agent is forced to issue increasingly peripheral queries; these queries return documents far from the previously visited region, producing more first\-seen documents rather than memory hits, and the hit rate consequently falls\. For medium cases, the hit rate rises monotonically from 2\.0% to 11\.2% with no reversal, reflecting an agent that has sufficient steps to saturate its local retrieval space while repeatedly circling nearby documents\. Hard cases also rise monotonically but tell a more nuanced story\.

![Refer to caption](https://arxiv.org/html/2606.10532v1/x4.png)

![Refer to caption](https://arxiv.org/html/2606.10532v1/x5.png)

Figure 5:Case study traces for a BrowseComp\-Plus instance\(top\)and a GAIA instance\(bottom\)\. Each box shows the query issued, the Memorizer\-produced gist or Memory Hit result, and the resulting Reasoning Context passed to the Planner\. The BrowseComp\-Plus question involves multiple interrelated constraints and requires a broader, longer search trajectory to satisfy all of them simultaneously\. The GAIA question has fewer constraints and converges to the answer in far fewer steps\.The natural hypothesis is that harder questions cause the agent to fail to find key information, fall back on repeated searches, and thereby accumulate higher memory hits\. While this mechanism is valid, Table[10](https://arxiv.org/html/2606.10532#A1.T10)reveals an important qualification: hard cases start slightly above medium at steps 1–10, yet fall below medium thereafter\. This is because the inherent nature of hard questions prevents the agent from accumulating a dense cluster of retrieved documents around any single topic—the search space is too broad and the relevant signals too sparse for repeated queries to converge on the same small set of documents\. As a result, hard\-case agents operate within a wide but diffuse retrieval space, only occasionally re\-encountering the same documents, which keeps the hit rate from rising as steeply as in medium cases where the agent circles more tightly around a narrower topic area\.

These patterns together suggest that memory hit rate can serve as a practical search saturation indicator\. A low hit rate in early steps reflects an effective exploration phase in which the agent is discovering new documents at each turn\. A rising hit rate in later steps signals that the locally reachable document set is approaching exhaustion and the agent has begun cycling within already\-visited content\. When the per\-step hit rate exceeds≈10\{\\approx\}10–12%12\\%for several consecutive steps, continued search is unlikely to yield further marginal benefit, and an early\-stopping criterion could be triggered accordingly\. We leave the formal design and evaluation of such a mechanism to future work\.

#### A\.4\.5Case Study

We present two representative cases from BrowseComp\-Plus and GAIA to illustrate the working pipeline of ActiveMem\. GAIA questions typically involve fewer and more directly verifiable constraints, allowing ActiveMem to converge to an answer within a small number of steps\. BrowseComp\-Plus questions are substantially harder: they involve multiple interrelated constraints that must be jointly satisfied, requiring longer reasoning trajectories and a larger number of retrieved documents before sufficient evidence accumulates\.

##### BrowseComp\-Plus\.

Figure[5](https://arxiv.org/html/2606.10532#A1.F5)\(top\) traces the trajectory for a question asking for the designer of wedding invitations from a 2014 ceremony, subject to four constraints: the DJ was recognized as top state wedding DJ in a 2023 article, the venue accommodates up to 250 guests, the photographer was a husband\-and\-wife team, and the invitation printer was founded in the mid\-1980s\.

At Step 1, two parallel queries initialize two shards: doc4059 \(Charlotte NC top DJ list including Split Second Sound\) and doc67764 \(a husband\-and\-wife photo team\)\. Steps 2–10 explore multiple false leads—Steve Foltz lacks 2023 recognition; Mike & Samantha’s wedding has no verifiable top\-state DJ; Laura Lonsdale works on birthday cards—and none satisfies all four constraints simultaneously\. The pivot occurs at Step 11, when doc76272 is retrieved and distilled into:“A wedding at Champagne Manor featured DJ Split Second Sound and invitations by Demi Mabry,”linking the DJ, venue, and answer candidate in one gist\. At Step 12, the Similarity Judge returnsSIMILARfor doc4059, triggering a Memory Hit that confirms Split Second Sound’s top\-state recognition without re\-invoking the Memorizer; for doc76272 the Judge returnsNEW, so the Memorizer re\-distills and the Operator consolidates the result into Shard \#14\. At Step 13, a fresh retrieval confirms Copy Express opened in 1986, and a Memory Hit on doc67764 confirms Rach Loves Troy as the husband\-and\-wife photographer\. All four constraints are satisfied, and the Planner submitsDemi Mabryat Step 14\.

##### GAIA\.

Figure[5](https://arxiv.org/html/2606.10532#A1.F5)\(bottom\) traces a two\-hop question: what does “R” stand for in the three core policies of the content type violated in the Legume Wikipedia page logs before December 2022?

At Step 1, the query“Legume Wikipedia page 2022 violation public logs”retrieves a page indicating that the violation concerned Wikipedia’s core content policies, identifying the content type needed for the second hop\. At Step 2, the query“Wikipedia three core content policies R stands for”retrieves two independent documents that both identify “R” as “Research” \(No Original Research\)\. The Planner submitsResearchafter a brief verification pass, completing the trajectory in three steps with two fresh retrievals and no Memory Hits\.

### A\.5The Use of AI Assistants

In this paper, AI assistants \(ChatGPT and Claude\) were used exclusively for language polishing, including grammar correction, phrasing, and stylistic refinement\. They were not used to generate scientific content such as research ideas, methods, experiments, or related work\. No confidential, personal, or proprietary information was shared with the models\. The authors take full responsibility for the scientific content, which was entirely authored and verified by the authors\.

Similar Articles

AdMem: Advanced Memory for Task-solving Agents

arXiv cs.AI

This paper introduces AdMem, a unified memory framework for LLM-based agents that integrates semantic, episodic, and procedural memory with a bi-level short-term and long-term store, using a multi-agent architecture for automatic memory generation and adaptive retrieval. Experiments show improved robustness and success on long multi-turn tasks.

SimpleMem: Efficient Lifelong Memory for LLM Agents

Papers with Code Trending

Introduces SimpleMem, an efficient memory framework for LLM agents that uses semantic lossless compression to improve accuracy and reduce token consumption, achieving 26.4% F1 improvement and up to 30x reduction in inference-time token usage.

MemGym: a Long-Horizon Memory Environment for LLM Agents

arXiv cs.CL

MemGym is a benchmark for evaluating memory formation in LLM agents over long-horizon tasks, unifying existing agent gyms and synthetic pipelines with memory-isolated scores. It spans tool-use dialogue, multi-turn search, coding, and computer use, and includes a lightweight reward model (MemRM) for efficient evaluation.