S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

arXiv cs.CL Papers

Summary

S3Mem proposes a structured spatiotemporal scene-event memory framework for long-horizon interactive question answering, using anchor-sensitive retrieval and token-budget-aware evidence interface to outperform standard RAG in multiple environments.

arXiv:2605.28831v1 Announce Type: new Abstract: Long-horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliably. We argue that the main bottleneck is not context length alone, but the trajectory-to-answer interface of long-term memory. When histories are stored as plain-text chunks and queried with standard retrieval-augmented generation (RAG), systems often retrieve locally relevant but chain-incomplete evidence, especially for spatial, temporal, repeated-event, and multi-hop state questions. We propose S3MEM, a structured scene-event episodic memory framework for long-horizon interactive question answering (QA). S3MEM writes trajectories into structured memory units, retrieves evidence through anchor-sensitive retrieval, and exposes a compact token-budget-aware evidence interface for answer-time inference. In this sense, S3MEM is a structured evidence harness that converts agent trajectories into query-aligned support. We evaluate S3MEM on two internal headline environments (Crafter, Jericho) and two out-of-family environments (SciWorld, ALFWorld). Under a shared frozen answer-time protocol, S3MEM consistently outperforms Vanilla RAG across all four environments, surpasses Graph-NoReader on Crafter, Jericho, and ALFWorld, and matches it on SciWorld while using dramatically fewer evidence tokens. Three adapted recent baselines -- A-MEM-inspired, MemoryOS-adapted, and LightMem-adapted -- improve over Vanilla RAG in several settings, but none matches S3MEM's overall accuracy-efficiency frontier. Overall, the evidence supports a bounded conclusion: under the current frozen answer-time protocol, structured writing and anchor-sensitive evidence routing provide a stronger accuracy-efficiency frontier for long-horizon interactive QA than more generic memory interfaces.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:11 AM

# Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering
Source: [https://arxiv.org/html/2605.28831](https://arxiv.org/html/2605.28831)
Encheng Su1, Jinouwen Zhang3, Jianyu Wu2, Qiucheng Yu4, Chen Tang5, Pengze Li6, Lintao Wang7, Yizhou Wang5,†, Xinzhu Ma8, Shixiang Tang5, Aoran Wang3,†1University of Science and Technology of China2Shanghai Jiao Tong University3Shanghai AI Laboratory4City University of Hong Kong5The Chinese University of Hong Kong6Fudan University7The University of Sydney8Beihang University†Corresponding authors

###### Abstract

Long\-horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliably\. We argue that the main bottleneck is not context length alone, but the*trajectory\-to\-answer interface*of long\-term memory\. When histories are stored as plain\-text chunks and queried with standard retrieval\-augmented generation \(RAG\), systems often retrieve locally relevant but chain\-incomplete evidence, especially for spatial, temporal, repeated\-event, and multi\-hop state questions\. We proposeS3Mem, a structured scene\-event episodic memory framework for long\-horizon interactive question answering \(QA\)\.S3Memwrites trajectories into structured memory units, retrieves evidence through anchor\-sensitive retrieval, and exposes a compact token\-budget\-aware evidence interface for answer\-time inference\. In this sense,S3Memis a structured evidence harness that converts agent trajectories into query\-aligned support\. We evaluateS3Memon two internal headline environments \(Crafter,Jericho\) and two out\-of\-family environments \(ScienceWorld,ALFWorld\)\. Under a shared frozen answer\-time protocol,S3Memconsistently outperforms Vanilla RAG across all four environments, surpasses Graph\-NoReader onCrafter,Jericho, andALFWorld, and matches it onScienceWorldwhile using dramatically fewer evidence tokens\. Three adapted recent baselines—A\-MEM\-inspired, MemoryOS\-adapted, and LightMem\-adapted—improve over Vanilla RAG in several settings, but none matchesS3Mem’s overall accuracy–efficiency frontier\. Overall, the evidence supports a bounded conclusion: under the current frozen answer\-time protocol, structured write and anchor\-sensitive evidence routing provide a stronger accuracy–efficiency frontier for long\-horizon interactive QA than more generic memory interfaces\.

Keywords:episodic memory, long\-horizon agents, interactive question answering, structured retrieval, evidence interface

## 1 Introduction

LLM/VLM agents have made substantial progress in text\-based environments\[[8](https://arxiv.org/html/2605.28831#bib.bib21),[21](https://arxiv.org/html/2605.28831#bib.bib23)\], visually grounded interaction\[[7](https://arxiv.org/html/2605.28831#bib.bib20)\], and multi\-step decision making\[[26](https://arxiv.org/html/2605.28831#bib.bib30)\]\. Yet even strong agents remain brittle when asked about events that occurred much earlier in an episode\. This weakness is especially visible in*long\-horizon interactive question answering*, where a system must not only remember past observations, but also recover a usable evidence chain across objects, events, relations, state changes, and temporal order\.

A common practice is to externalize trajectory history as text chunks or short summaries, then retrieve top\-kksnippets with retrieval\-augmented generation \(RAG\)\[[13](https://arxiv.org/html/2605.28831#bib.bib1)\]\. This approach is simple and often effective in short contexts, but it breaks down when the question depends on repeated\-event disambiguation, temporal offsets, spatial reasoning, or multi\-hop state tracking\. The problem is not merely that relevant evidence exists far away in the trajectory\. Rather, generic memory interfaces weaken the trajectory\-to\-answer transfer in three recurrent ways\. First, writing trajectories as flat text erases structure that later reasoning depends on\. Second, retrieval based primarily on surface similarity often returns snippets that are locally relevant but globally chain\-incomplete\. Third, the answer stage receives long unstructured fragments rather than a compact, query\-aligned evidence set, forcing the model to reconstruct the chain itself\.

We therefore argue that long\-horizon interactive QA should be treated not merely as a text\-retrieval problem, but as an*episodic evidence\-construction*problem over agent\-generated trajectories\. A useful memory interface for this setting should satisfy three properties: \(i\) it should write trajectories into a structured representation aligned with the interaction process; \(ii\) it should retrieve evidence using question\-conditioned anchors such as target steps, repeated occurrences, and state transitions; and \(iii\) it should expose a compact but sufficient evidence interface under a strict token budget\. Our central claim is deliberately narrow: under the current frozen answer\-time protocol, the key to long\-horizon interactive QA is not exposing more history, but exposing the*right structured evidence*from history\.

Motivated by this view, we proposeS3Mem, a structured scene\-event episodic memory framework for long\-horizon interactive QA\. Instead of treating interaction history as generic text,S3Memwrites each interaction step as a structured*memory unit*containing scene, event, state, and temporal context\. At query time, it performs*anchor\-sensitive retrieval*that explicitly recovers target\-step anchors, occurrence anchors, and state\-transition anchors\. It then constructs a compact,*token\-budget\-aware evidence interface*rather than exposing all retrieved evidence verbatim\. In this sense,S3Memshould be understood as a structured evidence harness: its role is to convert long trajectories into the smallest query\-aligned evidence chain that still supports answer\-time inference\.

This framing also clarifies the limitation of competing baselines\. The main weakness of plain\-text RAG, graph\-only retrieval, and recent generic memory neighbors is not that they lack memory altogether, but that their memory interfaces generalize weakly under long\-horizon, budgeted QA\. They may store large amounts of history or compress it efficiently, but they do not consistently preserve the decisive anchor\-bearing step together with the minimal local chain required by the question\. As a result, generic baselines face a tradeoff: exposing more history can improve accuracy only at very high token cost, whereas generic shrinking reduces tokens without recovering the same chain\-complete evidence\. Our objective is therefore not to build a general agent operating system, but to improve the trajectory\-to\-answer interface for episodic QA\.

We evaluateS3Memon two internal headline environments,Crafter\[[7](https://arxiv.org/html/2605.28831#bib.bib20)\]andJericho\[[8](https://arxiv.org/html/2605.28831#bib.bib21)\], and two out\-of\-family environments,ScienceWorld\[[23](https://arxiv.org/html/2605.28831#bib.bib22)\]andALFWorld\[[21](https://arxiv.org/html/2605.28831#bib.bib23)\]\. Across all four environments,S3Memconsistently outperforms Vanilla RAG and surpasses the Graph\-NoReader baseline onCrafter,Jericho, andALFWorld, while matching it onScienceWorldat dramatically lower evidence cost\. The external evidence is intentionally interpreted more narrowly than the internal headline\. OnScienceWorld,S3Memmainly demonstrates*efficiency generalization*: it preserves near\-ceiling accuracy while greatly reducing evidence tokens\. OnALFWorld, it demonstrates a stronger*accuracy and efficiency*gain\. At the same time, Full\-History controls show that exposing more history can still improve accuracy when token cost is ignored, so the strongest supported claim of this paper is an*accuracy–efficiency frontier*claim rather than unconditional accuracy optimality\.

We further sharpen this interpretation through three complementary analyses\. First, we compare against three recent neighboring memory systems—A\-MEM, MemoryOS, and LightMem\[[25](https://arxiv.org/html/2605.28831#bib.bib15),[12](https://arxiv.org/html/2605.28831#bib.bib16),[4](https://arxiv.org/html/2605.28831#bib.bib18)\]—adapted to the same trajectory\-grounded episodic\-QA setting\. Second, we test the strongest alternative explanation for the gain, namely that any sufficiently aggressive generic compressor might recover a similar frontier, through Full\-History, Summarize\-then\-Answer, and RTK\-style generic\-compression controls\. Third, we examine robustness under trajectory\-source shift through a four\-actor Crafter rollout study, and include a bonus out\-of\-family transfer test onATM\-Bench\[[27](https://arxiv.org/html/2605.28831#bib.bib26)\]\.

Our contributions are threefold:

1. 1\.Method contribution:We introduceS3Mem, a structured scene\-event episodic memory interface that unifies structured writing, anchor\-sensitive retrieval, and token\-budget\-aware evidence exposure for long\-horizon interactive QA\.
2. 2\.Empirical contribution:We establish a stronger accuracy–efficiency frontier on the internal headline environments \(Crafter,Jericho\), and show complementary out\-of\-family evidence through efficiency generalization onScienceWorldand accuracy\+efficiency generalization onALFWorld\.
3. 3\.Interpretation contribution:We bound the paper’s claim through a control\-and\-robustness suite—including recent neighboring baselines, context\-length and generic\-compression controls, write\-side ablations, answerer\-fairness diagnostics, and rollout robustness—showing that the main gain is consistent with a structured evidence harness under the current frozen answer\-time protocol, while the strongest remaining non\-generality lies in answer\-time consumption\.

## 2 Related Work

#### Plain\-text retrieval and graph\-based retrieval\.

Retrieval\-augmented generation \(RAG\) augments language models with retrieved passages from an external knowledge source\[[13](https://arxiv.org/html/2605.28831#bib.bib1),[6](https://arxiv.org/html/2605.28831#bib.bib2),[11](https://arxiv.org/html/2605.28831#bib.bib3)\]\. A common adaptation to long\-context QA is to flatten history into text chunks and retrieve the top\-kksnippets at answer time\. Graph\-based retrieval improves on flat chunking by organizing information into graph structures that better support compositional queries and relation\-aware reasoning\[[3](https://arxiv.org/html/2605.28831#bib.bib4),[9](https://arxiv.org/html/2605.28831#bib.bib5),[1](https://arxiv.org/html/2605.28831#bib.bib19)\]\. These approaches are effective for static corpora, document collections, and explicit knowledge graphs\. However, long\-horizon interactive QA poses a different challenge: the evidence is distributed across temporally separated observations, actions, revisits, and state transitions within an agent\-generated trajectory\. In this setting, the core question is not merely whether relevant snippets or nodes can be retrieved, but whether the retrieved evidence preserves the*scene–event–state–temporal*chain required by the question under a strict answer\-time budget\. Our work builds on this retrieval literature but targets a different failure mode: the weakness of generic retrieval interfaces when the knowledge source is the agent’s own trajectory rather than a static text corpus\.

#### Structured and hierarchical memory for agents\.

A growing literature studies memory systems for LLM\- and VLM\-based agents, including natural\-language memory streams, self\-reflection, note\-like memory organization, and hierarchical memory management\[[19](https://arxiv.org/html/2605.28831#bib.bib6),[20](https://arxiv.org/html/2605.28831#bib.bib7),[18](https://arxiv.org/html/2605.28831#bib.bib8),[22](https://arxiv.org/html/2605.28831#bib.bib9),[25](https://arxiv.org/html/2605.28831#bib.bib15),[12](https://arxiv.org/html/2605.28831#bib.bib16),[15](https://arxiv.org/html/2605.28831#bib.bib17),[4](https://arxiv.org/html/2605.28831#bib.bib18)\]\. These systems make important progress toward persistent memory, but they usually target broader goals such as conversational continuity, general agent infrastructure, long\-term memory management across heterogeneous tasks, or continual adaptation over extended agent lifetimes\. Our setting is narrower: we study*environment\-grounded long\-horizon QA*over agent\-generated trajectories, where the main challenge is to preserve and expose the decisive evidence chain for a downstream question\. This makes our method closer to a*structured evidence harness*than to a general\-purpose agent operating system\. Within this broader landscape, A\-MEM, MemoryOS, and LightMem are especially relevant because they represent three neighboring design families—note\-link organization, hierarchical memory management, and efficiency\-oriented memory construction—that remain adaptable to our setting and therefore form the most informative recent runnable neighbors in our experiments\.

#### Benchmarking long\-horizon agent memory\.

Recent benchmarks increasingly focus on evaluating long\-term memory rather than proposing a memory method directly, including long\-horizon conversational memory and interactive agent\-memory benchmarks\[[16](https://arxiv.org/html/2605.28831#bib.bib24),[24](https://arxiv.org/html/2605.28831#bib.bib28),[14](https://arxiv.org/html/2605.28831#bib.bib25),[10](https://arxiv.org/html/2605.28831#bib.bib29),[27](https://arxiv.org/html/2605.28831#bib.bib26),[17](https://arxiv.org/html/2605.28831#bib.bib27)\]\. These benchmarks are valuable because they show that current agents remain weak on long\-horizon memory tasks and expose recurring failure modes for naive long\-context prompting or simple retrieval\. They help establish*that*long\-horizon memory is difficult and where current systems break\. However, they do not by themselves answer the method question we focus on here:*what should be written, retrieved, and exposed as evidence under a token budget when the knowledge source is the agent’s own trajectory?*Our work complements this benchmark literature by studying the memory\-representation and evidence\-interface problem directly, rather than introducing a new benchmark family\.

#### Programmatic reasoning as analysis support\.

Program\-aided reasoning methods translate natural\-language questions into executable programs and use structured execution to improve reasoning reliability\[[5](https://arxiv.org/html/2605.28831#bib.bib31),[2](https://arxiv.org/html/2605.28831#bib.bib32)\]\. In our work, parser/executor tools are not the main method family\. Instead, they serve as*boundary diagnostics*: they help localize how much of the remaining non\-generality lies in answer\-time consumption rather than in the memory interface itself\. We therefore do not positionS3Memas a semantic\-parsing system\. The core contribution of this paper remains a structured memory\-and\-evidence interface for long\-horizon interactive QA\.

## 3 Method

### 3\.1 Task Setup and Overview

We study long\-horizon interactive question answering over agent\-generated trajectories\. Let an episode trajectory be

τ=\(o1,a1,…,oT,aT\),\\tau=\(o\_\{1\},a\_\{1\},\\ldots,o\_\{T\},a\_\{T\}\),whereoto\_\{t\}is the observation at stepTTandata\_\{t\}is the action taken\. Given a questionqqabout this trajectory, the system must predict an answeryy\. Questions may involve single\-step lookup, repeated\-event disambiguation, temporal offsets, state\-chain reasoning, inventory changes, location visits, and multi\-hop transitions anchored on earlier events\.

Unlike open\-domain QA, the knowledge source is not an external corpus but the agent’s own interaction history\. We therefore treat the task as an*episodic evidence\-construction*problem: the system first writes the trajectory into memory, then retrieves and exposes the subset of evidence needed to answer the question\. We evaluate methods along two dimensions: \(i\)*answer accuracy*\(exact match, EM\), and \(ii\)*evidence efficiency*\(average evidence token cost\)\. The Second is central to the problem setting: a method that improves only by re\-inserting large amounts of raw history is not a strong memory interface for long\-horizon agents\. Accordingly, the strongest claim of this paper is an*accuracy–efficiency frontier*claim under the current frozen answer\-time protocol, rather than unconditional accuracy optimality\.

This formulation also clarifies the weakness of generic baselines\. Facts such as “the item was obtained after the second visit to a location” or “the state change happened two steps after a particular anchor event” are distributed across multiple trajectory steps and are easily fragmented by flat chunking, shallow retrieval, or generic compression\. Our objective is therefore to construct a memory interfaceℳ\\mathcal\{M\}that preserves the scene–event–state–temporal structure required for answer\-time inference while remaining effective under a strict evidence budget\.

S3Memconsists of four stages: \(1\) structured scene\-event write, \(2\) anchor\-sensitive retrieval, \(3\) token\-budget\-aware evidence interface, and \(4\) answer\-time layer and protocol boundary\. Compared with standard RAG,S3Memdoes not merely replace the retriever\. It changes the intermediate interface between trajectories and final answers by explicitly representing, routing, and packing episodic evidence\. In this sense,S3Memis best understood as a*structured evidence harness*for long\-horizon interactive QA\. Figure[1](https://arxiv.org/html/2605.28831#S3.F1)illustrates the overall pipeline\. Additional algorithmic details, environment\-specific schema choices, and anchor\-extraction rules are deferred to Appendix[B](https://arxiv.org/html/2605.28831#A2), Appendix[B\.2](https://arxiv.org/html/2605.28831#A2.SS2), and Appendix[B\.3](https://arxiv.org/html/2605.28831#A2.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.28831v1/x1.png)Figure 1:Overview of theS3Mempipeline\. Agent trajectories are written as structured memory units, retrieved through anchor\-sensitive retrieval, packed into a compact evidence interface under a token budgetBB, and then consumed by the frozen answer\-time layer\.
### 3\.2 Structured Scene\-Event Write

Each step or local segment of a trajectory is written as a structured*memory unit*containing fields such as: step index, action, objects, events, relations, state or inventory, location or spatial context, and a short raw summary\. The exact schema varies across environments\. InCrafter, memory units emphasize objects, locations, inventory, and local environmental state\. InJericho,ScienceWorld, andALFWorld, they emphasize observations, actions, locations, item gains, and state transitions\. Across all environments, however, the same design principle holds: interaction history is written as*structured scene\-event units*rather than flattened into generic text\. A full environment\-by\-environment schema mapping is reported in Appendix[B\.2](https://arxiv.org/html/2605.28831#A2.SS2)\.

This design is important because the downstream question often depends on*joint*access to scene, event, state, and temporal information\. Across four environments, full scene\-event writing consistently outperformsevent\_only,object\_only, andplain\_chunkvariants \(Table[3](https://arxiv.org/html/2605.28831#S5.T3)\), indicating that the useful memory signal lies not in any single field alone, but in the joint representation they provide\.

Formally, we write the memory at stepttas

mt=\(t,at,Ot,Et,Rt,St,Lt,ut\),m\_\{t\}=\\bigl\(t,\\;a\_\{t\},\\;O\_\{t\},\\;E\_\{t\},\\;R\_\{t\},\\;S\_\{t\},\\;L\_\{t\},\\;u\_\{t\}\\bigr\),whereata\_\{t\}is the action,OtO\_\{t\}is the set of salient objects,EtE\_\{t\}is the local event type or event set,RtR\_\{t\}is the relation set,StS\_\{t\}is the local state or inventory snapshot,LtL\_\{t\}is the location or spatial context, andutu\_\{t\}is a short raw summary\. The exact schema is environment\-specific, but the representation always preserves scene, event, state, and temporal context jointly\. This is the first point whereS3Memdeparts from generic memory baselines: the write stage already exposes the fields later needed for evidence routing\.

### 3\.3 Anchor\-Sensitive Retrieval

Many long\-horizon questions cannot be answered by retrieving the most textually similar snippet\. Questions such as*“What happened two steps after the second visit to the lab bench?”*or*“Where was the agent after obtaining the item?”*contain explicit or implicit*anchors*\. If retrieval is driven only by surface similarity, the system may collect fragments that are individually relevant but still insufficient as a complete evidence chain\.

S3Memtherefore performs*anchor\-sensitive retrieval*\. In addition to standard lexical or dense similarity, it explicitly recovers three types of question\-conditioned anchors:

- •Target\-step anchors: steps directly referenced by the question\.
- •Occurrence anchors: first, second, third, or last occurrence of an event\.
- •State\-transition anchors: item gains, observation changes, or location visits that serve as chain entry points\.

At a coarse level, the system extracts cues from the question—target object, triggering event, queried field, occurrence index, and temporal offset—and uses them to construct a candidate set more likely to contain the full evidence chain\. Unless otherwise noted, anchor extraction is rule/template\-driven rather than delegated to a learned question parser\. This keeps the retrieval stage more inspectable and makes the main source of gain easier to localize: the key signals come from question\-derived anchors and explicit memory fields, not from benchmark\-specific answer heuristics\. The full extraction rules, default fallbacks, and environment\-specific variants are listed in Appendix[B\.3](https://arxiv.org/html/2605.28831#A2.SS3)\.

We denote the extracted anchor tuple by

A​\(q\)=\(o∗,e∗,f∗,k∗,Δ∗\),A\(q\)=\\bigl\(o^\{\\ast\},\\;e^\{\\ast\},\\;f^\{\\ast\},\\;k^\{\\ast\},\\;\\Delta^\{\\ast\}\\bigr\),whereo∗o^\{\\ast\}denotes the target object or entity,e∗e^\{\\ast\}the triggering event,f∗f^\{\\ast\}the queried field,k∗k^\{\\ast\}the occurrence anchor \(for example first, second, or last\), andΔ∗\\Delta^\{\\ast\}the temporal offset when present\. Retrieval then proceeds in two stages: an initial text\-based candidate search, followed by anchor\-aware expansion and reranking\. At a high level, reranking prioritizes units with high

s​\(m,q\)=stext​\(m,q\)\+λa​sanchor​\(m,A​\(q\)\)\+λc​schain​\(m,A​\(q\)\),s\(m,q\)=s\_\{\\mathrm\{text\}\}\(m,q\)\+\\lambda\_\{a\}\\,s\_\{\\mathrm\{anchor\}\}\(m,A\(q\)\)\+\\lambda\_\{c\}\\,s\_\{\\mathrm\{chain\}\}\(m,A\(q\)\),wherestexts\_\{\\mathrm\{text\}\}measures lexical or dense similarity,sanchors\_\{\\mathrm\{anchor\}\}measures direct anchor compatibility, andschains\_\{\\mathrm\{chain\}\}rewards units that help preserve a local evidence chain around the anchor\. We use this expression as an abstract description of the ordering policy rather than as a claim that one fixed numeric parameterization is universal across all environments\. What is fixed is the ranking principle: text relevance alone is insufficient, so units that preserve anchor compatibility and local chain support are promoted\. Concrete ranking features, tie\-break order, and environment\-specific details are reported in Appendix[B\.4](https://arxiv.org/html/2605.28831#A2.SS4)\.

### 3\.4 Token\-Budget\-Aware Evidence Interface

Good retrieval alone is not sufficient if the answer stage still receives long unstructured fragments\.S3Memtherefore constructs a compact evidence interface under a fixed token budgetBB\. The packing strategy prioritizes: \(i\) steps directly aligned with the query anchor, \(ii\) local context around target events, objects, or locations, and \(iii\) the minimal neighborhoods needed to preserve the decisive evidence chain\.

This stage is not generic summarization\. Its goal is to retain the smallest structured subset of trajectory evidence sufficient for answer\-time inference\. This component is central to the paper’s efficiency claim: in multiple environments,S3Memuses far fewer evidence tokens than bothVanilla RAGandGraph\-NoReaderwhile maintaining or improving accuracy under the current frozen answer\-time protocol\.

We describe the packing stage as a budgeted evidence\-selection problem:

maxℰ⊆𝒞q⁡F​\(ℰ;q\)s\.t\.TokenCost​\(ℰ\)≤B,\\max\_\{\\mathcal\{E\}\\subseteq\\mathcal\{C\}\_\{q\}\}F\(\\mathcal\{E\};q\)\\quad\\text\{s\.t\.\}\\quad\\mathrm\{TokenCost\}\(\\mathcal\{E\}\)\\leq B,where𝒞q\\mathcal\{C\}\_\{q\}is the reranked candidate set andF​\(ℰ;q\)F\(\\mathcal\{E\};q\)favors three properties: preserving anchor steps, preserving the minimal local chain around those anchors, and avoiding redundant units that consume budget without adding evidence support\. In practice,S3Memuses a greedy construction procedure that follows this priority order rather than solving a global optimization problem exactly\.

More concretely, the packer traverses the reranked candidate list and adds a unit only if it satisfies at least one of three conditions: \(1\) it is itself an anchor\-bearing step, \(2\) it lies in the minimal neighborhood needed to support an anchor\-linked transition or temporal offset, or \(3\) it supplies a required state\-change fact not already covered by previously selected units\. Units that are lexically similar but chain\-redundant are skipped\. Formal tie\-break and stopping rules are reported in Appendix[B\.5](https://arxiv.org/html/2605.28831#A2.SS5)\.

This stage is what turnsS3Memfrom a better retriever into a stronger evidence harness\. The packer preserves anchor steps first, then local chain support, then minimal state\-transition completion, rather than performing generic free\-form compression\. As a result, the method does not simply expose*less*history than generic baselines; it exposes a smaller but more query\-complete evidence set\.

### 3\.5 Answer\-Time Layer and Protocol Boundary

Our main evaluation uses the current frozen end\-to\-end answer\-time protocol\. Under this protocol,S3Memshould be understood as a*structured memory\-interface method*: the paper’s main claim is that structured write, anchor\-sensitive retrieval, and budget\-aware evidence exposure improve the trajectory\-to\-answer interface under the current QA setup\. However, we also compare three answer\-time protocols—current,generic, andgold\_executor—in a fairness analysis \(Section[5\.1](https://arxiv.org/html/2605.28831#S5.SS1)\)\. These comparisons help identify which gains come from the memory core and which still depend on answer\-time consumption\.

This boundary is central to the paper’s framing\. The fairness analysis is intended to sharpen interpretation rather than to prove full protocol independence\. Accordingly, our strongest supported claim is deliberately narrow:S3Memprovides a stronger structured evidence harness under the current frozen answer\-time protocol than more generic memory interfaces\. The strongest remaining source of non\-generality lies in the answer\-time layer rather than in the write\-side representation alone\.

## 4 Experimental Setup

### 4\.1 Evaluation Scope and Protocol

We evaluateS3Memon four environments:Crafter\[[7](https://arxiv.org/html/2605.28831#bib.bib20)\],Jericho\[[8](https://arxiv.org/html/2605.28831#bib.bib21)\],ScienceWorld\[[23](https://arxiv.org/html/2605.28831#bib.bib22)\], andALFWorld\[[21](https://arxiv.org/html/2605.28831#bib.bib23)\]\. We assign them two distinct evaluative roles\.CrafterandJerichoserve as the*internal headline environments*: they define the main tables and support the paper’s strongest in\-family claims\.ScienceWorldandALFWorldserve as*out\-of\-family generalization environments*: they test whether the same structured evidence interface remains useful beyond the internal benchmark family\. The two external environments are intentionally not treated as interchangeable benchmark copies\. Instead,ScienceWorldis used primarily for*efficiency generalization*, whereasALFWorldis used for the stronger test of*accuracy\+efficiency generalization*\.

A central design choice of the evaluation is to separate two questions that a reviewer might otherwise conflate\. The first question is an*end\-to\-end question*: under the current frozen QA stack, does a method provide a stronger accuracy–efficiency frontier? The second is a*memory\-side question*: how much of that gain is attributable to the quality of the memory interface rather than to the answer\-time layer that consumes it? Our experimental design reflects this distinction explicitly\.

For the first question, all headline tables use the same frozen end\-to\-end answer\-time protocol within each environment\. These tables are intended to support an end\-to\-end frontier claim under the current deployment setting\. They are*not*intended, by themselves, to prove that every gain comes solely from the memory core\. For the second question, we therefore include separate diagnostics that hold the answer\-time layer fixed or otherwise narrow the comparison boundary, including answerer\-fairness analysis, write\-side ablations, and matched\-budget / reduced\-breadth controls\. These analyses are used to localize which part of the gain is due to structured memory construction and evidence routing, and which part still depends on answer\-time consumption\.

This separation is important for the paper’s claim boundary\. We do*not*replace the headline with thegenericorgold\_executorprotocols, because those are boundary diagnostics rather than replacement task definitions\. Accordingly, the strongest supported claim of the paper is deliberately narrow: under the current frozen answer\-time protocol,S3Memprovides a stronger*end\-to\-end*accuracy–efficiency frontier than more generic memory interfaces\. The stronger claim about*memory quality*is supported more cautiously and rests on the combination of shared\-answerer comparisons, write\-side ablations, and controlled budget / retrieval\-breadth analyses rather than on the headline table alone\.

Appendix[C](https://arxiv.org/html/2605.28831#A3)reports the benchmark construction pipeline, including rollout provenance, processed trajectory generation, question construction, family labeling, and sample filtering\. Within each environment, all compared methods share the*same*processed trajectories and the*same*generated questions\. This matters for interpretation: the comparison is between memory interfaces under a shared benchmark slice, not between rollout sources or final\-answer backbones\.

### 4\.2 Environments and Question Families

Table 1:Evaluation environments and their assigned roles\. The external environments are intentionally used for different evaluative purposes rather than treated as interchangeable benchmark copies\.EnvironmentRoleModalityPrimary memory stressCrafter\(n=1895n=1895\)Internal headlineVisual survivalSpatial displacement, event interval, multi\-hop state chains, logical aggregationJericho\(n=402n=402\)Internal headlineText adventureTemporal offset, location reasoning, acquisition\-to\-outcome chains, inventory aggregationScienceWorld\(n=242n=242\)External efficiency generalizationScience\-grounded interactionStep observation, location visits, inventory tracking, action counting, state\-chain extensionsALFWorld\(n=329n=329\)External accuracy\+efficiency generalizationEmbodied household interactionLocation chains, gain\-item chains, second/last\-occurrence anchors, multi\-hop state chainsThe four environments therefore play complementary roles rather than serving as interchangeable benchmark copies\.Crafterstresses visually grounded spatial and event memory,Jerichostresses text\-event and inventory chains,ScienceWorldasks whether aggressive evidence compression can remain accurate in an out\-of\-family setting, andALFWorldis the external environment where a genuine out\-of\-family accuracy gap is expected to appear if structured episodic evidence transfer is real\. Appendix[C](https://arxiv.org/html/2605.28831#A3)gives the benchmark\-specific question families and data\-construction details for all four environments\.

### 4\.3 Compared Methods

Our core comparison uses four methods that instantiate four different trajectory\-to\-answer interfaces for the same task:

- •No\-Memory: the model receives only the question and the current observation, with no externalized episodic memory\.
- •Vanilla RAG: interaction history is flattened into plain\-text chunks and retrieved via standard similarity\-based RAG\.
- •Graph\-NoReader: a graph\-based memory baseline that organizes trajectory information into graph nodes and edges and queries that structure without a reader specialized for the memory representation\.
- •S3Mem: our structured scene\-event episodic memory framework, which writes trajectories as structured memory units, retrieves anchor\-sensitive candidates, and exposes a budgeted evidence interface\.

This comparison isolates four qualitatively different memory interfaces: no externalized episodic memory, plain\-text chunk memory, graph\-structured memory, and our structured scene\-event evidence harness\. We retain the nameGraph\-NoReaderbecause the baseline truly uses graph\-structured storage and graph\-level retrieval, while*not*introducing a dedicated reader specialized for that memory representation\. Its role is therefore not to weaken graph memory, but to provide a strong graph\-structured control under the same frozen answer\-time protocol\.

To reduce the gap between older interface\-level controls and the recent agent\-memory literature, we additionally evaluate three runnable recent neighbors: anA\-MEM\-inspirednote\-link baseline\[[25](https://arxiv.org/html/2605.28831#bib.bib15)\], aMemoryOS\-adaptedhierarchical\-memory baseline\[[12](https://arxiv.org/html/2605.28831#bib.bib16)\], and aLightMem\-adaptedefficiency\-oriented memory baseline\[[4](https://arxiv.org/html/2605.28831#bib.bib18)\]\. These adapted baselines preserve, respectively, note\-link memory organization, hierarchical memory management, and efficiency\-oriented memory construction, while excludingS3Mem\-specific anchor\-aware retrieval and packing logic\. We do*not*treat every recent memory paper as a drop\-in runnable baseline: some recent works are primarily benchmarks, some target conversational persistence, and others focus on general agent infrastructure rather than trajectory\-grounded episodic QA\.

A further fairness distinction is necessary for the reviewer\-facing interpretation\. The headline comparison is an*end\-to\-end frozen\-configuration comparison*, not a claim that every method uses identical retrieval breadth or identical internal hyperparameters\. In particular, the strongestS3Memsetting on the internal headline environments uses a larger retrieval breadth than some generic baselines \(for example,top\_k = 32on Crafter andtop\_k = 24on Jericho, whereas several baseline comparisons usetop\_k = 16; Appendix[E\.2](https://arxiv.org/html/2605.28831#A5.SS2)\)\. We therefore do*not*present the headline table as a strictly parameter\-matched proof that memory quality alone explains the entire gain\.

Instead, the paper uses a layered comparison strategy\. The headline tables answer the practical question of which method provides the strongest end\-to\-end frontier under each method’s best frozen validated setting\. Claims specifically about the contribution of the memory interface are then supported by additional controls: \(i\) answerer\-fixed fairness analysis, which reduces answer\-time confounding; \(ii\) write\-side ablations, which isolate the contribution of structured memory construction; and \(iii\) matched\-budget or reduced\-breadth controls, which test whether the gain survives once generic baselines are given comparable answer\-time budgets or retrieval constraints\. This is why the recent\-neighbor comparison and the control suite should be read together rather than as one single leaderboard\.

Figure[2](https://arxiv.org/html/2605.28831#S4.F2)summarizes the comparison design at a reviewer\-facing level\. The main point is that the paper no longer contrastsS3Memonly against an older control chain: it also includes recent runnable neighbors, direct alternative\-explanation controls, and a focused trajectory\-source robustness study\.

![Refer to caption](https://arxiv.org/html/2605.28831v1/x2.png)Figure 2:Reviewer\-facing comparison map\. The paper uses a layered comparison design: a core interface\-control chain \(No\-Memory,Vanilla RAG,Graph\-NoReader,S3Mem\), three runnable recent neighbors \(A\-MEM\-inspired, MemoryOS\-adapted, and LightMem\-adapted\), and targeted diagnostics for alternative explanations, including answerer\-fairness analysis, context\-length controls, RTK\-style generic compression, write\-side ablations, and rollout robustness\.The key fairness rule that does hold throughout is that all adapted baselines are evaluated on shared trajectories, shared question sets, and the same frozen answer\-time protocol within each comparison block\. Appendix[E](https://arxiv.org/html/2605.28831#A5)reports the adaptation boundary for each recent baseline, including what was preserved from the original design family, what was explicitly disallowed, and which hyperparameters were tuned\. Appendix[E\.2](https://arxiv.org/html/2605.28831#A5.SS2)further discloses the tuning regime so that the recent\-neighbor comparison is transparent about what is, and is not, strictly parameter\-matched\.

### 4\.4 Metrics and Statistical Reporting

We reportexact match\(EM\) as the primary accuracy metric andaverage evidence token costas the primary efficiency metric\. The token metric is measured on the evidence interface actually exposed to the answer\-time layer\. This metric is central to the paper’s framing: if a method improves only by re\-inserting much larger amounts of raw history, then it is not a strong trajectory\-to\-answer interface for long\-horizon QA\. This is also why Full\-History is treated as a control rather than as a competing memory method: it is informative about what can be gained by exposing more history, but it does not by itself solve the evidence\-selection problem\.

The metric interpretation also follows the claim boundary introduced above\. Headline EM improvements support an*end\-to\-end*claim under the frozen protocol\. They do not, by themselves, prove that the gain comes entirely from better memory construction\. For that reason, we treat token cost, write\-side ablations, answerer\-fixed comparisons, and matched\-budget controls as part of the evidence for the memory\-side interpretation, rather than as auxiliary details\.

For the internal headline environments, we additionally report bootstrap confidence intervals and paired bootstrap differences\. These statistics serve a deliberately narrow purpose: they test whether theS3Mem\>\>Graph\-NoReadergap is stable, rather than merely a favorable point estimate\. In appendix bootstrap tables, the EM column reports the bootstrap\-estimated center, so it can differ slightly from the frozen headline score in the main table\. Appendix[A](https://arxiv.org/html/2605.28831#A1)also reports additional statistical support for recent\-neighbor comparisons and robustness sweeps\.

### 4\.5 Runtime, Answer\-Time, and Rollout Disclosure

All headline results use the same frozen answer\-time protocol within each environment, so the main paper compares*trajectory interfaces*rather than sampled foundation\-model backbones\. The shared current answerer is an in\-repo heuristic answer layer rather than an external closed model; thegenericanswerer is a degraded control mode of the same code path; andgold\_executoruses program conversion and execution rather than open\-ended answer generation\. We treatgenericandgold\_executoras boundary diagnostics, not as replacement task definitions\.

Table[2](https://arxiv.org/html/2605.28831#S4.T2)summarizes the answer\-time protocols\. This distinction is important because the current answer\-time layer is not a neutral black box\. It consumes the evidence exposed by the upstream memory method and therefore remains part of the end\-to\-end system being evaluated\. That is precisely why the paper separates*headline end\-to\-end results*from*memory\-side diagnostics*rather than claiming that the main table alone proves full protocol\-independent memory superiority\. Appendix[D](https://arxiv.org/html/2605.28831#A4)provides the full runtime/model card and answer\-time disclosure\.

Table 2:Runtime and answer\-time disclosure summary\. The main comparison is between trajectory interfaces under a frozen answer\-time protocol, not between different sampled foundation\-model backbones\.ProtocolMain implementationRole in the papercurrentIn\-repo shared current answererFrozen headline answer\-time protocolgenericGeneric/degraded mode of the same answerer pathFairness and protocol\-boundary diagnosticgold\_executorProgram converter \+ executorBoundary diagnostic that bypasses generative answeringTrajectory provenance also differs by environment and should not be conflated with the method comparison itself\.Crafteruses real rollout artifacts from a Qwen3\-VL\-8B\-Instruct family actor,Jerichouses real rollout artifacts from a GPT\-5\.4 family actor with a fixed history setting,ScienceWorlduses a lightweight valid\-random collector, andALFWorlduses a handcoded expert collector\. The Crafter robustness study additionally uses four matched\-condition rollout actor families\. Within each environment, however, all compared methods share the*same*processed trajectories and the*same*generated questions\. We therefore separate answer\-time protocol reproducibility from rollout provenance reproducibility, and treat the main fairness question as one of memory representation and evidence interface rather than rollout actor quality\.

## 5 Analysis

### 5\.1 Answerer\-Fairness Analysis

This subsection is a*boundary diagnostic*rather than a replacement headline\. A natural reviewer concern is that the main gain could reflect the answer\-time interface more than the underlying memory quality\. We therefore replace the default answer\-time protocol with two alternatives: agenericanswerer, which consumes the exposed evidence with minimal task\-specific structure, and agold\_executor, which executes converted programs against the trajectory\. Within each protocol, all compared methods share the same answer\-time layer, so the comparison helps separate memory\-side evidence quality from answer\-time consumption\.

The resulting pattern is deliberately interpreted as a claim\-boundary result rather than as a second leaderboard\. Undergeneric, all methods collapse into roughly the same low range, indicating that a plain generic consumer is not yet a mature replacement protocol for this task\. Undergold\_executor, the ordering does not fully reproduce the headline ranking, showing that the current end\-to\-end gain is not simply the same memory ordering reappearing under every downstream consumer\. These diagnostics therefore do*not*overturn the main result, but they do narrow its interpretation: the current paper cannot claim full protocol\-independent superiority\.

At the same time, the fairness analysis does not reduce the result to a pure answerer trick\. The structured\-memory advantage remains real under the frozen headline protocol, and the write\-side analyses below show that changing the memory representation alone materially changes performance\. The fairest reading is therefore a bounded one: the current evidence supports a genuine memory\-side gain, but the strongest remaining source of non\-generality lies in the answer\-time layer rather than in the write representation alone\. The detailed answerer\-fairness table is reported separately and should be read as boundary evidence rather than as a replacement headline\.

### 5\.2 Write\-Side Ablation

This subsection tests whether the gain comes from the write representation itself rather than only from retrieval or answer\-time logic\. Table[3](https://arxiv.org/html/2605.28831#S5.T3)shows write\-side ablations across all four environments\. Within each controlled ablation run, all variants share the same downstream retrieval\-and\-packing backend; only the write mode changes\. The table therefore isolates the contribution of the write representation itself rather than conflating it with answer\-time changes\.

Table 3:Write\-side ablation across four environments\. Within each controlled ablation run, all variants use the same retrieval and evidence interface; only the write mode differs\. For the external environments, the “full scene\-event” row is the ablation reference run and need not numerically match the frozen headline score in Table[4](https://arxiv.org/html/2605.28831#S5.T4)\.Write ModeCrafterJerichoScienceWorldALFWorldFull scene\-event0\.72510\.93030\.99590\.9234event\_only0\.47600\.53980\.41150\.4061object\_only0\.60470\.50750\.49480\.4061plain\_chunk0\.52770\.67910\.60940\.5670Full scene\-event writing consistently outperformsevent\_only,object\_only, andplain\_chunkvariants\. This is important because it shows that the gain is not reducible to event logging, object logging, or slightly better chunk retrieval\. Evenplain\_chunk, which retains more raw text than the single\-field variants, remains well below full scene\-event writing\. The main methodological implication is therefore straightforward: what matters is not merely retaining more textual content, but retaining the right joint structure over scene, event, state, and temporal context\.

This ablation is also central to the paper’s claim boundary\. The fairness analysis shows that the final system is still coupled to the answer\-time layer, but Table[3](https://arxiv.org/html/2605.28831#S5.T3)shows that the memory\-side representation itself is genuinely doing work\. In other words, the paper does*not*establish that structured memory alone is sufficient under any arbitrary answerer, but it does show that structured write is a real source of the observed gain under the current protocol\.

### 5\.3 Budget\-Matched Generic Compression

This subsection addresses a different alternative explanation: perhapsS3Mem’s efficiency comes mostly from aggressive generic compression rather than from a better memory interface\. We therefore run a controlled RTK\-style compression validation in which the answerer, tokenizer, split, and token accounting are fixed, and the generic compressor is restricted to operations such as filtering, grouping, truncation, and deduplication\. It is explicitly not allowed to use scene\-event fields, anchor recovery, occurrence reasoning, target\-step reasoning, or environment\-specific heuristics\.

The matched\-budget view rules out the simplest “generic shrinking” explanation\. OnCrafter,S3Memremains stable in the low\-token regime where the RTK\-style baselines lose substantial accuracy\. OnJericho, the separation is much stronger: at nearly identical token cost,Graph\-NoReader\+ RTK\-style collapses whileS3Memremains near its headline accuracy\. This is not a minor implementation detail; it is evidence that low\-budget answering depends on preserving the decisive anchor and its local supporting chain rather than on generic shrinking alone\.

The correct interpretation should also remain bounded\. This control does*not*show that compression is unimportant\. On the contrary, compression is essential\. What it shows is that*generic*compression is insufficient: a low\-token answer\-time interface is useful only if it preserves the decisive anchor, the minimal supporting neighborhood, and the chain\-completing state facts\. Appendix[H](https://arxiv.org/html/2605.28831#A8)provides the full controlled tables, family breakdowns, and additional diagnostic discussion\.

### 5\.4 Four\-Actor Rollout Robustness

This subsection tests whether the main story is overly tied to one frozen Crafter rollout source\. We therefore run a matched\-condition Crafter robustness study in which the environment, seeds, step budget, QA construction protocol, and final shared answerer are held fixed while only the rollout actor family changes\. The four actor families are Qwen3\-VL\-235B\-A22B\-Instruct, GPT\-5\.4, GLM\-4\.6V, and Doubao\-1\.5\-Vision\-Pro\. Detailed per\-actor tables are reported in Appendix[I](https://arxiv.org/html/2605.28831#A9)\.

Table 4:Crafter four\-actor rollout robustness, aggregated across the four rollout actor families\. Avg\. Tokens is averaged within each actor and then across actors\.MethodAvg\. EMAvg\. TokensMin EMMax EMNo\-Memory0\.3225462\.730\.31050\.3278Vanilla RAG0\.4643788\.460\.42580\.4957A\-MEM\-inspired0\.5270142\.530\.46430\.5840MemoryOS\-adapted0\.5123143\.920\.48350\.5372LightMem\-adapted0\.5637111\.940\.53850\.6198Graph\-NoReader0\.57421920\.480\.53570\.6068S3Mem0\.5930148\.640\.54400\.6182This robustness study supports a deliberately moderate claim\. The local ranking is*not*perfectly rigid: LightMem slightly exceedsS3Memon some actor\-induced trajectory families\. However,S3Memremains in the top\-performing group under all four actor families, achieves the best average EM across actors, and retains a low\-token regime far belowGraph\-NoReader\. Actor\-induced trajectory shift can therefore change the fine\-grained ordering among the strongest low\-token methods without overturning the broader conclusion that structured episodic evidence yields a substantially better accuracy–efficiency tradeoff thanVanilla RAGand high\-token graph retrieval\.

The right reading of this result is therefore narrower and more defensible than “the exact top method never changes\.” The broader frontier story is substantially more stable than the exact local ranking\. In particular, actor\-induced trajectory shift can reorder the strongest low\-token methods, but it does not erase the main contrast between structured/efficient episodic memory and much higher\-cost alternatives\.

### 5\.5 Qualitative Failure Analysis

To go beyond aggregate metrics, we analyze both concrete examples and the broader failure pattern behind them\. Across the main internal comparisons,S3Mem’s advantage is most often associated with preserving the decisive anchor step together with the minimal neighboring evidence needed to close the chain\. WhenVanilla RAGfails, it often retrieves semantically related but chain\-incomplete snippets\. WhenGraph\-NoReaderfails, it more often retrieves globally relevant structure but loses the exact local ordering or state\-transition support needed for the final answer\. The dominant residual failure mode is therefore not system\-wide collapse, but*selective evidence miss*: the correct evidence is nearby or partially present, but the decisive anchor, offset\-supporting neighborhood, or transition\-completing state fact is not retained in the final interface\.

Figure[3](https://arxiv.org/html/2605.28831#S5.F3)illustrates representative qualitative cases and the corresponding evidence\-selection pattern\. The visual takeaway is consistent with the method story:S3Mem’s main advantage is not generic verbosity, but a higher probability of preserving the anchor step together with the smallest neighboring evidence set needed to close the chain\. This is also why generic summary\-style controls and RTK\-style shrinking fail on the hardest families: they can retain locally relevant content while still dropping the exact evidence link that closes the reasoning chain\. Appendix[K](https://arxiv.org/html/2605.28831#A11)provides the corresponding detailed cases and supporting discussion\.

![Refer to caption](https://arxiv.org/html/2605.28831v1/result.png)Figure 3:Representative qualitative cases\. Across failure modes,Vanilla RAGoften retrieves semantically related but chain\-incomplete snippets, whileGraph\-NoReadermore often preserves globally relevant structure but misses the exact local ordering or transition\-supporting fact\.S3Memsucceeds more often by preserving the decisive anchor step together with the minimal neighboring evidence needed to close the chain\.
### 5\.6 Bonus Out\-of\-Family Stress Test:ATM\-Bench

As a final bonus stress test, we adapt our evaluation stack toATM\-Bench\[[17](https://arxiv.org/html/2605.28831#bib.bib27)\], a personalized long\-horizon memory QA benchmark built from*emails*,*images*, and*videos*rather than from interactive environment rollouts\. We intentionally do*not*treatATM\-Benchas a fifth headline environment\. Its task distribution is materially different from our main setting: there is no acting policy, no environment transition loop, and no native step\-by\-step interaction trace\. Instead,ATM\-Benchis closer to a heterogeneous personal archive in which a question must be answered from a long stream of timestamped personal memory items\. We therefore use it as a*bonus out\-of\-family transfer test*rather than as core evidence for the paper’s main claim\. Appendix[C\.6](https://arxiv.org/html/2605.28831#A3.SS6)and Appendix[E\.6](https://arxiv.org/html/2605.28831#A5.SS6)describe the adaptation in detail\.

To makeATM\-Benchcompatible with our memory\-method comparison, we first convert the official QA plus evidence identifiers into a pseudo\-trajectory: emails are treated as timestamped text items, images and videos are first itemized through the benchmark’s officialbatch\_resultsroute, and all available evidence items are then serialized into one ordered personal\-memory stream\. This adaptation is deliberately conservative\. It preserves the benchmark’s answer keys and evidence ids, while allowingNo\-Memory,Vanilla RAG,Graph\-NoReader, andS3Memto consume the same processed personal\-memory stream under a shared answer\-time protocol\.

Table[5](https://arxiv.org/html/2605.28831#S5.T5)reports the largest fully frozen current\-covered slice with completed core\-method results at write time,ATM\-Benchsubset431\. The most important pattern is not raw leaderboard dominance but transfer behavior under severe distribution shift\. AlthoughATM\-Benchis much more document\-like and email\-dominated than our interactive benchmarks,S3Memremains competitive while using dramatically fewer evidence tokens than the strongest plain\-text and graph baselines\. This suggests that the same structured evidence harness remains useful when the memory source changes from interaction trajectories to a timestamped personal archive, even though the absolute ranking is less favorable toS3Memthan onCrafter,Jericho, orALFWorld\.

Table 5:Bonus out\-of\-family stress test onATM\-Benchsubset431\. This benchmark is used as a transfer/efficiency stress test rather than as a fifth headline environment, because it is archive\-grounded rather than interaction\-grounded\.MethodEMAvg\. Evidence TokensVanilla RAG0\.68212962\.01Graph\-NoReader0\.69613672\.11S3Mem0\.6613751\.09The right conclusion from this table is therefore narrow\.ATM\-Benchshould not be read as evidence thatS3Memuniversally dominates every long\-memory benchmark\. Instead, it shows that even on a substantially different personalized\-memory QA distribution,S3Memdoes not collapse: on subset431 it trails the strongest baseline by only a few EM points while reducing evidence tokens by roughly7575–80%80\\%\. We view this as a useful generalization signal for efficiency\-sensitive transfer, not as a replacement for the paper’s main interaction\-grounded headline\.

## 6 Limitations

Our study has several limitations\. First, the answer\-time layer is not yet fully general, as shown by the fairness analysis, so the paper should not be read as establishing full protocol independence\. Second, although the external benchmark coverage is stronger than a single\-family evaluation, it remains modest in scale and focuses on curated episodic\-QA slices rather than complete benchmark families\. Third, the main evaluation targets long\-horizon memory QA rather than end\-to\-end acting policies\. Fourth, the headline comparisons are frozen end\-to\-end comparisons rather than strictly parameter\-matched proofs of memory quality alone; some settings, such as retrieval breadth, are disclosed and interpreted through follow\-up controls rather than hidden inside a single leaderboard claim\. Fifth, parser/executor analysis should be interpreted only as boundary support rather than as a second main method family\.

Taken together, these limitations mean that the current contribution is best understood as a*structured evidence harness*result with bounded empirical scope: it shows that structured write and anchor\-sensitive evidence routing improve the trajectory\-to\-answer interface under the current frozen protocol, but it does not establish universal end\-to\-end agent superiority under arbitrary downstream consumers\.

## 7 Conclusion

We revisit a simple but under\-addressed question:*how should an agent remember its own interaction history for long\-horizon QA?*Our results suggest that the key is not exposing more history, but exposing the right structured evidence from history\. By writing trajectories into structured scene\-event memory, retrieving anchor\-sensitive evidence, and exposing a compact token\-budget\-aware evidence interface,S3Memimproves both answer quality and evidence efficiency under the current frozen answer\-time protocol across internal and external environments\.

The enlarged comparison suite sharpens the interpretation of this result\. Recent baselines such as A\-MEM, MemoryOS, and especially LightMem are meaningful neighbors, but they do not matchS3Mem’s overall frontier\. Full\-History can push accuracy higher only at very high token cost, while generic summaries and RTK\-style compression save tokens but fail to preserve the hardest evidence chains\. The fairness diagnostics, in turn, show that the main unresolved boundary still lies in the answer\-time layer rather than in the structured memory core\. The rollout\-robustness analysis further suggests that fine\-grained top\-method rankings can shift across trajectory sources, but the broader frontier advantage of structured episodic evidence remains substantially more stable\.

These findings support a broader but still bounded view: when the knowledge source is an agent’s own trajectory, long\-term memory should not be treated as generic text retrieval, but as*structured episodic evidence construction*\. Under the current frozen answer\-time protocol, this design yields a stronger end\-to\-end accuracy–efficiency frontier, and the supporting diagnostics indicate that a substantial part of that gain comes from the quality of the structured evidence harness rather than from generic memory storage alone\.

## References

- \[1\]\(2024\)AriGraph: learning knowledge graph world models with episodic memory for LLM agents\.CoRRabs/2407\.04363\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen\(2023\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.Transactions on Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px4.p1.1)\.
- \[3\]D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, and J\. Larson\(2024\)From local to global: a graph RAG approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px1.p1.1)\.
- \[4\]J\. Fang, X\. Deng, H\. Xu, Z\. Jiang, Y\. Tang, Z\. Xu, S\. Deng, Y\. Yao, M\. Wang, S\. Qiao, H\. Chen, and N\. Zhang\(2025\)LightMem: lightweight and efficient memory\-augmented generation\.arXiv preprint arXiv:2510\.18866\.Cited by:[§1](https://arxiv.org/html/2605.28831#S1.p7.1),[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2605.28831#S4.SS3.p3.1)\.
- \[5\]L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig\(2023\)PAL: program\-aided language models\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px4.p1.1)\.
- \[6\]K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang\(2020\)REALM: retrieval\-augmented language model pre\-training\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]D\. Hafner\(2022\)Benchmarking the spectrum of agent capabilities\.InInternational Conference on Learning Representations,Cited by:[§C\.2](https://arxiv.org/html/2605.28831#A3.SS2.p1.1),[§1](https://arxiv.org/html/2605.28831#S1.p1.1),[§1](https://arxiv.org/html/2605.28831#S1.p6.1),[§4\.1](https://arxiv.org/html/2605.28831#S4.SS1.p1.1)\.
- \[8\]M\. Hausknecht, P\. Ammanabrolu, M\. Côté, and X\. Yuan\(2020\)Interactive fiction games: a colossal adventure\.InAAAI Conference on Artificial Intelligence,Cited by:[§C\.3](https://arxiv.org/html/2605.28831#A3.SS3.p1.1),[§1](https://arxiv.org/html/2605.28831#S1.p1.1),[§1](https://arxiv.org/html/2605.28831#S1.p6.1),[§4\.1](https://arxiv.org/html/2605.28831#S4.SS1.p1.1)\.
- \[9\]X\. He, Y\. Tian, Y\. Sun, N\. V\. Chawla, T\. Laurent, Y\. LeCun, X\. Bresson, and B\. Hooi\(2024\)G\-Retriever: retrieval\-augmented generation for textual graph understanding and question answering\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px1.p1.1)\.
- \[10\]Z\. He, Y\. Wang, C\. Zhi, Y\. Hu, T\. Chen, L\. Yin, Z\. Chen, T\. A\. Wu, S\. Ouyang, Z\. Wang, J\. Pei, J\. McAuley, Y\. Choi, and A\. Pentland\(2026\)MemoryArena: benchmarking agent memory in interdependent multi\-session agentic tasks\.arXiv preprint arXiv:2602\.16313\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px3.p1.1)\.
- \[11\]G\. Izacard, P\. Lewis, M\. Lomeli, L\. Hosseini, F\. Petroni, T\. Schick, J\. Dwivedi\-Yu, A\. Joulin, S\. Riedel, and E\. Grave\(2023\)Atlas: few\-shot learning with retrieval augmented language models\.Journal of Machine Learning Research24\(251\),pp\. 1–43\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px1.p1.1)\.
- \[12\]J\. Kang, M\. Ji, Z\. Zhao, and T\. Bai\(2025\)Memory OS of AI agent\.arXiv preprint arXiv:2506\.06326\.Cited by:[§1](https://arxiv.org/html/2605.28831#S1.p7.1),[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2605.28831#S4.SS3.p3.1)\.
- \[13\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems,Vol\.33\.Cited by:[§1](https://arxiv.org/html/2605.28831#S1.p2.1),[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px1.p1.1)\.
- \[14\]X\. Li, Z\. Zhu, S\. Liu, Y\. Ma, Y\. Zang, Y\. Cao, and A\. Sun\(2026\)EMemBench: interactive benchmarking of episodic memory for VLM agents\.arXiv preprint arXiv:2601\.16690\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px3.p1.1)\.
- \[15\]Z\. Lu, D\. Li, Y\. Shi, B\. Wang, L\. Wang, and B\. Hu\(2026\)Structured episodic event memory\.arXiv preprint arXiv:2601\.06411\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang\(2024\)Evaluating very long\-term conversational memory of LLM agents\.arXiv preprint arXiv:2402\.17753\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px3.p1.1)\.
- \[17\]J\. Mei, J\. Chen, G\. Yang, X\. Hou, M\. Li, and B\. Byrne\(2026\)According to me: long\-term personalized referential memory qa\.arXiv preprint arXiv:2603\.01990\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px3.p1.1),[§5\.6](https://arxiv.org/html/2605.28831#S5.SS6.p1.1)\.
- \[18\]C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez\(2023\)MemGPT: towards LLMs as operating systems\.arXiv preprint arXiv:2310\.08560\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InACM Symposium on User Interface Software and Technology,Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px2.p1.1)\.
- \[20\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px2.p1.1)\.
- \[21\]M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations,Cited by:[§C\.5](https://arxiv.org/html/2605.28831#A3.SS5.p1.1),[§1](https://arxiv.org/html/2605.28831#S1.p1.1),[§1](https://arxiv.org/html/2605.28831#S1.p6.1),[§4\.1](https://arxiv.org/html/2605.28831#S4.SS1.p1.1)\.
- \[22\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px2.p1.1)\.
- \[23\]R\. Wang, P\. Jansen, M\. Côté, and P\. Ammanabrolu\(2022\)ScienceWorld: is your agent smarter than a 5th grader?\.InConference on Empirical Methods in Natural Language Processing,Cited by:[§C\.4](https://arxiv.org/html/2605.28831#A3.SS4.p1.1),[§1](https://arxiv.org/html/2605.28831#S1.p6.1),[§4\.1](https://arxiv.org/html/2605.28831#S4.SS1.p1.1)\.
- \[24\]D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu\(2024\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.arXiv preprint arXiv:2410\.10813\.Cited by:[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px3.p1.1)\.
- \[25\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2025\)A\-MEM: agentic memory for LLM agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§1](https://arxiv.org/html/2605.28831#S1.p7.1),[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2605.28831#S4.SS3.p3.1)\.
- \[26\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.28831#S1.p1.1)\.
- \[27\]Y\. Zhao, B\. Yuan, J\. Huang, H\. Yuan, Z\. Yu, H\. Xu, L\. Hu, A\. Shankarampeta, Z\. Huang, W\. Ni, Y\. Tian, and J\. Zhao\(2026\)AMA\-Bench: evaluating long\-horizon memory for agentic applications\.arXiv preprint arXiv:2602\.22769\.Cited by:[§1](https://arxiv.org/html/2605.28831#S1.p7.1),[§2](https://arxiv.org/html/2605.28831#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix Roadmap

This appendix is organized as a supporting document for the main paper rather than as a second narrative\. Each section is intended to support a specific claim in the main text\. The roadmap is structured as follows:

Appendix SectionContent DescriptionApp\.[A](https://arxiv.org/html/2605.28831#A1)Statistical support for the internal headline\.App\.[B](https://arxiv.org/html/2605.28831#A2)–[B\.5](https://arxiv.org/html/2605.28831#A2.SS5)Algorithmic details ofS3Mem, including schema mapping, anchor extraction, reranking, and packing\.App\.[C](https://arxiv.org/html/2605.28831#A3)Benchmark construction, processed trajectories, question\-generation pipeline, andATM\-Benchconversion/subset construction\.App\.[D](https://arxiv.org/html/2605.28831#A4)Runtime, answer\-time, and rollout disclosure\.App\.[E](https://arxiv.org/html/2605.28831#A5)–[E\.2](https://arxiv.org/html/2605.28831#A5.SS2)Recent\-baseline adaptation details, includingATM\-Benchadapted baselines, and tuning budget disclosure\.App\.[F](https://arxiv.org/html/2605.28831#A6),[G](https://arxiv.org/html/2605.28831#A7),[H](https://arxiv.org/html/2605.28831#A8)Context controls, component ablations, and generic\-compression validation\.App\.[I](https://arxiv.org/html/2605.28831#A9)Four\-actor Crafter rollout robustness\.App\.[J](https://arxiv.org/html/2605.28831#A10)–[L](https://arxiv.org/html/2605.28831#A12)Efficiency addendum, qualitative cases, and parser/executor boundary support\.App\.[M](https://arxiv.org/html/2605.28831#A13)Post\-freeze strengthening notes\.

## Appendix AStatistical Support

This appendix section supports the internal headline claims in the main text and does not replace the main result tables\. The EM column for bootstrap summaries reports the bootstrap\-estimated center, which may differ slightly from the frozen headline point estimate in the main tables\.

Table 6:Statistical support for the internal headline environments\. Panel A reports bootstrap 95% confidence intervals; Panel B reports paired bootstrap differences for key comparisons\.Panel A: Bootstrap confidence intervals

CrafterJerichoMethodEM95% CIEM95% CINo\-Memory0\.3831\[0\.3620,0\.4058\]\[0\.3620,\\,0\.4058\]0\.2289\[0\.1891,0\.2711\]\[0\.1891,\\,0\.2711\]Vanilla RAG0\.6158\[0\.5947,0\.6375\]\[0\.5947,\\,0\.6375\]0\.6990\[0\.6567,0\.7438\]\[0\.6567,\\,0\.7438\]Graph\-NoReader0\.6992\[0\.6802,0\.7203\]\[0\.6802,\\,0\.7203\]0\.8881\[0\.8557,0\.9179\]\[0\.8557,\\,0\.9179\]S3Mem0\.7203\[0\.7008,0\.7409\]\[0\.7008,\\,0\.7409\]0\.9303\[0\.9030,0\.9527\]\[0\.9030,\\,0\.9527\]
Panel B: Paired bootstrap differences

ComparisonCrafterJerichoS3Mem−\-Graph\-NoReader\+0\.0211​\[\+0\.0132,\+0\.0290\]\+0\.0211~\[\+0\.0132,\\,\+0\.0290\]\+0\.0423​\[\+0\.0199,\+0\.0672\]\+0\.0423~\[\+0\.0199,\\,\+0\.0672\]S3Mem−\-Vanilla RAG\+0\.1045​\[\+0\.0902,\+0\.1198\]\+0\.1045~\[\+0\.0902,\\,\+0\.1198\]\+0\.2313​\[\+0\.1891,\+0\.2711\]\+0\.2313~\[\+0\.1891,\\,\+0\.2711\]

## Appendix BAlgorithmic Details ofS3Mem

This appendix section supports the method description in the main text and does not redefine the main claim\. Its purpose is to make the write–retrieve–pack pipeline more reproducible and to separate general method principles from environment\-specific implementation details\.

### B\.1 Algorithmic Overview

### B\.2 Environment\-Specific Memory Schema Mapping

The main text describes the write representation abstractly\. Table[7](https://arxiv.org/html/2605.28831#A2.T7)makes the environment\-specific schema more concrete\.

Table 7:Environment\-specific memory schema mapping\. All environments preserve scene, event, state, and temporal context jointly, but the emphasized fields vary by benchmark\.EnvironmentScene fieldsEvent / action fieldsState fieldsAdditional notesCrafterlocation, nearby objects, spatial relationsaction, event type, interaction targetinventory, local environmental stateStrongest emphasis on spatial displacement and item/state chainsJerichoroom/location, visible objectsaction, trigger event, text\-event transitioninventory, door/object statusStrongest emphasis on location reasoning and temporal offsetsScienceWorldobservation scene, room/locationaction, observation\-triggered event, item interactioninventory, state\-change summaryScience\-grounded state transitions and action countingALFWorldroom/location, visible household objectsaction, gain\-item / use\-item eventinventory, object state, short transition summaryStrong emphasis on location chains, occurrence anchors, and multi\-hop state chains
### B\.3 Anchor Extraction Rules

Anchor extraction is rule/template driven unless otherwise noted\. Its goal is not to perform a full semantic parse, but to recover the small set of cues needed for anchor\-sensitive retrieval\. Table[8](https://arxiv.org/html/2605.28831#A2.T8)summarizes the main fields\.

Table 8:Anchor extraction fields and default interpretation rules\.Anchor fieldExample cue in the questionDefault interpretationTarget entity / object“lab bench”, “alarm clock”, “brass key”Primary object/location/entity of interestTrigger event“visit”, “obtain”, “unlock”, “see”Event that introduces the evidence chainQueried field“what happened”, “where was the agent”, “what item”Output type to be recovered at answer timeOccurrence anchor“first”, “second”, “last”Which event occurrence should be selectedTemporal offset“one step after”, “two steps later”Offset neighborhood around the selected anchor stepFallbackno explicit occurrence/offset markerUse the strongest direct anchor\-compatible step and its minimal neighborhoodThe exact templates vary slightly by benchmark because question families differ, but the same abstract anchor tuple is used throughout the paper\.

### B\.4 Retrieval and Reranking Features

The main text gives an abstract score decomposition\. Here we make the reranking policy more explicit\. Table[9](https://arxiv.org/html/2605.28831#A2.T9)lists the main feature groups\.

Table 9:Retrieval and reranking features used byS3Mem\.Feature groupRole in rerankingText relevanceRewards lexical/dense similarity between the question and memory\-unit summary/fieldsAnchor compatibilityRewards units whose objects, events, or locations match the extracted anchor tupleOccurrence compatibilityRewards units consistent with first/second/last\-event cues when such cues are presentLocal neighborhood supportRewards units adjacent to anchor\-bearing steps when they help complete a temporal or state chainState\-transition supportRewards units that supply missing transition facts such as gain\-item, unlock, or location\-change evidenceTie\-breakPrefer anchor\-bearing units, then local chain\-support units, then smaller nonredundant support unitsThe paper should not be read as claiming one universal set of learned weights\. The more accurate description is thatS3Memfollows a fixed*ordering principle*: text relevance alone is insufficient, so anchor compatibility and local chain support are explicitly promoted\.

### B\.5 Packing Policy and Stopping Rules

The evidence packer is greedy and budget\-aware\. Its purpose is not generic summarization, but preserving the smallest structured subset of evidence sufficient for answer\-time inference\. The packer follows four rules:

1. 1\.Anchor\-first rule:preserve the decisive anchor step whenever one is identified\.
2. 2\.Neighborhood rule:preserve the minimal local neighborhood needed to support a temporal offset, occurrence disambiguation, or state transition\.
3. 3\.Coverage\-before\-redundancy rule:add a unit only if it contributes a missing fact or closes a partial chain\.
4. 4\.Budget stop rule:stop when no additional nonredundant support unit can be added without exceeding the token budget\.

The main tie\-break order is: anchor\-bearing units≻\\succrequired local support≻\\succstate\-transition completion≻\\succother weakly relevant units\. This makes the evidence interface more stable than generic truncation or free\-form compression\.

## Appendix CData Construction and Benchmark Pipeline

This appendix section supports the benchmark and provenance discussion in the main text and does not introduce a new benchmark contribution\. Its purpose is to document how raw rollouts become processed trajectories, how processed trajectories become QA examples, and how sample counts arise in the final evaluation slices\.

### C\.1 Shared Construction Pipeline

Across all four environments, the data pipeline follows the same high\-level structure:

1. 1\.Obtain raw rollout trajectories from the environment\-specific actor or collector\.
2. 2\.Convert raw rollout logs into a processed trajectory format that normalizes observations, actions, locations, events, state facts, and question\-facing metadata\.
3. 3\.Build question–answer pairs from the processed trajectory using benchmark\-specific template/rule\-based construction\.
4. 4\.Assign family labels from generator\-side fields such astype,category,template, ormetadata\["answer\_type"\]\.
5. 5\.Filter invalid or degenerate examples using generator\-side validity constraints; unless otherwise noted, the main paper does not apply manual per\-question cleaning\.
6. 6\.Freeze the resulting processed trajectories and generated questions for all method comparisons within that environment\.

The fairness boundary is important\. All compared methods consume the same processed trajectory set and the same generated questions within each environment\. In the main evaluation, baselines do*not*operate directly on raw rollout logs\. Instead, they consume processed trajectories, which are then converted inside the runner into a unified memory object representation before answer\-time inference\. The shared current answerer receives the question together with the selected memories and, when available, structured evidence fields\. This makes the main comparison one of memory interfaces rather than raw\-data sources or final\-answer backbones\.

### C\.2 Crafter Construction

Crafter\[[7](https://arxiv.org/html/2605.28831#bib.bib20)\]uses real rollout artifacts from the EMemBench Crafter pipeline\. The main headline evaluation slice uses a Qwen3\-VL\-8B\-Instruct family rollout actor and contains24 episodes / 1895 questions\. Episode identities indicate a seed range fromseed1toseed24, with a fixed step budget of180 stepsper episode\. The episode\-id field180steps\_10roundsshould be interpreted as180 stepswithhistory\_turns = 10, rather than as a multi\-episode round count\.

The reported slice isCrafter\. In the current generator, is*not*episode filtering, family balancing, or manual difficulty annotation\. Instead, it is a question\-construction scope truncation: generated questions and answers are restricted to the first 50 steps of the episode prefix\.

QA construction is fully automatic and uses the EMemBench Crafter generator over real rollout logs\. The generator reads the rollout log, map/discovered\-map state, inventory, terrain, and anchor events, then constructs trajectory\-grounded questions using template/rule\-based logic\. Question families include step lookup, occurrence reasoning, state query, terrain/spatial reasoning, inventory reasoning, craft feasibility, temporal reasoning, event ordering, counting, multi\-hop reasoning, and induction\. For the headline slice, we do not apply additional manual question cleaning or global post\-hoc ambiguity filtering beyond generator\-side validity constraints\.

### C\.3 Jericho Construction

Jericho\[[8](https://arxiv.org/html/2605.28831#bib.bib21)\]uses real rollout artifacts from a GPT\-5\.4 family actor with a fixed history setting\. The reported evaluation slice isJerichowithn=402n=402questions\. The processed trajectory preserves room/location transitions, inventory changes, action traces, and state\-change facts such as unlocking or acquiring an item\. Question families emphasize temporal offsets, location reasoning, acquisition\-to\-outcome chains, and inventory aggregation\. As withCrafter, some original rollout\-generation hyperparameters are not fully recoverable from archived runs, but the processed evaluation slice is frozen and shared across all memory methods\.

### C\.4 ScienceWorld Construction

ScienceWorld\[[23](https://arxiv.org/html/2605.28831#bib.bib22)\]is generated from a lightweight valid\-random collector\. The reported slice containsn=242n=242questions\. The processed trajectory preserves step observations, location changes, action traces, inventory tracking, and short state\-transition summaries\. Question construction emphasizes step observation, location visits, inventory tracking, action counting, and state\-chain extensions\. In the paper body,ScienceWorldis used primarily as an*efficiency generalization*environment rather than as the strongest external accuracy test\.

### C\.5 ALFWorld Construction

ALFWorld\[[21](https://arxiv.org/html/2605.28831#bib.bib23)\]is generated from a handcoded expert text\-rollout collector\. The reported slice containsn=329n=329questions\. The processed trajectory preserves location transitions, item gains, use\-item actions, and state changes over household objects\. Question construction emphasizes location chains, gain\-item chains, second/last\-occurrence anchors, longer temporal chains, and multi\-hop state\-chain reasoning\. In the paper body,ALFWorldis the strongest external*accuracy\+efficiency*generalization result\.

### C\.6ATM\-BenchConstruction and Adaptation

ATM\-Benchis fundamentally different from the four main environments because it is*not*an interactive rollout benchmark\. There is no acting policy, no environment control loop, and no native trajectory generated by an agent\. Instead, the benchmark provides question–answer pairs together with ground\-truth evidence identifiers over a long\-term personal memory archive containing emails, images, and videos\. For this reason, we do not interpretATM\-Benchas a fifth in\-family environment\. We interpret it as an out\-of\-family personalized\-memory stress test\.

To makeATM\-Benchrunnable under our unified memory\-interface comparison, we apply the following deterministic conversion pipeline:

1. 1\.Read the official questions, answers, and evidence ids\.
2. 2\.Read the email archive and the image/video memory items produced by the benchmark’s officialbatch\_resultsroute\.
3. 3\.Convert each evidence item into a timestamped pseudo\-step\.
4. 4\.Sort the pseudo\-steps into one ordered personal\-memory stream\.
5. 5\.Attach the benchmark questions to that single ordered stream to form a pseudo\-trajectory episode consumable by the same evaluation runner used elsewhere in the paper\.

This design preserves the benchmark’s original answers and evidence ids while allowing the same evaluation harness to compareVanilla RAG,Graph\-NoReader,S3Mem, and the adapted recent baselines\. The crucial methodological point is thatATM\-Benchstill does*not*become an interactive benchmark after conversion: the pseudo\-trajectory is only a uniform serialization layer over a personal archive, not a recovered acting trace\.

The official multimodal route was also adapted conservatively\. For official compatibility checks, we validated a Qwen\-family route based onqwen3\-235b \+ batch\_results\. The raw Qwen\-VL route remained unstable during our runs, so the reproducible path is the benchmark’s officialbatch\_resultssetting, where image and video items are first itemized into text memory entries and then consumed by the downstream QA agent\. On this official route, Oracle on subset87 achieves EM=0\.7241=0\.7241at588\.6588\.6average total tokens, while MMRAG on subset20 achieves EM=0\.8000=0\.8000at2472\.32472\.3average total tokens\. These official baselines are not the main paper headline, but they confirm that the benchmark\-side Qwen path itself is functional and that our later unified comparison is not built on a broken input pipeline\.

### C\.7ATM\-BenchCurrent\-Covered Subsets and Why They Exist

TheATM\-Benchexperiments use several subsets, but these subsets are*not*arbitrary random samples\. They arise from two deterministic needs: functional validation and current\-covered evaluation\. The small slices \(subset5andsubset20\) were used only to validate that the official Qwen route, evaluator, and answer\-format handling were working\. The larger slices are*coverage checkpoints*: at each time point, we rebuild the largest question subset whose evidence ids are fully covered by the currently available email archive plus image/videobatch\_results\. This is why the numbers increase from 87 to 417, 431, 444, and 452 as image itemization coverage expands\.

Table 10:ATM\-Benchsubset construction rationale\. These subsets are deterministic coverage checkpoints or functional validation slices, not random samples\.SubsetPrimary roleQuestionsConstruction notesubset5Smoke / protocol validation5Minimal slice for first functional Qwen\-route and evaluator validationsubset20Small official baseline probe20First nontrivial Oracle/MMRAG validation slicesubset87First stable unified\-method slice87First frozen slice where official baselines and our unified runner both completed reliablysubset417First large formal slice417First current\-covered large slice after broader image/video itemization and the ATM seed\-injection fixsubset431Larger frozen formal slice431Additional image\-only questions added; this is the largest fully frozen slice with completed core\-method results at write timesubset444Ongoing larger slice444Partial\-only results available at write timesubset452Latest prepared slice452Largest current\-covered slice prepared at write time, but not yet fully finalizedTwo further caveats matter for interpretation\. First, the larger slices are still highly email\-dominated: for example, subset452 contains 6845 pseudo\-steps, of which 6742 are emails, 90 are images, and 13 are videos\. Second,subset417tosubset431should*not*be read as a pure scale\-up curve\. That change combines three effects: \(i\) more image\-only questions enter the covered subset, \(ii\) the benchmark distribution shifts slightly as new image evidence becomes available, and \(iii\) theVanilla RAG/Graph\-NoReaderconfigurations on the later slice use wider retrieval breadth and larger generic budgets than on the earlier slice\. We therefore treat the largerATM\-Benchsubsets as a coverage\-growth stress test, not as a perfectly controlled monotonic scaling law\.

### C\.8 Question Construction and Family Labeling

The exact template sets vary by benchmark, but all QA examples are trajectory\-grounded and answerable from the processed trajectory alone\. Question families are assigned according to the dominant reasoning pattern required to recover the answer, including: single\-step lookup, repeated\-event disambiguation, temporal offset, spatial reasoning, inventory/state\-chain reasoning, and multi\-hop transitions\. Where multiple patterns overlap, the family label is assigned by the most restrictive reasoning requirement\. The appendix family tables should therefore be interpreted as reasoning\-family summaries rather than as mutually exclusive ontological classes\.

## Appendix DRuntime / Model / Rollout Disclosure

This appendix section supports the reproducibility boundary behind the main headline tables and does not replace the main method comparison\. For reproducibility, we separate*answer\-time protocol disclosure*from*trajectory provenance disclosure*\.

### D\.1 Answer\-Time Protocol Disclosure

Table 11:Answer\-time protocol disclosure\.ProtocolMain implementationReproducibility notecurrentIn\-repo shared current answererDeterministic headline protocol; not an external sampled foundation modelgenericGeneric/degraded mode of the same answerer pathControl protocol; should not replace the headline evaluationgold\_executorProgram converter \+ executorProgram\-grounded execution rather than open\-ended answer generationFor all three protocols, we do not use temperature, top\-pp, generation retries, or max\-new\-token sampling settings, because the answer stage is deterministic rather than free\-form sampled generation\. This matters for interpretation: the paper is not a hidden foundation\-model leaderboard under different prompting schemes\.

Concretely, the shared current answerer does not operate as an unconstrained generative backbone\. It receives the question together with the selected memories and, when available, structured evidence fields produced by the upstream memory method\. In the current implementation, the answerer heavily relies on ordered raw\-observation traces recovered from memory metadata, which are then consumed under the frozen answer\-time protocol\. This is precisely why the paper frames its main contribution as a structured memory\-interface result under the current frozen answer\-time protocol, rather than as a protocol\-independent end\-to\-end answerer result\.

### D\.2 Trajectory Provenance Disclosure

Table 12:Trajectory provenance by environment\. Within each environment, all memory methods share the same processed trajectories and questions\.EnvironmentProvenanceDisclosure boundaryCrafterReal rollout artifacts from a Qwen3\-VL\-8B\-Instruct family actorActor family and run identity are recoverable; full rollout sampling parameters are not fully preserved locallyJerichoReal rollout artifacts from a GPT\-5\.4 family actor with a fixed history settingActor family and history setting are recoverable; full API\-time sampling configuration is not completely recoverableScienceWorldLightweight valid\-random rollout collectorLocally reproducible scripted rollout pathALFWorldHandcoded expert text rollout collectorLocally reproducible expert\-policy rollout pathThis separation clarifies the reproducibility claim boundary\. The frozen memory\-method evaluation protocol is reproducible from the released processed trajectories and questions, whereas the originalCrafter/Jerichorollout\-generation calls are only partially recoverable at the level of actor family and run identity\.

## Appendix ERecent Baseline Adaptation Details

This appendix section supports the recent\-neighbor summary in the main text and does not redefine the paper’s headline comparison\. We intentionally do*not*claim a full reproduction of any official end\-to\-end system\. Instead, each baseline is adapted to the same trajectory\-grounded episodic\-QA setting and the same frozen shared\-answerer protocol\.

### E\.1 Adaptation Boundary

Table 13:Adaptation boundary for the recent baselines\.BaselinePreserved memory intuitionFrozen comparison boundaryA\-MEM\-inspiredNote\-style memory items, lightweight keywords/tags, note links, link\-aware retrievalNo external\-LLM note analysis; no anchor\-sensitive retrieval orS3Mem\-specific seed injectionMemoryOS\-adaptedHierarchical memory management and staged memory organizationNoS3Mem\-style anchor routing; same frozen answerer and trajectory/question setsLightMem\-adaptedEfficiency\-oriented memory construction in a low\-token regimeNoS3Mem\-specific anchor logic; same frozen answerer and trajectory/question setsThis boundary is deliberate\. The baselines preserve the core memory\-organization intuition of each recent system while avoiding hidden reuse ofS3Mem\-specific anchor logic or unfair answerer changes\.

### E\.2 Adaptation Tuning Budget

Table 14:Adaptation tuning\-budget disclosure for the recent baselines\.BaselineTuning splitMain tuned factorsDisclosure noteA\-MEM\-inspiredFrozen internal development slicenote granularity, link threshold, retrieval breadthSame processed trajectories and shared answerer as all other baselinesMemoryOS\-adaptedFrozen internal development slicehierarchy depth, memory aggregation granularity, retrieval breadthNoS3Mem\-specific anchor routing permittedLightMem\-adaptedFrozen internal development slicecompression granularity, memory grouping, retrieval breadthEvaluated as the strongest low\-token recent neighbor under the same shared\-answerer boundaryThe purpose of this table is not to claim that every baseline receives an identical official reproduction budget\. Rather, it makes the comparison boundary explicit: all recent baselines are tuned only within the shared trajectory\-grounded QA setting and are not allowed to inheritS3Mem\-specific anchor logic\.

For the final paper\-facing recent\-baseline comparisons, the adapted recent baselines are run under a shared low\-token comparison regime withtop\_k = 16,short\_window = 4, andtoken\_budget = 192\. In addition, the MemoryOS\-adapted baseline usessegment\_size = 8, while the LightMem\-adapted baseline usessegment\_size = 10\. We explicitly note that the mainS3Memconfiguration is not strictly parameter\-matched to this recent\-baseline regime: the strongestS3Memsetting uses a larger retrieval breadth on the internal headline environments \(Crafter:top\_k = 32; Jericho:top\_k = 24\)\. For this reason, the recent\-neighbor comparison should be read together with the matched\-budget controls, RTK\-style compression validation, and Full\-History controls, rather than as a single isolated fairness claim\.

### E\.3 A\-MEM\-Inspired Adaptation Details

The A\-MEM\-inspired baseline preserves the note\-link intuition of lightweight note items, keyword\-style tags, and explicit note\-to\-note links\. The memory items are constructed from the same processed trajectories used by the rest of the paper, and the final answer stage is still the frozen shared current answerer\. We do not import external note\-analysis prompting or anyS3Mem\-specific anchor injection into the A\-MEM\-inspired retrieval path\.

### E\.4 MemoryOS\-Adapted Baseline Details

The MemoryOS\-adapted baseline preserves the hierarchical\-memory intuition of staged storage and memory\-management structure\. Its role in the paper is to test whether hierarchical organization by itself is sufficient once the answer\-time layer, question set, and trajectories are frozen\. The adaptation therefore retains a hierarchical write/retrieval structure but does not addS3Mem\-style occurrence\-anchor recovery, target\-step recovery, or specialized evidence packing\.

### E\.5 LightMem\-Adapted Baseline Details

The LightMem\-adapted baseline preserves the efficiency\-oriented intuition of building a leaner memory representation aimed at low\-token answering\. Among the three recent baselines, it is the strongest empirical neighbor in our setting, especially onCrafterandALFWorld\. That is precisely why it matters for the paper: if a recent efficiency\-oriented memory system already matched theS3Memfrontier, our claim would need to be narrower\.

### E\.6ATM\-BenchAdapted Baselines and Comparison Boundary

ForATM\-Bench, we use the same overarching fairness rule as in the main paper: all compared methods consume the same processed pseudo\-trajectory, the same questions, and the same frozen Qwen\-family answer\-time configuration within each comparison block\. The current stable route usesqwen3\-235bas the answer\-time backbone over the benchmark’sbatch\_results\-based textual memory stream\.

The core methods preserve their usual roles under this conversion:No\-Memorysees only the question,Vanilla RAGindexes the pseudo\-trajectory as plain text,Graph\-NoReaderuses graph\-structured storage over the same pseudo\-steps, andS3Memuses structured write, anchor\-sensitive retrieval, and budgeted evidence packing over the pseudo\-trajectory\. ForS3Memspecifically, we explicitly disable the Crafter/Jericho\-style seed\-step injection onATM\-Benchbecause those interaction\-specific seeds polluted retrieval on archive\-style personal memory questions\. This bug fix is part of the reason why the largerATM\-Benchformal slices should be interpreted only after the corrected runs\.

The recent adapted baselines are implemented as controlled design\-family translations rather than as end\-to\-end re\-executions of the original external systems\. TheA\-MEM\-inspiredvariant preserves note\-style items, lightweight keyword tagging, and note\-link retrieval\. TheMemoryOS\-adaptedvariant preserves a hierarchical memory\-management intuition with staged grouping over the pseudo\-trajectory\. TheLightMem\-adaptedvariant preserves efficiency\-oriented compressed memory construction with short per\-step summaries and lightweight segment topics\. None of these adapted baselines is allowed to reuseS3Mem\-specific anchor recovery, cue\-line packing, orS3Mem\-specific evidence\-selection heuristics\.

Table 15:ATM\-Benchadapted\-baseline comparison on subset87\. This slice is used for the first stable cross\-method adaptation check because it is the earliest slice on which all adapted baselines, the official Qwen route, and the unified runner completed reliably\.MethodEMAvg\. Evidence TokensNo\-Memory0\.03450\.00Vanilla RAG0\.72411453\.87Graph\-NoReader0\.71261751\.64A\-MEM\-inspired0\.0460229\.38MemoryOS\-adapted0\.03451066\.80LightMem\-adapted0\.0345316\.40S3Mem\(top\_k=24,budget=768\)0\.7126743\.62This table serves two purposes\. First, it shows thatATM\-Benchcan be integrated into our unified method\-comparison framework rather than only into the benchmark’s official Oracle/MMRAG chain\. Second, it shows that the recent adapted baselines do*not*dominate under this adaptation: their role is as meaningful neighboring controls, not as new headline winners\. On the same slice, a conservative answer\-refinement pass can raiseS3Memto EM=0\.7471=0\.7471and bothVanilla RAGandGraph\-NoReaderto EM=0\.7356=0\.7356, but we do not use that post\-processed variant as the mainATM\-Benchheadline in the paper body\.

### E\.7 Why Not Every Recent Work Is Run as a Baseline

Not every recent memory paper is a drop\-in baseline for this setting\. Some recent works are primarily benchmarks, some target conversational long\-term memory, and others focus on general agent infrastructure rather than trajectory\-grounded episodic QA\. Our experimental design therefore uses a layered comparison: the core interface\-control chain \(No\-Memory,Vanilla RAG,Graph\-NoReader,S3Mem\), three runnable recent neighbors \(A\-MEM, MemoryOS, and LightMem\), and several alternative\-explanation controls \(Full\-History, Summarize\-then\-Answer, and RTK\-style compression\)\. This is a more honest comparison strategy than pretending that all recent memory papers can be re\-executed fairly under the same frozen answer\-time protocol\.

## Appendix FContext\-Length Family Breakdown

This appendix section supports the context\-length control argument in the main text and does not replace the main headline results\. The summary\-level context\-length comparison is already included in Table[4](https://arxiv.org/html/2605.28831#S5.T4); this appendix focuses on the family\-level failure pattern\.

Table 16:Family\-level breakdown for the context\-length controls\. Numbers in parentheses are family sample counts\.EnvironmentFamilySummarize\-then\-AnswerFull\-HistoryCrafterSingle\-Hop \(304\)0\.26320\.7566CrafterAdversarial \(531\)0\.99250\.9680CrafterMulti\-Hop \(76\)0\.15790\.6447CrafterInducing \(275\)0\.42550\.5018CrafterSpatial \(318\)0\.16980\.7201CrafterTime \(190\)0\.03160\.7684CrafterLogical \(201\)0\.25870\.8408JerichoSingle\-Hop \(90\)0\.15561\.0000JerichoMulti\-Hop \(80\)0\.05001\.0000JerichoInducing \(40\)0\.10001\.0000JerichoSpatial \(55\)0\.07271\.0000JerichoTime \(47\)0\.23401\.0000JerichoLogical \(30\)0\.13331\.0000JerichoAdversarial \(60\)1\.00000\.9833The family view clarifies why generic summary compression is not a sufficient explanation forS3Mem\. The summary baseline remains relatively strong only on easier or highly lexical families such as the adversarial subset\. It collapses exactly on the chain\-heavy families that need anchor preservation and local ordering, includingCrafterMulti\-Hop / Spatial / Time andJerichoMulti\-Hop / Spatial / Logical / Time\. This is the same qualitative region whereS3Memis designed to help\.

## Appendix GComponent Ablation

This appendix section localizes which parts of the pipeline matter most once the main headline is fixed\. Because these variants modify candidate coverage, anchor recovery, and compression separately, they identify which part of the pipeline is responsible for the gain\.

Table 17:Component ablation onCrafterandJericho\. “no\_seed” removes anchor recovery; “no\_compress” removes evidence compression; “top\_k16” narrows retrieval breadth; “budget96” tightens the token budget\.CrafterJerichoVariantEMAvg\. TokensEMAvg\. TokensGraph\-NoReader\(reference\)0\.69921679\.070\.88811728\.89S3Mem\(full\)0\.7203141\.050\.9303173\.76top\_k160\.6992138\.410\.8881166\.42budget960\.720391\.920\.9303103\.16top\_k16\+budget960\.699291\.470\.8881101\.42no\_seed0\.6201133\.760\.8085167\.80no\_compress0\.72032602\.600\.93032504\.90Four observations are most important\. First,anchor recovery is the main accuracy driver:no\_seedcauses the largest drop, showing that seed\-step and occurrence\-anchor recovery are core components rather than optional heuristics\. Second,candidate coverage still matters:top\_k16collapsesS3Memback to theGraph\-NoReaderlevel, so the method does not work by compression alone\. Third,the system is already near a strong budget–quality knee:budget96preserves full EM while lowering token cost further\. Fourth,compression buys efficiency rather than hidden extra accuracy:no\_compressleaves EM unchanged but increases token cost by more than18×18\\times\.

## Appendix HControlled Generic Compression Validation

This appendix section supports the generic\-compression control claim in the main text and does not present RTK\-style compression as a competing memory method\. We add a controlled RTK\-style validation to test a specific alternative explanation forS3Mem’s token efficiency: perhaps the gain comes only from generic answer\-time compression rather than from a structured memory interface\. The control freezes the answerer, tokenizer, evaluation split, EM metric, and token accounting pipeline\. The RTK\-style compressor is allowed to perform only generic text operations—filtering, grouping, truncation, and deduplication—and is explicitly forbidden from using scene\-event parsing, anchor recovery, occurrence reasoning, target\-step reasoning, or environment\-specific heuristics\. Because this validation re\-exports all methods through a unified evidence\-serialization pipeline, the resulting point estimates differ slightly from the frozen headline tables in the main text; its role is controlled explanation rather than headline replacement\.

### H\.1 Internal Controlled Comparison

Table 18:Controlled RTK\-style compression comparison on the two internal environments\. “Reduction” is measured relative to the raw evidence text exposed by the corresponding uncompressed baseline pipeline\.CrafterJerichoMethodEMTok\.Red\.EMTok\.Red\.Vanilla RAG0\.6158581\.890\.00%0\.6990823\.810\.00%Vanilla RAG\+ RTK\-style0\.544188\.8784\.73%0\.3060179\.6478\.19%Graph\-NoReader0\.70451678\.910\.00%0\.89051734\.910\.00%Graph\-NoReader\+ RTK\-style0\.5509131\.2492\.18%0\.2861167\.1090\.37%S3Mem0\.7251140\.9995\.76%0\.9502165\.0594\.89%The key result is not that generic compression fails to save tokens\. It does save tokens, often aggressively\. The key result is that generic compression cannot preserve the right evidence chain\. OnCrafter, both RTK\-style baselines compress into the same broad budget regime asS3Membut lose substantial accuracy\. OnJericho, the contrast is much stronger:Graph\-NoReader\+ RTK\-style reaches nearly the same token level asS3Mem\(167\.10167\.10vs\.165\.05165\.05\) yet collapses to0\.28610\.2861EM, whileS3Memremains at0\.95020\.9502\.

### H\.2 Hardest\-Family Breakdown

Table 19:Hardest\-family comparison under the RTK\-style validation\. Numbers in parentheses are family sample counts\.EnvironmentFamilyGraph\-NoReaderGraph\-NoReader\+ RTK\-styleS3MemCrafterLogical / aggregation0\.5672 \(476\)0\.4391 \(476\)0\.6029 \(476\)CrafterMulti\-hop0\.5395 \(76\)0\.0658 \(76\)0\.5658 \(76\)CrafterSpatial0\.6038 \(318\)0\.3711 \(318\)0\.6384 \(318\)CrafterTime0\.4579 \(190\)0\.0684 \(190\)0\.5158 \(190\)Jerichoinventory / acquisition chain0\.6500 \(20\)0\.4500 \(20\)0\.9000 \(20\)Jericholocation chain0\.9250 \(40\)0\.1250 \(40\)1\.0000 \(40\)Jerichotemporal\_interval0\.9333 \(30\)0\.1000 \(30\)1\.0000 \(30\)Jerichotemporal\_offset1\.0000 \(80\)0\.0375 \(80\)1\.0000 \(80\)The largest collapses occur exactly on the families that most require anchor preservation, local ordering, and chain\-complete context:Craftermulti\-hop, time, spatial, and logical aggregation;Jerichotemporal offset, temporal interval, location chain, and inventory\-acquisition chain\. The stress test therefore supports a sharper claim than “compression hurts a bit”: generic compression systematically destroys the evidence structure needed by the hardest episodic reasoning families\.

## Appendix ICrafter Four\-Actor Rollout Robustness

This appendix section supports the trajectory\-source robustness claim in the main text and does not replace the internal headline\. Its purpose is narrower: to test whether the main memory\-method story survives trajectory\-distribution shift induced by heterogeneous rollout actor families while keeping the environment, seeds, step budget, QA construction protocol, and final shared answerer fixed\.

Table 20:Crafter four\-actor rollout robustness, per\-actor results\.Rollout actorMethodEMAvg\. TokensGPT\-5\.4No\-Memory0\.3242486\.48Vanilla RAG0\.4258715\.86A\-MEM\-inspired0\.4643142\.76MemoryOS\-adapted0\.4835144\.00LightMem\-adapted0\.5467112\.00Graph\-NoReader0\.53572091\.76S3Mem0\.5440138\.10Qwen3\-VL\-235B A22B\-InstructNo\-Memory0\.3105445\.31Vanilla RAG0\.4957924\.32A\-MEM\-inspired0\.5413142\.40MemoryOS\-adapted0\.5071144\.00LightMem\-adapted0\.5499112\.00Graph\-NoReader0\.58121760\.33S3Mem0\.6011150\.72GLM\-4\.6VNo\-Memory0\.3276383\.87Vanilla RAG0\.4701749\.64A\-MEM\-inspired0\.5185142\.47MemoryOS\-adapted0\.5214143\.69LightMem\-adapted0\.5385111\.76Graph\-NoReader0\.60681830\.54S3Mem0\.6182154\.46Doubao\-1\.5 Vision\-ProNo\-Memory0\.3278535\.24Vanilla RAG0\.4656764\.01A\-MEM\-inspired0\.5840142\.48MemoryOS\-adapted0\.5372144\.00LightMem\-adapted0\.6198112\.00Graph\-NoReader0\.57301999\.28S3Mem0\.6088151\.27Two conclusions matter most\. First, actor\-induced trajectory shift does change the local top\-method ordering: LightMem slightly exceedsS3Memon some actor\-induced trajectories\. Second, the broader efficiency story is much more stable than the exact ranking:Graph\-NoReaderremains extremely expensive under all four actor families, whileS3Memand LightMem remain in a much lower\-token regime with substantially stronger EM thanVanilla RAG\.

## Appendix JSystem Efficiency Addendum

This appendix section supports the “efficiency beyond tokens” statement in the main text and does not redefine the paper’s primary efficiency metric\. Token cost remains the primary efficiency metric because it directly measures the evidence interface seen by the answer\-time layer\. We additionally measure system\-level quantities to check whetherS3Memhides a large implementation cost elsewhere\.

Table 21:System\-efficiency addendum beyond token cost\.EnvMethodAns\. msBuild msParsed KB/ep\.Store KB/ep\.CrafterGraph\-NoReader52\.73321\.881150\.41781\.98CrafterS3Mem56\.27322\.031150\.41781\.98JerichoGraph\-NoReader19\.8891\.84313\.20236\.76JerichoS3Mem20\.6791\.14313\.20236\.76These measurements do not suggest any hidden systems\-regime shift\. The main efficiency story therefore remains the same as in the paper body:S3Memyields a better evidence interface and a better EM–token tradeoff without introducing a catastrophic answer\-time or memory\-build overhead\.

## Appendix KQualitative Case Studies

This appendix section supports the qualitative failure\-analysis paragraph in the main text and does not add a separate narrative headline\. We summarize three representative cases that are especially well aligned with the paper’s claims\.

#### Crafteroccurrence anchor\.

The question asks for the first step whose reason mentions “wood\.”Graph\-NoReaderanswers correctly on the raw evidence, butGraph\-NoReaderplus RTK\-style compression predicts a later step because the compressor merges away the decisive first occurrence\.S3Memkeeps the correct early anchor and answers correctly\. This is the cleanest internal example of why occurrence\-sensitive retrieval and packing matter\.

#### Jerichotemporal\-offset chain\.

The question asks for the action executed one step after the first rod\-related event\. RTK\-style compression keeps the rod mention itself but loses the adjacent action\-supporting context, leading to a “not answerable” failure\.S3Mempreserves the local two\-step neighborhood around the trigger event and answers correctly\. This case is representative of the temporal\-offset and location\-chain families where generic compression collapses most sharply\.

#### ALFWorldsecond\-occurrence chain\.

The question asks for the action executed one step after the second time the agent obtained an alarm clock\. The graph baseline retrieves many superficially relevant steps but misses the exact second\-occurrence neighborhood\.S3Memkeeps the compact two\-step evidence chain and answers correctly\. This external case is important because it shows that the same anchor\-preserving pattern is not confined to the internal benchmark family\.

## Appendix LParser/Executor Boundary Support

This appendix section supports the boundary analysis around the answer\-time layer and does not reframe the paper as a parser/executor method\. The parser/executor analysis serves as a risk\-closing side thread rather than a primary method contribution\. On held\-out question families, the learned factorized parser achieves:aggregationexecution EM=0\.9027=0\.9027,temporal\_intervalexecution EM=0\.9104=0\.9104, andtemporal\_offset\(v4\) execution EM=0\.8765=0\.8765\. These results show that the least general part of the current system lies primarily in the answer\-time layer, and that once supervision noise is removed, programmatic routes can recover most of the hardest\-family performance\.

## Appendix MPost\-Freeze Strengthening

This appendix section records two post\-freeze support notes and does not alter the main claim boundary used in the paper body\.

#### Retrieval last\-mile onCrafter\.

A limited retrieval sweep onCrafterimprovesS3Memfrom0\.72510\.7251to0\.74350\.7435with only a marginal token\-cost increase \(140\.99→144\.02140\.99\\rightarrow 144\.02\), indicating that the environment remains partially retrieval\-limited and that low\-cost gains are still available\.

#### Hardest\-family risk closing\.

On the hardest parser\-side family \(temporal\_offset\), the learned factorized parser reaches0\.87650\.8765full held\-out execution EM in the v4 setting, approaching the gold\-program ceiling of0\.91870\.9187\. This supports the interpretation that the remaining non\-generality is concentrated in a narrow answer\-time branch rather than in the memory core\.

Similar Articles

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

Hugging Face Daily Papers

SuperMemory-VQA is a new egocentric VQA benchmark featuring 52.9 hours of AI-glasses footage and 4,853 QA pairs designed to evaluate AI assistants on long-horizon memory tasks spanning object recall, intent, timelines, and conversations. Benchmarking reveals existing agentic frameworks and LLMs remain far from reliable on these real-world memory challenges.

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

arXiv cs.CL

This paper proposes MERIT, a dynamic multi-horizon memory retrieval framework for interactive text-to-SQL agents that uses episode-level and turn-level memory with learned retrieval policies optimized via reinforcement learning and a process reward model for dense rewards. Experiments on BIRD-Interact and Spider2-Snow show that MERIT outperforms static and single-horizon dynamic baselines in success rate while requiring fewer interaction turns.

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Hugging Face Daily Papers

This paper proposes SAM, a state-adaptive memory framework that dynamically manages interaction histories for long-horizon agentic reasoning, enabling intent-driven recall without retraining the backbone model. It outperforms strong baselines across multiple benchmarks like BrowseComp and HLE.