Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents
Summary
Researchers from University of Toronto and Vector Institute propose Segment Tree Memory (SegTreeMem), a memory architecture for long-horizon conversational agents that preserves temporal order using a hierarchical segment tree structure for both online construction and retrieval. Experiments across three datasets show nearly 20% improvement in LLM-judge accuracy over non-temporal tree baselines.
View Cached Full Text
Cached at: 06/05/26, 02:15 AM
# Temporal Order Matters for Agentic Memory: Segment Trees for Memory Construction and Retrieval
Source: [https://arxiv.org/html/2606.04555](https://arxiv.org/html/2606.04555)
Yifan Simon Liu1,Liam Gallagher1,Faeze Moradi Kalarde1, Jiazhou Liang1,Armin Toroghi1,Scott Sanner1,2 1University of Toronto 2Vector Institute for Artificial Intelligence yifanliu\.liu@mail\.utoronto\.ca
###### Abstract
Long\-horizon conversational agents need to interact with users through evolving events, tasks, and goals, where utterances within a period often center on a shared topic before the conversation shifts to a new context\. Such histories can be naturally represented as a temporally ordered segment tree, where sequential utterances compose into higher\-level segments\. However, existing tree\-based memory systems mostly focus on topically structured hierarchies and often ignore temporal sequencing, leaving two key design questions for temporally ordered memory underexplored: \(i\) how to update a temporally ordered tree online as new utterances arrive, and \(ii\) how to exploit the resulting temporal hierarchy along with topical structure during retrieval\. We introduce Segment Tree Memory \(SegTreeMem\), a memory architecture that represents conversation history as a Segment Tree over utterances while preserving temporal order\. To support online updates,SegTreeMemincrementally incorporates new utterances into the Segment Tree while maintaining temporally contiguous interaction segments\. For retrieval,SegTreeMempropagates relevance signals through the tree, enabling the model to integrate information across related parts of the conversation and identify query\-relevant context more effectively\. Our experiments show that across three datasets and two LLM backbones,SegTreeMemimproves LLM\-judge accuracy by nearly 20% over non\-temporal tree baselines and other strong memory baselines\. These results support the view that long\-horizon conversational agents benefit from memory indexes that are*both*temporal and hierarchical\.
## 1Introduction
Long\-horizon conversational agents interact with users over multi\-turn dialogues of sequential user and agent utterances\[[24](https://arxiv.org/html/2606.04555#bib.bib24),[40](https://arxiv.org/html/2606.04555#bib.bib40),[10](https://arxiv.org/html/2606.04555#bib.bib10)\]\. To respond accurately to a query, the agent must retain and leverage relevant information from earlier utterances\. This necessitates a memory system that supports efficientonline constructionas utterances arrive to retain the full history, along with aretrieval mechanismthat accesses query\-relevant information to support response generation\[[30](https://arxiv.org/html/2606.04555#bib.bib30),[28](https://arxiv.org/html/2606.04555#bib.bib28),[5](https://arxiv.org/html/2606.04555#bib.bib5)\]\.
Figure 1:Memory tree representations\. Semantic trees may group non\-consecutive utterances, ignoring temporal order \(top\), whileSegTreeMempreserves temporal order by assigning each node a contiguous span of utterances \(bottom\)\.In real\-world interactions, conversations naturally organize into topically coherent and temporally ordered segments, where consecutive utterances share the same subject before transitioning to a new topic\[[7](https://arxiv.org/html/2606.04555#bib.bib7),[12](https://arxiv.org/html/2606.04555#bib.bib12)\]\. This implies that the relevance of a past utterance to a current query cannot be determined from its content alone, as utterances within a segment are semantically interrelated and their meaning depends on surrounding utterances within the same segment\. A memory system that does not account for temporal order risks retrieving semantically similar yet contextually misaligned information\. This makes tree\-based memory structures a natural choice, as they can organize conversations into hierarchically nested, temporally ordered segments while preserving local contextual dependencies\.
However, existing tree\-based memory systems tend to focus on topically structured hierarchies and often place less emphasis on temporal sequencing, leaving open two key design questions for temporally ordered tree memory: \(i\)How should the tree be constructed online to index incoming utterances in real\-time while preserving topical coherence and temporal order?\(ii\)How should retrieval exploit both topical and temporal organization in the resulting tree structure?
To preserve the temporal order of utterances in memory, we use a Segment Tree\[[6](https://arxiv.org/html/2606.04555#bib.bib6)\], where each node represents a contiguous span of utterances, and internal nodes recursively partition their spans into ordered, non\-overlapping sub\-spans\. Building on the Segment Tree structure, we introduce Segment Tree Memory \(SegTreeMemshown in[Figure˜1](https://arxiv.org/html/2606.04555#S1.F1)\)\. Specifically, our contributions are as follows:
- •SegTreeMeminserts each new utterance by updating only a small set of nodes along the rightmost frontier, supporting real\-time tree construction while preserving temporal order and coherent node representations \([Figure˜2](https://arxiv.org/html/2606.04555#S4.F2)\)\.
- •SegTreeMememploys a structure\-aware retrieval mechanism that uses the tree as a temporal and hierarchical memory index and formulates retrieval as relevance propagation over the tree via a transition matrix, combining local semantic matching with hierarchical topical structure and temporal dependencies \([Figure˜3](https://arxiv.org/html/2606.04555#S4.F3)\)\.
- •We empirically evaluateSegTreeMemon three memory datasets and two LLM backbones, showing consistent accuracy improvements over existing baselines and demonstrating the benefit of temporal and hierarchical memory indexes for conversational agents\.
## 2Problem Definition
We consider a conversational agent that interacts with a user through a sequence of utterances\. Letxtx\_\{t\}denote thett\-th utterance\. The conversation history afterttutterances isXt=\(x1,…,xt\)X\_\{t\}=\(x\_\{1\},\\ldots,x\_\{t\}\)\.
A long\-horizon agent maintains an external memory state throughout the conversation\. LetMtM\_\{t\}denote the memory state after processing the firstttutterances\. In this work, we focus on two components of memory: online memory construction and the retrieval from memory, which are described below\.
##### Online Memory Construction\.
When a new utterancext\+1x\_\{t\+1\}arrives, the memory state is updated asMt\+1=𝒰\(Mt,xt\+1\)M\_\{t\+1\}=\\mathcal\{U\}\(M\_\{t\},x\_\{t\+1\}\), where𝒰\\mathcal\{U\}denotes the online update operator\. An effective𝒰\\mathcal\{U\}should satisfy:
1. \(i\)Incremental update:It incorporatesxt\+1x\_\{t\+1\}intoMtM\_\{t\}without rebuilding memory from scratch\.
2. \(ii\)Efficient update scope:Let𝒞t\\mathcal\{C\}\_\{t\}andΔt\\Delta\_\{t\}denote the memory units inspected and modified during update\. For real\-time interaction,\|𝒞t\|,\|Δt\|≪\|Mt\|\|\\mathcal\{C\}\_\{t\}\|,\|\\Delta\_\{t\}\|\\ll\|M\_\{t\}\|, ideally sublinear in\|Mt\|\|M\_\{t\}\|\.
3. \(iii\)Structural and temporal preservation:The updated memoryMt\+1M\_\{t\+1\}should preserve structural organization and conversation order\. Memory units may be grouped or abstracted, but not reordered, i\.e\.,i<j⇒xii<j\\Rightarrow x\_\{i\}precedesxjx\_\{j\}inMt\+1M\_\{t\+1\}\.
##### Retrieval from Memory\.
Given a user queryqq, the agent retrieves evidenceO=ℛ\(q,Mt\)O=\\mathcal\{R\}\(q,M\_\{t\}\)from the current memory state, whereℛ\\mathcal\{R\}is the retrieval operator\. An effectiveℛ\\mathcal\{R\}should satisfy:
1. \(i\)Topical relevance:LetR\(q,m\)R\(q,m\)denote the topical relevance between queryqqand memory unitmm\. Retrieval should favor memory units with highR\(q,m\)R\(q,m\)\.
2. \(ii\)Temporal and structural context:A topical match may not be sufficient alone\. Let𝒩t\(m\)\\mathcal\{N\}\_\{t\}\(m\)denote the structurally and temporally admissible neighborhood ofmminMtM\_\{t\}\. Retrieval should use topical matches as memory entry points and𝒩t\(m\)\\mathcal\{N\}\_\{t\}\(m\)to expand the evidence\.
## 3Related Work
##### Memory construction\.
RAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]and LATTICE\[[8](https://arxiv.org/html/2606.04555#bib.bib8)\]build semantic abstraction trees through clustering\. MemWalker\[[4](https://arxiv.org/html/2606.04555#bib.bib4)\]preserves document order by grouping adjacent segments into fixed\-size units\. MemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\]builds a semantic tree by inserting new memories into the most similar branch\. Among these methods, only MemTree explicitly supports online memory update\. However, its similarity\-based insertion prioritizes topical relatedness and does not preserve utterance order in the tree structure\. Conversely, MemWalker preserves temporal order but is designed for static contexts rather than online updates\.SegTreeMemsupports both\.
##### Retrieval from memory\.
Existing retrieval methods vary in whether and how they utilize the tree structure during retrieval\. RAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]and MemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\]perform*collapsed retrieval*, where all tree nodes are treated as independent candidates\. Although this approach supports topical relevance, it does not explicitly use structural or temporal context during retrieval\. MemWalker\[[4](https://arxiv.org/html/2606.04555#bib.bib4)\]and LATTICE\[[8](https://arxiv.org/html/2606.04555#bib.bib8)\]perform*LLM\-guided traversal*, where an LLM navigates the tree hierarchy\. Although these methods leverage the top\-down tree structure, they do not exploit other tree\-induced relations or temporal order during retrieval\. In contrast, the retrieval mechanism inSegTreeMemexploits both the tree structure and temporal order to identify additional relevant context\.
\\rowcolorheaderpurple\\cellcolorwhiteConstructionRetrieval\\rowcolorheaderpurple\\cellcolorwhiteBuildOnlinePreserves orderRetrievalUses hierarchyUses orderRAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]Clustering×\\times×\\timesCollapsed×\\times×\\timesMemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\]Semantic insertion✓\\checkmark×\\timesCollapsed×\\times×\\timesMemWalker\[[4](https://arxiv.org/html/2606.04555#bib.bib4)\]Fixed ordered grouping×\\times✓\\checkmarkLLM\-guided traversal✓\\checkmark✓\\checkmarkLATTICE\[[8](https://arxiv.org/html/2606.04555#bib.bib8)\]Clustering×\\times×\\timesLLM\-guided traversal✓\\checkmark×\\times\\rowcolorlightpurpleSegTreeMem\(Ours\)Segment Tree✓\\checkmark✓\\checkmarkRelevance propagation✓\\checkmark✓\\checkmark
Table 1:Comparison of tree\-based memory methods through the desiderata in[Section˜2](https://arxiv.org/html/2606.04555#S2)\. UnderConstruction,Onlineindicates whether the method defines an online update rule for incorporating a new utterance into an existing memory state, andPreserves orderindicates whether construction explicitly preserves conversation order\. UnderRetrieval,Uses hierarchyindicates whether retrieval uses the tree structure rather than treating nodes as independent candidates, andUses orderindicates whether retrieval can exploit temporal order when selecting evidence\.
## 4Methodology
A Segment Tree\[[6](https://arxiv.org/html/2606.04555#bib.bib6)\]provides a principled data structure that satisfies the desiderata of online memory construction in[Section˜2](https://arxiv.org/html/2606.04555#S2)\. It maintains a hierarchical decomposition over contiguous time intervals, where \(i\) incorporatingxt\+1x\_\{t\+1\}reduces to attaching a rightmost leaf and updating nodes along a single root\-to\-leaf path, avoiding global reconstruction; \(ii\) the structure restricts inspection and modification toO\(log\|Mt\|\)O\(\\log\|M\_\{t\}\|\)nodes under balanced tree; \(iii\) each node encodes a contiguous segment, and the left\-to\-right leaf order follows chronology, preserving temporal consistency\.
### 4\.1Conversation Segment Tree
Given the temporally ordered utterance sequenceX=\(x1,…,xT\)X=\(x\_\{1\},\\dots,x\_\{T\}\), wherextx\_\{t\}denotes thett\-th utterance, we representXXas a rooted Segment TreeT=\(V,E,ρ\)T=\(V,E,\\rho\)to preserve temporal order for later retrieval \([Figure 1](https://arxiv.org/html/2606.04555#S1.F1), bottom\)\. Here,VVis the set of nodes,EEis the set of parent\-child edges, andρ∈V\\rho\\in Vis the root node\.
Each nodev∈Vv\\in Vis associated with a contiguous interval of the conversation history, denoted byI\(v\)=\[l\(v\),r\(v\)\]I\(v\)=\\big\[l\(v\),r\(v\)\\big\], wherel\(v\)l\(v\)andr\(v\)r\(v\)are the indices of the first and last utterances in the conversation interval, respectively, with1≤l\(v\)≤r\(v\)≤T1\\leq l\(v\)\\leq r\(v\)\\leq T\. Thus,vvrepresents the subsequence\(xl\(v\),xl\(v\)\+1,…,xr\(v\)\)\\big\(x\_\{l\(v\)\},x\_\{l\(v\)\+1\},\\dots,x\_\{r\(v\)\}\\big\)in the conversation history\. The tree comprises three types of nodes\.
##### Root node\.
The root nodeρ\\rhois associated with the entire conversation history, i\.e\.,I\(ρ\)=\[1,T\]I\(\\rho\)=\[1,T\]\.
##### Internal nodes\.
Each internal nodevvis associated with an interval equal to the union of the intervals of its child nodes\. Let𝖼𝗁\(v\)=\(c1,…,cm\)\\mathsf\{ch\}\(v\)=\(c\_\{1\},\\dots,c\_\{m\}\)denote the ordered children of nodevv, indexed from left to right\. These children have pairwise disjoint and consecutive intervals, i\.e\.,
r\(cj\)\+1=l\(cj\+1\),1≤j≤m−1\.\\displaystyle r\(c\_\{j\}\)\+1=l\(c\_\{j\+1\}\),\\qquad 1\\leq j\\leq m\-1\.\(1\)Then, the interval associated with nodevvisI\(v\)=⋃j=1m\[l\(cj\),r\(cj\)\]=\[l\(c1\),r\(cm\)\],I\(v\)=\\bigcup\_\{j=1\}^\{m\}\\big\[l\(c\_\{j\}\),r\(c\_\{j\}\)\\big\]=\\big\[l\(c\_\{1\}\),r\(c\_\{m\}\)\\big\],where the last equality follows from the fact that intervals associated with consecutive child nodes are adjacent and non\-overlapping\. Therefore, the interval associated with every internal node is partitioned into consecutive non\-overlapping sub\-intervals, each corresponding to a child node\.
##### Leaf nodes\.
Each leaf node represents a single utterance\. Thus, for every leaf nodevv,I\(v\)=\[t,t\]I\(v\)=\[t,t\]\.
##### Node annotations\.
Each nodevvcarries an annotationA\(v\)A\(v\), which is an LLM\-generated summary of the conversation spanI\(v\)I\(v\)\.
### 4\.2Online Segment Tree Construction
In the agent setting, memory must be constructed online as the conversation unfolds\. GivenXt=\(x1,…,xt\)X\_\{t\}=\(x\_\{1\},\\dots,x\_\{t\}\), letTt=\(Vt,Et,ρt\)T\_\{t\}=\(V\_\{t\},E\_\{t\},\\rho\_\{t\}\)be the Segment Tree of the firstttutterances\. When a new utterancext\+1x\_\{t\+1\}arrives, our goal is to updateTtT\_\{t\}toTt\+1T\_\{t\+1\}while preserving the Segment Tree properties\.
##### Rightmost frontier\.
Let the rightmost frontier ofTtT\_\{t\}be denoted byFtF\_\{t\}, defined as the ordered collection of nodes whose associated intervals have right endpoint equal tott, i\.e\.,
Ft=𝖥𝗋𝗈𝗇𝗍𝗂𝖾𝗋\(Tt\)=\(f1,…,fm\),\{f1,…,fm\}=\{v∈Vt:r\(v\)=t\}\.\\displaystyle F\_\{t\}=\\operatorname\{\\mathsf\{Frontier\}\}\(T\_\{t\}\)=\(f\_\{1\},\\ldots,f\_\{m\}\),\\qquad\\\{f\_\{1\},\\ldots,f\_\{m\}\\\}=\\\{v\\in V\_\{t\}:r\(v\)=t\\\}\.\(2\)The frontier nodes are ordered from lower to higher levels:f1f\_\{1\}is the rightmost leaf andfmf\_\{m\}is the root\. Under the levelled\-tree invariant,FtF\_\{t\}contains exactly one active node per level\. Sincext\+1x\_\{t\+1\}has interval\[t\+1,t\+1\]\[t\+1,t\+1\], it can only attach to a node whose interval ends attt\. Thus, the admissible insertion candidates are precisely the non\-leaf nodes inFtF\_\{t\}, i\.e\.,Ft∖\{f1\}F\_\{t\}\\setminus\\\{f\_\{1\}\\\}, giving the update local scope rather than requiring a scan over all tree nodes\.
Figure 2:Online Segment Tree update\. For a new utterancext\+1x\_\{t\+1\}, the compatibility model selects a frontier nodev3v\_\{3\}, to which a subtree with leafxt\+1x\_\{t\+1\}is attached\.
##### Online tree update via rightmost frontier\.
To select an attachment node, we use a compatibility model, which either returns the non\-leaf frontier node most compatible with the incoming utterancext\+1x\_\{t\+1\}, or indicates that no compatible frontier node exists\. If a frontier node is selected, letfℓ∗f\_\{\\ell^\{\*\}\}, whereℓ∗∈\{2,…,m\}\\ell^\{\*\}\\in\\\{2,\\ldots,m\\\}, denote the selected node\. We create a chain ofℓ∗−2\\ell^\{\*\}\-2internal nodes above the new leafzt\+1z\_\{t\+1\}, each with interval\[t\+1,t\+1\]\[t\+1,t\+1\], and attach the chain top, orzt\+1z\_\{t\+1\}itself whenℓ∗=2\\ell^\{\*\}=2, as the rightmost child offℓ∗f\_\{\\ell^\{\*\}\}\. We then update the right endpoints and annotations offℓ∗f\_\{\\ell^\{\*\}\}and its ancestors\. If no frontier node is compatible, we create a new rootfm\+1f\_\{m\+1\}whose leftmost child is the previous root ofTtT\_\{t\}\. We also create a chain ofm−1m\-1internal nodes abovezt\+1z\_\{t\+1\}, each with interval\[t\+1,t\+1\]\[t\+1,t\+1\], and attach the chain top, orzt\+1z\_\{t\+1\}itself whenm=1m=1, tofm\+1f\_\{m\+1\}\. The interval offm\+1f\_\{m\+1\}is then updated to\[1,t\+1\]\[1,t\+1\], and its annotation is created accordingly\. The algorithm is visualized in[Figure˜2](https://arxiv.org/html/2606.04555#S4.F2)\.
##### Design choices for memory construction\.
The design choices for online memory construction center on the compatibility model, which determines where a new utterance is inserted\. We consider three forms of compatibility\. \(i\)*Cosine similarity*selects the frontier node whose annotation embedding is most similar to the new utterance\. \(ii\)*Pointwise LLM*\[[17](https://arxiv.org/html/2606.04555#bib.bib17),[18](https://arxiv.org/html/2606.04555#bib.bib18),[20](https://arxiv.org/html/2606.04555#bib.bib20),[14](https://arxiv.org/html/2606.04555#bib.bib14)\]independently scores each frontier node annotation against the new utterance and selects the most compatible node\. \(iii\)*Batch LLM*\[[23](https://arxiv.org/html/2606.04555#bib.bib23),[35](https://arxiv.org/html/2606.04555#bib.bib35),[42](https://arxiv.org/html/2606.04555#bib.bib42)\]presents all frontier node annotations with the new utterance to an LLM, which jointly compares candidates and selects the most compatible node\.
Figure 3:Two score propagation policies: top: top\-down, where scores propagate from parents to children; bottom: bottom\-up, where scores propagate from children to parents\. The matrices illustrate the corresponding propagation operators\. Blue arrays denote normalized initial local scores, purple arrays denote scores after one propagation step, and gray arrays denote the final scores obtained by combining the initial and propagated scores withα=0\.1\\alpha=0\.1\.
### 4\.3Retrieval from the Memory Tree
Our retrieval operator follows a simple principle: identify memory units relevant to the query, then expand around them to gather context\. This is motivated by the fact that relevant information typically appears within coherent temporal spans rather than in isolation\. Accordingly, the operator satisfies the retrieval desiderata in[Section˜2](https://arxiv.org/html/2606.04555#S2): \(i\) it assigns initial scores that give higher values to more relevant nodes, selecting them as entry points; \(ii\) it propagates these scores over the tree, allowing high\-scoring nodes to expand into their structurally and temporally admissible neighborhoods\.
#### 4\.3\.1Structure\-Aware Score Propagation
To incorporate the hierarchical structure of the memory tree, we propagate relevance scores through structurally meaningful connections using a matrix\-based multi\-hop formulation, inspired by classical graph\-based ranking methods such as PageRank\[[29](https://arxiv.org/html/2606.04555#bib.bib29),[11](https://arxiv.org/html/2606.04555#bib.bib11),[37](https://arxiv.org/html/2606.04555#bib.bib37)\]\.
Formally, a propagation policyPPinduces a sparse propagation matrixWP∈\[0,1\]\|V\|×\|V\|W\_\{P\}\\in\[0,1\]^\{\|V\|\\times\|V\|\}, where each entry\[WP\]uv=Pr\(u∣v,q\)\[W\_\{P\}\]\_\{uv\}=\\Pr\(u\\mid v,q\)gives the probability of propagating relevance from nodevvto nodeuugiven queryqq\. Thus,\[WP\]uv\[W\_\{P\}\]\_\{uv\}captures the contribution of nodevvto nodeuuin a single propagation step, while repeated applications ofWPW\_\{P\}enable multi\-hop aggregation over the tree structure\.
Two canonical propagation policies are shown in[Figure˜3](https://arxiv.org/html/2606.04555#S4.F3): \(i\)*top\-down*, which distributes relevance from a node to its children, and \(ii\)*bottom\-up*, which propagates relevance from a node to its parent\.
##### Initial score normalization\.
Given a queryqq, retrieval aims to identify nodes whose associated conversation spans are relevant for answeringqq\. Each nodev∈Vv\\in Vis represented by its annotationA\(v\)A\(v\), from which we compute a local relevance scorerq\(v\)=R\(q,A\(v\)\)∈ℝ\.r\_\{q\}\(v\)=R\(q,A\(v\)\)\\in\\mathbb\{R\}\.We normalize these scores to obtain an initial query distribution assq\(0\)\(v\)=rq\(v\)∑u∈Vrq\(u\)\.s\_\{q\}^\{\(0\)\}\(v\)=\\frac\{r\_\{q\}\(v\)\}\{\\sum\_\{u\\in V\}r\_\{q\}\(u\)\}\.
This normalization ensures thatsq\(0\)s\_\{q\}^\{\(0\)\}forms a valid probability distribution, which stabilizes subsequent matrix\-based propagation and prevents scale explosion or vanishing across multiple hops\. Selecting nodes directly based onsq\(0\)s\_\{q\}^\{\(0\)\}recovers*collapsed\-tree*retrieval\[[31](https://arxiv.org/html/2606.04555#bib.bib31),[34](https://arxiv.org/html/2606.04555#bib.bib34)\], which ignores the tree structure and treats nodes independently\.
##### Score propagation\.
Starting from the initial distributionsq\(0\)∈ℝ\|V\|s\_\{q\}^\{\(0\)\}\\in\\mathbb\{R\}^\{\|V\|\}, we define thekk\-hop structural distribution via iterative propagation as
sq\(k\)=WPsq\(k−1\)=WPksq\(0\),k=1,…,H\.\\displaystyle s\_\{q\}^\{\(k\)\}=W\_\{P\}s\_\{q\}^\{\(k\-1\)\}=W\_\{P\}^\{k\}s\_\{q\}^\{\(0\)\},\\qquad k=1,\\ldots,H\.\(3\)Each multiplication byWPW\_\{P\}corresponds to one structural transition along the tree\.
##### Final retrieval score\.
We define the final structure\-aware retrieval score as a finite\-horizon combination of the local and propagated distributions:
s~q=∑k=0Hαksq\(k\)∑k=0Hαk,0≤α<1,\\displaystyle\\tilde\{s\}\_\{q\}=\\frac\{\\sum\_\{k=0\}^\{H\}\\alpha^\{k\}s\_\{q\}^\{\(k\)\}\}\{\\sum\_\{k=0\}^\{H\}\\alpha^\{k\}\},\\qquad 0\\leq\\alpha<1,\(4\)
whereα\\alphacontrols the relative contribution of longer propagation paths\. The resulting vectors~q∈ℝ\|V\|\\tilde\{s\}\_\{q\}\\in\\mathbb\{R\}^\{\|V\|\}assigns a structure\-aware relevance score to each node\. The retriever then selects the top\-KKnodes with the largest values ins~q∈ℝ\|V\|\\tilde\{s\}\_\{q\}\\in\\mathbb\{R\}^\{\|V\|\}\.
##### Design choices for retrieval\.
We consider the following design choices for the retrieval mechanism: the relevance scorerR\(⋅,⋅\)R\(\\cdot,\\cdot\), the propagation policyP∈\{no\-propagation,top\-down,bottom\-up\}P\\in\\\{\\mathrm\{no\\text\{\-\}propagation\},\\mathrm\{top\\text\{\-\}down\},\\mathrm\{bottom\\text\{\-\}up\}\\\}, the induced transition matrixWPW\_\{P\}, the propagation horizonHH, and the decay factorα\\alpha\.
## 5Experiments
### 5\.1Research Questions
RQ1: Does temporal tree construction improve final performance?We compare Segment Tree construction with non\-temporal similarity clustering trees under a temporal\-order permutation experiment with matched retrieval settings to evaluate whether preserving temporal order in the constructed memory tree improves final performance\.
RQ2: Which online construction strategy gives the best performance\-cost tradeoff?We compare cosine similarity, pointwise LLM, and batch LLM as compatibility models for online Segment Tree construction, and evaluate their final performance together with tree construction latency and cost\.
RQ3: How do tree\-aware retrieval policies affect retrieval quality?We compare no\-propagation, top\-down, and bottom\-up retrieval policies, and study how propagation horizonHHand decay factorα\\alphacontribute to each policy’s final performance\.
RQ4: How does fullSegTreeMemperform against existing memory baselines?We compare the fullSegTreeMemsystem against flat retrieval, tree\-based memory, and structured memory baselines to evaluate its end\-to\-end conversational memory performance\.
### 5\.2Datasets and Evaluation
We evaluate on three long\-horizon memory benchmarks:LoCoMowith 10 conversation groups and 1,986 queries\[[24](https://arxiv.org/html/2606.04555#bib.bib24)\],LongMemEval\-MABwith 5 context groups and 300 queries\[[40](https://arxiv.org/html/2606.04555#bib.bib40),[10](https://arxiv.org/html/2606.04555#bib.bib10)\], andRealMemwith 10 personas and 1,415 queries\[[3](https://arxiv.org/html/2606.04555#bib.bib3)\]\. LongMemEval\-MAB denotes the MemoryAgentBench reformulation of LongMemEval\(S\), where memory is built over an extended dialogue and queried repeatedly\. We usetext\-embedding\-3\-smallfor all embedding\-based retrieval and similarity computations\. We report LLM\-judge accuracy and token\-level F1\. For LLM\-dependent components, including LLM\-based memory construction and answer generation, we evaluate two backbones:gpt\-5\.4\-mini\[[27](https://arxiv.org/html/2606.04555#bib.bib27)\]andqwen\-3\.5\-flash\[[1](https://arxiv.org/html/2606.04555#bib.bib1)\]\. Answer correctness is judged bygpt\-4o\-mini\[[25](https://arxiv.org/html/2606.04555#bib.bib25)\]\. All methods use the same evidence budgetK=10K=10\. We only reportgpt\-5\.4\-miniresults in the main paper\. Fullqwen\-3\.5\-flashresults are provided in[AppendixD\.1](https://arxiv.org/html/2606.04555#A4.SS1)and category\-level breakdowns are provided in[AppendixD\.2](https://arxiv.org/html/2606.04555#A4.SS2)\.
### 5\.3Baselines and Configurations
We compare against three groups of external baselines\.Flat retrievalbaselines include BM25 and Dense retrieval\[[32](https://arxiv.org/html/2606.04555#bib.bib32),[26](https://arxiv.org/html/2606.04555#bib.bib26)\]\.Tree\-basedbaselines include RAPTOR and MemTree\[[34](https://arxiv.org/html/2606.04555#bib.bib34),[31](https://arxiv.org/html/2606.04555#bib.bib31)\]\.Structured memorybaselines include A\-MEM, Mem0, and HippoRAG\[[41](https://arxiv.org/html/2606.04555#bib.bib41),[5](https://arxiv.org/html/2606.04555#bib.bib5),[9](https://arxiv.org/html/2606.04555#bib.bib9)\]\. We do not include MemWalker and LATTICE as direct baselines because they do not provide readily reproducible implementations for our online conversational setting\[[4](https://arxiv.org/html/2606.04555#bib.bib4),[8](https://arxiv.org/html/2606.04555#bib.bib8)\]\. See[AppendixE](https://arxiv.org/html/2606.04555#A5)for full implementation details\.
##### SegTreeMemvariants\.
\\rowcolorlightpurpleDesign choiceValue space\\rowcolorlightpurple\!45Online constructionCompatibility modelCCcosine similarity, pointwise LLM, batch LLM\\rowcolorlightpurple\!45RetrievalLocal relevanceRRnormalized cosine similarityPropagation policyPPno\-propagation, top\-down, bottom\-upTransition matrixWPW\_\{P\}binary structural weights, column\-normalizedDecay factorα\\alpha\{\.05,\.10,\.20,\.30,\.50,\.70,\.95\}\\\{\.05,\.10,\.20,\.30,\.50,\.70,\.95\\\}Propagation horizonHH\{0,1,2,3\}\\\{0,1,2,3\\\}
Table 2:Experimental setting ofSegTreeMem\.111Anonymized implementation and reproduction instructions are available at[https://anonymous\.4open\.science/r/SegTreeMem\-B325/README\.md](https://anonymous.4open.science/r/SegTreeMem-B325/README.md)\.We evaluateSegTreeMemby varying online construction and tree\-aware retrieval settings, as summarized in[Section˜5\.3](https://arxiv.org/html/2606.04555#S5.SS3.SSS0.Px1)\. For online construction, we compare three compatibility models: cosine similarity, pointwise LLM, and batch LLM\. For retrieval, we compute node\-query relevance withtext\-embedding\-3\-small, normalize the local scores into an initial query distribution, and compare three propagation policiesPP: no\-propagation, top\-down, and bottom\-up\. In our experiments,WPW\_\{P\}is set by assigning binary raw weights to the allowed source\-target transitions induced byPP\([Figure˜3](https://arxiv.org/html/2606.04555#S4.F3)\), and then column\-normalizing these weights\.
### 5\.4Results
RQ1: Does temporal tree construction improve final performance?
Figure 4:Controlled comparison of tree construction strategies\. We compare a non\-temporal similarity clustering tree with Segment Tree variants under matched retrieval settings\. BU and TD report the best bottom\-up and top\-down configurations\. Segment Tree construction consistently improves over the non\-temporal baseline, showing the benefit of preserving temporal order for memory construction\.##### Temporal trees improve final performance over non\-temporal baselines\.
In[Figure˜4](https://arxiv.org/html/2606.04555#S5.F4), we compare Segment Tree with a non\-temporal similarity clustering tree\. See[AppendixF](https://arxiv.org/html/2606.04555#A6)for the implementation of this baseline\. We control the node\-level summarization prompt and other implementation details, so the only difference is the online update algorithm\. Across the three datasets, Segment Tree variants consistently outperform the non\-temporal tree, and applying the same propagation policies to the non\-temporal tree does not close the gap\. This suggests that preserving temporal order during construction yields a stronger memory representation for conversational retrieval\.
##### Temporal tree construction improves internal\-node coherence\.
To investigate why temporal tree construction improves over the non\-temporal similarity clustering tree, we conduct a qualitative analysis of tree instances \([AppendixC](https://arxiv.org/html/2606.04555#A3)\)\. We find that Segment Tree internal nodes tend to be more semantically coherent and fluent because they summarize temporally ordered utterance spans\. In contrast, similarity clustering can group utterances from distant parts of a conversation under the same internal node, which may lead to less coherent summaries\.
\\rowcolorlightpurpleTreeOrderLoCoMoLMERealMemNon\-temp\.Orig\.0\.6030\.5950\.574Perm\.0\.5830\.5760\.523Δ\\Delta\-0\.020\-0\.019\-0\.051SegTreeMemOrig\.0\.6390\.6300\.580Perm\.0\.5380\.5230\.436Δ\\Delta\-0\.111\-0\.107\-0\.144Table 3:Temporal\-order permutation experiment under batch\-LLM construction and no\-propagation retrieval\. Scores are LLM\-judge accuracy\.
##### Temporal permutation confirms the contribution of temporal order\.
To test whether temporal order is actually used by the constructed memory, we conduct a temporal\-order permutation experiment by swapping 30% of randomly selected turn pairs before memory construction and rebuilding each tree from the permuted sequence\. We use batch\-LLM construction with no\-propagation retrieval\. As shown in[Table˜3](https://arxiv.org/html/2606.04555#S5.T3), permutation only mildly affects the non\-temporal tree but causes a much larger performance drop forSegTreeMem\. This indicates that preserving temporal order during construction contributes directly toSegTreeMem’s performance\.
RQ2: Which online construction strategy gives the best performance\-cost tradeoff?
##### Batch LLM gives the best construction quality\.
RQ2 compares the compatibility models used during online Segment Tree construction\. As shown in[Figure˜4](https://arxiv.org/html/2606.04555#S5.F4), batch LLM is consistently strong under no\-propagation and also receives larger gains from tree\-aware retrieval\. This suggests that batch LLM produces coherent memory nodes that perform well without propagation and yields a tree structure that can be better exploited by propagation\. This is consistent with findings in LLM\-based ranking that listwise comparison can better support relative judgments among candidates than independent pointwise scoring\[[23](https://arxiv.org/html/2606.04555#bib.bib23),[35](https://arxiv.org/html/2606.04555#bib.bib35)\]\.
##### Cosine provides the better quality\-cost tradeoff\.
Construction quality should be interpreted together with online cost \([Figure˜5](https://arxiv.org/html/2606.04555#S5.F5)\)\. Batch LLM gives the strongest quality, but it requires LLM\-based frontier scoring\. In contrast, cosine similarity uses no LLM calls for frontier matching, and the runtime analysis shows roughly10×10\\timeslower token usage and about5×5\\timeslower construction latency than batch LLM in the online construction trace\. Cosine similarity therefore provides the better quality\-cost trade\-off as it remains competitive in accuracy while better satisfying the real\-time update requirement\. Thus, batch LLM is preferable when construction quality is prioritized, while cosine similarity is the practical choice when latency or token budget is the main constraint\.
RQ3: How do tree\-aware retrieval policies affect retrieval quality?
Figure 5:Construction and retrieval efficiency as memory grows\. We report per\-utterance construction time, per\-utterance retrieval latency, and per\-utterance token usage for flat retrieval baselines, structured memory baselines, andSegTreeMemconstruction variants\.SegTreeMemremains efficient among structured memory methods, with cosine construction providing the lowest\-cost variant and LLM\-based construction remaining competitive in runtime and token usage\.Figure 6:Effect of propagation policy and decay factor on retrieval quality\. The dashed line denotes no\-propagation retrieval, corresponding toH=0H=0\. Curves show bottom\-up and top\-down propagation asα\\alphavaries, with performance averaged overH∈\{1,2,3\}H\\in\\\{1,2,3\\\}\. All results use batch LLM tree construction\. Moderate propagation often improves retrieval, while the stronger direction depends on the temporal granularity required by each dataset\.Figure 7:Retrieved\-node level distribution as a function of the propagation setting, using batch\-LLMSegTreeMemtrees and a fixed propagation horizonH=2H=2\. Stacked bars show the fraction of retrieved nodes at each dataset\-specific temporal level, and the dashed line shows the corresponding accuracy\. Bottom\-up propagation increasingly shifts retrieval toward broader internal nodes asα\\alphagrows, whereas top\-down propagation remains more concentrated on exchange\-level evidence\.
##### Retrieval behavior is dataset\-dependent\.
[Figure˜6](https://arxiv.org/html/2606.04555#S5.F6)shows that the effect of tree\-aware retrieval depends on the dataset, propagation policyPP, horizonHH, and decay factorα\\alpha\. No single propagation configuration dominates across all settings\. Top\-down propagation gives the strongest results on LoCoMo and LongMemEval\-MAB, while RealMem shows stronger gains from bottom\-up propagation\.
##### Propagation direction changes temporal granularity\.
To explain the dataset\-specific effects, we analyze how different propagation policies change the distribution of retrieved node levels in[AppendixD\.3](https://arxiv.org/html/2606.04555#A4.SS3)\. The analysis shows that different propagation policies shift the temporal granularity of the retrieved evidence\. Top\-down retrieval selects more specific spans, while bottom\-up retrieval increases the selection of internal nodes that cover broader temporal contexts\. This helps explain the dataset\-specific behavior in[Figure˜6](https://arxiv.org/html/2606.04555#S5.F6)\. LoCoMo contains many detail\-oriented questions, where retrieving specific utterance spans is more useful\. RealMem is more project\-oriented, where answers often depend on the broader state of an evolving interaction\. In real applications, this suggests that the propagation policy should be chosen according to the temporal granularity required by the task\.
##### Decay factorα\\alphaand horizonHHtune probability allocation\.
[Figure˜6](https://arxiv.org/html/2606.04555#S5.F6)shows that the best propagation setting differs across datasets and policies\. Moderate propagation often improves over no\-propagation, while very largeα\\alphacan sharply reduce accuracy\. To better understand the effects ofα\\alphaandHH, we provide fullHH\-sweep results and analyze retrieved node levels under different propagation policies and\(α,H\)\(\\alpha,H\)settings in[AppendixD\.3](https://arxiv.org/html/2606.04555#A4.SS3)\. The analysis shows thatα\\alphaandHHdetermine how far retrieval moves from the initial topical entry distribution and how strongly the retrieved evidence follows the corresponding policy direction\. Largerα\\alphaand largerHHpush retrieval further along that direction\. Thus, the best\(α,H\)\(\\alpha,H\)setting depends on the temporal granularity required by the task and how far retrieval should move from the initial distribution\.
RQ4: How does fullSegTreeMemperform against existing memory baselines?
\\rowcolorheaderpurpleMethodLoCoMoLongMemEval\-MABRealMem\\rowcolorheaderpurpleLLMF1LLMF1LLMF1Retrieval\-only memory baselinesBM25\[[33](https://arxiv.org/html/2606.04555#bib.bib33)\]0\.469\[0\.447, 0\.491\]0\.184\[0\.177, 0\.192\]0\.497\[0\.440, 0\.553\]0\.169\[0\.150, 0\.189\]0\.498\[0\.472, 0\.524\]0\.317\[0\.312, 0\.322\]Dense\[[13](https://arxiv.org/html/2606.04555#bib.bib13)\]0\.490\[0\.468, 0\.512\]0\.188\[0\.180, 0\.196\]0\.520\[0\.463, 0\.577\]0\.168\[0\.149, 0\.187\]0\.524\[0\.498, 0\.550\]0\.316\[0\.311, 0\.321\]Graph\-structured memory baselinesA\-MEM\[[41](https://arxiv.org/html/2606.04555#bib.bib41)\]0\.608\[0\.586, 0\.629\]0\.231\[0\.223, 0\.239\]0\.553\[0\.497, 0\.610\]0\.166\[0\.148, 0\.184\]0\.515\[0\.488, 0\.541\]0\.308\[0\.303, 0\.314\]Mem0\[[5](https://arxiv.org/html/2606.04555#bib.bib5)\]0\.454\[0\.432, 0\.476\]0\.187\[0\.179, 0\.194\]0\.537\[0\.480, 0\.593\]0\.174\[0\.156, 0\.191\]0\.450\[0\.424, 0\.476\]0\.314\[0\.310, 0\.318\]HippoRAG\[[9](https://arxiv.org/html/2606.04555#bib.bib9)\]0\.570\[0\.548, 0\.592\]0\.218\[0\.210, 0\.226\]0\.560\[0\.504, 0\.616\]0\.172\[0\.153, 0\.190\]0\.531\[0\.505, 0\.557\]0\.310\[0\.304, 0\.315\]Tree\-structured memory baselinesRAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]0\.593\[0\.571, 0\.614\]0\.223\[0\.213, 0\.232\]0\.523\[0\.467, 0\.580\]0\.166\[0\.148, 0\.185\]0\.579\[0\.553, 0\.604\]0\.312\[0\.306, 0\.318\]MemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\]0\.483\[0\.461, 0\.505\]0\.151\[0\.144, 0\.157\]0\.483\[0\.427, 0\.540\]0\.157\[0\.139, 0\.175\]0\.536\[0\.510, 0\.562\]0\.310\[0\.304, 0\.315\]Proposed method\\rowcolorlightpurpleSegTreeMem\-TD0\.636\[0\.615, 0\.657\]0\.308\[0\.302, 0\.314\]0\.633\[0\.577, 0\.686\]0\.204\[0\.188, 0\.220\]0\.608\[0\.583, 0\.634\]0\.356\[0\.348, 0\.363\]\\rowcolorlightpurpleSegTreeMem\-BU0\.639\[0\.618, 0\.660\]0\.301\[0\.295, 0\.307\]0\.647\[0\.591, 0\.699\]0\.197\[0\.181, 0\.213\]0\.621\[0\.595, 0\.645\]0\.356\[0\.348, 0\.363\]
Table 4:Main answer\-quality results across memory benchmarks\. Each reported score includes its 95% confidence interval in brackets\. ForSegTreeMem, TD and BU use the same fixed representative setting \(batch LLM construction,α=0\.10\\alpha=0\.10,H=2H=2\), with top\-down and bottom\-up propagation respectively\.SegTreeMemachieves the strongest overall performance across benchmarks, showing that temporally ordered tree memory improves long\-horizon agentic memory\.[Table˜4](https://arxiv.org/html/2606.04555#S5.T4)shows that the fullSegTreeMemsystem achieves the best LLM\-judge accuracy across the evaluated benchmarks, improving over the strongest external baseline nearly20%20\\%\. The comparison supports the central claim of this work: long\-horizon conversational agents benefit from memory indexes that are both temporal and hierarchical\. Online Segment Tree construction creates temporally ordered memory units, while tree\-aware retrieval uses this structure to recover relevant context at the appropriate granularity\.[Figure˜5](https://arxiv.org/html/2606.04555#S5.F5)shows thatSegTreeMemis also computationally competitive\. Per\-utterance construction and retrieval is the fastest among methods that build any structured memory\. Token cost ranges from negligible for Cosine to comparable with existing structured baselines for the Pointwise and Batch variants\.
## 6Conclusion
We introducedSegTreeMem, a memory architecture that organizes long\-horizon conversation history as an online Segment Tree over utterances and retrieves evidence through tree\-aware propagation\. The central takeaway is that conversational agents benefit from memory indexes that are both temporal and hierarchical: temporal order helps construct coherent memory states, while the tree structure lets retrieval select context at the appropriate granularity\. Our results support temporal and hierarchical indexing as a practical foundation for long\-horizon conversational memory\.
##### Limitation and future work\.
Our current implementation leaves several practical extensions for future work\. First, propagation uses fixed policies and horizons across queries, although different questions may require different retrieval granularity\. Query\-dependent policy selection could choose among no propagation, top\-down retrieval, and bottom\-up retrieval\. Second, the current system focuses on text\-based conversational memory\. Extending the same temporal ordering principle to multimodal observations, tool calls, and action traces is an important direction for long\-horizon agents\. Third, future systems could add incremental maintenance operations, such as local subtree rebuilding, periodic compression, or duplicate cleanup, to keep the memory index efficient and stable over very long deployments\.
## References
- Alibaba Cloud \[2026\]Alibaba Cloud\.Qwen3\.5\-Flash model documentation\.[https://www\.alibabacloud\.com/help/en/model\-studio/getting\-started/models](https://www.alibabacloud.com/help/en/model-studio/getting-started/models), 2026\.Alibaba Cloud Model Studio documentation forqwen3\.5\-flash; snapshotqwen3\.5\-flash\-2026\-02\-23\.
- Bai et al\. \[2024\]Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al\.Longbench: A bilingual, multitask benchmark for long context understanding\.In*Proceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: Long papers\)*, pages 3119–3137, 2024\.
- Bian et al\. \[2026\]Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, and Ronghao Chen\.Realmem: Benchmarking llms in real\-world memory\-driven interaction\.*arXiv preprint arXiv:2601\.06966*, 2026\.
- Chen et al\. \[2023\]Howard Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz\.Walking down the memory maze: Beyond context limit through interactive reading\.*arXiv preprint arXiv:2310\.05029*, 2023\.
- Chhikara et al\. \[2025\]Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav\.Mem0: Building production\-ready ai agents with scalable long\-term memory\.*arXiv preprint arXiv:2504\.19413*, 2025\.Includes Mem0 and graph\-enhanced Mem0g variants\.
- de Berg et al\. \[2008\]Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars\.*Computational Geometry: Algorithms and Applications*\.Springer, 3 edition, 2008\.
- Galley et al\. \[2003\]Michel Galley, Kathleen R\. McKeown, Eric Fosler\-Lussier, and Hongyan Jing\.Discourse segmentation of multi\-party conversation\.In*Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics*, pages 562–569, Sapporo, Japan, July 2003\. Association for Computational Linguistics\.doi:10\.3115/1075096\.1075167\.URL[https://aclanthology\.org/P03\-1071/](https://aclanthology.org/P03-1071/)\.
- Gupta et al\. \[2025\]Nilesh Gupta, Wei\-Cheng Chang, Ngot Bui, Cho\-Jui Hsieh, and Inderjit S\. Dhillon\.LLM\-guided hierarchical retrieval\.*arXiv preprint arXiv:2510\.13217*, 2025\.
- Gutiérrez et al\. \[2024\]Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su\.Hipporag: Neurobiologically inspired long\-term memory for large language models\.*arXiv preprint arXiv:2405\.14831*, 2024\.
- Hu et al\. \[2025\]Yuanzhe Hu, Yu Wang, and Julian McAuley\.Evaluating memory in llm agents via incremental multi\-turn interactions\.*arXiv preprint arXiv:2507\.05257*, 2025\.
- Jeh and Widom \[2003\]Glen Jeh and Jennifer Widom\.Scaling personalized web search\.In*Proceedings of the 12th International Conference on World Wide Web*, pages 271–279\. ACM, 2003\.
- Joty et al\. \[2013\]Shafiq Joty, Giuseppe Carenini, and Raymond T\. Ng\.Topic segmentation and labeling in asynchronous conversations\.*Journal of Artificial Intelligence Research*, 47:521–573, 2013\.doi:10\.1613/jair\.3940\.URL[https://doi\.org/10\.1613/jair\.3940](https://doi.org/10.1613/jair.3940)\.
- Karpukhin et al\. \[2020\]Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen\-tau Yih\.Dense passage retrieval for open\-domain question answering\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 6769–6781, Online, November 2020\. Association for Computational Linguistics\.doi:10\.18653/v1/2020\.emnlp\-main\.550\.URL[https://aclanthology\.org/2020\.emnlp\-main\.550/](https://aclanthology.org/2020.emnlp-main.550/)\.
- Kim et al\. \[2026\]Junyoung Kim, Anton Korikov, Jiazhou Liang, Justin Cui, Yifan Simon Liu, Qianfeng Wen, Mark Zhao, and Scott Sanner\.Bayesian active learning with gaussian processes guided by llm relevance scoring for dense passage retrieval\.*arXiv preprint arXiv:2604\.17906*, 2026\.
- Liang et al\. \[2026a\]Jiazhou Liang, Yifan Simon Liu, David Guo, Minqi Sun, Yilun Jiang, and Scott Sanner\.Evaluating scene\-based in\-situ item labeling for immersive conversational recommendation\.*arXiv preprint arXiv:2604\.09698*, 2026a\.
- Liang et al\. \[2026b\]Jiazhou Liang, Armin Toroghi, Yifan Simon Liu, Faeze Moradi Kalarde, Liam Gallagher, and Scott Sanner\.Goal\-oriented reasoning for rag\-based memory in conversational agentic llm systems\.*arXiv preprint arXiv:2605\.12213*, 2026b\.
- Liu et al\. \[2023a\]Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu\.G\-eval: Nlg evaluation using gpt\-4 with better human alignment\.In*Proceedings of the 2023 conference on empirical methods in natural language processing*, pages 2511–2522, 2023a\.
- Liu et al\. \[2023b\]Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu\.G\-eval: Nlg evaluation using gpt\-4 with better human alignment\.In*Proceedings of the 2023 conference on empirical methods in natural language processing*, pages 2511–2522, 2023b\.
- Liu et al\. \[2025a\]Yifan Liu, Gelila Tilahun, Xinxiang Gao, Qianfeng Wen, and Michael Gervers\.A comparative study of static and contextual embeddings for analyzing semantic changes in medieval latin charters\.In*Proceedings of the First Workshop on Language Models for Low\-Resource Languages*, pages 182–192, 2025a\.
- Liu et al\. \[2025b\]Yifan Liu, Qianfeng Wen, Jiazhou Liang, Mark Zhao, Justin Cui, Anton Korikov, Armin Toroghi, Junyoung Kim, and Scott Sanner\.Multimodal item scoring for natural language recommendation via gaussian process regression with llm relevance judgments\.*arXiv preprint arXiv:2510\.22023*, 2025b\.
- Liu et al\. \[2025c\]Yifan Liu, Qianfeng Wen, Mark Zhao, Jiazhou Liang, and Scott Sanner\.Ma\-dpr: Manifold\-aware distance metrics for dense passage retrieval\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 31073–31091, 2025c\.
- Liu et al\. \[2026\]Yifan Simon Liu, Ruifan Wu, Liam Gallagher, Jiazhou Liang, Armin Toroghi, and Scott Sanner\.Semantic xpath: Structured agentic memory access for conversational ai\.*arXiv preprint arXiv:2603\.01160*, 2026\.
- Ma et al\. \[2023\]Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin\.Zero\-shot listwise document reranking with a large language model\.*CoRR*, abs/2305\.02156, 2023\.doi:10\.48550/arXiv\.2305\.02156\.
- Maharana et al\. \[2024\]Adyasha Maharana, Dong\-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang\.Evaluating very long\-term conversational memory of llm agents\.*arXiv preprint arXiv:2402\.17753*, 2024\.
- OpenAI \[2024a\]OpenAI\.GPT\-4o mini model documentation\.[https://platform\.openai\.com/docs/models/gpt\-4o\-mini](https://platform.openai.com/docs/models/gpt-4o-mini), 2024a\.Official OpenAI API documentation forgpt\-4o\-mini; snapshotgpt\-4o\-mini\-2024\-07\-18\.
- OpenAI \[2024b\]OpenAI\.New embedding models and api updates\.[https://openai\.com/index/new\-embedding\-models\-and\-api\-updates/](https://openai.com/index/new-embedding-models-and-api-updates/), 2024b\.Introducestext\-embedding\-3\-small\.
- OpenAI \[2026\]OpenAI\.Gpt\-5\.4 mini model documentation\.[https://developers\.openai\.com/api/docs/models/gpt\-5\.4\-mini](https://developers.openai.com/api/docs/models/gpt-5.4-mini), 2026\.Official OpenAI API documentation forgpt\-5\.4\-mini\.
- Packer et al\. \[2023\]Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G\. Patil, Ion Stoica, and Joseph E\. Gonzalez\.Memgpt: Towards llms as operating systems\.*arXiv preprint arXiv:2310\.08560*, 2023\.
- Page et al\. \[1999\]Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd\.The pagerank citation ranking: Bringing order to the web\.Technical report, Stanford InfoLab, 1999\.
- Park et al\. \[2023\]Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S\. Bernstein\.Generative agents: Interactive simulacra of human behavior\.In*Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology*, UIST ’23, 2023\.
- Rezazadeh et al\. \[2024\]Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao\.From isolated conversations to hierarchical schemas: Dynamic tree memory representation for llms\.*arXiv preprint arXiv:2410\.14052*, 2024\.
- Robertson and Zaragoza \[2009\]Stephen Robertson and Hugo Zaragoza\.The probabilistic relevance framework: Bm25 and beyond\.*Foundations and Trends in Information Retrieval*, 3\(4\):333–389, 2009\.doi:10\.1561/1500000019\.
- Robertson et al\. \[1994\]Stephen E\. Robertson, Steve Walker, Susan Jones, Micheline Hancock\-Beaulieu, and Mike Gatford\.Okapi at trec\-3\.In*Text Retrieval Conference*, 1994\.
- Sarthi et al\. \[2024\]Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D\. Manning\.RAPTOR: Recursive abstractive processing for tree\-organized retrieval\.*arXiv preprint arXiv:2401\.18059*, 2024\.
- Sun et al\. \[2023\]Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren\.Is ChatGPT good at search? investigating large language models as re\-ranking agents\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 14918–14937, Singapore, December 2023\. Association for Computational Linguistics\.doi:10\.18653/v1/2023\.emnlp\-main\.923\.URL[https://aclanthology\.org/2023\.emnlp\-main\.923/](https://aclanthology.org/2023.emnlp-main.923/)\.
- Tang and Yang \[2024\]Yixuan Tang and Yi Yang\.Multihop\-rag: Benchmarking retrieval\-augmented generation for multi\-hop queries\.*arXiv preprint arXiv:2401\.15391*, 2024\.
- Tong et al\. \[2006\]Hanghang Tong, Christos Faloutsos, and Jia\-Yu Pan\.Fast random walk with restart and its applications\.In*Proceedings of the Sixth IEEE International Conference on Data Mining*, pages 613–622, 2006\.
- Wen et al\. \[2024\]Qianfeng Wen, Yifan Liu, Joshua Zhang, George Saad, Anton Korikov, Yury Sambale, and Scott Sanner\.Elaborative subtopic query reformulation for broad and indirect queries in travel destination recommendation\.*arXiv preprint arXiv:2410\.01598*, 2024\.
- Wen et al\. \[2025\]Qianfeng Wen, Yifan Liu, Justin Cui, Joshua Zhang, Anton Korikov, George\-Kirollos Saad, and Scott Sanner\.A simple but effective elaborative query reformulation approach for natural language recommendation\.*arXiv preprint arXiv:2510\.02656*, 2025\.
- Wu et al\. \[2024\]Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai\-Wei Chang, and Dong Yu\.Longmemeval: Benchmarking chat assistants on long\-term interactive memory\.*arXiv preprint arXiv:2410\.10813*, 2024\.
- Xu et al\. \[2025\]Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang\.A\-mem: Agentic memory for llm agents\.*arXiv preprint arXiv:2502\.12110*, 2025\.
- Zhang et al\. \[2023\]Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, and Jimmy Lin\.Rank\-without\-GPT: Building GPT\-independent listwise rerankers on open\-source large language models\.*CoRR*, abs/2312\.02969, 2023\.doi:10\.48550/arXiv\.2312\.02969\.
- Zhong et al\. \[2024\]Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang\.Memorybank: Enhancing large language models with long\-term memory\.*Proceedings of the AAAI Conference on Artificial Intelligence*, 38\(17\), 2024\.
## Appendix AAlgorithmic Details
This subsection provides pseudocode for the twoSegTreeMemoperations\. Node annotations are denoted byA\(v\)A\(v\), node intervals byI\(v\)I\(v\), and the admissible retrieval set by𝒢\\mathcal\{G\}\.
Algorithm 1Online Segment Tree Update1:Input:current tree
Tt=\(Vt,Et,ρt\)T\_\{t\}=\(V\_\{t\},E\_\{t\},\\rho\_\{t\}\), incoming utterance
xt\+1x\_\{t\+1\}, compatibility model
CC, annotation model
SS
2:Output:updated tree
Tt\+1T\_\{t\+1\}
3:Create a new leaf
zt\+1z\_\{t\+1\}with
I\(zt\+1\)=\[t\+1,t\+1\]I\(z\_\{t\+1\}\)=\[t\+1,t\+1\]and
A\(zt\+1\)=xt\+1A\(z\_\{t\+1\}\)=x\_\{t\+1\}
4:if
Tt=∅T\_\{t\}=\\emptysetthen
5:returnthe single\-node tree rooted at
zt\+1z\_\{t\+1\}
6:endif
7:
Ft=\(f1,…,fm\)←𝖥𝗋𝗈𝗇𝗍𝗂𝖾𝗋\(Tt\)F\_\{t\}=\(f\_\{1\},\\ldots,f\_\{m\}\)\\leftarrow\\operatorname\{\\mathsf\{Frontier\}\}\(T\_\{t\}\), ordered from leaf to root\.
8:
ℓ⋆←C\(xt\+1,Ft∖\{f1\}\)\\ell^\{\\star\}\\leftarrow C\(x\_\{t\+1\},F\_\{t\}\\setminus\\\{f\_\{1\}\\\}\), where
ℓ⋆∈\{2,…,m\}∪\{NONE\}\\ell^\{\\star\}\\in\\\{2,\\ldots,m\\\}\\cup\\\{\\texttt\{NONE\}\\\}
9:if
ℓ⋆≠NONE\\ell^\{\\star\}\\neq\\texttt\{NONE\}then
10:Create a rightmost chain
ZZof
ℓ⋆−2\\ell^\{\\star\}\-2internal nodes above
zt\+1z\_\{t\+1\}, each with interval
\[t\+1,t\+1\]\[t\+1,t\+1\]
11:Attach the chain top \(or
zt\+1z\_\{t\+1\}itself, if
ℓ⋆=2\\ell^\{\\star\}=2\) as the rightmost child of
fℓ⋆f\_\{\\ell^\{\\star\}\}
12:
U←Z∪\{fℓ⋆\}∪𝖺𝗇𝖼\(fℓ⋆\)U\\leftarrow Z\\cup\\\{f\_\{\\ell^\{\\star\}\}\\\}\\cup\\operatorname\{\\mathsf\{anc\}\}\(f\_\{\\ell^\{\\star\}\}\)
13:else
14:Create a new root
ρt\+1\\rho\_\{t\+1\}whose leftmost child is
ρt\\rho\_\{t\}
15:Create a rightmost chain
ZZof
m−1m\-1internal nodes above
zt\+1z\_\{t\+1\}, each with interval
\[t\+1,t\+1\]\[t\+1,t\+1\]
16:Attach the chain top \(or
zt\+1z\_\{t\+1\}itself, if
m=1m=1\) as the rightmost child of
ρt\+1\\rho\_\{t\+1\}
17:
U←Z∪\{ρt\+1\}U\\leftarrow Z\\cup\\\{\\rho\_\{t\+1\}\\\}
18:endif
19:foreach node
u∈Uu\\in Uin bottom\-up orderdo
20:
I\(u\)←\[minc∈𝖼𝗁\(u\)l\(c\),maxc∈𝖼𝗁\(u\)r\(c\)\]I\(u\)\\leftarrow\\bigl\[\\min\_\{c\\in\\mathsf\{ch\}\(u\)\}l\(c\),\\,\\max\_\{c\\in\\mathsf\{ch\}\(u\)\}r\(c\)\\bigr\]
21:
A\(u\)←S\(\(A\(c\)\)c∈𝖼𝗁\(u\)\)A\(u\)\\leftarrow S\\bigl\(\(A\(c\)\)\_\{c\\in\\mathsf\{ch\}\(u\)\}\\bigr\)
22:endfor
23:return
Tt\+1T\_\{t\+1\}
##### Insertion invariant\.
The update in[Algorithm˜1](https://arxiv.org/html/2606.04555#alg1)never reorders existing leaves\. The incoming utterance is always introduced as the new rightmost leaf, and only nodes on the affected right frontier are modified\. Therefore every internal node continues to summarize a contiguous interval, while the update scope remains localized to the selected frontier segment and its ancestors\. TheNONEbranch is used when the compatibility model finds no coherent continuation among the active frontier nodes\. In that case the previous tree and the new rightmost chain are combined under a new root\.
Algorithm 2Structure\-Aware Retrieval1:Input:memory tree
T=\(V,E\)T=\(V,E\), query
qq, admissible set
𝒢⊆V\\mathcal\{G\}\\subseteq V, local relevance model
RR, propagation matrix
WPW\_\{P\}, horizon
HH, decay
α∈\[0,1\)\\alpha\\in\[0,1\), evidence budget
KK
2:Output:retrieved node set
𝒱^K\\widehat\{\\mathcal\{V\}\}\_\{K\}
3:foreach node
v∈Vv\\in Vdo
4:
rq\(v\)←R\(q,A\(v\)\)r\_\{q\}\(v\)\\leftarrow R\(q,A\(v\)\)//Relevance score computation
5:endfor
6:
sq\(0\)\(v\)←rq\(v\)/∑u∈Vrq\(u\)s\_\{q\}^\{\(0\)\}\(v\)\\leftarrow r\_\{q\}\(v\)\\big/\\sum\_\{u\\in V\}r\_\{q\}\(u\)for all
v∈Vv\\in V//Initial score normalization
7:
s~q←sq\(0\)\\tilde\{s\}\_\{q\}\\leftarrow s\_\{q\}^\{\(0\)\}
8:
sqprev←sq\(0\)s\_\{q\}^\{\\mathrm\{prev\}\}\\leftarrow s\_\{q\}^\{\(0\)\}
9:for
k=1,…,Hk=1,\\ldots,Hdo
10:
sq\(k\)←WPsqprevs\_\{q\}^\{\(k\)\}\\leftarrow W\_\{P\}\\\>s\_\{q\}^\{\\mathrm\{prev\}\}// Score propagation
11:
s~q←s~q\+αksq\(k\)\\tilde\{s\}\_\{q\}\\leftarrow\\tilde\{s\}\_\{q\}\+\\alpha^\{k\}s\_\{q\}^\{\(k\)\}// Propagated score accumulation
12:
sqprev←sq\(k\)s\_\{q\}^\{\\mathrm\{prev\}\}\\leftarrow s\_\{q\}^\{\(k\)\}
13:endfor
14:
𝒱^K←TopKv∈𝒢s~q\(v\)\\widehat\{\\mathcal\{V\}\}\_\{K\}\\leftarrow\\operatorname\{TopK\}\_\{v\\in\\mathcal\{G\}\}\\tilde\{s\}\_\{q\}\(v\)// Retrieving nodes with high final scores
15:return
𝒱^K\\widehat\{\\mathcal\{V\}\}\_\{K\}
##### Retrieval invariant\.
[Algorithm˜2](https://arxiv.org/html/2606.04555#alg2)first scores every node independently using its annotation, then mixes the local score distribution with propagated distributions obtained by applyingWPW\_\{P\}for up toHHsteps\. SettingH=0H=0recovers collapsed retrieval\. For top\-down propagation, probability mass moves from broader summaries toward child spans\. For bottom\-up propagation, mass moves from specific matches toward their enclosing temporal context\. The final Top\-KKselection is restricted to𝒢\\mathcal\{G\}, which lets the same scoring routine support either internal\-node retrieval or leaf\-only evidence selection\.
## Appendix BExtended Related Work
### B\.1Long\-Horizon Memory Benchmarks
Long\-horizon conversational memory benchmarks evaluate whether an agent can retain, update, and retrieve information accumulated across extended interactions\. This setting differs from standard long\-context RAG\[[2](https://arxiv.org/html/2606.04555#bib.bib2),[36](https://arxiv.org/html/2606.04555#bib.bib36),[38](https://arxiv.org/html/2606.04555#bib.bib38),[39](https://arxiv.org/html/2606.04555#bib.bib39),[15](https://arxiv.org/html/2606.04555#bib.bib15)\]: the memory state is built over a sequence of user–agent interactions, and queries may require resolving information across temporally separated turns, updated facts, evolving preferences, or task state\.
We evaluateSegTreeMemon three benchmarks that directly target this setting:
- •LoCoMoevaluates very long\-term conversational memory across multi\-session dialogues, with questions that require retrieving facts, preferences, and events from extended interaction histories\[[24](https://arxiv.org/html/2606.04555#bib.bib24)\]\.
- •LongMemEval\-MABis based on the MemoryAgentBench reformulation of LongMemEval, where memory is built incrementally over an extended dialogue and queried repeatedly throughout the interaction\[[40](https://arxiv.org/html/2606.04555#bib.bib40),[10](https://arxiv.org/html/2606.04555#bib.bib10)\]\. This setting is especially relevant to online memory construction because the system must maintain a memory state as new utterances arrive rather than process a fixed document once\.
- •RealMemevaluates memory\-driven interaction in more realistic user\-facing scenarios, where questions may depend on evolving project state, user preferences, or prior task context\[[3](https://arxiv.org/html/2606.04555#bib.bib3)\]\.
These benchmarks are well matched to our goal because they evaluate memory as an external state maintained over interaction histories\. Other long\-context benchmarks, including synthetic needle\-retrieval tasks, long\-document QA, and book\-level reading comprehension, are useful for measuring whether a model can attend over long input sequences\. However, they do not directly evaluate the central setting of this paper: online construction of a persistent conversational memory index and retrieval from that index as the interaction evolves\. We therefore focus our experiments on benchmarks where memory construction and retrieval are both part of the task\.
### B\.2Flat Retrieval Methods
##### Memory construction\.
Flat retrieval methods store past context as an unordered collection of independent memory records\. In sparse retrieval, records are indexed by lexical matching signals such as BM25\[[33](https://arxiv.org/html/2606.04555#bib.bib33)\]\. In dense retrieval, records are embedded into a vector space and indexed for nearest neighbor search\[[13](https://arxiv.org/html/2606.04555#bib.bib13)\]\. These methods are simple, scalable, and often strong baselines for retrieval\-augmented generation\. However, their construction step does not explicitly model the hierarchical or temporal structure of a conversation\. Each memory unit is usually treated as an independent item, even when its meaning depends on surrounding utterances\.
##### Retrieval\.
At retrieval time, flat methods rank memory records independently by lexical or semantic similarity to the query\. This supports topical matching but does not explicitly use temporal or structural context\. For example, if a query matches one utterance in a longer coherent span, flat retrieval does not automatically recover nearby utterances, enclosing summaries, or other parts of the same interaction segment\. In our terminology, flat retrieval satisfies topical relevance but does not directly exploit the admissible temporal or structural neighborhood around a match\.
### B\.3Graph\-Based and Agentic Memory Methods
##### Memory construction\.
Graph\-based and agentic memory systems enrich flat memory by extracting, linking, updating, or consolidating memory records over time\. Generative Agents store memories in a stream and periodically synthesize higher\-level reflections\[[30](https://arxiv.org/html/2606.04555#bib.bib30)\]\.MemoryBank incorporates importance weighting and forgetting mechanisms for long\-term user memory\[[43](https://arxiv.org/html/2606.04555#bib.bib43)\]\. MemGPT organizes memory across tiers and lets the agent manage memory through operations analogous to paging\[[28](https://arxiv.org/html/2606.04555#bib.bib28)\]\. More recent agentic memory systems such as A\-MEM and Mem0 emphasize explicit memory operations, including extraction, updating, linking, and consolidation\[[41](https://arxiv.org/html/2606.04555#bib.bib41),[5](https://arxiv.org/html/2606.04555#bib.bib5)\]\. Goal\-oriented memory reasoning further studies how an agent can decompose a user utterance into subgoals and retrieve memory targeted to those reasoning needs\[[16](https://arxiv.org/html/2606.04555#bib.bib16)\]\.
These systems are designed for persistent agent memory, but their memory organization is often driven by learned or heuristic estimates of memory importance, extracted entities, or agentic update decisions rather than by a deterministic temporal constraint\.
##### Retrieval\.
Graph\-structured retrieval methods use links between memory records to expand beyond an initial topical match\[[21](https://arxiv.org/html/2606.04555#bib.bib21),[22](https://arxiv.org/html/2606.04555#bib.bib22)\]\. This idea is related to classical graph ranking methods such as PageRank, personalized PageRank, and random walks with restart\[[29](https://arxiv.org/html/2606.04555#bib.bib29),[11](https://arxiv.org/html/2606.04555#bib.bib11),[37](https://arxiv.org/html/2606.04555#bib.bib37)\]\. HippoRAG applies this perspective to LLM memory by constructing a corpus\-derived knowledge graph and using personalized PageRank to obtain a structure\-aware relevance signal\[[9](https://arxiv.org/html/2606.04555#bib.bib9)\]\. A\-MEM also uses explicit memory links to support associative retrieval\[[41](https://arxiv.org/html/2606.04555#bib.bib41)\]\.
Our retrieval method shares the high\-level idea that relevance can be propagated through structure\. The key difference is the underlying structure\. Graph\-based methods typically propagate over entity, fact, or memory\-record links\. In contrast,SegTreeMempropagates over a temporal hierarchy whose nodes correspond to contiguous conversational spans\. This makes propagation directly interpretable as moving between local utterances, enclosing temporal contexts, and related neighboring spans\.
### B\.4Tree\-Based Memory Methods
##### Memory construction\.
Tree\-based memory methods organize text into hierarchical abstractions\. RAPTOR constructs a recursive abstraction tree by clustering text chunks and summarizing clusters\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]\. LATTICE similarly builds a semantic hierarchy for LLM\-guided retrieval\[[8](https://arxiv.org/html/2606.04555#bib.bib8)\]\. MemWalker preserves document order by grouping adjacent segments into fixed\-size units\[[4](https://arxiv.org/html/2606.04555#bib.bib4)\]\. MemTree is designed for conversational memory and supports online updates by inserting new memories into the most semantically similar branch\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\]\.
These methods differ in whether they support online construction and whether they preserve order\. RAPTOR and LATTICE build semantic abstraction trees through clustering, which can group non\-adjacent segments and therefore does not guarantee temporal contiguity\. MemWalker preserves order but is designed for static contexts rather than online conversational updates\. MemTree supports online insertion, but its similarity\-based insertion rule prioritizes topical relatedness and does not preserve utterance order in the tree structure\.SegTreeMemis designed to satisfy both requirements: it updates memory online while maintaining a tree over temporally contiguous utterance spans\.
##### Retrieval\.
Existing tree\-based retrieval methods generally follow one of two strategies\. RAPTOR and MemTree perform*collapsed retrieval*: they flatten the tree into a set of candidate nodes and rank all nodes independently\[[34](https://arxiv.org/html/2606.04555#bib.bib34),[31](https://arxiv.org/html/2606.04555#bib.bib31)\]\. This allows retrieval at multiple abstraction levels, but it does not explicitly use the tree edges during retrieval\. MemWalker and LATTICE instead perform*LLM\-guided traversal*, where an LLM moves from coarse nodes to finer\-grained children to select relevant branches\[[4](https://arxiv.org/html/2606.04555#bib.bib4),[8](https://arxiv.org/html/2606.04555#bib.bib8)\]\. This uses the top\-down hierarchy, but it does not directly exploit other tree\-induced relations such as parent context, sibling context, or temporal neighborhoods around an initially relevant match\.
SegTreeMemdiffers from both approaches\. Rather than flattening the tree or performing a single top\-down traversal, it treats retrieval as finite\-horizon relevance propagation over the memory tree\. This lets the retriever begin from topical matches and then move through structurally meaningful relations, such as from a detailed utterance to its enclosing context or from a broad summary to its child spans\. Because every node inSegTreeMemcorresponds to a contiguous utterance interval, these structural moves also have a temporal interpretation\.
### B\.5Conversational Segmentation and Segment Trees
Conversational segmentation studies how to identify coherent stretches of dialogue and topic boundaries in interaction histories\[[19](https://arxiv.org/html/2606.04555#bib.bib19)\]\. Prior work has studied discourse segmentation in multi\-party conversation\[[7](https://arxiv.org/html/2606.04555#bib.bib7)\]and topic segmentation in asynchronous conversations\[[12](https://arxiv.org/html/2606.04555#bib.bib12)\]\. These methods typically produce a flat sequence of boundaries as a preprocessing step\.
The compatibility model inSegTreeMemperforms a related decision, but in an online and hierarchical memory construction setting\. Each incoming utterance is compared against the rightmost frontier of the current tree, where different frontier nodes correspond to different temporal granularities\. The update therefore decides not only whether the utterance continues the current topic, but also at what level of temporal abstraction it should be attached\.
Segment trees originate as data structures for range queries over intervals\[[6](https://arxiv.org/html/2606.04555#bib.bib6)\]\. We adapt this classical structure to conversational memory by associating each node with an LLM\-generated annotation of a contiguous utterance span and by defining an online rightmost\-frontier update rule\. To our knowledge,SegTreeMemis the first LLM\-agent memory system to use a Segment Tree as the primary memory index for online construction and structure\-aware retrieval over long\-horizon conversations\.
## Appendix CTree Structure Analysis
### C\.1Cross\-Dataset Quantitative Statistics
DatasetMethodNodesMax depthMean depthDepth std\.BranchingCalls/leafLoCoMoSegTreeMemBatch3173\.803\.090\.248\.911\.95SegTreeMemCosine3524\.003\.460\.605\.100\.96SegTreeMemPointwise3153\.403\.010\.209\.233\.97MemTree4745\.903\.820\.922\.453\.67RAPTOR2902\.202\.130\.0834\.37–LongMemEvalSegTreeMemBatch7804\.003\.120\.335\.701\.88SegTreeMemCosine8334\.003\.360\.484\.380\.93SegTreeMemPointwise7974\.003\.200\.405\.204\.13MemTree11026\.803\.861\.112\.404\.13RAPTOR7035\.204\.230\.7012\.02–RealMemSegTreeMemBatch9474\.003\.460\.493\.811\.94SegTreeMemCosine8653\.002\.880\.325\.260\.94SegTreeMemPointwise8353\.002\.990\.086\.553\.93MemTree11906\.903\.651\.202\.424\.09RAPTOR7344\.403\.660\.6521\.54–
Table 5:Cross\-dataset tree\-shape statistics, averaged over conversations in each dataset\. LoCoMo has 10 conversations with 281 leaves on average, LongMemEval has 5 conversations with 643 leaves on average, and RealMem has 10 conversations with 698 leaves on average\. Nodes denotes the per\-tree node count\. Max depth, mean depth, and depth std\. summarize leaf\-depth statistics\. Branching denotes the mean branching factor of internal nodes\. Calls/leaf denotes LLM construction calls per leaf\.RAPTORis omitted from Calls/leaf because its recursive clustering procedure lacks an online per\-leaf analogue\.[Table˜5](https://arxiv.org/html/2606.04555#A3.T5)shows thatSegTreeMemproduces well\-behaved trees in practice\. Maximum leaf depth stays at33–44on conversations of281281–698698utterances, far below theO\(T\)O\(T\)worst\-case ceiling of the degenerate cascading regime \([AppendixC\.3](https://arxiv.org/html/2606.04555#A3.SS3)\), and depth standard deviations are roughly halfMemTree’s across most settings\.
Branching and cost placeSegTreeMembetween the two baselines on both axes\.RAPTORforms broad clustered levels \(branching up to3434\) andMemTreeforms narrow semantic hierarchies \(≈2\.4\\approx 2\.4\);SegTreeMemsits in between at44–99\. Construction cost shows the same pattern: cosine mode \(≈0\.95\\approx 0\.95calls/leaf\) is cheaper than any baseline, batch LLM \(≈1\.9\\approx 1\.9\) adds one batched frontier\-selection call per insertion, and sequential LLM \(≈4\\approx 4\) matchesMemTree’s budget while producing shallower, more uniform trees\.
### C\.2LoCoMoconv\-47Tree Visualization
#### C\.2\.1Tree Comparison
Figure 8:Whole\-tree visualizations ofSegTreeMem\(top\),MemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\]\(middle\), andRAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]\(bottom\) on LoCoMoconv\-47, which spans329329utterances across3131dialogues\. Each leaf \(filled circle\) is one utterance; internal nodes are drawn as outlined circles\. Both leaves and internal nodes are colored by source dialogue index\.conv\-47spans 31 sessions of conversation over approximately seven calendar months\. The conversation is between two friends, James and John, and weaves together six recurring topics: gaming, programming, dogs, charity, travel, and romance\. These topics are revisited across many distantly spaced sessions\. The resulting tree structures forSegTreeMem,MemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\], andRAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]are shown in[Figure˜8](https://arxiv.org/html/2606.04555#A3.F8)\.
### C\.3Adversarial Cases
We analyze two adversarial input patterns forSegTreeMem’s online construction algorithm\.
#### C\.3\.1Maximally topic\-switching input
The online insertion algorithm selects an attachment point from the rightmost frontier, or creates a new root when no frontier node is compatible with the incoming utterance\. In the adversarial limit where every new utterance is judged incompatible with every frontier node, every insertion takes the new\-root branch\. AfterTTutterances, the tree becomes a degenerate cascading segment tree \([Figure˜9](https://arxiv.org/html/2606.04555#A3.F9)\): each NONE step creates a new root whose left subtree is the entire previous treeTtT\_\{t\}and whose right subtree is a freshly created chain ofm−1m\-1internal nodes terminating in the new leafzt\+1z\_\{t\+1\}\. The resulting tree has heightO\(T\)O\(T\)andO\(T2\)O\(T^\{2\}\)total nodes\.
Figure 9:Degenerate cascading segment tree produced by maximally topic\-switching input\. Each new utterance triggers a new root with a fresh right\-side chain whose length grows with the height of the existing tree\.##### Consequences\.
This structure has two distinct costs\. The first is storage: total node count grows quadratically inTT, and theO\(T2\)O\(T^\{2\}\)internal nodes each require an annotation pass throughSS, making this regime strictly more expensive than maintaining a flat list of utterances\. The second is representational\. Each NONE step creates two kinds of internal nodes, both representationally weak: the new root summarizes a long, incoherent prefix ending at the latest utterance, while them−1m\-1chain nodes abovezt\+1z\_\{t\+1\}all carry the singleton interval\[t\+1,t\+1\]\[t\+1,t\+1\]and therefore summarize a single utterance through redundant copies of one annotation\. Neither type captures focused conversational episodes\. Structure\-aware propagation is correspondingly weakened: relevance signals between older and newer leaves must pass through high\-level ancestors that do not correspond to meaningful temporal segments\. Local leaf scoring still retrieves individual utterances, but the tree structure itself contributes little useful context\.
##### When this arises\.
This case is a constructed worst case rather than a realistic operating regime\. It requires every incoming utterance to be judged incompatible with every active temporal context on the frontier, including the most recent leaf\. Natural conversation does not satisfy this condition: adjacent turns within a conversation almost always share some local discourse context\. We do not observe degenerate cascading behavior on any benchmark in this work; we describe it here to delimit the structural guarantees of the construction algorithm rather than to characterize typical inputs\. The pattern would require either \(i\) an extremely strict compatibility threshold or \(ii\) a synthetically constructed stream in which each utterance is drawn from a domain disjoint from every preceding utterance\.
#### C\.3\.2Long\-range topical recurrence
A second adversarial pattern occurs when the conversation repeatedly returns to a previous topic after a long intervening span on other topics\. BecauseSegTreeMeminserts new utterances only through the rightmost frontier, the later occurrence of the topic is placed according to its current temporal context rather than being merged retroactively with the earlier occurrence\. As a result, two utterances about the same topic can lie in disjoint subtrees whose lowest common ancestor covers a broad intervening interval \([Figure˜10](https://arxiv.org/html/2606.04555#A3.F10)\)\.
Figure 10:Long\-range topical recurrence\. Two leaves on the same topic \(purple\) are separated by an intervening span of utterances on other topics \(blue\)\. Although the temporal segment\-tree invariant is preserved, the shortest tree path between the two topical mentions passes through a high common ancestor\. Finite\-horizon propagation may therefore fail to connect the two mentions\.##### Consequences\.
The temporal segment\-tree invariant still holds: every internal node summarizes a contiguous span of the conversation\. However, the recurrent topic is split across multiple temporal regions, and no narrow internal node summarizes all occurrences of that topic together\. Since structure\-aware retrieval propagates scores for only a finite horizonHH, a relevance signal at one occurrence cannot necessarily reach the other occurrence when their tree distance exceedsHH\. IncreasingHHcan reduce this issue, but also risks spreading relevance into unrelated intervening content through high\-level ancestors\.
##### When this arises\.
Unlike maximally topic\-switching input, this pattern is common in long\-horizon interaction\. Users naturally return to ongoing projects, people, preferences, locations, and periodic activities after many unrelated utterances\. The adversarial aspect is not the recurrence itself, but the combination of recurrence with enough intervening content that the related mentions become structurally distant in the temporal tree\.
##### Mitigation\.
The current retrieval framework partially mitigates this case through local scoring and finite\-horizon propagation\. Collapsed retrieval is robust when both recurrent mentions have annotations that independently match the query, since it scores nodes without relying on tree distance\. Bottom\-up propagation can also promote a relevant leaf to its local temporal context, helping retrieve surrounding evidence from the same episode\. However, pure temporal propagation does not guarantee that a single high\-scoring occurrence will activate a distant recurrence of the same topic\. This is a genuine limitation of a purely temporal segment tree, and suggests a natural extension: augment the temporal tree with semantic cross\-links between distant but related nodes, while preserving the segment tree as the primary chronological scaffold\.
### C\.4QA Case Analyses
We provide qualitative QA cases to illustrate how temporal tree construction and score propagation affect downstream retrieval\. We first show two successful cases from the late\-2022 portion ofconv\-47, whereSegTreeMemkeeps the gold utterances and their nearby context within the same local temporal segment\. We then analyze a representative failure ofSegTreeMem, followed by two cases that illustrate how score propagation changes the retrieved evidence\.
##### Successful QA cases\.
The first two cases illustrate two ways that temporal segmentation helpsSegTreeMemretrieve the required evidence\. In Q1, a narrow parent summary preserves a local detail that is lost when gaming\-related utterances are aggregated more broadly\. In Q2, a parent summary packages information from adjacent sibling utterances, allowing retrieval to recover a short answer even when the exact answer\-bearing utterance is not retrieved directly\.
- •SM1, Local\-detail preservation\.When several adjacent utterances discuss the same event,SegTreeMemcan summarize them within a narrow temporal segment\. This keeps specific details attached to their local context rather than smoothing them into a broad topic summary\.
- •SM2, Summary\-mediated sibling evidence\.When the answer is contained in a short neighboring utterance, the parent summary can carry that detail into the retrieved evidence even if retrieval lands on an adjacent utterance or parent node from the same local temporal segment\.
##### Q1 \| SM1: narrow temporal summaries preserve local details\.
Question\.What did John suggest James practice before playing FIFA 23 together?Gold answer\.Control with a gamepad and timing\.Relevant utterances\(5 Nov 2022\)\.User 1:Great idea\! I hope it’s easy to control\.User 2:Not at all, all you need is a gamepad and a sense of timing\.User 1:Great\! Well, I’ll go train\!
SegTreeMem\.The depth\-3 internal node covering the 5 November dialogue encodes the answer in its summary: “…John advised him to practice first …using a gamepad and good timing\.” Because the FIFA\-related utterances are grouped within a narrow temporal segment, the gamepad/timing fact is preserved in the summary seen by the retriever\.
MemTreeandRAPTOR\.MemTree’s retrieval is dominated by broad video\-game clusters that mix gaming utterances from different dates\. These clusters recover the topic of FIFA/gaming, but their summaries do not preserve the specific gamepad/timing detail from 5 November\.RAPTORretrieves nearby FIFA evidence, including the utterance where John says James should “practice a little first,” but not the answer\-bearing detail about using a gamepad and having a sense of timing\. Its generated answer is therefore underspecified: James should “practice a little first\.”
This case shows that temporal segmentation can preserve the granularity needed for exact\-answer questions\. The baselines retrieve the correct broad topic, but the retrieved evidence does not contain the specific detail required by the question\.
Figure 11:Q1 ground\-truth answer\-bearing utterance and ancestor nodes in the LoCoMoconv\-47SegTreeMemtree\.
##### Q2 \| SM2: parent summaries recover adjacent sibling details\.
Question\.What is the name of John’s cousin’s dog?Gold answer\.Luna\.Relevant utterances\(7 Nov 2022\)\.User 1:This pup is so adorable\! What’s their name?User 2:Their name is Luna\.
SegTreeMem\.The depth\-3 internal node covering the 7 November dialogue ends its summary with “…including John’s cousin’s dog Luna,” so the parent summary already contains the answer\. The answer is therefore available through the local parent node even when retrieval does not land directly on the short answer\-bearing utterance, “Their name is Luna\.”
MemTreeandRAPTOR\.Both baselines retrieve nearby 7 November dog\-related evidence in which John shares a photo of his cousin’s dog, but the retrieved evidence does not name the dog\. The exact name appears in an adjacent utterance, and the retrieved evidence does not package that sibling fact with the retrieved node\. Thus, the baselines reach the right local neighborhood but miss the short answer\-bearing detail\.
Together, Q1 and Q2 show two complementary benefits of temporal segmentation\. First, narrow temporal summaries can preserve local details that may be smoothed away by broad semantic aggregation, as in Q1\. Second, parent summaries can package information from adjacent sibling utterances, as in Q2\. The latter case is especially important for short answers such as names, where the answer\-bearing utterance may be semantically close to the retrieved evidence but absent from the exact node text shown to the answer model\.
##### Failure mode ofSegTreeMem: missed details in internal node summaries\.
AlthoughSegTreeMemimproves retrieval by preserving chronological structure, its summaries can still compress away exact local details\. This is most visible when the question requires an answer such as a date or name\. Score propagation can sometimes recover such details by moving relevance from a relevant parent summary to its children, but it does not guarantee that the answer\-bearing utterance receives enough mass to enter the evidence budget\.
##### Q3: Internal summaries can omit exact local details\.
Question\.When did John resume playing drums in his adulthood?Gold answer\.February 2022\.
SegTreeMem\.Retrieval reaches the correct local temporal region, but the highest\-ranked evidence from that region is not the answer\-bearing utterance\. The retrieved parent summary mentions that John resumed hobbies, but it does not preserve the specific drum\-playing fact needed to answer the question\. Thus, the temporal tree locates the right region of the conversation, while the retrieved summary is not specific enough for literal date recall\.
MemTreeandRAPTOR\.MemTree’s leaf\-level retrieval lands directly on the literal drum mention\.RAPTORretrieves related drum evidence, including a March utterance about John playing drums and a later utterance about having played drums when he was younger, but it does not surface the exact February resume\-date evidence\. This illustrates that literal recall can depend sensitively on whether the answer\-bearing utterance itself enters the evidence budget\.
This case illustrates a limitation ofSegTreeMem: answer quality still depends on whether the relevant detail is preserved in the retrieved node annotations\. This motivates either more detail\-preserving summarization prompts or a fallback to exact utterance\-level evidence when the query appears to require literal recall\.
##### Score propagation\.
We next inspect two cases where score propagation changes the ranking or composition of the retrieved evidence relative to collapsed retrieval\. These examples illustrate the intended behavior of propagation: moving score toward nodes whose local temporal context also matches the query\.
##### Q4: Propagation improves evidence selection\.
Question\.What is John planning to do after receiving Samantha’s phone number?Gold answer\.Call her\.
Under collapsed retrieval, an unrelated programming/video\-card utterance is ranked slightly above a Samantha\-related utterance\. With structure\-aware propagation, the Samantha\-related evidence is promoted because its parent and ancestor nodes also describe the same local Samantha\-related temporal context\. This small ranking change moves the answer\-bearing context above an unrelated near\-tie\.
SegTreeMemwith propagation\.Propagation raises the score of the Samantha\-related utterance by incorporating support from its local parent context\. The resulting evidence supports the answer that John plans to call Samantha\.
Collapsed retrieval\.Without propagation, the retrieval order leaves an unrelated programming/video\-card utterance above the Samantha\-related evidence\. The answer model therefore produces a confused response instead of directly answering the intended event\.
This case shows the intended use of propagation: the local relevance score identifies candidate utterances, while the tree context helps distinguish an utterance embedded in the relevant temporal segment from a near\-tie embedded in an unrelated segment\. It also illustrates that this is a propagation\-specific effect rather than a general failure of collapsed retrieval:RAPTOR’s collapsed retrieval surfaces the literal Samantha phone\-number evidence and answers this question correctly\.
##### Q5: Propagation can introduce distractors\.
Question\.Which of James’s family members have visited him in the last year?Gold answer\.Mother and sister\.
This case shows the precision/recall tradeoff between propagation policies\. A broader top\-down propagation policy admits a cousin\-related utterance because it shares family vocabulary with the query\. However, that extra evidence introduces a distractor, and the answer model incorrectly adds cousins to the final answer\. A tighter bottom\-up policy keeps retrieval concentrated around the directly relevant temporal regions and answers with only mother and sister\.
Bottom\-up propagation\.The bottom\-up policy keeps relevance concentrated around the utterances and parent nodes that directly support the visit facts\. The retrieved evidence remains focused on the mother and sister mentions, producing the correct answer\.
Top\-down propagation\.The top\-down policy admits an incorrect cousin\-related utterance\. Although this evidence is topically related to family, it is not part of the gold answer, and the answer model over\-includes cousins\.
Together, Q4 and Q5 show that propagation is useful when structural context agrees with the query, but can hurt when nearby or descendant nodes contain semantically related distractors\. This supports the main experimental finding that propagation should be used with limited horizons and moderate strength: it is a retrieval\-time correction that can improve evidence selection, not a substitute for local node relevance\.
## Appendix DAdditional Results
### D\.1Qwen Results
##### Qwen backbone results\.
[Table˜6](https://arxiv.org/html/2606.04555#A4.T6)reports the same main comparison when LLM\-dependent memory construction and answer generation useqwen3\.5\-flash\.SegTreeMemremains the strongest method under this backbone on all three benchmarks\. The no\-propagation variant already improves over the strongest external baseline in primary score on LoCoMo, LongMemEval\-MAB, and RealMem, showing that the temporally ordered tree structure itself contributes beyond the answer model\. The Best variant further improves over no propagation by2\.42\.4points on LoCoMo,5\.15\.1points on LongMemEval\-MAB, and4\.24\.2points on RealMem, indicating that structure\-aware propagation continues to provide useful context expansion under a different LLM backbone\.
##### Effect of propagation under Qwen\.
The gains from Best over no propagation are not uniform across metrics\. On LoCoMo and RealMem, the Best variant improves both the primary judge score and token\-level F1, suggesting that propagation helps retrieve evidence that is both more judge\-sufficient and more lexically aligned with the reference answers\. On LongMemEval\-MAB, the primary accuracy gain is large, while F1 changes only modestly\. This pattern is consistent with LongMemEval\-MAB’s answer format: many questions are judged by whether the answer state is correct, and the token\-level overlap metric is less sensitive to whether the retrieved evidence supports the correct update or temporal relation\. Thus, the Qwen results support the same conclusion as the main results: propagation is most useful as a mechanism for selecting the appropriate temporal context, not merely as a way to increase lexical overlap\.
#### D\.1\.1Propagation Sweep under Qwen
##### Propagation remains beneficial under Qwen\.
[Figure˜12](https://arxiv.org/html/2606.04555#A4.F12)shows that structure\-aware retrieval also improvesSegTreeMemwhen LLM\-dependent memory construction and answer generation useqwen3\.5\-flash\. The no\-propagation baseline is already strong, but both propagation directions produce additional gains on at least one setting for each benchmark\. This supports the same conclusion as the main GPT\-backed results: the tree structure is useful by itself, and propagation further improves retrieval by changing how relevance is allocated across the temporal hierarchy\.
##### The best propagation direction is benchmark\-dependent\.
The Qwen sweep again shows that there is no universally optimal propagation policy\. On LoCoMo, top\-down propagation gives a stable improvement at moderate decay, while bottom\-up propagation is useful only at small decay and becomes unstable when too much mass is pushed upward\. On LongMemEval\-MAB, both directions improve over no propagation, with top\-down propagation giving the largest gain at a moderate decay factor\. On RealMem, bottom\-up propagation gives the strongest result, consistent with the intuition that RealMem often requires recovering broader project\- or task\-level state rather than only a single answer\-bearing exchange\.
##### Decay controls the granularity–drift tradeoff\.
The shape of the curves illustrates the role ofα\\alpha\. Small and moderate values ofα\\alphaallow retrieval to incorporate structurally related context while keeping the initial semantic match anchored\. Large values can move too much probability mass away from the answer\-bearing node\. This effect is most visible for bottom\-up propagation on LoCoMo, where aggressive upward expansion substantially degrades accuracy\. The same pattern appears more mildly on LongMemEval\-MAB, where bottom\-up propagation improves at small decay but falls below the best top\-down setting as decay increases\.
##### Backbone robustness\.
Together with[Table˜6](https://arxiv.org/html/2606.04555#A4.T6), this sweep shows that the propagation behavior is not an artifact of the GPT\-family backbone\. The absolute scores differ across answer models, and Qwen has a distinct category\-level failure profile, especially on LoCoMo adversarial questions\. Nevertheless, the same high\-level retrieval pattern persists: useful propagation is task\- and granularity\-dependent\. Detail\-oriented settings favor top\-down or weak propagation, whereas state\-oriented settings such as RealMem benefit more from bottom\-up context expansion\.
\\rowcoloraddheaderpurpleMethodVariantLoCoMoLongMemEval\-MABRealMem\\rowcoloraddheaderpurpleLLMF1LLMF1LLMF1Retrieval\-only memory baselinesBM25\[[33](https://arxiv.org/html/2606.04555#bib.bib33)\]Default0\.469\[0\.447, 0\.491\]0\.157\[0\.150, 0\.164\]0\.542\[0\.485, 0\.598\]0\.137\[0\.121, 0\.154\]0\.611\[0\.585, 0\.636\]0\.359\[0\.353, 0\.364\]Dense\[[13](https://arxiv.org/html/2606.04555#bib.bib13)\]Default0\.477\[0\.455, 0\.499\]0\.153\[0\.147, 0\.160\]0\.590\[0\.534, 0\.646\]0\.140\[0\.125, 0\.156\]0\.631\[0\.606, 0\.656\]0\.363\[0\.357, 0\.368\]Graph\-structured memory baselinesA\-MEM\[[41](https://arxiv.org/html/2606.04555#bib.bib41)\]Default0\.496\[0\.474, 0\.518\]0\.155\[0\.149, 0\.162\]0\.599\[0\.543, 0\.654\]0\.139\[0\.123, 0\.154\]0\.635\[0\.610, 0\.660\]0\.361\[0\.355, 0\.366\]Mem0\[[5](https://arxiv.org/html/2606.04555#bib.bib5)\]Default0\.354\[0\.332, 0\.375\]0\.128\[0\.122, 0\.133\]0\.517\[0\.460, 0\.573\]0\.147\[0\.132, 0\.162\]0\.486\[0\.460, 0\.512\]0\.351\[0\.346, 0\.356\]HippoRAG\[[9](https://arxiv.org/html/2606.04555#bib.bib9)\]Default0\.470\[0\.448, 0\.492\]0\.151\[0\.145, 0\.158\]0\.612\[0\.557, 0\.667\]0\.142\[0\.126, 0\.159\]0\.629\[0\.603, 0\.654\]0\.359\[0\.353, 0\.364\]Tree\-structured memory baselinesRAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]Default0\.518\[0\.496, 0\.540\]0\.146\[0\.140, 0\.152\]0\.593\[0\.538, 0\.649\]0\.140\[0\.124, 0\.156\]0\.618\[0\.592, 0\.643\]0\.362\[0\.356, 0\.367\]MemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\]Default0\.369\[0\.347, 0\.390\]0\.118\[0\.112, 0\.123\]0\.543\[0\.487, 0\.600\]0\.137\[0\.121, 0\.152\]0\.597\[0\.571, 0\.622\]0\.360\[0\.355, 0\.365\]Proposed tree\-structured memory method\\rowcoloraddlightpurpleSegTreeMemNo\-Propagation0\.525\[0\.503, 0\.547\]0\.161\[0\.154, 0\.167\]0\.669\[0\.618, 0\.720\]0\.128\[0\.112, 0\.143\]0\.659\[0\.634, 0\.683\]0\.359\[0\.354, 0\.365\]\\rowcoloraddlightpurpleSegTreeMemBest0\.549\[0\.527, 0\.571\]0\.168\[0\.162, 0\.175\]0\.720\[0\.669, 0\.771\]0\.144\[0\.129, 0\.159\]0\.701\[0\.677, 0\.724\]0\.364\[0\.358, 0\.369\]
Table 6:Qwen answer\-quality results across long\-horizon memory benchmarks\. LLM denotes LLM\-judge accuracy and F1 denotes token\-level F1\. Each reported score includes its 95% confidence interval in brackets\. ForSegTreeMem, Best denotes the best propagation setting selected by LLM\-judge accuracy from the retrieval sweep\. All LLM\-dependent memory construction and answer generation useqwen3\.5\-flash\. Bold values mark the best score in each metric column\.Figure 12:Effect of propagation direction and decay factor under theqwen3\.5\-flashbackbone\. The dashed horizontal line denotes no\-propagation retrieval\. BU denotes bottom\-up propagation and TD denotes top\-down propagation\. Stars mark the best setting within each propagation direction, and percentage annotations report the improvement over the no\-propagation setting\. The plotted score is LLM\-judge accuracy for LoCoMo and LongMemEval\-MAB, and perfect\-rate for RealMem\.
### D\.2Question\-Type Breakdown
#### D\.2\.1LoCoMo Breakdown
[Table˜7](https://arxiv.org/html/2606.04555#A4.T7)shows thatSegTreeMemobtains the best aggregate LoCoMo accuracy under bothgpt\-5\.4\-miniandqwen3\.5\-flash, but the category\-level behavior differs substantially between the two backbones\. Undergpt\-5\.4\-mini,SegTreeMemBest leads on single\-hop, multi\-hop, temporal, open\-domain, and overall accuracy\. The only category where it does not lead is adversarial QA, where A\-MEM, RAPTOR, and HippoRAG remain stronger\. This indicates that temporal hierarchy and propagation are especially helpful for locating answer\-bearing evidence across ordinary single\-hop, multi\-hop, and temporal questions, while adversarial questions additionally require robust rejection of unsupported premises\.
##### Backbone sensitivity on LoCoMo\.
Theqwen3\.5\-flashLoCoMo results show a different error profile\. The largest degradation is concentrated in adversarial questions: all structured methods drop sharply on this category, and the strongest adversarial score under Qwen comes from BM25 rather than from a structured memory method\. Despite this adversarial weakness,SegTreeMemBest remains the best aggregate method because it retains strong performance on multi\-hop, temporal, and open\-domain questions\. This suggests that the retrieval structure continues to help recover relevant evidence, but the answer backbone affects whether the model correctly refuses or qualifies answers when the question contains a misleading premise\.
\\rowcoloraddheaderpurpleMethodgpt\-5\.4\-miniqwen3\.5\-flash\\rowcoloraddheaderpurpleSingle\-hopMulti\-hopTemporalOpen\-domainAdversarialAllSingle\-hopMulti\-hopTemporalOpen\-domainAdversarialAllRetrieval\-only memory baselinesBM25\[[33](https://arxiv.org/html/2606.04555#bib.bib33)\]0\.2520\.3890\.3540\.7340\.1910\.4690\.2910\.4420\.3020\.7070\.1860\.469Dense\[[13](https://arxiv.org/html/2606.04555#bib.bib13)\]0\.3400\.4210\.3850\.7500\.1680\.4900\.3830\.4330\.3540\.7170\.1430\.477Graph\-structured memory baselinesA\-MEM\[[41](https://arxiv.org/html/2606.04555#bib.bib41)\]0\.4220\.4240\.4170\.7790\.5760\.6080\.4290\.4700\.3850\.7370\.1260\.496Mem0\[[5](https://arxiv.org/html/2606.04555#bib.bib5)\]0\.1910\.3240\.3230\.5970\.4710\.4540\.2060\.4050\.2600\.5070\.1410\.353HippoRAG\[[9](https://arxiv.org/html/2606.04555#bib.bib9)\]0\.3720\.4020\.3750\.7250\.5650\.5700\.3940\.4170\.3750\.7110\.1210\.470Tree\-structured memory baselinesRAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]0\.4820\.4420\.3960\.7670\.5670\.5930\.5670\.4210\.3750\.7690\.1140\.518MemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\]0\.2910\.3270\.4060\.6300\.4550\.4830\.2940\.2830\.3120\.5480\.1500\.369Proposed tree\-structured memory method\\rowcoloraddlightpurpleSegTreeMemNo\-Propagation0\.4700\.5540\.3850\.8790\.4100\.6390\.4450\.5190\.2880\.7830\.0670\.525\\rowcoloraddlightpurpleSegTreeMemBest0\.5260\.5800\.5420\.8940\.4200\.6680\.5140\.5580\.4580\.8110\.0900\.549
Table 7:LoCoMo question\-type LLM\-judge accuracy under two answer and LLM\-dependent memory\-construction backbones\. The five LoCoMo categories are single\-hop, multi\-hop, temporal, open\-domain, and adversarial\. ForSegTreeMem, Identity denotes no propagation, and Best denotes the best propagation setting under the corresponding backbone\. Bold values mark the best score in each category within each backbone\.
#### D\.2\.2LongMemEval\-MAB Breakdown
[Table˜8](https://arxiv.org/html/2606.04555#A4.T8)shows a cleaner pattern than LoCoMo\. Under bothgpt\-5\.4\-miniandqwen3\.5\-flash,SegTreeMemBest achieves the highest aggregate accuracy\. The strongest gains appear in multi\-session, single\-session preference, and temporal\-reasoning questions, which are precisely the categories that require maintaining state across turns or resolving information through temporal context\. By contrast, the single\-session assistant category is nearly saturated for many methods, and the single\-session user and knowledge\-update categories are sometimes led by external baselines\. Therefore, the aggregate improvement is not driven by easy single\-session cases; it comes primarily from categories where temporal memory organization is expected to matter\.
##### Propagation improves state\-tracking categories\.
Comparing identity/no\-propagation with Best shows that propagation is most useful for categories that require moving between a local evidence match and a broader memory state\. On LongMemEval\-MAB, Best improves over no propagation on multi\-session and temporal\-reasoning questions under both backbones\. The same trend appears for preference questions, where the answer often depends on recovering an accumulated or updated user state rather than a single isolated utterance\. These category\-level gains explain why the aggregate Best row improves over the identity row even when some single\-session categories are already near ceiling\.
##### Backbone effects on LongMemEval\-MAB\.
Theqwen3\.5\-flashbackbone raises LongMemEval\-MAB accuracy for most methods relative to the GPT\-family runs, but the relative conclusion remains unchanged:SegTreeMemBest is the strongest aggregate method, and its advantage is largest on the temporally demanding categories\. This is important because the Qwen setting changes both LLM\-dependent construction and answer generation\. The fact that the same category\-level pattern persists suggests that the gains are not an artifact of a particular answer model, but reflect the interaction between temporally ordered memory construction and structure\-aware retrieval\.
\\rowcoloraddheaderpurpleMethodgpt\-5\.4\-miniqwen3\.5\-flash\\rowcoloraddheaderpurpleKnow\.MultiAsst\.Pref\.UserTemp\.AllKnow\.MultiAsst\.Pref\.UserTemp\.AllRetrieval\-only memory baselinesBM25\[[33](https://arxiv.org/html/2606.04555#bib.bib33)\]0\.5780\.2530\.9670\.3670\.8890\.3200\.4970\.6440\.3071\.0000\.4000\.9330\.3600\.542Dense\[[13](https://arxiv.org/html/2606.04555#bib.bib13)\]0\.6890\.3070\.9670\.4670\.8890\.2530\.5200\.7560\.3331\.0000\.6330\.8890\.3870\.590Graph\-structured memory baselinesA\-MEM\[[41](https://arxiv.org/html/2606.04555#bib.bib41)\]0\.7110\.3870\.9670\.6000\.8220\.2800\.5530\.7330\.4271\.0000\.6330\.9110\.3330\.599Mem0\[[5](https://arxiv.org/html/2606.04555#bib.bib5)\]0\.7560\.4000\.7670\.4330\.8890\.2800\.5370\.7110\.3730\.8000\.3330\.8890\.2800\.517HippoRAG\[[9](https://arxiv.org/html/2606.04555#bib.bib9)\]0\.7780\.3730\.9670\.5330\.8670\.2800\.5600\.8220\.3731\.0000\.7000\.8890\.3730\.612Tree\-structured memory baselinesRAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\]0\.6440\.3330\.9670\.5330\.8670\.2530\.5230\.8000\.3471\.0000\.6670\.8890\.3470\.593MemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\]0\.5560\.2271\.0000\.6330\.8220\.2270\.4830\.6440\.3200\.9670\.6330\.8890\.2930\.543Proposed tree\-structured memory method\\rowcoloraddlightpurpleSegTreeMemNo\-Propagation0\.6320\.4371\.0000\.6520\.8210\.5360\.6300\.6290\.4061\.0000\.6540\.7960\.5450\.669\\rowcoloraddlightpurpleSegTreeMemBest0\.7180\.4880\.9730\.7020\.8290\.5950\.6700\.7780\.5201\.0000\.7670\.8670\.6670\.720
Table 8:LongMemEval\-MAB question\-type LLM\-judge accuracy under two answer and LLM\-dependent memory\-construction backbones\. Columns are knowledge update \(Know\.\), multi\-session \(Multi\), single\-session assistant \(Asst\.\), single\-session preference \(Pref\.\), single\-session user \(User\), temporal reasoning \(Temp\.\), and aggregate accuracy \(All\)\. ForSegTreeMem, Identity denotes no propagation, and Best denotes the best propagation setting under the corresponding backbone\. Bold values mark the best score in each category within each backbone\.
### D\.3Additional Propagation Analysis
To complement the propagation sweep in[Figure˜6](https://arxiv.org/html/2606.04555#S5.F6), we provide two additional diagnostics for structure\-aware retrieval\. First, we analyze how accuracy changes jointly with the propagation horizonHHand decay factorα\\alpha\. Second, we inspect how propagation changes the temporal level of the retrieved evidence\. Together, these analyses clarify why propagation helps in some settings but hurts in others: the propagation policy does not simply add more context, but changes the granularity at which evidence is retrieved from the temporal hierarchy\.
Figure 13:Accuracy as a function of the decay factorα\\alphaand propagation horizonHH, using batch\-LLMSegTreeMemtrees\. The dashed horizontal line denotes the no\-propagation baselineH=0H=0\. Solid curves report finite\-horizon bottom\-up and top\-down propagation withH∈\{1,2,3\}H\\in\\\{1,2,3\\\}over the gridα∈\{0\.05,0\.10,0\.20,0\.30,0\.50,0\.70,0\.95\}\\alpha\\in\\\{0\.05,0\.10,0\.20,0\.30,0\.50,0\.70,0\.95\\\}\. The plotted score is LLM\-judge accuracy for LongMemEval\-MAB and LoCoMo, and perfect\-rate for RealMem\.##### Propagation is most useful at short horizons\.
[Figure˜13](https://arxiv.org/html/2606.04555#A4.F13)shows that one or two propagation hops are usually sufficient to capture the useful structural context\. Increasing the horizon beyond this range rarely gives a consistent gain and can make retrieval less stable, especially when paired with a large decay factor\. This supports the finite\-horizon design used in the main experiments: relevant context is often near the initial semantic match in the tree, either as a parent summary, a child span, or a nearby hierarchical neighbor\. Long\-range propagation can move too much probability mass away from the answer\-bearing region\.
##### Decay controls the amount of structural drift\.
The curves also show thatα\\alphais the main knob controlling the tradeoff between semantic anchoring and structural expansion\. Small and moderateα\\alphavalues allow the retriever to incorporate nearby hierarchical context while preserving the original query match\. Very largeα\\alphavalues can overweight propagated mass and cause retrieval drift\. This is most visible for bottom\-up propagation on LoCoMo, where increasingα\\alphasteadily lowers accuracy across horizons\. In contrast, LongMemEval\-MAB and RealMem tolerate moderate upward expansion better, because many of their questions benefit from recovering broader state or temporally aggregated context\.
##### The best direction depends on the benchmark\.
The horizon sweep confirms that there is no universally best propagation direction\. LoCoMo benefits more from top\-down propagation at moderate decay, which is consistent with its many detail\-oriented questions: broad summaries can identify the relevant episode, but the answer often requires retrieving a more specific descendant\. RealMem shows the opposite tendency\. Its strongest gains come from bottom\-up propagation, suggesting that local matches are often useful entry points but the final answer depends on broader project\- or task\-level state\. LongMemEval\-MAB lies between these regimes: both directions can improve over identity retrieval, but performance is sensitive to the chosen decay and horizon\.
Figure 14:Retrieved\-node level distribution as a function of the propagation setting, using batch\-LLMSegTreeMemtrees and a fixed propagation horizonH=2H=2\. The leftmost bar in each panel is the identity baseline\. Stacked bars show the fraction of retrieved nodes at each dataset\-specific temporal level, and the black line shows the corresponding accuracy or perfect\-rate score\. Bottom\-up propagation increasingly shifts retrieval toward broader internal nodes asα\\alphagrows, whereas top\-down propagation remains more concentrated on exchange\-level evidence\.
##### Top\-down propagation remains exchange\-oriented\.
[Figure˜14](https://arxiv.org/html/2606.04555#A4.F14)shows that top\-down propagation keeps most retrieved evidence close to the leaves\. Even asα\\alphaincreases, the retrieved set remains dominated by exchange\-level nodes, with only a moderate increase in broader temporal spans at intermediate settings\. This matches the semantics of top\-down propagation: relevance that initially matches a higher\-level summary is pushed downward toward finer\-grained descendants\. The resulting retrieval behavior is well suited to questions that require localized details, which helps explain why top\-down propagation is competitive on LoCoMo and LongMemEval\-MAB\.
##### Bottom\-up propagation expands toward broader temporal context\.
Bottom\-up propagation exhibits the complementary pattern\. Asα\\alphaincreases, retrieved evidence shifts from exchange\-level nodes toward higher\-level temporal units, such as subsessions and sessions in LongMemEval\-MAB and LoCoMo, or broader episode\-like segments in RealMem\. This behavior reflects the intended role of bottom\-up propagation: a highly relevant local match can promote its enclosing temporal context, allowing retrieval to recover the broader state surrounding the matched exchange\. This broader context is helpful on RealMem, but it can hurt on LoCoMo when the question requires a narrow answer\-bearing exchange\.
##### Granularity explains the dataset\-dependent sweep\.
The two diagnostics together explain the propagation behavior in[Figure˜6](https://arxiv.org/html/2606.04555#S5.F6)\. On LoCoMo, stronger bottom\-up propagation moves retrieval away from exchange\-level evidence and is accompanied by a clear drop in accuracy, indicating that many questions require narrow, detail\-preserving evidence\. On RealMem, moderate bottom\-up propagation improves performance while also increasing the share of broader internal nodes, suggesting that project\- or task\-level context is often useful\. LongMemEval\-MAB lies between these extremes: a small amount of structural expansion can help, but aggressive propagation can dilute answer\-bearing local evidence\.
Overall, these results support the interpretation in[Section˜5\.4](https://arxiv.org/html/2606.04555#S5.SS4): tree\-aware retrieval is best understood as a*granularity\-control mechanism*\. The propagation direction determines whether retrieval expands toward broader context or finer detail, while the decay factorα\\alphaand horizonHHcontrol the strength and distance of that expansion\. The best setting therefore depends on the temporal granularity required by the downstream QA task rather than on a universally optimal propagation policy\.
## Appendix EImplementation Details
We evaluate all external baselines through their official implementations, adapted to the same long\-horizon conversational\-memory benchmark interface\. This interface standardizes the visible memory prefix, query set, final evidence budget, answer\-generation prompt, and judging protocol across methods\. Unless otherwise stated, each method returns at mostK=10K=10evidence items per query\. For GPT\-backed runs, LLM\-dependent memory construction and answer generation usegpt\-5\.4\-mini\. For Qwen\-backed runs, they useqwen3\.5\-flashwith thinking disabled\. All embedding\-dependent components usetext\-embedding\-3\-small\. The judge is fixed togpt\-4o\-miniacross methods\.
##### BM25\.
BM25 is the lexical retrieval\-only baseline\. It indexes the same exchange\-level memory units as the other baselines and ranks visible memories with BM25 usingk1=1\.5k\_\{1\}=1\.5andb=0\.75b=0\.75\. It does not use LLM\-based construction, embedding retrieval, memory updates, graph construction, or hierarchical summarization\. The retrieved evidence budget is fixed toK=10K=10\.
##### Dense retrieval\.
Dense retrieval is the flat embedding baseline\. It embeds both the query and each visible exchange\-level memory unit withtext\-embedding\-3\-smalland ranks memory units by cosine similarity\. The method does not build an explicit memory structure and does not use an LLM during memory construction\. It returns the topK=10K=10exchange\-level memory units to the shared answer\-generation prompt\.
##### RAPTOR\.
For RAPTOR\[[34](https://arxiv.org/html/2606.04555#bib.bib34)\], we use the official recursive clustering\-and\-summarization interface\. The initial leaves are the benchmark memory units\. Node embeddings usetext\-embedding\-3\-small; cluster summaries are generated by the LLM backbone used in the corresponding experimental regime\. The RAPTOR configuration uses reduction dimension33, cluster threshold0\.50\.5, and maximum summary input length35003500tokens\. The final retrieved evidence budget isK=10K=10\.
##### MemTree\.
For MemTree\[[31](https://arxiv.org/html/2606.04555#bib.bib31)\], we use the official online semantic\-tree interface\. Memory units are inserted incrementally using the method’s semantic routing rule\. The configuration sets the base insertion threshold to0\.450\.45, the threshold growth rate to1\.01\.0, maximum depth to88, and enables leaf expansion\. Internal aggregation uses the LLM backbone of the corresponding run, and embeddings usetext\-embedding\-3\-small\. Retrieval uses the collapsed\-tree mode with retrieval threshold0\.100\.10, returning the topK=10K=10nodes\.
##### A\-MEM\.
For A\-MEM\[[41](https://arxiv.org/html/2606.04555#bib.bib41)\], we use the official note\-based memory interface\. Each memory unit is converted into a structured memory note with LLM\-generated JSON metadata\. The text used for note embedding concatenates the note content, context, keywords, and tags\. The configuration uses22nearest neighbors for note linking, enables memory evolution, sets the evolution threshold to100100, and uses link\-score threshold0\.750\.75\. Retrieval uses A\-MEM’s semantic\-plus\-links mode and returns the topK=10K=10memory notes\.
##### Mem0\.
For Mem0\[[5](https://arxiv.org/html/2606.04555#bib.bib5)\], we use the official base Mem0 memory interface\. We do not report the graph\-enhanced Mem0 variant in the main comparison\. The configuration uses additive single\-pass memory extraction, normalized\-text MD5 hashing for deduplication, and entity linking\. LLM\-dependent extraction uses the LLM backbone of the corresponding run, and memory embeddings usetext\-embedding\-3\-small\. Retrieval combines semantic, BM25, and entity signals with configured weights0\.700\.70,0\.200\.20, and0\.100\.10, respectively, and uses a minimum overfetch of6060\. The final evidence budget isK=10K=10\.
##### HippoRAG\.
For HippoRAG\[[9](https://arxiv.org/html/2606.04555#bib.bib9)\], we use the official graph\-retrieval interface\. LLM\-dependent information extraction, query processing, and fact filtering use the backbone of the corresponding experimental regime, while embedding\-dependent components usetext\-embedding\-3\-small\. The configuration sets the graph\-retrieval damping factor to0\.50\.5, synonymy threshold to0\.80\.8, query\-linking top\-kkto55, maximum filtered facts to44, and passage\-node weight to0\.050\.05\. The topK=10K=10retrieved passages are passed to the shared answer\-generation prompt\.
##### Controlled deviations from the original baseline settings\.
The intentional deviations from some original baseline papers are the shared benchmark substitutions required for controlled comparison: the LLM backbone, embedding model, visible\-memory interface, final evidence budget, answer prompt, and judge are fixed across methods\. In particular, all LLM\-dependent baseline components use eithergpt\-5\.4\-miniorqwen3\.5\-flashaccording to the experimental regime, all embedding\-dependent components usetext\-embedding\-3\-small, and all reported runs use final evidence budgetK=10K=10\. All other method\-specific parameters follow the official implementation interface and defaults, with the explicit configuration values reported above\.
## Appendix FNon\-Temporal Similarity\-Clustering Baseline
To isolate the role of temporal construction, we implement a non\-temporal similarity\-clustering tree as a controlled baseline\. This baseline uses the same memory units, annotation prompts, embedding model, retrieval scorer, propagation settings, and evidence budget asSegTreeMem\. Its only intended difference fromSegTreeMemis the online update rule: instead of inserting each new utterance through the rightmost temporal frontier, it assigns the utterance to the most similar existing semantic cluster\.
At timett, the baseline maintains a clustering tree over the observed utterance prefixXt=\(x1,…,xt\)X\_\{t\}=\(x\_\{1\},\\ldots,x\_\{t\}\)\. Each utterancexix\_\{i\}is represented as a leaf node\. The baseline is initialized by creating the first lowest\-level cluster from the first leaf utterance\. Let𝒞t\\mathcal\{C\}\_\{t\}denote the set of current lowest\-level internal cluster nodes, which serve as the assignment candidates for the next utterance\. For each clusterc∈𝒞tc\\in\\mathcal\{C\}\_\{t\}, letL\(c\)L\(c\)denote the set of leaf utterances undercc\. The cluster is represented by the mean embedding of its assigned utterances,
μ\(c\)=1\|L\(c\)\|∑xi∈L\(c\)e\(xi\),\\mu\(c\)=\\frac\{1\}\{\|L\(c\)\|\}\\sum\_\{x\_\{i\}\\in L\(c\)\}e\(x\_\{i\}\),wheree\(⋅\)e\(\\cdot\)is the same embedding function used by the dense retrieval pipeline\.
When a new utterancext\+1x\_\{t\+1\}arrives, the baseline compares it against all current lowest\-level clusters and identifies the nearest cluster,
c⋆=argmaxc∈𝒞tcos\(e\(xt\+1\),μ\(c\)\)\.c^\{\\star\}=\\arg\\max\_\{c\\in\\mathcal\{C\}\_\{t\}\}\\cos\\bigl\(e\(x\_\{t\+1\}\),\\mu\(c\)\\bigr\)\.If
cos\(e\(xt\+1\),μ\(c⋆\)\)≥τnt,τnt=0\.45,\\cos\\bigl\(e\(x\_\{t\+1\}\),\\mu\(c^\{\\star\}\)\\bigr\)\\geq\\tau\_\{\\mathrm\{nt\}\},\\qquad\\tau\_\{\\mathrm\{nt\}\}=0\.45,the new utterance is inserted as a new leaf child ofc⋆c^\{\\star\}\. Otherwise, the baseline creates a new lowest\-level internal cluster containing only the new leaf\. Thus, the number of active clusters is not fixed in advance, and new clusters are created online when the incoming utterance is not sufficiently similar to any existing cluster\. Existing leaves are not reassigned during this online update\. After insertion, the embedding of the affected cluster and the embeddings of its ancestors are recomputed, and their node annotations are regenerated using the same summarization prompts asSegTreeMem\.
This baseline is non\-temporal because cluster membership is determined by semantic similarity rather than by utterance order\. Consequently, utterances from distant parts of a conversation may be grouped under the same internal node, even when they do not form a contiguous conversational episode\. This makes the baseline useful for testing the structural hypothesis of the paper: whether a temporally ordered hierarchy provides a better memory state than a semantic clustering tree under matched retrieval conditions\.[Table˜9](https://arxiv.org/html/2606.04555#A6.T9)summarizes the main differences between this non\-temporal baseline andSegTreeMem\.
ComponentSegTreeMemNon\-temporal baselineMemory unitIndividual utteranceSameConstruction candidatesRightmost temporal frontierCurrent lowest\-level semantic clusters𝒞t\\mathcal\{C\}\_\{t\}Assignment criterionFrontier compatibility with the incoming utteranceCosine similarity to cluster mean embeddings with thresholdτnt=0\.45\\tau\_\{\\mathrm\{nt\}\}=0\.45New\-cluster creationNew temporal branch when no frontier node is compatibleNew semantic cluster when nearest\-cluster similarity is belowτnt\\tau\_\{\\mathrm\{nt\}\}Temporal constraintEvery internal node covers a contiguous spanNo contiguity constraintLeaf reassignmentExisting leaves are never reorderedExisting leaves are not reassignedAnnotation modelSame LLM summarization promptsSame promptsRetrievalSame local scorer, propagation grid, and evidence budgetSame settingsTable 9:Controlled comparison between temporal Segment Tree construction and the non\-temporal similarity\-clustering baseline\. The baseline is designed to test whether preserving chronological contiguity during construction improves the memory representation, while keeping retrieval and annotation choices fixed\.
## Appendix GPrompt Templates
### G\.1SegTreeMemConstruction Prompts
SegTreeMemuses LLMs only for the LLM\-based construction variants and for internal\-node annotation\. Retrieval and dataset loading do not invoke LLMs\. Specifically, the implementation uses three prompt templates: pointwise LLM compatibility, batch LLM compatibility, and internal\-node annotation\.
##### Pointwise LLM compatibility\.
The pointwise construction variant evaluates one admissible rightmost\-frontier candidate at a time\. The model decides whether the incoming subtree should be merged into the candidate or whether construction should continue upward along the frontier\.
`System prompt User prompt`
`Batch LLM compatibility\. The batch construction variant evaluates the admissible rightmost frontier in a single LLM call\. The frontier candidates are ordered from lower to higher level\. If there are mm admissible candidates, the valid merge labels are dynamically generated as MERGE\_1, …\\ldots, MERGE\_m\. The model returns exactly one merge label or SPLIT\. A label MERGE\_i selects candidate ii as the attachment node, while SPLIT indicates that no supplied frontier candidate is compatible with the incoming subtree\. System prompt User prompt Internal\-node annotation\. After construction creates or updates an internal node, SegTreeMem regenerates the node annotation from its ordered child segments\. The child segments are rendered in chronological order\. The annotation prompt is shared across construction variants\. System prompt User prompt G\.2 LoCoMo Prompts For LoCoMo, retrieved evidence is inserted into the following prompt before answer generation\. Answer generaiton We evaluate generated LoCoMo answers using the following rubric\. The judge receives the question, question category, gold answer, and candidate answer, and returns an integer score from 0 to 3\. LLM judge prompt G\.3 LongMemEval\-MAB Prompts For LongMemEval\-MAB, retrieved evidence is inserted into the following answer\-generation prompt\. Answer generation G\.4 RealMem Prompts For RealMem, we use the official answer\-generation prompt from the RealMemBench evaluation code\. Answer generation RealMem evaluates answer quality by checking consistency between the candidate answer and the user\-related memory\. LLM judge prompt Appendix H Limitations and Broader Impacts Limitations\. Agentic memory systems remain sensitive to the quality of their stored representations, retrieval mechanisms, and summarization procedures\. Errors introduced during memory construction can persist over long horizons, causing agents to retrieve incomplete, outdated, or misleading context\. More broadly, current memory systems still lack robust guarantees for reliability, privacy preservation, and under noisy, adversarial, or rapidly changing user histories\. Broader Impacts\. Long\-term agentic memory has the potential to make AI assistants more useful by supporting continuity across extended interactions, personalized help, and better handling of evolving user goals\. Persistent memory also creates risks around privacy, data retention, and over\-personalization\. Users should remain in control over what is stored, retrieved, edited, and forgotten, and careful analysis should be done to ensure memory improves user outcomes without amplifying harmful or sensitive inferences\. NeurIPS Paper Checklist 1\. Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: \[Yes\] Justification: Our abstract and introduction accurately summarize our main contributions: a temporally ordered segment\-tree memory representation, online rightmost\-frontier construction, and structure\-aware relevance propagation\. We evaluate the empirical claims across LoCoMo, LongMemEval\-MAB, and RealMem in Section˜5\. 2\. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: \[Yes\] Justification: We discuss limitations in Appendix C\.3 and Appendix C\.4, including maximally topic\-switching conversations, long\-range topical recurrence, and cases where internal node summaries miss exact details\. We also discuss cases where propagation can dilute local evidence and broader limitations of long\-term memory systems in Appendix H\. 3\. Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof? Answer: \[N/A\] Justification: We formalize the memory representation, online update rule, and retrieval operator, but we do not present theoretical theorems requiring formal proofs\. 4\. Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)? Answer: \[Yes\] Justification: We provide the information needed to reproduce the main experimental results\. Sections 5\.2–5\.4 specify the datasets, baselines, evaluation protocol, model choices, metrics, and main experimental settings\. Appendix E and Appendix G provide implementation details and prompt templates for construction, retrieval, answer generation, and evaluation\. 5\. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: \[Yes\] Justification: We provide anonymized code in the supplemental material, including instructions for reproducing the main experiments\. We use publicly available benchmark datasets and document the dataset choices, baselines, model settings, retrieval configuration, and prompt templates in Section˜5 and Appendix E and Appendix G\. 6\. Experimental setting/details Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results? Answer: \[Yes\] Justification: We specify the datasets, baselines, evaluation metrics, answer\-generation models, judge model, and main experimental configuration in Section˜5\. We provide additional details on construction, retrieval, answer generation, and evaluation prompts in Appendix E and Appendix G\. 7\. Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: \[Yes\] Justification: We report 95% confidence intervals for the main LLM\-judge accuracy and token\-level F1 results in Table˜4\. We compute these intervals over evaluation examples, so they quantify uncertainty due to finite benchmark size rather than variation across model training runs\. 8\. Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments? Answer: \[Yes\] Justification: We describe the compute setting in Appendix E\. Our experiments use pretrained API models and local orchestration rather than model training or finetuning\. We identify the main computational costs, including embedding memory nodes, LLM\-based node annotation, LLM\-based compatibility scoring, answer generation, and LLM\-as\-a\-judge evaluation\. 9\. Code of ethics Answer: \[Yes\] Justification: We use public or benchmark conversational\-memory datasets and evaluate retrieval and answer generation methods\. We have reviewed the NeurIPS Code of Ethics and do not identify deviations from it\. 10\. Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: \[Yes\] Justification: We discuss both positive and negative broader impacts in Appendix H\. We note that long\-term memory can improve continuity, personalization, and support for evolving user goals, while also creating risks around privacy, data retention, over\-personalization, and harmful or sensitive inferences\. 11\. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)? Answer: \[N/A\] Justification: We do not release a pretrained model, scraped dataset, or other high\-risk artifact\. We evaluate memory construction and retrieval methods on existing benchmarks\. 12\. Licenses for existing assets Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: \[Yes\] Justification: We cite the original sources for the datasets, baselines, pretrained models, and evaluation assets used in our experiments\. We do not redistribute the original benchmark datasets or pretrained models; users should obtain those assets from their original sources and comply with the corresponding licenses and terms of use\. Our supplemental code documents dependencies and reproduction instructions\. 13\. New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: \[Yes\] Justification: We release anonymized supplemental code as a new asset\. We do not release a new dataset or pretrained model\. 14\. Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)? Answer: \[N/A\] Justification: Not applicable 15\. Institutional review board \(IRB\) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained? Answer: \[N/A\] Justification: Not applicable 16\. Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigor, or originality of the research, declaration is not required\. Answer: \[Yes\] Justification: We use LLMs as core components of both our method and evaluation\. In the method, we use LLMs to generate node annotations and, in the LLM\-based construction variants, to make compatibility judgments during online tree construction\. In the experiments, we use LLMs for answer generation and LLM\-as\-a\-judge evaluation\. We describe these uses in Section˜4 and Section˜5 and provide the corresponding prompts in Appendix G\.`Similar Articles
MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing
MemForest proposes a memory framework for long-context LLM agents that improves scalability and reduces latency through parallel chunk extraction and hierarchical temporal indexing, achieving 6x higher throughput on benchmarks.
Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents
This paper proposes MERIT, a dynamic multi-horizon memory retrieval framework for interactive text-to-SQL agents that uses episode-level and turn-level memory with learned retrieval policies optimized via reinforcement learning and a process reward model for dense rewards. Experiments on BIRD-Interact and Spider2-Snow show that MERIT outperforms static and single-horizon dynamic baselines in success rate while requiring fewer interaction turns.
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Mem0 introduces a scalable memory-centric architecture using graph-based representations to improve long-term conversational coherence in LLMs, significantly reducing latency and token costs while outperforming existing memory systems.
Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents
This paper presents HORMA, a hierarchical organize-and-retrieve memory agent that organizes agent experiences into a file-system-like structure for efficient retrieval, improving performance on long-horizon tasks while reducing token usage.
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
Memanto introduces a typed semantic memory system using a schema, conflict resolution, and Moorcheh's information-theoretic retrieval engine, achieving state-of-the-art results on LongMemEval and LoCoMo benchmarks with zero ingestion cost and sub-90ms latency.