G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

arXiv cs.CL Papers

Summary

G-Long proposes a graph-enhanced memory management framework for long-term dialogue agents, using a fine-tuned small language model for structured triplet extraction and associative retrieval, achieving state-of-the-art performance in response generation and memory retrieval with reduced computational overhead.

arXiv:2606.13115v1 Announce Type: new Abstract: While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:51 AM

# G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents
Source: [https://arxiv.org/html/2606.13115](https://arxiv.org/html/2606.13115)
Minjun Choi Yoonjin Jang Sangwon Youn Youngjoong Ko Sungkyunkwan University, Suwon, Republic of Korea \{alswns078, yoonjinjang98, mikeyoun2000\}@gmail\.com,yjko@skku\.edu

###### Abstract

While Large Language Models \(LLMs\) have advanced open\-domain dialogue systems, maintaining long\-term consistency remains a challenge due to inherent limitations in long\-context reasoning and the inefficiency of processing extensive raw text\. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency\. To address these limitations, we proposeG\-Long, a graph\-enhanced framework that utilizes a fine\-tuned small Language Model \(sLM\) for structured triplet extraction and associative retrieval, significantly reducing operational costs\. Furthermore, we introduce the novel attention\-aware importance scoring mechanism that leverages the intrinsic cross\-attention signals of a T5 summarizer to identify salient memories\. Extensive experiments across diverse benchmarks demonstrate that G\-Long achieves state\-of\-the\-art performance in both response generation and memory retrieval, yielding performance gains of up to 9\.8% in response quality on MSC and 40\.8% in retrieval recall on LME, while significantly minimizing computational overhead\.111The code is available at[https://anonymous\.4open\.science/r/G\-Long\-8E5E/](https://anonymous.4open.science/r/G-Long-8E5E/)

G\-Long: Graph\-Enhanced Memory Management for Efficient Long\-Term Dialogue Agents

Minjun Choi Yoonjin Jang Sangwon Youn Youngjoong Ko††thanks:Corresponding author\.Sungkyunkwan University, Suwon, Republic of Korea\{alswns078, yoonjinjang98, mikeyoun2000\}@gmail\.com,yjko@skku\.edu

![Refer to caption](https://arxiv.org/html/2606.13115v1/figure/figure1_final.png)Figure 1:Comparison of long\-term memory paradigms\. Existing unstructured text\-based memory banks relying on heavy LLMs\(Left\)\. The proposed structured graph\-based memory bank \(G\-Long\) utilizing a local sLM\.\(Right\)\.## 1Introduction

Recent advancements in Large Language Models \(LLMs\) have revolutionized open\-domain dialogue systems, enabling agents to generate fluent and context\-aware responses\. However, establishing long\-term relationships with users requiresLong\-term Memoryto retain persona consistency and recall past events across extended sessionsXuet al\.\([2022a](https://arxiv.org/html/2606.13115#bib.bib5),[b](https://arxiv.org/html/2606.13115#bib.bib36)\); Onget al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib11)\)\. Without this capability, agents fail to maintain contextual continuity and consequently treat each session as an isolated interaction\. This leads to disjointed and impersonal user experiences\.

To achieve high\-quality generation in long\-term dialogues, memory systems must simultaneously overcome challenges in retrieval precision and computational efficiency\. As shown in Figure[1](https://arxiv.org/html/2606.13115#S0.F1)\(Left\), existing frameworks commonly utilize large\-scale LLMs for memory management to maintain an unstructured memory bank via paragraph summaries \(Tanet al\.,[2025](https://arxiv.org/html/2606.13115#bib.bib3); Zhonget al\.,[2024](https://arxiv.org/html/2606.13115#bib.bib2); Luet al\.,[2023](https://arxiv.org/html/2606.13115#bib.bib38); Chenet al\.,[2025](https://arxiv.org/html/2606.13115#bib.bib39)\)\. These conventional frameworks inherently have the following limitations\. First, text\-based summaries easily lose critical details over time\. When complex long\-term conversations are compressed into plain paragraphs, the meaningful relationships between entities become blurred\. For example, specific context such as “deep\-sea fishing” is generalized into abstract terms like “\[fishing\],” which leads standard dense retrieval methods to struggle with tracing exact facts across past conversations and impairs response generation\. Second, unstructured verbose paragraph summaries directly into large LLMs for memory operations leads to massive token consumption and computational overhead, making the system highly inefficient\.

To overcome these limitations, we advocate for a shift from unstructured text storage to a graph\-based memory bank, as illustrated in Figure[1](https://arxiv.org/html/2606.13115#S0.F1)\(Right\)\. By explicitly modeling entities and their relations into discrete triplets with precise importance scores \(e\.g\., ranking\(Article, topic, Deep\-sea fishing\)at0\.98\), the graph\-based paradigm mitigates retrieval ambiguity and stores rich contexts in a compact representation\. Furthermore, this structural compactness allows the framework to inject highly condensed and relevant contexts into the LLM when compared with verbose text paragraphs, which can reduce prompt token overhead and inference costs\. Therefore, we proposeG\-Long, a resource\-efficient and graph\-enhanced framework for scalable and precise long\-term dialogue memory management\. Extensive evaluations across MSC, CC, LoCoMo, and LME benchmarks demonstrates that G\-Long achieves the state\-of\-the\-art performance, driven by three key contributions\.

- •First, apersona\-driven graph structureestablishes a fine\-grained and triplet\-based memory bank optimized for long\-term dialogues\. By capturing core relational patterns aligned with conversational personas \(e\.g\., viaPersonaExt\), G\-Long resolves critical problems from conversational noise and paragraph blur, yielding a 9\.8% improvement in response quality \(BLEU\-2\) on the MSC dataset\.
- •Second, theassociative retrieval mechanismintegrated with anattention\-aware importance scoring modulemaximizes retrieval precision\. G\-Long navigates topological connections via structured subgraph expansion while filtering retrieved triplets based on attention\-driven importance signals\. This approach bridges the gap between surface\-level similarity and multi\-hop reasoning, yielding a 40\.8% improvement in retrieval performance \(Recall@3\) on the LME benchmark\.
- •Finally, aresource\-efficient framework designminimizes overall operational costs\. By offloading memory construction to a fine\-tuned small Language Model \(sLM\) and feeding compact graph triplets instead of long paragraphs into the LLM, G\-Long eliminates prohibitive LLM\-API dependencies\. This reduces memory maintenance costs by4\.9×4\.9\\timesand token consumption by 63\.0% without sacrificing generation quality\.

![Refer to caption](https://arxiv.org/html/2606.13115v1/figure/figure_revision.png)Figure 2:Overview of the G\-Long framework\.
## 2Related Work

##### Long\-term Dialogue Systems

Significant efforts have been made to construct high\-quality benchmarks for evaluating long\-term dialogue consistency\. Representative datasets includeMSCXuet al\.\([2022a](https://arxiv.org/html/2606.13115#bib.bib5)\)andCCJanget al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib6)\)for open\-domain dialogue\. Furthermore,LongMemEvalWuet al\.\([2025a](https://arxiv.org/html/2606.13115#bib.bib7)\)andLoCoMoMaharanaet al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib31)\)have been introduced to rigorously evaluate specific memory retention capabilities, whileCareCallBaeet al\.\([2022a](https://arxiv.org/html/2606.13115#bib.bib10)\)extends the scope to the healthcare domain\.

While the most intuitive approach to handling long\-term dialogue is directly concatenating the entire dialogue history, this incurs prohibitive computational costs and degrades performance due to the "lost\-in\-the\-middle" phenomenonLiuet al\.\([2023a](https://arxiv.org/html/2606.13115#bib.bib13)\)\. Early studiesXuet al\.\([2022a](https://arxiv.org/html/2606.13115#bib.bib5)\); Baeet al\.\([2022b](https://arxiv.org/html/2606.13115#bib.bib37)\)relied on simple summarization to compress context, which often sacrifices detail for brevity\. To address this, recent approaches have shifted towards structured external memory systems that leverage granular abstraction for precise retentionZhanget al\.\([2023a](https://arxiv.org/html/2606.13115#bib.bib12)\); Onget al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib11)\)\. For instance,MemoryBankZhonget al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib2)\)imitates human forgetting to preserve salient information efficiently\. To enhance retrieval precision,FraComKeet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib9)\)andRMMTanet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib3)\)focus on granular restructuring by decomposing histories into proposition\-level units or semantic topics\. Furthermore,LD\-AgentLiet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib1)\)tackles the complexity of long\-term interaction by separating memory into independent event and persona modules to track narrative trajectories and user characteristics distinctively\.

Despite these advances, heavy reliance on LLMs for memory management limits scalability, and unstructured representations hinder precise retrieval and informative response generation\.

##### Memory management for Long\-term Dialogue Agents

A parallel line of work has investigated structured memory representations that convert dialogue into discrete semantic units\. The most common substrate is the \(subject, relation, object\) tripletTrisedyaet al\.\([2019](https://arxiv.org/html/2606.13115#bib.bib40)\)\.Vossenet al\.,[2024](https://arxiv.org/html/2606.13115#bib.bib15)andLiet al\.,[2024](https://arxiv.org/html/2606.13115#bib.bib18)extract open\-domain relations to enrich dialogue understanding, whileWuet al\.,[2020](https://arxiv.org/html/2606.13115#bib.bib17)andGenReWanget al\.\([2022](https://arxiv.org/html/2606.13115#bib.bib16)\)target attribute\-level triplets for building dynamic user profiles\.PAEDZhuet al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib4)\)introduces a contrastive learning model for generalized zero\-shot persona attribute extraction, andPapalucaet al\.,[2023](https://arxiv.org/html/2606.13115#bib.bib14)andDenget al\.,[2024](https://arxiv.org/html/2606.13115#bib.bib41)show that LLMs can construct knowledge graphs from text in zero\-shot settings\. These methods establish triplets as a viable substrate for dialogue memory, but they rely on heavy LLM calls for extraction and treat the resulting triplets as static, with no mechanism for evaluating their relative importance to retrieval\.

Once such structured units are obtained, the next step is to organize them into graphs\. Recent work in this direction—includingGraphRAGEdgeet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib28)\),HippoRAGGutiérrezet al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib32)\), andLinearRAGZhuanget al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib33)\)—focuses on synthesizing global insights or navigating complex relations within a static document corpus\.Heet al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib48)\); Mavromatis and Karypis\([2025](https://arxiv.org/html/2606.13115#bib.bib49)\); Huet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib50)\)This static assumption creates two structural mismatches with long\-term dialogue: \(i\) Named Entity Recognition is unreliable on conversational utterances that often contain pronouns and ellipsis, and \(ii\) reflecting an evolving persona would require rebuilding the graph after every session\. Beyond graph\-based organization, other external memory architectures likeLongMemWanget al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib35)\)utilize specialized key–value side\-networks; however, such designs are incompatible with our plug\-and\-play design over frozen black\-box APIs\.

Closer to our setting, recent work has begun adapting graph\-based memory to the dialogue domain itself\. WhereasFraComKeet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib9)\)operates at the proposition level,SGMemWuet al\.\([2025b](https://arxiv.org/html/2606.13115#bib.bib34)\)instead constructs sentence\-level graphs across turn\-, round\-, and session\-level chunks while combining retrieved raw dialogue with LLM\-generated summaries, facts, and insights\. Yet both still inherit the limitations they aim to address: proposition\-based decomposition frequently yields uninformative coreference fragments \(e\.g\., "it", "he"\) that dilute the retrieval space, and SGMem’s strongest variants reintroduce the very LLM\-API dependency that should be eliminated\.

## 3Methodology

### 3\.1Framework Overview

Figure[2](https://arxiv.org/html/2606.13115#S1.F2)illustrates the overall architecture of our proposed framework,G\-Long\. To handle the long\-term memory of the dialogues, we adopt graph\-based approaches to long\-term memory bank management\. The proposed framework consists of four components: \(1\)efficient memory construction, \(2\)graph\-based memory bank, \(3\)associative memory retrieval, and \(4\)response generation\. We describe the details of each component in the following subsections\.

### 3\.2Efficient Memory Construction

To transform an unstructured dialogue into a structured graph representation, we utilize thetriplet extraction module, which extracts triplets from each utterance within a conversation, and theattention\-aware importance scoring module, which assigns importance scores to each extracted triplet\.

##### Triplet Extraction Module

In this module, we employ a fine\-tuned sLM to parse raw utterances into structured facts as\(s​u​b​j​e​c​t,r​e​l​a​t​i​o​n,o​b​j​e​c​t\)\(subject,relation,object\)triplets\. To achieve this, the sLM undergoes instruction tuning on a dialogue\-triplet dataset \(e\.g\.,PersonaExtZhuet al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib4)\)\)\. This training process enables the model to effectively extract schema\-compliant triplets and generate a consistent set of relations\. Training hyperparameters and data preprocessing details are provided in Appendix[B](https://arxiv.org/html/2606.13115#A2)\.

##### Attention\-Aware Importance Scoring Module

To quantify triplet saliency without external LLM costs, we leverage cross\-attention maps from a T5 summarizer finetuned onSAMSum\(Raffelet al\.,[2023](https://arxiv.org/html/2606.13115#bib.bib29); Gliwaet al\.,[2019](https://arxiv.org/html/2606.13115#bib.bib20)\)\. Specifically, we concatenate all extracted triplets into a single input sequenceX=\[x1,…,xi,…,xN\]X=\[x\_\{1\},\.\.\.,x\_\{i\},\.\.\.,x\_\{N\}\]to generate a summary sequenceY=\[y1,…,yj,…,yM\]Y=\[y\_\{1\},\.\.\.,y\_\{j\},\.\.\.,y\_\{M\}\], withxix\_\{i\}andyjy\_\{j\}constituting individual tokens\. We first average cross\-attention weightsAi,jA\_\{i,j\}across allHHheads of the last decoder layer:

Ai,j=1H​∑h=1HAttnh​\(xi,yj\)\.A\_\{i,j\}=\\frac\{1\}\{H\}\\sum\_\{h=1\}^\{H\}\\mathrm\{Attn\}\_\{h\}\(x\_\{i\},y\_\{j\}\)\.\(1\)The raw saliency scoreS​\(tk\)S\(t\_\{k\}\)for triplettkt\_\{k\}is then computed by aggregating weights over theMMsummary tokens and the triplet’s token span:

S​\(tk\)=∑xi∈Tokens​\(tk\)∑j=1MAi,j\.S\(t\_\{k\}\)=\\sum\_\{x\_\{i\}\\in\\mathrm\{Tokens\}\(t\_\{k\}\)\}\\sum\_\{j=1\}^\{M\}A\_\{i,j\}\.\(2\)Finally, we normalize these scores by the session maximum to yield the relative weightwk∈\(0,1\]w\_\{k\}\\in\(0,1\]assigned to graph edges \(Section[3\.3](https://arxiv.org/html/2606.13115#S3.SS3)\) for reranking \(Section[3\.4](https://arxiv.org/html/2606.13115#S3.SS4)\):

wk=S​\(tk\)maxk′⁡S​\(tk′\)\.w\_\{k\}=\\frac\{S\(t\_\{k\}\)\}\{\\max\_\{k^\{\\prime\}\}S\(t\_\{k^\{\\prime\}\}\)\}\.\(3\)

### 3\.3Graph\-Based Memory Bank

To effectively store and associate the extracted knowledge, we construct the graph\-based memory bank, formally denoted as𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)\. Unlike simple key\-value stores, our architecture adopts the dual\-storage strategy that combinesdense vector indexing for entity nodesandattribute\-rich edge storage\.

##### Dense Vector Indexing for Entity Nodes

The set of nodes𝒱\\mathcal\{V\}contains unique entities, which consist ofs​u​b​j​e​c​t​ssubjectsando​b​j​e​c​t​sobjects, identified in the dialogue\. For the semantic entry point search described in Section[3\.4](https://arxiv.org/html/2606.13115#S3.SS4)\(Anchor Node Identification\), each entity nodev∈𝒱v\\in\\mathcal\{V\}is embedded into a high\-dimensional vector space using a pre\-trained encoder \(e\.g\.,MiniLMWanget al\.\([2020](https://arxiv.org/html/2606.13115#bib.bib22)\)\) and stored in a vector database \(e\.g\.,ChromaDB222[https://www\.trychroma\.com/](https://www.trychroma.com/)\)\. This embedding\-based indexing allows the system to identify relevant entities even when the user’s query uses synonymous or paraphrased terms, to overcome the limitations of exact string matching\.

##### Attribute\-Rich Edge Storage

The set of edgesℰ\\mathcal\{E\}contains directed relations between entities\. In our work, a multi\-directed graph is employed to accommodate overlapping interactions and specially, G\-Long represents edges as attribute\-rich containers formally denoted ases→o=\{r,w,τc​r​e​a​t​e,τa​c​c​e​s​s\}e\_\{s\\rightarrow o\}=\\\{r,w,\\tau\_\{create\},\\tau\_\{access\}\\\}unlike standard architectures;rris the relation text extracted from the dialogue \(e\.g\.,location\),w∈\(0,1\]w\\in\(0,1\]is the normalized importance score derived from Eq\.\(3\), andτ\\taurepresents the timestamps for memory creation and the latest access\. By embedding these temporal and significant attributes directly into the graph structure, G\-Long enables the retrieval mechanism to dynamically balance semantic relevance with recency during inference\.

### 3\.4Associative Memory Retrieval

G\-Long implements the memory retrieval mechanism to extract contextually relevant and significant memories\. The process is designed as a coarse\-to\-fine pipeline consisting of two primary phases: \(1\)retrieval candidate expansionand \(2\)two\-stage hybrid reranking\.

##### Retrieval Candidate Expansion

To initiate the search, we first identifyAnchor Nodesin the graph𝒢\\mathcal\{G\}, which serve as the essential starting points for graph traversal\. TheT​o​p−NTop\-Nentities, which are most semantically similar to the current user utteranceutu\_\{t\}, are selected as anchor nodes from the shared embedding space of entity nodes by usingsim⁡\(𝐄​\(ut\),𝐄​\(v\)\)\\operatorname\{sim\}\(\\mathbf\{E\}\(u\_\{t\}\),\\mathbf\{E\}\(v\)\);𝐄​\(⋅\)\\mathbf\{E\}\(\\cdot\)is a dense encoder andsim⁡\(⋅\)\\operatorname\{sim\}\(\\cdot\)denotes cosine similarity\. Starting from these anchor nodes, we perform a multi\-hop subgraph expansion to gather connected triplets withinLL\-hops\. This expansion effectively constructs a candidate triplet set𝒯c​a​n​d\\mathcal\{T\}\_\{cand\}by capturing associative memories structurally linked to the query concepts\.

##### Two\-Stage Hybrid Reranking

To filter noise and prioritize essential information, we first perform a coarse semantic ranking by retaining the Top\-NNtriplets𝒯s​e​m\\mathcal\{T\}\_\{sem\}from𝒯c​a​n​d\\mathcal\{T\}\_\{cand\}based on their similarity toutu\_\{t\}\. Subsequently, we refine this ranking by leveraging the rich edge attributes stored in the memory bank \(Section[3\.3](https://arxiv.org/html/2606.13115#S3.SS3.SSS0.Px2)\)\. The final scoreSf​i​n​a​lS\_\{final\}is computed as:

Sf​i​n​a​l​\(t\)=β1⋅Ss​e​m​\(t\)\+β2⋅w\+β3⋅e−λ​Δ​τS\_\{final\}\(t\)=\\beta\_\{1\}\\cdot S\_\{sem\}\(t\)\+\\beta\_\{2\}\\cdot w\+\\beta\_\{3\}\\cdot e^\{\-\\lambda\\Delta\\tau\}\(4\)wherewwis the triplet importance andΔ​τ=τc​u​r​r​e​n​t−τa​c​c​e​s​s\\Delta\\tau=\\tau\_\{current\}\-\\tau\_\{access\}denotes the time elapsed after the memory was last accessed, withτc​u​r​r​e​n​t\\tau\_\{current\}denoting the timestamp of the current dialogue session\. Following the recency weighting strategy inParket al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib27)\), we utilize the exponential decay factorλ\\lambdato prioritize recent interactions\. Based on this score, the Top\-KKtriplets𝒯r​e​t​r​i​e​v​e​d\\mathcal\{T\}\_\{retrieved\}are selected to form the final context for response generation\.

### 3\.5Response Generation

To generate system responses, we serialize the retrieved triplets𝒯r​e​t​r​i​e​v​e​d\\mathcal\{T\}\_\{retrieved\}into a textual sequenceℳL​o​n​g=⨁i=1K⟨si,ri,oi⟩\\mathcal\{M\}\_\{Long\}=\\bigoplus\_\{i=1\}^\{K\}\\langle s\_\{i\},r\_\{i\},o\_\{i\}\\ranglewhere⊕\\oplusdenotes string concatenation of the subject \(sis\_\{i\}\), relation \(rir\_\{i\}\), and object \(oio\_\{i\}\)\. We then construct the final prompt𝒫\\mathcal\{P\}by concatenating the system instructionℐ\\mathcal\{I\},ℳl​o​n​g\\mathcal\{M\}\_\{long\}, and the current dialogue history𝒞c​u​r​r​e​n​t\\mathcal\{C\}\_\{current\}to generate the responseRR:

𝒫=\[ℐ;ℳL​o​n​g;𝒞c​u​r​r​e​n​t\],R=LLM​\(𝒫\)\\mathcal\{P\}=\[\\mathcal\{I\};\\mathcal\{M\}\_\{Long\};\\mathcal\{C\}\_\{current\}\],\\quad R=\\text\{LLM\}\(\\mathcal\{P\}\)
Table 1:Performance comparison on MSC and CC datasets across sessions \(N=500N=500, as in following prior work\)\.

## 4Experiments

### 4\.1Experimental Setup

#### 4\.1\.1Datasets

To investigate the effectiveness of G\-Long in long\-term dialogue scenarios, we conduct experiments on the test sets of two representative multi\-session datasets: theMulti\-Session Chat \(MSC\)datasetXuet al\.\([2022a](https://arxiv.org/html/2606.13115#bib.bib5)\)and theConversation Chronicles \(CC\)datasetJanget al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib6)\)\. To further assess the scalability of our framework across exceptionally long contexts, we incorporate theLoCoModatasetMaharanaet al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib31)\)\. Additionally, we utilize theLongMemEval \(LME\)datasetWuet al\.\([2025a](https://arxiv.org/html/2606.13115#bib.bib7)\), a large\-scale long\-term conversational benchmark, to directly measure retrieval performance and evaluate the intrinsic accuracy of the retrieved memory\. Dataset details are provided in Appendix[A](https://arxiv.org/html/2606.13115#A1)\.

#### 4\.1\.2Baselines

To benchmark the performance of G\-Long, we compare it against six competitive baselines categorized by their memory storage and structural properties: Unstructured & Text\-based Memory Frameworks \(No History,Long Context,MemoryBankZhonget al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib2)\),LD\-AgentLiet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib1)\)\) and Structured & Graph\-based Frameworks \(FraComKeet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib9)\),HippoRAGGutiérrezet al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib32)\)\)\. To systematically evaluate the performance across tasks, all baselines are evaluated on both memory retrieval and dialogue generation, with the sole exception of HippoRAG, which is evaluated strictly on the retrieval task due to its design as a specialized retriever without native end\-to\-end generation prompts for long\-term dialogue\. To ensure a rigorous evaluation that isolates the efficacy of the memory mechanism itself, we standardize the backbone model for all components togpt\-4o\-mini\. Detailed configurations and implementation details for G\-Long and all baselines are provided in Appendix[B](https://arxiv.org/html/2606.13115#A2)and[C](https://arxiv.org/html/2606.13115#A3), while comprehensive results demonstrating the generalization capability across various other backbone LLMs are detailed in Appendix[H](https://arxiv.org/html/2606.13115#A8)\.

#### 4\.1\.3Metrics

We employ a multi\-faceted evaluation strategy comprising both automatic and human evaluations\. For response generation tasks \(MSC, CC\), we utilize standardnn\-gram metrics \(BLEUPapineniet al\.\([2002](https://arxiv.org/html/2606.13115#bib.bib24)\),ROUGE\-LLin\([2004](https://arxiv.org/html/2606.13115#bib.bib25)\)\) to measure lexical overlap, alongsideG\-EvalLiuet al\.\([2023b](https://arxiv.org/html/2606.13115#bib.bib21)\)to measure coherence, engagingness, and fluency\. For the long\-term memory QA task \(LME\), we adoptRecallandMRRto evaluate retrieval coverage andAccuracyto assess answer correctness\. Finally, we conductHuman Evaluationto qualitatively validate long\-term consistency\. Detailed definitions and evaluation protocols are provided in Appendix[D](https://arxiv.org/html/2606.13115#A4)\.

Table 2:LLM\-based evaluation \(G\-Eval\) results on MSC and CC dataset \(N=100\)\.

### 4\.2Main Results on Response Generation

Table[1](https://arxiv.org/html/2606.13115#S3.T1)shows that G\-Long achieves state\-of\-the\-art performance on both the MSC and CC datasets, consistently outperforming strong baselines like MemoryBank and LD\-Agent\. On average, G\-Long surpasses LD\-Agent by 9\.8% in BLEU\-2 and 3\.8% in ROUGE\-L on the MSC dataset\. These gains are even more pronounced on the deeper CC dataset, where G\-Long improves BLEU\-2 by 8\.9% over LD\-Agent and 24\.6% over MemoryBank\.

Notably, G\-Long maintains this superiority across all individual dialogue sessions \(Sessions 2–5\)\. This consistency validates that our graph\-enhanced memory management and importance\-aware retrieval effectively capture context and maintain coherence regardless of session depth\. A qualitative case study detailing these improvements is provided in Appendix[J](https://arxiv.org/html/2606.13115#A10)\.

![Refer to caption](https://arxiv.org/html/2606.13115v1/figure/figure3_humaneval.png)Figure 3:Human evaluation results on MSC and CC dataset \(N=50\)\.##### Qualitative Evaluation

To complement automatic n\-gram metrics, we employed both LLM\-based evaluation and human evaluation\. As shown in Table[2](https://arxiv.org/html/2606.13115#S4.T2), G\-Long consistently achieves the highest average scores across both MSC and CC datasets, with a particularly substantial margin in engagingness—indicating that retrieving structured and pertinent facts allows the agent to generate richer and more interesting responses than the baselines\.

These findings were further validated through a pairwise human evaluation on 50 randomly sampled instances\. As shown in Figure[3](https://arxiv.org/html/2606.13115#S4.F3), G\-Long was consistently preferred over all baselines, with win rates of 64%, 52%, and 54% against LD\-Agent, MemoryBank, and Long Context, respectively—confirming that our graph\-enhanced memory management aligns with human preferences for contextually accurate and engaging dialogue\. Full evaluation protocol is provided in Appendix[E](https://arxiv.org/html/2606.13115#A5)\.

### 4\.3Scalability Assessment on Modern Benchmarks \(LoCoMo\)

To evaluate scalability on more recent, larger\-scale benchmarks, we tested G\-Long on LoCoMo \(avg\. 27 sessions per sample\)\. Due to computational constraints, we evaluated 50 representative samples, which nonetheless involve cumulative turns and tokens exceeding standard MSC setups\. As shown in Table[3](https://arxiv.org/html/2606.13115#S4.T3), G\-Long consistently outperforms LD\-Agent across all lexical metrics\. This demonstrates that our graph\-enhanced memory and two\-stage retrieval effectively generalize to extremely long\-term, diverse dialogue scenarios\.

Table 3:Generation Performance on the Large\-Scale LoCoMo Dataset \(N=50\)\.
### 4\.4Analysis of Memory Retrieval Accuracy

To validate the intrinsic quality of the retrieved memory, we evaluate retrieval performance using the LME dataset\. Table[4](https://arxiv.org/html/2606.13115#S4.T4)presents the Recall@kkand Accuracy@kkscores \(fork=3,5k=3,5\), which quantify the model’s ability to successfully retrieve the ground\-truth utterance containing the necessary evidence within the top\-kkcandidates\.

As shown in Table[4](https://arxiv.org/html/2606.13115#S4.T4), G\-Long significantly outperforms both baselines, LD\-Agent and MemoryBank, across all reported metrics\. Notably, G\-Long achieves a substantial performance leap in the strictest metrics, recording a Recall@3 of \(0\.6286\) and an Accuracy@3 of \(0\.5783\)\. This corresponds to a relative improvement of approximately 40\.8% and 30\.2% respectively, compared to the strongest baseline, LD\-Agent \(0\.4462 and 0\.4440\)\.

This demonstrates the structural advantage of our approach\. Unlike baselines, which often retrieve semantically similar but irrelevant sentences \(noise\), G\-Long’s graph structure enables precise navigation to the target triplets\. Furthermore, the high Recall@5 score \(0\.7220\) confirms that our system reliably preserves critical long\-term information within its candidate pool\.

Table 4:Retrieval and QA performance comparison on the LME dataset \(N=500\)\.
### 4\.5Systematic Comparison on Diverse Memory Architectures

To evaluate the structural superiority of G\-Long, we conducted retrieval experiments on a subset of the LME dataset \(N=50N=50\)\. This evaluation focuses on how different memory representations handle the inherent noise and sparsity of conversational data\.

G\-Long vs\. Proposition\-level Retrieval:As shown in Table[5](https://arxiv.org/html/2606.13115#S4.T5), FraCom’s performance is significantly hampered by the prevalence of coreferences \(e\.g\., “it”, “he”\) in dialogue, which often leads to uninformative proposition units\. Even after filtering common coreferences \(FraCom \+ Removed Coreference\), G\-Long maintains a substantial lead\. This suggests that our structured triplet representation, optimized via our fine\-tuned sLM, captures the core semantic intent of dialogue more effectively than simple proposition\-level decomposition\.

G\-Long vs\. GraphRAG Architectures:We further compare G\-Long with HippoRAG to isolate the effectiveness of our search mechanism\. HippoRAG’s reliance on NER and Personalized PageRank \(PPR\) struggles with conversational contexts where explicit entities are scarce\. To rigorously verify the search component, we applied HippoRAG’s PPR on the triplets extracted by G\-Long \(HippoRAG \+ G\-Long Triplet\)\. While the high\-quality triplets improved its performance, G\-Long’s native subgraph expansion still prevailed\. This empirically proves that our subgraph search, which prioritizes immediate semantic neighbors, is more suitable for dialogue retrieval than general graph\-based ranking algorithms\.

Table 5:Comparison of Retrieval Performance against Diverse Memory Architectures on the LME Dataset \(N=50\)\.MethodMainMemoryTotal Cost\(Tokens\)InputGen\.\($/1K\)MSCLong Context28,058004\.21MemoryBank13,74214,6452,2955\.63LDA15,76711,3525,9107\.61G\-Long \(Ours\)10,380001\.56CCLong Context20,327003\.05MemoryBank13,70911,8922,1164\.73LDA13,41210,1444,4816\.22G\-Long \(Ours\)9,230001\.38

Table 6:Comparison of token efficiency and estimated API costs on MSC and CC datasets\. \(N=500\)
### 4\.6Token Efficiency and API Cost Analysis

We evaluate G\-Long’s token consumption through response generation \("Main"\) and memory management \("Memory"\), with costs based ongpt\-4o\-mini\(Table[6](https://arxiv.org/html/2606.13115#S4.T6)\)\.

##### Response Generation Efficiency

G\-Long requires only 10,380 tokens on the MSC dataset, achieving a 63\.0% reduction over the Long Context baseline \(28,058\) and outperforming the summary\-based LD\-Agent \(15,767\)\. This highlights that our structured triplet representation delivers high\-density information far more efficiently than verbose natural language summaries\.

##### Elimination of API\-Based Memory Management

By employing a fine\-tuned small Language Model \(sLM\) for triplet extraction, G\-Long offloads memory maintenance to local resources, eliminating iterative API costs\. Consequently, G\-Long \($1\.56\) is 4\.9x more cost\-effective than LD\-Agent \($7\.61\), confirming its robust computational and financial efficiency\.

### 4\.7Ablation Study

In this section, we investigate the impact of key design choices in G\-Long\. While a detailed analysis of the importance scoring module’s noise\-filtering effect is provided in Appendix[I](https://arxiv.org/html/2606.13115#A9), we focus here on the most critical hyperparameter: the subgraph expansion depth \(LL\)\.

#### 4\.7\.1Sensitivity to Hop Count and Coverage

As shown in Table[7](https://arxiv.org/html/2606.13115#S4.T7), extending the subgraph expansion toL=2L=2leads to a notable degradation in performance\. This confirms that traversing high\-degree hub nodes \(e\.g\., “User”\) introduces “Semantic Drift,” where semantically irrelevant noise enters the candidate pool\. Figure[4](https://arxiv.org/html/2606.13115#S4.F4)further validates that our 1\-hop search space provides exceptional coverage; G\-Long achieves a Recall@20 of 0\.9321 and a significantly higher MRR@20 \(0\.674\) compared to Memory Bank \(0\.517\) and LD\-Agent \(0\.495\)\. The rapid ascent of G\-Long’s recall curve indicates that relevant triplets are densely clustered near the anchor nodes, justifying our choice ofL=1L=1for achieving both high retrieval accuracy and computational efficiency\.

Table 7:Sensitivity Analysis of Hop Count \(LL\) on the LME Dataset \(N=500\)\.![Refer to caption](https://arxiv.org/html/2606.13115v1/figure/figure_retrieval_performance.jpg)Figure 4:Comparative analysis of retrieval Recall and MRR across varying candidate sizes \(KK\) on the LME dataset \(N=500\)\.

## 5Conclusion

In this paper, we introducedG\-Long, the resource\-efficient graph\-enhanced long\-term dialogue framework designed to overcome the limitations of unstructured memory representations and the high computational costs inherent in long\-term dialogue systems\. By leveraging the fine\-tuned sLM and the attention\-aware importance scoring module, G\-Long constructs the structured knowledge Graph that captures essential context while filtering out noise from raw dialogue history\.

Crucially, G\-Long eliminates prohibitive API overhead, providing a cost\-effective alternative to existing competitive frameworks while achieving SOTA results across MSC, CC, and the large\-scale LoCoMo dataset\. Furthermore, G\-Long’s superior retrieval performance on LME—outperforming general\-purpose graph architectures—validates that our framework is uniquely optimized for the nuanced dynamics of long\-term dialogue\. Consequently, G\-Long serves as a scalable, plug\-and\-play solution for maintaining consistent and engaging agents in resource\-constrained environments\.

## Limitations

While G\-Long demonstrates significant improvements in long\-term memory retrieval and operational efficiency, we acknowledge certain limitations inherent to our design choices\.

First, the conversion of unstructured dialogue into structured triplets entails a trade\-off between information density and semantic granularity\. While triplets efficiently capture core facts \(e\.g\.,\(User, like, Hiking\)\), highly subtle emotional undertones or stylistic nuances present in raw text may be abstracted away during this discretization process\. Although our experiments show that this does not hinder factual response generation, future iterations could explore hybrid storage mechanisms that link graph nodes to their original text segments to preserve richer stylistic contexts\.

Second, as a pipeline framework, the overall quality of the memory bank depends on the robustness of the Triplet Extraction Module\. While our instruction\-tuned sLM demonstrates high accuracy on standard benchmarks, extreme ambiguities or complex coreferences in user utterances could potentially introduce noise into the graph\. Currently, our attention\-aware importance scoring mechanism mitigates this by assigning low weights to trivial or noisy triplets, but developing a self\-correcting mechanism that verifies factual consistency within the graph remains a promising direction for future work\.

Furthermore, while the graph\-based representation effectively preserves factual consistency, our qualitative error analysis reveals two inherent structural boundaries \(detailed in Appendix[J\.3](https://arxiv.org/html/2606.13115#A10.SS3)\)\. First, sparse queries containing unresolved anaphora \(e\.g\., demonstrative pronouns without explicit entities\) can lead to referential disconnect, causing the retriever to fetch irrelevant nodes\. Second, we observed instances of semantic drift where the retriever over\-prioritizes literal keyword overlaps \(e\.g\., specific color descriptors\) at the expense of global conversational intent\. These findings suggest that future iterations could benefit from a hybrid approach that more tightly couples graph nodes with dynamic contextual resolution\.

## Ethical Considerations

Our experiments utilized anonymized public datasets to ensure data privacy\. Beyond standard compliance, G\-Long offers a distinct ethical advantage through its structured knowledge graph\. Unlike unstructured vector stores, the explicit nature of graph triplets enables transparent interpretability and precise control, allowing users or administrators to identify and delete specific memories without retraining the model\. This capability directly supports the "Right to be Forgotten" and promotes responsible data management\. Regarding potential risks, our framework relies on generative models for both memory extraction \(sLM\) and response generation \(backbone LLM\)\. Consequently, there is a possibility that incorrect triplets may be extracted, or the final model may exhibit hallucinations or biases inherent in pre\-trained models\. Therefore, we emphasize the necessity of human oversight and fairness\-aware safeguards when deploying such systems in sensitive environments\.

## References

- S\. Abnar and W\. Zuidema \(2020\)Quantifying attention flow in transformers\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4190–4197\.External Links:[Link](https://aclanthology.org/2020.acl-main.385/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.385)Cited by:[Appendix I](https://arxiv.org/html/2606.13115#A9.p2.1)\.
- S\. Bae, D\. Kwak, S\. Kang, M\. Y\. Lee, S\. Kim, Y\. Jeong, H\. Kim, S\. Lee, W\. Park, and N\. Sung \(2022a\)Keep me updated\! memory management in long\-term conversations\.External Links:2210\.08750,[Link](https://arxiv.org/abs/2210.08750)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Bae, D\. Kwak, S\. Kim, D\. Ham, S\. Kang, S\. Lee, and W\. Park \(2022b\)Building a role specified open\-domain dialogue system leveraging large\-scale language models\.External Links:2205\.00176,[Link](https://arxiv.org/abs/2205.00176)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p2.1)\.
- N\. Chen, H\. Li, J\. Chang, J\. Huang, B\. Wang, and J\. Li \(2025\)Compress to impress: unleashing the potential of compressive memory in real\-world long\-term conversations\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 755–773\.External Links:[Link](https://aclanthology.org/2025.coling-main.51/)Cited by:[§1](https://arxiv.org/html/2606.13115#S1.p2.1.1)\.
- K\. Clark, U\. Khandelwal, O\. Levy, and C\. D\. Manning \(2019\)What does BERT look at? an analysis of BERT’s attention\.InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,T\. Linzen, G\. Chrupała, Y\. Belinkov, and D\. Hupkes \(Eds\.\),Florence, Italy,pp\. 276–286\.External Links:[Link](https://aclanthology.org/W19-4828/),[Document](https://dx.doi.org/10.18653/v1/W19-4828)Cited by:[Appendix I](https://arxiv.org/html/2606.13115#A9.p2.1)\.
- S\. Deng, Y\. Ma, N\. Zhang, Y\. Cao, and B\. Hooi \(2024\)Information extraction in low\-resource scenarios: survey and perspective\.External Links:2202\.08063,[Link](https://arxiv.org/abs/2202.08063)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLORA: efficient finetuning of quantized llms\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[Appendix B](https://arxiv.org/html/2606.13115#A2.SS0.SSS0.Px1.p1.3)\.
- D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, D\. Metropolitansky, R\. O\. Ness, and J\. Larson \(2025\)From local to global: a graph rag approach to query\-focused summarization\.External Links:2404\.16130,[Link](https://arxiv.org/abs/2404.16130)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p2.1)\.
- B\. Gliwa, I\. Mochol, M\. Biesek, and A\. Wawer \(2019\)SAMSum corpus: a human\-annotated dialogue dataset for abstractive summarization\.InProceedings of the 2nd Workshop on New Frontiers in Summarization,L\. Wang, J\. C\. K\. Cheung, G\. Carenini, and F\. Liu \(Eds\.\),Hong Kong, China,pp\. 70–79\.External Links:[Link](https://aclanthology.org/D19-5409/),[Document](https://dx.doi.org/10.18653/v1/D19-5409)Cited by:[Appendix B](https://arxiv.org/html/2606.13115#A2.SS0.SSS0.Px1.p3.1),[§3\.2](https://arxiv.org/html/2606.13115#S3.SS2.SSS0.Px2.p1.6)\.
- B\. J\. Gutiérrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su \(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=hkujvAPVsg)Cited by:[6th item](https://arxiv.org/html/2606.13115#A3.I1.i6.p1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p2.1),[§4\.1\.2](https://arxiv.org/html/2606.13115#S4.SS1.SSS2.p1.1)\.
- X\. He, Y\. Tian, Y\. Sun, N\. V\. Chawla, T\. Laurent, Y\. LeCun, X\. Bresson, and B\. Hooi \(2024\)G\-retriever: retrieval\-augmented generation for textual graph understanding and question answering\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=MPJ3oXtTZl)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[Appendix B](https://arxiv.org/html/2606.13115#A2.SS0.SSS0.Px1.p1.3)\.
- Y\. Hu, Z\. Lei, Z\. Zhang, B\. Pan, C\. Ling, and L\. Zhao \(2025\)GRAG: graph retrieval\-augmented generation\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 4145–4157\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.232/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.232),ISBN 979\-8\-89176\-195\-7Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p2.1)\.
- J\. Jang, M\. Boo, and H\. Kim \(2023\)Conversation chronicles: towards diverse temporal and relational dynamics in multi\-session conversations\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 13584–13606\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.838/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.838)Cited by:[Appendix A](https://arxiv.org/html/2606.13115#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p1.1),[§4\.1\.1](https://arxiv.org/html/2606.13115#S4.SS1.SSS1.p1.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 6769–6781\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.550/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.550)Cited by:[3rd item](https://arxiv.org/html/2606.13115#A3.I1.i3.p1.1)\.
- C\. Ke, Y\. Du, B\. Liang, Y\. Xiang, L\. Gui, Z\. Li, B\. Wang, Y\. Yu, H\. Wang, K\. Wong, and R\. Xu \(2025\)Flexibly utilize memory for long\-term conversation via a fragment\-then\-compose framework\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 21119–21136\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1069/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1069),ISBN 979\-8\-89176\-332\-6Cited by:[5th item](https://arxiv.org/html/2606.13115#A3.I1.i5.p1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p2.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p3.1),[§4\.1\.2](https://arxiv.org/html/2606.13115#S4.SS1.SSS2.p1.1)\.
- G\. Li, Z\. Xu, Z\. Shang, J\. Liu, K\. Ji, and Y\. Guo \(2024\)Empirical analysis of dialogue relation extraction with large language models\.External Links:2404\.17802,[Link](https://arxiv.org/abs/2404.17802)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Li, C\. Yang, A\. Zhang, Y\. Deng, X\. Wang, and T\. Chua \(2025\)Hello again\! LLM\-powered personalized agent for long\-term dialogue\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 5259–5276\.External Links:[Link](https://aclanthology.org/2025.naacl-long.272/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.272),ISBN 979\-8\-89176\-189\-6Cited by:[Appendix A](https://arxiv.org/html/2606.13115#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2606.13115#A1.SS0.SSS0.Px2.p1.1),[Appendix K](https://arxiv.org/html/2606.13115#A11.SS0.SSS0.Px1.p1.4),[4th item](https://arxiv.org/html/2606.13115#A3.I1.i4.p1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p2.1),[§4\.1\.2](https://arxiv.org/html/2606.13115#S4.SS1.SSS2.p1.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§4\.1\.3](https://arxiv.org/html/2606.13115#S4.SS1.SSS3.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2023a\)Lost in the middle: how language models use long contexts\.External Links:2307\.03172,[Link](https://arxiv.org/abs/2307.03172)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p2.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023b\)G\-eval: NLG evaluation using gpt\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 2511–2522\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.153/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153)Cited by:[Appendix D](https://arxiv.org/html/2606.13115#A4.SS0.SSS0.Px1.p1.1),[§4\.1\.3](https://arxiv.org/html/2606.13115#S4.SS1.SSS3.p1.1)\.
- J\. Lu, S\. An, M\. Lin, G\. Pergola, Y\. He, D\. Yin, X\. Sun, and Y\. Wu \(2023\)MemoChat: tuning llms to use memos for consistent long\-range open\-domain conversation\.External Links:2308\.08239,[Link](https://arxiv.org/abs/2308.08239)Cited by:[§1](https://arxiv.org/html/2606.13115#S1.p2.1.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of llm agents\.\.Cited by:[Appendix A](https://arxiv.org/html/2606.13115#A1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p1.1),[§4\.1\.1](https://arxiv.org/html/2606.13115#S4.SS1.SSS1.p1.1)\.
- C\. Mavromatis and G\. Karypis \(2025\)GNN\-RAG: graph neural retrieval for efficient large language model reasoning on knowledge graphs\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 16682–16699\.External Links:[Link](https://aclanthology.org/2025.findings-acl.856/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.856),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p2.1)\.
- K\. T\. Ong, N\. Kim, M\. Gwak, H\. Chae, T\. Kwon, Y\. Jo, S\. Hwang, D\. Lee, and J\. Yeo \(2025\)Towards lifelong dialogue agents via timeline\-based memory management\.External Links:2406\.10996,[Link](https://arxiv.org/abs/2406.10996)Cited by:[§1](https://arxiv.org/html/2606.13115#S1.p1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p2.1)\.
- A\. Papaluca, D\. Krefl, S\. M\. Rodriguez, A\. Lensky, and H\. Suominen \(2023\)Zero\- and few\-shots knowledge graph triplet extraction with large language models\.External Links:2312\.01954,[Link](https://arxiv.org/abs/2312.01954)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)BLEU: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting on Association for Computational Linguistics,ACL ’02,USA,pp\. 311–318\.External Links:[Link](https://doi.org/10.3115/1073083.1073135),[Document](https://dx.doi.org/10.3115/1073083.1073135)Cited by:[§4\.1\.3](https://arxiv.org/html/2606.13115#S4.SS1.SSS3.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.External Links:2304\.03442,[Link](https://arxiv.org/abs/2304.03442)Cited by:[§3\.4](https://arxiv.org/html/2606.13115#S3.SS4.SSS0.Px2.p1.11)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2023\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.External Links:1910\.10683,[Link](https://arxiv.org/abs/1910.10683)Cited by:[Appendix B](https://arxiv.org/html/2606.13115#A2.SS0.SSS0.Px1.p3.1),[§3\.2](https://arxiv.org/html/2606.13115#S3.SS2.SSS0.Px2.p1.6)\.
- Z\. Tan, J\. Yan, I\. Hsu, R\. Han, Z\. Wang, L\. T\. Le, Y\. Song, Y\. Chen, H\. Palangi, G\. Lee, A\. Iyer, T\. Chen, H\. Liu, C\. Lee, and T\. Pfister \(2025\)In prospect and retrospect: reflective memory management for long\-term personalized dialogue agents\.External Links:2503\.08026,[Link](https://arxiv.org/abs/2503.08026)Cited by:[§1](https://arxiv.org/html/2606.13115#S1.p2.1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p2.1)\.
- B\. D\. Trisedya, G\. Weikum, J\. Qi, and R\. Zhang \(2019\)Neural relation extraction for knowledge base enrichment\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 229–240\.External Links:[Link](https://aclanthology.org/P19-1023/),[Document](https://dx.doi.org/10.18653/v1/P19-1023)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Vig and Y\. Belinkov \(2019\)Analyzing the structure of attention in a transformer language model\.InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,T\. Linzen, G\. Chrupała, Y\. Belinkov, and D\. Hupkes \(Eds\.\),Florence, Italy,pp\. 63–76\.External Links:[Link](https://aclanthology.org/W19-4808/),[Document](https://dx.doi.org/10.18653/v1/W19-4808)Cited by:[Appendix I](https://arxiv.org/html/2606.13115#A9.p2.1)\.
- P\. Vossen, S\. B\. Santamaría, L\. Bajčetić, and T\. Belluci \(2024\)Extracting triples from dialogues for conversational social agents\.External Links:2412\.18364,[Link](https://arxiv.org/abs/2412.18364)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Wang, L\. Dong, H\. Cheng, X\. Liu, X\. Yan, J\. Gao, and F\. Wei \(2023\)Augmenting language models with long\-term memory\.External Links:2306\.07174,[Link](https://arxiv.org/abs/2306.07174)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p2.1)\.
- W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou \(2020\)MINILM: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.InProceedings of the 34th International Conference on Neural Information Processing Systems,NIPS ’20,Red Hook, NY, USA\.External Links:ISBN 9781713829546Cited by:[§3\.3](https://arxiv.org/html/2606.13115#S3.SS3.SSS0.Px1.p1.4.7)\.
- Z\. Wang, X\. Zhou, R\. Koncel\-Kedziorski, A\. Marin, and F\. Xia \(2022\)Extracting and inferring personal attributes from dialogue\.InProceedings of the 4th Workshop on NLP for Conversational AI,B\. Liu, A\. Papangelis, S\. Ultes, A\. Rastogi, Y\. Chen, G\. Spithourakis, E\. Nouri, and W\. Shi \(Eds\.\),Dublin, Ireland,pp\. 58–69\.External Links:[Link](https://aclanthology.org/2022.nlp4convai-1.6/),[Document](https://dx.doi.org/10.18653/v1/2022.nlp4convai-1.6)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Wu, A\. Madotto, Z\. Lin, P\. Xu, and P\. Fung \(2020\)Getting to know you: user attribute extraction from dialogues\.InProceedings of the Twelfth Language Resources and Evaluation Conference,N\. Calzolari, F\. Béchet, P\. Blache, K\. Choukri, C\. Cieri, T\. Declerck, S\. Goggi, H\. Isahara, B\. Maegaard, J\. Mariani, H\. Mazo, A\. Moreno, J\. Odijk, and S\. Piperidis \(Eds\.\),Marseille, France,pp\. 581–589\(eng\)\.External Links:[Link](https://aclanthology.org/2020.lrec-1.73/),ISBN 979\-10\-95546\-34\-4Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025a\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.External Links:2410\.10813,[Link](https://arxiv.org/abs/2410.10813)Cited by:[Appendix A](https://arxiv.org/html/2606.13115#A1.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p1.1),[§4\.1\.1](https://arxiv.org/html/2606.13115#S4.SS1.SSS1.p1.1)\.
- Y\. Wu, Y\. Zhang, S\. Liang, and Y\. Liu \(2025b\)SGMem: sentence graph memory for long\-term conversational agents\.External Links:2509\.21212,[Link](https://arxiv.org/abs/2509.21212)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p3.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.External Links:2309\.17453,[Link](https://arxiv.org/abs/2309.17453)Cited by:[Appendix I](https://arxiv.org/html/2606.13115#A9.p2.1)\.
- J\. Xu, A\. Szlam, and J\. Weston \(2022a\)Beyond goldfish memory: long\-term open\-domain conversation\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 5180–5197\.External Links:[Link](https://aclanthology.org/2022.acl-long.356/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.356)Cited by:[Appendix A](https://arxiv.org/html/2606.13115#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13115#S1.p1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p2.1),[§4\.1\.1](https://arxiv.org/html/2606.13115#S4.SS1.SSS1.p1.1)\.
- X\. Xu, Z\. Gou, W\. Wu, Z\. Niu, H\. Wu, H\. Wang, and S\. Wang \(2022b\)Long time no see\! open\-domain conversation with long\-term persona memory\.External Links:2203\.05797,[Link](https://arxiv.org/abs/2203.05797)Cited by:[§1](https://arxiv.org/html/2606.13115#S1.p1.1)\.
- Q\. Zhang, J\. Naradowsky, and Y\. Miyao \(2023a\)Mind the gap between conversations for improved long\-term dialogue generation\.External Links:2310\.15415,[Link](https://arxiv.org/abs/2310.15415)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p2.1)\.
- Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett, Z\. Wang, and B\. Chen \(2023b\)H2o: heavy\-hitter oracle for efficient generative inference of large language models\.External Links:2306\.14048,[Link](https://arxiv.org/abs/2306.14048)Cited by:[Appendix I](https://arxiv.org/html/2606.13115#A9.p2.1)\.
- T\. Zhang\*, V\. Kishore\*, F\. Wu\*, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with bert\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkeHuCVFDr)Cited by:[Appendix D](https://arxiv.org/html/2606.13115#A4.SS0.SSS0.Px1.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)MemoryBank: enhancing large language models with long\-term memory\.Proceedings of the AAAI Conference on Artificial Intelligence38\(17\),pp\. 19724–19731\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/29946),[Document](https://dx.doi.org/10.1609/aaai.v38i17.29946)Cited by:[3rd item](https://arxiv.org/html/2606.13115#A3.I1.i3.p1.1),[§1](https://arxiv.org/html/2606.13115#S1.p2.1.1),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px1.p2.1),[§4\.1\.2](https://arxiv.org/html/2606.13115#S4.SS1.SSS2.p1.1)\.
- L\. Zhu, W\. Li, R\. Mao, V\. Pandelea, and E\. Cambria \(2023\)PAED: zero\-shot persona attribute extraction in dialogues\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 9771–9787\.External Links:[Link](https://aclanthology.org/2023.acl-long.544/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.544)Cited by:[Appendix B](https://arxiv.org/html/2606.13115#A2.SS0.SSS0.Px1.p1.3),[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.13115#S3.SS2.SSS0.Px1.p1.1)\.
- L\. Zhuang, S\. Chen, Y\. Xiao, H\. Zhou, Y\. Zhang, H\. Chen, Q\. Zhang, and X\. Huang \(2025\)LinearRAG: linear graph retrieval augmented generation on large\-scale corpora\.External Links:2510\.10114,[Link](https://arxiv.org/abs/2510.10114)Cited by:[§2](https://arxiv.org/html/2606.13115#S2.SS0.SSS0.Px2.p2.1)\.

## Appendix ADataset Details

To comprehensively evaluate the effectiveness of G\-Long, we employ three benchmark datasets tailored for multi\-session and long\-term conversational memory modeling\.

##### Multi\-Session Chat \(MSC\)

MSCXuet al\.\([2022a](https://arxiv.org/html/2606.13115#bib.bib5)\)is a large\-scale multi\-session dialogue benchmark\. For our main experiments, following previous workLiet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib1)\), we randomly sampled 500 dialogue sessions from the test split to manage computational costs while maintaining statistical significance\. Each sample consists of five temporally separated sessions between the same two interlocutors \(avg\. 43 turns\), designed to evaluate the retention of user\-specific information such as preferences and past experiences\.

##### Conversation Chronicles \(CC\)

CCJanget al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib6)\)focuses on chronological consistency and evolving personal narratives\. Similarly, followingLiet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib1)\), we used a random subset of 500 test samples for evaluation\. Each sample comprises five sessions \(avg\. 58 turns\) with rich temporal dependencies, requiring models to track changes in user states and relationships over time\.

##### LoCoMo

LoCoMoMaharanaet al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib31)\)is a sophisticated long\-term dialogue dataset designed to evaluate memory retention over extended temporal horizons\. Unlike MSC and CC, LoCoMo contains an exceptionally high number of sessions \(approximately 27 sessions per sample\), providing a highly challenging and realistic environment for testing the scalability of long\-context reasoning\.

##### LongMemEval \(LME\)

To measure intrinsic retrieval accuracy, we use the LME datasetWuet al\.\([2025a](https://arxiv.org/html/2606.13115#bib.bib7)\)\. We utilized theLongMemEval\_Ssubset \(approx\. 48 sessions, 10 turns each\), where a post\-conversation QA task targets specific facts mentioned throughout the history\. This allows for a precise assessment of memory correctness independent of generation fluency\.

## Appendix BImplementation Details

We implemented G\-Long using thePyTorchframework and theHuggingFace Transformerslibrary\.

##### Module Configurations

For theTriplet Extraction Module\(Section[3\.2](https://arxiv.org/html/2606.13115#S3.SS2.SSS0.Px1)\), we primarily utilizedQwen\-3\-8B333[https://huggingface\.co/Qwen/Qwen3\-8B](https://huggingface.co/Qwen/Qwen3-8B)instruction\-tuned on thePersonaExtZhuet al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib4)\)dataset\. This corpus, consisting of dialogue sessions paired with structured persona triplets, provides ground\-truth supervision for learning precise extraction schemas\. To ensure training efficiency, we employed QLoRADettmerset al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib23)\)with 4\-bit quantization\. We configured the LoRAHuet al\.\([2021](https://arxiv.org/html/2606.13115#bib.bib42)\)adapter with a rank ofr=32r=32andα=32\\alpha=32, targeting all linear projection layers to maximize representational power\. The model was fine\-tuned for 10 epochs with a learning rate of2×10−42\\times 10^\{\-4\}and a batch size of 16\. For the inference phase, we leveraged thevLLMengine with greedy decoding \(temperature 0\.0\) to guarantee high\-throughput and deterministic generation of structured knowledge\. To guide the model’s generation, we utilized a specific system instruction emphasizing structured output\. Below is the exact prompt template and a representative training instance used in our experiments\.

System InstructionYou are an expert in Knowledge Graph Construction\. Extract \(subject, relation, object\) triplets from the text and output them in JSON format\.

User Input \(Dialogue\)Extract triplets from the following dialogue: "Speaker A: I love playing tennis on weekends\." "Speaker B: Me too, it’s my favorite hobby\."

Assistant Output \(Target JSON\)\[\{"subject": "Speaker A", "relation": "hobby", "object": "tennis"\}, \{"subject": "Speaker B", "relation": "hobby", "object": "tennis"\}\]

For theAttention\-Aware Importance Scoring Module, we leveraged aT5\-smallRaffelet al\.\([2023](https://arxiv.org/html/2606.13115#bib.bib29)\)checkpoint that was fine\-tuned on theSAMSumcorpusGliwaet al\.\([2019](https://arxiv.org/html/2606.13115#bib.bib20)\)\. Since this model is optimized for abstractive dialogue summarization, its cross\-attention mechanisms implicitly capture the semantic saliency of utterances\. We utilize these cross\-attention maps from the last decoder layer to quantify the importance of extracted triplets\. We leveraged a T5\-small checkpoint444[https://huggingface\.co/chanifrusydi/t5\-dialogue\-summarization](https://huggingface.co/chanifrusydi/t5-dialogue-summarization)fine\-tuned on theSamsumdataset\.

##### Generation Settings

To ensure reproducibility and deterministic evaluation, we standardized the generation process across G\-Long and all baseline models\. We utilizedgpt\-4o\-minias the backbone LLM for response generation and auxiliary memory management tasks such as summarizing, setting the decoding temperature to 0\.0 to minimize randomness in the outputs\.

##### Retrieval Settings

We implemented the underlying graph structure using theNetworkX555[https://networkx\.org/en/](https://networkx.org/en/)library\. For the dense embedding, we utilized theall\-MiniLM\-L6\-v2sentence\-transformer encoder model, which was managed within aChromaDBvector store for efficient querying\.

Regarding the hyperparameters discussed in Section[3\.4](https://arxiv.org/html/2606.13115#S3.SS4), we empirically set the number of anchor nodes toNa​n​c=5N\_\{anc\}=5\. For subgraph expansion, we adopted a 1\-hop neighborhood strategy \(L=1L=1\) to prioritize retrieval efficiency while maintaining sufficient local context, as validated by the ablation analysis in Section[4\.7](https://arxiv.org/html/2606.13115#S4.SS7)\. In the Two\-Stage Hybrid Reranking phase, we first filtered the topN=5N=5triplets based on semantic similarity\. Subsequently, for importance\-aware reranking, the weighting coefficients were set toβ1=0\.5\\beta\_\{1\}=0\.5\(semantic\),β2=0\.3\\beta\_\{2\}=0\.3\(importance\), andβ3=0\.2\\beta\_\{3\}=0\.2\(recency\)\. The time\-decay factor was set toλ=10−7\\lambda=10^\{\-7\}based on validation performance\. Finally, the topK=3K=3triplets were selected and concatenated with the dialogue context for generation\.

##### Hardware

All experiments were conducted on a single NVIDIA RTX PRO 6000 Blackwell Max\-Q Edition \(96GB\)\.

Table 8:Additional performance comparison on MSC and CC datasets across sessions \(N=500\)\.Table 9:Additional Experimental Results on MSC Dataset \(N=100\)\.

## Appendix CBaseline Details

We provide detailed specifications of the baselines used in our comparative analysis to evaluate the effectiveness of preserving and utilizing long\-term memory:

- •No History:A standard conversational setup that generates responses based solely on the current user utterance, without access to any past dialogue history\. This serves as a lower bound for performance\.
- •Long Context:A baseline that concatenates the full raw dialogue history into the prompt, constrained only by the LLM’s context window\. This represents the capability of standard LLMs to handle long contexts without explicit memory management\.
- •MemoryBankZhonget al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib2)\): A summarization\-based approach utilizing a discrete memory bank\. It employs an Ebbinghaus forgetting curve for memory management and uses a dense retriever \(DPR\)Karpukhinet al\.\([2020](https://arxiv.org/html/2606.13115#bib.bib8)\)for memory access\.
- •LD\-AgentLiet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib1)\): A personalized long\-term dialogue framework integrating a Persona Extractor and Session Summarizer\. It relies on fixed conversation databases with retrieval modulation strategies, such as keyword matching, to access relevant history and maintain user consistency across extended sessions\.
- •FraComKeet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib9)\): A recently proposed framework that decomposes dialogue history into proposition\-level units\. It facilitates granular memory retrieval and allows for flexible context composition during response generation\.
- •HippoRAGGutiérrezet al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib32)\): A neurobiologically inspired GraphRAG system that leverages LLM\-based Named Entity Recognition \(NER\) and triplet extraction to construct a knowledge graph, using Personalized PageRank \(PPR\) to retrieve relevant contexts\.

## Appendix DEvaluation Metrics Details

Depending on the nature of the datasets, we utilize distinct metrics for response generation and memory retrieval tasks\.

##### Response Generation Metrics \(MSC, CC\)

We utilize standardnn\-gram based metrics to measure the lexical similarity between generated responses and the ground truth\. Specifically, we reportBLEU\-2andBLEU\-3to evaluate phrase\-level overlap, andROUGE\-Lto capture sentence\-level structure via the longest common subsequence\. Additional results for BLEU\-1, BLEU\-4, and BERTScoreZhang\*et al\.\([2020](https://arxiv.org/html/2606.13115#bib.bib26)\)are reported in Appendix[F](https://arxiv.org/html/2606.13115#A6)\. To overcome the limitations of static metrics, we also employG\-EvalLiuet al\.\([2023b](https://arxiv.org/html/2606.13115#bib.bib21)\), an LLM\-based framework, to assess coherence, engagingness, and fluency\.

##### Memory QA Metrics \(LME\)

For the LME benchmark, designed as a QA task to probe long\-term memory, we adopt the following primary metrics:

- •Recall \(Retrieval Coverage\):Measures whether theground\-truth evidence utterancecontaining the answer is successfully retrieved within the top\-kkmemory candidates\.
- •MRR \(Ranking Precision\):Evaluates the retrieval quality by calculating the reciprocal rank of the first relevant ground\-truth memory triplet, effectively measuring the system’s capability to prioritize essential evidence at higher ranks as the candidate size scales\.
- •Accuracy \(QA Correctness\):Measures whether the final response correctly provides the exact answer\. We utilizegpt\-4o\-minito verify if the generated response matches the ground truth based on the retrieved memories\.

## Appendix EDetailed Human Evaluation Protocol

This section provides a comprehensive breakdown of our human evaluation process to ensure transparency and reproducibility\.

##### Recruitment and Compensation

We recruited three independent graduate\-level researchers specializing in NLP, all proficient in English\. Annotators were not affiliated with the model development and had no prior exposure to the system outputs\. All annotators were compensated at our institution’s standard hourly rate to ensure fair labor practices\.

##### Task Design and Blind Setup

To eliminate bias, we adopted a fully blind protocol where model identities were anonymized \(presented as “Model A” vs\. “Model B”\) and the presentation order was randomized\. Annotators performed pairwise side\-by\-side comparisons under a forced binary choice \(1 for win, 0 for loss\)\. For each instance, they were presented with the full long\-term dialogue history \(Sessions 1–4\) followed by candidate responses for Session 5\. They were asked to judge which response better reflects the user’s established persona and provides a more fluent, engaging, and coherent follow\-up\.

##### Evaluation Criteria

Annotators judged each pair holistically along three dimensions: \(1\)Fluency: linguistic quality and naturalness; \(2\)Engagingness: proactivity and empathetic tone; and \(3\)Coherence: logical relevance and long\-term factual consistency with the established history\.

##### Sample Size and Annotation Effort

We evaluated 50 randomly sampled instances per baseline pairing\. While this number may appear modest, each instance spans five sessions, averaging 51 turns \(≈\\approx2,500 total turns per pairing\)\. To make a reliable judgment, annotators must read the entire dialogue history, requiring 15–20 minutes per instance\. We therefore prioritized evaluation depth—essential for genuinely assessing long\-term consistency—over the breadth of shallow, snippet\-level judgments\.

##### Statistical Significance

To assess whether G\-Long’s win rates differ meaningfully from chance \(0\.5\), we conducted a one\-sample t\-test\. The aggregated win rates \(64% vs\. LD\-Agent, 52% vs\. MemoryBank, 54% vs\. Long Context\) yieldedp=0\.058p=0\.058\. Although slightly above the 0\.05 threshold due to the time\-constrained sample size, the consistent trend across all independent baselines and the substantial 14\-point margin against LD\-Agent support G\-Long’s superiority\. We view this as complementary evidence to our G\-Eval results\.

##### Data Consent and Ethics

Annotators were informed that data would be used for academic research only\. No personally identifiable information \(PII\) was collected\. The study involved low\-risk tasks on public, anonymized datasets and followed the ethical research guidelines of our institution\.

## Appendix FAdditional Generation Performance Comparison

Table[8](https://arxiv.org/html/2606.13115#A2.T8)presents the extended evaluation results usingBLEU\-1,BLEU\-4, andBERTScoreon both MSC and CC datasets\. Consistent with the main findings in Section[4\.2](https://arxiv.org/html/2606.13115#S4.SS2), G\-Long demonstrates robust superiority over all baseline models, including competitive frameworks like MemoryBank and LD\-Agent\.

Across both datasets, G\-Long achieves the highest scores in all metrics\. Notably, the significant margins inBLEU\-4indicate that our graph\-based memory facilitates more precise lexical usage compared to summarization\-based approaches\. Furthermore, the superiorBERTScoreperformance confirms that G\-Long effectively captures semantic relevance, ensuring that generated responses are not only fluent but also contextually aligned with the long\-term dialogue history\. These results further validate the generalizability and effectiveness of our proposed framework across diverse evaluation criteria\.

Additionally, as shown in Table[9](https://arxiv.org/html/2606.13115#A2.T9), regarding the evaluation of FraCom, the reported scores represent the average performance calculated on a per\-session basis\. It should be noted that due to the prohibitive API costs associated with FraCom’s operational characteristics, its evaluation was restricted to a subset of 100 samples\.

## Appendix GTriplet Extraction Performance Comparison

We further investigate whether replacing the computationally expensive LLM\-based extractor with a fine\-tuned sLM compromises the system’s performance\. Table[10](https://arxiv.org/html/2606.13115#A7.T10)presents a comparative analysis on the MSC dataset \(N=50N=50\)\. The results demonstrate that the sLM extractor does not incur any performance degradation compared to the LLM\-based counterpart\. In fact, it achieves a slight improvement in BLEU\-3 \(1\.88 vs\. 1\.78\) while maintaining identical scores for BLEU\-2 and ROUGE\-L\. We attribute this unexpected gain to the structural consistency of the fine\-tuned model\. Unlike general\-purpose LLMs, which may exhibit variance in formatting, the fine\-tuned sLM produces highly standardized triplets\. This consistency is likely to reduce noise in the constructed graph, enabling the generator to utilize retrieved contexts more effectively\. This confirms that our sLM\-based design achieves operational efficiency without sacrificing generation quality\.

Table 10:Performance comparison on the MSC dataset \(N=50\) while using the LLM extractor and the sLM extractor\.Table 11:Effect of different LLMs on MSC and CC datasets \(N=100\)\.Table 12:Effect of different sLMs’ triplet extraction on MSC and CC datasets \(N=100\)\.Table 13:Ablation study on the effect of the Triplet Importance Scoring Module \(N=500\)\.
## Appendix HRobustness on Different Models

To verify the generalization capability and robustness of G\-Long, we evaluate our framework across different backbone LLMs for response generation and sLMs for triplet extraction\.

##### Impact of Different LLM Backbones

Table[11](https://arxiv.org/html/2606.13115#A7.T11)compares performance across three distinct LLMs: gpt\-4o\-mini, ChatGLM3\-6B, and Gemini\-1\.5\-flash\-lite\. G\-Long consistently outperforms the strong baseline \(LD\-Agent\) across all backbone models on both the MSC and CC datasets\. This suggests that our framework is model\-agnostic, effectively enhancing long\-context understanding regardless of the specific model characteristics\.

##### Effect of Different SLM Extractors

We further investigate robustness by varying the triplet extraction module with different open\-source instruction\-tuned sLMs \(Qwen3\-8B and Llama3\.1\-8B\-Instruct\)\. As shown in Table[12](https://arxiv.org/html/2606.13115#A7.T12), G\-Long maintains high performance regardless of the underlying sLM\. This indicates the framework’s high modularity, allowing flexible integration with various open\-source models for knowledge extraction\.

## Appendix IDetailed Analysis of Importance Scoring

This section provides an in\-depth ablation study to validate the effectiveness of our attention\-aware importance scoring module within the G\-Long framework\.

Our design choice to leverage cross\-attention as a saliency signal is grounded in a substantial body of prior work\. Studies on transformer interpretability have demonstrated that attention weights encode meaningful syntactic and semantic signalsClarket al\.\([2019](https://arxiv.org/html/2606.13115#bib.bib43)\); Vig and Belinkov\([2019](https://arxiv.org/html/2606.13115#bib.bib44)\), and can be quantitatively correlated as saliency measuresAbnar and Zuidema\([2020](https://arxiv.org/html/2606.13115#bib.bib45)\)\. Building on this insight, recent approaches have successfully leveraged attention scores for efficient memory and token managementZhanget al\.\([2023b](https://arxiv.org/html/2606.13115#bib.bib51)\); Xiaoet al\.\([2024](https://arxiv.org/html/2606.13115#bib.bib47)\)\. These findings collectively support our hypothesis that the cross\-attention maps of a fine\-tuned summarizer can serve as a robust, intrinsic proxy for the semantic importance of extracted triplets\.

To rigorously assess its impact, we conduct two separate evaluations: intrinsic retrieval accuracy and response generation quality\.

##### Intrinsic Retrieval Performance

We first evaluate the module’s capability as a re\-ranker\. Since the initial candidate pool \(size 5\) is retrieved based on semantic similarity, we focus on the Top\-3 metrics to observe the refinement\. As shown in Table[13](https://arxiv.org/html/2606.13115#A7.T13), the inclusion of importance scores improved Recall@3 \(0\.6179→\\rightarrow0\.6286\) and Accuracy@3 \(0\.5720→\\rightarrow0\.5783\), confirming its ability to prioritize critical evidence from the retrieved candidates\.

Table 14:Impact of Importance\-Aware Reranking on Generation Quality \(N=100\)\.
##### Response Generation Impact

The primary value of our scoring mechanism lies in its noise\-filtering capability\. By providing only the highest\-density information \(Top\-3 filtered\) to the generator, we avoid the distraction caused by trivial or irrelevant triplets found in the full candidate pool \(Top\-5 unfiltered\)\. Table[14](https://arxiv.org/html/2606.13115#A9.T14)demonstrates that G\-Long \(Top\-3\) consistently achieves higher BLEU and ROUGE\-L scores compared to the No\-Rerank \(Top\-5\) baseline across both MSC and CC datasets\. These results empirically support that importance scoring is essential for achieving both high\-fidelity response generation and computational efficiency via reduced token consumption\.

## Appendix JCase Study

### J\.1Qualitative Analysis on Dialogue Generation

Figure[5](https://arxiv.org/html/2606.13115#A11.F5)presents a qualitative comparison between G\-Long and the Long Context baseline\. This comparison highlights the efficacy of providing structured knowledge \(triplets\) over simply feeding raw, extensive dialogue history\. We examine two critical aspects of long\-term dialogue quality:Persona ConsistencyandContext\-Aware Empathy\.

##### Case 1: Persona Consistency

In the "Freelancer" scenario, G\-Long retrieves distinct facts \(occupation and pet\) and synthesizes them into a causal chain:"Freelance flexibility→\\rightarrowwork from home→\\rightarrowtime with Bosco\."Unlike the Long Context baseline, which only captures surface\-level associations, G\-Long constructs grounded narratives that respect the logical flow of the persona’s lifestyle rather than merely repeating historical facts\.

##### Case 2: Context\-Aware Empathy

In the "Fishing Article" scenario, G\-Long demonstrates nuanced empathy\. While Long Context offers generic encouragement \("great accomplishment"\), G\-Long explicitly leverages the user’s past hobby \("enjoys fishing"\) to validate their ability \("passion will shine through"\)\. This capability to bridge current emotions with verified past interests fosters a deeper connection than standard supportive phrases\.

Notably, G\-Long’s comprehensive responses stem not from verbosity, but from the ability to fully articulate retrieved structured knowledge, ensuring interactions are both logically grounded and conversationally engaging\.

### J\.2Empirical Evidence for Retrieval Behavior

To address the concern regarding the empirical evidence for retrieval behavior, we provide a detailed qualitative analysis using the MSC Dataset \(Figure[6](https://arxiv.org/html/2606.13115#A11.F6)\)\. These results demonstrate how G\-Long’s structured retrieval mechanism effectively captures and utilizes long\-term memory compared to traditional summarization\-based baselines\.

When handling a complex inquiry, G\-Long accurately captures the core entities of the dialogue and extracts independent, highly relevant facts prioritized by rank\. This provides zero\-noise information that is ready for immediate contextual use\. In contrast, LDA over\-compresses past conversations into a single summary sentence, resulting in critical information loss where specific entities essential for answering the inquiry \(e\.g\., "grill", "steak", "Alaska"\) are completely omitted\. Furthermore, the Memory Bank baseline returns lengthy, narrative\-style summaries mixed with metadata\. Although it contains some of the target information, the excessive amount of irrelevant text causes severe context overload, making it difficult for the language model to pinpoint the necessary facts\.

### J\.3Error Analysis on Memory Retrieval

To provide a transparent view of the structural boundaries of the proposed graph\-based memory representation, we conducted an extended qualitative analysis on failure cases that occurred outside the standard benchmark evaluation\. Our analysis identifies two primary failure modes when handling nuanced or non\-factual conversational content, as illustrated in Figure[7](https://arxiv.org/html/2606.13115#A11.F7)\.

These findings suggest that while graph\-based structures are highly efficient for factual retention, future work should explore tighter integration between graph nodes and dynamic context resolution to mitigate such referential and lexical sensitivities\.

## Appendix KPrompt Configuration

We present the complete prompt templates used in our experiments in Figure[8](https://arxiv.org/html/2606.13115#A11.F8)\.

##### Response Generation \(Top\)

To ensure a fair comparison, we adopted the prompt structure from LD\-AgentLiet al\.\([2025](https://arxiv.org/html/2606.13115#bib.bib1)\)\. However, as G\-Long does not employ a dedicated persona module, we removed the static<<USER\_TRAITS\>\>section\. Instead, we unified all retrieved triplets within the<<MEMORY\>\>block, allowing the model to dynamically generate responses based on structured knowledge\.

##### G\-Eval for Response Quality \(Bottom\)

For automatic evaluation, we utilizedgpt\-4oas an expert judge\. The prompt instructs the model to evaluate the target session \(Session 5\) while considering the long\-term context \(Sessions 1\-4\)\. The evaluation is based on three criteria \(Coherence, Engagingness, and Fluency\)

Case 1: Persona Consistency \- "Freelancer" scenario\[Dialogue Context\]Speaker 1 asks about the work\-life balance of being a freelance programmer\.User \(Speaker 1\):“Does being a freelance programmer allow you to stay home with Bosco?”\[Retrieved Memory\]•\(Speaker 2, occupation, freelance programmer\)•\(Speaker 2, work\_style, work from home\)•\(Speaker 2, has\_pet, dog named Bosco\)G\-Long \(Ours\):Absolutely\! Being afreelance programmergives me the flexibility towork from home, which means I can spend a lot of time withBosco\. It’s great because I can take breaks to play with him or take him for walks whenever I need a mental break from coding\. How about you? Do you have any pets?LongContext:Yes, it definitely does\! I love being able to work from home and spend time with Bosco\. It’s the best of both worlds\!

Case 2: Context\-Aware Empathy \- "Fishing Article" scenario\[Dialogue Context\]Speaker 1 is nervous about sharing their first article\.User \(Speaker 1\):“I finished up writing my fishing article\. I want to send it to you to read over, I’m nervous about actually writing my first article\.”\[Retrieved Memory\]•\(Speaker 1, hobby, sport fishing\)•\(Speaker 1, current\_activity, writing first article\)•\(Speaker 1, emotion, nervous\)G\-Long \(Ours\):That’s fantastic\! I’m really excited toread your article\. It’s completely normal to feelnervous, especially with your first one, but remember that you’ve got agreat passion for fishing, and that will shine through in your writing\. Go ahead and send it over whenever you’re ready\!LongContext:I’d be happy to read it\! Don’t be nervous; it’s a great accomplishment\. I’m excited to see your perspective on fishing\!

Figure 5:Qualitative comparison between G\-Long and LongContext\. Bold text indicates specific information directly integrated from the retrieved triplets, demonstrating G\-Long’s ability to maintain persona consistency \(Case 1\) and utilize user traits for context\-aware empathy \(Case 2\)\.Empirical Retrieval Analysis: G\-Long vs\. Baselines \(MSC Dataset\)\[Query \(Inquiry\)\]“Good thinking\. Not purchasing a treadmill now will save you some money, and you’ll be more motivated to go to the gym since you won’t have exercise equipment at home\. What other things do you like to grill on sunny days other than steak? Do you get some good salmon up in Alaska?”\[Expected Response\]“Oh, I love salmon, and it is pretty good here\. Steak is about the only thing I really like to grill though\. I know you don’t eat meat, but do you eat seafood?”1\. G\-Long \(Proposed\)Characteristics:Accurately captures core entities and extracts independent, highly relevant facts prioritized by rank\. It provides zero\-noise information that is ready for immediate contextual use\.•\[Rank 1\]\(Active 13\.0 days 1\.0 hours 2\.0 minutes ago\) Speaker 2’s family originates from Alaska\.•\[Rank 2\]\(Active 13\.0 days 1\.0 hours 2\.0 minutes ago\) SPEAKER1 likes grilled food\.•\[Rank 3\]\(Active 13\.0 days 1\.0 hours 2\.0 minutes ago\) SPEAKER2’s favorite food is steak\.2\. LDA \(Baseline\)Characteristics:Over\-compresses past conversations into a single summary sentence\. While it captures the general topic, all specific entities \(grill, steak, Alaska\) essential for answering the inquiry are completely lost \(Information Loss\)\.•\[Retrieved Memory\]"13\.0 days 1\.0 hours 3\.0 minutes ago, SPEAKER\_1 owns a jeep; SPEAKER\_2 prefers walking\. They discuss work, hobbies, and their cold home regions\.\."3\. Memory Bank \(Baseline\)Characteristics:Returns lengthy, narrative\-style summaries mixed with metadata\. The excessive amount of irrelevant text causes severe context overload\.•\[Memory 1\]"Overall Summary: On January 1, 2023, a conversation took place… \(omitted\) …their culinary preferences \(the AI likes grilling meat; MSC\_User\_1 prefers grilled vegetables\)\. They shared geographical backgrounds—AI’s family from Alaska…"•\[Memory 2\]"On January 2, 2023, there was no dialogue content provided for summarization\."•\[Memory 3\]"On December 30, 2025, the conversation centered on running preferences\. MSC\_User\_1 preferred controlled environments like gyms with tracks or treadmills…"Figure 6:Qualitative comparison of retrieval behavior\. G\-Long precisely extracts specific, high\-priority facts \(e\.g\., Alaska, grill, steak\) without introducing noise\. In contrast, LDA suffers from severe information loss, and Memory Bank introduces context overload through lengthy, unfocused summaries\.Case 1: Referential Disconnect via Unresolved Anaphora\[User Query\]:“What was wrong withit?”\[Retrieved Memory \(G\-Long\)\]•\[Rank 1\] \(John, hobby, running\)•\[Rank 2\] \(Speaker1, has, car\)•\[Rank 3\] \(Speaker1, likes, grilled food\)Analysis:This case demonstrates a “referential disconnect” where the system fails to resolve the entity represented by the demonstrative pronoun “it” from the preceding dialogue turns\. Since the isolated query lacks explicit entity mentions, the retriever performs a blind search in the embedding space, fetching irrelevant, noisy triplets that share no semantic connection with the actual topic under discussion\.

Case 2: Semantic Drift via Lexical Over\-sensitivity\[User Query\]:“Thanks for the invite\! I do love painting barns, especially theold redones\. They’re just so charming\.”\[Retrieved Memory \(G\-Long\)\]•\[Rank 1\] \(John, is\_currently\_on, job hunt\)•\[Rank 2\] \(Speaker1, has,ginger hair\)•\[Rank 3\] \(The person, dislikes,the past\)Analysis:We observed instances of “semantic drift” where the retriever over\-prioritizes fragmented, literal keyword associations\. In this example, the model incorrectly mapped the color descriptor “red” to “ginger hair” and the temporal adjective “old” to the concept of “the past\.” This reveals a boundary in how current dense embedding spaces distinguish between disparate domains that share common descriptive modifiers\.

Figure 7:Qualitative error analysis of G\-Long’s memory retrieval mechanism\. The examples highlight structural limitations such as referential disconnect due to unresolved anaphora \(Case 1\) and semantic drift caused by lexical over\-sensitivity \(Case 2\)\.\(Top\) G\-Long Response Generation PromptSystem Prompt:As a communication expert with outstanding communication habits, you embody the role of \{Agent Name\} throughout the following dialogues\.User Prompt:<<CONTEXT\>\>Drawing from your recent conversation with \{User Name\}:\{Recent Dialogue Context\}<<MEMORY\>\>The memories linked to the ongoing conversation are:s0→r0o0s\_\{0\}\\xrightarrow\{r\_\{0\}\}o\_\{0\}s1→r1o1s\_\{1\}\\xrightarrow\{r\_\{1\}\}o\_\{1\}…sk→rkoks\_\{k\}\\xrightarrow\{r\_\{k\}\}o\_\{k\}Now, please role\-play as \{Agent Name\} to continue the dialogue between \{Agent Name\} and \{User Name\}\.\{User Name\} said: \{Current Inquiry\}Please respond to \{User Name\}’s statement:

\(Bottom\) G\-Eval Prompt for Response Quality \(gpt\-4o\)System Prompt:You are an expert judge evaluating a long\-term dialogue system\. The dialogue consists of alternating turns between a ’User’ and an ’Assistant’\. Your task is to evaluate the quality of the responses generated by the ’Assistant’ in the \[Target Session \(Session 5\)\], while using the \[Conversation History \(Sessions 1\-4\)\] as background context\.User Prompt:\[Conversation History \(Sessions 1\-4\)\]:This is the past dialogue history\. The Assistant should remember the information shared here\.\{Context History\}\[Target Session \(Session 5\)\]:This is the current session to be evaluated\. Focus ONLY on the lines labeled ’Assistant’\. Do not evaluate the ’User’ lines\.\{Target Session\}Evaluation Criteria \(1\-5 scale\):1\.Coherence: Does the response logically follow the input? Is it consistent with Sessions 1\-4 \(no contradictions/hallucinations\)?2\.Engagingness: Is the Assistant interesting? Does it provide specific responses rather than generic ones?3\.Fluency: Is the language natural and grammatically correct?Output Format:Output ONLY the scores in the exact format below\. Do not provide any reasoning or explanation\.Coherence: \[Score\]Engagingness: \[Score\]Fluency: \[Score\]

Figure 8:Full prompt templates used in our experiments\.\(Top\)The response generation prompt adapted from LD\-Agent, integrating retrieved triplets\.\(Bottom\)The G\-Eval prompt used for automatic evaluation with gpt\-4o\.

Similar Articles

Synthesis and Evaluation of Long-term History-aware Medical Dialogue

arXiv cs.CL

This paper introduces a framework for synthesizing long-term medical dialogue datasets using LLMs, and creates MediLongChat with three benchmark tasks to evaluate healthcare agents' memory and reasoning capabilities. Experiments show that even state-of-the-art LLMs struggle with these tasks.