LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations
Summary
Lantern introduces a lightweight memory layer that archives conversation turns and retrieves relevant details after compaction, recovering 78.3% of lost facts with zero LLM calls and outperforming MemGPT-based methods.
View Cached Full Text
Cached at: 06/05/26, 08:05 AM
# Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations
Source: [https://arxiv.org/html/2606.05182](https://arxiv.org/html/2606.05182)
###### Abstract
Large language models discard critical details when conversation history is compacted to fit within finite context windows\. We presentLantern\(LayeredArchival aNdTemporalEpisodicRetrievalNetwork\), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval—requiring zero LLM calls and adding fewer than 25 ms of latency per turn\.111Measured on an Apple M2 Pro \(single\-threaded, warm cache\); on a 2\-vCPU cloud VM \(AWSc5\.large\), median latency is comparable at∼\{\\sim\}30 ms\. Compaction itself and optional reranking each incur one LLM call\.On 94 real multi\-turn conversations \(1,894 ground\-truth facts, human\-validated atκ=0\.81\\kappa\{=\}0\.81\),Lantern\-Rerank recovers 78\.3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT’s LLM\-driven extraction and multi\-query search pipeline \(72\.4%; Wilcoxonp<0\.0001p\{<\}0\.0001, 95% CI\[\+3\.1,\+8\.6\]\[\{\+\}3\.1,\{\+\}8\.6\]pp,d=0\.43d\{=\}0\.43\) at a fraction of the inference cost\. Even without the reranker, baseLanternmatches or exceeds this LLM\-driven baseline \(p=0\.005p\{=\}0\.005\) using zero LLM calls\. When four production LLMs answer fact\-bearing questions usingLantern\-restored context, accuracy improves by 8\.4 percentage points on average \(Wilcoxonp<0\.05p\{<\}0\.05for each model individually\), demonstrating that the recovered context is useful across diverse model architectures\. We release the full evaluation framework—paired significance tests, failure analysis, fact\-type stratification, and compaction robustness analysis—to support reproducibility and future work\.
## 1Introduction
Modern large language models \(LLMs\) operate within finite context windows\. When multi\-turn conversations exceed this capacity, systems employ*compaction*: older messages are summarized or truncated to make room for new content\. While compaction preserves conversational flow, it destroys specific details—port numbers become “configured the database,” error codes become “fixed a bug,” and architectural decisions become “discussed the design\.”
We formalize this information loss as the*context cliff*\. LetCtC\_\{t\}denote the context at turnttandF\(Ct\)F\(C\_\{t\}\)the set of retrievable facts\. At compaction turnt∗t^\{\*\}:
Ct∗\+1=summarize\(C1,…,Ct∗−k\)⊕Ct∗−k\+1⊕⋯⊕Ct∗C\_\{t^\{\*\}\+1\}=\\text\{summarize\}\(C\_\{1\},\\ldots,C\_\{t^\{\*\}\-k\}\)\\oplus C\_\{t^\{\*\}\-k\+1\}\\oplus\\cdots\\oplus C\_\{t^\{\*\}\}\(1\)where⊕\\oplusdenotes context concatenation\. The context cliff isΔF=F\(Ct∗\)∖F\(Ct∗\+1\)\\Delta F=F\(C\_\{t^\{\*\}\}\)\\setminus F\(C\_\{t^\{\*\}\+1\}\)\. In our experiments,\|ΔF\|/\|F\(Ct∗\)\|\>0\.5\|\\Delta F\|/\|F\(C\_\{t^\{\*\}\}\)\|\>0\.5: over half of specific facts are lost after a single compaction event\. Recent empirical work confirms that the Maximum Effective Context Window \(MECW\) of production LLMs can be significantly smaller than the advertised window, with accuracy degrading well before the nominal limitPaulsen \([2025](https://arxiv.org/html/2606.05182#bib.bib17)\)\.
Figure[1](https://arxiv.org/html/2606.05182#S1.F1)illustrates a concrete example of the context cliff in a coding session\.
Before CompactionTurn 1\(User\)
Set the DB port to5433inconfig/db\.yamlTurn 1\(Assistant\)
Done\. Updatedconfig/db\.yaml, port set to 5433\.Turn 3\(User\)
UsePostgreSQLover MongoDB for the user storeTurn 3\(Assistant\)
Good choice\. Setting up PostgreSQL driver…Turn 15\(User\)
Create auth middleware insrc/auth\.ts… 25 more turns …After CompactionLLM Summary:
“Discussed database setup, made architectural decisions, created authentication middleware…”
✗ Port 5433 —lost
✗ PostgreSQL decision —lost
✗ File pathsrc/auth\.ts—lost
✗ Tool calls & file refs —lostRecent turns 38–40\(kept\)
Only the last few messages survive\.LanternRestores✓ DB port =5433\|\|config/db\.yaml
✓ ChosePostgreSQLover MongoDB
✓ Auth middleware\|\|src/auth\.ts
✓ Tool calls:write\_file,run\_cmdcompactionhybrid retrievalFigure 1:The context cliff in practice\.Left:a coding conversation with specific, recoverable facts \(highlighted\)\.Right top:after compaction, early turns are replaced by a vague summary—specific facts are destroyed\.Right bottom:Lanternrestores the lost details from its archival store via hybrid retrieval\.This problem affects every extended LLM interaction\. Coding assistants lose configuration values and architectural decisions\. Support agents forget customer details mentioned early in a session\. Research assistants lose citations and numerical results from earlier analysis\.
Existing approaches each address parts of this problem but fall short individually\. Sliding windows preserve only recent context\. RAG systemsLewis et al\. \([2020](https://arxiv.org/html/2606.05182#bib.bib8)\)retrieve from static documents rather than live conversation history\. Summarization inherently loses specificity\. MemGPTPacker et al\. \([2023](https://arxiv.org/html/2606.05182#bib.bib15)\)introduces explicit memory paging but relies on the LLM itself to decide what to archive, incurring latency and cost\.
We presentLantern, a compaction\-aware memory system that combines proactive extractive archival with hybrid retrieval via Reciprocal Rank FusionCormack et al\. \([2009](https://arxiv.org/html/2606.05182#bib.bib5)\)into a pipeline that requires zero LLM calls during archival and base restoration\. The key insight is that LLM\-driven fact extraction—the dominant paradigm in conversational memory systems—is unnecessary: a well\-designed extractive archival pipeline that fuses multiple retrieval signals can match or exceed LLM\-driven approaches at orders\-of\-magnitude lower cost\. An optional confidence\-decay mechanism for multi\-session curation is evaluated in Appendix[D](https://arxiv.org/html/2606.05182#A4)\.
Our contributions are as follows\. \(1\) We demonstrate thatLantern\-Rerank recovers 78\.3% of facts lost to compaction, significantly outperforming MemGPT\-Faithful \(72\.4%;p<0\.0001p\{<\}0\.0001,d=0\.43d\{=\}0\.43, 95% CI\[\+3\.1,\+8\.6\]\[\{\+\}3\.1,\{\+\}8\.6\]pp\)\. Even without the reranker, baseLantern\(76\.3%\) outperforms this LLM\-driven baseline \(p=0\.005p\{=\}0\.005\) while requiring zero LLM calls—establishing that extraction\-free archival with hybrid retrieval is a cost\-effective alternative to LLM\-driven memory\. \(2\) We show that the recovered context is broadly useful: four production LLMs improve their accuracy by 8\.4 pp on average when answering questions withLantern\-restored context \(p<0\.05p\{<\}0\.05for each model\), and we characterize a coverage–coherence trade\-off between base retrieval \(quality 4\.42/5\) and reranking \(4\.11/5\)\. \(3\) We release a rigorous evaluation framework—1,894 human\-validated facts across 94 real conversations—with failure analysis, fact\-type stratification, paired statistical tests, and compaction robustness analysis, establishing a benchmark for future compaction\-aware memory research\.
## 2Related Work
#### Context window extension\.
Architectures such as RoPESu et al\. \([2021](https://arxiv.org/html/2606.05182#bib.bib20)\), ALiBiPress et al\. \([2022](https://arxiv.org/html/2606.05182#bib.bib18)\), and LongformerBeltagy et al\. \([2020](https://arxiv.org/html/2606.05182#bib.bib1)\)support longer sequences, but do not address information loss when the window is exceeded\.Liu et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib11)\)show that LLMs underutilize information in the middle of long contexts\. Ring AttentionLiu et al\. \([2023](https://arxiv.org/html/2606.05182#bib.bib10)\)is a distributed attention algorithm that partitions the sequence across devices to enable near\-infinite sequence lengths at the systems level; however, it does not address the*semantic*loss that occurs when earlier turns become diluted\. Infini\-attentionMunkhdalai et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib14)\)integrates a compressive memory directly into the attention mechanism but requires model retraining, limiting its applicability to API\-served LLMs\. StreamingLLMXiao et al\. \([2024b](https://arxiv.org/html/2606.05182#bib.bib25)\)maintains attention sinks to enable stable streaming inference, while SnapKVLi et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib9)\)compresses key\-value caches; both address efficiency rather than information preservation\. InfLLMXiao et al\. \([2024a](https://arxiv.org/html/2606.05182#bib.bib24)\)provides training\-free context extrapolation via an efficient context memory\. These findings collectively motivate application\-level memory as a practical and complementary alternative to architectural extensions\.
#### Retrieval\-augmented generation\.
RAGLewis et al\. \([2020](https://arxiv.org/html/2606.05182#bib.bib8)\)and RETROBorgeaud et al\. \([2022](https://arxiv.org/html/2606.05182#bib.bib2)\)augment LLMs with document retrieval\. These systems are designed for static knowledge bases and do not handle the temporal, evolving nature of live conversations\. HippoRAGGutiérrez et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib7)\)draws on neurobiological principles for long\-term memory in LLMs but targets knowledge graph construction rather than conversational fact recovery\.
#### Memory\-augmented agents\.
MemGPTPacker et al\. \([2023](https://arxiv.org/html/2606.05182#bib.bib15)\)introduces OS\-style memory paging for LLM agents, delegating archival decisions to the LLM\.Park et al\. \([2023](https://arxiv.org/html/2606.05182#bib.bib16)\)implement reflection\-based memory for generative agents\.Zhang et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib26)\)survey memory mechanisms in LLM agents, identifying a gap in systematic evaluation of context persistence\.Wang et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib22)\)andModarressi et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib13)\)explore model\-level read\-write memory\. LarimarDas et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib6)\)introduces episodic memory control for LLMs via external memory modules\. The CoALA frameworkSumers et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib21)\)proposes cognitive architectures for language agents with structured memory components\.
#### Conversational memory benchmarks\.
LongMemEvalWu et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib23)\)benchmarks chat assistants on long\-term interactive memory across sessions\. LoCoMoMaharana et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib12)\)provides a dataset for evaluating long conversational memory\.Lantern’s evaluation framework complements these by focusing specifically on*within\-session*fact recovery after compaction events\.
Lanterndiffers from prior work along three axes\. Unlike truncation and sliding windows, it archives context*before*compaction\. Unlike standard RAG, it indexes live conversation turns rather than static documents\. Unlike MemGPTPacker et al\. \([2023](https://arxiv.org/html/2606.05182#bib.bib15)\), which delegates archival decisions to the LLM \(incurring latency and cost\),Lantern’s archival and base retrieval require zero LLM calls; the LLM is invoked only for the optional reranking step and for compaction itself\.
## 3Method
Lanternoperates as middleware between the application and the LLM\. It observes every conversation turn, maintains a persistent SQLite store, and injects restored context after compaction events\. The system has two core phases—*Archive*and*Restore*—with an optional*Reinforce*phase for multi\-session curation \(Figure[2](https://arxiv.org/html/2606.05182#S3.F2)\)\.
➊ Archiveevery turn · zero LLM callsMemory StoreSQLite · WAL · FTS5➋ Restoreon compaction event➌ Reinforceself\-curation loopConversation TurnLLM runtimeChunk Turnuser · asst · tool callsSummariseextractive ·≤\{\\leq\}1200 charsTag & Classifyepisodic / semantic / proceduralEmbedMiniLM\-L6\-v2 · 384\-dSQLiteWAL modeFTS5 indexDedup hashc0=0\.5c\_\{0\}\{=\}0\.5σ0=0\.5\\sigma\_\{0\}\{=\}0\.5writeSemanticcosineFull\-TextFTS5KeywordJaccardImportanceR⋅F⋅D⋅c⋅σR\{\\cdot\}F\{\\cdot\}D\{\\cdot\}c\{\\cdot\}\\sigmaRRF Fusion∑1/\(60\+rank\)\\sum 1/\(60\{\+\}\\mathrm\{rank\}\)MMR Diversityλ=0\.7\\lambda\{=\}0\.7Budget PackB=6kB\{=\}6\\text\{k\}charscompactionfetchinject context Boostc\+0\.15c\{\+\}0\.15Decayc−0\.02c\{\-\}0\.02Prunec<0\.15c\{<\}0\.15restored IDsupdatecc,σ\\sigma
Figure 2:Lanternsystem architecture\.Archive\(blue\): every turn is chunked, summarised, tagged, and embedded—zero LLM calls\.Memory Store\(teal\): WAL\-mode SQLite with FTS5 index, deduplication, and per\-entry confidenceccand EMA success rateσ\\sigma\.Restore\(orange\): on compaction, four parallel retrieval signals merge via Reciprocal Rank Fusion, diversify via MMR \(λ=0\.7\\lambda\{=\}0\.7\), and pack into a 6,000\-char budget\.Reinforce\(green\): retrieved entries are boosted, non\-retrieved decay, and stale entries are pruned—closing a self\-curation loop\.### 3\.1Proactive Archival
On each turn,Lanternperforms five operations with zero LLM calls:
#### 1\. Chunking\.
User and assistant messages are grouped into turn pairs along with tool\-call metadata and file paths\.
#### 2\. Extractive summarization\.
A summary is produced deterministically: up to 500 characters from each message plus tool and file references, truncated to 1200 characters\.
#### 3\. Embedding\.
The summary is encoded using a sentence transformer \(all\-MiniLM\-L6\-v2, 384 dimensions\)Reimers and Gurevych \([2019](https://arxiv.org/html/2606.05182#bib.bib19)\)\.
#### 4\. Tag and type extraction\.
Tags \(e\.g\., file paths, error codes, function names\) are extracted via pattern matching\. Each turn is classified into a memory type \(episodic, semantic, or procedural\) to support downstream filtering\.
#### 5\. Storage\.
The entry is written to SQLite \(WAL mode, FTS5 full\-text index\) with metadata: confidence score \(initialized to 0\.5\), access count, timestamps, tags, and memory type\.
Per\-turn archival cost: zero LLM API calls,<<25 ms latency,∼\{\\sim\}2 KB storage\. \(Compaction itself is performed by the host LLM runtime and is not part ofLantern’s archival pipeline\.\)
### 3\.2Hybrid Retrieval and Restoration
When compaction is detected,Lanternrestores context within a character budgetBB\. Retrieval combines four ranked lists fused via Reciprocal Rank Fusion \(RRF\)Cormack et al\. \([2009](https://arxiv.org/html/2606.05182#bib.bib5)\):
#### Semantic similarity\.
Cosine similarity between the query embedding and stored entry embeddings\.
#### Full\-text search\.
SQLite FTS5 ranking over entry summaries and content\.
#### Keyword overlap\.
Jaccard\-like overlap between query terms and entry lookup hints \(tags, file paths, tool names\)\.
#### Importance scoring\.
Each entry is scored by:
I\(e\)=R\(e\)⋅F\(e\)⋅D\(e\)⋅ce⋅σeI\(e\)=R\(e\)\\cdot F\(e\)\\cdot D\(e\)\\cdot c\_\{e\}\\cdot\\sigma\_\{e\}\(2\)whereR\(e\)=exp\(−0\.693⋅Δt/T1/2\)R\(e\)=\\exp\(\-0\.693\\cdot\\Delta t/T\_\{1/2\}\)is recency \(half\-lifeT1/2=7T\_\{1/2\}\{=\}7days\),F\(e\)=log2\(ae\+1\)\+1F\(e\)=\\log\_\{2\}\(a\_\{e\}\+1\)\+1is frequency,D\(e\)D\(e\)is richness \(bonuses for tool calls and file references\),cec\_\{e\}is confidence, andσe\\sigma\_\{e\}is the EMA success rate\.
The four ranked lists are fused with RRF constantk=60k\{=\}60:
RRF\(e\)=∑L∈ℒ1k\+rankL\(e\)\\text\{RRF\}\(e\)=\\sum\_\{L\\in\\mathcal\{L\}\}\\frac\{1\}\{k\+\\text\{rank\}\_\{L\}\(e\)\}\(3\)
Maximal Marginal Relevance \(MMR\)Carbonell and Goldstein \([1998](https://arxiv.org/html/2606.05182#bib.bib3)\)is applied to the fused ranking to promote diversity before packing entries into the budget\.
#### Optional reranking \(Lantern\-Rerank\)\.
After RRF fusion and MMR, an optional single LLM call reranks the top candidates based on relevance to the compaction context\. This adds one LLM call at restore time \(∼\{\\sim\}200 ms\) and improves fact recovery by an additional 2 percentage points, though at a slight cost to context coherence \(§[5](https://arxiv.org/html/2606.05182#S5)\)\.
#### Optional: Confidence\-Decay Reinforcement\.
After each restore cycle, retrieved entries receive a confidence boost \(α=0\.15\\alpha\{=\}0\.15\) while non\-retrieved entries decay \(β=0\.02\\beta\{=\}0\.02\) toward a floor \(γ=0\.1\\gamma\{=\}0\.1\); entries that remain at the floor with low success rates are pruned\. This mechanism is designed for multi\-session deployments where the store accumulates entries over time and stale facts need culling\. In our single\-session evaluation it contributes only \+1\.7 pp \(Appendix[D](https://arxiv.org/html/2606.05182#A4)\), so the headline results rely entirely on the Archive and Restore phases described above\.
## 4Experimental Setup
### 4\.1Data
We evaluate on 100 real multi\-turn conversations randomly sampled from the ShareGPT corpus, a publicly available collection of human–LLM interactions, filtered for at least 20 turns to ensure meaningful compaction events\. A heuristic topic classification shows the sample spans diverse domains: coding \(43%\), creative writing \(16%\), general knowledge \(9%\), data analysis \(8%\), business \(4%\), and other/mixed \(20%\)\. The dataset is not narrowly technical—57% of conversations involve non\-coding topics\. The 20\-turn minimum favors longer, more structured conversations; shorter interactions are underrepresented\. After excluding 6 conversations with zero extractable facts, the evaluation set contains 94 conversations comprising 1,894 ground\-truth facts\. Table[1](https://arxiv.org/html/2606.05182#S4.T1)summarizes the dataset statistics per experiment\.
Table 1:Dataset statistics per experiment\.NN= conversations evaluated, Facts = total ground\-truth facts\.
### 4\.2Ground Truth Extraction
We extract ground\-truth facts using a two\-stage LLM pipeline: \(1\) a lightweight model identifies specific, verifiable facts \(configuration values, decisions, error codes, function names, commands, entity references\); \(2\) for each fact, a probe question and expected answer are generated\. Facts are validated by checking that the expected answer appears in the source conversation; hallucinated facts are discarded\.
### 4\.3Compaction
For the primary experiments, compaction uses LLM\-driven abstractive summarization at 50% of the conversation length\. All messages before the compaction point \(minus a 4\-message recency window\) are replaced with an LLM\-generated summary\. Experiment 7 additionally tests extractive and sliding\-window compaction strategies at 30%, 50%, and 70% compaction points to evaluate robustness\.
### 4\.4Evaluation Metrics
Recovery rate:fraction of ground\-truth facts recoverable from the restored context, evaluated via an LLM judge that determines whether each fact’s answer is semantically present in the context\.Quality score:an LLM judge rates the overall quality of restored context on a 1–5 scale\.Live accuracy:fraction of probe questions that production LLMs answer correctly with and withoutLanterncontext\.
#### Human validation of LLM judge\.
To calibrate trust in our automated evaluation, we conducted a human annotation study on a random sample of 100 fact\-recovery judgments spanning all methods\. Two annotators independently assessed whether each fact’s expected answer was semantically present in the restored context\. Inter\-annotator agreement was substantial \(Cohen’sκ=0\.78\\kappa=0\.78\)\. The LLM judge agreed with the majority human label on 91 of 100 cases \(91% agreement,κ=0\.81\\kappa=0\.81against the human consensus\)\. Of the 9 disagreements, 6 were borderline cases where the fact was partially present; the remaining 3 were false negatives on semantically equivalent paraphrases\. This suggests the automated evaluation is a reliable proxy for human judgment, with a slight conservative bias\.
### 4\.5Baselines
- •Summarization:extractive summaries sorted newest\-first within budget\.
- •Neural RAG:neural \(MiniLM\) embeddings with cosine similarity retrieval\.
- •MemGPT\-Faithful:a controlled reimplementation of the core archival pipeline described in MemGPTPacker et al\. \([2023](https://arxiv.org/html/2606.05182#bib.bib15)\), built from their open\-source code: \(i\)*LLM\-driven fact extraction*from each turn batch \(5 turns per batch, 500 chars per summary, 2048 max tokens\); \(ii\)*LLM\-formulated multi\-query search*where the model generates 3 diverse search queries; and \(iii\)*neural retrieval*with max\-score fusion across queries\. This baseline is deliberately controlled: it shares the same embedding model, character budget, and evaluation pipeline as all other methods, isolating the effect of the archival and retrieval strategy\. MemGPT also includes a self\-directed memory paging loop during generation; we discuss the scope of our reimplementation in §[8](https://arxiv.org/html/2606.05182#S8)\.
All methods receive the same character budget \(6,000 characters\) and the same post\-compaction context\.Lanternand MemGPT\-Faithful additionally receive the full pre\-compaction history for archival\.
## 5Results
### 5\.1Experiment 1: Fact Recovery
Table[2](https://arxiv.org/html/2606.05182#S5.T2)presents recovery rates onN=94N\{=\}94real conversations comprising 1,894 ground\-truth facts\.Lantern\-Rerank recovers 78\.3% of facts—nearly 6 percentage points above MemGPT\-Faithful \(72\.4%\)—using a single LLM call versus MemGPT\-Faithful’s 21 calls per 100\-turn conversation\. BaseLantern\(76\.3%\), which uses*zero*LLM calls, still outperforms the LLM\-driven baseline\.
Table 2:Fact recovery rate onN=94N\{=\}94conversations \(1,894 facts\)\. All methods receive a 6,000\-character budget\.Figure[3](https://arxiv.org/html/2606.05182#S5.F3)visualizes these results\. All comparisons are paired: every method is evaluated on the same 94 conversations with the same compaction and budget\. Wilcoxon signed\-rank tests confirm thatLantern\-Rerank significantly outperforms MemGPT\-Faithful \(p<0\.0001p\{<\}0\.0001; paired bootstrap 95% CI:\[\+3\.1,\+8\.6\]\[\{\+\}3\.1,\{\+\}8\.6\]pp; Cohen’sd=0\.43d\{=\}0\.43, medium effect\)\. BaseLanternalso significantly outperforms MemGPT\-Faithful \(p=0\.005p\{=\}0\.005; CI:\[\+0\.8,\+7\.0\]\[\{\+\}0\.8,\{\+\}7\.0\]pp;d=0\.26d\{=\}0\.26\) at zero LLM cost\. The incremental gain from reranking \(\+1\.9 pp,p=0\.10p\{=\}0\.10\) is small, reinforcing our finding that the retrieval pipeline already produces a strong candidate set \(§[7](https://arxiv.org/html/2606.05182#S7)\)\. All methods significantly outperform Neural RAG and Summarization \(p<0\.001p\{<\}0\.001;d\>0\.5d\{\>\}0\.5, large effects\)—the 15 pp gap betweenLanternand Neural RAG \(d=0\.81d\{=\}0\.81\) quantifies the value of hybrid retrieval over pure semantic search\. Full test results are in Appendix[B](https://arxiv.org/html/2606.05182#A2)\.
Figure 3:Fact recovery rates across methods \(N=94N\{=\}94, 1,894 facts\)\. Error bars show±\\pm1 standard deviation across conversations; paired 95% bootstrap CIs and Wilcoxonpp\-values for all pairwise comparisons are reported in Appendix[B](https://arxiv.org/html/2606.05182#A2), Table[12](https://arxiv.org/html/2606.05182#A2.T12)\. The dashed line marks the MemGPT\-Faithful baseline\.
### 5\.2Experiment 2: Live Cross\-Model Evaluation
We evaluate whetherLantern’s restored context helps LLMs answer questions in practice\. ForN=50N\{=\}50conversations \(137 probe questions\), each question is posed to four production LLMs with and withoutLanternrestoration\. Table[3](https://arxiv.org/html/2606.05182#S5.T3)shows results\.
Table 3:Live cross\-model evaluation \(N=50N\{=\}50conversations, 137 questions per model\)\.Δ\\Delta= improvement withLantern\. Wilcoxonpp\-values are paired on per\-conversation accuracy\.Lanternimproves accuracy by 8\.4 percentage points on average across all 548 questions \(Figure[4](https://arxiv.org/html/2606.05182#S5.F4)\)\. Every model improves significantly: Wilcoxonp=0\.010p\{=\}0\.010for Claude Sonnet 4\.5,p=0\.018p\{=\}0\.018for Gemini 2\.5 Flash,p=0\.020p\{=\}0\.020for GPT\-4o Mini, andp=0\.046p\{=\}0\.046for GPT\-5 Nano \(Table[3](https://arxiv.org/html/2606.05182#S5.T3)\)\. Sign tests confirm the consistency: each model shows 11–15 conversations improving versus only 3–5 declining, with the remainder tied\. The gains are architecturally diverse—spanning two OpenAI models, one Google model, and one Anthropic model—confirming thatLantern’s restored context is broadly useful rather than tuned to any particular model’s behavior\.
Figure 4:Live LLM accuracy with and withoutLantern\-restored context \(N=50N\{=\}50conversations\)\. Annotations show the per\-model accuracy gain\. The dotted line separates individual models from the aggregate average\.
### 5\.3Experiment 3: LLM\-Judged Context Quality
An LLM judge rates the quality of restored context on a 1–5 scale acrossN=39N\{=\}39conversations \(107 scored items per method; Table[4](https://arxiv.org/html/2606.05182#S5.T4)\)\.
Table 4:LLM\-judged context quality \(1–5 scale,N=39N\{=\}39conversations\)\.Lanternscores 4\.42/5, a 0\.50\-point lead over MemGPT\-Faithful \(3\.92\)\. The quality gap is notably larger than the recovery\-rate gap, suggesting thatLantern’s hybrid retrieval selects more coherent and contextually relevant entries\.Lantern\(without reranking\) scores higher thanLantern\-Rerank \(4\.42 vs\. 4\.11\) on quality despite lower fact recovery \(76\.3% vs\. 78\.3%\)\. This reveals a*coverage–coherence trade\-off*: the reranker packs more facts into the budget at the cost of less coherent context\. Practitioners should choose between the two variants based on whether completeness or readability is the priority\.
### 5\.4Experiment 4: Fact\-Type Stratification
To assess whetherLantern’s advantage is driven by lexically matchable facts \(e\.g\., config values, commands\) rather than semantically complex ones \(e\.g\., decisions, entities\), we stratify recovery by fact type across 1,657 individual fact evaluations \(Table[5](https://arxiv.org/html/2606.05182#S5.T5)\)\.
Table 5:Recovery rate stratified by fact type\. The five largest categories are shown; full breakdown across all 13 categories is in Appendix[C](https://arxiv.org/html/2606.05182#A3), Table[13](https://arxiv.org/html/2606.05182#A3.T13)\.Lantern’s advantage over MemGPT\-Faithful is largest for code\-related facts \(\+16\.1 pp\) and commands \(\+10\.5 pp\), which benefit from keyword and FTS matching of file paths, function names, and shell commands\. On decision\-type facts, which require semantic understanding, MemGPT\-Faithful actually edges out baseLanternby 0\.8 pp \(thoughLantern\-Rerank recovers this gap, reaching 82\.0% vs\. 78\.7%\)\. On the smaller*Goal*category \(n=21n\{=\}21\), MemGPT\-Faithful leads baseLanternby 9\.5 pp, and on*Problem*\(n=6n\{=\}6\) the reranker collapses from 66\.7% to 16\.7%—a reminder that single\-call reranking can overfit to lexical cues and mis\-order semantically similar candidates in small\-sample categories \(see Appendix[C](https://arxiv.org/html/2606.05182#A3)\)\. This pattern is consistent with our expectation: the hybrid retrieval pipeline’s primary advantage over LLM\-driven extraction comes from capturing surface\-level specifics that LLM summarization discards, while purely semantic categories remain a harder problem where LLM\-based approaches are competitive\. We note this as an explicit caveat:Lantern’s aggregate advantage is partly driven by lexically matchable fact types, and the reranker variant is not uniformly better across categories\.
### 5\.5Experiment 5: Embedding Comparison
We compare four embedding strategies across 46 conversations and 1,718 retrieval probes \(Table[6](https://arxiv.org/html/2606.05182#S5.T6)\)\.
Table 6:Embedding model comparison\. Recall@kkmeasures whether the correct entry appears in the top\-kkretrieved results\.Neural embeddings \(MiniLM, MPNet\) substantially outperform non\-neural alternatives at Recall@5, but the gap narrows at Recall@10\. MiniLM and MPNet perform comparably, validating our default choice of the lighter model \(384\-d vs\. 768\-d\)\.
### 5\.6Experiment 6: Hyperparameter Sensitivity
We sweep three key hyperparameters across 46 conversations \(Table[7](https://arxiv.org/html/2606.05182#S5.T7)\)\.
Table 7:Hyperparameter sensitivity\. Recovery rate at different budget sizes, RRF constants, and decay rates\.Recovery improves sharply from 2,000 to 6,000 characters, then plateaus \(Figure[5](https://arxiv.org/html/2606.05182#S5.F5)\)\. This confirms the default budget of 6,000 characters as a practical operating point\. The RRF constant and decay rate show no meaningful sensitivity within the tested ranges, indicating that the hybrid retrieval mechanism is robust to these settings\.
Figure 5:Recovery rate as a function of restore budget \(N=46N\{=\}46\)\. Performance saturates around 6,000–8,000 characters, confirming the default operating point\.
### 5\.7Experiment 7: Compaction Strategy Robustness
We evaluate robustness across three compaction strategies \(extractive, abstractive, sliding window\) at three compaction points \(30%, 50%, 70%\) on 50 conversations \(Table[8](https://arxiv.org/html/2606.05182#S5.T8)\)\.
Table 8:Recovery rate across compaction strategies \(averaged over three compaction points\)\.Lantern’s advantage holds for extractive and abstractive; all methods converge under sliding window\.Lanternmaintains its advantage under both extractive and abstractive compaction, with mean recoveryΔ<1pp\\Delta<1\\,\\text\{pp\}between strategies\. Under sliding\-window compaction \(which retains recent messages verbatim rather than summarizing\), all methods converge because there is no summary to degrade retrieval quality\. This confirms thatLantern’s proactive archival is most valuable precisely when compaction is lossy\.
#### Confidence\-decay ablation\.
An optional confidence\-decay mechanism boosts entries that are retrieved and prunes stale entries over time\. In an 8\-event multi\-compaction simulation, decay provides a small but significant improvement \(\+1\.7 pp,p<0\.001p\{<\}0\.001\)\. However, in the single\-session setting that dominates our primary evaluation, the effect is minimal\. We report the full ablation in Appendix[D](https://arxiv.org/html/2606.05182#A4); we do not include confidence decay in the headline contributions\.
## 6Analysis
#### Why hybrid retrieval matters\.
The gap betweenLantern\(76\.3%\) and Neural RAG \(61\.3%\) is attributable to the hybrid retrieval pipeline\. Semantic similarity alone misses facts that share few surface\-level features with the query but are topically relevant\. Full\-text search catches exact keyword matches that embedding models may not prioritize\. RRF fusion allows each signal to compensate for the others’ blind spots\.
#### The value of proactive archival\.
Neural RAG and Summarization only index post\-compaction messages\.Lantern’s proactive archival preserves the pre\-compaction history, enabling recovery of facts from early turns that were destroyed by compaction\.
#### Coverage–coherence trade\-off in reranking\.
Lantern\-Rerank uses a single LLM call to reorder retrieved candidates, gaining \+2\.0 pp in recovery over baseLanternbut scoring 0\.31 points lower on coherence \(4\.11 vs\. 4\.42\)\. This reveals a fundamental trade\-off: the reranker optimizes for fact density, packing more recoverable facts into the budget at the cost of narrative flow\. The practical implication is that deployments can choose their operating point: baseLanternwhen coherence matters \(tutoring, support\),Lantern\-Rerank when fact completeness is paramount \(coding, debugging\)\. Notably, MemGPT\-Faithful uses*many*more LLM calls \(extraction \+ query formulation\) yet trailsLantern\-Rerank by 5\.9 pp \(p<0\.0001p\{<\}0\.0001\) on recovery\. This suggests that*where*the LLM call is spent matters more than*how many*are used: a single well\-placed reranking call over a strong candidate set outperforms many upstream extraction calls\.
#### Cost–recovery Pareto improvement\.
Lantern’s archival phase requires zero LLM calls\. MemGPT\-Faithful, by contrast, invokes the LLM once per 5\-turn batch for extraction plus one query\-formulation call per compaction: for a 100\-turn conversation, this amounts to 21 additional LLM invocations per session\. At typical 2026 API pricing \(on the order of $0\.15–$0\.60 per million input tokens for compact models\), MemGPT\-Faithful incurs roughly an order of magnitude more per\-session cost than baseLantern—while recovering 5\.9 pp*fewer*facts thanLantern\-Rerank \(Table[12](https://arxiv.org/html/2606.05182#A2.T12)\)\. Even without the reranker, baseLanternoutperforms MemGPT\-Faithful by 4\.0 pp at zero LLM cost\. This is the central practical result:Lanternachieves better fact recovery at lower cost and lower latency, a strict Pareto improvement on the cost–recovery frontier\. Table[9](https://arxiv.org/html/2606.05182#S6.T9)provides a latency breakdown\.
Table 9:Latency and cost breakdown forLanternoperations\. Measured on Apple M2 Pro, single\-threaded, warm cache\.
#### Budget saturation\.
Experiment 6 reveals that recovery plateaus around 6,000–8,000 characters\. Beyond this point, additional budget yields diminishing returns because the hybrid retrieval already surfaces the most relevant entries\. This suggests a natural operating point where memory overhead remains modest\.
#### Failure analysis: where does the remaining 25% come from?
Of the 1,657 individual fact evaluations forLantern, 420 \(25\.3%\) are missed\. Table[10](https://arxiv.org/html/2606.05182#S6.T10)decomposes these failures by fact type, revealing two distinct failure modes\. First,*ranking failures*:Lantern\-Rerank recovers 59 of the 420 facts that baseLanternmisses, confirming that these facts*were*archived and retrieved but ranked below the budget cutoff\. Second,*coverage failures*: the remaining 361 facts were not surfaced by the retrieval pipeline at all, indicating either archival gaps or query\-fact mismatch\.
Config facts show the highest miss rate among common types \(30\.8% vs\. 20\.1% for code and 21\.1% for commands\), suggesting that configuration values—which often lack distinctive keywords—fall into a blind spot between semantic and keyword retrieval\. Code and command facts, by contrast, are lexically distinctive and benefit strongly from FTS5 and keyword matching\. This analysis directly supports the claim that archival coverage—not ranking—is the binding constraint \(§[7](https://arxiv.org/html/2606.05182#S7)\), and points to specific avenues for improvement: richer archival representations for config\-style facts, and cross\-turn coreference resolution for entity facts\.
Table 10:Failure breakdown forLanternby fact type\. High miss rates on rare types should be interpreted cautiously due to smallNN\.
## 7Discussion
#### Complementarity with architectural approaches\.
Context\-window challenges are being addressed at three distinct levels: distributed systems \(Ring AttentionLiu et al\. \([2023](https://arxiv.org/html/2606.05182#bib.bib10)\)\), model architecture \(Infini\-attentionMunkhdalai et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib14)\)\), and application middleware \(Lantern\)\. These approaches compose rather than compete\. Architectural extensions expand the raw capacity of the context window, but effective utilization degrades well before the nominal limitPaulsen \([2025](https://arxiv.org/html/2606.05182#bib.bib17)\); Liu et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib11)\)\.Lanternaddresses the complementary problem of*what to put*in that window after compaction\. Crucially,Lanternrequires no model fine\-tuning and no access to model internals, making it immediately deployable with any API\-served LLM\.
#### Archival coverage as the binding constraint\.
The recovery gap between baseLantern\(76\.3%\) andLantern\-Rerank \(78\.3%\) is only 2\.0 pp \(p=0\.10p\{=\}0\.10\), which tells us something important: the hybrid retrieval pipeline already produces a high\-quality candidate set, and the reranker has limited room to improve ranking\. Our failure analysis \(§[6](https://arxiv.org/html/2606.05182#S6)\) confirms this directly—of the 420 factsLanternmisses, only 59 are recovered by reranking; the remaining 361 were never surfaced in the candidate set\. The bottleneck is*what gets archived*, not*how it gets ranked*\. This points future work toward richer archival strategies \(e\.g\., graph\-based linking, cross\-turn coreference resolution\) rather than more sophisticated ranking\.
#### Broader applicability\.
Lanternis model\-agnostic and operates as middleware, making it compatible with any LLM runtime\. The SQLite backend requires no infrastructure beyond the application process\. Potential applications include coding assistants, customer support agents, tutoring systems, and any multi\-turn LLM deployment where session continuity matters\.
#### Future work\.
Several directions remain open\. First, adaptive budget sizing—dynamically adjusting the restore budget based on context window utilization—could further improve recovery\. Second, graph\-based memory linking, where entries are connected by causal or topical relationships, may help recover clusters of related facts\. Third, integration with production LLM runtimes would enable real\-world deployment studies\. Fourth, multi\-session memory persistence—where facts from sessionAAinform sessionBB—is an important production scenario not covered here\.
## 8Limitations
1. 1\.LLM\-as\-judge evaluation\.Ground\-truth facts are LLM\-extracted, and recovery is LLM\-judged\. We mitigate this with human validation \(κ=0\.81\\kappa\{=\}0\.81, 100\-sample audit\), which confirms judge reliability\. Our fact\-type stratification further shows thatLantern’s advantage varies by type—it is largest for lexically matchable facts \(code, commands\) and smallest for paraphrase\-heavy facts \(decisions\), consistent with the hybrid retrieval design rather than a judge artifact\.
2. 2\.Single dataset\.We evaluate on ShareGPT, a topically diverse corpus \(43% coding, 57% non\-coding\)\. Extending to LoCoMoMaharana et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib12)\)or LongMemEvalWu et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib23)\)is a natural next step\.
3. 3\.Embedding model\.All experiments use all\-MiniLM\-L6\-v2\.Lantern’s embedding is a pluggable component, and we expect modern high\-capacity embeddings \(BGE\-large, Nomic\-embed, text\-embedding\-3\) to further improve performance—particularly for the semantic retrieval signal\.
4. 4\.Single\-session scope\.We evaluate within individual conversations; multi\-session persistence is an important production scenario for future work\. The confidence\-decay mechanism \(Appendix[D](https://arxiv.org/html/2606.05182#A4)\) is designed for this setting\.
5. 5\.Baseline scope\.Our MemGPT\-Faithful baseline implements the core archival pipeline fromPacker et al\. \([2023](https://arxiv.org/html/2606.05182#bib.bib15)\)—LLM extraction, multi\-query search, neural retrieval—under controlled conditions \(same embedding, budget, evaluation pipeline\)\. It does not replicate MemGPT’s full self\-directed paging loop during generation\. This design choice ensures a fair comparison: the same inputs, the same budget, the same judge\. Systems like the production Letta framework, Mem0Chhablani et al\. \([2024](https://arxiv.org/html/2606.05182#bib.bib4)\), and Zep use different models, budgets, and infrastructure, making controlled comparison difficult; integrating them is an important direction for future work\.
## 9Conclusion
We presentedLantern, a compaction\-aware memory layer that recovers facts lost when LLM conversations are compacted\.Lantern\-Rerank recovers 78\.3% of verifiable facts on 94 real conversations \(1,894 facts\), significantly outperforming an LLM\-driven archival baseline \(p<0\.0001p\{<\}0\.0001,d=0\.43d\{=\}0\.43\) while requiring an order of magnitude fewer LLM calls per session\. Even without reranking, baseLanternoutperforms the LLM\-driven approach \(p=0\.005p\{=\}0\.005\) at zero inference cost and under 25 ms latency\. Across four production LLMs,Lantern\-restored context improves answer accuracy by 8\.4 percentage points \(p<0\.05p\{<\}0\.05for every model tested\), demonstrating that the benefit transfers across architectures\. Our failure analysis reveals that the remaining recovery gap is primarily an archival\-coverage problem—facts not preserved during compaction—rather than a retrieval\-ranking problem, pointing to clear avenues for future improvement\. As LLM\-powered applications move toward longer, multi\-session interactions, compaction\-aware memory becomes essential infrastructure\.Lanternprovides both a practical system and an open evaluation framework for this problem\.
## Ethics Statement
We evaluate on publicly available conversation data \(ShareGPT\)\. No private user data was collected\. The system stores conversation content locally; deployment requires appropriate data retention policies and user consent\.
## Reproducibility Statement
All experiments use a single embedding model \(all\-MiniLM\-L6\-v2\) and a single LLM judge model \(GPT\-5 Nano\)\. Hyperparameters are listed in Appendix[A](https://arxiv.org/html/2606.05182#A1); dataset statistics per experiment are in Table[1](https://arxiv.org/html/2606.05182#S4.T1)\. The codebase, evaluation framework, pre\-extracted ground truth, and scripts to reproduce all tables will be released at[https://github\.com/\[redacted\]/lantern](https://github.com/%5Bredacted%5D/lantern)upon publication\.
## References
- Beltagy et al\. \[2020\]Iz Beltagy, Matthew E Peters, and Arman Cohan\.Longformer: The long\-document transformer\.*arXiv preprint arXiv:2004\.05150*, 2020\.
- Borgeaud et al\. \[2022\]Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm van den Driessche, Jean\-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al\.Improving language models by retrieving from trillions of tokens\.In*International Conference on Machine Learning \(ICML\)*, pages 2206–2240, 2022\.
- Carbonell and Goldstein \[1998\]Jaime Carbonell and Jade Goldstein\.The use of MMR, diversity\-based reranking for reordering documents and producing summaries\.In*Proceedings of the 21st Annual International ACM SIGIR Conference*, pages 335–336, 1998\.
- Chhablani et al\. \[2024\]Gunjan Chhablani, Singh Taranjeet, and Deshraj Khare\.Mem0: The memory layer for AI applications\.[https://github\.com/mem0ai/mem0](https://github.com/mem0ai/mem0), 2024\.
- Cormack et al\. \[2009\]Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher\.Reciprocal rank fusion outperforms condorcet and individual rank learning methods\.*Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 758–759, 2009\.
- Das et al\. \[2024\]Payel Das et al\.Larimar: Large language models with episodic memory control\.In*International Conference on Machine Learning \(ICML\)*, 2024\.
- Gutiérrez et al\. \[2024\]Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su\.HippoRAG: Neurobiologically inspired long\-term memory for large language models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- Lewis et al\. \[2020\]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\.Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 33, pages 9459–9474, 2020\.
- Li et al\. \[2024\]Yuhong Li et al\.SnapKV: LLM knows what you are looking for before generation\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- Liu et al\. \[2023\]Hao Liu, Matei Zaharia, and Pieter Abbeel\.Ring attention with blockwise transformers for near\-infinite context\.*arXiv preprint arXiv:2310\.01889*, 2023\.
- Liu et al\. \[2024\]Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\.Lost in the middle: How language models use long contexts\.*Transactions of the Association for Computational Linguistics*, 12:157–173, 2024\.
- Maharana et al\. \[2024\]Adyasha Maharana et al\.LOCOMO: A dataset for long conversational memory\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.
- Modarressi et al\. \[2024\]Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze\.MemLLM: Finetuning LLMs to use an explicit read\-write memory\.*arXiv preprint arXiv:2404\.11672*, 2024\.
- Munkhdalai et al\. \[2024\]Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal\.Leave no context behind: Efficient infinite context transformers with Infini\-attention\.*arXiv preprint arXiv:2404\.07143*, 2024\.
- Packer et al\. \[2023\]Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez\.MemGPT: Towards LLMs as operating systems\.*arXiv preprint arXiv:2310\.08560*, 2023\.
- Park et al\. \[2023\]Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein\.Generative agents: Interactive simulacra of human behavior\.In*Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology \(UIST\)*, 2023\.
- Paulsen \[2025\]Norman Paulsen\.Context is what you need: The maximum effective context window for real world limits of LLMs\.*arXiv preprint arXiv:2509\.21361*, 2025\.
- Press et al\. \[2022\]Ofir Press, Noah A Smith, and Mike Lewis\.Train short, test long: Attention with linear biases enables input length generalization\.*arXiv preprint arXiv:2108\.12409*, 2022\.
- Reimers and Gurevych \[2019\]Nils Reimers and Iryna Gurevych\.Sentence\-BERT: Sentence embeddings using Siamese BERT\-networks\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 3982–3992, 2019\.
- Su et al\. \[2021\]Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu\.RoFormer: Enhanced transformer with rotary position embedding\.*arXiv preprint arXiv:2104\.09864*, 2021\.
- Sumers et al\. \[2024\]Theodore R Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L Griffiths\.Cognitive architectures for language agents\.*Transactions on Machine Learning Research \(TMLR\)*, 2024\.
- Wang et al\. \[2024\]Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei\.Augmenting language models with long\-term memory\.*arXiv preprint arXiv:2306\.07174*, 2024\.
- Wu et al\. \[2024\]Di Wu et al\.LongMemEval: Benchmarking chat assistants on long\-term interactive memory\.*arXiv preprint arXiv:2410\.10813*, 2024\.
- Xiao et al\. \[2024a\]Chaojun Xiao et al\.InfLLM: Training\-free long\-context extrapolation for LLMs with an efficient context memory\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2024a\.
- Xiao et al\. \[2024b\]Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis\.Efficient streaming language models with attention sinks\.In*International Conference on Learning Representations \(ICLR\)*, 2024b\.
- Zhang et al\. \[2024\]Zeyu Zhang, Xiaohe Zhang, Tong Wang, Daijia Zhu, Yuanchun Lu, and Zhuosheng Li\.A survey on the memory mechanism of large language model based agents\.*arXiv preprint arXiv:2404\.13501*, 2024\.
## Appendix AHyperparameters
ParameterSymbolValueConfidence boostα\\alpha0\.15Confidence decayβ\\beta0\.02/roundConfidence floorγ\\gamma0\.1Min\. confidence for retrieval—0\.15Recency half\-lifeT1/2T\_\{1/2\}7 daysRestore budgetBB6000 charsSummary max length—1200 charsRRF constantkk60MMR diversityλ\\lambda—0\.7Embedding model—all\-MiniLM\-L6\-v2Embedding dimension—384Initial confidence—0\.5EMA smoothing \(success rate\)—0\.1Tool richness bonus—0\.5File richness bonus—0\.3Table 11:Hyperparameters used in all experiments\.
## Appendix BStatistical Testing
All comparisons use paired evaluation: every method is evaluated on the sameN=94N\{=\}94conversations with identical compaction and budget settings\. We apply two\-sided Wilcoxon signed\-rank tests and paired bootstrap confidence intervals \(10,000 resamples, seed 42\) on per\-conversation recovery rates\.
Table 12:Paired statistical tests for fact recovery \(Experiment 1,N=94N\{=\}94\)\.Key findings:
- •Lanternvs\. MemGPT\-Faithful is significant atp=0\.005p\{=\}0\.005with a small effect size \(d=0\.26d\{=\}0\.26\)\. The 95% CI excludes zero, confirming a reliable advantage\.
- •Lantern\-Rerank vs\. MemGPT\-Faithful is highly significant \(p<0\.001p\{<\}0\.001,d=0\.43d\{=\}0\.43, medium effect\)\.
- •TheLantern\-Rerank vs\. baseLanterngap \(\+1\.9 pp\) is*not*significant \(p=0\.10p\{=\}0\.10\), confirming that the reranking improvement is modest and uncertain\.
- •All methods significantly outperform Neural RAG and Summarization \(p<0\.001p\{<\}0\.001,d\>0\.5d\{\>\}0\.5\)\.
## Appendix CFact\-Type Breakdown
Table[13](https://arxiv.org/html/2606.05182#A3.T13)provides the complete per\-type recovery rates across the 13 labeled fact types identified in Experiment 4\. A small residual category \(unknown, used by the ground\-truth extractor when no confident type assignment could be made\) is excluded from this breakdown\.
Table 13:Full fact\-type stratification\. Recovery rate \(%\) per method across all identified fact types\.
## Appendix DConfidence\-Decay Ablation
To isolate the effect of the confidence\-decay reinforcement loop \(§[3](https://arxiv.org/html/2606.05182#S3)\), we simulate 8 successive compaction events per conversation by stepping the compaction point through the conversation at fractions\{0\.15,0\.25,…,0\.85\}\\\{0\.15,0\.25,\\ldots,0\.85\\\}and invoking restore at each step\. This yields a store that accumulates entries over time and is repeatedly queried, which is the setting in which decay and pruning can matter\. For each conversation we evaluate two conditions—*with*confidence\-decay reinforcement enabled and*without*it \(boost/decay/prune disabled; all entries retained at equal confidence\)—under otherwise identical settings, producing 732 paired observations per condition across the 94 conversations \(Table[14](https://arxiv.org/html/2606.05182#A4.T14)\)\.
Table 14:Confidence\-decay ablation\. Mean recovery across 8 compaction events\.The overall effect is small \(\+1\.7 pp\) but significant \(p<0\.001p\{<\}0\.001, Wilcoxon\)\. The benefit is concentrated in early compaction events \(events 2–4:Δ≈\+2\.5\\Delta\\approx\{\+\}2\.5–4\.54\.5pp\) when the store is accumulating rapidly and pruning stale entries has the most impact\. By event 5, the gap narrows to \+0\.9 pp as the store stabilizes\.
#### Interpretation\.
In the single\-session, single\-compaction setting that dominates our primary evaluation, confidence decay has negligible effect—all entries are relatively fresh and few have been queried enough times to differentiate\. The mechanism is designed for multi\-session or continuous\-conversation deployments where the store accumulates entries over hours or days\. We do not include confidence decay in the paper’s headline contributions because its single\-session effect is small\. However, we report it here for completeness, as practitioners building multi\-session systems may find the mechanism useful\.Similar Articles
LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks
This paper introduces LANTERN, a framework for multi-source neurosymbolic transfer in reinforcement learning that uses LLMs to generate task automata and adaptive gating to improve sample efficiency.
RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents
RecMem is a recurrence-based memory consolidation method for long-running LLM agents that reduces token consumption by up to 87% while improving accuracy, by only invoking LLMs when semantically similar interactions recur.
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Mem0 introduces a scalable memory-centric architecture using graph-based representations to improve long-term conversational coherence in LLMs, significantly reducing latency and token costs while outperforming existing memory systems.
PersonaVLM: Long-Term Personalized Multimodal LLMs
PersonaVLM introduces a personalized multimodal LLM framework that enables long-term user adaptation through memory retention, multi-turn reasoning, and response alignment, outperforming GPT-4o by 5.2% on the new Persona-MME benchmark.
SimpleMem: Efficient Lifelong Memory for LLM Agents
Introduces SimpleMem, an efficient memory framework for LLM agents that uses semantic lossless compression to improve accuracy and reduce token consumption, achieving 26.4% F1 improvement and up to 30x reduction in inference-time token usage.