@dair_ai: // Memory as a Model // The paper augments any LLM with a separate trained memory model that stores, retrieves, and int…

X AI KOLs Following Papers

Summary

MeMo introduces a modular memory model that augments any LLM to store, retrieve, and integrate new knowledge without retraining or catastrophic forgetting. It outperforms RAG-based methods on benchmarks like BrowseComp-Plus, NarrativeQA, and MuSiQue.

// Memory as a Model // The paper augments any LLM with a separate trained memory model that stores, retrieves, and integrates facts on its behalf. It decouples memory updates from base-model weight updates. It achieves continual-learning robustness without catastrophic forgetting, which is a property that RAG fails to deliver. A vector store is a database with a learned encoder bolted on. MeMo is a learned subsystem with explicit interfaces. That distinction matters, as agents need to be able to ingest fresh knowledge weekly without retraining or vector-DB churn. At its core, the position here is that memory in agents should be modular, learned, and gated, not a context-window hack. Paper: https://arxiv.org/abs/2605.15156 Learn to build effective AI agents in our academy: https://academy.dair.ai
Original Article
View Cached Full Text

Cached at: 05/20/26, 10:36 PM

// Memory as a Model //

The paper augments any LLM with a separate trained memory model that stores, retrieves, and integrates facts on its behalf.

It decouples memory updates from base-model weight updates. It achieves continual-learning robustness without catastrophic forgetting, which is a property that RAG fails to deliver.

A vector store is a database with a learned encoder bolted on. MeMo is a learned subsystem with explicit interfaces. That distinction matters, as agents need to be able to ingest fresh knowledge weekly without retraining or vector-DB churn.

At its core, the position here is that memory in agents should be modular, learned, and gated, not a context-window hack.

Paper: https://arxiv.org/abs/2605.15156

Learn to build effective AI agents in our academy: https://academy.dair.ai


MeMo: Memory as a Model

Source: https://arxiv.org/html/2605.15156 Ryan Wei Heng Quek1,2,3,4Sanghyuk Lee5,6,7⁣∗{}^{5,6,7\hskip 0.85358pt*}Alfred Wei Lun Leong4,8⁣∗{}^{4,8\hskip 0.85358pt*} Arun Verma9⁣∗†{}^{9\hskip 0.85358pt*\dagger}Alok Prakash9Nancy F. Chen3 Bryan Kian Hsiang Low1,2,4,9Daniela Rus7,9Armando Solar-Lezama7,9 1Institute of Data Science, National University of Singapore, Singapore 2Integrative Sciences and Engineering Programme, NUSGS, Singapore 3Agency for Science, Technology, Research (A*STAR), Singapore 4Department of Computer Science, National University of Singapore, Singapore 5University of Tokyo, Japan6Liquid AI, USA 7CSAIL, Massachusetts Institute of Technology, USA8AI Singapore 9Singapore-MIT Alliance for Research and Technology Centre, Singapore [email protected]@g.ecc.u-tokyo.ac.jp [email protected]@smart.mit.edu [email protected][email protected] [email protected]@[email protected]

Abstract

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduceMeMo(Memory as aModel), a modular framework that encodes new knowledge into a dedicatedMemorymodel while keeping the LLM parameters unchanged. Compared to existing methods,MeMooffers several advantages:(a)it captures complex cross-document relationships,(b)it is robust to retrieval noise,(c)it avoids catastrophic forgetting in the LLM,(d)it does not require access to the LLM’s weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and(e)its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show thatMeMoachieves strong performance compared to existing methods across diverse settings.

1Introduction

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks(kojima2023largelanguagemodelszeroshot,;ArXiv23_zhao2023survey,;survey-llms-code-generation,). Despite their successes, these models are effectivelyfrozenfor extended periods after pretraining(xu2024knowledgeconflictsllmssurvey,)until subsequent updates, causing their pretrained knowledge to become increasingly outdated as the world evolves. For applications that require up to date(cheng2024dateddatatracingknowledge,;kasai2024realtimeqawhatsanswer,)or domain-specific(singhal2022largelanguagemodelsencode,;wu2023bloomberggptlargelanguagemodel,)knowledge, this dependence on static knowledge presents a fundamental architectural limitation(lewis2021retrievalaugmentedgenerationknowledgeintensivenlp,;kandpal2023largelanguagemodelsstruggle,). Retraining is a natural solution but remains prohibitively expensive at modern scales(wu2022sustainable,), motivating the need for an efficient mechanism to integrate new external knowledge into LLMs without full retraining.

Refer to captionFigure 1:Overview of the training and inference pipeline ofMeMo.DuringMemorymodel training (left), a frozenGeneratormodel transforms a target corpus into a reflection QA dataset via fact extraction, consolidation, verification, entity surfacing, and cross-document synthesis, which is then used to train a dedicatedMemorymodel. During inference (right), the frozenExecutivemodel answers complex user queries by querying theMemorymodel through a structured multi-turn protocol: it decomposes the input into simpler, targeted sub-queries, retrieves intermediate responses from theMemorymodel, and reasons over them to produce a final answer to the user’s query.Existing methods for integrating new knowledge into LLMs fall into three categories.\small1⃝Non-parametric methodsretrieve relevant information from an external store at inference time via lexical(bm25,), dense(nvembedv2,), or graph-based retrievers(lewis2020retrieval,;graphrag,;gutierrez2024hipporag,;gutierrez2025rag,), before incorporating it through in-context learning(incontextlearning,;dong-etal-2024-survey-in-context-learning,). However, these methods are constrained by limited context windows and struggle to synthesize cross-document relationships when relevant information is distributed across multiple documents(tang2024multihop,;lin2025optimizingmultihopdocumentretrieval,).\small2⃝Parametric methodsinternalize knowledge directly into model parameters via continual pretraining(ke2023continual,)or fine-tuning(ouyang2022training,;wang2023self,;chung2024scaling,)on the target corpus directly. While effective, they are computationally expensive, prone to catastrophic forgetting(luo2025empiricalstudycatastrophicforgetting,), and tend to memorize training distributions rather than acquire transferable knowledge, limiting generalization to unseen queries(chu2025sft,).\small3⃝Latent memory methods(chevalier2023autocompressor,;mu2023gist,;ge2024icae,;zhang2026memgen,)compress knowledge into soft tokens or other model-specific representations, but suffer from representation coupling: the memory is tightly bound to the specific model used to produce these representations, limiting transferability across LLMs.

We introduceMeMo(Memory as aModel), a modular framework where a dedicatedMemorymodel is trained on new knowledge, and anExecutivemodel retrieves relevant information from theMemorymodel at inference time via targeted sub-queries and then reasons over the retrieved information to respond to user queries.MeMocombines the complementary strengths of the three paradigms above while mitigating their individual limitations. Like the non-parametric methods, it is able to leverage off-the-shelf frontier models unchanged by separating the memory from the reasoning model; it shares with the parametric methods the ability to internalize knowledge in model parameters, and it shares the benefits of a compact, queryable memory artifact with latent memory methods. As a result,MeMooffers the following advantages:(a)it captures complex cross-document relationships,(b)it is robust to retrieval noise,(c)it avoids catastrophic forgetting by keeping theExecutivemodel parameters unchanged,(d)it does not require access to theExecutivemodel’s weights or output logits, enabling plug-and-play integration with both open and proprietary LLMs, and(e)its retrieval cost is independent of corpus size at inference time due to the fixed size of theMemorymodel. However, designingMeMoto comprehensively capture cross-document relationships during training while accurately answering arbitrary queries at inference time introduces two key challenges, which we outline below and address them with novel methods.

\small1⃝TrainingMemorymodel.A core challenge in theMemorymodel is ensuring it can accurately answer diverse, unseen queries at inference time, including those requiring cross-document reasoning and long-context understanding. A natural approach is to train directly on the raw corpus using standard data augmentation techniques such as paraphrasing(li2022data,;chen2023empirical,;allen2024physics,), additional sampling of generated QA pairs(alberti2019synthetic,;puri2020training,), or targeted gap-filling, where the model identifies and completes missing knowledge from the corpus(feng2024don,;jie2024self,). However, these approaches fail to consolidate related facts into compositional representations necessary for robust generalization to unseen queries(chu2025sft,). With this challenge in mind, we design a novelfive-step data synthesis pipelineguided by aGeneratormodel (Section˜4.1) that distills the corpus into a question–answer (QA) dataset ofreflections: compositional representations that expose underlying corpus knowledge under diverse query variations (illustrated inFig.˜1(left) and details inSection˜4.1). We trainMemorymodel on the synthesized reflection QA dataset via supervised fine-tuning (seeSection˜4.2), enablingMemorymodel to capture more complex, cross-document relationships and compositional structure than retrieval-based methods.

\small2⃝QueryingMemorymodel.At inference time, complex or compositional queries often require multi-step reasoning and aggregation of information across multiple documents. Naively queryingMemorymodel via single-turn or unstructured multi-turn interactions fails to reliably retrieve the knowledge required to answer such queries. To address this, we design athree-stage inference pipelinein whichExecutivemodel queries and retrieves information fromMemorymodel via a structured multi-turn protocol, decomposing complex user queries into targeted sub-queries that align with the shared reflection interface (illustrated inFig.˜1(right) and more details are inSection˜4.4). Unlike retrieval-based methods, this approach incurs retrieval cost independent of corpus size and is robust to retrieval noise (seeSection˜5.2) Crucially, becauseMeMotreatsExecutivemodel as a black box and does not access its weights, gradients, or output logits, it supportsplug-and-playintegration withany LLM, including both both open and proprietary closed-source models.

Our method is guided by a single design principle:reflections, corpus-derived structures that require no knowledge of future queries, yet naturally serve as the precise interface through which any query can access the underlying corpus without ever observing it directly. During training, theMemorymodel internalizes these reflections;Executivemodel retrieves relevant knowledge through targeted sub-queries at inference time. Building on the challenges outlined above and the methods proposed to address them, we summarize the key contributions of this paper as follows:

  • •Novel data synthesis pipeline.We propose a five-step data synthesis pipeline that uses aGeneratormodel, an LLM that may be the same as or smaller thanExecutivemodel, to distill a target corpus into reflections, enabling a dedicatedMemorymodel to internalize knowledge in compositional forms that capture more complex cross-document relationships and generalize robustly to diverse, unseen query variations at inference time (seeSections˜4.1and4.2).
  • •Structured multi-turn protocol.We introduce astructuredmulti-turn protocol that systematically decomposes complex queries into targeted sub-queries aligned with the shared reflection interface. The protocol supports plug-and-play integration with any arbitrary LLM, including proprietary closed-source LLMs, and has retrieval cost independent of corpus size (seeSection˜4.4).
  • •Empirical validation.We evaluateMeMoon BrowseComp-Plus, NarrativeQA, and MuSiQue, demonstrating strong performance against both parametric and non-parametric baselines. We further empirically validateMeMo’s robustness to retrieval noise (seeSection˜5).

2Related Work

Non-parametric methods.Non-parametric alternatives(bm25,;nvembedv2,;gutierrez2025rag,)avoid parameter updates entirely, instead, supplying new knowledge at inference time. In particular, in-context learning (ICL)(incontextlearning,;dong-etal-2024-survey-in-context-learning,)inserts relevant knowledge directly into the prompt, avoiding catastrophic forgetting. However, ICL scales poorly with increasing context length: the computational cost of autoregressive generation(vaswani2023attentionneed,)leads to substantial token overhead and inference latency as the knowledge base grows(gelada2025scalingcontextrequiresrethinking,), and even explicitly long-context models exhibit significant performance degradation as context length increases(liu2024lost,;hsieh2024ruler,). Retrieval-augmented generation (RAG)(lewis2020retrieval,;graphrag,;gutierrez2024hipporag,;gutierrez2025rag,)addresses this scalability bottleneck by selectively retrieving relevant chunks of knowledge rather at inference time. However, RAG systems are highly sensitive to retrieval noise(powerofnoise,), where irrelevant or misleading passages substantially degrade generation quality(liu2026tacklinginherentdifficultynoise,;zhang2026understanding,). In addition, RAG systems often struggle to reason over complex cross-document dependencies(tang2024multihop,), as they lack robust mechanisms for synthesizing information that is distributed across multiple chunks or a large corpus(lin2025optimizingmultihopdocumentretrieval,).

Parametric methods.Existing post-training approaches, such as continual pretraining on new corpora(ke2023continual,;sun2020ernie-cpt,)or supervised fine-tuning (SFT) on curated instruction data(ouyang2022training,;wang2023self,;chung2024scaling,), attempt to address this limitation by incorporating new knowledge into LLMs during post-training. While conceptually straightforward, these parametric methods often suffer from catastrophic forgetting, whereby adaptation to newly observed knowledge degrades previously acquired knowledge, learned capabilities(luo2025empiricalstudycatastrophicforgetting,;learningwithoutforgetting,;harmon2025mappingposttrainingforgettinglanguage,), and erodes safety alignment learned during LLM post-training(qi2024fine,). In addition, the scale of modern LLMs makes frequent fine-tuning computationally expensive(zhang2023dissecting,;xia2024understanding,), and fine-tuning is often infeasible for proprietary, closed-source models(manchanda2025opensourceadvantagelarge,), substantially limiting the practicality of parametric methods in real-world, large-scale applications.

Latent memory methods.Another approach to storing knowledge is viacompressed latent representations, which lie between non-parametric retrieval and fully parametric methods. Context compression techniques such as AutoCompressor(chevalier2023autocompressor,), Gist tokens(mu2023gist,), and ICAE(ge2024icae,)encode knowledge into compact soft tokens prepended at inference, reducing ICL token overhead without discarding information. However, these representations are tightly coupled to the encoder and cannot be consumed by other model families, limiting compatibility with black-box LLMs. Similarly, recurrent-state models(gu2023mamba,;sun2023retnet,)and nearest-neighbor memory methods such as Memorizing Transformers(wu2022memorizing,)andkkNN-LM(khandelwal2020generalization,)rely on model-specific representations or architectures, preventing post hoc use with pretrained LLMs. Although Memory Decoder(cao2025memory,)is a plug-and-play pretrained memory module that integrates without modifying model parameters, it is limited to architectures sharing a common tokenizer, enabling reuse only within this subset. The core limitation of these methods isrepresentation coupling: latent memory is inseparable from the model that produces it. In contrast,MeMoallows a plug-and-play integration with any LLM, including closed-source models.

Table 1:A comparison of desirable properties across different memory paradigms, showing thatMeMosatisfies them through its modular memory construction and memory-augmented reasoning.MethodsFrozenbase LLMNoretrieval indexBlack-boxcompatibleNo catastrophicforgettingConstant-sizememoryCross-LLMtransferableNon-parametric(RAG, ICL)✓×\times✓✓×\times✓Parametric(CPT, SFT)×\times✓×\times×\times×\times×\timesLatent memory(AutoCompressor, Gist, ICAE)✓✓×\times✓✓×\timesMeMo(Ours)✓✓✓✓✓✓

3Preliminaries

Problem setting.Letℳθ\mathcal{M}_{\theta}denote a large language model with frozen parametersθ∈ℝp\theta\in\mathbb{R}^{p}, pretrained on a corpus𝒟pre\mathcal{D}_{\text{pre}}. We treatℳθ\mathcal{M}_{\theta}as a conditional distribution that maps a promptxxto a responseℳθ​(x)\mathcal{M}_{\theta}(x), and assume only black-box access; in particular,ℳθ\mathcal{M}_{\theta}may be either a white-box model or a closed-source model accessed via API. Let𝒟={d1,…,dN}\mathcal{D}=\{d_{1},\ldots,d_{N}\}denote a target corpus ofNNdocuments containing knowledge thatℳθ\mathcal{M}_{\theta}cannot reliably recall111We do not assume𝒟\mathcal{D}is disjoint from𝒟pre\mathcal{D}_{\text{pre}}, as training data is rarely disclosed by model providers. A document is consideredeffectively absentfromℳθ\mathcal{M}_{\theta}’s knowledge if the model fails to answer questions grounded in it, either because it never appeared in𝒟pre\mathcal{D}_{\text{pre}}or because the training process was insufficient to retain it. For more information, refer toAppendixI.. Let𝒬\mathcal{Q}be a set of queries, eachq∈𝒬q\in\mathcal{Q}associated with a ground-truth answera⋆​(q)a^{\star}(q)and a set of supporting documents𝒮​(q)⊆𝒟\mathcal{S}(q)\subseteq\mathcal{D}. Note that𝒮​(q)\mathcal{S}(q)is a theoretical construct used to characterize query complexity.

Knowledge integration mechanism.Aknowledge integration mechanismis a pair(Φ,f)(\Phi,f), whereΦ\Phimaps the corpus to a representation𝒦≐Φ​(𝒟)\mathcal{K}\doteq\Phi(\mathcal{D})andffcombines𝒦\mathcal{K}withℳθ\mathcal{M}_{\theta}at inference to produce responsesf​(ℳθ,𝒦,q)f(\mathcal{M}_{\theta},\mathcal{K},q). We formalize the goal as follows.

Definition 1(Knowledge Integration Problem).

Given a frozen modelℳθ\mathcal{M}_{\theta}and target corpus𝒟\mathcal{D}, find a mechanism(Φ,f)(\Phi,f)such that, without modifyingθ\theta, for allq∈𝒬q\in\mathcal{Q},ℙ​{f​(ℳθ,Φ​(𝒟),q)=a⋆​(q)}=1.\mathbb{P}\left\{f(\mathcal{M}_{\theta},\Phi(\mathcal{D}),q)=a^{\star}(q)\right\}=1.

Existing approaches.Existing methods differ in their choice of(Φ,f)(\Phi,f).ICLsets𝒦=𝒟\mathcal{K}=\mathcal{D}andf​(ℳθ,𝒦,q)=ℳθ​([𝒟;q])f(\mathcal{M}_{\theta},\mathcal{K},q)=\mathcal{M}_{\theta}([\mathcal{D};q]), i.e., appending the corpus directly to the prompt.RAGconstructs𝒦\mathcal{K}as a retrieval index and definesffto retrieve a subset𝒮^⊆𝒟\hat{\mathcal{S}}\subseteq\mathcal{D}before passing[𝒮^;q][\hat{\mathcal{S}};q]toℳθ\mathcal{M}_{\theta}.Fine-tuningsets𝒦=∅\mathcal{K}=\emptysetandf=ℳθ′f=\mathcal{M}_{\theta^{\prime}}, whereθ′\theta^{\prime}is obtained by updatingθ\thetaon𝒟\mathcal{D}. In contrast,MeModefines𝒦\mathcal{K}as the parameters of a small, dedicatedMemorymodelℳφ\mathcal{M}_{\varphi}withφ≪θ\varphi\ll\theta, trained on reflection QA dataset derived from𝒟\mathcal{D}, and queried by a frozenExecutivemodelℳθ\mathcal{M}_{\theta}at inference time.Table˜1summarizes how these paradigms compare across desirable properties.

4MeMo: Memory as a Model

MeMoaddresses the knowledge integration problem (Def.1) through two components: a frozenmodelℳθ\mathcal{M}_{\theta}(Executivemodel), which handles reasoning and responds to user queries, and aMemorymodelℳφ\mathcal{M}_{\varphi}, which is trained to encode knowledge in its parameters from a target corpus𝒟\mathcal{D}. Our pipeline operates in two phases: (i) atraining phasethat constructsMemorymodel from𝒟\mathcal{D}, and (ii) aninference phasein whichExecutivemodel queries and retrieves information fromMemorymodel to answer knowledge-intensive questions (seeSections˜4.1,4.2and4.4).

4.1Data Synthesis Pipeline

Given a corpus of documents𝒟\mathcal{D}, our objective in the data generation process is to construct a reflection QA dataset𝒬final\mathcal{Q}_{\text{final}}that captures both single-document facts and cross-document relationships. This process is driven by aGeneratormodelℳgen\mathcal{M}_{\text{gen}}and proceeds through five steps, as summarized inAlgorithm˜1and illustrated inFig.˜1: (1) fact extraction from raw documents, (2) consolidation of redundant or overlapping information, (3) verification and rewriting to ensure correctness and clarity, (4) entity surfacing to explicitly represent key entities, and (5) cross-document synthesis to integrate evidence across the corpus. Importantly, no document identifiers or watermarks are embedded in the generated QA pairs at any step, preventingMemorymodel from exploiting shortcut signals during evaluation.

Algorithm 1Reflection QA Dataset Generation Pipeline from Target Corpus0:Corpus

𝒟\mathcal{D}, generator

ℳgen\mathcal{M}_{\text{gen}}, document groups

𝒢={G1,…,Gk}\mathcal{G}=\{G_{1},\ldots,G_{k}\}with

Gi⊆𝒟G_{i}\subseteq\mathcal{D} 1:

𝒬final←∅\mathcal{Q}_{\text{final}}\leftarrow\emptyset 2:for alldocument

d∈𝒟d\in\mathcal{D}do

3:

C←Chunk​(d)C\leftarrow\mathrm{Chunk}(d)⊳\vartrianglerightSegment into chunks

4:

𝒬verd←∅\mathcal{Q}_{\text{ver}}^{d}\leftarrow\emptyset 5:for allchunk

c∈Cc\in Cdo

6:

𝒬dir,𝒬indir←ℳgen​(c)\mathcal{Q}_{\text{dir}},\,\mathcal{Q}_{\text{indir}}\leftarrow\mathcal{M}_{\text{gen}}(c)⊳\vartrianglerightStep 1: Direct and indirect extraction

7:

𝒬raw←𝒬dir∪𝒬indir\mathcal{Q}_{\text{raw}}\leftarrow\mathcal{Q}_{\text{dir}}\cup\mathcal{Q}_{\text{indir}}⊳\vartrianglerightStep 2a: Merge direct and indirect

8:

𝒬mrg←ℳgen​(𝒬raw)\mathcal{Q}_{\text{mrg}}\leftarrow\mathcal{M}_{\text{gen}}(\mathcal{Q}_{\text{raw}})⊳\vartrianglerightStep 2b: Consolidate related pairs

9:

𝒬con←𝒬raw∪𝒬mrg\mathcal{Q}_{\text{con}}\leftarrow\mathcal{Q}_{\text{raw}}\cup\mathcal{Q}_{\text{mrg}}⊳\vartrianglerightStep 2c: Full merge set

10:

𝒬ver←ℳgen​(𝒬con,c)\mathcal{Q}_{\text{ver}}\leftarrow\mathcal{M}_{\text{gen}}(\mathcal{Q}_{\text{con}},\,c)⊳\vartrianglerightStep 3: Verify self-containment; rewrite or discard

11:

𝒬verd←𝒬verd∪𝒬ver\mathcal{Q}_{\text{ver}}^{d}\leftarrow\mathcal{Q}_{\text{ver}}^{d}\cup\,\mathcal{Q}_{\text{ver}} 12:endfor

13:

𝒬entd←ℳgen​(𝒬verd)\mathcal{Q}_{\text{ent}}^{d}\leftarrow\mathcal{M}_{\text{gen}}(\mathcal{Q}_{\text{ver}}^{d})⊳\vartrianglerightStep 4: Entity-surfacing pairs

14:

𝒬final←𝒬final∪𝒬verd∪𝒬entd\mathcal{Q}_{\text{final}}\leftarrow\mathcal{Q}_{\text{final}}\cup\,\mathcal{Q}_{\text{ver}}^{d}\cup\mathcal{Q}_{\text{ent}}^{d} 15:endfor

16:for all

Gi∈𝒢G_{i}\in\mathcal{G}do

17:

𝒬cross←ℳgen​(⋃d∈Gi(𝒬verd∪𝒬entd))\mathcal{Q}_{\text{cross}}\leftarrow\mathcal{M}_{\text{gen}}\!\Bigl(\bigcup_{d\in G_{i}}\bigl(\mathcal{Q}_{\text{ver}}^{d}\cup\mathcal{Q}_{\text{ent}}^{d}\bigr)\Bigr)⊳\vartrianglerightStep 5: Cross-document synthesis

18:

𝒬final←𝒬final∪𝒬cross\mathcal{Q}_{\text{final}}\leftarrow\mathcal{Q}_{\text{final}}\cup\,\mathcal{Q}_{\text{cross}} 19:endfor

20:return

𝒬final\mathcal{Q}_{\text{final}}

Step 1: Fact extraction.Each documentd∈𝒟d\in\mathcal{D}is segmented into chunksCC, where each chunk corresponds either to an entire document or to a contiguous segment of a longer document. For each chunk,ℳgen\mathcal{M}_{\text{gen}}performs two parallel extraction processes:direct extraction, which captures explicitly stated facts (producing𝒬dir\mathcal{Q}_{\text{dir}}), andindirect extraction, which targets inferred or synthesized information beyond the surface text (producing𝒬indir\mathcal{Q}_{\text{indir}}). This dual extraction process ensures that both factual recall and inferential reasoning are represented in the training signal forMemorymodel.

Step 2: Consolidation.TheGeneratormodelℳgen\mathcal{M}_{\text{gen}}consolidates𝒬dir∪𝒬indir\mathcal{Q}_{\text{dir}}\cup\mathcal{Q}_{\text{indir}}by identifying QA pairs that share a common underlying context (such as entity, time period, or relationship type) and combining them into QA pairs that encompass multiple facts, denoted𝒬mrg\mathcal{Q}_{\text{mrg}}. This merging process produces training instances that require integrating multiple facts within the same contextual chunk, going beyond single-fact question answering pairs. The synthesized QA pairs are subsequently unified with the original sets to form the consolidated dataset𝒬con=𝒬dir∪𝒬indir∪𝒬mrg\mathcal{Q}_{\text{con}}=\mathcal{Q}_{\text{dir}}\cup\mathcal{Q}_{\text{indir}}\cup\mathcal{Q}_{\text{mrg}}.

Step 3: Verification and rewriting.Each QA pair in𝒬con\mathcal{Q}_{\text{con}}is evaluated forself-containmentbyℳgen\mathcal{M}_{\text{gen}}, i.e., whether it can be fully understood and correctly answered in isolation, without access to the source chunk. Common failure modes include unresolved pronouns (e.g., “What didtheypropose?”) and implicit references (e.g., “As noted in the above table…”). Non-self-contained QA pairs are rewritten byℳgen\mathcal{M}_{\text{gen}}using the source chunkCCas context; QA pairs that remain ambiguous after rewriting are discarded. This check-and-rewrite procedure yields the verified set𝒬ver\mathcal{Q}_{\text{ver}}, a set of QA pairs that can be used as training examples without access to the source chunk.

Step 4: Entity surfacing.For each named entity in𝒬ver\mathcal{Q}_{\text{ver}},ℳgen\mathcal{M}_{\text{gen}}generates a set of entity-surfacing QA pairs in which the question encodes the entity’s attributes and relationships (including connections to other named entities) and the answer reveals its identity. Facts about each entity are aggregated across all QA pairs within the chunk prior to generation, enabling the integration and composition of information from multiple source pairs. Questions are generated at varying levels of complexity, ranging from single-fact to multi-fact queries. These pairs, denoted𝒬ent\mathcal{Q}_{\text{ent}}, aim to mitigate the reversal curse(berglund2023reversal,;allen2023physics32,)by trainingMemorymodel to infer entities from indirect or partially specified descriptions. This capability supports theentity identification turnat inference time (Section˜4.4).

Step 5: Cross-document synthesis.The final step operates over pre-defined document groups𝒢={G1,…,Gk}\mathcal{G}=\{G_{1},\ldots,G_{k}\}, where chunks within each groupGiG_{i}are topically related. Such groups arise naturally, for example, when a large document is segmented into chunks (forming a single group) or from human-provided labels. For each groupGiG_{i},ℳgen\mathcal{M}_{\text{gen}}is provided with the entity-surfacing pairs𝒬entd:d∈Gi{\mathcal{Q}_{\text{ent}}^{d}:d\in G_{i}}from all member documents and identifies two types of cross-document connections:

  • Converging clues: multiple documents provide complementary facts about the same entity, which together enable its identification.
  • Parallel properties: different entities across documents share a common attribute or role, enabling comparative and analogical reasoning.

Both types yield QA pairs with support sizes​(q)>1s(q)>1(Section˜3), directly targeting the cross-document synthesis objective. The final dataset is𝒬final=𝒬ver∪𝒬ent∪𝒬cross\mathcal{Q}_{\text{final}}=\mathcal{Q}_{\text{ver}}\cup\mathcal{Q}_{\text{ent}}\cup\mathcal{Q}_{\text{cross}}, which collectively captures self-contained, entity-centric, and cross-document reflections for trainingMemorymodel.

Ablations of the pipeline design are presented inAppendix˜E.

4.2Training theMemorymodel

Given𝒬final\mathcal{Q}_{\text{final}},Memorymodel is trained via supervised fine-tuning to map questions directly to answerswithoutaccess to source documents at inference time.Memorymodel is initialized from a small pretrained language model, substantially smaller thanExecutivemodel (e.g., 1.5B vs. 32B parameters), and optimized by minimizing the next-token prediction loss over answer tokens only.

ℒ(φ)=−∑(qi,ai)∈𝒬final∑t=1|ai|logℳφ(ai(t)|qi,ai(1:t−1)).\mathcal{L}(\varphi)\;=\;-\!\!\!\sum_{(q_{i},\,a_{i})\,\in\,\mathcal{Q}_{\text{final}}}\;\;\sum_{t=1}^{|a_{i}|}\log\mathcal{M}_{\varphi}\!\left(a_{i}^{(t)}\;\middle|\;q_{i},\,a_{i}^{(1:t-1)}\right).Conditioning only on the question and preceding answer tokens, and never on source documents, forcesMemorymodel to internalize knowledgeparametricallyrather than rely on copying from retrieved context. This constitutes a key distinction from RAG-based readers: at inference time,Memorymodel generates answers solely from its internalized parametric knowledge, without access to any external corpus. Further details on hyperparameter choices and training paradigms (full SFT vs. LoRA) are provided inAppendix˜FandAppendix˜O, respectively.

4.3Continual Knowledge Integration via Model Merging

A practical desideratum of any knowledge integration system is the ability to incorporate new corpora incrementally without retraining on or rebuilding from all previously ingested sources. For parametric models, integrating new knowledge typically requires retraining on the union of all observed corpora, a cost that grows prohibitively with the number of sources. In contrast, non-parametric systems such as knowledge graphs and vector databases support efficient incremental updates. We exploremodel merging(yang2024model,)as an approach to close this gap for parametric models. Model merging aims to preserve knowledge from multiple sources without requiring joint training on their union, by combiningKKMemorymodel models, each trained independently on a distinct corpus, into a single model.

Continual knowledge integration.Let{𝒟1,…,𝒟K}\{\mathcal{D}_{1},\dots,\mathcal{D}_{K}\}be a collection of pairwise disjoint target corpora. For each corpus𝒟i\mathcal{D}_{i}, we generate a reflection QA dataset𝒬final(i)\mathcal{Q}^{(i)}_{\text{final}}(Section˜4.1) and train a correspondingMemorymodelℳφi\mathcal{M}_{\varphi_{i}}via SFT (Section˜4.2), initializing allKKmodels from the same pretrained baseℳφ0\mathcal{M}_{\varphi_{0}}. We define thetask vectorfor𝒟i\mathcal{D}_{i}asτi=φi−φ0\tau_{i}\;{=}\;\varphi_{i}-\varphi_{0}, capturing the parametric shift induced by training on𝒟i\mathcal{D}_{i}alone. The mergedMemorymodel is then obtained as

φmerged=Merge​(φ0,{τi}i=1K;Θ),\varphi_{\text{merged}}\;{=}\;\mathrm{Merge}(\varphi_{0},\,\{\tau_{i}\}_{i=1}^{K};\,\Theta),whereΘ\Thetadenotes method-specific hyperparameters (e.g., merging coefficients, sparsification densities). We discuss alternative merging methods and their respective limitations inAppendix˜H.

4.4Inference-Time Integration

At inference time,Executivemodel queries and retrieves information fromMemorymodel through a structured multi-turn protocol, withExecutivemodel treatingMemorymodel as an external knowledge oracle. The pipeline has three sequential stages, each designed to progressively improve the likelihood of producing a correct final answer, as illustrated inFig.˜1(right). Each stage utilizes distinct prompts, sampling temperatures and independent budgets to control the number of interactions betweenExecutivemodel andMemorymodel.

Stage 1: Grounding.Given a queryqq,Executivemodel decomposes it into a set of atomic, clue-probing sub-questions{q1′,…,qK′}\{q_{1}^{\prime},\ldots,q_{K}^{\prime}\}, where each sub-question targets a single identifying constraint inqq, andKKis adaptively determined byExecutivemodel. TheMemorymodel answers each sub-question independently, without shared context, producing grounding responses{m1,…,mK}\{m_{1},\ldots,m_{K}\}. These responses draw onMemorymodel’s parametric knowledge to provide additional contextual grounding for subsequent interactions in the later stages.

Stage 2: Entity identification.Using the grounding responses as context,Executivemodel iteratively narrows a set of candidate entities by issuing targeted follow-up sub-queries toMemorymodel across multiple interactions. This process continues untilExecutivemodel converges on a single entitye⋆e^{\star}or the stage budget is exhausted. If no candidates are identified, Stage 3 is skipped andExecutivemodel synthesizes a final answer from the grounding responses alone. This stage leveragesMemorymodel’s training on the entity-surfacing QA pairs𝒬ent\mathcal{Q}_{\text{ent}}(Section˜4.1).

Stage 3: Answer seeking and synthesis.Conditioned on the identified entitye⋆e^{\star},Executivemodel queriesMemorymodel for additional supporting facts through targeted follow-up questions. Once sufficient evidence is gathered, or the stage budget is exhausted,Executivemodel synthesizes the accumulated responses into a final answer:

a^=ℳθ​(q,{mk}k=1K,e⋆,mseek).\hat{a}\;=\;\mathcal{M}_{\theta}\!\bigl(q,\;\{m_{k}\}_{k=1}^{K},\;e^{\star},\;m_{\text{seek}}\bigr). Notably, theMemorymodel responsesmkm_{k}andmseekm_{\text{seek}}are compact natural-language snippets whose lengths are independent of the corpus size, ensuring constant-time inference. As all interactions withℳθ\mathcal{M}_{\theta}occur through its input–output interface,MeMoremains fully compatible with black-boxExecutivemodels, including proprietary APIs, without requiring access to internal parameters. For full implementation details, refer toAppendix˜Jand the supplementary materials.

5Experiments

Datasets.We evaluateMeMoon three knowledge-intensive benchmarks.BrowseComp-Plus(chen2025browsecompplusfairtransparentevaluation,)is a deep-research benchmark requiring multi-hop, multi-document retrieval and reasoning; we filter non-English instances with LangDetect(danilak2021langdetect,), sample 300 questions, and pair each question’s evidence documents with an equal number of negative documents,222BrowseComp-Plus and MuSiQue provide annotations of gold (correct), evidence (supporting), and negative (distractor) documents. Gold documents are a subset of the evidence documents.yielding 3,541 documents in total.NarrativeQA(kovcisky2018narrativeqa,)tests discourse understanding over long documents such as books and movie scripts; we use 293 questions across 10333We follow HippoRAG2 and evaluate on 10 such documents from the NarrativeQA validation split (294 questions); one duplicate is removed for consistency.documents.MuSiQue(trivedi2022musiquemultihopquestionssinglehop,)requires composing 2–4 reasoning steps across multiple Wikipedia paragraphs; we use 1,000 questions and construct the target corpus following the same procedure as for BrowseComp-Plus, yielding 5,296 documents. Further details are inAppendix˜D; datasets and code are in the supplementary materials.

Baselines.We compareMeMoagainst four baselines:BM25(bm25,)(lexical retrieval),NV-Embed-V2(nvembedv2,)(dense retrieval),HippoRAG2(gutierrez2025rag,)(graph-based RAG, state-of-the-art), andCartridges(eyuboglu2025cartridges,)(a trained KV-cache loaded ontoExecutivemodel at inference; the closest existing parametric baseline toMeMo). Newer methods exist(chevalier2023autocompressor,;cao2025memorydecoderpretrainedplugandplay,)but typically require white-box access toExecutivemodel and are therefore not directly comparable. We additionally includePerfect Retrievalas anempirical upper bound, whereExecutivemodel receives exclusively the evidence documents in context(incontextlearning,). Retrieval baselines use top-k=9k{=}9with adaptive backoff: reducingkkprogressively until the retrieved context fitsExecutivemodel’s context window.

Implementation and evaluation.*(a) Data generation.We use Qwen2.5-32B-Instruct(qwen2025qwen25technicalreport,)as theGeneratormodel, served via vLLM(kwon2023efficientmemorymanagementlarge,)with YaRN RoPE scaling(su2023roformerenhancedtransformerrotary,)to support a 131K-token context window.(b) Training.We trainMemorymodel initialized from Qwen2.5-14B-Instruct for 3 epochs with fused AdamW(loshchilov2017decoupled,)and DeepSpeed 2(rajbhandari2020zero,)at learning rate2×10−52{\times}10^{-5}; full hyperparameters inAppendix˜F.(c) Evaluation.We instantiateExecutivemodel with initialized Qwen2.5-32B-Instruct or Gemini-3.0-Flash(google2025gemini3flash,)to evaluate the same trainedMemorymodel across models of varying reasoning capability; both models have minimal prior knowledge of the evaluation datasets (Appendix˜I).Executivemodel queriesMemorymodel through the multi-turn protocol described inSection˜4.4. We report binary accuracy judged by Gemini-2.5-Flash-Lite(comanici2025gemini25pushingfrontier,)via DeepEval(Ip_deepeval_2026,), as mean±\pmstandard deviation over three runs for Qwen2.5-32B-Instruct and a single run for Gemini-3.0-Flash.(d) Continual integration.*For the model-merging experiment (Section˜5.5), we partition NarrativeQA into two pairwise-disjoint subsets (K=2K{=}2,∼\sim640k QA pairs each), SFT a separate Qwen2.5-14B-InstructMemorymodel on each, and sweep six merging methods at three densities (14 configurations total).

5.1Experimental results

MeMoachieves strong performance across benchmarks.As shown in Table2,MeMoconsistently outperforms all baselines on NarrativeQA and MuSiQue across bothExecutivemodels. On NarrativeQA, the most challenging benchmark (Appendix˜I),MeMoachieves26.85%26.85\%with Qwen2.5-32B-Instruct and53.58%53.58\%with Gemini-3-Flash, substantially surpassing all baselines. This is notable: NarrativeQA requires reasoning over long passages with complex connections, where retrieval-based methods are constrained by context windows and struggle to synthesize information across long documents;MeMoinstead captures these connections via reflections during training and retrieves them through its multi-turn protocol at inference. The same trend holds on MuSiQue, whereMeMoachieves48.30%48.30\%and58.70%58.70\%respectively, outperforming baselines that struggle with multi-hop reasoning across independently retrieved passages. On BrowseComp-Plus,MeMoleads with Gemini-3-Flash (66.67%66.67\%) and remains competitive with Qwen2.5-32B-Instruct (54.22%54.22\%, narrowly trailing HippoRAG2’s56.11%56.11\%). This gap reflects BrowseComp-Plus’s nature: its answers are absent fromExecutivemodel’s parametric knowledge (Appendix˜I), making direct access to evidence documents especially valuable and favoring retrieval methods that pass raw documents toExecutivemodel.

Table 2:Accuracy (%) on BrowseComp-Plus, NarrativeQA, and MuSiQue under twoExecutivemodels: Qwen2.5-32B-Instruct (Qwen2.5-32B-I) and Gemini-3-Flash (Gemini-3-F). Bold values indicate the best result in each column, excluding Perfect Retrieval.MeMouses Qwen2.5-14B-Instruct asMemorymodel, and results are reported at the best training epoch.⋆Perfect Retrieval represents an empirical upper bound.BrowseComp-PlusNarrativeQAMuSiQueMethodQwen2.5-32B-IGemini-3-FQwen2.5-32B-IGemini-3-FQwen2.5-32B-IGemini-3-FPerfect Retrieval⋆79.67±1.4579.67\pm 1.4588.3388.3351.42±0.5251.42\pm 0.5260.4160.4162.83±0.9062.83\pm 0.9073.0073.00BM251.11±0.691.11\pm 0.6927.0027.0010.24±0.3410.24\pm 0.3414.3314.3320.00±0.3020.00\pm 0.3023.2023.20NV-Embed-V250.67±0.3350.67\pm 0.3357.0057.0020.59±0.8620.59\pm 0.8626.6226.6237.47±0.1537.47\pm 0.1546.6046.60HippoRAG2444These results differ from the original paper(gutierrez2025rag,), which uses Llama3.3-70B-Instruct instead of Qwen2.5-32B-Instruct.56.11±0.51\mathbf{56.11\pm 0.51}66.3366.3321.39±0.2021.39\pm 0.2023.2123.2142.17±0.1242.17\pm 0.1257.0057.00Cartridges555Cartridges requires white-box access toExecutivemodel as well; its results for Gemini-3-Flash are therefore omitted.0.00±0.000.00\pm 0.00-3.75±0.113.75\pm 0.11-8.57±0.408.57\pm 0.40-MeMo54.22±0.8454.22\pm 0.8466.67\mathbf{66.67}26.85±0.39\mathbf{26.85\pm 0.39}53.58\mathbf{53.58}48.30±1.25\mathbf{48.30\pm 1.25}60.20\mathbf{60.20}

MeMosupports plug-and-play integration.Across the three benchmarks,MeMoconsistently achieves higher performance when paired with a more capableExecutivemodel (Gemini-3-Flash): switching from Qwen2.5-32B-Instruct to Gemini-3-Flash yield gains of 12.45, 26.73, 11.90 pp on BrowseComp-Plus, NarrativeQA and MuSiQue respectively. This demonstrates thatMeMocan be trained once with a weakerGeneratormodel, and seamlessly paired withanyLLM at inference — including proprietary models such as Gemini-3-Flash. Thisplug-and-playcapability allowsMeMoto directly leverage state-of-the-art models without any additional training or overhead.

5.2Ablation on the amount of noise for the dataset

Table 3:Accuracy (%) on BrowseComp-Plus and MuSiQue with Qwen2.5-32B-Instruct asExecutivemodel.MeMoresults are based on Qwen2.5-14B-Instruct and reported at the best training epoch.N=NevidencedatasetN=N_{\text{evidence}}^{\text{dataset}}denotes the number of ground-truth evidence documents in the corpus; column headers indicate the number of additional negative (distractor) documents added, as a multiple ofNN.Δ\Deltadenotes accuracy difference (pp) compared to0​N0N.MethodDataset0×N0\times N1×N1\times NAcc. (%)Acc. (%)Δ\DeltaNV-Embed-V2BrowseComp-Plus56.89±0.5156.89\pm 0.5150.67±0.3350.67\pm 0.33↓6.22{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 6.22}MuSiQue42.30±0.5342.30\pm 0.5337.47±0.1537.47\pm 0.15↓4.83{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 4.83}HippoRAG2BrowseComp-Plus62.33±1.1562.33\pm 1.1556.11±0.5156.11\pm 0.51↓6.22{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 6.22}MuSiQue47.33±0.7447.33\pm 0.7442.17±0.1242.17\pm 0.12↓5.16{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 5.16}MeMoBrowseComp-Plus53.67±1.1553.67\pm 1.1554.22±0.8454.22\pm 0.84↑0.55{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\uparrow 0.55}MuSiQue50.07±0.8150.07\pm 0.8148.30±1.2548.30\pm 1.25↓1.77{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\downarrow 1.77}

We investigate the robustness ofMeMoagainst two strong retrieval-based baselines, NV-Embed-V2 and HippoRAG2, under increasing levels of retrieval noise, controlled by varying the number of negative (distractor) documents added to the target corpus as a multiple of the total number of ground-truth evidence documents in each dataset (Nevidencedataset=1,775N_{\text{evidence}}^{\text{dataset}}=1{,}775for BrowseComp-Plus andNevidencedataset=2,648N_{\text{evidence}}^{\text{dataset}}=2{,}648for MuSiQue). The datasets used throughout this paper (detailed inAppendix˜D) correspond to a ratio of 1×Nevidencedataset\times N_{\text{evidence}}^{\text{dataset}}; we additionally evaluate at ratio 0×Nevidencedataset\times N_{\text{evidence}}^{\text{dataset}}(no distractors) as an idealized noise-free reference to isolate the effect of distractors.

Results inTable˜3demonstrate that retrieval-based methods exhibit pronounced sensitivity to noise. Both NV-Embed-V2 and HippoRAG2 suffer drops of up to 6.22 pp on BrowseComp-Plus and up to 5.16 pp on MuSiQue when scaling from0×N0\times Nto1×N1\times N, confirming that these systems struggle to filter irrelevant documents under realistic corpus conditions. In contrast,MeMomaintains stable performance across both benchmarks, with a marginal improvement of 0.55 pp on BrowseComp-Plus and a decline of only 1.77 pp on MuSiQue, both within one standard deviation, demonstrating thatMeMois robust to increasing retrieval noise. We attribute this robustness toMeMo’s design: despite being trained on a corpus containing negative documents,Memorymodel provides more precise information toExecutivemodel’s sub-queries than direct document retrieval. Additional analysis of performance degradation in retrieval-based methods is provided inAppendix˜L.

5.3Ablation onMemorymodel size

We investigate how the size ofMemorymodel affects downstream task performance by comparing models of 1.5B and 14B parameters in the Qwen2.5 family. Implementation details are provided inAppendix˜M. Results inTable˜4show a consistent positive scaling trend: largerMemorymodels yield improved performance across all benchmarks andExecutivemodels. However, the results also show that a strongerExecutivemodel reasoning capability modulates this gap non-uniformly across tasks: the performance difference betweenMemorymodel sizes widens for NarrativeQA but shrinks for BrowseComp-Plus and MuSiQue. This suggests that the interaction betweenExecutivemodel reasoning capability andMemorymodel size is task-dependent.

Table 4:Ablation onMemorymodel size within the Qwen2.5 family. Bold results indicate best performing results in the column.BrowseComp-PlusNarrativeQAMuSiQueMemoryModelQwen2.5-32BGemini-3-FlashQwen2.5-32BGemini-3-FlashQwen2.5-32BGemini-3-FlashQwen2.5-1.5B-Instruct44.11±2.2244.11\pm 2.2261.0061.0024.00±0.2024.00\pm 0.2047.4447.4442.90±1.3942.90\pm 1.3959.7059.70Qwen2.5-14B-Instruct54.22±0.84\mathbf{54.22\pm 0.84}66.67\mathbf{66.67}26.85±.39\mathbf{26.85\pm.39}53.58\mathbf{53.58}50.07±0.81\mathbf{50.07\pm 0.81}60.20\mathbf{60.20}

5.4Ablation onMemorymodel family

We investigate whether the choice ofMemorymodel family affects performance by comparing three models of similar parameter scale (∼\sim1–2B) but distinct architectures and pretraining lineages: Qwen2.5-1.5B-Instruct(qwen2025qwen25technicalreport,), Gemma3-1B-IT(gemmateam2025gemma3technicalreport,), and LFM2.5-1.2B-Instruct(amini2025lfm2,). Implementation details are provided inAppendix˜N. Results inTable˜5show thatMeMoperformance is largely robust to the choice ofMemorymodel architecture, demonstrating that the framework is not sensitive to the specific pretraining lineage ofMemorymodel at similar parameter scale, and that the parametric knowledge compression induced by our training procedure generalizes across diverse model families.

Table 5:Ablation acrossMemorymodels at similar parameter scales (∼\sim1–2B). Bold results indicate best performing results in the column.BrowseComp-PlusNarrativeQAMuSiQueMemoryModelQwen2.5-32B-IGemini-3-FQwen2.5-32B-IGemini-3-FQwen2.5-32B-IGemini-3-FQwen2.5-1.5B-Instruct44.11±2.22\mathbf{44.11\pm 2.22}61.0024.00±0.20\mathbf{24.00\pm 0.20}47.4447.4442.90±1.3942.90\pm 1.3959.70\mathbf{59.70}Gemma3-1B-IT41.67±2.0341.67\pm 2.0359.0059.0022.30±2.4722.30\pm 2.4748.81\mathbf{48.81}41.17±1.2041.17\pm 1.2056.2056.20LFM2.5-1.2B-Instruct37.33±1.8637.33\pm 1.8659.6759.6721.96±1.9721.96\pm 1.9746.4246.4245.23±2.49\mathbf{45.23\pm 2.49}58.3058.30

5.5Continual integration via model merging

We test the streaming-update scenario described inSection˜4.2on NarrativeQA, comparing model merging against full retraining ofMemorymodel on the union of both subsets when the second arrives. Of the 14 sweep configurations (seeTable˜12,Appendix˜H), we report TIES(yadav2023ties,)atρ=0.3\rho{=}0.3in the main paper, the top-performing one. LettingXXandYYdenote the SFT cost on each subset alone (cost scales approximately linearly with the number of QA pairs, so the union costsX+YX{+}Y), cumulative compute across the two arrivals isX+YX{+}Yfor merging versusX+(X+Y)X{+}(X{+}Y)for full retraining.

Table 6:Model merging vs. full retraining on NarrativeQA.Memorymodel = Qwen2.5-14B-Instruct. Merge-TIES (ρ=0.3\rho{=}0.3) is the best of 14 configurations swept (Table˜12,Appendix˜H). Cumulative compute is reported in 8×\timesH100 GPU-hours forK=2K{=}2subsets of∼\sim640k reflection QA pairs each.Δ\Deltadenotes accuracy difference (pp) relative to full retraining.MethodCumulative computeQwen2.5-32B-IGemini-3-F(8×\timesH100 GPU-h)Acc. (%)Δ\DeltaAcc. (%)Δ\DeltaFull retrain (X+(X+Y)X{+}(X{+}Y))≈72\approx 72h26.85±0.39\mathbf{26.85\pm 0.39}—53.58\mathbf{53.58}—Merge-TIES (ρ=0.3\rho{=}0.3,X+YX{+}Y)≈48\approx 48h15.81±0.3915.81\pm 0.39↓11.04{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 11.04}34.4734.47↓19.11{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 19.11}Merging cuts compute by𝟑𝟑%\mathbf{33\%}atK=2K{=}2, with widening returns at scale.As reported inTable˜6, the full-retrain baseline incursX+(X+Y)≈72X{+}(X{+}Y)\approx 72GPU-hours of cumulative compute, while merging accumulates onlyX+Y≈48X{+}Y\approx 48GPU-hours — a33%33\%reduction (Fig.˜2). The gap widens withKK: under the same per-corpus cost, merging scales asΘ​(K)\Theta(K)while full retraining scales asΘ​(K2)\Theta(K^{2}), yielding a5.5×5.5{\times}saving atK=10K{=}10(240240vs.1,3201{,}320GPU-hours).

Merging trades a measurable accuracy gap for the compute saving, but still beats retrieval.Merge-TIES (ρ=0.3\rho{=}0.3) trails the full-retrainMemorymodel by11.011.0pp under Qwen2.5-32B-Instruct and19.119.1pp under Gemini-3-Flash (Table˜6); across the full 14-configuration sweep, accuracy ranges from7.85%7.85\%(SLERP, worst) to15.81%15.81\%(TIES, best), shown inFig.˜2. Despite this gap, the mergedMemorymodel still outperforms every retrieval baseline (BM25, NV-Embed-V2, HippoRAG2, Cartridges; seeTable˜2) on NarrativeQA, indicating that even an aggressively-cheaper merging procedure preserves most ofMeMo’s qualitative advantage over retrieval-based approaches. TIES and DARE-Linear atρ=0.3\rho{=}0.3dominate the sweep, suggesting that aggressive sparsification combined with sign-conflict resolution is the most reliable merging recipe in this regime.

6Conclusion

We introducedMeMo, a modular framework for integrating updated or domain-specific knowledge into LLMs via aMemorymodel trained on a synthesized reflection QA dataset.MeMoaddresses key limitations of existing methods: it bypasses context constraints and weak cross-document reasoning in retrieval-based approaches, avoids costly and brittle parametric updates (including catastrophic forgetting), and removes representation coupling in latent memory methods. Its core components are a data synthesis pipeline capturing explicit facts and implicit relationships, and a multi-turn inference protocol that decomposes complex queries into targeted sub-queries for desired information retrieval from the memory model. WhileMeModemonstrates strong performance, it has limitations regarding training cost, evaluation scope, and the capacity ofMemorymodel to scale with corpus size (seeAppendix˜B). Empirically,MeMooutperforms strong baselines across diverse benchmarks. It also provides a scalable pathway for knowledge integration, supporting efficient updates and plug-and-play deployment with both open-source and proprietary LLMs. Future work includes more efficient memory construction, extensions to dynamic corpora, and tighter coordination between theExecutivemodel andMemorymodel. We viewMeMo(Memory as aModel) as a promising foundation for more flexible, updatable, and knowledge-aware AI systems.

References

  • [1]Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa.Large language models are zero-shot reasoners.arXiv:2205.11916, 2023.
  • [2]Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al.A survey of large language models.arXiv:2303.18223, 2023.
  • [3]Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghoon Kim.A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 2026.
  • [4]Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu.Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024.
  • [5]Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme.Dated data: Tracing knowledge cutoffs in large language models.arXiv:2403.12958, 2024.
  • [6]Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui.Realtime qa: What’s the answer right now?arXiv:2207.13332, 2024.
  • [7]Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan.Large language models encode clinical knowledge.arXiv:2212.13138, 2022.
  • [8]Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann.Bloomberggpt: A large language model for finance.arXiv:2303.17564, 2023.
  • [9]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela.Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv:2005.11401, 2021.
  • [10]Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel.Large language models struggle to learn long-tail knowledge.arXiv:2211.08411, 2023.
  • [11]Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al.Sustainable ai: Environmental implications, challenges and opportunities.InProc. MLSys, pages 795–813, 2022.
  • [12]Stephen E. Robertson and Steve Walker.Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval.InProc. SIGIR, 1994.
  • [13]Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping.Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv:2405.17428, 2024.
  • [14]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al.Retrieval-augmented generation for knowledge-intensive nlp tasks.InProc. NeurIPS, pages 9459–9474, 2020.
  • [15]Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson.From local to global: A graph rag approach to query-focused summarization.arXiv:2404.16130, 2024.
  • [16]Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su.Hipporag: Neurobiologically inspired long-term memory for large language models.InProc. NeurIPS, pages 59532–59569, 2024.
  • [17]Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su.From rag to memory: Non-parametric continual learning for large language models.InProc. ICML, 2025.
  • [18]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.Language models are few-shot learners.InProc. NeurIPS, pages 1877–1901, 2020.
  • [19]Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al.A survey on in-context learning.InProc. EMNLP, 2024.
  • [20]Yixuan Tang and Yi Yang.MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024.
  • [21]Jiaen Lin, Jingyu Liu, and Yingbo Liu.Optimizing multi-hop document retrieval through intermediate representations.arXiv:2503.04796, 2025.
  • [22]Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu.Continual pre-training of language models.arXiv:2302.03241, 2023.
  • [23]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al.Training language models to follow instructions with human feedback.InProc. NeurIPS, 2022.
  • [24]Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi.Self-instruct: Aligning language models with self-generated instructions.InProc. ACL, pages 13484–13508, 2023.
  • [25]Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al.Scaling instruction-finetuned language models.Journal of Machine Learning Research, pages 1–53, 2024.
  • [26]Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang.An empirical study of catastrophic forgetting in large language models during continual fine-tuning.arXiv:2308.08747, 2025.
  • [27]Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma.Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv:2501.17161, 2025.
  • [28]Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen.Adapting language models to compress contexts.InProc. EMNLP, 2023.
  • [29]Jesse Mu, Xiang Li, and Noah D. Goodman.Learning to compress prompts with gist tokens.InProc. NeurIPS, 2023.
  • [30]Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei.In-context autoencoder for context compression in a large language model.InProc. ICLR, 2024.
  • [31]Guibin Zhang, Muxin Fu, and Shuicheng YAN.Memgen: Weaving generative latent memory for self-evolving agents.InProc. ICLR, 2026.
  • [32]Bohan Li, Yutai Hou, and Wanxiang Che.Data augmentation approaches in natural language processing: A survey.AI Open, pages 71–90, 2022.
  • [33]Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang.An empirical survey of data augmentation for limited data learning in nlp.Transactions of the Association for Computational Linguistics, pages 191–211, 2023.
  • [34]Zeyuan Allen-Zhu and Yuanzhi Li.Physics of language models: part 3.1, knowledge storage and extraction.InProc. ICML, pages 1067–1077, 2024.
  • [35]Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins.Synthetic qa corpora generation with roundtrip consistency.InProc. ACL, pages 6168–6173, 2019.
  • [36]Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa Patwary, and Bryan Catanzaro.Training question answering models from synthetic data.InProc. EMNLP, pages 5811–5826, 2020.
  • [37]Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov.Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration.InProc. ACL, pages 14664–14690, 2024.
  • [38]Yeo Wei Jie, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, and Erik Cambria.Self-training large language models through knowledge detection.InProc. EMNLP Findings, pages 15033–15045, 2024.
  • [39]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need.InProc. NeurIPS, 2017.
  • [40]Carles Gelada, Jacob Buckman, Sean Zhang, and Txus Bach.Scaling context requires rethinking attention.arXiv:2507.04239, 2025.
  • [41]Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang.Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
  • [42]Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg.RULER: What’s the real context size of your long-context language models?InProc. COLM, 2024.
  • [43]Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri.The power of noise: Redefining retrieval for rag systems.InProc. SIGIR, 2024.
  • [44]Jingyu Liu, Jiaen Lin, and Yong Liu.Tackling the inherent difficulty of noise filtering in rag.arXiv:2601.01896, 2026.
  • [45]Ze Yu Zhang, Arun Verma, Finale Doshi-Velez, and Bryan Kian Hsiang Low.Understanding the relationship between prompts and response uncertainty in large language models.InProc. ACL Findings, 2026.
  • [46]Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang.ERNIE 2.0: A continual pre-training framework for language understanding.InProc. AAAI, 2020.
  • [47]Zhizhong Li and Derek Hoiem.Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018.
  • [48]Jackson Harmon, Andreas Hochlehnert, Matthias Bethge, and Ameya Prabhu.Mapping post-training forgetting in language models at scale.arXiv:2510.17776, 2025.
  • [49]Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson.Fine-tuning aligned language models compromises safety, even when users do not intend to!InProc. ICLR, 2024.
  • [50]Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, et al.Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.arXiv:2311.03687, 2023.
  • [51]Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong Callie Hao, and Nishil Talati.Understanding the performance and estimating the cost of llm fine-tuning.InProc. IISWC, 2024.
  • [52]Jiya Manchanda, Laura Boettcher, Matheus Westphalen, and Jasser Jasser.The open source advantage in large language models (llms).arXiv:2412.12004, 2025.
  • [53]Albert Gu and Tri Dao.Mamba: Linear-time sequence modeling with selective state spaces.arXiv:2312.00752, 2023.
  • [54]Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei.Retentive network: A successor to transformer for large language models.arXiv:2307.08621, 2023.
  • [55]Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy.Memorizing transformers.InProc. ICLR, 2022.
  • [56]Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis.Generalization through memorization: Nearest neighbor language models.InProc. ICLR, 2020.
  • [57]Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, and Zhouhan Lin.Memory decoder: A pretrained, plug-and-play memory for large language models.arXiv:2508.09874, 2025.
  • [58]Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans.The reversal curse: Llms trained on“ a is b“ fail to learn“ b is a“.arXiv:2309.12288, 2023.
  • [59]Zeyuan Allen-Zhu and Yuanzhi Li.Physics of language models: Part 3.2, knowledge manipulation.arXiv:2309.14402, 2023.
  • [60]Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao.Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities.ACM Computing Surveys, 2024.
  • [61]Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin.Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025.
  • [62]Michal Danilák.langdetect.https://github.com/Mimino666/langdetect, 2021.
  • [63]Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette.The narrativeqa reading comprehension challenge.Transactions of the Association for Computational Linguistics, pages 317–328, 2018.
  • [64]Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal.Musique: Multihop questions via single-hop question composition.arXiv:2108.00573, 2022.
  • [65]Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al.Cartridges: Lightweight and general-purpose long context representations via self-study.arXiv:2506.06266, 2025.
  • [66]Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, and Zhouhan Lin.Memory decoder: A pretrained, plug-and-play memory for large language models.arXiv:2508.09874, 2025.
  • [67]An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al.Qwen2.5 technical report.arXiv:2412.15115, 2025.
  • [68]Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica.Efficient memory management for large language model serving with pagedattention.arXiv:2309.06180, 2023.
  • [69]Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu.Roformer: Enhanced transformer with rotary position embedding.arXiv:2104.09864, 2023.
  • [70]Ilya Loshchilov and Frank Hutter.Decoupled weight decay regularization.arXiv:1711.05101, 2017.
  • [71]Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.Zero: Memory optimizations toward training trillion parameter models.InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020.
  • [72]Google DeepMind.Gemini 3 flash model card.https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf, December 2025.
  • [73]Gheorghe Comanici, Eric Bieber, et al.Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025.
  • [74]Jeffrey Ip and Kritin Vongthongsri.deepeval.https://github.com/confident-ai/deepeval, 2025.
  • [75]Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, Emilio Parisotto, David Tian, Matan Eyal, Colin Cherry, Jan-Thorsten Peter, Danila Sinopalnikov, Surya Bhupatiraju, Rishabh Agarwal, Mehran Kazemi, Dan Malkin, Ravin Kumar, David Vilar, Idan Brusilovsky, Jiaming Luo, Andreas Steiner, Abe Friesen, Abhanshu Sharma, Abheesht Sharma, Adi Mayrav Gilady, Adrian Goedeckemeyer, Alaa Saade, Alex Feng, Alexander Kolesnikov, Alexei Bendebury, Alvin Abdagic, Amit Vadi, András György, André Susano Pinto, Anil Das, Ankur Bapna, Antoine Miech, Antoine Yang, Antonia Paterson, Ashish Shenoy, Ayan Chakrabarti, Bilal Piot, Bo Wu, Bobak Shahriari, Bryce Petrini, Charlie Chen, Charline Le Lan, Christopher A. Choquette-Choo, CJ Carey, Cormac Brick, Daniel Deutsch, Danielle Eisenbud, Dee Cattle, Derek Cheng, Dimitris Paparas, Divyashree Shivakumar Sreepathihalli, Doug Reid, Dustin Tran, Dustin Zelle, Eric Noland, Erwin Huizenga, Eugene Kharitonov, Frederick Liu, Gagik Amirkhanyan, Glenn Cameron, Hadi Hashemi, Hanna Klimczak-Plucińska, Harman Singh, Harsh Mehta, Harshal Tushar Lehri, Hussein Hazimeh, Ian Ballantyne, Idan Szpektor, Ivan Nardini, Jean Pouget-Abadie, Jetha Chan, Joe Stanton, John Wieting, Jonathan Lai, Jordi Orbay, Joseph Fernandez, Josh Newlan, Ju yeong Ji, Jyotinder Singh, Kat Black, Kathy Yu, Kevin Hui, Kiran Vodrahalli, Klaus Greff, Linhai Qiu, Marcella Valentine, Marina Coelho, Marvin Ritter, Matt Hoffman, Matthew Watson, Mayank Chaturvedi, Michael Moynihan, Min Ma, Nabila Babar, Natasha Noy, Nathan Byrd, Nick Roy, Nikola Momchev, Nilay Chauhan, Noveen Sachdeva, Oskar Bunyan, Pankil Botarda, Paul Caron, Paul Kishan Rubenstein, Phil Culliton, Philipp Schmid, Pier Giuseppe Sessa, Pingmei Xu, Piotr Stanczyk, Pouya Tafti, Rakesh Shivanna, Renjie Wu, Renke Pan, Reza Rokni, Rob Willoughby, Rohith Vallu, Ryan Mullins, Sammy Jerome, Sara Smoot, Sertan Girgin, Shariq Iqbal, Shashir Reddy, Shruti Sheth, Siim Põder, Sijal Bhatnagar, Sindhu Raghuram Panyam, Sivan Eiger, Susan Zhang, Tianqi Liu, Trevor Yacovone, Tyler Liechty, Uday Kalra, Utku Evci, Vedant Misra, Vincent Roseberry, Vlad Feinberg, Vlad Kolesnikov, Woohyun Han, Woosuk Kwon, Xi Chen, Yinlam Chow, Yuvein Zhu, Zichuan Wei, Zoltan Egyed, Victor Cotruta, Minh Giang, Phoebe Kirk, Anand Rao, Kat Black, Nabila Babar, Jessica Lo, Erica Moreira, Luiz Gustavo Martins, Omar Sanseviero, Lucas Gonzalez, Zach Gleicher, Tris Warkentin, Vahab Mirrokni, Evan Senter, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, Yossi Matias, D. Sculley, Slav Petrov, Noah Fiedel, Noam Shazeer, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Jean-Baptiste Alayrac, Rohan Anil, Dmitry, Lepikhin, Sebastian Borgeaud, Olivier Bachem, Armand Joulin, Alek Andreev, Cassidy Hardin, Robert Dadashi, and Léonard Hussenot.Gemma 3 technical report.arXiv:2503.19786, 2025.
  • [76]Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, et al.LFM2 technical report.arXiv:2511.23404, 2025.
  • [77]Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal.Ties-merging: Resolving interference when merging models.InProc. NeurIPS, 2023.
  • [78]Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction.MIT press Cambridge, 1998.
  • [79]Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al.Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024.
  • [80]Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha.Fine-tuning or retrieval? comparing knowledge injection in llms.InProc. EMNLP, pages 237–250, 2024.
  • [81]Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari.Continual learning for large language models: A survey.arXiv:2402.01364, 2024.
  • [82]Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al.Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.InProc. ICML, pages 23965–23998, 2022.
  • [83]Ken Shoemake.Animating rotation with quaternion curves.InProc. SIGGRAPH, pages 245–254, 1985.
  • [84]Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi.Editing models with task arithmetic.InProc. ICLR, 2023.
  • [85]Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li.Language models are super mario: Absorbing abilities from homologous models as a free lunch.InProc. ICML, 2024.
  • [86]Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al.Lora: Low-rank adaptation of large language models.InProc. ICLR, 2022.

Appendix AImpact statement

MeMoadvances the ability of LLMs to internalize knowledge over large, domain-specific corpora without requiring access to model weights, lowering the barrier for deploying capable AI systems in knowledge-intensive domains such as law, medicine, and scientific research. By enabling plug-and-play integration with any LLM, including proprietary models,MeModemocratizes access to powerful knowledge integration capabilities that would otherwise require significant computational resources or white-box model access. At the same time, this accessibility introduces dual-use concerns, as the same capability that enables beneficial applications could be used to internalize misinformation, proprietary data without authorization, or harmful content at scale. Additionally, asMeMoreduces reliance on explicit retrieval, it may obscure the provenance of retrieved information, making it harder to attribute the sources underlying a model’s responses. We encourage future work to investigate attribution mechanisms and access controls for memory-based systems, and urge practitioners to carefully consider the nature of the documents used to trainMemorymodel.

Appendix BLimitations

MeMoincurs an upfront training cost for each new corpus, and performance may vary across domains, document types, or LLM families beyond those covered in our experiments. Furthermore, the performance ofMeMois inherently bounded by the representational capacity ofMemorymodel to internalize the target corpus. Although our experiments do not reveal clear signs thatMemorymodel has reached its capacity limit, we hypothesize that sufficiently large or information-dense corpora will exceed what a fixed-sizeMemorymodel can correctly compress and represent.

Appendix CFuture work

We outline several directions for future work. The data generation pipeline is computationally expensive, with Step 5 inAlgorithm˜1scaling quadratically atO​(k⋅C2⋅Q2)O(k\cdot C^{2}\cdot Q^{2}), and reducing this cost remains an open problem. A systematic evaluation of chunking strategies and their associated tradeoffs (Appendix˜D) is likewise an open direction. On the training side, scalingMemorymodel with corpus size and developing more effective model merging strategies for reducing per-corpus training costs (Section˜5.5) are promising directions. Other post-training methods such as Reinforcement Learning[78]have also shown to be effective in improving model task performance[79], and applying such methods toMemorymodel training warrants future investigation.

LoRA configurations better suited to specific architectures, including per-architecture tuning of rank and learning rate, also warrant further investigation (Appendix˜O). Finally, a more systematic study of the interaction betweenExecutivemodel reasoning capability andMemorymodel model size (Section˜5.3), as well as the optimal interaction budget at each stage andExecutivemodel selection (Section˜J.2), are other promising future directions.

Appendix DPreparation of datasets

Corpus construction.Extending from our description inSection˜5, we distinguish between two types of documents666Note that for BrowseComp-Plus, the gold documents are a subset of the evidence documents.: evidence documents, which contain information relevant to answering a given question, and negative documents, which are irrelevant and serve as noise. For BrowseComp-Plus, we used1,7751{,}775unique evidence documents and1,7661{,}766unique negative documents (after removal of non-English documents), yielding3,5413{,}541documents in total. For MuSiQue, we used2,6482{,}648documents for each of the evidence and negative documents, yielding5,2965{,}296documents in total. NarrativeQA does not have negative documents.

Chunking strategy.As shown inTable˜7, NarrativeQA full documents span the32,76932{,}769–131,072131{,}072token range with a median length of65,92565{,}925tokens, reflecting the long-form nature of the source novels. Processing such documents without chunking risks reduced coverage of extractable QA pairs in Step 1 ofAlgorithm˜1, as attention quality is known to deteriorate over longer contexts[42]. We therefore chunk NarrativeQA documents using a fixed sliding window of6,4006{,}400words with a640640-word overlap (10%10\%overlap ratio), yielding7575chunks concentrated in the4,0974{,}097–16,38416{,}384token range and accounting for96%96\%of all chunks, with a median group size of77per document as shown inTable˜8. Unlike NarrativeQA, MuSiQue documents are compact with99.70%99.70\%falling below512512tokens, and each MuSiQue document is treated as a single chunk.

Table 7:Token length distribution across corpora at thechunk level, wherennrepresents the total number of individual chunks processed byAlgorithm˜1. Each entry reflects the token count of a single text chunk. Statistics for NarrativeQA are reported before and after chunking.Token RangeBrowseComp-Plus(n=3,541n=3{,}541)NarrativeQAFull Docs(n=10n=10)NarrativeQAChunks(n=75n=75)MuSiQue(n=5,296n=5{,}296)0–512512606606(17.11%17.11\%)0(0.00%0.00\%)0(0.00%0.00\%)5,2805{,}280(99.70%99.70\%)513513–1,0241{,}024591591(16.69%16.69\%)0(0.00%0.00\%)0(0.00%0.00\%)1616(0.30%0.30\%)1,0251{,}025–2,0482{,}048746746(21.07%21.07\%)0(0.00%0.00\%)11(1.33%1.33\%)0(0.00%0.00\%)2,0492{,}049–4,0964{,}096598598(16.89%16.89\%)0(0.00%0.00\%)22(2.67%2.67\%)0(0.00%0.00\%)4,0974{,}097–8,1928{,}192428428(12.09%12.09\%)0(0.00%0.00\%)3636(48.00%48.00\%)0(0.00%0.00\%)8,1938{,}193–16,38416{,}384323323(9.12%9.12\%)0(0.00%0.00\%)3636(48.00%48.00\%)0(0.00%0.00\%)16,38516{,}385–32,76832{,}768145145(4.09%4.09\%)0(0.00%0.00\%)0(0.00%0.00\%)0(0.00%0.00\%)32,76932{,}769–65,53665{,}5365656(1.58%1.58\%)55(50.00%50.00\%)0(0.00%0.00\%)0(0.00%0.00\%)65,53765{,}537–131,072131{,}0722020(0.56%0.56\%)55(50.00%50.00\%)0(0.00%0.00\%)0(0.00%0.00\%)>131,072>131{,}0722828(0.79%0.79\%)0(0.00%0.00\%)0(0.00%0.00\%)0(0.00%0.00\%)Min tokens141432,80432{,}8041,9431{,}9432323Median tokens1,7561{,}75665,92565{,}9258,1588{,}158105105Mean tokens7,1927{,}19266,32466{,}3248,7138{,}713123123p95p_{95}tokens20,33020{,}330119,267119{,}26711,26611{,}266270270Max tokens1,235,8971{,}235{,}897119,267119{,}26712,10412{,}104828828

Table 8:Distribution of document group sizes across datasets, where group size denotes the number of chunks associated with a single question or document. For BrowseComp-Plus and MuSiQue, each question is associated with a subset of chunks drawn from the corpus, and group size represents the number of chunks perquestion. For NarrativeQA, each subset of chunks is derived from the original document used for multiple questions, and group size represents the number of chunks perdocument.Document GroupSize RangeBrowseComp-Plus(ngroup=300n_{\text{group}}=300)NarrativeQA Chunks(ngroup=10n_{\text{group}}=10)MuSiQue(ngroup=1,000n_{\text{group}}=1{,}000)0–222(0.67%0.67\%)0(0.00%0.00\%)0(0.00%0.00\%)3–41414(4.67%4.67\%)33(30.00%30.00\%)518518(51.80%51.80\%)5–87878(26.00%26.00\%)44(40.00%40.00\%)482482(48.20%48.20\%)9–16159159(53.00%53.00\%)33(30.00%30.00\%)0(0.00%0.00\%)>>164747(15.67%15.67\%)0(0.00%0.00\%)0(0.00%0.00\%)Min group size223344Median group size12127744Mean group size11.811.87.57.55.35.3p95p_{95}group size2020161688Max group size2323161688

BrowseComp-Plus documents are also treated as a single chunk. The time complexity of Step 5 inAlgorithm˜1isO​(k⋅C2⋅Q2)O(k\cdot C^{2}\cdot Q^{2}), wherek=ngroupk=n_{\text{group}}is the number of groups,C=|Gi|C=|G_{i}|is the number of participating chunks per group, andQ=Q¯iQ=\bar{Q}_{i}is the average number of QA pairs extracted per chunk. Since chunking increasesCC, pipeline costs at Step 5 scale quadratically as the number of chunks per group increases. Given that only2.93%2.93\%of BrowseComp-Plus documents exceed32,76832{,}768tokens, the majority of documents fit within a single chunk, making the cost of chunking difficult to justify. We therefore opted against chunking in favor of lower pipeline cost, and leave a systematic evaluation of chunking strategies and related tradeoffs to future work.

Subset selection of negative documents.We include only a subset of negative documents for BrowseComp-Plus and MuSiQue due to computational constraints arising from the quadratic scaling of Step 5. As reported inTable˜8, BrowseComp-Plus currently has a mean group size of11.811.8and a maximum of2323, while MuSiQue has a mean group size of5.35.3and a maximum of88. Incorporating all available negative documents, which average7878per question (up to197197) for BrowseComp-Plus and1717per question (up to1818) for MuSiQue, would cause the group size to increase substantially. Given the quadratic dependence onCCin Step 5, this would result in a prohibitive increase in pipeline cost for BrowseComp-Plus (k=300k=300) and MuSiQue (k=1,000k=1{,}000). Hence, we opted to only include up toNevidencedatasetN_{\text{evidence}}^{\text{dataset}}number of negative documents for each question in the corpus.

Appendix EDiscussion on steps in data generation pipeline

E.1Ablation of data synthesis steps

We experiment with the data generation pipeline to show the importance of each step. We perform a leave-one-out (LOO) ablation for each step of data synthesis and train the model on the synthesized QA pairs generated. Results are reported inTable˜9on the NarrativeQA and MuSiQue datasets using Qwen2.5-32B-Instruct as theExecutivemodel and Qwen2.5-1.5B-Instruct as theMemorymodel.

Table 9:LOO ablation accuracy at best performing Qwen2.5-1.5B-Instruct epoch across datasets. Data ratio indicates the number of QA pairs retained relative to the baseline. For each step removed, the Qwen2.5-1.5B-Instruct was retrained, and we report the mean±\pmstd. dev. over 3 runs at the same training epoch as the baseline.NarrativeQAMuSiQueAblationData RatioAccuracy (%)Data RatioAccuracy (%)Baseline (all steps)1.000×1.000\times24.00±0.2024.00\pm 0.201.000×1.000\times42.90±1.2542.90\pm 1.25Step 1a removed0.434×0.434\times20.48±0.9020.48\pm 0.900.381×0.381\times30.00±0.1730.00\pm 0.17Step 1b removed0.598×0.598\times22.98±1.0422.98\pm 1.040.651×0.651\times37.33±0.2537.33\pm 0.25Step 2 removed0.739×0.739\times24.69±1.1024.69\pm 1.100.621×0.621\times37.10±1.7637.10\pm 1.76Step 3 removed2.078×2.078\times28.90±0.8628.90\pm 0.861.128×1.128\times41.70±0.7841.70\pm 0.78Step 4 removed0.378×0.378\times23.21±1.5623.21\pm 1.560.501×0.501\times39.10±0.0239.10\pm 0.02Step 5 removed0.002×0.002\times6.37±0.396.37\pm 0.390.195×0.195\times24.17±0.2524.17\pm 0.25

Step 5 (Cross-document synthesis) is the most critical component of the pipeline. Its removal causes accuracy to collapse to6.37%6.37\%and24.17%24.17\%on NarrativeQA and MuSiQue respectively, against baseline scores of24.00%24.00\%and42.90%42.90\%, accompanied by a near-total loss of training data (0.002×0.002\timesand0.195×0.195\timesretention). As described inSection˜4.1, Step 5 enables cross-document synthesis whereℳgen\mathcal{M}_{\text{gen}}constructs𝒬cross\mathcal{Q}_{\text{cross}}pairs spanning inter-document connections and cross-chunk connections within a single long document, making it the dominant source of training pairs in𝒬final\mathcal{Q}_{\text{final}}and directly targeting the multi-source synthesis objective central to both benchmarks.

An interesting anomaly arises with Step 2 and Step 3, where their removal does not consistently hurt performance and improves accuracy on NarrativeQA. Step 2 merges related QA pairs from a single document chunk into multi-fact questions by identifying commonalities such as shared entities, overlapping time periods, and sequential events. For MuSiQue, these commonalities reflect genuine knowledge relationships that directly resemble the multi-hop factual reasoning the benchmark evaluates, such that removing Step 2 eliminates a large fraction of useful training pairs, leading to a drop in accuracy from42.90%42.90\%to37.10%37.10\%. For NarrativeQA, however, the same consolidation patterns operate on superficial narrative co-occurrences rather than meaningful knowledge relationships. The predominant commonality categories are event or scene groupings that NarrativeQA does not evaluate, and entity co-occurrence patterns that are trivially satisfied given the pervasive presence of central characters across scenes. Removing Step 2 eliminates these low-quality pairs, leading to the marginal accuracy improvement from24.00%24.00\%to24.69%24.69\%.

Removing Step 3 retains more data than the baseline (2.078×2.078\timesand1.100×1.100\timesfor NarrativeQA and MuSiQue respectively), yet the effect on performance diverges. For MuSiQue, performance drops from42.90%42.90\%to41.78%41.78\%, whereas for NarrativeQA, performance improves from24.00%24.00\%to28.90%28.90\%. Step 3 applies a self-containment filter that rewrites or discards pairs whose questions cannot be understood without access to the source chunk. For MuSiQue, violations are predominantly localized and shallow, making them amenable to filtering; the proposed filter effectively identifies and removes defective pairs. For NarrativeQA, long-form narrative text frequently contains pronouns and temporal references that span many paragraphs, which are structural features of the domain rather than fixable defects. This causes the rewriting loop to introduce substitute unrelated content and corrupt the pairs produced by earlier steps. Removing Step 3 for NarrativeQA therefore avoids this domain-induced corruption and retains the original pairs intact, explaining both the data retention ratio increase and the accuracy improvement. This suggests that Step 3 is most beneficial when applied to domains where self-containment violations are well-defined and resolvable.

The remaining steps follow a consistent trend: removing Step 1a, Step 1b, or Step 4 reduces both data volume and accuracy across both datasets, confirming that each step contributes a distinct and meaningful role to the final training corpus quality.

E.2Additional steps considered but excluded

Three additional steps were considered but ultimately excluded from the pipeline. These include paraphrasing[80], increasing the number of sampling trials at Step 1 ofAlgorithm˜1, and a targeted fill wherebyℳgen\mathcal{M}_{\text{gen}}reviews the generated QA pairs and rewrites them to incorporate additional missed information. Paraphrasing was excluded as the scale of generated pairs already provides sufficient coverage (≈\approx600k–1.6M across the three datasets, see Table11), and the potential gains were outweighed by the additional computational overhead. Increasing sampling trials proved unreliable, as additional trials did not consistently extract facts that the initial pass had failed to extract. The targeted fill similarly offered limited gains, where appending the existing QA pairs as context to prompt a revision only lengthens the context when the model had already failed to extract a fact from the original chunk, likely exacerbating attention further degradation over long inputs[42]and making retrieval of relevant information less reliable at inference time.

Appendix FMemorymodel hyperparameter settings

Training was conducted on H100 and H200 GPUs using the hyperparameter settings reported inTable˜10. The effective batch size for each dataset is summarized inTable˜11.

Table 10:Memorymodel SFT Training ConfigurationParameterValueOptimizerFused AdamWGradient checkpointingTrueLearning rate (LR)2×10−52\times 10^{-5}Num of Training epochs3LR scheduler typeConstant with warmupWarmup ratio0.05Weight decay0.01Max gradient norm1.0Max sequence length8096PrecisionBF16Attention implementationFlash Attention 2Table 11:Effective batch sizes and number of QA pairs used. NarrativeQA.1 and NarrativeQA.2 are independent subsets partitioned from the original that were used for model merging.DatasetTarget Num of QuestionsNum of QA PairsEffective Batch SizeBrowseComp-Plus3003001,639,9951{,}639{,}995512512NarrativeQA2932931,276,6761{,}276{,}676512512NarrativeQA.1146146635,009635{,}009256256NarrativeQA.2147147641,667641{,}667256256MuSiQue1,0001{,}000664,762664{,}762256256

Appendix GCompute resources

All experiments were conducted on using NVIDIA H200 GPUs. We report computational cost in GPU-hours.

Data generation.

Generating the full reflection dataset for BrowseComp-Plus, NarrativeQA, and MuSiQue took approximately 240, 200, and 150 GPU-hours respectively.

Training.

Memorymodel (Qwen2.5-14B-Instruct) training for a single run BrowseComp-Plus, NarrativeQA, and MuSiQue took approximately 180, 150, 90 GPU-hours.

Appendix HModel training discussion

We considered three training paradigms: CPT, SFT, and LoRA-based SFT. CPT was excluded as it risks degrading instruction-following capability[81], which is critical for downstream QA evaluation. Full SFT was selected as it directly optimizes for the target task while preserving alignment[23]. LoRA-based SFT serves as a parameter-efficient alternative and we include a comparison to these training methods inAppendix˜O.

Model merging targets the practical streaming setting in which new corpora arrive over time andMemorymodel must continually integrate them. RetrainingMemorymodel from scratch on the union of all observed corpora is the natural baseline but quickly becomes prohibitive at scale, since its cost grows with the cumulative corpus size. Model merging instead trains a separateMemorymodel on each new corpus and combines it with the existing model in parameter space, so the cost of each update scales only with the size of the new corpus rather than the entire history. This decoupling comes at a measurable accuracy cost relative to full retraining, which we quantify inFig.˜2. We assume the corpora to be merged are pairwise disjoint.

H.1Model merging

Merging methods.We consider the following methods, all of which produceφmerged\varphi_{\text{merged}}without ever training on𝒟1∪⋯∪𝒟K\mathcal{D}_{1}\cup\dots\cup\mathcal{D}_{K}:

  • •Linear merging[82]computes a weighted sum of task vectors:φmerged=φ0+∑i=1Kλi​τi\varphi_{\text{merged}}=\varphi_{0}+\sum_{i=1}^{K}\lambda_{i}\tau_{i}, whereλi>0\lambda_{i}>0are merging coefficients.
  • •SLERP[83]interpolates between two task vectors along the unit sphere, preserving their magnitudes:φmerged=φ0+SLERP​(τ1,τ2;t)\varphi_{\text{merged}}=\varphi_{0}+\mathrm{SLERP}(\tau_{1},\tau_{2};\,t), witht∈[0,1]t\in[0,1]controlling the interpolation factor.
  • •Task arithmetic[84]adds task vectors directly without further processing, recovering linear merging as a special case with uniformλi\lambda_{i}.
  • •TIES[77]resolves interference among task vectors before summation by (i)trimmingeachτi\tau_{i}to its top-ρ\rhofraction of largest-magnitude entries, (ii)electinga sign at each coordinate by magnitude-weighted majority vote, and (iii)disjoint-mergingonly the entries that agree with the elected sign.
  • •DARE[85]sparsifies each task vector by randomly dropping a fraction1−ρ1-\rhoof its entries and rescaling the survivors by1/ρ1/\rhoto preserve expected magnitude, before linear merging.
  • •DARE-TIES[85]combines DARE-style stochastic sparsification with TIES sign-conflict resolution, retaining the diversity of random dropout while filtering out conflicting updates.

Avoiding catastrophic forgetting.Because no individualMemorymodelℳφi\mathcal{M}_{\varphi_{i}}is ever fine-tuned on another corpus’ data, model merging cannot induce the kind of distributional interference that drives catastrophic forgetting in sequential fine-tuning[26]. Knowledge from each corpus is preserved within its own task vectorτi\tau_{i}, and conflicts between task vectors are addressed at merge time via the methods above rather than during gradient updates.

Scalability.When a new corpus𝒟K+1\mathcal{D}_{K+1}arrives, we train auxiliary modelℳφK+1\mathcal{M}_{\varphi_{K+1}}on its reflection QA dataset, deriveτK+1\tau_{K+1}, and re-merge in𝒪​(1)\mathcal{O}(1)additional cost relative to the full collection. This enables modular, plug-and-play integration over a continuous stream of disjoint knowledge sources, unlike retraining from scratch on⋃i𝒟i\bigcup_{i}\mathcal{D}_{i}, which scales linearly with the cumulative corpus size.

Inference.The mergedMemorymodel is queried identically to a single-corpusMemorymodel via the structured multi-turn protocol described inSection˜4.4. Because merging operates entirely in parameter space and produces a model with the same architecture and interface asℳφ0\mathcal{M}_{\varphi_{0}}, it inherits the plug-and-play property ofMeMowithout requiring changes to theExecutivemodel or the inference protocol. Importantly, theExecutivemodel queries asinglemergedMemorymodel at inference rather than dispatching acrossKKseparate per-corpusMemorymodels, keeping the multi-turn retrieval pipeline unchanged regardless of how many corpora have been integrated.

Procedure.For our experiments we partition NarrativeQA into two pairwise-disjoint subsets, NarrativeQA.1 and NarrativeQA.2, of∼\sim640k reflection QA pairs each. Each subset is used to fine-tune an independentMemorymodel from the same Qwen2.5-14B-Instruct base via SFT for33epochs, producingℳφ1\mathcal{M}_{\varphi_{1}}andℳφ2\mathcal{M}_{\varphi_{2}}at SFT costs ofXXandYYGPU-hours, respectively (each is≈24\approx 24GPU-hours on 8×\timesH100; full-retrain on the union NarrativeQA.1∪\cupNarrativeQA.2 costsX+Y≈48X{+}Y\approx 48GPU-hours by linear scaling). We evaluate every saved checkpoint of each run on the held-out NarrativeQA evaluation set and select the best-performing checkpoint per subset; the corresponding task vectorsτ1\tau_{1}andτ2\tau_{2}are the inputs to the merging step. We then sweep all six merging methods listed above (Linear, Task arithmetic, SLERP, TIES, DARE, DARE-TIES) at three sparsification densitiesρ∈{0.3,0.5,0.7}\rho\in\{0.3,0.5,0.7\}(or three interpolation factorst∈{0.3,0.5,0.7}t\in\{0.3,0.5,0.7\}for SLERP), giving1414merged-Memorymodel configurations in total. Each configuration is evaluated on NarrativeQA with Qwen2.5-32B-Instruct asExecutivemodel (mean±\pmstd over 3 runs). The configuration that we report in theSection˜5asMerge-TIESis the best of the sweep (TIES withρ=0.3\rho{=}0.3).

Refer to captionFigure 2:Cost–accuracy trade-off on NarrativeQA when a second corpus arrives (K=2K{=}2,Memorymodel = Qwen2.5-14B-Instruct, 8×\timesH100).Cumulative training cost is shown on thexx-axis (one Qwen-14B SFT run takes≈24\approx 24GPU-hours on a 640k-QA-pair corpus). Merging trainsMemorymodel only on the new corpus, costingX+Y≈48X{+}Y\approx 48GPU-hours, while full retraining re-runs on the union, costingX+(X+Y)≈72X{+}(X{+}Y)\approx 72GPU-hours — a 33% saving. Merge-TIES (ρ=0.3\rho{=}0.3) trails full retraining by11.011.0pp with Qwen2.5-32B-Instruct and19.119.1pp with Gemini-3-Flash asExecutivemodel, but still outperforms all retrieval baselines (BM25, NV-Embed-V2, HippoRAG2, Cartridges). The vertical↕\updownarrowat the merge cost shows the worst-to-best range across the 14 merge configurations swept (Table˜12). Perfect Retrieval is shown as the upper bound.Results.A single SFT run consumes≈24\approx 24GPU-hours on 8×\timesH100; after two arrivals, full retraining incursX+(X+Y)=72X{+}(X{+}Y){=}72GPU-hours of cumulative compute, whereas merging accumulates onlyX+Y=48X{+}Y{=}48GPU-hours — a 33% reduction (Fig.˜2). The asymptotic gap widens withKK: under the same per-corpus cost, merging scales asΘ​(K)\Theta(K)while full retraining scales asΘ​(K2)\Theta(K^{2}), yielding a5.5×5.5{\times}saving atK=10K{=}10(240240vs.1,3201{,}320GPU-hours). On accuracy, Merge-TIES (ρ=0.3\rho{=}0.3) trails full retraining by11.011.0pp with Qwen2.5-32B-Instruct asExecutivemodel (15.81%15.81\%vs.26.85%26.85\%) and by19.119.1pp with Gemini-3-Flash (34.47%34.47\%vs.53.58%53.58\%), placing the mergedMemorymodel below the union-retrainedMemorymodel but above every retrieval baseline. The full per-method sweep is reported inTable˜12: TIES (ρ=0.3\rho{=}0.3) and DARE-Linear (ρ=0.3\rho{=}0.3) lead at15.81%15.81\%and15.47%15.47\%respectively, while SLERP (t=0.5t{=}0.5) is the worst configuration at7.85%7.85\%. The pattern across families suggests that aggressive sparsification at lowρ\rhopaired with sign-conflict resolution (TIES, DARE-Linear) is the most reliable merging recipe in this regime. These results confirm the predicted compute–accuracy trade-off: merging recovers most ofMemorymodel’s headroom over retrieval methods at substantially lower cumulative cost.

Table 12:Sweep of all 14 merge configurations on NarrativeQA.TwoMemorymodels (Qwen2.5-14B-Instruct) are independently SFT-trained on the disjoint NarrativeQA.1 and NarrativeQA.2 subsets; each subset’s best-performing checkpoint provides the task vector entering the merge.Executivemodel = Qwen2.5-32B-Instruct; results are mean±\pmstd. dev. over 3 runs. Best merge inbold; full-retrain accuracy (26.85±0.3926.85\pm 0.39) is shown for reference. Hyperparameter conventions:t∈[0,1]t\in[0,1]is the SLERP interpolation factor along the unit sphere connecting the two task vectors (t=0t{=}0recoversMemorymodel on NarrativeQA.1,t=1t{=}1recoversMemorymodel on NarrativeQA.2,t=0.5t{=}0.5is the geodesic midpoint);ρ∈(0,1]\rho\in(0,1]is the sparsification density — the fraction of largest-magnitude task-vector entries kept (TIES) or the keep probability for random-drop sparsification (DARE, DARE-TIES). Linear and Task arithmetic merge with uniform weights (λi=1\lambda_{i}=1) and have no hyperparameter.Method familyHyperparameterAccuracy (%)Linear—11.60±1.0211.60\pm 1.02Task arithmetic—12.74±1.7512.74\pm 1.75SLERPt=0.3t=0.311.60±2.2411.60\pm 2.24t=0.5t=0.57.85±1.71\phantom{0}7.85\pm 1.71t=0.7t=0.711.60±2.1311.60\pm 2.13TIESρ=0.3\rho=0.315.81±0.39\mathbf{15.81\pm 0.39}ρ=0.5\rho=0.512.17±1.9412.17\pm 1.94ρ=0.7\rho=0.712.06±2.5812.06\pm 2.58DARE-Linearρ=0.3\rho=0.315.47±0.7915.47\pm 0.79ρ=0.5\rho=0.59.78±1.20\phantom{0}9.78\pm 1.20ρ=0.7\rho=0.713.65±2.0813.65\pm 2.08DARE-TIESρ=0.3\rho=0.311.72±0.5211.72\pm 0.52ρ=0.5\rho=0.512.97±1.2312.97\pm 1.23ρ=0.7\rho=0.711.04±1.2011.04\pm 1.20

Appendix IValidating evaluation dataset suitability

Table 13:Performance gap between no context and perfect retrieval across datasets andExecutivemodels.Qwen2.5-32B-InstructGemini-3-FlashBrowseComp-PlusNarrativeQAMuSiQueBrowseComp-PlusNarrativeQAMuSiQueNo Context0.00±0.000.00\pm 0.005.35±0.205.35\pm 0.2017.03±0.4017.03\pm 0.401.331.3326.6226.6241.8041.80Perfect Retrieval79.67±1.4579.67\pm 1.4551.42±0.5251.42\pm 0.5262.83±0.9062.83\pm 0.9088.3388.3360.4160.4173.0073.00

To assess the suitability of the evaluation datasets forExecutivemodel and whether theExecutivemodel has memorized answers from training data, we evaluate performance both without any context (No Context) and with evidence documents provided (Perfect Retrieval), the latter serving as an empirical upper-bound that assumes perfect retrieval of relevant documents.

As shown inTable˜13, the large disparity in performance between No Context and Perfect Retrieval confirms that these datasets require access to evidence documents to achieve correct answers, validating their suitability for evaluatingMeMo.

Unsurprisingly, MuSiQue yields the highest No Context scores, as its Wikipedia-grounded questions fall within models’ parametric knowledge. NarrativeQA proves most challenging as it achieves the lowest Perfect Retrieval scores across bothExecutivemodels, reflecting the demand for careful reasoning over full-length books and movie scripts. BrowseComp-Plus yields the largest disparity between No Context and Perfect Retrieval, with near-zero No Context performance but strong recovery when evidence documents are provided.

These findings confirm thatExecutivemodel heavily relies on evidence documents across all three datasets to perform well. MuSiQue tests multi-hop factual reasoning where parametric knowledge provides partial signals, NarrativeQA tests narrative comprehension that remains challenging even with perfect context, and BrowseComp-Plus tests the ability to exploit retrieved documents for facts otherwise entirely inaccessible to the model.

Appendix JEvaluation details

J.1Implementation details

The current temperature settings are described inTable˜14. Stage 1 only has a budget of11interaction, Stage 2 has a budget of77interactions, Stage 3 has a budget88interactions.

Table 14:Temperature Configuration of each Stage fromSection˜4.4StageModelTemperature ValueIntentEvaluation Stage 1 – GroundingExecutivemodel0.4Moderate exploration to generate diverse but focused sub-questionsEvaluation Stage 1 – GroundingMemorymodel0.1Near-deterministic to ensure stable, consistent grounding answersEvaluation Stage 2 – Entity identificationExecutivemodel0.4Moderate exploration to identify varied candidate entities without excess noiseEvaluation Stage 2 – Entity identificationMemorymodel0.1Near-deterministic to produce reliable entity-targeted answersEvaluation Stage 3 – Answer SeekingExecutivemodel1.0High exploration to maximally diversify sub-questions once the entity is confirmedEvaluation Stage 3 – Answer SeekingMemorymodel0.3Slightly relaxed determinism to allow nuanced answers while remaining consistentFinal SynthesisExecutivemodel0.3Low temperature to produce a consistent final answer

Table 15:Helper Functions for Stage 2 and 3 of the Evaluation PipelineFunctionStageIntentTrack uncertain answer streaksStage 2Maintains a running tally of how many unanswerable questions each candidate entity has accumulated across Stage 2, allowing theExecutivemodel to progressively prioritize candidates that theMemorymodel consistently cannot corroborateSelect the best candidateStage 2Fallback bridge from Stage 2 to Stage 3 when entity pinning ends without a confirmed entity. Selects the highestExecutivemodel-ranked candidate, with ties broken by the order in which theMemorymodel produced the candidatesEntity pivot correctionStage 3Allows the pipeline to self-correct mid Stage 3 if the Stage 2 entity proves incorrect. When theExecutivemodel nominates a different entity, the confirmed entity is overwritten and marked as unconfirmed so subsequent turns are aware it was not pinned through the full Stage 2 process

Beyond what is described inSection˜4.4, there are additional helper functions that help manage failure modes across Stage 2 and Stage 3. Within Stage 2, the uncertain answer streak tracker is called at the start of every entity-pinning interaction and its output is passed directly into the entity-pinning prompt, giving theExecutivemodel a live view of which candidates theMemorymodel has repeatedly failed to corroborate. This allows theExecutivemodel to continuously re-rank and prune the candidate pool as evidence accumulates. When Stage 2 concludes without a confirmed entity, either because theExecutivemodel explicitly exhausts its options or the interaction budget is reached, the best candidate selector acts as the bridge into Stage 3 by returning the top-ranked candidate. In cases where multiple candidates share the highest rank, the first candidate in the order produced byExecutivemodel is selected. In both cases, the downstream Stage 3 prompt is informed of whether the entity was formally confirmed or merely a best guess. Finally, if Stage 3 reveals that the Stage 2 entity was incorrect due to persistentMemorymodel failures, the entity pivot mechanism allows theExecutivemodel to nominate a replacement entity mid-stage. The confirmed entity is then overwritten and marked as unconfirmed, ensuring subsequent stages treat it with appropriate uncertainty rather than the confidence of a fully pinned entity.

J.2Ablations on evaluation setup

To justify our structured multi-turn evaluation design, we compare against two baselines: a single-turn setup and anunstructuredmulti-turn setup; in both cases, the same trainedMemorymodel is used andExecutivemodel is held fixed. Results are reported inTable˜16.

Table 16:MeMoaccuracy results with Qwen2.5-32B-Instruct asExecutivemodel and Qwen2.5-14B-Instruct asMemorymodel across evaluation setups. The best performing epoch was used in comparison across all 3 setups, with mean±\pmstd. dev. reported across 3 runs. Bold results indicate best performing results in the column.Evaluation SetupBrowseComp-Plus AccuracyNarrativeQA AccuracyMuSiQue AccuracySingle turn evaluation32.56±1.5832.56\pm 1.5824.80±0.2024.80\pm 0.2037.57±1.1537.57\pm 1.15Unstructured multi-turn evaluation(15 turns)47.33±0.8847.33\pm 0.8826.73±2.1726.73\pm 2.1740.13±1.1240.13\pm 1.12Unstructured multi-turn evaluation(50 turns)48.67±1.0048.67\pm 1.0027.19±0.7127.19\pm 0.7140.57±0.3140.57\pm 0.31Structured multi-turn evaluation(7 Entity Identification turns +8 Answer seeking turns)54.22±0.84\mathbf{54.22\pm 0.84}26.39±1.7526.39\pm 1.7548.30±1.25\mathbf{48.30\pm 1.25}Structured multi-turn evaluation(7 Entity Identification turns +15 Answer seeking turns)51.44±2.4151.44\pm 2.4127.76±0.20\mathbf{27.76\pm 0.20}47.57±0.9547.57\pm 0.95

In a single-turn interaction,Executivemodel first determines whether the question requires external memory retrieval, and if so, decomposes it into a set of sub-questions (Stage 1,Section˜4.4) and poses them all simultaneously toMemorymodel.Memorymodel responds to each sub-question independently, and responses indicating uncertainty are discarded before the remaining answers are passed toExecutivemodel for final synthesis. This design requiresExecutivemodel to commit to its full sub-question set before observing any responses, preventing it from reformulating uninformative queries, following up on answers that introduce new candidate entities, or correcting retrievals that are incomplete, contradictory, or anchored to the wrong entity. This is a fundamental limitation that is reflected in its consistently lowest performance across all three datasets (Table˜16).

A natural extension of the single-turn setting is an unstructured multi-turn interaction, whereExecutivemodel examines the responses fromMemorymodel and decides whether sufficient information has been gathered, or whether additional retrieval rounds are needed (Stage 3,Section˜4.4. In this setting,Executivemodel is presented with the full history of question-answer pairs and prompted to either synthesize a final answer or generate a new batch of sub-questions targeting remaining gaps, repeating for up toTTinteractions. While iterative retrieval yields clear improvements over the single-turn baseline, performance plateaus quickly when increasing from 15 to 50 interactions (47.33±0.8847.33\pm 0.88to48.67±1.0048.67\pm 1.00on BrowseComp-Plus,26.73±2.1726.73\pm 2.17to27.19±0.7127.19\pm 0.71on NarrativeQA, and40.13±1.1240.13\pm 1.12to40.57±0.3140.57\pm 0.31on MuSiQue), suggesting that iterative retrieval alone is insufficient.

The structured multi-turn setup (seeSections˜4.4andJ.1) outperforms the unstructured multi-turn baseline, with 8 answer-seeking interactions achieving the strongest overall performance for BrowseComp-Plus and MuSiQue. This is consistent with the expectation that explicit entity identification is well-suited to the multi-hop reasoning demands of these datasets. NarrativeQA, which tests discourse understanding over long documents, has the unstructured 15- and 50-interaction baselines (26.73±2.1726.73\pm 2.17and27.19±0.7127.19\pm 0.71) initially outperform the structured setup with 8 answer-seeking interactions (26.39±1.7526.39\pm 1.75). Inference logs indicate thatExecutivemodel rarely utilizes the entity identification stage on NarrativeQA, likely because its questions are less reliant on resolving specific entities. Consequently, the fixed-entity identification budget effectively reduces the number of available answer-seeking interactions compared to unstructured baselines. Increasing the answer-seeking budget to 15 interactions recovers this gap, with NarrativeQA reaching27.76±0.2027.76\pm 0.20, surpassing both unstructured baselines. We hypothesize that this could be due to the additional answer-seeking interactions continue to surface useful signal without the risk of entity drift or state corruption that compounds in open-domain multi-hop settings.

Unlike NarrativeQA, structured entity identification and state tracking in BrowseComp-Plus and MuSiQue introduce sensitivity to error accumulation as the number of answer-seeking interactions increases beyond the optimal budget. Additional interactions increase the risk of erroneousMemorymodel responses corrupting the known facts state, and provide more opportunities forExecutivemodel to commit to an incorrect intermediate entity via the entity pivot correction helper (Table˜15). Furthermore, more interactions dilute the correct signal with potentially incorrect answers at the final synthesis stage. These failure modes are partly a function of the reasoning capability ofExecutivemodel, as structured state maintenance demands strong in-context reasoning to accurately track entities and avoid premature entity commitment. Corroborating this, we observe inTable˜2that a stronger reasoning model asExecutivemodel yields improved performance when paired with the sameMemorymodel, suggesting that these failure modes can be mitigated by scaling the reasoning capability ofExecutivemodel.

The stage budget used in our experiments was selected without systematic tuning, and alternative settings may yield similar performance with greater token efficiency. We therefore leave a systematic study of the optimal interaction budget andExecutivemodel selection as future work.

Appendix KDiscussion on number of training epochs

FromFigs.˜3,4and5, we observe that additional training epochs do not consistently improve accuracy, as peak performance for mostMemorymodel occurs at epoch 2 with marginal gains or mild regression thereafter. We attribute the early saturation and subsequent regression to overfitting on the SFT corpus, which exhibits substantial lexical overlap across steps by design, as later steps are derived from earlier onesAlgorithm˜1. To quantify this lexical overlap, we compute the lossless compression ratio of the combined QA text across all steps for each dataset by extracting all question and answer strings, concatenating them into a single text corpus, and applying gzip compression at maximum level (compression level 9), where the compression ratio is defined as the ratio of the original text size to the compressed size.

Refer to captionFigure 3:BrowseComp-Plus accuracy (%) vs. training epoch (Full SFT) for eachMeMomodel size and model family. Lines show the mean over 3 runs, and the shaded band shows±\pmstd. dev. for Qwen2.5-32B-Instruct Runs.Refer to captionFigure 4:NarrativeQA accuracy (%) vs. training epoch (Full SFT) for eachMeMomodel size and model family. Lines show the mean over 3 runs, and the shaded band shows±\pmstd. dev. for Qwen2.5-32B-Instruct Runs.Refer to captionFigure 5:MuSiQue accuracy (%) vs. training epoch (Full SFT) for eachMeMomodel size and model family. Lines show the mean over 3 runs, and the shaded band shows±\pmstd. dev. for Qwen2.5-32B-Instruct Runs.BrowseComp-Plus (1,639,9951{,}639{,}995pairs) achieves a ratio of5.80×5.80\times(82.8%82.8\%savings), MuSiQue (664,762664{,}762pairs) achieves7.03×7.03\times(85.8%85.8\%savings), and NarrativeQA (1,276,6761{,}276{,}676pairs) achieves5.45×5.45\times(81.7%81.7\%savings), indicating substantial lexical overlap within each dataset. We note that compression ratio captures lexical overlap only, and semantic diversity across QA pairs may remain higher, as each step targets distinct reasoning operations ranging from direct fact extraction to cross-document synthesisAlgorithm˜1, which is consistent with the impact of removing Step 5 (seeAppendix˜E).

Appendix LPerformance degradation of retrieval-based methods with increasing noise

Table 17:Accuracy (%) on BrowseComp-Plus and MuSiQue with Qwen2.5-32B-Instruct asExecutivemodel.MeMoresults are based on Qwen2.5-14B-Instruct and reported at the best training epoch.NNis the number of evidence documents present in the target corpus.Δ\Deltadenotes accuracy difference (pp) compared to0​N0N.MethodDataset0​N0N1​N1N2​N2NAcc. (%)Acc. (%)Δ\DeltaAcc. (%)Δ\DeltaNV-Embed-V2BrowseComp-Plus56.89±0.5156.89\pm 0.5150.67±0.3350.67\pm 0.33↓6.22{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 6.22}49.44±0.1949.44\pm 0.19↓7.45{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 7.45}MuSiQue42.30±0.5342.30\pm 0.5337.47±0.1537.47\pm 0.15↓4.83{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 4.83}33.03±1.1033.03\pm 1.10↓9.27{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 9.27}HippoRAG2BrowseComp-Plus62.33±1.1562.33\pm 1.1556.11±0.5156.11\pm 0.51↓6.22{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 6.22}50.78±1.3550.78\pm 1.35↓11.55{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 11.55}MuSiQue47.33±0.7447.33\pm 0.7442.17±0.1242.17\pm 0.12↓5.16{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 5.16}41.70±0.6941.70\pm 0.69↓5.63{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\downarrow 5.63}

Table˜17reports the performance of two retrieval-based baselines (NV-Embed-V2 and HippoRAG2) under increasing retrieval noise. Both methods degrade monotonically as noise increases, confirming their susceptibility to irrelevant documents. The degradation is most severe for HippoRAG2 on BrowseComp-Plus, which drops 11.55 pp from 0N to 2N, and for NV-Embed-V2 on MuSiQue, which drops 9.27 pp over the same range. Notably, even a single negative document per evidence document (1N) causes substantial drops of up to 6.22 pp for both methods on BrowseComp-Plus, suggesting that retrieval-based methods are extremely sensitive to noisy retrieval settings.

Appendix MAblation onMemorymodel size

Both Qwen2.5-1.5B-Instruct and Qwen2.5-14B-InstructMemorys are trained on the same QA dataset generated by theGeneratormodel (Qwen2.5-32B-Instruct) under training settings described inAppendix˜F. EachMemorymodel is evaluated using Qwen2.5-32B-Instruct and Gemini-3-Flash as theExecutivemodel.

Appendix NAblation onMemorymodel family

EachMemorymodel is trained on the same QA dataset generated byGeneratormodel (Qwen2.5-32B-Instruct) and evaluated using Qwen2.5-32B-Instruct and Gemini-3-Flash asExecutivemodel. Notably, while Qwen2.5-1.5B-Instruct and Gemma3-1B-IT are based on standard transformer architectures, LFM2.5-1.2B-Instruct adopts a hybrid architecture combining state-space convolution with transformer attention blocks, thereby providing a broader test ofMemorymodel across diverse model designs. These models are trained on the same training settings inAppendix˜F, with Gemma3-1B-IT using eager attention during training instead of Flash Attention 2.

Appendix OComparison between full SFT and LoRA

We train all models using LoRA[86]applied to the attention and feed-forward projection layers:q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj, anddown_proj. The general LoRA configuration is summarised in Table18, with model-specific rank and scaling settings reported in Table19. All remaining training hyperparameters follow Table10, and per-dataset batch sizes are given in Table11.

Table 18:LoRA Specific Training Configuration. All other parameters are the same as those inTable˜10.ParameterValueTarget modulesq_proj, k_proj, v_proj, o_proj,gate_proj, up_proj, down_projLoRA dropout0.05BiasNoneLearning rate2×10−42\times 10^{-4}Table 19:Model-Specific LoRA Configuration.ModelSizeLoRA rankLoRA alphaTrainable paramsLFM2.5-1.2B-Instruct1.2B8166.1M (0.41%)Gemma3-1B-IT1B8166.6M (0.65%)Qwen2.5-1.5B-Instruct1.5B8169.2M (0.60%)Qwen2.5-14B-Instruct14B163268.8M (0.47%)Table 20:Ablation on LoRA vs Full SFT training across allMemorymodels, evaluated with Qwen2.5-32B-Instruct asExecutivemodel. All results are mean±\pmstd. dev. over 3 runs. Bold results indicate best performing results in the column.BrowseComp-PlusNarrativeQAMuSiQueMemorymodelLoRAFull SFTLoRAFull SFTLoRAFull SFTGemma3-1B-IT25.22±1.3925.22\pm 1.3941.67±2.0341.67\pm 2.0321.62±0.8621.62\pm 0.8622.30±2.4722.30\pm 2.4726.17±1.1026.17\pm 1.1041.17±1.2041.17\pm 1.20LFM2.5-1.2B-Instruct0.78±0.190.78\pm 0.1937.33±1.8637.33\pm 1.865.69±0.715.69\pm 0.7121.96±1.9721.96\pm 1.977.50±0.267.50\pm 0.2645.23±2.4945.23\pm 2.49Qwen2.5-1.5B-Instruct29.78±0.5129.78\pm 0.5144.11±2.2244.11\pm 2.2221.84±0.3421.84\pm 0.3424.00±0.2024.00\pm 0.2031.53±0.5531.53\pm 0.5542.90±1.3942.90\pm 1.39Qwen2.5-14B-Instruct48.78±1.02\mathbf{48.78\pm 1.02}54.22±0.84\mathbf{54.22\pm 0.84}23.78±0.52\mathbf{23.78\pm 0.52}26.85±0.39\mathbf{26.85\pm 0.39}43.94±0.97\mathbf{43.94\pm 0.97}50.07±0.81\mathbf{50.07\pm 0.81}

The notably poor LoRA performance of LFM2.5-1.2B-Instruct can be attributed to its hybrid convolution–attention architecture, which differs from the standard transformer models in our evaluation. Following the LFM2 architecture[76], LFM2.5-1.2B-Instruct consists of1616layers —66grouped-query attention (GQA) blocks (at indices{2,5,8,10,12,14}\{2,5,8,10,12,14\}) interleaved with1010short-range LIV convolution (ShortConv) blocks[76]. Crucially, the LFM2 attention output projection is namedout_proj(rather thano_proj) and its SwiGLU MLP usesw1/w3/w2(rather thangate_proj/up_proj/down_proj), while the ShortConv blocks expose their ownin_projandout_projlayers. A LoRA configuration targeting the standard Llama-family module names therefore adapts only a strict subset of the projections that exist in LFM2.5, leaving the remainder frozen. The result is6.16.1M trainable parameters (0.41%0.41\%of total), disproportionately low given the model size and below our target of∼0.5%{\sim}0.5\%. The rankr=8r=8was kept fixed across all sub-2B models for a controlled comparison; in retrospect, this penalises LFM2.5-1.2B-Instruct due to its architectural mismatch with the standard Llama-style target set.

Furthermore, the1010ShortConv blocks which handle the bulk of the model’s local feature extraction and the SwiGLU MLPs attached to every block remain entirely unadapted under standard LoRA targeting, severely limiting the adapter’s ability to shift the model’s behaviour. As shown inTable˜20, the large performance gap between LoRA and Full SFT confirms that the model is capable of learning the task when all parameters are updated.

Future work could explore LoRA configurations better suited to this architecture: targeting the LFM2-specific module names (out_proj,w1,w3,w2) alongside the ShortConv projections (in_proj,out_proj), as well as tuning the rank and learning rate per architecture rather than holding them fixed across families for the controlled comparison reported here.

Similar Articles

δ-mem: Efficient Online Memory for Large Language Models

Hugging Face Daily Papers

The paper introduces δ-mem, a lightweight memory mechanism that enhances large language models by augmenting a frozen attention backbone with a compact associative memory state. It demonstrates improved performance on memory-heavy benchmarks with minimal computational overhead.

AdMem: Advanced Memory for Task-solving Agents

arXiv cs.AI

This paper introduces AdMem, a unified memory framework for LLM-based agents that integrates semantic, episodic, and procedural memory with a bi-level short-term and long-term store, using a multi-agent architecture for automatic memory generation and adaptive retrieval. Experiments show improved robustness and success on long multi-turn tasks.