Decoupled Mixture-of-Experts for Parametric Knowledge Injection
Summary
Decoupled Mixture-of-Experts (DMoE) proposes a modular architecture for parametric knowledge injection, decoupling experts and router from the base model to enable efficient auto-regressive inference and mitigate catastrophic forgetting.
View Cached Full Text
Cached at: 06/15/26, 08:58 AM
# Decoupled Mixture-of-Experts for Parametric Knowledge Injection
Source: [https://arxiv.org/html/2606.14243](https://arxiv.org/html/2606.14243)
Baoqing YueWeihang SuDepartment of Computer Science and Technology, Tsinghua UniversityEqual contribution\.Qingyao AiDepartment of Computer Science and Technology, Tsinghua UniversityYichen TangDepartment of Computer Science and Technology, Tsinghua UniversityChangyue WangDepartment of Computer Science and Technology, Tsinghua University Jiacheng KangDepartment of Computer Science and Technology, Tsinghua UniversityJingtao ZhanDepartment of Computer Science and Technology, Tsinghua UniversityYiqun LiuDepartment of Computer Science and Technology, Tsinghua University
###### Abstract
Knowledge injection aims to equip large language models \(LLMs\) with external, domain\-specific, or time\-sensitive knowledge\. Existing approaches typically face a trade\-off between flexibility and integration: retrieval\-augmented generation keeps knowledge outside the model but only provides prompt\-level augmentation, whereas post\-training based methods encode new knowledge into shared parameters but may introduce catastrophic forgetting, knowledge conflict, and costly updates\. In this paper, we propose Decoupled Mixture\-of\-Experts \(DMoE\), a modular architecture for parametric knowledge injection that decouples both experts and the router from the base model\. DMoE converts external knowledge corpora into independently updatable expert modules and uses a lightweight uncertainty\-aware router to activate relevant experts only when the base model lacks sufficient knowledge during generation\. To support efficient auto\-regressive inference, DMoE attaches experts only to the final\-layer feed\-forward network, preserving KV\-cache reuse while enabling parameter\-level knowledge augmentation\. Experiments on knowledge\-intensive benchmarks show that DMoE consistently improves answer quality over retrieval and adapter\-based baselines\.
Decoupled Mixture\-of\-Experts for Parametric Knowledge Injection
Figure 1:Comparison of knowledge injection paradigms\. RAG injects knowledge as external context, while post\-training modifies shared model parameters and may introduce conflict or forgetting\. DMoE decouples knowledge modules from the base model, enabling modular and efficient parametric integration\.## 1Introduction
Large Language Models \(LLMs\) have demonstrated strong generalization capabilities across a wide range of tasksBrownet al\.\([2020](https://arxiv.org/html/2606.14243#bib.bib1)\); Chowdheryet al\.\([2023](https://arxiv.org/html/2606.14243#bib.bib2)\)\. However, their parametric knowledge is inevitably static after pre\-training\. As a result, LLMs often fail on domain\-specific or time\-sensitive queries, producing hallucinated or outdated responses at inference timeSonget al\.\([2025](https://arxiv.org/html/2606.14243#bib.bib6)\); Xuet al\.\([2024](https://arxiv.org/html/2606.14243#bib.bib7)\)\. This limitation has motivated growing interest in knowledge injection, which aims to equip LLMs with external knowledge during inference or through post\-training methodsOvadiaet al\.\([2023](https://arxiv.org/html/2606.14243#bib.bib9)\); Lauscheret al\.\([2020](https://arxiv.org/html/2606.14243#bib.bib10)\); Aiet al\.\([2025](https://arxiv.org/html/2606.14243#bib.bib39)\)\.
As illustrated in Figure[1](https://arxiv.org/html/2606.14243#S0.F1), existing knowledge injection methods can be broadly grouped into retrieval\-based and post\-training\-based paradigms\. Retrieval\-Augmented Generation \(RAG\) keeps knowledge outside the model and dynamically augments the input with retrieved documentsBorgeaudet al\.\([2022](https://arxiv.org/html/2606.14243#bib.bib11)\); Lewiset al\.\([2020](https://arxiv.org/html/2606.14243#bib.bib13)\)\. This design makes knowledge easy to update, since the retrieval corpus can be modified without changing model parameters\. However, the injected knowledge remains at the prompt level: it is only exposed to the model as additional context, rather than being integrated into the model’s parameter space\. Consequently, RAG provides flexible but relatively shallow knowledge augmentation, and its inference efficiency can be limited by repeated retrieval and long\-context processing\.
In contrast, post\-training\-based methods, including Supervised Fine\-Tuning \(SFT\)Wanget al\.\([2022](https://arxiv.org/html/2606.14243#bib.bib18)\); Mishraet al\.\([2021](https://arxiv.org/html/2606.14243#bib.bib19)\)and parameter\-efficient variants such as LoRAHuet al\.\([2022](https://arxiv.org/html/2606.14243#bib.bib55)\), encode new knowledge directly into model parameters\. Although this enables deeper parameter\-level integration, the injected knowledge is still written into a shared parameter space that already stores diverse pretrained knowledge\. When knowledge is continuously updated or expanded, such shared updates may interfere with existing capabilities, introduce knowledge conflict, or require repeated re\-training as the external corpus changes\. Thus, while post\-training\-based approaches integrate knowledge more deeply than RAG, they often sacrifice modularity, update efficiency, and knowledge isolation\.
This trade\-off reveals a deeper architectural bottleneck\. Most existing LLMs organize knowledge either as external prompt context or as entangled dense parameters\. Neither form provides an explicit mechanism for isolating heterogeneous knowledge, routing to relevant knowledge modules, or incrementally expanding the model without perturbing unrelated knowledge\. A desirable knowledge injection architecture should therefore satisfy three requirements: it should integrate knowledge at the parameter level, keep injected knowledge modular and independently updatable, and preserve efficient auto\-regressive inference\.
To this end, we proposeDecoupled Mixture\-of\-Experts \(DMoE\), a modular architecture for parametric knowledge injection\. Inspired by the Mixture\-of\-Experts \(MoE\) principle of conditional computation and expert specializationJacobset al\.\([1991](https://arxiv.org/html/2606.14243#bib.bib60)\); Shazeeret al\.\([2017](https://arxiv.org/html/2606.14243#bib.bib61)\); Feduset al\.\([2022](https://arxiv.org/html/2606.14243#bib.bib62)\), DMoE differs from conventional MoE architectures in that both the experts and the router are decoupled from the base model\. Given an external knowledge corpus, DMoE partitions the corpus into knowledge units and constructs lightweight expert modules while keeping the base model unchanged\. These experts are stored outside the dense backbone and can be added, removed, or updated independently\. During inference, a lightweight uncertainty\-aware router estimates whether the current query requires external expert support and activates relevant experts only when necessary\.
A key design goal of DMoE is to support efficient autoregressive generation\. Naively attaching dynamically selected experts to multiple transformer layers can make cache reuse difficult when the active expert set changes after the prefix has been processed, since the cached key\-value states would no longer correspond to the hidden states produced under the newly activated experts\. DMoE avoids this issue by attaching experts after the attention computation of the final transformer layer, specifically to the final\-layer feed\-forward network\. Because expert activation does not modify the representations from which earlier\-layer key\-value caches are computed, DMoE can reuse cached attention states during autoregressive generation\. As a result, DMoE enables parameter\-level knowledge augmentation while substantially reducing inference overhead compared with dynamic retrieval or multi\-layer expert injection strategies\.
We evaluate DMoE on a suite of knowledge\-intensive benchmarks, focusing on both answer quality and inference efficiency\. Experimental results show that DMoE consistently improves over dense\-model baselines and remains competitive with retrieval\- and adapter\-based knowledge injection methods\. At the same time, DMoE substantially reduces inference overhead by preserving KV\-cache reuse\. Further ablation studies validate the main architectural choices: decoupled experts are better suited for knowledge injection than conventional coupled MoE variants, uncertainty\-aware routing remains robust across triggering thresholds, and final\-layer FFN attachment yields the best effectiveness–efficiency trade\-off\.
In summary, our contributions are as follows:
- •We propose DMoE, which decouples knowledge experts and the router from the base model, enabling modular and independently updatable knowledge injection\.
- •DMoE uses uncertainty\-aware routing to selectively activate relevant experts and attaches them to the final\-layer feed\-forward network, preserving KV\-cache reuse during generation\.
- •Experimental results show that DMoE improves answer quality over dense baselines while achieving competitive performance with substantially lower inference overhead\.
## 2Related Work
### 2\.1Retrieval\-Augmented Generation
Retrieval\-Augmented Generation \(RAG\) injects external knowledge into LLMs by retrieving relevant information from external repositories and conditioning generation on the retrieved contextLewiset al\.\([2020](https://arxiv.org/html/2606.14243#bib.bib13)\); Donget al\.\([2025](https://arxiv.org/html/2606.14243#bib.bib28)\); Tuet al\.\([2025](https://arxiv.org/html/2606.14243#bib.bib49)\); Suet al\.\([2025b](https://arxiv.org/html/2606.14243#bib.bib54),[f](https://arxiv.org/html/2606.14243#bib.bib38)\)\. By keeping knowledge outside model parameters, RAG provides a flexible mechanism for improving factual grounding, mitigating hallucinationsWanget al\.\([2026](https://arxiv.org/html/2606.14243#bib.bib29)\); Suet al\.\([2025d](https://arxiv.org/html/2606.14243#bib.bib30),[2024e](https://arxiv.org/html/2606.14243#bib.bib31)\); Wanget al\.\([2025c](https://arxiv.org/html/2606.14243#bib.bib50)\), supporting knowledge updatesWanget al\.\([2025b](https://arxiv.org/html/2606.14243#bib.bib51),[a](https://arxiv.org/html/2606.14243#bib.bib32)\), and adapting LLMs to specialized domains without full model retrainingSuet al\.\([2024b](https://arxiv.org/html/2606.14243#bib.bib34),[2025g](https://arxiv.org/html/2606.14243#bib.bib33),[2025a](https://arxiv.org/html/2606.14243#bib.bib35),[2026a](https://arxiv.org/html/2606.14243#bib.bib43)\)\. Most RAG systems follow a retrieval\-then\-read pipeline, where a search module retrieves documents from a large\-scale corpusRobertsonet al\.\([2009](https://arxiv.org/html/2606.14243#bib.bib45)\); Suet al\.\([2024a](https://arxiv.org/html/2606.14243#bib.bib46)\); Fanget al\.\([2024](https://arxiv.org/html/2606.14243#bib.bib47)\)and the generator uses them as additional input context\. Recent extensions further explore dynamic RAGJianget al\.\([2023](https://arxiv.org/html/2606.14243#bib.bib52)\); Suet al\.\([2024d](https://arxiv.org/html/2606.14243#bib.bib53),[c](https://arxiv.org/html/2606.14243#bib.bib44)\), graph\-based RAGEdgeet al\.\([2024](https://arxiv.org/html/2606.14243#bib.bib36)\), parametric RAGSuet al\.\([2025e](https://arxiv.org/html/2606.14243#bib.bib72),[c](https://arxiv.org/html/2606.14243#bib.bib40),[2026a](https://arxiv.org/html/2606.14243#bib.bib43),[2026c](https://arxiv.org/html/2606.14243#bib.bib37)\), and agentic RAGJinet al\.\([2025](https://arxiv.org/html/2606.14243#bib.bib42)\); Suet al\.\([2026b](https://arxiv.org/html/2606.14243#bib.bib48)\)\. Despite these advances, RAG primarily injects knowledge at the prompt level: the retrieved evidence is exposed to the model as external context rather than integrated into its parameter space\. As a result, RAG remains highly updateable but provides relatively shallow knowledge incorporation, and its inference efficiency can be limited by repeated retrieval and long\-context processing\.
### 2\.2Post\-training Based Knowledge Injection
Post\-training methods inject knowledge by further optimizing model parameters using data from external sources\. Supervised Fine\-Tuning \(SFT\)Mishraet al\.\([2021](https://arxiv.org/html/2606.14243#bib.bib19)\); Ouyanget al\.\([2022](https://arxiv.org/html/2606.14243#bib.bib20)\); Taoriet al\.\([2023](https://arxiv.org/html/2606.14243#bib.bib21)\)commonly trains the model on synthetic or human\-annotated instruction–response pairs, enabling deeper parameter\-level incorporation of new knowledge than context\-only augmentation\. However, directly updating the base model’s shared parameters may interfere with previously acquired knowledge, leading to catastrophic forgetting and knowledge conflictGoodfellowet al\.\([2013](https://arxiv.org/html/2606.14243#bib.bib56)\); Kemkeret al\.\([2018](https://arxiv.org/html/2606.14243#bib.bib57)\)\. To reduce training cost and limit parameter interference, parameter\-efficient fine\-tuning \(PEFT\) methodsHoulsbyet al\.\([2019](https://arxiv.org/html/2606.14243#bib.bib73)\); Hanet al\.\([2024](https://arxiv.org/html/2606.14243#bib.bib74)\), such as LoRAHuet al\.\([2022](https://arxiv.org/html/2606.14243#bib.bib55)\), prompt tuningLesteret al\.\([2021](https://arxiv.org/html/2606.14243#bib.bib75)\), and prefix tuningLi and Liang \([2021](https://arxiv.org/html/2606.14243#bib.bib76)\), freeze most base parameters and train only a small set of additional parameters\. Although PEFT substantially improves update efficiency, standard PEFT methods are typically trained as task\- or domain\-level adapters and do not by themselves provide fine\-grained knowledge isolation, expert\-level routing, or independent updates for individual knowledge units\. DMoE builds on the efficiency of lightweight parameter modules while organizing them as decoupled, retrievable experts for modular knowledge injection\.
### 2\.3Previous Explorations in Decoupling MoE
Several works have attempted to decouple components of MoE to improve efficiency or training stability\. Read\-MECaiet al\.\([2024](https://arxiv.org/html/2606.14243#bib.bib83)\)introduces a pre\-gating router partially decoupled from the MoE backbone to enable expert\-aware batching and caching\. EvoMoENieet al\.\([2021](https://arxiv.org/html/2606.14243#bib.bib84)\)decouples expert training from sparse gating through a progressive dense\-to\-sparse evolution scheme\. StableMoEDaiet al\.\([2022](https://arxiv.org/html/2606.14243#bib.bib85)\)distills and then freezes a router decoupled from the backbone to stabilize routing\. DeMoWanget al\.\([2025d](https://arxiv.org/html/2606.14243#bib.bib86)\)designs a feature\-level decoupled MoE for multi\-modal object re\-identification, focusing on modality\-specific feature weighting\. Unlike previous works that decouple either the router or the training phases, our approach*fully*decouples both experts and router from the base model, isolating each unit of knowledge and enabling a scalable routing mechanism\.
## 3Methodology
Figure 2:Comparison of architecturesbetween the dense model, traditional Mixture\-of\-Experts \(MoE\), and the proposed Decoupled Mixture\-of\-Experts \(DMoE\)\. In the MoE architecture, the feed\-forward layers of a dense model are replaced with a coupled router\-expert network\. In contrast, DMoE decouples both the router and experts from the base model\. During inference, DMoE retrieves and updates relevant experts when knowledge injection is required, enabling adaptive and non\-destructive knowledge integration\.### 3\.1Preliminaries and Task Formulation
We first characterize the effect of knowledge injection through its impact on the model’s predictive uncertainty, following prior uncertainty\-aware retrieval and routing methodsJianget al\.\([2023](https://arxiv.org/html/2606.14243#bib.bib52)\); Suet al\.\([2024d](https://arxiv.org/html/2606.14243#bib.bib53)\)\. At each decoding steptt, the model produces a softmax distributionptp\_\{t\}over the vocabulary\. We define the token uncertainty \(TU\) at stepttas the entropy of this distribution:
TUt=−∑vpt\(v\)logpt\(v\)\.\\mathrm\{TU\}\_\{t\}=\-\\sum\_\{v\}p\_\{t\}\(v\)\\log p\_\{t\}\(v\)\.\(1\)
During auto\-regressive generation, the model does not have access to the correct next token and thus cannot directly assess whether its prediction is accurate or whether additional knowledge is required\. TU therefore serves as a practical inference\-time signal that reflects how uncertain the model is about its next\-token distribution: a larger TU indicates that the model lacks sufficient internal knowledge for the current input and is more likely to benefit from external knowledge injection\.
Formally, let𝒦=\{𝒦1,𝒦2,…,𝒦N\}\\mathcal\{K\}=\\\{\\mathcal\{K\}\_\{1\},\\mathcal\{K\}\_\{2\},\\dots,\\mathcal\{K\}\_\{N\}\\\}denote a set of external knowledge units, each representing a semantically coherent piece of information \(e\.g\., a fact, a document, or a reasoning pattern\)\. The objective of knowledge injection is to construct an augmented modelθ′\\theta^\{\\prime\}by incorporating a subset of knowledge𝒦∗⊆𝒦\\mathcal\{K\}^\{\*\}\\subseteq\\mathcal\{K\}, so as to reduce TU on knowledge\-relevant samples𝒟know\\mathcal\{D\}\_\{\\mathrm\{know\}\}:
θ′=𝒜\(θ,𝒦∗\),\\theta^\{\\prime\}=\\mathcal\{A\}\(\\theta,\\mathcal\{K\}^\{\*\}\),\(2\)𝒦∗=argmin𝒮⊆𝒦𝔼x∼𝒟know\[TU\(x;𝒜\(θ,𝒮\)\)\],\\mathcal\{K\}^\{\*\}=\\arg\\min\_\{\\mathcal\{S\}\\subseteq\\mathcal\{K\}\}\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\_\{\\mathrm\{know\}\}\}\\left\[\\mathrm\{TU\}\(x;\\mathcal\{A\}\(\\theta,\\mathcal\{S\}\)\)\\right\],\(3\)where𝒜\(⋅\)\\mathcal\{A\}\(\\cdot\)denotes a knowledge augmentation mechanism\. Existing approaches differ in how they realize this objective\. In RAG settings, the model parameters remain unchanged, and external knowledge𝒦∗\\mathcal\{K\}^\{\*\}is injected through the input context, resulting in a shallow form of knowledge injection\. In contrast, parameter\-level knowledge injection methods such as post\-training modifyθ\\thetadirectly to encode𝒦∗\\mathcal\{K\}^\{\*\}, enabling the model to internalize new knowledge\.
### 3\.2Model Architecture
As illustrated in Figure[2](https://arxiv.org/html/2606.14243#S3.F2),DMoEarchitecture consists of three main components: a frozenbase model, arouter, and a large set ofexperts\. Unlike conventional MoE architectures that replace the feed\-forward network \(FFN\) layers of a dense model with a jointly trained gating network and multiple experts, DMoE decouples both router and experts from the base model\.
#### 3\.2\.1Experts
##### Expert architecture\.
In DMoE, each expert represents a semantic unit of knowledge, which is practically realized via lightweight adapters \(e\.g\., LoRAHuet al\.\([2022](https://arxiv.org/html/2606.14243#bib.bib55)\)or other PEFT variants\) attached to the FFN layers, which are known to store factual and semantic knowledgeGevaet al\.\([2021b](https://arxiv.org/html/2606.14243#bib.bib93)\); Nandaet al\.\([2023](https://arxiv.org/html/2606.14243#bib.bib94)\); Yu and Ananiadou \([2024](https://arxiv.org/html/2606.14243#bib.bib95)\)\. To maintain practical KV\-cache reuse during decoding, we attach experts only to the final\-layer FFN \(Figure[3](https://arxiv.org/html/2606.14243#S3.F3)\); a detailed analysis of this design choice is provided in Section[5\.3\.1](https://arxiv.org/html/2606.14243#S5.SS3.SSS1)\.
##### Expert construction\.
For each knowledge unit𝒦i\\mathcal\{K\}\_\{i\}, we construct an expert by training a lightweight parameter updateΔθi\\Delta\\theta\_\{i\}on data derived from that unit\. Following PRAGSuet al\.\([2025e](https://arxiv.org/html/2606.14243#bib.bib72)\), we first augment each document into instruction\-style training instances and then optimize a PEFT module while keeping the base model frozen\. This produces a collection of knowledge\-specific experts\{Δθi\}i=1N\\\{\\Delta\\theta\_\{i\}\\\}\_\{i=1\}^\{N\}, where each expert is associated with both a parameter update and a text surrogate used for routing\. In this work, we use this construction pipeline as a transparent instantiation of DMoE, while our main focus is on how independently trained experts are organized, routed, and integrated during inference\.
Figure 3:Illustration of expert placement in DMoE\.Experts are attached only to the FFN of the final Transformer layer so that the hidden states feeding earlier attention blocks remain unchanged\. Placing experts at intermediate layers would modify representations consumed by subsequent attention modules, preventing efficient reuse of KV\-cache during decoding\.
#### 3\.2\.2Router
###### Triggering\.
Routing is activated only when the current model exhibits insufficient knowledge of the input\. Since our goal is to reduce TU, there is no need to involve additional experts when the model is already confident\. At each decoding steptt, we trigger routing only when TU is above a given threshold:
Triggert=𝕀\[TUt\>τ\],\\text\{Trigger\}\_\{t\}=\\mathbb\{I\}\\big\[\\mathrm\{TU\}\_\{t\}\>\\tau\\big\],\(4\)whereτ\\tauis a given threshold\. WhenTriggert=1\\text\{Trigger\}\_\{t\}=1, the router deactivates outdated experts and activates relevant experts; otherwise, decoding proceeds with the current model\.
##### Routing\.
To make knowledge injection both effective and scalable, the router should \(i\) select experts that most likely reduce token uncertainty \(TU\) under the current context, while \(ii\) remaining fully*decoupled*from the base model so that experts can be added/removed/updated without retraining any neural router\. We therefore adopt a*BM25\-based lexical router*implemented via an external inverted index\.
Each expertEiE\_\{i\}is trained on a knowledge unit𝒦i\\mathcal\{K\}\_\{i\}and is associated with a text surrogate𝒟i\\mathcal\{D\}\_\{i\}\(the document plus its augmentations used in expert training; see Section 4\.1\.3\)\. At decoding steptt, we form a routing queryqtq\_\{t\}from the task input and the current generation context, and retrieve experts whose associated texts best matchqtq\_\{t\}under BM25: When a high\-TU position triggers routing,qtq\_\{t\}is constructed from the generated prefix up to, but excluding, that position, so the retrieved experts are conditioned on exactly the context already represented in the KV cache\.
Esel=Top\-ki∈\[N\]BM25\(qt,𝒟i\)\.E\_\{\\text\{sel\}\}\\;=\\;\\operatorname\{Top\\mbox\{\-\}k\}\_\{i\\in\[N\]\}\\;\\mathrm\{BM25\}\(q\_\{t\},\\mathcal\{D\}\_\{i\}\)\.\(5\)This design is training\-free, incrementally updatable \(adding an expert only inserts𝒟i\\mathcal\{D\}\_\{i\}into the index\), and avoids introducing an additional neural encoder that would partially re\-couple routing with model parameters\. Finally, we activate the selected experts by composing them with the frozen base model only as effective parameters for the current decoding state:
θteff=θ\+∑Ei∈EselΔθi\.\\theta^\{\\mathrm\{eff\}\}\_\{t\}\\;=\\;\\theta\\;\+\\;\\sum\_\{E\_\{i\}\\in E\_\{\\text\{sel\}\}\}\\Delta\\theta\_\{i\}\.\(6\)Here,θ\\thetadenotes the unchanged base\-model parameters, whileθteff\\theta^\{\\mathrm\{eff\}\}\_\{t\}denotes the temporary parameter composition used for decoding under the currently active experts\. Generation resumes from the triggering token, and this active expert set is kept until the next triggering event, avoiding redundant expert swaps at every decoding step\.
Table 1:Main results on four knowledge\-intensive benchmarks\.We report EM/F1 or ACC across datasets and base models\. Best values are in bold\.MethodCWQHotpotQAQuasar\-TStrategyQAEMF1EMF1EMF1ACCBase Model: Llama3\.2\-1BBasic\-RAG0\.16330\.23840\.17000\.24630\.28000\.35130\.4333FLARE0\.24000\.31540\.07330\.13030\.18670\.21900\.5367PRAG0\.25000\.32840\.07330\.14270\.22000\.25140\.5600SFT\-LoRA0\.21670\.30920\.07670\.13250\.21330\.24810\.5533DMoE \(Ours\)0\.24670\.34790\.18000\.25530\.31330\.36580\.5667Base Model: Qwen2\.5\-1\.5BBasic\-RAG0\.12330\.18880\.15670\.24460\.18330\.25490\.3667FLARE0\.12000\.18060\.10000\.17170\.17330\.22420\.2500PRAG0\.21670\.30500\.10330\.15830\.17330\.23120\.5833SFT\-LoRA0\.19670\.30920\.10330\.15950\.17670\.22920\.5733DMoE \(Ours\)0\.21670\.32850\.15330\.23750\.23330\.27950\.6133
## 4Experimental Setup
### 4\.1Benchmarks and Metrics
We conduct comprehensive evaluations on a diverse suite of knowledge\-intensive benchmarks, each representing a distinct aspect of the knowledge injection problem\.HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.14243#bib.bib96)\)focuses on multi\-hop reasoning, where the model must integrate evidence from multiple documents to answer complex questions that require reasoning across facts\.ComplexWebQuestions\(Talmor and Berant,[2018](https://arxiv.org/html/2606.14243#bib.bib97)\)extends this setting to an open\-domain environment, challenging the model to perform compositional reasoning and inject relevant knowledge from large\-scale web content\.Quasar\-T\(Dhingraet al\.,[2017](https://arxiv.org/html/2606.14243#bib.bib98)\)further evaluates open\-domain retrieval and reasoning by requiring models to locate and synthesize factual knowledge from a large corpus to answer trivia\-style questions\.StrategyQA\(Gevaet al\.,[2021a](https://arxiv.org/html/2606.14243#bib.bib99)\)assesses implicit multi\-hop reasoning, where the required knowledge is not explicitly stated in the question and must be inferred or retrieved based on world knowledge\.
For evaluation, we primarily assess the effectiveness of knowledge injection methods in the main results\. Effectiveness is measured using EM and F1 \(ACC for StrategyQA\), which reflect how well a model integrates and utilizes injected knowledge\. We analyze efficiency separately in Section[5\.2](https://arxiv.org/html/2606.14243#S5.SS2)using the average inference time and average GPU memory consumption per sample\.
### 4\.2Baselines
We compare DMoE against representative baselines spanning the major paradigms of knowledge injection: Basic\-RAG, FLARE, PRAG, and SFT\-LoRA, all built onLlama\-3\.2\-1B\-InstructandQwen2\.5\-1\.5B\-Instruct\. For an equitable comparison, all retrieval\-based methods share the same retrieval corpus of 27,613 passage\-level documents, the same retriever \(Elasticsearch BM25\), retrieval budget \(k=3k=3\), and greedy decoding; retrieval queries are truncated to 128 tokens, and the corpus is not additionally chunked because each item is already a short single\-topic passage of approximately 100 words\. Basic\-RAG performs one\-shot retrieval and appends the retrieved passages to the prompt, FLARE retrieves during decoding and therefore cannot reuse the KV\-cache, PRAG combines retrieved passages with a corresponding parametric adapter using the same augmentation\-and\-adapter recipe as our expert construction pipeline, and SFT\-LoRA trains a single LoRA adapter on supervision merged across the full corpus\.
### 4\.3Implementation Details
Before training experts, each base model undergoes a one\-epoch held\-out format\-stabilization warmup\. Concretely, warmup uses the same task and system instructions as the main evaluation, but only evaluation\-disjoint instances, and is intended to stabilize output formatting consistency rather than inject factual or benchmark\-specific knowledge\. This stage is retained to match the PRAG recipe and maintain a controlled comparison; its impact is analyzed in Section[A\.1](https://arxiv.org/html/2606.14243#A1.SS1)\.
The external knowledge corpus used to construct experts is drawn from the Wikipedia corpus used in DPRKarpukhinet al\.\([2020](https://arxiv.org/html/2606.14243#bib.bib100)\)\. This pool is used by all retrieval\-based baselines and by DMoE, so comparisons are controlled under the same accessible knowledge source\. Importantly, corpus construction and expert training do not use answer strings, gold supporting\-document annotations, or any manually labeled rationales\. Each document is augmented only from its own passage text by: \(1\) a single paraphrased rewrite that preserves its factual content, and \(2\) three question\-answer pairs generated from the document\. These augmentations enrich the supervision signals for expert training and help each expert better capture the underlying knowledge unit\. For each knowledge unit, we train a dedicated expert implemented as a LoRA adapter with rank 4, scaling factorα=16\\alpha=16, and learning rate1×10−51\\times 10^\{\-5\}\. Experts are trained for one epoch and inserted only into the final\-layer feed\-forward network to ensure full compatibility with KV\-cache reuse during autoregressive decoding\. During inference, we set triggering threshold to2\.02\.0, use greedy decoding and activate the top\-kkexperts \(k=3k=3by default\)\. We use BM25 to construct expert and context representations\. All efficiency measurements \(inference time and GPU memory usage\) are conducted on a single NVIDIA A100 80GB GPU\.
## 5Experimental Results
### 5\.1Overall Results
Table[1](https://arxiv.org/html/2606.14243#S3.T1)summarizes overall effectiveness across all baselines\. We set test corpus size to 300 for all benchmarks\. Table[1](https://arxiv.org/html/2606.14243#S3.T1)shows a general advantage for DMoE\. Across the 14 reported effectiveness metrics, DMoE obtains the best or tied\-best score on 11 metrics, including the strongest F1 on CWQ for both base models, both Quasar\-T metrics for both base models, and StrategyQA ACC for both base models\. This pattern suggests that DMoE is usually the strongest method in the reported comparison\. The exceptions are concentrated rather than scattered\. On CWQ, PRAG slightly exceeds DMoE in EM for Llama3\.2\-1B and ties it for Qwen2\.5\-1\.5B, although DMoE has the higher CWQ F1 in both cases\. On HotpotQA with Qwen2\.5\-1\.5B, Basic\-RAG is marginally better than DMoE on both EM and F1, while DMoE is best on both HotpotQA metrics with Llama3\.2\-1B\. Thus, Table[1](https://arxiv.org/html/2606.14243#S3.T1)supports a measured conclusion: DMoE is broadly competitive and most often best among the evaluated methods, but the results do not establish dominance on every dataset, base model, or metric\.
### 5\.2Efficiency Analysis
Table 2:Efficiency analysis\.We report all available average inference latency and GPU memory measurements per sample\. Lower is better, and best values within each base\-model block are in bold\.As shown in Table[2](https://arxiv.org/html/2606.14243#S5.T2), static baselines such as Basic\-RAG, PRAG, and SFT\-LoRA can incur lower per\-sample cost because they do not modify retrieval or routing decisions during decoding\. Among methods that adapt generation online, however, DMoE is roughly3×3\\timesfaster than FLARE while also using substantially less memory\. The core reason is KV\-cache reuse\. FLARE\-style dynamic RAG interleaves generation with retrieval and prompt rewriting, which grows the context over time and invalidates cache reuse; consequently, attention over an expanding prefix is repeatedly recomputed, inflating both latency and memory\.
In contrast, DMoE keeps the prompt state stable and performs adaptation by conditionally activating lightweight final\-layer experts\. Since experts are confined to the last FFN layer, the KV\-cache remains fully valid throughout decoding, largely avoiding the context\-growth penalty and redundant attention computation\. This cache\-safe design translates into consistent time and memory advantages in practice: DMoE reduces GPU memory usage by roughly1\.6×1\.6\\times–1\.9×1\.9\\timescompared to FLARE\.
Figure 4:KV\-cache reuse substantially improves autoregressive inference efficiency\.We report the inference latency \(top\) and GPU memory footprint \(bottom\) for two base models \(Llama\-3\.2\-1B and Qwen2\.5\-1\.5B\) across four benchmarks, comparingReuse KV\-cachevs\.No KV\-cache reuse\. KV\-cache reuse consistently reduces both time and memory across all datasets\.Figure[4](https://arxiv.org/html/2606.14243#S5.F4)further confirms the centrality of KV caching: across all benchmarks and both base models, enabling KV caching yields a1\.3×1\.3\\times–5\.1×5\.1\\timeslatency speedup and a1\.2×1\.2\\times–2\.5×2\.5\\timesreduction in GPU memory usage, motivating cache\-compatible expert placement\.
Table 4:DMoE vs\. a coupled MoE backbone\.We compare DMoE against a finetuned MoE model \(OLMoE\-1B\-7B\) under SFT\-LoRA\. DMoE achieves higher effectiveness \(EM/F1/ACC\) while using markedly lower inference time and GPU memory\. Higher is better for EM/F1/ACC; lower is better for Time/GPU\. Best values are in bold\.Table 3:Performance \(EM\) when reusing KV\-cache while introducing experts into different portions of FFN layers\. Here,x%x\\%denotes inserting experts into the finalx%x\\%of FFN layers, whileOnly\-lastinserts experts only into the final\-layer FFN\. Only\-last preserves KV\-cache compatibility and achieves the best overall accuracy across datasets\.
### 5\.3Ablation Study
To better understand the contribution of individual architectural components in DMoE, the main paper focuses on two architectural questions: \(1\) where experts should be placed to preserve KV\-cache compatibility, and \(2\) whether DMoE’s gains can be explained by simply using a conventional MoE backbone\. Additional ablations, including warmup, sensitivity to router top\-kk, TU\-triggered routing analyses, LoRA hyperparameter sensitivity, and retriever choice, are reported in Appendix[A](https://arxiv.org/html/2606.14243#A1)\.
#### 5\.3\.1Impact of Expert Layer Placement
A practical requirement for DMoE is that expert activation must preserve KV\-cache reuse during auto\-regressive decoding\. In a standard transformer decoder, each layerℓ\\ellcomputes
hℓ\+1=ℱℓ\(hℓ\),h\_\{\\ell\+1\}=\\mathcal\{F\}\_\{\\ell\}\(h\_\{\\ell\}\),\(7\)
wherehℓh\_\{\\ell\}is the hidden state, and the keys and values for layerℓ\\ell’s attention are linear projections ofhℓh\_\{\\ell\}\. Introducing an expert at any intermediate layerkkmodifies the hidden state to
h~k\+1=hk\+1\+Δ\(hk\),\\tilde\{h\}\_\{k\+1\}=h\_\{k\+1\}\+\\Delta\(h\_\{k\}\),\(8\)which propagates forward to all subsequent layersj\>kj\>k\. Because the KV\-cache storesKjcached=WKhjK\_\{j\}^\{\\text\{cached\}\}=W\_\{K\}h\_\{j\}andVjcached=WVhjV\_\{j\}^\{\\text\{cached\}\}=W\_\{V\}h\_\{j\}computed from the unmodified trajectory, any modificationh~j≠hj\\tilde\{h\}\_\{j\}\\neq h\_\{j\}requires recomputing
Kjnew=WKh~j,Vjnew=WVh~j,K\_\{j\}^\{\\text\{new\}\}=W\_\{K\}\\tilde\{h\}\_\{j\},\\qquad V\_\{j\}^\{\\text\{new\}\}=W\_\{V\}\\tilde\{h\}\_\{j\},\(9\)breaking KV\-cache compatibility and causing substantial latency overhead\.
This constraint is avoided only when experts are applied at thefinal FFN layer\. In a decoder\-only model withLLlayers, the output of the last FFN \(hL\+1h\_\{L\+1\}\) is projected directly to logits and does not feed into any further attention computation\. Thus, placing experts exclusively atk=Lk=Lguarantees
∀j≤L,h~j=hj,\\forall j\\leq L,\\quad\\tilde\{h\}\_\{j\}=h\_\{j\},\(10\)ensuring that all cached keys and values remain valid\.
Table[3](https://arxiv.org/html/2606.14243#S5.T3)empirically confirms this analysis\. When experts are inserted into earlier or multiple layers, performance drops sharply due to KV\-cache mismatch\. In contrast, attaching experts only to the last FFN layer is generally competitive and often best\. These findings demonstrate a key property of DMoE: minimal\-intrusive expert placement at the final layer preserves fast decoding while still enabling effective knowledge injection\. Together with adaptive triggering, this design provides an effective and efficient mechanism for integrating new knowledge during inference\.
#### 5\.3\.2DMoE vs\. Conventional MoE Backbones
To clarify whether our gains are simply an artifact of adopting an MoE\-style backbone, we include a coupled MoE model as a structural control\. Specifically, we evaluateallenai/OLMoE\-1B\-7B\-0125\-Instructand adapt it with the same SFT\-LoRA recipe used in our dense baselines\. We choose it because its activated parameter scale is comparable to DMoE, i\.e\., both systems perform sparse conditional computation with a similar amount of per\-token active capacity\. This design makes the comparison informative: it helps factor out the confounding effect of increasing total parameters, and tests whether the improvements come from DMoE’s architectural decoupling rather than parameter count\. Table[4](https://arxiv.org/html/2606.14243#S5.T4)shows that DMoE consistently outperforms the coupled MoE backbone on effectiveness while being substantially more efficient\. On CWQ, HotpotQA, and Quasar\-T, both DMoE variants achieve higher EM/F1 than OLMoE under SFT\-LoRA, and DMoE also matches or exceeds OLMoE on StrategyQA accuracy\. More importantly, the system costs differ dramatically: DMoE reduces average inference time from 20\.0 s \(OLMoE\) to 2\.7–3\.9 s, and cuts GPU memory from 26\.1 GB to 7\.2–8\.3 GB\.
These results support our main design rationale for knowledge injection\. A conventional MoE backbone offers conditional computation, but its experts are heavy and must remain resident during inference, and its router/expert parameters are coupled to the base model, which makes knowledge updates harder to localize and incurs high runtime footprint\. In contrast, DMoE treats experts as external, lightweight knowledge modules and invokes them only when needed, while keeping the base model intact\. Therefore, DMoE’s gains cannot be attributed to merely increasing model capacity; they arise from a decoupled, cache\-safe modularization that yields a strictly better effectiveness\-efficiency trade\-off than simply switching to a standard MoE architecture\.
## 6Conclusion
We presented DMoE, a decoupled mixture\-of\-experts architecture for modular parametric knowledge injection\. Unlike conventional retrieval or post\-training based methods, DMoE keeps both knowledge experts and the router external to the frozen base model, allowing knowledge units to be added, removed, or updated independently\. During inference, DMoE uses uncertainty\-aware routing to selectively activate relevant experts and attaches them only to the final\-layer feed\-forward network, preserving KV\-cache reuse during autoregressive generation\. Experiments on knowledge\-intensive benchmarks show that DMoE consistently improves over dense baselines and achieves competitive performance against retrieval\- and adapter\-based methods, while substantially reducing the overhead of dynamic knowledge augmentation\. These results suggest that decoupled, cache\-compatible expert modularization is a promising direction for scalable and updateable knowledge injection in LLMs\.
## References
- Q\. Ai, Y\. Tang, C\. Wang, J\. Long, W\. Su, and Y\. Liu \(2025\)MemoryBench: a benchmark for memory and continual learning in llm systems\.arXiv preprint arXiv:2510\.17281\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p1.1)\.
- S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican, G\. B\. Van Den Driessche, J\. Lespiau, B\. Damoc, A\. Clark,et al\.\(2022\)Improving language models by retrieving from trillions of tokens\.InInternational conference on machine learning,pp\. 2206–2240\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p2.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p1.1)\.
- R\. Cai, Y\. Ro, G\. Kim, P\. Wang, B\. E\. Bejnordi, A\. Akella, and Z\. Wang \(2024\)Read\-me: refactorizing llms as router\-decoupled mixture of experts with system co\-design\.arXiv preprint arXiv:2410\.19123\.Cited by:[§2\.3](https://arxiv.org/html/2606.14243#S2.SS3.p1.1)\.
- A\. Chowdhery, S\. Narang, J\. Devlin, M\. Bosma, G\. Mishra, A\. Roberts, P\. Barham, H\. W\. Chung, C\. Sutton, S\. Gehrmann,et al\.\(2023\)Palm: scaling language modeling with pathways\.Journal of Machine Learning Research24\(240\),pp\. 1–113\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p1.1)\.
- D\. Dai, L\. Dong, S\. Ma, B\. Zheng, Z\. Sui, B\. Chang, and F\. Wei \(2022\)Stablemoe: stable routing strategy for mixture of experts\.arXiv preprint arXiv:2204\.08396\.Cited by:[§2\.3](https://arxiv.org/html/2606.14243#S2.SS3.p1.1)\.
- B\. Dhingra, K\. Mazaitis, and W\. W\. Cohen \(2017\)Quasar: datasets for question answering by search and reading\.arXiv preprint arXiv:1707\.03904\.Cited by:[§4\.1](https://arxiv.org/html/2606.14243#S4.SS1.p1.1)\.
- Q\. Dong, Q\. Ai, H\. Wang, Y\. Liu, H\. Li, W\. Su, Y\. Liu, T\. Chua, and S\. Ma \(2025\)Decoupling knowledge and context: an efficient and effective retrieval augmented generation framework via cross attention\.InProceedings of the ACM on Web Conference 2025,pp\. 4386–4395\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, and J\. Larson \(2024\)From local to global: a graph rag approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- Y\. Fang, J\. Zhan, Q\. Ai, J\. Mao, W\. Su, J\. Chen, and Y\. Liu \(2024\)Scaling laws for dense retrieval\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1339–1349\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p5.1)\.
- M\. Geva, D\. Khashabi, E\. Segal, T\. Khot, D\. Roth, and J\. Berant \(2021a\)Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies\.Transactions of the Association for Computational Linguistics9,pp\. 346–361\.Cited by:[§4\.1](https://arxiv.org/html/2606.14243#S4.SS1.p1.1)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021b\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 5484–5495\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.14243#S3.SS2.SSS1.Px1.p1.1)\.
- I\. J\. Goodfellow, M\. Mirza, D\. Xiao, A\. Courville, and Y\. Bengio \(2013\)An empirical investigation of catastrophic forgetting in gradient\-based neural networks\.arXiv preprint arXiv:1312\.6211\.Cited by:[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1)\.
- Z\. Han, C\. Gao, J\. Liu, J\. Zhang, and S\. Q\. Zhang \(2024\)Parameter\-efficient fine\-tuning for large models: a comprehensive survey\.arXiv preprint arXiv:2403\.14608\.Cited by:[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1)\.
- N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly \(2019\)Parameter\-efficient transfer learning for nlp\.InInternational conference on machine learning,pp\. 2790–2799\.Cited by:[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR1\(2\),pp\. 3\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1),[§3\.2\.1](https://arxiv.org/html/2606.14243#S3.SS2.SSS1.Px1.p1.1)\.
- R\. A\. Jacobs, M\. I\. Jordan, S\. J\. Nowlan, and G\. E\. Hinton \(1991\)Adaptive mixtures of local experts\.Neural computation3\(1\),pp\. 79–87\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p5.1)\.
- Z\. Jiang, F\. F\. Xu, L\. Gao, Z\. Sun, Q\. Liu, J\. Dwivedi\-Yu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)Active retrieval augmented generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 7969–7992\.Cited by:[§C\.2](https://arxiv.org/html/2606.14243#A3.SS2.p1.1),[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.14243#S3.SS1.p1.3)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. S\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.\.InEMNLP \(1\),pp\. 6769–6781\.Cited by:[§4\.3](https://arxiv.org/html/2606.14243#S4.SS3.p2.5)\.
- R\. Kemker, M\. McClure, A\. Abitino, T\. Hayes, and C\. Kanan \(2018\)Measuring catastrophic forgetting in neural networks\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1)\.
- A\. Lauscher, O\. Majewska, L\. F\. Ribeiro, I\. Gurevych, N\. Rozanov, and G\. Glavaš \(2020\)Common sense or world knowledge? investigating adapter\-based knowledge injection into pretrained transformers\.arXiv preprint arXiv:2005\.11787\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p1.1)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.arXiv preprint arXiv:2104\.08691\.Cited by:[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.arXiv preprint arXiv:2101\.00190\.Cited by:[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1)\.
- S\. Mishra, D\. Khashabi, C\. Baral, and H\. Hajishirzi \(2021\)Cross\-task generalization via natural language crowdsourcing instructions\.arXiv preprint arXiv:2104\.08773\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1)\.
- N\. Nanda, S\. Rajamanoharan, J\. Kramar, and R\. Shah \(2023\)Fact finding: attempting to reverse\-engineer factual recall on the neuron level\.InAlignment Forum,pp\. 6\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.14243#S3.SS2.SSS1.Px1.p1.1)\.
- X\. Nie, X\. Miao, S\. Cao, L\. Ma, Q\. Liu, J\. Xue, Y\. Miao, Y\. Liu, Z\. Yang, and B\. Cui \(2021\)Evomoe: an evolutional mixture\-of\-experts training framework via dense\-to\-sparse gate\.arXiv preprint arXiv:2112\.14397\.Cited by:[§2\.3](https://arxiv.org/html/2606.14243#S2.SS3.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1)\.
- O\. Ovadia, M\. Brief, M\. Mishaeli, and O\. Elisha \(2023\)Fine\-tuning or retrieval? comparing knowledge injection in llms\.arXiv preprint arXiv:2312\.05934\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p1.1)\.
- S\. Robertson, H\. Zaragoza,et al\.\(2009\)The probabilistic relevance framework: bm25 and beyond\.Foundations and Trends® in Information Retrieval3\(4\),pp\. 333–389\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p5.1)\.
- Z\. Song, B\. Yan, Y\. Liu, M\. Fang, M\. Li, R\. Yan, and X\. Chen \(2025\)Injecting domain\-specific knowledge into large language models: a comprehensive survey\.arXiv preprint arXiv:2502\.10708\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p1.1)\.
- W\. Su, Q\. Ai, X\. Li, J\. Chen, Y\. Liu, X\. Wu, and S\. Hou \(2024a\)Wikiformer: pre\-training with structured information of wikipedia for ad\-hoc retrieval\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19026–19034\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, Q\. Ai, Y\. Wu, A\. Xie, C\. Wang, Y\. Ma, H\. Li, Z\. Wu, Y\. Liu, and M\. Zhang \(2025a\)Pre\-training for legal case retrieval based on inter\-case distinctions\.ACM Transactions on Information Systems43\(5\),pp\. 1–27\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, Q\. Ai, J\. Zhan, Q\. Dong, and Y\. Liu \(2025b\)Dynamic and parametric retrieval\-augmented generation\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 4118–4121\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, X\. Chen, Y\. Wu, Q\. Ai, and Y\. Liu \(2026a\)Enhancing judgment document generation via agentic legal information collection and rubric\-guided optimization\.arXiv preprint arXiv:2605\.02011\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, Q\. Dong, Q\. Ai, and Y\. Liu \(2025c\)SIGIR\-ap 2025 tutorial proposal: dynamic and parametric retrieval\-augmented generation\.In3rd International ACM SIGIR Conference on Information Retrieval in the Asia Pacific,Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, Y\. Hu, A\. Xie, Q\. Ai, Q\. Bing, N\. Zheng, Y\. Liu, W\. Shen, and Y\. Liu \(2024b\)STARD: a Chinese statute retrieval dataset derived from real\-life queries by non\-professionals\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 10658–10671\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.625/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.625)Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, J\. Long, Q\. Ai, Y\. Tang, C\. Wang, Y\. Tu, and Y\. Liu \(2026b\)Skill retrieval augmentation for agentic ai\.arXiv preprint arXiv:2604\.24594\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, J\. Long, C\. Wang, S\. Lin, J\. Xu, Z\. Ye, Q\. Ai, and Y\. Liu \(2025d\)Towards unification of hallucination detection and fact verification for large language models\.arXiv preprint arXiv:2512\.02772\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, Y\. Tang, Q\. Ai, C\. Wang, Z\. Wu, and Y\. Liu \(2024c\)Mitigating entity\-level hallucination in large language models\.InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region,pp\. 23–31\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, Y\. Tang, Q\. Ai, Z\. Wu, and Y\. Liu \(2024d\)DRAGIN: dynamic retrieval augmented generation based on the information needs of large language models\.arXiv preprint arXiv:2403\.10081\.Cited by:[§C\.2](https://arxiv.org/html/2606.14243#A3.SS2.p1.1),[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.14243#S3.SS1.p1.3)\.
- W\. Su, Y\. Tang, Q\. Ai, J\. Yan, C\. Wang, H\. Wang, Z\. Ye, Y\. Zhou, and Y\. Liu \(2025e\)Parametric retrieval augmented generation\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1240–1250\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1),[§3\.2\.1](https://arxiv.org/html/2606.14243#S3.SS2.SSS1.Px2.p1.3)\.
- W\. Su, C\. Wang, Q\. Ai, Y\. Hu, Z\. Wu, Y\. Zhou, and Y\. Liu \(2024e\)Unsupervised real\-time hallucination detection based on the internal states of large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 14379–14391\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, A\. Xie, Q\. Ai, J\. Long, X\. Chen, J\. Mao, Z\. Ye, and Y\. Liu \(2025f\)Surge: a benchmark and evaluation framework for scientific survey generation\.arXiv preprint arXiv:2508\.15658\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, B\. Yue, Q\. Ai, Y\. Hu, J\. Li, C\. Wang, K\. Zhang, Y\. Wu, and Y\. Liu \(2025g\)JuDGE: benchmarking judgment document generation for chinese legal system\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR ’25\), July 13–18, 2025, Padua, Italy,External Links:[Document](https://dx.doi.org/10.1145/3726302.3730295)Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- W\. Su, H\. Zhang, Q\. Ai, and Y\. Liu \(2026c\)Decoupling knowledge and task subspaces for composable parametric retrieval augmented generation\.arXiv preprint arXiv:2604\.26768\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- A\. Talmor and J\. Berant \(2018\)The web as a knowledge\-base for answering complex questions\.arXiv preprint arXiv:1803\.06643\.Cited by:[§4\.1](https://arxiv.org/html/2606.14243#S4.SS1.p1.1)\.
- R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Alpaca: a strong, replicable instruction\-following model\.Stanford Center for Research on Foundation Models\. https://crfm\. stanford\. edu/2023/03/13/alpaca\. html3\(6\),pp\. 7\.Cited by:[§2\.2](https://arxiv.org/html/2606.14243#S2.SS2.p1.1)\.
- Y\. Tu, W\. Su, Y\. Zhou, Y\. Liu, and Q\. Ai \(2025\)Robust fine\-tuning for retrieval augmented generation against retrieval defects\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1272–1282\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- C\. Wang, W\. Su, Q\. Ai, and Y\. Liu \(2026\)Joint evaluation of answer and reasoning consistency for hallucination detection in large reasoning models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 33377–33385\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- C\. Wang, W\. Su, Q\. Ai, Y\. Tang, and Y\. Liu \(2025a\)Knowledge editing through chain\-of\-thought\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 10684–10704\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- C\. Wang, W\. Su, Q\. Ai, Y\. Zhou, and Y\. Liu \(2025b\)Decoupling reasoning and knowledge injection for in\-context knowledge editing\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 24543–24562\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- C\. Wang, W\. Su, Q\. Ai, Y\. Zhou, and Y\. Liu \(2025c\)Decoupling reasoning and knowledge injection for in\-context knowledge editing\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 24543–24562\.Cited by:[§2\.1](https://arxiv.org/html/2606.14243#S2.SS1.p1.1)\.
- Y\. Wang, S\. Mishra, P\. Alipoormolabashi, Y\. Kordi, A\. Mirzaei, A\. Arunkumar, A\. Ashok, A\. S\. Dhanasekaran, A\. Naik, D\. Stap,et al\.\(2022\)Super\-naturalinstructions: generalization via declarative instructions on 1600\+ nlp tasks\.arXiv preprint arXiv:2204\.07705\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p3.1)\.
- Y\. Wang, Y\. Liu, A\. Zheng, and P\. Zhang \(2025d\)Decoupled feature\-based mixture of experts for multi\-modal object re\-identification\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 8141–8149\.Cited by:[§2\.3](https://arxiv.org/html/2606.14243#S2.SS3.p1.1)\.
- Z\. Xu, S\. Jain, and M\. Kankanhalli \(2024\)Hallucination is inevitable: an innate limitation of large language models\.arXiv preprint arXiv:2401\.11817\.Cited by:[§1](https://arxiv.org/html/2606.14243#S1.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[§4\.1](https://arxiv.org/html/2606.14243#S4.SS1.p1.1)\.
- Z\. Yu and S\. Ananiadou \(2024\)Neuron\-level knowledge attribution in large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 3267–3280\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.14243#S3.SS2.SSS1.Px1.p1.1)\.
- J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou \(2023\)Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911\.Cited by:[§A\.1](https://arxiv.org/html/2606.14243#A1.SS1.p1.4)\.
## Appendix AAdditional Ablation Studies
### A\.1Warmup: Definition and Impact
Since the focus of this work is the DMoE architecture itself, we do not include dedicated instruction\-following benchmarks such as IFEval\(Zhouet al\.,[2023](https://arxiv.org/html/2606.14243#bib.bib101)\)to directly measure the effect of warmup on instruction following\. Instead, we assess its impact indirectly through downstream performance on four QA benchmarks\. Warmup yields modest gains of about\+1\.0\+1\.0EM on CWQ,\+0\.3\+0\.3EM on HotpotQA, and\+1\.3\+1\.3EM on Quasar\-T, but slightly lowers StrategyQA by about1\.01\.0ACC\. This mixed pattern suggests that warmup does not provide a uniform performance benefit and does not indicate a broad degradation in instruction following\. The main setting retains warmup to match the PRAG recipe and maintain a controlled comparison\.
### A\.2Router: Impact of Top\-kkExperts
Figure 5:Impact of the number of activated experts\.Varying top\-kkacross the tested range yields nearly unchanged downstream performance on all four benchmarks, indicating that DMoE is robust to this routing hyperparameter\.Figure[5](https://arxiv.org/html/2606.14243#A1.F5)studies sensitivity to the number of activated experts while keeping the remaining setup fixed\. The curves remain nearly flat across CWQ, HotpotQA, Quasar\-T, and StrategyQA, showing that DMoE is highly robust to the choice of top\-kk\. This behavior suggests that the router identifies a sufficiently relevant expert set even with a small activation budget, while increasing the number of activated experts brings only marginal changes in downstream quality\. Therefore, the default choicek=3k=3provides a simple and stable operating point without requiring careful tuning\.
### A\.3Router: TU\-Triggered Routing Analysis
This section analyzes token uncertainty \(TU\) as the trigger signal used by the router\. We focus on three questions: whether TU is aligned with downstream performance, how the trigger and router components contribute separately, and how sensitive DMoE is to the triggering threshold\.
#### A\.3\.1TU and Downstream Performance
Table 5:DMoE accuracy across token\-uncertainty bins\.We group 300 evaluation examples into five entropy bins using the base model’s token uncertainty \(TU\) before expert activation and report DMoE performance within each bin\. Higher TU consistently corresponds to lower EM/F1, supporting TU as a useful—though imperfect—proxy for when the model lacks sufficient prior knowledge\.Table[5](https://arxiv.org/html/2606.14243#A1.T5)sorts 300 evaluation examples by the base model’s token uncertainty before expert activation\. Low\-entropy bins correspond to cases where the base model is already relatively confident, whereas high\-entropy bins correspond to cases where the model has more limited prior knowledge\. The degradation is monotonic: EM drops from 0\.4167 in the lowest\-entropy bin to 0\.1333 in the highest\-entropy bin, and F1 drops from 0\.5593 to 0\.2085\. This trend indicates that TU is aligned with knowledge difficulty and is therefore a reasonable signal for deciding when to trigger expert assistance\.
#### A\.3\.2Component Ablation: Trigger and Router
Table 6:Component ablation of trigger and router choices\.We disentangle when to activate experts from which experts to select\. The full TU\-triggered BM25 router performs best, showing that DMoE’s gains are not explained by indiscriminate expert activation\.Table[6](https://arxiv.org/html/2606.14243#A1.T6)further disentangles the triggering decision from expert selection\. The full setting, which combines TU\-based triggering with the BM25 router, achieves the best EM and F1\. Replacing TU with a random trigger hurts performance, keeping TU but replacing BM25 with a random router also degrades F1, and always triggering experts collapses substantially\. These results show that the gains are not driven by indiscriminate expert activation: both deciding when to activate experts and selecting which experts to activate matter\. Finally, the oracle\-style expert router is only comparable to BM25, suggesting that BM25 is already a strong practical router in this parametric\-expert pipeline rather than the dominant bottleneck\.
#### A\.3\.3Impact of Triggering Thresholds
Figure 6:Impact of triggering thresholds on effectiveness and inference speed\.We vary the TU threshold for expert activation\. Larger thresholds reduce triggering frequency and improve inference speed, while downstream performance remains comparatively stable across a broad threshold range\.Figure[6](https://arxiv.org/html/2606.14243#A1.F6)studies the effect of the TU triggering threshold\. Increasing the threshold makes expert activation more selective, which reduces routing frequency and improves inference speed\. At the same time, downstream metrics remain relatively stable across the tested range, indicating that DMoE does not require a narrowly tuned threshold to maintain performance\. This behavior suggests that the TU criterion captures a useful region where expert activation is most needed, while allowing practitioners to adjust the threshold to trade off effectiveness and efficiency\.
### A\.4Router: Impact of Retriever Choice
Table 7:Impact of retriever choice for routing\.We compare the BM25 router used in the main experiments with SGPT\-based dense retrieval\. SGPT improves StrategyQA but does not consistently improve the other benchmarks, while using more GPU memory\.Table[7](https://arxiv.org/html/2606.14243#A1.T7)studies whether replacing the default BM25 lexical router with SGPT dense retrieval improves expert selection\. SGPT yields higher accuracy on StrategyQA and slightly higher F1 on HotpotQA, but it underperforms BM25 on CWQ and Quasar\-T\. The efficiency profile is also mixed: SGPT is modestly faster in these runs, but consistently consumes about0\.60\.6GB more GPU memory because dense retrieval introduces additional neural embedding computation and storage\. Since SGPT does not provide a stable effectiveness gain and increases memory usage, we use BM25 as the default router in the main experiments\.
### A\.5Experts: Impact of LoRA Rank and Scaling Factor
Table 8:Impact of LoRA rank and scaling factor\.We vary the expert LoRA configuration\(r,α\)\(r,\\alpha\)while keeping the remaining setup fixed\. Higher rank or larger scaling does not yield consistent gains across benchmarks\.Table[8](https://arxiv.org/html/2606.14243#A1.T8)evaluates the sensitivity of DMoE experts to LoRA rankrrand scaling factorα\\alpha\. The default configuration\(4,16\)\(4,16\)achieves the best result on HotpotQA, CWQ, and Quasar\-T, while StrategyQA slightly favors\(8,32\)\(8,32\)\. Overall, increasing rank or scaling factor does not produce a consistent improvement, suggesting that the expert adapters do not require large capacity to capture the targeted knowledge units used in our setting\. We therefore use\(4,16\)\(4,16\)as the default configuration because it provides the strongest overall trade\-off between effectiveness and parameter efficiency\.
## Appendix BScalability and Cost Analysis
### B\.1Expert Count and Storage Footprint
In DMoE, the number of experts is not a fixed architectural hyperparameter as in standard MoE models\. It is determined by how external knowledge is partitioned into knowledge units\. In our experiments, we instantiate one expert for each passage\-level knowledge unit, resulting in 27,613 corpus\-aligned experts for the Llama\-3\.2\-1B expert bank\. This design makes the expert bank naturally aligned with the retrieval corpus, but it is not the only possible granularity: in deployments with stronger storage constraints, multiple passages can be grouped into a document\-level or cluster\-level expert; conversely, finer units can be used when routing precision is more important\.
Each expert is lightweight\. With the default LoRA configuration, one expert contains 122,880 trainable parameters, corresponding to approximately 481 KiB on disk\. The full 27,613\-expert bank occupies about 13\.08 GiB of disk storage\. Importantly, this bank size does not translate into GPU memory pressure during inference\. DMoE keeps the expert bank on disk and loads only the top\-kkselected experts at triggered decoding steps; all inactive experts remain outside GPU memory\. Therefore, increasing the expert bank mainly affects disk footprint and index management, while per\-step GPU memory depends on the small number of active experts rather than the total number of stored experts\.
### B\.2Training and Update Cost
DMoE also differs from SFT\-LoRA in how updates are performed\. SFT\-LoRA trains a single adapter on a merged corpus\-scale dataset, so adding or modifying knowledge typically requires re\-merging the training data and retraining or continuing training a shared adapter\. In contrast, DMoE trains one small LoRA expert per knowledge unit independently\. Existing expert checkpoints are skipped automatically, so when the corpus grows, only experts for newly added knowledge units need to be trained\.
In our current implementation, training one new passage\-level expert takes about 10 seconds on a single NVIDIA A100 GPU\. Training is embarrassingly parallel because experts are independent: the 27,613\-expert bank corresponds to roughly 76\.7 A100 GPU\-hours under this per\-expert timing, and wall\-clock time can be reduced by distributing experts across multiple GPUs\. Updating obsolete knowledge is similarly local\. One can delete the corresponding adapter directory and remove the document from the BM25 index, without retraining the backbone or unrelated experts\.
### B\.3Scalability Trade\-off
The main scalability trade\-off is therefore not the number of active parameters at inference time, but the granularity of the expert bank\. A larger bank provides more fine\-grained routing targets but increases disk usage and index size; a smaller bank reduces storage and management overhead but may merge heterogeneous knowledge into the same expert\. This trade\-off is controllable because knowledge units can be defined at different granularities before expert construction\.
We consider two related but distinct dimensions of expert capacity\. First, the capacity of each individual expert is controlled by the LoRA rank and scaling factor; as shown in Appendix[A\.5](https://arxiv.org/html/2606.14243#A1.SS5), this effect is non\-monotonic, and the default\(r=4,α=16\)\(r=4,\\alpha=16\)setting is competitive or best on three of the four datasets\. Second, the total expert\-bank size is controlled by how many knowledge units are represented as separate experts\. Table[9](https://arxiv.org/html/2606.14243#A2.T9)reports the latter ablation\.
Table 9:Expert\-bank scaling ablation\.We reduce the expert bank to different fractions of the full corpus\-aligned bank and report downstream performance\. The full bank is not uniformly best, indicating that DMoE is not brittle to the exact expert count\.The results show reasonably stable performance across substantial reductions in bank size\. HotpotQA slightly improves at 1/10 of the bank, StrategyQA is best at 1/5, while CWQ and Quasar\-T favor the full bank or tie with it\. This pattern indicates that DMoE does not require a narrowly tuned expert count to remain functional\.
## Appendix CPrompt Design
This appendix lists the prompt templates used for document augmentation and evaluation\. The augmentation prompts are used to construct document\-specific parametric knowledge representations, while the evaluation prompt follows the few\-shot reasoning format used in our experiments\.
### C\.1Prompts for Document Augmentation
To construct document\-specific parametric knowledge representations, we employ two augmentation techniques: multi\-style rewriting and QA generation\.
##### Document Rewriting
We utilize rewriting to diversify surface forms while preserving semantic content\.
Prompt 1: Document RewritingRewrite the following passage\.While keeping the entities, proper nouns, and key details such as names, locations, and terminology intact, create a new version of the text that expresses the same ideas in a different way\. Make sure the revised passage is distinct from the original one, but preserves the core meaning and relevant information\.Passage: passage
##### QA Pair Generation
This prompt is used to generate multiple QA pairs from a passage for downstream training\.
Prompt 2: Question\-Answer GenerationI will provide a passage of text, and you need to generate three different questions based on the content of this passage\. Each question should be answerable using the information provided in the passage\. Additionally, please provide an appropriate answer for each question derived from the passage\.You need to generate the question and answer in the following format:\[\{"question": "What is the capital of France?","answer": "Paris","full\_answer": "The capital of France is Paris\."\},\]This list should have at least three elements\. You only need to output this list in the above format\.Passage: passage
### C\.2Prompt for HotpotQA
Following prior work such as FLARE\(Jianget al\.,[2023](https://arxiv.org/html/2606.14243#bib.bib52)\)and DRAGIN\(Suet al\.,[2024d](https://arxiv.org/html/2606.14243#bib.bib53)\), we adopt a similar few\-shot prompting strategy\. All compared methods use the same prompt format for a fair evaluation\.
We use reasoning exemplars to guide the model toward concise multi\-hop answers\.
Prompt 3: Few\-Shot Prompt for HotpotQAYou should reference the knowledge provided below and combine it with your own knowledge to answer the question\. Please follow the format of the example I provided above\.Question: Jeremy Theobald and Christopher Nolan share what profession? Answer: Jeremy Theobald is an actor and producer\. Christopher Nolan is a director, producer, and screenwriter\. Therefore, they both share the profession of being a producer\. So the answer is producer\.Question: Were Lonny and Allure both founded in the 1990s? Answer: Lonny \(magazine\) was founded in 2009\. Allure \(magazine\) was founded in 1991\. Thus, of the two, only Allure was founded in 1990s\. So the answer is no\.… \[additional examples truncated for brevity\] …Question: Were Scott Derrickson and Ed Wood of the same nationality?Similar Articles
EMO: Pretraining Mixture of Experts for Emergent Modularity
EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.
Emergent Modularity in Mixture-of-Experts Models (8 minute read)
Ai2 releases EMO, a 14B-parameter mixture-of-experts language model trained to develop emergent modularity. It allows using a small subset of experts for specific tasks while maintaining near full-model performance.
ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression
ConMoE proposes a train-free prototype remapping framework for Mixture-of-Experts (MoE) compression, which selects a subset of experts as reusable prototypes and deterministically remaps original expert calls to them, reducing memory usage without weight updates or fine-tuning.
Mixture of Experts (MoEs) in Transformers
Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.
EMO: Pretraining mixture of experts for emergent modularity
Allen AI releases EMO, a mixture-of-experts model where modular structure emerges naturally from data, enabling use of just 12.5% of experts for a task while maintaining near full-model performance.