MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

arXiv cs.CL 04/20/26, 04:00 AM Papers

Summary

MoshiRAG combines a compact full-duplex speech language model with asynchronous retrieval-augmented generation to improve factuality while maintaining real-time interactivity. The approach leverages natural temporal gaps in conversation to retrieve external knowledge without disrupting the natural flow of dialogue.

arXiv:2604.12928v2 Announce Type: replace Abstract: Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models Source: https://arxiv.org/html/2604.12928 Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, Alexandre Défossez ###### Abstract Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks. Full-Duplex, Retrieval Augmented Generation, Voice Assistant, Factuality, Speech Language Model, Moshi ## 1 Introduction Building voice interfaces for artificial intelligence (AI) systems capable of assisting humans across a wide range of scenarios has long been central to visions of future technology. A user-friendly voice interface should create a natural conversation experience, allowing users to communicate with AI systems as if they were speaking to a real human assistant. Earlier approaches typically combined multiple components – such as automatic speech recognition (ASR), text-based dialogue management, and text-to-speech (TTS) synthesis – and optimized them for conversational use cases (Seneff et al., 1998); Levin et al., 2000; Bohus and Rudnicky, 2009). More recent research has shifted toward end-to-end approaches to avoid information loss introduced by speech-to-text conversion, such as prosody, rhythm, and intonation, while also reducing latency and friction caused by cascaded pipelines (Zhang et al., 2023; Nachmani et al., 2024; Xie and Wu, 2024; Fang et al., 2025a; Zeng et al., 2024). Among modern frameworks, full-duplex models (Défossez et al., 2024; Yu et al., 2025) are distinguished by their ability to “listen while speaking,” in contrast to turn-based methods that process speech in large chunks (e.g., sentences) and allow transitioning between listening and speaking states only after each chunk is completed (see Figure 1). The capability to concurrently receive speech inputs and generate responses enables full-duplex models to react more promptly to user inputs (Zhang et al., 2025; Chen et al., 2025a) and can better model the complex interactivity of real-world conversation (Veluri et al., 2024; Yu et al., 2025; Roy et al., 2026). However, the full-duplex approach also introduces unique challenges such as the need for real-time speech processing and generation. Meanwhile, recent studies indicate that native audio models struggle more than text models with tasks requiring factuality, such as question answering (Wang et al., 2025a). This reduced factuality is at least in part due to the much smaller amounts of speech data than text data (in terms of number of words) available for training. Refer to caption Figure 1: Illustration of turn-based models versus full-duplex models. The former must explicitly switch between speaking and listening states, while the latter can concurrently speak and listen.

To address the challenge of improving factuality while maintaining interactivity, we propose MoshiRAG, the first full-duplex voice model equipped with retrieval-augmented generation (RAG) capability, built as an extension of the full-duplex speech LM Moshi (Défossez et al., 2024). While RAG has become a widely adopted technique for enhancing the factuality of large language models (LLMs) (Lewis et al., 2020), its integration into full-duplex voice systems remains largely unexplored due to the strict real-time constraints imposed by continuous speech interaction. We tackle this challenge by exploiting the natural temporal gap between the onset of a spoken response and the emergence of its key informational content (the “keyword delay” in Figure 2). Leveraging this observation, we design specialized fine-tuning data that trains Moshi to predict a retrieval trigger signal when the user poses knowledge-intensive queries. This signal asynchronously invokes an information retrieval system to generate reference documents relevant to the conversation context. The retrieved information is then incorporated into the response generation process before the key content is reached. We design the RAG mechanism so as to guarantee that the entire retrieval process completes within two seconds – shorter than the keyword delay of many existing speech LMs (see Table 1). In addition to improving factuality without compromising interactivity, MoshiRAG is retrieval-back-end agnostic, enabling seamless integration of different retrieval methods – such as LLM-based retrievers or search engines – as long as they can provide textual references within a reasonable time. This design offers flexibility and extensibility for future improvements. Experimental results demonstrate that MoshiRAG significantly improves the factuality of Moshi on question answering (QA) benchmarks while maintaining good interactivity in speech conversation as measured by full-duplex benchmarks (Lin et al., 2025b, a). We further show that performance can be enhanced at inference time by simply switching to more powerful retrieval back ends without retraining the base model. Finally, we demonstrate that MoshiRAG generalizes well to previously unseen mathematical reasoning tasks, which are challenging for both the original Moshi and other speech LMs. This can be viewed as an early exploration of the tool-use capabilities of full-duplex models, where Moshi effectively leverages an LLM as an external tool to solve mathematical tasks. Our results suggest the broader potential for enabling general tool use in full-duplex models and demonstrate the promise of building more powerful, reliable, and user-friendly voice AI assistants by combining real-time interactive voice interfaces with more capable problem-solving mechanisms. ## 2 Related Work Since dGSLM (Nguyen et al., 2023) initiated research on end-to-end multi-speaker conversational modeling (Veluri et al., 2024; Wang et al., 2025b), duplex models have emerged as an increasingly prominent direction. To jointly model user and system speech, one line of work adopts time-multiplexing approaches (Zhang et al., 2025; Chen et al., 2025a; Mai and Carson-Berndsen, 2025), in which the model alternates between processing fixed-duration chunks of user input and generating responses of the same duration. In contrast, models with a dual-channel architecture like Moshi (Défossez et al., 2024; Yu et al., 2025; Hu et al., 2025; Yao et al., 2025; Roy et al., 2026) enable high frame-rate, simultaneous modeling of input and output speech streams. To improve the factuality of speech dialogue models, recent works have incorporated RAG (Min et al., 2025; Rackauckas and Hirschberg, 2025; Chen et al., 2025b; Feng et al., 2025). The concurrent work Stream RAG (Arora et al., 2025) is particularly related, as it similarly exploits temporal gaps in spoken conversations to perform information retrieval. However, existing approaches are designed for non-full-duplex settings and do not address the strict timing constraints in real-time full-duplex conversations. Moreover, while prior methods retrieve information from fixed, pre-indexed corpora, we extend this paradigm to open-domain QA by retrieving information directly from the web. Beyond RAG, alternative approaches such as chain-of-thought reasoning for audio and speech models (Zhifei et al., 2025; Ma et al., 2025; Chiang et al., 2025b, a; Shih et al., 2025) have also been explored; these techniques are complementary to our framework and could be naturally combined in future work. ## 3 System Design The MoshiRAG framework is built upon Moshi (Défossez et al., 2024). To integrate external information into Moshi’s response generation, we first analyze the timing constraints in human-machine speech conversations. Based on it, we propose a framework consisting of a full-duplex front end and an asynchronous retrieval back end that operate in parallel, enabling the model to maintain interactivity while incorporating externally retrieved knowledge in real time. Refer to caption Figure 2: Different types of delays in human-machine conversations. End-to-end keyword delay (E2EKD) measures the time between the end of the user’s question and the most informative word in the response. Retrieval delay measures how long it takes for the back end to provide relevant information.

### 3.1 Timing Constraints Below, we introduce some terminology related to latency in human-machine conversation (illustrated in Figure 2): - **Time-to-first-audio-token (TTFAT)**: the audio-domain counterpart of the commonly used time-to-first-token (TTFT) metric for LLMs. We define TTFAT as the delay between the end of a user’s utterance and the moment the model generates the first audio token of its response.¹¹This definition focuses on content generation latency and excludes the time for token-to-waveform conversion, e.g., the codec or vocoder, which is orthogonal to the scope of this work. - **Keyword delay**: time interval from the beginning of the model’s spoken response to the point at which the key content (i.e., a keyword that directly answers the user’s query, if any) first appears. See Section 5.2 for details. - **End-to-end keyword delay (E2EKD)**: the total time from the end of the user’s query to the moment the keyword is mentioned in the model’s response. By definition, E2EKD is the sum of TTFAT and keyword delay. - **Retrieval delay**: the time from the prediction of a retrieval trigger to the completion of the retrieval process. E2EKD is a critical perceptual metric, as it determines how quickly meaningful information is delivered to the user. For retrieval-augmented systems, assuming that the retrieval is not triggered before the user query finishes, the retrieval delay must be shorter than the E2EKD in order for the retrieved information to be integrated into the response in time. Our preliminary analysis shows that the E2EKD of existing speech LMs often exceeds 3 seconds (see Table 1). Accordingly, we target for MoshiRAG a retrieval delay of no more than 2 seconds during both data construction and model training, ensuring that external knowledge can be effectively integrated without compromising real-time interaction quality. ### 3.2 System Overview In this paper, we define the **front end** as the modules that directly receive or generate audio to communicate with the user in real time, while the **back end** consists of components that do not directly interact with the user. For example, in a traditional cascaded ASR–dialogue–TTS system, the ASR and TTS modules are front-end components by this definition, whereas the text-based dialogue management system belongs to the back end. To optimize user experience, the front end must provide immediate feedback and reactions to user inputs. In contrast, the back end can prioritize factuality and reasoning, such as planning dialogue flow, selecting correct information, or managing topics, and benefits from greater time flexibility since it does not operate under strict real-time constraints. In this work, we use the original Moshi model (with minor modifications) as the full-duplex front end, while an asynchronous information retrieval system operates in parallel as the back end. Additionally, since most information retrieval systems are text-based, an additional streaming ASR model is used to transcribe user speech into text for retrieval purposes.²²Although it is possible to build the transcription functionality into the main Moshi model, we use a separate ASR model to minimize training efforts. This ASR model directly receives speech inputs and thus, by definition, is part of the front end. Figure 3 provides a conceptual overview of the system. The lack of synchronization between the front end and back end allows the system to effectively “think while listening and speaking,” similar to human’s cognitive abilities. Refer to caption Figure 3: Illustration of the front-end and back-end components in MoshiRAG. When the model needs external information, it outputs a ⟨ret⟩ token. The conversation transcript is sent to the back end which operates asynchronously. Once ready, the result is injected into Moshi which then adapts its response with no interruption.

During a speech conversation, the front-end Moshi takes user speech tokens encoded by the Mimi codec encoder (Défossez et al., 2024) as input and autoregressively predicts both textual transcriptions (with padding tokens inserted) and corresponding speech tokens for the model’s response in separate channels. The

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Similar Articles

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

RAG-Anything: All-in-One RAG Framework

LightRAG: Simple and Fast Retrieval-Augmented Generation

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

Submit Feedback

Similar Articles

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

RAG-Anything: All-in-One RAG Framework

LightRAG: Simple and Fast Retrieval-Augmented Generation

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation