Tag
ParaBridge is an on-policy self-distillation method that bridges the gap between paralinguistic perception and dialogue behavior in speech language models, significantly improving safety and empathy without external rewards.
This paper investigates whether factual recall mechanisms learned in text-based language models transfer to speech modalities in multimodal speech-language models. Using causal mediation analysis on SpiritLM, it finds that the mechanisms are only partially carried over, highlighting differences between text and speech processing.
The Awesome-SpeechLM-Survey repository on GitHub systematically organizes the research lineage of speech language models, including classification frameworks, representative models, training datasets, and evaluation benchmarks. It serves as a knowledge map for understanding the field.
MoshiRAG combines a compact full-duplex speech language model with asynchronous retrieval-augmented generation to improve factuality while maintaining real-time interactivity. The approach leverages natural temporal gaps in conversation to retrieve external knowledge without disrupting the natural flow of dialogue.
MTR-DuplexBench introduces a comprehensive benchmark for evaluating Full-Duplex Speech Language Models in multi-round conversations, addressing challenges like blurred turn boundaries and context inconsistency while assessing conversational features, dialogue quality, instruction following, and safety.