WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training
Summary
WavAlign introduces a modality-aware adaptive post-training method that uses constrained preference updates and explicit anchoring to boost both semantic quality and speech expressiveness in end-to-end spoken dialogue models.
View Cached Full Text
Cached at: 04/23/26, 03:35 AM
Paper page - WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training
Source: https://huggingface.co/papers/2604.14932
Abstract
Spoken dialogue models face challenges in expressiveness despite end-to-end approaches, but a modality-aware adaptive post-training method using constrained preference updates and explicit anchoring improves both semantic quality and speech expressiveness.
End-to-endspoken dialogue modelshave garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-sourcespoken dialogue modelsoften remain below expectations. Motivated by the success of onlinereinforcement learning(RL) in other domains, one might attempt to directly applypreference optimizationtospoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives ofreward modelingandrollout sampling, focusing on how sparsepreference supervisioninteracts with dense speech generation undershared-parameter updates. Based on the analysis, we propose amodality-aware adaptive post-trainingrecipe that makes RL practical for spoken dialogue: it constrains preference updates to thesemantic channeland improvesacoustic behaviorviaexplicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2604\.14932
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.14932 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.14932 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.14932 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]
easyaligner is an open-source forced alignment library with GPU acceleration and flexible text normalization that works with all wav2vec2 models on Hugging Face Hub. It addresses practical workflows like handling partial transcripts, irrelevant speech segments, and long audio without chunking while preserving original text formatting.
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
WildFeedback is a novel framework that leverages in-situ user feedback from actual LLM conversations to automatically create preference datasets for aligning language models with human preferences, addressing scalability and bias issues in traditional annotation-based alignment methods.
VoxMind: An End-to-End Agentic Spoken Dialogue System
VoxMind introduces an end-to-end spoken dialogue system with agentic capabilities, using a "Think-before-Speak" mechanism and dynamic tool management to boost task completion from 34.88% to 74.57% while preserving conversational quality.
Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
Researchers from KTH Royal Institute of Technology propose a two-stage framework that fine-tunes LLMs on dialogue transcripts and uses contrastive learning to create joint embeddings for aligning backchannel signals with conversational context, demonstrating improved context-backchannel retrieval compared to previous methods.
OpenAI's New Voice Models Want to Do More Than Talk Back
OpenAI has launched three new real-time audio models to enable continuous, multitasking voice interactions that prioritize long-context reasoning, live translation, and seamless tool use.