WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Hugging Face Daily Papers 04/16/26, 12:00 AM Papers

Summary

WavAlign introduces a modality-aware adaptive post-training method that uses constrained preference updates and explicit anchoring to boost both semantic quality and speech expressiveness in end-to-end spoken dialogue models.

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/23/26, 03:35 AM

Paper page - WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Source: https://huggingface.co/papers/2604.14932

Abstract

Spoken dialogue models face challenges in expressiveness despite end-to-end approaches, but a modality-aware adaptive post-training method using constrained preference updates and explicit anchoring improves both semantic quality and speech expressiveness.

End-to-endspoken dialogue modelshave garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-sourcespoken dialogue modelsoften remain below expectations. Motivated by the success of onlinereinforcement learning(RL) in other domains, one might attempt to directly applypreference optimizationtospoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives ofreward modelingandrollout sampling, focusing on how sparsepreference supervisioninteracts with dense speech generation undershared-parameter updates. Based on the analysis, we propose amodality-aware adaptive post-trainingrecipe that makes RL practical for spoken dialogue: it constrains preference updates to thesemantic channeland improvesacoustic behaviorviaexplicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2604\.14932

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.14932 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.14932 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.14932 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Paper page - WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

VoxMind: An End-to-End Agentic Spoken Dialogue System

Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning

OpenAI's New Voice Models Want to Do More Than Talk Back

Submit Feedback

Similar Articles

easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

VoxMind: An End-to-End Agentic Spoken Dialogue System

Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning

OpenAI's New Voice Models Want to Do More Than Talk Back