Learning task-specific subspaces via interventional post-training of speech foundation models
Summary
This paper proposes a post-training refinement approach using interventional contrastive learning to disentangle speech foundation model representations into separate content and speaker subspaces. The method shows improved out-of-domain speaker verification performance and evidence of successful separation.
View Cached Full Text
Cached at: 06/17/26, 05:42 AM
# Learning task-specific subspaces via interventional post-training of speech foundation models Source: [https://arxiv.org/abs/2606.17967](https://arxiv.org/abs/2606.17967) [View PDF](https://arxiv.org/pdf/2606.17967) > Abstract:Speech foundation models, pre\-trained on large corpora of unlabelled speech data, produce general\-purpose representations which are useful across tasks\. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on only some of this variability\. In this work, we propose a post\-training refinement approach using interventional contrastive learning\. By leveraging an interventional dataset and multi\-part contrastive loss, we learn a transformation from the entangled representation space of speech foundation models into separate content and speaker subspaces\. We evaluate the learnt representations on speaker verification and keyword spotting tasks, showing improved out\-of\-domain speaker verification performance and evidence that speaker and content information are separated across the learned subspaces\. ## Submission history From: Jack Cox \[[view email](https://arxiv.org/show-email/8f301d11/2606.17967)\] **\[v1\]**Tue, 16 Jun 2026 14:18:20 UTC \(39 KB\)
Similar Articles
Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
Researchers from KTH Royal Institute of Technology propose a two-stage framework that fine-tunes LLMs on dialogue transcripts and uses contrastive learning to create joint embeddings for aligning backchannel signals with conversational context, demonstrating improved context-backchannel retrieval compared to previous methods.
WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training
WavAlign introduces a modality-aware adaptive post-training method that uses constrained preference updates and explicit anchoring to boost both semantic quality and speech expressiveness in end-to-end spoken dialogue models.
Selective Capability Unlearning in End-to-End Spoken Language Understanding
Proposes BindingSubspace (BSU), a representation-level framework that isolates and attenuates intent-conditioned directions in end-to-end spoken language understanding models to prevent capability persistence, where suppressing an intent still allows slot generation under forced prefixes. The method reduces forced-prefix recoverability while preserving retained performance on SLU benchmarks.
Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition
Proposes a POI-aware contrastive training framework using LLM-generated near-misses to improve ASR robustness at code-switching regions, achieving consistent error reductions on two benchmarks.
Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training
This paper identifies a Rank-1 Subspace phenomenon in LLM pre-training trajectories and proposes Extra-Merge, a training-free strategy that extrapolates along this subspace to minimize loss, achieving consistent zero-shot accuracy gains across GPT-2 and LLaMA families up to 2B parameters.