Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Hugging Face Daily Papers Papers

Summary

This paper applies sparse autoencoders to the CosyVoice3 text-to-speech language model, discovering interpretable features that can be steered to control attributes like laughter, speaker gender, and speech rate while preserving content.

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.
Original Article
View Cached Full Text

Cached at: 06/10/26, 09:43 AM

Paper page - Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Source: https://huggingface.co/papers/2606.10029

Abstract

Sparse autoencoders trained on language model representations reveal interpretable features for speech synthesis that can be manipulated to control linguistic and prosodic attributes.

Language modelsincreasingly serve as the backbone oftext-to-speech(TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a singleresidual stream. We train BatchTopKsparse autoencoderson the LM backbone of CosyVoice3 and introduce a modality-awareauto-interp pipelinethat labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanningphonemes,laughter,accent promptsandspeaker gender. Steering through the SAElatent spaceshows these features are causal rather than merely descriptive: targeted interventions raiselaughterprobability from 0.02 to 0.79, flip perceivedspeaker gender, and controlspeech ratewhile preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

View arXiv pageView PDFAdd to collection

Community

Paper submitter

about 1 hour ago

Bringing SAEs to text-to-speech models!

Currently, control over TTS models such as CosyVoice3 is limited to prompts or predefined tags. We found that model generations can be precisely edited by steering SAE features.

We also analyze these features: some are audio-only, others activate only on text, and some activate on both text and audio. Additionally, we introduce an autointerp pipeline for all of them.

We plan to publish the SAE weights and code soon!

Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.

Tap or paste here to upload images

Get this paper in your agent:

hf papers read 2606\.10029

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.10029 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.10029 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.10029 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

arXiv cs.AI

This paper demonstrates that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing scalability concerns for dictionary learning. The features are multilingual, multimodal, and include safety-relevant concepts like deception and sycophancy, with causal influence on model outputs.