MisoLabs/MisoTTS
Summary
Miso Labs releases Miso TTS 8B, a text-to-speech model based on the Sesame CSM architecture with a Llama 3.2-style backbone, designed for high-quality conversational speech generation and voice continuation.
View Cached Full Text
Cached at: 06/05/26, 08:09 AM
MisoLabs/MisoTTS · Hugging Face
Source: https://huggingface.co/MisoLabs/MisoTTS
https://huggingface.co/MisoLabs/MisoTTS#miso-tts-8bMiso TTS 8B
https://huggingface.co/MisoLabs/MisoTTS#model-introductionModel Introduction
Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.
The model is designed for high-quality conversational speech generation and voice continuation from prompt audio. This repository contains the inference code, model definition, and setup instructions for running Miso TTS locally.
https://huggingface.co/MisoLabs/MisoTTS#quickstartQuickstart
To run the model, use the inference code at ourpublic repository, or try our demo at misolabs.ai.
https://huggingface.co/MisoLabs/MisoTTS#model-summaryModel Summary
ItemValueModelMiso TTS 8BOrganizationMiso LabsTaskText-to-speechArchitectureSesame-style CSMBackbonellama\-8BAudio decoderllama\-300MText vocabulary128,256Audio vocabulary2,051Audio codebooks32Audio tokenizerMimiMax sequence length2,048
https://huggingface.co/MisoLabs/MisoTTS#architectureArchitecture
Miso TTS 8B uses two transformer components:
- A large backbone transformer that consumes text/audio-frame embeddings.
- A smaller decoder transformer that autoregressively predicts higher-order audio codebooks within each frame.
Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.
https://huggingface.co/MisoLabs/MisoTTS#linksLinks
- Website:misolabs.ai
- Hugging Face:MisoLabs/MisoTTS
- GitHub:MisoLabsAI
- X:@MisoLabsAI
Similar Articles
@omarsar0: Another banger open-source release. Miso One is an 8B text-to-speech model with real emotional range, so voiceovers car…
Miso One is an open-source 8B parameter text-to-speech model with real emotional range and 110ms latency, designed for voiceover work.
OpenMOSS-Team/MOSS-TTS-Nano-100M
MOSS-TTS-Nano is an open-source multilingual speech generation model with only 0.1B parameters, designed for real-time TTS that runs directly on CPU without GPU. Released by OpenMOSS team and MOSI.AI, it enables simple local deployment for web serving and product integration.
OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face
MOSS-TTS v1.5 is an updated open-source text-to-speech model with improved multilingual synthesis (supporting 31 languages), more stable zero-shot voice cloning, and explicit inline pause control.
@lmsysorg: SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS mode…
MOSS-TTS-Local Transformer v1.5 is an open-source 48 kHz stereo TTS model with zero-shot voice cloning, native streaming, and support for 31 languages, built on a Qwen3-4B backbone and served via SGLang-Omni.
@Gorden_Sun: ZONOS2: Open-source MoE TTS model. 8B total parameters, 0.9B activated parameters. Supports multilingual, voice cloning, Chinese, and Chinese results are good. Model:
Zyphra released ZONOS2, an open-source MoE text-to-speech model trained on over 6 million hours of multilingual speech, supporting voice cloning and high-quality synthesis across many languages.