MisoLabs/MisoTTS

Hugging Face Models Trending 05/21/26, 12:06 AM Models

text-to-speech open-source conversational-ai audio-generation llama huggingface

Summary

Miso Labs releases Miso TTS 8B, a text-to-speech model based on the Sesame CSM architecture with a Llama 3.2-style backbone, designed for high-quality conversational speech generation and voice continuation.

Task: text-to-speech Tags: pytorch, safetensors, text-to-speech, speech-synthesis, voice, audio, sesame, mimi, llama, license:other, region:us

Original Article

View Cached Full Text

Cached at: 06/05/26, 08:09 AM

MisoLabs/MisoTTS · Hugging Face

Source: https://huggingface.co/MisoLabs/MisoTTS

https://huggingface.co/MisoLabs/MisoTTS#miso-tts-8bMiso TTS 8B

https://huggingface.co/MisoLabs/MisoTTS#model-introductionModel Introduction

Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.

The model is designed for high-quality conversational speech generation and voice continuation from prompt audio. This repository contains the inference code, model definition, and setup instructions for running Miso TTS locally.

https://huggingface.co/MisoLabs/MisoTTS#quickstartQuickstart

To run the model, use the inference code at ourpublic repository, or try our demo at misolabs.ai.

https://huggingface.co/MisoLabs/MisoTTS#model-summaryModel Summary

ItemValueModelMiso TTS 8BOrganizationMiso LabsTaskText-to-speechArchitectureSesame-style CSMBackbonellama\-8BAudio decoderllama\-300MText vocabulary128,256Audio vocabulary2,051Audio codebooks32Audio tokenizerMimiMax sequence length2,048

https://huggingface.co/MisoLabs/MisoTTS#architectureArchitecture

Miso TTS 8B uses two transformer components:

A large backbone transformer that consumes text/audio-frame embeddings.
A smaller decoder transformer that autoregressively predicts higher-order audio codebooks within each frame.

Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.

https://huggingface.co/MisoLabs/MisoTTS#linksLinks

Website:misolabs.ai
Hugging Face:MisoLabs/MisoTTS
GitHub:MisoLabsAI
X:@MisoLabsAI

MisoLabs/MisoTTS

MisoLabs/MisoTTS · Hugging Face

Source: https://huggingface.co/MisoLabs/MisoTTS

https://huggingface.co/MisoLabs/MisoTTS#miso-tts-8bMiso TTS 8B

https://huggingface.co/MisoLabs/MisoTTS#model-introductionModel Introduction

https://huggingface.co/MisoLabs/MisoTTS#quickstartQuickstart

https://huggingface.co/MisoLabs/MisoTTS#model-summaryModel Summary

https://huggingface.co/MisoLabs/MisoTTS#architectureArchitecture

https://huggingface.co/MisoLabs/MisoTTS#linksLinks

Similar Articles

@omarsar0: Another banger open-source release. Miso One is an 8B text-to-speech model with real emotional range, so voiceovers car…

OpenMOSS-Team/MOSS-TTS-Nano-100M

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

@lmsysorg: SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS mode…

@Gorden_Sun: ZONOS2: Open-source MoE TTS model. 8B total parameters, 0.9B activated parameters. Supports multilingual, voice cloning, Chinese, and Chinese results are good. Model:

Submit Feedback

Similar Articles

@omarsar0: Another banger open-source release. Miso One is an 8B text-to-speech model with real emotional range, so voiceovers car…
Miso One is an open-source 8B parameter text-to-speech model with real emotional range and 110ms latency, designed for voiceover work.

OpenMOSS-Team/MOSS-TTS-Nano-100M

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

@lmsysorg: SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS mode…

@Gorden_Sun: ZONOS2: Open-source MoE TTS model. 8B total parameters, 0.9B activated parameters. Supports multilingual, voice cloning, Chinese, and Chinese results are good. Model: