MisoLabs/MisoTTS

Hugging Face Models Trending Models

Summary

Miso Labs releases Miso TTS 8B, a text-to-speech model based on the Sesame CSM architecture with a Llama 3.2-style backbone, designed for high-quality conversational speech generation and voice continuation.

Task: text-to-speech Tags: pytorch, safetensors, text-to-speech, speech-synthesis, voice, audio, sesame, mimi, llama, license:other, region:us
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:09 AM

MisoLabs/MisoTTS · Hugging Face

Source: https://huggingface.co/MisoLabs/MisoTTS

https://huggingface.co/MisoLabs/MisoTTS#miso-tts-8bMiso TTS 8B

https://huggingface.co/MisoLabs/MisoTTS#model-introductionModel Introduction

Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.

The model is designed for high-quality conversational speech generation and voice continuation from prompt audio. This repository contains the inference code, model definition, and setup instructions for running Miso TTS locally.


https://huggingface.co/MisoLabs/MisoTTS#quickstartQuickstart

To run the model, use the inference code at ourpublic repository, or try our demo at misolabs.ai.

https://huggingface.co/MisoLabs/MisoTTS#model-summaryModel Summary

ItemValueModelMiso TTS 8BOrganizationMiso LabsTaskText-to-speechArchitectureSesame-style CSMBackbonellama\-8BAudio decoderllama\-300MText vocabulary128,256Audio vocabulary2,051Audio codebooks32Audio tokenizerMimiMax sequence length2,048

https://huggingface.co/MisoLabs/MisoTTS#architectureArchitecture

Miso TTS 8B uses two transformer components:

  • A large backbone transformer that consumes text/audio-frame embeddings.
  • A smaller decoder transformer that autoregressively predicts higher-order audio codebooks within each frame.

Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.


https://huggingface.co/MisoLabs/MisoTTS#linksLinks

Similar Articles

OpenMOSS-Team/MOSS-TTS-Nano-100M

Hugging Face Models Trending

MOSS-TTS-Nano is an open-source multilingual speech generation model with only 0.1B parameters, designed for real-time TTS that runs directly on CPU without GPU. Released by OpenMOSS team and MOSI.AI, it enables simple local deployment for web serving and product integration.

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

Reddit r/LocalLLaMA

MOSS-TTS v1.5 is an updated open-source text-to-speech model with improved multilingual synthesis (supporting 31 languages), more stable zero-shot voice cloning, and explicit inline pause control.