@lmsysorg: SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS mode…
Summary
MOSS-TTS-Local Transformer v1.5 is an open-source 48 kHz stereo TTS model with zero-shot voice cloning, native streaming, and support for 31 languages, built on a Qwen3-4B backbone and served via SGLang-Omni.
View Cached Full Text
Cached at: 06/18/26, 02:16 PM
SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS model built on a Qwen3-4B backbone. Zero-shot voice cloning + native streaming at 48 kHz stereo 31 languages, trained on ~4M hours of speech Duration control + explicit pause markup + long-form up to 10 min 5.976 req/s non-streaming at RTF 0.644, 1.75% WER (SeedTTS English, 2× GPU) Three-stage pipeline: reference encoding → AR engine → streaming vocoder, with frame-level CUDA Graphs
Cookbook: https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html… Run it now with SGLang-Omni!
MOSS-TTS-Local — SGLang
Source: https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html
MOSS-TTS-Local#
MOSS-TTS-Local-Transformer-v1.5is a text-to-speech model from MOSI.AI and the OpenMOSS team. It generates native48 kHz stereospeech withMOSS-Audio-Tokenizer-v2and supports zero-shot voice cloning from reference audio, reference-less synthesis, long-form speech generation, streaming, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. The model supports31 languages, accepts language tags to guide multilingual generation, and supports inline pause markers such as\[pause 3\.2s\]for explicit prosody control.

Architecturally, MOSS-TTS-Local-Transformer-v1.5 is thelocal\-transformercounterpart to thedelay\-patternMOSS-TTS-v1.5. Instead of staggering RVQ streams across time, the Qwen3-4B backbone emits a global latent for each aligned audio frame, and a lightweight frame-local transformer expands that latent into a fixed 12-codebook RVQ block. In SGLang-Omni it runs as apreprocessing → tts\_engine → vocoderpipeline served through the OpenAI-compatible/v1/audio/speechendpoint.
Prerequisites#
Installsglang\-omniby followingInstallation, then download the model (public, no token required):
hf download OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5
The processor ships with the checkpoint, so no extra TTS package is needed. Decoding base64 (data-URI) reference audio additionally requiressoundfile(uv pip install soundfile).
Server Configuration#
The default layout puts the AR backbone and the codec/vocoder on the same GPU:
sgl-omni serve \
--model-path OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 \
--port 8000
A matching config file is available atexamples/configs/moss\_tts\_local\.yaml.
Synthesizing Speech#
Basic Speech#
MOSS-TTS-Local can synthesize speech without a reference clip:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "SGLang-Omni is a great project!"}' \
--output output.wav
Voice Cloning#
Provide a reference clip when you want voice cloning. Thereferencesfield acceptsaudio\_path(a local path, HTTP URL, or base64 data URI) andtext(the transcript of that clip). Supplying the transcript materially improves cloning quality.
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "SGLang-Omni is a great project!",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}]
}' \
--output output.wav
ref\_audioandref\_textare accepted as shorthand forreferences\[0\]\.audio\_pathandreferences\[0\]\.text.
Python#
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": "Get the trust fund to the bank early.",
"ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"ref_text": "We asked over twenty different people, and they all said it was his.",
},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
f.write(resp.content)
Reference Audio Sources#
audio\_path/ref\_audiomay be a local filesystem path readable by the server, an HTTP(S) URL, or a base64data URI(data:audio/wav;base64,<\.\.\.\>, decoded withsoundfile):
import base64
import requests
reference_url = "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav"
reference_resp = requests.get(reference_url)
reference_resp.raise_for_status()
ref_audio = (
"data:audio/wav;base64,"
+ base64.b64encode(reference_resp.content).decode("ascii")
)
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": "SGLang-Omni is a great project!",
"ref_audio": ref_audio,
"ref_text": "Transcript of the reference clip.",
},
)
resp.raise_for_status()
with open("output_data_uri.wav", "wb") as f:
f.write(resp.content)
Reference encodes are cached (LRU) and coalesced into batched codec calls, so resending the same reference clip skips re-encoding.
Streaming#
Set"stream": true,"response\_format": "pcm", and"stream\_format": "audio"to receive raw 48 kHz mono PCM chunks in real time. Pipe the stream throughffmpegwhen you want a playable WAV file:
curl -N -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"ref_text": "We asked over twenty different people, and they all said it was his.",
"stream": true,
"response_format": "pcm",
"stream_format": "audio"
}' \
| ffmpeg -f s16le -ar 48000 -ac 1 -i pipe:0 output_stream.wav
Duration Control#
MOSS-TTS-Local conditions on a targetduration token count(codec frames; a larger count yields longer audio). Set it with an inline$\{token:N\}prefix oninput(stripped before synthesis), or with atoken\_count(aliasduration\_tokens) parameter. The count must be a positive integer.
{"input": "${token:150}A sentence with an explicit duration target.", "ref_audio": "..."}
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "${token:150}A sentence with an explicit duration target.",
"ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"ref_text": "We asked over twenty different people, and they all said it was his."
}' \
--output output_duration_tokens.wav
If omitted, the model picks the duration itself.
Text Markup, Style, and Language#
Inline text markup that the model understands (for example\[pause Xs\], pinyin, and IPA) is passed through unchanged. An optionalinstructionsfield carries a free-text style directive, and an optionallanguagehint biases the target language (omit it to let the model infer from the text):
{
"input": "今天天气不错 [pause 0.5s] 就该出去晒晒太阳。",
"ref_audio": "...", "ref_text": "...",
"language": "Chinese"
}
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "今天天气不错 [pause 0.5s] 就该出去晒晒太阳。",
"ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"ref_text": "We asked over twenty different people, and they all said it was his.",
"language": "Chinese",
"instructions": "Use a natural conversational style."
}' \
--output output_markup.wav
Generation Parameters#
The two default values reflect the model’s separate sampling channels: thetextchannel is the per-frame continue/stop head and theaudiochannel is the RVQ codebooks. A singletemperature,top\_p, ortop\_kin the request applies to both.
Seed Reproducibility#
A fixedseedis reproducible atany concurrency: each token’s sampling depends only on its own seed and position, never on its batch neighbours.
- Reproducibility holds for afixed server configuration and hardware— backbone floating-point non-determinism (different batch shapes, GPUs, or kernels) can still shift the sampled tokens across deployments.
seedmust be a non-negative integer; negative or non-integer values are rejected.- Without a
seed, each request draws a fresh random seed and is not reproducible across runs.
Benchmarking#
MOSS-TTS-Local clones from each prompt (\-\-ref\-format references) and estimates a per-sample duration with\-\-token\-count auto. Run at\-\-max\-concurrency 16.
python -m benchmarks.eval.benchmark_tts_seedtts \
--meta zhaochenyang20/seed-tts-eval-arrow \
--model OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 --port 8000 \
--ref-format references \
--token-count auto \
--output-dir results/moss_tts_en \
--lang en --max-concurrency 16
Use\-\-lang zhfor the Chinese split. Seebenchmarks/README\.mdfor the full workflow.
Evaluation Benchmarks#
Seed-TTS-Eval Reference Performance#
Seed-TTS-Eval full set (EN = 1088, ZH = 2020) on 2× H100, concurrency 16,\-\-token\-count auto. These are reference inference-performance numbers reported in PR #728 — reproducible references, not CI thresholds.
Multilingual Voice Clone#
We evaluate MOSS-TTS-Local-Transformer-v1.5 on public multilingual TTS suites and internal voice-cloning stress sets, covering multilingual synthesis, speaker similarity, and hard speaker-stability cases.
WER (↓) and SIM (↑) are macro-averaged and reported in percentage points.N/Ameans the benchmark is speaker-similarity only and does not report WER.
These results were measured with audio sampling parameterstemperature=1\.7,top\_p=0\.8, andtop\_k=25. In tests from MOSI.AI,temperature=0\.6,top\_p=0\.95,top\_k=25, andaudio\_repetition\_penalty=1\.2may produce better quality.
Known Limitations#
- **Voice cloning depends on the reference.**Omit the reference for non-cloned speech; provide the transcript (
text/ref\_text) for the best speaker similarity when cloning. - **Rare runaway generation.**A small fraction of utterances can loop and generate up to
max\_new\_tokens; setting atoken\_count(or loweringmax\_new\_tokens) bounds the output. - Duration is a hint.
$\{token:N\}/token\_countsteers length but is not an exact clip duration. - **Reproducibility is hardware-bound.**A fixed
seedreproduces only on the same server configuration and GPU; seeSeed Reproducibility.
OpenMOSS (@Open_MOSS): 🤗 MOSS-TTS-Local Transformer v1.5 is now open source.
Built with a pure autoregressive Audio Tokenizer + LLM paradigm:
>MOSS-Audio-Tokenizer-v2, 2B params >Qwen3-4B backbone >Native 48 kHz stereo audio >Streaming output with theoretical sub-100 ms TTFT >Zero-shot voice cloning
Similar Articles
@MosiAI_Official: MOSS-TTS Local Transformer v1.5 is here. Clone any voice. Speak any language. Hear every detail. 30+ languages, 48 kHz …
MosiAI has released MOSS-TTS Local Transformer v1.5, a text-to-speech model that supports voice cloning, over 30 languages, and high-quality 48 kHz output.
OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face
MOSS-TTS v1.5 is an updated open-source text-to-speech model with improved multilingual synthesis (supporting 31 languages), more stable zero-shot voice cloning, and explicit inline pause control.
OpenMOSS-Team/MOSS-TTS-Nano-100M
MOSS-TTS-Nano is an open-source multilingual speech generation model with only 0.1B parameters, designed for real-time TTS that runs directly on CPU without GPU. Released by OpenMOSS team and MOSI.AI, it enables simple local deployment for web serving and product integration.
@MosiAI_Official: MOSS-TTS-v1.5 just reached #1 on Hugging Face Trending for Text-to-Speech, with 20.6K downloads. A multilingual, contro…
MOSS-TTS-v1.5, a multilingual controllable TTS model with voice cloning and long-form generation, reached #1 on Hugging Face Trending with 20.6K downloads.
k2-fsa/OmniVoice
OmniVoice is a massively multilingual zero-shot text-to-speech model supporting over 600 languages, built on a diffusion language model architecture with fast inference and voice cloning capabilities.