@lmsysorg: SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS mode…

X AI KOLs Timeline 06/18/26, 05:44 AM Models

tts text-to-speech voice-cloning streaming open-source qwen3 multilingual

Summary

MOSS-TTS-Local Transformer v1.5 is an open-source 48 kHz stereo TTS model with zero-shot voice cloning, native streaming, and support for 31 languages, built on a Qwen3-4B backbone and served via SGLang-Omni.

SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS model built on a Qwen3-4B backbone. Zero-shot voice cloning + native streaming at 48 kHz stereo 31 languages, trained on ~4M hours of speech Duration control + explicit pause markup + long-form up to 10 min 5.976 req/s non-streaming at RTF 0.644, 1.75% WER (SeedTTS English, 2× GPU) Three-stage pipeline: reference encoding → AR engine → streaming vocoder, with frame-level CUDA Graphs Cookbook: https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html… Run it now with SGLang-Omni!

Original Article

View Cached Full Text

Cached at: 06/18/26, 02:16 PM

Cookbook: https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html… Run it now with SGLang-Omni!

MOSS-TTS-Local — SGLang

Source: https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html

MOSS-TTS-Local#

MOSS-TTS-Local-Transformer-v1.5is a text-to-speech model from MOSI.AI and the OpenMOSS team. It generates native48 kHz stereospeech withMOSS-Audio-Tokenizer-v2and supports zero-shot voice cloning from reference audio, reference-less synthesis, long-form speech generation, streaming, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. The model supports31 languages, accepts language tags to guide multilingual generation, and supports inline pause markers such as\[pause 3\.2s\]for explicit prosody control.

MOSS-TTS-Local architecture

Architecturally, MOSS-TTS-Local-Transformer-v1.5 is thelocal\-transformercounterpart to thedelay\-patternMOSS-TTS-v1.5. Instead of staggering RVQ streams across time, the Qwen3-4B backbone emits a global latent for each aligned audio frame, and a lightweight frame-local transformer expands that latent into a fixed 12-codebook RVQ block. In SGLang-Omni it runs as apreprocessing → tts\_engine → vocoderpipeline served through the OpenAI-compatible/v1/audio/speechendpoint.

Prerequisites#

Installsglang\-omniby followingInstallation, then download the model (public, no token required):

hf download OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5

The processor ships with the checkpoint, so no extra TTS package is needed. Decoding base64 (data-URI) reference audio additionally requiressoundfile(uv pip install soundfile).

Server Configuration#

The default layout puts the AR backbone and the codec/vocoder on the same GPU:

sgl-omni serve \
  --model-path OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 \
  --port 8000

A matching config file is available atexamples/configs/moss\_tts\_local\.yaml.

Synthesizing Speech#

Basic Speech#

MOSS-TTS-Local can synthesize speech without a reference clip:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "SGLang-Omni is a great project!"}' \
  --output output.wav

Voice Cloning#

Provide a reference clip when you want voice cloning. Thereferencesfield acceptsaudio\_path(a local path, HTTP URL, or base64 data URI) andtext(the transcript of that clip). Supplying the transcript materially improves cloning quality.

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "SGLang-Omni is a great project!",
    "references": [{
      "audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
      "text": "We asked over twenty different people, and they all said it was his."
    }]
  }' \
  --output output.wav

ref\_audioandref\_textare accepted as shorthand forreferences\[0\]\.audio\_pathandreferences\[0\]\.text.

Python#

import requests

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={
        "input": "Get the trust fund to the bank early.",
        "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
        "ref_text": "We asked over twenty different people, and they all said it was his.",
    },
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
    f.write(resp.content)

Reference Audio Sources#

audio\_path/ref\_audiomay be a local filesystem path readable by the server, an HTTP(S) URL, or a base64data URI(data:audio/wav;base64,<\.\.\.\>, decoded withsoundfile):

import base64
import requests

reference_url = "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav"
reference_resp = requests.get(reference_url)
reference_resp.raise_for_status()
ref_audio = (
    "data:audio/wav;base64,"
    + base64.b64encode(reference_resp.content).decode("ascii")
)

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={
        "input": "SGLang-Omni is a great project!",
        "ref_audio": ref_audio,
        "ref_text": "Transcript of the reference clip.",
    },
)
resp.raise_for_status()
with open("output_data_uri.wav", "wb") as f:
    f.write(resp.content)

Reference encodes are cached (LRU) and coalesced into batched codec calls, so resending the same reference clip skips re-encoding.

Streaming#

Set"stream": true,"response\_format": "pcm", and"stream\_format": "audio"to receive raw 48 kHz mono PCM chunks in real time. Pipe the stream throughffmpegwhen you want a playable WAV file:

curl -N -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Get the trust fund to the bank early.",
    "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
    "ref_text": "We asked over twenty different people, and they all said it was his.",
    "stream": true,
    "response_format": "pcm",
    "stream_format": "audio"
  }' \
  | ffmpeg -f s16le -ar 48000 -ac 1 -i pipe:0 output_stream.wav

Duration Control#

MOSS-TTS-Local conditions on a targetduration token count(codec frames; a larger count yields longer audio). Set it with an inline$\{token:N\}prefix oninput(stripped before synthesis), or with atoken\_count(aliasduration\_tokens) parameter. The count must be a positive integer.

{"input": "${token:150}A sentence with an explicit duration target.", "ref_audio": "..."}

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "${token:150}A sentence with an explicit duration target.",
    "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
    "ref_text": "We asked over twenty different people, and they all said it was his."
  }' \
  --output output_duration_tokens.wav

If omitted, the model picks the duration itself.

Text Markup, Style, and Language#

Inline text markup that the model understands (for example\[pause Xs\], pinyin, and IPA) is passed through unchanged. An optionalinstructionsfield carries a free-text style directive, and an optionallanguagehint biases the target language (omit it to let the model infer from the text):

{
  "input": "今天天气不错 [pause 0.5s] 就该出去晒晒太阳。",
  "ref_audio": "...", "ref_text": "...",
  "language": "Chinese"
}

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "今天天气不错 [pause 0.5s] 就该出去晒晒太阳。",
    "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
    "ref_text": "We asked over twenty different people, and they all said it was his.",
    "language": "Chinese",
    "instructions": "Use a natural conversational style."
  }' \
  --output output_markup.wav

Generation Parameters#

The two default values reflect the model’s separate sampling channels: thetextchannel is the per-frame continue/stop head and theaudiochannel is the RVQ codebooks. A singletemperature,top\_p, ortop\_kin the request applies to both.

Seed Reproducibility#

A fixedseedis reproducible atany concurrency: each token’s sampling depends only on its own seed and position, never on its batch neighbours.

Reproducibility holds for afixed server configuration and hardware— backbone floating-point non-determinism (different batch shapes, GPUs, or kernels) can still shift the sampled tokens across deployments.
seedmust be a non-negative integer; negative or non-integer values are rejected.
Without aseed, each request draws a fresh random seed and is not reproducible across runs.

Benchmarking#

MOSS-TTS-Local clones from each prompt (\-\-ref\-format references) and estimates a per-sample duration with\-\-token\-count auto. Run at\-\-max\-concurrency 16.

python -m benchmarks.eval.benchmark_tts_seedtts \
    --meta zhaochenyang20/seed-tts-eval-arrow \
    --model OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 --port 8000 \
    --ref-format references \
    --token-count auto \
    --output-dir results/moss_tts_en \
    --lang en --max-concurrency 16

Use\-\-lang zhfor the Chinese split. Seebenchmarks/README\.mdfor the full workflow.

Evaluation Benchmarks#

Seed-TTS-Eval Reference Performance#

Seed-TTS-Eval full set (EN = 1088, ZH = 2020) on 2× H100, concurrency 16,\-\-token\-count auto. These are reference inference-performance numbers reported in PR #728 — reproducible references, not CI thresholds.

Multilingual Voice Clone#

We evaluate MOSS-TTS-Local-Transformer-v1.5 on public multilingual TTS suites and internal voice-cloning stress sets, covering multilingual synthesis, speaker similarity, and hard speaker-stability cases.

WER (↓) and SIM (↑) are macro-averaged and reported in percentage points.N/Ameans the benchmark is speaker-similarity only and does not report WER.

These results were measured with audio sampling parameterstemperature=1\.7,top\_p=0\.8, andtop\_k=25. In tests from MOSI.AI,temperature=0\.6,top\_p=0\.95,top\_k=25, andaudio\_repetition\_penalty=1\.2may produce better quality.

Known Limitations#

**Voice cloning depends on the reference.**Omit the reference for non-cloned speech; provide the transcript (text/ref\_text) for the best speaker similarity when cloning.
**Rare runaway generation.**A small fraction of utterances can loop and generate up tomax\_new\_tokens; setting atoken\_count(or loweringmax\_new\_tokens) bounds the output.
Duration is a hint.$\{token:N\}/token\_countsteers length but is not an exact clip duration.
**Reproducibility is hardware-bound.**A fixedseedreproduces only on the same server configuration and GPU; seeSeed Reproducibility.

OpenMOSS (@Open_MOSS): 🤗 MOSS-TTS-Local Transformer v1.5 is now open source.

Built with a pure autoregressive Audio Tokenizer + LLM paradigm:

>MOSS-Audio-Tokenizer-v2, 2B params >Qwen3-4B backbone >Native 48 kHz stereo audio >Streaming output with theoretical sub-100 ms TTFT >Zero-shot voice cloning