@lmsysorg: SGLang-Omni 现已于第0天提供来自 @Open_MOSS 的 MOSS-TTS-Local Transformer v1.5！这是一个开源的 48 kHz 立体声 TTS 模式…

X AI KOLs Timeline 2026/06/18 05:44 模型

tts text-to-speech voice-cloning streaming open-source qwen3 multilingual

摘要

MOSS-TTS-Local Transformer v1.5 是一个开源的 48 kHz 立体声 TTS 模型，具有零样本语音克隆、原生流式传输，并支持31种语言，基于 Qwen3-4B 骨干网构建，通过 SGLang-Omni 提供。

SGLang-Omni 现已于第0天提供来自 @Open_MOSS 的 MOSS-TTS-Local Transformer v1.5！这是一个基于 Qwen3-4B 骨干网构建的开源 48 kHz 立体声 TTS 模型。零样本语音克隆 + 48 kHz 立体声原生流式传输 31种语言，基于约400万小时语音训练时长控制 + 显式停顿标记 + 最长10分钟长文本非流式传输达 5.976 req/s，RTF 0.644，WER 1.75%（SeedTTS 英语，2× GPU）三阶段流水线：参考编码 → AR 引擎 → 流式声码器，支持帧级 CUDA Graphs Cookbook: https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html… 立即使用 SGLang-Omni 运行！

查看原文

查看缓存全文

缓存时间: 2026/06/18 14:16

SGLang-Omni 现已在第 0 天为 @Open_MOSS 的 MOSS-TTS-Local-Transformer v1.5 提供服务！这是一个基于 Qwen3-4B 骨干网络构建的开源 48 kHz 立体声 TTS 模型。零样本语音克隆 + 原生流式 48 kHz 立体声支持 31 种语言，基于约 400 万小时语音数据训练时长控制 + 显式停顿标记 + 最长 10 分钟的长文本生成非流式下每秒 5.976 请求，RTF 0.644，WER 1.75%（SeedTTS 英文，2× GPU）三阶段流水线：参考编码 → AR 引擎 → 流式声码器，采用帧级 CUDA Graphs

Cookbook: https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html… 立即使用 SGLang-Omni 运行！

MOSS-TTS-Local — SGLang

来源：https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html

MOSS-TTS-Local# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#moss-tts-local)

MOSS-TTS-Local-Transformer-v1.5 (https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5) 是由 MOSI.AI 和 OpenMOSS 团队开发的文本转语音模型。它使用 MOSS-Audio-Tokenizer-v2 (https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-v2) 生成原生48 kHz 立体声语音，并支持从参考音频进行零样本语音克隆、无参考合成、长文本语音生成、流式输出、词元级时长控制、拼音/IPA 发音控制、多语言合成以及语码转换。该模型支持31 种语言，接受语言标签以指导多语言生成，并支持内联停顿标记，例如 [pause 3.2s]，用于显式韵律控制。

MOSS-TTS-Local 架构

在架构上，MOSS-TTS-Local-Transformer-v1.5 是 delay-pattern 类型的 MOSS-TTS-v1.5 (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts.html) 的 local-transformer 对应版本。它不是将 RVQ 流在时间上交错，而是由 Qwen3-4B 骨干网络为每个对齐的音频帧发出一个全局潜变量，然后一个轻量级的帧局部 transformer 将该潜变量扩展为固定的 12 码本 RVQ 块。在 SGLang-Omni 中，它作为 预处理 → tts_engine → 声码器 流水线运行，通过兼容 OpenAI 的 /v1/audio/speech 端点提供服务。

前提条件# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#prerequisites)

按照安装指南 (https://sgl-project.github.io/sglang-omni/get_started/installation.html) 安装 sglang-omni，然后下载模型（公开，无需令牌）：

hf download OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5

处理器与检查点一起提供，因此无需额外的 TTS 包。解码 base64（数据 URI）参考音频还需要 soundfile（uv pip install soundfile）。

服务器配置# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#server-configuration)

默认布局将 AR 骨干网络和编解码器/声码器放在同一 GPU 上：

sgl-omni serve \ --model-path OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 \ --port 8000

相应的配置文件位于 examples/configs/moss_tts_local.yaml。

合成语音# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#synthesizing-speech)

基础语音# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#basic-speech)

MOSS-TTS-Local 可以在没有参考音频片段的情况下合成语音：

curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"input": "SGLang-Omni is a great project!"}' \ --output output.wav

语音克隆# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#voice-cloning)

在需要语音克隆时提供参考音频片段。references 字段接受 audio_path（本地路径、HTTP URL 或 base64 数据 URI）和 text（该片段的转录文本）。提供转录文本可以显著提升克隆质量。

curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "SGLang-Omni is a great project!", "references": [{ "audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav", "text": "We asked over twenty different people, and they all said it was his." }] }' \ --output output.wav

ref_audio 和 ref_text 可作为 references[0].audio_path 和 references[0].text 的简写。

Python# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#python)

`` import requests

resp = requests.post( “http://localhost:8000/v1/audio/speech”, json={ “input”: “Get the trust fund to the bank early.”, “ref_audio”: “https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav”, “ref_text”: “We asked over twenty different people, and they all said it was his.”, }, ) resp.raise_for_status() with open(“output.wav”, “wb”) as f: f.write(resp.content) ``

参考音频来源# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#reference-audio-sources)

audio_path/ref_audio 可以是服务器可读的本地文件系统路径、HTTP(S) URL 或 base64 数据 URI（data:audio/wav;base64,<...>，使用 soundfile 解码）：

`` import base64 import requests

reference_url = “https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav” reference_resp = requests.get(reference_url) reference_resp.raise_for_status() ref_audio = ( “data:audio/wav;base64,” + base64.b64encode(reference_resp.content).decode(“ascii”) )

resp = requests.post( “http://localhost:8000/v1/audio/speech”, json={ “input”: “SGLang-Omni is a great project!”, “ref_audio”: ref_audio, “ref_text”: “Transcript of the reference clip.”, }, ) resp.raise_for_status() with open(“output_data_uri.wav”, “wb”) as f: f.write(resp.content) ``

参考编码会被缓存（LRU）并合并为批处理编解码器调用，因此重复发送相同的参考音频片段会跳过重新编码。

流式# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#streaming)

设置 "stream": true、"response_format": "pcm" 和 "stream_format": "audio" 以实时接收原始 48 kHz 单声道 PCM 块。将流通过 ffmpeg 转换为可播放的 WAV 文件：

curl -N -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "Get the trust fund to the bank early.", "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav", "ref_text": "We asked over twenty different people, and they all said it was his.", "stream": true, "response_format": "pcm", "stream_format": "audio" }' \ | ffmpeg -f s16le -ar 48000 -ac 1 -i pipe:0 output_stream.wav

时长控制# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#duration-control)

MOSS-TTS-Local 基于目标词元计数（编解码器帧；更大的计数产生更长的音频）。可以通过在 input 前加上内联 ${token:N} 前缀（在合成前会被移除）或使用 token_count（别名 duration_tokens）参数来设置。计数必须为正整数。

{"input": "${token:150}A sentence with an explicit duration target.", "ref_audio": "..."}

curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "${token:150}A sentence with an explicit duration target.", "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav", "ref_text": "We asked over twenty different people, and they all said it was his." }' \ --output output_duration_tokens.wav

如果省略，模型会自动选择时长。

文本标记、风格和语言# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#text-markup-style-and-language)

模型理解的内联文本标记（例如 [pause Xs]、拼音和 IPA）会原样传递。可选的 instructions 字段携带自由文本风格指令，可选的 language 提示用于偏向目标语言（省略则让模型从文本推断）：

{ "input": "今天天气不错 [pause 0.5s] 就该出去晒晒太阳。", "ref_audio": "...", "ref_text": "...", "language": "Chinese" }

curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "今天天气不错 [pause 0.5s] 就该出去晒晒太阳。", "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav", "ref_text": "We asked over twenty different people, and they all said it was his.", "language": "Chinese", "instructions": "采用自然的对话风格。" }' \ --output output_markup.wav

生成参数# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#generation-parameters)

两个默认值反映了模型不同的采样通道：text 通道是每帧的继续/停止头，audio 通道是 RVQ 码本。请求中的单个 temperature、top_p 或 top_k 适用于两者。

种子可重现性# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#seed-reproducibility)

固定的 seed 在任何并发数下均可重现：每个词元的采样仅依赖于其自身的种子和位置，而与其批处理邻居无关。

可重现性仅适用于固定的服务器配置和硬件——骨干网络的浮点非确定性（不同的批处理形状、GPU 或内核）仍可能在不同部署中导致采样的词元偏移。
seed 必须为非负整数；负值或非整数值会被拒绝。
没有 seed 时，每个请求会生成一个新的随机种子，且跨运行不可重现。

基准测试# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#benchmarking)

MOSS-TTS-Local 从每个提示词（--ref-format references）进行克隆，并使用 --token-count auto 估算每个样本的时长。以 --max-concurrency 16 运行：

python -m benchmarks.eval.benchmark_tts_seedtts \ --meta zhaochenyang20/seed-tts-eval-arrow \ --model OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 --port 8000 \ --ref-format references \ --token-count auto \ --output-dir results/moss_tts_en \ --lang en --max-concurrency 16

使用 --lang zh 进行中文分片。完整工作流请参见 benchmarks/README.md。

评估基准# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#evaluation-benchmarks)

Seed-TTS-Eval 参考性能# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#seed-tts-eval-reference-performance)

Seed-TTS-Eval 完整集（EN = 1088, ZH = 2020），使用 2× H100，并发数 16，--token-count auto。这些是 PR #728 中报告的参考推理性能数据——可重现的参考，而非 CI 阈值。

多语言语音克隆# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#multilingual-voice-clone)

我们在公开的多语言 TTS 套件和内部语音克隆压力测试集上评估 MOSS-TTS-Local-Transformer-v1.5，覆盖多语言合成、说话人相似度以及高难度说话人稳定性案例。

WER (↓) 和 SIM (↑) 是宏平均后的百分比值。N/A 表示该基准仅衡量说话人相似度，不报告 WER。

这些结果是在音频采样参数 temperature=1.7、top_p=0.8 和 top_k=25 下测量的。根据 MOSI.AI 的测试，temperature=0.6、top_p=0.95、top_k=25 以及 audio_repetition_penalty=1.2 可能会产生更好的质量。

已知限制# (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#known-limitations)

语音克隆质量取决于参考音频。 进行非克隆语音合成时可省略参考音频；进行克隆时，提供转录文本（text/ref_text）可获得最佳说话人相似度。
罕见失控生成。 少数话语可能循环并生成多达 max_new_tokens 的内容；设置 token_count（或降低 max_new_tokens）可以限制输出长度。
时长仅为提示。 ${token:N}/token_count 可以控制长度，但并非精确的片段时长。
可重现性受硬件限制。 固定的 seed 仅能在相同服务器配置和 GPU 上重现；请参见种子可重现性 (https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html#seed-reproducibility)。

OpenMOSS (@Open_MOSS): 🤗 MOSS-TTS-Local Transformer v1.5 现已开源。

采用纯自回归音频分词器 + LLM 范式构建：

>MOSS-Audio-Tokenizer-v2，2B 参数 >Qwen3-4B 骨干网络 >原生 48 kHz 立体声音频 >流式输出，理论 TTFT 低于 100 毫秒 >零样本语音克隆