k2-fsa/OmniVoice

Hugging Face Models Trending 模型

摘要

OmniVoice 是一款大规模多语言零样本文本转语音模型,支持超过 600 种语言,基于扩散语言模型架构构建,具备快速推理和语音克隆能力。

任务:文本转语音 标签:omnivoice, safetensors, zero-shot, 多语言, voice-cloning, voice-design, 文本转语音, aae, aal, aao, ab, abb, abn, abr, abs, abv, acm, acw, acx, adf, adx, ady, aeb, aec, af, afb, afo, ahl, ahs, ajg, aju, ala, aln, alo, am, amu, an, anc, ank, anp, anw, aom, apc, apd, arb, arq, ars, ary, arz, as, ast, avl, awo, ayl, ayp, az, ba, bag, bas, bax, bba, bbj, bbl, bbu, bce, bci, bcs, bcy, bda, bde, bdm, be, beb, bew, bfd, bft, bg, bgp, bhb, bhh, bho, bhp, bhr, bjj, bjk, bjn, bjt, bkh, bkm, bky, bmm, bmq, bn, bnm, bnn, bns, bo, bou, bqg, br, bra, brh, bri, brx, bs, bsh, bsj, bsk, btm, btv, bug, bum, buo, bux, bwr, bxf, byc, bys, byv, byx, bzc, bzw, ca, ccg, ceb, cen, cfa, cgg, chq, cjk, ckb, ckl, ckr, cky, cnh, cpy, cs, cte, ctl, cut, cux, cv, cy, da, dag, dar, dav, dbd, dcc, de, deg, dgh, dgo, dje, dmk, dml, dru, dty, dua, dv, dyu, dzg, ebr, ebu, ego, eiv, eko, ekr, el, elm, en, eo, es, esu, et, eto, ets, etu, eu, ewo, ext, eyo, fa, fan, fat, ff, ffm, fi, fia, fil, fip, fkk, fmp, fr, fub, fuc, fue, fuf, fuh, fui, fuq, fuv, fy, ga, gbm, gbr, gby, gcc, gdf, gej, ges, ggg, gid, gig, giz, gjk, gju, gl, glw, gn, gol, gom, gsl, gu, gui, gur, guz, gv, gwc, gwe, gwt, gya, gyz, ha, hah, hao, haw, haz, hbb, he, hem, hi, hia, hkk, hla, hno, hoj, hr, hsb, ht, hu, hue, hul, hux, hwo, hy, hz, ia, ibb, id, ida, idu, ig, ijc, ijn, ik, ikw, is, ish, iso, it, its, itw, itz, ja, jal, jax, jgo, jmx, jns, jqr, juk, juo, jv, ka, kab, kai, kaj, kam, kbd, kbl, kbt, kcq, kdh, kea, keu, kfe, kfk, kfp, khg, khw, kj, kjc, kjk, kk, kln, kls, km, kmr, kmy, kn, kna, knn, ko, kol, koo, kpo, kqo, ks, ksd, ksf, kto, kuh, kvx, kw, kwm, kxp, ky, kyx, lag, lb, lcm, ldb, lg, lij, lir, lkb, lla, ln, lnu, lo, loa, lrk, lss, lt, ltg, lto, lua, luo, lus, lv, lwg, mab, maf, mai, mau, max, mbo, mcf, mcn, mcx, mdd, mde, mdf, mek, mer, meu, mfm, mfn, mfo, mfv, mgg, mgi, mhk, mhr, mi, mig, miu, mk, mkf, mki, ml, mlq, mn, mne, mni, mqy, mr, mrj, mrr, mrt, ms, mse, msh, msw, mt, mtr, mtu, mtx, mua, mug, mui, mve, mvy, mxs, mxu, mxy, my, myv, mzl, nal, nan, nap, nb, nbh, ncf, nco, ncx, ndi, ng, ngi, nhg, nhi, nhn, nhq, nja, nl, nla, nlv, nmg, nmz, nn, nnh, no, noe, npi, nso, ny, nyu, oc, odk, odu, ogo, om, orc, oru, ory, os, pa, pbs, pbt, pbu, pcm, pex, phl, phr, pip, piy, pko, pl, plk, plt, pmq, pms, pmy, pnb, poc, poe, pow, prq, ps, pst, pt, pua, pwn, qug, qum, qup, qur, qus, quv, qux, quy, qva, qvi, qvj, qvl, qwa, qws, qxa, qxp, qxt, qxu, qxw, rag, rm, ro, rob, rof, roo, rth, ru, rup, rw, sa, sah, sat, sau, say, sbn, sc, scl, scn, sd, sei, shu, si, sip, siw, sjr, sk, skg, skr, sl, sn, snc, snk, so, sol, sps, sq, sr, src, sro, ssi, ste, sua, sv, sva, sw, szy, ta, tan, tar, tay, tbf, tcf, tcy, tdn, tdx, te, tg, tgc, th, the, thq, thr, thv, ti, tig, tio, tk, tkg, tkt, tli, tlp, tn, tok, tpl, tpz, tqp, tr, trp, trq, trv, trw, tt, ttj, ttr, ttu, tui, tul, tuq, tuv, tuy, tvo, tvu, tw, twu, txs, txy, udl, ug, uk, uki, umb, ur, ush, uz, uzn, vai, var, ver, vi, vmc, vmj, vmm, vmp, vmz, vot, vro, wbl, wci, weo, wes, wja, wji, wo, wof, xh, xhe, xka, xmf, xmv, xmw, xpe, xti, xtu, yaq, yav, yay, ydd, ydg, yer, yes, yi, yo, yue, zga, zgh, zh, zoc, zoh, zor, zpv, zpy, ztg, ztn, ztp, zts, ztu, zu, zza, arxiv:2604.00688, base_model:Qwen/Qwen3-0.6B, base_model:finetune:Qwen/Qwen3-0.6B, license:apache-2.0, region:us
查看原文 导出为 Word 导出为 PDF
查看缓存全文

缓存时间: 2026/05/08 09:09

k2-fsa/OmniVoice · Hugging Face

来源:https://huggingface.co/k2-fsa/OmniVoice OmniVoice

Hugging Face 模型 (https://huggingface.co/k2-fsa/OmniVoice)Hugging Face 空间 (https://huggingface.co/spaces/k2-fsa/OmniVoice)GitHub 代码 (https://github.com/k2-fsa/OmniVoice)在 Colab 中打开 (https://colab.research.google.com/github/k2-fsa/OmniVoice/blob/master/docs/OmniVoice.ipynb)

OmniVoice 是一款大规模多语言零样本文本转语音(TTS)模型,支持超过 600 种语言。它基于创新的扩散语言模型风格架构构建,能够以卓越的推理速度生成高质量语音,支持语音克隆和语音设计。

  • **论文:**OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models (https://huggingface.co/papers/2604.00688)
  • **代码仓库:**GitHub (https://github.com/k2-fsa/OmniVoice)
  • **演示:**Hugging Face 空间 (https://huggingface.co/spaces/k2-fsa/OmniVoice)
  • **Colab:**Google Colab 笔记本 (https://colab.research.google.com/github/k2-fsa/OmniVoice/blob/master/docs/OmniVoice.ipynb)

https://huggingface.co/k2-fsa/OmniVoice#key-features核心特性

  • 支持 600+ 种语言:在零样本 TTS 模型中语言覆盖范围最广。
  • 语音克隆:从短参考音频中实现最先进的语音克隆质量。
  • 语音设计:通过指定的说话人属性(性别、年龄、音高、方言/口音、耳语等)控制语音。
  • 细粒度控制:非语言符号(例如,[laughter])以及通过拼音或音素进行发音纠正。
  • 快速推理:RTF 低至 0.025(比实时快 40 倍)。
  • 扩散语言模型风格架构:简洁、流畅且可扩展的设计,兼顾质量与速度。

https://huggingface.co/k2-fsa/OmniVoice#usage使用方法

首先,安装 omnivoice 库:

建议使用全新的虚拟环境(例如,condavenv 等)以避免冲突。

步骤 1:安装 PyTorch

NVIDIA GPU``

根据你的 CUDA 版本安装 PyTorch,例如

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 –extra-index-url https://download.pytorch.org/whl/cu128 ``

其他版本的安装请参考 PyTorch 官方网站 (https://pytorch.org/get-started/locally/)。

Apple Siliconpip install torch==2.8.0 torchaudio==2.8.0

步骤 2:安装 OmniVoice

pip install omnivoice

https://huggingface.co/k2-fsa/OmniVoice#python-apiPython API

您可以按如下方式使用 OmniVoice 进行零样本语音克隆:

`` from omnivoice import OmniVoice import soundfile as sf import torch

加载模型

model = OmniVoice.from_pretrained( “k2-fsa/OmniVoice”, device_map=“cuda:0”, dtype=torch.float16 )

生成音频

audio = model.generate( text=“Hello, this is a test of zero-shot voice cloning.”, ref_audio=“ref.wav”, ref_text=“Transcription of the reference audio.”, ) # audio 是一个 np.ndarray 列表,形状为 (T,),采样率为 24 kHz。

sf.write(“out.wav”, audio[0], 24000) ``

更多生成模式(例如,语音设计)、功能(例如,非语言符号、发音纠正)以及完整的使用说明,请参阅我们的 GitHub 仓库 (https://github.com/k2-fsa/OmniVoice)。

https://huggingface.co/k2-fsa/OmniVoice#discussion–communication讨论与交流

您可以直接在 GitHub Issues (https://github.com/k2-fsa/OmniVoice/issues) 上进行讨论。

您也可以扫描二维码加入我们的微信群或关注我们的微信公众号。

https://huggingface.co/k2-fsa/OmniVoice#citation引用

@article{zhu2026omnivoice, title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models}, author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel}, journal={arXiv preprint arXiv:2604.00688}, year={2026} }

https://huggingface.co/k2-fsa/OmniVoice#disclaimer免责声明

严禁用户将此模型用于未经授权的语音克隆、语音冒充、欺诈、诈骗或任何其他非法或不道德的活动。所有用户应确保完全遵守适用的当地法律、法规和道德标准。开发者对此模型的任何滥用不承担任何责任,并倡导负责任的 AI 开发和使用,鼓励社区在 AI 研究和应用中坚持安全和道德原则。

相似文章

openbmb/VoxCPM2

Hugging Face Models Trending

VoxCPM2 是一个开源的、无分词器的扩散自回归文本转语音模型,支持30种语言,拥有20亿参数,48kHz音频输出,并具备从自然语言描述进行语音设计、可控语音克隆以及实时流式处理等功能。

OpenMOSS-Team/MOSS-TTS-Nano-100M

Hugging Face Models Trending

MOSS-TTS-Nano是一个开源的多语言语音生成模型,仅0.1B参数,专为实时TTS设计,可直接在CPU上运行而无需GPU。由OpenMOSS团队和MOSI.AI发布,它支持简单的本地部署,用于Web服务和产品集成。