@uniswap12: 微软开源了一个语音 AI,60 分钟长音频一次转写,4 个人同时说话都能搞定 VibeVoice,微软开源,24.8k star,今天才知道这个。录音一键转文字这件事,我之前一直用 Whisper,但它处理长会议录音经常超时,多人说话识别…

X AI KOLs Timeline 工具

摘要

微软开源了语音AI框架VibeVoice,支持60分钟长音频一次性转写、多说话人分离和时间戳标注,同时提供多角色TTS合成能力,底层基于Qwen2.5并配有0.5B轻量实时版本,已在GitHub获得24.8k星标。

微软开源了一个语音 AI,60 分钟长音频一次转写,4 个人同时说话都能搞定 VibeVoice,微软开源,24.8k star,今天才知道这个。录音一键转文字这件事,我之前一直用 Whisper,但它处理长会议录音经常超时,多人说话识别错得也挺厉害的。 VibeVoice 这个直接支持 60 分钟连续音频,自带说话人分离和时间戳,四个人同时说话的场景也能分清楚谁说了什么。最让我意外的是 TTS 这边,4 个角色同时合成,90 分钟连贯输出,声音全程不跑偏。 想做有声书或者播客内容的同学应该会很感兴趣,以前多角色合成经常前后声音不一致,这个解决了。底层是 Qwen2.5 加了专门的连续语音 tokenizer,还有个 0.5B 的轻量版本,300ms 延迟,可以直接接进对话 AI 做实时语音交互,不用再单独接第三方 TTS 服务了。 正在想把 ASR 这块接进自己的会议记录工具里,如果真能稳定跑,一个会下来自动生成带发言人标注的纪要,那效率真的拉满了。 开源地址: https://github.com/microsoft/VibeVoice… #AI #AIAgent
查看原文
查看缓存全文

缓存时间: 2026/06/05 02:21

微软开源了一个语音 AI,60 分钟长音频一次转写,4 个人同时说话都能搞定

VibeVoice,微软开源,24.8k star,今天才知道这个。录音一键转文字这件事,我之前一直用 Whisper,但它处理长会议录音经常超时,多人说话识别错得也挺厉害的。

VibeVoice 这个直接支持 60 分钟连续音频,自带说话人分离和时间戳,四个人同时说话的场景也能分清楚谁说了什么。最让我意外的是 TTS 这边,4 个角色同时合成,90 分钟连贯输出,声音全程不跑偏。

想做有声书或者播客内容的同学应该会很感兴趣,以前多角色合成经常前后声音不一致,这个解决了。底层是 Qwen2.5 加了专门的连续语音 tokenizer,还有个 0.5B 的轻量版本,300ms 延迟,可以直接接进对话 AI 做实时语音交互,不用再单独接第三方 TTS 服务了。

正在想把 ASR 这块接进自己的会议记录工具里,如果真能稳定跑,一个会下来自动生成带发言人标注的纪要,那效率真的拉满了。

开源地址: https://github.com/microsoft/VibeVoice…

#AI #AIAgent


microsoft/VibeVoice

Source: https://github.com/microsoft/VibeVoice

🎙️ VibeVoice: Open-Source Frontier Voice AI

Project Page Hugging Face TTS Report ASR Report Colab ASR Playground

microsoft%2FVibeVoice | Trendshift

VibeVoice Logo

📰 News

2026-03-06: 🚀 VibeVoice ASR is now part of a Transformers release! You can now use our speech recognition model directly through the Hugging Face Transformers library for seamless integration into your projects.

2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in Playground.

2025-12-16: 📣 We added experimental speakers to VibeVoice‑Realtime‑0.5B for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. Try it. More speaker types will be added over time.

2025-12-03: 📣 We open-sourced VibeVoice‑Realtime‑0.5B, a real‑time text‑to‑speech model that supports streaming text input and robust long-form speech generation. Try it on Colab.

2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.

2025-08-25: 📣 We open-sourced VibeVoice-TTS, a long-form multi-speaker text-to-speech model that can synthesize speech up to 90 minutes long with up to 4 distinct speakers. — accepted as an Oral at ICLR 2026! 🔥

Overview

VibeVoice is a family of open-source frontier voice AI models that includes both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models.

A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

For more information, demos, and examples, please visit our Project Page.

ModelWeightQuick Try
VibeVoice-ASR-7BHF LinkPlayground
VibeVoice-TTS-1.5BHF LinkDisabled
VibeVoice-Realtime-0.5BHF LinkColab

Models

1. 📖 VibeVoice-ASR - Long-form Speech Recognition

VibeVoice-ASR is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for Customized Hotwords.

  • 🕒 60-minute Single-Pass Processing: Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.

  • 👤 Customized Hotwords: Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.

  • 📝 Rich Transcription (Who, When, What): The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates who said what and when.

📖 Documentation | 🤗 Hugging Face | 🎮 Playground | 🛠️ Finetuning | 📊 Paper

DER
cpWER
tcpWER

https://github.com/user-attachments/assets/acde5602-dc17-4314-9e3b-c630bc84aefa


2. 🎙️ VibeVoice-TTS - Long-form Multi-speaker TTS

Best for: Long-form conversational audio, podcasts, multi-speaker dialogues

  • ⏱️ 90-minute Long-form Generation: Synthesizes conversational/single-speaker speech up to 90 minutes in a single pass, maintaining speaker consistency and semantic coherence throughout.

  • 👥 Multi-speaker Support: Supports up to 4 distinct speakers in a single conversation, with natural turn-taking and speaker consistency across long dialogues.

  • 🎭 Expressive Speech: Generates expressive, natural-sounding speech that captures conversational dynamics and emotional nuances.

  • 🌐 Multi-lingual Support: Supports English, Chinese and other languages.

📖 Documentation | 🤗 Hugging Face | 📊 Paper

VibeVoice Results

English

https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784

Chinese

https://github.com/user-attachments/assets/322280b7-3093-4c67-86e3-10be4746c88f

Cross-Lingual

https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722

Spontaneous Singing

https://github.com/user-attachments/assets/6f27a8a5-0c60-4f57-87f3-7dea2e11c730

Long Conversation with 4 people

https://github.com/user-attachments/assets/a357c4b6-9768-495c-a576-1618f6275727


3. ⚡ VibeVoice-Streaming - Real-time Streaming TTS

VibeVoice-Realtime is a lightweight real‑time text-to-speech model supporting streaming text input and robust long-form speech generation.

  • Parameter size: 0.5B (deployment-friendly)
  • Real-time TTS (~300 milliseconds first audible latency)
  • Streaming text input
  • Robust long-form speech generation (~10 minutes)

📖 Documentation | 🤗 Hugging Face | 🚀 Colab

https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc


Contributing

Please see CONTRIBUTING.md for detailed contribution guidelines.

⚠️ Risks and Limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

Star History

Star History Chart

相似文章

@denziideng: 又发现一个AI语音克隆“降维打击”…… 之前分享的 CosyVoice 3秒可克隆,觉得已经够吓人了,结果今天这个更要命,随便录了1分钟自己的声音训练后,它直接把声线、语气、情感、呼吸、停顿全部复刻,简直像本人灵魂附体! 阿里达摩院的 C…

X AI KOLs Timeline

GPT-SoVITS 是一款开源 AI 语音克隆工具,支持零样本(5秒声音)和少样本(1分钟训练)高保真声音克隆,跨语言推理,并自带完整 WebUI 工具链,在 GitHub 上已获 57.8k 星,成为语音克隆领域的领先开源项目。

VibeVoice 技术报告

Papers with Code Trending

VibeVoice 是微软推出的一款新模型,它利用 Next-Token Diffusion(下一令牌扩散)和一种高度高效的连续语音分词器,生成长形式多说话人语音。该模型实现了卓越的保真度和压缩率,支持长达 90 分钟的多说话人音频生成。

@Honcia13: 开源TTS直接卷疯了!园区诈骗又有新武器? 清华 OpenBMB 刚刚放出 VoxCPM2: 200亿参数 + 200万小时多语言数据训练,48kHz录音棚级音质! 最狠的是——完全不用Tokenizer,直接在连续潜空间做扩散自回归,细…

X AI KOLs Timeline

清华大学 OpenBMB 发布了 VoxCPM2,这是一个拥有 200 亿参数的开源多语言 TTS 模型,支持无需 Tokenizer 的连续潜空间扩散自回归生成,具备 48kHz 录音棚级音质和强大的声音克隆与设计能力。

@MaxForAI: 如果你在做语音Agent,你应该试一下这个项目 来自南洋理工、新国立和上海 AI Lab的团队发布了:Mega-ASR 这个完全开源的ASR基于 Qwen3-ASR构建,目的是打破长期困扰ASR的在嘈杂、混响或其他受损现实环境中表现的瓶颈…

X AI KOLs Timeline

南洋理工、新国立和上海 AI Lab 联合发布 Mega-ASR,一个基于 Qwen3-ASR 构建的完全开源 ASR 模型,通过 Voices-in-the-Wild-2M 数据集和渐进式声学到语义优化,在真实世界嘈杂环境中实现最高 30% 的相对词错误率下降,且仅 1.7B 参数可在消费级硬件高效推理。