@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…
Summary
Recommends using Qwen3-ASR instead of Whisper to reduce hallucinations, using LattifAI tools for precise audio-text alignment and subtitle generation, and introducing their own OmniVAD-Kit project for voice activity detection.
View Cached Full Text
Cached at: 05/16/26, 03:10 AM
🌐 Official Website | 🖥️ GitHub | 🤗 Model | 📑 Blog |
element | `` withitunes:timing=“Word”| #### Speaker Colors Thespeaker_coloroption colorizes speaker names in ASS output (works with both karaoke and non-karaoke modes): | Value | Behavior | |-------|----------| |“”(default) | No speaker coloring | |“auto”| Assigns from a built-in 10-color palette | |“#RRGGBB”| Single color for all speakers | |“#RRGGBB,#00BFFF,…”| Comma-separated list, one per speaker (cycles if more speakers than colors) | Speaker Palette ``bash # Auto-color speakers in ASS output lai caption convert input.json output.ass \ render.include_speaker_in_text=true \ ass.speaker_color=auto # Custom single color lai caption convert input.json output.ass \ render.include_speaker_in_text=true \ ass.speaker_color="#1387C0" `` #### Karaoke Color Schemes Useass.karaoke_color_schemeto apply a predefined color scheme for karaoke ASS output. Each scheme setsprimary_color, secondary_color, outline_color, and back_color. 12 schemes available: azure-gold, sakura-purple, mint-ocean, gardenia-green, sunset-warm, prussian-elegant, burgundy-classic, langgan-spring, mars-teal, spring-field, navy-pink, apricot-darkKaraoke Color Schemes ``bash # Karaoke with color scheme + auto speaker colors lai caption convert input.json output.ass \ ass.karaoke_effect=sweep \ ass.karaoke_color_scheme=azure-gold \ ass.speaker_color=auto `` ### VTT Format (YouTube VTT Support) The VTT format handler supports both standard WebVTT and YouTube VTT with word-level timestamps. **Reading**: VTT automatically detects YouTube VTT format (with `` tags) and extracts word-level alignment data: `` WEBVTT 00:00:00.000 --> 00:00:02.000 <00:00:00.000> Hello<00:00:00.500> world `` **Writing**: Userender.word_level=Trueto output YouTube VTT style with word timestamps: ``python from lattifai.data import Caption from lattifai.caption.config import ASSConfig, RenderConfig caption = Caption.read("input.vtt") caption.write( "output.ass", format_config=ASSConfig(karaoke_effect="sweep"), render=RenderConfig(word_level=True), ) `` ``bash # CLI: Convert to VTT with word-level timestamps lai caption convert input.json output.vtt \ render.word_level=true `` ### Transcription Language Support #### Gemini Models (100+ Languages) **Models**:gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-3-pro-preview, gemini-3-flash-preview, gemini-3.1-pro-previewEnglish, Chinese (Mandarin & Cantonese), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Arabic, Russian, Hindi, Bengali, Turkish, Dutch, Polish, Swedish, Danish, Norwegian, Finnish, Greek, Hebrew, Thai, Vietnamese, Indonesian, Malay, Filipino, Ukrainian, Czech, Romanian, Hungarian, and 70+ more. > Requires Gemini API key from Google AI Studio (https://aistudio.google.com/apikey) #### NVIDIA Parakeet (24 European Languages) **Model**:nvidia/parakeet-tdt-0.6b-v3| Region | Languages | |--------|-----------| | Western Europe | English (en), French (fr), German (de), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl) | | Nordic | Danish (da), Swedish (sv), Norwegian (no), Finnish (fi) | | Eastern Europe | Polish (pl), Czech (cs), Slovak (sk), Hungarian (hu), Romanian (ro), Bulgarian (bg), Ukrainian (uk), Russian (ru) | | Others | Croatian (hr), Estonian (et), Latvian (lv), Lithuanian (lt), Slovenian (sl), Maltese (mt), Greek (el) | #### Alibaba SenseVoice (5 Asian Languages) **Model**:iic/SenseVoiceSmallChinese/Mandarin (zh), English (en), Japanese (ja), Korean (ko), Cantonese (yue) #### FunAudioLLM Fun-ASR-Nano (31 Languages) **Models**:FunAudioLLM/Fun-ASR-Nano-2512(https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512),FunAudioLLM/Fun-ASR-MLT-Nano-2512(https://huggingface.co/FunAudioLLM/Fun-ASR-MLT-Nano-2512) 800M parameter end-to-end ASR model from Tongyi Lab, excelling at far-field, high-noise, dialect/accent, and music lyric recognition. | Region | Languages | |--------|-----------| | East Asia | Chinese (+ 7 dialects, 26 accents), Japanese, Korean, Cantonese | | Southeast Asia | Vietnamese, Indonesian, Thai, Malay, Filipino | | South Asia | Hindi | | Middle East | Arabic | | Europe | English, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish | ``bash # Use ModelScope (default for China) lai transcribe run audio.wav output.srt \ transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \ transcription.model_hub=modelscope # Use HuggingFace lai transcribe run audio.wav output.srt \ transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \ transcription.model_hub=huggingface `` #### vLLM/SGLang (Any ASR Model) Any ASR model served via vLLM (https://docs.vllm.ai) or SGLang (https://sgl-project.github.io/) with an OpenAI-compatible API. **Supported models and limitations:** | Model | Audio tok/s | Max Audio | API Mode | Batch | Notes | |-------|-------------|-----------|----------|-------|-------| | Qwen3-ASR (https://huggingface.co/Qwen/Qwen3-ASR-1.7B) (0.6B/1.7B) | 25 | auto | transcriptions | Yes | Best for zh/en/ja/ko | | Whisper (https://huggingface.co/openai/whisper-large-v3-turbo) | 50 | **30s** | transcriptions | Yes | Fixed 30s context window | | Voxtral (https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) | 12.5 | auto | transcriptions | Yes | European languages | | Voxtral Realtime (https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) | 12.5 | auto | realtime | Yes | WebSocket, <500ms latency | | Ultravox (https://huggingface.co/fixie-ai/ultravox-v0_5) | 6.25 | auto | transcriptions | Yes | Confirmed in vLLM source | | Gemma-3n (https://huggingface.co/google/gemma-3n-E4B-it) | 6.25 | **30s** | chat (auto) | **No** | Not a dedicated ASR model (~3x Whisper WER), 30s encoder limit (https://huggingface.co/google/gemma-3n-E4B-it/discussions/37), no concurrent requests | - **Max Audio**: "auto" = estimated frommax_model_len; bold values are hard encoder limits - **Batch**: Whether batch_size>1concurrent requests are supported - **API Mode**:transcriptionsis the default; general-purpose LLMs auto-switch tochat**API modes:** | Mode | Endpoint | Use Case | |------|----------|----------| |transcriptions(default) |/v1/audio/transcriptions| Dedicated ASR models (Qwen3-ASR, Whisper, GLM-ASR, etc.) | |chat|/v1/chat/completions| General-purpose LLMs (Gemma-3n, etc.) — auto-selected for non-ASR models | |realtime|/v1/realtime (WebSocket) | Voxtral Realtime | ``bash # 1. Install vLLM with audio support (requires CUDA GPU) pip install vllm "vllm[audio]" # 2. Start vLLM server on a Linux GPU machine (auto-downloads the model) vllm serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8081 # Other models: # vllm serve openai/whisper-large-v3-turbo # vllm serve google/gemma-3n-E4B-it --max-model-len 32000 --enforce-eager # 3. Transcribe (default: transcriptions mode) lai transcribe run audio.wav output.srt \ transcription.model_name=Qwen/Qwen3-ASR-1.7B \ transcription.api_base_url=http://localhost:8081/v1 # Batch mode for faster processing (4 concurrent requests) lai transcribe run audio.wav output.srt \ transcription.model_name=Qwen/Qwen3-ASR-1.7B \ transcription.api_base_url=http://localhost:8081/v1 \ transcription.batch_size=4 # General-purpose LLM (auto-switches to chat mode with ASR system prompt) lai transcribe run audio.wav output.srt \ transcription.model_name=google/gemma-3n-E4B-it \ transcription.api_base_url=http://localhost:8084/v1 \ transcription.language=zh # Voxtral Realtime (streaming WebSocket, <500ms latency) # Server: VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \ # --host 0.0.0.0 --port 8086 --compilation_config '{"cudagraph_mode": "PIECEWISE"}' lai transcribe run audio.wav output.srt \ transcription.model_name=mistralai/Voxtral-Mini-4B-Realtime-2602 \ transcription.api_base_url=http://localhost:8086/v1 \ transcription.api_mode=realtime `` --- ## Roadmap Visit lattifai.com/roadmap (https://lattifai.com/roadmap) for updates. | Date | Release | Features | |------|---------|----------| | **Oct 2025** | Lattice-1-Alpha | ✅ English forced alignment, multi-format support | | **Nov 2025** | Lattice-1 | ✅ EN+ZH+DE, speaker diarization, multi-model transcription | | **Q2 2026** | Lattice-2 | ✅ Streaming mode, 🔮 40+ languages, real-time alignment | --- ## Development ``bash git clone https://github.com/lattifai/lattifai-python.git cd lattifai-python # Using uv (recommended, auto-configures extra index) uv sync && source .venv/bin/activate # Or pip (requires extra-index-url for lattifai-core) pip install -e ".[all,dev]" --extra-index-url https://lattifai.github.io/pypi/simple/ # Run tests pytest # Install pre-commit hooks pre-commit install `` ## Contributing 1. Fork the repository 2. Create a feature branch (git checkout -b feature/amazing-feature) 3. Make changes and add tests 4. Run pytestandpre-commit run –all-files 5. Commit your changes (git commit -m ‘Add amazing feature’) 6. Push to branch (git push origin feature/amazing-feature`) 7. Open a Pull Request — ## Support - Issues: GitHub Issues (https://github.com/lattifai/lattifai-python/issues) - Discord: Join our community (https://discord.gg/kvF4WsBRK8) ## License Apache License 2.0
Similar Articles
@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...
NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.
@XieZhifei14110: Stop using Whisper for ASR ! open sourcing Mega-ASR — the first full-scenario SOTA industrial-grade ASR model, built fo…
Open sourcing Mega-ASR, a full-scenario SOTA industrial-grade ASR model designed for challenging audio conditions like far-field and noise, outperforming existing open and closed models by 10-30% on real-world benchmarks.
@denziideng: Another AI voice cloning 'dimensional reduction attack'... The CosyVoice I shared before can clone in 3 seconds, which I thought was already scary enough. But today's tool is even more lethal — after casually recording 1 minute of my own voice for training, it directly replicates tone, mannerisms, emotions, breathing, and pauses. It's almost like the soul of the original person possessed it! C...
GPT-SoVITS is an open-source AI voice cloning tool that supports zero-shot (5-second voice) and few-shot (1-minute training) high-fidelity voice cloning, cross-lingual inference, and comes with a complete WebUI toolchain. It has garnered 57.8k stars on GitHub, becoming the leading open-source project in the voice cloning field.
Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions
This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.
@uniswap12: 微软开源了一个语音 AI,60 分钟长音频一次转写,4 个人同时说话都能搞定 VibeVoice,微软开源,24.8k star,今天才知道这个。录音一键转文字这件事,我之前一直用 Whisper,但它处理长会议录音经常超时,多人说话识别…
微软开源了语音AI框架VibeVoice,支持60分钟长音频一次性转写、多说话人分离和时间戳标注,同时提供多角色TTS合成能力,底层基于Qwen2.5并配有0.5B轻量实时版本,已在GitHub获得24.8k星标。