@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…

X AI KOLs Timeline 05/15/26, 10:43 AM Tools

asr speech-recognition forced-alignment vad open-source qwen3-asr lattifai

Summary

Recommends using Qwen3-ASR instead of Whisper to reduce hallucinations, using LattifAI tools for precise audio-text alignment and subtitle generation, and introducing their own OmniVAD-Kit project for voice activity detection.

Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20 minutes; 2. For text timestamps, also ditch whisper as it's not very accurate. Although Qwen/Qwen3-ForcedAligner-0.6B can be used, tests show that beyond 180s the timestamps become chaotic. You can use @LattifAI_HQ https://github.com/lattifai/lattifai-python… it works accurately even for 4 hours, https://lattifai.com/zh/podcasts/PoJ1vTdHpks… check out the karaoke subtitles there – they are very accurate. There is also a skill https://github.com/lattifai/lattifai-skills.git… speaker diarization and naming are also well handled. 3. For VAD segmentation, I recommend my own project https://github.com/lifeiteng/OmniVAD-Kit… Top accuracy.

Original Article

View Cached Full Text

Cached at: 05/16/26, 03:10 AM

🌐 Official Website | 🖥️ GitHub | 🤗 Model | 📑 Blog |

element | `` withitunes:timing=“Word”| #### Speaker Colors Thespeaker_coloroption colorizes speaker names in ASS output (works with both karaoke and non-karaoke modes): | Value | Behavior | |-------|----------| |“”(default) | No speaker coloring | |“auto”| Assigns from a built-in 10-color palette | |“#RRGGBB”| Single color for all speakers | |“#RRGGBB,#00BFFF,…”| Comma-separated list, one per speaker (cycles if more speakers than colors) | Speaker Palette ``bash # Auto-color speakers in ASS output lai caption convert input.json output.ass \ render.include_speaker_in_text=true \ ass.speaker_color=auto # Custom single color lai caption convert input.json output.ass \ render.include_speaker_in_text=true \ ass.speaker_color="#1387C0" `` #### Karaoke Color Schemes Useass.karaoke_color_schemeto apply a predefined color scheme for karaoke ASS output. Each scheme setsprimary_color, secondary_color, outline_color, and back_color. 12 schemes available: azure-gold, sakura-purple, mint-ocean, gardenia-green, sunset-warm, prussian-elegant, burgundy-classic, langgan-spring, mars-teal, spring-field, navy-pink, apricot-darkKaraoke Color Schemes ``bash # Karaoke with color scheme + auto speaker colors lai caption convert input.json output.ass \ ass.karaoke_effect=sweep \ ass.karaoke_color_scheme=azure-gold \ ass.speaker_color=auto `` ### VTT Format (YouTube VTT Support) The VTT format handler supports both standard WebVTT and YouTube VTT with word-level timestamps. **Reading**: VTT automatically detects YouTube VTT format (with `` tags) and extracts word-level alignment data: `` WEBVTT 00:00:00.000 --> 00:00:02.000 <00:00:00.000> Hello<00:00:00.500> world `` **Writing**: Userender.word_level=Trueto output YouTube VTT style with word timestamps: ``python from lattifai.data import Caption from lattifai.caption.config import ASSConfig, RenderConfig caption = Caption.read("input.vtt") caption.write( "output.ass", format_config=ASSConfig(karaoke_effect="sweep"), render=RenderConfig(word_level=True), ) `` ``bash # CLI: Convert to VTT with word-level timestamps lai caption convert input.json output.vtt \ render.word_level=true `` ### Transcription Language Support #### Gemini Models (100+ Languages) **Models**:gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-3-pro-preview, gemini-3-flash-preview, gemini-3.1-pro-previewEnglish, Chinese (Mandarin & Cantonese), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Arabic, Russian, Hindi, Bengali, Turkish, Dutch, Polish, Swedish, Danish, Norwegian, Finnish, Greek, Hebrew, Thai, Vietnamese, Indonesian, Malay, Filipino, Ukrainian, Czech, Romanian, Hungarian, and 70+ more. > Requires Gemini API key from Google AI Studio (https://aistudio.google.com/apikey) #### NVIDIA Parakeet (24 European Languages) **Model**:nvidia/parakeet-tdt-0.6b-v3| Region | Languages | |--------|-----------| | Western Europe | English (en), French (fr), German (de), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl) | | Nordic | Danish (da), Swedish (sv), Norwegian (no), Finnish (fi) | | Eastern Europe | Polish (pl), Czech (cs), Slovak (sk), Hungarian (hu), Romanian (ro), Bulgarian (bg), Ukrainian (uk), Russian (ru) | | Others | Croatian (hr), Estonian (et), Latvian (lv), Lithuanian (lt), Slovenian (sl), Maltese (mt), Greek (el) | #### Alibaba SenseVoice (5 Asian Languages) **Model**:iic/SenseVoiceSmallChinese/Mandarin (zh), English (en), Japanese (ja), Korean (ko), Cantonese (yue) #### FunAudioLLM Fun-ASR-Nano (31 Languages) **Models**:FunAudioLLM/Fun-ASR-Nano-2512(https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512),FunAudioLLM/Fun-ASR-MLT-Nano-2512(https://huggingface.co/FunAudioLLM/Fun-ASR-MLT-Nano-2512) 800M parameter end-to-end ASR model from Tongyi Lab, excelling at far-field, high-noise, dialect/accent, and music lyric recognition. | Region | Languages | |--------|-----------| | East Asia | Chinese (+ 7 dialects, 26 accents), Japanese, Korean, Cantonese | | Southeast Asia | Vietnamese, Indonesian, Thai, Malay, Filipino | | South Asia | Hindi | | Middle East | Arabic | | Europe | English, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish | ``bash # Use ModelScope (default for China) lai transcribe run audio.wav output.srt \ transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \ transcription.model_hub=modelscope # Use HuggingFace lai transcribe run audio.wav output.srt \ transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \ transcription.model_hub=huggingface `` #### vLLM/SGLang (Any ASR Model) Any ASR model served via vLLM (https://docs.vllm.ai) or SGLang (https://sgl-project.github.io/) with an OpenAI-compatible API. **Supported models and limitations:** | Model | Audio tok/s | Max Audio | API Mode | Batch | Notes | |-------|-------------|-----------|----------|-------|-------| | Qwen3-ASR (https://huggingface.co/Qwen/Qwen3-ASR-1.7B) (0.6B/1.7B) | 25 | auto | transcriptions | Yes | Best for zh/en/ja/ko | | Whisper (https://huggingface.co/openai/whisper-large-v3-turbo) | 50 | **30s** | transcriptions | Yes | Fixed 30s context window | | Voxtral (https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) | 12.5 | auto | transcriptions | Yes | European languages | | Voxtral Realtime (https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) | 12.5 | auto | realtime | Yes | WebSocket, <500ms latency | | Ultravox (https://huggingface.co/fixie-ai/ultravox-v0_5) | 6.25 | auto | transcriptions | Yes | Confirmed in vLLM source | | Gemma-3n (https://huggingface.co/google/gemma-3n-E4B-it) | 6.25 | **30s** | chat (auto) | **No** | Not a dedicated ASR model (~3x Whisper WER), 30s encoder limit (https://huggingface.co/google/gemma-3n-E4B-it/discussions/37), no concurrent requests | - **Max Audio**: "auto" = estimated frommax_model_len; bold values are hard encoder limits - **Batch**: Whether batch_size>1concurrent requests are supported - **API Mode**:transcriptionsis the default; general-purpose LLMs auto-switch tochat**API modes:** | Mode | Endpoint | Use Case | |------|----------|----------| |transcriptions(default) |/v1/audio/transcriptions| Dedicated ASR models (Qwen3-ASR, Whisper, GLM-ASR, etc.) | |chat|/v1/chat/completions| General-purpose LLMs (Gemma-3n, etc.) — auto-selected for non-ASR models | |realtime|/v1/realtime (WebSocket) | Voxtral Realtime | ``bash # 1. Install vLLM with audio support (requires CUDA GPU) pip install vllm "vllm[audio]" # 2. Start vLLM server on a Linux GPU machine (auto-downloads the model) vllm serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8081 # Other models: # vllm serve openai/whisper-large-v3-turbo # vllm serve google/gemma-3n-E4B-it --max-model-len 32000 --enforce-eager # 3. Transcribe (default: transcriptions mode) lai transcribe run audio.wav output.srt \ transcription.model_name=Qwen/Qwen3-ASR-1.7B \ transcription.api_base_url=http://localhost:8081/v1 # Batch mode for faster processing (4 concurrent requests) lai transcribe run audio.wav output.srt \ transcription.model_name=Qwen/Qwen3-ASR-1.7B \ transcription.api_base_url=http://localhost:8081/v1 \ transcription.batch_size=4 # General-purpose LLM (auto-switches to chat mode with ASR system prompt) lai transcribe run audio.wav output.srt \ transcription.model_name=google/gemma-3n-E4B-it \ transcription.api_base_url=http://localhost:8084/v1 \ transcription.language=zh # Voxtral Realtime (streaming WebSocket, <500ms latency) # Server: VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \ # --host 0.0.0.0 --port 8086 --compilation_config '{"cudagraph_mode": "PIECEWISE"}' lai transcribe run audio.wav output.srt \ transcription.model_name=mistralai/Voxtral-Mini-4B-Realtime-2602 \ transcription.api_base_url=http://localhost:8086/v1 \ transcription.api_mode=realtime `` --- ## Roadmap Visit lattifai.com/roadmap (https://lattifai.com/roadmap) for updates. | Date | Release | Features | |------|---------|----------| | **Oct 2025** | Lattice-1-Alpha | ✅ English forced alignment, multi-format support | | **Nov 2025** | Lattice-1 | ✅ EN+ZH+DE, speaker diarization, multi-model transcription | | **Q2 2026** | Lattice-2 | ✅ Streaming mode, 🔮 40+ languages, real-time alignment | --- ## Development ``bash git clone https://github.com/lattifai/lattifai-python.git cd lattifai-python # Using uv (recommended, auto-configures extra index) uv sync && source .venv/bin/activate # Or pip (requires extra-index-url for lattifai-core) pip install -e ".[all,dev]" --extra-index-url https://lattifai.github.io/pypi/simple/ # Run tests pytest # Install pre-commit hooks pre-commit install `` ## Contributing 1. Fork the repository 2. Create a feature branch (git checkout -b feature/amazing-feature) 3. Make changes and add tests 4. Run pytestandpre-commit run –all-files 5. Commit your changes (git commit -m ‘Add amazing feature’) 6. Push to branch (git push origin feature/amazing-feature`) 7. Open a Pull Request — ## Support - Issues: GitHub Issues (https://github.com/lattifai/lattifai-python/issues) - Discord: Join our community (https://discord.gg/kvF4WsBRK8) ## License Apache License 2.0

Similar Articles

@XieZhifei14110: Stop using Whisper for ASR ! open sourcing Mega-ASR — the first full-scenario SOTA industrial-grade ASR model, built fo…

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

@uniswap12: 微软开源了一个语音 AI，60 分钟长音频一次转写，4 个人同时说话都能搞定 VibeVoice，微软开源，24.8k star，今天才知道这个。录音一键转文字这件事，我之前一直用 Whisper，但它处理长会议录音经常超时，多人说话识别…

Submit Feedback

Similar Articles

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

@XieZhifei14110: Stop using Whisper for ASR ! open sourcing Mega-ASR — the first full-scenario SOTA industrial-grade ASR model, built fo…

@denziideng: Another AI voice cloning 'dimensional reduction attack'... The CosyVoice I shared before can clone in 3 seconds, which I thought was already scary enough. But today's tool is even more lethal — after casually recording 1 minute of my own voice for training, it directly replicates tone, mannerisms, emotions, breathing, and pauses. It's almost like the soul of the original person possessed it! C...

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

@uniswap12: 微软开源了一个语音 AI，60 分钟长音频一次转写，4 个人同时说话都能搞定 VibeVoice，微软开源，24.8k star，今天才知道这个。录音一键转文字这件事，我之前一直用 Whisper，但它处理长会议录音经常超时，多人说话识别…
微软开源了语音AI框架VibeVoice，支持60分钟长音频一次性转写、多说话人分离和时间戳标注，同时提供多角色TTS合成能力，底层基于Qwen2.5并配有0.5B轻量实时版本，已在GitHub获得24.8k星标。