@FeitengLi: 其实这些问题都能很好的解决了 1. 扔掉 whisper,换 ASR 模型,Qwen3-ASR 就很不错幻觉很少、也有一些别的ASR选择,whisper 幻觉多也要求 30s片段,Qwen3-ASR 塞更长的音频识别越准确,最大支持 20…
摘要
推荐使用Qwen3-ASR替代Whisper以减少幻觉,使用LattifAI工具进行精确的音文本对齐和字幕生成,并介绍自己的OmniVAD-Kit项目用于语音活动检测。
查看缓存全文
缓存时间: 2026/05/16 03:10
其实这些问题都能很好的解决了 1. 扔掉 whisper,换 ASR 模型,Qwen3-ASR 就很不错幻觉很少、也有一些别的ASR选择,whisper 幻觉多也要求 30s片段,Qwen3-ASR 塞更长的音频识别越准确,最大支持 20 分钟; 2. 文字时间轴 也扔掉 whisper 不是很准, 虽然 Qwen/Qwen3-ForcedAligner-0.6B 也能用,但实际测试超过 180s 就时间轴就混乱不堪,可以用 @LattifAI_HQ https://github.com/lattifai/lattifai-python… 4 小时都轻松准确,https://lattifai.com/zh/podcasts/PoJ1vTdHpks… 可以到这里看看 卡拉 OK 字幕都十分准确,也有 skill https://github.com/lattifai/lattifai-skills.git… speaker diarization 和 naming 也解决的很好了 3. VAD 切片我推荐自己的项目 https://github.com/lifeiteng/OmniVAD-Kit… 准确性 Top
lattifai/lattifai-python
Source: https://github.com/lattifai/lattifai-python
🌐 Official Website | 🖥️ GitHub | 🤗 Model | 📑 Blog |
LattifAI: Precision Alignment, Infinite Possibilities
Advanced forced alignment and subtitle generation powered by 🤗 Lattice-1 model.
Table of Contents
- Features
- Installation
- Quick Start
- CLI Reference
- Python SDK
- Advanced Features
- Text Processing
- Supported Formats & Languages
- Roadmap
- Development
Features
| Feature | Description |
|---|---|
| Forced Alignment | Word-level and segment-level audio-text synchronization powered by Lattice-1 |
| Multi-Model Transcription | Gemini, Parakeet, SenseVoice, Fun-ASR, Qwen3-ASR, Whisper, and any vLLM/SGLang-served model |
| Speaker Diarization | Multi-speaker identification with label preservation |
| Caption Translation | LLM-powered translation with terminology consistency and bilingual output |
| Streaming Mode | Process audio up to 20 hours with minimal memory |
| Universal Format Support | 30+ caption/subtitle formats |
Alignment Models
| Model | Links | Languages | Description |
|---|---|---|---|
| Lattice-1 | 🤗 HF • 🤖 MS | English, Chinese, German | Production model with mixed-language alignment support |
| Lattice-1-Alpha | 🤗 HF • 🤖 MS | English | Initial release with English forced alignment |
Model Hub: Models can be downloaded from huggingface (default) or modelscope (recommended for users in China):
# Use ModelScope (faster in China)
lai alignment align audio.wav caption.srt output.srt alignment.model_hub=modelscope
from lattifai.client import LattifAI
from lattifai.config import AlignmentConfig
client = LattifAI(alignment_config=AlignmentConfig(model_hub="modelscope"))
Installation
Requires Python 3.10 – 3.14
Using uv (Recommended)
uv is a fast Python package manager (10-100x faster than pip).
# Install uv (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
As a CLI tool (recommended for most users):
# Install globally — lai command available everywhere
uv tool install "lattifai[all]" --extra-index-url https://lattifai.github.io/pypi/simple/
# Quick test without installing
uvx --from lattifai --extra-index-url https://lattifai.github.io/pypi/simple/ lai --help
As a project dependency (for Python SDK usage):
# Add to an existing project
uv add "lattifai[all]" --extra-index-url https://lattifai.github.io/pypi/simple/
Using pip
# Full installation (recommended)
pip install "lattifai[all]" --extra-index-url https://lattifai.github.io/pypi/simple/
Configure pip globally (optional, to avoid --extra-index-url each time):
# Add to ~/.pip/pip.conf (Linux/macOS) or %APPDATA%\pip\pip.ini (Windows)
[global]
extra-index-url = https://lattifai.github.io/pypi/simple/
Installation Options
| Extra | Includes |
|---|---|
| (base) | Forced alignment, Gemini transcription, YouTube, captions |
transcription | Local ASR models (Parakeet, SenseVoice, Fun-ASR) |
diarization | Speaker diarization (NeMo, pyannote) |
translation | LLM-powered caption translation (OpenAI-compatible) |
event | Audio event detection |
all | Base + transcription + diarization + translation + event |
Note: Base installation includes alignment, Gemini transcription, and YouTube. Use [all] for local ASR models and all optional features.
Caption Format Support
Caption/subtitle format parsing is provided by lattifai-captions, a separate package supporting 30+ formats (SRT, VTT, ASS, TTML, TextGrid, NLE formats, etc.). It is automatically installed with lattifai.
API Keys
LattifAI API Key (Required) - Get your free key at lattifai.com/dashboard/api-keys, or try instantly with lai auth trial.
Gemini API Key (Optional) - For transcription with Gemini models, get key at aistudio.google.com/apikey
Configuration Priority
Keys and URLs are resolved in this order (first match wins):
- Environment variable —
export LATTIFAI_API_KEY=lf_xxx - CLI session (
~/.lattifai/config.toml) — written bylai auth login/lai auth trial, device-bound obfuscated storage .envfile — auto-discovered from current directory upward
# Option 1: Environment variable
export LATTIFAI_API_KEY="lf_your_api_key_here"
export GEMINI_API_KEY="your_gemini_api_key_here"
# Option 2: CLI login (opens browser, stores key securely)
lai auth login
# Option 3: Free trial (no sign-up, 120 minutes)
lai auth trial
# Option 4: .env file in project root
cat > .env <<EOF
LATTIFAI_API_KEY=lf_your_api_key_here
LATTIFAI_BASE_URL=https://api.lattifai.com/v1
GEMINI_API_KEY=your_gemini_api_key_here
EOF
The same resolution order applies to LATTIFAI_BASE_URL and LATTIFAI_SITE_URL.
Quick Start
Command Line
# Align audio with subtitle
lai alignment align audio.wav subtitle.srt output.srt
# YouTube video
lai youtube align "https://youtube.com/watch?v=VIDEO_ID"
# Start local browser playground (4 tabs)
lai serve run
Python SDK
from lattifai.client import LattifAI
client = LattifAI()
caption = client.alignment(
input_media="audio.wav",
input_caption="subtitle.srt",
output_caption_path="aligned.srt",
)
CLI Reference
| Command | Description | Example |
|---|---|---|
lai alignment align | Align audio/video with caption | lai alignment align audio.wav caption.srt output.srt |
lai youtube align | Download & align YouTube | lai youtube align "https://youtube.com/watch?v=ID" |
lai transcribe run | Transcribe audio/video | lai transcribe run audio.wav output.srt |
lai transcribe align | Transcribe and align | lai transcribe align audio.wav output.srt |
lai translate caption | Translate captions | lai translate caption input.srt output.srt translation.target_lang=zh |
lai caption convert | Convert caption formats | lai caption convert input.srt output.vtt |
lai caption shift | Shift timestamps | lai caption shift input.srt output.srt 2.0 |
lai serve run | Start local web UI playground | lai serve run |
lai doctor | Run environment diagnostics | lai doctor |
lai update | Update to latest version | lai update or lai update --force |
lai config | Manage API keys & settings | lai config set lattifai_api_key lf_xxx |
Common Options
# Device selection
alignment.device=cuda # cuda, mps, cpu
# Caption options
caption.split_sentence=true # Smart sentence splitting
caption.word_level=true # Word-level timestamps
# Streaming for long audio
media.streaming_chunk_secs=300
# Channel selection
media.channel_selector=left # left, right, average, or index
Transcription Models
LattifAI supports a wide range of ASR models — from cloud APIs to local inference to self-hosted servers:
| Model | Type | Languages | Install Extra |
|---|---|---|---|
| Gemini 2.5 Pro/Flash | Cloud API | 100+ | (base) |
| NVIDIA Parakeet | Local | 24 (European) | [transcription] |
| SenseVoice | Local | zh, en, ja, ko, yue | [transcription] |
| Fun-ASR-Nano | Local | 31 (incl. zh dialects) | [transcription] |
| Fun-ASR-MLT-Nano | Local | 31 (incl. zh dialects) | [transcription] |
| Qwen3-ASR | Local / vLLM/SGLang | 52 (30 lang + 22 zh dialects) | [transcription] |
| Whisper | vLLM/SGLang | 99 | — |
| Voxtral | vLLM/SGLang | 13 (European) | — |
| Voxtral Realtime | vLLM (realtime) | 13 (European) | — |
| Gemma-3n | vLLM (chat) | 140+ | — ⚠️ |
⚠️ Gemma-3n is a general-purpose multimodal LLM, not a dedicated ASR model. It has a hard 30s audio encoder limit, ~3x higher WER than Whisper, and weaker multilingual transcription. Best suited for transcription + downstream understanding (summarization, translation) rather than pure ASR accuracy.
# Gemini (cloud API, requires GEMINI_API_KEY)
transcription.model_name=gemini-2.5-pro
# Local models (requires [transcription] extra)
transcription.model_name=nvidia/parakeet-tdt-0.6b-v3
transcription.model_name=iic/SenseVoiceSmall
transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512
transcription.model_name=Qwen/Qwen3-ASR-1.7B
# vLLM/SGLang-served models (requires a running vLLM server)
transcription.model_name=Qwen/Qwen3-ASR-1.7B \
transcription.api_base_url=http://localhost:8081/v1
lai transcribe run
Transcribe audio/video files or YouTube URLs to generate timestamped captions.
# Local file
lai transcribe run audio.wav output.srt
# YouTube URL
lai transcribe run "https://youtube.com/watch?v=VIDEO_ID" output_dir=./output
# With model selection
lai transcribe run audio.wav output.srt \
transcription.model_name=gemini-2.5-pro \
transcription.device=cuda
Parameters:
input: Path to audio/video file or YouTube URLoutput_caption: Output caption file path (for local files)output_dir: Output directory (for YouTube URLs, defaults to current directory)channel_selector: Audio channel -average(default),left,right, or channel index
lai transcribe align
Transcribe and align in a single step - produces precisely aligned captions.
# Basic usage
lai transcribe align audio.wav output.srt
# With options
lai transcribe align audio.wav output.srt \
transcription.model_name=nvidia/parakeet-tdt-0.6b-v3 \
alignment.device=cuda \
caption.split_sentence=true \
caption.word_level=true
lai translate caption
Translate caption files to any target language using LLM providers (Gemini, OpenAI-compatible).
Three translation modes with increasing quality:
| Mode | Pipeline | LLM Calls | Use Case |
|---|---|---|---|
quick | Translate | ~1x | Quick draft, informal review |
normal | Analyze → Translate | ~2x | Default — terminology-consistent, context-aware |
refined | Analyze → Translate → Review → Revise | ~3x | Publication-quality professional subtitles |
What each stage does:
- Analyze (
normal/refined): Scans source text to identify domain, terminology, speaker style, and tone. Extracts a glossary of key terms with recommended translations, ensuring consistency across all segments (e.g., “forced alignment” → “强制对齐” everywhere). - Translate: Batch-translates segments with context windows (surrounding lines for coherence). In
quickmode, uses only the raw text. Innormal/refined, the translation prompt includes the analysis results and glossary. - Review (
refinedonly): A separate reviewer pass compares each translation against the original, checking for mistranslations, omissions, tone shifts, and glossary violations. Outputs per-segment critiques. - Revise (
refinedonly): Applies reviewer feedback to produce a polished final version. All intermediate artifacts (analysis, prompts, drafts, critiques, revisions) can be saved withsave_artifacts=true.
# Basic (default: normal mode, bilingual, target=zh)
lai translate caption input.srt output.srt
# Quick mode to English
lai translate caption input.srt output.srt \
translation.target_lang=en \
translation.mode=quick
# Refined mode with artifacts saved
lai translate caption input.srt output.srt \
translation.target_lang=ja \
translation.mode=refined \
translation.save_artifacts=true
# Bilingual output with translation on top
lai translate caption input.srt output.srt \
translation.target_lang=zh \
caption.translation_first=true
# OpenAI-compatible API (local or third-party)
lai translate caption input.srt output.srt \
translation.llm.provider=openai \
translation.llm.api_base_url=http://localhost:8000/v1 \
translation.llm.model=qwen3
# With custom glossary
lai translate caption input.srt output.srt \
translation.glossary_file=glossary.yaml
TranslationConfig Options:
| Option | Default | Description |
|---|---|---|
target_lang | zh | Target language code (see supported languages) |
source_lang | auto | Source language (auto-detected if not set) |
approach | rewrite | rewrite: natural expression, idiom adaptation; translate: accuracy, source fidelity |
mode | normal | Translation mode: quick, normal, refined |
bilingual | true | Output bilingual captions (original + translation) |
style | technical | Style hint: storytelling, formal, casual, technical |
llm.model | gemini-3-flash-preview | LLM model name |
llm.provider | gemini | LLM provider: gemini or openai |
llm.api_base_url | — | Base URL for OpenAI-compatible endpoint (vLLM, SGLang, Ollama) |
batch_size | 30 | Segments per API call |
max_concurrent | 5 | Max concurrent batch requests |
glossary_file | — | Path to custom glossary (YAML or Markdown) |
save_artifacts | false | Save intermediate files (analysis, prompts, critiques, revisions) |
Translation Language Support
55+ languages supported. Common codes:
| Region | Languages |
|---|---|
| East Asian | zh Chinese (Simplified), zh-TW Traditional, ja Japanese, ko Korean |
| South/SE Asian | hi Hindi, bn Bengali, th Thai, vi Vietnamese, id Indonesian, ms Malay |
| Western European | en English, es Spanish, fr French, de German, pt Portuguese, it Italian, nl Dutch |
| Northern European | sv Swedish, da Danish, no Norwegian, fi Finnish |
| Eastern European | ru Russian, uk Ukrainian, pl Polish, cs Czech, ro Romanian, hu Hungarian |
| Middle Eastern | ar Arabic, fa Persian, he Hebrew, tr Turkish |
Full list: lattifai.languages.SUPPORTED_LANGUAGES
Translation approach inspired by 宝玉’s AI translation methodology.
Python SDK
Configuration Objects
from lattifai.client import LattifAI
from lattifai.config import (
ClientConfig,
AlignmentConfig,
CaptionConfig,
CaptionInputConfig,
DiarizationConfig,
MediaConfig,
RenderConfig,
)
client = LattifAI(
client_config=ClientConfig(api_key="lf_xxx", timeout=60.0),
alignment_config=AlignmentConfig(device="cuda"),
caption_config=CaptionConfig(
input=CaptionInputConfig(split_sentence=True),
render=RenderConfig(word_level=True),
),
)
caption = client.alignment(
input_media="audio.wav",
input_caption="subtitle.srt",
output_caption_path="output.json",
)
# Access results
for segment in caption.supervisions:
print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")
YouTube Processing
caption = client.youtube(
url="https://youtube.com/watch?v=VIDEO_ID",
output_dir="./downloads",
output_caption_path="aligned.srt",
)
CaptionConfig Options
| Sub-config | Option | Default | Description |
|---|---|---|---|
input | split_sentence | False | Smart sentence splitting, separates non-speech elements |
input | normalize_text | True | Clean HTML entities and special characters |
input | source_lang | None | Source language code (e.g., "en", "zh") |
render | word_level | False | Include word-level timestamps in output |
render | include_speaker_in_text | True | Include speaker labels in text output |
render | translation_first | False | Place translation above original in bilingual output |
ass | speaker_color | "" | Speaker name color in ASS output: "" (off), "auto" (10-color palette), "#RRGGBB", or comma-separated list |
from lattifai.client import LattifAI
from lattifai.config import CaptionConfig, CaptionInputConfig, RenderConfig
client = LattifAI(
caption_config=CaptionConfig(
input=CaptionInputConfig(split_sentence=True, normalize_text=True),
render=RenderConfig(word_level=True, include_speaker_in_text=False),
)
)
Advanced Features
Streaming Mode (Long Audio)
Process audio up to 20 hours with minimal memory:
caption = client.alignment(
input_media="long_audio.wav",
input_caption="subtitle.srt",
streaming_chunk_secs=300.0, # 5-minute chunks
)
Word-Level Alignment
from lattifai.client import LattifAI
from lattifai.config import CaptionConfig, RenderConfig
client = LattifAI(caption_config=CaptionConfig(render=RenderConfig(word_level=True)))
caption = client.alignment(
input_media="audio.wav",
input_caption="subtitle.srt",
output_caption_path="output.json", # JSON preserves word-level data
)
Speaker Diarization
Automatically identify and label different speakers in audio.
Capabilities:
- Multi-Speaker Detection: Automatically detect speaker changes
- Smart Labeling: Assign labels (SPEAKER_00, SPEAKER_01, etc.)
- Label Preservation: Maintain existing speaker names from input captions
- Gemini Integration: Extract speaker names from transcription context
Label Handling:
- Without existing labels → Generic labels (SPEAKER_00, SPEAKER_01)
- With existing labels (
[Alice],>> Bob:,SPEAKER_01:) → Preserved during alignment - Gemini transcription → Names extracted from context (e.g., “Hi, I’m Alice” →
Alice)
from lattifai.client import LattifAI
from lattifai.config import DiarizationConfig
client = LattifAI(
diarization_config=DiarizationConfig(
enabled=True,
device="cuda",
min_speakers=2,
max_speakers=4,
)
)
caption = client.alignment(...)
for segment in caption.supervisions:
print(f"[{segment.speaker}] {segment.text}")
LLM Speaker Name Inference:
When speakers remain as SPEAKER_XX after acoustic diarization, enable LLM inference to identify real names from dialogue content:
DiarizationConfig(
enabled=True,
infer_speakers=True, # Use LLM to infer speaker names
)
# Pass context as a per-call parameter to speaker_diarization()
client.speaker_diarization(
input_media=audio,
caption=caption,
output_caption_path="output.srt",
speaker_context="podcast, host is Alice, guest is Bob", # Optional hint
)
DiarizationConfig Options:
| Option | Default | Description |
|---|---|---|
enabled | False | Enable speaker diarization |
device | auto | cpu, cuda, mps, or auto |
num_speakers | — | Exact number of speakers (overrides min/max) |
min_speakers | — | Minimum speakers to detect |
max_speakers | — | Maximum speakers to detect |
infer_speakers | False | Use LLM to infer real names from dialogue |
CLI:
lai alignment align audio.wav subtitle.srt output.srt \
diarization.enabled=true \
diarization.device=cuda
# With LLM speaker name inference
lai alignment align audio.wav subtitle.srt output.srt \
diarization.enabled=true \
diarization.infer_speakers=true
# Diarize subcommand with speaker context
lai diarize run audio.wav subtitle.srt output.srt \
--context "interview with Dr. Smith"
Data Flow
Input Media → AudioLoader → Aligner → (Diarizer) → Caption
↑
Input Caption → Reader → Tokenizer
Text Processing
The tokenizer handles various text patterns for forced alignment.
Bracket/Caption Handling
Visual captions and annotations in brackets are treated specially - they get two pronunciation paths so the aligner can choose:
- Silence path - skip when content doesn’t appear in audio
- Inner text pronunciation - match if someone actually says the words
| Bracket Type | Symbol | Example | Alignment Behavior |
|---|---|---|---|
| Half-width square | [] | [APPLAUSE] | Skip or match “applause” |
| Half-width paren | () | (music) | Skip or match “music” |
| Full-width square | 【】 | 【笑声】 | Skip or match “笑声” |
| Full-width paren | () | (音乐) | Skip or match “音乐” |
| Angle brackets | <> | <intro> | Skip or match “intro” |
| Book title marks | 《》 | 《开场白》 | Skip or match “开场白” |
This allows proper handling of:
- Visual descriptions:
[Barret adjusts the camera and smiles]→ skipped if not spoken - Sound effects:
[APPLAUSE],(music)→ matched if audible - Chinese annotations:
【笑声】,(鼓掌)→ flexible alignment
Multilingual Text
| Pattern | Handling | Example |
|---|---|---|
| CJK characters | Split individually | 你好 → ["你", "好"] |
| Latin words | Grouped with accents | Kühlschrank → ["Kühlschrank"] |
| Contractions | Kept together | I'm, don't, we'll |
| Punctuation | Attached to words | Hello, world! |
Speaker Labels
Recognized speaker patterns are preserved during alignment:
| Format | Example | Output |
|---|---|---|
| Arrow prefix | >> Alice: or >> Alice: | [Alice] |
| LattifAI format | [SPEAKER_01]: | [SPEAKER_01] |
| Uppercase name | SPEAKER NAME: | [SPEAKER NAME] |
Supported Formats & Languages
Media Formats
| Type | Formats |
|---|---|
| Audio | WAV, MP3, M4A, AAC, FLAC, OGG, OPUS, AIFF, and more |
| Video | MP4, MKV, MOV, WEBM, AVI, and more |
| Caption | SRT, VTT, ASS, SSA, SRV3, JSON, TextGrid, TSV, CSV, LRC, TTML, and more |
Note: Caption format handling is provided by lattifai-captions, which is automatically installed as a dependency. For standalone caption processing without alignment features, install
pip install lattifai-captions.
JSON Format
JSON is the most flexible format for storing caption data with full word-level timing support:
[
{
"text": "Hello beautiful world",
"start": 0.0,
"end": 2.5,
"speaker": "Speaker 1",
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5},
{"word": "beautiful", "start": 0.6, "end": 1.4},
{"word": "world", "start": 1.5, "end": 2.5}
]
}
]
Features:
- Word-level timestamps preserved in
wordsarray - Round-trip compatible (read/write without data loss)
- Optional
speakerfield for multi-speaker content
Word-Level and Karaoke Output
| Format | word_level=True | word_level=True + karaoke_effect |
|---|---|---|
| JSON | Includes words array | Same as word_level=True |
| SRT | One word per segment | One word per segment |
| VTT | One word per segment | YouTube VTT style: <00:00:00.000><c> word</c> |
| ASS | One word per segment | {\kf} karaoke tags (sweep effect) |
| LRC | One word per line | Enhanced <timestamp> tags |
| TTML | One word per <p> element | <span> with itunes:timing="Word" |
Speaker Colors
The speaker_color option colorizes speaker names in ASS output (works with both karaoke and non-karaoke modes):
| Value | Behavior |
|---|---|
"" (default) | No speaker coloring |
"auto" | Assigns from a built-in 10-color palette |
"#RRGGBB" | Single color for all speakers |
"#RRGGBB,#00BFFF,..." | Comma-separated list, one per speaker (cycles if more speakers than colors) |

# Auto-color speakers in ASS output
lai caption convert input.json output.ass \
render.include_speaker_in_text=true \
ass.speaker_color=auto
# Custom single color
lai caption convert input.json output.ass \
render.include_speaker_in_text=true \
ass.speaker_color="#1387C0"
Karaoke Color Schemes
Use ass.karaoke_color_scheme to apply a predefined color scheme for karaoke ASS output. Each scheme sets primary_color, secondary_color, outline_color, and back_color.
12 schemes available: azure-gold, sakura-purple, mint-ocean, gardenia-green, sunset-warm, prussian-elegant, burgundy-classic, langgan-spring, mars-teal, spring-field, navy-pink, apricot-dark

# Karaoke with color scheme + auto speaker colors
lai caption convert input.json output.ass \
ass.karaoke_effect=sweep \
ass.karaoke_color_scheme=azure-gold \
ass.speaker_color=auto
VTT Format (YouTube VTT Support)
The VTT format handler supports both standard WebVTT and YouTube VTT with word-level timestamps.
Reading: VTT automatically detects YouTube VTT format (with <timestamp><c> tags) and extracts word-level alignment data:
WEBVTT
00:00:00.000 --> 00:00:02.000
<00:00:00.000><c> Hello</c><00:00:00.500><c> world</c>
Writing: Use render.word_level=True to output YouTube VTT style with word timestamps:
from lattifai.data import Caption
from lattifai.caption.config import ASSConfig, RenderConfig
caption = Caption.read("input.vtt")
caption.write(
"output.ass",
format_config=ASSConfig(karaoke_effect="sweep"),
render=RenderConfig(word_level=True),
)
# CLI: Convert to VTT with word-level timestamps
lai caption convert input.json output.vtt \
render.word_level=true
Transcription Language Support
Gemini Models (100+ Languages)
Models: gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-3-pro-preview, gemini-3-flash-preview, gemini-3.1-pro-preview
English, Chinese (Mandarin & Cantonese), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Arabic, Russian, Hindi, Bengali, Turkish, Dutch, Polish, Swedish, Danish, Norwegian, Finnish, Greek, Hebrew, Thai, Vietnamese, Indonesian, Malay, Filipino, Ukrainian, Czech, Romanian, Hungarian, and 70+ more.
Requires Gemini API key from Google AI Studio
NVIDIA Parakeet (24 European Languages)
Model: nvidia/parakeet-tdt-0.6b-v3
| Region | Languages |
|---|---|
| Western Europe | English (en), French (fr), German (de), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl) |
| Nordic | Danish (da), Swedish (sv), Norwegian (no), Finnish (fi) |
| Eastern Europe | Polish (pl), Czech (cs), Slovak (sk), Hungarian (hu), Romanian (ro), Bulgarian (bg), Ukrainian (uk), Russian (ru) |
| Others | Croatian (hr), Estonian (et), Latvian (lv), Lithuanian (lt), Slovenian (sl), Maltese (mt), Greek (el) |
Alibaba SenseVoice (5 Asian Languages)
Model: iic/SenseVoiceSmall
Chinese/Mandarin (zh), English (en), Japanese (ja), Korean (ko), Cantonese (yue)
FunAudioLLM Fun-ASR-Nano (31 Languages)
Models: FunAudioLLM/Fun-ASR-Nano-2512, FunAudioLLM/Fun-ASR-MLT-Nano-2512
800M parameter end-to-end ASR model from Tongyi Lab, excelling at far-field, high-noise, dialect/accent, and music lyric recognition.
| Region | Languages |
|---|---|
| East Asia | Chinese (+ 7 dialects, 26 accents), Japanese, Korean, Cantonese |
| Southeast Asia | Vietnamese, Indonesian, Thai, Malay, Filipino |
| South Asia | Hindi |
| Middle East | Arabic |
| Europe | English, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish |
# Use ModelScope (default for China)
lai transcribe run audio.wav output.srt \
transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \
transcription.model_hub=modelscope
# Use HuggingFace
lai transcribe run audio.wav output.srt \
transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \
transcription.model_hub=huggingface
vLLM/SGLang (Any ASR Model)
Any ASR model served via vLLM or SGLang with an OpenAI-compatible API.
Supported models and limitations:
| Model | Audio tok/s | Max Audio | API Mode | Batch | Notes |
|---|---|---|---|---|---|
| Qwen3-ASR (0.6B/1.7B) | 25 | auto | transcriptions | Yes | Best for zh/en/ja/ko |
| Whisper | 50 | 30s | transcriptions | Yes | Fixed 30s context window |
| Voxtral | 12.5 | auto | transcriptions | Yes | European languages |
| Voxtral Realtime | 12.5 | auto | realtime | Yes | WebSocket, <500ms latency |
| Ultravox | 6.25 | auto | transcriptions | Yes | Confirmed in vLLM source |
| Gemma-3n | 6.25 | 30s | chat (auto) | No | Not a dedicated ASR model (~3x Whisper WER), 30s encoder limit, no concurrent requests |
- Max Audio: “auto” = estimated from
max_model_len; bold values are hard encoder limits - Batch: Whether
batch_size>1concurrent requests are supported - API Mode:
transcriptionsis the default; general-purpose LLMs auto-switch tochat
API modes:
| Mode | Endpoint | Use Case |
|---|---|---|
transcriptions (default) | /v1/audio/transcriptions | Dedicated ASR models (Qwen3-ASR, Whisper, GLM-ASR, etc.) |
chat | /v1/chat/completions | General-purpose LLMs (Gemma-3n, etc.) — auto-selected for non-ASR models |
realtime | /v1/realtime (WebSocket) | Voxtral Realtime |
# 1. Install vLLM with audio support (requires CUDA GPU)
pip install vllm "vllm[audio]"
# 2. Start vLLM server on a Linux GPU machine (auto-downloads the model)
vllm serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8081
# Other models:
# vllm serve openai/whisper-large-v3-turbo
# vllm serve google/gemma-3n-E4B-it --max-model-len 32000 --enforce-eager
# 3. Transcribe (default: transcriptions mode)
lai transcribe run audio.wav output.srt \
transcription.model_name=Qwen/Qwen3-ASR-1.7B \
transcription.api_base_url=http://localhost:8081/v1
# Batch mode for faster processing (4 concurrent requests)
lai transcribe run audio.wav output.srt \
transcription.model_name=Qwen/Qwen3-ASR-1.7B \
transcription.api_base_url=http://localhost:8081/v1 \
transcription.batch_size=4
# General-purpose LLM (auto-switches to chat mode with ASR system prompt)
lai transcribe run audio.wav output.srt \
transcription.model_name=google/gemma-3n-E4B-it \
transcription.api_base_url=http://localhost:8084/v1 \
transcription.language=zh
# Voxtral Realtime (streaming WebSocket, <500ms latency)
# Server: VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
# --host 0.0.0.0 --port 8086 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'
lai transcribe run audio.wav output.srt \
transcription.model_name=mistralai/Voxtral-Mini-4B-Realtime-2602 \
transcription.api_base_url=http://localhost:8086/v1 \
transcription.api_mode=realtime
Roadmap
Visit lattifai.com/roadmap for updates.
| Date | Release | Features |
|---|---|---|
| Oct 2025 | Lattice-1-Alpha | ✅ English forced alignment, multi-format support |
| Nov 2025 | Lattice-1 | ✅ EN+ZH+DE, speaker diarization, multi-model transcription |
| Q2 2026 | Lattice-2 | ✅ Streaming mode, 🔮 40+ languages, real-time alignment |
Development
git clone https://github.com/lattifai/lattifai-python.git
cd lattifai-python
# Using uv (recommended, auto-configures extra index)
uv sync && source .venv/bin/activate
# Or pip (requires extra-index-url for lattifai-core)
pip install -e ".[all,dev]" --extra-index-url https://lattifai.github.io/pypi/simple/
# Run tests
pytest
# Install pre-commit hooks
pre-commit install
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make changes and add tests
- Run
pytestandpre-commit run --all-files - Commit your changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Support
- Issues: GitHub Issues
- Discord: Join our community
License
Apache License 2.0
相似文章
@MaxForAI: 如果你在做语音Agent,你应该试一下这个项目 来自南洋理工、新国立和上海 AI Lab的团队发布了:Mega-ASR 这个完全开源的ASR基于 Qwen3-ASR构建,目的是打破长期困扰ASR的在嘈杂、混响或其他受损现实环境中表现的瓶颈…
南洋理工、新国立和上海 AI Lab 联合发布 Mega-ASR,一个基于 Qwen3-ASR 构建的完全开源 ASR 模型,通过 Voices-in-the-Wild-2M 数据集和渐进式声学到语义优化,在真实世界嘈杂环境中实现最高 30% 的相对词错误率下降,且仅 1.7B 参数可在消费级硬件高效推理。
@XieZhifei14110: 别再使用Whisper做语音识别了!开源Mega-ASR——首个全场景SOTA工业级ASR模型,专为……
开源Mega-ASR,一个全场景SOTA工业级ASR模型,专为远场、噪声等复杂音频环境设计,在真实世界基准测试中比现有开源和闭源模型性能高出10-30%。
@denziideng: 又发现一个AI语音克隆“降维打击”…… 之前分享的 CosyVoice 3秒可克隆,觉得已经够吓人了,结果今天这个更要命,随便录了1分钟自己的声音训练后,它直接把声线、语气、情感、呼吸、停顿全部复刻,简直像本人灵魂附体! 阿里达摩院的 C…
GPT-SoVITS 是一款开源 AI 语音克隆工具,支持零样本(5秒声音)和少样本(1分钟训练)高保真声音克隆,跨语言推理,并自带完整 WebUI 工具链,在 GitHub 上已获 57.8k 星,成为语音克隆领域的领先开源项目。
转录儿童语音:ASR性能与获取可靠的正字法转写
这篇论文评估了九种ASR模型(Whisper、Parakeet、Wav2Vec2)在荷兰语儿童语音数据集JASMIN和DART上的表现,发现微调后的Whisper-medium取得了最佳性能(在JASMIN上WER为5.54%,在DART上为70.37%)。它还提出了一种选择方法,能够以高精度自动识别发音正确的录音片段,从而减少人工验证的需求。
@uniswap12: 微软开源了一个语音 AI,60 分钟长音频一次转写,4 个人同时说话都能搞定 VibeVoice,微软开源,24.8k star,今天才知道这个。录音一键转文字这件事,我之前一直用 Whisper,但它处理长会议录音经常超时,多人说话识别…
微软开源了语音AI框架VibeVoice,支持60分钟长音频一次性转写、多说话人分离和时间戳标注,同时提供多角色TTS合成能力,底层基于Qwen2.5并配有0.5B轻量实时版本,已在GitHub获得24.8k星标。