@FeitengLi: 其实这些问题都能很好的解决了 1. 扔掉 whisper，换 ASR 模型，Qwen3-ASR 就很不错幻觉很少、也有一些别的ASR选择，whisper 幻觉多也要求 30s片段，Qwen3-ASR 塞更长的音频识别越准确，最大支持 20…

X AI KOLs Timeline 2026/05/15 10:43 工具

asr speech-recognition forced-alignment vad open-source qwen3-asr lattifai

摘要

推荐使用Qwen3-ASR替代Whisper以减少幻觉，使用LattifAI工具进行精确的音文本对齐和字幕生成，并介绍自己的OmniVAD-Kit项目用于语音活动检测。

其实这些问题都能很好的解决了 1. 扔掉 whisper，换 ASR 模型，Qwen3-ASR 就很不错幻觉很少、也有一些别的ASR选择，whisper 幻觉多也要求 30s片段，Qwen3-ASR 塞更长的音频识别越准确，最大支持 20 分钟； 2. 文字时间轴也扔掉 whisper 不是很准, 虽然 Qwen/Qwen3-ForcedAligner-0.6B 也能用，但实际测试超过 180s 就时间轴就混乱不堪，可以用 @LattifAI_HQ https://github.com/lattifai/lattifai-python… 4 小时都轻松准确，https://lattifai.com/zh/podcasts/PoJ1vTdHpks… 可以到这里看看卡拉 OK 字幕都十分准确，也有 skill https://github.com/lattifai/lattifai-skills.git… speaker diarization 和 naming 也解决的很好了 3. VAD 切片我推荐自己的项目 https://github.com/lifeiteng/OmniVAD-Kit… 准确性 Top

查看原文

查看缓存全文

缓存时间: 2026/05/16 03:10

lattifai/lattifai-python

Source: https://github.com/lattifai/lattifai-python

🌐 Official Website | 🖥️ GitHub | 🤗 Model | 📑 Blog |

LattifAI: Precision Alignment, Infinite Possibilities

Advanced forced alignment and subtitle generation powered by 🤗 Lattice-1 model.

Features
Installation
Quick Start
CLI Reference
- Translation
Python SDK
Advanced Features
Text Processing
Supported Formats & Languages
Roadmap
Development

Features

Feature	Description
Forced Alignment	Word-level and segment-level audio-text synchronization powered by Lattice-1
Multi-Model Transcription	Gemini, Parakeet, SenseVoice, Fun-ASR, Qwen3-ASR, Whisper, and any vLLM/SGLang-served model
Speaker Diarization	Multi-speaker identification with label preservation
Caption Translation	LLM-powered translation with terminology consistency and bilingual output
Streaming Mode	Process audio up to 20 hours with minimal memory
Universal Format Support	30+ caption/subtitle formats

Alignment Models

Model	Links	Languages	Description
Lattice-1	🤗 HF • 🤖 MS	English, Chinese, German	Production model with mixed-language alignment support
Lattice-1-Alpha	🤗 HF • 🤖 MS	English	Initial release with English forced alignment

Model Hub: Models can be downloaded from huggingface (default) or modelscope (recommended for users in China):

# Use ModelScope (faster in China)
lai alignment align audio.wav caption.srt output.srt alignment.model_hub=modelscope

from lattifai.client import LattifAI
from lattifai.config import AlignmentConfig

client = LattifAI(alignment_config=AlignmentConfig(model_hub="modelscope"))

Installation

Requires Python 3.10 – 3.14

Using uv (Recommended)

uv is a fast Python package manager (10-100x faster than pip).

# Install uv (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

As a CLI tool (recommended for most users):

# Install globally — lai command available everywhere
uv tool install "lattifai[all]" --extra-index-url https://lattifai.github.io/pypi/simple/

# Quick test without installing
uvx --from lattifai --extra-index-url https://lattifai.github.io/pypi/simple/ lai --help

As a project dependency (for Python SDK usage):

# Add to an existing project
uv add "lattifai[all]" --extra-index-url https://lattifai.github.io/pypi/simple/

Using pip

# Full installation (recommended)
pip install "lattifai[all]" --extra-index-url https://lattifai.github.io/pypi/simple/

Configure pip globally (optional, to avoid --extra-index-url each time):

# Add to ~/.pip/pip.conf (Linux/macOS) or %APPDATA%\pip\pip.ini (Windows)
[global]
extra-index-url = https://lattifai.github.io/pypi/simple/

Installation Options

Extra	Includes
(base)	Forced alignment, Gemini transcription, YouTube, captions
`transcription`	Local ASR models (Parakeet, SenseVoice, Fun-ASR)
`diarization`	Speaker diarization (NeMo, pyannote)
`translation`	LLM-powered caption translation (OpenAI-compatible)
`event`	Audio event detection
`all`	Base + transcription + diarization + translation + event

Note: Base installation includes alignment, Gemini transcription, and YouTube. Use [all] for local ASR models and all optional features.

Caption Format Support

Caption/subtitle format parsing is provided by lattifai-captions, a separate package supporting 30+ formats (SRT, VTT, ASS, TTML, TextGrid, NLE formats, etc.). It is automatically installed with lattifai.

API Keys

LattifAI API Key (Required) - Get your free key at lattifai.com/dashboard/api-keys, or try instantly with lai auth trial.

Gemini API Key (Optional) - For transcription with Gemini models, get key at aistudio.google.com/apikey

Configuration Priority

Keys and URLs are resolved in this order (first match wins):

Environment variable — export LATTIFAI_API_KEY=lf_xxx
CLI session (~/.lattifai/config.toml) — written by lai auth login / lai auth trial, device-bound obfuscated storage
.env file — auto-discovered from current directory upward

# Option 1: Environment variable
export LATTIFAI_API_KEY="lf_your_api_key_here"
export GEMINI_API_KEY="your_gemini_api_key_here"

# Option 2: CLI login (opens browser, stores key securely)
lai auth login

# Option 3: Free trial (no sign-up, 120 minutes)
lai auth trial

# Option 4: .env file in project root
cat > .env <<EOF
LATTIFAI_API_KEY=lf_your_api_key_here
LATTIFAI_BASE_URL=https://api.lattifai.com/v1
GEMINI_API_KEY=your_gemini_api_key_here
EOF

The same resolution order applies to LATTIFAI_BASE_URL and LATTIFAI_SITE_URL.

Quick Start

Command Line

# Align audio with subtitle
lai alignment align audio.wav subtitle.srt output.srt

# YouTube video
lai youtube align "https://youtube.com/watch?v=VIDEO_ID"

# Start local browser playground (4 tabs)
lai serve run

Python SDK

from lattifai.client import LattifAI

client = LattifAI()
caption = client.alignment(
    input_media="audio.wav",
    input_caption="subtitle.srt",
    output_caption_path="aligned.srt",
)

CLI Reference

Command	Description	Example
`lai alignment align`	Align audio/video with caption	`lai alignment align audio.wav caption.srt output.srt`
`lai youtube align`	Download & align YouTube	`lai youtube align "https://youtube.com/watch?v=ID"`
`lai transcribe run`	Transcribe audio/video	`lai transcribe run audio.wav output.srt`
`lai transcribe align`	Transcribe and align	`lai transcribe align audio.wav output.srt`
`lai translate caption`	Translate captions	`lai translate caption input.srt output.srt translation.target_lang=zh`
`lai caption convert`	Convert caption formats	`lai caption convert input.srt output.vtt`
`lai caption shift`	Shift timestamps	`lai caption shift input.srt output.srt 2.0`
`lai serve run`	Start local web UI playground	`lai serve run`
`lai doctor`	Run environment diagnostics	`lai doctor`
`lai update`	Update to latest version	`lai update` or `lai update --force`
`lai config`	Manage API keys & settings	`lai config set lattifai_api_key lf_xxx`

Common Options

# Device selection
alignment.device=cuda          # cuda, mps, cpu

# Caption options
caption.split_sentence=true    # Smart sentence splitting
caption.word_level=true        # Word-level timestamps

# Streaming for long audio
media.streaming_chunk_secs=300

# Channel selection
media.channel_selector=left    # left, right, average, or index

Transcription Models

LattifAI supports a wide range of ASR models — from cloud APIs to local inference to self-hosted servers:

Model	Type	Languages	Install Extra
Gemini 2.5 Pro/Flash	Cloud API	100+	(base)
NVIDIA Parakeet	Local	24 (European)	`[transcription]`
SenseVoice	Local	zh, en, ja, ko, yue	`[transcription]`
Fun-ASR-Nano	Local	31 (incl. zh dialects)	`[transcription]`
Fun-ASR-MLT-Nano	Local	31 (incl. zh dialects)	`[transcription]`
Qwen3-ASR	Local / vLLM/SGLang	52 (30 lang + 22 zh dialects)	`[transcription]`
Whisper	vLLM/SGLang	99	—
Voxtral	vLLM/SGLang	13 (European)	—
Voxtral Realtime	vLLM (realtime)	13 (European)	—
Gemma-3n	vLLM (chat)	140+	— ⚠️

⚠️ Gemma-3n is a general-purpose multimodal LLM, not a dedicated ASR model. It has a hard 30s audio encoder limit, ~3x higher WER than Whisper, and weaker multilingual transcription. Best suited for transcription + downstream understanding (summarization, translation) rather than pure ASR accuracy.

# Gemini (cloud API, requires GEMINI_API_KEY)
transcription.model_name=gemini-2.5-pro

# Local models (requires [transcription] extra)
transcription.model_name=nvidia/parakeet-tdt-0.6b-v3
transcription.model_name=iic/SenseVoiceSmall
transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512
transcription.model_name=Qwen/Qwen3-ASR-1.7B

# vLLM/SGLang-served models (requires a running vLLM server)
transcription.model_name=Qwen/Qwen3-ASR-1.7B \
    transcription.api_base_url=http://localhost:8081/v1

lai transcribe run

Transcribe audio/video files or YouTube URLs to generate timestamped captions.

# Local file
lai transcribe run audio.wav output.srt

# YouTube URL
lai transcribe run "https://youtube.com/watch?v=VIDEO_ID" output_dir=./output

# With model selection
lai transcribe run audio.wav output.srt \
    transcription.model_name=gemini-2.5-pro \
    transcription.device=cuda

Parameters:

input: Path to audio/video file or YouTube URL
output_caption: Output caption file path (for local files)
output_dir: Output directory (for YouTube URLs, defaults to current directory)
channel_selector: Audio channel - average (default), left, right, or channel index

lai transcribe align

Transcribe and align in a single step - produces precisely aligned captions.

# Basic usage
lai transcribe align audio.wav output.srt

# With options
lai transcribe align audio.wav output.srt \
    transcription.model_name=nvidia/parakeet-tdt-0.6b-v3 \
    alignment.device=cuda \
    caption.split_sentence=true \
    caption.word_level=true

lai translate caption

Translate caption files to any target language using LLM providers (Gemini, OpenAI-compatible).

Three translation modes with increasing quality:

Mode	Pipeline	LLM Calls	Use Case
`quick`	Translate	~1x	Quick draft, informal review
`normal`	Analyze → Translate	~2x	Default — terminology-consistent, context-aware
`refined`	Analyze → Translate → Review → Revise	~3x	Publication-quality professional subtitles

What each stage does:

Analyze (normal/refined): Scans source text to identify domain, terminology, speaker style, and tone. Extracts a glossary of key terms with recommended translations, ensuring consistency across all segments (e.g., “forced alignment” → “强制对齐” everywhere).
Translate: Batch-translates segments with context windows (surrounding lines for coherence). In quick mode, uses only the raw text. In normal/refined, the translation prompt includes the analysis results and glossary.
Review (refined only): A separate reviewer pass compares each translation against the original, checking for mistranslations, omissions, tone shifts, and glossary violations. Outputs per-segment critiques.
Revise (refined only): Applies reviewer feedback to produce a polished final version. All intermediate artifacts (analysis, prompts, drafts, critiques, revisions) can be saved with save_artifacts=true.

# Basic (default: normal mode, bilingual, target=zh)
lai translate caption input.srt output.srt

# Quick mode to English
lai translate caption input.srt output.srt \
    translation.target_lang=en \
    translation.mode=quick

# Refined mode with artifacts saved
lai translate caption input.srt output.srt \
    translation.target_lang=ja \
    translation.mode=refined \
    translation.save_artifacts=true

# Bilingual output with translation on top
lai translate caption input.srt output.srt \
    translation.target_lang=zh \
    caption.translation_first=true

# OpenAI-compatible API (local or third-party)
lai translate caption input.srt output.srt \
    translation.llm.provider=openai \
    translation.llm.api_base_url=http://localhost:8000/v1 \
    translation.llm.model=qwen3

# With custom glossary
lai translate caption input.srt output.srt \
    translation.glossary_file=glossary.yaml

TranslationConfig Options:

Option	Default	Description
`target_lang`	`zh`	Target language code (see supported languages)
`source_lang`	auto	Source language (auto-detected if not set)
`approach`	`rewrite`	`rewrite`: natural expression, idiom adaptation; `translate`: accuracy, source fidelity
`mode`	`normal`	Translation mode: `quick`, `normal`, `refined`
`bilingual`	`true`	Output bilingual captions (original + translation)
`style`	`technical`	Style hint: `storytelling`, `formal`, `casual`, `technical`
`llm.model`	`gemini-3-flash-preview`	LLM model name
`llm.provider`	`gemini`	LLM provider: `gemini` or `openai`
`llm.api_base_url`	—	Base URL for OpenAI-compatible endpoint (vLLM, SGLang, Ollama)
`batch_size`	`30`	Segments per API call
`max_concurrent`	`5`	Max concurrent batch requests
`glossary_file`	—	Path to custom glossary (YAML or Markdown)
`save_artifacts`	`false`	Save intermediate files (analysis, prompts, critiques, revisions)

Translation Language Support

55+ languages supported. Common codes:

Region	Languages
East Asian	`zh` Chinese (Simplified), `zh-TW` Traditional, `ja` Japanese, `ko` Korean
South/SE Asian	`hi` Hindi, `bn` Bengali, `th` Thai, `vi` Vietnamese, `id` Indonesian, `ms` Malay
Western European	`en` English, `es` Spanish, `fr` French, `de` German, `pt` Portuguese, `it` Italian, `nl` Dutch
Northern European	`sv` Swedish, `da` Danish, `no` Norwegian, `fi` Finnish
Eastern European	`ru` Russian, `uk` Ukrainian, `pl` Polish, `cs` Czech, `ro` Romanian, `hu` Hungarian
Middle Eastern	`ar` Arabic, `fa` Persian, `he` Hebrew, `tr` Turkish

Full list: lattifai.languages.SUPPORTED_LANGUAGES

Translation approach inspired by 宝玉’s AI translation methodology.

Python SDK

Configuration Objects

from lattifai.client import LattifAI
from lattifai.config import (
    ClientConfig,
    AlignmentConfig,
    CaptionConfig,
    CaptionInputConfig,
    DiarizationConfig,
    MediaConfig,
    RenderConfig,
)

client = LattifAI(
    client_config=ClientConfig(api_key="lf_xxx", timeout=60.0),
    alignment_config=AlignmentConfig(device="cuda"),
    caption_config=CaptionConfig(
        input=CaptionInputConfig(split_sentence=True),
        render=RenderConfig(word_level=True),
    ),
)

caption = client.alignment(
    input_media="audio.wav",
    input_caption="subtitle.srt",
    output_caption_path="output.json",
)

# Access results
for segment in caption.supervisions:
    print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")

YouTube Processing

caption = client.youtube(
    url="https://youtube.com/watch?v=VIDEO_ID",
    output_dir="./downloads",
    output_caption_path="aligned.srt",
)

CaptionConfig Options

Sub-config	Option	Default	Description
`input`	`split_sentence`	`False`	Smart sentence splitting, separates non-speech elements
`input`	`normalize_text`	`True`	Clean HTML entities and special characters
`input`	`source_lang`	`None`	Source language code (e.g., `"en"`, `"zh"`)
`render`	`word_level`	`False`	Include word-level timestamps in output
`render`	`include_speaker_in_text`	`True`	Include speaker labels in text output
`render`	`translation_first`	`False`	Place translation above original in bilingual output
`ass`	`speaker_color`	`""`	Speaker name color in ASS output: `""` (off), `"auto"` (10-color palette), `"#RRGGBB"`, or comma-separated list

from lattifai.client import LattifAI
from lattifai.config import CaptionConfig, CaptionInputConfig, RenderConfig

client = LattifAI(
    caption_config=CaptionConfig(
        input=CaptionInputConfig(split_sentence=True, normalize_text=True),
        render=RenderConfig(word_level=True, include_speaker_in_text=False),
    )
)

Advanced Features

Streaming Mode (Long Audio)

Process audio up to 20 hours with minimal memory:

caption = client.alignment(
    input_media="long_audio.wav",
    input_caption="subtitle.srt",
    streaming_chunk_secs=300.0,  # 5-minute chunks
)

Word-Level Alignment

from lattifai.client import LattifAI
from lattifai.config import CaptionConfig, RenderConfig

client = LattifAI(caption_config=CaptionConfig(render=RenderConfig(word_level=True)))
caption = client.alignment(
    input_media="audio.wav",
    input_caption="subtitle.srt",
    output_caption_path="output.json",  # JSON preserves word-level data
)

Speaker Diarization

Automatically identify and label different speakers in audio.

Capabilities:

Multi-Speaker Detection: Automatically detect speaker changes
Smart Labeling: Assign labels (SPEAKER_00, SPEAKER_01, etc.)
Label Preservation: Maintain existing speaker names from input captions
Gemini Integration: Extract speaker names from transcription context

Label Handling:

Without existing labels → Generic labels (SPEAKER_00, SPEAKER_01)
With existing labels ([Alice], >> Bob:, SPEAKER_01:) → Preserved during alignment
Gemini transcription → Names extracted from context (e.g., “Hi, I’m Alice” → Alice)

from lattifai.client import LattifAI
from lattifai.config import DiarizationConfig

client = LattifAI(
    diarization_config=DiarizationConfig(
        enabled=True,
        device="cuda",
        min_speakers=2,
        max_speakers=4,
    )
)
caption = client.alignment(...)

for segment in caption.supervisions:
    print(f"[{segment.speaker}] {segment.text}")

LLM Speaker Name Inference:

When speakers remain as SPEAKER_XX after acoustic diarization, enable LLM inference to identify real names from dialogue content:

DiarizationConfig(
    enabled=True,
    infer_speakers=True,              # Use LLM to infer speaker names
)

# Pass context as a per-call parameter to speaker_diarization()
client.speaker_diarization(
    input_media=audio,
    caption=caption,
    output_caption_path="output.srt",
    speaker_context="podcast, host is Alice, guest is Bob",  # Optional hint
)

DiarizationConfig Options:

Option	Default	Description
`enabled`	`False`	Enable speaker diarization
`device`	`auto`	`cpu`, `cuda`, `mps`, or `auto`
`num_speakers`	—	Exact number of speakers (overrides min/max)
`min_speakers`	—	Minimum speakers to detect
`max_speakers`	—	Maximum speakers to detect
`infer_speakers`	`False`	Use LLM to infer real names from dialogue

CLI:

lai alignment align audio.wav subtitle.srt output.srt \
    diarization.enabled=true \
    diarization.device=cuda

# With LLM speaker name inference
lai alignment align audio.wav subtitle.srt output.srt \
    diarization.enabled=true \
    diarization.infer_speakers=true

# Diarize subcommand with speaker context
lai diarize run audio.wav subtitle.srt output.srt \
    --context "interview with Dr. Smith"

Data Flow

Input Media → AudioLoader → Aligner → (Diarizer) → Caption
                              ↑
Input Caption → Reader → Tokenizer

Text Processing

The tokenizer handles various text patterns for forced alignment.

Bracket/Caption Handling

Visual captions and annotations in brackets are treated specially - they get two pronunciation paths so the aligner can choose:

Silence path - skip when content doesn’t appear in audio
Inner text pronunciation - match if someone actually says the words

Bracket Type	Symbol	Example	Alignment Behavior
Half-width square	`[]`	`[APPLAUSE]`	Skip or match “applause”
Half-width paren	`()`	`(music)`	Skip or match “music”
Full-width square	`【】`	`【笑声】`	Skip or match “笑声”
Full-width paren	`（）`	`（音乐）`	Skip or match “音乐”
Angle brackets	`<>`	`<intro>`	Skip or match “intro”
Book title marks	`《》`	`《开场白》`	Skip or match “开场白”

This allows proper handling of:

Visual descriptions: [Barret adjusts the camera and smiles] → skipped if not spoken
Sound effects: [APPLAUSE], (music) → matched if audible
Chinese annotations: 【笑声】, （鼓掌） → flexible alignment

Multilingual Text

Pattern	Handling	Example
CJK characters	Split individually	`你好` → `["你", "好"]`
Latin words	Grouped with accents	`Kühlschrank` → `["Kühlschrank"]`
Contractions	Kept together	`I'm`, `don't`, `we'll`
Punctuation	Attached to words	`Hello,` `world!`

Speaker Labels

Recognized speaker patterns are preserved during alignment:

Format	Example	Output
Arrow prefix	`>> Alice:` or `>> Alice:`	`[Alice]`
LattifAI format	`[SPEAKER_01]:`	`[SPEAKER_01]`
Uppercase name	`SPEAKER NAME:`	`[SPEAKER NAME]`

Supported Formats & Languages

Media Formats

Type	Formats
Audio	WAV, MP3, M4A, AAC, FLAC, OGG, OPUS, AIFF, and more
Video	MP4, MKV, MOV, WEBM, AVI, and more
Caption	SRT, VTT, ASS, SSA, SRV3, JSON, TextGrid, TSV, CSV, LRC, TTML, and more

Note: Caption format handling is provided by lattifai-captions, which is automatically installed as a dependency. For standalone caption processing without alignment features, install pip install lattifai-captions.

JSON Format

JSON is the most flexible format for storing caption data with full word-level timing support:

[
    {
        "text": "Hello beautiful world",
        "start": 0.0,
        "end": 2.5,
        "speaker": "Speaker 1",
        "words": [
            {"word": "Hello", "start": 0.0, "end": 0.5},
            {"word": "beautiful", "start": 0.6, "end": 1.4},
            {"word": "world", "start": 1.5, "end": 2.5}
        ]
    }
]

Features:

Word-level timestamps preserved in words array
Round-trip compatible (read/write without data loss)
Optional speaker field for multi-speaker content

Word-Level and Karaoke Output

Format	`word_level=True`	`word_level=True` + `karaoke_effect`
JSON	Includes `words` array	Same as word_level=True
SRT	One word per segment	One word per segment
VTT	One word per segment	YouTube VTT style: `<00:00:00.000><c> word</c>`
ASS	One word per segment	`{\kf}` karaoke tags (sweep effect)
LRC	One word per line	Enhanced `<timestamp>` tags
TTML	One word per `<p>` element	`<span>` with `itunes:timing="Word"`

Speaker Colors

The speaker_color option colorizes speaker names in ASS output (works with both karaoke and non-karaoke modes):

Value	Behavior
`""` (default)	No speaker coloring
`"auto"`	Assigns from a built-in 10-color palette
`"#RRGGBB"`	Single color for all speakers
`"#RRGGBB,#00BFFF,..."`	Comma-separated list, one per speaker (cycles if more speakers than colors)

Speaker Palette

# Auto-color speakers in ASS output
lai caption convert input.json output.ass \
    render.include_speaker_in_text=true \
    ass.speaker_color=auto

# Custom single color
lai caption convert input.json output.ass \
    render.include_speaker_in_text=true \
    ass.speaker_color="#1387C0"

Karaoke Color Schemes

Use ass.karaoke_color_scheme to apply a predefined color scheme for karaoke ASS output. Each scheme sets primary_color, secondary_color, outline_color, and back_color.

12 schemes available: azure-gold, sakura-purple, mint-ocean, gardenia-green, sunset-warm, prussian-elegant, burgundy-classic, langgan-spring, mars-teal, spring-field, navy-pink, apricot-dark

Karaoke Color Schemes

# Karaoke with color scheme + auto speaker colors
lai caption convert input.json output.ass \
    ass.karaoke_effect=sweep \
    ass.karaoke_color_scheme=azure-gold \
    ass.speaker_color=auto

VTT Format (YouTube VTT Support)

The VTT format handler supports both standard WebVTT and YouTube VTT with word-level timestamps.

Reading: VTT automatically detects YouTube VTT format (with <timestamp><c> tags) and extracts word-level alignment data:

WEBVTT

00:00:00.000 --> 00:00:02.000
<00:00:00.000><c> Hello</c><00:00:00.500><c> world</c>

Writing: Use render.word_level=True to output YouTube VTT style with word timestamps:

from lattifai.data import Caption
from lattifai.caption.config import ASSConfig, RenderConfig

caption = Caption.read("input.vtt")
caption.write(
    "output.ass",
    format_config=ASSConfig(karaoke_effect="sweep"),
    render=RenderConfig(word_level=True),
)

# CLI: Convert to VTT with word-level timestamps
lai caption convert input.json output.vtt \
    render.word_level=true

Transcription Language Support

Gemini Models (100+ Languages)

Models: gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-3-pro-preview, gemini-3-flash-preview, gemini-3.1-pro-preview

English, Chinese (Mandarin & Cantonese), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Arabic, Russian, Hindi, Bengali, Turkish, Dutch, Polish, Swedish, Danish, Norwegian, Finnish, Greek, Hebrew, Thai, Vietnamese, Indonesian, Malay, Filipino, Ukrainian, Czech, Romanian, Hungarian, and 70+ more.

Requires Gemini API key from Google AI Studio

NVIDIA Parakeet (24 European Languages)

Model: nvidia/parakeet-tdt-0.6b-v3

Region	Languages
Western Europe	English (en), French (fr), German (de), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl)
Nordic	Danish (da), Swedish (sv), Norwegian (no), Finnish (fi)
Eastern Europe	Polish (pl), Czech (cs), Slovak (sk), Hungarian (hu), Romanian (ro), Bulgarian (bg), Ukrainian (uk), Russian (ru)
Others	Croatian (hr), Estonian (et), Latvian (lv), Lithuanian (lt), Slovenian (sl), Maltese (mt), Greek (el)

Alibaba SenseVoice (5 Asian Languages)

Model: iic/SenseVoiceSmall

Chinese/Mandarin (zh), English (en), Japanese (ja), Korean (ko), Cantonese (yue)

FunAudioLLM Fun-ASR-Nano (31 Languages)

Models: FunAudioLLM/Fun-ASR-Nano-2512, FunAudioLLM/Fun-ASR-MLT-Nano-2512

800M parameter end-to-end ASR model from Tongyi Lab, excelling at far-field, high-noise, dialect/accent, and music lyric recognition.

Region	Languages
East Asia	Chinese (+ 7 dialects, 26 accents), Japanese, Korean, Cantonese
Southeast Asia	Vietnamese, Indonesian, Thai, Malay, Filipino
South Asia	Hindi
Middle East	Arabic
Europe	English, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish

# Use ModelScope (default for China)
lai transcribe run audio.wav output.srt \
    transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \
    transcription.model_hub=modelscope

# Use HuggingFace
lai transcribe run audio.wav output.srt \
    transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \
    transcription.model_hub=huggingface

vLLM/SGLang (Any ASR Model)

Any ASR model served via vLLM or SGLang with an OpenAI-compatible API.

Supported models and limitations:

Model	Audio tok/s	Max Audio	API Mode	Batch	Notes
Qwen3-ASR (0.6B/1.7B)	25	auto	transcriptions	Yes	Best for zh/en/ja/ko
Whisper	50	30s	transcriptions	Yes	Fixed 30s context window
Voxtral	12.5	auto	transcriptions	Yes	European languages
Voxtral Realtime	12.5	auto	realtime	Yes	WebSocket, <500ms latency
Ultravox	6.25	auto	transcriptions	Yes	Confirmed in vLLM source
Gemma-3n	6.25	30s	chat (auto)	No	Not a dedicated ASR model (~3x Whisper WER), 30s encoder limit, no concurrent requests

Max Audio: “auto” = estimated from max_model_len; bold values are hard encoder limits
Batch: Whether batch_size>1 concurrent requests are supported
API Mode: transcriptions is the default; general-purpose LLMs auto-switch to chat

API modes:

Mode	Endpoint	Use Case
`transcriptions` (default)	`/v1/audio/transcriptions`	Dedicated ASR models (Qwen3-ASR, Whisper, GLM-ASR, etc.)
`chat`	`/v1/chat/completions`	General-purpose LLMs (Gemma-3n, etc.) — auto-selected for non-ASR models
`realtime`	`/v1/realtime` (WebSocket)	Voxtral Realtime

# 1. Install vLLM with audio support (requires CUDA GPU)
pip install vllm "vllm[audio]"

# 2. Start vLLM server on a Linux GPU machine (auto-downloads the model)
vllm serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8081
# Other models:
#   vllm serve openai/whisper-large-v3-turbo
#   vllm serve google/gemma-3n-E4B-it --max-model-len 32000 --enforce-eager

# 3. Transcribe (default: transcriptions mode)
lai transcribe run audio.wav output.srt \
    transcription.model_name=Qwen/Qwen3-ASR-1.7B \
    transcription.api_base_url=http://localhost:8081/v1

# Batch mode for faster processing (4 concurrent requests)
lai transcribe run audio.wav output.srt \
    transcription.model_name=Qwen/Qwen3-ASR-1.7B \
    transcription.api_base_url=http://localhost:8081/v1 \
    transcription.batch_size=4

# General-purpose LLM (auto-switches to chat mode with ASR system prompt)
lai transcribe run audio.wav output.srt \
    transcription.model_name=google/gemma-3n-E4B-it \
    transcription.api_base_url=http://localhost:8084/v1 \
    transcription.language=zh

# Voxtral Realtime (streaming WebSocket, <500ms latency)
# Server: VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
#   --host 0.0.0.0 --port 8086 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'
lai transcribe run audio.wav output.srt \
    transcription.model_name=mistralai/Voxtral-Mini-4B-Realtime-2602 \
    transcription.api_base_url=http://localhost:8086/v1 \
    transcription.api_mode=realtime

Roadmap

Visit lattifai.com/roadmap for updates.

Date	Release	Features
Oct 2025	Lattice-1-Alpha	✅ English forced alignment, multi-format support
Nov 2025	Lattice-1	✅ EN+ZH+DE, speaker diarization, multi-model transcription
Q2 2026	Lattice-2	✅ Streaming mode, 🔮 40+ languages, real-time alignment

Development

git clone https://github.com/lattifai/lattifai-python.git
cd lattifai-python

# Using uv (recommended, auto-configures extra index)
uv sync && source .venv/bin/activate

# Or pip (requires extra-index-url for lattifai-core)
pip install -e ".[all,dev]" --extra-index-url https://lattifai.github.io/pypi/simple/

# Run tests
pytest

# Install pre-commit hooks
pre-commit install

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make changes and add tests
Run pytest and pre-commit run --all-files
Commit your changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Support

Issues: GitHub Issues
Discord: Join our community

License

Apache License 2.0

相似文章

@MaxForAI: 如果你在做语音Agent，你应该试一下这个项目来自南洋理工、新国立和上海 AI Lab的团队发布了：Mega-ASR 这个完全开源的ASR基于 Qwen3-ASR构建，目的是打破长期困扰ASR的在嘈杂、混响或其他受损现实环境中表现的瓶颈…

X AI KOLs Timeline

南洋理工、新国立和上海 AI Lab 联合发布 Mega-ASR，一个基于 Qwen3-ASR 构建的完全开源 ASR 模型，通过 Voices-in-the-Wild-2M 数据集和渐进式声学到语义优化，在真实世界嘈杂环境中实现最高 30% 的相对词错误率下降，且仅 1.7B 参数可在消费级硬件高效推理。

@XieZhifei14110: 别再使用Whisper做语音识别了！开源Mega-ASR——首个全场景SOTA工业级ASR模型，专为……

X AI KOLs Timeline

开源Mega-ASR，一个全场景SOTA工业级ASR模型，专为远场、噪声等复杂音频环境设计，在真实世界基准测试中比现有开源和闭源模型性能高出10-30%。

@denziideng: 又发现一个AI语音克隆“降维打击”…… 之前分享的 CosyVoice 3秒可克隆，觉得已经够吓人了，结果今天这个更要命，随便录了1分钟自己的声音训练后，它直接把声线、语气、情感、呼吸、停顿全部复刻，简直像本人灵魂附体！阿里达摩院的 C…

X AI KOLs Timeline

GPT-SoVITS 是一款开源 AI 语音克隆工具，支持零样本（5秒声音）和少样本（1分钟训练）高保真声音克隆，跨语言推理，并自带完整 WebUI 工具链，在 GitHub 上已获 57.8k 星，成为语音克隆领域的领先开源项目。

转录儿童语音：ASR性能与获取可靠的正字法转写

arXiv cs.CL

这篇论文评估了九种ASR模型（Whisper、Parakeet、Wav2Vec2）在荷兰语儿童语音数据集JASMIN和DART上的表现，发现微调后的Whisper-medium取得了最佳性能（在JASMIN上WER为5.54%，在DART上为70.37%）。它还提出了一种选择方法，能够以高精度自动识别发音正确的录音片段，从而减少人工验证的需求。

@uniswap12: 微软开源了一个语音 AI，60 分钟长音频一次转写，4 个人同时说话都能搞定 VibeVoice，微软开源，24.8k star，今天才知道这个。录音一键转文字这件事，我之前一直用 Whisper，但它处理长会议录音经常超时，多人说话识别…

X AI KOLs Timeline

微软开源了语音AI框架VibeVoice，支持60分钟长音频一次性转写、多说话人分离和时间戳标注，同时提供多角色TTS合成能力，底层基于Qwen2.5并配有0.5B轻量实时版本，已在GitHub获得24.8k星标。

lattifai/lattifai-python

LattifAI: Precision Alignment, Infinite Possibilities

Table of Contents

Features

Alignment Models

Installation

Using uv (Recommended)

Using pip

Installation Options

Caption Format Support

API Keys

Configuration Priority

Quick Start

Command Line

Python SDK

CLI Reference

Common Options

Transcription Models

lai transcribe run

lai transcribe align

lai translate caption

Translation Language Support

Python SDK

Configuration Objects

YouTube Processing

CaptionConfig Options

Advanced Features

Streaming Mode (Long Audio)

Word-Level Alignment

Speaker Diarization

Data Flow

Text Processing

Bracket/Caption Handling

Multilingual Text

Speaker Labels

Supported Formats & Languages

Media Formats

JSON Format

Word-Level and Karaoke Output

Speaker Colors

Karaoke Color Schemes

VTT Format (YouTube VTT Support)

Transcription Language Support

Gemini Models (100+ Languages)

NVIDIA Parakeet (24 European Languages)

Alibaba SenseVoice (5 Asian Languages)

FunAudioLLM Fun-ASR-Nano (31 Languages)

vLLM/SGLang (Any ASR Model)

Roadmap

Development

Contributing

Support

License

相似文章

@MaxForAI: 如果你在做语音Agent，你应该试一下这个项目 来自南洋理工、新国立和上海 AI Lab的团队发布了：Mega-ASR 这个完全开源的ASR基于 Qwen3-ASR构建，目的是打破长期困扰ASR的在嘈杂、混响或其他受损现实环境中表现的瓶颈…

@XieZhifei14110: 别再使用Whisper做语音识别了！开源Mega-ASR——首个全场景SOTA工业级ASR模型，专为……

转录儿童语音：ASR性能与获取可靠的正字法转写

提交意见反馈

@MaxForAI: 如果你在做语音Agent，你应该试一下这个项目来自南洋理工、新国立和上海 AI Lab的团队发布了：Mega-ASR 这个完全开源的ASR基于 Qwen3-ASR构建，目的是打破长期困扰ASR的在嘈杂、混响或其他受损现实环境中表现的瓶颈…