@XAMTO_AI: 这个开源工具要是现在不收藏，将来肯定得后悔——视频自动配音翻译，一口气支持 33 种语言，还能直接对视频内容提问。在 GitHub 上发现一个宝藏工具，叫 Violin，完全开源，做的事情说出来有点离谱：你把视频丢进去，它自动识别语音、…

X AI KOLs Timeline 2026/06/12 01:08 工具

open-source video-dubbing translation ai-tool github voice-synthesis multilingual

摘要

Violin 是一个开源的视频自动配音翻译工具，支持33种语言，集成Whisper、DeepSeek等模型，提供一键式语音识别、翻译、配音合成及视频内问答功能。

这个开源工具要是现在不收藏，将来肯定得后悔——视频自动配音翻译，一口气支持 33 种语言，还能直接对视频内容提问。在 GitHub 上发现一个宝藏工具，叫 Violin，完全开源，做的事情说出来有点离谱：你把视频丢进去，它自动识别语音、翻译、合成目标语言的配音，再无缝混回视频里，时间轴完全对齐，还顺手给你输出 SRT 字幕。整个流程一条龙，不用你手动碰任何东西。底层跑的是 Whisper Large v3 做语音识别，DeepSeek V4 Pro 负责翻译，Cartesia Sonic 3 合成配音，ffmpeg 前后处理，整个 pipeline 设计得很干净。功能上有几个点我觉得挺有意思：支持 33 种目标语言，其中 16 种配有精选母语配音，用的是 Cartesia Sonic 3 加 ElevenLabs，听感不是那种机器腔。视频内 Q&A，配音完之后你可以对视频任意时刻提问，它基于附近字幕和采样帧给你答案，这个设计有点超出预期。自然语言选声音，你描述想要什么风格的声音，LLM 自动从语音库里帮你挑，不用一个个试。六种风格预设：标准、儿童、学术、休闲、讲故事、新闻，每种预设连语速和情绪都调好了，直接用。可插拔架构，转录、翻译、TTS 各阶段都能换，Together、OpenAI、ElevenLabs 随便组合，一个 YAML 文件配置搞定，不想动代码的人也能玩。 GitHub： https://github.com/shang-zhu/violin… 在线体验： https://violin-ai.com

查看原文

查看缓存全文

缓存时间: 2026/06/12 08:58

这个开源工具要是现在不收藏，将来肯定得后悔——视频自动配音翻译，一口气支持 33 种语言，还能直接对视频内容提问。

在 GitHub 上发现一个宝藏工具，叫 Violin，完全开源，做的事情说出来有点离谱：你把视频丢进去，它自动识别语音、翻译、合成目标语言的配音，再无缝混回视频里，时间轴完全对齐，还顺手给你输出 SRT 字幕。整个流程一条龙，不用你手动碰任何东西。

底层跑的是 Whisper Large v3 做语音识别，DeepSeek V4 Pro 负责翻译，Cartesia Sonic 3 合成配音，ffmpeg 前后处理，整个 pipeline 设计得很干净。

功能上有几个点我觉得挺有意思：

支持 33 种目标语言，其中 16 种配有精选母语配音，用的是 Cartesia Sonic 3 加 ElevenLabs，听感不是那种机器腔。

视频内 Q&A，配音完之后你可以对视频任意时刻提问，它基于附近字幕和采样帧给你答案，这个设计有点超出预期。

自然语言选声音，你描述想要什么风格的声音，LLM 自动从语音库里帮你挑，不用一个个试。

六种风格预设：标准、儿童、学术、休闲、讲故事、新闻，每种预设连语速和情绪都调好了，直接用。

可插拔架构，转录、翻译、TTS 各阶段都能换，Together、OpenAI、ElevenLabs 随便组合，一个 YAML 文件配置搞定，不想动代码的人也能玩。

GitHub： https://github.com/shang-zhu/violin… 在线体验： https://violin-ai.com

shang-zhu/violin

Source: https://github.com/shang-zhu/violin

🎻 Violin

Open-source Video Translation Skill.

🌐 Live demo · 📝 Blog post · 📜 MIT License

Upload a video. Violin transcribes the speech, translates it, synthesizes a native-sounding voice-over in the target language, and remuxes it back into the video — fully aligned, with optional SRT subtitles.

Available as a CLI, a FastAPI web app, and a Claude Code skill.

✨ Features

33 target languages with handpicked native-speaker voices for the 16 most-used ones (Cartesia Sonic 3 + ElevenLabs)
In-video Q&A — ask questions about any moment in the dubbed video; answers use nearby subtitles plus sampled frames
Natural-language voice picker — describe the voice you want, an LLM picks from the catalog
6 style profiles (experimental) — standard / kids / academic / casual / storyteller / news
Pluggable stack — Together / OpenAI / ElevenLabs interchangeable for every stage, one YAML

🚀 Quick start

Try it without installing anything

The live demo runs at https://www.violin-ai.com — drop a short clip in, get a dubbed video out in a few minutes.

Run locally

Requires Python 3.10+ and ffmpeg on PATH.

curl -LsSf https://astral.sh/uv/install.sh | sh   # install uv if you don't have it
uv tool install violin                            # recommended — faster, isolated
# or: pip install violin                          # if you'd rather install into your current Python env

export TOGETHER_API_KEY=...                       # get one at https://api.together.ai (add to ~/.zshrc to persist)

Three ways to use it:

1. CLI — translate one file:

violin lecture.mp4 lecture_zh.mp4 --language Chinese

2. Web app — full REST API + browser UI:

violin-api
# → http://127.0.0.1:8000           (browser UI)
# → http://127.0.0.1:8000/docs      (interactive API docs)

3. Claude Code skill — invoke from any Claude Code session:

violin --install-skill          # one-time: copies the skill into ~/.claude/skills/
claude
> please use the violin skill to translate path/to/video.mp4 into Chinese

Run from source (for hacking on the pipeline)

git clone https://github.com/shang-zhu/violin.git
cd violin
uv sync
cp .env.example .env             # then fill in TOGETHER_API_KEY
uv run main.py lecture.mp4 lecture_zh.mp4 --language Chinese

To use the violin / violin-api commands globally while edits to your local source reflect immediately, install editable:

uv tool uninstall violin     # if you've installed the PyPI version
uv tool install --editable .

After this, violin / violin-api run from your local checkout — edit any file and the next invocation picks it up; no rebuild needed. To switch back to PyPI: uv tool uninstall violin && uv tool install violin.

📝 To Do List

[-] support voice cloning.
[-] lip sync generation.

🎬 How Violin works

Video
  │
  ├─ ffmpeg ─────────────────────► Extract audio (16 kHz WAV)
  │
  ├─ Whisper Large v3 ────────────► Word-level timestamps → sentence segments
  │
  ├─ LLM (DeepSeek V4 Pro by default) ──► Translate each segment, respecting style profile
  │
  ├─ TTS (Cartesia Sonic 3 by default) ─► Synthesize dubbed audio per segment
  │
  └─ ffmpeg ─────────────────────► Speed-align video to dubbed audio,
                                    concat with freeze-frame fallback,
                                    single-pass AAC encode the audio track,
                                    write output mp4 + optional SRT

⚙️ Configuration

Override any default by writing your own YAML and passing it with --config my.yaml — only the keys you want to change need to appear; values deep-merge with the built-in defaults.

Switch providers

# config/default.yaml — pick the stack you want
models:
  transcription:
    provider: together                  # together | openai
    model: openai/whisper-large-v3      # together → openai/whisper-large-v3 | openai → whisper-1
  translation:
    provider: together                  # together | openai
    model: deepseek-ai/DeepSeek-V4-Pro  # together → deepseek-ai/DeepSeek-V4-Pro | openai → gpt-5.5
  tts:
    provider: together                  # together | elevenlabs | openai
    model: cartesia/sonic-3             # together → cartesia/sonic-3 | elevenlabs → eleven_v3 | openai → tts-1-hd

Production overrides

A starter config/prod.yaml is included for public deployments. It adds upload limits, serializes jobs, and caps ffmpeg concurrency. The included Dockerfile + docker-compose.yml + Caddyfile are how the live demo is hosted — docker compose up -d --build after filling .env is enough to put a copy of Violin behind auto-HTTPS on any Docker host.

Environment variables

Variable	When required	Description
`TOGETHER_API_KEY`	Recommended — covers every stage with the default config	Together AI API key
`OPENAI_API_KEY`	Any stage uses `provider: openai`	Covers `whisper-1`, GPT models, and `tts-1`
`ELEVENLABS_API_KEY`	TTS uses `provider: elevenlabs`	ElevenLabs API key
`CORS_ORIGINS`	Optional	Comma-separated allowed origins (default: `*`)

You only need keys for the providers you actually pick. Pure-OpenAI deployments (all stages on openai) work too — OPENAI_API_KEY alone is enough. Same idea for ElevenLabs.

🎭 Style profiles

Six built-in profiles tune both the translation LLM prompt and the TTS delivery. Use --style <name> on the CLI or pass style in API requests.

Style	Tone	TTS speed	Emotion
`standard`	Faithful translation, natural voice	1.0×	—
`kids`	Rewritten for a 7-year-old, plain language	1.0×	excited
`academic`	Formal register, preserves jargon and honorifics	0.95×	calm
`casual`	Spoken slang, contractions, friendly	1.1×	content
`storyteller`	Vivid, dramatic narration	0.9×	enthusiastic
`news`	Concise, declarative, broadcast-style	1.0×	neutral

Add your own by editing prompts/styles.yaml.

See all available styles: violin --style list.

💻 CLI usage

Examples use the PyPI-installed violin command. If you’re running from a git checkout, substitute uv run main.py for violin (and uv run run_api.py for violin-api).

# Basic
violin lecture.mp4 lecture_es.mp4 --language Spanish

# Pick a style
violin talk.mp4 talk_zh.mp4 --language Chinese --style kids

# Pick a specific voice
violin lecture.mp4 lecture_fr.mp4 --language French --voice "french narrator man"

# Skip SRT
violin lecture.mp4 lecture_ja.mp4 --language Japanese --no-subtitles

# Full replacement (no original audio underneath)
violin lecture.mp4 lecture_ko.mp4 --language Korean --no-voiceover

# Custom config (e.g. switch to OpenAI/ElevenLabs)
violin lecture.mp4 lecture_it.mp4 --language Italian --config config/other_api.yaml

CLI flags

Flag	Default	Description
`--language` / `-l`	(required)	Target language name (e.g. `Spanish`, `Japanese`)
`--voice` / `-v`	auto	TTS voice. Defaults to the primary native voice for the target language
`--source-language`	`auto-detect`	Source language hint for translation
`--no-subtitles`	off	Skip SRT generation
`--voiceover` / `--no-voiceover`	voiceover on	Keep original audio underneath the dub, or full replacement
`--style` / `-s`	`standard`	Style profile name. Use `--style list` to see all
`--config` / `-c`	`config/default.yaml`	Path to a YAML override file
`--timings-out`	off	Write per-step wall-clock timings + cost as JSON

🛰️ Web app & REST API

violin-api                              # default dev mode
violin-api --host 0.0.0.0 --port 8080   # bind everywhere
violin-api --config config/prod.yaml    # production overrides (requires a git checkout for config/prod.yaml)

Core flow: POST /jobs to start, GET /jobs/{id} to poll, GET /jobs/{id}/video and /srt to download, POST /jobs/{id}/chat for in-video Q&A. Full list with request/response schemas at /docs.

Example

# Submit
JOB=$(curl -s -X POST http://localhost:8000/jobs \
  -F "[email protected]" \
  -F "language=Spanish" \
  -F "style=academic" | jq -r .id)

# Poll
curl -s http://localhost:8000/jobs/$JOB | jq '{status, progress}'

# Download
curl -OJ http://localhost:8000/jobs/$JOB/video
curl -OJ http://localhost:8000/jobs/$JOB/srt

Job data lives under jobs/{id}/. Set api.job_ttl_hours to auto-delete jobs older than N hours (default 0 = disabled; config/prod.yaml uses 24h for the public demo).

🌍 Supported languages

Violin supports 33 target languages. The 16 below ship with handpicked native-speaker voices for each provider; the rest fall back to the English voice catalog (which is multilingual under both Cartesia Sonic 3 and ElevenLabs eleven_v3).

Ordered by native-speaker population.

Language	Cartesia native voice (M / F)	ElevenLabs native voice (M / F)
Chinese	chinese commercial man / chinese female conversational	Lin / Lingyue
Spanish	spanish narrator man / spanish narrator lady	Carlos / Valeria
English	tutorial man / helpful woman	Adam / Sarah
Hindi	hindi narrator man / hindi narrator woman	Yatin / Madhusmita
Arabic	middle eastern woman	Faris / Haneen
Portuguese	friendly brazilian man / pleasant brazilian lady	Medeiros / Luna
Russian	russian narrator man 1 / russian narrator woman	Ivo / Xenia
Japanese	japanese male conversational / japanese woman conversational	Shohei / Maiko
Turkish	turkish narrator man / turkish calm man	Sinan / Aura
German	german reporter man / german conversational woman	Daniel / Sina
Korean	korean narrator man / korean calm woman	Joon-ho / Soo
French	french narrator man / french narrator lady	Lior / Virginie
Italian	italian narrator man / italian narrator woman	Raffaele / Chiara
Polish	polish confident man / polish narrator woman	Gregor / Jola
Dutch	dutch confident man / dutch man	Ronald / Jolanda
Swedish	swedish narrator man / swedish calm lady	Andreas / Louise

The 17 fallback languages (using the English voice catalog), also ordered by native speakers: Vietnamese, Tamil, Indonesian, Malay, Ukrainian, Romanian, Thai, Greek, Hungarian, Catalan, Czech, Bulgarian, Danish, Slovak, Croatian, Finnish, Norwegian.

🤝 Contributing

PRs welcome. Got questions or hit a bug? Email [email protected] or open an issue.

⚠️ Disclaimer

This is a personal open-source project, not a Together AI product. Users are responsible for ensuring they have the right to download and translate any content they process. Designed for Creative Commons, public domain, your own recordings, and other content you have permission to use.

📜 License

MIT — use it freely, including commercially.

🙏 Acknowledgements

Built on top of Together AI, Whisper, Cartesia Sonic 3, ElevenLabs, FastAPI, and ffmpeg.

相似文章

@aigclink: 一个开源的端到端视频翻译+视频问答Skill：violin，亮点是不只是直译，而是内容再创作的设想它把ASR、LLM翻译和TTS整合成了一条无缝管道视频Skill，这三个环节自动衔接，输入视频即得到翻译后的配音视频翻译风格可调，比如说…

X AI KOLs Timeline

Violin是一个开源端到端视频翻译+视频问答工具，整合ASR、LLM翻译和TTS，支持风格调整和内容再创作，可针对视频内容问答。

@berryxia: 兄弟们，这个可以啊！赶紧装起来！ Kevin Lin，牛津大学博士后，前Meta和Microsoft研究员，刚刚把Violin这个开源视频翻译Skill放了出来。视频已经是互联网绝对主流的内容形式。可绝大多数高质量讲座、演讲、播客却被…

X AI KOLs Timeline

Violin是一个开源视频翻译工具，集成了语音识别、大语言模型翻译和语音合成功能，支持30多种语言，提供CLI、Web应用和Claude Code三种使用方式。

@yhslgg: 老杨再特么分享一个宝藏开源工具——KrillinAI，GitHub 10000 星，做多语言音视频内容的绝对值得看！一句话：从视频下载到字幕翻译、AI配音、视频合成，整条链路全包，还能自动生成平台封面，B站、抖音、小红书、YouTube…

X AI KOLs Timeline

KrillinAI 是一款开源工具，整合了视频下载、字幕翻译、AI配音、视频合成全流程，支持上下文感知翻译、语音克隆、自动布局和封面生成，兼容多种AI模型，适合多语言音视频内容创作与分发。

@yhslgg: 兄弟们，再分享一个开源视频翻译工具——pyVideoTrans，GitHub 17700 星，做视频搬运和本地化的必备！一句话：一个视频丢进去，自动走完语音识别→字幕翻译→AI配音→视频合成整条流水线，出来就是另一种语言的完整视频。核…

X AI KOLs Timeline

pyVideoTrans 是一个开源视频翻译工具，支持自动语音识别、字幕翻译、AI 配音和视频合成，集成了多种 ASR、翻译和 TTS 引擎，适合跨语言视频制作和本地化。

@rwayne: 视频翻译这事，这下被牛津博士后一个人干通了。牛津大学博士后 Kevin Lin 开源 Violin 视频翻译工具，把语音识别、LLM 翻译、语音合成整合成自动化流水线。支持多语言互译、个性化翻译风格、视频对话三合一，可以把学术报告转成儿…

X AI KOLs Timeline

牛津大学博士后 Kevin Lin 开源了 Violin 视频翻译工具，将语音识别、LLM 翻译和语音合成整合为自动化流水线，支持多语言互译和个性化风格，并提供 Web、CLI 和 Agent 三种使用方式。

shang-zhu/violin

🎻 Violin

✨ Features

🚀 Quick start

Try it without installing anything

Run locally

📝 To Do List

🎬 How Violin works

⚙️ Configuration

Switch providers

Production overrides

Environment variables

🎭 Style profiles

💻 CLI usage

CLI flags

🛰️ Web app & REST API

Example

🌍 Supported languages

🤝 Contributing

⚠️ Disclaimer

📜 License

🙏 Acknowledgements

相似文章

@berryxia: 兄弟们，这个可以啊！赶紧装起来！ Kevin Lin，牛津大学博士后，前Meta和Microsoft研究员，刚刚把Violin这个开源视频翻译Skill放了出来。 视频已经是互联网绝对主流的内容形式。 可绝大多数高质量讲座、演讲、播客却被…

提交意见反馈

@berryxia: 兄弟们，这个可以啊！赶紧装起来！ Kevin Lin，牛津大学博士后，前Meta和Microsoft研究员，刚刚把Violin这个开源视频翻译Skill放了出来。视频已经是互联网绝对主流的内容形式。可绝大多数高质量讲座、演讲、播客却被…