@XAMTO_AI: If you don't bookmark this open-source tool now, you'll regret it later — automatic video dubbing and translation, supports 33 languages at once, and can even answer questions about video content. Found a gem on GitHub called Violin, fully open-source, what it does is a bit unbelievable: you drop a video in, it automatically recognizes speech, …
Summary
Violin is an open-source automatic video dubbing and translation tool that supports 33 languages, integrates models like Whisper and DeepSeek, and provides one-click speech recognition, translation, dubbing synthesis, and in-video Q&A functionality.
View Cached Full Text
Cached at: 06/12/26, 08:58 AM
If you don’t bookmark this open-source tool now, you’ll definitely regret it later — automatic video dubbing and translation, supporting 33 languages at once, and you can even ask questions about the video content. I found a treasure on GitHub called Violin. It’s completely open-source, and what it does is pretty wild: you drop a video in, it automatically recognizes the speech, translates it, synthesizes voiceover in the target language, and seamlessly merges it back into the video — with perfect timing alignment, and it even outputs SRT subtitles on the side. The entire pipeline is fully automated — no manual intervention needed.
Under the hood, it runs Whisper Large v3 for speech recognition, DeepSeek V4 Pro for translation, Cartesia Sonic 3 for voice synthesis, and ffmpeg for pre/post processing. The whole pipeline is cleanly designed.
A few features I found interesting:
- Supports 33 target languages, with handpicked native-speaker voices for 16 of them (Cartesia Sonic 3 + ElevenLabs), sounding natural, not robotic.
- In-video Q&A: after dubbing, you can ask questions about any moment in the video, and it answers based on nearby subtitles and sampled frames — this design exceeded expectations.
- Natural-language voice selection: describe the voice style you want, and the LLM automatically picks from the voice library — no need to try them one by one.
- Six style presets: Standard, Kids, Academic, Casual, Storyteller, News — each preset adjusts speaking speed and emotion, ready to use.
- Pluggable architecture: transcription, translation, and TTS stages are interchangeable — Together, OpenAI, ElevenLabs can be mixed and matched, all configured via a single YAML file; even non-coders can use it.
GitHub: https://github.com/shang-zhu/violin…
Online demo: https://violin-ai.com
✨ Features
- 33 target languages with handpicked native-speaker voices for the 16 most-used ones (Cartesia Sonic 3 + ElevenLabs)
- In-video Q&A — ask questions about any moment in the dubbed video; answers use nearby subtitles plus sampled frames
- Natural-language voice picker — describe the voice you want, an LLM picks from the catalog
- 6 style profiles (experimental) — standard / kids / academic / casual / storyteller / news
- Pluggable stack — Together / OpenAI / ElevenLabs interchangeable for every stage, one YAML
🚀 Quick start
Try it without installing anything
The live demo runs at **** — drop a short clip in, get a dubbed video out in a few minutes.
Run locally
Requires Python 3.10+ and ffmpeg on PATH.
``bash
curl -LsSf https://astral.sh/uv/install.sh | sh # install uv if you don’t have it
uv tool install violin # recommended — faster, isolated
or: pip install violin # if you’d rather install into your current Python env
export TOGETHER_API_KEY=… # get one at https://api.together.ai (add to ~/.zshrc to persist)
``
Three ways to use it:
1. CLI — translate one file:
bash violin lecture.mp4 lecture_zh.mp4 --language Chinese
2. Web app — full REST API + browser UI:
``bash
violin-api # → http://127.0.0.1:8000 (browser UI)
→ http://127.0.0.1:8000/docs (interactive API docs)
``
3. Claude Code skill — invoke from any Claude Code session:
``bash
violin –install-skill # one-time: copies the skill into ~/.claude/skills/
claude
please use the violin skill to translate path/to/video.mp4 into Chinese
``
Run from source (for hacking on the pipeline)
bash git clone https://github.com/shang-zhu/violin.git cd violin uv sync cp .env.example .env # then fill in TOGETHER_API_KEY uv run main.py lecture.mp4 lecture_zh.mp4 --language Chinese
To use the violin / violin-api commands globally while edits to your local source reflect immediately, install editable:
bash uv tool uninstall violin # if you've installed the PyPI version uv tool install --editable .
After this, violin / violin-api run from your local checkout — edit any file and the next invocation picks it up; no rebuild needed.
To switch back to PyPI: uv tool uninstall violin && uv tool install violin.
📝 To Do List
- [-] support voice cloning.
- [-] lip sync generation.
🎬 How Violin works
Video │ ├─ ffmpeg ─────────────────────► Extract audio (16 kHz WAV) │ ├─ Whisper Large v3 ────────────► Word-level timestamps → sentence segments │ ├─ LLM (DeepSeek V4 Pro by default) ──► Translate each segment, respecting style profile │ ├─ TTS (Cartesia Sonic 3 by default) ─► Synthesize dubbed audio per segment │ └─ ffmpeg ─────────────────────► Speed-align video to dubbed audio, concat with freeze-frame fallback, single-pass AAC encode the audio track, write output mp4 + optional SRT
⚙️ Configuration
Override any default by writing your own YAML and passing it with --config my.yaml — only the keys you want to change need to appear; values deep-merge with the built-in defaults (https://github.com/shang-zhu/violin/blob/main/config/default.yaml).
Switch providers
``yaml
config/default.yaml — pick the stack you want
models:
transcription:
provider: together # together | openai
model: openai/whisper-large-v3 # together → openai/whisper-large-v3 | openai → whisper-1
translation:
provider: together # together | openai
model: deepseek-ai/DeepSeek-V4-Pro # together → deepseek-ai/DeepSeek-V4-Pro | openai → gpt-5.5
tts:
provider: together # together | elevenlabs | openai
model: cartesia/sonic-3 # together → cartesia/sonic-3 | elevenlabs → eleven_v3 | openai → tts-1-hd
``
Production overrides
A starter config/prod.yaml is included for public deployments. It adds upload limits, serializes jobs, and caps ffmpeg concurrency.
The included Dockerfile + docker-compose.yml + Caddyfile are how the live demo is hosted — docker compose up -d --build after filling .env is enough to put a copy of Violin behind auto-HTTPS on any Docker host.
Environment variables
| Variable | When required | Description |
|---|---|---|
TOGETHER_API_KEY | Recommended — covers every stage with the default config | Together AI API key |
OPENAI_API_KEY | Any stage uses provider: openai | Covers whisper-1, GPT models, and tts-1 |
ELEVENLABS_API_KEY | TTS uses provider: elevenlabs | ElevenLabs API key |
CORS_ORIGINS | Optional | Comma-separated allowed origins (default: *) |
You only need keys for the providers you actually pick. Pure-OpenAI deployments (all stages on
openai) work too —OPENAI_API_KEYalone is enough. Same idea for ElevenLabs.
🎭 Style profiles
Six built-in profiles tune both the translation LLM prompt and the TTS delivery. Use --style <name> on the CLI or pass style in API requests.
| Style | Tone | TTS speed | Emotion |
|---|---|---|---|
standard | Faithful translation, natural voice | 1.0× | — |
kids | Rewritten for a 7-year-old, plain language | 1.0× | excited |
academic | Formal register, preserves jargon and honorifics | 0.95× | calm |
casual | Spoken slang, contractions, friendly | 1.1× | content |
storyteller | Vivid, dramatic narration | 0.9× | enthusiastic |
news | Concise, declarative, broadcast-style | 1.0× | neutral |
Add your own by editing prompts/styles.yaml. See all available styles: violin --style list.
💻 CLI usage
Examples use the PyPI-installed
violincommand. If you’re running from a git checkout, substituteuv run main.pyforviolin(anduv run run_api.pyforviolin-api).
``bash
Basic
violin lecture.mp4 lecture_es.mp4 –language Spanish
Pick a style
violin talk.mp4 talk_zh.mp4 –language Chinese –style kids
Pick a specific voice
violin lecture.mp4 lecture_fr.mp4 –language French –voice “french narrator man”
Skip SRT
violin lecture.mp4 lecture_ja.mp4 –language Japanese –no-subtitles
Full replacement (no original audio underneath)
violin lecture.mp4 lecture_ko.mp4 –language Korean –no-voiceover
Custom config (e.g. switch to OpenAI/ElevenLabs)
violin lecture.mp4 lecture_it.mp4 –language Italian –config config/other_api.yaml
``
CLI flags
| Flag | Default | Description |
|---|---|---|
--language / -l | (required) | Target language name (e.g. Spanish, Japanese) |
--voice / -v | auto | TTS voice. Defaults to the primary native voice for the target language |
--source-language | auto-detect | Source language hint for translation |
--no-subtitles | off | Skip SRT generation |
--voiceover / --no-voiceover | voiceover on | Keep original audio underneath the dub, or full replacement |
--style / -s | standard | Style profile name. Use --style list to see all |
--config / -c | config/default.yaml | Path to a YAML override file |
--timings-out | off | Write per-step wall-clock timings + cost as JSON |
🛰️ Web app & REST API
bash violin-api # default dev mode violin-api --host 0.0.0.0 --port 8080 # bind everywhere violin-api --config config/prod.yaml # production overrides (requires a git checkout for config/prod.yaml)
Core flow: POST /jobs to start, GET /jobs/{id} to poll, GET /jobs/{id}/video and /srt to download, POST /jobs/{id}/chat for in-video Q&A. Full list with request/response schemas at /docs.
Example
``bash
Submit
JOB=$(curl -s -X POST http://localhost:8000/jobs \
-F “[email protected]” \
-F “language=Spanish” \
-F “style=academic” | jq -r .id)
Poll
curl -s http://localhost:8000/jobs/$JOB | jq ‘{status, progress}’
Download
curl -OJ http://localhost:8000/jobs/JOB/video
curl -OJ http://localhost:8000/jobs/JOB/srt
``
Job data lives under jobs/{id}/. Set api.job_ttl_hours to auto-delete jobs older than N hours (default 0 = disabled; config/prod.yaml uses 24h for the public demo).
🌍 Supported languages
Violin supports 33 target languages. The 16 below ship with handpicked native-speaker voices for each provider; the rest fall back to the English voice catalog (which is multilingual under both Cartesia Sonic 3 and ElevenLabs eleven_v3). Ordered by native-speaker population.
| Language | Cartesia native voice (M / F) | ElevenLabs native voice (M / F) |
|---|---|---|
| Chinese | chinese commercial man / chinese female conversational | Lin / Lingyue |
| Spanish | spanish narrator man / spanish narrator lady | Carlos / Valeria |
| English | tutorial man / helpful woman | Adam / Sarah |
| Hindi | hindi narrator man / hindi narrator woman | Yatin / Madhusmita |
| Arabic | middle eastern woman | Faris / Haneen |
| Portuguese | friendly brazilian man / pleasant brazilian lady | Medeiros / Luna |
| Russian | russian narrator man 1 / russian narrator woman | Ivo / Xenia |
| Japanese | japanese male conversational / japanese woman conversational | Shohei / Maiko |
| Turkish | turkish narrator man / turkish calm man | Sinan / Aura |
| German | german reporter man / german conversational woman | Daniel / Sina |
| Korean | korean narrator man / korean calm woman | Joon-ho / Soo |
| French | french narrator man / french narrator lady | Lior / Virginie |
| Italian | italian narrator man / italian narrator woman | Raffaele / Chiara |
| Polish | polish confident man / polish narrator woman | Gregor / Jola |
| Dutch | dutch confident man / dutch man | Ronald / Jolanda |
| Swedish | swedish narrator man / swedish calm lady | Andreas / Louise |
The 17 fallback languages (using the English voice catalog), also ordered by native speakers: Vietnamese, Tamil, Indonesian, Malay, Ukrainian, Romanian, Thai, Greek, Hungarian, Catalan, Czech, Bulgarian, Danish, Slovak, Croatian, Finnish, Norwegian.
🤝 Contributing
PRs welcome. Got questions or hit a bug? Email **** or open an issue.
⚠️ Disclaimer
This is a personal open-source project, not a Together AI product. Users are responsible for ensuring they have the right to download and translate any content they process. Designed for Creative Commons, public domain, your own recordings, and other content you have permission to use.
📜 License
MIT (https://github.com/shang-zhu/violin/blob/main/LICENSE) — use it freely, including commercially.
🙏 Acknowledgements
Built on top of Together AI (https://together.ai), Whisper (https://github.com/openai/whisper), Cartesia Sonic 3 (https://cartesia.ai), ElevenLabs (https://elevenlabs.io), FastAPI (https://fastapi.tiangolo.com/), and ffmpeg (https://ffmpeg.org).
Similar Articles
@aigclink: An open-source end-to-end video translation + video Q&A Skill: violin. The highlight is not just literal translation, but the idea of content re-creation. It integrates ASR, LLM translation, and TTS into a seamless pipeline video Skill. The three modules are automatically chained: input a video and get a dubbed translated video. Translation style is adjustable, for example...
Violin is an open-source end-to-end video translation and video Q&A tool, integrating ASR, LLM translation, and TTS. It supports style adjustment and content re-creation, and can answer questions about video content.
@berryxia: Guys, this is awesome! Install it right away! Kevin Lin, postdoc at Oxford, former Meta and Microsoft researcher, just released Violin, an open-source video translation Skill. Video is already the absolute dominant content form on the internet. Yet most high-quality lectures, speeches, and podcasts are locked by a single language…
Violin is an open-source video translation tool that integrates speech recognition, large language model translation, and text-to-speech. It supports over 30 languages and offers three usage modes: CLI, web app, and Claude Code.
@yhslgg: Old Yang shares another gem open-source tool—KrillinAI, 10,000 stars on GitHub, a must-see for multilingual audio/video content! In a nutshell: from video download to subtitle translation, AI dubbing, video compositing, the entire pipeline is covered, and it can even auto-generate platform covers, supporting Bilibili, Douyin, Xiaohongshu, YouTube…
KrillinAI is an open-source tool that integrates the entire workflow of video downloading, subtitle translation, AI dubbing, and video compositing. It supports context-aware translation, voice cloning, auto layout, and cover generation, and is compatible with multiple AI models, suitable for multilingual audio/video content creation and distribution.
@yhslgg: Bro, sharing another open-source video translation tool—pyVideoTrans, with 17,700 stars on GitHub, a must-have for video repurposing and localization! In a nutshell: drop a video in, and it automatically runs through the entire pipeline of speech recognition → subtitle translation → AI dubbing → video synthesis, outputting a complete video in another language. Core...
pyVideoTrans is an open-source video translation tool that supports automatic speech recognition, subtitle translation, AI dubbing, and video synthesis. It integrates multiple ASR, translation, and TTS engines, making it suitable for cross-language video production and localization.
@rwayne: Video translation has been cracked by a single Oxford postdoc. Kevin Lin, a postdoc at Oxford University, open-sourced Violin, a video translation tool that integrates speech recognition, LLM translation, and speech synthesis into an automated pipeline. It supports multilingual translation, personalized translation styles, and all-in-one video dialogue; it can turn academic reports into children's...
Kevin Lin, a postdoctoral fellow at Oxford University, open-sourced Violin, a video translation tool that integrates speech recognition, LLM translation, and speech synthesis into an automated pipeline. It supports multilingual translation and personalized styles, and provides three usage modes: Web, CLI, and Agent.