@XAMTO_AI: If you don't bookmark this open-source tool now, you'll regret it later — automatic video dubbing and translation, supports 33 languages at once, and can even answer questions about video content. Found a gem on GitHub called Violin, fully open-source, what it does is a bit unbelievable: you drop a video in, it automatically recognizes speech, …

X AI KOLs Timeline Tools

Summary

Violin is an open-source automatic video dubbing and translation tool that supports 33 languages, integrates models like Whisper and DeepSeek, and provides one-click speech recognition, translation, dubbing synthesis, and in-video Q&A functionality.

If you don't bookmark this open-source tool now, you'll regret it later — automatic video dubbing and translation, supports 33 languages at once, and can even answer questions about video content. Found a gem on GitHub called Violin, fully open-source. What it does is a bit unbelievable: you drop a video in, it automatically recognizes speech, translates, synthesizes dubbing in the target language, then seamlessly mixes it back into the video with perfectly aligned timestamps, and also conveniently outputs SRT subtitles. The entire process is end-to-end; you don't have to touch anything manually. Under the hood, it uses Whisper Large v3 for speech recognition, DeepSeek V4 Pro for translation, Cartesia Sonic 3 for dubbing synthesis, and ffmpeg for pre- and post-processing. The whole pipeline is cleanly designed. There are a few features I find quite interesting: - Supports 33 target languages, 16 of which come with curated native-speaker dubbing, using Cartesia Sonic 3 and ElevenLabs. The sound quality is not like robotic voices. - In-video Q&A: after dubbing, you can ask questions about any moment in the video, and it gives answers based on nearby subtitles and sampled frames. This design exceeds expectations. - Natural language voice selection: you describe the style of voice you want, and the LLM automatically picks from the voice library — no need to try each one. - Six preset styles: Standard, Children, Academic, Casual, Storytelling, News. Each preset has speed and emotion tuned — just use it directly. - Pluggable architecture: You can swap out transcription, translation, and TTS stages. Together, OpenAI, ElevenLabs can be combined arbitrarily, configured with a single YAML file. Even those who don't want to touch code can use it. GitHub: https://github.com/shang-zhu/violin… Online demo: https://violin-ai.com
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:58 AM

If you don’t bookmark this open-source tool now, you’ll definitely regret it later — automatic video dubbing and translation, supporting 33 languages at once, and you can even ask questions about the video content. I found a treasure on GitHub called Violin. It’s completely open-source, and what it does is pretty wild: you drop a video in, it automatically recognizes the speech, translates it, synthesizes voiceover in the target language, and seamlessly merges it back into the video — with perfect timing alignment, and it even outputs SRT subtitles on the side. The entire pipeline is fully automated — no manual intervention needed.
Under the hood, it runs Whisper Large v3 for speech recognition, DeepSeek V4 Pro for translation, Cartesia Sonic 3 for voice synthesis, and ffmpeg for pre/post processing. The whole pipeline is cleanly designed.

A few features I found interesting:

  • Supports 33 target languages, with handpicked native-speaker voices for 16 of them (Cartesia Sonic 3 + ElevenLabs), sounding natural, not robotic.
  • In-video Q&A: after dubbing, you can ask questions about any moment in the video, and it answers based on nearby subtitles and sampled frames — this design exceeded expectations.
  • Natural-language voice selection: describe the voice style you want, and the LLM automatically picks from the voice library — no need to try them one by one.
  • Six style presets: Standard, Kids, Academic, Casual, Storyteller, News — each preset adjusts speaking speed and emotion, ready to use.
  • Pluggable architecture: transcription, translation, and TTS stages are interchangeable — Together, OpenAI, ElevenLabs can be mixed and matched, all configured via a single YAML file; even non-coders can use it.

GitHub: https://github.com/shang-zhu/violin…
Online demo: https://violin-ai.com

✨ Features

  • 33 target languages with handpicked native-speaker voices for the 16 most-used ones (Cartesia Sonic 3 + ElevenLabs)
  • In-video Q&A — ask questions about any moment in the dubbed video; answers use nearby subtitles plus sampled frames
  • Natural-language voice picker — describe the voice you want, an LLM picks from the catalog
  • 6 style profiles (experimental) — standard / kids / academic / casual / storyteller / news
  • Pluggable stack — Together / OpenAI / ElevenLabs interchangeable for every stage, one YAML

🚀 Quick start

Try it without installing anything

The live demo runs at **** — drop a short clip in, get a dubbed video out in a few minutes.

Run locally

Requires Python 3.10+ and ffmpeg on PATH.

``bash
curl -LsSf https://astral.sh/uv/install.sh | sh # install uv if you don’t have it
uv tool install violin # recommended — faster, isolated

or: pip install violin # if you’d rather install into your current Python env

export TOGETHER_API_KEY=… # get one at https://api.together.ai (add to ~/.zshrc to persist)
``

Three ways to use it:

1. CLI — translate one file:
bash violin lecture.mp4 lecture_zh.mp4 --language Chinese

2. Web app — full REST API + browser UI:
``bash
violin-api # → http://127.0.0.1:8000 (browser UI)

→ http://127.0.0.1:8000/docs (interactive API docs)

``

3. Claude Code skill — invoke from any Claude Code session:
``bash
violin –install-skill # one-time: copies the skill into ~/.claude/skills/
claude

please use the violin skill to translate path/to/video.mp4 into Chinese
``

Run from source (for hacking on the pipeline)
bash git clone https://github.com/shang-zhu/violin.git cd violin uv sync cp .env.example .env # then fill in TOGETHER_API_KEY uv run main.py lecture.mp4 lecture_zh.mp4 --language Chinese

To use the violin / violin-api commands globally while edits to your local source reflect immediately, install editable:
bash uv tool uninstall violin # if you've installed the PyPI version uv tool install --editable .

After this, violin / violin-api run from your local checkout — edit any file and the next invocation picks it up; no rebuild needed.
To switch back to PyPI: uv tool uninstall violin && uv tool install violin.


📝 To Do List

  • [-] support voice cloning.
  • [-] lip sync generation.

🎬 How Violin works

Video │ ├─ ffmpeg ─────────────────────► Extract audio (16 kHz WAV) │ ├─ Whisper Large v3 ────────────► Word-level timestamps → sentence segments │ ├─ LLM (DeepSeek V4 Pro by default) ──► Translate each segment, respecting style profile │ ├─ TTS (Cartesia Sonic 3 by default) ─► Synthesize dubbed audio per segment │ └─ ffmpeg ─────────────────────► Speed-align video to dubbed audio, concat with freeze-frame fallback, single-pass AAC encode the audio track, write output mp4 + optional SRT


⚙️ Configuration

Override any default by writing your own YAML and passing it with --config my.yaml — only the keys you want to change need to appear; values deep-merge with the built-in defaults (https://github.com/shang-zhu/violin/blob/main/config/default.yaml).

Switch providers

``yaml

config/default.yaml — pick the stack you want

models:
transcription:
provider: together # together | openai
model: openai/whisper-large-v3 # together → openai/whisper-large-v3 | openai → whisper-1
translation:
provider: together # together | openai
model: deepseek-ai/DeepSeek-V4-Pro # together → deepseek-ai/DeepSeek-V4-Pro | openai → gpt-5.5
tts:
provider: together # together | elevenlabs | openai
model: cartesia/sonic-3 # together → cartesia/sonic-3 | elevenlabs → eleven_v3 | openai → tts-1-hd
``

Production overrides

A starter config/prod.yaml is included for public deployments. It adds upload limits, serializes jobs, and caps ffmpeg concurrency.
The included Dockerfile + docker-compose.yml + Caddyfile are how the live demo is hosted — docker compose up -d --build after filling .env is enough to put a copy of Violin behind auto-HTTPS on any Docker host.

Environment variables

VariableWhen requiredDescription
TOGETHER_API_KEYRecommended — covers every stage with the default configTogether AI API key
OPENAI_API_KEYAny stage uses provider: openaiCovers whisper-1, GPT models, and tts-1
ELEVENLABS_API_KEYTTS uses provider: elevenlabsElevenLabs API key
CORS_ORIGINSOptionalComma-separated allowed origins (default: *)

You only need keys for the providers you actually pick. Pure-OpenAI deployments (all stages on openai) work too — OPENAI_API_KEY alone is enough. Same idea for ElevenLabs.


🎭 Style profiles

Six built-in profiles tune both the translation LLM prompt and the TTS delivery. Use --style <name> on the CLI or pass style in API requests.

StyleToneTTS speedEmotion
standardFaithful translation, natural voice1.0×
kidsRewritten for a 7-year-old, plain language1.0×excited
academicFormal register, preserves jargon and honorifics0.95×calm
casualSpoken slang, contractions, friendly1.1×content
storytellerVivid, dramatic narration0.9×enthusiastic
newsConcise, declarative, broadcast-style1.0×neutral

Add your own by editing prompts/styles.yaml. See all available styles: violin --style list.


💻 CLI usage

Examples use the PyPI-installed violin command. If you’re running from a git checkout, substitute uv run main.py for violin (and uv run run_api.py for violin-api).

``bash

Basic

violin lecture.mp4 lecture_es.mp4 –language Spanish

Pick a style

violin talk.mp4 talk_zh.mp4 –language Chinese –style kids

Pick a specific voice

violin lecture.mp4 lecture_fr.mp4 –language French –voice “french narrator man”

Skip SRT

violin lecture.mp4 lecture_ja.mp4 –language Japanese –no-subtitles

Full replacement (no original audio underneath)

violin lecture.mp4 lecture_ko.mp4 –language Korean –no-voiceover

Custom config (e.g. switch to OpenAI/ElevenLabs)

violin lecture.mp4 lecture_it.mp4 –language Italian –config config/other_api.yaml
``

CLI flags

FlagDefaultDescription
--language / -l(required)Target language name (e.g. Spanish, Japanese)
--voice / -vautoTTS voice. Defaults to the primary native voice for the target language
--source-languageauto-detectSource language hint for translation
--no-subtitlesoffSkip SRT generation
--voiceover / --no-voiceovervoiceover onKeep original audio underneath the dub, or full replacement
--style / -sstandardStyle profile name. Use --style list to see all
--config / -cconfig/default.yamlPath to a YAML override file
--timings-outoffWrite per-step wall-clock timings + cost as JSON

🛰️ Web app & REST API

bash violin-api # default dev mode violin-api --host 0.0.0.0 --port 8080 # bind everywhere violin-api --config config/prod.yaml # production overrides (requires a git checkout for config/prod.yaml)

Core flow: POST /jobs to start, GET /jobs/{id} to poll, GET /jobs/{id}/video and /srt to download, POST /jobs/{id}/chat for in-video Q&A. Full list with request/response schemas at /docs.

Example

``bash

Submit

JOB=$(curl -s -X POST http://localhost:8000/jobs \
-F “[email protected]” \
-F “language=Spanish” \
-F “style=academic” | jq -r .id)

Poll

curl -s http://localhost:8000/jobs/$JOB | jq ‘{status, progress}’

Download

curl -OJ http://localhost:8000/jobs/JOB/video curl -OJ http://localhost:8000/jobs/JOB/srt
``

Job data lives under jobs/{id}/. Set api.job_ttl_hours to auto-delete jobs older than N hours (default 0 = disabled; config/prod.yaml uses 24h for the public demo).


🌍 Supported languages

Violin supports 33 target languages. The 16 below ship with handpicked native-speaker voices for each provider; the rest fall back to the English voice catalog (which is multilingual under both Cartesia Sonic 3 and ElevenLabs eleven_v3). Ordered by native-speaker population.

LanguageCartesia native voice (M / F)ElevenLabs native voice (M / F)
Chinesechinese commercial man / chinese female conversationalLin / Lingyue
Spanishspanish narrator man / spanish narrator ladyCarlos / Valeria
Englishtutorial man / helpful womanAdam / Sarah
Hindihindi narrator man / hindi narrator womanYatin / Madhusmita
Arabicmiddle eastern womanFaris / Haneen
Portuguesefriendly brazilian man / pleasant brazilian ladyMedeiros / Luna
Russianrussian narrator man 1 / russian narrator womanIvo / Xenia
Japanesejapanese male conversational / japanese woman conversationalShohei / Maiko
Turkishturkish narrator man / turkish calm manSinan / Aura
Germangerman reporter man / german conversational womanDaniel / Sina
Koreankorean narrator man / korean calm womanJoon-ho / Soo
Frenchfrench narrator man / french narrator ladyLior / Virginie
Italianitalian narrator man / italian narrator womanRaffaele / Chiara
Polishpolish confident man / polish narrator womanGregor / Jola
Dutchdutch confident man / dutch manRonald / Jolanda
Swedishswedish narrator man / swedish calm ladyAndreas / Louise

The 17 fallback languages (using the English voice catalog), also ordered by native speakers: Vietnamese, Tamil, Indonesian, Malay, Ukrainian, Romanian, Thai, Greek, Hungarian, Catalan, Czech, Bulgarian, Danish, Slovak, Croatian, Finnish, Norwegian.


🤝 Contributing

PRs welcome. Got questions or hit a bug? Email **** or open an issue.


⚠️ Disclaimer

This is a personal open-source project, not a Together AI product. Users are responsible for ensuring they have the right to download and translate any content they process. Designed for Creative Commons, public domain, your own recordings, and other content you have permission to use.


📜 License

MIT (https://github.com/shang-zhu/violin/blob/main/LICENSE) — use it freely, including commercially.


🙏 Acknowledgements

Built on top of Together AI (https://together.ai), Whisper (https://github.com/openai/whisper), Cartesia Sonic 3 (https://cartesia.ai), ElevenLabs (https://elevenlabs.io), FastAPI (https://fastapi.tiangolo.com/), and ffmpeg (https://ffmpeg.org).

Similar Articles

@aigclink: An open-source end-to-end video translation + video Q&A Skill: violin. The highlight is not just literal translation, but the idea of content re-creation. It integrates ASR, LLM translation, and TTS into a seamless pipeline video Skill. The three modules are automatically chained: input a video and get a dubbed translated video. Translation style is adjustable, for example...

X AI KOLs Timeline

Violin is an open-source end-to-end video translation and video Q&A tool, integrating ASR, LLM translation, and TTS. It supports style adjustment and content re-creation, and can answer questions about video content.

@berryxia: Guys, this is awesome! Install it right away! Kevin Lin, postdoc at Oxford, former Meta and Microsoft researcher, just released Violin, an open-source video translation Skill. Video is already the absolute dominant content form on the internet. Yet most high-quality lectures, speeches, and podcasts are locked by a single language…

X AI KOLs Timeline

Violin is an open-source video translation tool that integrates speech recognition, large language model translation, and text-to-speech. It supports over 30 languages and offers three usage modes: CLI, web app, and Claude Code.

@yhslgg: Old Yang shares another gem open-source tool—KrillinAI, 10,000 stars on GitHub, a must-see for multilingual audio/video content! In a nutshell: from video download to subtitle translation, AI dubbing, video compositing, the entire pipeline is covered, and it can even auto-generate platform covers, supporting Bilibili, Douyin, Xiaohongshu, YouTube…

X AI KOLs Timeline

KrillinAI is an open-source tool that integrates the entire workflow of video downloading, subtitle translation, AI dubbing, and video compositing. It supports context-aware translation, voice cloning, auto layout, and cover generation, and is compatible with multiple AI models, suitable for multilingual audio/video content creation and distribution.

@yhslgg: Bro, sharing another open-source video translation tool—pyVideoTrans, with 17,700 stars on GitHub, a must-have for video repurposing and localization! In a nutshell: drop a video in, and it automatically runs through the entire pipeline of speech recognition → subtitle translation → AI dubbing → video synthesis, outputting a complete video in another language. Core...

X AI KOLs Timeline

pyVideoTrans is an open-source video translation tool that supports automatic speech recognition, subtitle translation, AI dubbing, and video synthesis. It integrates multiple ASR, translation, and TTS engines, making it suitable for cross-language video production and localization.

@rwayne: Video translation has been cracked by a single Oxford postdoc. Kevin Lin, a postdoc at Oxford University, open-sourced Violin, a video translation tool that integrates speech recognition, LLM translation, and speech synthesis into an automated pipeline. It supports multilingual translation, personalized translation styles, and all-in-one video dialogue; it can turn academic reports into children's...

X AI KOLs Timeline

Kevin Lin, a postdoctoral fellow at Oxford University, open-sourced Violin, a video translation tool that integrates speech recognition, LLM translation, and speech synthesis into an automated pipeline. It supports multilingual translation and personalized styles, and provides three usage modes: Web, CLI, and Agent.