Show HN: CPU-only transcription for YouTube, TikTok, X, Instagram videos
Summary
yapsnap is a command-line tool for transcribing video/audio from various sources (YouTube, TikTok, etc.) to plain text using only CPU, no GPU or cloud required. It leverages sherpa-onnx and yt-dlp for offline, fast transcription.
View Cached Full Text
Cached at: 05/20/26, 11:30 PM
kouhxp/yapsnap
Source: https://github.com/kouhxp/yapsnap
yapsnap
Snap any video URL or audio file into plaintext. No GPU. No cloud. One command.
yapsnap "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
That’s it. You get a .txt next to your shell, transcribed on your CPU, in less time than it took the video to play.
Why yapsnap
- ⚡ Fast on CPU. Streaming Zipformer transducer (Kroko English) chews through audio at several times realtime on a laptop. No CUDA. No M-series-only tricks. Plain old cores.
- 🌐 Any video URL, plus local files. YouTube. X. TikTok. Instagram Reels. Direct
.mp4/.mp3links. Or just point it at a file on disk. yt-dlp handles the fetch, ffmpeg handles the decode, the rest is yours. - 📴 Offline after first run. ~80 MB model downloads once to your cache and stays there. No API keys. No quotas. Your audio never leaves your machine.
- 🪶 One file, three deps.
sherpa-onnx,numpy,yt-dlp. The whole tool is a single Python module. - ⏱ Sentence-level timestamps when you want them.
--timestampsadds[MM:SS]per sentence using Kroko’s built-in punctuation. Timing stays correct even when you transcribe at 2x.
Quickstart
# 1. ffmpeg on PATH (one-time, per OS — see below)
# 2. Install
pip install .
# 3. Snap something
yapsnap https://www.tiktok.com/@user/video/7234567890123456789
yapsnap meeting.mp4 --timestamps
yapsnap podcast.mp3 -o ~/notes/episode.txt
The first run downloads the model (~80 MB). Every run after is offline.
What it handles
Any URL yt-dlp understands works. The big ones:
| Source | Example |
|---|---|
| YouTube | https://www.youtube.com/watch?v=... |
| YouTube Shorts | https://www.youtube.com/shorts/... |
| X / Twitter | https://x.com/user/status/.../video/1 |
| TikTok | https://www.tiktok.com/@user/video/... |
| Instagram Reels | https://www.instagram.com/reel/.../ |
| Direct media URL | https://example.com/clip.mp4 |
Plus any local file ffmpeg can decode: .mp3, .mp4, .m4a, .wav, .webm, .mov, .mkv, .aac, .opus, .ogg, .flac, and friends.
Install
1. ffmpeg
| OS | Command |
|---|---|
| macOS | brew install ffmpeg |
| Linux | sudo apt install ffmpeg or sudo dnf install ffmpeg |
| Windows | winget install ffmpeg or choco install ffmpeg |
2. yapsnap
pip install .
Installs two equivalent commands on your PATH: yapsnap (canonical) and transcribe (alias, for when the name slips your mind).
Usage
# Local file
yapsnap path/to/audio.mp3
# Any video URL
yapsnap "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Sentence-level timestamps
yapsnap input.mp4 --timestamps
# Custom output path
yapsnap input.mp4 -o ./transcripts/talk.txt
# Don't speed audio up before transcribing (default is 1.5x, pitch preserved)
yapsnap input.mp4 --speed 1.0
# Keep the downloaded audio (URL inputs only)
yapsnap "https://..." --keep-audio
Output
Plaintext, UTF-8. Default location is ./transcripts/ (created if missing) under the current working directory; override with -o. For URL inputs the filename is derived from the video ID (dQw4w9WgXcQ_transcript.txt, etc.).
Without --timestamps — one paragraph of recognized text:
Welcome to the show. Today we're talking about transcription. Let's get started.
With --timestamps — one sentence per line, timed against the original audio:
[00:00] Welcome to the show.
[00:03] Today we're talking about transcription.
[00:08] Let's get started.
Timestamps stay in original-audio time even at --speed 1.5 or higher.
Flags
| Flag | Description |
|---|---|
-o, --output | Output .txt path. Default: ./transcripts/<input>_transcript.txt. |
--timestamps | Emit [MM:SS] sentence. lines instead of a single paragraph. |
--speed | Pre-transcription speedup factor, pitch preserved. Default 1.5. |
--keep-audio | Keep the downloaded audio (URL inputs only). |
--model | Override the model directory. Also reads KROKO_MODEL env var. |
How it works
- Fetch. If the input is a URL,
yt-dlpgrabs the best audio-only stream to a temp directory. If it’s a local path, this step is skipped. - Decode.
ffmpegpipes the media into 16 kHz mono PCM. The optionalatempofilter speeds it up without raising pitch. - Recognize. A streaming Zipformer2 transducer (Kroko English, INT8 ONNX, ~80 MB) eats the PCM in chunks. CPU-only. Greedy decode.
- Format. Plain text by default. With
--timestamps, token timestamps are grouped on.!?into sentences and scaled back to original-audio time.
No frame is sent anywhere. No state is kept between runs except the cached model.
Model & cache
The default Kroko English model is downloaded on first run to:
- macOS —
~/Library/Caches/yapsnap/ - Linux —
$XDG_CACHE_HOME/yapsnap/(or~/.cache/yapsnap/) - Windows —
%LOCALAPPDATA%\yapsnap\
To use a different streaming transducer (other languages, larger Kroko variants, etc.), point --model at a directory containing encoder(.int8).onnx, decoder(.int8).onnx, joiner(.int8).onnx, and tokens.txt. Or set KROKO_MODEL in your environment.
Notes & limits
- The default model is English-only. For other languages, supply a matching sherpa-onnx streaming transducer via
--model. --speed 1.5shaves about a third off transcription time with minimal accuracy cost. Try2.0if you want it even faster, or1.0for noisy, mumbled, or fast-speech sources.- Some social-media URLs are geo-locked or login-walled;
yt-dlpwill say so explicitly. - This is a streaming model, so timestamps come from token positions in the recognized stream. They’re accurate enough for navigation, not for subtitling-grade alignment.
License
Apache-2.0 for this project. The Kroko model is distributed under its own license — see https://huggingface.co/Banafo/Kroko-ASR. Powered by sherpa-onnx and yt-dlp.
Similar Articles
Show HN: VidStudio, a browser based video editor that doesn't upload your files
VidStudio is a free browser-based video editor that processes files locally without uploading, offering resize, trim, compress, watermark, subtitle and multi-track editing tools.
Show HN: OpenBrief – Local-first video downloader/summarizer
OpenBrief is an open-source desktop app that lets users download videos, transcribe audio, generate grounded summaries, and chat with media content, all running locally on their machine.
Show HN: TikTok but for Scientific Papers
Papel is a new research-focused social platform that leverages AI-powered vector search and on-device RAG to help researchers discover, discuss, and quiz themselves on academic papers. It offers personalized feeds, local AI chat via Apple Intelligence or MLX, and gamified learning features.
yt-dlp/yt-dlp
yt-dlp is a feature-rich command-line audio/video downloader supporting thousands of sites, forked from youtube-dl.
@noahduck283: A tool that can download any YouTube video, cleanly remove vocals, transcribe, translate into 100+ languages, clone the original voice, and perform fully automatic dubbing. It takes less than 2 minutes. 100% runs locally. Free. Sews six top open-source models into a web page for "one-click download, vocal removal, transcription, translation, dubbing"...
Voice-Pro is a web tool that integrates six top open-source models (Whisper, Demucs, CosyVoice, F5-TTS, etc.), supporting YouTube video downloading, vocal removal, transcription, translation, voice cloning, and fully automatic dubbing. It takes less than 2 minutes, runs 100% locally, and is free.