@NFTCPS: Here’s a macOS terminal tool that cures not understanding English in meetings: Microphone or meeting audio in, real-time captions plus Chinese translation out, no network needed, privacy lovers rejoice. Runs locally on Apple Silicon GPU, English transcription fed directly to Hunyuan MT for Chinese translation. Can also distinguish who’s speaking…
Summary
Introduces livecaption, a command-line tool for real-time English transcription and Chinese translation on macOS. It runs locally using Apple Silicon GPU, supports speaker diarization and two-pass correction, no internet required, preserving privacy.
View Cached Full Text
Cached at: 06/17/26, 01:58 PM
Here’s a macOS terminal powerhouse for anyone who struggles with English in meetings, especially non-native speakers:
Microphone or meeting audio in, real-time captions plus Chinese translation out. Runs entirely offline – privacy lovers rejoice.
Leverages Apple Silicon GPU locally: English transcription is fed directly to Hunyuan MT for Chinese translation. Supports speaker diarization (labels S1, S2), records both microphone and speaker output simultaneously during meetings or livestreams. Two‑pass correction: low‑latency captions appear first, then the full sentence is reprocessed for error correction – a seamless experience.
The initial model download is around 3GB. Output can go to the terminal or be saved as Markdown. English learners, take note.
https://github.com/six-ddc/livecaption…
six-ddc/livecaption
Source: https://github.com/six-ddc/livecaption
简体中文 | English
livecaption
A command‑line tool for macOS that performs local real‑time English transcription + Chinese translation. No GUI – output goes to the terminal or a text file.
livecaption demo: real‑time transcription, S1/S2 speaker diarization (color‑coded), two‑pass correction (red strikethrough for discarded words, green for corrected ones), and per‑sentence Chinese translation
- ASR: Runs NVIDIA
nemotron-3.5-asr-streaming-0.6bviamlx-audio(cache‑aware true streaming transducer) on Apple Silicon GPU / MLX; endpoint detection uses Silero VAD (MLX version). - Speaker Diarization (enabled by default, disable with
--no-diarize): NVIDIA Sortformer v2.1 streaming model, supports up to 4 speakers; sentences are split per speaker, labeled S1/S2/…, each independently translated, and speaker IDs remain stable throughout. - Translation: Runs
Hy-MT2-1.8B-8bit(Tencent Hunyuan MT Gen3) viamlx-lmon Apple Silicon GPU. - Audio Sources: Microphone, system audio (Zoom/Teams/browser speaker output), or both as independent stereo streams.
Both ASR and translation run on the Apple GPU (unified memory). VAD blocks silence from entering the encoder, so GPU is only consumed when someone is speaking. Translation only processes finalized ASR sentences and does not back‑pressure the audio pipeline. Finalized sentences go through two‑pass correction: the real‑time caption uses low‑latency streaming decoding, and when the sentence ends, the entire sentence is re‑decoded with maximum look‑ahead. The final caption and translation both use the more accurate re‑decoded result.
Prerequisites
- macOS 14.2+ (system audio capture uses Core Audio process tap; Tahoe recommends ≥ 26.1)
- Apple Silicon
- uv (https://docs.astral.sh/uv/)
- Only needed for
system/bothaudio sources: Swift 5.9+ (to compile audiotee)
Installation
uv sync
# Only compile audiotee if you need to capture meeting/system audio:
bash scripts/build_audiotee.sh
The first run will automatically download from Hugging Face: ASR (~1.2GB), Silero VAD (tiny), Sortformer for diarization (~225MB, enabled by default, disable with --no-diarize), and the translation model (~2GB).
Usage
# Real‑time transcription + translation from microphone, output to terminal
uv run livecaption --source mic
# Transcribe meeting output (system audio), write to file
uv run livecaption --source system --out meeting.md
# Dual‑stream: transcribe both your microphone and the other party’s audio simultaneously
uv run livecaption --source both --out meeting.md
# Transcribe an audio file (for meeting review / end‑to‑end testing; wav/mp3/m4a, auto‑resampled, exits after completion)
uv run livecaption --source file --file recording.m4a --out recording.md
# Translation includes context from the previous 3 sentences (improves pronoun/term coherence); disable or adjust with:
uv run livecaption --source system --context 0
# Transcription only, no translation
uv run livecaption --source mic --no-translate
# Translate to Japanese, use a larger translation model
uv run livecaption --target-lang ja-jp --mt-model mlx-community/Hy-MT2-7B-4bit
# Non‑English meeting: specify source language (40 locales; if wrong, lists all valid values)
uv run livecaption --asr-lang de-de --target-lang zh-cn
# Capture only a specific app (first find its PID via `ps`/Activity Monitor)
uv run livecaption --source system --include-pid 12345
# List microphone devices
uv run livecaption --list-devices
# Theme: default is auto (detects terminal background brightness; falls back to high‑contrast if detection fails);
# explicitly specify light or dark if the default is hard to read:
uv run livecaption --theme light
# Monitor MLX unified memory usage (displayed as active/cache/peak in the terminal status bar; off by default, for diagnostics)
uv run livecaption --source mic --mem
In the terminal: the bottom gray line shows real‑time interim results; finalized sentences scroll upward as original text, followed by the translation (bold or colored depending on the theme). Speakers S1–S4 are each shown in a distinct color for easy identification.
If colors are unclear: the default --theme auto tries to detect background brightness from COLORFGBG, but most macOS terminals (Terminal/iTerm/VS Code) do not set this variable. When detection fails, it falls back to a safe scheme (“default foreground + bold”) that is readable on any background, but without translation‑specific colors. For colored translations, explicitly specify --theme light (light background, translation in deep cyan‑blue) or --theme dark (dark background, translation in bright cyan).
Permissions
-
Microphone: First run triggers a system prompt – the permission is granted to your terminal app.
-
System Audio (important):
audioteeis a raw binary started as a Python subprocess, so macOS often does not display the authorization prompt. If you run--source system, see “Listening” but get no transcription at all (after ~8 seconds the program prints a silence warning), it is almost certainly because “Screen & System Audio Recording” permission is missing – Core Audio silently returns silence without error or prompt. To grant manually:- Open System Settings > Privacy & Security > Screen & System Audio Recording
- macOS 15 (Sequoia) and later have two sub‑sections here: scroll down to the “System Audio Recording Only” sub‑section – not the “Screen & System Audio Recording” sub‑section at the top – and add your terminal app (Terminal / iTerm / VS Code, etc.) with the toggle enabled. If it’s not there, click
+to add it manually (e.g./System/Applications/Utilities/Terminal.app).audioteeonly does audio tap, no screen capture, so adding to the wrong sub‑section still results in silence. (macOS 14 has only one list – no such distinction.) - Fully quit and restart the terminal app (TCC permission changes require a process restart), then try again.
- If still not working, try running from macOS’s built‑in Terminal.app (not iTerm/VS Code) first – it more easily triggers the authorization prompt.
This “two sub‑sections” detail is not even in audiotee’s own README – it was only mentioned by the author in audiotee#7 (https://github.com/makeusabrew/audiotee/issues/7).
To verify if the permission is effective: play any sound and run
uv run python scripts/diag_system_audio.py. Check ifmax |amplitude|> 0.
Design Rationale
nemotron-3.5-asr-streaming is the official multilingual successor to nemotron-speech-streaming-en (also cache‑aware true streaming, same 0.6B parameter budget) – it is not a sliding‑window approximation of an offline model. It runs via mlx-audio: MLX natively on Apple GPU (sherpa‑onnx is CPU‑only on Mac, CoreML does not fully cover streaming transducer operators). mlx-audio only provides a pull‑style interface; this project rewrote its streaming kernel as a push‑style real‑time stepper (asr.py). Endpoint detection is a semantic re‑implementation of Silero VAD following sherpa’s rule1/2/3. Default language is en-us (internally mapped to the model’s en-US key) to avoid auto‑detection jumping between languages. Language parameters accept both lowercase short codes and English language names, e.g., zh-cn / Chinese, ja-jp / Japanese, de-de / German.
Known Risks & Workarounds
- If ASR streaming quality is unsatisfactory → increase
ASR_ATT_CONTEXT(e.g.[56,13]gives best accuracy but refreshes every 1.12s, increasing partial latency). - If ASR and translation compete for GPU and cause stuttering → use a smaller / lower‑precision translation model, disable speaker diarization with
--no-diarize, or only transcribe with--no-translateso ASR has exclusive GPU access. - If VRAM is tight → switch ASR to an 8‑bit quantized version:
--asr-model mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit. - If translation quality is insufficient →
--mt-model mlx-community/Hy-MT2-7B-4bit(~4.2GB, more accurate). - Occasional silence when tapping a specific app → tapping the entire system output is more stable (omit
--include-pid).
Similar Articles
@GitHub_Daily: MacParakeet is an open-source tool on GitHub designed specifically for Macs that performs purely local speech-to-text transcription with high accuracy. It supports dragging and dropping audio/video files or pasting YouTube links to quickly generate transcripts with timestamps and speaker labels. It can also simultaneously record system audio and microphone input...
MacParakeet is a new open-source Mac application that provides fast, fully local voice transcription using Apple's Neural Engine and NVIDIA's Parakeet model, ensuring privacy by keeping audio data on-device.
@VincentLogic: Someone open-sourced an invisible teleprompter hidden in the MacBook notch. The most awkward thing during a video call: you either look at the screen or at the camera. You can't do both at the same time, so the other person always sees your eyes wandering. The idea of this tool is clever: it displays the teleprompter content in the notch area, because the camera is right in the middle of the notch. When you look at the teleprompter, your eyes naturally face the camera.
Someone open-sourced an invisible teleprompter hidden in the MacBook notch, allowing you to naturally look at the camera while reading prompts during video calls, maintaining eye contact. It's free, open-source, and supports AI assistance.
@XChatScout: https://x.com/XChatScout/status/2056622783761899644
Detailed tutorial on how to use MacBook, Mac mini, Codex, OpenClaw (lobster), and Tailscale to build a personal super AI system for remote collaboration and automation tasks.
@sitinme: There's a pretty interesting open-source project called Cider, specifically designed to accelerate local AI inference on Macs with Apple Silicon chips. Many people buy a Mac mini or MacBook Pro and want to run models locally, but often encounter issues like insufficient speed and high memory usage. Actually...
Cider is an open-source project designed for Apple Silicon Macs, accelerating local AI inference by fully leveraging the computing power of M-series chips. It is compatible with the MLX ecosystem, supports models like Qwen and Llama, and is easy to install.
@geekbb: A macOS terminal designed for AI coding, integrating workspace management, split-screen, and AI agent startup workflows. Supports horizontal and vertical split screens, one-click launch of seven AI agents like Claude Code, Codex, Gemini CLI, and more. Right-click selected content to directly submit to...
kooky is a macOS terminal designed for AI coding, integrating workspace management, split-screen, and AI agent startup workflows. It supports one-click launch of multiple AI agents and right-click content submission.