@NFTCPS: Here’s a macOS terminal tool that cures not understanding English in meetings: Microphone or meeting audio in, real-time captions plus Chinese translation out, no network needed, privacy lovers rejoice. Runs locally on Apple Silicon GPU, English transcription fed directly to Hunyuan MT for Chinese translation. Can also distinguish who’s speaking…

X AI KOLs Timeline 06/17/26, 07:27 AM Tools

macos real-time-transcription live-caption translation local-ai privacy apple-silicon

Summary

Introduces livecaption, a command-line tool for real-time English transcription and Chinese translation on macOS. It runs locally using Apple Silicon GPU, supports speaker diarization and two-pass correction, no internet required, preserving privacy.

Here’s a macOS terminal tool that cures not understanding English in meetings: Microphone or meeting audio in, real-time captions plus Chinese translation out, no network needed, privacy lovers rejoice. Runs locally on Apple Silicon GPU, English transcription fed directly to Hunyuan MT for Chinese translation. Can also distinguish who’s speaking, labeled S1, S2, and record both your mic and the other speaker during live meetings. Two-pass correction: first low-latency captions, then re-decode the full sentence for error correction, experience maxed out. Model first download ~3GB+, output can go to terminal or save as Markdown. English-challenged folks, grab it now. https://github.com/six-ddc/livecaption…

Original Article

View Cached Full Text

Cached at: 06/17/26, 01:58 PM

Here’s a macOS terminal powerhouse for anyone who struggles with English in meetings, especially non-native speakers:

Microphone or meeting audio in, real-time captions plus Chinese translation out. Runs entirely offline – privacy lovers rejoice.

Leverages Apple Silicon GPU locally: English transcription is fed directly to Hunyuan MT for Chinese translation. Supports speaker diarization (labels S1, S2), records both microphone and speaker output simultaneously during meetings or livestreams. Two‑pass correction: low‑latency captions appear first, then the full sentence is reprocessed for error correction – a seamless experience.

The initial model download is around 3GB. Output can go to the terminal or be saved as Markdown. English learners, take note.

https://github.com/six-ddc/livecaption…

six-ddc/livecaption

Source: https://github.com/six-ddc/livecaption

简体中文 | English

livecaption

A command‑line tool for macOS that performs local real‑time English transcription + Chinese translation. No GUI – output goes to the terminal or a text file.

livecaption demo: real‑time transcription, S1/S2 speaker diarization (color‑coded), two‑pass correction (red strikethrough for discarded words, green for corrected ones), and per‑sentence Chinese translation

ASR: Runs NVIDIA nemotron-3.5-asr-streaming-0.6b via mlx-audio (cache‑aware true streaming transducer) on Apple Silicon GPU / MLX; endpoint detection uses Silero VAD (MLX version).
Speaker Diarization (enabled by default, disable with --no-diarize): NVIDIA Sortformer v2.1 streaming model, supports up to 4 speakers; sentences are split per speaker, labeled S1/S2/…, each independently translated, and speaker IDs remain stable throughout.
Translation: Runs Hy-MT2-1.8B-8bit (Tencent Hunyuan MT Gen3) via mlx-lm on Apple Silicon GPU.
Audio Sources: Microphone, system audio (Zoom/Teams/browser speaker output), or both as independent stereo streams.

Both ASR and translation run on the Apple GPU (unified memory). VAD blocks silence from entering the encoder, so GPU is only consumed when someone is speaking. Translation only processes finalized ASR sentences and does not back‑pressure the audio pipeline. Finalized sentences go through two‑pass correction: the real‑time caption uses low‑latency streaming decoding, and when the sentence ends, the entire sentence is re‑decoded with maximum look‑ahead. The final caption and translation both use the more accurate re‑decoded result.

Prerequisites

macOS 14.2+ (system audio capture uses Core Audio process tap; Tahoe recommends ≥ 26.1)
Apple Silicon
uv (https://docs.astral.sh/uv/)
Only needed for system / both audio sources: Swift 5.9+ (to compile audiotee)

Installation

uv sync
# Only compile audiotee if you need to capture meeting/system audio:
bash scripts/build_audiotee.sh

The first run will automatically download from Hugging Face: ASR (~1.2GB), Silero VAD (tiny), Sortformer for diarization (~225MB, enabled by default, disable with --no-diarize), and the translation model (~2GB).

Usage

# Real‑time transcription + translation from microphone, output to terminal
uv run livecaption --source mic

# Transcribe meeting output (system audio), write to file
uv run livecaption --source system --out meeting.md

# Dual‑stream: transcribe both your microphone and the other party’s audio simultaneously
uv run livecaption --source both --out meeting.md

# Transcribe an audio file (for meeting review / end‑to‑end testing; wav/mp3/m4a, auto‑resampled, exits after completion)
uv run livecaption --source file --file recording.m4a --out recording.md

# Translation includes context from the previous 3 sentences (improves pronoun/term coherence); disable or adjust with:
uv run livecaption --source system --context 0

# Transcription only, no translation
uv run livecaption --source mic --no-translate

# Translate to Japanese, use a larger translation model
uv run livecaption --target-lang ja-jp --mt-model mlx-community/Hy-MT2-7B-4bit

# Non‑English meeting: specify source language (40 locales; if wrong, lists all valid values)
uv run livecaption --asr-lang de-de --target-lang zh-cn

# Capture only a specific app (first find its PID via `ps`/Activity Monitor)
uv run livecaption --source system --include-pid 12345

# List microphone devices
uv run livecaption --list-devices

# Theme: default is auto (detects terminal background brightness; falls back to high‑contrast if detection fails);
# explicitly specify light or dark if the default is hard to read:
uv run livecaption --theme light

# Monitor MLX unified memory usage (displayed as active/cache/peak in the terminal status bar; off by default, for diagnostics)
uv run livecaption --source mic --mem

In the terminal: the bottom gray line shows real‑time interim results; finalized sentences scroll upward as original text, followed by the translation (bold or colored depending on the theme). Speakers S1–S4 are each shown in a distinct color for easy identification.

If colors are unclear: the default --theme auto tries to detect background brightness from COLORFGBG, but most macOS terminals (Terminal/iTerm/VS Code) do not set this variable. When detection fails, it falls back to a safe scheme (“default foreground + bold”) that is readable on any background, but without translation‑specific colors. For colored translations, explicitly specify --theme light (light background, translation in deep cyan‑blue) or --theme dark (dark background, translation in bright cyan).

Permissions

Microphone: First run triggers a system prompt – the permission is granted to your terminal app.
System Audio (important): audiotee is a raw binary started as a Python subprocess, so macOS often does not display the authorization prompt. If you run --source system, see “Listening” but get no transcription at all (after ~8 seconds the program prints a silence warning), it is almost certainly because “Screen & System Audio Recording” permission is missing – Core Audio silently returns silence without error or prompt. To grant manually:
1. Open System Settings > Privacy & Security > Screen & System Audio Recording
2. macOS 15 (Sequoia) and later have two sub‑sections here: scroll down to the “System Audio Recording Only” sub‑section – not the “Screen & System Audio Recording” sub‑section at the top – and add your terminal app (Terminal / iTerm / VS Code, etc.) with the toggle enabled. If it’s not there, click + to add it manually (e.g. /System/Applications/Utilities/Terminal.app). audiotee only does audio tap, no screen capture, so adding to the wrong sub‑section still results in silence. (macOS 14 has only one list – no such distinction.)
3. Fully quit and restart the terminal app (TCC permission changes require a process restart), then try again.
4. If still not working, try running from macOS’s built‑in Terminal.app (not iTerm/VS Code) first – it more easily triggers the authorization prompt.
This “two sub‑sections” detail is not even in audiotee’s own README – it was only mentioned by the author in audiotee#7 (https://github.com/makeusabrew/audiotee/issues/7).

To verify if the permission is effective: play any sound and run uv run python scripts/diag_system_audio.py. Check if max |amplitude| > 0.

Design Rationale

nemotron-3.5-asr-streaming is the official multilingual successor to nemotron-speech-streaming-en (also cache‑aware true streaming, same 0.6B parameter budget) – it is not a sliding‑window approximation of an offline model. It runs via mlx-audio: MLX natively on Apple GPU (sherpa‑onnx is CPU‑only on Mac, CoreML does not fully cover streaming transducer operators). mlx-audio only provides a pull‑style interface; this project rewrote its streaming kernel as a push‑style real‑time stepper (asr.py). Endpoint detection is a semantic re‑implementation of Silero VAD following sherpa’s rule1/2/3. Default language is en-us (internally mapped to the model’s en-US key) to avoid auto‑detection jumping between languages. Language parameters accept both lowercase short codes and English language names, e.g., zh-cn / Chinese, ja-jp / Japanese, de-de / German.

Known Risks & Workarounds

If ASR streaming quality is unsatisfactory → increase ASR_ATT_CONTEXT (e.g. [56,13] gives best accuracy but refreshes every 1.12s, increasing partial latency).
If ASR and translation compete for GPU and cause stuttering → use a smaller / lower‑precision translation model, disable speaker diarization with --no-diarize, or only transcribe with --no-translate so ASR has exclusive GPU access.
If VRAM is tight → switch ASR to an 8‑bit quantized version: --asr-model mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit.
If translation quality is insufficient → --mt-model mlx-community/Hy-MT2-7B-4bit (~4.2GB, more accurate).
Occasional silence when tapping a specific app → tapping the entire system output is more stable (omit --include-pid).