Gemma 4 audio with MLX

Simon Willison's Blog Tools

Summary

A practical guide for audio transcription on macOS using Gemma 4 E2B model with MLX and mlx-vlm, including a uv run recipe and demonstration of the workflow.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:27 AM

# Gemma 4 audio with MLX Source: https://simonwillison.net/2026/Apr/12/mlx-audio/ 12th April 2026 Thanks to a tip from Rahim Nathwani (https://twitter.com/RahimNathwani/status/2039961945613209852), here's a `uv run` recipe for transcribing an audio file on macOS using the 10.28 GB Gemma 4 E2B model (https://huggingface.co/google/gemma-4-E2B) with MLX and mlx-vlm (https://github.com/Blaizzy/mlx-vlm): `` uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \ mlx_vlm.generate \ --model google/gemma-4-e2b-it \ --audio file.wav \ --prompt "Transcribe this audio" \ --max-tokens 500 \ --temperature 1.0 `` Your browser does not support the audio element. I tried it on this 14 second `.wav` file (https://static.simonwillison.net/static/2026/demo-audio-for-gemma.wav) and it output the following: > This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works. (That was supposed to be "This right here..." and "... how well that works" but I can hear why it misinterpreted that as "front" and "how that works".)

Similar Articles

New Gemma 4 MTP on MLX?

Reddit r/LocalLLaMA

Google released Multi Token Prediction drafters for Gemma 4 to accelerate inference via speculative decoding, but support for MLX is currently unconfirmed or unavailable.

Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2

Hugging Face Models Trending

SuperGemma4-26B-Uncensored-MLX-4bit-v2 is a fine-tuned and quantized variant of Google's Gemma 4 26B optimized for Apple Silicon, offering improved performance on code, reasoning, and tool-use tasks while maintaining faster inference speeds compared to the stock baseline.

Trials and tribulations fine-tuning & deploying Gemma-4 [P]

Reddit r/MachineLearning

An ML team documents practical challenges encountered while fine-tuning and deploying Gemma-4, including incompatibilities with PEFT, SFTTrainer, DeepSpeed ZeRO-3, and lack of runtime LoRA serving support, along with workarounds for each issue.

google/gemma-4-31B-it-assistant

Hugging Face Models Trending

Google DeepMind releases Gemma 4, a family of open-weights multimodal models featuring Multi-Token Prediction (MTP) for up to 2x decoding speedups, supporting text, image, video, and audio with enhanced reasoning and coding capabilities.