Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Reddit r/LocalLLaMA Models

Summary

The user reports that the Gemma 4 12B unified audio model stops attending to speech when the system prompt is large (~21k tokens), and asks for workarounds or explanations, noting the issue persists across vLLM, llama.cpp, and LiteRT-LM backends.

I'm trying to use **Gemma 4 12B** — the new encoder-free unified model (audio/vision/text in one) — for a one-pass **audio → response** voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the separate ASR + LLM steps into a single model (TTS still happens afterward). Works great with a **minimal prompt** — the model clearly hears and responds to the audio. But once the **text prompt gets large/dense** (mine is \~21k tokens: detailed instructions + tool definitions), it basically **stops attending to the audio** — replies as if the audio weren't there (generic/hallucinated) or only weakly transcribes. Trim the prompt back down and audio attention returns. Same behavior across three stacks, so it doesn't look stack-specific: \- **vLLM** (gemma4-unified image + pip install av), audio as base64 audio\_url \- **llama.cpp** (--mmproj, input\_audio content, chat\_template\_kwargs {enable\_thinking:false}) \- **LiteRT-LM** (gemma4-12b,gpu) Feels like an inherent attention/saturation limit when audio competes with a long dense text context. (Notably, **E4B** with a tiny prompt keeps audio attention fine — so I'm using it as a small audio front-end instead.) Questions for anyone who's tried: 1. Has anyone gotten **12B unified audio to reliably attend to speech with a big system prompt** (lots of instructions/tools)? 2. Known limitation of the unified arch, or a serving/config thing (audio placement in the sequence, attention settings, chat template, sampling)? 3. Workarounds — audio-first vs audio-last ordering, prompt structuring, attention/RoPE tweaks? Served on an NVIDIA GB10 (Blackwell).
Original Article

Similar Articles

Gemma 4 audio with MLX

Simon Willison's Blog

A practical guide for audio transcription on macOS using Gemma 4 E2B model with MLX and mlx-vlm, including a uv run recipe and demonstration of the workflow.