Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Reddit r/LocalLLaMA 06/10/26, 06:51 AM Models

gemma-4 audio-attention system-prompt large-context vllm llamacpp litert-lm

Summary

The user reports that the Gemma 4 12B unified audio model stops attending to speech when the system prompt is large (~21k tokens), and asks for workarounds or explanations, noting the issue persists across vLLM, llama.cpp, and LiteRT-LM backends.

I'm trying to use **Gemma 4 12B** — the new encoder-free unified model (audio/vision/text in one) — for a one-pass **audio → response** voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the separate ASR + LLM steps into a single model (TTS still happens afterward). Works great with a **minimal prompt** — the model clearly hears and responds to the audio. But once the **text prompt gets large/dense** (mine is \~21k tokens: detailed instructions + tool definitions), it basically **stops attending to the audio** — replies as if the audio weren't there (generic/hallucinated) or only weakly transcribes. Trim the prompt back down and audio attention returns. Same behavior across three stacks, so it doesn't look stack-specific: \- **vLLM** (gemma4-unified image + pip install av), audio as base64 audio\_url \- **llama.cpp** (--mmproj, input\_audio content, chat\_template\_kwargs {enable\_thinking:false}) \- **LiteRT-LM** (gemma4-12b,gpu) Feels like an inherent attention/saturation limit when audio competes with a long dense text context. (Notably, **E4B** with a tiny prompt keeps audio attention fine — so I'm using it as a small audio front-end instead.) Questions for anyone who's tried: 1. Has anyone gotten **12B unified audio to reliably attend to speech with a big system prompt** (lots of instructions/tools)? 2. Known limitation of the unified arch, or a serving/config thing (audio placement in the sequence, attention settings, chat template, sampling)? 3. Workarounds — audio-first vs audio-last ordering, prompt structuring, attention/RoPE tweaks? Served on an NVIDIA GB10 (Blackwell).

Original Article

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Similar Articles

Gemma 4 12B native encoder free voice input utilization suggest?

Gemma 4 audio with MLX

Gemma 4 26b a4b is genuinely the best model I have tried for language learning and scientific queries!

@_philschmid: We just launched a Gemma 4 12B! Our first mid-sized model with native audio inputs. Gemma 4 12 B is a unified, encoder-…

Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze

Submit Feedback

Similar Articles

Gemma 4 12B native encoder free voice input utilization suggest?

Gemma 4 26b a4b is genuinely the best model I have tried for language learning and scientific queries!

@_philschmid: We just launched a Gemma 4 12B! Our first mid-sized model with native audio inputs. Gemma 4 12 B is a unified, encoder-…
We just launched Gemma 4 12B, a mid-sized multimodal model with native audio inputs, requiring only 16GB memory and released under Apache 2.0.

Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze