Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?
Summary
The user reports that the Gemma 4 12B unified audio model stops attending to speech when the system prompt is large (~21k tokens), and asks for workarounds or explanations, noting the issue persists across vLLM, llama.cpp, and LiteRT-LM backends.
Similar Articles
Gemma 4 12B native encoder free voice input utilization suggest?
Discusses leveraging Gemma 4 12B's encoder-free architecture for native voice input, seeking out-of-the-box solutions for low-latency streaming audio ingestion.
Gemma 4 audio with MLX
A practical guide for audio transcription on macOS using Gemma 4 E2B model with MLX and mlx-vlm, including a uv run recipe and demonstration of the workflow.
Gemma 4 26b a4b is genuinely the best model I have tried for language learning and scientific queries!
User reports that Gemma 4 26b outperforms Qwen 3.5/3.6 for language learning and scientific queries, despite being behind in coding tasks, and invites discussion on other non-coding use cases for small MoE models.
@_philschmid: We just launched a Gemma 4 12B! Our first mid-sized model with native audio inputs. Gemma 4 12 B is a unified, encoder-…
We just launched Gemma 4 12B, a mid-sized multimodal model with native audio inputs, requiring only 16GB memory and released under Apache 2.0.
Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze
The author reports that the Gemma 4 12b QAT model suffers from a regression in tool calling and coding tasks compared to the standard Q5_K_L version, due to a bug involving control token misconfiguration. Despite high token speed, the model's inconsistent outputs make it unsuitable for agent workflows.