Experimental "Preserve Thinking" Jinja Template for Gemma4 31B in llama.cpp
Summary
An experimental Jinja template for Gemma4 31B in llama.cpp that improves stability for multi-turn tool calls by fixing common thinking tag issues. Community feedback is welcome, but this is not recommended by Google.
Similar Articles
Gemma 4 Chat Template now has preserve thinking
Google's Gemma 4 31B IT model now has a chat template fix that preserves thinking and improves null handling, reasoning preservation, and input validation.
PSA: Gemma 4 12B is NOT completely broken for coding and tool calling, you need a special chat template
Gemma 4 12B has a known issue with tool calling and coding, but using a custom chat template in llama.cpp resolves the bugs. Users should compile llama.cpp from source and apply the fix before evaluating the model's coding ability.
[WIP] Gemma 4 MTP
llama.cpp is an open-source C/C++ library for efficient LLM inference on various hardware, supporting multiple quantization formats and GPU backends. This README details its features, installation, and recent updates including Hugging Face cache migration and multimodal support.
Gemma 4 2B handling structured JSON output + tool calling + reasoning traces correctly via Spring AI / LM Studio — including identifying a real Java bug in code review
User tested Gemma 4 2B running locally via LM Studio and Spring AI for structured JSON output, tool calling, and reasoning traces, finding it correctly identified a Java bug in code review and performed comparably to larger models.
google/gemma-4-26B-A4B-it-assistant
Google DeepMind released Gemma 4 MTP drafters for the Gemma 4 family, enabling significant decoding speedups via speculative decoding while maintaining exact generation quality for low-latency applications.