Experimental "Preserve Thinking" Jinja Template for Gemma4 31B in llama.cpp

Reddit r/LocalLLaMA Tools

Summary

An experimental Jinja template for Gemma4 31B in llama.cpp that improves stability for multi-turn tool calls by fixing common thinking tag issues. Community feedback is welcome, but this is not recommended by Google.

[https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF/blob/main/gemma4-improved.jinja](https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF/blob/main/gemma4-improved.jinja) Yall are more than welcome to try it out and provide feedback. In my own testing in Pi-coding-agent I no longer have the "forgot to close thinking tag" "forgot to open thinking" "closed thinking to early" problem. It's more stable for multi-turn tool calls within multiple turns of prompts. Disclaimer this is NOT recommended by Google.
Original Article

Similar Articles

[WIP] Gemma 4 MTP

Reddit r/LocalLLaMA

llama.cpp is an open-source C/C++ library for efficient LLM inference on various hardware, supporting multiple quantization formats and GPU backends. This README details its features, installation, and recent updates including Hugging Face cache migration and multimodal support.

google/gemma-4-26B-A4B-it-assistant

Hugging Face Models Trending

Google DeepMind released Gemma 4 MTP drafters for the Gemma 4 family, enabling significant decoding speedups via speculative decoding while maintaining exact generation quality for low-latency applications.