Tag
The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.
Release of Gemma4-12B-QAT Uncensored Balanced, a fine-tuned uncensored model with a multi-token-prediction draft head for ~60% faster speculative decoding, optimized for llama.cpp and offering vision support.
A guide on using Colab-CLI to run the Gemma4 12B model on underpowered machines like a MacBook Air or Linux computer, leveraging the free version of Google Colab.
A live challenge is underway to accelerate inference of the Gemma 4 E4B model on a single A10G GPU, with a dashboard on Hugging Face tracking agent submissions.
User shares positive experience with Gemma4 QAT model, noting quality improvements and speed gains with MTP, and asks others for their experiences.
A user seeks clarification on the relation between MTP (Multi-Token Prediction) and QAT (Quantization-Aware Training) in llama.cpp, particularly regarding GGUF compatibility for the Gemma4 model and the new QAT string in filenames.
A user reports that the QAT quantized variant of Gemma4 26B A4B performs worse on a chessboard SVG test compared to the non-QAT version, with unstable piece drawing despite using suggested settings.
Google's new Gemma 4 12B is a single decoder-only transformer with encoder-free multimodal input, achieving strong benchmarks while being small enough to run locally on a budget GPU. It is released under Apache 2.0 license.
A developer created a Chrome extension called Dobby that runs Google's Gemma4 (Gemini Nano) locally on PC without needing a GPU, requiring only Chrome and 16GB RAM. The extension provides a simple interface to interact with the model for tasks like spell checking or summarizing.
User benchmarks the APEX quantized version of Gemma4 26B A4B model on AMD RX 9060 XT, achieving 38 tps at 90k context with no quality degradation, finding it better than previous quantizations.
An experimental Jinja template for Gemma4 31B in llama.cpp that improves stability for multi-turn tool calls by fixing common thinking tag issues. Community feedback is welcome, but this is not recommended by Google.
Ollama now natively supports Codex, allowing you to run the OpenAI Codex App entirely free and locally without subscriptions, API keys, or data leaving your laptop.
HauhauCS releases Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced, a lossless uncensored variant of Gemma4 with 0/465 refusals after over a month of development, available in GGUF formats.
A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.
A hands-on benchmark of four local LLMs—Qwen3.6-27B, Qwen3.6-35B, Qwen3.5-27B and Gemma 4—on a 20k-token architecture-writing task shows Qwen3.6-27B delivering the best overall balance of clarity, completeness and usefulness on an RTX 5090.