Tag
New Q3 quantizations added to the gemma-4-12B-coder-fable5-composer2.5 GGUF model, enabling the coding-focused fine-tune to run on GPUs with around 6GB VRAM using importance-matrix quantized versions.
The user tested the community fine-tuned gemma-4-12B-coder against Qwen3.6-35B-A3B MoE on three programming tasks, finding that gemma performed poorly on complex stateful programs, while Qwen 35B remained robust.
The author releases improved GGUF quantized versions of Gemma 4 models (12B and 31B) using a more accurate quantization-aware training process that achieves lower KLD and higher same-top percentage than stock quantizations.
This paper investigates whether repetition loops in long factual enumeration tasks by Gemma 4 models can be fixed by editing a single neuron. It finds that targeted weight edits on a small set of MLP neurons can significantly reduce loop failures, though not completely eliminate doom looping in larger models.
A tweet highlights that the abliterated, NVFP4 quantized Gemma-4-12B model (7.7 GB) can rival Qwen 3.6-35B in practical tasks while running fast on Blackwell GPUs, demonstrating significant efficiency gains.
Discusses leveraging Gemma 4 12B's encoder-free architecture for native voice input, seeking out-of-the-box solutions for low-latency streaming audio ingestion.
A benchmark shows Diffusion Gemma is 4x faster than Gemma4 but makes 6x more factual mistakes, especially on obscure topics, trading factual accuracy for smooth text generation.
A detailed tutorial on setting up a local coding agent on macOS using Gemma 4 with MTP draft model and llama.cpp, achieving ~24% speed improvement through speculative decoding.
Google's Gemma 4 12B model, released last week, has already surpassed 4 million downloads on HuggingFace, making it the most popular encoder-free VLM and the first general-purpose LLM with encoder-free audio input. The model balances size and performance, enabling local laptop use with multi-step reasoning and agentic workflows.
An open-source local AI dungeon app using Gemma 4 and FLUX for text and image generation, fully private and runs under 8GB RAM.
Google Gemma team demonstrates real-time social robotics using Gemini Live on the Reachy Mini robot, showcasing both cloud and local inference with Gemma 4.
A detailed technical exploration of MTP speculative decoding in llama.cpp with Gemma 4 models, showing that assistant model selection and quantization significantly impact speedups, and that not all 'same name' assistants perform equally.
Gemma 4 now runs 2x faster with MTP GGUF format and can run locally on just 6GB RAM. The linked article explains how GGUF works, including quantization and memory mapping.
Unsloth AI announces that Gemma 4 runs 2x faster with MTP GGUFs, making it feasible for local coding agents on hardware like a MacBook Pro M1 Max at 72 tokens/s.
Unsloth releases a 2-bit quantized Gemma 4 12B model, only 4.66GB, runnable locally, with capabilities like autonomous online search and deep analysis similar to McKinsey consulting.
llmfan46 released a quadruple set of uncensored, fine-tuned and quantized Gemma-4 models on Hugging Face, including 12B, 26B-A4B, and 31B variants with QAT and GGUF formats.
DiffusionGemma is out; it's compute-bound and 4x faster than other Gemma-4 models with 1k tok/s on H100, and excels at coding tasks including 3D generation and front-end.
Gemma 4 26B runs on an RTX 4060 with 248K token context at 20 tokens per second using llama.cpp and Q4_K_XL quantization, enabling local processing of entire codebases on consumer hardware.
A focused fine-tune of Gemma 4 12B for coding, distilled from chain-of-thought data (Composer 2.5 and Fable 5) and quantized to GGUF for local, offline use with minimal VRAM requirements.
The user reports that the Gemma 4 12B unified audio model stops attending to speech when the system prompt is large (~21k tokens), and asks for workarounds or explanations, noting the issue persists across vLLM, llama.cpp, and LiteRT-LM backends.