gemma-4

#gemma-4

@Tono_Ken3: Added Q3 series to gemma-4-12B-coder-fable5-composer2.5-GGUF You might be able to try out the essence of Fable5 (as a t…

X AI KOLs Timeline ↗ · 2026-06-16 Cached

New Q3 quantizations added to the gemma-4-12B-coder-fable5-composer2.5 GGUF model, enabling the coding-focused fine-tune to run on GPUs with around 6GB VRAM using importance-matrix quantized versions.

0 favorites 0 likes

#gemma-4

@zhixianio: Finished testing, feeling quite surprised, not sure if I'm using it wrong. Feel free to provide counterexamples. Here are my results: On M5 Max, pitting this community fine-tuned gemma-4-12B-coder (llama.cpp) against my daily driver Qwen3.6-35B-…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

The user tested the community fine-tuned gemma-4-12B-coder against Qwen3.6-35B-A3B MoE on three programming tasks, finding that gemma performed poorly on complex stateful programs, while Qwen 35B remained robust.

0 favorites 0 likes

#gemma-4

moar QAT stuff and hairy ticks

Reddit r/LocalLLaMA ↗ · 2026-06-15

The author releases improved GGUF quantized versions of Gemma 4 models (12B and 31B) using a more accurate quantization-aware training process that achieves lower KLD and higher same-top percentage than stock quantizations.

0 favorites 0 likes

#gemma-4

Can Editing 1 Neuron Fix Repetition Loops in LLMs?

arXiv cs.LG ↗ · 2026-06-15 Cached

This paper investigates whether repetition loops in long factual enumeration tasks by Gemma 4 models can be fixed by editing a single neuron. It finds that targeted weight edits on a small set of MLP neurons can significantly reduce loop failures, though not completely eliminate doom looping in larger models.

0 favorites 0 likes

#gemma-4

@Tono_Ken3: I noticed that there might be another person who realized that gemma-4-12b could rival qwen3.6-35b in practical work Ye…

X AI KOLs Timeline ↗ · 2026-06-14 Cached

A tweet highlights that the abliterated, NVFP4 quantized Gemma-4-12B model (7.7 GB) can rival Qwen 3.6-35B in practical tasks while running fast on Blackwell GPUs, demonstrating significant efficiency gains.

0 favorites 0 likes

#gemma-4

Gemma 4 12B native encoder free voice input utilization suggest?

Reddit r/LocalLLaMA ↗ · 2026-06-14

Discusses leveraging Gemma 4 12B's encoder-free architecture for native voice input, seeking out-of-the-box solutions for low-latency streaming audio ingestion.

0 favorites 0 likes

#gemma-4

Diffusion Gemma is 4x faster, but makes 6x more mistakes!

Reddit r/LocalLLaMA ↗ · 2026-06-12

A benchmark shows Diffusion Gemma is 4x faster than Gemma4 but makes 6x more factual mistakes, especially on obscure topics, trading factual accuracy for smooth text generation.

0 favorites 0 likes

#gemma-4

How to setup a local coding agent on macOS

Hacker News Top ↗ · 2026-06-12 Cached

A detailed tutorial on setting up a local coding agent on macOS using Gemma 4 with MTP draft model and llama.cpp, achieving ~24% speed improvement through speculative decoding.

0 favorites 0 likes

#gemma-4

@AndreasPSteiner: Released last week, and already more than 4M downloads on HuggingFace alone This makes Gemma 4 12B the most popular enc…

X AI KOLs Timeline ↗ · 2026-06-12 Cached

Google's Gemma 4 12B model, released last week, has already surpassed 4 million downloads on HuggingFace, making it the most popular encoder-free VLM and the first general-purpose LLM with encoder-free audio input. The model balances size and performance, enabling local laptop use with multi-step reasoning and agentic workflows.

0 favorites 0 likes

#gemma-4

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS)

Reddit r/LocalLLaMA ↗ · 2026-06-12

An open-source local AI dungeon app using Gemma 4 and FLUX for text and image generation, fully private and runs under 8GB RAM.

0 favorites 0 likes

#gemma-4

@googlegemma: Real-time social robotics, from the cloud to your local device. Watch Ian from our DevX team use Gemini Live for a seam…

X AI KOLs Following ↗ · 2026-06-12 Cached

Google Gemma team demonstrates real-time social robotics using Gemini Live on the Reachy Mini robot, showcasing both cloud and local inference with Gemma 4.

0 favorites 0 likes

#gemma-4

Not All MTP Assistants Are Created Equal

Reddit r/LocalLLaMA ↗ · 2026-06-12

A detailed technical exploration of MTP speculative decoding in llama.cpp with Gemma 4 models, showing that assistant model selection and quantization significantly impact speedups, and that not all 'same name' assistants perform equally.

0 favorites 0 likes

#gemma-4

@amitiitbhu: Gemma 4 now runs 2x faster with MTP GGUFs! Run locally on just 6GB RAM. New Article: How does GGUF work? Read here: htt…

X AI KOLs Timeline ↗ · 2026-06-12 Cached

Gemma 4 now runs 2x faster with MTP GGUF format and can run locally on just 6GB RAM. The linked article explains how GGUF works, including quantization and memory mapping.

0 favorites 0 likes

#gemma-4

@Freerunnering: This actually makes Gemma 4 26B-4A usable for a coding agent @ 72tk/s on my MacBook Pro M1 Max. This video is realtime,…

X AI KOLs Timeline ↗ · 2026-06-12 Cached

Unsloth AI announces that Gemma 4 runs 2x faster with MTP GGUFs, making it feasible for local coding agents on hardware like a MacBook Pro M1 Max at 72 tokens/s.

0 favorites 0 likes

#gemma-4

@VincentLogic: A 4.66 GB model actually runs at the level of a McKinsey consultant locally? Unsloth's latest 2-bit Gemma 4 12B is truly explosive. This isn't just chat – it directly transforms into a 'Super Agent' working autonomously: autonomously searching online citing 15+ sources, deeply distinguishing…

X AI KOLs Timeline ↗ · 2026-06-12 Cached

Unsloth releases a 2-bit quantized Gemma 4 12B model, only 4.66GB, runnable locally, with capabilities like autonomous online search and deep analysis similar to McKinsey consulting.

0 favorites 0 likes

#gemma-4

Gemma 4 Quadruple Release, 12B, 12B QAT, 26B-A4B QAT and 31B QAT Uncensored Heretics!

Reddit r/LocalLLaMA ↗ · 2026-06-11 Cached

llmfan46 released a quadruple set of uncensored, fine-tuned and quantized Gemma-4 models on Hugging Face, including 12B, 26B-A4B, and 31B variants with QAT and GGUF formats.

0 favorites 0 likes

#gemma-4

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

X AI KOLs Following ↗ · 2026-06-10 Cached

DiffusionGemma is out; it's compute-bound and 4x faster than other Gemma-4 models with 1k tok/s on H100, and excels at coding tasks including 3D generation and front-end.

0 favorites 0 likes

#gemma-4

@leopardracer: GEMMA 4 26B ON AN RTX 4060 WITH A 248K TOKEN CONTEXT WINDOW 20 tokens per second and a context window so large you can …

X AI KOLs Timeline ↗ · 2026-06-10 Cached

Gemma 4 26B runs on an RTX 4060 with 248K token context at 20 tokens per second using llama.cpp and Q4_K_XL quantization, enabling local processing of entire codebases on consumer hardware.

0 favorites 0 likes

#gemma-4

yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Hugging Face Models Trending ↗ · 2026-06-10 Cached

A focused fine-tune of Gemma 4 12B for coding, distilled from chain-of-thought data (Composer 2.5 and Fable 5) and quantized to GGUF for local, offline use with minimal VRAM requirements.

0 favorites 0 likes

#gemma-4

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Reddit r/LocalLLaMA ↗ · 2026-06-10

The user reports that the Gemma 4 12B unified audio model stops attending to speech when the system prompt is large (~21k tokens), and asks for workarounds or explanations, noting the issue persists across vLLM, llama.cpp, and LiteRT-LM backends.

0 favorites 0 likes

gemma-4

Submit Feedback