@Tono_Ken3: Added Q3 series to gemma-4-12B-coder-fable5-composer2.5-GGUF You might be able to try out the essence of Fable5 (as a t…

X AI KOLs Timeline Models

Summary

New Q3 quantizations added to the gemma-4-12B-coder-fable5-composer2.5 GGUF model, enabling the coding-focused fine-tune to run on GPUs with around 6GB VRAM using importance-matrix quantized versions.

Added Q3 series to gemma-4-12B-coder-fable5-composer2.5-GGUF You might be able to try out the essence of Fable5 (as a teacher role) in coding even on a GPU with around 6GB VRAM
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:49 AM

Added Q3 series to gemma-4-12B-coder-fable5-composer2.5-GGUF You might be able to try out the essence of Fable5 (as a teacher role) in coding even on a GPU with around 6GB VRAM


sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF · Hugging Face

Source: https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF

https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF#%F0%9F%92%BB-gemma-4-12b-coder-fable5-%C3%97-composer25–imatrix-gguf-%E2%9C%A8💻 Gemma-4-12B-Coder (fable5 × composer2.5) —imatrix GGUF

https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF#runs-anywhere-llamacpp-runs–amdvulkan-cpu-apple-nvidia-no-blackwell-no-mtp-just-gguf-%F0%9F%90%A7%F0%9F%8D%8E%F0%9F%AA%9FRuns anywhere llama.cpp runs —AMD/Vulkan, CPU, Apple, NVIDIA. No Blackwell, no MTP, just GGUF. 🐧🍎🪟

Importance-matrix (imatrix) quants ofyuxinlu1’s coding model, calibrated onreal Python coding dataso the low-bit builds keep their coding smarts. Text-only (a coding model — no vision baggage). 💚


https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF#%F0%9F%99%8F-credit🙏 Credit

Quants of**yuxinlu1/gemma\-4\-12B\-coder\-fable5\-composer2\.5\-v1— all thanks to@yuxinlu1for the model. ⭐ the original and watch it for a v2! The author’s recipe: a fine-tune ofgoogle/gemma\-4\-12B\-itonexecution-verifiedPython coding chains-of-thought (Composer 2.5 real CoT + a Fable 5 “second-attempt” set for the hard cases). Itthinks in Gemma’s native channel**, then writes clean, runnable code. De-refused; Python/algorithmic focus; English-centric.

Why this repo:the originals are static GGUF. These add animportance matrix(code-calibrated) soIQ4_XS / Q4_Kkeep more quality at low VRAM — the builds that fly forAMD/Vulkan and CPUfolks.


https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF#%F0%9F%93%A6-pick-your-quant-all-imatrix📦 Pick your quant (all imatrix)

QuantSizeVibe🟢Q3_K_S5.53 GBsmallest that works— for8 GB / 6 GBcards (leaves room for context). ~**91.7%**HumanEval[:12]🟢Q3_K_M****6.09 GBtinyandsharp —**100%**HumanEval[:15]🔵IQ4_XS****6.64 GBthe imatrix 4-bit sweet spot —**100%**HumanEval[:15]🔵Q4_K_M****7.38 GBbalanced (embeddings/output at Q6_K)⚪Q5_K_M****8.55 GBquality-first if you have the RAM/VRAM

💡8 GB VRAM (or 6 GB):grabQ3_K_S(5.5 GB) — it leaves headroom for context and still codes well. On theVulkanbackend (AMD) all of these fly. ⚠️Avoid IQ3 (i-quant 3-bit) for this modelIQ3\_XXS/IQ3\_Scollapseto gibberish here (gemma-4’s special attention layers don’t survive 3-bit i-quants). The**Q3\_K\_\*K-quants**stay coherent at the same size — that’s why the small tiers are Q3_K, not IQ3.


https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF#%F0%9F%9A%80-run-it-llamacpp–any-backend🚀 Run it (llama.cpp — any backend)

# build llama.cpp with your backend (Vulkan for AMD):  cmake -B build -DGGML_VULKAN=ON && cmake --build build
# grab one quant:
hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF gemma-4-12B-coder-fable5-composer2.5-IQ4_XS.gguf --local-dir .

# chat server (OpenAI-compatible at http://localhost:8080)
./llama-server -m gemma-4-12B-coder-fable5-composer2.5-IQ4_XS.gguf \
  -ngl 99 --ctx-size 16384 -fa on --jinja \
  --temp 1.0 --top-p 0.95 --top-k 64 --host 0.0.0.0 --port 8080

⚠️ Needs arecent llama.cpp— this is thegemma4architecture (older builds won’t load it). 🧠Thinking is on by defaultvia the chat template (\-\-jinja). The model reasons through edge cases, then writes the code. For deterministic coding use\-\-temp 0.

https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF#%F0%9F%A6%99-ollama-one-line-straight-from-this-repo🦙 Ollama (one line, straight from this repo)

ollama run hf.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF:Q4_K_M

Pick any tag:Q3\_K\_S``Q3\_K\_M``IQ4\_XS``Q4\_K\_M``Q5\_K\_M.

“manifest not found”?You must includeboththehf\.co/prefixandan explicit quant tag. Without a tag, Ollama looks for:latest(which doesn’t exist here); withouthf\.co/, it searches Ollama’s own registry instead of this repo. The fix is just…\-GGUF:Q4\_K\_M.

Also works inLM Studio / Jan / KoboldCpp— import the GGUF, pick a quant, go. 🐾


https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF#%F0%9F%93%8A-how-good-is-it-greedy-pass1📊 How good is it? (greedy pass@1)

BenchmarkScoreHumanEval****90.2%(148/164)MBPP****85.7%(366/427) Strong at hard algorithms,bug-fixing & refactoring, and faithful open reasoning. Japanese prompts cause no measurable Python-quality drop.

⚠️One honest caveat:ontime-series / quant-financecode it can introduce alook-ahead bias(and its reasoning may state the right rule while the code does the opposite). Great algorithm/debug helper — butreview its pandas/numpy back-test codebefore trusting it.


https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF#%F0%9F%94%A7-quant-details🔧 Quant details

  • imatrixcomputed on acode-heavycalibration set (HumanEval + MBPP problems & solutions) so the importance matrix reflects real coding activations.
  • Source: the author’sQ8\_0GGUF (≈lossless). Text-onlygemma4(no vision/audio).
  • Higher tiers keep token-embeddings & output tensors atQ6\_K(K-quant default) for fidelity where it matters most; the Q3_K tiers trade a little there for size.
  • K-quants over i-quants here:gemma-4’s heterogeneous attention (head_dim 256 / 512 layers) survivesQ3\_K\_\*butcollapses underIQ3\_\*— verified, so the small tiers ship as Q3_K.

https://huggingface.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-GGUF#%F0%9F%93%9A-license–use📚 License & use

Gemma Terms of Use(derivatives must comply). De-refused / not safety-aligned — add your own guardrails. Best on Python/algorithmic tasks; double-check general facts and time-series code. Shared as-is.Quants & eval byLna-Lab; thanks to @yuxinlu1.🐾✨

Similar Articles

yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Hugging Face Models Trending

A focused fine-tune of Gemma 4 12B for coding, distilled from chain-of-thought data (Composer 2.5 and Fable 5) and quantized to GGUF for local, offline use with minimal VRAM requirements.

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Reddit r/LocalLLaMA

A user compares Qwen3.6 35B-A3B and Gemma 4 26B-A4B-IT running locally on a 16GB VRAM GPU via LM Studio, finding Qwen3.6 produces more detailed outputs while both run at comparable speeds. The post is an informal community comparison using quantized models.

Gemma 4 26B-A4B GGUF Benchmarks

Reddit r/LocalLLaMA

Unsloth has released KL Divergence benchmarks for Gemma 4 26B-A4B GGUF quantizations, showing Unsloth GGUFs top 21 of 22 sizes on the Pareto frontier. They also introduced a new UD-IQ4_NL_XL quant fitting in 16GB VRAM and updated Q6_K and MLX quants for both Gemma 4 and Qwen3.6.