yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF

Hugging Face Models Trending Models

Summary

A fine-tuned version of Gemma-4-12B, optimized for local coding and agentic tasks, achieving ~3.5x improvement over the base model on the tau2-bench telecom benchmark.

Task: text-generation Tags: gguf, gemma4, coding, agentic, terminal, tool-use, reasoning, thinking, llama.cpp, local-llm, text-generation, base_model:google/gemma-4-12B-it, base_model:quantized:google/gemma-4-12B-it, license:apache-2.0, endpoints_compatible, region:us, conversational
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:19 PM

yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF · Hugging Face

Source: https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF

https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%92%BB%F0%9F%A4%96-gemma4-12b-v2–coding–agentic-edition-%E2%9C%A8💻🤖 Gemma4-12Bv2— Coding + Agentic Edition ✨

https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%90%A3-tiny-footprint-big-brain–a-local-coding–tool-using-agent-for-everyone🐣 Tiny footprint, big brain — a localcoding & tool-using agentforeveryone

No matter your GPU. No matter your RAM.With~4.5 GBof VRAMorunified memory free, you can run your own private, offline codingagentright now. 🚀 v2 is the bigagenticupgrade — it reads, reasons,uses tools, and works through multi-step technical tasks before it acts. 🧠🛠️ All local, all yours, no API, no cloud.


https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%93%8A-the-headline–it-works-as-an-agent-tau2-bench📊 The headline — it works as an agent (tau2-bench)

v2 is built forcoding + agenticwork — writing code, running commands, using tools, debugging, multi-step technical tasks. The clearest signal istau2-benchtelecom, an agentic tool-use benchmark whosediagnose → fix → verifyloop mirrors real terminal/debugging work:

tau2-benchtelecom· 20 tasks · local, same harness,all Q8_0scoreofficialgemma\-4\-12B\-it(base)~15%🟢Gemma4-12B v2 (this model)****~55% → Roughly3.5× higherthan the base model on technical-agentic tasks. 🎯Want the full storywhytelecom,howthe two models fail differently, the honest caveats, and the trade-offs (including general knowledge)?It’s all broken down further below. 👇


https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%9A%80-announcements🚀 Announcements

📌 Hitting a problem? Please check my pinned discussion first.****~99% of issues are a client/sampler config, not the weights— and they have a quick fix there. For example: garbled orrepeating0000…output almost always meansno repetition penalty(setrep\_pen 1\.1,temp 1\.0); and leaked<\|tool\_call\>/<\|channel\>tokens mean your front-end isn’t parsing Gemma 4’snative tool format(use llama.cpp\-\-jinja). If your question isn’t covered,don’t hesitate to open a discussion— I read them and reply as fast as I can. 💬

**📦 No Q2_K this release.**I finished a Q2_K (imatrix) build, but it didn’t hold up under real stress-testing, so I’m holding it back —I only ship a quant once I’m confident it’s genuinely good.Smallest reliable option isQ3_K_M;Q4_K_Mis the recommended sweet spot. 🙏

🔮 v3 is already on the way.Honestly? EvenIdidn’t expect the post-training jump to bethislarge — so I’m pushing further. v3 keeps thecoding + agenticfocus and aims higher still. Stay tuned! 🎉

🐘 And a bigger sibling is coming — Qwen3.6-27B.I’ve also started fine-tuningQwen3.6-27Bwith the samecoding + agenticrecipe, for those of you whodohave the headroom and want more raw capability. But I haven’t forgotten what this project is about: a27B may be too heavyfor some of your GPUs / RAM. So this isnota replacement — I’m pushingv3 (this 12B line) in parallel, at the same time, and it willonly get stronger. 💪No matter your hardware, you’ll have a model that fits.💚


https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%92%9A-a-personal-note–thank-you-and-a-few-honest-words-please-read💚 A personal note — thank you, and a few honest words (please read)

First, a huge thank-you for all the data and help you’ve shared.🙏 The bittersweet part: none of us saw it coming thatFable 5 would be retired— and only myowndataset holds Fable 5’sgenuine, self-authoredchain-of-thought. So for every dataset the community contributed, Irebuilt the missing reasoning from scratch with Opus 4.8 (xhigh). It may diverge from the original Fable 5 traces, but it was the only workable path — and theimprovement turned out really,reallyhuge(it nearly launched me out of my chair 😄). The benchmark numbers are right above. 👆

Second— I’ve tried toreply to every community comment, and I’ve openlyowned v1’s training problems. Truly, thank you: your feedback is what lets me improve. 💚

Because v1 hit**#1 trending**, it also attracted somebad words / trolling. I’ll say this gently but firmly:real criticism is always welcome here — pure insults are not.This is alocalmodel that lets anyone run a capable AI on tiny RAM/VRAM, atzero API costand fullyprivate; I even open-sourced thefull safetensors masterto study and build on. If something’s off,open a discussion about the actual problem— I genuinely want to hear it and I’ll act on it. But comments that areonlyinsults help no one, and I’ll remove them without hesitation. 🙏

Please remember:I’m one person— not a lab shipping an “open” model for marketing or to monetize later. I don’t advertise. I build this for you onmy own time and my own money: synthesizing data, reviewing and cleaning it by hand, splitting and re-segmenting it (this round I even built adynamic context-windowpass to keep the agent’sread-before-actsteps intact), reading the latest papers, then training → evaluating → training → evaluating. It burned through anentire Claude Max 20× plan(I keep a separate Pro for my own work), andv2 alone cost 40+ hours— even with Opus 4.8, the data threw constant curveballs I had to verify myself. Thank you, truly. 🐾


https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%94%AC-the-benchmarks-in-detail-tau2-bench🔬 The benchmarks, in detail (tau2-bench)

I evaluated v2 ontau2-bench(an agentic tool-use benchmark). I didnotrun the whole suite — it’s very time-consuming — so I focused on the single domain that best matches what v2 is for.

Why tau2-benchtelecom?Telecom troubleshooting makes the agentdiagnose with read/inspect tools → pinpoint the issue → apply a fix → verify it— structurally thesame loopas real terminal/debugging work (check state → diagnose → fix → confirm). That’s exactly what this model is meant to be good at, which makes it the right yardstick for v2 (much more so than a shopping/customer-service domain).

tau2-benchtelecom· 20 tasks · local, same harness,all Q8_0scoreofficialgemma\-4\-12B\-it(base)~15%🟢Gemma4-12B v2 (this model)****~55% → Roughly3.5× higherthan the base model on technical-agentic tasks. 🎯

Grounded, not made-up.Independently, a coding/terminalfabrication probe(tasks that deliberately tempt the model to invent file paths / function signatures / values) found v2grounds before it actsjust like the base — itgrep/read/lsfirst, anddoesn’t make things up(0% fabrication, on par with the base model).

The interesting part —howthey fail.Thebase model gives up early: on this run it bailed to a human agent10 times(transfer\_to\_human) instead of finishing the fix.v2 keeps going— it stays in the loop and works the problem the way a much bigger model would, which is exactly why it solves so many more. It’s not perfect yet: it stillflails a little sometimes(over-trying, retrying). And some of the remaining misses are actually abug in the benchmark’s own APN tool(it throws on inputs it should handle gracefully), not the model. To be clear:I will not patch the benchmark’s tools or leak its test questions just to inflate my score— I’d rather report an honest number and improve themodelitself.More training is coming in v3.🔧

Aboutretail(customer-service shopping):on tau2-benchretail, the base model scores a bit higher than v2.This is fully expected and by design.Retail is pure customer-service (look up a user, process an order) —notwhat this model is for. v2 is specialized forcoding / terminal / technical-agenticwork, and on those (telecom) it dramatically outperforms the base. Need a customer-service bot? This isn’t it. Need alocal coding/agenticmodel? It is. 💚

Let’s keep it honest about scale.Today’sfrontiermodels — thinkmimo-v2.5-proorOpus 4.8— all land90%+on this telecom benchmark. They’re alsoenormous. For a12Bmodel, my roughguessis that v3 might top out somewhere around60–70%(emphasis onguess— I haven’t even started v3 yet). So let’s be clear-eyed: there’s still a real gap to the frontier. But keep the scale in mind —this is a 12B model running on your own machine, and narrowing that gap as much as possibleat this sizeis the whole point. 💪

And the trade-off — there’s no free lunch.I also ran a general-knowledge benchmark (MMLU-Pro), and v2 landsa little below the base modelthere. That’scompletely normal and expectedfor a focused fine-tune: when you push hard on coding + agentic, you trade a sliver of broad-knowledge breadth for it. Need a generalist? Try my own general-purpose**Claude Opus 4.6/4.8 distillation— or the originalgoogle/gemma\-4\-12B\-itbase. Need alocal coding/agentic**worker? That’s what v2 is tuned for.

🔬Methodology, honestly:these arelocal, same-harness, relativenumbers (all models tested at Q8_0, greedy decoding, self-simulated user, 20 tasks). They arenotdirectly comparable to published tau2-bench leaderboard figures (different user-simulator, full task sets, full precision) — local self-eval runssystematically lowerthan published scores. Read them as**“v2 vs the base model under identical conditions”**, which is the comparison that actually matters here.


https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%93%9A-whats-new-in-v2-training📚 What’s new in v2 (training)

v2 continues from the v1 coder and adds a bigagenticpush — the piece v1 was missing:

  • 🛠️ Agentic / terminal— realmulti-step tool-usetrajectories (read → reason → act → verify), in Gemma 4’s native tool protocol. This is what drove the tau2-bench telecom jump, and it fixes v1’s “stops after the first step” behavior.
  • 💻 Coding— verified chain-of-thought over Python tasks (real CoT, gated on passing tests) plus the Fable-5-redo set for the hard cases.
  • 📚 General— a curated slice of reasoning/instruction data to keep broad competence.

All reasoning isdistilled CoT(see the personal note above on how the Fable 5 traces were rebuilt with Opus 4.8).


https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%93%A6-pick-your-size-gguf-quants📦 Pick your size (GGUF quants)

QuantSizeVibe🟡Q3_K_M****5.7 GBgreat for 8 GB VRAM🔵Q4_K_M****6.87 GBthe sweet spot 👌 (recommended)🟣Q6_K****9.11 GBnear-lossless⚪Q8_0****11.8 GBbasically full quality

ℹ️NoQ2_Kthis release — it didn’t pass stress-testing yet (see Announcements). Smallest reliable quant =Q3_K_M.


https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%9A%80-how-to-run-it🚀 How to run it

https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#option-a–llamacpp-recommended-%F0%9F%A6%99Option A — llama.cpp (recommended) 🦙

⚠️ Needs arecent llama.cpp(this is thegemma4\_unifiedarchitecture — older builds won’t load it).

@echo off
cd /d C:\llama.cpp
llama-server.exe ^
  -m C:\models\gemma4-v2-Q4_K_M.gguf ^
  --ctx-size 16384 ^
  --n-gpu-layers 99 ^
  --no-mmap -fa on ^
  --jinja ^
  --temp 1.0 --top-p 0.95 --top-k 64 ^
  --host 0.0.0.0 --port 18080
pause
  • **🛠️ Agentic use:**pass your tools via the OpenAItoolsfield (works with\-\-jinja). v2 emits structured tool-calls in Gemma 4’s native protocol and is happy in agent loops (read/grep/edit/run, then verify).
  • **🖱️ One-click apps:**LM Studio / Jan / Ollama — import the GGUF, pick a quant, go.

https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%A7%A0-thinking-mode🧠 Thinking mode

v2 thinks in Gemma’s native thought channel before answering (keepenable\_thinking=true, the default chat template handles it). Recommended sampling:temp 1\.0, top\_p 0\.95, top\_k 64; for coding you can also go greedy (temp 0).


https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%E2%9A%A0%EF%B8%8F-good-to-know⚠️ Good to know

  • **Specialized for coding / terminal / agentic.**General-knowledge facts/numbers should still be double-checked.
  • **Reduced refusals:**task-focused training, not safety-aligned — add your own guardrails for production. Use responsibly. 🙏
  • English-centric.

https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%F0%9F%93%9A-base–license📚 Base & License

  • License: Apache 2.0.Gemma 4 is released by Google underApache 2.0(unlike the older Gemma 1/2/3 terms), so this fine-tune isApache 2.0too — free to use, modify, and redistribute. 🎉
  • Base model:google/gemma\-4\-12B\-it.
  • Personal/hobby project — shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! 🐾✨

https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF#%E2%9A%A1-speculative-decoding-mtp-draft–verified-build⚡ Speculative decoding (MTP draft) — verified build

TheMTP/folder ships the Gemma 4 multi-token-prediction draft (unsloth’s GGUF conversion of Google’s officialgemma\-4\-12B\-it\-assistant) for speculative decoding. Gemma 4 MTP is inllama.cpp mainline(PR #23398) — no fork needed — but thegemma4\-assistantloader isbuild-sensitive right now, so please use the exact build below:

  • ✅**Verified working: llama.cppb9553(commit9e3b928fd).**I reproduced it withgemma4\-v2\-Q8\_0+ theMTP\-Q8\_0draft: loads cleanly and accelerates generation (~88 → ~180 tok/s on a simple deterministic prompt; expect ~1.2–1.3× on real coding/thinking).Losslesseither way.
  • ⚠️Newer builds (e.g. b9702 / b9717) currently crashwhile loading the draft withinvalid vector subscript. This is anupstream regressionin thegemma4\-assistantloader path,nota problem with these GGUFs — the same files load fine on b9553. Stick withb9553until it’s fixed upstream.

Working command on b9553 (note the older flag names —\-\-model\-draft,not\-\-spec\-draft\-model):

llama-server -m gemma4-v2-Q8_0.gguf ^
  --model-draft MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^
  --spec-type draft-mtp --spec-draft-n-max 4 ^
  -ngl 99 -ngld 99 -fa on --jinja

ℹ️ TheGemma4Assistant requires ctx\_other to be set \(this is normal during memory fitting\)line is harmless. The draft is the generic Gemma 4 assistant (not retrained for v2), so acceptance is a touch lower than a model-specific draft would give — still 100% lossless. On small-VRAM cards, Q8 main + long context + the draft can be tight; drop to Q6_K/Q4_K_M or a smaller\-\-ctx\-sizeif you hit OOM.

Similar Articles

yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Hugging Face Models Trending

A focused fine-tune of Gemma 4 12B for coding, distilled from chain-of-thought data (Composer 2.5 and Fable 5) and quantized to GGUF for local, offline use with minimal VRAM requirements.

Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2

Hugging Face Models Trending

SuperGemma4-26B-Uncensored-MLX-4bit-v2 is a fine-tuned and quantized variant of Google's Gemma 4 26B optimized for Apple Silicon, offering improved performance on code, reasoning, and tool-use tasks while maintaining faster inference speeds compared to the stock baseline.

Jiunsong/supergemma4-26b-uncensored-gguf-v2

Hugging Face Models Trending

SuperGemma4-26B-Uncensored-Fast GGUF v2 is a quantized, locally-runnable variant of Google's Gemma-4-26B model optimized for Apple Silicon, offering faster inference speeds and less-censored chat behavior while maintaining practical performance on general tasks.

google/gemma-4-12B-it-qat-q4_0-gguf

Hugging Face Models Trending

Google DeepMind releases Gemma 4 models optimized with Quantization-Aware Training (QAT) in multiple formats including GGUF, enabling high quality with reduced memory requirements.