@Tono_Ken3: I noticed that there might be another person who realized that gemma-4-12b could rival qwen3.6-35b in practical work Ye…
Summary
A tweet highlights that the abliterated, NVFP4 quantized Gemma-4-12B model (7.7 GB) can rival Qwen 3.6-35B in practical tasks while running fast on Blackwell GPUs, demonstrating significant efficiency gains.
View Cached Full Text
Cached at: 06/15/26, 12:54 AM
I noticed that there might be another person who realized that gemma-4-12b could rival qwen3.6-35b in practical work Yeah, 12b can handle real work It’s fast! https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16…
sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 · Hugging Face
Source: https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#huihui-gemma-4-12b-it-abliterated-nvfp4a16Huihui-gemma-4-12B-it-abliterated-NVFP4A16
NVFP4 (W4A16) quantization ofhuihui-ai/Huihui-gemma-4-12B-it-abliterated— the abliterated (uncensored)Gemma 4 12B unifiedmodel (text + vision + audio).
24 GB → 7.7 GB.Runs on asingle 16 GB Blackwell GPU, or shards across several for higher throughput. Up to118 tok/s single-stream(TP=4 + MTP speculative decode) and**~1117 tok/s**aggregate.
Basehuihui-ai/Huihui-gemma-4-12B-it-abliterated(abliteratedgoogle/gemma\-4\-12B\-it)ArchitectureGemma4UnifiedForConditionalGeneration— 12B dense, 48 layers, 131K ctxQuantization****NVFP4A16— weights FP4 (group 16, FP8 scales),activations BF16****Formatcompressed\-tensors/nvfp4\-pack\-quantized(native vLLM)Toolllm-compressorSize7.7 GB ·Requires NVIDIA Blackwell (SM120)
Weight-only FP4 (W4A16) keeps activations at BF16, so it is robust where full W4A4 NVFP4 collapses on this architecture.
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#quickstartQuickstart
Requires aBlackwellGPU (SM120 / RTX 50-series / GB10 / B100/B200), Docker with the NVIDIA runtime, and thehfCLI. Gemma 4unifiedis brand new — you needvLLM nightly(released ≤ 0.22.1 lack theGemma4Unifiedclass).
# 1) Download this model (7.7 GB). For spec-decode, also grab the 0.4B MTP draft.
hf download sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 --local-dir ./model
hf download google/gemma-4-12B-it-assistant --local-dir ./draft # optional, for spec-decode
# 2a) Simplest — single GPU, no speculative decode
docker run --rm --gpus '"device=0"' --ipc=host --shm-size 16gb -p 8000:8000 \
-v $PWD/model:/model:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-12b --max-model-len 65536 \
--gpu-memory-utilization 0.92 --trust-remote-code
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#multi-gpu–read-this-if-your-box-has-no-nvlinkMulti-GPU —read this if your box has no NVLink
On consumer/entry Blackwell (e.g. RTX PRO 2000) over plain PCIe there isno working GPU P2P, and vLLM tensor-parallelhangsunless you disable both NCCL P2PandvLLM’s custom all-reduce:
docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
-e NCCL_P2P_DISABLE=1 \ # <-- without this, hangs at NCCL init
-v $PWD/model:/model:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-12b \
--tensor-parallel-size 4 \
--disable-custom-all-reduce \ # <-- without this, the forward deadlocks
--max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#maximum-interactive-speed–tp4–mtp-speculative-decodeMaximum interactive speed — TP=4 + MTP speculative decode
Google ships a 0.4B MTP draft (google/gemma\-4\-12B\-it\-assistant). It nearlydoubles single-streamthroughput (lossless — the target verifies every token). Use**num\_speculative\_tokens: 3(the stable optimum; k≥5 collapses acceptance) and\-\-kv\-cache\-dtype fp8**(NVFP4 KV would break the draft):
docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
-e NCCL_P2P_DISABLE=1 \
-v $PWD/model:/model:ro -v $PWD/draft:/draft:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-12b \
--tensor-parallel-size 4 --disable-custom-all-reduce \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":3}' \
--max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code
Test it:
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d \
'{"model":"gemma4-12b","messages":[{"role":"user","content":"Explain the CAP theorem in one sentence."}]}'
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#flag-cheat-sheetFlag cheat-sheet
Flag / envWhenWhyvllm/vllm\-openai:nightlyalwaysonly nightly registersGemma4UnifiedForConditionalGeneration``\-\-trust\-remote\-codealwaysnew archNCCL\_P2P\_DISABLE=1(env)TP > 1 on no-NVLinkelse hangs at NCCL init\-\-disable\-custom\-all\-reduceTP > 1 on no-NVLinkelse the forward deadlocks\-\-ipc=host \-\-shm\-size 16gbTP > 1 (docker)host-path NCCL needs shared memory\-\-speculative\-config '\{"method":"mtp",…,"num\_speculative\_tokens":3\}'interactive~1.6–1.7× single-stream\-\-kv\-cache\-dtype fp8with spec-decodenvfp4 KV collapses draft acceptance\-\-max\-num\-seqs 4(+\-\-gpu\-memory\-utilization 0\.95)single GPU, long ctxfrees KV room for up to\-c 32768on 16 GB
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#benchmarksBenchmarks
Measured on4× RTX PRO 2000 Blackwell (16 GB, SM120, 288 GB/s, PCIe — no NVLink), TP=4,\-c 65536.
Single-stream decode (interactive) — TP sweep, 1 request × 512 tok:
TPGPUsno spec**+ MTP (k=3)MTP gain1130.555.01.80×2253.294.81.78×4473.3118.51.62× (TP=4 + MTP peaks at121.0with k=4, but k=3 is the stable optimum.) MTP gives a steady~1.6–1.8×**at every TP. TP scaling is sub-linear on this no-NVLink box (host-memory all-reduce). Pick by what you have:
goalconfigsingle-streamGPUs freedlow-power, 1-GPU residentTP=1 + MTP555balancedTP=2 + MTP954fastest interactiveTP=4 + MTP1182 Aggregate throughput (concurrency sweep, no spec-decode):
concurrency12481632tok/s (-c65536)731452744877961117tok/s (-c131072)741452754987921100 64K and 128K context decode identically (sliding-window KV).Rule:MTP spec-decode for low concurrency (≤8); turn itofffor high-concurrency batch serving (it costs throughput once the batch saturates).
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#quality–measured-vs-bf16-base-and-an-fp8-build-same-huihui-baseQuality — measured vs BF16 base and an FP8 build (same huihui base)
Greedy side-by-side on EN / 繁體中文 / 日本語 / code / facts / reasoning traps:
- **Standard tasks: identical.**Facts (Chernobyl:April 1986, reactor 4), Traditional-Chinese & Japanese explanations,
17×23−100 = 291,60 km / 45 min = 80 km/h, code —NVFP4 = FP8 = BF16 base, no collapse, no drift. - **Hard reasoning traps (7 tested): a small, real W4A16 tax.**FP8 matched the BF16 base on every trap the base got right;NVFP4 slipped on ~1 of 7(it answered a Barbara-type syllogism “Yes” where No is correct, plus one minor secondary-detail slip). One age-word-problem even the BF16 base fails — a model limit, not a quant artifact.
Verdict:half the size and faster than FP8, at standard-task parity. ChooseFP8 for maximum reasoning fidelity; choosethis NVFP4A16 for the best size/speed at ~85–90% reasoning parity— the right default for most local-agent and chat workloads.
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#notesNotes
- Abliterated(uncensored). Use responsibly.
- NVFP4 is Blackwell-specific; it willnotrun on Ampere/Hopper.
- Multimodal vision/audio embedders kept in BF16.
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#creditsCredits
- Base model & abliteration:huihui-ai
- Original model:Google DeepMind(Gemma 4)
- **Quantization & serving recipe:**Lna-Lab ·Tooling:llm-compressor/vLLM
https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#support-the-base-model-author-huihui-aiSupport the Base Model Author (huihui-ai)
If you find the abliterated base useful, please support huihui-ai:
- Ko-fi:https://ko-fi.com/huihuiai
- Bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
Crown 👑 (@barackomaba): I’m seeing the light. The Gemma models are actually extremely good.
The 12b might be even better at Hermes than qwen3.6 35b.
My AMD Strix Halo gets 115 TPS+ with 26b QAT MTP
New quality tests run : https://t.co/UcZiLjFDcf https://t.co/Has9oJrz7Y
@usr_bin_roygbiv
Similar Articles
Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it
A user compares Qwen3.6 35B-A3B and Gemma 4 26B-A4B-IT running locally on a 16GB VRAM GPU via LM Studio, finding Qwen3.6 produces more detailed outputs while both run at comparable speeds. The post is an informal community comparison using quantized models.
gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint
Qwen3.5-9B outperforms gemma-4-12b-it on 5 of 8 benchmarks despite having a smaller footprint, with gemma only slightly better at coding.
Gemma 4 31B's competence surprised me
A user shares anecdotal findings that Gemma 4 31B outperforms Qwen 3.6 models and matches Opus 4.7 in understanding and refactoring messy academic code, highlighting a benchmark (SciCode) where Gemma excels.
Qwen3.6-35B vs Gemma4-26B on 7900 XTX
A detailed benchmark comparing Qwen3.6-35B and Gemma4-26B on Radeon 7900 XTX shows Gemma is ~20% faster end-to-end despite slower token generation, because Qwen generates ~2x more tokens due to internal reasoning. The article recommends using Qwen for throughput-bound batch work and Gemma for latency-sensitive single requests.
The Qwen 3.6 35B A3B hype is real!!!
The author benchmarks small local LLMs, highlighting Qwen 3.6 35B A3B for its superior ability to map academic code to research papers compared to models like Gemma 4 and Nemotron 3 Nano.