@Tono_Ken3: I noticed that there might be another person who realized that gemma-4-12b could rival qwen3.6-35b in practical work Ye…

X AI KOLs Timeline Models

Summary

A tweet highlights that the abliterated, NVFP4 quantized Gemma-4-12B model (7.7 GB) can rival Qwen 3.6-35B in practical tasks while running fast on Blackwell GPUs, demonstrating significant efficiency gains.

I noticed that there might be another person who realized that gemma-4-12b could rival qwen3.6-35b in practical work Yeah, 12b can handle real work It's fast! https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16…
Original Article
View Cached Full Text

Cached at: 06/15/26, 12:54 AM

I noticed that there might be another person who realized that gemma-4-12b could rival qwen3.6-35b in practical work Yeah, 12b can handle real work It’s fast! https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16…


sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 · Hugging Face

Source: https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16

https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#huihui-gemma-4-12b-it-abliterated-nvfp4a16Huihui-gemma-4-12B-it-abliterated-NVFP4A16

NVFP4 (W4A16) quantization ofhuihui-ai/Huihui-gemma-4-12B-it-abliterated— the abliterated (uncensored)Gemma 4 12B unifiedmodel (text + vision + audio).

24 GB → 7.7 GB.Runs on asingle 16 GB Blackwell GPU, or shards across several for higher throughput. Up to118 tok/s single-stream(TP=4 + MTP speculative decode) and**~1117 tok/s**aggregate.

Basehuihui-ai/Huihui-gemma-4-12B-it-abliterated(abliteratedgoogle/gemma\-4\-12B\-it)ArchitectureGemma4UnifiedForConditionalGeneration— 12B dense, 48 layers, 131K ctxQuantization****NVFP4A16— weights FP4 (group 16, FP8 scales),activations BF16****Formatcompressed\-tensors/nvfp4\-pack\-quantized(native vLLM)Toolllm-compressorSize7.7 GB ·Requires NVIDIA Blackwell (SM120) Weight-only FP4 (W4A16) keeps activations at BF16, so it is robust where full W4A4 NVFP4 collapses on this architecture.


https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#quickstartQuickstart

Requires aBlackwellGPU (SM120 / RTX 50-series / GB10 / B100/B200), Docker with the NVIDIA runtime, and thehfCLI. Gemma 4unifiedis brand new — you needvLLM nightly(released ≤ 0.22.1 lack theGemma4Unifiedclass).

# 1) Download this model (7.7 GB). For spec-decode, also grab the 0.4B MTP draft.
hf download sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 --local-dir ./model
hf download google/gemma-4-12B-it-assistant --local-dir ./draft   # optional, for spec-decode

# 2a) Simplest — single GPU, no speculative decode
docker run --rm --gpus '"device=0"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-12b --max-model-len 65536 \
  --gpu-memory-utilization 0.92 --trust-remote-code

https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#multi-gpu–read-this-if-your-box-has-no-nvlinkMulti-GPU —read this if your box has no NVLink

On consumer/entry Blackwell (e.g. RTX PRO 2000) over plain PCIe there isno working GPU P2P, and vLLM tensor-parallelhangsunless you disable both NCCL P2PandvLLM’s custom all-reduce:

docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -e NCCL_P2P_DISABLE=1 \                          # <-- without this, hangs at NCCL init
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-12b \
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \                     # <-- without this, the forward deadlocks
  --max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code

https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#maximum-interactive-speed–tp4–mtp-speculative-decodeMaximum interactive speed — TP=4 + MTP speculative decode

Google ships a 0.4B MTP draft (google/gemma\-4\-12B\-it\-assistant). It nearlydoubles single-streamthroughput (lossless — the target verifies every token). Use**num\_speculative\_tokens: 3(the stable optimum; k≥5 collapses acceptance) and\-\-kv\-cache\-dtype fp8**(NVFP4 KV would break the draft):

docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -e NCCL_P2P_DISABLE=1 \
  -v $PWD/model:/model:ro -v $PWD/draft:/draft:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-12b \
  --tensor-parallel-size 4 --disable-custom-all-reduce \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":3}' \
  --max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code

Test it:

curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d \
 '{"model":"gemma4-12b","messages":[{"role":"user","content":"Explain the CAP theorem in one sentence."}]}'

https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#flag-cheat-sheetFlag cheat-sheet

Flag / envWhenWhyvllm/vllm\-openai:nightlyalwaysonly nightly registersGemma4UnifiedForConditionalGeneration``\-\-trust\-remote\-codealwaysnew archNCCL\_P2P\_DISABLE=1(env)TP > 1 on no-NVLinkelse hangs at NCCL init\-\-disable\-custom\-all\-reduceTP > 1 on no-NVLinkelse the forward deadlocks\-\-ipc=host \-\-shm\-size 16gbTP > 1 (docker)host-path NCCL needs shared memory\-\-speculative\-config '\{"method":"mtp",…,"num\_speculative\_tokens":3\}'interactive~1.6–1.7× single-stream\-\-kv\-cache\-dtype fp8with spec-decodenvfp4 KV collapses draft acceptance\-\-max\-num\-seqs 4(+\-\-gpu\-memory\-utilization 0\.95)single GPU, long ctxfrees KV room for up to\-c 32768on 16 GB


https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#benchmarksBenchmarks

Measured on4× RTX PRO 2000 Blackwell (16 GB, SM120, 288 GB/s, PCIe — no NVLink), TP=4,\-c 65536.

Single-stream decode (interactive) — TP sweep, 1 request × 512 tok:

TPGPUsno spec**+ MTP (k=3)MTP gain1130.555.01.80×2253.294.81.78×4473.3118.51.62× (TP=4 + MTP peaks at121.0with k=4, but k=3 is the stable optimum.) MTP gives a steady~1.6–1.8×**at every TP. TP scaling is sub-linear on this no-NVLink box (host-memory all-reduce). Pick by what you have:

goalconfigsingle-streamGPUs freedlow-power, 1-GPU residentTP=1 + MTP555balancedTP=2 + MTP954fastest interactiveTP=4 + MTP1182 Aggregate throughput (concurrency sweep, no spec-decode):

concurrency12481632tok/s (-c65536)731452744877961117tok/s (-c131072)741452754987921100 64K and 128K context decode identically (sliding-window KV).Rule:MTP spec-decode for low concurrency (≤8); turn itofffor high-concurrency batch serving (it costs throughput once the batch saturates).

https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#quality–measured-vs-bf16-base-and-an-fp8-build-same-huihui-baseQuality — measured vs BF16 base and an FP8 build (same huihui base)

Greedy side-by-side on EN / 繁體中文 / 日本語 / code / facts / reasoning traps:

  • **Standard tasks: identical.**Facts (Chernobyl:April 1986, reactor 4), Traditional-Chinese & Japanese explanations,17×23−100 = 291,60 km / 45 min = 80 km/h, code —NVFP4 = FP8 = BF16 base, no collapse, no drift.
  • **Hard reasoning traps (7 tested): a small, real W4A16 tax.**FP8 matched the BF16 base on every trap the base got right;NVFP4 slipped on ~1 of 7(it answered a Barbara-type syllogism “Yes” where No is correct, plus one minor secondary-detail slip). One age-word-problem even the BF16 base fails — a model limit, not a quant artifact.

Verdict:half the size and faster than FP8, at standard-task parity. ChooseFP8 for maximum reasoning fidelity; choosethis NVFP4A16 for the best size/speed at ~85–90% reasoning parity— the right default for most local-agent and chat workloads.

https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#notesNotes

  • Abliterated(uncensored). Use responsibly.
  • NVFP4 is Blackwell-specific; it willnotrun on Ampere/Hopper.
  • Multimodal vision/audio embedders kept in BF16.

https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#creditsCredits

https://huggingface.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16#support-the-base-model-author-huihui-aiSupport the Base Model Author (huihui-ai)

If you find the abliterated base useful, please support huihui-ai:

Crown 👑 (@barackomaba): I’m seeing the light. The Gemma models are actually extremely good.

The 12b might be even better at Hermes than qwen3.6 35b.

My AMD Strix Halo gets 115 TPS+ with 26b QAT MTP

New quality tests run : https://t.co/UcZiLjFDcf https://t.co/Has9oJrz7Y

@usr_bin_roygbiv

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Reddit r/LocalLLaMA

A user compares Qwen3.6 35B-A3B and Gemma 4 26B-A4B-IT running locally on a 16GB VRAM GPU via LM Studio, finding Qwen3.6 produces more detailed outputs while both run at comparable speeds. The post is an informal community comparison using quantized models.

Gemma 4 31B's competence surprised me

Reddit r/LocalLLaMA

A user shares anecdotal findings that Gemma 4 31B outperforms Qwen 3.6 models and matches Opus 4.7 in understanding and refactoring messy academic code, highlighting a benchmark (SciCode) where Gemma excels.

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Reddit r/LocalLLaMA

A detailed benchmark comparing Qwen3.6-35B and Gemma4-26B on Radeon 7900 XTX shows Gemma is ~20% faster end-to-end despite slower token generation, because Qwen generates ~2x more tokens due to internal reasoning. The article recommends using Qwen for throughput-bound batch work and Gemma for latency-sensitive single requests.

The Qwen 3.6 35B A3B hype is real!!!

Reddit r/LocalLLaMA

The author benchmarks small local LLMs, highlighting Qwen 3.6 35B A3B for its superior ability to map academic code to research papers compared to models like Gemma 4 and Nemotron 3 Nano.