@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…

X AI KOLs Following Models

Summary

A mixed-precision quantization of Google's Gemma-4-12B-it model using NVFP4 for MLP weights and FP8 for attention layers, achieving 25% smaller footprint and faster throughput while maintaining quality.

@sakurayukiai I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 Attention layers saving 25% additional footprint vs NVFP4 MLP only, all while maintaining the same quality or better output on benchmarks https://t.co/1ISwtdI1IU
Original Article
View Cached Full Text

Cached at: 06/08/26, 03:22 PM

@sakurayukiai I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 Attention layers saving 25% additional footprint vs NVFP4 MLP only, all while maintaining the same quality or better output on benchmarks

https://t.co/1ISwtdI1IU


AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8 · Hugging Face

Source: https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#gemma-4-12b-it-aeon-abliterated–k4-biprojection-mixed-nvfp4–fp8Gemma-4-12B-it AEON Abliterated — K=4 Biprojection (Mixed NVFP4 + FP8)

The smallest and fastest variant — 9.3 GB, 21 tok/s single-stream / 318 tok/s concurrent on a DGX Spark.A mixed-precision quantization of our K=4 biprojection abliteration ofgoogle/gemma\-4\-12B\-it:4-bit NVFP4 on the MLP weights, 8-bit FP8 on the attention. Delivers theNVFP4-MLP-only sibling’s reasoning quality (MMLU 76.8) at20% smaller size and 34% faster single-stream, with stronger coding. Loads in vLLM with\-\-quantization modelopt. **This is the maximum-density / maximum-throughput pick.**When peak reasoning quality matters, use the near-losslessFP8 sibling(13 GB).

Refusal behavior has been removed; the model responds to a wide range of prompts the base would decline. Operator-side safety is your responsibility — see the arbitration clause at the bottom.


https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#%F0%9F%9A%80-quickstart🚀 QuickStart

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#docker-recommended-dgx-spark–blackwellDocker (recommended, DGX Spark / Blackwell)

# 1. Download
huggingface-cli download AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8 \
  --local-dir ./Gemma-4-12B-AEON-K4-NVFP4-FP8

# 2. Serve
docker run -d --name aeon-gemma12b --gpus all --ipc=host --shm-size=16g --net=host \
  -v $(pwd)/Gemma-4-12B-AEON-K4-NVFP4-FP8:/model:ro \
  --entrypoint vllm \
  ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
    --served-model-name gemma12b \
    --quantization modelopt \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 8192 \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.85 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code

# 3. Call (OpenAI-compatible)
curl -s http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma12b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}' \
  | python3 -c "import json,sys; print(json.load(sys.stdin)['choices'][0]['message']['content'])"

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#plain-vllmPlain vLLM

pip install "vllm>=0.22.2" "nvidia-modelopt>=0.43" "transformers>=5.10"
vllm serve AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8 \
  --quantization modelopt --kv-cache-dtype fp8_e4m3 \
  --max-model-len 8192 --max-num-seqs 16 \
  --gpu-memory-utilization 0.85 --trust-remote-code

⚠️Needs vLLM ≥ 0.22.2(for theGemma4UnifiedForConditionalGenerationloaderandtheMIXED\_PRECISIONmodelopt path) and aBlackwellGPU (DGX Spark GB10sm\_121a, B100/B200sm\_100, RTX 50-seriessm\_120) for the NVFP4 GEMM. TheAEON vLLM Ultimatecontainer ships the loader pre-built forsm\_121a. On Hopper/Ampere the FP4 weights dequantize to BF16 (no speed benefit) — use the FP8 sibling instead.

That’s it. Everything below is detail.


https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#why-mixed-precision–measured-capabilityWhy mixed precision — measured capability

The full-FP4 (W4A4) NVFP4 quant costs**~21pp on hard reasoningbecause the FP4activationsperturb precise multi-step logit propagation. This model sidesteps that: the bulkMLP weights go 4-bit NVFP4**(where most of the size lives), while the reasoning-sensitiveattention stays at 8-bit FP8. The result keeps the 4-bit size/speed advantage without the W4A4 reasoning collapse.

All axes evaluatedthrough the vLLM serving path, identical prompts/settings for every model.MMLU is the balanced 285-question set (5 × all 57 subjects)— a diverse measure, not the worst-case single-subject slice.

Capability axisBF16 (ref)FP8**Mixed (this)**NVFP4 MLP-onlyMMLU (balanced, N=285)80.4%80.4%**76.8%**76.8%HumanEval syntactic (N=164)99.4%99.4%**97.0%**96.3%HumanEval functional (N=164)83.5%85.4%**81.7%**76.2%IFEval (N=50)90.0%90.0%**90.0%**90.0% **vs the NVFP4-MLP-only sibling, this model matches MMLU and IFEval, and is better on coding (HumanEval functional +5.5pp) — at 20% smaller size.**It does not reach FP8’s reasoning (the shared NVFP4 MLP is the bottleneck); the FP8 sibling remains the quality pick.

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#throughput-dgx-spark-gb10-fp8-kv-cache-greedyThroughput (DGX Spark GB10, FP8 KV cache, greedy)

Mixed (this)FP8NVFP4 MLP-onlyBF16Size9.3 GB13 GB11.7 GB24 GBSingle-stream overall21.1 tok/s15.8 tok/s15.7 tok/s7.7 tok/sSingle-stream TTFT median110 ms143 ms——Concurrent ×16 aggregate318 tok/s226 tok/s254 tok/s144 tok/s Fastest variant by a clear margin— +34% single-stream and +25–41% concurrent vs the other usable quants, at the smallest footprint. On a memory-bandwidth-bound box like the Spark, the 9.3 GB footprint is what drives the single-stream win.

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#quantization-methodologyQuantization methodology

PropertyValueToolNVIDIA ModelOpt 0.44FormatMIXED\_PRECISION— per-layer NVFP4 + FP8MLP(gate\_proj/up\_proj/down\_proj)NVFP4(4-bit float E2M1, block size 16, E4M3 block scales)Attention(q/k/v/o\_proj)FP8(E4M3, per-tensor)Calibration2048 × CNN/DailyMail validation @ 1024 tokens, nativesm\_121aModel size~9.3 GB (from 23.9 GB BF16 — 61% reduction)RuntimevLLM\-\-quantization modelopt(modelopt\_mixed) viaGemma4UnifiedForConditionalGeneration vLLM dispatches each layer to its own kernel:FlashInferCutlassNvFp4LinearKernelfor the MLP,FlashInferFP8ScaledMMLinearKernelfor the attention.

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#kept-at-full-bf16Kept at full BF16

lm\_head,model\.language\_model\.embed\_tokens,model\.embed\_vision\*,model\.embed\_audio\*,model\.vision\_embedder\*. (A small number of attention projections that ModelOpt’s calibration left unquantized also remain BF16 — higher precision, no downside.)

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#vllm-loader-notes-for-reproducersvLLM loader notes (for reproducers)

  1. Google’s Gemma-4-12B is the encoder-freeGemma4UnifiedForConditionalGeneration. ModelOpt’s HF export needs two touch-ups to load in vLLM: rename the vision keys to vLLM’svision\_embedder\.\*layout, and addmodel\.vision\_embedder\*to the quantignorelist. Both are scripted inmake\_vllm\_ready\.py(gemma4\-nvfp4/).
  2. **The attention layers must use per-tensor FP8 (FP8\_DEFAULT\_CFG), notFP8\_PER\_CHANNEL\_PER\_TOKEN.**vLLM’sModelOptMixedPrecisionConfigonly routes per-layerquant\_algo ∈ \{FP8, NVFP4, W4A16\_NVFP4\}; the per-channel/per-token variant exports asFP8\_PER\_CHANNEL\_PER\_TOKEN, which falls through to the unquantized path and fails to load. The full mixed recipe is inquantize\_k4\_nvfp4\.py \-\-recipe mixed\_mlp\_nvfp4\_attn\_fp8.

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#abliteration-methodology-inherited-from-the-bf16-baseAbliteration methodology (inherited from the BF16 base)

K=4 multi-direction norm-preserving biprojection (extends TrevorJS’s recipe). Basis layers L24/L37/L39/L26 (top-K by SNR),o\_proj+mlp\.down\_projedited on 24/48 layers, scale=1.0. See theBF16 cardfor the full biprojection math + capability comparison vs base.

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#behaviorBehavior

  • Benign prompts: matches the NVFP4-MLP-only sibling (the capability table confirms it numerically).
  • Previously-refused prompts: full responses, usually after a brief disclaimer paragraph.
  • Tool calling via\-\-enable\-auto\-tool\-choice \-\-tool\-call\-parser gemma4.
  • Multimodal vision path preserved (BF16).
  • KV cache:use\-\-kv\-cache\-dtype fp8\_e4m3(the published default). Donotcombine NVFP4 KV cache with speculative decoding — 4-bit KV collapses draft acceptance.

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#available-formatsAvailable formats

VariantRepoPrecisionSizePick whenFP8…\-K4\-FP8FP8 E4M313 GBQuality matters— near-lossless, matches BF16Mixed (this)…\-K4\-NVFP4\-FP8NVFP4 MLP + FP8 attn9.3 GB****Smallest + fastest— MLP-only quality, 20% less size, 34% fasterNVFP4 MLP-only…\-K4\-NVFP4NVFP4 4-bit MLP11.7 GBSuperseded by the Mixed variant aboveBF16…\-K4\-BF16bfloat1624 GBFine-tuning, non-Blackwell hardware

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#acknowledgementsAcknowledgements

TrevorJS (biprojection), p-e-w/heretic (abliteration framework), NVIDIA ModelOpt (NVFP4/FP8 toolkit + Gemma-4 reference recipes), AEON-7 (K-direction extension, mixed-precision recipe + vLLM loader fixes, capability eval).

https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#licenseLicense

Inherits theGemma license.


https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#arbitration-clauseArbitration Clause

By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:

  1. Sole Responsibility.You, the user, aresolely and exclusively responsiblefor (a) every prompt you or your downstream system issue to this model, (b) every response this model produces in reply, (c) every downstream action taken by you, your systems, your agents, or your users in reliance on those responses, and (d) any harm — direct, indirect, consequential, foreseeable, or otherwise — that results from any of the above.
  2. No Warranty.This model is provided strictly“AS IS”, without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.
  3. Legal Compliance.You are responsible for ensuring that your use of this model complies withall applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policiesin every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.
  4. Operational Safety Layer.An uncensored model is not a toy. You are expected to implement appropriatedownstream safety layersproportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers isunsafe by constructionand is not a supported use case.
  5. Heightened Duty of Care.The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model than you would operate a base aligned model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is tonot make the request.
  6. **No Endorsement of Outputs.**The authors, contributors, and publishers of this model do not endorse, adopt, or take responsibility for any specific output this model produces. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.
  7. Arbitration.Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved throughbinding individual arbitrationunder the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association’s Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys’ fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.
  8. **Indemnification.**You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys’ fees) arising from or related to your use of the model or your breach of this clause.
  9. **Severability.**If any provision of this clause is held unenforceable in a given jurisdiction, the remaining provisions remain in full force in that jurisdiction, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.
  10. **Acceptance.**Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.

This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model’s.


https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8#%E2%98%95-support-the-work☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokenscan be sent to the same Ethereum address.

Similar Articles

Qwen3.6-27B KLDs - INTs and NVFPs

Reddit r/LocalLLaMA

Reddit post compares quantized Qwen3.6-27B variants (INT4, NVFP4, BF16-INT4) showing trade-offs between memory size and accuracy for different use-cases.

Here is my llama.cpp NVFP4/MXFP6 GGUF quantizer tool

Reddit r/LocalLLaMA

The author introduces an open-source GGUF quantizer tool for llama.cpp that creates NVFP4 and MXFP6 quantized models with advanced techniques like RSF, tensor promotion, and dynamic quantization, achieving better quality than existing methods like ModelOpt.

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

arXiv cs.CL

Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.