Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update

Reddit r/LocalLLaMA Models

Summary

Cyankiwi introduced an updated version of their AWQ 4-bit quantization method that jointly optimizes scales and quantization ranges, achieving lower KL divergence than existing methods on Llama-3 models.

In standard AWQ, per-channel scales and quantization ranges are picked in separate steps: scales first, then the quantization parameters. But they're not independent, i.e., the rounding error from one depends on the choice of the other, so optimizing them in sequence leaves quality on the table. Our cyankiwi AWQ 26.05 update jointly fits scales and quantization ranges against a reconstruction objective. We benchmarked cyankiwi AWQ 26.05 update against every major 4-bit method on Llama-3 as examples, measuring KL Divergence vs the BF16 baseline on GPQA Diamond responses. Result: cyankiwi posts the lowest KLD on all three base models. Lower is better. # Llama-3.2-3B-Instruct |Quantized Model|Method|KLD| |:-|:-|:-| |**cyankiwi/Llama-3.2-3B-Instruct-AWQ-INT4**|**cyankiwi AWQ INT4**|**0.00510**| |unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit|unsloth BNB NF4|0.00785| |unsloth/Llama-3.2-3B-Instruct-bnb-4bit|BNB NF4|0.00896| |nvidia/Meta-Llama-3.2-3B-Instruct-ONNX-INT4|AWQ INT4|0.01494| |casperhansen/llama-3.2-3b-instruct-awq|AWQ INT4|0.02437| # Llama-3.1-8B-Instruct |Quantized Model|Method|KLD| |:-|:-|:-| |**cyankiwi/Llama-3.1-8B-Instruct-AWQ-INT4**|**cyankiwi AWQ INT4**|**0.00478**| |RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16|GPTQ INT4|0.00729| |unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit|unsloth BNB NF4|0.00769| |unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit|BNB NF4|0.00835| |RedHatAI/Llama-3.1-8B-Instruct-NVFP4|SmoothQuant NVFP4|0.01059| |nvidia/Llama-3.1-8B-Instruct-NVFP4|NVFP4|0.01190| # Llama-3.3-70B-Instruct |Quantized Model|Method|KLD| |:-|:-|:-| |**cyankiwi/Llama-3.3-70B-Instruct-AWQ-INT4**|**cyankiwi AWQ INT4**|**0.02826**| |unsloth/Llama-3.3-70B-Instruct-unsloth-bnb-4bit|unsloth BNB NF4|0.04444| |casperhansen/llama-3.3-70b-instruct-awq|AWQ INT4|0.04859| |unsloth/Llama-3.3-70B-Instruct-bnb-4bit|BNB NF4|0.06879| |nvidia/Llama-3.3-70B-Instruct-NVFP4|NVFP4|0.08307| |RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16|GPTQ INT4|0.09272| https://preview.redd.it/uicubbg6951h1.png?width=6400&format=png&auto=webp&s=2f7f1d4e46c9953f00c68518b3c2aa058fc34e32
Original Article

Similar Articles

Qwen3.6-27B Quantization Benchmark

Reddit r/LocalLLaMA

This article benchmarks various Qwen3.6-27B quantizations (Q8 to Q2) using KLD and Same Top P metrics, comparing providers like Unsloth and mradermacher, and offers recommendations for quality-size trade-offs.

Here is my llama.cpp NVFP4/MXFP6 GGUF quantizer tool

Reddit r/LocalLLaMA

The author introduces an open-source GGUF quantizer tool for llama.cpp that creates NVFP4 and MXFP6 quantized models with advanced techniques like RSF, tensor promotion, and dynamic quantization, achieving better quality than existing methods like ModelOpt.