2-bit QAT model releases
Summary
A discussion on the potential of 2-bit Quantization Aware Training (QAT) for larger MoE models, comparing their performance to 4-bit QAT and ternary LLMs, and considering feasibility for consumer hardware.
Similar Articles
LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization
Proposes LC-QAT, a 2-bit weight-only vector quantization aware training framework for LLMs that uses a learned affine mapping to enable end-to-end training, achieving state-of-the-art results with only 0.1%-10% of training data.
Does it make sense to use alternative quantizations of QAT models? [D]
A discussion on whether it is sensible to use alternative quantization methods on quantization-aware trained (QAT) models like Gemma-4, questioning if unsloth's benchmarks showing closer performance to QAT fine-tunes are beneficial or counterproductive.
Gemma 4 26B A4B IT QAT Comparison
A user benchmarks three quantized versions of Gemma 4 26B IT (4-bit, 6-bit, and 8-bit QAT) on MMLU_PRO and HumanEval, finding that the QAT 8-bit model performs worse than the 6-bit quant on HumanEval and is not clearly better than 4-bit, questioning the superiority of QAT for this model.
K-Quantization and its Impact on Output Performance
This paper investigates the impact of different quantization levels (2-bit to 8-bit) on the performance of eight large language models across reasoning, code comprehension, and reading comprehension tasks, finding that while higher precision generally yields better performance, aggressive quantization often retains acceptable accuracy, with larger models showing greater resilience.
MTP and QTA - what is the relation?
A user seeks clarification on the relation between MTP (Multi-Token Prediction) and QAT (Quantization-Aware Training) in llama.cpp, particularly regarding GGUF compatibility for the Gemma4 model and the new QAT string in filenames.