2-bit QAT model releases

Reddit r/LocalLLaMA 06/07/26, 07:38 PM News

Summary

A discussion on the potential of 2-bit Quantization Aware Training (QAT) for larger MoE models, comparing their performance to 4-bit QAT and ternary LLMs, and considering feasibility for consumer hardware.

So far model releases that take advantage of Quantization Aware Training (QAT) have been focused on 4-bit. I’m curious what could be accomplished with a larger MoE model around 120b up to 400b. Obviously the model could not approach 8/16 bit performance, but perhaps this could be a better alternative to training a ternary LLM (1.58 bit) from scratch. At these sizes you could fit the model into consumer computers running 64/128 gb RAM and perhaps it could out perform a model at about half the size (80b/235b) at 4-bit precision. I suspect the reason it wouldn’t be tried is tooling and coding might suffer too much. I’m thinking about it in the context of creative writing. In my experience 2-bit can still perform. What do you think? EDIT: I acknowledge it is likely 4-bit QAT is the best solution for similar performance to the 8 bit / 16 bit model. What I'm wondering is ... how would a 4-bit 120b compare to a 2 bit 240b QAT model? Could it perform similarly? We're noticing a trend towards bigger models. Could a QAT model bridge the gap in the decrease to mid-range models?

Original Article

Similar Articles

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

arXiv cs.CL

Proposes LC-QAT, a 2-bit weight-only vector quantization aware training framework for LLMs that uses a learned affine mapping to enable end-to-end training, achieving state-of-the-art results with only 0.1%-10% of training data.

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

arXiv cs.CL

CAT-Q introduces a post-training ternary quantization method for LLMs that uses learnable modulation and softened ternarization, achieving superior performance over BitNet 1.58-bit while using only 512 calibration samples and scaling to 235B parameters.

Does it make sense to use alternative quantizations of QAT models? [D]

Reddit r/MachineLearning

A discussion on whether it is sensible to use alternative quantization methods on quantization-aware trained (QAT) models like Gemma-4, questioning if unsloth's benchmarks showing closer performance to QAT fine-tunes are beneficial or counterproductive.

Gemma 4 26B A4B IT QAT Comparison

Reddit r/LocalLLaMA

A user benchmarks three quantized versions of Gemma 4 26B IT (4-bit, 6-bit, and 8-bit QAT) on MMLU_PRO and HumanEval, finding that the QAT 8-bit model performs worse than the 6-bit quant on HumanEval and is not clearly better than 4-bit, questioning the superiority of QAT for this model.

Rethinking Small VLM Quantization: From Component-Wise Analysis to Hardware-Aware Edge Deployment

arXiv cs.LG

This paper systematically evaluates component-wise quantization of small vision-language models on Jetson edge devices, finding that model architecture (MoE vs dense) significantly affects quantization sensitivity and that quantization errors are largely additive except along modality-alignment paths.