2-bit QAT model releases

Reddit r/LocalLLaMA News

Summary

A discussion on the potential of 2-bit Quantization Aware Training (QAT) for larger MoE models, comparing their performance to 4-bit QAT and ternary LLMs, and considering feasibility for consumer hardware.

So far model releases that take advantage of Quantization Aware Training (QAT) have been focused on 4-bit. I’m curious what could be accomplished with a larger MoE model around 120b up to 400b. Obviously the model could not approach 8/16 bit performance, but perhaps this could be a better alternative to training a ternary LLM (1.58 bit) from scratch. At these sizes you could fit the model into consumer computers running 64/128 gb RAM and perhaps it could out perform a model at about half the size (80b/235b) at 4-bit precision. I suspect the reason it wouldn’t be tried is tooling and coding might suffer too much. I’m thinking about it in the context of creative writing. In my experience 2-bit can still perform. What do you think? EDIT: I acknowledge it is likely 4-bit QAT is the best solution for similar performance to the 8 bit / 16 bit model. What I'm wondering is ... how would a 4-bit 120b compare to a 2 bit 240b QAT model? Could it perform similarly? We're noticing a trend towards bigger models. Could a QAT model bridge the gap in the decrease to mid-range models?
Original Article

Similar Articles

Does it make sense to use alternative quantizations of QAT models? [D]

Reddit r/MachineLearning

A discussion on whether it is sensible to use alternative quantization methods on quantization-aware trained (QAT) models like Gemma-4, questioning if unsloth's benchmarks showing closer performance to QAT fine-tunes are beneficial or counterproductive.

Gemma 4 26B A4B IT QAT Comparison

Reddit r/LocalLLaMA

A user benchmarks three quantized versions of Gemma 4 26B IT (4-bit, 6-bit, and 8-bit QAT) on MMLU_PRO and HumanEval, finding that the QAT 8-bit model performs worse than the 6-bit quant on HumanEval and is not clearly better than 4-bit, questioning the superiority of QAT for this model.

K-Quantization and its Impact on Output Performance

arXiv cs.CL

This paper investigates the impact of different quantization levels (2-bit to 8-bit) on the performance of eight large language models across reasoning, code comprehension, and reading comprehension tasks, finding that while higher precision generally yields better performance, aggressive quantization often retains acceptable accuracy, with larger models showing greater resilience.

MTP and QTA - what is the relation?

Reddit r/LocalLLaMA

A user seeks clarification on the relation between MTP (Multi-Token Prediction) and QAT (Quantization-Aware Training) in llama.cpp, particularly regarding GGUF compatibility for the Gemma4 model and the new QAT string in filenames.