Tag
LiftQuant introduces a 'lift-then-project' mechanism enabling continuous (non-integer) bit-width quantization for LLMs, allowing precise fitting to hardware memory budgets. The framework compresses a 70B LLM to 2.4-bit to fit a 24GB GPU, outperforming state-of-the-art 2-bit models.
Introduces QAM-W, a joint 2D codebook quantization method for LLM weights using Hadamard rotation and activation-aware scaling, achieving near BF16 perplexity at 5–6 bits per weight and matching SmoothQuant W8A8 quality with 32% fewer weight bits.
This paper investigates smoothness degradation in extremely quantized Large Language Models, arguing that preserving smoothness is crucial for maintaining performance beyond numerical accuracy.
A user demonstrates successful local inference of a 27B parameter Qwen model across three GTX 1080 Ti GPUs, achieving approximately 28-30 tokens per second using TurboQuant optimization.
Researchers identify two distinct failure modes in aggressive LLM quantization—Signal Degradation and Computation Collapse—and show that training-free fixes only remedy the former, indicating structural reconstruction is needed for ultra-low-bit models.