Tag
Proposes LC-QAT, a 2-bit weight-only vector quantization aware training framework for LLMs that uses a learned affine mapping to enable end-to-end training, achieving state-of-the-art results with only 0.1%-10% of training data.
UniSVQ proposes a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices, achieving state-of-the-art performance among scalar methods and matching vector methods with higher throughput.
LiftQuant introduces a 'lift-then-project' mechanism enabling continuous (non-integer) bit-width quantization for LLMs, allowing precise fitting to hardware memory budgets. The framework compresses a 70B LLM to 2.4-bit to fit a 24GB GPU, outperforming state-of-the-art 2-bit models.
This paper introduces inner product aware quantization methods that preserve inner products with unseen vectors, developing fast and adaptive algorithms with provable guarantees, achieving 2-10x speedup over prior ASQ methods.
Shard is a drop-in HuggingFace Cache that achieves 10x KV cache compression for Llama-3.1-8B by using PCA plus int4 quantization on K and Hadamard rotation plus vector quantization on V, without accuracy loss on benchmarks.
This paper introduces SDFlow, a similarity-driven flow matching framework for time series generation that addresses exposure bias in autoregressive models. It achieves state-of-the-art performance and inference speedups by operating in the frozen VQ latent space with low-rank manifold decomposition.