vector-quantization

#vector-quantization

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

arXiv cs.CL ↗ · 2026-06-10 Cached

Proposes LC-QAT, a 2-bit weight-only vector quantization aware training framework for LLMs that uses a learned affine mapping to enable end-to-end training, achieving state-of-the-art results with only 0.1%-10% of training data.

0 favorites 0 likes

#vector-quantization

UniSVQ: 2-bit Unified Scalar-Vector Quantization

arXiv cs.CL ↗ · 2026-06-10 Cached

UniSVQ proposes a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices, achieving state-of-the-art performance among scalar methods and matching vector methods with higher throughput.

0 favorites 0 likes

#vector-quantization

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

arXiv cs.LG ↗ · 2026-06-04 Cached

LiftQuant introduces a 'lift-then-project' mechanism enabling continuous (non-integer) bit-width quantization for LLMs, allowing precise fitting to hardware memory budgets. The framework compresses a 70B LLM to 2.4-bit to fit a 24GB GPU, outperforming state-of-the-art 2-bit models.

0 favorites 0 likes

#vector-quantization

Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper introduces inner product aware quantization methods that preserve inner products with unseen vectors, developing fast and adaptive algorithms with provable guarantees, achieving 2-10x speedup over prior ASQ methods.

0 favorites 0 likes

#vector-quantization

Shard - getting to 10× KV cache compression

Reddit r/LocalLLaMA ↗ · 2026-05-26 Cached

Shard is a drop-in HuggingFace Cache that achieves 10x KV cache compression for Llama-3.1-8B by using PCA plus int4 quantization on K and Hadamard rotation plus vector quantization on V, without accuracy loss on benchmarks.

0 favorites 0 likes

#vector-quantization

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

arXiv cs.AI ↗ · 2026-05-08 Cached

This paper introduces SDFlow, a similarity-driven flow matching framework for time series generation that addresses exposure bias in autoregressive models. It achieves state-of-the-art performance and inference speedups by operating in the frozen VQ latent space with low-rank manifold decomposition.

0 favorites 0 likes

#vector-quantization

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Papers with Code Trending ↗ · 2025-02-08 Cached

IndexTTS is an enhanced text-to-speech system that combines XTTS and Tortoise models with hybrid character-pinyin modeling and optimized vector quantization, achieving improved naturalness, controllable pronunciation, and faster inference than existing open-source TTS systems.

0 favorites 0 likes

vector-quantization

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

UniSVQ: 2-bit Unified Scalar-Vector Quantization

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms

Shard - getting to 10× KV cache compression

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Submit Feedback