low-bit

#low-bit

High-accuracy Low-Bit KV-Cache Quantization via Local Distribution Restoration

arXiv cs.LG ↗ · 5d ago Cached

This paper identifies that low-bit KV cache quantization degrades LLM accuracy due to structured local misranking of logits, and proposes DGAP, a method that restores the local distribution of top-K candidates, recovering RULER accuracy from 47.8% to 83.2% on Llama-3.1-8B with minimal overhead.

0 favorites 0 likes

#low-bit

I ran Ternary-Bonsai-27B (2-bit) and Bonsai-27B (1-bit) on Terminal-Bench 2.0, in 8GB VRAM

Reddit r/LocalLLaMA ↗ · 5d ago

A user tested quantized 1-bit and 2-bit versions of the 27B-parameter Bonsai model on Terminal-Bench 2.0, achieving results within 8GB VRAM.

0 favorites 0 likes

#low-bit

Saturation Makes Quantization Error Additive: A Coverage Model with a Certificate

arXiv cs.LG ↗ · 2026-07-15 Cached

This paper analyzes the structure of quantization loss in mixed-precision neural networks, showing that saturation makes the loss additive per layer and proposes a coverage model that predicts configuration loss with few parameters, validated on large-scale models.

0 favorites 0 likes

#low-bit

BitNet Text Embeddings

arXiv cs.CL ↗ · 2026-06-25 Cached

This paper introduces BitEmbed, an extreme low-bit framework for LLM-based text embeddings that converts pretrained LLM backbones into BitNet-style encoders with ternary weights and quantized activations. It achieves comparable performance to full-precision models while significantly reducing encoding and storage costs.

0 favorites 0 likes

#low-bit

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

arXiv cs.CL ↗ · 2026-06-10 Cached

Proposes LC-QAT, a 2-bit weight-only vector quantization aware training framework for LLMs that uses a learned affine mapping to enable end-to-end training, achieving state-of-the-art results with only 0.1%-10% of training data.

0 favorites 0 likes

#low-bit

Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

arXiv cs.LG ↗ · 2026-06-03 Cached

This paper introduces Qift, a fixed no-zero two-bit weight quantization level set designed for Hadamard-rotated LLMs, achieving improved W2A4/KV4 inference by leveraging the near-zero-centered Gaussian-like distribution of rotated weights. Experiments on LLaMA-2-7B and LLaMA-3.1-8B show consistent perplexity gains over standard W2 quantization.

0 favorites 0 likes

#low-bit

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

arXiv cs.LG ↗ · 2026-06-02 Cached

BitsMoE introduces a spectral-energy-guided bit allocation framework for quantizing Mixture-of-Experts LLMs, achieving substantial accuracy improvements and speedups under ultra-low-bit quantization.

0 favorites 0 likes

#low-bit

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

arXiv cs.AI ↗ · 2026-05-27 Cached

This paper presents Tail-Aware HiFloat4, a W4A4 post-training quantization method for the Wan2.2 text-to-video diffusion model, which uses activation-tail-aware percentile calibration to mitigate outlier effects while preserving HiFloat4 arithmetic.

0 favorites 0 likes

#low-bit

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

arXiv cs.LG ↗ · 2026-05-21 Cached

Quant.npu introduces a fully static quantization framework for mobile NPUs, using learnable parameters and rotation matrices to enable efficient low-bit LLM inference without runtime re-computation, achieving up to 15.1% latency reduction.

0 favorites 0 likes

#low-bit

Measuring Maximum Activations in Open Large Language Models

arXiv cs.CL ↗ · 2026-05-18 Cached

This paper measures maximum activation magnitudes across 27 checkpoints from 8 open LLM families, finding significant variance across families, architectures, and training stages, with implications for low-bit quantization and deployment.

0 favorites 0 likes

#low-bit

@antirez: Uploading a new HF imatrix GGUF for 2 bits: same name, different content with fixed down layer of shared experts (there…

X AI KOLs Following ↗ · 2026-05-11

A corrected 2-bit GGUF model file has been uploaded to Hugging Face after fixing a bug in the imatrix computation, leading to improved logits recall and reduced error.

0 favorites 0 likes

#low-bit

Ternary Bonsai: Top Intelligence at 1.58 Bits

Hacker News Top ↗ · 2026-04-18

A highly efficient AI model architecture using ternary weights (-1, 0, 1) that achieves competitive performance while requiring only 1.58 bits per parameter, enabling deployment on extremely constrained devices.

0 favorites 0 likes

low-bit

Submit Feedback