Tag
This paper empirically investigates whether aligning the allocation cost with the output-space objective improves compressed model fidelity in ROCKET, a training-free LLM compression method. Results show a trade-off between accuracy and perplexity, with effects more pronounced at higher compression ratios.
A novel end-to-end framework for LLM compression that jointly optimizes structural pruning and mixed-precision quantization, achieving significant perplexity reductions and speedups over state-of-the-art methods, especially at ultra-low bit precisions.
Introduces SigmaScale, a method that learns auxiliary scaling matrices for SVD-based LLM compression, showing competitive performance on Llama 3.1 8B and Qwen3-8B benchmarks.
A community researcher shares a custom quantization recipe for Qwen3.6-27B that produces a smaller 30GB Q8 GGUF by keeping high-outlier sublayers in BF16, achieving better KLD and top-p metrics than Unsloth's 33GB Q8_K_XL variant.
Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.
This study reveals a 'Smart Pruning Paradox' where activation-aware pruning methods like Wanda preserve perplexity but significantly amplify bias in Large Language Models deployed on edge devices.
Tencent's AngelSlim team released Hy-MT1.5-1.8B-1.25bit, a highly compressed 1.25-bit machine translation model supporting 33 languages that fits in 440MB for on-device use. It utilizes the Sherry quantization algorithm to achieve world-class translation quality comparable to much larger models.