llm-compression

#llm-compression

Output-Space Allocation Costs for Calibration-Guided LLM Compression: An Empirical Study

arXiv cs.CL ↗ · yesterday Cached

This paper empirically investigates whether aligning the allocation cost with the output-space objective improves compressed model fidelity in ROCKET, a training-free LLM compression method. Results show a trade-off between accuracy and perplexity, with effects more pronounced at higher compression ratios.

0 favorites 0 likes

#llm-compression

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

arXiv cs.AI ↗ · 2026-06-09 Cached

A novel end-to-end framework for LLM compression that jointly optimizes structural pruning and mixed-precision quantization, achieving significant perplexity reductions and speedups over state-of-the-art methods, especially at ultra-low bit precisions.

0 favorites 0 likes

#llm-compression

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

arXiv cs.CL ↗ · 2026-06-08 Cached

Introduces SigmaScale, a method that learns auxiliary scaling matrices for SVD-based LLM compression, showing competitive performance on Llama 3.1 8B and Qwen3-8B benchmarks.

0 favorites 0 likes

#llm-compression

Qwen 3.6 27B 30GB Same top p: 98.358 ± 0.033 % vs UD Q8 K XL 33GB Same top p: 97.426 ± 0.041 %

Reddit r/LocalLLaMA ↗ · 2026-06-04

A community researcher shares a custom quantization recipe for Qwen3.6-27B that produces a smaller 30GB Q8 GGUF by keeping high-outlier sublayers in BF16, achieving better KLD and top-p metrics than Unsloth's 33GB Q8_K_XL variant.

0 favorites 0 likes

#llm-compression

LLM Compression with Jointly Optimizing Architectural and Quantization choices

arXiv cs.LG ↗ · 2026-06-04 Cached

Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.

0 favorites 0 likes

#llm-compression

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI

arXiv cs.LG ↗ · 2026-05-12 Cached

This study reveals a 'Smart Pruning Paradox' where activation-aware pruning methods like Wanda preserve perplexity but significantly amplify bias in Large Language Models deployed on edge devices.

0 favorites 0 likes

#llm-compression

AngelSlim/Hy-MT1.5-1.8B-1.25bit

Hugging Face Models Trending ↗ · 2026-04-28 Cached

Tencent's AngelSlim team released Hy-MT1.5-1.8B-1.25bit, a highly compressed 1.25-bit machine translation model supporting 33 languages that fits in 440MB for on-device use. It utilizes the Sherry quantization algorithm to achieve world-class translation quality comparable to much larger models.

1 favorites 1 likes

llm-compression

Output-Space Allocation Costs for Calibration-Guided LLM Compression: An Empirical Study

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

Qwen 3.6 27B 30GB Same top p: 98.358 ± 0.033 % vs UD Q8 K XL 33GB Same top p: 97.426 ± 0.041 %

LLM Compression with Jointly Optimizing Architectural and Quantization choices

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI

AngelSlim/Hy-MT1.5-1.8B-1.25bit

Submit Feedback