Tag
This paper studies how pruning attention layers in LLMs affects explanation faithfulness and confidence calibration, finding that accuracy often remains high but interpretability and reliability degrade, highlighting a misalignment between model confidence, interpretability, and accuracy.
A new paper investigates whether it's better to prune a larger LLM or train a small LLM from scratch, finding that pruning provides more than just a good initialization.
Proposes a structural pruning framework for MoE models that maximizes channel-score coverage via attribution-based approximation, achieving 50% or 25% pruning with 4-bit quantization and reducing memory footprint by 5.27x on Qwen3-30B-A3B.
This paper reveals a 'benchmark illusion' where pruned LLMs perform well on multiple-choice benchmarks but fail to answer the same questions in open generation, suggesting that compressed models should be tested on generative tasks rather than just recognition tasks.
This paper proposes an inter-layer perturbation-absorption perspective for layer-wise sparsity in LLMs, showing that layers exhibit heterogeneous responses to pruning perturbations and introducing an absorption-aware correction that improves existing pruning methods by reducing perplexity and boosting accuracy.
The MiniMax-M3 PRISM Dynamic-Quant recipe compresses a 428B parameter model from ~450GB to 119GB using per-tensor sensitivity ranking, with plans to prune further to 60-80GB for local deployment.
SpenseGPT introduces a practical one-shot pruning method for LLMs that enables both sparse and dense GEMMs during inference, improving efficiency.
This article is the middle part of the AI Engineering Landscape series, detailing core techniques such as inference optimization, model slimming (quantization, distillation, pruning, MoE), and speculative decoding, while reviewing the latest advances from hardware to the engineering stack.
Sigma-Branch restructures pretrained dense networks into a hierarchical binary tree with a shared backbone, routers, and specialized leaves, reducing per-inference active parameters by 58–60% while staying within 1.72 pp of baseline accuracy on CIFAR-100, ImageNet-1K, and ModelNet40.
IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.
TENP proposes a structured pruning framework for Mixture-of-Experts LLMs that retains important experts and applies neuron pruning to less important ones, achieving high sparsity with minimal accuracy loss on Qwen and DeepSeek models.
This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.
Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.
This paper presents a Marchenko-Pastur random matrix approach to pruning deep neural networks, offering theoretical guarantees and achieving strong accuracy retention with minimal fine-tuning on ImageNet for ViT and CNN architectures.
The author argues that AI agent memory should focus on pruning data rather than hoarding, drawing parallels to human memory types (sensory, short-term, long-term) and suggesting that modeling after human memory can reduce token usage while maintaining high-quality context.
A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.
NVIDIA Model Optimizer is a library that compresses deep learning models using techniques like quantization, distillation, pruning, and speculative decoding to accelerate inference. It supports Hugging Face, PyTorch, and ONNX models and integrates with NVIDIA inference frameworks.
Graft is a training-free framework that enhances speculative decoding by combining pruning and retrieval to improve acceptance rates and inference speed, achieving up to 5.41x speedup on short-context benchmarks and up to 21.8% improvement over EAGLE-3 on Qwen3-235B.
This academic paper investigates the asymmetry between pruning and growth in structural plasticity for neural networks, showing that newborn units suffer from weaker gradient signals than incumbent units, and proposes interventions to improve integration.
Introduces Self-Pruned Key-Value Attention (SP-KV), a mechanism that learns to predict future utility of key-value pairs to dynamically prune the KV cache, reducing memory usage and decoding speed by 3-10x with minimal performance degradation. The model and utility predictor are trained end-to-end using next-token prediction.