pruning

Tag

Cards List
#pruning

Don't Go Breaking My LLM: The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration

arXiv cs.LG · 5h ago Cached

This paper studies how pruning attention layers in LLMs affects explanation faithfulness and confidence calibration, finding that accuracy often remains high but interpretability and reliability degrade, highlighting a misalignment between model confidence, interpretability, and accuracy.

0 favorites 0 likes
#pruning

@Zephyr271828: You want a strong small LLM. Would you start small — or inherit from something bigger? New paper: Small LLMs: Pruning v…

X AI KOLs Timeline · yesterday Cached

A new paper investigates whether it's better to prune a larger LLM or train a small LLM from scratch, finding that pruning provides more than just a good initialization.

0 favorites 0 likes
#pruning

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

arXiv cs.LG · 2026-06-18 Cached

Proposes a structural pruning framework for MoE models that maximizes channel-score coverage via attribution-based approximation, achieving 50% or 25% pruning with 4-bit quantization and reducing memory footprint by 5.27x on Qwen3-30B-A3B.

0 favorites 0 likes
#pruning

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

arXiv cs.CL · 2026-06-17 Cached

This paper reveals a 'benchmark illusion' where pruned LLMs perform well on multiple-choice benchmarks but fail to answer the same questions in open generation, suggesting that compressed models should be tested on generative tasks rather than just recognition tasks.

0 favorites 0 likes
#pruning

Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective

arXiv cs.CL · 2026-06-16 Cached

This paper proposes an inter-layer perturbation-absorption perspective for layer-wise sparsity in LLMs, showing that layers exhibit heterogeneous responses to pruning perturbations and introducing an absorption-aware correction that improves existing pruning methods by reducing perplexity and boosting accuracy.

0 favorites 0 likes
#pruning

@Ex0byt: Open frontier intelligence, in your hands - the MiniMax-M3 PRISM Dynamic-Quant recipe is ready! 428B parameters compres…

X AI KOLs Timeline · 2026-06-14 Cached

The MiniMax-M3 PRISM Dynamic-Quant recipe compresses a 428B parameter model from ~450GB to 119GB using per-tensor sensitivity ranking, with plans to prune further to 60-80GB for local deployment.

0 favorites 0 likes
#pruning

@_akhaliq: SpenseGPT Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

X AI KOLs Following · 2026-06-12 Cached

SpenseGPT introduces a practical one-shot pruning method for LLMs that enables both sparse and dense GEMMs during inference, improving efficiency.

0 favorites 0 likes
#pruning

@snowboat84: https://x.com/snowboat84/status/2065215177029787705

X AI KOLs Timeline · 2026-06-11 Cached

This article is the middle part of the AI Engineering Landscape series, detailing core techniques such as inference optimization, model slimming (quantization, distillation, pruning, MoE), and speculative decoding, while reviewing the latest advances from hardware to the engineering stack.

0 favorites 0 likes
#pruning

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

arXiv cs.LG · 2026-06-10 Cached

Sigma-Branch restructures pretrained dense networks into a hierarchical binary tree with a shared backbone, routers, and specialized leaves, reducing per-inference active parameters by 58–60% while staying within 1.72 pp of baseline accuracy on CIFAR-100, ImageNet-1K, and ModelNet40.

0 favorites 0 likes
#pruning

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

arXiv cs.LG · 2026-06-10 Cached

IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.

0 favorites 0 likes
#pruning

TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts

arXiv cs.LG · 2026-06-10 Cached

TENP proposes a structured pruning framework for Mixture-of-Experts LLMs that retains important experts and applies neuron pruning to less important ones, achieving high sparsity with minimal accuracy loss on Qwen and DeepSeek models.

0 favorites 0 likes
#pruning

Less is MoE: Trimming Experts in Domain-Specialist Language Models

arXiv cs.LG · 2026-06-05 Cached

This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.

0 favorites 0 likes
#pruning

LLM Compression with Jointly Optimizing Architectural and Quantization choices

arXiv cs.LG · 2026-06-04 Cached

Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.

0 favorites 0 likes
#pruning

Pruning Deep Neural Networks via the Marchenko--Pastur Distribution

arXiv cs.LG · 2026-06-03 Cached

This paper presents a Marchenko-Pastur random matrix approach to pruning deep neural networks, offering theoretical guarantees and achieving strong accuracy retention with minimal fine-tuning on ImageNet for ViT and CNN architectures.

0 favorites 0 likes
#pruning

Agentic AI memory isn't a hoarding problem. It's a pruning problem.

Reddit r/AI_Agents · 2026-06-03

The author argues that AI agent memory should focus on pruning data rather than hoarding, drawing parallels to human memory types (sensory, short-term, long-term) and suggesting that modeling after human memory can reduce token usage while maintaining high-quality context.

0 favorites 0 likes
#pruning

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Hugging Face Daily Papers · 2026-05-27 Cached

A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.

0 favorites 0 likes
#pruning

@tom_doerr: Compresses deep learning models for faster inference https://github.com/NVIDIA/Model-Optimizer…

X AI KOLs Timeline · 2026-05-19 Cached

NVIDIA Model Optimizer is a library that compresses deep learning models using techniques like quantization, distillation, pruning, and speculative decoding to accelerate inference. It supports Hugging Face, PyTorch, and ONNX models and integrates with NVIDIA inference frameworks.

0 favorites 0 likes
#pruning

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Hugging Face Daily Papers · 2026-05-19 Cached

Graft is a training-free framework that enhances speculative decoding by combining pruning and retrieval to improve acceptance rates and inference speed, achieving up to 5.41x speedup on short-context benchmarks and up to 21.8% improvement over EAGLE-3 on Qwen3-235B.

0 favorites 0 likes
#pruning

On the Stability of Growth in Structural Plasticity

arXiv cs.LG · 2026-05-18 Cached

This academic paper investigates the asymmetry between pruning and growth in structural plasticity for neural networks, showing that newborn units suffer from weaker gradient signals than incumbent units, and proposes interventions to improve integration.

0 favorites 0 likes
#pruning

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

arXiv cs.LG · 2026-05-15 Cached

Introduces Self-Pruned Key-Value Attention (SP-KV), a mechanism that learns to predict future utility of key-value pairs to dynamically prune the KV cache, reducing memory usage and decoding speed by 3-10x with minimal performance degradation. The model and utility predictor are trained end-to-end using next-token prediction.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback