pruning

#pruning

Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

arXiv cs.LG ↗ · 2026-05-14 Cached

This paper introduces Layer-wise Representation Dynamics (LRD), a framework with three measurement families to analyze how hidden states change across layers in language models. Applied to 31 models on 30 MTEB tasks, LRD reveals architectural differences and enables label-free model selection and inference-time layer pruning.

0 favorites 0 likes

#pruning

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Hugging Face Daily Papers ↗ · 2026-05-09 Cached

This paper explores structured pruning and knowledge distillation techniques for compressing large Mixture-of-Experts (MoE) models during pre-training. It demonstrates that progressive pruning and combined distillation strategies, such as multi-token prediction distillation, improve downstream performance, exemplified by compressing Qwen3-Next-80A3B to a more efficient 23A2B model.

0 favorites 0 likes

#pruning

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

Reddit r/MachineLearning ↗ · 2026-04-23

Author shares experience hitting diminishing returns with FP16 + ONNX + pruning on 162 MB transformer, seeks advice on next best steps among quantization, distillation, low-rank factorization, or hardware-specific tricks.

0 favorites 0 likes

#pruning

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers introduce PIE, a CLT-native framework for efficient circuit discovery via feature attribution-based pruning, achieving ~40× compression in feature selection while maintaining behavioral fidelity on IOI and Doc-String tasks.

0 favorites 0 likes

#pruning

@HuggingPapers: Cut your losses in parallel reasoning STOP learns to prune doomed trajectories early by reading KV-cache states, cuttin…

X AI KOLs Timeline ↗ · 2026-04-21 Cached

STOP method prunes doomed reasoning trajectories early via KV-cache states, cutting token usage 70% and boosting AIME/GPQA accuracy across 1.5B–20B models.

0 favorites 0 likes

#pruning

@0xSero: Finally GLM-5.1-505B-REAP-NVFP4 45 tokens/s decode 1350 tokens/s prefill 32% prune This was the hardest I ever worked t…

X AI KOLs Timeline ↗ · 2026-04-20 Cached

Developer @0xSero achieved high-performance inference on an optimized GLM-5.1-505B variant using NVFP4 quantization and 32% pruning, reaching 45 tokens/s decode and 1350 tokens/s prefill speeds.

0 favorites 0 likes

pruning

Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

@HuggingPapers: Cut your losses in parallel reasoning STOP learns to prune doomed trajectories early by reading KV-cache states, cuttin…

@0xSero: Finally GLM-5.1-505B-REAP-NVFP4 45 tokens/s decode 1350 tokens/s prefill 32% prune This was the hardest I ever worked t…

Submit Feedback