pruning

#pruning

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

Reddit r/MachineLearning ↗ · 2026-04-23

Author shares experience hitting diminishing returns with FP16 + ONNX + pruning on 162 MB transformer, seeks advice on next best steps among quantization, distillation, low-rank factorization, or hardware-specific tricks.

0 favorites 0 likes

#pruning

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers introduce PIE, a CLT-native framework for efficient circuit discovery via feature attribution-based pruning, achieving ~40× compression in feature selection while maintaining behavioral fidelity on IOI and Doc-String tasks.

0 favorites 0 likes

#pruning

@HuggingPapers: Cut your losses in parallel reasoning STOP learns to prune doomed trajectories early by reading KV-cache states, cuttin…

X AI KOLs Timeline ↗ · 2026-04-21 Cached

STOP method prunes doomed reasoning trajectories early via KV-cache states, cutting token usage 70% and boosting AIME/GPQA accuracy across 1.5B–20B models.

0 favorites 0 likes

#pruning

@0xSero: Finally GLM-5.1-505B-REAP-NVFP4 45 tokens/s decode 1350 tokens/s prefill 32% prune This was the hardest I ever worked t…

X AI KOLs Timeline ↗ · 2026-04-20 Cached

Developer @0xSero achieved high-performance inference on an optimized GLM-5.1-505B variant using NVFP4 quantization and 32% pruning, reaching 45 tokens/s decode and 1350 tokens/s prefill speeds.

0 favorites 0 likes

pruning

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

@HuggingPapers: Cut your losses in parallel reasoning STOP learns to prune doomed trajectories early by reading KV-cache states, cuttin…

@0xSero: Finally GLM-5.1-505B-REAP-NVFP4 45 tokens/s decode 1350 tokens/s prefill 32% prune This was the hardest I ever worked t…

Submit Feedback