Tag
Author shares experience hitting diminishing returns with FP16 + ONNX + pruning on 162 MB transformer, seeks advice on next best steps among quantization, distillation, low-rank factorization, or hardware-specific tricks.
Researchers introduce PIE, a CLT-native framework for efficient circuit discovery via feature attribution-based pruning, achieving ~40× compression in feature selection while maintaining behavioral fidelity on IOI and Doc-String tasks.
STOP method prunes doomed reasoning trajectories early via KV-cache states, cutting token usage 70% and boosting AIME/GPQA accuracy across 1.5B–20B models.
Developer @0xSero achieved high-performance inference on an optimized GLM-5.1-505B variant using NVFP4 quantization and 32% pruning, reaching 45 tokens/s decode and 1350 tokens/s prefill speeds.