Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

Reddit r/MachineLearning 04/23/26, 11:06 AM Papers

Summary

Author shares experience hitting diminishing returns with FP16 + ONNX + pruning on 162 MB transformer, seeks advice on next best steps among quantization, distillation, low-rank factorization, or hardware-specific tricks.

Hi everyone, I’ve been working on optimizing a transformer-based neural network for both inference speed and model size, but I feel like I’ve hit a plateau and would appreciate some guidance. So far I’ve converted weights to FP16 (about 2× size reduction), exported and optimized with ONNX Runtime for inference speed, and tried both unstructured and structured pruning as well as ONNX graph optimizations, but none of these gave significant additional gains, and I’m still around \~162 MB per model. At this point I’m considering next steps like low-rank factorization (SVD/LoRA-style compression), more aggressive quantization (INT8/INT4 like GPTQ, AWQ, or SmoothQuant), knowledge distillation into a smaller student model, or more hardware/runtime-specific optimizations like TensorRT or FlashAttention, but I’m not sure which of these actually gives meaningful real-world improvements after FP16 + pruning. I’d really appreciate advice on what approaches tend to work best in practice for transformer compression beyond what I’ve already tried, and whether low-rank methods are actually effective post-training or if distillation/quantization is usually the only real win at this stage.

Original Article

Similar Articles

Generative modeling with sparse transformers

OpenAI Blog

OpenAI introduces the Sparse Transformer, a deep neural network that improves the attention mechanism from O(N²) to O(N√N) complexity, enabling modeling of sequences 30x longer than previously possible across text, images, and audio. The model uses sparse attention patterns and checkpoint-based memory optimization to train networks up to 128 layers deep, achieving state-of-the-art performance across multiple domains.

Consider running a bigger quant if possible

Reddit r/LocalLLaMA

A user reports that switching from a highly-compressed IQ4_XS quant to the larger IQ4_NL_XL quant of Qwen 3.6 dramatically improves agentic-coding accuracy, despite lower tok/s, urging others to favor bigger quants when VRAM allows.

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

arXiv cs.CL

This paper proposes STOP (SuperTOken for Pruning), a systematic framework for pruning inefficient reasoning paths early in parallel reasoning with Large Reasoning Models. The method achieves superior efficiency and effectiveness across models from 1.5B to 20B parameters, boosting GPT-OSS-20B accuracy on AIME25 from 84% to 90% under fixed compute budgets.

Transformer Math Explorer [P]

Reddit r/MachineLearning

This interactive tool visualizes the mathematical underpinnings of transformer models through dataflow graphs, covering architectures from GPT-2 to Qwen 3.6 and various attention mechanisms.

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution