Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]
Summary
Author shares experience hitting diminishing returns with FP16 + ONNX + pruning on 162 MB transformer, seeks advice on next best steps among quantization, distillation, low-rank factorization, or hardware-specific tricks.
Similar Articles
Tiny Scale Is All I Can Spare To Play With Transformer
A student introduces Silia, a novel transformer architecture that combines attention and FFN into a unified operation to save parameters at scales ≤10M, achieving comparable performance to GPT-2 with fewer parameters despite limited compute resources.
@PyTorch: Model Optimization and Post-Training Quantization Model quantization is an effective method to reduce VRAM usage and im…
This post from NVIDIA explains how to use the NVIDIA Model Optimizer library to quantize a CLIP model to FP8 using post-training quantization, reducing VRAM usage and improving inference performance on consumer GPUs.
@tom_doerr: Compresses deep learning models for faster inference https://github.com/NVIDIA/Model-Optimizer…
NVIDIA Model Optimizer is a library that compresses deep learning models using techniques like quantization, distillation, pruning, and speculative decoding to accelerate inference. It supports Hugging Face, PyTorch, and ONNX models and integrates with NVIDIA inference frameworks.
Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]
A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
This paper explores structured pruning and knowledge distillation techniques for compressing large Mixture-of-Experts (MoE) models during pre-training. It demonstrates that progressive pruning and combined distillation strategies, such as multi-token prediction distillation, improve downstream performance, exemplified by compressing Qwen3-Next-80A3B to a more efficient 23A2B model.