Tag
This paper presents a cascaded multi-granularity pruning framework for deploying LLMs on Industrial IoT edge devices, achieving up to 13.8x compression with minimal accuracy loss on MHA+GELU architectures while exposing a collapse on GQA+SwiGLU designs.
This paper empirically compares pruning vs. training small language models from scratch, finding that pruning provides a strong advantage under limited token budgets but that the advantage diminishes as training scales, especially with coarse pruning.