Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

arXiv cs.CL 06/26/26, 04:00 AM Papers

Summary

This paper presents a cascaded multi-granularity pruning framework for deploying LLMs on Industrial IoT edge devices, achieving up to 13.8x compression with minimal accuracy loss on MHA+GELU architectures while exposing a collapse on GQA+SwiGLU designs.

arXiv:2606.26861v1 Announce Type: new Abstract: Deploying large language models (LLMs) on Industrial Internet of Things (IIoT) edge devices demands extreme compression, yet existing structured pruning methods collapse at high compression ratios due to one-shot importance estimation, and their cross-architecture behavior remains unpredictable. This article presents a cascaded multi-granularity pruning framework that removes layers, attention heads, and feed-forward channels in coarse-to-fine order, with lightweight low-rank recovery between stages to re-estimate component importance. An information-theoretic analysis motivates this ordering, and the Structural Independence Assumption (SIA) is formalized as a checkable condition predicting whether per-component pruning criteria are reliable for a given architecture: Multi-Head Attention (MHA)+GELU designs satisfy the SIA, whereas Grouped Query Attention (GQA)+SwiGLU designs violate it. On bearing fault diagnosis spanning 88M to 6.25B-parameter models, the framework extends achievable compression to 13.8 times on MHA+GELU architectures with 83.82% accuracy (+3.70 percentage points (pp) over the strongest baseline), while exposing a ~74pp accuracy collapse on GQA+SwiGLU architectures that violate the SIA. Deployed on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark, compressed models reduce inference latency by up to 67.2% and peak memory by 62.5%, demonstrating viability for IIoT edge inference.

Original Article

View Cached Full Text

Cached at: 06/26/26, 05:19 AM

# Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT
Source: [https://arxiv.org/abs/2606.26861](https://arxiv.org/abs/2606.26861)
[View PDF](https://arxiv.org/pdf/2606.26861)

> Abstract:Deploying large language models \(LLMs\) on Industrial Internet of Things \(IIoT\) edge devices demands extreme compression, yet existing structured pruning methods collapse at high compression ratios due to one\-shot importance estimation, and their cross\-architecture behavior remains unpredictable\. This article presents a cascaded multi\-granularity pruning framework that removes layers, attention heads, and feed\-forward channels in coarse\-to\-fine order, with lightweight low\-rank recovery between stages to re\-estimate component importance\. An information\-theoretic analysis motivates this ordering, and the Structural Independence Assumption \(SIA\) is formalized as a checkable condition predicting whether per\-component pruning criteria are reliable for a given architecture: Multi\-Head Attention \(MHA\)\+GELU designs satisfy the SIA, whereas Grouped Query Attention \(GQA\)\+SwiGLU designs violate it\. On bearing fault diagnosis spanning 88M to 6\.25B\-parameter models, the framework extends achievable compression to 13\.8 times on MHA\+GELU architectures with 83\.82% accuracy \(\+3\.70 percentage points \(pp\) over the strongest baseline\), while exposing a ~74pp accuracy collapse on GQA\+SwiGLU architectures that violate the SIA\. Deployed on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark, compressed models reduce inference latency by up to 67\.2% and peak memory by 62\.5%, demonstrating viability for IIoT edge inference\.

## Submission history

From: Jinghan Wang \[[view email](https://arxiv.org/show-email/7a38e432/2606.26861)\] **\[v1\]**Thu, 25 Jun 2026 10:44:48 UTC \(1,559 KB\)

Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

Similar Articles

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI

Efficient On-Device Diffusion LLM Inference with Mobile NPU

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

Local LLM Inference Optimization: The Complete Guide

@_akhaliq: SpenseGPT Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

Submit Feedback

Similar Articles

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI

Efficient On-Device Diffusion LLM Inference with Mobile NPU

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

Local LLM Inference Optimization: The Complete Guide

@_akhaliq: SpenseGPT Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference