BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

arXiv cs.LG Papers

Summary

BitsMoE introduces a spectral-energy-guided bit allocation framework for quantizing Mixture-of-Experts LLMs, achieving substantial accuracy improvements and speedups under ultra-low-bit quantization.

arXiv:2606.00079v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains memory-intensive because all expert weights must be kept resident in memory. Existing MoE compression methods struggle in the ultra-low-bit regime: pruning irreversibly removes model capacity, while coarse-grained quantization fails to allocate bits according to heterogeneous expert and weight-direction importance. We propose BitsMoE, a spectral-energy-guided bit-allocation framework for MoE LLM quantization. BitsMoE decomposes each MoE layer by SVD into a shared basis and expert-specific spectral factors, retaining the shared basis without quantization to preserve common cross-expert structure and using the expert-specific factors as fine-grained quantization units. To determine the bit-width of each unit, BitsMoE formulates spectrum-wise mixed-precision quantization as an activation-aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. Experiments across multiple MoE LLMs show that BitsMoE substantially reduces downstream task accuracy degradation in ultra-low-bit regimes. Under 2-bit quantization on Qwen3-30B-A3B-Base, BitsMoE accelerates quantization by 12.3$\times$, improves average accuracy by 27.83 percentage points, and increases decoding speed by 1.76$\times$ over GPTQ. Our model and code are publicly available at https://github.com/zjiayu064/BitsMoE.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:39 PM

# BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
Source: [https://arxiv.org/html/2606.00079](https://arxiv.org/html/2606.00079)
Jiayu Zhao1,2Zihan Teng1,2Minhao Fan2Tianrui Ma2Wentao Ren3Song Chen1Weichen Liu21School of Microelectronics, University of Science and Technology of China2College of Computing and Data Science, Nanyang Technological University3School of Electrical and Electronic Engineering, Nanyang Technological UniversityWork done during a visit to Nanyang Technological University\.Correspondence to: Weichen Liu<liu@ntu\.edu\.sg\>\.

###### Abstract

Mixture\-of\-Experts \(MoE\) large language models reduce per\-token computation through sparse expert activation, but their deployment remains memory\-intensive because all expert weights must be kept resident in memory\. Existing MoE compression methods struggle in the ultra\-low\-bit regime: pruning irreversibly removes model capacity, while coarse\-grained quantization fails to allocate bits according to heterogeneous expert and weight\-direction importance\. We proposeBitsMoE, a spectral\-energy\-guided bit\-allocation framework for MoE LLM quantization\.BitsMoEdecomposes each MoE layer by SVD into a shared basis and expert\-specific spectral factors, retaining the shared basis without quantization to preserve common cross\-expert structure and using the expert\-specific factors as fine\-grained quantization units\. To determine the bit\-width of each unit,BitsMoEformulates spectrum\-wise mixed\-precision quantization as an activation\-aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget\. Experiments across multiple MoE LLMs show thatBitsMoEsubstantially reduces downstream task accuracy degradation in ultra\-low\-bit regimes\. Under 2\-bit quantization on Qwen3\-30B\-A3B\-Base,BitsMoEaccelerates quantization by 12\.3×\\times, improves average accuracy by 27\.83 percentage points, and increases decoding speed by 1\.76×\\timesover GPTQ\. Our model and code are publicly available at[https://github\.com/zjiayu064/BitsMoE](https://github.com/zjiayu064/BitsMoE)\.

## 1Introduction

Recent progress in natural language processing has been largely driven by large language models \(LLMs\), among which Mixture\-of\-Experts \(MoE\) models\[[5](https://arxiv.org/html/2606.00079#bib.bib1)\]have emerged as an efficient sparse\-scaling paradigm and achieved strong performance across diverse benchmarks\[[23](https://arxiv.org/html/2606.00079#bib.bib5),[45](https://arxiv.org/html/2606.00079#bib.bib6),[11](https://arxiv.org/html/2606.00079#bib.bib10),[46](https://arxiv.org/html/2606.00079#bib.bib7)\]\. However, typical systems keep all experts memory\-resident regardless of runtime activation, which makes the memory footprint a key deployment bottleneck\. For example, Qwen3\-30B\-A3B\-Base\[[45](https://arxiv.org/html/2606.00079#bib.bib6)\]activates only 3B parameters per token but still stores all 30B parameters\. This gap between sparse computation and dense memory residency makes MoE deployment costly and motivates MoE LLM compression\[[28](https://arxiv.org/html/2606.00079#bib.bib11)\]\. Existing methods mainly follow two paradigms,*pruning*and*quantization*, which reduce memory usage and inference cost from different perspectives\.

Despite recent progress, existing MoE compression methods remain inadequate under aggressive compression\.*Pruning\-based methods*reduce model size by removing redundant experts or compressing expert weights\[[16](https://arxiv.org/html/2606.00079#bib.bib13),[25](https://arxiv.org/html/2606.00079#bib.bib12),[47](https://arxiv.org/html/2606.00079#bib.bib17)\], but hard structural pruning irreversibly discards capacity and limits flexibility under tight memory budgets\. In contrast,*quantization\-based methods*preserve the MoE architecture and routing mechanism by representing expert weights in low precision\[[8](https://arxiv.org/html/2606.00079#bib.bib16),[21](https://arxiv.org/html/2606.00079#bib.bib19),[12](https://arxiv.org/html/2606.00079#bib.bib14),[9](https://arxiv.org/html/2606.00079#bib.bib44),[43](https://arxiv.org/html/2606.00079#bib.bib45),[48](https://arxiv.org/html/2606.00079#bib.bib46)\]\. However, existing methods usually allocate bit\-widths at coarse granularities such as layers, experts, or linear blocks\. Such coarse allocation fails to capture the intrinsic heterogeneity of MoE models and leads to severe degradation under ultra\-low\-bit quantization\.

Although quantization preserves MoE capacity better than pruning, uniform ultra\-low\-bit quantization ignores the heterogeneous importance of expert weights\. Under tight memory budgets, limited bits should therefore be allocated adaptively rather than uniformly, especially near 2 bits where existing MoE quantization methods degrade sharply\. This degradation reflects a mismatch between coarse bit allocation and MoE structure: experts share input–output feature spaces and exhibit redundant cross\-expert directions, whereas sensitivity differs markedly across fine\-grained weight directions\. Consequently, coarse allocation can over\-compress shared or sensitive directions and waste bits on less important ones\. This raises a fundamental question:

*How can MoE quantization use calibration data to identify heterogeneous importance and allocate bits at fine granularity under a fixed budget?*

![Refer to caption](https://arxiv.org/html/2606.00079v1/x1.png)Figure 1:Overview ofBitsMoE\.Stage 1\(Section[3\.2](https://arxiv.org/html/2606.00079#S3.SS2)\): Each MoE layer is decomposed by SVD into a shared basis and expert\-specific spectral factors\.Stage 2\(Section[3\.3](https://arxiv.org/html/2606.00079#S3.SS3)\): Bit\-widths are assigned to spectral components by an ILP under a fixed bit budget\.Stage 3: During inference, inputs are projected onto the shared basis and quantized spectral factors are used to compute routed experts\.We address this question by formulating MoE quantization as fixed\-budget bit allocation over spectral components\. To define such allocation units,BitsMoEdecomposes each MoE layer via SVD into a shared basis and expert\-specific spectral factors\. The shared basis is retained without quantization to preserve common cross\-expert structure, while the expert\-specific factors serve as fine\-grained units for mixed\-precision quantization\. We then formulate an activation\-aware reconstruction surrogate to estimate the loss induced by assigning each bit\-width to each spectral component, and cast the resulting allocation problem as an integer linear program \(ILP\) that minimizes the estimated reconstruction loss under a fixed bit budget\.

This design positionsBitsMoEas a spectrum\-wise mixed\-precision framework rather than an SVD rank\-reduction method or a coarse\-grained MoE quantizer\. As shown in Figure[1](https://arxiv.org/html/2606.00079#S1.F1), its shared spectral space preserves common cross\-expert structure in an unquantized basis and exposes expert\-specific spectral components as allocation units\. Thus,BitsMoEdiffers from prior SVD\-based MoE compressors\[[25](https://arxiv.org/html/2606.00079#bib.bib12),[47](https://arxiv.org/html/2606.00079#bib.bib17),[16](https://arxiv.org/html/2606.00079#bib.bib13)\], which primarily use decomposition to reduce rank and discard spectral components, and from prior ILP\-based mixed\-precision MoE methods\[[12](https://arxiv.org/html/2606.00079#bib.bib14),[22](https://arxiv.org/html/2606.00079#bib.bib20)\], which allocate bits at the layer, expert, or linear\-block level\. In contrast,BitsMoEallocates more bits to spectral components with larger activation\-aware reconstruction costs\. Detailed positioning is provided in Appendix[A](https://arxiv.org/html/2606.00079#A1)\.

Our contributions are summarized as follows:

1. 1\.Capacity\-preserving spectral quantization\.We propose a shared spectral parameterization for MoE layers that preserves cross\-expert structure and treats expert\-specific spectral components as fine\-grained quantization units\.
2. 2\.Importance\-aligned bit allocation under a fixed budget\.We cast MoE quantization as spectrum\-wise bit allocation with an activation\-aware reconstruction surrogate\. The ILP allocates bits based on spectral energy, activation importance, and bit\-dependent quantization distortion\.
3. 3\.Accurate and efficient MoE deployment\.We presentBitsMoE, an end\-to\-end framework that integrates shared\-basis decomposition, adaptive bit allocation, and efficient inference\. Experiments on multiple MoE LLMs show thatBitsMoEimproves downstream accuracy and inference efficiency under ultra\-low\-bit quantization\.

## 2Related Work

### 2\.1Mixture\-of\-Experts Large Language Models

MoE architectures have become widely adopted in recent LLMs\[[23](https://arxiv.org/html/2606.00079#bib.bib5),[27](https://arxiv.org/html/2606.00079#bib.bib21),[44](https://arxiv.org/html/2606.00079#bib.bib42),[31](https://arxiv.org/html/2606.00079#bib.bib43)\]\. By partitioning the network into multiple experts and routing each input to a sparse subset, MoE reduces per\-token computation while improving scalability\[[35](https://arxiv.org/html/2606.00079#bib.bib22),[13](https://arxiv.org/html/2606.00079#bib.bib15)\]\. For instance, Mixtral\[[23](https://arxiv.org/html/2606.00079#bib.bib5)\]replaces each feed\-forward block with multiple experts and applies top\-kkrouting, activating only two experts per token while retaining large total capacity\. Despite these advantages, MoE LLMs still suffer from a large parameter footprint due to expert replication\[[18](https://arxiv.org/html/2606.00079#bib.bib23)\]\. Moreover, unbalanced routing induces expert\-level redundancy and highly skewed expert utilization, which creates substantial disparities in expert importance and complicates effective compression\[[29](https://arxiv.org/html/2606.00079#bib.bib24)\]\.

### 2\.2MoE LLM Compression and Pruning

SVD\-based low\-rank decomposition has been widely used as a structured compression tool for dense LLMs\[[20](https://arxiv.org/html/2606.00079#bib.bib37),[7](https://arxiv.org/html/2606.00079#bib.bib38),[49](https://arxiv.org/html/2606.00079#bib.bib39),[41](https://arxiv.org/html/2606.00079#bib.bib40)\]\. For MoE LLMs, recent methods further exploit expert\-level redundancy through pruning and structured decomposition\. MoE\-I2\[[47](https://arxiv.org/html/2606.00079#bib.bib17)\]combines non\-uniform inter\-expert pruning with importance\-aware intra\-expert low\-rank decomposition to compress MoE LLMs in a task\-agnostic framework\. MoE\-SVD\[[25](https://arxiv.org/html/2606.00079#bib.bib12)\]selectively decomposes less sensitive expert layers and reduces cross\-expert redundancy through frequency\-guided V\-matrix sharing and U\-matrix trimming\. D2\-MoE\[[16](https://arxiv.org/html/2606.00079#bib.bib13)\]decomposes expert weights into a Fisher\-weighted shared base and expert\-specific delta weights, where the shared base is compressed via semi\-dynamic pruning and the delta weights are compressed via truncation\-aware SVD\.

### 2\.3Post\-Training Quantization for MoE LLMs

Post\-training quantization \(PTQ\) has become a widely used paradigm for compressing LLMs without retraining\. In this work, we focus on scalar weight quantization, a representative PTQ family that has been extensively studied for LLM compression\[[26](https://arxiv.org/html/2606.00079#bib.bib2),[42](https://arxiv.org/html/2606.00079#bib.bib36),[34](https://arxiv.org/html/2606.00079#bib.bib18),[2](https://arxiv.org/html/2606.00079#bib.bib41)\]\. Among these methods, GPTQ\[[14](https://arxiv.org/html/2606.00079#bib.bib3)\]uses Hessian\-based error compensation for sequential weight quantization, while HQQ\[[3](https://arxiv.org/html/2606.00079#bib.bib4)\]formulates low\-bit quantization as a calibration\-free half\-quadratic optimization problem\.

For MoE LLMs, MoEQuant\[[8](https://arxiv.org/html/2606.00079#bib.bib16)\]improves PTQ by constructing expert\-balanced calibration samples and incorporating token expert affinities into the quantization process\. MiLo\[[21](https://arxiv.org/html/2606.00079#bib.bib19)\]augments extremely quantized MoE models with adaptive low\-rank compensators and efficient INT3 kernels to recover accuracy while improving inference efficiency\. MxMoE\[[12](https://arxiv.org/html/2606.00079#bib.bib14)\]assigns bit\-widths according to block sensitivity, expert activation patterns, and hardware constraints, and generates optimized Group GEMM kernels for efficient MoE inference\.

## 3Methodology

### 3\.1BitsMoE

We presentBitsMoE, an efficient mixed\-precision quantization framework for MoE LLMs\. Its design is motivated by two properties of MoE expert weights under tight memory budgets\. First, experts within the same MoE layer operate on shared input and output feature spaces, suggesting that cross\-expert spectral redundancy can be captured by a shared basis rather than quantizing each expert independently\. Second, spectral components differ in both reconstruction contribution and routing\-conditioned importance, making uniform or coarse\-grained bit\-width allocation inefficient in the ultra\-low\-bit regime\.

Accordingly,BitsMoEintroduces two key designs\. It first extracts a shared spectral basis across experts for each projection type, while representing each expert using normalized expert\-specific spectral components\. It then formulates spectrum\-wise mixed\-precision bit allocation as an ILP that minimizes an activation\-aware reconstruction surrogate under a fixed bit budget\. Figure[1](https://arxiv.org/html/2606.00079#S1.F1)provides an overview of theBitsMoEframework, and Table[6](https://arxiv.org/html/2606.00079#A2.T6)summarizes the notation used in this section\. Sections[3\.2](https://arxiv.org/html/2606.00079#S3.SS2)and[3\.3](https://arxiv.org/html/2606.00079#S3.SS3)then present the shared\-basis decomposition and ILP\-based bit allocation in detail\.

### 3\.2Shared\-basis Spectral Decomposition

Within an MoE layer, all experts share the same input and output feature spaces but implement distinct parameterized linear transformations\. Therefore, a shared basis for each projection type in the MoE layer can be obtained via SVD\. We denote the projection types byℋ≔\{𝚐𝚊𝚝𝚎​\_​𝚙𝚛𝚘𝚓,𝚞𝚙​\_​𝚙𝚛𝚘𝚓,𝚍𝚘𝚠𝚗​\_​𝚙𝚛𝚘𝚓\}\\mathcal\{H\}\\coloneqq\\\{\\mathtt\{gate\\\_proj\},\\mathtt\{up\\\_proj\},\\mathtt\{down\\\_proj\}\\\}, whereℋin≔\{𝚐𝚊𝚝𝚎​\_​𝚙𝚛𝚘𝚓,𝚞𝚙​\_​𝚙𝚛𝚘𝚓\}\\mathcal\{H\}\_\{\\mathrm\{in\}\}\\coloneqq\\\{\\mathtt\{gate\\\_proj\},\\mathtt\{up\\\_proj\}\\\}andhdn≔𝚍𝚘𝚠𝚗​\_​𝚙𝚛𝚘𝚓h\_\{\\mathrm\{dn\}\}\\coloneqq\\mathtt\{down\\\_proj\}\. Forh∈ℋinh\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\}, we concatenate the expert weights along the output\-channel dimension and decompose it as

𝑾cat\(h\)≔\[𝑾1\(h\)⋮𝑾E\(h\)\]=𝑼cat\(h\)​𝚺\(h\)​𝚽h⊤=𝑷~cat\(h\)​𝚽h⊤,𝑷~cat\(h\)≔𝑼cat\(h\)​𝚺\(h\)=\[𝑷~1\(h\)⋮𝑷~E\(h\)\]\.\\boldsymbol\{W\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\coloneqq\\begin\{bmatrix\}\\boldsymbol\{W\}\_\{1\}^\{\(h\)\}\\\\ \\vdots\\\\ \\boldsymbol\{W\}\_\{E\}^\{\(h\)\}\\end\{bmatrix\}=\\boldsymbol\{U\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\boldsymbol\{\\Sigma\}^\{\(h\)\}\\boldsymbol\{\\Phi\}\_\{h\}^\{\\top\}=\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\boldsymbol\{\\Phi\}\_\{h\}^\{\\top\},\\qquad\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\coloneqq\\boldsymbol\{U\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\boldsymbol\{\\Sigma\}^\{\(h\)\}=\\begin\{bmatrix\}\\widetilde\{\\boldsymbol\{P\}\}\_\{1\}^\{\(h\)\}\\\\ \\vdots\\\\ \\widetilde\{\\boldsymbol\{P\}\}\_\{E\}^\{\(h\)\}\\end\{bmatrix\}\.\(1\)
###### Definition 3\.1\(Spectral component and energy matrix\)\.

Letϕh,k\\boldsymbol\{\\phi\}\_\{h,k\}be thekk\-th column of𝚽h\\boldsymbol\{\\Phi\}\_\{h\}, and let𝒑~e,h,k≔𝑷~e\(h\)​\[:,k\]\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\}\\coloneqq\\widetilde\{\\boldsymbol\{P\}\}\_\{e\}^\{\(h\)\}\[:,k\]\. The corresponding shared\-basis component is

𝒑~e,h,k​ϕh,k⊤\.\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\}\\boldsymbol\{\\phi\}\_\{h,k\}^\{\\top\}\.\(2\)Its spectral energy and the associated diagonal energy matrix are defined as

αe,h,k≔‖𝒑~e,h,k‖2,𝑨e\(h\)≔diag⁡\(αe,h,1,…,αe,h,nh\)\.\\alpha\_\{e,h,k\}\\coloneqq\\left\\\|\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\}\\right\\\|\_\{2\},\\qquad\\boldsymbol\{A\}\_\{e\}^\{\(h\)\}\\coloneqq\\operatorname\{diag\}\\\!\\left\(\\alpha\_\{e,h,1\},\\ldots,\\alpha\_\{e,h,n\_\{h\}\}\\right\)\.\(3\)

###### Definition 3\.2\(Normalized expert\-specific spectral matrix\)\.

The normalized expert\-specific spectral matrix is defined by

𝑷e\(h\)≔𝑷~e\(h\)​\(𝑨e\(h\)\)−1=\[𝒑e,h,1,…,𝒑e,h,nh\],𝒑e,h,k≔𝒑~e,h,kαe,h,k\.\\boldsymbol\{P\}\_\{e\}^\{\(h\)\}\\coloneqq\\widetilde\{\\boldsymbol\{P\}\}\_\{e\}^\{\(h\)\}\\left\(\\boldsymbol\{A\}\_\{e\}^\{\(h\)\}\\right\)^\{\-1\}=\\left\[\\boldsymbol\{p\}\_\{e,h,1\},\\ldots,\\boldsymbol\{p\}\_\{e,h,n\_\{h\}\}\\right\],\\qquad\\boldsymbol\{p\}\_\{e,h,k\}\\coloneqq\\frac\{\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\}\}\{\\alpha\_\{e,h,k\}\}\.\(4\)

By Definitions[3\.1](https://arxiv.org/html/2606.00079#S3.Thmdefinition1)and[3\.2](https://arxiv.org/html/2606.00079#S3.Thmdefinition2), each column of𝑷e\(h\)\\boldsymbol\{P\}\_\{e\}^\{\(h\)\}has unitℓ2\\ell\_\{2\}\-norm\. The expert weight can then be written as

𝑾e\(h\)=𝑷e\(h\)​𝑨e\(h\)​𝚽h⊤,h∈ℋin\\boldsymbol\{W\}\_\{e\}^\{\(h\)\}=\\boldsymbol\{P\}\_\{e\}^\{\(h\)\}\\boldsymbol\{A\}\_\{e\}^\{\(h\)\}\\boldsymbol\{\\Phi\}\_\{h\}^\{\\top\},\\qquad h\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\}\(5\)Forh=hdnh=h\_\{\\mathrm\{dn\}\}, expert weights share the same output feature space, so we concatenate them along the input\-channel dimension:

𝑾cat\(h\)≔\[𝑾1\(h\)​⋯​𝑾E\(h\)\]=𝚽h​𝑷~cat\(h\)⊤,𝑷~cat\(h\)=\[𝑷~1\(h\)⋮𝑷~E\(h\)\]\.\\boldsymbol\{W\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\coloneqq\\left\[\\boldsymbol\{W\}\_\{1\}^\{\(h\)\}\\ \\cdots\\ \\boldsymbol\{W\}\_\{E\}^\{\(h\)\}\\right\]=\\boldsymbol\{\\Phi\}\_\{h\}\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\\top\},\\qquad\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}=\\begin\{bmatrix\}\\widetilde\{\\boldsymbol\{P\}\}\_\{1\}^\{\(h\)\}\\\\ \\vdots\\\\ \\widetilde\{\\boldsymbol\{P\}\}\_\{E\}^\{\(h\)\}\\end\{bmatrix\}\.\(6\)After the same column normalization of𝑷~e\(h\)\\widetilde\{\\boldsymbol\{P\}\}\_\{e\}^\{\(h\)\}, each down\-projection expert is written as

𝑾e\(h\)=𝚽h​𝑨e\(h\)​𝑷e\(h\)⊤,h=hdn\\boldsymbol\{W\}\_\{e\}^\{\(h\)\}=\\boldsymbol\{\\Phi\}\_\{h\}\\boldsymbol\{A\}\_\{e\}^\{\(h\)\}\\boldsymbol\{P\}\_\{e\}^\{\(h\)\\top\},\\qquad h=h\_\{\\mathrm\{dn\}\}\(7\)Thus, across all projection types,𝑷e\(h\)\\boldsymbol\{P\}\_\{e\}^\{\(h\)\}denotes the expert\-specific normalized spectral matrix assigned mixed bit\-widths, while𝚽h\\boldsymbol\{\\Phi\}\_\{h\}denotes the shared basis retained without quantization\.

### 3\.3Spectral Energy\-Guided Adaptive Bit Allocation

#### Activation\-aware reconstruction error\.

We first consider the loss of a single expert for a fixed projection typeh∈ℋinh\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\}, and omit the expert and projection indices for clarity\. Let𝑷=\[𝒑1,…,𝒑n\]\\boldsymbol\{P\}=\[\\boldsymbol\{p\}\_\{1\},\\ldots,\\boldsymbol\{p\}\_\{n\}\],𝑨=diag⁡\(α1,…,αn\)\\boldsymbol\{A\}=\\operatorname\{diag\}\(\\alpha\_\{1\},\\ldots,\\alpha\_\{n\}\), and𝚽=\[ϕ1,…,ϕn\]\\boldsymbol\{\\Phi\}=\[\\boldsymbol\{\\phi\}\_\{1\},\\ldots,\\boldsymbol\{\\phi\}\_\{n\}\], so that

𝑾=𝑷​𝑨​𝚽⊤=∑k=1nαk​𝒑k​ϕk⊤\.\\boldsymbol\{W\}=\\boldsymbol\{P\}\\boldsymbol\{A\}\\boldsymbol\{\\Phi\}^\{\\top\}=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}\\boldsymbol\{p\}\_\{k\}\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\.\(8\)Quantization is applied only to the expert\-specific normalized spectral vectors:

𝒑^k=Qb​\(𝒑k\),𝜺k​\(b\)≔𝒑k−𝒑^k\.\\widehat\{\\boldsymbol\{p\}\}\_\{k\}=Q\_\{b\}\(\\boldsymbol\{p\}\_\{k\}\),\\qquad\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)\\coloneqq\\boldsymbol\{p\}\_\{k\}\-\\widehat\{\\boldsymbol\{p\}\}\_\{k\}\.\(9\)Let𝑷^=\[𝒑^1,…,𝒑^n\]\\widehat\{\\boldsymbol\{P\}\}=\[\\widehat\{\\boldsymbol\{p\}\}\_\{1\},\\ldots,\\widehat\{\\boldsymbol\{p\}\}\_\{n\}\]and𝑬P≔𝑷−𝑷^=\[𝜺1,…,𝜺n\]\\boldsymbol\{E\}\_\{P\}\\coloneqq\\boldsymbol\{P\}\-\\widehat\{\\boldsymbol\{P\}\}=\[\\boldsymbol\{\\varepsilon\}\_\{1\},\\ldots,\\boldsymbol\{\\varepsilon\}\_\{n\}\]\. The reconstructed weight and the induced weight perturbation are

𝑾^=𝑷^​𝑨​𝚽⊤=∑k=1nαk​𝒑^k​ϕk⊤,𝚫≔𝑾−𝑾^=𝑬P​𝑨​𝚽⊤=∑k=1nαk​𝜺k​\(b\)​ϕk⊤\.\\widehat\{\\boldsymbol\{W\}\}=\\widehat\{\\boldsymbol\{P\}\}\\boldsymbol\{A\}\\boldsymbol\{\\Phi\}^\{\\top\}=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}\\widehat\{\\boldsymbol\{p\}\}\_\{k\}\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\},\\qquad\\boldsymbol\{\\Delta\}\\coloneqq\\boldsymbol\{W\}\-\\widehat\{\\boldsymbol\{W\}\}=\\boldsymbol\{E\}\_\{P\}\\boldsymbol\{A\}\\boldsymbol\{\\Phi\}^\{\\top\}=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\.\(10\)
###### Lemma 3\.1\(Spectrum\-wise reconstruction error\)\.

Under the shared\-basis decomposition in Eq\. \([8](https://arxiv.org/html/2606.00079#S3.E8)\) and the reconstruction error definition in Eq\. \([10](https://arxiv.org/html/2606.00079#S3.E10)\), the routing\-weighted reconstruction loss satisfies

L​\(𝑾^\)≔𝔼​‖\(𝑾−𝑾^\)​𝑿g‖F2=∑k=1n∑l=1nαk​αl​\(ϕk⊤​𝑯​ϕl\)​𝔼​\[𝜺k⊤​𝜺l\],L\(\\widehat\{\\boldsymbol\{W\}\}\)\\coloneqq\\mathbb\{E\}\\left\\\|\(\\boldsymbol\{W\}\-\\widehat\{\\boldsymbol\{W\}\}\)\\boldsymbol\{X\}\_\{g\}\\right\\\|\_\{F\}^\{2\}=\\sum\_\{k=1\}^\{n\}\\sum\_\{l=1\}^\{n\}\\alpha\_\{k\}\\alpha\_\{l\}\\left\(\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{l\}\\right\)\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{\\varepsilon\}\_\{l\}\\right\],\(11\)where𝐗g≔𝐗Diag\(𝐠\)1/2\\boldsymbol\{X\}\_\{g\}\\coloneqq\\boldsymbol\{X\}\\operatorname\{Diag\}\(\\boldsymbol\{g\}\)^\{1/2\}and𝐇≔𝐗g​𝐗g⊤=𝐗​Diag⁡\(𝐠\)​𝐗⊤\\boldsymbol\{H\}\\coloneqq\\boldsymbol\{X\}\_\{g\}\\boldsymbol\{X\}\_\{g\}^\{\\top\}=\\boldsymbol\{X\}\\operatorname\{Diag\}\(\\boldsymbol\{g\}\)\\boldsymbol\{X\}^\{\\top\}, so𝐠\\boldsymbol\{g\}weights calibration activations according to the corresponding routing affinities\.

###### Proof\.

Let𝚫≔𝑾−𝑾^=∑kαk​𝜺k​ϕk⊤\\boldsymbol\{\\Delta\}\\coloneqq\\boldsymbol\{W\}\-\\widehat\{\\boldsymbol\{W\}\}=\\sum\_\{k\}\\alpha\_\{k\}\\boldsymbol\{\\varepsilon\}\_\{k\}\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\. Using‖𝑨‖F2=Tr⁡\(𝑨​𝑨⊤\)\\\|\\boldsymbol\{A\}\\\|\_\{F\}^\{2\}=\\operatorname\{Tr\}\(\\boldsymbol\{A\}\\boldsymbol\{A\}^\{\\top\}\), we obtain

L​\(𝑾^\)=𝔼​\[Tr⁡\(𝚫​𝑯​𝚫⊤\)\]=∑k,lαk​αl​\(ϕk⊤​𝑯​ϕl\)​𝔼​\[𝜺k⊤​𝜺l\]\.L\(\\widehat\{\\boldsymbol\{W\}\}\)=\\mathbb\{E\}\\left\[\\operatorname\{Tr\}\\left\(\\boldsymbol\{\\Delta\}\\boldsymbol\{H\}\\boldsymbol\{\\Delta\}^\{\\top\}\\right\)\\right\]=\\sum\_\{k,l\}\\alpha\_\{k\}\\alpha\_\{l\}\\left\(\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{l\}\\right\)\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{\\varepsilon\}\_\{l\}\\right\]\.\(12\)This gives the spectrum\-wise reconstruction\-error decomposition in Eq\. \([11](https://arxiv.org/html/2606.00079#S3.E11)\)\. ∎

To avoid cross\-component interactions, which would make bit allocation a quadratic ILP, we adopt a diagonal approximation\. We further assume that quantization errors associated with different spectral components are independent and zero\-mean under symmetric quantization\.

𝔼​\[𝜺k⊤​𝜺l\]≈𝔼​\[𝜺k\]⊤​𝔼​\[𝜺l\]≈0,∀k≠l\.\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{\\varepsilon\}\_\{l\}\\right\]\\approx\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\]^\{\\top\}\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{l\}\\right\]\\approx 0,\\qquad\\forall\\,k\\neq l\.\(13\)
###### Corollary 3\.1\(Additive spectrum\-wise loss\)\.

Under the uncorrelated\-error assumption in Eq\. \([13](https://arxiv.org/html/2606.00079#S3.E13)\), the reconstruction loss reduces to

L​\(𝑾^\)\\displaystyle L\(\\widehat\{\\boldsymbol\{W\}\}\)=∑kαk2​\(ϕk⊤​𝑯​ϕk\)​𝔼​‖𝜺k‖22≈∑kαk2​βk​𝔼​‖𝜺k‖22,\\displaystyle=\\sum\_\{k\}\\alpha\_\{k\}^\{2\}\\left\(\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{k\}\\right\)\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\\\|\_\{2\}^\{2\}\\approx\\sum\_\{k\}\\alpha\_\{k\}^\{2\}\\beta\_\{k\}\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\\\|\_\{2\}^\{2\},\(14\)where

βk≔ϕk⊤​𝑯​ϕk\.\\beta\_\{k\}\\coloneqq\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{k\}\.\(15\)

Forh=hdnh=h\_\{\\mathrm\{dn\}\}, the shared basis is associated with the activation\-output feature space, while the expert\-specific normalized spectral vectors remain on the activation\-input side\. Therefore, we write the perturbation as

𝚫=∑k=1nαk​ϕk​𝜺k⊤=𝚽​𝑨​𝑬P⊤,\\boldsymbol\{\\Delta\}=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}\\boldsymbol\{\\phi\}\_\{k\}\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}=\\boldsymbol\{\\Phi\}\\boldsymbol\{A\}\\boldsymbol\{E\}\_\{P\}^\{\\top\},\(16\)where𝜺k\\boldsymbol\{\\varepsilon\}\_\{k\}is the quantization error of the expert\-specific vector𝒑k\\boldsymbol\{p\}\_\{k\}\. Using the orthonormality of the shared basis𝚽\\boldsymbol\{\\Phi\}, the activation\-aware reconstruction loss becomes

L​\(𝑾^\)=𝔼​\[Tr⁡\(𝑨​𝑬P⊤​𝑯​𝑬P​𝑨\)\]=∑k=1nαk2​𝔼​\[𝜺k⊤​𝑯​𝜺k\]\.L\(\\widehat\{\\boldsymbol\{W\}\}\)=\\mathbb\{E\}\\left\[\\operatorname\{Tr\}\\left\(\\boldsymbol\{A\}\\boldsymbol\{E\}\_\{P\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{E\}\_\{P\}\\boldsymbol\{A\}\\right\)\\right\]=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}^\{2\}\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\]\.\(17\)
Since directly using Eq\.[17](https://arxiv.org/html/2606.00079#S3.E17)depends on the quantization\-error direction, we use a tractable empirical surrogate based on the corresponding unquantized expert\-specific spectral direction:

𝔼​\[𝜺k⊤​𝑯​𝜺k\]≈βk​𝔼​‖𝜺k‖22,βk≔𝒑k⊤​𝑯​𝒑k\.\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\]\\approx\\beta\_\{k\}\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\\\|\_\{2\}^\{2\},\\qquad\\beta\_\{k\}\\coloneqq\\boldsymbol\{p\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{p\}\_\{k\}\.\(18\)
Therefore, for each expert and each projection inℋ\\mathcal\{H\}, the remaining derivation uses the unified additive loss

L​\(𝑾^\)≈∑k=1nαk2​βkγ​𝔼​‖𝜺k‖22,L\(\\widehat\{\\boldsymbol\{W\}\}\)\\approx\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}^\{2\}\\beta\_\{k\}^\{\\gamma\}\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\\\|\_\{2\}^\{2\},\(19\)where𝜺k\\boldsymbol\{\\varepsilon\}\_\{k\}denotes the quantization error of the expert\-specific spectral vector𝒑k\\boldsymbol\{p\}\_\{k\}, andγ∈\[0,1\]\\gamma\\in\[0,1\]smooths the activation\-aware importance to prevent a few largeβk\\beta\_\{k\}values from dominating the bit\-allocation objective\.

#### Piecewise reconstruction error for bit allocation\.

We define a piecewise reconstruction\-error surrogate for allocating bit\-widths to expert\-specific spectral vectors overℬ=\{16,8,6,4,3,2,1,0\}\\mathcal\{B\}=\\\{16,8,6,4,3,2,1,0\\\}\. For a single componentkk, let𝜺k​\(b\)\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)denote the quantization\-induced direction error at bit\-widthbb\. Its normalized distortion is measured as

ℰk​\(b\)≔𝔼​‖𝜺k​\(b\)‖22\.\\mathcal\{E\}\_\{k\}\(b\)\\coloneqq\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)\\right\\\|\_\{2\}^\{2\}\.\(20\)The surrogate is specified by bit\-width regime\.

###### Lemma 3\.2\(High\-bit distortion\)\.

Forb∈\{6,8,16\}b\\in\\\{6,8,16\\\}, letdddenote the dimension of𝐩k\\boldsymbol\{p\}\_\{k\}, and define

ρk≔‖𝒑k‖∞,ηk≔d​ρk23\.\\rho\_\{k\}\\coloneqq\\\|\\boldsymbol\{p\}\_\{k\}\\\|\_\{\\infty\},\\qquad\\eta\_\{k\}\\coloneqq\\frac\{d\\rho\_\{k\}^\{2\}\}\{3\}\.The high\-bit distortion is approximated as

ℰk​\(b\)≔ηk​exp⁡\(−λ​b\),λ=2​ln⁡2\.\\mathcal\{E\}\_\{k\}\(b\)\\coloneqq\\eta\_\{k\}\\exp\(\-\\lambda b\),\\qquad\\lambda=2\\ln 2\.\(21\)

###### Lemma 3\.3\(Low\-bit empirical distortion\)\.

Forb∈\{2,3,4\}b\\in\\\{2,3,4\\\}, define

ℰk​\(b\)≔κb,\\mathcal\{E\}\_\{k\}\(b\)\\coloneqq\\kappa\_\{b\},\(22\)whereκb\\kappa\_\{b\}is a bit\-dependent low\-bit distortion coefficient estimated offline\.

###### Lemma 3\.4\(One\-bit sign distortion\)\.

Forb=1b=1, define

𝒑^k\(1\)≔sign⁡\(𝒑k\)d,cos⁡θk≔𝒑k⊤​𝒑^k\(1\)\.\\widehat\{\\boldsymbol\{p\}\}^\{\(1\)\}\_\{k\}\\coloneqq\\frac\{\\operatorname\{sign\}\(\\boldsymbol\{p\}\_\{k\}\)\}\{\\sqrt\{d\}\},\\qquad\\cos\\theta\_\{k\}\\coloneqq\\boldsymbol\{p\}\_\{k\}^\{\\top\}\\widehat\{\\boldsymbol\{p\}\}^\{\(1\)\}\_\{k\}\.The one\-bit distortion is defined by the angular mismatch

ℰk​\(1\)≔sin2⁡θk\.\\mathcal\{E\}\_\{k\}\(1\)\\coloneqq\\sin^\{2\}\\theta\_\{k\}\.\(23\)

###### Lemma 3\.5\(Zero\-bit eviction distortion\)\.

Forb=0b=0, the spectral vector is evicted and its normalized distortion is

ℰk​\(0\)≔1\.\\mathcal\{E\}\_\{k\}\(0\)\\coloneqq 1\.\(24\)

Detailed proofs of Lemmas[3\.2](https://arxiv.org/html/2606.00079#S3.Thmlemma2)–[3\.5](https://arxiv.org/html/2606.00079#S3.Thmlemma5)are provided in Appendix[B](https://arxiv.org/html/2606.00079#A2)\. Since the derivation ofℰ\\mathcal\{E\}is identical for different experts and projection types, restoring the indiceseeandhhgives the piecewise distortion surrogate:

ℰe,h,k​\(b\)=\{ηe,h,k​exp⁡\(−λ​b\),b∈\{6,8,16\},κb,b∈\{2,3,4\},sin2⁡θe,h,k,b=1,1,b=0\.\\mathcal\{E\}\_\{e,h,k\}\(b\)=\\begin\{cases\}\\eta\_\{e,h,k\}\\exp\(\-\\lambda b\),&b\\in\\\{6,8,16\\\},\\\\\[3\.0pt\] \\kappa\_\{b\},&b\\in\\\{2,3,4\\\},\\\\\[3\.0pt\] \\sin^\{2\}\\theta\_\{e,h,k\},&b=1,\\\\\[3\.0pt\] 1,&b=0\.\\end\{cases\}\(25\)

#### Component\-wise ILP formulation\.

We uniformly allocate the bit budget across MoE layers and solve the bit\-allocation problem independently for each projection type\. For each component\(e,h,k\)\(e,h,k\), letye,h,k,b∈\{0,1\}y\_\{e,h,k,b\}\\in\\\{0,1\\\}indicate whether bit\-widthbbis assigned to this component\.

For each projection typehh, let𝒀\(h\)\\boldsymbol\{Y\}^\{\(h\)\}collectye,h,k,by\_\{e,h,k,b\}, let𝑪\(h\)\\boldsymbol\{C\}^\{\(h\)\}collectCe,h,k,b≔Le,h,k​\(b\)=αe,h,k2​βe,h,kγ​ℰe,h,k​\(b\)C\_\{e,h,k,b\}\\coloneqq L\_\{e,h,k\}\(b\)=\\alpha\_\{e,h,k\}^\{2\}\\beta\_\{e,h,k\}^\{\\gamma\}\\mathcal\{E\}\_\{e,h,k\}\(b\), and let𝛀\(h\)\\boldsymbol\{\\Omega\}^\{\(h\)\}collect the normalized bit costsΩe,h,k,b≔b\\Omega\_\{e,h,k,b\}\\coloneqq b\. SinceBhB\_\{h\}denotes the normalized component budget for projection typehh, the projection\-wise ILP can be written as

min𝒀\(h\)\\displaystyle\\min\_\{\\boldsymbol\{Y\}^\{\(h\)\}\}⟨𝒀\(h\),𝑪\(h\)⟩\\displaystyle\\left\\langle\\boldsymbol\{Y\}^\{\(h\)\},\\boldsymbol\{C\}^\{\(h\)\}\\right\\rangle\(26\)s\.t\.\\displaystyle\\mathrm\{s\.t\.\}⟨𝒀\(h\),𝛀\(h\)⟩≤Bh,\\displaystyle\\left\\langle\\boldsymbol\{Y\}^\{\(h\)\},\\boldsymbol\{\\Omega\}^\{\(h\)\}\\right\\rangle\\leq B\_\{h\},∑b∈ℬye,h,k,b=1,∀e∈\[E\],k∈\[nh\],\\displaystyle\\sum\_\{b\\in\\mathcal\{B\}\}y\_\{e,h,k,b\}=1,\\qquad\\forall\\,e\\in\[E\],\\ k\\in\[n\_\{h\}\],ye,h,k,b∈\{0,1\},∀e∈\[E\],k∈\[nh\],b∈ℬ\.\\displaystyle y\_\{e,h,k,b\}\\in\\\{0,1\\\},\\qquad\\forall\\,e\\in\[E\],\\ k\\in\[n\_\{h\}\],\\ b\\in\\mathcal\{B\}\.Here⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\rangledenotes the tensor inner product over\(e,k,b\)\(e,k,b\)for projection typehh\. Eq\. \([26](https://arxiv.org/html/2606.00079#S3.E26)\) is solved independently for each projection type to obtain component\-level mixed\-precision assignments under the piecewise reconstruction\-error surrogate\. Appendix[B](https://arxiv.org/html/2606.00079#A2)provides the full ILP derivation\.

Table 1:Evaluation results for DeepSeek\-V2\-Lite, Qwen3\-30B\-A3B\-Base, and Qwen3\-Next\-80B\-A3B\-Instruct at 2\-bit and 3\-bit settings\.MethodBitsPPL↓\\downarrowAccuracy↑\\uparrow\(%\)HellaS\.MathQAMMLUOpenb\.WinoG\.GSM8KHumanE\.Avg\.DeepSeek\-V2\-LiteFP16168\.6977\.7039\.0355\.6044\.4070\.8839\.1226\.8350\.51HQQ214\.2167\.7329\.5143\.4138\.2063\.5412\.4311\.5938\.06GPTQ217\.7861\.4425\.0927\.7235\.8059\.982\.960\.0030\.43MiLo213\.8769\.4230\.8241\.8037\.2065\.5911\.378\.5437\.82MoEQuant211\.8366\.2532\.1946\.2939\.6069\.8515\.8512\.8040\.40BitsMoE212\.2069\.9633\.3746\.4139\.2068\.8215\.4714\.0241\.04HQQ39\.2576\.8336\.4553\.1644\.4070\.8832\.1521\.3447\.89GPTQ39\.5675\.8837\.2950\.9244\.2069\.3030\.4026\.2247\.74MiLo39\.1876\.4037\.2953\.5843\.0070\.5634\.5720\.1247\.93MoEQuant39\.5376\.1538\.3954\.6443\.6070\.1733\.6625\.0048\.80BitsMoE39\.3875\.0638\.1653\.5943\.2070\.8830\.3327\.4448\.38Qwen3\-30B\-A3B\-BaseFP161610\.2481\.3560\.0378\.7745\.0072\.8583\.4756\.1068\.22HQQ223\.6563\.0533\.1748\.8236\.4060\.8522\.6711\.5939\.51GPTQ215\.6370\.1624\.8939\.1739\.4060\.624\.320\.0034\.08MiLo221\.5362\.8231\.8344\.5135\.4059\.9819\.647\.9337\.44MoEQuant215\.3466\.4445\.0970\.0240\.0067\.7249\.3626\.8352\.21BitsMoE216\.0774\.0952\.7070\.8743\.4072\.9375\.5143\.9061\.91HQQ311\.4578\.5549\.0175\.3744\.4071\.6779\.5343\.2963\.12GPTQ310\.9079\.9254\.0775\.8343\.4072\.1479\.4538\.4163\.32MiLo311\.1179\.8157\.0576\.4541\.8070\.6482\.6456\.1066\.36MoEQuant310\.4079\.5557\.6279\.9743\.8071\.3580\.8253\.0566\.59BitsMoE311\.8279\.2460\.1776\.9844\.8074\.1985\.3750\.6167\.34Qwen3\-Next\-80B\-A3B\-InstructFP161610\.3182\.7263\.8584\.5344\.2076\.8077\.1895\.7375\.00HQQ212\.1378\.7350\.9579\.6843\.8070\.4066\.4991\.4668\.79GPTQ215\.3770\.2427\.4754\.6338\.6065\.5918\.351\.2239\.44MiLo212\.0678\.6949\.7579\.7643\.8071\.7471\.4991\.4669\.53BitsMoE212\.7678\.0260\.6781\.4744\.8075\.8571\.4992\.6872\.14HQQ310\.5582\.1161\.1783\.8145\.6077\.0376\.5792\.6874\.14GPTQ310\.8381\.4259\.4082\.5244\.2076\.0976\.6592\.0773\.19MiLo310\.5182\.1761\.3483\.4945\.2076\.0976\.1992\.6873\.88BitsMoE310\.7680\.9362\.7883\.9444\.8076\.4075\.9794\.5174\.19

## 4Experiments

We evaluateBitsMoEunder a unified post\-training compression setting in which compression is applied exclusively to MoE layers, while all attention layers are retained in FP16\. This configuration is shared by all baselines to ensure a fair comparison\. All evaluation experiments are conducted on NVIDIA A100\-PCIe\-80GB GPUs, and the ILP problems are solved using the Gurobi Optimizer\[[17](https://arxiv.org/html/2606.00079#bib.bib33)\]\.

### 4\.1Experimental Setup

#### Models and Datasets\.

We conduct experiments on DeepSeek\-V2\-Lite\[[27](https://arxiv.org/html/2606.00079#bib.bib21)\], Qwen3\-30B\-A3B\-Base\[[45](https://arxiv.org/html/2606.00079#bib.bib6)\], Qwen3\-Next\-80B\-A3B\-Instruct\[[45](https://arxiv.org/html/2606.00079#bib.bib6),[46](https://arxiv.org/html/2606.00079#bib.bib7)\], Qwen1\.5\-MoE\-A2\.7B\[[4](https://arxiv.org/html/2606.00079#bib.bib8),[38](https://arxiv.org/html/2606.00079#bib.bib9)\]and Mixtral\-8x7B\-v0\.1\[[23](https://arxiv.org/html/2606.00079#bib.bib5)\]\. Our evaluation covers both base and instruction\-tuned models to demonstrate the effectiveness of our method\. In addition to perplexity on C4\[[32](https://arxiv.org/html/2606.00079#bib.bib25)\], we evaluate the proposedBitsMoEon a diverse suite of zero\-shot tasks, including HellaSwag\[[50](https://arxiv.org/html/2606.00079#bib.bib26)\], MathQA\[[1](https://arxiv.org/html/2606.00079#bib.bib27)\], MMLU\[[19](https://arxiv.org/html/2606.00079#bib.bib28)\], OpenBookQA\[[30](https://arxiv.org/html/2606.00079#bib.bib29)\]and WinoGrande\[[33](https://arxiv.org/html/2606.00079#bib.bib30)\]\. Furthermore, we evaluateBitsMoEusing HumanEval\[[6](https://arxiv.org/html/2606.00079#bib.bib34)\]and GSM8K\[[10](https://arxiv.org/html/2606.00079#bib.bib35)\]\. HumanEval evaluates code generation capabilities, while GSM8K assesses multi\-step mathematical reasoning skills\. We evaluate these seven tasks using the open\-source tool lm\-evaluation\-harness \(version 0\.4\.9\.1\)\[[37](https://arxiv.org/html/2606.00079#bib.bib31)\]\.

#### Baselines\.

Our baselines include representative LLM post\-training quantization \(PTQ\) methods HQQ\[[3](https://arxiv.org/html/2606.00079#bib.bib4)\]and GPTQ\[[14](https://arxiv.org/html/2606.00079#bib.bib3)\]and the MoE\-specific comparators MiLo\[[21](https://arxiv.org/html/2606.00079#bib.bib19)\]and MoEQuant\[[8](https://arxiv.org/html/2606.00079#bib.bib16)\]\. All methods quantize only MoE expert weights with group size 128\. GPTQ andBitsMoEare calibrated on 1024 C4 samples, MoEQuant is calibrated on EBSS, and HQQ and MiLo are calibration\-free\. MoEQuant is excluded for Qwen3\-Next\-80B\-A3B\-Instruct because its released implementation does not support the linear\-attention/FlashLinearAttention forward path required to quantize this model\.

### 4\.2Main Results

Table 2:Ultra\-low\-bit quantization results on Qwen3\-30B\-A3B\-Base\.MethodBitAccuracy↑\\uparrow\(%\)GSM8KAvg\.HQQ222\.6739\.51GPTQ24\.3234\.08MoEQuant249\.3652\.21MiLo219\.6437\.44BitsMoE2\.075\.5161\.911\.869\.1456\.231\.663\.5353\.451\.452\.6247\.46

As shown in Table[1](https://arxiv.org/html/2606.00079#S3.T1)and Figure[2](https://arxiv.org/html/2606.00079#S4.F2),BitsMoEconsistently preserves downstream accuracy under 2\-bit quantization across different MoE backbones\. The gains are most pronounced on GSM8K and HumanEval, which indicates thatBitsMoEbetter preserves reasoning and coding abilities under the ultra\-low\-bit regime\. Although its PPL is not always the lowest, it remains comparable to strong baselines\. These results indicate that fine\-grained bit allocation over spectral components can better protect important weight directions, thereby reducing downstream degradation in ultra\-low\-bit MoE LLM quantization\.

Table[2](https://arxiv.org/html/2606.00079#S4.T2)reports sub\-2\-bit results forBitsMoEon Qwen3\-30B\-A3B\-Base, with average accuracy computed across seven tasks\. At 1\.4 bits,BitsMoEpreserves strong GSM8K performance, which shows that the proposed allocation strategy remains effective under tighter bit budgets\.

![Refer to caption](https://arxiv.org/html/2606.00079v1/x2.png)\(a\)Qwen1\.5\-MoE\-A2\.7B
![Refer to caption](https://arxiv.org/html/2606.00079v1/x3.png)\(b\)Mixtral\-8×\\times7B

Figure 2:Zero\-shot accuracy \(%\) on seven benchmarks for \(a\) Qwen1\.5\-MoE\-A2\.7B and \(b\) Mixtral\-8×\\times7B under 2\-bit and 3\-bit quantization\. Compared with GPTQ and MoEQuant,BitsMoEgenerally preserves stronger accuracy across tasks, especially in the 2\-bit regime\.
### 4\.3Ablation Study

We evaluate four ablation settings under the same effective 2\-bit budget to isolate the effects of basis sharing, FP16 shared\-basis retention, and adaptive bit allocation:

1. \(1\)NS/UniBit: independent SVD without basis sharing\. Each expert is decomposed separately\. Only the top\-NNspectral components are retained and uniformly quantized to 2 bits, while the remaining components are discarded\.
2. \(2\)QS/UniBit: shared\-basis SVD with a quantized shared basis\. The shared basis is uniformly quantized to 2 bits\. Only the expert\-specific components selected according to spectral energy are retained and uniformly quantized to 2 bits, while the remaining expert\-specific components are discarded\.
3. \(3\)FS/UniBit: shared\-basis SVD with an FP16 shared basis\. The shared basis is kept in FP16\. Only the expert\-specific components selected according to spectral energy are retained and uniformly quantized to 2 bits, while the remaining expert\-specific components are discarded\.
4. \(4\)FS/AdaBit: the fullBitsMoEsetting\. The shared basis is kept in FP16, and adaptive bit\-widths are assigned to expert\-specific spectral components by the activation\-aware ILP under the same equivalent 2\-bit budget\.

Table 3:Ablation summary under 2\-bit quantization\.222Model abbreviations: QW1\.5\-14B = Qwen1\.5\-MoE\-A2\.7B, DSV2\-16B = DeepSeek\-V2\-Lite, QW3\-30B = Qwen3\-30B\-A3B\-Base, MI\-8x7B = Mixtral\-8×\\times7B\-v0\.1, and QW3\-80B\-I = Qwen3\-Next\-80B\-A3B\-Instruct\.SettingDSV2\-16BQW3\-30BQW3\-80B\-INS/UniBit29\.7236\.8320\.82QS/UniBit21\.2221\.4621\.31FS/UniBit30\.5643\.9267\.69FS/AdaBit41\.0461\.9172\.14

*Note\.*NS/QS/FS denote no shared basis, quantized shared basis, and FP16 shared basis; UniBit/AdaBit denote uniform/adaptive bit allocation\.

Table[2](https://arxiv.org/html/2606.00079#footnote2)summarizes the four ablation settings and average accuracy in the 2\-bit setting\. The comparison shows that a shared basis with quantization is insufficient:QS/UniBitperforms poorly across all models, which indicates that the shared basis encodes common cross\-expert information and should be retained without quantization\. Under the same bit budget, preserving the shared basis in FP16 substantially improves average accuracy\.FS/AdaBitoutperformsFS/UniBiton three models, which demonstrates the effectiveness of spectrum\-wise bit allocation under ultra\-low\-bit quantization\. Full results are reported in Appendix[C](https://arxiv.org/html/2606.00079#A3)\.

### 4\.4Efficiency Analysis

![Refer to caption](https://arxiv.org/html/2606.00079v1/x4.png)\(a\)2\-bit
![Refer to caption](https://arxiv.org/html/2606.00079v1/x5.png)\(b\)3\-bit

Figure 3:Time breakdown of the post\-training quantization pipeline under 2\-bit and 3\-bit settings\.#### ILP Breakdown and Quantization Overhead\.

Figure[3](https://arxiv.org/html/2606.00079#S4.F3)reports the end\-to\-end offline quantization overhead ofBitsMoE\. On NVIDIA A100\-PCIe\-80GB GPUs,BitsMoErequires substantially less offline quantization time than GPTQ\. In both 2\-bit and 3\-bit settings, mostBitsMoEoverhead is due to calibration\-statistics collection, while SVD decomposition and ILP solving contribute only marginally\. Thus, the proposed spectrum\-wise allocation introduces no significant optimization bottleneck\. The speedup over GPTQ stems from a compact per\-layer ILP formulation, which avoids the Hessian\-based error compensation required by GPTQ’s sequential expert quantization\.

#### Inference Efficiency\.

Table[4](https://arxiv.org/html/2606.00079#S4.T4)summarizes the online inference efficiency and memory footprint ofBitsMoE\. Since optimized GPTQ kernels such as Marlin\[[15](https://arxiv.org/html/2606.00079#bib.bib47)\]and ExLlamaV2\[[40](https://arxiv.org/html/2606.00079#bib.bib48)\]are not applicable to the 2\-bit GPTQ setting, we use the available GPTQ Triton backend\[[14](https://arxiv.org/html/2606.00079#bib.bib3),[39](https://arxiv.org/html/2606.00079#bib.bib49)\]for evaluation\. On NVIDIA A6000 GPUs,BitsMoEimproves online inference efficiency by increasing decoding throughput, reducing TTFT, and lowering the MoE\-layer memory footprint under 2\-bit quantization\. Inference is measured with batch size 1, prefill length 256, and generation length 128\. AlthoughBitsMoEintroduces a shared basis, its projection is computed once per MoE layer and reused across routed experts\. During inference, packed expert\-specific spectral factors are unpacked and dequantized inside GEMM kernels without reconstructing full weights, while experts are executed in parallel within each MoE layer\.

Table 4:Inference efficiency and GPU memory footprint of MoE LLMs\. Decode speed is measured in tokens/sec, and TTFT denotes time to first token\. Speedup is computed relative to FP16\.ModelInference EfficiencyGPU Memory \(GB\)Decode Speed↑\\uparrow\(tokens/sec\)TTFT↓\\downarrow\(sec\)FP16BitsMoESavingFP16GPTQBitsMoEFP16GPTQBitsMoETotalAttnMoEMoEMoEDSV2\-16B10\.397\.43\(0\.71×\\times\)12\.46\(1\.20×\\times\)0\.471\.27\(0\.37×\\times\)0\.64\(0\.73×\\times\)29\.510\.6927\.655\.085\.44×\\timesQW3\-30B3\.073\.25\(1\.06×\\times\)5\.71\(1\.86×\\times\)2\.352\.94\(0\.80×\\times\)1\.51\(1\.55×\\times\)56\.951\.6954\.008\.586\.29×\\timesQW3\-80B\-I1\.652\.59\(1\.57×\\times\)5\.01\(3\.04×\\times\)8\.357\.32\(1\.14×\\times\)1\.06\(7\.90×\\times\)148\.690\.61144\.2821\.986\.56×\\times

## 5Limitations

BitsMoEhas several limitations\. First, its spectrum\-wise ILP optimizes a tractable local activation\-aware reconstruction surrogate rather than the fully coupled reconstruction objective\. Although the diagonal\-error approximation and empiricaldown\_projheuristic make allocation linear and efficient, higher\-order interactions among spectral components are not explicitly modeled\. Second, the target bit budget is assigned uniformly across layers and projection types\. This simple design does not exploit heterogeneous sensitivity across layers and projections, which suggests adaptive high\-level budget allocation as future work\. Third,BitsMoEcompresses only MoE expert weights, whereas attention layers, activations, and the KV cache remain unquantized\. These components can be compressed by general\-purpose quantization or KV\-cache compression methods that are orthogonal toBitsMoE\.

## 6Conclusion

We presentBitsMoE, a shared\-basis mixed\-precision quantization framework for ultra\-low\-bit MoE LLM compression\.BitsMoEdecomposes each MoE layer into a shared spectral basis and expert\-specific spectral factors, retaining the shared basis without quantization while assigning mixed bit\-widths to fine\-grained expert\-specific spectral components\. By formulating spectrum\-wise bit allocation as an activation\-aware reconstruction surrogate and solving the resulting ILP under a fixed bit budget,BitsMoEallocates limited bits according to spectral energy, activation importance, and bit\-dependent distortion\. Experiments across multiple MoE backbones show that this design substantially reduces accuracy degradation in ultra\-low\-bit regimes, especially under 2\-bit quantization, while also reducing MoE\-layer memory footprint and improving inference efficiency\. These results suggest that shared spectral structure and activation\-aware bit allocation provide a useful direction for future research on fine\-grained, structure\-aware compression of sparse LLMs\.

## Impact Statement

BitsMoEaims to reduce the memory footprint and inference cost of MoE large language models by compressing expert weights under ultra\-low\-bit budgets\. Its positive impacts include lowering hardware barriers, reducing deployment costs, and improving the accessibility and energy efficiency of large\-scale MoE inference\. At the same time, more efficient MoE deployment may also lower the cost of using powerful language models for harmful applications, such as misinformation generation, automated spam, or privacy\-invasive applications\. SinceBitsMoEdoes not modify the safety alignment or usage policies of the underlying models, compressed models may inherit the risks and limitations of the original models\. We therefore encourage users to follow the licenses, usage policies, and safety guidelines of the original models and to evaluate compressed models under task\-specific safety and reliability requirements before deployment\.

## Acknowledgments and Disclosure of Funding

This work was partially supported by the Strategic Priority Research Program of the CAS under Grant XDB0660000, and in part by the National Natural Science Foundation of China under Grant 92473114\.

This work was partially supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 2 \(MOE\-T2EP20224\-0006\)\.

## References

- \[1\]\(2019\)Mathqa: towards interpretable math word problem solving with operation\-based formalisms\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 \(long and short papers\),pp\. 2357–2367\.Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[2\]S\. Ashkboos, A\. Mohtashami, M\. L\. Croci, B\. Li, P\. Cameron, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman\(2024\)QuaRot: outlier\-free 4\-bit inference in rotated LLMs\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=dfqsW38v1X)Cited by:[§2\.3](https://arxiv.org/html/2606.00079#S2.SS3.p1.1)\.
- \[3\]H\. Badri and A\. Shaji\(2023\-11\)Half\-quadratic quantization of large machine learning models\.External Links:[Link](https://dropbox.github.io/hqq_blog/)Cited by:[Table 5](https://arxiv.org/html/2606.00079#A1.T5.7.7.12.1.1.1),[§2\.3](https://arxiv.org/html/2606.00079#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px2.p1.1)\.
- \[4\]J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang,et al\.\(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[5\]W\. Cai, J\. Jiang, F\. Wang, J\. Tang, S\. Kim, and J\. Huang\(2025\)A survey on mixture of experts in large language models\.IEEE Transactions on Knowledge and Data Engineering\.Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p1.1)\.
- \[6\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[7\]P\. CHen, H\. Yu, I\. S\. Dhillon, and C\. Hsieh\(2021\)DRONE: data\-aware low\-rank compression for large NLP models\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=sthiz9zeXGG)Cited by:[§2\.2](https://arxiv.org/html/2606.00079#S2.SS2.p1.2)\.
- \[8\]Z\. Chen, X\. Hu, D\. Yang, Z\. Xu, XUCHEN, Z\. Yuan, S\. Zhou, and JiangyongYu\(2025\)MoEQuant: enhancing quantization for mixture\-of\-experts large language models via expert\-balanced sampling and affinity guidance\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=0epuNvt5Dj)Cited by:[Table 5](https://arxiv.org/html/2606.00079#A1.T5.7.7.13.1.1.1),[§B\.2](https://arxiv.org/html/2606.00079#A2.SS2.p2.6),[§1](https://arxiv.org/html/2606.00079#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.00079#S2.SS3.p2.1),[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px2.p1.1)\.
- \[9\]M\. N\. R\. Chowdhury, K\. E\. Maghraoui, H\. Tsai, N\. Wang, G\. W\. Burr, L\. Liu, and M\. Wang\(2026\)Efficient quantization of mixture\-of\-experts with theoretical generalization guarantees\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=yiMlVBAoQi)Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p2.1)\.
- \[10\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[11\]DeepSeek\-AI\(2024\)DeepSeek\-v3 technical report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p1.1)\.
- \[12\]H\. Duanmu, X\. Li, Z\. Yuan, S\. Zheng, J\. Duan, X\. Zhang, and D\. Lin\(2025\)MxMoE: mixed\-precision quantization for moe with accuracy and performance co\-design\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=pXoZLGMNDm)Cited by:[Table 5](https://arxiv.org/html/2606.00079#A1.T5.7.7.14.1.1.1),[§1](https://arxiv.org/html/2606.00079#S1.p2.1),[§1](https://arxiv.org/html/2606.00079#S1.p5.1),[§2\.3](https://arxiv.org/html/2606.00079#S2.SS3.p2.1)\.
- \[13\]W\. Fedus, B\. Zoph, and N\. Shazeer\(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§2\.1](https://arxiv.org/html/2606.00079#S2.SS1.p1.1)\.
- \[14\]E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh\(2022\)Gptq: accurate post\-training quantization for generative pre\-trained transformers\.arXiv preprint arXiv:2210\.17323\.Cited by:[Table 5](https://arxiv.org/html/2606.00079#A1.T5.7.7.11.1.1.1),[§2\.3](https://arxiv.org/html/2606.00079#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px2.p1.1),[§4\.4](https://arxiv.org/html/2606.00079#S4.SS4.SSS0.Px2.p1.1)\.
- \[15\]E\. Frantar, R\. L\. Castro, J\. Chen, T\. Hoefler, and D\. Alistarh\(2025\)Marlin: mixed\-precision auto\-regressive parallel inference on large language models\.InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming,pp\. 239–251\.Cited by:[§4\.4](https://arxiv.org/html/2606.00079#S4.SS4.SSS0.Px2.p1.1)\.
- \[16\]H\. Gu, W\. Li, L\. Li, Z\. Qiyuan, M\. G\. Lee, S\. Sun, W\. Xue, and Y\. Guo\(2025\)Delta decompression for moe\-based LLMs compression\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=ziezViPoN1)Cited by:[Table 5](https://arxiv.org/html/2606.00079#A1.T5.7.7.9.1.1.1),[§1](https://arxiv.org/html/2606.00079#S1.p2.1),[§1](https://arxiv.org/html/2606.00079#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.00079#S2.SS2.p1.2)\.
- \[17\]Gurobi Optimization, LLC\(2024\)Gurobi Optimizer Reference Manual\.External Links:[Link](https://www.gurobi.com/)Cited by:[§4](https://arxiv.org/html/2606.00079#S4.p1.1)\.
- \[18\]S\. He, L\. Ding, D\. Dong, B\. Liu, F\. Yu, and D\. Tao\(2023\)Pad\-net: an efficient framework for dynamic networks\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14354–14366\.Cited by:[§2\.1](https://arxiv.org/html/2606.00079#S2.SS1.p1.1)\.
- \[19\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2009\)Measuring massive multitask language understanding, 2021\.URL https://arxiv\. org/abs,pp\. 20\.Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[20\]Y\. Hsu, T\. Hua, S\. Chang, Q\. Lou, Y\. Shen, and H\. Jin\(2022\)Language model compression with weighted low\-rank factorization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=uPv9Y3gmAI5)Cited by:[§2\.2](https://arxiv.org/html/2606.00079#S2.SS2.p1.2)\.
- \[21\]B\. Huang, Y\. Yuan, Z\. SHAO, and M\. Zhang\(2025\)MiLo: efficient quantized moe inference with mixture of low\-rank compensators\.InEighth Conference on Machine Learning and Systems,External Links:[Link](https://openreview.net/forum?id=NXVXiJmhe1)Cited by:[Table 5](https://arxiv.org/html/2606.00079#A1.T5.7.7.15.1.1.1),[§1](https://arxiv.org/html/2606.00079#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.00079#S2.SS3.p2.1),[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px2.p1.1)\.
- \[22\]W\. Huang, Y\. Liao, J\. Liu, R\. He, H\. Tan, S\. Zhang, H\. Li, S\. Liu, and X\. QI\(2025\)Mixture compressor for mixture\-of\-experts LLMs gains more\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hheFYjOsWO)Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p5.1)\.
- \[23\]A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. d\. l\. Casas, E\. B\. Hanna, F\. Bressand,et al\.\(2024\)Mixtral of experts\.arXiv preprint arXiv:2401\.04088\.Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00079#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[24\]E\. Koehler, E\. Brown, and S\. J\. Haneuse\(2009\)On the assessment of monte carlo error in simulation\-based statistical analyses\.The American Statistician63\(2\),pp\. 155–162\.Cited by:[Appendix D](https://arxiv.org/html/2606.00079#A4.SS0.SSS0.Px3.p1.6)\.
- \[25\]W\. Li, L\. Li, H\. Gu, Y\. Huang, M\. G\. Lee, S\. Sun, W\. Xue, and Y\. Guo\(2025\)MoE\-SVD: structured mixture\-of\-experts LLMs compression via singular value decomposition\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=acJ3vdFljk)Cited by:[Table 5](https://arxiv.org/html/2606.00079#A1.T5.7.7.10.1.1.1),[§1](https://arxiv.org/html/2606.00079#S1.p2.1),[§1](https://arxiv.org/html/2606.00079#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.00079#S2.SS2.p1.2)\.
- \[26\]J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han\(2024\)Awq: activation\-aware weight quantization for on\-device llm compression and acceleration\.Proceedings of machine learning and systems6,pp\. 87–100\.Cited by:[§2\.3](https://arxiv.org/html/2606.00079#S2.SS3.p1.1)\.
- \[27\]A\. Liu, B\. Feng, B\. Wang, B\. Wang, B\. Liu, C\. Zhao, C\. Dengr, C\. Ruan, D\. Dai, D\. Guo,et al\.\(2024\)Deepseek\-v2: a strong, economical, and efficient mixture\-of\-experts language model\.arXiv preprint arXiv:2405\.04434\.Cited by:[§2\.1](https://arxiv.org/html/2606.00079#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[28\]J\. Liu, P\. Tang, W\. Wang, Y\. Ren, X\. Hou, P\. Heng, M\. Guo, and C\. Li\(2024\)A survey on inference optimization techniques for mixture of experts models\.CoRR\.Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p1.1)\.
- \[29\]X\. Lu, Q\. Liu, Y\. Xu, A\. Zhou, S\. Huang, B\. Zhang, J\. Yan, and H\. Li\(2024\)Not all experts are equal: efficient expert pruning and skipping for mixture\-of\-experts large language models\.arXiv preprint arXiv:2402\.14800\.Cited by:[§2\.1](https://arxiv.org/html/2606.00079#S2.SS1.p1.1)\.
- \[30\]T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal\(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.arXiv preprint arXiv:1809\.02789\.Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[31\]N\. Muennighoff, L\. Soldaini, D\. Groeneveld, K\. Lo, J\. Morrison, S\. Min, W\. Shi, E\. P\. Walsh, O\. Tafjord, N\. Lambert, Y\. Gu, S\. Arora, A\. Bhagia, D\. Schwenk, D\. Wadden, A\. Wettig, B\. Hui, T\. Dettmers, D\. Kiela, A\. Farhadi, N\. A\. Smith, P\. W\. Koh, A\. Singh, and H\. Hajishirzi\(2025\)OLMoe: open mixture\-of\-experts language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=xXTkbTBmqq)Cited by:[§2\.1](https://arxiv.org/html/2606.00079#S2.SS1.p1.1)\.
- \[32\]C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu\(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[33\]K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi\(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[34\]W\. Shao, M\. Chen, Z\. Zhang, P\. Xu, L\. Zhao, Z\. Li, K\. Zhang, P\. Gao, Y\. Qiao, and P\. Luo\(2024\)OmniQuant: omnidirectionally calibrated quantization for large language models\.InICLR,External Links:[Link](https://openreview.net/forum?id=8Wuvhh0LYW)Cited by:[§2\.3](https://arxiv.org/html/2606.00079#S2.SS3.p1.1)\.
- \[35\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[§2\.1](https://arxiv.org/html/2606.00079#S2.SS1.p1.1)\.
- \[36\]Z\. Su, Q\. Li, H\. Zhang, W\. Ye, Q\. Xue, Y\. Qian, Y\. Xie, N\. Wong, and K\. Yuan\(2025\)Unveiling super experts in mixture\-of\-experts large language models\.arXiv preprint arXiv:2507\.23279\.Cited by:[§B\.5](https://arxiv.org/html/2606.00079#A2.SS5.p5.1)\.
- \[37\]EleutherAI/lm\-evaluation\-harness: v0\.4\.9\.1External Links:[Document](https://dx.doi.org/10.5281/zenodo.16737642),[Link](https://doi.org/10.5281/zenodo.16737642)Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[38\]Q\. Team\(2024\-02\)Qwen1\.5\-moe: matching 7b model performance with 1/3 activated parameters\.External Links:[Link](https://qwenlm.github.io/blog/qwen-moe/)Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[39\]P\. Tillet, H\. Kung, and D\. Cox\(2019\)Triton: an intermediate language and compiler for tiled neural network computations\.InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages,pp\. 10–19\.Cited by:[§4\.4](https://arxiv.org/html/2606.00079#S4.SS4.SSS0.Px2.p1.1)\.
- \[40\]turboderp\-org\(2023\)ExLlamaV2\.Note:[https://github\.com/turboderp\-org/exllamav2](https://github.com/turboderp-org/exllamav2)GitHub repository\. Accessed: 2026\-05\-05Cited by:[§4\.4](https://arxiv.org/html/2606.00079#S4.SS4.SSS0.Px2.p1.1)\.
- \[41\]X\. Wang, Y\. Zheng, Z\. Wan, and M\. Zhang\(2025\)SVD\-LLM: truncation\-aware singular value decomposition for large language model compression\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=LNYIUouhdt)Cited by:[§2\.2](https://arxiv.org/html/2606.00079#S2.SS2.p1.2)\.
- \[42\]G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han\(2023\)SmoothQuant: accurate and efficient post\-training quantization for large language models\.InICML,pp\. 38087–38099\.External Links:[Link](https://proceedings.mlr.press/v202/xiao23c.html)Cited by:[§2\.3](https://arxiv.org/html/2606.00079#S2.SS3.p1.1)\.
- \[43\]Z\. Xu, Z\. Zhao, X\. Hu, Z\. Chen, and D\. Yang\(2026\)KBVQ\-moe: KLT\-guided SVD with bias\-corrected vector quantization for moe large language models\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=veFs5UfYq9)Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p2.1)\.
- \[44\]F\. Xue, Z\. Zheng, Y\. Fu, J\. Ni, Z\. Zheng, W\. Zhou, and Y\. You\(2024\)OpenMoE: an early effort on open mixture\-of\-experts language models\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=1YDeZU8Lt5)Cited by:[§2\.1](https://arxiv.org/html/2606.00079#S2.SS1.p1.1)\.
- \[45\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[46\]A\. Yang, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Huang, J\. Jiang, J\. Tu, J\. Zhang, J\. Zhou, J\. Lin, K\. Dang, K\. Yang, L\. Yu, M\. Li, M\. Sun, Q\. Zhu, R\. Men, T\. He, W\. Xu, W\. Yin, W\. Yu, X\. Qiu, X\. Ren, X\. Yang, Y\. Li, Z\. Xu, and Z\. Zhang\(2025\)Qwen2\.5\-1m technical report\.arXiv preprint arXiv:2501\.15383\.Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.
- \[47\]C\. Yang, Y\. Sui, J\. Xiao, L\. Huang, Y\. Gong, Y\. Duan, W\. Jia, M\. Yin, Y\. Cheng, and B\. Yuan\(2024\)MoE\-i\-squared: compressing mixture of experts models through inter\-expert pruning and intra\-expert low\-rank decomposition\.arXiv preprint arXiv:2411\.01016\.Cited by:[Table 5](https://arxiv.org/html/2606.00079#A1.T5.7.7.8.1.1.1),[§1](https://arxiv.org/html/2606.00079#S1.p2.1),[§1](https://arxiv.org/html/2606.00079#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.00079#S2.SS2.p1.2)\.
- \[48\]X\. Yin, X\. Liu, T\. Xia, B\. Bao, V\. Thangarasa, V\. Manohararajah, E\. Sather, and S\. Q\. Zhang\(2026\)CodeQuant: unified clustering and quantization for enhanced outlier smoothing in low\-precision mixture\-of\-experts\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ATpchFiBQi)Cited by:[§1](https://arxiv.org/html/2606.00079#S1.p2.1)\.
- \[49\]Z\. Yuan, Y\. Shang, Y\. Song, D\. Yang, Q\. Wu, Y\. Yan, and G\. Sun\(2025\)ASVD: activation\-aware singular value decomposition for compressing large language models\.External Links:[Link](https://openreview.net/forum?id=HyPofygOCT)Cited by:[§2\.2](https://arxiv.org/html/2606.00079#S2.SS2.p1.2)\.
- \[50\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)Hellaswag: can a machine really finish your sentence?\.arXiv preprint arXiv:1905\.07830\.Cited by:[§4\.1](https://arxiv.org/html/2606.00079#S4.SS1.SSS0.Px1.p1.1)\.

## Appendix APositioningBitsMoEAmong MoE Compression Methods

Table 5:Positioning ofBitsMoErelative to representative MoE compression paradigms\. ✓ and ✗ indicate whether each feature is a primary design component under the corresponding column definition\.RepresentativemethodsCore techniqueDesign featuresCompression /allocation unitSharedbasisBitalloc\.Act\.awareMoEpriorMoE\-I2\[[47](https://arxiv.org/html/2606.00079#bib.bib17)\]Inter\-expert pruning with intra\-expert low\-rank decomposition✗✗✗✗Expert / intra\-expert rankD2\-MoE\[[16](https://arxiv.org/html/2606.00079#bib.bib13)\]Fisher\-weighted shared\-base and expert\-specific delta compression✗✗✗✗Shared base / expert\-specific delta rankMoE\-SVD\[[25](https://arxiv.org/html/2606.00079#bib.bib12)\]Low\-rank decomposition with factor sharing✓✗✓✓Layer / rank / low\-rank factorGPTQ\[[14](https://arxiv.org/html/2606.00079#bib.bib3)\]Hessian\-based error\-compensated PTQ✗✗✓✗Original weight block / groupHQQ\[[3](https://arxiv.org/html/2606.00079#bib.bib4)\]Calibration\-free half\-quadratic quantization✗✗✗✗Original weight groupMoEQuant\[[8](https://arxiv.org/html/2606.00079#bib.bib16)\]MoE\-aware scalar quantization✗✗✓✓Expert\-wise weight groupMxMoE\[[12](https://arxiv.org/html/2606.00079#bib.bib14)\]Mixed precision with kernel co\-design✗✓✓✓Linear block, e\.g\.,gate\_proj,up\_proj,down\_projMiLo\[[21](https://arxiv.org/html/2606.00079#bib.bib19)\]Low\-bit quantization with low\-rank compensation✗✗✗✓Compensator rank / layer–expert groupBitsMoEShared\-basis spectrum\-wise mixed precision✓✓✓✓Spectral component under a shared basis

*Note\.*“Shared basis” denotes an explicitly retained common spectral basis used as the compression or quantization parameterization space\. “Bit alloc\.” denotes explicit bit\-width assignment across units\. “Act\. aware” denotes the use of calibration activations or activation\-derived statistics\. “MoE prior” denotes explicit use of routing frequency, token–expert affinity, or expert\-utilization imbalance\.

This appendix positionsBitsMoErelative to representative MoE compression paradigms\. Existing methods typically reduce memory by pruning structure, truncating rank, quantizing weights in the original space, or compensating after quantization\.BitsMoEinstead changes the allocation space: a shared spectral basis is extracted, expert\-specific spectral components are used as fine\-grained quantization units, and bit\-widths are assigned to these components by an activation\-aware ILP under a fixed budget\.

This design defines a different decision unit\. Pruning and rank\-compression methods make hard structural decisions over experts, ranks, or low\-rank factors\. Scalar PTQ methods preserve the architecture but quantize weight groups or channels in the original weight space\. MoEQuant adapts scalar PTQ with expert\-balanced calibration and token–expert affinity, but it does not allocate bits adaptively\. MxMoE is closer toBitsMoEbecause both use mixed precision, but it assigns precision at the linear\-block level\. MiLo follows a quantize\-then\-compensate path, in which low\-rank compensators restore information lost under extreme quantization\. By contrast,BitsMoEallocates mixed precision over spectral components, which enables finer granularity and treats component eviction as a budget\-aware allocation decision rather than a predefined structural truncation\.

## Appendix BDetailed Derivation of the Error Model and ILP Formulation

This section provides a detailed derivation of the spectrum\-wise reconstruction\-error model and the resulting ILP formulation used in Section[3](https://arxiv.org/html/2606.00079#S3)\. Section[3](https://arxiv.org/html/2606.00079#S3)presents the method in a compact form, whereas this section expands the derivation step by step, from the shared\-basis decomposition to the spectrum\-wise error objective and the ILP\-based mixed\-precision bit allocation\. The notation used throughout this section is summarized in Table[6](https://arxiv.org/html/2606.00079#A2.T6)\.

Table 6:Detailed notation used in Appendix[B](https://arxiv.org/html/2606.00079#A2)\.CategorySymbolMeaningIndicese∈\[E\],h∈ℋ,k∈\[nh\],b∈ℬe\\in\[E\],\\ h\\in\\mathcal\{H\},\\ k\\in\[n\_\{h\}\],\\ b\\in\\mathcal\{B\}Expert index, projection type, spectral\-component index, and candidate bit\-width\.Projection typesℋ\\mathcal\{H\},ℋin\\mathcal\{H\}\_\{\\mathrm\{in\}\},hdnh\_\{\\mathrm\{dn\}\}Projection setℋ≔\{𝚐𝚊𝚝𝚎​\_​𝚙𝚛𝚘𝚓,𝚞𝚙​\_​𝚙𝚛𝚘𝚓,𝚍𝚘𝚠𝚗​\_​𝚙𝚛𝚘𝚓\}\\mathcal\{H\}\\coloneqq\\\{\\mathtt\{gate\\\_proj\},\\mathtt\{up\\\_proj\},\\mathtt\{down\\\_proj\}\\\}\. We useℋin≔\{𝚐𝚊𝚝𝚎​\_​𝚙𝚛𝚘𝚓,𝚞𝚙​\_​𝚙𝚛𝚘𝚓\}\\mathcal\{H\}\_\{\\mathrm\{in\}\}\\coloneqq\\\{\\mathtt\{gate\\\_proj\},\\mathtt\{up\\\_proj\}\\\}andhdn≔𝚍𝚘𝚠𝚗​\_​𝚙𝚛𝚘𝚓h\_\{\\mathrm\{dn\}\}\\coloneqq\\mathtt\{down\\\_proj\}\.Dimensionsdh,nhd\_\{h\},\\ n\_\{h\}Length of the expert\-specific spectral vector and number of retained spectral components for projection typehh\.Expert weights𝑾e\(h\)\\boldsymbol\{W\}\_\{e\}^\{\(h\)\},𝑾cat\(h\)\\boldsymbol\{W\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}Expert weight matrix and the expert\-concatenated matrix used to construct the layer\-wise shared basis\.Shared\-basis decomposition𝑷~e\(h\)\\widetilde\{\\boldsymbol\{P\}\}\_\{e\}^\{\(h\)\},𝑷e\(h\)\\boldsymbol\{P\}\_\{e\}^\{\(h\)\},𝑨e\(h\)\\boldsymbol\{A\}\_\{e\}^\{\(h\)\},𝚽h\\boldsymbol\{\\Phi\}\_\{h\}Singular\-value\-absorbed expert\-specific spectral matrix, its column\-normalized version, diagonal spectral\-energy matrix, and shared basis\. Only𝑷e\(h\)\\boldsymbol\{P\}\_\{e\}^\{\(h\)\}is assigned mixed bit\-widths;𝚽h\\boldsymbol\{\\Phi\}\_\{h\}is kept unquantized\.Spectral directions𝒑~e,h,k\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\},𝒑e,h,k\\boldsymbol\{p\}\_\{e,h,k\},ϕh,k\\boldsymbol\{\\phi\}\_\{h,k\}Unnormalized expert\-specific spectral vector, normalized direction with‖𝒑e,h,k‖2=1\\\|\\boldsymbol\{p\}\_\{e,h,k\}\\\|\_\{2\}=1, and corresponding shared\-basis direction\.Spectral energyαe,h,k\\alpha\_\{e,h,k\},𝑾e,h,k\\boldsymbol\{W\}\_\{e,h,k\}Component energy and the associated rank\-one spectral component\.Calibration statistics𝑿e,h\\boldsymbol\{X\}\_\{e,h\},𝒈e\\boldsymbol\{g\}\_\{e\},𝑿g,e,h\\boldsymbol\{X\}\_\{g,e,h\},𝑯e,h\\boldsymbol\{H\}\_\{e,h\}Routed calibration activations, routing weights, affinity\-weighted activations, and the corresponding activation Gram matrix\.Component importanceβe,h,k\\beta\_\{e,h,k\},γ∈\[0,1\]\\gamma\\in\[0,1\]Activation\-aware component importance and its smoothing exponent in the ILP objective\.Quantization errorQb​\(⋅;sk\)Q\_\{b\}\(\\cdot;s\_\{k\}\),𝒑^e,h,k\\widehat\{\\boldsymbol\{p\}\}\_\{e,h,k\},𝜺e,h,k​\(b\)\\boldsymbol\{\\varepsilon\}\_\{e,h,k\}\(b\)Symmetric uniform quantizer, quantized spectral vector, and vector\-valued quantization error\.Direction distortionℰe,h,k​\(b\)\\mathcal\{E\}\_\{e,h,k\}\(b\),ρe,h,k\\rho\_\{e,h,k\},ηe,h,k\\eta\_\{e,h,k\},κb\\kappa\_\{b\},θe,h,k\\theta\_\{e,h,k\}Scalar distortionℰe,h,k​\(b\)≔𝔼​‖𝜺e,h,k​\(b\)‖22\\mathcal\{E\}\_\{e,h,k\}\(b\)\\coloneqq\\mathbb\{E\}\\\|\\boldsymbol\{\\varepsilon\}\_\{e,h,k\}\(b\)\\\|\_\{2\}^\{2\}\. The remaining symbols are auxiliary coefficients in the piecewise distortion model\.Component costLe,h,k​\(b\)L\_\{e,h,k\}\(b\)Reconstruction\-loss surrogate for assigningbbbits to component\(e,h,k\)\(e,h,k\)\.ILP variablesye,h,k,by\_\{e,h,k,b\},𝒀\(h\)\\boldsymbol\{Y\}^\{\(h\)\},𝑪\(h\)\\boldsymbol\{C\}^\{\(h\)\},𝛀\(h\)\\boldsymbol\{\\Omega\}^\{\(h\)\}Binary bit\-assignment variable and its projection\-wise collections, withCe,h,k,b≔Le,h,k​\(b\)C\_\{e,h,k,b\}\\coloneqq L\_\{e,h,k\}\(b\)andΩe,h,k,b≔b\\Omega\_\{e,h,k,b\}\\coloneqq b\.Bit budgets𝔟eq\\mathfrak\{b\}\_\{\\mathrm\{eq\}\},BB,BhbitB\_\{h\}^\{\\mathrm\{bit\}\},BhB\_\{h\}Target equivalent bit\-width, layer\-level remaining bit budget, projection\-level physical bit budget, and normalized component budget used by the ILP\.### B\.1Shared\-basis Spectral Decomposition

Within an MoE layer, experts share the same feature spaces but implement different parameterized transformations\. This motivates constructing a layer\-wise shared spectral basis across experts\. We denote the projection types byℋ≔\{𝚐𝚊𝚝𝚎​\_​𝚙𝚛𝚘𝚓,𝚞𝚙​\_​𝚙𝚛𝚘𝚓,𝚍𝚘𝚠𝚗​\_​𝚙𝚛𝚘𝚓\}\\mathcal\{H\}\\coloneqq\\\{\\mathtt\{gate\\\_proj\},\\mathtt\{up\\\_proj\},\\mathtt\{down\\\_proj\}\\\}, withℋin≔\{𝚐𝚊𝚝𝚎​\_​𝚙𝚛𝚘𝚓,𝚞𝚙​\_​𝚙𝚛𝚘𝚓\}\\mathcal\{H\}\_\{\\mathrm\{in\}\}\\coloneqq\\\{\\mathtt\{gate\\\_proj\},\\mathtt\{up\\\_proj\}\\\}andhdn≔𝚍𝚘𝚠𝚗​\_​𝚙𝚛𝚘𝚓h\_\{\\mathrm\{dn\}\}\\coloneqq\\mathtt\{down\\\_proj\}\. The concatenation direction determines whether the shared basis is defined over the input or output feature space\.

Forh∈ℋinh\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\}, expert weights share the same input feature space\. We concatenate expert weights along the output\-channel dimension:

𝑾cat\(h\)≔\[𝑾1\(h\)⋮𝑾E\(h\)\]\.\\boldsymbol\{W\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\coloneqq\\begin\{bmatrix\}\\boldsymbol\{W\}\_\{1\}^\{\(h\)\}\\\\ \\vdots\\\\ \\boldsymbol\{W\}\_\{E\}^\{\(h\)\}\\end\{bmatrix\}\.\(27\)We then compute the SVD of the concatenated matrix and merge the singular values into the left factor:

𝑾cat\(h\)=𝑼cat\(h\)​𝚺\(h\)​𝚽h⊤=𝑷~cat\(h\)​𝚽h⊤,𝑷~cat\(h\)≔𝑼cat\(h\)​𝚺\(h\)\.\\boldsymbol\{W\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}=\\boldsymbol\{U\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\boldsymbol\{\\Sigma\}^\{\(h\)\}\\boldsymbol\{\\Phi\}\_\{h\}^\{\\top\}=\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\boldsymbol\{\\Phi\}\_\{h\}^\{\\top\},\\qquad\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\coloneqq\\boldsymbol\{U\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\boldsymbol\{\\Sigma\}^\{\(h\)\}\.\(28\)After merging the singular values, we partition the resulting matrix according to expert blocks:

𝑷~cat\(h\)=\[𝑷~1\(h\)⋮𝑷~E\(h\)\],𝑾e\(h\)=𝑷~e\(h\)​𝚽h⊤\.\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}=\\begin\{bmatrix\}\\widetilde\{\\boldsymbol\{P\}\}\_\{1\}^\{\(h\)\}\\\\ \\vdots\\\\ \\widetilde\{\\boldsymbol\{P\}\}\_\{E\}^\{\(h\)\}\\end\{bmatrix\},\\qquad\\boldsymbol\{W\}\_\{e\}^\{\(h\)\}=\\widetilde\{\\boldsymbol\{P\}\}\_\{e\}^\{\(h\)\}\\boldsymbol\{\\Phi\}\_\{h\}^\{\\top\}\.\(29\)
Forh=hdnh=h\_\{\\mathrm\{dn\}\}, expert weights share the same output feature space\. We concatenate expert weights along the input\-channel dimension:

𝑾cat\(h\)≔\[𝑾1\(h\)⋯𝑾E\(h\)\]\.\\boldsymbol\{W\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\coloneqq\\begin\{bmatrix\}\\boldsymbol\{W\}\_\{1\}^\{\(h\)\}&\\cdots&\\boldsymbol\{W\}\_\{E\}^\{\(h\)\}\\end\{bmatrix\}\.\(30\)The corresponding SVD\-and\-absorption step is

𝑾cat\(h\)=𝚽h​𝚺\(h\)​𝑽cat\(h\)⊤=𝚽h​𝑷~cat\(h\)⊤,𝑷~cat\(h\)≔𝑽cat\(h\)​𝚺\(h\)\.\\boldsymbol\{W\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}=\\boldsymbol\{\\Phi\}\_\{h\}\\boldsymbol\{\\Sigma\}^\{\(h\)\}\\boldsymbol\{V\}\_\{\\mathrm\{cat\}\}^\{\(h\)\\top\}=\\boldsymbol\{\\Phi\}\_\{h\}\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\\top\},\\qquad\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\coloneqq\\boldsymbol\{V\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}\\boldsymbol\{\\Sigma\}^\{\(h\)\}\.\(31\)Partitioning𝑷~cat\(h\)\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}according to expert input\-channel blocks gives

𝑷~cat\(h\)=\[𝑷~1\(h\)⋮𝑷~E\(h\)\],𝑾e\(h\)=𝚽h​𝑷~e\(h\)⊤\.\\widetilde\{\\boldsymbol\{P\}\}\_\{\\mathrm\{cat\}\}^\{\(h\)\}=\\begin\{bmatrix\}\\widetilde\{\\boldsymbol\{P\}\}\_\{1\}^\{\(h\)\}\\\\ \\vdots\\\\ \\widetilde\{\\boldsymbol\{P\}\}\_\{E\}^\{\(h\)\}\\end\{bmatrix\},\\qquad\\boldsymbol\{W\}\_\{e\}^\{\(h\)\}=\\boldsymbol\{\\Phi\}\_\{h\}\\widetilde\{\\boldsymbol\{P\}\}\_\{e\}^\{\(h\)\\top\}\.\(32\)
###### Definition B\.1\(Spectral component and energy matrix\)\.

Letϕh,k\\boldsymbol\{\\phi\}\_\{h,k\}denote thekk\-th column of the shared basis𝚽h\\boldsymbol\{\\Phi\}\_\{h\}, and let𝒑~e,h,k≔𝑷~e\(h\)​\[:,k\]\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\}\\coloneqq\\widetilde\{\\boldsymbol\{P\}\}\_\{e\}^\{\(h\)\}\[:,k\]denote the corresponding expert\-specific spectral vector\. We define its spectral energy as

αe,h,k≔‖𝒑~e,h,k‖2\.\\alpha\_\{e,h,k\}\\coloneqq\\left\\\|\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\}\\right\\\|\_\{2\}\.\(33\)The component energies are collected into a diagonal matrix

𝑨e\(h\)≔diag⁡\(αe,h,1,…,αe,h,nh\)\.\\boldsymbol\{A\}\_\{e\}^\{\(h\)\}\\coloneqq\\operatorname\{diag\}\(\\alpha\_\{e,h,1\},\\ldots,\\alpha\_\{e,h,n\_\{h\}\}\)\.\(34\)The corresponding rank\-one spectral component is

𝑾e,h,k≔\{𝒑~e,h,k​ϕh,k⊤,h∈ℋin,ϕh,k​𝒑~e,h,k⊤,h=hdn\.\\boldsymbol\{W\}\_\{e,h,k\}\\coloneqq\\begin\{cases\}\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\}\\boldsymbol\{\\phi\}\_\{h,k\}^\{\\top\},&h\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\},\\\\\[3\.0pt\] \\boldsymbol\{\\phi\}\_\{h,k\}\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\}^\{\\top\},&h=h\_\{\\mathrm\{dn\}\}\.\\end\{cases\}\(35\)

###### Definition B\.2\(Normalized expert\-specific spectral matrix\)\.

To decouple component magnitude from direction, each column of𝑷~e\(h\)\\widetilde\{\\boldsymbol\{P\}\}\_\{e\}^\{\(h\)\}is normalized by its spectral energy:

𝑷e\(h\)≔𝑷~e\(h\)​\(𝑨e\(h\)\)−1=\[𝒑e,h,1,…,𝒑e,h,nh\],𝒑e,h,k≔𝒑~e,h,kαe,h,k\.\\boldsymbol\{P\}\_\{e\}^\{\(h\)\}\\coloneqq\\widetilde\{\\boldsymbol\{P\}\}\_\{e\}^\{\(h\)\}\\left\(\\boldsymbol\{A\}\_\{e\}^\{\(h\)\}\\right\)^\{\-1\}=\[\\boldsymbol\{p\}\_\{e,h,1\},\\ldots,\\boldsymbol\{p\}\_\{e,h,n\_\{h\}\}\],\\qquad\\boldsymbol\{p\}\_\{e,h,k\}\\coloneqq\\frac\{\\widetilde\{\\boldsymbol\{p\}\}\_\{e,h,k\}\}\{\\alpha\_\{e,h,k\}\}\.\(36\)Thus,‖𝒑e,h,k‖2=1\\\|\\boldsymbol\{p\}\_\{e,h,k\}\\\|\_\{2\}=1for every component\. The expert weight admits the unified normalized shared\-basis form

𝑾e\(h\)=\{𝑷e\(h\)​𝑨e\(h\)​𝚽h⊤=∑k=1nhαe,h,k​𝒑e,h,k​ϕh,k⊤,h∈ℋin,𝚽h​𝑨e\(h\)​𝑷e\(h\)⊤=∑k=1nhαe,h,k​ϕh,k​𝒑e,h,k⊤,h=hdn\.\\boldsymbol\{W\}\_\{e\}^\{\(h\)\}=\\begin\{cases\}\\boldsymbol\{P\}\_\{e\}^\{\(h\)\}\\boldsymbol\{A\}\_\{e\}^\{\(h\)\}\\boldsymbol\{\\Phi\}\_\{h\}^\{\\top\}=\\sum\_\{k=1\}^\{n\_\{h\}\}\\alpha\_\{e,h,k\}\\boldsymbol\{p\}\_\{e,h,k\}\\boldsymbol\{\\phi\}\_\{h,k\}^\{\\top\},&h\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\},\\\\\[5\.0pt\] \\boldsymbol\{\\Phi\}\_\{h\}\\boldsymbol\{A\}\_\{e\}^\{\(h\)\}\\boldsymbol\{P\}\_\{e\}^\{\(h\)\\top\}=\\sum\_\{k=1\}^\{n\_\{h\}\}\\alpha\_\{e,h,k\}\\boldsymbol\{\\phi\}\_\{h,k\}\\boldsymbol\{p\}\_\{e,h,k\}^\{\\top\},&h=h\_\{\\mathrm\{dn\}\}\.\\end\{cases\}\(37\)

In this unified notation,𝑷e\(h\)\\boldsymbol\{P\}\_\{e\}^\{\(h\)\}always denotes the expert\-specific normalized spectral matrix assigned mixed bit\-widths, whereas𝚽h\\boldsymbol\{\\Phi\}\_\{h\}always denotes the shared basis retained without quantization\.

### B\.2Activation\-aware Reconstruction Loss

We first consider the loss of a single expert forh∈ℋinh\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\}\. Let𝑷=\[𝒑1,…,𝒑n\]\\boldsymbol\{P\}=\[\\boldsymbol\{p\}\_\{1\},\\ldots,\\boldsymbol\{p\}\_\{n\}\],𝑨=diag⁡\(α1,…,αn\)\\boldsymbol\{A\}=\\operatorname\{diag\}\(\\alpha\_\{1\},\\ldots,\\alpha\_\{n\}\), and𝚽=\[ϕ1,…,ϕn\]\\boldsymbol\{\\Phi\}=\[\\boldsymbol\{\\phi\}\_\{1\},\\ldots,\\boldsymbol\{\\phi\}\_\{n\}\]\. Quantization is applied only to the expert\-specific normalized spectral vectors:

𝒑^k=Qbk​\(𝒑k\),𝜺k≔𝒑k−𝒑^k\.\\widehat\{\\boldsymbol\{p\}\}\_\{k\}=Q\_\{b\_\{k\}\}\(\\boldsymbol\{p\}\_\{k\}\),\\qquad\\boldsymbol\{\\varepsilon\}\_\{k\}\\coloneqq\\boldsymbol\{p\}\_\{k\}\-\\widehat\{\\boldsymbol\{p\}\}\_\{k\}\.\(38\)The reconstructed weight and the induced weight perturbation are

𝑾^=∑k=1nαk​𝒑^k​ϕk⊤,𝚫≔𝑾−𝑾^=∑k=1nαk​𝜺k​ϕk⊤\.\\widehat\{\\boldsymbol\{W\}\}=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}\\widehat\{\\boldsymbol\{p\}\}\_\{k\}\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\},\\qquad\\boldsymbol\{\\Delta\}\\coloneqq\\boldsymbol\{W\}\-\\widehat\{\\boldsymbol\{W\}\}=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}\\boldsymbol\{\\varepsilon\}\_\{k\}\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\.\(39\)Equivalently, if𝑬P≔𝑷−𝑷^=\[𝜺1,…,𝜺n\]\\boldsymbol\{E\}\_\{P\}\\coloneqq\\boldsymbol\{P\}\-\\widehat\{\\boldsymbol\{P\}\}=\[\\boldsymbol\{\\varepsilon\}\_\{1\},\\ldots,\\boldsymbol\{\\varepsilon\}\_\{n\}\], then𝚫=𝑬P​𝑨​𝚽⊤\\boldsymbol\{\\Delta\}=\\boldsymbol\{E\}\_\{P\}\\boldsymbol\{A\}\\boldsymbol\{\\Phi\}^\{\\top\}\.

###### Definition B\.3\(Activation\-aware reconstruction loss\)\.

Given the input activation matrix𝑿\\boldsymbol\{X\}routed to this expert, the activation\-output reconstruction loss is defined as

L​\(𝑾^\)≔𝔼​‖\(𝑾−𝑾^\)​𝑿g‖F2\.L\(\\widehat\{\\boldsymbol\{W\}\}\)\\coloneqq\\mathbb\{E\}\\left\\\|\(\\boldsymbol\{W\}\-\\widehat\{\\boldsymbol\{W\}\}\)\\boldsymbol\{X\}\_\{g\}\\right\\\|\_\{F\}^\{2\}\.\(40\)

Following the affinity\-guided calibration idea for MoEQuant\[[8](https://arxiv.org/html/2606.00079#bib.bib16)\], we incorporate token\-expert routing affinity into the activation statistics by defining

𝑯≔𝑿g​𝑿g⊤=𝑿​Diag⁡\(𝒈\)​𝑿⊤=∑t=1Tgt​𝒙t​𝒙t⊤,\\boldsymbol\{H\}\\coloneqq\\boldsymbol\{X\}\_\{g\}\\boldsymbol\{X\}\_\{g\}^\{\\top\}=\\boldsymbol\{X\}\\operatorname\{Diag\}\(\\boldsymbol\{g\}\)\\boldsymbol\{X\}^\{\\top\}=\\sum\_\{t=1\}^\{T\}g\_\{t\}\\boldsymbol\{x\}\_\{t\}\\boldsymbol\{x\}\_\{t\}^\{\\top\},\(41\)where𝑿=\[𝒙1,…,𝒙T\]\\boldsymbol\{X\}=\[\\boldsymbol\{x\}\_\{1\},\\ldots,\\boldsymbol\{x\}\_\{T\}\]contains the activations routed to this expert,𝒈=\[g1,…,gT\]⊤\\boldsymbol\{g\}=\[g\_\{1\},\\ldots,g\_\{T\}\]^\{\\top\}contains the corresponding routing weights, and𝑿g≔𝑿Diag\(𝒈\)1/2\\boldsymbol\{X\}\_\{g\}\\coloneqq\\boldsymbol\{X\}\\operatorname\{Diag\}\(\\boldsymbol\{g\}\)^\{1/2\}\. Sincegt≥0g\_\{t\}\\geq 0,𝑯⪰0\\boldsymbol\{H\}\\succeq 0\. This matrix measures the routing\-affinity\-weighted activation distribution for this expert\.

###### Lemma B\.1\(Spectrum\-wise reconstruction error\)\.

Forh∈ℋinh\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\}, under the shared\-basis decomposition in Eq\. \([37](https://arxiv.org/html/2606.00079#A2.E37)\) and the perturbation in Eq\. \([39](https://arxiv.org/html/2606.00079#A2.E39)\), the reconstruction loss satisfies

L​\(𝑾^\)=∑k=1n∑l=1nαk​αl​\(ϕk⊤​𝑯​ϕl\)​𝔼​\[𝜺k⊤​𝜺l\]\.L\(\\widehat\{\\boldsymbol\{W\}\}\)=\\sum\_\{k=1\}^\{n\}\\sum\_\{l=1\}^\{n\}\\alpha\_\{k\}\\alpha\_\{l\}\\left\(\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{l\}\\right\)\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{\\varepsilon\}\_\{l\}\\right\]\.\(42\)

###### Proof\.

Using‖𝑨‖F2=Tr⁡\(𝑨​𝑨⊤\)\\\|\\boldsymbol\{A\}\\\|\_\{F\}^\{2\}=\\operatorname\{Tr\}\(\\boldsymbol\{A\}\\boldsymbol\{A\}^\{\\top\}\), Eq\. \([40](https://arxiv.org/html/2606.00079#A2.E40)\) becomes

L​\(𝑾^\)=𝔼​\[Tr⁡\(𝚫​𝑯​𝚫⊤\)\]\.L\(\\widehat\{\\boldsymbol\{W\}\}\)=\\mathbb\{E\}\\left\[\\operatorname\{Tr\}\\left\(\\boldsymbol\{\\Delta\}\\boldsymbol\{H\}\\boldsymbol\{\\Delta\}^\{\\top\}\\right\)\\right\]\.\(43\)Substituting Eq\. \([39](https://arxiv.org/html/2606.00079#A2.E39)\) into Eq\. \([43](https://arxiv.org/html/2606.00079#A2.E43)\) gives

L​\(𝑾^\)\\displaystyle L\(\\widehat\{\\boldsymbol\{W\}\}\)=∑k=1n∑l=1nαk​αl​𝔼​\[Tr⁡\(𝜺k​ϕk⊤​𝑯​ϕl​𝜺l⊤\)\]\.\\displaystyle=\\sum\_\{k=1\}^\{n\}\\sum\_\{l=1\}^\{n\}\\alpha\_\{k\}\\alpha\_\{l\}\\mathbb\{E\}\\left\[\\operatorname\{Tr\}\\left\(\\boldsymbol\{\\varepsilon\}\_\{k\}\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{l\}\\boldsymbol\{\\varepsilon\}\_\{l\}^\{\\top\}\\right\)\\right\]\.\(44\)The middle termϕk⊤​𝑯​ϕl\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{l\}is scalar, andTr⁡\(𝒂​𝒃⊤\)=𝒃⊤​𝒂\\operatorname\{Tr\}\(\\boldsymbol\{a\}\\boldsymbol\{b\}^\{\\top\}\)=\\boldsymbol\{b\}^\{\\top\}\\boldsymbol\{a\}\. Therefore,

Tr⁡\(𝜺k​ϕk⊤​𝑯​ϕl​𝜺l⊤\)=\(ϕk⊤​𝑯​ϕl\)​𝜺k⊤​𝜺l\.\\operatorname\{Tr\}\\left\(\\boldsymbol\{\\varepsilon\}\_\{k\}\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{l\}\\boldsymbol\{\\varepsilon\}\_\{l\}^\{\\top\}\\right\)=\\left\(\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{l\}\\right\)\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{\\varepsilon\}\_\{l\}\.\(45\)Substituting this identity into Eq\. \([44](https://arxiv.org/html/2606.00079#A2.E44)\) proves Eq\. \([42](https://arxiv.org/html/2606.00079#A2.E42)\)\. ∎

To obtain an additive spectrum\-wise loss and avoid a quadratic integer program, we use the following diagonal component\-error approximation\.

###### Assumption B\.1\(Diagonal component\-error approximation\)\.

For distinct spectral componentsk≠lk\\neq l, the corresponding quantization errors are treated as approximately uncorrelated:

𝔼​\[𝜺k⊤​𝜺l\]≈0,∀k≠l\.\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{\\varepsilon\}\_\{l\}\\right\]\\approx 0,\\qquad\\forall\\,k\\neq l\.\(46\)

Forb≥1b\\geq 1, this approximation is supported by the standard symmetric\-quantization model, under which separately scaled component\-wise quantizers induce approximately zero\-mean errors\. Specifically,

𝔼​\[𝜺k⊤​𝜺l\]≈𝔼​\[𝜺k\]⊤​𝔼​\[𝜺l\]≈0,∀k≠l\.\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{\\varepsilon\}\_\{l\}\\right\]\\approx\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\]^\{\\top\}\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{l\}\\right\]\\approx 0,\\qquad\\forall\\,k\\neq l\.\(47\)This approximation removes cross\-component error terms and makes the spectrum\-wise objective additive\. Forb=0b=0, the same removal should be interpreted only as a tractable diagonal approximation, not as a consequence of zero\-mean quantization error\.

###### Corollary B\.1\(Additive spectrum\-wise loss\)\.

Under Eq\. \([47](https://arxiv.org/html/2606.00079#A2.E47)\), the reconstruction loss forh∈ℋinh\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\}reduces to

L​\(𝑾^\)=∑k=1nαk2​\(ϕk⊤​𝑯​ϕk\)​𝔼​‖𝜺k‖22≈∑k=1nαk2​βk​𝔼​‖𝜺k‖22,L\(\\widehat\{\\boldsymbol\{W\}\}\)=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}^\{2\}\\left\(\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{k\}\\right\)\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\\\|\_\{2\}^\{2\}\\approx\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}^\{2\}\\beta\_\{k\}\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\\\|\_\{2\}^\{2\},\(48\)where

βk≔ϕk⊤​𝑯​ϕk\.\\beta\_\{k\}\\coloneqq\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{k\}\.\(49\)We refer toβk\\beta\_\{k\}as theactivation\-aware importance\. Since𝐇⪰0\\boldsymbol\{H\}\\succeq 0, we haveβk≥0\\beta\_\{k\}\\geq 0\.

###### Proof\.

Starting from Lemma[B\.1](https://arxiv.org/html/2606.00079#A2.Thmlemma1), we split the double summation into diagonal and off\-diagonal terms:

L​\(𝑾^\)\\displaystyle L\(\\widehat\{\\boldsymbol\{W\}\}\)=∑k=1nαk2​\(ϕk⊤​𝑯​ϕk\)​𝔼​\[𝜺k⊤​𝜺k\]\+∑k≠lnαk​αl​\(ϕk⊤​𝑯​ϕl\)​𝔼​\[𝜺k⊤​𝜺l\]\\displaystyle=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}^\{2\}\\left\(\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{k\}\\right\)\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\]\+\\sum\_\{k\\neq l\}^\{n\}\\alpha\_\{k\}\\alpha\_\{l\}\\left\(\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{l\}\\right\)\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{\\varepsilon\}\_\{l\}\\right\]≈∑k=1nαk2​\(ϕk⊤​𝑯​ϕk\)​𝔼​‖𝜺k‖22,\\displaystyle\\approx\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}^\{2\}\\left\(\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{k\}\\right\)\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\\\|\_\{2\}^\{2\},\(50\)where the off\-diagonal summation vanishes by Eq\. \([47](https://arxiv.org/html/2606.00079#A2.E47)\)\. This gives Eq\. \([48](https://arxiv.org/html/2606.00079#A2.E48)\)\. ∎

Equivalently, forh∈ℋinh\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\}, the activation\-aware importance can be obtained by retaining the diagonal entries of the activation metric in the shared spectral basis:

𝜷≔diag⁡\(𝚽⊤​𝑯​𝚽\),βk=ϕk⊤​𝑯​ϕk\.\\boldsymbol\{\\beta\}\\coloneqq\\operatorname\{diag\}\\left\(\\boldsymbol\{\\Phi\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\Phi\}\\right\),\\qquad\\beta\_\{k\}=\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{k\}\.\(51\)This expression shows that bit allocation should prioritize spectral components with larger spectral energyαk2\\alpha\_\{k\}^\{2\}, larger activation\-aware importanceβk\\beta\_\{k\}, and larger bit\-dependent directional distortion\.

Forh=hdnh=h\_\{\\mathrm\{dn\}\}, the shared basis is associated with the activation\-output feature space, so the quantized expert\-specific vectors are still denoted by𝒑k\\boldsymbol\{p\}\_\{k\}, while the shared directions areϕk\\boldsymbol\{\\phi\}\_\{k\}\. The perturbation is therefore

𝚫=∑k=1nαk​ϕk​𝜺k⊤=𝚽​𝑨​𝑬P⊤,\\boldsymbol\{\\Delta\}=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}\\boldsymbol\{\\phi\}\_\{k\}\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}=\\boldsymbol\{\\Phi\}\\boldsymbol\{A\}\\boldsymbol\{E\}\_\{P\}^\{\\top\},\(52\)where𝜺k\\boldsymbol\{\\varepsilon\}\_\{k\}is the quantization error of the expert\-specific input\-side spectral vector𝒑k\\boldsymbol\{p\}\_\{k\}\. Using the orthonormality of the shared basis𝚽\\boldsymbol\{\\Phi\}, the loss becomes

L​\(𝑾^\)=𝔼​\[Tr⁡\(𝑨​𝑬P⊤​𝑯​𝑬P​𝑨\)\]=∑k=1nαk2​𝔼​\[𝜺k⊤​𝑯​𝜺k\]\.L\(\\widehat\{\\boldsymbol\{W\}\}\)=\\mathbb\{E\}\\left\[\\operatorname\{Tr\}\\left\(\\boldsymbol\{A\}\\boldsymbol\{E\}\_\{P\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{E\}\_\{P\}\\boldsymbol\{A\}\\right\)\\right\]=\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}^\{2\}\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\]\.\(53\)
Since directly using Eq\. \([53](https://arxiv.org/html/2606.00079#A2.E53)\) would make the importance depend on the quantization\-error direction, we use a tractable empirical surrogate based on the corresponding unquantized expert\-specific spectral direction:

𝔼​\[𝜺k⊤​𝑯​𝜺k\]≈βk​𝔼​‖𝜺k‖22,βk≔𝒑k⊤​𝑯​𝒑k\.\\mathbb\{E\}\\left\[\\boldsymbol\{\\varepsilon\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\]\\approx\\beta\_\{k\}\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\\\|\_\{2\}^\{2\},\\qquad\\beta\_\{k\}\\coloneqq\\boldsymbol\{p\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{p\}\_\{k\}\.\(54\)
Forb≥1b\\geq 1, this is an empirical alignment heuristic rather than an isotropic\-noise approximation; it is exact only under zero\-bit eviction\.

For a single expert, the activation\-aware importance is defined as

βk≔\{ϕk⊤​𝑯​ϕk,h∈ℋin,𝒑k⊤​𝑯​𝒑k,h=hdn\.\\beta\_\{k\}\\coloneqq\\begin\{cases\}\\boldsymbol\{\\phi\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{\\phi\}\_\{k\},&h\\in\\mathcal\{H\}\_\{\\mathrm\{in\}\},\\\\\[3\.0pt\] \\boldsymbol\{p\}\_\{k\}^\{\\top\}\\boldsymbol\{H\}\\boldsymbol\{p\}\_\{k\},&h=h\_\{\\mathrm\{dn\}\}\.\\end\{cases\}\(55\)Therefore, for each expert and each projection inℋ\\mathcal\{H\}, the remaining derivation uses the unified additive loss

L​\(𝑾^\)≈∑k=1nαk2​βk​𝔼​‖𝜺k‖22,L\(\\widehat\{\\boldsymbol\{W\}\}\)\\approx\\sum\_\{k=1\}^\{n\}\\alpha\_\{k\}^\{2\}\\beta\_\{k\}\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\\right\\\|\_\{2\}^\{2\},\(56\)where𝜺k\\boldsymbol\{\\varepsilon\}\_\{k\}denotes the quantization error of the expert\-specific spectral vector𝒑k\\boldsymbol\{p\}\_\{k\}\.

### B\.3Piecewise Reconstruction Error for Bit Allocation

We now specify the bit\-dependent normalized distortion term for candidate bit\-widthsℬ=\{16,8,6,4,3,2,1,0\}\\mathcal\{B\}=\\\{16,8,6,4,3,2,1,0\\\}\. The candidateb=16b=16denotes an FP16 expert\-specific spectral vector, which consumes 16 bits per element\. Let𝜺k​\(b\)\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)denote the direction error induced by assigning bit\-widthbbto𝒑k\\boldsymbol\{p\}\_\{k\}\. The normalized direction distortion is

ℰk​\(b\)≔𝔼​‖𝜺k​\(b\)‖22\.\\mathcal\{E\}\_\{k\}\(b\)\\coloneqq\\mathbb\{E\}\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)\\right\\\|\_\{2\}^\{2\}\.\(57\)
###### Lemma B\.2\(High\-bit distortion under the high\-resolution approximation\)\.

Forb∈\{6,8,16\}b\\in\\\{6,8,16\\\}, letdddenote the dimension of𝐩k\\boldsymbol\{p\}\_\{k\}, and define

ρk≔‖𝒑k‖∞,ηk≔d​ρk23\.\\rho\_\{k\}\\coloneqq\\\|\\boldsymbol\{p\}\_\{k\}\\\|\_\{\\infty\},\\qquad\\eta\_\{k\}\\coloneqq\\frac\{d\\rho\_\{k\}^\{2\}\}\{3\}\.\(58\)Under the high\-resolution uniform\-noise approximation for symmetric uniform quantization, the normalized direction distortion is approximated by

ℰk​\(b\)≈d​ρk23​exp⁡\(−λ​b\)=ηk​exp⁡\(−λ​b\),λ≔2​ln⁡2\.\\mathcal\{E\}\_\{k\}\(b\)\\approx\\frac\{d\\rho\_\{k\}^\{2\}\}\{3\}\\exp\(\-\\lambda b\)=\\eta\_\{k\}\\exp\(\-\\lambda b\),\\qquad\\lambda\\coloneqq 2\\ln 2\.\(59\)

###### Proof\.

For thekk\-th expert\-specific spectral vector of the corresponding expert weight, let𝒑k∈ℝd\\boldsymbol\{p\}\_\{k\}\\in\\mathbb\{R\}^\{d\}\. Symmetric uniform quantization is applied element\-wise with a common scale\. For each coordinatej∈\{1,…,d\}j\\in\\\{1,\\ldots,d\\\}, define

Qb​\(pk,j;sk\)≔\\displaystyle Q\_\{b\}\(p\_\{k,j\};s\_\{k\}\)\\coloneqq\{\}sk⋅clamp\(⌊pk,jsk⌉,qmin,qmax\),\\displaystyle s\_\{k\}\\cdot\\operatorname\{clamp\}\\left\(\\left\\lfloor\\frac\{p\_\{k,j\}\}\{s\_\{k\}\}\\right\\rceil,q\_\{\\min\},q\_\{\\max\}\\right\),\(60\)qmax=\\displaystyle q\_\{\\max\}=\{\}2b−1−1,qmin=−2b−1\.\\displaystyle 2^\{b\-1\}\-1,\\qquad q\_\{\\min\}=\-2^\{b\-1\}\.The coordinate\-wise quantization error is

εk,j​\(b\)≔pk,j−Qb​\(pk,j;sk\)\.\\varepsilon\_\{k,j\}\(b\)\\coloneqq p\_\{k,j\}\-Q\_\{b\}\(p\_\{k,j\};s\_\{k\}\)\.\(61\)Accordingly, the vector\-level quantization error is

𝜺k​\(b\)≔𝒑k−Qb​\(𝒑k;sk\)=\[εk,1​\(b\),…,εk,d​\(b\)\]⊤\.\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)\\coloneqq\\boldsymbol\{p\}\_\{k\}\-Q\_\{b\}\(\\boldsymbol\{p\}\_\{k\};s\_\{k\}\)=\\left\[\\varepsilon\_\{k,1\}\(b\),\\ldots,\\varepsilon\_\{k,d\}\(b\)\\right\]^\{\\top\}\.\(62\)In the high\-resolution regime, clipping is negligible and each scalar rounding error is approximated as uniformly distributed on\[−sk/2,sk/2\)\[\-s\_\{k\}/2,s\_\{k\}/2\)\. Therefore, for each coordinatejj,

𝔼​\[εk,j​\(b\)\]=0,𝔼​\[εk,j2​\(b\)\]=1sk​∫−sk/2sk/2ε2​𝑑ε=sk212\.\\mathbb\{E\}\\\!\\left\[\\varepsilon\_\{k,j\}\(b\)\\right\]=0,\\qquad\\mathbb\{E\}\\\!\\left\[\\varepsilon\_\{k,j\}^\{2\}\(b\)\\right\]=\\frac\{1\}\{s\_\{k\}\}\\int\_\{\-s\_\{k\}/2\}^\{s\_\{k\}/2\}\\varepsilon^\{2\}d\\varepsilon=\\frac\{s\_\{k\}^\{2\}\}\{12\}\.\(63\)Since the vector\-level squared error is the sum of coordinate\-wise squared errors, we have

𝔼​\[‖𝜺k​\(b\)‖22\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)\\right\\\|\_\{2\}^\{2\}\\right\]=𝔼​\[∑j=1dεk,j2​\(b\)\]\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\sum\_\{j=1\}^\{d\}\\varepsilon\_\{k,j\}^\{2\}\(b\)\\right\]\(64\)=∑j=1d𝔼​\[εk,j2​\(b\)\]=d​sk212\.\\displaystyle=\\sum\_\{j=1\}^\{d\}\\mathbb\{E\}\\\!\\left\[\\varepsilon\_\{k,j\}^\{2\}\(b\)\\right\]=\\frac\{ds\_\{k\}^\{2\}\}\{12\}\.By the definition ofρk\\rho\_\{k\}in Eq\. \([58](https://arxiv.org/html/2606.00079#A2.E58)\), the coordinate\-wise quantization scale is

sk=ρkqmax\.s\_\{k\}=\\frac\{\\rho\_\{k\}\}\{q\_\{\\max\}\}\.\(65\)Substituting Eq\. \([65](https://arxiv.org/html/2606.00079#A2.E65)\) into Eq\. \([64](https://arxiv.org/html/2606.00079#A2.E64)\) gives

𝔼​\[‖𝜺k​\(b\)‖22\]=d​ρk212​qmax2\.\\mathbb\{E\}\\\!\\left\[\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)\\right\\\|\_\{2\}^\{2\}\\right\]=\\frac\{d\\rho\_\{k\}^\{2\}\}\{12q\_\{\\max\}^\{2\}\}\.\(66\)For sufficiently large bit\-widths,qmax=2b−1−1≈2b−1q\_\{\\max\}=2^\{b\-1\}\-1\\approx 2^\{b\-1\}\. Thus,

𝔼​\[‖𝜺k​\(b\)‖22\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\(b\)\\right\\\|\_\{2\}^\{2\}\\right\]≈d​ρk212⋅22​\(b−1\)\\displaystyle\\approx\\frac\{d\\rho\_\{k\}^\{2\}\}\{12\\cdot 2^\{2\(b\-1\)\}\}\(67\)=d​ρk23​exp⁡\(−2​b​ln⁡2\)\.\\displaystyle=\\frac\{d\\rho\_\{k\}^\{2\}\}\{3\}\\exp\(\-2b\\ln 2\)\.Sinceλ≔2​ln⁡2\\lambda\\coloneqq 2\\ln 2andηk≔d​ρk2/3\\eta\_\{k\}\\coloneqq d\\rho\_\{k\}^\{2\}/3, this gives Eq\. \([59](https://arxiv.org/html/2606.00079#A2.E59)\)\. ∎

###### Lemma B\.3\(Low\-bit empirical distortion\)\.

Forb∈\{2,3,4\}b\\in\\\{2,3,4\\\}, letsk,b∗s\_\{k,b\}^\{\\ast\}denote the MSE\-optimal quantization scale of the unit\-norm spectral vector𝐩k\\boldsymbol\{p\}\_\{k\}:

sk,b∗∈arg⁡minsk,b\>0⁡‖𝒑k−Qb​\(𝒑k;sk,b\)‖22\.s\_\{k,b\}^\{\\ast\}\\in\\arg\\min\_\{s\_\{k,b\}\>0\}\\left\\\|\\boldsymbol\{p\}\_\{k\}\-Q\_\{b\}\(\\boldsymbol\{p\}\_\{k\};s\_\{k,b\}\)\\right\\\|\_\{2\}^\{2\}\.\(68\)For coordinatej∈\{1,…,d\}j\\in\\\{1,\\ldots,d\\\}, define the coordinate\-wise quantization error as

εk,j​\(b;sk,b∗\)≔pk,j−Qb​\(pk,j;sk,b∗\)\.\\varepsilon\_\{k,j\}\(b;s\_\{k,b\}^\{\\ast\}\)\\coloneqq p\_\{k,j\}\-Q\_\{b\}\(p\_\{k,j\};s\_\{k,b\}^\{\\ast\}\)\.\(69\)We define the component\-specific relative distortion ratio of𝐩k\\boldsymbol\{p\}\_\{k\}as

κk,b≔1d​∑j=1dεk,j2​\(b;sk,b∗\)1d​∑j=1dpk,j2\.\\kappa\_\{k,b\}\\coloneqq\\frac\{\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}\\varepsilon\_\{k,j\}^\{2\}\(b;s\_\{k,b\}^\{\\ast\}\)\}\{\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}p\_\{k,j\}^\{2\}\}\.\(70\)Letℐ\\mathcal\{I\}denote the set of spectral vectors used for coefficient estimation, whereiiindexes its elements\. The shared bit\-dependent low\-bit coefficient is estimated as

κb≔1\|ℐ\|​∑i∈ℐκi,b\.\\kappa\_\{b\}\\coloneqq\\frac\{1\}\{\|\\mathcal\{I\}\|\}\\sum\_\{i\\in\\mathcal\{I\}\}\\kappa\_\{i,b\}\.\(71\)Under the empirical coefficient\-sharing approximation, which assumes that low\-bit relative distortions are sufficiently stable across spectral vectors inℐ\\mathcal\{I\}, the vector\-level low\-bit distortion is approximated by

ℰk​\(b\)≈κb,b∈\{2,3,4\}\.\\mathcal\{E\}\_\{k\}\(b\)\\approx\\kappa\_\{b\},\\qquad b\\in\\\{2,3,4\\\}\.\(72\)

###### Proof\.

For a scalar coordinatepk,jp\_\{k,j\}, we use the symmetric uniform quantizer

Qb\(pk,j;sk,b\)≔sk,b⋅clamp\(⌊pk,jsk,b⌉,qmin,qmax\),Q\_\{b\}\(p\_\{k,j\};s\_\{k,b\}\)\\coloneqq s\_\{k,b\}\\cdot\\operatorname\{clamp\}\\left\(\\left\\lfloor\\frac\{p\_\{k,j\}\}\{s\_\{k,b\}\}\\right\\rceil,q\_\{\\min\},q\_\{\\max\}\\right\),\(73\)whereqmax=2b−1−1q\_\{\\max\}=2^\{b\-1\}\-1andqmin=−2b−1q\_\{\\min\}=\-2^\{b\-1\}\. The vector quantizerQb​\(𝒑k;sk,b\)Q\_\{b\}\(\\boldsymbol\{p\}\_\{k\};s\_\{k,b\}\)is applied elementwise\. For each pair of spectral vector and bit\-width,sk,b∗s\_\{k,b\}^\{\\ast\}is chosen by directly minimizing the empirical vector\-level distortion:

ℰk​\(b;sk,b\)≔‖𝒑k−Qb​\(𝒑k;sk,b\)‖22=∑j=1d\(pk,j−Qb​\(pk,j;sk,b\)\)2\.\\mathcal\{E\}\_\{k\}\(b;s\_\{k,b\}\)\\coloneqq\\left\\\|\\boldsymbol\{p\}\_\{k\}\-Q\_\{b\}\(\\boldsymbol\{p\}\_\{k\};s\_\{k,b\}\)\\right\\\|\_\{2\}^\{2\}=\\sum\_\{j=1\}^\{d\}\\left\(p\_\{k,j\}\-Q\_\{b\}\(p\_\{k,j\};s\_\{k,b\}\)\\right\)^\{2\}\.\(74\)Thus,

ℰk​\(b\)=ℰk​\(b;sk,b∗\)=∑j=1dεk,j2​\(b;sk,b∗\)=d​\(1d​∑j=1dεk,j2​\(b;sk,b∗\)\)\.\\mathcal\{E\}\_\{k\}\(b\)=\\mathcal\{E\}\_\{k\}\(b;s\_\{k,b\}^\{\\ast\}\)=\\sum\_\{j=1\}^\{d\}\\varepsilon\_\{k,j\}^\{2\}\(b;s\_\{k,b\}^\{\\ast\}\)=d\\left\(\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}\\varepsilon\_\{k,j\}^\{2\}\(b;s\_\{k,b\}^\{\\ast\}\)\\right\)\.\(75\)Since𝒑k\\boldsymbol\{p\}\_\{k\}isℓ2\\ell\_\{2\}\-normalized, we have

1d​∑j=1dpk,j2=1d\.\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}p\_\{k,j\}^\{2\}=\\frac\{1\}\{d\}\.\(76\)Combining Eq\. \([70](https://arxiv.org/html/2606.00079#A2.E70)\), Eq\. \([75](https://arxiv.org/html/2606.00079#A2.E75)\), and Eq\. \([76](https://arxiv.org/html/2606.00079#A2.E76)\) gives

κk,b=1d​∑j=1dεk,j2​\(b;sk,b∗\)1d​∑j=1dpk,j2=d​\(1d​∑j=1dεk,j2​\(b;sk,b∗\)\)=ℰk​\(b\)\.\\kappa\_\{k,b\}=\\frac\{\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}\\varepsilon\_\{k,j\}^\{2\}\(b;s\_\{k,b\}^\{\\ast\}\)\}\{\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}p\_\{k,j\}^\{2\}\}=d\\left\(\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}\\varepsilon\_\{k,j\}^\{2\}\(b;s\_\{k,b\}^\{\\ast\}\)\\right\)=\\mathcal\{E\}\_\{k\}\(b\)\.\(77\)The shared coefficient is defined as the average relative distortion overℐ\\mathcal\{I\}:

κb≔1\|ℐ\|​∑i∈ℐκi,b\.\\kappa\_\{b\}\\coloneqq\\frac\{1\}\{\|\\mathcal\{I\}\|\}\\sum\_\{i\\in\\mathcal\{I\}\}\\kappa\_\{i,b\}\.\(78\)Under the empirical coefficient\-sharing approximation, the shared coefficient is used as the low\-bit distortion surrogate for each spectral component:

ℰk​\(b\)≈κb,b∈\{2,3,4\}\.\\mathcal\{E\}\_\{k\}\(b\)\\approx\\kappa\_\{b\},\\qquad b\\in\\\{2,3,4\\\}\.\(79\)∎

The empirical stability of the shared coefficientsκb\\kappa\_\{b\}is further analyzed in Section[D](https://arxiv.org/html/2606.00079#A4)\.

###### Lemma B\.4\(One\-bit sign distortion\)\.

Forb=1b=1, let𝐩k∈ℝd\\boldsymbol\{p\}\_\{k\}\\in\\mathbb\{R\}^\{d\}be a unit\-normalized spectral vector\. We adopt the symmetric 1\-bit sign quantizer

Q1​\(𝒑k;sk,1\)≔sk,1​sign⁡\(𝒑k\),sk,1≔1d​∑j=1d\|pk,j\|\.Q\_\{1\}\(\\boldsymbol\{p\}\_\{k\};s\_\{k,1\}\)\\coloneqq s\_\{k,1\}\\operatorname\{sign\}\(\\boldsymbol\{p\}\_\{k\}\),\\qquad s\_\{k,1\}\\coloneqq\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}\|p\_\{k,j\}\|\.\(80\)Define the normalized sign direction

𝒓k\(1\)≔sign⁡\(𝒑k\)d,cos⁡θk≔𝒑k⊤​𝒓k\(1\),\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}\\coloneqq\\frac\{\\operatorname\{sign\}\(\\boldsymbol\{p\}\_\{k\}\)\}\{\\sqrt\{d\}\},\\qquad\\cos\\theta\_\{k\}\\coloneqq\\boldsymbol\{p\}\_\{k\}^\{\\top\}\\boldsymbol\{r\}\_\{k\}^\{\(1\)\},\(81\)wheresign⁡\(⋅\)\\operatorname\{sign\}\(\\cdot\)is applied elementwise withsign⁡\(0\)=1\\operatorname\{sign\}\(0\)=1so that𝐫k\(1\)∈\{±1/d\}d\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}\\in\\\{\\pm 1/\\sqrt\{d\}\\\}^\{d\}and‖𝐫k\(1\)‖2=1\\\|\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}\\\|\_\{2\}=1\. The normalized 1\-bit distortion is

ℰk​\(1\)=sin2⁡θk\.\\mathcal\{E\}\_\{k\}\(1\)=\\sin^\{2\}\\theta\_\{k\}\.\(82\)

###### Proof\.

The 1\-bit quantized vector in Eq\. \([80](https://arxiv.org/html/2606.00079#A2.E80)\) can be rewritten using the normalized sign direction as

Q1​\(𝒑k;sk,1\)=sk,1​sign⁡\(𝒑k\)=sk,1​d​𝒓k\(1\)\.Q\_\{1\}\(\\boldsymbol\{p\}\_\{k\};s\_\{k,1\}\)=s\_\{k,1\}\\operatorname\{sign\}\(\\boldsymbol\{p\}\_\{k\}\)=s\_\{k,1\}\\sqrt\{d\}\\,\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}\.\(83\)By the definition of𝒓k\(1\)\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}, its alignment with𝒑k\\boldsymbol\{p\}\_\{k\}is

cos⁡θk=𝒑k⊤​𝒓k\(1\)=1d​∑j=1dpk,j​sign⁡\(pk,j\)=1d​∑j=1d\|pk,j\|=sk,1​d\.\\cos\\theta\_\{k\}=\\boldsymbol\{p\}\_\{k\}^\{\\top\}\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}=\\frac\{1\}\{\\sqrt\{d\}\}\\sum\_\{j=1\}^\{d\}p\_\{k,j\}\\operatorname\{sign\}\(p\_\{k,j\}\)=\\frac\{1\}\{\\sqrt\{d\}\}\\sum\_\{j=1\}^\{d\}\|p\_\{k,j\}\|=s\_\{k,1\}\\sqrt\{d\}\.\(84\)Therefore, the 1\-bit reconstruction is equivalently

Q1​\(𝒑k;sk,1\)=cos⁡θk​𝒓k\(1\)\.Q\_\{1\}\(\\boldsymbol\{p\}\_\{k\};s\_\{k,1\}\)=\\cos\\theta\_\{k\}\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}\.\(85\)Thus the 1\-bit quantization error vector is

𝜺k​\(1\)≔𝒑k−Q1​\(𝒑k;sk,1\)=𝒑k−cos⁡θk​𝒓k\(1\)\.\\boldsymbol\{\\varepsilon\}\_\{k\}\(1\)\\coloneqq\\boldsymbol\{p\}\_\{k\}\-Q\_\{1\}\(\\boldsymbol\{p\}\_\{k\};s\_\{k,1\}\)=\\boldsymbol\{p\}\_\{k\}\-\\cos\\theta\_\{k\}\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}\.\(86\)Since‖𝒑k‖2=‖𝒓k\(1\)‖2=1\\\|\\boldsymbol\{p\}\_\{k\}\\\|\_\{2\}=\\\|\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}\\\|\_\{2\}=1, we obtain

‖𝜺k​\(1\)‖22\\displaystyle\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\(1\)\\right\\\|\_\{2\}^\{2\}=‖𝒑k−cos⁡θk​𝒓k\(1\)‖22\\displaystyle=\\left\\\|\\boldsymbol\{p\}\_\{k\}\-\\cos\\theta\_\{k\}\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}\\right\\\|\_\{2\}^\{2\}\(87\)=‖𝒑k‖22\+cos2⁡θk​‖𝒓k\(1\)‖22−2​cos⁡θk​𝒑k⊤​𝒓k\(1\)\\displaystyle=\\\|\\boldsymbol\{p\}\_\{k\}\\\|\_\{2\}^\{2\}\+\\cos^\{2\}\\theta\_\{k\}\\\|\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}\\\|\_\{2\}^\{2\}\-2\\cos\\theta\_\{k\}\\boldsymbol\{p\}\_\{k\}^\{\\top\}\\boldsymbol\{r\}\_\{k\}^\{\(1\)\}=1\+cos2⁡θk−2​cos2⁡θk\\displaystyle=1\+\\cos^\{2\}\\theta\_\{k\}\-2\\cos^\{2\}\\theta\_\{k\}=1−cos2⁡θk=sin2⁡θk\.\\displaystyle=1\-\\cos^\{2\}\\theta\_\{k\}=\\sin^\{2\}\\theta\_\{k\}\.Hence the normalized 1\-bit distortion isℰk​\(1\)=‖𝜺k​\(1\)‖22=sin2⁡θk\\mathcal\{E\}\_\{k\}\(1\)=\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\(1\)\\\|\_\{2\}^\{2\}=\\sin^\{2\}\\theta\_\{k\}\. ∎

###### Lemma B\.5\(Zero\-bit eviction distortion\)\.

Forb=0b=0, the spectral vector is evicted:

Q0​\(𝒑k\)≔𝟎,Q\_\{0\}\(\\boldsymbol\{p\}\_\{k\}\)\\coloneqq\\boldsymbol\{0\},\(88\)Since𝐩k\\boldsymbol\{p\}\_\{k\}is unit\-normalized, the normalized zero\-bit distortion is

ℰk​\(0\)=1\.\\mathcal\{E\}\_\{k\}\(0\)=1\.\(89\)

###### Proof\.

Whenb=0b=0, the corresponding spectral vector is discarded\. Hence the zero\-bit error vector is

𝜺k​\(0\)≔𝒑k−Q0​\(𝒑k;sk,0\)=𝒑k\.\\boldsymbol\{\\varepsilon\}\_\{k\}\(0\)\\coloneqq\\boldsymbol\{p\}\_\{k\}\-Q\_\{0\}\(\\boldsymbol\{p\}\_\{k\};s\_\{k,0\}\)=\\boldsymbol\{p\}\_\{k\}\.\(90\)Taking the squaredℓ2\\ell\_\{2\}norm gives

‖𝜺k​\(0\)‖22=‖𝒑k‖22=1,\\left\\\|\\boldsymbol\{\\varepsilon\}\_\{k\}\(0\)\\right\\\|\_\{2\}^\{2\}=\\left\\\|\\boldsymbol\{p\}\_\{k\}\\right\\\|\_\{2\}^\{2\}=1,\(91\)where the last equality follows from the unit normalization of the spectral vector\. Therefore, the normalized distortion induced by zero\-bit eviction isℰk​\(0\)=1\\mathcal\{E\}\_\{k\}\(0\)=1\. ∎

Since the derivation is identical for alleeandhh, the full indices are restored by the substitution𝒑k↦𝒑e,h,k\\boldsymbol\{p\}\_\{k\}\\mapsto\\boldsymbol\{p\}\_\{e,h,k\}, which yields the distortionℰe,h,k​\(b\)\\mathcal\{E\}\_\{e,h,k\}\(b\)\. The bit\-dependent spectral\-vector distortion used byBitsMoE, which follows from Lemmas[B\.2](https://arxiv.org/html/2606.00079#A2.Thmlemma2)–[B\.5](https://arxiv.org/html/2606.00079#A2.Thmlemma5), is

ℰe,h,k​\(b\)=\{ηe,h,k​exp⁡\(−λ​b\),b∈\{6,8,16\},κb,b∈\{2,3,4\},sin2⁡θe,h,k,b=1,1,b=0\.\\mathcal\{E\}\_\{e,h,k\}\(b\)=\\begin\{cases\}\\eta\_\{e,h,k\}\\exp\(\-\\lambda b\),&b\\in\\\{6,8,16\\\},\\\\\[3\.0pt\] \\kappa\_\{b\},&b\\in\\\{2,3,4\\\},\\\\\[3\.0pt\] \\sin^\{2\}\\theta\_\{e,h,k\},&b=1,\\\\\[3\.0pt\] 1,&b=0\.\\end\{cases\}\(92\)

### B\.4Component\-wise Loss and ILP Formulation

We now combine the additive reconstruction loss in Corollary[B\.1](https://arxiv.org/html/2606.00079#A2.Thmcorollary1)with the piecewise distortion surrogate in Eq\. \([92](https://arxiv.org/html/2606.00079#A2.E92)\)\.

###### Theorem B\.1\(Smoothed component\-wise reconstruction loss\)\.

Letαe,h,k\\alpha\_\{e,h,k\}denote the spectral energy andβe,h,k\\beta\_\{e,h,k\}denote the activation\-output importance of component\(e,h,k\)\(e,h,k\)\. With smoothing exponentγ∈\[0,1\]\\gamma\\in\[0,1\], the cost of assigning bit\-widthbbto this component is

Le,h,k​\(b\)≈αe,h,k2​βe,h,kγ​ℰe,h,k​\(b\),L\_\{e,h,k\}\(b\)\\approx\\alpha\_\{e,h,k\}^\{2\}\\beta\_\{e,h,k\}^\{\\gamma\}\\mathcal\{E\}\_\{e,h,k\}\(b\),\(93\)whereℰe,h,k​\(b\)\\mathcal\{E\}\_\{e,h,k\}\(b\)is defined in Eq\. \([92](https://arxiv.org/html/2606.00079#A2.E92)\)\.

The exponentγ\\gammasmooths the activation\-importance coefficient in the optimization objective, which prevents a few extremely largeβe,h,k\\beta\_\{e,h,k\}values from dominating bit allocation\. In contrast to calibration\-based PTQ methods such as GPTQ, which use calibration activations for Hessian\-based error compensation,BitsMoEuses calibration data only to estimate activation\-aware component importance\. The smoothing exponentγ\\gammatherefore provides a simple knob for balancing activation awareness and calibration robustness\.

Theorem[B\.1](https://arxiv.org/html/2606.00079#A2.Thmtheorem1)gives the component\-wise cost used in the ILP formulation of Section[3](https://arxiv.org/html/2606.00079#S3)\. The objective is driven by three factors: the spectral energy of the component, its activation\-output importance, and the bit\-dependent distortion induced by quantizing its expert\-specific direction\.

For each projection weighth∈ℋh\\in\\mathcal\{H\}, define the binary assignment variable

ye,h,k,b∈\{0,1\},ye,h,k,b=1⟺component​\(e,h,k\)​is assigned​b​bits\.y\_\{e,h,k,b\}\\in\\\{0,1\\\},\\qquad y\_\{e,h,k,b\}=1\\Longleftrightarrow\\text\{component \}\(e,h,k\)\\text\{ is assigned \}b\\text\{ bits\}\.\(94\)For each projection typehh, we denote by𝒀\(h\)\\boldsymbol\{Y\}^\{\(h\)\}the collection of binary variablesye,h,k,by\_\{e,h,k,b\}, by𝑪\(h\)\\boldsymbol\{C\}^\{\(h\)\}the corresponding objective coefficientsCe,h,k,b≔Le,h,k​\(b\)C\_\{e,h,k,b\}\\coloneqq L\_\{e,h,k\}\(b\), and by𝛀\(h\)\\boldsymbol\{\\Omega\}^\{\(h\)\}the corresponding normalized bit costsΩe,h,k,b≔b\\Omega\_\{e,h,k,b\}\\coloneqq b\. SinceBhB\_\{h\}denotes the normalized component budget for projection typehh, the projection\-wise ILP can be written compactly as

min𝒀\(h\)\\displaystyle\\min\_\{\\boldsymbol\{Y\}^\{\(h\)\}\}⟨𝒀\(h\),𝑪\(h\)⟩\\displaystyle\\left\\langle\\boldsymbol\{Y\}^\{\(h\)\},\\boldsymbol\{C\}^\{\(h\)\}\\right\\rangle\(95\)s\.t\.\\displaystyle\\mathrm\{s\.t\.\}⟨𝒀\(h\),𝛀\(h\)⟩≤Bh,\\displaystyle\\left\\langle\\boldsymbol\{Y\}^\{\(h\)\},\\boldsymbol\{\\Omega\}^\{\(h\)\}\\right\\rangle\\leq B\_\{h\},∑b∈ℬye,h,k,b=1,∀e∈\[E\],k∈\[nh\],\\displaystyle\\sum\_\{b\\in\\mathcal\{B\}\}y\_\{e,h,k,b\}=1,\\qquad\\forall\\,e\\in\[E\],\\ k\\in\[n\_\{h\}\],ye,h,k,b∈\{0,1\},∀e∈\[E\],k∈\[nh\],b∈ℬ\.\\displaystyle y\_\{e,h,k,b\}\\in\\\{0,1\\\},\\qquad\\forall\\,e\\in\[E\],\\ k\\in\[n\_\{h\}\],\\ b\\in\\mathcal\{B\}\.Here,⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\rangledenotes the tensor inner product over\(e,k,b\)\(e,k,b\)for the projection weighthh\.

Solving Eq\. \([95](https://arxiv.org/html/2606.00079#A2.E95)\) independently for each projection type produces component\-level mixed\-precision assignments under the proposed piecewise reconstruction\-error surrogate\.

### B\.5Equivalent Bit Budget for an MoE Layer

We describe how the target equivalent bit\-width𝔟eq\\mathfrak\{b\}\_\{\\mathrm\{eq\}\}is converted into the bit budget used by the ILP solver\. Consider one MoE layer withEErouted experts to be quantized\. Each routed expert contains three projection matrices\{𝑾e,h\}h∈ℋ\\\{\\boldsymbol\{W\}\_\{e,h\}\\\}\_\{h\\in\\mathcal\{H\}\}, whereℋ=\{gate​\_​proj,up​\_​proj,down​\_​proj\}\\mathcal\{H\}=\\\{\\mathrm\{gate\\\_proj\},\\mathrm\{up\\\_proj\},\\mathrm\{down\\\_proj\}\\\}and𝑾e,h∈ℝm×n\\boldsymbol\{W\}\_\{e,h\}\\in\\mathbb\{R\}^\{m\\times n\}\. If these routed expert weights are stored in FP16, the corresponding storage is

Mfp16=16⋅3​E​m​nbits\.M\_\{\\mathrm\{fp16\}\}=16\\cdot 3Emn\\quad\\mathrm\{bits\}\.\(96\)
Under the shared\-basis formulation, each projection type is associated with one layer\-wise shared basis, which is retained in FP16\. The shared\-basis storage of this MoE layer is therefore

Mshare=16⋅3​n2bits\.M\_\{\\mathrm\{share\}\}=16\\cdot 3n^\{2\}\\quad\\mathrm\{bits\}\.\(97\)
We apply the same target equivalent bit\-width𝔟eq\\mathfrak\{b\}\_\{\\mathrm\{eq\}\}to every MoE layer\. For a given layer, the total equivalent storage budget for the three routed\-expert projections is3​E​m​n​𝔟eq3Emn\\mathfrak\{b\}\_\{\\mathrm\{eq\}\}bits\. After reserving the FP16 shared bases, the remaining bit budget assigned to the expert\-specific spectral vectors is

B=3​E​m​n​𝔟eq−16⋅3​n2bits\.B=3Emn\\mathfrak\{b\}\_\{\\mathrm\{eq\}\}\-16\\cdot 3n^\{2\}\\quad\\mathrm\{bits\}\.\(98\)Within each layer, this budget is split uniformly across the three projection types:

Bhbit=B3=E​m​n​𝔟eq−16​n2,h∈ℋ\.B\_\{h\}^\{\\mathrm\{bit\}\}=\\frac\{B\}\{3\}=Emn\\mathfrak\{b\}\_\{\\mathrm\{eq\}\}\-16n^\{2\},\\qquad h\\in\\mathcal\{H\}\.\(99\)
For each projection typehh, the shared\-basis decomposition producesnnexpert\-specific spectral vectors\{𝒑e,h,k\}k=1n\\\{\\boldsymbol\{p\}\_\{e,h,k\}\\\}\_\{k=1\}^\{n\}for each expert, with𝒑e,h,k∈ℝm\\boldsymbol\{p\}\_\{e,h,k\}\\in\\mathbb\{R\}^\{m\}\. Assigning bit\-widthbbto𝒑e,h,k\\boldsymbol\{p\}\_\{e,h,k\}consumesm​bmbbits\. We therefore normalize the projection\-level bit budget bymmand obtain

Bh≔⌊Bhbitm⌋=⌊E​n​𝔟eq−16​n2m⌋\.B\_\{h\}\\coloneqq\\left\\lfloor\\frac\{B\_\{h\}^\{\\mathrm\{bit\}\}\}\{m\}\\right\\rfloor=\\left\\lfloor En\\mathfrak\{b\}\_\{\\mathrm\{eq\}\}\-\\frac\{16n^\{2\}\}\{m\}\\right\\rfloor\.\(100\)The ILP for projection typehhthen enforces

∑e=1E∑k=1n∑b∈ℬb​ye,h,k,b≤Bh,\\sum\_\{e=1\}^\{E\}\\sum\_\{k=1\}^\{n\}\\sum\_\{b\\in\\mathcal\{B\}\}b\\,y\_\{e,h,k,b\}\\leq B\_\{h\},\(101\)whereye,h,k,b∈\{0,1\}y\_\{e,h,k,b\}\\in\\\{0,1\\\}indicates whether spectral vector𝒑e,h,k\\boldsymbol\{p\}\_\{e,h,k\}is assigned bit\-widthbb, with

∑b∈ℬye,h,k,b=1,∀e,h,k\.\\sum\_\{b\\in\\mathcal\{B\}\}y\_\{e,h,k,b\}=1,\\qquad\\forall e,\\ h,\\ k\.\(102\)
As observed in\[[36](https://arxiv.org/html/2606.00079#bib.bib32)\], MoE LLMs usually contain only a few*super experts*, which can be critical to preserving model performance despite their rarity\. For instance, Mixtral\-8×\\times7B has only one such expert\. We therefore exclude super experts from quantization\. Shared experts are also kept unquantized, and this setting is applied to all baselines for a fair comparison\.

### B\.6Sensitivity to the Smoothing Exponent

Table 7:Sensitivity of average accuracy to the smoothing exponentγ\\gamma\. The fixedγ\\gammacolumn reports the value used in the main experiments for each backbone\.444Model abbreviations: QW1\.5\-14B = Qwen1\.5\-MoE\-A2\.7B, DSV2\-16B = DeepSeek\-V2\-Lite, QW3\-30B = Qwen3\-30B\-A3B\-Base, MI\-8x7B = Mixtral\-8×\\times7B\-v0\.1, and QW3\-80B\-I = Qwen3\-Next\-80B\-A3B\-Instruct\.ModelFixedγ\\gammaγ=1\.0\\gamma=1\.0γ=0\.7\\gamma=0\.7γ=0\.5\\gamma=0\.5γ=0\.2\\gamma=0\.2MeanStd\.2\-bit Avg\. Accuracy \(%\)QW1\.5\-14B0\.747\.3547\.7246\.8446\.3547\.070\.52DSV2\-16B0\.241\.0840\.8240\.6041\.0440\.890\.19QW3\-30B0\.761\.7961\.9159\.2158\.2560\.291\.60MI\-8x7B0\.547\.9748\.5148\.7548\.5148\.430\.29QW3\-80B\-I0\.271\.8471\.7671\.9972\.1471\.930\.153\-bit Avg\. Accuracy \(%\)QW1\.5\-14B0\.752\.3153\.1252\.3151\.6552\.350\.52DSV2\-16B0\.246\.6846\.8246\.9048\.3847\.200\.69QW3\-30B0\.766\.8867\.3467\.1966\.0766\.870\.49MI\-8x7B0\.557\.5157\.8158\.1957\.9057\.850\.24QW3\-80B\-I0\.274\.0473\.8873\.9974\.1974\.030\.11

The smoothing exponentγ\\gammacontrols the strength of the activation\-output importance term in the component\-wise loss:

Le,h,k​\(b\)=αe,h,k2​βe,h,kγ​ℰe,h,k​\(b\),γ∈\[0,1\]\.L\_\{e,h,k\}\(b\)=\\alpha\_\{e,h,k\}^\{2\}\\beta\_\{e,h,k\}^\{\\gamma\}\\mathcal\{E\}\_\{e,h,k\}\(b\),\\qquad\\gamma\\in\[0,1\]\.\(103\)A largerγ\\gammagives more weight to the calibration\-dependent activation statisticβe,h,k\\beta\_\{e,h,k\}, making the allocation more activation\-aware\. A smallerγ\\gammaweakens this calibration dependence and makes the objective closer to a purely spectral\-energy\-based reconstruction surrogate\. Therefore,γ\\gammacontrols the trade\-off between activation\-aware sensitivity and calibration robustness, and an appropriate balance is important for stable bit allocation\.

We evaluate a predefined gridγ∈\{0\.2,0\.5,0\.7,1\.0\}\\gamma\\in\\\{0\.2,0\.5,0\.7,1\.0\\\}and report the full sensitivity results in Table[4](https://arxiv.org/html/2606.00079#footnote4)\. Theγ\\gammaused in the main experiments is fixed per backbone and shared by the 2\-bit and 3\-bit settings instead of being tuned for each task or bit budget\. Thus, the table serves as a robustness check for the smoothing exponent rather than a task\-specific hyperparameter search\. For most model–bit\-width pairs, the standard deviation across the grid is below0\.700\.70accuracy points, which suggests thatγ\\gammaacts primarily as a smoothing hyperparameter rather than a brittle tuning knob\. The main exception is Qwen3\-30B\-A3B\-Base under 2\-bit quantization, for which the method is more sensitive to the strength of activation\-aware importance under aggressive compression\.

## Appendix CAblation Study

All four ablation settings use the same effective 2\-bit MoE\-layer budget\. For uniform\-bit settings, spectral components are ranked by spectral energy, the top\-NNspectral components are retained and quantized uniformly to 2 bits, and the rest are discarded as zero\-bit eviction\. The value ofNNis chosen such that the total storage matches the equivalent 2\-bit budget, with the corresponding shared basis and spectral factors overhead accounted for\.

1. \(1\)NS/UniBit: independent SVD without basis sharing\. Each expert is decomposed separately\. Only the top\-NNspectral components are retained and uniformly quantized to 2 bits, while the remaining components are discarded\.
2. \(2\)QS/UniBit: shared\-basis SVD with a quantized shared basis\. The shared basis is uniformly quantized to 2 bits\. Only the expert\-specific components selected according to spectral energy are retained and uniformly quantized to 2 bits, while the remaining expert\-specific components are discarded\.
3. \(3\)FS/UniBit: shared\-basis SVD with an FP16 shared basis\. The shared basis is kept in FP16\. Only the expert\-specific components selected according to spectral energy are retained and uniformly quantized to 2 bits, while the remaining expert\-specific components are discarded\.
4. \(4\)FS/AdaBit: the fullBitsMoEsetting\. The shared basis is kept in FP16, and adaptive bit\-widths are assigned to expert\-specific spectral components by the activation\-aware ILP under the same equivalent 2\-bit budget\.

We provide the full task\-level ablation results under the 2\-bit setting in Table[8](https://arxiv.org/html/2606.00079#A3.T8)\. The results complement the summarized ablation in Section[4\.3](https://arxiv.org/html/2606.00079#S4.SS3)and report the accuracy on each downstream benchmark\. Across all evaluated MoE backbones,FS/AdaBitconsistently achieves the best average accuracy, demonstrating that both FP16 shared\-basis preservation and adaptive spectrum\-wise bit allocation are important for robust ultra\-low\-bit MoE quantization\.

Table 8:Full ablation results under the 2\-bit setting\. We compare four settings:NS/UniBit,QS/UniBit,FS/UniBit, andFS/AdaBit\. Here,NSdenotes non\-shared decomposition,QSdenotes quantized shared\-basis decomposition, andFSdenotes FP16 shared\-basis decomposition\.SettingAccuracy↑\\uparrow\(%\)HellaS\.MathQAMMLUOpenb\.WinoG\.GSM8KHumanE\.Avg\.DeepSeek\-V2\-LiteNS/UniBit54\.8825\.8632\.5030\.2062\.121\.900\.6129\.72QS/UniBit26\.3619\.3326\.8927\.2048\.780\.000\.0021\.22FS/UniBit57\.1326\.7729\.9931\.6065\.272\.580\.6130\.56FS/AdaBit69\.9633\.3746\.4139\.2068\.8215\.4714\.0241\.04Qwen3\-30B\-A3B\-BaseNS/UniBit52\.5232\.9349\.2733\.2061\.0914\.7814\.0236\.83QS/UniBit26\.6421\.0122\.9530\.4049\.250\.000\.0021\.46FS/UniBit64\.6740\.6057\.1935\.8064\.8842\.461\.8343\.92FS/AdaBit74\.0952\.7070\.8743\.4072\.9375\.5143\.9061\.91Qwen3\-Next\-80B\-A3B\-InstructNS/UniBit25\.0420\.5722\.9527\.6049\.570\.000\.0020\.82QS/UniBit26\.6221\.1022\.9528\.4050\.120\.000\.0021\.31FS/UniBit72\.4353\.4777\.1141\.8073\.1668\.0887\.8067\.69FS/AdaBit78\.0260\.6781\.4744\.8075\.8571\.4992\.6872\.14

## Appendix DILP Coefficient Calibration and Stability Analysis

#### Consistency of piecewise ILP coefficients\.

Figure[4](https://arxiv.org/html/2606.00079#A4.F4)shows representative ILP loss coefficients for selected layers and experts in DeepSeek\-V2\-Lite\. Across all cases, the coefficients remain consistently ordered by bit\-width, indicating that the piecewise surrogate preserves a stable penalty hierarchy across precision regimes without introducing scale mismatch into the ILP objective\.

#### Dispersion ofκb\\kappa\_\{b\}estimates\.

For each layerℓ\\elland bit\-widthbb, we treat the normalized spectral\-vector quantization distortions across all projection types, experts, and spectral components as samples from an empirical component distribution\. The bit\-dependent coefficient is estimated as

κb=1E​∑h∈ℋnh​∑h∈ℋ∑e=1E∑k=1nh‖𝒑e,h,k−Qb​\(𝒑e,h,k\)‖22,\\kappa\_\{b\}=\\frac\{1\}\{E\\sum\_\{h\\in\\mathcal\{H\}\}n\_\{h\}\}\\sum\_\{h\\in\\mathcal\{H\}\}\\sum\_\{e=1\}^\{E\}\\sum\_\{k=1\}^\{n\_\{h\}\}\\left\\\|\\boldsymbol\{p\}\_\{e,h,k\}\-Q\_\{b\}\(\\boldsymbol\{p\}\_\{e,h,k\}\)\\right\\\|\_\{2\}^\{2\},\(104\)whereEEis the number of routed experts,ℋ\\mathcal\{H\}is the set of projection types, andnhn\_\{h\}is the number of spectral components for projection typehh\. We measure the relative layer\-wise dispersion of these component\-wise distortions using the coefficient of variation \(CV\):

CVℓ,b=sℓ,b\|κb\|×100%,\\mathrm\{CV\}\_\{\\ell,b\}=\\frac\{s\_\{\\ell,b\}\}\{\\left\|\\kappa\_\{b\}\\right\|\}\\times 100\\%,\(105\)wheresℓ,bs\_\{\\ell,b\}denotes the sample standard deviation of the component\-wise distortions over all projection types, experts, and spectral components in layerℓ\\ell\. Figure[5](https://arxiv.org/html/2606.00079#A4.F5)reports the layer\-wise CV under 2/3/4\-bit quantization on Qwen1\.5\-MoE\-A2\.7B, DeepSeek\-V2\-Lite, and Mixtral\-8×\\times7B\. The CV values are almost all below15%15\\%, suggesting that the empirical low\-bit distortion scale remains stable at the layer level\.

#### rMCSE ofκb\\kappa\_\{b\}estimation\.

The uncertainty of the averaged coefficient is further measured by the relative Monte Carlo standard error \(rMCSE\)\[[24](https://arxiv.org/html/2606.00079#bib.bib50)\]:

rMCSEℓ,h,b=sℓ,h,bE​nh​\|κb\|×100%\.\\mathrm\{rMCSE\}\_\{\\ell,h,b\}=\\frac\{s\_\{\\ell,h,b\}\}\{\\sqrt\{En\_\{h\}\}\\,\\left\|\\kappa\_\{b\}\\right\|\}\\times 100\\%\.\(106\)Figure[6](https://arxiv.org/html/2606.00079#A4.F6)reports the layer\-wise rMCSE ofκb\\kappa\_\{b\}under 2/3/4\-bit quantization\. For all evaluated models, the rMCSE is below0\.15%0\.15\\%, indicating negligible relative uncertainty in the averaged bit\-dependent coefficients\. These results support using a shared empirical coefficientκb\\kappa\_\{b\}as a stable low\-bit distortion scale in the ILP objective, while component\-specific magnitude and activation\-aware importance are captured byαe,h,k2\\alpha\_\{e,h,k\}^\{2\}andβe,h,kγ\\beta\_\{e,h,k\}^\{\\gamma\}\.

Table[9](https://arxiv.org/html/2606.00079#A4.T9)further shows that the estimated coefficients preserve the expected orderingκ2\>κ3\>κ4\\kappa\_\{2\}\>\\kappa\_\{3\}\>\\kappa\_\{4\}, assigning larger ILP penalties to lower bit\-widths\. This shared\-coefficient design also reduces construction cost, since empirical distortions need not be explicitly computed for every candidate component in the large allocation space; instead,κb\\kappa\_\{b\}is combined with the component\-specific factorsαe,h,k2\\alpha\_\{e,h,k\}^\{2\}andβe,h,kγ\\beta\_\{e,h,k\}^\{\\gamma\}\.

![Refer to caption](https://arxiv.org/html/2606.00079v1/x6.png)\(a\)Layer 1, Expert 0\.
![Refer to caption](https://arxiv.org/html/2606.00079v1/x7.png)\(b\)Layer 9, Expert 21\.
![Refer to caption](https://arxiv.org/html/2606.00079v1/x8.png)\(c\)Layer 17, Expert 42\.
![Refer to caption](https://arxiv.org/html/2606.00079v1/x9.png)\(d\)Layer 23, Expert 63\.

Figure 4:Representative ILP loss coefficients across bit\-widths for different layers and experts in DeepSeek\-V2\-Lite\.![Refer to caption](https://arxiv.org/html/2606.00079v1/x10.png)\(a\)Qwen1\.5\-MoE\-A2\.7B
![Refer to caption](https://arxiv.org/html/2606.00079v1/x11.png)\(b\)DeepSeek\-V2\-Lite
![Refer to caption](https://arxiv.org/html/2606.00079v1/x12.png)\(c\)Mixtral\-8×\\times7B

Figure 5:Layer\-wise CV of the empiricalκb\\kappa\_\{b\}estimates on Qwen1\.5\-MoE\-A2\.7B, DeepSeek\-V2\-Lite, and Mixtral\-8×\\times7B\.Table 9:Bit\-dependent quantization\-error coefficientsκb\\kappa\_\{b\}used in the ILP objective\.Bit\-widthbb4\-bit3\-bit2\-bitκb\\kappa\_\{b\}0\.011847860\.040678900\.14949200

![Refer to caption](https://arxiv.org/html/2606.00079v1/x13.png)\(a\)DeepSeek\-V2\-Lite
![Refer to caption](https://arxiv.org/html/2606.00079v1/x14.png)\(b\)Qwen1\.5\-MoE\-A2\.7B
![Refer to caption](https://arxiv.org/html/2606.00079v1/x15.png)\(c\)Mixtral\-8×\\times7B

Figure 6:Layer\-wise rMCSE of the empiricalκb\\kappa\_\{b\}estimates on Qwen1\.5\-MoE\-A2\.7B, DeepSeek\-V2\-Lite, and Mixtral\-8×\\times7B\.

Similar Articles

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

arXiv cs.CL

Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.

dMoE: dLLMs with Learnable Block Experts

arXiv cs.CL

dMoE proposes block-level expert routing for diffusion LLMs, reducing the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% performance and achieving 76-80% memory reduction with 1.14-1.66× speedup.