MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs
Summary
This paper introduces MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE multimodal LLMs that addresses biases in expert importance estimation by decomposing selection frequency by modality and filtering redundant vision tokens, achieving minimal performance loss under aggressive quantization.
View Cached Full Text
Cached at: 06/17/26, 05:36 AM
# Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs
Source: [https://arxiv.org/html/2606.17118](https://arxiv.org/html/2606.17118)
Yuanteng Chen1,2,3, Peisong Wang1,2, Zhilei Liu1,2, Nanxin Zeng2, Yuantian Shao1, Shiqiang Lang3, Tao Liu3, Chuangyi Li1,2, Qinghao Hu1,2, Gang Li1,2, Jing Liu1,2,3, Jian Cheng1,2,311footnotemark:1 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Zhongguancun Academy \{peisong\.wang,jcheng\}@nlpr\.ia\.ac\.cn
###### Abstract
Mixture\-of\-Experts Multimodal Large Language Models \(MoE\-MLLMs\) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential\. Among PTQ methods, expert\-level mixed\-precision quantization has proven effective for MoE\-LLMs, yet suffers notable degradation on MoE\-MLLMs due to two overlooked biases in expert importance estimation\. \(1\) At thecross\-modallevel, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; \(2\) at theintra\-visionlevel, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content\. To bridge gaps, we proposeMODE, amodality\-decomposedexpert\-level mixed\-precision quantization framework for MoE\-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency\-based estimation\. These signals are integrated into an Integer Linear Programming formulation to assign per\-expert bit\-widths under a given budget\. Extensive experiments show that MODE is particularly well\-suited for MoE\-MLLMs, limiting average performance loss to within 2\.9% at W3A16, with larger gains at the extreme 2\-bit setting\. The code will be available at[![[Uncaptioned image]](https://arxiv.org/html/2606.17118v1/GitHub_Invertocat_Black.png)Github](https://github.com/MingZwhy/MODE)\.
MODE: Modality\-Decomposed Expert\-Level Mixed\-Precision Quantization for MoE Multimodal LLMs
Yuanteng Chen1,2,3, Peisong Wang1,2††thanks:Corresponding author\., Zhilei Liu1,2, Nanxin Zeng2,Yuantian Shao1, Shiqiang Lang3, Tao Liu3, Chuangyi Li1,2,Qinghao Hu1,2, Gang Li1,2, Jing Liu1,2,3, Jian Cheng1,2,311footnotemark:11Institute of Automation, Chinese Academy of Sciences2School of Artificial Intelligence, University of Chinese Academy of Sciences3Zhongguancun Academy\{peisong\.wang,jcheng\}@nlpr\.ia\.ac\.cn
## 1Introduction
Figure 1:Performance comparison on Qwen3\-VL\-30B\-A3B\-Instruct under 3\-bit weight quantization \(W3A16\)\.Multimodal Large Language Models \(MLLMs\) have achieved remarkable success across a wide range of vision\-language tasks\(Radfordet al\.,[2021](https://arxiv.org/html/2606.17118#bib.bib26); Teamet al\.,[2026](https://arxiv.org/html/2606.17118#bib.bib27)\)\. By aligning visual perception with linguistic reasoning, they empower intelligent systems to perceive, understand, and interact with the visual world in a unified framework\.
With the growing scale of MLLMs, the Mixture\-of\-Experts \(MoE\) architecture\(Artetxeet al\.,[2022](https://arxiv.org/html/2606.17118#bib.bib28)\)has become a popular solution to manage computational cost, activating only a sparse subset of experts per token to enable efficient parameter scaling while keeping training and inference FLOPs low\(Feduset al\.,[2022](https://arxiv.org/html/2606.17118#bib.bib29)\)\. However, since all expert parameters must reside on GPUs before inference regardless of activation sparsity, deploying large\-scale MoE\-MLLMs\(Linet al\.,[2024](https://arxiv.org/html/2606.17118#bib.bib30)\)like Qwen3\-VL\-30B\-A3B\-Instruct\(Team,[2025](https://arxiv.org/html/2606.17118#bib.bib14)\)still demands prohibitive memory, which severely limits their practical use in resource\-constrained settings\.
Post\-Training Quantization \(PTQ\) offers a practical way to reduce memory footprint without retraining, yet existing PTQ methods target either dense MLLMs or MoE\-LLMs, and neither transfers well to MoE\-MLLMs\. Specifically, PTQ methods designed for dense MLLMs\(Liet al\.,[2024b](https://arxiv.org/html/2606.17118#bib.bib19); Guoet al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib18)\)recognize the modality gap between text and vision tokens, but treat the model as a monolithic whole and overlook the sparse activation and unequal expert contribution intrinsic to MoE architectures\. Conversely, PTQ methods tailored for MoE\(Zhenget al\.,[2026](https://arxiv.org/html/2606.17118#bib.bib22); Huet al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib20)\)exploit these structural properties, but ignore the modality differences inherent to MLLMs\.
Among PTQ methods for MoE, expert\-level mixed\-precision quantization\(Huanget al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib21)\)has emerged as a prevailing paradigm, which measures expert importance through activation frequency and assigns higher bit\-widths to more critical experts while aggressively compressing less important ones\. However, when applied to MoE\-MLLMs, these methods aggregate activation statistics across all tokens regardless of modality, rendering expert importance estimates unreliable and leading to notable performance degradation\.
In our study, we identify two distinct levels of bias that arise when expert\-level mixed\-precision quantization is applied to MoE\-MLLMs\. At the cross\-modal level, vision tokens vastly outnumber text tokens in typical multimodal inputs, so global frequency statistics are dominated by vision\-side routing patterns and systematically undervalue experts critical for textual reasoning\. At the intra\-vision level, pervasive redundancy among vision tokens further distorts frequency\-based importance, as many tokens carry near\-duplicate information yet each still casts a router vote, inflating the apparent importance of certain experts while obscuring the contributions of others\.
Motivated by these observations, we proposeMODE, a modality\-aware quantization framework tailored to MoE\-MLLMs that jointly leverages the heterogeneous expert structure of MoE for differentiated precision allocation and accounts for the modality heterogeneity and intra\-vision redundancy inherent in multimodal inputs\. To resolve the cross\-modal bias, MODE separately collects and normalizes expert selection frequencies for text and vision tokens, so that text\-critical experts are no longer overshadowed\. To address the intra\-vision distortion, MODE identifies key vision tokens at each layer and computes vision\-side frequencies solely from them, filtering out noise from redundant visual content\. Beyond frequency alone, MODE further evaluates each expert’s quantization sensitivity separately under text and key vision tokens, yielding a modality\-decomposed importance metric\. The resulting scores are integrated into an Integer Linear Programming \(ILP\) formulation that produces an optimal per\-expert precision allocation under a given bit budget\.
We evaluate MODE on three MoE\-MLLM families across ten multimodal benchmarks under various low\-bit weight\-only quantization settings\. As illustrated in Figure[1](https://arxiv.org/html/2606.17118#S1.F1), MODE consistently surpasses all compared PTQ methods designed for dense MLLMs, MoE\-LLMs, and MoE\-MLLMs, respectively\. In particular, at the W3A16 setting MODE incurs only 2\.84% and 2\.08% average accuracy degradation on Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct, respectively, making practical deployment of 30B\-scale MoE\-MLLMs on a single consumer\-grade GPU feasible\.
Figure 2:Cross\-modal expert frequency bias in Qwen3\-VL\-30B\-A3B\-Instruct\. \(a\) Per\-expert selection frequency at Layers 12 and 33, decomposed into vision\-token \(blue\) and text\-token \(red\) contributions\. \(b\) Quantization sensitivity \(KL divergence between first output\-token logits before and after 2\-bit RTN quantization, log\-scaled\) versus overall selection frequency for all Layer\-12 experts, colored by the text\-preference scorer\(e\)r\(e\)\.
## 2Related Work
MLLMs Quantization\.PTQ methods designed for MLLMs primarily focus on mitigating the distribution heterogeneity between vision and text modalities\. MBQ\(Liet al\.,[2024b](https://arxiv.org/html/2606.17118#bib.bib19)\)recognizes that vision and language tokens exhibit different quantization sensitivities and introduces gradient\-based sensitivity measures into calibration to balance reconstruction quality across modalities\. SPEED\-Q\(Guoet al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib18)\)adopts a staged quantization strategy that first quantizes the vision encoder and re\-calibrates the projection layer for modality alignment, then applies quantization\-aware training to the language module\.
MoE Quantization\.The sparse activation nature of MoE architectures brings new challenges and opportunities to PTQ\. MoEQuant\(Huet al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib20)\)addresses inter\- and intra\-expert activation imbalance through an expert\-balanced calibration strategy\. A more prevalent line of work exploits the heterogeneous expert contributions through mixed\-precision quantization\. MC\-MoE\(Huanget al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib21)\)leverages expert activation frequency as an importance indicator, assigning higher bit\-widths to more critical experts, and significantly outperforming uniform\-precision baselines\. MoQa\(Zhenget al\.,[2026](https://arxiv.org/html/2606.17118#bib.bib22)\)builds upon this paradigm by introducing a channel\-level dynamic adjustment mechanism\. More recently, VEQ\(Qinet al\.,[2026](https://arxiv.org/html/2606.17118#bib.bib24)\)takes a first step toward MoE\-MLLMs by incorporating a modality\-affinity\-aware Hessian objective based on token–expert affinity to improve accuracy\.
## 3Motivation
Expert selection frequency, defined as the fraction of tokens routed to a given expert on a calibration set, is a natural and widely adopted importance proxy for mixed\-precision quantization, as more frequently activated experts are considered more important in MoE\. However, when transferred to MoE\-MLLMs, we identify two distinct levels of bias that cause overall selection frequency to significantly deviate from true expert importance\. We elaborate on each below, using Qwen3\-VL\-30B\-A3B\-Instruct as the representative model and 512 randomly sampled image\-text pairs from ShareGPT4V\(Chenet al\.,[2023](https://arxiv.org/html/2606.17118#bib.bib11)\)as calibration set\.
### 3\.1Cross\-Modal Expert Frequency Bias\.
In typical multimodal inputs, a single image is encoded into hundreds or even thousands of visual tokens while the accompanying text prompt remains relatively short, so vision tokens outnumber text tokens by a large margin\. This imbalance is an intrinsic property of multimodal inference, and consequently the overall expert selection frequency is dominated by vision\-side routing patterns, causing experts that are heavily selected by text tokens but receive little vision traffic to be systematically undervalued under a global frequency ranking\. To illustrate this, Figure[2](https://arxiv.org/html/2606.17118#S1.F2)\(a\) decomposes the per\-expert selection frequency at two representative MoE layers into vision\-token \(blue\) and text\-token \(red\) contributions\. Clear variation across experts confirms that experts do differ substantially in importance, supporting the general premise of frequency\-based estimation; however, the text\-token contribution is nearly invisible at this scale, and experts like E56 and E85 in Layer 12, despite being among the most heavily selected by text tokens, fall well below the layer average in total frequency and thus become inconspicuous\.
Beyond frequency statistics, we further reveal this bias from the perspective of actual quantization loss\. Figure[2](https://arxiv.org/html/2606.17118#S1.F2)\(b\) plots each Layer\-12 expert’s quantization sensitivity, measured as the KL divergence between the first output\-token logits before and after 2\-bit RTN quantization \(log\-scaled, averaged over all calibration samples\), against its log overall frequency, with redder color indicating experts more frequently selected by text tokens but less by vision tokens\. At the global level, a broadly positive trend between sensitivity and frequency is visible, supporting frequency as a reasonable importance proxy in general\. At the individual expert level, however, several strongly text\-preferred experts emerge as pronounced outliers: E56 and E99 exhibit disproportionately high quantization sensitivity despite having only low\-to\-moderate overall frequencies, indicating that they should be protected as text\-critical experts rather than being dismissed by their modest overall activation\.
These observations motivate a straightforward design principle: text\-token and vision\-token frequencies should be normalized independently before being combined into a unified importance score, ensuring that experts critical to each modality receive appropriate protection\.
Figure 3:Intra\-vision expert frequency bias\.\(a\)f¯key\\bar\{f\}\_\{\\mathrm\{key\}\}is highly consistent across five calibration datasets\.\(b\)Within a single dataset,f¯key\\bar\{f\}\_\{\\mathrm\{key\}\}andf¯red\\bar\{f\}\_\{\\mathrm\{red\}\}exhibits pronounced deviations across \(layer, expert\) positions, revealing a systematic key–redundant routing bias\.
### 3\.2Intra\-Vision Expert Frequency Bias
Recent work on MLLM token compression\(Yanget al\.,[2026](https://arxiv.org/html/2606.17118#bib.bib32)\)consistently shows that only a small subset of vision tokens carry the core visual semantics, while the vast majority are redundant and can be pruned with negligible performance loss\. In the context of MoE\-MLLMs, however, a more pertinent question is whether these key vision tokens correspond to a distinct and stable group of experts—those that are truly important for visual modality\. Our analysis gives an affirmative answer and further uncovers a second layer of frequency bias within vision modality itself: key and redundant vision tokens systematically activate different experts, while the preference of key tokens remains stable across diverse data, as detailed below\.
To distinguish key vision tokens from redundant ones, we follow the widely adopted attention\-based criterion of SparseVLM\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.17118#bib.bib31)\): at each layer, we rank vision tokens by the total attention they receive from all text tokens, which reliably indicates their relevance to the current query, and treat the top 20% as key tokens and the remainder as redundant\. We apply this selection at every layer, so that the key\-token set adapts to how the model attends to visual content at different depths\. A detailed description is provided in Appendix[C](https://arxiv.org/html/2606.17118#A3)\.
With key and redundant vision tokens identified at every layer, we collect their layer\-wise normalized expert\-selection frequencies,f¯key\\bar\{f\}\_\{\\mathrm\{key\}\}andf¯red\\bar\{f\}\_\{\\mathrm\{red\}\}, on calibration datasets and analyze from the following two complementary angles\.
We first examine the cross\-dataset behavior off¯key\\bar\{f\}\_\{\\mathrm\{key\}\}via its pairwise cosine similarity across the five datasets \(details in Appendix[D](https://arxiv.org/html/2606.17118#A4)\)\. As shown in Figure[3](https://arxiv.org/html/2606.17118#S3.F3)\(a\), the scores are consistently high \(all above 0\.74\), indicating that key vision tokens converge to a stable set of preferred experts regardless of data composition\. This stability links key vision tokens to a group of vision\-critical experts intrinsic to the model, rather than to dataset\-specific artifacts\.
We then zoom into a single calibration set \(ShareGPT4V\) to ask whether redundant tokens share the same preferences\. Figure[3](https://arxiv.org/html/2606.17118#S3.F3)\(b\) visualizesf¯key−f¯red\\bar\{f\}\_\{\\mathrm\{key\}\}\-\\bar\{f\}\_\{\\mathrm\{red\}\}over a representative subset of \(layer, expert\) positions \(the full 48×\\times128 grid is omitted for space\), where a clear divergence emerges: certain experts are substantially more activated by key tokens, while others are disproportionately used by redundant ones\. Since redundant tokens dominate the vision\-token population, treating all vision tokens indiscriminately biases the estimated expert importance toward redundant\-preferred experts, leaving those most relevant to meaningful visual content under\-protected\.
Combining the two findings, expert importance for vision modality should be assessed through lens of key vision tokens rather than all vision tokens\.
Figure 4:Concept verification\. Experts are progressively promoted from 2\-bit to 3\-bit under four ranking strategies \(a\)–\(d\); results on ChartQA and MMBench validate both dimensions of frequency bias\.Figure 5:Pipeline of our quantization method for MoE\-MLLMs\. Modality\-wise frequency and sensitivity are profiled per expert, then combined in an ILP that allocates expert\-level bit\-widths under a target budget\.
### 3\.3Concept Verification
To jointly validate the two dimensions of bias, we conduct a controlled mixed\-precision experiment on Qwen3\-VL\-30B\-A3B\-Instruct\. We fix all attention layers at 4\-bit precision and quantize every MoE expert to 2 bits via RTN as the starting point\. We then incrementally promote experts to 3 bits in steps of 8 per layer \(out of 128\), where the order of promotion is determined by a*simplified expert importance score*ssbuilt from per\-layer expert selection frequenciesf¯x\\bar\{f\}\_\{x\}\(normalized over experts\) on different token groupsxx\. We compare four strategies for computingss, differing only in which token\-group frequencies are used:
\(a\) Global:s=f¯totals=\\bar\{f\}\_\{\\mathrm\{total\}\}, the selection frequency over all calibration tokens\.
\(b\) Text\+Vision:s=12f¯text\+12f¯visions=\\tfrac\{1\}\{2\}\\bar\{f\}\_\{\\mathrm\{text\}\}\+\\tfrac\{1\}\{2\}\\bar\{f\}\_\{\\mathrm\{vision\}\}, balancing text and all\-vision tokens\.
\(c\) Text\+Key:s=12f¯text\+12f¯keys=\\tfrac\{1\}\{2\}\\bar\{f\}\_\{\\mathrm\{text\}\}\+\\tfrac\{1\}\{2\}\\bar\{f\}\_\{\\mathrm\{key\}\}, replacing all vision tokens with only key vision tokens\.
\(d\) Text\+Redundant:s=12f¯text\+12f¯reds=\\tfrac\{1\}\{2\}\\bar\{f\}\_\{\\mathrm\{text\}\}\+\\tfrac\{1\}\{2\}\\bar\{f\}\_\{\\mathrm\{red\}\}, replacing vision with only redundant vision tokens\.
Under each strategy, experts with the highest scores are promoted first\. Comparing \(a\) vs\. \(b\) isolates the effect of modality\-balanced ranking \(inter\-modality bias\), while comparing \(c\) and \(d\) against \(b\) isolates the effect of which vision tokens drive the ranking \(intra\-vision bias\)\.
Figure[4](https://arxiv.org/html/2606.17118#S3.F4)reports the accuracy curves on ChartQA and MMBench\. On both benchmarks, \(b\) substantially outperforms \(a\) at every protection level, confirming the importance of separately accounting for text and vision expert preferences\. \(c\) further improves upon \(b\), while \(d\) falls slightly below, showing that key vision tokens provide a more faithful signal of vision\-side expert importance than redundant ones\. Together, these comparisons validate both dimensions of frequency bias\.
## 4Method
The concept verification in §[3\.3](https://arxiv.org/html/2606.17118#S3.SS3)shows that modality\-aware frequency ranking significantly outperforms the global frequency baseline in expert importance estimation\. However, frequency reflectshow oftenan expert is selected, which does not fully align withhow muchits quantization degrades model performance\. As evidenced by the outliers in Figure[2](https://arxiv.org/html/2606.17118#S1.F2)\(b\), experts with similar frequencies can exhibit vastly different quantization sensitivities; moreover, even for equally important experts, different bit\-widths yield different precision–efficiency trade\-offs\. To address these, we build upon modality\-wise frequency and further introduce modality\-wise quantization sensitivity to form a more comprehensive importance metric, then formulate the bit\-width allocation as an integer linear programming \(ILP\) problem that jointly optimizes all expert assignments under a given bit budget\. The overall pipeline is illustrated in Figure[5](https://arxiv.org/html/2606.17118#S3.F5)\.
#### Step 1: Modality\-wise Expert Selection Frequency\.
Following the analysis in §[3](https://arxiv.org/html/2606.17118#S3), we collect expert selection frequencies separately for text tokens and key vision tokens \(the top 20% of vision tokens with the highest cumulative attention toward text tokens in each layer\) on the calibration set\. For each MoE layer, we count each expert’s selections over all text tokens and all key vision tokens, respectively, and normalize the counts layer\-wise so that each modality’s distribution sums to one within every layer\. Collecting these normalized frequencies across all layers givesf¯t\(e\)\\bar\{f\}\_\{t\}\(e\)andf¯v\(e\)\\bar\{f\}\_\{v\}\(e\)for every experteein the model\.
#### Step 2: Modality\-wise Quantization Sensitivity\.
While frequency reflects which experts are important, quantization sensitivity measures how much each expert’s quantization hurts model outputs\. For each experteeand candidate bit\-widthbb, we quantizeonlyexperteetobbbits while keeping all other parameters at full precision, and run a forward pass on the calibration set\. To disentangle per\-modality effects, the quantized weights are applied exclusively to one modality at a time: for text\-side sensitivityδT\(e,b\)\\delta\_\{T\}\(e,b\), only text tokens routed to experteeuse the quantized weights; for vision\-side sensitivityδV\(e,b\)\\delta\_\{V\}\(e,b\), only key vision tokens do\. Sensitivity is quantified as the KL divergence between the output logits of the original and partially quantized models, averaged over the calibration set:
δM\(e,b\)=1\|𝒟\|∑x∈𝒟DKL\(p\(x\)∥pM\(e→b\)\(x\)\),\\delta\_\{M\}\(e,b\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{x\\in\\mathcal\{D\}\}D\_\{\\mathrm\{KL\}\}\\\!\\Big\(p\(x\)\\;\\Big\\\|\\;p^\{\(e\\to b\)\}\_\{M\}\(x\)\\Big\),\(1\)whereM∈\{T,V\}M\\in\\\{T,V\\\}denotes the modality,p\(x\)p\(x\)the full\-precision output logits on samplexx, andpM\(e→b\)\(x\)p^\{\(e\\to b\)\}\_\{M\}\(x\)the logits when experteeis quantized tobbbits with the quantized weights applied only to modality\-MMtokens\. This per\-expert, per\-modality isolation ensures each sensitivity score faithfully reflects the impact of quantizing experteeon the corresponding modality\.
#### Step 3: Bit\-width Allocation via ILP\.
Equipped with modality\-wise frequencies and sensitivities, we formulate expert\-level bit\-width allocation as an ILP problem, a formulation also adopted by prior MoE quantization work but with modality\-agnostic importance terms\(Huanget al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib21); Duanmuet al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib23)\)\. The resulting ILP can be solved within seconds for the entire model\. We introduce a binary indicatorxe,b∈\{0,1\}x\_\{e,b\}\\in\\\{0,1\\\}for each experteeand candidate bit\-widthb∈ℬ=\{1,2,3,4\}b\\in\\mathcal\{B\}=\\\{1,2,3,4\\\}, wherexe,b=1x\_\{e,b\}=1iff experteeis quantized tobbbits\. The ILP is formulated as:
min∑e∑b∈ℬ\[f¯t\(e\)⋅δT\(e,b\)\+f¯v\(e\)⋅δV\(e,b\)\]xe,b,\\min\\sum\_\{e\}\\sum\_\{b\\in\\mathcal\{B\}\}\\Big\[\\bar\{f\}\_\{t\}\(e\)\\cdot\\delta\_\{T\}\(e,b\)\+\\bar\{f\}\_\{v\}\(e\)\\cdot\\delta\_\{V\}\(e,b\)\\Big\]\\,x\_\{e,b\},\(2\)s\.t\.∑e∑b∈ℬb⋅xe,b=n⋅k,∑b∈ℬxe,b=1,∀e,\\textsc\{s\.t\.\}\\;\\;\\sum\_\{e\}\\sum\_\{b\\in\\mathcal\{B\}\}b\\cdot x\_\{e,b\}=n\\cdot k,\\quad\\sum\_\{b\\in\\mathcal\{B\}\}x\_\{e,b\}=1,\\;\\;\\forall e,\(3\)wherennis the number of experts andkkis the target average bit\-width\. Each termf¯M\(e\)⋅δM\(e,b\)\\bar\{f\}\_\{M\}\(e\)\\cdot\\delta\_\{M\}\(e,b\)coupleshow criticalexperteeis to modalityMMwithhow muchit degrades that modality at bit\-widthbb, and summing overM∈\{T,V\}M\\in\\\{T,V\\\}yields a modality\-balanced cost\. Minimizing this cost under the budget constraint naturally drives the solver to assign higher precision to experts that are both frequently routed and highly sensitive, while compressing the rest aggressively\.
## 5Experiments
In this section, we first describe the experimental setup, then present comprehensive results demonstrating MODE outperforms existing methods across different low\-bit quantization settings when applied to MoE\-MLLMs\. Finally, we show MODE is orthogonal to the recently popular rotation\-based quantization techniques and can be combined with them to further reduce quantization loss\.
### 5\.1Experimental Setup
#### Models and Benchmarks\.
We conduct experiments on Qwen3\-VL\-30B\-A3B\-Instruct, Kimi\-VL\-A3B\-Instruct, and InternVL3\.5\-30B\-A3B\(Wanget al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib16)\)to cover diverse MoE\-MLLMs architectures and validate the generalizability of our approach\. Results for InternVL3\.5\-30B\-A3B are provided in Appendix[B](https://arxiv.org/html/2606.17118#A2)due to space constraints\.
We evaluate on multiple multimodal benchmarks covering diverse capability dimensions, including ChartQA\(Masryet al\.,[2022](https://arxiv.org/html/2606.17118#bib.bib1)\), InfoVQA\(Mathewet al\.,[2021](https://arxiv.org/html/2606.17118#bib.bib2)\), and TextVQA\(Singhet al\.,[2019](https://arxiv.org/html/2606.17118#bib.bib3)\)for OCR\-oriented visual question answering; VizWiz\-VQA\(Gurariet al\.,[2018](https://arxiv.org/html/2606.17118#bib.bib4)\), MME\-RealWorld\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.17118#bib.bib5)\), and GQA\(Hudson and Manning,[2019](https://arxiv.org/html/2606.17118#bib.bib6)\)for real\-world visual understanding; MMMU\(Yueet al\.,[2024](https://arxiv.org/html/2606.17118#bib.bib8)\), MMStar\(Chenet al\.,[2024b](https://arxiv.org/html/2606.17118#bib.bib9)\), and MMBench\(Liuet al\.,[2024b](https://arxiv.org/html/2606.17118#bib.bib10)\)for general multimodal reasoning; and POPE\(Liet al\.,[2023](https://arxiv.org/html/2606.17118#bib.bib7)\)for object hallucination\.
#### Implementation Details\.
We perform weight\-only quantization and report results at average bit\-widths of 3\-bit and 2\-bit\. Since the attention and router modules account for less than 3% of the total LLM parameters yet significantly impact overall accuracy, we fix them at 4\-bit and adjust the expert bit\-width budget accordingly to maintain the target average bit\-width; we provide detailed justification and ablation in Appendix[G](https://arxiv.org/html/2606.17118#A7)\. In the main experiments, we do not quantize the vision encoder, as its parameter count is small relative to the LLM backbone and has negligible impact on the overall memory footprint, which also aligns with most prior MoE\-MLLM quantization work\. Nevertheless, we additionally report results with the vision encoder quantized in Appendix[F](https://arxiv.org/html/2606.17118#A6)for reference\.
After the mixed\-precision bit\-widths are determined, we apply GPTQ for weight calibration\. The calibration set consists of 512 text\-image pairs from ShareGPT4V, which is the same set used for collecting expert routing frequencies\. We adopt symmetric quantization with a group size of 128\. All evaluations are conducted using the open\-sourcelmms\-eval\(Li\*et al\.,[2024](https://arxiv.org/html/2606.17118#bib.bib25)\)framework\.
Table 1:Main quantization results on Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct\(Teamet al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib15)\)across 10 multimodal benchmarks\. The best results under each setting are shown inbold\.ModelAvg\-BitMethodBenchmarksAvg\.MMMUMMBMMStarMME\-RChartQATextVInfoVGQAVizWizPOPE![[Uncaptioned image]](https://arxiv.org/html/2606.17118v1/qwen.png)Qwen3\-VL\-30B\-A3B\-InstructBF16–52\.5686\.8660\.0552\.4885\.2083\.3781\.4662\.9671\.6489\.9272\.65W3GPTQ44\.0076\.6345\.7545\.2966\.1676\.5866\.2357\.7362\.6388\.7962\.98MBQ45\.0877\.2447\.4845\.7168\.1677\.9868\.4258\.4464\.5288\.8564\.19Speed\-Q46\.1178\.6549\.2146\.4673\.2078\.2571\.3758\.9265\.1188\.8265\.61MC\-MoE47\.5680\.0152\.0147\.8879\.9477\.9373\.6557\.3065\.5589\.2167\.10MoEQuant46\.0578\.2148\.7446\.0371\.4077\.0570\.8158\.7264\.7188\.6965\.04VEQ\-ME48\.0480\.1552\.5847\.4181\.1278\.5873\.2358\.1966\.0288\.9667\.43\\cellcolorgreen\!10MODE \(ours\)\\cellcolorgreen\!1050\.33\\cellcolorgreen\!1082\.56\\cellcolorgreen\!1055\.68\\cellcolorgreen\!1050\.77\\cellcolorgreen\!1083\.00\\cellcolorgreen\!1080\.65\\cellcolorgreen\!1076\.62\\cellcolorgreen\!1060\.21\\cellcolorgreen\!1068\.82\\cellcolorgreen\!1089\.44\\cellcolorgreen\!1069\.81\(\+2\.38\)W2GPTQ25\.5622\.4630\.4812\.9431\.1240\.9839\.0736\.8937\.5550\.8432\.79MBQ35\.7161\.2336\.2731\.6357\.4459\.8453\.1947\.1648\.7872\.3750\.36Speed\-Q36\.9263\.6737\.5533\.9859\.2362\.6353\.4246\.9850\.1375\.4151\.99MC\-MoE39\.7866\.7440\.4840\.4863\.1270\.4256\.7453\.1055\.3280\.8656\.70MoEQuant36\.9864\.4938\.1338\.3160\.1861\.3256\.0748\.0148\.6673\.5352\.57VEQ\-ME41\.1171\.6842\.0838\.9767\.1672\.3859\.0754\.8959\.9585\.5659\.29\\cellcolorgreen\!10MODE \(ours\)\\cellcolorgreen\!1044\.63\\cellcolorgreen\!1075\.71\\cellcolorgreen\!1045\.34\\cellcolorgreen\!1044\.85\\cellcolorgreen\!1073\.11\\cellcolorgreen\!1075\.45\\cellcolorgreen\!1066\.19\\cellcolorgreen\!1056\.99\\cellcolorgreen\!1063\.39\\cellcolorgreen\!1087\.28\\cellcolorgreen\!1063\.29\(\+4\.00\)![[Uncaptioned image]](https://arxiv.org/html/2606.17118v1/kimi.png)Kimi\-VL\-A3B\-InstructBF16–52\.3382\.9949\.4144\.4589\.4488\.6983\.6162\.7270\.5987\.1771\.14W3GPTQ45\.6775\.0443\.1437\.2782\.0482\.0569\.5156\.0562\.0786\.0663\.89MBQ46\.0276\.2144\.8838\.5283\.1381\.7372\.6956\.8663\.9186\.7365\.07Speed\-Q45\.9976\.5545\.5739\.2182\.9783\.0174\.2757\.3363\.8986\.3265\.51MC\-MoE46\.6778\.1346\.1440\.0284\.2685\.0675\.5359\.0564\.4286\.5366\.58MoEQuant46\.3176\.3945\.2239\.4583\.7782\.9374\.3356\.9363\.5686\.8465\.57VEQ\-ME47\.0077\.8446\.8239\.9785\.6484\.2477\.2458\.1766\.3986\.7267\.00\\cellcolorgreen\!10MODE \(ours\)\\cellcolorgreen\!1048\.51\\cellcolorgreen\!1080\.84\\cellcolorgreen\!1048\.66\\cellcolorgreen\!1041\.59\\cellcolorgreen\!1087\.26\\cellcolorgreen\!1086\.24\\cellcolorgreen\!1079\.87\\cellcolorgreen\!1061\.11\\cellcolorgreen\!1069\.42\\cellcolorgreen\!1087\.10\\cellcolorgreen\!1069\.06\(\+2\.06\)W2GPTQ27\.3323\.4423\.0124\.2327\.6248\.0222\.4237\.0925\.2070\.2932\.87MBQ31\.4838\.2133\.5429\.5736\.7759\.9845\.1741\.7939\.3374\.3143\.02Speed\-Q35\.3351\.4436\.0131\.9844\.6262\.0250\.4247\.0949\.2077\.2948\.54MC\-MoE40\.9272\.6141\.5236\.0166\.4975\.1564\.2355\.9758\.2682\.5859\.37MoEQuant37\.6363\.1635\.7833\.1253\.2468\.8455\.5150\.9653\.6979\.1353\.11VEQ\-ME39\.7872\.1642\.1335\.5262\.2474\.0 664\.9455\.4557\.0483\.0958\.64\\cellcolorgreen\!10MODE \(ours\)\\cellcolorgreen\!1045\.11\\cellcolorgreen\!1075\.23\\cellcolorgreen\!1044\.11\\cellcolorgreen\!1038\.71\\cellcolorgreen\!1075\.20\\cellcolorgreen\!1079\.38\\cellcolorgreen\!1067\.75\\cellcolorgreen\!1058\.30\\cellcolorgreen\!1064\.52\\cellcolorgreen\!1085\.46\\cellcolorgreen\!1063\.38\(\+4\.01\)
### 5\.2Main Results
#### Baselines\.
To provide a thorough and convincing comparison, besides the basic uniform quantization method GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2606.17118#bib.bib35)\), we cover three categories of PTQ methods designed for MLLMs, MoE\-LLMs, and MoE\-MLLMs, respectively:
- •For MLLMs:MBQ\(Liet al\.,[2024b](https://arxiv.org/html/2606.17118#bib.bib19)\)and Speed\-Q\(Guoet al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib18)\), both of which account for inter\-modality differences during quantization\.
- •For MoE\-LLMs:MC\-MoE and MoEQuant\(Huet al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib20)\), which are tailored for MoE architectures, where MC\-MoE similarly employs mixed\-precision quantization across experts\.
- •For MoE\-MLLMs:VEQ\-ME\(Qinet al\.,[2026](https://arxiv.org/html/2606.17118#bib.bib24)\), the most recent PTQ method explicitly designed for MoE\-MLLMs based on GPTQ\.
All baseline results are reproduced using their official code repositories\. For a fairer comparison with VEQ\-ME, whose originally reported results suffer over 10% average accuracy degradation at 3\-bit due to unprotected attention weights, we reproduce all its results with the same protection applied\.
#### Results\.
Table[1](https://arxiv.org/html/2606.17118#S5.T1)summarizes the quantization results on two representative MoE\-MLLMs across 10 multimodal benchmarks under 3\-bit and 2\-bit settings\. MLLM\-oriented methods such as MBQ and Speed\-Q consider modality\-level differences but do not explicitly model the structural characteristics of MoE, which leads to noticeable performance degradation\. This becomes more evident under 2\-bit quantization, where the average accuracy drops by over 20%\. MoE\-targeted methods like MC\-MoE mitigate severe collapse through mixed\-precision bit allocation across experts, yet still incur a non\-trivial quantization loss, as they overlook the modality imbalance in expert usage inherent to MoE\-MLLMs\. VEQ\-ME achieves better results than both categories in most settings, though the improvements remain relatively limited\.
Our method jointly addresses the heterogeneous expert contributions inherent to MoE architecture and the inter\-modality bias together with intra\-visual redundancy characteristic of MLLMs, yielding consistent improvements across both models and bit\-width settings\. At 3\-bit, the average accuracy loss is kept within 2\.9% and 2\.1% on Qwen3\-VL and Kimi\-VL respectively, substantially improving practical usability\. At 2\-bit, our method outperforms the best competing approach by over 4% on average, effectively narrowing the performance gap under extreme low\-bit quantization\.
### 5\.3Compatibility with Rotation\-based Quantization
Rotation\-based quantization, exemplified by QuaRot\(Ashkbooset al\.,[2024](https://arxiv.org/html/2606.17118#bib.bib36)\), has become a mainstream PTQ technique that applies random Hadamard rotations to suppress outliers in weights and activations, thereby reducing quantization error at its source\. Since our method focuses on mixed\-precision bit allocation along the expert dimension, it is orthogonal to rotation\-based approaches\. We therefore examine whether combining QuaRot on top of our method can yield further gains\.
Table 2:Effect of combining rotation\-based quantization \(QuaRot\) with our method on Qwen3\-VL\-30B\-A3B\-Instruct\. “\+Rot” denotes applying Hadamard rotation on top of our mixed\-precision framework\.MethodBitsMMMUVizWizTextVQABaseline \(BF16\)1652\.5671\.6483\.37Ours350\.3368\.8280\.65\\rowcolorblue\!5 Ours \(\+Rot\)350\.8169\.1781\.50Ours244\.6363\.3975\.45\\rowcolorblue\!5 Ours \(\+Rot\)246\.4966\.0276\.94
As shown in Table[2](https://arxiv.org/html/2606.17118#S5.T2), integrating QuaRot consistently improves performance on top of our method\. The benefit is particularly pronounced under the extreme 2\-bit setting, where rotation yields an average improvement of around 1\.5%, confirming the two techniques address complementary sources of quantization error and can be effectively combined\.
### 5\.4Robustness to Calibration Set
Our method collects expert activation frequencies under different modalities on a calibration set as one indicator of expert importance\. It is therefore necessary to verify that this signal is robust to the choice of calibration data\. We run our full pipeline with three different calibration sets—ShareGPT4V, Flickr30k\(Younget al\.,[2014](https://arxiv.org/html/2606.17118#bib.bib12)\), and LLaVA\-Next\(Liuet al\.,[2024a](https://arxiv.org/html/2606.17118#bib.bib13)\), and evaluate the resulting quantized models on the same downstream benchmarks\. As shown in Table[3](https://arxiv.org/html/2606.17118#S5.T3), switching the calibration set introduces negligible variation in average accuracy, confirming that our method is not sensitive to the specific calibration data used\. The underlying reason is that MoE\-based MLLMs exhibit similar expert selection preferences across different datasets, as demonstrated in Section[3\.2](https://arxiv.org/html/2606.17118#S3.SS2)where key vision tokens converge to a stable set of preferred experts regardless of data composition\.
Table 3:Performance of our method at 3\-bit on Qwen3\-VL\-30B\-A3B\-Instruct with different calibration sets\.Calib\. SetMMMUVizWizInfoVQAMMStarChartQAAvg\.Baseline \(BF16\)52\.5671\.6481\.4660\.0585\.2070\.18ShareGPT4V50\.3368\.8276\.6255\.6883\.0066\.89Flickr30k49\.6168\.2777\.0454\.7582\.5466\.44LLaVA\-Next50\.7868\.9176\.5656\.0283\.3667\.13
## 6Deployment Efficiency
For real quantization, we use BitBLAS\(Wanget al\.,[2024](https://arxiv.org/html/2606.17118#bib.bib37)\)to store quantized weights at varying bit\-widths and perform mixed\-precision BLAS operations on GPUs\. To demonstrate efficiency, we take Qwen3\-VL\-30B\-A3B\-Instruct quantized to an average of 3 bits as an example\. Figure[6](https://arxiv.org/html/2606.17118#S6.F6)reports its total weight memory, activation memory, and average accuracy across ten benchmarks, along with two dense MLLMs as references: LLaVA\-OneVision\-7B\(Liet al\.,[2024a](https://arxiv.org/html/2606.17118#bib.bib17)\), which has a comparable total weight memory, and Qwen3\-VL\-2B\-Instruct, which has a comparable activation memory\.
After 3\-bit quantization, the total weight memory drops from 62 GB to around 14 GB, fitting comfortably within a single consumer\-grade GPU such as RTX 4090 \(24 GB\)\. Compared with LLaVA\-OneVision\-7B at a similar total weight memory, the quantized model requires far less activation memory thanks to the sparse activation nature of MoE, while achieving higher accuracy\. Compared with Qwen3\-VL\-2B\-Instruct at a comparable activation memory, the quantized model surpasses it by a large margin in accuracy\. These results confirm that our method enables practical deployment of large\-scale MoE\-MLLMs on resource\-constrained hardware with only a minor performance drop\.
Figure 6:Deployment efficiency of Qwen3\-VL\-30B\-A3B\-Instruct quantized by MODE \(W3A16\) in terms of total / activation memory, and accuracy retention\.
## 7Conclusion
We propose MODE, a mixed\-precision expert quantization method for MoE\-MLLMs\. Unlike prior work allocating bit\-widths based solely on expert activation statistics, MODE explicitly accounts for both theinter\-modal biasbetween textual and visual tokens and theintra\-visual\-modal biasacross different visual semantics, enabling a more faithful characterization of expert importance in the multimodal setting\. Under a fixed bit budget, MODE efficiently allocates precision across experts, substantially mitigating the performance degradation of weight\-only quantization under low\-bit regimes\.
## Limitations
While our work demonstrates the effectiveness of MODE for quantizing MoE\-MLLMs, we acknowledge the following limitations and directions for future improvement\.
#### Scaling to larger MoE\-MLLMs\.
Our experiments are conducted on Qwen3\-VL\-30B\-A3B\-Instruct, Kimi\-VL\-A3B\-Instruct, and InternVL3\.5\-30B\-A3B, which are deliberately chosen to cover a diverse set of model architectures and thereby validate the generalization of our method\. However, due to computational resource constraints, we have not yet verified MODE on larger\-scale MoE\-MLLMs such as Qwen3\-VL\-235B\-A22B\. We leave the evaluation on such larger models to future work when more computational resources become available\.
#### Quantization of the vision module\.
As discussed in Appendix[F](https://arxiv.org/html/2606.17118#A6), our method is primarily designed for the LLM part of MoE\-MLLMs and does not incorporate a dedicated design for the vision module\. Although the vision module occupies only a small fraction of the total memory footprint and thus contributes marginally to the deployment cost, we leave a thorough investigation of vision\-module quantization in MoE\-MLLMs to future work in pursuit of a more comprehensive PTQ design\.
#### Deployment efficiency in modern inference frameworks\.
As reported in Section[6](https://arxiv.org/html/2606.17118#S6), for real quantization we leverage BitBLAS\(Wanget al\.,[2024](https://arxiv.org/html/2606.17118#bib.bib37)\)to store quantized weights at varying bit\-widths and perform mixed\-precision BLAS operations on GPUs, which achieves significant reductions in memory consumption and memory footprint in practical deployment\. Nevertheless, we honestly note that the end\-to\-end inference speed of the quantized model still lags behind full\-precision models served by highly optimized inference frameworks such as vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.17118#bib.bib38)\)\. This is mainly because vLLM and similar frameworks incorporate extensive system\-level optimizations tailored to MoE architectures, including efficient KV cache management and fused kernels for MoE layers, while currently vLLM does not natively support integrating mixed\-precision quantization into its runtime\. We hope to integrate our mixed\-precision expert quantization scheme into high\-performance inference frameworks like vLLM in the future, so as to translate the memory savings into larger practical speedups\.
## Ethical Considerations
This work focuses on post\-training quantization for MoE\-based Multimodal Large Language Models, which is a model compression technique and does not involve the collection of new data or human subjects\. All models and benchmarks used in our experiments are publicly available and employed in accordance with their respective licenses and intended research use\. Since our method reduces the memory and hardware requirements for deploying MoE\-MLLMs, it may contribute positively by lowering the barrier to accessing such models and reducing energy consumption during inference\.
## References
- M\. Artetxe, S\. Bhosale, N\. Goyal, T\. Mihaylov, M\. Ott, S\. Shleifer, X\. V\. Lin, J\. Du, S\. Iyer, R\. Pasunuru, G\. Anantharaman, X\. Li, S\. Chen, H\. Akin, M\. Baines, L\. Martin, X\. Zhou, P\. S\. Koura, B\. O’Horo, J\. Wang, L\. Zettlemoyer, M\. Diab, Z\. Kozareva, and V\. Stoyanov \(2022\)Efficient large scale language modeling with mixtures of experts\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 11699–11732\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.804),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.804)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p2.1)\.
- S\. Ashkboos, A\. Mohtashami, M\. L\. Croci, B\. Li, P\. Cameron, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman \(2024\)QuaRot: outlier\-free 4\-bit inference in rotated llms\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 100213–100240\.External Links:[Document](https://dx.doi.org/10.52202/079017-3180),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/b5b939436789f76f08b9d0da5e81af7c-Paper-Conference.pdf)Cited by:[§5\.3](https://arxiv.org/html/2606.17118#S5.SS3.p1.1)\.
- L\. Chen, H\. Zhao, T\. Liu, S\. Bai, J\. Lin, C\. Zhou, and B\. Chang \(2024a\)An image is worth 1/2 tokens after layer 2: plug\-and\-play inference acceleration for large vision\-language models\.External Links:2403\.06764,[Link](https://arxiv.org/abs/2403.06764)Cited by:[§C\.1](https://arxiv.org/html/2606.17118#A3.SS1.p1.1)\.
- L\. Chen, J\. Li, X\. Dong, P\. Zhang, Y\. Zang, Z\. Chen, H\. Duan, J\. Wang, Y\. Qiao, D\. Lin,et al\.\(2024b\)Are we on the right way for evaluating large vision\-language models?\.arXiv preprint arXiv:2403\.20330\.Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- L\. Chen, J\. Li, X\. Dong, P\. Zhang, C\. He, J\. Wang, F\. Zhao, and D\. Lin \(2023\)ShareGPT4V: improving large multi\-modal models with better captions\.arXiv preprint arXiv:2311\.12793\.Cited by:[§3](https://arxiv.org/html/2606.17118#S3.p1.1)\.
- H\. Duanmu, X\. Li, Z\. Yuan, S\. Zheng, J\. Duan, X\. Zhang, and D\. Lin \(2025\)MxMoE: mixed\-precision quantization for moe with accuracy and performance co\-design\.External Links:2505\.05799,[Link](https://arxiv.org/abs/2505.05799)Cited by:[§4](https://arxiv.org/html/2606.17118#S4.SS0.SSS0.Px3.p1.6)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.External Links:2101\.03961,[Link](https://arxiv.org/abs/2101.03961)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p2.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2023\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.External Links:2210\.17323,[Link](https://arxiv.org/abs/2210.17323)Cited by:[§H\.1](https://arxiv.org/html/2606.17118#A8.SS1.p1.4),[§5\.2](https://arxiv.org/html/2606.17118#S5.SS2.SSS0.Px1.p1.1)\.
- T\. Guo, S\. Zhao, S\. Zhu, and C\. Ma \(2025\)SPEED\-q: staged processing with enhanced distillation towards efficient low\-bit on\-device vlm quantization\.External Links:2511\.08914Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p3.1),[§2](https://arxiv.org/html/2606.17118#S2.p1.1),[1st item](https://arxiv.org/html/2606.17118#S5.I1.i1.p1.1)\.
- D\. Gurari, Q\. Li, A\. J\. Stangl, A\. Guo, C\. Lin, K\. Grauman, J\. Luo, and J\. P\. Bigham \(2018\)Vizwiz grand challenge: answering visual questions from blind people\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 3608–3617\.Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- X\. Hu, Z\. Chen, D\. Yang, Z\. Xu, C\. Xu, Z\. Yuan, S\. Zhou, and J\. Yu \(2025\)MoEQuant: enhancing quantization for mixture\-of\-experts large language models via expert\-balanced sampling and affinity guidance\.External Links:2505\.03804,[Link](https://arxiv.org/abs/2505.03804)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p3.1),[§2](https://arxiv.org/html/2606.17118#S2.p2.1),[2nd item](https://arxiv.org/html/2606.17118#S5.I1.i2.p1.1)\.
- W\. Huang, Y\. Liao, J\. Liu, R\. He, H\. Tan, S\. Zhang, H\. Li, S\. Liu, and X\. Qi \(2025\)Mixture compressor for mixture\-of\-experts llms gains more\.External Links:2410\.06270,[Link](https://arxiv.org/abs/2410.06270)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p4.1),[§2](https://arxiv.org/html/2606.17118#S2.p2.1),[§4](https://arxiv.org/html/2606.17118#S4.SS0.SSS0.Px3.p1.6)\.
- D\. A\. Hudson and C\. D\. Manning \(2019\)Gqa: a new dataset for real\-world visual reasoning and compositional question answering\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 6700–6709\.Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.External Links:2309\.06180,[Link](https://arxiv.org/abs/2309.06180)Cited by:[Deployment efficiency in modern inference frameworks\.](https://arxiv.org/html/2606.17118#Sx1.SS0.SSS0.Px3.p1.1)\.
- B\. Li, Y\. Zhang, D\. Guo, R\. Zhang, F\. Li, H\. Zhang, K\. Zhang, P\. Zhang, Y\. Li, Z\. Liu, and C\. Li \(2024a\)LLaVA\-onevision: easy visual task transfer\.External Links:2408\.03326,[Link](https://arxiv.org/abs/2408.03326)Cited by:[§6](https://arxiv.org/html/2606.17118#S6.p1.1)\.
- S\. Li, Y\. Hu, X\. Ning, X\. Liu, K\. Hong, X\. Jia, X\. Li, Y\. Yan, P\. Ran, G\. Dai, S\. Yan, H\. Yang, and Y\. Wang \(2024b\)MBQ: modality\-balanced quantization for large vision\-language models\.External Links:2412\.19509,[Link](https://arxiv.org/abs/2412.19509)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p3.1),[§2](https://arxiv.org/html/2606.17118#S2.p1.1),[1st item](https://arxiv.org/html/2606.17118#S5.I1.i1.p1.1)\.
- Y\. Li, Y\. Du, K\. Zhou, J\. Wang, W\. X\. Zhao, and J\. Wen \(2023\)Evaluating object hallucination in large vision\-language models\.arXiv preprint arXiv:2305\.10355\.Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- B\. Li\*, P\. Zhang\*, K\. Zhang\*, F\. Pu\*, X\. Du, Y\. Dong, H\. Liu, Y\. Zhang, G\. Zhang, C\. Li, and Z\. Liu \(2024\)LMMs\-eval: accelerating the development of large multimoal models\.Zenodo\.External Links:[Link](https://github.com/EvolvingLMMs-Lab/lmms-eval)Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px2.p2.1)\.
- B\. Lin, Z\. Tang, Y\. Ye, J\. Huang, J\. Zhang, Y\. Pang, P\. Jin, M\. Ning, J\. Luo, and L\. Yuan \(2024\)MoE\-llava: mixture of experts for large vision\-language models\.External Links:2401\.15947,[Link](https://arxiv.org/abs/2401.15947)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p2.1)\.
- H\. Liu, C\. Li, Y\. Li, B\. Li, Y\. Zhang, S\. Shen, and Y\. J\. Lee \(2024a\)LLaVA\-next: improved reasoning, ocr, and world knowledge\.External Links:[Link](https://llava-vl.github.io/blog/2024-01-30-llava-next/)Cited by:[§5\.4](https://arxiv.org/html/2606.17118#S5.SS4.p1.1)\.
- Y\. Liu, H\. Duan, Y\. Zhang, B\. Li, S\. Zhang, W\. Zhao, Y\. Yuan, J\. Wang, C\. He, Z\. Liu,et al\.\(2024b\)Mmbench: is your multi\-modal model an all\-around player?\.InEuropean conference on computer vision,pp\. 216–233\.Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- A\. Masry, D\. Long, J\. Q\. Tan, S\. Joty, and E\. Hoque \(2022\)ChartQA: a benchmark for question answering about charts with visual and logical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2022,Dublin, Ireland,pp\. 2263–2279\.External Links:[Link](https://aclanthology.org/2022.findings-acl.177),[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.177)Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- M\. Mathew, V\. Bagal, R\. P\. Tito, D\. Karatzas, E\. Valveny, and C\. V\. Jawahar \(2021\)InfographicVQA\.CoRRabs/2104\.12756\.External Links:[Link](https://arxiv.org/abs/2104.12756),2104\.12756Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- G\. Qin, Z\. Li, Z\. Chen, W\. Zhang, L\. Kong, and Y\. Zhang \(2026\)VEQ: modality\-adaptive quantization for moe vision\-language models\.External Links:2602\.01037,[Link](https://arxiv.org/abs/2602.01037)Cited by:[§2](https://arxiv.org/html/2606.17118#S2.p2.1),[3rd item](https://arxiv.org/html/2606.17118#S5.I1.i3.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 8748–8763\.External Links:[Link](https://proceedings.mlr.press/v139/radford21a.html)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p1.1)\.
- A\. Singh, V\. Natarajan, M\. Shah, Y\. Jiang, X\. Chen, D\. Batra, D\. Parikh, and M\. Rohrbach \(2019\)Towards vqa models that can read\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 8317–8326\.Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- X\. Tan, P\. Ye, C\. Tu, J\. Cao, Y\. Yang, L\. Zhang, D\. Zhou, and T\. Chen \(2025\)TokenCarve: information\-preserving visual token compression in multimodal large language models\.External Links:2503\.10501,[Link](https://arxiv.org/abs/2503.10501)Cited by:[§C\.1](https://arxiv.org/html/2606.17118#A3.SS1.p1.1)\.
- K\. Team, A\. Du, B\. Yin, B\. Xing, B\. Qu, B\. Wang, C\. Chen, C\. Zhang, C\. Du, C\. Wei, C\. Wang, D\. Zhang, D\. Du, D\. Wang, E\. Yuan, E\. Lu, F\. Li, F\. Sung, G\. Wei, G\. Lai, H\. Zhu, H\. Ding, H\. Hu, H\. Yang, H\. Zhang, H\. Wu, H\. Yao, H\. Lu, H\. Wang, H\. Gao, H\. Zheng, J\. Li, J\. Su, J\. Wang, J\. Deng, J\. Qiu, J\. Xie, J\. Wang, J\. Liu, J\. Yan, K\. Ouyang, L\. Chen, L\. Sui, L\. Yu, M\. Dong, M\. Dong, N\. Xu, P\. Cheng, Q\. Gu, R\. Zhou, S\. Liu, S\. Cao, T\. Yu, T\. Song, T\. Bai, W\. Song, W\. He, W\. Huang, W\. Xu, X\. Yuan, X\. Yao, X\. Wu, X\. Zu, X\. Zhou, X\. Wang, Y\. Charles, Y\. Zhong, Y\. Li, Y\. Hu, Y\. Chen, Y\. Wang, Y\. Liu, Y\. Miao, Y\. Qin, Y\. Chen, Y\. Bao, Y\. Wang, Y\. Kang, Y\. Liu, Y\. Du, Y\. Wu, Y\. Wang, Y\. Yan, Z\. Zhou, Z\. Li, Z\. Jiang, Z\. Zhang, Z\. Yang, Z\. Huang, Z\. Huang, Z\. Zhao, and Z\. Chen \(2025\)Kimi\-VL technical report\.External Links:2504\.07491,[Link](https://arxiv.org/abs/2504.07491)Cited by:[Table 1](https://arxiv.org/html/2606.17118#S5.T1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p2.1)\.
- V\. Team, W\. Hong, W\. Yu, X\. Gu, G\. Wang, G\. Gan, H\. Tang, J\. Cheng, J\. Qi, J\. Ji, L\. Pan, S\. Duan, W\. Wang, Y\. Wang, Y\. Cheng, Z\. He, Z\. Su, Z\. Yang, Z\. Pan, A\. Zeng, B\. Wang, B\. Chen, B\. Shi, C\. Pang, C\. Zhang, D\. Yin, F\. Yang, G\. Chen, H\. Li, J\. Zhu, J\. Chen, J\. Xu, J\. Xu, J\. Chen, J\. Lin, J\. Chen, J\. Wang, J\. Chen, L\. Lei, L\. Gong, L\. Pan, M\. Liu, M\. Xu, M\. Zhang, Q\. Zheng, R\. Lyu, S\. Tu, S\. Yang, S\. Meng, S\. Zhong, S\. Huang, S\. Zhao, S\. Xue, T\. Zhang, T\. Luo, T\. Hao, T\. Tong, W\. Jia, W\. Li, X\. Liu, X\. Zhang, X\. Lyu, X\. Zhang, X\. Fan, X\. Huang, Y\. Xue, Y\. Wang, Y\. Wang, Y\. Wang, Y\. An, Y\. Du, Y\. Huang, Y\. Niu, Y\. Shi, Y\. Wang, Y\. Wang, Y\. Yue, Y\. Li, Y\. Liu, Y\. Zhang, Y\. Wang, Y\. Zhang, Z\. Xue, Z\. Du, Z\. Hou, Z\. Wang, P\. Zhang, D\. Liu, B\. Xu, J\. Li, M\. Huang, Y\. Dong, and J\. Tang \(2026\)GLM\-4\.5v and glm\-4\.1v\-thinking: towards versatile multimodal reasoning with scalable reinforcement learning\.External Links:2507\.01006,[Link](https://arxiv.org/abs/2507.01006)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p1.1)\.
- L\. Wang, L\. Ma, S\. Cao, Q\. Zhang, J\. Xue, Y\. Shi, N\. Zheng, Z\. Miao, F\. Yang, T\. Cao, Y\. Yang, and M\. Yang \(2024\)Ladder: enabling efficient low\-precision deep learning computing through hardware\-aware tensor transformation\.In18th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 24\),Santa Clara, CA,pp\. 307–323\.External Links:ISBN 978\-1\-939133\-40\-3,[Link](https://www.usenix.org/conference/osdi24/presentation/wang-lei)Cited by:[§6](https://arxiv.org/html/2606.17118#S6.p1.1),[Deployment efficiency in modern inference frameworks\.](https://arxiv.org/html/2606.17118#Sx1.SS0.SSS0.Px3.p1.1)\.
- W\. Wang, Z\. Gao, L\. Gu, H\. Pu, L\. Cui, X\. Wei, Z\. Liu, L\. Jing, S\. Ye, J\. Shao,et al\.\(2025\)InternVL3\.5: advancing open\-source multimodal models in versatility, reasoning, and efficiency\.arXiv preprint arXiv:2508\.18265\.Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p1.1)\.
- S\. Yang, Y\. Chen, Z\. Tian, C\. Wang, J\. Li, B\. Yu, and J\. Jia \(2026\)VisionZip: longer is better but not necessary in vision language models\.External Links:2412\.04467,[Link](https://arxiv.org/abs/2412.04467)Cited by:[§C\.1](https://arxiv.org/html/2606.17118#A3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.17118#S3.SS2.p1.1)\.
- P\. Young, A\. Lai, M\. Hodosh, and J\. Hockenmaier \(2014\)From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions\.Transactions of the Association for Computational Linguistics2,pp\. 67–78\.Cited by:[§5\.4](https://arxiv.org/html/2606.17118#S5.SS4.p1.1)\.
- X\. Yue, Y\. Ni, K\. Zhang, T\. Zheng, R\. Liu, G\. Zhang, S\. Stevens, D\. Jiang, W\. Ren, Y\. Sun, C\. Wei, B\. Yu, R\. Yuan, R\. Sun, M\. Yin, B\. Zheng, Z\. Yang, Y\. Liu, W\. Huang, H\. Sun, Y\. Su, and W\. Chen \(2024\)MMMU: a massive multi\-discipline multimodal understanding and reasoning benchmark for expert agi\.InProceedings of CVPR,Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- Y\. Zhang, H\. Zhang, H\. Tian, C\. Fu, S\. Zhang, J\. Wu, F\. Li, K\. Wang, Q\. Wen, Z\. Zhang, L\. Wang, R\. Jin, and T\. Tan \(2025a\)MME\-realworld: could your multimodal llm challenge high\-resolution real\-world scenarios that are difficult for humans?\.External Links:2408\.13257,[Link](https://arxiv.org/abs/2408.13257)Cited by:[§5\.1](https://arxiv.org/html/2606.17118#S5.SS1.SSS0.Px1.p2.1)\.
- Y\. Zhang, C\. Fan, J\. Ma, W\. Zheng, T\. Huang, K\. Cheng, D\. Gudovskiy, T\. Okuno, Y\. Nakata, K\. Keutzer,et al\.\(2025b\)SparseVLM: visual token sparsification for efficient vision\-language model inference\.InInternational Conference on Machine Learning,Cited by:[§C\.1](https://arxiv.org/html/2606.17118#A3.SS1.p1.1),[Appendix C](https://arxiv.org/html/2606.17118#A3.p1.1),[§3\.2](https://arxiv.org/html/2606.17118#S3.SS2.p2.1)\.
- Z\. Zheng, X\. Cui, S\. Zheng, M\. Li, J\. Chen, Y\. Liang, and X\. Chen \(2026\)DynaMo: runtime switchable quantization for moe with cross\-dataset adaptation\.External Links:2503\.21135,[Link](https://arxiv.org/abs/2503.21135)Cited by:[§1](https://arxiv.org/html/2606.17118#S1.p3.1),[§2](https://arxiv.org/html/2606.17118#S2.p2.1)\.
## Appendix AComplete Per\-Layer Heatmap of Intra\-Vision Expert Frequency Bias
In Section[3\.2](https://arxiv.org/html/2606.17118#S3.SS2), due to space constraints, Figure[3](https://arxiv.org/html/2606.17118#S3.F3)\(b\) only visualizesf¯key−f¯red\\bar\{f\}\_\{\\mathrm\{key\}\}\-\\bar\{f\}\_\{\\mathrm\{red\}\}for the first 16 layers and the first 48 experts\. However, Qwen3\-VL\-30B\-A3B\-Instruct contains 48 layers in total, with 128 experts per layer\. For completeness, we provide the full heatmap covering all layers and all experts in Figure[7](https://arxiv.org/html/2606.17118#A1.F7)\.
Figure 7:Complete per\-layer heatmap off¯key−f¯red\\bar\{f\}\_\{\\mathrm\{key\}\}\-\\bar\{f\}\_\{\\mathrm\{red\}\}on Qwen3\-VL\-30B\-A3B\-Instruct, covering all 48 layers and all 128 experts per layer\.#### On the magnitude of frequency differences\.
A natural question is why the absolute values off¯key−f¯red\\bar\{f\}\_\{\\mathrm\{key\}\}\-\\bar\{f\}\_\{\\mathrm\{red\}\}in Figure[7](https://arxiv.org/html/2606.17118#A1.F7)\(and in Figure[3](https://arxiv.org/html/2606.17118#S3.F3)\(b\)\) appear noticeably smaller than the routing frequencies reported in Figure[2](https://arxiv.org/html/2606.17118#S1.F2)of Section[3\.1](https://arxiv.org/html/2606.17118#S3.SS1), where the maximum value can exceed25%25\\%\. The discrepancy stems from a difference in normalization rather than a difference in the underlying routing behavior\.
In Qwen3\-VL\-30B\-A3B\-Instruct, each MoE layer contains128128experts, and each token activates the top\-88experts\. Consequently, even under a perfectly uniform routing distribution, the expected selection frequency of any single expert is
8128=6\.25%\.\\frac\{8\}\{128\}=6\.25\\%\.The frequencies reported in Figure[2](https://arxiv.org/html/2606.17118#S1.F2)are*unnormalized*selection frequencies, so values around or above this6\.25%6\.25\\%baseline are expected, and peaks beyond25%25\\%indicate strong routing preferences\.
In contrast, the values in Figure[3](https://arxiv.org/html/2606.17118#S3.F3)\(b\) and Figure[7](https://arxiv.org/html/2606.17118#A1.F7)are computed from*normalized*per\-token routing distributions, where the activated experts of each token sum to11\. Under this normalization, the expected selection frequency of any single expert under uniform routing is
1128≈0\.78%\.\\frac\{1\}\{128\}\\approx 0\.78\\%\.Relative to this baseline, a difference off¯key−f¯red\\bar\{f\}\_\{\\mathrm\{key\}\}\-\\bar\{f\}\_\{\\mathrm\{red\}\}on the order of8%8\\%already represents an obviously large routing\-frequency gap—roughly an order of magnitude above the uniform baseline—indicating a pronounced specialization of certain experts toward dominant \(key\) versus redundant tokens\.
## Appendix BAdditional Results on InternVL3\.5\-30B\-A3B
In the main paper, we report results on Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct\. Due to space constraints, we defer the results on InternVL3\.5\-30B\-A3B to this appendix\. We follow the exact same evaluation protocol as in the main experiments, evaluating on the same 10 multimodal benchmarks under both average 3\-bit and 2\-bit settings\. As shown in Table[4](https://arxiv.org/html/2606.17118#A2.T4), our proposed MODE consistently outperforms both the mixed\-precision MoE quantization method MC\-MoE and the MoE\-MLLM\-tailored method VEQ\-MA at both bit\-widths\. In particular, under the 3\-bit setting, MODE limits the average accuracy degradation to within 3% relative to the BF16 baseline \(69\.25 vs\. 72\.15\), demonstrating the strong generalization of our method across different model families\. Under the more challenging 2\-bit setting, MODE still surpasses the strongest baseline VEQ\-MA by about 2 points on average, further confirming its robustness under extremely low\-bit quantization\.
Table 4:Quantization results on InternVL3\.5\-30B\-A3B across 10 multimodal benchmarks\. The best results under each setting are shown inbold\. The number in parentheses denotes the absolute average gain of MODE over the strongest baseline\.ModelAvg\-BitMethodBenchmarksAvg\.MMMUMMBMMStarMME\-RChartQATextVInfoVGQAVizWizPOPE![[Uncaptioned image]](https://arxiv.org/html/2606.17118v1/internvl.png)InternVL3\.5\-30B\-A3BBF16–60\.1186\.0871\.2549\.7188\.1278\.3576\.2762\.9559\.2389\.3872\.15W3GPTQ52\.5681\.1963\.1944\.1777\.6070\.3465\.4456\.0145\.2389\.1364\.49MC\-MoE54\.5681\.4266\.5747\.5282\.0072\.6767\.6857\.3850\.2888\.5766\.87VEQ\-MA55\.4981\.1267\.0146\.0583\.0773\.1369\.7559\.2153\.0489\.0867\.70\\cellcolorgreen\!10MODE \(ours\)\\cellcolorgreen\!1057\.33\\cellcolorgreen\!1082\.56\\cellcolorgreen\!1068\.26\\cellcolorgreen\!1047\.97\\cellcolorgreen\!1085\.64\\cellcolorgreen\!1074\.45\\cellcolorgreen\!1071\.53\\cellcolorgreen\!1060\.15\\cellcolorgreen\!1055\.55\\cellcolorgreen\!1089\.03\\cellcolorgreen\!1069\.25\(\+1\.55\)W2GPTQ22\.7877\.3231\.6721\.6840\.7536\.1125\.6230\.3030\.1140\.1835\.65MC\-MoE44\.8976\.4357\.4043\.7970\.9267\.9460\.1055\.6344\.0788\.1860\.94VEQ\-MA46\.1175\.7659\.7642\.3372\.3667\.5561\.6856\.0448\.7288\.4661\.88\\cellcolorgreen\!10MODE \(ours\)\\cellcolorgreen\!1048\.67\\cellcolorgreen\!1078\.69\\cellcolorgreen\!1060\.97\\cellcolorgreen\!1045\.31\\cellcolorgreen\!1073\.68\\cellcolorgreen\!1069\.83\\cellcolorgreen\!1063\.65\\cellcolorgreen\!1058\.40\\cellcolorgreen\!1050\.87\\cellcolorgreen\!1088\.66\\cellcolorgreen\!1063\.87\(\+2\.00\)
## Appendix CDetails of Key Vision Token Selection
In this appendix, we provide a detailed description of the key vision token selection used in our intra\-vision expert frequency bias analysis\. We first review the attention\-based token importance criterion proposed by SparseVLM\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.17118#bib.bib31)\)and discuss why it is well\-motivated for identifying redundant vision tokens \(Appendix[C\.1](https://arxiv.org/html/2606.17118#A3.SS1)\)\. We then describe how we adapt this criterion to our calibration setting, with a key difference from prior pruning\-oriented usage \(Appendix[C\.2](https://arxiv.org/html/2606.17118#A3.SS2)\)\. Finally, we empirically verify that the tokens selected by our procedure indeed behave as “key” and “redundant” vision tokens \(Appendix[C\.3](https://arxiv.org/html/2606.17118#A3.SS3)\)\.
### C\.1Recap of SparseVLM’s Attention\-based Criterion
A growing body of work has observed that vision tokens in VLMs/MLLMs are highly redundant: a large fraction of them can be pruned with negligible degradation in downstream performance, while substantially reducing the computation and memory cost of inference\(Chenet al\.,[2024a](https://arxiv.org/html/2606.17118#bib.bib34); Yanget al\.,[2026](https://arxiv.org/html/2606.17118#bib.bib32); Tanet al\.,[2025](https://arxiv.org/html/2606.17118#bib.bib33)\)\. Among these methods, SparseVLM\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.17118#bib.bib31)\)stands out as a representative approach\. Although many follow\-up works have further improved upon it, SparseVLM remains attractive for its simplicity and effectiveness, requiring no additional modules or training\. We therefore follow its core criterion in this work and adapt it to our needs\.
The key idea of SparseVLM is to reuse the self\-attention logits already computed inside the VLM decoder as a measure of how relevant each vision token is to the current text query\. Concretely, let𝑨∈ℝL×L\\boldsymbol\{A\}\\in\\mathbb\{R\}^\{L\\times L\}denote the self\-attention matrix at a given layer, whereLLis the total sequence length\. Let𝕃\\mathbb\{L\}and𝕀\\mathbb\{I\}denote the index sets of language \(text\) tokens and image \(vision\) tokens, respectively, with\|𝕃\|=Lt\|\\mathbb\{L\}\|=L\_\{t\}and\|𝕀\|=Lv\|\\mathbb\{I\}\|=L\_\{v\}\. SparseVLM extracts the text\-to\-vision sub\-block
𝑷=𝑨i,j,\(i,j\)∈𝕃×𝕀,𝑷∈ℝLt×Lv,\\boldsymbol\{P\}=\\boldsymbol\{A\}\_\{i,j\},\\quad\(i,j\)\\in\\mathbb\{L\}\\times\\mathbb\{I\},\\qquad\\boldsymbol\{P\}\\in\\mathbb\{R\}^\{L\_\{t\}\\times L\_\{v\}\},\(4\)and aggregates the attention each vision token receives from all text tokens:
𝒑~=\[p~1,p~2,…,p~Lv\]=1Lt∑i=1Lt𝑷i\.\\tilde\{\\boldsymbol\{p\}\}=\[\\tilde\{p\}\_\{1\},\\tilde\{p\}\_\{2\},\\dots,\\tilde\{p\}\_\{L\_\{v\}\}\]=\\frac\{1\}\{L\_\{t\}\}\\sum\_\{i=1\}^\{L\_\{t\}\}\\boldsymbol\{P\}\_\{i\}\.\(5\)A larger value ofp~j\\tilde\{p\}\_\{j\}indicates that vision tokenjjis more relevant to the current language query and thus more important to retain\. Because𝑨\\boldsymbol\{A\}is already computed during the forward pass, this estimation is essentially free in terms of additional FLOPs\.
### C\.2Our Adaptation: Per\-Layer Adaptive Selection
The most important difference between our usage and the original pruning\-oriented usage in SparseVLM lies in*when*and*how*the selection is applied\. In SparseVLM and most subsequent token\-pruning methods, the decision must be made at a relatively shallow layer \(e\.g\., the second or third decoder layer\): tokens deemed redundant are physically dropped and excluded from all subsequent layers’ computation, which is precisely how the speedup is achieved\. However, a well\-known issue with this design is that the attention distribution over vision tokens varies considerably across depths: a vision token that receives little attention in shallow layers may become highly attended in deeper layers, and vice versa\. Committing to a one\-shot pruning decision at a shallow layer therefore inevitably introduces selection bias\.
In our pipeline, by contrast, the selection is used purely for*calibration*, not for actual pruning during inference: we only need to identify which vision tokens behave as key tokens at each layer, in order to compute the modality\-wise expert selection frequency\. This frees us from the constraint of making an early, layer\-agnostic decision\. Accordingly, we apply the SparseVLM criterion*independently at every layer*: at layerℓ\\ell, we compute𝒑~\(ℓ\)\\tilde\{\\boldsymbol\{p\}\}^\{\(\\ell\)\}from that layer’s attention, rank vision tokens by𝒑~\(ℓ\)\\tilde\{\\boldsymbol\{p\}\}^\{\(\\ell\)\}, and treat the top20%20\\%as key vision tokens and the remaining80%80\\%as redundant vision tokens\. The resulting key\-token set is thus adaptive to how the model attends to visual content at different depths, avoiding the cross\-layer bias that hampers shallow\-layer pruning\.
### C\.3Empirical Verification of the Selected Key and Redundant Tokens
To verify that our per\-layer attention\-based selection indeed identifies vision tokens that are functionally “key” or “redundant” as intended, we conduct a controlled pruning experiment on Qwen3\-VL\-30B\-A3B\-Instruct using ChartQA\.
#### Protocol\.
At every decoder layer, we rank vision tokens by their importance score𝒑~\(ℓ\)\\tilde\{\\boldsymbol\{p\}\}^\{\(\\ell\)\}as described in Appendix[C\.2](https://arxiv.org/html/2606.17118#A3.SS2)\. We then prune vision tokens at increasing ratios from5%5\\%to50%50\\%in steps of5%5\\%, under two opposite strategies: \(i\)*key\-first pruning*, which removes the highest\-ranked tokens first, and \(ii\)*redundant\-first pruning*, which removes the lowest\-ranked tokens first\. We restrict the pruning ratio to at most50%50\\%, making the two strategies cleanly comparable\. For each pruning ratio, we evaluate the resulting model on ChartQA and plot accuracy versus pruning ratio as two curves \(key\-first vs\. redundant\-first\)\.
Figure 8:Effect of pruning key vs\. redundant vision tokens on ChartQA\. Removing tokens identified as*key*\(red\) causes a rapid accuracy drop, whereas removing tokens identified as*redundant*\(blue\) leaves accuracy almost unchanged across the entire5%5\\%–50%50\\%range\.
#### Results\.
As shown in Figure[8](https://arxiv.org/html/2606.17118#A3.F8), the two strategies exhibit sharply contrasting trends\. Under key\-first pruning, accuracy degrades sharply even at very small pruning ratios—dropping from the no\-pruning baseline of85\.20%85\.20\\%to roughly66%66\\%after pruning only20%20\\%of the vision tokens—which indicates that the tokens we identify as “key” indeed carry information that the model genuinely relies on\. As the ratio further increases, the curve flattens and decreases more gradually, since the additionally removed tokens are progressively less critical\. In contrast, under redundant\-first pruning, accuracy stays essentially flat throughout the entire range, fluctuating only within roughly84%84\\%–85%85\\%and remaining close to the baseline even when half of the vision tokens are discarded\. This pronounced asymmetry confirms that our per\-layer adaptive selection reliably separates key vision tokens from redundant ones, providing a solid basis for the subsequent analysis that links key vision tokens to critical vision experts\.
## Appendix DDetails of Expert Selection Frequency Collection and Cross\-Dataset Similarity
In this appendix, we formally describe how expert selection frequencies are collected, normalized, and compared across calibration sets\. We first introduce the general collection procedure without distinguishing token types \(Appendix[D\.1](https://arxiv.org/html/2606.17118#A4.SS1)\), then specialize it to text, key vision, and redundant vision tokens with layer\-wise normalization \(Appendix[D\.2](https://arxiv.org/html/2606.17118#A4.SS2)\), and finally describe the cross\-dataset cosine similarity \(Appendix[D\.3](https://arxiv.org/html/2606.17118#A4.SS3)\)\.
### D\.1General Expert Selection Frequency
Consider an MoE model withLLMoE layers andEEexperts per layer\. For each token, the router at every layer activates a top\-kksubset of experts\. Given a calibration dataset𝒟\\mathcal\{D\}, we run the model on it and, at each layerℓ\\ell, simply count how many tokens activate each expertee:
ce\(ℓ\)=\#\{x∈𝒟:xselects experteat layerℓ\}\.c^\{\(\\ell\)\}\_\{e\}\\;=\\;\\\#\\\{\\,x\\in\\mathcal\{D\}:\\text\{$x$ selects expert $e$ at layer $\\ell$\}\\,\\\}\.\(6\)Stacking overe=1,…,Ee=1,\\dots,Egives a per\-layer count vector𝒄\(ℓ\)∈ℝE\\boldsymbol\{c\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{E\}\.
### D\.2Per\-Token\-Type Frequency and Layer\-wise Normalization
We further partition tokens by their role: text tokens \(text\), key vision tokens \(key, top\-20%20\\%identified per layer as in Appendix[C\.2](https://arxiv.org/html/2606.17118#A3.SS2)\), and redundant vision tokens \(red, the remaining80%80\\%\)\. For each token typem∈\{text,key,red\}m\\in\\\{\\textit\{text\},\\,\\textit\{key\},\\,\\textit\{red\}\\\}, we restrict the counting in Eq\. \([6](https://arxiv.org/html/2606.17118#A4.E6)\) to tokens of typemmonly, yielding a type\-specific countcm,e\(ℓ\)c^\{\(\\ell\)\}\_\{m,e\}\.
To make frequencies comparable across layers and across calibration sets of different sizes, we normalize𝒄m\(ℓ\)\\boldsymbol\{c\}^\{\(\\ell\)\}\_\{m\}at each layer into a probability distribution over experts:
f¯m,e\(ℓ\)=cm,e\(ℓ\)∑e′=1Ecm,e′\(ℓ\),\\bar\{f\}^\{\(\\ell\)\}\_\{m,e\}\\;=\\;\\frac\{c^\{\(\\ell\)\}\_\{m,e\}\}\{\\sum\_\{e^\{\\prime\}=1\}^\{E\}c^\{\(\\ell\)\}\_\{m,e^\{\\prime\}\}\},\(7\)so that∑ef¯m,e\(ℓ\)=1\\sum\_\{e\}\\bar\{f\}^\{\(\\ell\)\}\_\{m,e\}=1\. Stacking over experts yields the normalized frequency vector𝒇¯m\(ℓ\)∈ℝE\\bar\{\\boldsymbol\{f\}\}^\{\(\\ell\)\}\_\{m\}\\in\\mathbb\{R\}^\{E\}\.
#### Notation across sections\.
The frequency symbols used in the main text are all special cases of𝒇¯m\(ℓ\)\\bar\{\\boldsymbol\{f\}\}^\{\(\\ell\)\}\_\{m\}, but with slightly different scopes depending on context:
- •In Sec\.[3\.3](https://arxiv.org/html/2606.17118#S3.SS3), four frequencies appear:f¯total\\bar\{f\}\_\{\\textit\{total\}\},f¯text\\bar\{f\}\_\{\\textit\{text\}\},f¯key\\bar\{f\}\_\{\\textit\{key\}\}, andf¯red\\bar\{f\}\_\{\\textit\{red\}\}\. Among them,f¯total\\bar\{f\}\_\{\\textit\{total\}\}is the overall normalized frequency obtained by counting all tokens \(i\.e\., applying Eq\. \([7](https://arxiv.org/html/2606.17118#A4.E7)\) to the full token set without partitioning\), whilef¯text\\bar\{f\}\_\{\\textit\{text\}\},f¯key\\bar\{f\}\_\{\\textit\{key\}\}, andf¯red\\bar\{f\}\_\{\\textit\{red\}\}are the normalized frequencies computed on the corresponding token subsets\.
- •In Sec\.[4](https://arxiv.org/html/2606.17118#S4), since our analysis has already focused on text tokens and key vision tokens, we adopt the more compact notationf¯t\\bar\{f\}\_\{t\}andf¯v\\bar\{f\}\_\{v\}to denote the normalized frequencies of the two modalities\. Specifically,f¯t\\bar\{f\}\_\{t\}denotes the frequency computed on text tokens \(identical tof¯text\\bar\{f\}\_\{\\textit\{text\}\}in Sec\.[3\.3](https://arxiv.org/html/2606.17118#S3.SS3)\), andf¯v\\bar\{f\}\_\{v\}denotes the frequency computed on key vision tokens only \(identical tof¯key\\bar\{f\}\_\{\\textit\{key\}\}\); the redundant vision tokens are no longer involved\.
### D\.3Cross\-Dataset Cosine Similarity
To measure how similar the routing behavior is between two calibration sets𝒟a\\mathcal\{D\}\_\{a\}and𝒟b\\mathcal\{D\}\_\{b\}, we use the normalized frequency vectors above\. For a given token typemm, we first compute the cosine similarity at every layer:
sm\(ℓ\)=cos\(𝒇¯m\(ℓ\)\(𝒟a\),𝒇¯m\(ℓ\)\(𝒟b\)\),s^\{\(\\ell\)\}\_\{m\}\\;=\\;\\cos\\\!\\big\(\\bar\{\\boldsymbol\{f\}\}^\{\(\\ell\)\}\_\{m\}\(\\mathcal\{D\}\_\{a\}\),\\;\\bar\{\\boldsymbol\{f\}\}^\{\(\\ell\)\}\_\{m\}\(\\mathcal\{D\}\_\{b\}\)\\big\),\(8\)and then average across allLLMoE layers to obtain the final similarity:
Sm\(𝒟a,𝒟b\)=1L∑ℓ=1Lsm\(ℓ\)\.S\_\{m\}\(\\mathcal\{D\}\_\{a\},\\mathcal\{D\}\_\{b\}\)\\;=\\;\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}s^\{\(\\ell\)\}\_\{m\}\.\(9\)Since each𝒇¯m\(ℓ\)\\bar\{\\boldsymbol\{f\}\}^\{\(\\ell\)\}\_\{m\}is a probability distribution with non\-negative entries, every per\-layer similarity satisfiessm\(ℓ\)∈\[0,1\]s^\{\(\\ell\)\}\_\{m\}\\in\[0,1\], and consequentlySm\(𝒟a,𝒟b\)∈\[0,1\]S\_\{m\}\(\\mathcal\{D\}\_\{a\},\\mathcal\{D\}\_\{b\}\)\\in\[0,1\]\. A value closer to11indicates that, for token typemm, the two datasets induce more consistent expert selection patterns throughout the network, supporting the cross\-dataset transferability claim in the main text\.
## Appendix EPerformance on Language\-Only Tasks
Table 5:Performance on language\-only benchmarks\. Our method preserves language capabilities with limited degradation at both bit\-widths\. Qwen3\-VL\-30B\-A3B and Kimi\-VL\-A3B are short for Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct, respectively\. ARC\-E and ARC\-C denote ARC\-Easy and ARC\-Challenge\.Qwen3\-VL\-30B\-A3BARC\-EARC\-CHellaSwagLAMBADAMMLUPIQAWinoGrandeAvgBaseline \(FP16\)80\.5654\.6962\.0371\.4980\.7580\.2574\.5970\.63W3A1679\.9252\.4758\.6768\.4575\.8978\.8471\.0167\.89W2A1677\.4849\.2353\.1262\.3265\.3875\.5766\.4864\.23Kimi\-VL\-A3BARC\-EARC\-CHellaSwagLAMBADAMMLUPIQAWinoGrandeAvgBaseline \(FP16\)82\.6253\.9259\.5771\.6169\.0180\.2071\.5969\.79W3A1680\.7252\.1356\.5670\.9564\.7178\.8970\.3267\.75W2A1675\.6742\.9248\.1463\.0957\.4676\.3964\.6461\.19
The core of our method is to jointly consider the cross\-modal expert importance bias between language and vision modalities and the intra\-vision expert importance bias between key and redundant vision tokens\. In the main experiments, we primarily focus on evaluating quantized MoE\-MLLMs on multimodal benchmarks, where our method significantly outperforms other PTQ methods designed for MLLMs, MoE\-LLMs, or MoE\-MLLMs\. Meanwhile, it is important to ensure that our quantization does not severely degrade the language\-only capabilities of MoE\-MLLMs\.
To this end, we evaluate our method on Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct at both 2\-bit and 3\-bit average bit\-widths across seven widely used language\-only benchmarks: ARC\-Easy, ARC\-Challenge, HellaSwag, LAMBADA, MMLU, PIQA, and WinoGrande\. Results are reported in Table[5](https://arxiv.org/html/2606.17118#A5.T5)\.
As shown in Table[5](https://arxiv.org/html/2606.17118#A5.T5), at 3\-bit average bit\-width, the average performance degradation is limited to approximately 3% on Qwen3\-VL\-30B\-A3B\-Instruct and 2% on Kimi\-VL\-A3B\-Instruct\. These results confirm that our modality\-aware mixed\-precision quantization method effectively preserves the language\-only capabilities of MoE\-MLLMs while substantially improving multimodal performance as demonstrated in the main experiments\.
## Appendix FQuantizing Both Vision Module and LLM
In the main experiments, we focus on quantizing the LLM part of MoE\-MLLMs while keeping the vision module \(i\.e\., the vision encoder and the multi\-modal projector, plus the merger module for Qwen3\-VL\-30B\-A3B\-Instruct\) at original precision\. We adopt this setting because the vision module accounts for only a negligible fraction of the total weight memory\. As reported in Table[6](https://arxiv.org/html/2606.17118#A6.T6), the vision module occupies merely around 1 GB of memory on both Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct, which is less than 3% of the total model size\. Therefore, keeping the vision module at original precision does not noticeably affect the overall memory footprint\. Furthermore, our method specifically targets the mixed\-precision quantization of MoE experts in MoE\-MLLMs and is orthogonal to methods that quantize the vision module, such as Speed\-Q\.
Table 6:Weight memory footprint of the vision module versus the LLM part on Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct\. The vision module accounts for less than 3% of the total weight memory on both models\.ModelVision \(GB\)Vision \(%\)LLM \(GB\)LLM \(%\)Total \(GB\)Qwen3\-VL\-30B\-A3B\-Instruct1\.0771\.73%61\.06498\.27%62\.142Kimi\-VL\-A3B\-Instruct0\.8952\.73%31\.92097\.27%32\.815Nevertheless, to provide more comprehensive results, we further evaluate the setting where both the vision module and the LLM part are quantized\. For the LLM part, we apply our mixed\-precision expert quantization method as in the main experiments, with average bit\-widths of 3\-bit and 2\-bit\. For the vision module, we simply apply GPTQ to quantize it to 4\-bit\. Evaluation is conducted on the same 10 multimodal benchmarks as in the main experiments, and results are summarized in Table[7](https://arxiv.org/html/2606.17118#A6.T7)\. In the table, the column “Vision Q\.” indicates whether the vision module is additionally quantized to 4\-bit \(✓\\checkmark\) or kept at original precision \(×\\times\)\.
Table 7:Performance comparison between LLM\-only quantization and joint quantization of both the vision module and the LLM\. The “Vision Q\.” column indicates whether the vision module is quantized to 4\-bit using GPTQ \(✓\\checkmark\) or kept at original precision \(×\\times\)\. Qwen3\-VL\-30B\-A3B and Kimi\-VL\-A3B are short for Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct, respectively\.Qwen3\-VL\-30B\-A3BVision Q\.ChartQAMMBenchMME\-RMMMUMMStarPOPEGQATextVQAInfoVQAVizWizAvgBaseline \(BF16\)×\\times85\.2086\.8652\.4852\.5660\.0589\.9262\.9683\.3781\.4671\.6472\.65MODE\-W3×\\times83\.0082\.5651\.7750\.3355\.6889\.4460\.2180\.6577\.6268\.8270\.01MODE\-W3✓\\checkmark81\.9280\.5948\.4046\.5652\.4189\.0459\.4780\.0977\.1368\.0068\.36MODE\-W2×\\times73\.1175\.7144\.8544\.6345\.3487\.2856\.9975\.4566\.1963\.3963\.29MODE\-W2✓\\checkmark74\.5972\.9641\.7241\.5242\.2686\.5156\.3375\.9866\.4759\.3461\.77Kimi\-VL\-A3BVision Q\.ChartQAMMBenchMME\-RMMMUMMStarPOPEGQATextVQAInfoVQAVizWizAvgBaseline \(BF16\)×\\times89\.4482\.9944\.4552\.3349\.4187\.1762\.7288\.6983\.6170\.5971\.14MODE\-W3×\\times87\.2680\.8441\.5948\.5148\.6687\.1061\.1186\.2479\.8769\.4269\.06MODE\-W3✓\\checkmark86\.7880\.5042\.1648\.3146\.5187\.3760\.3985\.6578\.8069\.0268\.55MODE\-W2×\\times75\.2075\.2338\.7145\.1144\.1185\.4658\.3079\.3867\.7564\.5263\.38MODE\-W2✓\\checkmark67\.0468\.4538\.8143\.7738\.7985\.5358\.4577\.3965\.4665\.0960\.88
As shown in Table[7](https://arxiv.org/html/2606.17118#A6.T7), on Qwen3\-VL\-30B\-A3B\-Instruct, quantizing the vision module to 4\-bit yields an additional memory saving of approximately1\.077×3/4≈0\.811\.077\\times 3/4\\approx 0\.81GB, while incurring about 1\.7% and 1\.5% relative average accuracy drops under the 3\-bit and 2\-bit LLM settings, respectively\. On Kimi\-VL\-A3B\-Instruct, the joint quantization is fairly robust under the 3\-bit setting with only a 0\.5% relative drop, but the degradation becomes more pronounced under the 2\-bit setting, exceeding 2\.5%\. Although our method is not specifically designed for the vision module, we hope these results can serve as a useful reference for future works on jointly quantizing both modalities of MoE\-MLLMs\.
## Appendix GThe Importance of Protecting Attention Module at 4\-bit
In our experiments, we observe that for MoE\-MLLMs, although the attention module accounts for only a small fraction of the total parameters, it is critical to the overall performance, especially under extremely low\-bit quantization\. For instance, under 2\-bit quantization, quantizing the MoE layers to 2\-bit causes noticeable accuracy degradation but does not lead to model collapse, whereas directly quantizing the attention module to 2\-bit causes the model to collapse entirely\. In contrast, keeping the attention module at 4\-bit substantially mitigates the quantization loss\. We attribute this to the fact that in MoE\-MLLMs, experts in the MoE layers are sparsely activated, while the attention module operates on every token, making it far more sensitive to quantization noise\. While we do not regard this as a novel contribution, we find it highly valuable for practical implementation and deployment, and therefore discuss it in detail here\.
#### Parameter footprint of the attention module\.
We first show that the attention module occupies only a small fraction of the LLM parameters\. As reported in Table[8](https://arxiv.org/html/2606.17118#A7.T8), the attention module accounts for less than 3% of the LLM weight memory on both Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct, while the MoE layers dominate the parameter budget at over 93%\.
Table 8:Weight memory footprint of the attention module versus the MoE layers in the LLM part of Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct\. The attention module accounts for less than 3% of the LLM weight memory on both models, while the MoE layers dominate at over 93%\.ModelAttention \(GB\)Attention \(%\)MoE Layers \(GB\)MoE Layers \(%\)LLM Total \(GB\)Qwen3\-VL\-30B\-A3B\-Instruct1\.8122\.97%58\.00794\.99%61\.064Kimi\-VL\-A3B\-Instruct0\.7432\.33%29\.69693\.03%31\.920
Table 9:Comparison between fully quantizing the LLM to 2\-bit \(GPTQ\-W2\) and protecting the attention module at 4\-bit while quantizing the MoE layers to 2\-bit \(GPTQ\-W2\-Attn4\)\. Protecting the attention module at 4\-bit substantially mitigates the quantization loss on both models\.Qwen3\-VL\-30B\-A3B\-InstructChartQAMMBenchMME\-RMMMUMMStarPOPEGQATextVQAInfoVQAVizWizAvgBaseline \(BF16\)85\.2086\.8652\.4852\.5660\.0589\.9262\.9683\.3781\.4671\.6472\.65GPTQ\-W231\.1222\.4612\.9425\.5630\.4850\.8436\.8940\.9839\.0737\.5534\.99GPTQ\-W2\-Attn458\.1652\.6832\.9136\.1136\.0879\.5645\.4456\.3859\.0752\.3650\.88Kimi\-VL\-A3B\-InstructChartQAMMBenchMME\-RMMMUMMStarPOPEGQATextVQAInfoVQAVizWizAvgBaseline \(BF16\)89\.4482\.9944\.4552\.3349\.4187\.1762\.7288\.6983\.6170\.5971\.14GPTQ\-W227\.6223\.4424\.2327\.3323\.0170\.2937\.0948\.0222\.4225\.2032\.87GPTQ\-W2\-Attn445\.6450\.8429\.7134\.0032\.6673\.1046\.1164\.2449\.2743\.4246\.90
#### Effect of protecting attention at 4\-bit\.
To empirically verify the importance of protecting the attention module, we use GPTQ as a representative quantization method and compare two settings on Qwen3\-VL\-30B\-A3B\-Instruct and Kimi\-VL\-A3B\-Instruct: \(i\)GPTQ\-W2, where all components of the LLM \(including both the attention module and the MoE layers\) are quantized to 2\-bit; and \(ii\)GPTQ\-W2\-Attn4, where only the MoE layers are quantized to 2\-bit while the attention module is kept at 4\-bit\. Both settings are evaluated on the same 10 multimodal benchmarks as in the main experiments, and the results are summarized in Table[9](https://arxiv.org/html/2606.17118#A7.T9)\. GPTQ\-W2 denotes uniform 2\-bit quantization applied to both attention and MoE layers, whereas GPTQ\-W2\-Attn4 keeps attention at 4\-bit precision while quantizing the MoE layers to 2\-bit\.
As shown in Table[9](https://arxiv.org/html/2606.17118#A7.T9), under the pure 2\-bit setting, Qwen3\-VL\-30B\-A3B\-Instruct exhibits clear performance collapse on multiple benchmarks including ChartQA, MME\-R, TextVQA, and InfoVQA, with accuracy on MME\-R dropping to as low as 12\.94\. Once the attention module is protected at 4\-bit, the performance is largely recovered, with the average accuracy improving from 34\.99 to 50\.88 \(a relative gain of over 45%\)\. A similar trend is observed on Kimi\-VL\-A3B\-Instruct, where the average accuracy increases from 32\.87 to 46\.90\. These results empirically confirm that protecting the attention module at 4\-bit is critical under extremely low\-bit quantization\.
Moreover, the additional memory cost of protecting the attention module is marginal\. Taking Qwen3\-VL\-30B\-A3B\-Instruct as an example, the extra memory introduced by promoting attention from 2\-bit to 4\-bit can be compensated by reducing the bit\-width of merely 8 experts \(out of128×48128\\times 48experts in total\) by 1 bit across all layers\. Given that the experts are empirically far more robust to quantization than attention, protecting the attention module at 4\-bit is a highly cost\-effective design choice and is strongly recommended in practice\.
## Appendix HMulti\-GPU Parallel GPTQ for MoE\-MLLMs
### H\.1Motivation and Design
Fine\-grained MoE\-MLLMs typically contain hundreds of routed experts per layer, each requiring independent GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2606.17118#bib.bib35)\)quantization of itsgate\_up\_projanddown\_projmatrices\. Conventional GPTQ pipelines traverse these experts strictly one after another within each MoE layer, and as the number of experts grows, the cumulative cost of this per\-expert sequential traversal makes MoE\-layer calibration the dominant bottleneck of the entire quantization procedure on fine\-grained MoE models\. Since the Hessian construction and weight update for different experts are, however, mutually independent, the expert\-quantization stage is in fact an*embarrassingly parallel*workload\. We therefore introduce an expert\-level multi\-GPU scheme that follows a“serial collection→\\toparallel quantization→\\toserial propagation”pattern: inter\-layer ordering, attention quantization, calibration forward passes, activation/Hessian collection, and the post\-quantization re\-forward used for layer\-wise error propagation all remain strictly serial; only the inner loop over experts within a single quantization stage is dispatched concurrently across GPUs\. This scheduling change leaves the GPTQ mathematics, the quantization order, and all inter\-layer dependencies untouched, so the output is bit\-identical to the serial baseline\. Algorithmic equivalence is guaranteed by four invariants: \(i\) layer order is preserved, so layerℓ\+1\\ell\\\!\+\\\!1always sees inputs produced by the fully quantized layerℓ\\ell; \(ii\) within a layer,down\_projquantization strictly follows the re\-forward of the quantizedgate\_up\_proj, satisfying GPTQ’s sequential\-quantization requirement; \(iii\) experts within the same stage are mathematically independent; and \(iv\) attention submodules and routers are quantized exactly as in the serial baseline\.
### H\.2Implementation
We take a single projection matrix of a single expert as the minimal task unit, yielding up to2E2Eindependent tasks per MoE layer withEEexperts\. Tasks are assigned toGGGPUs in a round\-robin manner, and each GPU processes its local queue sequentially while theGGqueues run concurrently; multi\-threading is sufficient to drive this parallelism, since CUDA kernels release the Python GIL during execution\. Each worker independently transfers the assigned expert weight and its corresponding inputs to its target GPU, performs the GPTQ update locally, and offloads the quantized result back to CPU before releasing its GPU memory\. The main thread, residing on a single primary device, retains responsibility for calibration, activation collection, and re\-forward propagation between stages, so that quantized expert weights produced in parallel are gathered and reinstalled into the model before the next dependent stage begins\. To bound peak memory, expert inputs and Hessians are cloned into per\-task structures before dispatch and released immediately afterwards, and explicit cache cleanup and synchronization are performed at the end of each layer\. The scheme is implemented as a drop\-in replacement for the standard GPTQ entry point and is agnostic to the specific MoE weight organization adopted by different model families\.
### H\.3Complexity and Speedup
LetEEbe the number of routed experts per layer,GGthe number of GPUs, andttthe GPTQ time of a single expert projection\. The serial baseline costs𝒪\(E⋅t\)\\mathcal\{O\}\(E\\cdot t\)per stage, while our scheme reduces this to𝒪\(⌈E/G⌉⋅t\)\\mathcal\{O\}\(\\lceil E/G\\rceil\\cdot t\), giving an expert\-stage speedup ofmin\(E,G\)\\min\(E,G\)\. Denoting the wall\-clock time of all serial components byTsT\_\{s\}and the serial cost of expert quantization byTpT\_\{p\}, the overall speedup can be expected as:
Speedup=Ts\+TpTs\+Tp/G,\\mathrm\{Speedup\}\\;=\\;\\frac\{T\_\{s\}\+T\_\{p\}\}\{T\_\{s\}\+T\_\{p\}/G\},which approachesGGwhenTp≫TsT\_\{p\}\\gg T\_\{s\}, as is typical for fine\-grained MoE\-MLLMs with largeEE\. In our experiments, Kimi\-VL\-A3B\-Instruct contains6464routed experts per layer, while Qwen3\-VL\-30B\-A3B\-Instruct and InternVL3\.5\-30B\-A3B each contain128128experts per layer, placing all three models well within the regime where the parallel speedup approachesGG\.
## Appendix ILLM Usage
In the preparation of this manuscript, large language models were used to polish the writing, including grammar refinement, phrasing improvement, and minor stylistic edits\.Similar Articles
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
Proposes GEMQ, a global expert-level mixed-precision quantization method for MoE LLMs that uses linear programming and router fine-tuning to reduce memory and accelerate inference with minimal accuracy degradation.
BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
BitsMoE introduces a spectral-energy-guided bit allocation framework for quantizing Mixture-of-Experts LLMs, achieving substantial accuracy improvements and speedups under ultra-low-bit quantization.
LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts
LongMoE proposes a unified framework that jointly addresses modality missingness and longitudinal dynamics in multimodal clinical learning, using context-aware imputation, attentional tokenization, trajectory-aware encoding, and sparse mixture-of-experts routing. Experiments on ADNI, OASIS-3, and MIMIC-IV demonstrate improved robustness under missing modalities while remaining competitive in full-modality settings.
E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring
This paper introduces E-PMQ, an expert-guided post-merge quantization framework that addresses the combined deviations from merging and quantization, achieving significant accuracy improvements on multi-task merged models like CLIP-ViT and FLAN-T5.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS is a training-free inference framework that mitigates the straggler effect in expert parallelism for multimodal MoE MLLMs by introducing entropy-weighted load and dynamic modality-adaptive capacity mechanisms.