On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain

arXiv cs.LG 07/03/26, 04:00 AM Papers
mixture-of-experts pruning biomedical reliability hallucination domain-specific model-compression
Summary
This paper investigates the effects of domain-specific expert pruning on both utility and factual reliability of Mixture-of-Experts (MoE) models in the biomedical domain. It finds that moderate pruning preserves in-domain utility without immediate reliability loss, but extreme pruning increases hallucination risks, and generalization degrades rapidly in cross-domain settings.
arXiv:2607.01444v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models offer inference speedups via selective activation but impose substantial memory requirements because the whole network must remain loaded. Structured expert pruning is a practical approach for reducing deployment costs in resource-constrained settings. However, prior studies primarily evaluate benchmark utility, leaving the effect of pruning on factual reliability underexplored, particularly in high-stakes domains such as biomedicine. In this paper, we investigate how domain-specific expert pruning affects both utility and reliability. We assess four MoE models, six pruning methods, and multiple pruning ratios across generation and classification tasks under in-domain (biomedical) and cross-domain settings. Results reveal that moderate pruning preserves in-domain utility without immediate reliability decline, although hallucination risks increase at extreme pruning ratios. When shifting to the general domain, both utility and reliability degrade rapidly. These findings indicate that safe compression depends heavily on the task and domain. Evaluating pruned MoE models solely on utility is inadequate for high-stakes deployment without reliability assessment.
Original Article
View Cached Full Text
Cached at: 07/03/26, 05:40 AM
# On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
Source: [https://arxiv.org/html/2607.01444](https://arxiv.org/html/2607.01444)
Atsuki Yamaguchi1,2Szymon Palucha2Léo Bijar2 Aline Villavicencio1,3,4Nikolaos Aletras1 1University of Sheffield, United Kingdom2AstraZeneca 3University of Exeter, United Kingdom4Federal University of Rio Grande do Norte, Brazil \{ayamaguchi1,a\.villavicencio,n\.aletras\}@sheffield\.ac\.uk

###### Abstract

Mixture\-of\-Experts \(MoE\) models offer inference speedups via selective activation but impose substantial memory requirements because the whole network must remain loaded\. Structured expert pruning is a practical approach for reducing deployment costs in resource\-constrained settings\. However, prior studies primarily evaluate benchmark utility, leaving the effect of pruning on factual reliability underexplored, particularly in high\-stakes domains such as biomedicine\. In this paper, we investigate how domain\-specific expert pruning affects both utility and reliability\. We assess four MoE models, six pruning methods, and multiple pruning ratios across generation and classification tasks under in\-domain \(biomedical\) and cross\-domain settings\. Results reveal that moderate pruning preserves in\-domain utility without immediate reliability decline, although hallucination risks increase at extreme pruning ratios\. When shifting to the general domain, both utility and reliability degrade rapidly\. These findings indicate that safe compression depends heavily on the task and domain\. Evaluating pruned MoE models solely on utility is inadequate for high\-stakes deployment without reliability assessment\.111Our code is available at[https://github\.com/gucci\-j/moe\-pruning\-reliability](https://github.com/gucci-j/moe-pruning-reliability)\.

## 1Introduction

Modern large language models \(LLMs\) such as Qwen3\.6\(Qwen Team,[2026](https://arxiv.org/html/2607.01444#bib.bib48)\), GPT\-OSS\(OpenAI et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib45)\), and Nemotron3\(NVIDIA et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib44)\)follow a Mixture\-of\-Experts \(MoE\) architecture\. During the forward pass, MoE models activate only a subset of the network by routing each token through specialized subnetworks called experts\(Du et al\.,[2022](https://arxiv.org/html/2607.01444#bib.bib14); Cai et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib3)\)\. They achieve strong performance while offering substantial inference speedups\. However, this efficiency does not eliminate the memory overhead during deployment\. At inference, all experts must remain loaded in memory, which results in substantially larger memory footprints than those of dense models with comparable active parameter counts\.

![Refer to caption](https://arxiv.org/html/2607.01444v1/x1.png)Figure 1:An example of hallucination in biomedical text summarization, where the text highlighted in red represents a term that inverts the original meaning of the source while it is still plausible\.A popular approach to mitigating this constraint is expert pruning, which eliminates redundant experts based on an informativeness \(i\.e\., saliency\) criterion estimated from a small calibration dataset\(Chen et al\.,[2022](https://arxiv.org/html/2607.01444#bib.bib6); Muzio et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib43); Lu et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib40),inter alia\)\. Among these approaches, domain\-specific expert pruning has been proposed to tailor the compressed model for a particular field\(Dong et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib13)\), such as biomedicine\. However, previous studies primarily evaluate pruning methods on downstream benchmark performance, orutility\. Consequently, they overlook thereliabilityof the resulting compressed models\. This includes factual consistency, faithfulness, and hallucination prevention\(Ji et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib28); Pal et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib46)\)\. This oversight poses serious risks particularly in high\-stakes domains such as biomedicine, where factual errors can lead to critical real\-world failures\(Weidinger et al\.,[2022](https://arxiv.org/html/2607.01444#bib.bib55); Moor et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib42)\)\.

Furthermore, benchmark utility does not always align with factual reliability\.Chrysostomou et al\. \([2024](https://arxiv.org/html/2607.01444#bib.bib10)\)indicate that pruning dense models can result inreducedhallucinations\. However, unlike dense model pruning, expert pruning completely removes entire weight matrices, fundamentally altering the model architecture\. This forces the model to rely on fallback experts; these alternatives may generate contextually fluent text while silently injecting subtle factual inaccuracies \(see Figure[1](https://arxiv.org/html/2607.01444#S1.F1)\)\. Therefore, the relationship between utility and reliability in MoE pruning remains unclear\.

In this paper, we investigate how domain\-specific expert pruning affectsboth utility and reliabilityin high\-stakes settings\. Focusing on the biomedical domain, we evaluate four MoE models, six pruning methods, and multiple pruning ratios\. Our evaluation spans generation and classification tasks, comparing in\-domain behavior with complementary general\-domain analysis\. This framework examines whether pruning degrades performance, if utility and reliability remain coupled, where degradation begins to appear, and how these outcomes vary across task types and domains\. Our contributions are as follows:

- •We provide the first systematic study of how expert pruning affects the factual reliability of MoE models in a high\-stakes setting, with a primary focus on the biomedical domain\.
- •We characterize how utility and reliability change across pruning ratios, model families, and pruning strategies, showing that moderate in\-domain pruning remains robust while degradation becomes more pronounced at extreme pruning ratios and under domain shift\.
- •We show that the safety of MoE compression is strongly task\- and domain\-dependent, and that evaluating pruned MoE models using utility alone is insufficient for high\-stakes deployment without explicit reliability assessment\.

## 2Related Work

#### Reducing Memory Footprint in MoE Models\.

Efforts to reduce the memory footprint of MoE models span several approaches, including weight quantization\(Li et al\.,[2024a](https://arxiv.org/html/2607.01444#bib.bib36); Huang et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib25); Chen et al\.,[2025c](https://arxiv.org/html/2607.01444#bib.bib8),inter alia\.\), expert merging\(He et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib21); Zhang et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib64); Zhou et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib67),inter alia\.\), and expert pruning\(Kim et al\.,[2021](https://arxiv.org/html/2607.01444#bib.bib31); Lu et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib40); Chen et al\.,[2025b](https://arxiv.org/html/2607.01444#bib.bib7); Bai et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib2)\)\.

Quantization techniques minimize network storage requirements by reducing parameter bit\-width\. Examples include methods that vary bit\-widths based on structural sensitivity\(Li et al\.,[2024a](https://arxiv.org/html/2607.01444#bib.bib36); Huang et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib25)\), optimize calibration\(Chen et al\.,[2025c](https://arxiv.org/html/2607.01444#bib.bib8)\), or employ extreme 1\-bit compression\(Yuan et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib62); Frantar and Alistarh,[2024](https://arxiv.org/html/2607.01444#bib.bib15)\)\. Quantization reduces numerical precision rather than altering network topology, making it orthogonal to expert merging and pruning\(Lu et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib40)\)\.

Expert merging mitigates memory constraints by mathematically blending parameter matrices into a unified matrix\(He et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib21); Li et al\.,[2024b](https://arxiv.org/html/2607.01444#bib.bib37); Chen et al\.,[2025a](https://arxiv.org/html/2607.01444#bib.bib4)\), while expert pruning permanently removes a subset of redundant or low\-saliency experts based on calibration data\(Muzio et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib43); Hu et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib24); Lasby et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib34)\)\. Expert merging introduces irreducible errors by eliminating the ability of the router to maintain fine\-grained, independent control over experts\(Dong et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib13); Lasby et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib34); Liu et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib39)\)\. Conversely, expert pruning preserves the original functional topology and routing independence, demonstrating superior performance at high compression rates\(Bai et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib2); Lasby et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib34)\)\. Therefore, we focus exclusively on expert pruning\.

#### Expert Pruning\.

Expert pruning methods are generally categorized into two paradigms: those that require subsequent fine\-tuning and those that operate entirely without training \(i\.e\., post\-training or one\-shot pruning\)\(Lu et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib40)\)\. Methods requiring training remove experts based on routing statistics or regularization, subsequently applying gradient\-based optimization to recover the resulting performance degradation\(Kim et al\.,[2021](https://arxiv.org/html/2607.01444#bib.bib31); Chen et al\.,[2022](https://arxiv.org/html/2607.01444#bib.bib6); Muzio et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib43); Yang et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib60)\)\.

In contrast, training\-free expert pruning removes redundant experts using one\-shot calibration data, bypassing parameter updates\. Prominent examples estimate expert saliency through reconstruction loss\(Lu et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib40)\), domain\-specific demonstrations\(Dong et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib13)\), expert\-selection frequency\(Chen et al\.,[2025b](https://arxiv.org/html/2607.01444#bib.bib7)\), or advanced functional criteria to preserve routing independence without gradient updates\(Hu et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib24); Lasby et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib34); Liu et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib39)\)\. We focus on this training\-free paradigm because it provides a lightweight solution that facilitates the efficient on\-device deployment of massive MoE architectures\. This approach also enables isolated examination of pruning effects without confounding variables from fine\-tuning\.

#### Model Compression and Factual Reliability\.

The relationship between model compression and the factual reliability of LLMs has been extensively investigated across dense model pruning, model merging, and weight quantization\.Chrysostomou et al\. \([2024](https://arxiv.org/html/2607.01444#bib.bib10)\)find that dense model pruning reduces hallucinations in abstractive summarization by forcing reliance on source documents\. Other studies show that it degrades overall trustworthiness\(Hong et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib23)\)and disrupts internal activation features necessary for robust lie detection\(Fu et al\.,[2025a](https://arxiv.org/html/2607.01444#bib.bib17)\)\. Therefore, dense model pruning yields task\-dependent outcomes\. While studies on merged models remain scarce,Yang et al\. \([2025b](https://arxiv.org/html/2607.01444#bib.bib61)\)demonstrate its utility as a parameter\-level conflict\-resolution strategy to harmonize helpfulness, honesty, and harmlessness\. For weight quantization, previous work\(Hong et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib23); Singh and Sajjad,[2025](https://arxiv.org/html/2607.01444#bib.bib50)\)demonstrates that moderate bit\-width reduction generally preserves trustworthiness and internal calibration\. However, quantized models become susceptible to deceptive prompts, even while retaining truthful internal representations\(Fu et al\.,[2025b](https://arxiv.org/html/2607.01444#bib.bib18)\), and exhibit degraded faithfulness when generating natural language self\-explanations\(Wang et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib54)\)\. Consequently, weight quantization presents a delicate trade\-off between numerical efficiency and semantic precision\.

Despite extensive research across these methodologies,the impact of expert pruning on factual reliability remains unexplored\.This work provides the first systematic exploration into how expert pruning affects the reliability of MoE models, offering critical insights for their safe deployment\.

## 3Training\-free Domain\-specific Expert Pruning

### 3\.1MoE Pruning Framework

Consider an MoE model withLLlayers, where each layerl∈\{1,…,L\}l\\in\\\{1,\\dots,L\\\}contains a set ofNNexperts,\{E1l,…,ENl\}\\\{E\_\{1\}^\{l\},\\dots,E\_\{N\}^\{l\}\\\}\. For a given pruning ratiopp, the objective is to retain a subset ofMMexperts per layer, whereM=\(1−p\)NM=\(1\-p\)N, and discard the remainingN−MN\-Mexperts\. In this work, we focus on domain\-specific expert pruning\(Dong et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib13)\), compressing an MoE model to preserve utility on a domain of interest \(here, the biomedical domain\)\.

To achieve this, the pruning process utilizes a calibration set𝒞\\mathcal\{C\}sampled from the domain of interest\. For each sample in𝒞\\mathcal\{C\}with a sequence length ofTTtokens, a saliency metric \(defined in §[3\.2](https://arxiv.org/html/2607.01444#S3.SS2)\) evaluates expert importance\. After calculating the average importance across𝒞\\mathcal\{C\}, experts with the lowest scores are removed\. During inference, the router is restricted to selecting from the remainingMMexperts\.

### 3\.2Saliency Metrics

To quantify the importance of each expert, we consider six distinct metrics ranging from random baselines to state\-of\-the\-art methods\. Our aim is to investigate how different importance metrics affect domain utility and factual reliability\. Specifically, we first use a stochastic baseline \(Random\) and then advance through a progression of data\-driven complexity: from static measures \(Frequency\), to dynamic activation\-based approaches, and finally to recent, context\-aware formulations\.

#### Random\.

The random pruning metric samplesMMexperts for retention from a uniform distribution\. This approach serves as a weak baseline excluding importance estimation and data\-driven signals\.

#### Frequency\.

This metric measures how often the router selects an expert\(Chen et al\.,[2022](https://arxiv.org/html/2607.01444#bib.bib6)\)\. It assigns equal weight to every activation event, regardless of the magnitude assigned by the routing algorithm\. For an expertEilE\_\{i\}^\{l\}, the score is the count of activations:∑t=1T𝕀\(gi,tl\>0\)\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\(g\_\{i,t\}^\{l\}\>0\), where𝕀\(⋅\)\\mathbb\{I\}\(\\cdot\)is the indicator function that returns one for positive gating values and zero otherwise\.

#### Gate\.

This metric assesses expert importance based on routing activation magnitude\(Chen et al\.,[2022](https://arxiv.org/html/2607.01444#bib.bib6)\)\. Unlike Frequency, which treats all activations equally, the gating metric identifies experts prioritized by the router\. For an expertEilE\_\{i\}^\{l\}in layerll, the score is the sum of gating values across all tokens:∑t=1Tgi,tl\\sum\_\{t=1\}^\{T\}g\_\{i,t\}^\{l\}, wheregi,tlg\_\{i,t\}^\{l\}represents the gating value of theii\-th expert for thett\-th token at layerll\.

#### EAN\.

Expert Activation Norm\(Jaiswal et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib27)\)assesses importance by accumulating the norms of intermediate activations produced by an expert\. The saliency score for expertEilE\_\{i\}^\{l\}is theL2L\_\{2\}norms of outputs across active tokens:∑t=1T𝕀\(gi,tl\>0\)⋅‖Ei,tl\(htl\)‖2\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\(g\_\{i,t\}^\{l\}\>0\)\\cdot\\\|E\_\{i,t\}^\{l\}\(h\_\{t\}^\{l\}\)\\\|\_\{2\}\. Here,Ei,tl\(htl\)E\_\{i,t\}^\{l\}\(h\_\{t\}^\{l\}\)denotes the output vector of theii\-th expert for the hidden statehtlh\_\{t\}^\{l\}at layerll\. This method favors experts that produce high\-magnitude transformations\.

#### EASY\-EP\.

Expert Assessment with Simple Yet\-effective scoring for Expert Pruning\(Dong et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib13)\)optimizes importance estimation by coupling the output magnitude of an expert with the token\-level contribution to representation shift\. This metric posits an expert is essential only when the output magnitude is large and the transformation induces a considerable reorientation of the hidden state\. The saliency score is the product of output\-aware importanceci,tlc\_\{i,t\}^\{l\}and expert\-level token contributionstls\_\{t\}^\{l\}, aggregated as∑t=1Tci,tl⋅stl\\sum\_\{t=1\}^\{T\}c\_\{i,t\}^\{l\}\\cdot s\_\{t\}^\{l\}\. The output\-aware importanceci,tlc\_\{i,t\}^\{l\}multiplies the routing gate value by theL2L\_\{2\}norm of the expert output, formulated asci,tl=gi,tl‖Ei,tl\(htl\)‖2c\_\{i,t\}^\{l\}=g\_\{i,t\}^\{l\}\\\|E\_\{i,t\}^\{l\}\(h\_\{t\}^\{l\}\)\\\|\_\{2\}\. The token contributionstls\_\{t\}^\{l\}is defined as1−Sim\(htl,h~tl\)1\-\\text\{Sim\}\(h\_\{t\}^\{l\},\\tilde\{h\}\_\{t\}^\{l\}\), whereSimdenotes cosine similarity between hidden representations preceding \(htlh\_\{t\}^\{l\}\) and following \(h~tl\\tilde\{h\}\_\{t\}^\{l\}\) the expert module\.

#### REAP\.

Router\-weighted Expert Activation Pruning\(Lasby et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib34)\)extends EAN by integrating gating values and normalizing by activation frequency\. This metric isolates experts with the most substantial per\-token contribution via the formulation:∑t=1Tgi,tl‖Ei,tl\(htl\)‖2∑t=1T𝕀\(gi,tl\>0\)\.\\frac\{\\sum\_\{t=1\}^\{T\}g\_\{i,t\}^\{l\}\\\|E\_\{i,t\}^\{l\}\(h\_\{t\}^\{l\}\)\\\|\_\{2\}\}\{\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\(g\_\{i,t\}^\{l\}\>0\)\}\.This approach ensures that the importance score reflects the mean contribution per activation, identifying experts that maintain influence regardless of selection frequency\.

## 4Experimental Setup

To investigate the empirical relationship between expert pruning and factual consistency, the evaluation framework comprises two distinct dimensions:utility, reflecting downstream task performance, andreliability, quantifying hallucination frequency\. This joint analysis reveals whether configurations that maintain in\-domain utility concurrently preserve reliability, or if reliability degrades more rapidly than utility as the pruning ratio increases\.222Please see Appendix[A](https://arxiv.org/html/2607.01444#A1)for implementation details\.

### 4\.1Models

We use four instruction\-tuned MoE models: GPT\-OSS 20B\(OpenAI et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib45), GPT\-OSS, 32 experts\), Qwen3 30B Instruct 2507\(Yang et al\.,[2025a](https://arxiv.org/html/2607.01444#bib.bib59), Qwen3, 128 experts\), Nemotron 3 Nano 30B\(NVIDIA et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib44), Nemotron3, 128 experts\), and Qwen3\.6 35B\(Qwen Team,[2026](https://arxiv.org/html/2607.01444#bib.bib48), Qwen3\.6, 256 experts\)\. This selection offers varying expert granularity and architectures such as the hybrid Mamba\-Transformer Nemotron3\.

### 4\.2MoE Pruning

We utilize MedINST\(Han et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib20)\)as the calibration set𝒞\\mathcal\{C\}for expert selection\. MedINST is a comprehensive meta\-dataset of biomedical instructions with diverse tasks, providing an ideal foundation to capture representative domain\-specific activations\. The calibration set contains 128 randomly sampled demonstrations from the training subset of MedINST, following the established practice in the pruning literature\(Williams et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib56)\)\. To mitigate sampling bias, we run three independent calibration experiments for each configuration\. Furthermore, to systematically evaluate the impact of structural compression, we assess each pruning strategy across pruning ratios in 12\.5% increments\.

### 4\.3Utility

#### Generation Tasks\.

We employ a MedINST evaluation subset requiring full\-text generation\.333This subset does not overlap with the calibration set, ensuring a fair comparison\.The selected categories are summarization \(SUM\), machine translation \(MT\), question answering \(QA\), named entity recognition \(NER\), named entity disambiguation \(NED\), relation extraction \(RE\), coreference resolution \(COREF\), and event extraction \(EE\)\. We compute zero\-shot ROUGE\-L\(Lin,[2004](https://arxiv.org/html/2607.01444#bib.bib38)\)for SUM, chrF\+\+\(Popović,[2017](https://arxiv.org/html/2607.01444#bib.bib47)\)for MT, and F1 for all other tasks\.

#### Classification Tasks\.

To complement the generative evaluation, we followWilliams et al\. \([2026](https://arxiv.org/html/2607.01444#bib.bib57)\)to assess discriminative utility using the MultiMedQA benchmark\(Singhal et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib51)\)\. The benchmark comprises several multiple\-choice QA tasks: PubMedQA\(Jin et al\.,[2019](https://arxiv.org/html/2607.01444#bib.bib30)\), MedQA\(Jin et al\.,[2021](https://arxiv.org/html/2607.01444#bib.bib29)\), and relevant subsets from MMLU \(anatomy, clinical knowledge, college medicine, medical genetics, professional medicine, and college biology\)\(Hendrycks et al\.,[2021](https://arxiv.org/html/2607.01444#bib.bib22)\)\. We measure zero\-shot accuracy for all tasks\.

### 4\.4Reliability

#### Generation Tasks\.

We measure semantic consistency using the Multi\-XScience\(Lu et al\.,[2020](https://arxiv.org/html/2607.01444#bib.bib41)\)and RCT\(Wallace et al\.,[2021](https://arxiv.org/html/2607.01444#bib.bib53)\)benchmarks\. Multi\-XScience, a part of MedINST, involves generating a related work section based on an abstract and reference articles\. The RCT dataset contains randomized controlled trial reports and serves as a high\-stakes test for medical summary accuracy\.

We use an LLM\-as\-a\-Judge framework\(Zheng et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib65)\)comprising three frontier models \([gpt\-5\.4\-mini](https://developers.openai.com/api/docs/models/gpt-5.4-mini),[Claude Haiku 4\.5](https://www.anthropic.com/claude-haiku-4-5-system-card), and[Gemini 3\.1 Flash\-Lite](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Flash-Lite-Model-Card.pdf)\) to minimize inter\-model variability\. The evaluation includes two methods:absolute judgment, which classifies outputs as either faithful or hallucinated, andrelative judgment\(Lango and Dusek,[2023](https://arxiv.org/html/2607.01444#bib.bib33); Chrysostomou et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib10)\), which compares summaries from source and pruned models across four criteria:

1. 1\.Hallucinations\(↓\\downarrow\): Frequency of unsupported content by the source document\.
2. 2\.Omission\(↓\\downarrow\): Exclusion of critical source information\.
3. 3\.Repetition\(↓\\downarrow\): Presence of redundant text\.
4. 4\.Alignment\(↑\\uparrow\): Degree of semantic correspondence with the source document\.

For relative judgment, we report the preference rate, calculated as the proportion of instances favoring the pruned model over the source model, where a rate above 0\.5 denotes preference for the pruned model\. We also compute standard metrics such as ROUGE\-L and BERTScore\(Zhang et al\.,[2020](https://arxiv.org/html/2607.01444#bib.bib63)\)to contrast lexical and semantic overlap with these judgments, followingChrysostomou et al\. \([2024](https://arxiv.org/html/2607.01444#bib.bib10)\)\.

#### Classification Tasks\.

For discriminative reliability, we use three reasoning tasks from the Medical Domain Hallucination Test\(Pal et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib46), MedHALT\): the False Confidence Test \(FCT\), the Fake Questions Test \(Fake\), and the None of the Above Test \(NOTA\)\. The FCT provides a randomly suggested answer alongside the question to assess whether the model exhibits unwarranted certainty\. The Fake task presents nonsensical medical questions to evaluate whether the model can recognize invalid queries\. Finally, the NOTA task replaces the correct option with “None of the above” to test whether the model can reject incorrect information\. Collectively, these tests determine whether the model resists hallucination when a false answer is proposed, a query is fundamentally flawed, or a correct solution is absent\.

### 4\.5General\-domain Analysis

To complement our domain\-specific evaluation, we monitor the impact of expert pruning on general\-domain benchmarks\. This analysis determines whether the relationship between utility and reliability in the biomedical domain persists across a broader context\.

#### Utility\.

We use four benchmarks: IFEval\(Zhou et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib66)\)zero\-shot accuracy for instruction\-following, GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2607.01444#bib.bib11)\)five\-shot exact match for math reasoning, HumanEval\(Chen et al\.,[2021](https://arxiv.org/html/2607.01444#bib.bib5)\)zero\-shot pass@1 for coding, and MMLU three\-shot accuracy for general reasoning\.

#### Reliability\.

We use Multi\-News\+\(Choi et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib9)\)to measure the ability of the model to summarize multiple documents for a given topic\. We apply the same LLM\-as\-a\-Judge protocol from §[4\.4](https://arxiv.org/html/2607.01444#S4.SS4)\.

## 5Results

### 5\.1Biomedical Domain Utility

Figure[2](https://arxiv.org/html/2607.01444#S5.F2)shows the relative performance retention across various compression levels\.

![Refer to caption](https://arxiv.org/html/2607.01444v1/x2.png)Figure 2:Downstream performance comparison across pruning ratios\. Lines denote mean performance relative to the source baseline \(100%\), with shaded areas indicating standard deviation across three random seeds\. Task\-specific results at a 50% pruning ratio are available in Tables[7](https://arxiv.org/html/2607.01444#A3.T7)and[8](https://arxiv.org/html/2607.01444#A3.T8)in the Appendix\.![Refer to caption](https://arxiv.org/html/2607.01444v1/x3.png)Figure 3:Biomedical reliability results across pruning ratios\. Dashed lines denote baseline unpruned models for absolute scores, or the 0\.5 preference threshold for relative comparisons\. The average inter\-annotator agreements \(Fleiss’κ\\kappa\) are 0\.52 for absolute judgments and 0\.46 for relative judgments \(moderate agreement\)\. Tables[4](https://arxiv.org/html/2607.01444#A3.T4)and[5](https://arxiv.org/html/2607.01444#A3.T5)in the Appendix provide agreement breakdowns\. Figure[7](https://arxiv.org/html/2607.01444#A3.F7)in the Appendix shows the corresponding summarization metrics\.#### Pruning Method\.

We observe that expert pruning strategies influence in\-domain performance retention\. Random pruning causes severe performance degradation across all settings\. At a 50% pruning ratio, a standard evaluation threshold in prior MoE pruning work\(Dong et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib13); Lasby et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib34)\), it leads to relative utility drops ranging from 18\.8% for Qwen3\.6 to 75\.1% for GPT\-OSS on generation tasks\.

In contrast, non\-random strategies retain performance comparable to unpruned baselines\. Most of these strategies yield similar results, with a maximum variance of 5\.2% in generation tasks and 17\.2% in classification tasks\. However, EAN often lags behind other methods in classification tasks, underperforming the best data\-driven metric \(EASY\-EP\) by an average of 23\.7% across models, with the gap reaching 36\.0% in GPT\-OSS\.

This observation aligns with recent findings\(Dong et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib13); Lasby et al\.,[2026](https://arxiv.org/html/2607.01444#bib.bib34)\)\. Relying solely on raw activation norms forces EAN to overemphasize output scale instead of true token\-level utility\. Because EAN does not consider the gating confidence of the router \(gi,tlg\_\{i,t\}^\{l\}in REAP\) or the representational shift \(stls\_\{t\}^\{l\}in EASY\-EP\), it retains experts that produce large but contextually unhelpful transformations\. This generates suboptimal utility\. Consequently, EAN experiences an earlier performance drop compared to other approaches at a 62\.5% pruning ratio\. Importantly, this earlier utility decline corresponds to an earlier onset of hallucination risks across models, as discussed in §[5\.2](https://arxiv.org/html/2607.01444#S5.SS2)\.

#### Model Family\.

Pruning behavior also varies across expert granularities and model architectures\. Models with large expert pools, like Qwen3 and Qwen3\.6, tolerate a 50% pruning ratio well\. For Qwen3\.6, all data\-driven approaches achieve nearly 100% performance retention on generation tasks, and EASY\-EP even reaches 106\.8% on classification tasks\. In contrast, models with fewer experts, like GPT\-OSS, suffer greater degradation because each expert holds more network capacity\. This vulnerability is especially evident on classification tasks, where no configuration exceeds 91\.1%\.

Beyond expert count, architectural complexity introduces another vulnerability\. Nemotron3 models show a consistent performance decline, underperforming the baseline by at least 6\.3% \(REAP\)\. We hypothesize that expert pruning disrupts the balance within this hybrid architecture, which combines Transformer attention, Mamba state\-space layers, and MoE feed\-forward networks\. This instability is critical in classification tasks, which rely on precise internal state propagation to form accurate decision boundaries\.

#### Pruning Ratio\.

For generation tasks, performance remains robust up to a 50% pruning ratio\. Context\-aware methods like EASY\-EP and REAP often match or exceed the unpruned baseline at ratios up to 75%\. For instance, at a 75% pruning ratio, Qwen3\.6 under REAP retains 96\.7% of its baseline utility \(a negligible 3\.3% drop\)\. In contrast, classification tasks show a gradual performance decline before dropping severely\.

![Refer to caption](https://arxiv.org/html/2607.01444v1/x4.png)Figure 4:MedHALT evaluation results across pruning ratios and approaches\. Dashed lines represent unpruned baseline model performances\.

### 5\.2Biomedical Domain Reliability

#### Pruning Method\.

Most pruning methods preserve baseline reliability metrics up to moderate compression levels \(a 50% pruning ratio\) in\-domain\. However, we observe that Random and EAN pruning degrade earlier; Random pruning causes immediate degradation even under moderate compression\. On FCT classification, accuracy approaches near zero across all pruning ratios \(Figure[4](https://arxiv.org/html/2607.01444#S5.F4)\)\. On generation tasks \(Figure[3](https://arxiv.org/html/2607.01444#S5.F3)\), Random pruning causes severe omissions that standard summarization metrics often mask\. For instance, on RCT at a 37\.5% pruning ratio, Qwen3\.6 under Random pruning maintains a ROUGE\-L of \.135 and BERTScore of \.853 \(vs\. \.134 and \.853 for the baseline\)\. Yet, its relative alignment drops to \.329 due to high omissions \(\.790\)\. EAN also exhibits early reliability degradation under moderate pruning compared to other data\-driven approaches\. For example, on FCT, Qwen3 accuracy drops to 20\.7% under EAN at a 50% ratio, compared to 47\.2% under EASY\-EP\.

Observation 1: Coherence vs\. CompletenessRandom pruning can result in high omission and repetition rates even under moderate compression\. Standard summarization metrics fail to identify cases where summaries remain structurally fluent but omit key information, creating a false impression of stability\.

#### Model Family\.

Echoing the utility findings \(§[5\.1](https://arxiv.org/html/2607.01444#S5.SS1)\), models with large expert pools, such as Qwen3 and Qwen3\.6, exhibit remarkable stability up to moderate pruning ratios \(50%\)\. This robustness particularly holds under context\-aware pruning methods \(EASY\-EP, REAP\), where both generative and discriminative benchmarks remain stable and match baseline performance\.

In contrast, models with fewer experts show earlier degradation\. GPT\-OSS exhibits reliability degradation at a 50% pruning ratio\. Under EASY\-EP at this ratio, GPT\-OSS suffers from substantial factual consistency decay on both RCT and Multi\-XScience, despite minimal decline in both task\-specific and overall utility metrics\. For instance, on Multi\-XScience, the model suffers from high repetitions \(\.815\), alongside an increase in the absolute hallucination rate to 17\.3% \(compared to 10\.5% at the 37\.5% ratio\)\. However, task\-specific summarization metrics remain stable\. ROUGE\-L and BERTScore are \.102 and \.817 on Multi\-XScience \(vs\. \.100 and \.816 for the baseline\)\. Likewise, overall generation utility retains 96\.9% of baseline performance \(a minor 3\.1% drop\), demonstrating thatfactual reliability decay can occur even with minimal impact on standard task\-specific and overall utility measurements\.

Observation 2: MoE Capacity BottleneckUnder moderate pruning ratios \(≈50%\\approx 50\\%\), models with smaller expert pools can exhibit factual reliability degradation earlier than models with larger expert pools\. This decline occurs before standard utility metrics begin to degrade\.

Nemotron3, while stable under low pruning ratios of 25% or lower, undergoes gradual reliability degradation at the moderate stage of 50%\. On RCT under EASY\-EP, it exhibits increased information omissions \(\.598\)\. On Multi\-XScience, it experiences a decline in relative alignment \(\.364\) accompanied by elevated relative hallucinations \(\.625\) and repetitions \(\.624\)\. Yet, summarization metrics on Multi\-XScience remain largely unaffected: ROUGE\-L is \.137 and BERTScore is \.838, compared to the baseline scores of \.148 and \.845\. This stability hides the factual consistency decay as the absolute hallucination rate rises to 44\.7% from 20\.1% for the source model\. However, overall generation utility successfully detects this degradation, underperforming the baseline by 7\.9% at 50% pruning\. This suggests that physical model constraints \(e\.g\., the hybrid MoE\-Mamba\-Attention topology of Nemotron3\) drive a coupled, parallel decay in overall utility and reliability, even when task\-specific metrics remain stable\.

#### Pruning Ratio\.

Extreme pruning \(62\.5% or higher\) increases reliability risks across all models, mirroring utility drops\. However, the threshold for reliability preservation depends on the task\. For instance, tasks like Fake, NOTA, and RCT often maintain baseline reliability even at 75% compression\. Conversely, tasks like FCT and Multi\-XScience exhibit earlier, gradual degradation\. We attribute this divergence to different reasoning demands\. Simple classification \(Fake and NOTA\) and single\-source extraction \(RCT\) may rely on broad semantic matching shared across many experts\. In contrast, precise fact retrieval \(FCT\) and multi\-source synthesis \(Multi\-XScience\) depend on specific experts, making them vulnerable to pruning\.

Crucially, extreme compression decouples utility and reliability on Multi\-XScience for GPT\-OSS, Qwen3, and Qwen3\.6\. For instance, under EASY\-EP at 75% pruning, the relative hallucination rate for Qwen3\.6 decreases to \.204 from \.507 at 50% pruning\. Yet, the corresponding absolute rate increases from 11\.9% to 17\.9%\. This apparent relative improvement is due to an artifact of summary collapse\. A summary length audit reveals that at 75% pruning, average lengths drop from baselines for Qwen3\.6 \(204\.9 to 97\.4 words\), Qwen3 \(367\.8 to 262\.7\), and GPT\-OSS \(382\.9 to 279\.8\)\. Consequently, relative omission rates surge above \.630 for these models\. The shorter the summary, the lower the absolute number of errors it will contain, lowering the probability of incurring a penalty during relative hallucination evaluation\. However, the absolute hallucination rate rises because the actual summary consists largely of ungrounded fabrications\. Standard summarization metrics mask this collapse; ROUGE\-L and BERTScore remain largely stable \(e\.g\., Qwen3\.6 records \.122 and \.833, compared to \.152 and \.846 at 50% pruning\)\.

Observation 3: Summary CollapseExtreme pruning artificially lowers relative hallucination rates while standard summarization metrics remain stable\. This stems from summary collapse; degenerate outputs omit critical information and make fewer assertions, rendering them less faithful to the source material\.

### 5\.3General\-domain Analysis

#### Utility\.

Figure[5](https://arxiv.org/html/2607.01444#S5.F5)presents the results on general\-domain benchmarks\. Across model families, expert pruning causes an immediate performance decline in the general domain, with overall utility dropping steadily as the pruning ratio increases\. This gradual decay contrasts with in\-domain utility, where performance remains highly robust up to 50%–75% pruning before collapsing suddenly\. GPT\-OSS is the least robust and shows an immediate average decline of 32\.4% even at 12\.5% pruning probably due to its small expert pool\.

When comparing pruning methods, EAN occasionally beats others up to a 50% ratio, except for GPT\-OSS\. Because EAN selects experts by magnitude without domain\-specific weighting, it preserves general capabilities at the expense of in\-domain alignment\. This trade\-off becomes evident when examining expert selection overlap\. At 50% pruning, the overlap between EAN and other data\-driven methods is low, peaking at \.653 for Qwen3\.6 \(with the exception of Nemotron3 at \.714\)\. In contrast, pairwise overlap among the remaining methods is higher with a minimum of \.753\. Ultimately, this divergence occurs because domain\-specific calibration inherently removes general\-purpose experts that rarely activate during biomedical tasks\. Our multi\-domain calibration analysis in §[6](https://arxiv.org/html/2607.01444#S6)further corroborates this finding\.

![Refer to caption](https://arxiv.org/html/2607.01444v1/x5.png)Figure 5:General\-domain utility across pruning ratios\. Lines and shaded areas denote the mean performance relative to the unpruned source baseline \(100%\) and standard deviation, respectively\. Table[6](https://arxiv.org/html/2607.01444#A3.T6)in the Appendix provides task\-specific results at 50% pruning\.![Refer to caption](https://arxiv.org/html/2607.01444v1/x6.png)Figure 6:General\-domain reliability \(Multi\-News\+\) across pruning ratios\. Dashed lines denote unpruned baselines for absolute scores and the 0\.5 preference threshold for relative comparisons\. The inter\-annotator agreements \(Fleiss’κ\\kappa\) are 0\.58 for absolute judgments and 0\.44 for relative judgments \(moderate agreement\)\.
#### Reliability\.

Hallucination rates on Multi\-News\+ increase monotonically with pruning ratio \(Figure[6](https://arxiv.org/html/2607.01444#S5.F6)\)\. For non\-random methods, Spearman correlation coefficients for absolute and relative hallucination rates spanρ=0\.88\\rho=0\.88to0\.930\.93andρ=0\.73\\rho=0\.73to0\.870\.87, respectively\. Crucially, this degradation occurs much earlier than in the in\-domain setting\. Unlike in\-domain reliability, which remains largely stable through 50%–62\.5% pruning \(and up to 75% for Qwen3 and Qwen3\.6\) as observed in Figure[3](https://arxiv.org/html/2607.01444#S5.F3), Multi\-News\+ hallucination rates climb steadily\. For example, GPT\-OSS reaches an 11\.1% mean absolute hallucination rate at just 37\.5% pruning, diverging from a 1\.3% baseline\. Only Qwen3\.6 remains relatively robust until 62\.5% pruning, where it reaches 12\.2% \(baseline 2\.7%\)\.

By contrast, summarization metrics remain stable\. Both ROUGE\-L and BERTScore stay within 2% of the unpruned baseline for all models up to 62\.5% pruning \(Figure[7](https://arxiv.org/html/2607.01444#A3.F7)in the Appendix\)\. Because ROUGE\-L measures the longest common subsequence and ignores repetition, and BERTScore rewards topically plausible yet unfaithful text, both metrics obscure the underlying degradation of reliability\. This decoupling constitutes another form of summary collapse driven by domain shift\. At 62\.5% pruning, models tend to produce repetitive text with increasing relative repetition rates \(e\.g\., 77\.1% for Qwen3 under EASY\-EP\)\. This triggers a collapse in lexical diversity, measured via Type\-Token Ratio \(TTR\)\. For instance, from an unpruned baseline of \.765, the TTR of Qwen3 drops to \.731 at 62\.5% pruning and falls sharply to \.638 at 75%\. Except for Qwen3\.6, this repetition inflates mean summary length by up to 149 words without improving faithfulness\.

## 6Analysis

#### Correlation between LLM\-as\-a\-Judge and Human Evaluation\.

To validate the reliability of our evaluation framework, we conduct a manual annotation study on 30 samples from RCT and Multi\-XScience\. Three human evaluators annotate each sample for comparison against the judgments of our three LLM judges\. We stratify the reliability agreement \(Cohen’sκ\\kappa\) by pruning ratio windows: Low \(p≤25\.0%p\\leq 25\.0\\%\), Moderate \(37\.5%≤p≤50\.0%37\.5\\%\\leq p\\leq 50\.0\\%\), and Extreme \(p≥62\.5%p\\geq 62\.5\\%\), as shown in Table[1](https://arxiv.org/html/2607.01444#S6.T1)\.

Table 1:Reliability agreement \(Cohen’sκ\\kappa\) across expert pruning ratio windows\.κ\\kappais computed pairwise for all pairs within a category \(3 pairs for Human–Human and LLM–LLM, and 9 pairs for Human–LLM\) and then averaged\. Relative agreement is pooled across questions \(Q1–Q4\) for each pair before computing the metrics\.Under extreme pruning ratios, we find substantial agreement across all groups \(e\.g\., human\-LLM absolute agreement reaches 82\.1% withκ=0\.644\\kappa=0\.644\)\. Severe quality degradation \(e\.g\., obvious hallucinations, omissions, and repetitions\) drives this straightforward evaluation for both humans and LLMs\. Under low and moderate pruning, however, the differences are subtle\. Relative preference exhibits substantial variance across both windows \(κ=0\.238\\kappa=0\.238andκ=−0\.012\\kappa=\-0\.012for human\-LLM\)\. This variance reflects the inherent difficulty of choosing between outputs of similar quality before severe degradation occurs\. For absolute judgments, agreement is high during moderate compression \(κ=0\.738\\kappa=0\.738for human\-LLM\) but diminishes under low compression \(κ=0\.006\\kappa=0\.006\), suggesting that LLM judges diverge from humans when evaluating near\-baseline outputs\. Overall, the strong alignment between human and LLM judges in detecting severe degradations justifies our automated evaluation setup for generation reliability assessment\.

Table 2:Utility and reliability metrics for Qwen3\.6 EASY\-EP pruned models, contrasting biomedical\-only calibration against dual\-domain calibration at 50% and 75% ratios, alongside the impact of 4\-bit GPTQ quantization\. Bold text denotes superior performance within each pairing; subscripts indicate standard deviation across three seeds\.
#### Multi\-domain Calibration\.

Given the substantial impact of calibration data domains on utility and reliability, we compare biomedical\-only calibration against dual\-domain calibration, which additionally uses general\-domain calibration data\.444We use 128 Dolci\-Instruct\-SFT\-No\-Tools\(Team Olmo et al\.,[2025](https://arxiv.org/html/2607.01444#bib.bib52)\)samples as general\-domain calibration data\.FollowingDong et al\. \([2025](https://arxiv.org/html/2607.01444#bib.bib13)\), we compute a saliency metric \(§[3\.2](https://arxiv.org/html/2607.01444#S3.SS2)\) for each domain and average them\. We analyze these calibration behavior at moderate \(50%\) and extreme \(75%\) pruning ratios using Qwen3\.6 with EASY\-EP as this combination represents the most effective pruning configuration\.

As shown in Table[2](https://arxiv.org/html/2607.01444#S6.T2), dual\-domain calibration mitigates general utility degradation, boosting it from \.387 to \.597 at 75% pruning\. Yet, this improvement incurs a trade\-off in biomedical utility \(e\.g\., at 75% pruning, MultiMedQA drops from \.665 to \.563\)\. For reliability, dual\-domain calibration performs comparably to the biomedical\-only method at 50% pruning\. It maintains low absolute hallucination rates, degrading by a maximum of \.029 on Multi\-News\+ compared to the unpruned baseline, and yields similar pairwise preference win\-rates\. Conversely, under extreme 75% pruning, the biomedical\-only calibration is more robust\. Under dual calibration, absolute hallucination rates surge across tasks, doubling the rates of MedINST \(\.358 vs\. \.179 on Multi\-XScience\)\. Pairwise metrics confirm this degradation: the 75% dual\-pruned model exhibits higher relative hallucinations and repetitions\. This outcome suggests that while multi\-domain calibration helps balance general and specialized utility under moderate pruning, specialized in\-domain calibration is critical for preserving factual reliability during extreme compression\.

#### Quantization\.

While quantization is orthogonal to expert pruning \(§[2](https://arxiv.org/html/2607.01444#S2)\), we examine whether utility and reliability trends remain consistent in quantized models\. Specifically, we apply 4\-bit GPTQ\(Frantar et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib16)\)weight quantization to the Qwen3\.6 EASY\-EP 50% pruned model\. As Table[2](https://arxiv.org/html/2607.01444#S6.T2)indicates, we observe no substantial utility difference between the unquantized and quantized configurations: the quantized model retains a biomedical utility of \.543 on generation and \.829 on classification tasks \(compared to \.554 and \.838 for the unquantized counterpart\)\. It also preserves \.676 of general\-domain utility, a minor decrease from the unquantized score of \.689\.

For reliability, quantization generally preserves consistency on biomedical benchmarks\. For instance, it maintains absolute hallucination rates of \.017 on RCT and \.125 on Multi\-XScience, against unquantized rates of \.015 and \.121\. However, in the general domain, quantization increases absolute hallucinations \(\.078 vs\. \.031\) and elevates relative errors\. Hallucination, omission, and repetition rates rise to \.547, \.553, and \.620 respectively, while relative alignment drops from \.647 to \.410\. These results demonstrate that while expert pruning and post\-training weight quantization can combine to achieve additional compression without compounding in\-domain performance loss, quantization can increase reliability risks in cross\-domain settings\.

#### Qualitative Analysis\.

Table 3:Output error examples on RCT and LLM judge absolute evaluations \(1\.0: unfaithful, 0\.0: faithful\)\.To investigate reliability degradation, we analyze RCT generated summaries \(Table[3](https://arxiv.org/html/2607.01444#S6.T3)\)\. We identify three primary failure modes: \(1\)Unit hallucinations, swapping measurement units \(e\.g\., Qwen3 generating mg/ml instead of ng/ml at 25% Gate\); \(2\)Numerical errors, introducing factual mistakes \(e\.g\., GPT\-OSS reporting 179 patients as 70\-9 at 75% Frequency\); and \(3\)Instruction leakage, repeating the task template under extreme pruning \(e\.g\., Qwen3\.6 at 87\.5% EAN\)\. Evaluating these outputs reveals the varying sensitivities of absolute LLM judges\. While Claude identifies most errors, Gemini and GPT overlook the unit swap in Case 1\. This error notably occurs at a low 25% pruning ratio while utility remains intact\. This establishes that evaluating pruned models on utility alone is often inadequate for high\-stakes deployment; direct reliability assessments are critical to detect factual failures before utility degrades\.

## 7Conclusion

This paper presents the first comprehensive study on the impact of expert pruning in MoE models on both benchmark utility and factual reliability within the high\-stakes biomedical domain\. Our findings highlight that utility is not always indicative of reliability\. During domain shift or extreme pruning, models experience summary collapse and an increase in hallucination rates before standard utility metrics register a decline\. In high\-stakes domains, this trade\-off between utility and reliability requires careful consideration through comprehensive in\-domain and general evaluation\. Finally, while multi\-domain calibration and quantization offer benefits, they introduce reliability trade\-offs under high compression or cross\-domain settings\.

## Acknowledgment

We would like to thank Mingzi Cao, Vynska Amalia Permadi, and Samuel Lewis\-Lim for their annotation support\. We also appreciate initial guidance on hallucination evaluation from Timothee Mickus\. AY is supported by the Engineering and Physical Sciences Research Council \(EPSRC\) \[grant number EP/W524360/1\] and the Japan Student Services Organization \(JASSO\) Student Exchange Support Program \(Graduate Scholarship for Degree Seeking Students\)\.

## References

- Ansel et al\. \(2024\)Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, C\. K\. Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark Saroufim, Marcos Yukio Siraichi, Helen Suk, Shunting Zhang, Michael Suo, Phil Tillet, Xu Zhao, Eikan Wang, Keren Zhou, Richard Zou, Xiaodong Wang, Ajit Mathews, William Wen, Gregory Chanan, Peng Wu, and Soumith Chintala\. 2024\.[PyTorch 2: Faster machine learning through dynamic Python bytecode transformation and graph compilation](https://doi.org/10.1145/3620665.3640366)\.In*Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2*, ASPLOS ’24, page 929–947, New York, NY, USA\. Association for Computing Machinery\.
- Bai et al\. \(2025\)Sikai Bai, Haoxi Li, Jie Zhang, Zicong Hong, and Song Guo\. 2025\.[DiEP: Adaptive mixture\-of\-experts compression through differentiable expert pruning](https://proceedings.neurips.cc/paper_files/paper/2025/file/511c7fd69db9f1ce7492a57285975849-Paper-Conference.pdf)\.In*Advances in Neural Information Processing Systems*, volume 38, pages 56090–56115\. Curran Associates, Inc\.
- Cai et al\. \(2025\)Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang\. 2025\.[A survey on mixture of experts in large language models](https://doi.org/10.1109/TKDE.2025.3554028)\.*IEEE Transactions on Knowledge and Data Engineering*, 37\(7\):3896–3915\.
- Chen et al\. \(2025a\)I\-Chun Chen, Hsu\-Shen Liu, Wei\-Fang Sun, Chen\-Hao Chao, Yen\-Chang Hsu, and Chun\-Yi Lee\. 2025a\.[Retraining\-free merging of sparse MoE via hierarchical clustering](https://openreview.net/forum?id=hslOzRxzXL)\.In*Proceedings of the Forty\-second International Conference on Machine Learning*\.
- Chen et al\. \(2021\)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert\-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N\. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba\. 2021\.[Evaluating large language models trained on code](http://arxiv.org/abs/2107.03374)\.*arXiv preprint*, arXiv:2107\.03374\.
- Chen et al\. \(2022\)Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei\. 2022\.[Task\-specific expert pruning for sparse mixture\-of\-experts](http://arxiv.org/abs/2206.00277)\.*arXiv preprint*, arXiv:2206\.00277\.
- Chen et al\. \(2025b\)Yuanteng Chen, Yuantian Shao, Peisong Wang, and Jian Cheng\. 2025b\.[EAC\-MoE: Expert\-selection aware compressor for mixture\-of\-experts large language models](https://doi.org/10.18653/v1/2025.acl-long.633)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 12942–12963, Vienna, Austria\. Association for Computational Linguistics\.
- Chen et al\. \(2025c\)Zhixuan Chen, Xing Hu, Dawei Yang, Zukang Xu, Xu Chen, Zhihang Yuan, Sifan Zhou, and Jiangyong Yu\. 2025c\.[MoEQuant: Enhancing quantization for mixture\-of\-experts large language models via expert\-balanced sampling and affinity guidance](https://proceedings.mlr.press/v267/chen25aa.html)\.In*Proceedings of the 42nd International Conference on Machine Learning*, volume 267 of*Proceedings of Machine Learning Research*, pages 8245–8260\. PMLR\.
- Choi et al\. \(2024\)Juhwan Choi, JungMin Yun, Kyohoon Jin, and YoungBin Kim\. 2024\.[Multi\-news\+: Cost\-efficient dataset cleansing via LLM\-based data annotation](https://doi.org/10.18653/v1/2024.emnlp-main.2)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 15–29, Miami, Florida, USA\. Association for Computational Linguistics\.
- Chrysostomou et al\. \(2024\)George Chrysostomou, Zhixue Zhao, Miles Williams, and Nikolaos Aletras\. 2024\.[Investigating hallucinations in pruned large language models for abstractive summarization](https://doi.org/10.1162/tacl_a_00695)\.*Transactions of the Association for Computational Linguistics*, 12:1163–1181\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\. 2021\.[Training verifiers to solve math word problems](http://arxiv.org/abs/2110.14168)\.*arXiv preprint*, arXiv:2110\.14168\.
- Dao \(2024\)Tri Dao\. 2024\.[FlashAttention\-2: Faster attention with better parallelism and work partitioning](https://openreview.net/forum?id=mZn2Xyh9Ec)\.In*Proceedings of the Twelfth International Conference on Learning Representations*\.
- Dong et al\. \(2025\)Zican Dong, Han Peng, Peiyu Liu, Xin Zhao, Dong Wu, Feng Xiao, and Zhifeng Wang\. 2025\.[Domain\-specific pruning of large mixture\-of\-experts models with few\-shot demonstrations](https://proceedings.neurips.cc/paper_files/paper/2025/file/958c676eeca0d8b58d46714a4c5eb615-Paper-Conference.pdf)\.In*Advances in Neural Information Processing Systems*, volume 38, pages 103552–103577\. Curran Associates, Inc\.
- Du et al\. \(2022\)Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier\-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and Claire Cui\. 2022\.[GLaM: Efficient scaling of language models with mixture\-of\-experts](https://proceedings.mlr.press/v162/du22c.html)\.In*Proceedings of the 39th International Conference on Machine Learning*, volume 162 of*Proceedings of Machine Learning Research*, pages 5547–5569\. PMLR\.
- Frantar and Alistarh \(2024\)Elias Frantar and Dan Alistarh\. 2024\.[QMoE: Sub\-1\-bit compression of trillion parameter models](https://proceedings.mlsys.org/paper_files/paper/2024/file/c74b624843218d9b6713fcf299d6d5e4-Paper-Conference.pdf)\.In*Proceedings of Machine Learning and Systems*, volume 6, pages 439–451\.
- Frantar et al\. \(2023\)Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh\. 2023\.[OPTQ: Accurate quantization for generative pre\-trained transformers](https://openreview.net/forum?id=tcbBPnfwxS)\.In*Proceedings of the Eleventh International Conference on Learning Representations*\.
- Fu et al\. \(2025a\)Yao Fu, Runchao Li, Xianxuan Long, Haotian Yu, Xiaotian Han, Yu Yin, and Pan Li\. 2025a\.[Pruning weights but not truth: Safeguarding truthfulness while pruning LLMs](https://doi.org/10.18653/v1/2025.findings-emnlp.1130)\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 20750–20768, Suzhou, China\. Association for Computational Linguistics\.
- Fu et al\. \(2025b\)Yao Fu, Xianxuan Long, Runchao Li, Haotian Yu, Mu Sheng, Xiaotian Han, Yu Yin, and Pan Li\. 2025b\.[Quantized but deceptive? a multi\-dimensional truthfulness evaluation of quantized LLMs](https://doi.org/10.18653/v1/2025.emnlp-main.1548)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 30435–30458, Suzhou, China\. Association for Computational Linguistics\.
- Gao et al\. \(2023\)Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou\. 2023\.A framework for few\-shot language model evaluation\.[https://zenodo\.org/records/10256836](https://zenodo.org/records/10256836)\.
- Han et al\. \(2024\)Wenhan Han, Meng Fang, Zihan Zhang, Yu Yin, Zirui Song, Ling Chen, Mykola Pechenizkiy, and Qingyu Chen\. 2024\.[MedINST: Meta dataset of biomedical instructions](https://doi.org/10.18653/v1/2024.findings-emnlp.482)\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 8221–8240, Miami, Florida, USA\. Association for Computational Linguistics\.
- He et al\. \(2023\)Shwai He, Run\-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao\. 2023\.[Merging experts into one: Improving computational efficiency of mixture of experts](https://doi.org/10.18653/v1/2023.emnlp-main.907)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 14685–14691, Singapore\. Association for Computational Linguistics\.
- Hendrycks et al\. \(2021\)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\. 2021\.[Measuring massive multitask language understanding](https://openreview.net/forum?id=d7KBjmI3GmQ)\.In*Proceedings of the Nineth International Conference on Learning Representations*\.
- Hong et al\. \(2024\)Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian R\. Bartoldson, Ajay Kumar Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, and Bo Li\. 2024\.[Decoding compressed trust: Scrutinizing the trustworthiness of efficient LLMs under compression](https://proceedings.mlr.press/v235/hong24a.html)\.In*Proceedings of the 41st International Conference on Machine Learning*, volume 235 of*Proceedings of Machine Learning Research*, pages 18611–18633\. PMLR\.
- Hu et al\. \(2026\)Wentao Hu, Mingkuan Zhao, Shuangyong Song, Xiaoyan Zhu, Xin Lai, and Jiayin Wang\. 2026\.[Mosaic pruning: A hierarchical framework for generalizable pruning of mixture\-of\-experts models](https://doi.org/10.1609/aaai.v40i26.39341)\.*Proceedings of the AAAI Conference on Artificial Intelligence*, 40\(26\):21885–21893\.
- Huang et al\. \(2025\)Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, and XIAOJUAN QI\. 2025\.[Mixture compressor for mixture\-of\-experts LLMs gains more](https://openreview.net/forum?id=hheFYjOsWO)\.In*Proceedings of the Thirteenth International Conference on Learning Representations*\.
- Ip and Vongthongsri \(2026\)Jeffrey Ip and Kritin Vongthongsri\. 2026\.[deepeval](https://github.com/confident-ai/deepeval)\.GitHub repository\.
- Jaiswal et al\. \(2025\)Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, and Xianzhi Du\. 2025\.[Finding fantastic experts in moes: A unified study for expert dropping strategies and observations](http://arxiv.org/abs/2504.05586)\.*arXiv preprint*, arXiv:2504\.05586\.
- Ji et al\. \(2023\)Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung\. 2023\.[Survey of hallucination in natural language generation](https://doi.org/10.1145/3571730)\.*ACM Computing Surveys*, 55\(12\)\.
- Jin et al\. \(2021\)Di Jin, Eileen Pan, Nassim Oufattole, Wei\-Hung Weng, Hanyi Fang, and Peter Szolovits\. 2021\.[What disease does this patient have? a large\-scale open domain question answering dataset from medical exams](https://doi.org/10.3390/app11146421)\.*Applied Sciences*, 11\(14\)\.
- Jin et al\. \(2019\)Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu\. 2019\.[PubMedQA: A dataset for biomedical research question answering](https://doi.org/10.18653/v1/D19-1259)\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 2567–2577, Hong Kong, China\. Association for Computational Linguistics\.
- Kim et al\. \(2021\)Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He, and Hany Hassan Awadalla\. 2021\.[Scalable and efficient MoE training for multitask multilingual models](http://arxiv.org/abs/2109.10465)\.*arXiv preprint*, arXiv:2109\.10465\.
- Kwon et al\. \(2023\)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica\. 2023\.[Efficient memory management for large language model serving with PagedAttention](https://doi.org/10.1145/3600006.3613165)\.In*Proceedings of the 29th Symposium on Operating Systems Principles*, SOSP ’23, page 611–626, New York, NY, USA\. Association for Computing Machinery\.
- Lango and Dusek \(2023\)Mateusz Lango and Ondrej Dusek\. 2023\.[Critic\-driven decoding for mitigating hallucinations in data\-to\-text generation](https://doi.org/10.18653/v1/2023.emnlp-main.172)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 2853–2862, Singapore\. Association for Computational Linguistics\.
- Lasby et al\. \(2026\)Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa\. 2026\.[REAP the experts: Why pruning prevails for one\-shot moe compression](https://openreview.net/forum?id=ukGxWd2aDG)\.In*Proceedings of the Fourteenth International Conference on Learning Representations*\.
- Lhoest et al\. \(2021\)Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan\-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander Rush, and Thomas Wolf\. 2021\.[Datasets: A community library for natural language processing](https://doi.org/10.18653/v1/2021.emnlp-demo.21)\.In*Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 175–184, Online and Punta Cana, Dominican Republic\. Association for Computational Linguistics\.
- Li et al\. \(2024a\)Pingzhi Li, Xiaolong Jin, Zhen Tan, Yu Cheng, and Tianlong Chen\. 2024a\.[QuantMoE\-Bench: Examining post\-training quantization for mixture\-of\-experts](http://arxiv.org/abs/2406.08155)\.*arXiv preprint*, arXiv:2406\.08155\.
- Li et al\. \(2024b\)Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi\-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen\. 2024b\.[Merge, then compress: Demystify efficient SMoE with hints from its routing policy](https://openreview.net/forum?id=eFWG9Cy3WK)\.In*Proceedings of the Twelfth International Conference on Learning Representations*\.
- Lin \(2004\)Chin\-Yew Lin\. 2004\.[ROUGE: A package for automatic evaluation of summaries](https://aclanthology.org/W04-1013/)\.In*Text Summarization Branches Out*, pages 74–81, Barcelona, Spain\. Association for Computational Linguistics\.
- Liu et al\. \(2026\)Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, and Xin Yuan\. 2026\.[EvoESAP: Non\-uniform expert pruning for sparse moe](http://arxiv.org/abs/2603.06003)\.*arXiv preprint*, arXiv:2603\.06003\.
- Lu et al\. \(2024\)Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li\. 2024\.[Not all experts are equal: Efficient expert pruning and skipping for mixture\-of\-experts large language models](https://doi.org/10.18653/v1/2024.acl-long.334)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 6159–6172, Bangkok, Thailand\. Association for Computational Linguistics\.
- Lu et al\. \(2020\)Yao Lu, Yue Dong, and Laurent Charlin\. 2020\.[Multi\-XScience: A large\-scale dataset for extreme multi\-document summarization of scientific articles](https://doi.org/10.18653/v1/2020.emnlp-main.648)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 8068–8074, Online\. Association for Computational Linguistics\.
- Moor et al\. \(2023\)Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M\. Krumholz, Jure Leskovec, Eric J\. Topol, and Pranav Rajpurkar\. 2023\.[Foundation models for generalist medical artificial intelligence](https://doi.org/10.1038/s41586-023-05881-4)\.*Nature*, 616\(7956\):259–265\.
- Muzio et al\. \(2024\)Alexandre Muzio, Alex Sun, and Churan He\. 2024\.[SEER\-MoE: Sparse expert efficiency through regularization for mixture\-of\-experts](http://arxiv.org/abs/2404.05089)\.*arXiv preprint*, arXiv:2404\.05089\.
- NVIDIA et al\. \(2025\)NVIDIA, Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla, Andrew Tao, Anjulie Agrusa, Ankur Verma, Ann Guan, Anubhav Mandarwal, Arham Mehta, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Ayush Dattagupta, Banghua Zhu, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bita Darvish Rouhani, Boris Ginsburg, Brandon Norick, Brandon Soubasis, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Carlo del Mundo, Chantal Hwang, Charles Wang, Cheng\-Ping Hsieh, Chenghao Zhang, Chenhan Yu, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christopher Parisien, Collin Neale, Cyril Meurillon, Damon Mosk\-Aoyama, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Lo, Daniel Rohrer, Daniel Serebrenik, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, Deepak Narayanan, Dhruv Nathawani, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dong Ahn, Duncan Riach, Dusan Stosic, Edgar Minasyan, Edward Lin, Eileen Long, Eileen Peters Long, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Tramel, Erick Galinkin, Erik Pounds, Evan Briones, Evelina Bakhturina, Evgeny Tsykunov, Faisal Ladhak, Fay Wang, Fei Jia, Felipe Soares, Feng Chen, Ferenc Galko, Frank Sun, Frankie Siino, Gal Hubara Agam, Ganesh Ajjanagadde, Gantavya Bhatt, Gargi Prasad, George Armstrong, Gerald Shen, Gorkem Batmaz, Grigor Nalbandyan, Haifeng Qian, Harsh Sharma, Hayley Ross, Helen Ngo, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hiren Upadhyay, Huizi Mao, Huy C Nguyen, Huy Q Nguyen, Iain Cunningham, Ido Galil, Ido Shahaf, Igor Gitman, Ilya Loshchilov, Itamar Schen, Itay Levy, Ivan Moshkov, Izik Golan, Izzy Putterman, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jian Zhang, Jiaqi Zeng, Jie Lou, Jimmy Zhang, Jinhang Choi, Jining Huang, Joey Conway, Joey Guman, John Kamalu, Johnny Greco, Jonathan Cohen, Joseph Jennings, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kai Xu, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khushi Bhardwaj, Kirthi Shankar, Krishna C\. Puvvada, Krzysztof Pawelec, Kumar Anik, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Li Ding, Lizzie Wei, Lucas Liebenwein, Luis Vega, Maanu Grover, Maarten Van Segbroeck, Maer Rodrigues de Melo, Mahdi Nazemi, Makesh Narsimhan Sreedhar, Manoj Kilaru, Maor Ashkenazi, Marc Romeijn, Marcin Chochowski, Mark Cai, Markus Kliegl, Maryam Moosaei, Matt Kulka, Matvei Novikov, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Meredith Price, Michael Andersch, Michael Boone, Michael Evans, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Minseok Lee, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Najeeb Nabwani, Natalie Hereth, Nave Assaf, Negar Habibi, Neta Zmora, Netanel Haber, Nicola Sessions, Nidhi Bhatia, Nikhil Jukar, Nikki Pope, Nikolai Ludwig, Nima Tajbakhsh, Nir Ailon, Nirmal Juluru, Nishant Sharma, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Puny, Oren Tropp, Ouye Xie, Parth Chadha, Pasha Shamis, Paul Gibbons, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pinky Xu, Piotr Januszewski, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Qing Miao, Qiyu Wan, Rabeeh Karimi Mahabadi, Rachit Garg, Ran El\-Yaniv, Ran Zilberstein, Rasoul Shafipour, Rich Harang, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Hesse, Roger Waleffe, Rohit Watve, Roi Koren, Ruoxi Zhang, Russell Hewett, Russell J\. Hewett, Ryan Prenger, Ryan Timbrook, Sadegh Mahdavi, Sahil Modi, Samuel Kriman, Sangkug Lim, Sanjay Kariyappa, Sanjeev Satheesh, Saori Kaji, Satish Pasumarthi, Saurav Muralidharan, Sean Narentharen, Sean Narenthiran, Seonmyeong Bak, Sergey Kashirsky, Seth Poulos, Shahar Mor, Shanmugam Ramasamy, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shiqing Fan, Shreya Gopal, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Simeng Sun, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Stas Sergienko, Stefania Alborghetti, Stephen Ge, Sugam Dipak Devare, Sumeet Kumar Barua, Suseella Panguluri, Suyog Gupta, Sweta Priyadarshi, Syeda Nahida Akter, Tan Bui, Teodor\-Dumitru Ene, Terry Kong, Thanh Do, Tijmen Blankevoort, Tim Moon, Tom Balough, Tomer Asida, Tomer Bar Natan, Tomer Ronen, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Vahid Noroozi, Venkat Srinivasan, Venmugil Elango, Victor Cui, Vijay Korthikanti, Vinay Rao, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Wanli Jiang, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wenfei Zhou, Will Jennings, William Zhang, Wojciech Prazuch, Xiaowei Ren, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi\-Fu Wu, Yian Zhang, Yigong Qin, Ying Lin, Yonatan Geifman, Yonggan Fu, Yoshi Subara, Yoshi Suhara, Yubo Gao, Zach Moshe, Zhen Dong, Zhongbo Zhu, Zihan Liu, Zijia Chen, and Zijie Yan\. 2025\.[NVIDIA Nemotron 3: Efficient and open intelligence](http://arxiv.org/abs/2512.20856)\.*arXiv preprint*, arXiv:2512\.20856\.
- OpenAI et al\. \(2025\)OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K\. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vlad Fomenko, Timur Garipov, Kristian Georgiev, Mia Glaese, Tarun Gogineni, Adam Goucher, Lukas Gross, Katia Gil Guzman, John Hallman, Jackie Hehir, Johannes Heidecke, Alec Helyar, Haitang Hu, Romain Huet, Jacob Huh, Saachi Jain, Zach Johnson, Chris Koch, Irina Kofman, Dominik Kundel, Jason Kwon, Volodymyr Kyrylov, Elaine Ya Le, Guillaume Leclerc, James Park Lennon, Scott Lessans, Mario Lezcano\-Casado, Yuanzhi Li, Zhuohan Li, Ji Lin, Jordan Liss, Lily, Liu, Jiancheng Liu, Kevin Lu, Chris Lu, Zoran Martinovic, Lindsay McCallum, Josh McGrath, Scott McKinney, Aidan McLaughlin, Song Mei, Steve Mostovoy, Tong Mu, Gideon Myles, Alexander Neitz, Alex Nichol, Jakub Pachocki, Alex Paino, Dana Palmie, Ashley Pantuliano, Giambattista Parascandolo, Jongsoo Park, Leher Pathak, Carolina Paz, Ludovic Peran, Dmitry Pimenov, Michelle Pokrass, Elizabeth Proehl, Huida Qiu, Gaby Raila, Filippo Raso, Hongyu Ren, Kimmy Richardson, David Robinson, Bob Rotsted, Hadi Salman, Suvansh Sanjeev, Max Schwarzer, D\. Sculley, Harshit Sikchi, Kendal Simon, Karan Singhal, Yang Song, Dane Stuckey, Zhiqing Sun, Philippe Tillet, Sam Toizer, Foivos Tsimpourlas, Nikhil Vyas, Eric Wallace, Xin Wang, Miles Wang, Olivia Watkins, Kevin Weil, Amy Wendling, Kevin Whinnery, Cedric Whitney, Hannah Wong, Lin Yang, Yu Yang, Michihiro Yasunaga, Kristen Ying, Wojciech Zaremba, Wenting Zhan, Cyril Zhang, Brian Zhang, Eddie Zhang, and Shengjia Zhao\. 2025\.[gpt\-oss\-120b & gpt\-oss\-20b model card](http://arxiv.org/abs/2508.10925)\.*arXiv preprint*, arXiv:2508\.10925\.
- Pal et al\. \(2023\)Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu\. 2023\.[Med\-HALT: Medical domain hallucination test for large language models](https://doi.org/10.18653/v1/2023.conll-1.21)\.In*Proceedings of the 27th Conference on Computational Natural Language Learning \(CoNLL\)*, pages 314–334, Singapore\. Association for Computational Linguistics\.
- Popović \(2017\)Maja Popović\. 2017\.[chrF\+\+: words helping character n\-grams](https://doi.org/10.18653/v1/W17-4770)\.In*Proceedings of the Second Conference on Machine Translation*, pages 612–618, Copenhagen, Denmark\. Association for Computational Linguistics\.
- Qwen Team \(2026\)Qwen Team\. 2026\.[Qwen3\.6\-35B\-A3B: Agentic coding power, now open to all](https://qwen.ai/blog?id=qwen3.6-35b-a3b)\.Blog post\.
- Red Hat AI and vLLM Project \(2024\)Red Hat AI and vLLM Project\. 2024\.[LLM Compressor](https://github.com/vllm-project/llm-compressor)\.GitHub reporitory\.
- Singh and Sajjad \(2025\)Manpreet Singh and Hassan Sajjad\. 2025\.[Interpreting the effects of quantization on LLMs](https://doi.org/10.18653/v1/2025.ijcnlp-long.123)\.In*Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics*, pages 2267–2281, Mumbai, India\. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics\.
- Singhal et al\. \(2023\)Karan Singhal, Shekoofeh Azizi, Tao Tu, S\. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole\-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner\-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S\. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan\. 2023\.[Large language models encode clinical knowledge](https://doi.org/10.1038/s41586-023-06291-2)\.*Nature*, 620\(7972\):172–180\.
- Team Olmo et al\. \(2025\)Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alexander Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lester James V\. Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A\. Smith, and Hannaneh Hajishirzi\. 2025\.[Olmo 3](http://arxiv.org/abs/2512.13961)\.*arXiv preprint*, arXiv:2512\.13961\.
- Wallace et al\. \(2021\)Byron C\. Wallace, Sayantan Saha, Frank Soboczenski, and Iain J\. Marshall\. 2021\.[Generating \(Factual?\) Narrative Summaries of RCTs: Experiments with Neural Multi\-Document Summarization](https://pmc.ncbi.nlm.nih.gov/articles/PMC8378607/)\.In*AMIA Summits on Translational Science Proceedings*\.
- Wang et al\. \(2026\)Qianli Wang, Nils Feldhus, Pepa Atanasova, Fedor Splitt, Simon Ostermann, Sebastian Möller, and Vera Schmitt\. 2026\.[Can large language models still explain themselves? investigating the impact of quantization on self\-explanations](http://arxiv.org/abs/2601.00282)\.*arXiv preprint*, arXiv:2601\.00282\.
- Weidinger et al\. \(2022\)Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po\-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, Abeba Birhane, Lisa Anne Hendricks, Laura Rimell, William Isaac, Julia Haas, Sean Legassick, Geoffrey Irving, and Iason Gabriel\. 2022\.[Taxonomy of risks posed by language models](https://doi.org/10.1145/3531146.3533088)\.In*Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency*, FAccT ’22, page 214–229, New York, NY, USA\. Association for Computing Machinery\.
- Williams et al\. \(2025\)Miles Williams, George Chrysostomou, and Nikolaos Aletras\. 2025\.[Self\-calibration for language model quantization and pruning](https://doi.org/10.18653/v1/2025.naacl-long.509)\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 10149–10167, Albuquerque, New Mexico\. Association for Computational Linguistics\.
- Williams et al\. \(2026\)Miles Williams, George Chrysostomou, Vitor Amancio Jeronymo, and Nikolaos Aletras\. 2026\.[Compressing language models for specialized domains](https://doi.org/10.18653/v1/2026.eacl-long.347)\.In*Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 7393–7415, Rabat, Morocco\. Association for Computational Linguistics\.
- Wolf et al\. \(2020\)Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush\. 2020\.[Transformers: State\-of\-the\-art natural language processing](https://doi.org/10.18653/v1/2020.emnlp-demos.6)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 38–45, Online\. Association for Computational Linguistics\.
- Yang et al\. \(2025a\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu\. 2025a\.[Qwen3 technical report](http://arxiv.org/abs/2505.09388)\.*arXiv preprint*, arXiv:2505\.09388\.
- Yang et al\. \(2024\)Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, and Bo Yuan\. 2024\.[MoE\-i2: Compressing mixture of experts models through inter\-expert pruning and intra\-expert low\-rank decomposition](https://doi.org/10.18653/v1/2024.findings-emnlp.612)\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 10456–10466, Miami, Florida, USA\. Association for Computational Linguistics\.
- Yang et al\. \(2025b\)Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Ziyu Zhao, Daixin Wang, Qing Cui, Zhiqiang Zhang, Jun Zhou, Fei Wu, and Kun Kuang\. 2025b\.[Mix data or merge models? balancing the helpfulness, honesty, and harmlessness of large language model via model merging](https://proceedings.neurips.cc/paper_files/paper/2025/file/a3948805bc5220206f5bfc465de0b1b7-Paper-Conference.pdf)\.In*Advances in Neural Information Processing Systems*, volume 38, pages 112467–112496\. Curran Associates, Inc\.
- Yuan et al\. \(2023\)Yuping Yuan, Zhao You, Shulin Feng, Dan Su, Yanchun Liang, Xiaohu Shi, and Dong Yu\. 2023\.[Compressed MoE ASR Model Based on Knowledge Distillation and Quantization](https://doi.org/10.21437/Interspeech.2023-2544)\.In*Proceedings of the 24th Annual Conference of the International Speech Communication Association*, pages 3337–3341\.
- Zhang et al\. \(2020\)Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q\. Weinberger, and Yoav Artzi\. 2020\.[BERTScore: Evaluating text generation with BERT](https://openreview.net/forum?id=SkeHuCVFDr)\.In*Proceedings of the Eighth International Conference on Learning Representations*\.
- Zhang et al\. \(2025\)Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, and Jianfeng Gao\. 2025\.[Diversifying the expert knowledge for task\-agnostic pruning in sparse mixture\-of\-experts](https://doi.org/10.18653/v1/2025.findings-acl.4)\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 86–102, Vienna, Austria\. Association for Computational Linguistics\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\. 2023\.[Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena](https://openreview.net/forum?id=uccHPGDlao)\.In*Proceedings of the Thirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track*\.
- Zhou et al\. \(2023\)Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou\. 2023\.[Instruction\-following evaluation for large language models](http://arxiv.org/abs/2311.07911)\.*arXiv preprint*, arXiv:2311\.07911\.
- Zhou et al\. \(2025\)Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, Zhiliang Wu, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, and Hehe Fan\. 2025\.[Dropping experts, recombining neurons: Retraining\-free pruning for sparse mixture\-of\-experts LLMs](https://doi.org/10.18653/v1/2025.findings-emnlp.820)\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 15169–15186, Suzhou, China\. Association for Computational Linguistics\.

## Appendix AImplementation Details

#### Software\.

We utilize Hugging Face \(HF\) datasets\(Lhoest et al\.,[2021](https://arxiv.org/html/2607.01444#bib.bib35), v3\.6\.0\)for data preprocessing, alongside HF transformers\(Wolf et al\.,[2020](https://arxiv.org/html/2607.01444#bib.bib58), v5\.5\.4\), FlashAttention\-2\(Dao,[2024](https://arxiv.org/html/2607.01444#bib.bib12), v2\.7\.4\), and PyTorch\(Ansel et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib1), v2\.10\.0\)for the pruning framework\. We employ lm\-evaluation\-harness\(Gao et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib19), v0\.4\.10\)for IFEval, GSM8K, HumanEval, and MMLU evaluation\. For MedHALT, we use the official repositories\(Pal et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib46), Commit: bd4408a\)\.555[https://github\.com/medhalt/medhalt/tree/bd4408a16aff36626e934aa5b012edd9fa7b6194](https://github.com/medhalt/medhalt/tree/bd4408a16aff36626e934aa5b012edd9fa7b6194)For HaluEval, we adopt the dataset from the official HF repository\.666[https://huggingface\.co/datasets/pminervini/HaluEval](https://huggingface.co/datasets/pminervini/HaluEval)For RCT, we obtain the dataset from the official repository\(Wallace et al\.,[2021](https://arxiv.org/html/2607.01444#bib.bib53), Commit: de10c27\)\.777[https://github\.com/bwallace/RCT\-summarization\-data/tree/de10c2712873efa1733859f2d7113af60427d7b2](https://github.com/bwallace/RCT-summarization-data/tree/de10c2712873efa1733859f2d7113af60427d7b2)Similarly, for Multi\-News\+, we load the dataset from the official repository\(Choi et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib9), Commit: e347c8e\)\.888[https://github\.com/c\-juhwan/multi\_news\_plus/tree/e347c8eedb78a09b4971bc0011a8865e11dfafdb](https://github.com/c-juhwan/multi_news_plus/tree/e347c8eedb78a09b4971bc0011a8865e11dfafdb)Due to API cost constraints during LLM\-as\-a\-Judge evaluation, we randomly subsample 50 documents from the Multi\-News\+ test set for our evaluation\. For MedINST, we execute the official script\(Han et al\.,[2024](https://arxiv.org/html/2607.01444#bib.bib20), Commit: dc13b2b\)\.999[https://github\.com/aialt/MedINST/blob/dc13b2be29cdaae01cddceb6e4134e7dc6df1134/evaluation\.py](https://github.com/aialt/MedINST/blob/dc13b2be29cdaae01cddceb6e4134e7dc6df1134/evaluation.py)For absolute judgment, we leverage the pre\-configuredHallucinationMetricof DeepEval\(Ip and Vongthongsri,[2026](https://arxiv.org/html/2607.01444#bib.bib26), v3\.9\.6\)\. Inference is executed using the vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2607.01444#bib.bib32), v0\.19\.1\)engine for faster inference\. For post\-training weight quantization analysis, we apply 4\-bit GPTQ using thellmcompressorlibrary\(Red Hat AI and vLLM Project,[2024](https://arxiv.org/html/2607.01444#bib.bib49), v0\.12\.0\)\.

#### Hardware\.

We use a single NVIDIA A100 80GB GPU with CUDA 12\.9 for experiments\. Notably, the expert pruning process completes within 775 seconds \(using the largest model: Qwen3\.6 with the context\-aware EASY\-EP method\) and introduces no substantial computational overhead\.

#### Prompt Templates\.

We prioritize official prompt templates to maintain standardized evaluation conditions, while using custom templates only as required\. We use the default prompt templates for MultiMedQA, IFEval, GSM8K, HumanEval, and MMLU provided by lm\-evaluation\-harness\. For MedINST, MedHALT, and HaluEval, we utilize their official prompt templates\. For summarization tasks, including RCT and Multi\-News\+, we apply the custom prompt templates listed in Appendix[B](https://arxiv.org/html/2607.01444#A2)\. For LLM\-as\-a\-Judge and human evaluation, we employ the off\-the\-shelf DeepEval template for absolute judgment and a custom prompt based onChrysostomou et al\. \([2024](https://arxiv.org/html/2607.01444#bib.bib10)\)for relative judgment; both are documented in Appendix[B](https://arxiv.org/html/2607.01444#A2)\.

#### Hyperparameters\.

During evaluation, we restrict the maximum sequence length to 4,096 tokens and set the temperature parameter to 0\.0, except for Multi\-News\+, which uses a sequence length of 8,192 tokens\. We disable the thinking mode for Qwen3\.6 and Nemotron3 and assign a low reasoning effort setting for GPT\-OSS by default\.

For quantization, we employ a W4A16 quantization scheme \(4\-bit weights and 16\-bit activations\) targeting all linear layers in the model\. The language modeling head \(lm\_head\) and the MoE gate/router layers \(gateandrouterparameters\) are excluded from quantization and kept in bfloat16 precision\. For calibration, we use the same calibration set as in pruning \(i\.e\., 128 randomly sampled instances from MedINST\)\.

#### Human Evaluation Protocol\.

To validate the LLM\-as\-a\-Judge evaluation framework, we conduct a manual annotation study to compare LLM judgments with human annotations\. We recruit three PhD student volunteers with experience in natural language processing and computer science\. Because the evaluation protocol relies strictly on textual grounding, i\.e\., verifying whether statements in the generated summary are logically entailed by the provided source documents, rather than requiring external clinical diagnostic knowledge, these annotators are highly qualified to evaluate semantic alignment and omission errors\. The human annotators are provided with the same instructions and prompt guidelines as the LLM judges \(documented in Appendix[B](https://arxiv.org/html/2607.01444#A2)\)\. Specifically, for absolute judgment, we ask the annotators to classify the summaries generated by the pruned models as either faithful or hallucinated\. For relative judgment, they compare the summaries from the unpruned baseline against the pruned models across four dimensions \(hallucinations, omissions, repetition, and semantic alignment\)\. Both evaluations are conducted using only the provided source documents\.

The human evaluation is conducted on a subset of 30 randomly selected documents from the RCT and Multi\-XScience datasets\. The sample distribution of these 30 evaluation cases across datasets, generator models, pruning methods, and pruning ratios is structured as follows:

- •Dataset Balance: 15 samples from RCT and 15 samples from Multi\-XScience\.
- •Generator Model Balance: - –GPT\-OSS: 8 samples \(4 RCT, 4 Multi\-XScience\) - –Qwen3: 7 samples \(4 RCT, 3 Multi\-XScience\) - –Qwen3\.6: 7 samples \(3 RCT, 4 Multi\-XScience\) - –Nemotron3: 8 samples \(4 RCT, 4 Multi\-XScience\)
- •Pruning Method Balance: 5 samples per pruning method, distributed evenly across all six approaches \(Random, Frequency, Gate, EAN, EASY\-EP, and REAP\)\.
- •Pruning Ratio Balance: - –Ratio 0\.125: 4 samples - –Ratio 0\.250: 4 samples - –Ratio 0\.375: 5 samples - –Ratio 0\.500: 4 samples - –Ratio 0\.625: 4 samples - –Ratio 0\.750: 5 samples - –Ratio 0\.875: 4 samples

All pruning ratios within each individual \(dataset, generator model, pruning method\) group are guaranteed to be distinct for the selected samples to ensure a representative and unbiased distribution across the entire pruning range\.

## Appendix BPrompt Templates

RCT prompt templateYou are a medical evidence summarization assistant\. Given a randomized controlled trial abstract, write a concise factual summary in 2\-4 sentences\. Do not include information that is not supported by the abstract\. Source documents:\{source\} Summary:

Multi\-News\+ prompt templateYou are a news summarization assistant\. Given documents from multiple news sources about the same event, write a concise factual summary in 2\-4 sentences\. Do not include information that is not supported by the source\. Source documents:\{source\} Summary:

Absolute judgment prompt templateGiven a list of factual alignments and contradictions, which highlights alignment/contradictions between the ‘actual output’ and ‘contexts’, use it to provide a reason for the hallucination score CONCISELY\. Note that the hallucination score ranges from 0 \- 1, and the lower the better\. \*\*IMPORTANT: Please make sure to only return in JSON format, with the ‘reason’ key providing the reason\.Example JSON:\{\{"reason": "The score is <hallucination\_score\> because <your\_reason\>\."\}\}\*\* Factual Alignments:\{factual\_alignments\} Contradictions:\{contradictions\} Hallucination Score:\{score\} JSON:

Relative judgment prompt templateYou are an impartial evaluator\. Compare Summary A vs Summary B against the Source Document\.Use ONLY the Source Document to judge support\. Do NOT use outside knowledge\.Be strict about factual support\. Answer these questions: Q1\. Hallucinations: Which summary contains MORE hallucinations \(unsupported content\)?Q2\. Omission: Which summary is missing MORE crucial information from the document?Q3\. Repetition: Which summary contains MORE repetitive information?Q4\. Alignment: Which summary is MORE semantically aligned with the source document? Return ONLY valid JSON with exactly these keys:\- q1\_hallucinations\_more: "A" or "B"\- q2\_omission\_more: "A" or "B"\- q3\_repetition\_more: "A" or "B"\- q4\_alignment\_more: "A" or "B" Rules:\- You MUST choose ‘A’ or ‘B’ \(no ties\)\.\- Keep outputs to JSON only \(no markdown\)\. Source Document:<<<DOC\{document\}DOC\>\>\> Summary A:<<<A\{summary\_a\}A\>\>\> Summary B:<<<B\{summary\_b\}B\>\>\>

## Appendix CSupplementary Results

Table 4:Inter\-annotator agreement \(Fleiss’κ\\kappa\) for absolute judgments among three LLMs\.Table 5:Inter\-annotator agreement \(Fleiss’κ\\kappa\) for relative judgments among three LLMs\.![Refer to caption](https://arxiv.org/html/2607.01444v1/x7.png)Figure 7:Summarization performance across different pruning ratios and approaches\.Table 6:Downstream performance on general\-domain benchmarks \(IFEval, GSM8K, HumanEval, and MMLU\) at 50% expert pruning ratio\. Scores that are better than Source are highlighted ingreen\. The best and second\-best approaches for each model family are inboldandunderlined, respectively\. Subscripts denote standard deviation across three seeds\.ApproachSUMMTQANERNEDRECOREFEEGPT\-OSS\\cellcolorgray\!20Source\\cellcolorgray\!20\.100\\cellcolorgray\!20\.585\\cellcolorgray\!20\.768\\cellcolorgray\!20\.538\\cellcolorgray\!20\.331\\cellcolorgray\!20\.342\\cellcolorgray\!20\.419\\cellcolorgray\!20\.290Random\.066\.011\.078\.042\.216\.255\.124\.025\.035\.005\.089\.059\.008\.006\.088\.039Frequency\.097\.000\.536\.004\.715\.008\.505\.007\.309\.009\\cellcolorgreen\!20\.345\.024\\cellcolorgreen\!20\.521\.060\.274\.017Gate\.096\.002\.531\.011\.738\.005\.507\.005\.313\.001\\cellcolorgreen\!20\.358\.007\\cellcolorgreen\!20\.492\.099\.283\.006EAN\\cellcolorgreen\!20\.119\.001\.285\.022\.636\.003\.331\.007\.162\.000\\cellcolorgreen\!20\.351\.010\.368\.044\.138\.005EASY\-EP\\cellcolorgreen\!20\.102\.003\.498\.028\.722\.009\.517\.006\.317\.009\.316\.030\\cellcolorgreen\!20\.447\.045\.280\.016REAP\\cellcolorgreen\!20\.103\.004\.485\.018\.726\.006\.516\.005\.317\.019\.317\.033\\cellcolorgreen\!20\.459\.050\.283\.015Nemotron3\\cellcolorgray\!20Source\\cellcolorgray\!20\.148\\cellcolorgray\!20\.515\\cellcolorgray\!20\.735\\cellcolorgray\!20\.445\\cellcolorgray\!20\.334\\cellcolorgray\!20\.384\\cellcolorgray\!20\.695\\cellcolorgray\!20\.244Random\.130\.007\.319\.013\.580\.068\.355\.009\.183\.007\.370\.016\.648\.020\.158\.031Frequency\.132\.001\.305\.037\.692\.005\.431\.003\.276\.004\.354\.006\\cellcolorgreen\!20\.695\.045\.199\.004Gate\.136\.000\.375\.015\.686\.007\.434\.001\.286\.004\.366\.008\\cellcolorgreen\!20\.698\.077\.228\.016EAN\.135\.002\.441\.017\.676\.008\.422\.003\.266\.004\.353\.012\\cellcolorgreen\!20\.714\.050\.235\.010EASY\-EP\.137\.003\.365\.064\.682\.006\.439\.002\.298\.002\.356\.007\\cellcolorgreen\!20\.757\.012\.215\.002REAP\.139\.001\.435\.052\.684\.015\.438\.002\.296\.003\.382\.015\\cellcolorgreen\!20\.751\.015\.236\.003Qwen3\\cellcolorgray\!20Source\\cellcolorgray\!20\.115\\cellcolorgray\!20\.615\\cellcolorgray\!20\.663\\cellcolorgray\!20\.566\\cellcolorgray\!20\.407\\cellcolorgray\!20\.396\\cellcolorgray\!20\.802\\cellcolorgray\!20\.368Random\.114\.019\.371\.054\.603\.035\.366\.033\.154\.031\.366\.030\.729\.049\.202\.048Frequency\\cellcolorgreen\!20\.118\.002\.450\.024\.597\.018\.529\.013\.347\.010\.393\.025\.747\.038\.336\.018Gate\\cellcolorgreen\!20\.119\.001\.484\.003\\cellcolorgreen\!20\.665\.025\.545\.001\.355\.012\.394\.014\\cellcolorgreen\!20\.825\.003\.360\.003EAN\\cellcolorgreen\!20\.132\.002\.442\.019\\cellcolorgreen\!20\.663\.007\.555\.002\.271\.004\.370\.006\.793\.016\.355\.006EASY\-EP\.114\.002\.499\.003\\cellcolorgreen\!20\.752\.006\\cellcolorgreen\!20\.575\.003\\cellcolorgreen\!20\.410\.001\.382\.020\\cellcolorgreen\!20\.823\.014\.368\.003REAP\\cellcolorgreen\!20\.116\.002\.488\.004\\cellcolorgreen\!20\.753\.002\\cellcolorgreen\!20\.571\.001\.402\.006\.377\.008\.794\.017\\cellcolorgreen\!20\.374\.005Qwen3\.6\\cellcolorgray\!20Source\\cellcolorgray\!20\.154\\cellcolorgray\!20\.624\\cellcolorgray\!20\.566\\cellcolorgray\!20\.714\\cellcolorgray\!20\.427\\cellcolorgray\!20\.479\\cellcolorgray\!20\.899\\cellcolorgray\!20\.304Random\.133\.012\.540\.009\\cellcolorgreen\!20\.614\.066\.519\.016\.204\.038\.430\.021\.764\.099\.218\.065Frequency\\cellcolorgreen\!20\.157\.001\.617\.003\\cellcolorgreen\!20\.760\.015\.702\.006\.407\.012\.475\.018\\cellcolorgreen\!20\.926\.018\\cellcolorgreen\!20\.323\.007Gate\.154\.002\.617\.002\\cellcolorgreen\!20\.653\.048\.704\.003\.423\.004\\cellcolorgreen\!20\.494\.007\\cellcolorgreen\!20\.928\.018\\cellcolorgreen\!20\.322\.004EAN\\cellcolorgreen\!20\.158\.003\.605\.009\\cellcolorgreen\!20\.795\.005\.690\.007\.366\.009\.449\.002\\cellcolorgreen\!20\.907\.004\.286\.019EASY\-EP\.152\.003\.612\.007\\cellcolorgreen\!20\.797\.011\.700\.018\\cellcolorgreen\!20\.427\.018\\cellcolorgreen\!20\.480\.017\\cellcolorgreen\!20\.928\.014\\cellcolorgreen\!20\.333\.010REAP\\cellcolorgreen\!20\.154\.003\.617\.004\\cellcolorgreen\!20\.595\.032\.710\.002\\cellcolorgreen\!20\.437\.002\\cellcolorgreen\!20\.499\.009\.887\.011\\cellcolorgreen\!20\.341\.009

Table 7:Zero\-shot performance on unseen MedINST downstream tasks\. Scores that are better than Source are highlighted ingreen\. The best and second\-best adaptation approaches for each model family are indicated inboldandunderlined, respectively\. Subscripts denote standard deviation across three seeds\.Table 8:Downstream performance on biomedical multiple\-choice and QA datasets\. Scores that are better than Source are highlighted ingreen\. The best and second\-best adaptation approaches for each model family are indicated inboldandunderlined, respectively\. Subscripts denote standard deviation across three seeds\.
On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain

Similar Articles

Pruning and Distilling Mixture-of-Experts into Dense Language Models

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Generic Expert Coverage for Pruning SparseMixture-of-Experts Language Models

Less is MoE: Trimming Experts in Domain-Specialist Language Models

Emergent Modularity in Mixture-of-Experts Models (8 minute read)

Submit Feedback

Similar Articles

Pruning and Distilling Mixture-of-Experts into Dense Language Models
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Generic Expert Coverage for Pruning SparseMixture-of-Experts Language Models
Less is MoE: Trimming Experts in Domain-Specialist Language Models
Emergent Modularity in Mixture-of-Experts Models (8 minute read)