TriageRA-CCF: Source-Side Clinical Confidence and Coverage Signals for Adaptive Rank Budgeting in Medical LLMs

arXiv cs.CL Papers

Summary

This paper proposes TriageRA-CCF, a method for adaptive rank budgeting in LoRA for medical question answering. It uses source-side signals (base-model confidence, clinical coverage, counterfactual proxy) to dynamically choose rank budgets, achieving modest accuracy gains on Qwen3-8B and Llama3.1-8B.

arXiv:2606.29375v1 Announce Type: new Abstract: Medical large language models are commonly adapted with a fixed low-rank budget, even though medical questions differ substantially in confidence, clinical coverage, and cross-domain difficulty. We study adaptive rank budgeting for parameter-efficient medical question answering: for each question, the adapter decides whether to activate a small, medium, or large subset of LoRA rank channels. The central challenge is that a naive adaptive budget router can collapse to unstable choices or spend capacity without improving shifted benchmarks. We propose TriageRA-CCF, a source-side teacher for adaptive rank-budgeted LoRA. It combines three signals computed only from source training data: base-model answer confidence, metadata-cell clinical coverage, and a counterfactual close-miss proxy. These signals supervise a straight-through budget router over active ranks {2,4,8}, together with budget-cost, entropy, and rank-balance regularization. Under a matched CMB-source training protocol, TriageRA-CCF achieves the best average accuracy among LoRA, DoRA, and MoELoRA baselines on both Qwen3-8B and Llama3.1-8B. The gains are modest and non-uniform across benchmarks: +0.21 average points over the strongest external baseline on Qwen3-8B and +0.16 on Llama3.1-8B. Component ablations show that confidence, coverage, and counterfactual signals all provide useful budget supervision, but their combination is not monotonically best on every backbone.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:30 AM

# TriageRA-CCF: Source-Side Clinical Confidence and Coverage Signals for Adaptive Rank Budgeting in Medical LLMs
Source: [https://arxiv.org/html/2606.29375](https://arxiv.org/html/2606.29375)
11institutetext:College of Computer Science, Sichuan University, Chengdu, China
11email:jishucan@stu\.scu\.edu\.cn, guohongliang@scu\.edu\.cn22institutetext:Meta Emergence Laboratory
22email:huangyining1987@gmail\.com###### Abstract

Medical large language models are commonly adapted with a fixed low\-rank budget, even though medical questions differ substantially in confidence, clinical coverage, and cross\-domain difficulty\. We study adaptive rank budgeting for parameter\-efficient medical question answering: for each question, the adapter decides whether to activate a small, medium, or large subset of LoRA rank channels\. The central challenge is that a naive adaptive budget router can collapse to unstable choices or spend capacity without improving shifted benchmarks\. We propose TriageRA\-CCF, a source\-side teacher for adaptive rank\-budgeted LoRA\. It combines three signals computed only from source training data: base\-model answer confidence, metadata\-cell clinical coverage, and a counterfactual close\-miss proxy\. These signals supervise a straight\-through budget router over active ranks\{2,4,8\}\\\{2,4,8\\\}, together with budget\-cost, entropy, and rank\-balance regularization\. Under a matched CMB\-source training protocol, TriageRA\-CCF achieves the best average accuracy among LoRA, DoRA, and MoELoRA baselines on both Qwen3\-8B and Llama3\.1\-8B\. The gains are modest and non\-uniform across benchmarks: \+0\.21 average points over the strongest external baseline on Qwen3\-8B and \+0\.16 on Llama3\.1\-8B\. Component ablations show that confidence, coverage, and counterfactual signals all provide useful budget supervision, but their combination is not monotonically best on every backbone\.

## 1Introduction

Medical multiple\-choice question answering has become a practical stress test for clinical knowledge in large language models \(LLMs\)\. Benchmarks such as CMB, CMExam, MedQA, and MedMCQA cover heterogeneous medical specialties, professions, and reasoning operations\[[21](https://arxiv.org/html/2606.29375#bib.bib1),[11](https://arxiv.org/html/2606.29375#bib.bib2),[7](https://arxiv.org/html/2606.29375#bib.bib3),[15](https://arxiv.org/html/2606.29375#bib.bib4)\], while medical LLM studies show that strong general models still require careful adaptation and validation before clinical use\[[17](https://arxiv.org/html/2606.29375#bib.bib6),[18](https://arxiv.org/html/2606.29375#bib.bib7)\]\. A common adaptation recipe is to fine\-tune a fixed\-rank LoRA or a mixture of LoRA experts\. This is simple and efficient, but it treats every question as requiring the same amount of update capacity\.

This fixed\-capacity assumption is poorly matched to medical exam data\. Some questions are direct recall items whose answer option already has a large base\-model margin; aggressive adapter updates can add little and may even perturb a correct answer\. Other questions are ambiguous, underrepresented in the source training distribution, or close to being solved by the base model\. These cases may need more adapter capacity\. In other words, the relevant question is how much low\-rank adaptation capacity should be activated for each example, rather than assuming a single capacity level for all medical questions\.

We therefore study adaptive rank budgeting inside a single shared LoRA basis\. Instead of learning a bank of full adapters, the model learnsAAandBBand gates the rank channels inB​diag⁡\(m​\(x\)\)​AB\\operatorname\{diag\}\(m\(x\)\)Aper input\. A separate budget router choosesk​\(x\)∈\{2,4,8\}k\(x\)\\in\\\{2,4,8\\\}active rank channels\. The motivation is direct: easy and confident examples should spend a smaller budget, while uncertain, under\-covered, or plausibly repairable examples should activate more rank capacity\. However, a budget router trained only through answer loss is hard to stabilize\. It may learn spurious budget choices, overuse large budgets, or fail to improve shifted benchmarks\.

We propose TriageRA\-CCF, where CCF denotesconfidence,clinical coverage, andcounterfactual proxy\. The teacher is constructed entirely from source training examples\. Confidence uses base\-model option scoring: large margins and low entropy indicate low\-budget examples, while low margins and high entropy suggest higher budget\. Clinical coverage counts coarse source metadata cells, such as specialty or profession crossed with a weak clinical\-operation tag, motivated by the risk that rare medical operations are underrepresented\. The counterfactual proxy marks close base\-model misses, where the gold option is near the top even when the base answer is wrong, as examples for which extra adaptation capacity may plausibly help\. Figure[1](https://arxiv.org/html/2606.29375#S1.F1)summarizes the design\.

![Refer to caption](https://arxiv.org/html/2606.29375v1/x1.png)Figure 1:TriageRA\-CCF constructs source\-side budget supervision from base\-model confidence, counterfactual close\-miss signals, and clinical coverage\. The learned budget router activates a small, medium, or large subset of rank channels inside one shared LoRA basis\.Our contributions are:

- •We formulate medical PEFT as input\-conditioned active\-rank budgeting, where each question selects how much LoRA rank capacity to use\.
- •We introduce a source\-side CCF budget teacher that uses answer confidence, clinical metadata coverage, and close\-miss counterfactual proxies without target benchmark labels\.
- •We implement the method as a straight\-through adaptive rank\-budgeted LoRA with budget\-cost, entropy, and rank\-balance regularization\.
- •We evaluate under a unified CMB\-source protocol on Qwen3\-8B and Llama3\.1\-8B, showing best average accuracy among compared external PEFT baselines while reporting non\-uniform per\-benchmark behavior and non\-monotonic component ablations\.

Table[1](https://arxiv.org/html/2606.29375#S1.T1)summarizes the design choices and the motivation behind each component\. The table is included because the method is intentionally conservative: it does not add a large external verifier or a new expert bank, but instead uses simple source\-side evidence to make the adaptive budget policy less arbitrary\.

Table 1:Design motivations in TriageRA\-CCF\.
## 2Related Work

#### Medical LLM evaluation\.

Medical QA benchmarks test both factual knowledge and clinical reasoning\. PubMedQA targets biomedical research question answering\[[8](https://arxiv.org/html/2606.29375#bib.bib5)\]; MedQA and MedMCQA collect exam\-style multiple\-choice questions in English\[[7](https://arxiv.org/html/2606.29375#bib.bib3),[15](https://arxiv.org/html/2606.29375#bib.bib4)\]; and CMB and CMExam provide broad Chinese medical examinations across specialties and professions\[[21](https://arxiv.org/html/2606.29375#bib.bib1),[11](https://arxiv.org/html/2606.29375#bib.bib2)\]\. Large medical LLM studies such as Med\-PaLM and Med\-PaLM 2 report strong progress but also emphasize calibration, uncertainty, and safety\-sensitive evaluation\[[17](https://arxiv.org/html/2606.29375#bib.bib6),[18](https://arxiv.org/html/2606.29375#bib.bib7)\]\. Our work does not propose a new benchmark; instead, it asks whether a medical adapter should spend the same low\-rank capacity on all benchmark items\.

#### Parameter\-efficient adaptation\.

LoRA freezes the base model and learns low\-rank update matrices\[[5](https://arxiv.org/html/2606.29375#bib.bib10)\]; QLoRA makes this practical for quantized LLM fine\-tuning\[[1](https://arxiv.org/html/2606.29375#bib.bib11)\]; and DoRA decomposes weight magnitude and direction for stronger low\-rank adaptation\[[13](https://arxiv.org/html/2606.29375#bib.bib12)\]\. Recent systems further adapt Qwen and Llama\-family backbones\[[23](https://arxiv.org/html/2606.29375#bib.bib8),[3](https://arxiv.org/html/2606.29375#bib.bib9)\]\. We compare against LoRA and DoRA baselines under the same source data and training length\. Unlike these fixed\-rank methods, TriageRA\-CCF changes the active rank budget per question while retaining a single shared adapter basis\.

#### Adaptive rank allocation\.

AdaLoRA allocates parameter budgets according to importance estimates of LoRA components\[[25](https://arxiv.org/html/2606.29375#bib.bib13)\]\. DyLoRA trains a range of ranks without retraining for each rank\[[19](https://arxiv.org/html/2606.29375#bib.bib14)\], and IncreLoRA incrementally allocates parameters across modules\[[24](https://arxiv.org/html/2606.29375#bib.bib15)\]\. These methods mainly address rank allocation across weights, modules, or deployment settings\. Our focus is complementary: the rank budget is conditioned on the current medical question and supervised by source\-side clinical uncertainty signals\.

#### Mixture and routing over LoRA experts\.

Sparse mixture\-of\-experts models route examples or tokens to a subset of experts\[[16](https://arxiv.org/html/2606.29375#bib.bib22),[2](https://arxiv.org/html/2606.29375#bib.bib23)\]\. LoRA\-based expert mixtures, including MoELoRA, MoLE, MixLoRA, and MING\-MOE, combine PEFT with expert routing\[[14](https://arxiv.org/html/2606.29375#bib.bib16),[22](https://arxiv.org/html/2606.29375#bib.bib20),[9](https://arxiv.org/html/2606.29375#bib.bib18),[10](https://arxiv.org/html/2606.29375#bib.bib19)\]\. Medical MOE\-LoRA variants use task\-motivated expert gates for multi\-task medical applications\[[12](https://arxiv.org/html/2606.29375#bib.bib17)\]; LoRAHub dynamically composes a library of LoRA modules\[[6](https://arxiv.org/html/2606.29375#bib.bib21)\]\. TriageRA\-CCF differs in two ways\. First, it does not train or select a bank of full LoRA experts; it varies active rank channels inside one adapter\. Second, its supervision is not a task ID or expert label, but a source\-side estimate of how much rank capacity the input should receive\.

#### Calibration and uncertainty\.

Calibration work shows that neural confidence is often misaligned with correctness\[[4](https://arxiv.org/html/2606.29375#bib.bib24)\], and test\-time adaptation methods can use entropy or unlabeled statistics to adjust models\[[20](https://arxiv.org/html/2606.29375#bib.bib25)\]\. We use confidence only as one source\-side teacher signal rather than as a deploy\-time guarantee\. The budget teacher is therefore deliberately combined with clinical coverage and close\-miss structure, and a budget cost discourages spending maximum capacity everywhere\.

## 3Methodology

### 3\.1Problem Setting

Letxxbe a medical multiple\-choice question with options and answeryyin the source training set\. A frozen base LLM with trainable adapter parameters predicts the answer by supervised fine\-tuning\. Standard LoRA uses a fixed low\-rank update

W​\(x\)​h=W0​h\+αr​B​A​h,W\(x\)h=W\_\{0\}h\+\\frac\{\\alpha\}\{r\}BAh,\(1\)whereW0W\_\{0\}is frozen andA,BA,Bare trainable\. We instead use an adaptive rank\-budgeted update

W​\(x\)​h=W0​h\+αr​B​diag⁡\(m​\(x\)\)​A​h,W\(x\)h=W\_\{0\}h\+\\frac\{\\alpha\}\{r\}B\\operatorname\{diag\}\(m\(x\)\)Ah,\(2\)wherem​\(x\)m\(x\)is a sparse nonnegative rank mask\. The contribution of this paper is the source\-supervised budget branch that determines the number of active entries inm​\(x\)m\(x\)for each input\.

### 3\.2Rank\-Budgeted Adapter

For each LoRA target module, rank logits are computed from the question representation:

s​\(x\)=frank​\(hx\),s\(x\)=f\_\{\\mathrm\{rank\}\}\(h\_\{x\}\),\(3\)wherehxh\_\{x\}is a pooled hidden representation of the question\. The individual rank channels are treated as latent capacity units\. Clinical metadata appears in the source\-side coverage teacher described below\.

Letpi\(x\)=softmax\(s\(x\)\)ip\_\{i\}\(x\)=\\operatorname\{softmax\}\(s\(x\)\)\_\{i\}overrrrank channels\. If the budget branch selectskk, the forward pass keeps the top\-kkrank probabilities and rescales their mass:

mi\(k\)​\(x\)=\{pi​\(x\)⋅τk​\(x\),i∈TopK⁡\(p​\(x\),k\),0,otherwise,m\_\{i\}^\{\(k\)\}\(x\)=\\begin\{cases\}p\_\{i\}\(x\)\\cdot\\tau\_\{k\}\(x\),&i\\in\\operatorname\{TopK\}\(p\(x\),k\),\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(4\)whereτk​\(x\)\\tau\_\{k\}\(x\)normalizes the selected mass so that update magnitudes are comparable across budgets\. A rank\-balance penalty discourages all examples from relying on the same rank channels, and an entropy penalty sharpens the rank distribution after softmax\.

### 3\.3Adaptive Budget Router

The budget router chooses one of three active\-rank budgets𝒦=\{2,4,8\}\\mathcal\{K\}=\\\{2,4,8\\\}\. We use a small discrete set because the training budget is short and the resulting policy is easy to audit: rank 2 is a conservative update, rank 4 is the middle operating point, and rank 8 allows more capacity while still activating only half of the stored rank\-16 basis\. Its logits are computed from the same question representation:

b​\(x\)=b0\+fbudget​\(hx\)\.b\(x\)=b\_\{0\}\+f\_\{\\mathrm\{budget\}\}\(h\_\{x\}\)\.\(5\)The budget probabilities areq​\(x\)=softmax⁡\(b​\(x\)/Tb\)q\(x\)=\\operatorname\{softmax\}\(b\(x\)/T\_\{b\}\)\. During training, the selected budget is the straight\-through argmax ofq​\(x\)q\(x\), while gradients flow through the soft probabilities:

q~​\(x\)=onehot⁡\(arg⁡max⁡q​\(x\)\)−sg⁡\(q​\(x\)\)\+q​\(x\)\.\\tilde\{q\}\(x\)=\\operatorname\{onehot\}\(\\arg\\max q\(x\)\)\-\\operatorname\{sg\}\(q\(x\)\)\+q\(x\)\.\(6\)The final gate is a probability\-weighted combination over the candidate top\-kkmasks:

m​\(x\)=∑j=1\|𝒦\|q~j​\(x\)​m\(kj\)​\(x\)\.m\(x\)=\\sum\_\{j=1\}^\{\|\\mathcal\{K\}\|\}\\tilde\{q\}\_\{j\}\(x\)m^\{\(k\_\{j\}\)\}\(x\)\.\(7\)A cost term penalizes the expected active budget,

ℒcost=𝔼x​\[∑jqj​\(x\)​kjmax⁡\(𝒦\)\],\\mathcal\{L\}\_\{\\mathrm\{cost\}\}=\\mathbb\{E\}\_\{x\}\\left\[\\frac\{\\sum\_\{j\}q\_\{j\}\(x\)k\_\{j\}\}\{\\max\(\\mathcal\{K\}\)\}\\right\],\(8\)which prevents a degenerate policy that always selects the largest rank budget\.

### 3\.4Source\-Side CCF Budget Teacher

The teacher is built before adapter training and uses only source training examples\. For each question, the frozen base model scores the answer options by next\-token probabilities over the option letters\. We also derive weak source metadata: a coarse specialty or profession tagrxr\_\{x\}when available, and a clinical\-operation tagoxo\_\{x\}from deterministic keyword rules\. These tags are used only to estimate source coverage\. We derive:

- •Confidence\.The top\-1/top\-2 margin, gold\-option probability, and normalized option entropy\. High confidence motivates a low budget; low margin or high entropy motivates a higher budget\.
- •Counterfactual proxy\.If the base prediction is wrong but the gold option is ranked near the top, the example is marked as a close miss\. The motivation is that extra adapter capacity is more likely to repair a close miss than an arbitrary wrong answer\.
- •Clinical coverage\.We count source examples in each metadata cell\(rx,ox\)\(r\_\{x\},o\_\{x\}\)\. Rare cells receive at least a medium budget prior, because under\-covered clinical operations are more likely to require additional adaptation capacity\.

These signals produce a teacher classz​\(x\)∈\{0,1,2\}z\(x\)\\in\\\{0,1,2\\\}, corresponding to active ranks2,4,82,4,8, and a nonnegative weightw​\(x\)w\(x\)\. Formally, high\-budget labels are assigned when a close miss, strong uncertainty, or high entropy is observed; medium\-budget labels are assigned for moderate uncertainty or rare clinical metadata cells; otherwise the example is labeled low budget\. This rule is intentionally simple: the goal is not to learn a target benchmark oracle, but to provide a source\-side inductive bias that stabilizes the budget router\.

The source\-only restriction is important\. If target benchmark labels were used to label high\-budget examples, the budget router could become a hidden target\-domain selector rather than a deployable adaptation method\. We therefore allow the teacher to use source answers, because supervised source fine\-tuning already uses them, but we do not inspect CMExam, MedQA, or MedMCQA labels during teacher construction\. This makes the setting closer to ordinary source\-domain medical adaptation followed by out\-of\-domain evaluation\. Table[2](https://arxiv.org/html/2606.29375#S3.T2)gives the deterministic rule used in all main runs\.

Table 2:Source\-side CCF teacher rule\. Entropy is normalized by the number of answer options\. The rare\-cell threshold is the 45th percentile of non\-empty source metadata\-cell counts\.
### 3\.5Training Objective

The final training objective is

ℒ=ℒans\+β​ℒbalance\+η​ℒentropy\+ρ​ℒcost\+μ​ℒteacher,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{ans\}\}\+\\beta\\mathcal\{L\}\_\{\\mathrm\{balance\}\}\+\\eta\\mathcal\{L\}\_\{\\mathrm\{entropy\}\}\+\\rho\\mathcal\{L\}\_\{\\mathrm\{cost\}\}\+\\mu\\mathcal\{L\}\_\{\\mathrm\{teacher\}\},\(9\)whereℒans\\mathcal\{L\}\_\{\\mathrm\{ans\}\}is answer\-only supervised fine\-tuning loss\. The teacher loss is weighted cross entropy over the budget probabilities:

ℒteacher=−∑xw​\(x\)​log⁡qz​\(x\)​\(x\)max⁡\(1,∑xw​\(x\)\)\.\\mathcal\{L\}\_\{\\mathrm\{teacher\}\}=\-\\frac\{\\sum\_\{x\}w\(x\)\\log q\_\{z\(x\)\}\(x\)\}\{\\max\(1,\\sum\_\{x\}w\(x\)\)\}\.\(10\)In the main runs,β=0\.001\\beta=0\.001,η=0\.01\\eta=0\.01,μ=0\.2\\mu=0\.2,Tb=0\.7T\_\{b\}=0\.7, andρ\\rhois swept over\{0,0\.02,0\.05\}\\\{0,0\.02,0\.05\\\}\. The sweep is important because the same cost coefficient need not behave identically across backbones\.

## 4Experiments

### 4\.1Datasets and Protocol

All methods use the same source training protocol: 4,200 CMB training examples, maximum sequence length 768, 400 update steps, micro\-batch size 1, gradient accumulation 8, and answer\-only supervised fine\-tuning\. Evaluation covers four benchmarks: CMB eval4149, CMExam full, MedQA test700, and MedMCQA val700\. The teacher labels are constructed only from the CMB source training examples\. Target benchmark labels are used only for final evaluation\.

Table 3:Unified evaluation protocol\.
### 4\.2Backbones and Baselines

We evaluate two instruction\-tuned 8B backbones: Qwen3\-8B\[[23](https://arxiv.org/html/2606.29375#bib.bib8)\]and Llama3\.1\-8B\-Instruct\[[3](https://arxiv.org/html/2606.29375#bib.bib9)\]\. TriageRA\-CCF uses rankr=16r=16, alpha 32, dropout 0\.05, target modulesq/k/v/o/gate/up/down, rank temperature 0\.5, and adaptive choices\{2,4,8\}\\\{2,4,8\\\}\. We compare against three external PEFT baselines trained with the same source data and step count: LoRA with rank 24 and alpha 48, DoRA with rank 24 and alpha 48, and MoELoRA with four experts, top\-2 routing, and rank 8 experts\. This gives the baselines a larger fixed nominal rank or explicit expert routing, so the comparison does not rely on underpowered LoRA baselines\.

Table 4:Main implementation settings\. All methods use the same source examples, sequence length, step count, and effective batch size\.We report accuracy as the primary metric because all four benchmarks are multiple\-choice QA tasks\. We also report the average active rank for adaptive methods\. Active rank is not a direct latency measurement, because wall\-clock runtime also depends on implementation details and batching, but it is the relevant diagnostic for whether the budget router is actually changing the low\-rank capacity used by each question\.

## 5Results and Analysis

### 5\.1Main Results

Table[5](https://arxiv.org/html/2606.29375#S5.T5)reports the main comparison, using the integrated CCF teacher as the representative method\. On Qwen3\-8B, TriageRA\-CCF reaches 68\.79% average accuracy, outperforming the best external baseline by average accuracy, LoRA, by 0\.21 points\. On Llama3\.1\-8B, TriageRA\-CCF reaches 58\.51%, outperforming the best external baseline by average accuracy, MoELoRA, by 0\.16 points\. The absolute gains are small\. Figure[2](https://arxiv.org/html/2606.29375#S5.F2)therefore also reports per\-benchmark deltas against the strongest external method on each dataset\. The method improves on CMExam and MedQA for Qwen3\-8B and on CMB and MedQA for Llama3\.1\-8B, but it does not dominate every benchmark\. The component ablation in Table[7](https://arxiv.org/html/2606.29375#S5.T7)further shows that the paper’s claim is not that the full CCF combination is always the single best variant; rather, source\-side uncertainty signals provide useful supervision for the rank\-budget router\.

Table 5:Main results under the matched CMB\-source training protocol\. “Gain” is computed against the strongest external baseline for the same backbone\.![Refer to caption](https://arxiv.org/html/2606.29375v1/x2.png)Figure 2:Per\-benchmark accuracy delta of TriageRA\-CCF relative to the strongest external PEFT baseline\. Dataset columns use the strongest external method on that benchmark; the Avg\. column uses the strongest external method by average accuracy\.
### 5\.2Budget Cost Sweep

Table[6](https://arxiv.org/html/2606.29375#S5.T6)shows the budget\-cost sweep\. Qwen3\-8B is relatively flat:ρ=0\\rho=0andρ=0\.02\\rho=0\.02tie on average, withρ=0\.02\\rho=0\.02serving as the representative nonzero\-cost setting in Table[5](https://arxiv.org/html/2606.29375#S5.T5)\. Llama3\.1\-8B behaves differently: the strongest result appears atρ=0\.05\\rho=0\.05, suggesting that this backbone benefits from stronger regularization of the budget router\.

The active\-rank diagnostics in Figure[3](https://arxiv.org/html/2606.29375#S5.F3)show that the learned policy is not simply choosing the minimum budget\. Average active rank is around four for Qwen3\-8B and around five for Llama3\.1\-8B, even though the available choices are two, four, and eight\. The backbone difference is informative: the Llama budget branch spends more capacity on average, but its best result appears when the budget\-cost coefficient is larger\. This supports the interpretation that cost is not merely a compression term; it also regularizes a discrete policy that can otherwise overreact to uncertain source\-side signals\. Since no separate target\-free validation split is reserved for selectingρ\\rho, we treat this sweep as sensitivity analysis rather than as evidence of fully target\-blind hyperparameter selection\.

Table 6:Budget cost sweep for TriageRA\-CCF\. Active and effective rank are averages over the four evaluation benchmarks\.![Refer to caption](https://arxiv.org/html/2606.29375v1/x3.png)Figure 3:Budget\-cost sensitivity\. Qwen3\-8B is stable aroundρ∈\{0,0\.02\}\\rho\\in\\\{0,0\.02\\\}, while Llama3\.1\-8B benefits from the strongerρ=0\.05\\rho=0\.05cost\.
### 5\.3Teacher Signal Component Ablation

Table[7](https://arxiv.org/html/2606.29375#S5.T7)compares the source\-side teacher variants against a naive adaptive budget router trained without teacher labels\. The strongest positive evidence is that the teacher components are not disposable\. On Qwen3\-8B, confidence\-only, coverage\-only, counterfactual\-only, and full CCF all improve average accuracy over the naive adaptive router; confidence\-only reaches the highest Qwen average, while coverage\-only gives the best MedMCQA score among the teacher variants\. On Llama3\.1\-8B, confidence\+coverage gives the highest average, and the full CCF teacher remains close behind with stronger MedQA than the naive adaptive router\.

The ablation therefore supports a conservative interpretation, visualized in Figure[4](https://arxiv.org/html/2606.29375#S5.F4)\. Confidence, clinical coverage, and counterfactual close\-miss signals capture different forms of clinical uncertainty and can guide active\-rank budgets\. However, combining all signals does not strictly add gains on every benchmark\. Coverage is a particularly useful signal: rare metadata cells are not just noise, but a practical proxy for clinically under\-covered source examples\. We use full CCF as the representative integrated method because it combines all three motivations and is stable across backbones, while avoiding the stronger claim that all three components must be monotonically better than any subset\.

Table 7:Teacher signal component ablation\. All rows use the same CMB\-source training protocol\. The table shows that individual and combined source\-side teacher signals are useful, but full CCF is not monotonically best on every backbone\.![Refer to caption](https://arxiv.org/html/2606.29375v1/x4.png)Figure 4:Average accuracy change of teacher variants relative to the naive adaptive budget router\. The signal components are useful, but the combined teacher is not strictly monotonic across backbones\.
### 5\.4Per\-Benchmark Observations

The gains are not uniform\. Qwen3\-8B benefits most on CMExam and MedQA relative to the strongest external baselines, while Llama3\.1\-8B benefits most on CMB and MedQA\. The Llama result is particularly useful because its strongest baseline is MoELoRA, a routed expert method, yet TriageRA\-CCF uses one shared rank basis\. MedMCQA remains difficult: TriageRA\-CCF is competitive but does not dominate every baseline on that dataset\. The component ablation reinforces the same message: teacher signals help most clearly on some datasets, especially MedQA, but source\-side clinical uncertainty is not a universal guarantee of improvement\.

## 6Limitations and Ethical Considerations

#### Small margins\.

The improvements are modest and non\-uniform across individual benchmarks\. We therefore limit the empirical claim: source\-side teacher signals make adaptive rank budgeting competitive with strong PEFT and MoE\-LoRA baselines across two 8B backbones\. We do not make statistical significance claims from these single\-run results, so small differences should be interpreted cautiously\.

#### Hyperparameter selection\.

The budget\-cost coefficientρ\\rhois reported as a sensitivity sweep rather than as a target\-free model\-selection procedure\. A stricter protocol would reserve a source validation split for selectingρ\\rho, the teacher coefficientμ\\mu, and other regularization weights before evaluating shifted benchmarks\.

#### Fixed\-source training\.

All experiments train on 4,200 CMB source examples\. The method has not yet been stress\-tested with larger source corpora, non\-exam clinical notes, or multilingual mixed\-source training\. Better source diversity may change the relative value of confidence, coverage, and counterfactual proxy signals\.

#### Weak clinical metadata\.

Specialty, profession, and operation tags are weak metadata rather than expert\-verified clinical annotations\. Coverage counts are therefore approximate\. This is acceptable for a source\-side budget prior, but the tags should not be interpreted as clinical explanations\.

#### Runtime accounting\.

We report active rank as a capacity diagnostic, not wall\-clock latency\. The budget router adds a small gating computation, and actual throughput depends on implementation details, batching, and whether sparse rank updates are fused efficiently\.

#### Clinical use\.

The evaluation is limited to multiple\-choice exams\. The system should not be used for diagnosis, treatment, or patient\-facing medical advice without separate clinical validation, uncertainty calibration, and human oversight\.

## 7Conclusion

We presented TriageRA\-CCF, a source\-side teacher for adaptive rank budgeting in medical LLM adaptation\. The method keeps one shared LoRA basis and learns how many rank channels to activate for each question, supervised by confidence, clinical coverage, and counterfactual close\-miss signals from source training data\. Across Qwen3\-8B and Llama3\.1\-8B, TriageRA\-CCF achieves the best average accuracy among compared external LoRA, DoRA, and MoELoRA baselines under a matched CMB\-source protocol, though the gains are small and dataset\-dependent\. Component ablations show that the teacher signals are useful but not monotonically additive\. The results suggest that medical PEFT should not only decide what adaptation parameters to learn, but also how much low\-rank update capacity each input should spend\.

## References

- \[1\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\(2023\)QLoRA: efficient finetuning of quantized LLMs\.External Links:2305\.14314,[Link](https://arxiv.org/abs/2305.14314)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px2.p1.1)\.
- \[2\]W\. Fedus, B\. Zoph, and N\. Shazeer\(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px4.p1.1)\.
- \[3\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The Llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2606.29375#S4.SS2.p1.2)\.
- \[4\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1321–1330\.Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px5.p1.1)\.
- \[5\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]C\. Huang, Q\. Liu, B\. Y\. Lin, T\. Pang, C\. Du, and M\. Lin\(2024\)LoraHub: efficient cross\-task generalization via dynamic LoRA composition\.InFirst Conference on Language Modeling,Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px4.p1.1)\.
- \[7\]D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits\(2020\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.External Links:2009\.13081,[Link](https://arxiv.org/abs/2009.13081)Cited by:[§1](https://arxiv.org/html/2606.29375#S1.p1.1),[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]Q\. Jin, B\. Dhingra, Z\. Liu, W\. W\. Cohen, and X\. Lu\(2019\)PubMedQA: a dataset for biomedical research question answering\.External Links:1909\.06146,[Link](https://arxiv.org/abs/1909.06146)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px1.p1.1)\.
- \[9\]D\. Li, Y\. Ma, N\. Wang, Z\. Ye, Z\. Cheng, Y\. Tang, Y\. Zhang, L\. Duan, J\. Zuo, C\. Yang, and M\. Tang\(2024\)MixLoRA: enhancing large language models fine\-tuning with LoRA\-based mixture of experts\.External Links:2404\.15159,[Link](https://arxiv.org/abs/2404.15159)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px4.p1.1)\.
- \[10\]Y\. Liao, S\. Jiang, Y\. Wang, and Y\. Wang\(2024\)MING\-MOE: enhancing medical multi\-task learning in large language models with sparse mixture of low\-rank adapter experts\.External Links:2404\.09027,[Link](https://arxiv.org/abs/2404.09027)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px4.p1.1)\.
- \[11\]J\. Liu, P\. Zhou, Y\. Hua, D\. Chong, Z\. Tian, A\. Liu, H\. Wang, C\. You, Z\. Guo, L\. Zhu, and M\. L\. Li\(2023\)Benchmarking large language models on CMExam – a comprehensive chinese medical exam dataset\.External Links:2306\.03030,[Link](https://arxiv.org/abs/2306.03030)Cited by:[§1](https://arxiv.org/html/2606.29375#S1.p1.1),[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px1.p1.1)\.
- \[12\]Q\. Liu, X\. Wu, X\. Zhao, Y\. Zhu, D\. Xu, F\. Tian, and Y\. Zheng\(2024\)When MOE meets LLMs: parameter efficient fine\-tuning for multi\-task medical applications\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px4.p1.1)\.
- \[13\]S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen\(2024\)DoRA: weight\-decomposed low\-rank adaptation\.External Links:2402\.09353,[Link](https://arxiv.org/abs/2402.09353)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px2.p1.1)\.
- \[14\]T\. Luo, J\. Lei, F\. Lei, W\. Liu, S\. He, J\. Zhao, and K\. Liu\(2024\)MoELoRA: contrastive learning guided mixture of experts on parameter\-efficient fine\-tuning for large language models\.External Links:2402\.12851,[Link](https://arxiv.org/abs/2402.12851)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px4.p1.1)\.
- \[15\]A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu\(2022\)MedMCQA: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InProceedings of the Conference on Health, Inference, and Learning,Proceedings of Machine Learning Research, Vol\.174,pp\. 248–260\.Cited by:[§1](https://arxiv.org/html/2606.29375#S1.p1.1),[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px4.p1.1)\.
- \[17\]K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl,et al\.\(2022\)Large language models encode clinical knowledge\.External Links:2212\.13138,[Link](https://arxiv.org/abs/2212.13138)Cited by:[§1](https://arxiv.org/html/2606.29375#S1.p1.1),[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px1.p1.1)\.
- \[18\]K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, L\. Hou, K\. Clark, S\. Pfohl, H\. Cole\-Lewis, D\. Neal,et al\.\(2023\)Towards expert\-level medical question answering with large language models\.External Links:2305\.09617,[Link](https://arxiv.org/abs/2305.09617)Cited by:[§1](https://arxiv.org/html/2606.29375#S1.p1.1),[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px1.p1.1)\.
- \[19\]M\. Valipour, M\. Rezagholizadeh, I\. Kobyzev, and A\. Ghodsi\(2023\)DyLoRA: parameter efficient tuning of pre\-trained models using dynamic search\-free low\-rank adaptation\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px3.p1.1)\.
- \[20\]D\. Wang, E\. Shelhamer, S\. Liu, B\. Olshausen, and T\. Darrell\(2021\)Tent: fully test\-time adaptation by entropy minimization\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px5.p1.1)\.
- \[21\]X\. Wang, G\. H\. Chen, D\. Song, Z\. Zhang, Z\. Chen, Q\. Xiao, F\. Jiang, J\. Li, X\. Wan, B\. Wang, and H\. Li\(2024\)CMB: a comprehensive medical benchmark in chinese\.External Links:2308\.08833,[Link](https://arxiv.org/abs/2308.08833)Cited by:[§1](https://arxiv.org/html/2606.29375#S1.p1.1),[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px1.p1.1)\.
- \[22\]X\. Wu, S\. Huang, and F\. Wei\(2024\)Mixture of LoRA experts\.External Links:2404\.13628,[Link](https://arxiv.org/abs/2404.13628)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px4.p1.1)\.
- \[23\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2606.29375#S4.SS2.p1.2)\.
- \[24\]F\. Zhang, L\. Li, J\. Chen, Z\. Jiang, B\. Wang, and Y\. Qian\(2023\)IncreLoRA: incremental parameter allocation method for parameter\-efficient fine\-tuning\.External Links:2308\.12043,[Link](https://arxiv.org/abs/2308.12043)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px3.p1.1)\.
- \[25\]Q\. Zhang, M\. Chen, A\. Bukharin, N\. Karampatziakis, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao\(2023\)AdaLoRA: adaptive budget allocation for parameter\-efficient fine\-tuning\.External Links:2303\.10512,[Link](https://arxiv.org/abs/2303.10512)Cited by:[§2](https://arxiv.org/html/2606.29375#S2.SS0.SSS0.Px3.p1.1)\.

Similar Articles

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

arXiv cs.CL

This paper introduces RASC+, a retrieval-constrained LLM adjudication method for clinical value set authoring that improves candidate-pool recall and selection precision over prior RASC baselines, demonstrating that blinded LLM adjudication with Qwen3-based retrieval significantly outperforms direct generation.

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

arXiv cs.AI

Proposes Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework for aligning LLM reasoning in mental health assessment, achieving an average improvement of 10.4 percentage points in weighted F1-score over existing baselines.