Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering
Summary
The paper proposes BiRG-LoRA, a rank-gated LoRA method for medical question answering that uses clinically structured priors to select sparse rank subsets, achieving 69.31% macro-average accuracy across four benchmarks while using fewer parameters than mixture-of-experts approaches.
View Cached Full Text
Cached at: 07/01/26, 05:33 AM
# Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering
Source: [https://arxiv.org/html/2606.31432](https://arxiv.org/html/2606.31432)
###### Abstract
Medical multiple\-choice question answering requires parameter\-efficient adaptation across heterogeneous knowledge domains and reasoning operations\. A medication question, a diagnostic decision, a public\-health item, and a nursing\-action item may require different low\-rank updates, while some recall items should preserve the base model’s representation with only mild adapter intervention\. We propose BiRG\-LoRA, a single\-adapter rank\-gated LoRA method for medical question answering\. BiRG\-LoRA keeps one LoRA module per target layer but makes its rank dimension input\-conditioned: for each question, a biaxial gate combines hidden semantic evidence with specialty/profession priors, clinical\-operation priors, and their interaction to select a sparse top\-kksubset of rank atoms\. A scalar injection coefficient further controls the strength of the selected adapter update\. Under a matched Qwen3\-8B CMB\-source protocol, BiRG\-LoRA achieves the highest four\-benchmark macro\-average accuracy among trainable PEFT baselines and matched routing controls: 69\.31% averaged over CMB, CMExam, MedQA, and MedMCQA\. It improves over MoELoRA by 0\.89 percentage points while using 28\.1% fewer trainable parameters; a paired, benchmark\-stratified bootstrap over final predictions gives a 95% confidence interval of \[0\.42, 1\.37\] for this macro\-average gain\. Basic controls show that BiRG\-LoRA also improves over vanilla LoRA r16 and active\-rank\-matched LoRA r4 by 0\.83 macro points, and an evaluation\-time weak\-axis perturbation check suggests that performance is not brittle to moderate tag noise\. The results support a bounded claim: clinically structured rank allocation improves cross\-benchmark medical QA under a matched single\-seed protocol, while training\-seed variance remains future work\.
## IIntroduction
Medical large language models \(LLMs\) are commonly evaluated with exam\-style multiple\-choice benchmarks such as CMB, CMExam, MedQA, and MedMCQA\[[21](https://arxiv.org/html/2606.31432#bib.bib1),[11](https://arxiv.org/html/2606.31432#bib.bib2),[6](https://arxiv.org/html/2606.31432#bib.bib3),[15](https://arxiv.org/html/2606.31432#bib.bib4)\]\. These benchmarks mix heterogeneous medical specialties, professional roles, and reasoning operations\. A question about drug mechanism, a treatment decision, a public\-health policy item, and a nursing\-action item may all share the same answer format, but they stress different regions of medical knowledge\. This makes medical QA a difficult setting for ordinary parameter\-efficient fine\-tuning: a single fixed LoRA update can improve one subset while weakening another, whereas multi\-adapter mixtures add storage, routing, and calibration complexity\.
Recent LoRA routing methods address this problem by composing or selecting adapters\[[14](https://arxiv.org/html/2606.31432#bib.bib22),[17](https://arxiv.org/html/2606.31432#bib.bib23),[23](https://arxiv.org/html/2606.31432#bib.bib24),[9](https://arxiv.org/html/2606.31432#bib.bib25)\], or by training mixture\-of\-LoRA experts\[[13](https://arxiv.org/html/2606.31432#bib.bib18),[8](https://arxiv.org/html/2606.31432#bib.bib19),[10](https://arxiv.org/html/2606.31432#bib.bib20),[22](https://arxiv.org/html/2606.31432#bib.bib21)\]\. A more fine\-grained direction is rank\-wise routing, where the rank channels of a single LoRA are treated as small experts rather than training many complete adapters\[[27](https://arxiv.org/html/2606.31432#bib.bib26)\]\. This is attractive for medical QA because it promises sparse activation and parameter sharing\. However, a purely generic rank\-wise expert view misses a crucial point: in medicine, routing signals are not arbitrary task IDs\. They have meaningful structure\. Specialty/profession priors, clinical\-operation priors, and hidden semantic cues each describe different sources of adaptation conflict\.
We therefore ask:*can a single LoRA adapter use clinically meaningful axes to allocate its internal rank channels, while remaining competitive with multi\-adapter and LoRA\-MoE baselines?*We propose BiRG\-LoRA, a biaxial rank\-gated LoRA method\. BiRG\-LoRA stores one rank\-RRadapter per target module but activates onlyκ\\kapparank atoms per question\. The rank gate combines four signals: a hidden\-state branch, a specialty/profession branch, a clinical\-operation branch, and an interaction branch\. A scalar injection coefficient controls the strength of the selected low\-rank update\. Figure[1](https://arxiv.org/html/2606.31432#S1.F1)summarizes the design\.
This formulation makes clinical structure part of the adapter itself\. Instead of assigning every question to the same fixed rank subspace, BiRG\-LoRA learns which rank atoms should be used for different combinations of content domain, clinical operation, and question semantics\. Under a matched Qwen3\-8B protocol with 4,200 CMB training examples and 400 update steps, BiRG\-LoRA reaches an average accuracy of 69\.31% across CMB, CMExam, MedQA, and MedMCQA\. It outperforms trainable MoELoRA by 0\.89 points while using 66\.59M rather than 92\.60M trainable parameters\. It also improves over vanilla LoRA controls, including rank\-16 LoRA and an active\-rank\-matched rank\-4 LoRA\. Paired bootstrap tests over final predictions support the main macro\-average gains, and evaluation\-time axis perturbation suggests that the method is not brittle to moderate noise in the rule\-derived clinical tags\.
Our contributions are:
- •We introduce a biaxial rank\-gated LoRA formulation for medical QA, where specialty/profession and clinical\-operation priors jointly guide sparse rank activation inside one adapter\.
- •We add an input\-conditioned injection coefficient that decides how strongly the selected adapter update should affect the base representation\.
- •We provide a matched evaluation against base Qwen3\-8B, vanilla LoRA r16/r4/r24, MoELoRA, DoRA, a generic rank\-wise LoRA control inspired by SMoRA, and adapter\-library baselines including MedAdapter, Arrow, MeteoRA, and LoRA\-Mixer style routing\.
- •We report paired, benchmark\-stratified bootstrap tests and sign tests over final predictions, showing that the main macro\-average gains over MoELoRA, vanilla LoRA, and a generic rank\-wise control are unlikely to be explained by evaluation\-set sampling noise\.
- •We report ablations and robustness checks for clinical\-only, hidden\-only, no\-injection, top\-kk, orthogonality, axis contrast, and weak\-axis noise, suggesting that the full biaxial design is not reducible to generic rank\-wise activation or fragile metadata lookup\.
Figure 1:BiRG\-LoRA uses one shared LoRA basis and dynamically selects rank atoms\. The gate combines hidden semantics, a specialty/profession axis, a clinical\-operation axis, and their interaction\. The selected top\-kkdiagonal mask and injection coefficient decide which rank atoms are active and how strongly the adapter modifies the frozen backbone\.
## IIRelated Work
Medical QA and medical LLM evaluation\.PubMedQA, MedQA, MedMCQA, CMB, and CMExam provide complementary biomedical and exam\-style QA settings across English and Chinese\[[7](https://arxiv.org/html/2606.31432#bib.bib5),[6](https://arxiv.org/html/2606.31432#bib.bib3),[15](https://arxiv.org/html/2606.31432#bib.bib4),[21](https://arxiv.org/html/2606.31432#bib.bib1),[11](https://arxiv.org/html/2606.31432#bib.bib2)\]\. Medical LLM studies such as Med\-PaLM and Med\-PaLM 2 show strong progress but also emphasize calibration, uncertainty, and safety\-sensitive validation\[[18](https://arxiv.org/html/2606.31432#bib.bib6),[19](https://arxiv.org/html/2606.31432#bib.bib7)\]\. Our work does not propose a new benchmark; it studies how a medical adapter should allocate low\-rank capacity when source and target benchmarks differ by language, specialty mix, and reasoning operation\.
Parameter\-efficient adaptation\.LoRA freezes the base weights and learns low\-rank update matrices\[[5](https://arxiv.org/html/2606.31432#bib.bib10)\]; QLoRA makes this practical for quantized LLMs\[[1](https://arxiv.org/html/2606.31432#bib.bib11)\]; and DoRA separates magnitude and direction updates for stronger low\-rank adaptation\[[12](https://arxiv.org/html/2606.31432#bib.bib12)\]\. Adaptive rank methods such as AdaLoRA, DyLoRA, and IncreLoRA allocate rank budgets across modules or deployment ranks\[[26](https://arxiv.org/html/2606.31432#bib.bib13),[20](https://arxiv.org/html/2606.31432#bib.bib14),[25](https://arxiv.org/html/2606.31432#bib.bib15)\]\. BiRG\-LoRA is different because the active rank subset is chosen per medical question, and the router is explicitly conditioned on clinical axes rather than only on weight importance or a global deployment budget\.
Mixtures and routers for LoRA\.Sparse MoE models route tokens or examples to a small subset of experts\[[16](https://arxiv.org/html/2606.31432#bib.bib16),[2](https://arxiv.org/html/2606.31432#bib.bib17)\]\. LoRA\-based MoE systems, including MoELoRA, MixLoRA, MING\-MOE, and MoLE, train or combine multiple LoRA experts\[[13](https://arxiv.org/html/2606.31432#bib.bib18),[8](https://arxiv.org/html/2606.31432#bib.bib19),[10](https://arxiv.org/html/2606.31432#bib.bib20),[22](https://arxiv.org/html/2606.31432#bib.bib21)\]\. Other methods reuse adapter libraries or route over generated candidates, including Arrow, MedAdapter, MeteoRA, and LoRA\-Mixer\[[14](https://arxiv.org/html/2606.31432#bib.bib22),[17](https://arxiv.org/html/2606.31432#bib.bib23),[23](https://arxiv.org/html/2606.31432#bib.bib24),[9](https://arxiv.org/html/2606.31432#bib.bib25)\]\. These systems motivate our baselines\. The main difference is that BiRG\-LoRA performs structured rank selection inside a single adapter rather than selecting among multiple complete adapter modules\.
Rank\-wise expert LoRA\.SMoRA proposes that each LoRA rank can be treated as an independent expert and uses dynamic rank\-wise activation for multi\-task learning\[[27](https://arxiv.org/html/2606.31432#bib.bib26)\]\. We build on the same broad direction but target a different problem\. Table[I](https://arxiv.org/html/2606.31432#S2.T1)summarizes the distinction\. SMoRA addresses generic multi\-task LoRA conflicts; BiRG\-LoRA addresses medical vertical adaptation conflicts, where the routing signal should reflect specialty, clinical operation, and hidden semantics\. We also introduce an injection coefficient, because medical QA may require conservative use of base knowledge for some items and stronger adapter intervention for others\.
TABLE I:Relation to prior rank\-wise LoRA\.
## IIIMethodology
### III\-AProblem Setting
Letxxbe a medical multiple\-choice question with options and gold answeryyduring source training\. A frozen LLM with parametersΘ0\\Theta\_\{0\}is adapted using LoRA modules in selected projection layers\. We train on a source setDsD\_\{s\}and evaluate on multiple benchmarksDtD\_\{t\}\. Target benchmark labels are used only for final reporting, not for training, threshold selection, or router calibration\.
For a target linear layerℓ\\ell, standard LoRA computes
zℓ=Wℓ0hℓ\+αRBℓAℓhℓ,z\_\{\\ell\}=W\_\{\\ell\}^\{0\}h\_\{\\ell\}\+\\frac\{\\alpha\}\{R\}B\_\{\\ell\}A\_\{\\ell\}h\_\{\\ell\},\(1\)whereWℓ0W\_\{\\ell\}^\{0\}is frozen,Aℓ∈ℝR×dinA\_\{\\ell\}\\in\\mathbb\{R\}^\{R\\times d\_\{\\mathrm\{in\}\}\}, andBℓ∈ℝdout×RB\_\{\\ell\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times R\}\. Standard LoRA activates allRRrank channels for every question\. BiRG\-LoRA instead uses
zℓ=Wℓ0hℓ\+ρ\(x\)αRBℓDiag\(ζ\(x\)\)Aℓhℓ,z\_\{\\ell\}=W\_\{\\ell\}^\{0\}h\_\{\\ell\}\+\\rho\(x\)\\frac\{\\alpha\}\{R\}B\_\{\\ell\}\\operatorname\{Diag\}\(\\zeta\(x\)\)A\_\{\\ell\}h\_\{\\ell\},\(2\)whereζ\(x\)∈\{0,1\}R\\zeta\(x\)\\in\\\{0,1\\\}^\{R\}is a sparse rank mask with‖ζ\(x\)‖0=κ\\\|\\zeta\(x\)\\\|\_\{0\}=\\kappa, andρ\(x\)∈\[0,1\]\\rho\(x\)\\in\[0,1\]is an injection coefficient\. In our main configuration,R=16R=16andκ=4\\kappa=4\.
Each rank channel corresponds to a rank\-one atom:
Δℓ,j\(hℓ\)=Bℓ,:,jAℓ,j,:hℓ\.\\Delta\_\{\\ell,j\}\(h\_\{\\ell\}\)=B\_\{\\ell,:,j\}A\_\{\\ell,j,:\}h\_\{\\ell\}\.\(3\)The method therefore compresses expert selection into the rank dimension of one LoRA adapter instead of selecting among many complete adapters\.
TABLE II:Design motivations in BiRG\-LoRA\. Each component is tied to a medical adaptation concern and to an empirical check in Section[V](https://arxiv.org/html/2606.31432#S5)\.
### III\-BBiaxial Rank Gate
Medical questions differ along at least two clinically meaningful axes\. The first is a specialty or profession axiss\(x\)s\(x\), including clinical medicine, pharmacy, nursing, public health, medical technology, traditional Chinese medicine, and basic biomedicine\. The second is a clinical\-operation axiso\(x\)o\(x\), including diagnosis, treatment, medication, test interpretation, mechanism, nursing action, public health, and knowledge recall\. These weak labels are derived from existing metadata and deterministic keyword rules; they are used as structural priors rather than expert\-certified explanations\. At both training and evaluation time,s\(x\)s\(x\)ando\(x\)o\(x\)are derived only from the question stem, options, and publicly available non\-answer metadata; gold answers and target benchmark labels are never used to construct routing features\.
The rank gate produces logits
a\(x\)=ah\(hx\)\+as\(s\(x\)\)\+ao\(o\(x\)\)\+aso\(s\(x\),o\(x\)\),a\(x\)=a\_\{h\}\(h\_\{x\}\)\+a\_\{s\}\(s\(x\)\)\+a\_\{o\}\(o\(x\)\)\+a\_\{so\}\(s\(x\),o\(x\)\),\(4\)wherehxh\_\{x\}is a pooled hidden representation\. The hidden branch lets the model depart from coarse metadata when the question text demands it\. The specialty and operation branches encode clinical priors, and the interaction branch captures combinations such as pharmacy\-medication versus clinical\-diagnosis\.
The sparse mask is formed by a straight\-through top\-κ\\kappaoperator:
p\(x\)=softmax\(a\(x\)/τ\),ζ\(x\)=STTopK\(p\(x\),κ\)\.p\(x\)=\\operatorname\{softmax\}\(a\(x\)/\\tau\),\\quad\\zeta\(x\)=\\operatorname\{STTopK\}\(p\(x\),\\kappa\)\.\(5\)The forward pass uses a hard top\-κ\\kappamask, while gradients flow through the soft probabilities\. This makes rank selection auditable while preserving differentiability\.
### III\-CInjection Coefficient
The scalarρ\(x\)\\rho\(x\)controls update strength:
ρ\(x\)=σ\(wρ⊤hx\+bρ\)\.\\rho\(x\)=\\sigma\(w\_\{\\rho\}^\{\\top\}h\_\{x\}\+b\_\{\\rho\}\)\.\(6\)The motivation is medical conservatism\. Some questions can be answered from the base model’s broad knowledge, and aggressive adapter updates may perturb a correct answer\. Other questions need stronger domain\-specific intervention\. In the current experiments,ρ\(x\)\\rho\(x\)is helpful but not yet a mature answer\-level risk calibrator: its mean is around 0\.75 in the full model, and the no\-injection ablation is lower on average\. We therefore describe it as a dynamic update scaler rather than a clinical safety guarantee\.
### III\-DTraining Objective
Training uses answer\-only supervised fine\-tuning on the source set\. The loss is
ℒ=ℒans\+λentℒent\+λbalℒbal\+λorthℒorth\+λaxisℒaxis\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{ans\}\}\+\\lambda\_\{\\mathrm\{ent\}\}\\mathcal\{L\}\_\{\\mathrm\{ent\}\}\+\\lambda\_\{\\mathrm\{bal\}\}\\mathcal\{L\}\_\{\\mathrm\{bal\}\}\+\\lambda\_\{\\mathrm\{orth\}\}\\mathcal\{L\}\_\{\\mathrm\{orth\}\}\+\\lambda\_\{\\mathrm\{axis\}\}\\mathcal\{L\}\_\{\\mathrm\{axis\}\}\.\(7\)ℒent\\mathcal\{L\}\_\{\\mathrm\{ent\}\}keeps routing probabilities from becoming degenerate,ℒbal\\mathcal\{L\}\_\{\\mathrm\{bal\}\}discourages global rank collapse,ℒorth\\mathcal\{L\}\_\{\\mathrm\{orth\}\}penalizes similarity among rank atoms, andℒaxis\\mathcal\{L\}\_\{\\mathrm\{axis\}\}separates specialty and operation prototype rank distributions\. These terms are included because short medical SFT runs can otherwise learn a superficially sparse but semantically collapsed gate\.
## IVExperiments
### IV\-ADatasets and Protocol
The main experiments use the publicQwen3\-8Bcheckpoint\[[24](https://arxiv.org/html/2606.31432#bib.bib8)\]\. The source training set is a route\-balanced CMB subset with 4,200 examples, 600 examples per major specialty/profession route, disjoint from CMB eval4149\. Evaluation uses CMB eval4149, CMExam full, MedQA test700, and MedMCQA val700\. CMB and CMExam are Chinese medical exams; MedQA and MedMCQA are English medical QA benchmarks\. This design tests both source\-domain performance and cross\-benchmark transfer from Chinese training data to English questions\.
All main trainable methods use the same training budget unless otherwise stated: 400 update steps, effective batch size 8, learning rate10−410^\{\-4\}, maximum length 768, and answer\-only SFT\. BiRG\-LoRA uses LoRA rankR=16R=16, active top\-κ=4\\kappa=4, alpha 32, rank temperature 0\.5, rank entropy coefficient 0\.01, rank\-balance coefficient 0\.001, orthogonality coefficient 0\.1, and axis\-contrast coefficient 0\.5\. Target modules areq/k/v/o/gate/up/downprojections\. We use greedy decoding, the same answer\-letter parser for all methods, and the same prompt template within each language setting\.
### IV\-BBaselines
We compare against basic and strong baselines\. Basic controls include frozen Qwen3\-8B, vanilla LoRA r16/alpha32, active\-rank\-matched LoRA r4/alpha8, and parameter\-matched LoRA r24/alpha48\. These rows answer whether the proposed rank gate is better than simply using a fixed low\-rank adapter with similar stored rank, similar active rank, or similar trainable parameter count\.
Trainable structured baselines include MoELoRA with 4 experts, top\-2 routing, and rank\-8 experts; DoRA with rank 24 and alpha 48; and a generic rank\-wise LoRA control inspired by SMoRA\. The rank\-wise control uses the same rank\-16/top\-4 adapter but removes clinical rank routing, interaction terms, axis contrast, orthogonality, and injection gating\. Adapter\-library baselines include MedAdapter\-style candidate verification, Arrow LoRA\-weight routing, MeteoRA\-style soft MoE, and LoRA\-Mixer\-style hard\-soft routing\. These routed baselines use fixed Qwen3 prompts, the same generated adapter\-output pool, and source\-only calibration; target labels are used only after policies are fixed\.
### IV\-CStatistical Testing
Because the main margins are less than one percentage point on the four\-benchmark macro average, we report paired statistical tests over final predictions\. For each comparison, examples are paired by question identifier\. The primary test is a benchmark\-stratified paired bootstrap with 10,000 resamples: each bootstrap sample resamples examples within each benchmark and then averages the four benchmark\-level deltas\. This matches the paper’s primary metric, the equal\-weighted four\-benchmark macro average\. We also report a pooled paired comparison over all examples using wins/losses and a two\-sided sign test\. These tests address evaluation\-set sampling variability for a fixed checkpoint; they do not estimate training\-run variance across different seeds\.
### IV\-DParameter Accounting
BiRG\-LoRA has 66\.59M trainable parameters\. MoELoRA has 92\.60M trainable parameters, so BiRG\-LoRA uses 28\.1% fewer trainable parameters\. For scale, a 16\-adapter rank\-8 library would imply approximately 349\.18M trainable adapter parameters; the single rank\-gated adapter is therefore a 5\.24×\\timescompression relative to that materialized library\. The active capacity is also sparse: BiRG\-LoRA stores 16 rank atoms and activates 4 per question, with empirical effective rank about 3\.02\. MoELoRA activates two rank\-8 experts per question, or roughly 16 active low\-rank directions\.
This accounting is intentionally conservative\. MoELoRA is not an underpowered baseline: it has more trainable parameters and about four times the active low\-rank directions per question\. DoRA is also parameter\-matched to BiRG\-LoRA at 66\.87M trainable parameters, and LoRA r24 has 65\.47M trainable parameters, close to BiRG\-LoRA’s 66\.59M\. The comparison therefore asks whether structured sparse rank allocation can improve average transfer, rather than whether a larger adapter simply wins by capacity\.
## VResults and Analysis
### V\-AMain Comparison
Table[III](https://arxiv.org/html/2606.31432#S5.T3)reports the main Qwen3\-8B CMB\-source comparison, and Figure[2](https://arxiv.org/html/2606.31432#S5.F2)visualizes the average accuracy and parameter trade\-off\. BiRG\-LoRA reaches the highest four\-benchmark macro average, 69\.31%\. It outperforms MoELoRA by 0\.89 average points while using fewer trainable parameters\. It also improves over vanilla LoRA r16 and active\-rank\-matched LoRA r4 by 0\.83 average points, and over parameter\-matched LoRA r24 by 0\.73 average points\. These basic controls are important: they show that the gain is not explained by ordinary answer\-only LoRA training, by using a smaller active rank, or by matching the trainable parameter count\.
The per\-benchmark pattern explains why average accuracy is the primary claim\. On CMB, DoRA is strongest, suggesting that a parameter\-matched direction/magnitude decomposition can fit the Chinese source\-like distribution very well\. On CMExam, Arrow and MedAdapter are competitive because adapter\-library routing preserves multiple candidate behaviors and CMExam remains close to Chinese medical exam structure\. On MedQA and MedMCQA, BiRG\-LoRA is strongest among the compared rows, indicating that the sparse rank subspace learned from Chinese source data transfers to English medical questions better than the fixed\-rank and full\-expert alternatives in this protocol\. The generic rank\-wise control reaches 68\.40% average, below the full biaxial method, supporting the claim that rank\-wise activation alone is insufficient for the medical setting\.
TABLE III:Main Qwen3\-8B results under matched CMB\-source training\. Accuracy is reported in percent\. Adapter\-library rows use source\-only calibration over a shared generated\-output pool; N/A means the trainable\-parameter count is not directly comparable to single\-adapter rows\.FamilyMethodParamsCMBCMExamMedQAMedMCQAAvg\.Base modelQwen3\-8B, no SFT071\.6868\.9562\.8654\.8664\.59Fixed LoRALoRA r4/alpha810\.91M77\.8375\.8263\.0057\.2968\.48Fixed LoRALoRA r16/alpha3243\.65M77\.8776\.2062\.4357\.4368\.48Fixed LoRALoRA r24/alpha4865\.47M78\.0975\.8262\.5757\.8668\.58Single adapterBiRG\-LoRA r16/top466\.59M78\.2676\.2764\.4358\.2969\.31Rank\-wise controlRank\-wise gate without clinical priors64\.88M78\.3675\.8263\.1456\.2968\.40LoRA\-MoEMoELoRA 4e/top2/r892\.60M77\.9575\.3263\.1457\.2968\.42PEFTDoRA r24/alpha4866\.87M78\.6975\.9462\.0056\.7168\.34Adapter libraryMedAdapter\-style candidate verifierN/A77\.6177\.0764\.0056\.8668\.89Adapter libraryArrow LoRA\-weight routerN/A76\.6977\.3061\.5756\.5768\.03Adapter libraryMeteoRA\-style soft MoEN/A76\.6775\.3361\.5756\.5767\.54Adapter libraryLoRA\-Mixer\-style hard\-soft routerN/A77\.0375\.5262\.0057\.1467\.92Figure 2:Average accuracy and trainable\-parameter comparison\. BiRG\-LoRA gives the strongest average among compared methods while using fewer trainable parameters than MoELoRA\.
### V\-BPaired Statistical Evidence
Table[IV](https://arxiv.org/html/2606.31432#S5.T4)reports paired bootstrap tests over final predictions\. The strongest evidence is against MoELoRA, the main trainable LoRA\-MoE baseline: BiRG\-LoRA improves the four\-benchmark macro average by 0\.89 points with a 95% confidence interval of \[0\.42, 1\.37\]\. The parameter\-matched LoRA r24 comparison is also positive, with a macro 95% confidence interval of \[0\.19, 1\.28\]\. The comparison with the generic rank\-wise control is similarly positive, supporting the role of clinical\-axis\-aware rank selection rather than generic rank activation alone\.
The DoRA comparison is more nuanced\. Since our primary metric is the equal\-weighted four\-benchmark macro average, the stratified bootstrap supports a macro\-level advantage over DoRA\. However, the pooled instance\-level comparison is inconclusive because CMB and CMExam contain many more examples and DoRA is stronger on CMB\. We therefore describe the DoRA result as a macro\-average improvement rather than a uniformly significant per\-instance gain\.
TABLE IV:Paired statistical tests over final predictions\. Macro CIs use benchmark\-stratified paired bootstrap with 10,000 resamples\. Pooled comparisons aggregate all paired examples and report a two\-sided sign test over discordant pairs\.
### V\-CAblation Study
Table[V](https://arxiv.org/html/2606.31432#S5.T5)and Figure[3](https://arxiv.org/html/2606.31432#S5.F3)summarize ablations\. Clinical\-only gating averages 68\.30%, hidden\-only gating averages 68\.82%, and the full model reaches 69\.31%\. The two axes are therefore complementary: clinical priors alone are too coarse, but hidden\-only routing lacks the medical structure needed for robust transfer\. Turning off the injection coefficient gives 68\.58%, supporting the usefulness of dynamic update strength\. Top\-k=2k=2is too sparse, while top\-k=8k=8is close but lower than top\-k=4k=4, showing that more active rank is not automatically better\.
TABLE V:Ablation study under Qwen3\-8B CMB\-source training\.Figure 3:Average accuracy drop relative to the full BiRG\-LoRA configuration\. The largest drops occur when removing clinical structure, injection scaling, or adequate rank capacity\.
### V\-DWeak\-Axis Robustness
The specialty and operation axes are weak structural priors, so a natural concern is whether the method depends on exact metadata\. We therefore perturb both axes at evaluation time by randomly replacing the specialty/profession tag and the operation tag for 10%, 20%, or 30% of examples using one perturbation seed\. Table[VI](https://arxiv.org/html/2606.31432#S5.T6)suggests that performance is not brittle: at 30% perturbation, the macro average is 69\.19%, only 0\.12 points below the clean setting\. This does not make the tags expert explanations, but it argues against a simple lookup\-table interpretation\. The hidden branch and shared rank basis appear to absorb moderate axis noise\.
TABLE VI:Evaluation\-time perturbation of specialty/profession and clinical\-operation tags using one perturbation seed\.
### V\-ETraining Source and Backbone Transfer
Table[VII](https://arxiv.org/html/2606.31432#S5.T7)reports two robustness checks\. First, when all methods switch from CMB\-source training to CMExam\-source training, BiRG\-LoRA remains the best average method: 69\.82% versus 69\.54% for MoELoRA and 69\.29% for DoRA\. This shows that the result is not an artifact of one Chinese training source\. Second, on Llama3\.1\-8B\-Instruct\[[3](https://arxiv.org/html/2606.31432#bib.bib9)\], BiRG\-LoRA again has the highest average, but the margin is small\. We therefore treat the Llama result as supporting evidence rather than a strong standalone claim\.
TABLE VII:Training\-source and backbone transfer\. All rows use 4,200 source examples and 400 update steps\.Figure 4:Average accuracy across transfer settings\. The same trend is observed when switching the Chinese source benchmark and when replacing Qwen3\-8B with Llama3\.1\-8B, although the second\-backbone margin is modest\.
### V\-FWhat the Results Support
The experiments support a bounded claim\. BiRG\-LoRA is not the best method on every single benchmark: DoRA is strongest on CMB in the main table, and Arrow or MedAdapter can be stronger on CMExam\. The robust claim is average cross\-benchmark performance under matched source training, with better parameter efficiency than MoELoRA and stronger transfer than fixed\-rank LoRA and a generic rank\-wise control\. Paired bootstrap tests support the macro\-average gains over MoELoRA, vanilla LoRA controls, and the generic rank\-wise control, but they should not be interpreted as evidence of training\-seed robustness\.
The Chinese\-to\-English transfer results are also notable\. Training on CMB still improves MedQA and MedMCQA relative to DoRA, MoELoRA, fixed LoRA, and the generic rank\-wise control\. This suggests that the rank gate is not merely memorizing Chinese benchmark tags; it learns reusable answer\-domain adaptation patterns\. At the same time, the Llama3\.1 margins warn against overstating model independence\.
## VILimitations and Ethical Considerations
The experiments use one training seed\. We therefore do not estimate training\-run variance caused by initialization, data order, or optimizer stochasticity\. To reduce the risk that small margins reflect evaluation\-set sampling noise, we report paired bootstrap confidence intervals and sign tests over final predictions\. These tests support the main macro\-average gains over MoELoRA, vanilla LoRA controls, and the generic rank\-wise control, but they do not replace multi\-seed training stability analysis\. Small margins, especially the Llama3\.1 differences, should therefore be interpreted cautiously\. The evaluation is limited to multiple\-choice benchmarks and does not establish clinical deployment safety\.
The specialty and operation labels are weak metadata and rule\-derived tags\. They are useful structural priors, but they should not be read as human\-validated clinical explanations\. Our axis\-noise check uses one perturbation seed and suggests that moderate random perturbation does not collapse performance, but it is not a substitute for expert\-reviewed tags or repeated perturbation trials\. Future work should test expert\-curated axes and more fine\-grained operation schemas\.
The injection coefficient improves average accuracy, but it is not a calibrated risk score\. Calibration of neural confidence is difficult\[[4](https://arxiv.org/html/2606.31432#bib.bib27)\], and medical deployment would require answer\-level uncertainty, verifier signals, and prospective validation\. Finally, the current implementation reports active rank rather than measured latency; sparse rank computation may require fused kernels for real speedups\.
## VIIConclusion
We presented BiRG\-LoRA, a biaxial rank\-gated LoRA method for medical multiple\-choice QA\. The method keeps one shared LoRA basis, activates a sparse subset of rank atoms per question, and conditions rank selection on hidden semantics, specialty/profession priors, clinical\-operation priors, and their interaction\. Under our matched Qwen3\-8B source\-training protocol, BiRG\-LoRA obtains the highest macro\-average accuracy among strong PEFT, LoRA\-MoE, rank\-wise, vanilla LoRA, and adapter\-library baselines while using fewer trainable parameters than MoELoRA\. Paired bootstrap tests support the main fixed\-checkpoint gains, and weak\-axis perturbation suggests that the method is not fragile to moderate metadata noise\. The broader lesson is that medical PEFT should ask not only how many adapter parameters to train, but which clinically structured rank subspaces should be used for each question\.
## References
- \[1\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\(2023\)QLoRA: efficient finetuning of quantized LLMs\.External Links:2305\.14314,[Link](https://arxiv.org/abs/2305.14314)Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p2.1)\.
- \[2\]W\. Fedus, B\. Zoph, and N\. Shazeer\(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[3\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The Llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§V\-E](https://arxiv.org/html/2606.31432#S5.SS5.p1.1)\.
- \[4\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1321–1330\.Cited by:[§VI](https://arxiv.org/html/2606.31432#S6.p3.1)\.
- \[5\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p2.1)\.
- \[6\]D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits\(2020\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.External Links:2009\.13081,[Link](https://arxiv.org/abs/2009.13081)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p1.1),[§II](https://arxiv.org/html/2606.31432#S2.p1.1)\.
- \[7\]Q\. Jin, B\. Dhingra, Z\. Liu, W\. W\. Cohen, and X\. Lu\(2019\)PubMedQA: a dataset for biomedical research question answering\.External Links:1909\.06146,[Link](https://arxiv.org/abs/1909.06146)Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p1.1)\.
- \[8\]D\. Li, Y\. Ma, N\. Wang, Z\. Ye, Z\. Cheng, Y\. Tang, Y\. Zhang, L\. Duan, J\. Zuo, C\. Yang, and M\. Tang\(2024\)MixLoRA: enhancing large language models fine\-tuning with LoRA\-based mixture of experts\.External Links:2404\.15159,[Link](https://arxiv.org/abs/2404.15159)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p2.1),[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[9\]W\. Li, Z\. Song, H\. Zhou, Y\. Zhang, J\. Yu, and W\. Yang\(2025\)LoRA\-Mixer: coordinate modular LoRA experts through serial attention routing\.External Links:2507\.00029,[Link](https://arxiv.org/abs/2507.00029)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p2.1),[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[10\]Y\. Liao, S\. Jiang, Y\. Wang, and Y\. Wang\(2024\)MING\-MOE: enhancing medical multi\-task learning in large language models with sparse mixture of low\-rank adapter experts\.External Links:2404\.09027,[Link](https://arxiv.org/abs/2404.09027)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p2.1),[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[11\]J\. Liu, P\. Zhou, Y\. Hua, D\. Chong, Z\. Tian, A\. Liu, H\. Wang, C\. You, Z\. Guo, L\. Zhu, and M\. L\. Li\(2023\)Benchmarking large language models on CMExam: a comprehensive chinese medical exam dataset\.External Links:2306\.03030,[Link](https://arxiv.org/abs/2306.03030)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p1.1),[§II](https://arxiv.org/html/2606.31432#S2.p1.1)\.
- \[12\]S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen\(2024\)DoRA: weight\-decomposed low\-rank adaptation\.External Links:2402\.09353,[Link](https://arxiv.org/abs/2402.09353)Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p2.1)\.
- \[13\]T\. Luo, J\. Lei, F\. Lei, W\. Liu, S\. He, J\. Zhao, and K\. Liu\(2024\)MoELoRA: contrastive learning guided mixture of experts on parameter\-efficient fine\-tuning for large language models\.External Links:2402\.12851,[Link](https://arxiv.org/abs/2402.12851)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p2.1),[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[14\]O\. Ostapenko, Z\. Su, E\. Ponti, L\. Charlin, N\. Le Roux, L\. Caccia, and A\. Sordoni\(2024\)Towards modular LLMs by building and reusing a library of LoRAs\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=0ZFWfeVsaD)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p2.1),[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[15\]A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu\(2022\)MedMCQA: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InProceedings of the Conference on Health, Inference, and Learning,Proceedings of Machine Learning Research, Vol\.174,pp\. 248–260\.Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p1.1),[§II](https://arxiv.org/html/2606.31432#S2.p1.1)\.
- \[16\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.InInternational Conference on Learning Representations,Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[17\]W\. Shi, R\. Xu, Y\. Zhuang, Y\. Yu, H\. Sun, H\. Wu, C\. Yang, and M\. D\. Wang\(2024\)MedAdapter: efficient test\-time adaptation of large language models towards medical reasoning\.External Links:2405\.03000,[Link](https://arxiv.org/abs/2405.03000)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p2.1),[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[18\]K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl,et al\.\(2022\)Large language models encode clinical knowledge\.External Links:2212\.13138,[Link](https://arxiv.org/abs/2212.13138)Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p1.1)\.
- \[19\]K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, L\. Hou, K\. Clark, S\. Pfohl, H\. Cole\-Lewis,et al\.\(2023\)Towards expert\-level medical question answering with large language models\.External Links:2305\.09617,[Link](https://arxiv.org/abs/2305.09617)Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p1.1)\.
- \[20\]M\. Valipour, M\. Rezagholizadeh, I\. Kobyzev, and A\. Ghodsi\(2023\)DyLoRA: parameter efficient tuning of pre\-trained models using dynamic search\-free low\-rank adaptation\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p2.1)\.
- \[21\]X\. Wang, G\. H\. Chen, D\. Song, Z\. Zhang, Z\. Chen, Q\. Xiao, F\. Jiang, J\. Li, X\. Wan, B\. Wang, and H\. Li\(2024\)CMB: a comprehensive medical benchmark in chinese\.External Links:2308\.08833,[Link](https://arxiv.org/abs/2308.08833)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p1.1),[§II](https://arxiv.org/html/2606.31432#S2.p1.1)\.
- \[22\]X\. Wu, S\. Huang, and F\. Wei\(2024\)Mixture of LoRA experts\.External Links:2404\.13628,[Link](https://arxiv.org/abs/2404.13628)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p2.1),[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[23\]J\. Xu, J\. Lai, and Y\. Huang\(2024\)MeteoRA: multiple\-tasks embedded LoRA for large language models\.External Links:2405\.13053,[Link](https://arxiv.org/abs/2405.13053)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p2.1),[§II](https://arxiv.org/html/2606.31432#S2.p3.1)\.
- \[24\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§IV\-A](https://arxiv.org/html/2606.31432#S4.SS1.p1.1)\.
- \[25\]F\. Zhang, L\. Li, J\. Chen, Z\. Jiang, B\. Wang, and Y\. Qian\(2023\)IncreLoRA: incremental parameter allocation method for parameter\-efficient fine\-tuning\.External Links:2308\.12043,[Link](https://arxiv.org/abs/2308.12043)Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p2.1)\.
- \[26\]Q\. Zhang, M\. Chen, A\. Bukharin, N\. Karampatziakis, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao\(2023\)AdaLoRA: adaptive budget allocation for parameter\-efficient fine\-tuning\.External Links:2303\.10512,[Link](https://arxiv.org/abs/2303.10512)Cited by:[§II](https://arxiv.org/html/2606.31432#S2.p2.1)\.
- \[27\]Z\. Zhao, Y\. Zhou, Z\. Zhang, D\. Zhu, T\. Shen, Z\. Li, J\. Yang, X\. Wang, J\. Su, K\. Kuang, Z\. Wei, F\. Wu, and Y\. Cheng\(2025\)Each rank could be an expert: single\-ranked mixture of experts LoRA for multi\-task learning\.External Links:2501\.15103,[Link](https://arxiv.org/abs/2501.15103)Cited by:[§I](https://arxiv.org/html/2606.31432#S1.p2.1),[§II](https://arxiv.org/html/2606.31432#S2.p4.1)\.Similar Articles
TriageRA-CCF: Source-Side Clinical Confidence and Coverage Signals for Adaptive Rank Budgeting in Medical LLMs
This paper proposes TriageRA-CCF, a method for adaptive rank budgeting in LoRA for medical question answering. It uses source-side signals (base-model confidence, clinical coverage, counterfactual proxy) to dynamically choose rank budgets, achieving modest accuracy gains on Qwen3-8B and Llama3.1-8B.
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
Introduces OGCaReBench, a free-form retrieval benchmark for evaluating LLMs on clinical questions that require reasoning beyond standard guidelines. Experiments show that even the best model achieves only 56% accuracy, but retrieval augmentation boosts performance to 82%.
Hybrid-IR: Dual-Path Hybrid Retrieval with Iterative Reasoning for Complex Medical Question Answering
Hybrid-IR introduces a dual-path retrieval framework combining graph-based and dense retrieval with iterative reasoning to improve complex medical QA, addressing limitations in existing RAG methods. Experiments on three benchmarks show effectiveness.
Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
This paper proposes a Variance-Aware Reward Framework using GRPO to improve LLM performance on heart-focused medical question answering, achieving significant accuracy and F1 gains on a HealthBench subset.
Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering
This paper introduces a multi-agent peer-reviewed reasoning method where multiple LLMs independently generate chain-of-thought reasoning and then evaluate each other's outputs to select the best answer. The method outperforms single-model reasoning and majority voting on medical QA benchmarks.