Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
Summary
This paper proposes a Variance-Aware Reward Framework using GRPO to improve LLM performance on heart-focused medical question answering, achieving significant accuracy and F1 gains on a HealthBench subset.
View Cached Full Text
Cached at: 06/05/26, 08:04 AM
# Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
Source: [https://arxiv.org/html/2606.05174](https://arxiv.org/html/2606.05174)
Arash AhmadiSchool of Electrical and Computer Engineering, University of Oklahoma, Norman, OK, USAIntelligent Neuromorphic and Quantum Understanding for Innovative Research and Engineering \(INQUIRE\) Laboratory, University of Oklahoma, Norman, OK, USAParisa Masnadi KhiabaniData Science and Analytics Institute, University of Oklahoma, Norman, OK, USAData Institute for Societal Challenges \(DISC\), University of Oklahoma, Norman, OK, USASarah SharifSchool of Electrical and Computer Engineering, University of Oklahoma, Norman, OK, USAIntelligent Neuromorphic and Quantum Understanding for Innovative Research and Engineering \(INQUIRE\) Laboratory, University of Oklahoma, Norman, OK, USACharles NicholsonSchool of Industrial and Systems Engineering, University of Oklahoma, Norman, OK, USAData Science and Analytics Institute, University of Oklahoma, Norman, OK, USADavid EbertMike BanadSchool of Electrical and Computer Engineering, University of Oklahoma, Norman, OK, USAIntelligent Neuromorphic and Quantum Understanding for Innovative Research and Engineering \(INQUIRE\) Laboratory, University of Oklahoma, Norman, OK, USACorrespondence:bana@ou\.edu
###### Abstract
Large Language Models \(LLMs\) have shown strong promise in healthcare applications\. Yet deploying general\-purpose models in real\-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on\-device use\. These challenges motivate the development of smaller, more efficient models that require robust post\-training strategies to ensure reliable medical reasoning\. In this work, we investigate Group Relative Policy Optimization \(GRPO\) for post\-training LLMs on heart\-focused medical question answering with rubric\-based supervision derived from RaR\-Medicine\. We propose a Variance\-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert\-style scoring with continuous analytical reward functions derived from criterion\-level rubric outcomes\. This formulation provides richer optimization signals for feedback that is sparse, multi\-criteria, and difficult to verify automatically, and enables more stable on\-policy reinforcement learning\. On a held\-out heart\-related subset of HealthBench, our best GRPO variant improves accuracy from 0\.362 to 0\.502 and F1 from 0\.532 to 0\.668 relative to the Qwen3\-14B base model, while remaining competitive with GPT\-OSS\-120B \(0\.508 accuracy, 0\.674 F1\)\. Our findings show that carefully designed rubric\-based rewards provide a practical strategy for improving heart\-focused medical question answering in LLMs, with potential to extend to other rubric\-based tasks\.
Artificial intelligence \(AI\) is increasingly shaping both medical research and clinical care, with applications spanning prediction, imaging, and language\-based analysis\. Predictive AI models are being used for risk stratification and outcome forecasting, while deep learning systems have achieved strong performance in image\-based tasks such as skin cancer classification, diabetic retinopathy detection, and breast cancer screening\[[39](https://arxiv.org/html/2606.05174#bib.bib1),[17](https://arxiv.org/html/2606.05174#bib.bib20),[6](https://arxiv.org/html/2606.05174#bib.bib6),[9](https://arxiv.org/html/2606.05174#bib.bib7),[22](https://arxiv.org/html/2606.05174#bib.bib8)\]\. In parallel, natural language processing methods are helping clinicians and researchers extract, organize, and interpret the growing volume of unstructured text in electronic health records and the biomedical literature\[[14](https://arxiv.org/html/2606.05174#bib.bib11),[15](https://arxiv.org/html/2606.05174#bib.bib12),[5](https://arxiv.org/html/2606.05174#bib.bib13),[20](https://arxiv.org/html/2606.05174#bib.bib14)\]\. Together, these advances highlight the expanding role of AI in modern healthcare and medical research, although broader real\-world deployment still depends on reliability, transparency, and the careful handling of sensitive clinical data\[[39](https://arxiv.org/html/2606.05174#bib.bib1),[38](https://arxiv.org/html/2606.05174#bib.bib2)\]\.
Large language models \(LLMs\) extend this trajectory because they can interpret natural language questions and produce free\-form explanations\. This capability suggests applications in patient triage, shared decision\-making, and clinician support\. General\-purpose LLMs still struggle with clinical specificity\. They can generate plausible narratives that omit contraindications, confuse differential diagnoses, or express unwarranted certainty\. Heart\-focused medical question answering illustrates this challenge because symptoms such as chest pain, dyspnea, and palpitations demand conservative guidance, appropriate uncertainty handling, and careful risk assessment\. The clinical importance of this domain is underscored by the Global Burden of Disease Study 2023\[[25](https://arxiv.org/html/2606.05174#bib.bib63)\], which reports that ischaemic heart disease and stroke have consistently ranked as the first and second leading causes of age\-standardised mortality worldwide from 1990 to 2023, with ischaemic heart disease alone remaining the top cause in every year except 2021, when COVID\-19 temporarily displaced it\. This persistent burden makes heart\-related clinical reasoning a high\-priority target for AI\-assisted decision support\.
Earlier clinical natural language processing systems relied on task\-specific pipelines that used entity extraction, retrieval, rule\-based heuristics, or supervised classifiers\. These approaches offer control and auditability, but they require extensive feature design and they often transfer poorly across institutions or note styles\. Recent open medical models such as Med\-Gemma show that domain adaptation can improve factuality and clinical relevance when compared with general baselines\[[30](https://arxiv.org/html/2606.05174#bib.bib16)\]\. CancerGPT showed that task\-adapted LLMs can support few\-shot biomedical inference in data\-limited settings\[[19](https://arxiv.org/html/2606.05174#bib.bib50)\]\. A fine\-tuned lightweight LLM for symptom\-based depression evaluation further demonstrated that clinically aligned adaptation can support symptom\-level assessment rather than only coarse screening\[[45](https://arxiv.org/html/2606.05174#bib.bib54)\]\. Work on lightweight disease\-diagnosis systems also highlights the importance of deployment\-aware design for resource\-constrained clinical environments\[[36](https://arxiv.org/html/2606.05174#bib.bib49)\]\. Taken together, these studies suggest that clinical utility often depends less on using the largest possible model than on tailoring models, data, and objectives to domain structure and deployment constraints\.
Recent medical LLM systems also show that grounding, retrieval, personalization, explainability, and auditable outputs are all essential for deployment\. Clinical entity augmented retrieval improves clinical information extraction by retrieving entity\-centered note spans rather than relying only on general semantic similarity, which can improve relevance while reducing unnecessary context\[[21](https://arxiv.org/html/2606.05174#bib.bib51)\]\. Retrieval\-augmented generation has likewise elevated the quality of local models in radiology contrast\-media consultation, supporting privacy\-preserving deployment when paired with curated knowledge and human oversight\[[41](https://arxiv.org/html/2606.05174#bib.bib58)\]\. EHR\-integrated patient education agents illustrate the promise of personalized LLM support while also underscoring the need for safety controls and clinician supervision in patient\-facing settings\[[13](https://arxiv.org/html/2606.05174#bib.bib57)\]\. KT\-LLM pushes this direction further through evidence\-grounded, policy\-aware, and auditable sequence\-text modeling\[[47](https://arxiv.org/html/2606.05174#bib.bib56)\]\. Holistic AI in medicine similarly emphasizes that explainability and performance should be improved together rather than treated as separate objectives\[[28](https://arxiv.org/html/2606.05174#bib.bib55)\]\. These studies show clear progress toward clinically grounded and trustworthy LLM systems, but they do not directly solve how to optimize a medical assistant against multi\-criterion clinical rubrics during post\-training\.
Domain adaptation often uses supervised fine\-tuning \(SFT\) on curated instruction or dialogue data\. SFT optimizes imitation of reference answers\. It can inherit annotation artifacts and it can encourage memorization of training examples, which is undesirable when the goal is robust clinical reasoning and safe generalization\. SFT also compresses multifactor clinical quality into a single target sequence\. Rubric\-based evaluation makes this limitation explicit because clinical answers must satisfy multiple criteria that cover correctness, safety, completeness, and appropriate communication\. A training method that can optimize directly against such criteria is therefore attractive for high\-stakes clinical assistants\.
Reinforcement learning \(RL\) provides a complementary paradigm because it optimizes behavior with respect to a reward signal rather than a fixed demonstration target\[[37](https://arxiv.org/html/2606.05174#bib.bib36)\]\. Early value\-based algorithms such as Q\-learning established fundamental principles for action\-value estimation\[[44](https://arxiv.org/html/2606.05174#bib.bib37)\]\. Subsequent methods such as SARSA and deep Q\-networks extended RL to stochastic control and high\-dimensional observations\[[43](https://arxiv.org/html/2606.05174#bib.bib39),[23](https://arxiv.org/html/2606.05174#bib.bib38)\]\. Double DQN and its subsequent modifications addressing moving\-target instability reduced overestimation bias and improved stability\[[40](https://arxiv.org/html/2606.05174#bib.bib40),[12](https://arxiv.org/html/2606.05174#bib.bib41)\]\. Policy\-gradient and actor\-critic methods also enabled scalable optimization for large action spaces\. Advantage actor\-critic provided a practical deep RL baseline\[[2](https://arxiv.org/html/2606.05174#bib.bib42)\]\. Trust region and proximal policy optimization improved robustness of policy updates\[[31](https://arxiv.org/html/2606.05174#bib.bib43),[32](https://arxiv.org/html/2606.05174#bib.bib44)\]\. RL has demonstrated that optimization beyond imitation can yield strategies that are not present in supervised targets\. AlphaGo highlighted this capability through self\-play and search\[[35](https://arxiv.org/html/2606.05174#bib.bib45)\]\. AlphaFold extended learning\-based optimization to scientific discovery and showed that structured objectives can unlock capabilities that exceed traditional heuristic pipelines in protein structure prediction\[[16](https://arxiv.org/html/2606.05174#bib.bib5)\]\. More recently, RL post\-training for language models has become a practical route to improve reasoning\. The DeepSeek\-Math and DeepSeek\-R1 projects show that RL can raise the performance of relatively small models on difficult reasoning tasks\[[34](https://arxiv.org/html/2606.05174#bib.bib3),[11](https://arxiv.org/html/2606.05174#bib.bib28)\]\.
Group Relative Policy Optimization \(GRPO\) is an RL algorithm designed for language model post\-training that avoids an explicit value function and uses group\-wise relative advantage estimates\[[34](https://arxiv.org/html/2606.05174#bib.bib3)\]\. This design reduces memory requirements when compared with methods that train a separate critic\. It also fits reward settings where relative ranking is more stable than absolute calibration\. GRPO first showed strong results in mathematical reasoning and has since been extended to a broader set of generative tasks\. Recent work shows that GRPO can improve code generation in underrepresented programming languages by integrating reasoning\-driven feedback into the optimization loop\[[27](https://arxiv.org/html/2606.05174#bib.bib59)\]\. Related GRPO\-based RL has also improved deep reasoning translation, suggesting that reward\-driven post\-training can transfer beyond domains with exact verifiers into more open\-ended generation settings\[[42](https://arxiv.org/html/2606.05174#bib.bib60)\]\. In medical AI, this trend is beginning to appear in multimodal settings\. MedVLM\-R1 applies reinforcement learning to medical vision\-language reasoning, encouraging explicit natural\-language reasoning and improving medical reasoning capability in radiological tasks\[[26](https://arxiv.org/html/2606.05174#bib.bib46)\]\. RARL similarly combines reinforcement learning, LoRA, and LLM\-as\-a\-judge style evaluation to improve medical vision\-language reasoning and generalization under limited data and hardware budgets\[[29](https://arxiv.org/html/2606.05174#bib.bib47)\]\. These studies strengthen the case for applying GRPO\-style post\-training to medical dialogue tasks where correctness is multi\-dimensional, rewards are sparse, and answer quality depends not only on factual accuracy but also on explanation quality, safety, and completeness\.
This paper develops a GRPO\-based framework for a heart\-focused medical assistant that answers free\-form questions while satisfying rubric\-defined clinical criteria\. Our training data come from RaR\-Medicine\. We filter the corpus to queries that are directly related to heart\-related problems through a dedicated classifier that uses either an LLM\-based decision rule or a high\-recall keyword filter\. We augment the filtered subset with synthetic reasoning traces generated by MedGemma\-27B\[[30](https://arxiv.org/html/2606.05174#bib.bib16)\]that encourage explicit intermediate explanations\. We use a structured output format that separates reasoning from final recommendations through dedicated tags, which aligns with prior work on eliciting multi\-step reasoning in language models\[[46](https://arxiv.org/html/2606.05174#bib.bib48)\]\. The model receives an initial supervised stage that teaches this format and stabilizes generation\. The main optimization stage applies GRPO to a disjoint subset of the same heart\-related training pool\. Reward computation follows the structure of HealthBench\-style rubrics\. Each prompt contains positive and negative criteria with associated point values\. Figure[1](https://arxiv.org/html/2606.05174#S1.F1)provides an overview of the full pipeline\.
A large judge model evaluates each criterion independently and returns a binary decision, which follows the broader LLM\-as\-a\-judge direction\[[48](https://arxiv.org/html/2606.05174#bib.bib26)\]and the rubrics\-as\-rewards framework\[[10](https://arxiv.org/html/2606.05174#bib.bib35)\]\. This criterion\-wise design is also consistent with recent medical evaluation work showing that expert\-grounded automated verification can scale assessment in specialty QA\[[7](https://arxiv.org/html/2606.05174#bib.bib52)\]and that rubric\-like LLM judging can align strongly with human evaluators in clinical summarization\[[4](https://arxiv.org/html/2606.05174#bib.bib53)\]\. Related medical RL studies such as RARL and broader GRPO\-based work such as DeepTrans also support the use of model\-based judging and structured reward criteria when exact verifiers are unavailable\[[29](https://arxiv.org/html/2606.05174#bib.bib47),[42](https://arxiv.org/html/2606.05174#bib.bib60)\]\. Criterion\-level scoring reduces brittleness that arises when a single model assigns an overall score to an entire response\. The raw rubric scores are then transformed into a scalar reward through a reward shaping function\. We study multiple shaping families that address sparse rewards, preserve strong incentives for fully correct and safe answers, and account for rubric complexity so that prompts with many criteria still provide meaningful gradients\. Training focuses on challenging prompts\. We exclude items with very small rubric sets because they yield trivial rewards and limited learning signals\. The final system adapts a 14B parameter base model with low\-rank adapters and quantized weights, which keeps post\-training feasible on academic hardware and supports privacy\-preserving local deployment\.
Experiments on a held\-out heart\-related subset of HealthBench show that GRPO post\-training improves rubric satisfaction when compared with the same backbone without RL\. The improvements appear across accuracy, F1, recall, and precision\. This work makes four contributions: \(1\) a rubric\-aligned GRPO pipeline for heart\-focused medical question answering that supports explicit reasoning traces and structured outputs; \(2\) a criterion\-wise judging and variance\-aware reward shaping strategy that reduces reward sparsity and improves learning under heterogeneous rubrics; \(3\) a data curation and filtering pipeline that isolates heart\-related questions and produces synthetic reasoning traces for instruction tuning; and \(4\) a comparative evaluation against strong baselines that examines the effect of scaling both the judge and the policy models\.
## 1Results
To evaluate the efficacy of rubric\-guided reinforcement learning for clinical reasoning, we designed a multi\-stage experimental framework focusing on cardiac medicine\. Our analysis progresses from data curation to model optimization and final evaluation\. We first established a specialized dataset of heart\-related inquiries by filtering and restructuring the RaR\-Medicine corpus that ensures high relevance and grading granularity\. Following a supervised initialization to stabilize reasoning formats, we applied Group Relative Policy Optimization \(GRPO\) using three distinct variance\-aware reward mechanisms\. The subsequent sections detail the characteristics of the curated dataset, the training dynamics of the reward functions, and the comparative performance of the post\-trained models on the held\-out HealthBench evaluation set\.
Figure 1:Fig\. 1 — Overview of the heart\-focused GRPO training and evaluation pipeline\.The pipeline begins with RaR\-Medicine data, which is filtered to heart\-related queries and split into supervised fine\-tuning \(SFT\) and reinforcement learning \(GRPO\) subsets\. SFT is first applied to initialize the base model with structured reasoning outputs\. During GRPO, the policy model generates multiple candidate responses per prompt, which are evaluated by an LLM\-based judge against prompt\-specific rubric criteria\. Each criterion is scored independently \(pass/fail\), and the resulting signals are aggregated into a scalar reward using variance\-aware reward functions, including hybrid and complexity\-aware formulations\. This reward is used to update the policy through group\-relative optimization, enabling stable learning from multi\-criteria clinical feedback\.### 1\.1Dataset curation and characteristics
#### 1\.1\.1Training set: RaR\-Medicine with heart\-related filtering
RaR\-Medicine\[[10](https://arxiv.org/html/2606.05174#bib.bib35)\]dataset provides training prompts, reference completions, and rubric annotations\. Each example contains a natural\-language question, a reference answer, and a rubric set that defines how a completion is graded\. The rubric fields are stored ascriterion\(text\),points\(signed scalar\), andtitle\(optional metadata\)\. The raw dataset is stored in Parquet splits, and we convert each sample to a JSONL record that contains a chat\-style prompt, a reference completion, and a rubric list in which each element containscriterionandpoints\. This conversion produces a schema that is compatible with rubric\-based evaluation\. Figure[2](https://arxiv.org/html/2606.05174#S1.F2)illustrates the rubric\-based supervision format used in our data pipeline\. For a medical queryqq, a candidate responseyyis evaluated against a prompt\-specific rubric𝒞\(q\)=\{\(cj,pj\)\}j=1m\\mathcal\{C\}\(q\)=\\\{\(c\_\{j\},p\_\{j\}\)\\\}\_\{j=1\}^\{m\}, where each criterioncjc\_\{j\}describes a clinically meaningful desired or undesired behavior andpjp\_\{j\}denotes its signed point value\. The total example\-level score is obtained by summing the points of all satisfied criteria\. In RaR\-Medicine, the reference completion provides the supervised target during fine\-tuning, while the same rubric structure is later used to evaluate generated responses and derive reward signals for reinforcement learning\.
Figure 2:Fig\. 2 — Rubric\-based supervision format used in our datasets\.A medical queryqqis paired with a candidate responseyy, which is then evaluated against a prompt\-specific rubric𝒞\(q\)=\{\(cj,pj\)\}j=1m\\mathcal\{C\}\(q\)=\\\{\(c\_\{j\},p\_\{j\}\)\\\}\_\{j=1\}^\{m\}\. Each satisfied criterion contributes its signed point value, and the total example\-level score is the sum over satisfied criteria\. The example shown is illustrative and is included to clarify how criterion\-level rubric annotations are converted into training and evaluation signals\.We restrict training to heart\-related samples\. A dedicated classifier assigns a binary labelheart\_relatedand auxiliary metadata that includes a theme category and keyword evidence\. The classifier queries an instruction\-following medical model and parses a structured decision from the model output\. We use MedGemma\[[33](https://arxiv.org/html/2606.05174#bib.bib34)\]in this role because it is specialized for medical instruction following and its compact footprint supports large\-scale preprocessing under limited computational resources\. As a result, we included asynthetic\_reasoningfield that contains a model\-generated reasoning trace aligned with the reference answer\.
Table[1](https://arxiv.org/html/2606.05174#S1.T1)shows an illustrative training instance after preprocessing\. The example follows the schema used in the JSONL files and it highlights how rubric criteria encode both required behaviors and prohibited behaviors\.
Table 1:Illustrative example of a rubric\-annotated instance in our training format\. The criteria and weights shown here are representative and are included to clarify the data schema\.
#### 1\.1\.2Train split usage
After filtering, the heart\-related training subset is shuffled with a fixed random seed and split into two disjoint halves\. One half is used for supervised fine\-tuning, and the other half is reserved for reinforcement learning with Group Relative Policy Optimization\. This split eliminates overlap between stages, and it isolates the effect of reward optimization\.
##### Dataset summary statistics\.
Figure[3](https://arxiv.org/html/2606.05174#S1.F3)summarizes the composition of the filtered heart\-related RaR\-Medicine subset, including overall theme frequency and stratification by split and source\. The heart\-related vs\. non\-heart\-related sample balance is shown in Supplementary Fig\. 1, and the question source distribution is presented in Supplementary Fig\. 2\. Additional descriptive statistics on rubric counts \(Supplementary Figs\. 3–4\), question and answer lengths \(Supplementary Figs\. 5–6, 8–9\), and rubric weight distributions \(Supplementary Fig\. 7\) are provided in the Supplementary Information\.
Figure 3:Fig\. 3 — Dataset composition of the filtered heart\-related RaR\-Medicine subset\.\(a\) Distribution of heart\-related themes \(excluding “Other” and themes with fewer than five records\)\. \(b\) Theme counts by dataset split \(train/test\)\. \(c\) Theme distribution stratified by question source\.
#### 1\.1\.3Evaluation set: HealthBench
We evaluate on HealthBench\[[1](https://arxiv.org/html/2606.05174#bib.bib32)\]and treat it as held out\. HealthBench contains 5,000 multi\-turn health conversations with physician\-written rubric criteria created by 262 physicians across 26 medical specialties\. We evaluate on a held\-out, non\-synthetic subset of HealthBench\. This benchmark is well suited for held\-out evaluation because it targets medical question answering and it provides standardized rubrics, while its scope remains computationally manageable for repeated metric computation during development\. The evaluation pipeline filters toheart\_related = YESand computes Accuracy, Precision, Recall, and F1 against available physician\-derived binary labels\. The reported results usen=500n=500heart\-related evaluation examples sampled with seed 42\.
### 1\.2Model performance on the held\-out HealthBench heart subset
Table[2](https://arxiv.org/html/2606.05174#S1.T2)reports the performance of all evaluated models on the held\-out heart\-related subset of HealthBench\. Extended multi\-metric comparisons, radar charts, and performance heatmaps are provided in Supplementary Figs\. 10–13\. Supplementary Video 1 shows the cumulative accuracy of all models as evaluation progresses sample by sample across the 500 held\-out prompts\. Among all systems, Kimi\-K2, which features approximately 1 trillion parameters, achieves the highest overall performance with an accuracy of 0\.570 and F1 score of 0\.726\. GPT\-OSS\-120B follows with an accuracy of 0\.508 and F1 score of 0\.674\.
Notably, our locally trained GRPO\-optimized Qwen3 variants achieve performance on par with the much larger GPT\-OSS\-120B\. The GRPO \(COMPLEXITY\) reward reaches an accuracy of 0\.502 and F1 score of 0\.668, while the GRPO \(HYBRID\) reward yields an accuracy of 0\.498 and F1 score of 0\.665\. These results demonstrate that variance\-aware reward shaping effectively improves model performance compared with the Qwen3\-14B Base model, which achieves an accuracy of 0\.362 and F1 score of 0\.532\. We note that while Kimi\-K2 achieves the highest overall performance, it is an open\-source model whose approximately 1 trillion total parameters far exceed the memory capacity of academic\-grade hardware such as the NVIDIA RTX 6000 PRO, making local training or serving infeasible\. Our GRPO variants, by contrast, are trained and served entirely on a single workstation GPU, demonstrating that rubric\-based RL can close a substantial fraction of the gap to frontier\-scale models under strict hardware constraints\.
Table 2:Table 2 — Model performance comparison on the held\-out heart\-related HealthBench subset \(n=500n=500, seed 42\)\.Models are sorted by Accuracy\. Variance\-aware GRPO reward functions substantially improve the Qwen3\-14B base model, while external frontier models such as Kimi\-K2 and GPT\-OSS\-120B achieve the highest overall performance\.Figure 4:Fig\. 4 — Main performance on the held\-out heart\-related HealthBench subset \(n=500n=500, seed 42\)\.\(a\) Accuracy comparison across all evaluated models with 95% confidence intervals\. Kimi\-K2 achieves the highest accuracy \(0\.570\), followed by GPT\-OSS\-120B and the GRPO\-trained Qwen3 variants\. \(b\) F1\-score comparison showing similar ranking trends to accuracy\. \(c\) Accuracy improvement \(Δ\\DeltaAccuracy\) relative to the Qwen3\-14B Base model\. Variance\-aware GRPO reward functions \(COMPLEXITY and HYBRID\) achieve the largest gains among local deployments\.Table[3](https://arxiv.org/html/2606.05174#S1.T3)reports improvements relative to the Qwen3\-14B Base model\. Figure[4](https://arxiv.org/html/2606.05174#S1.F4)provides a visual summary of benchmark performance, including accuracy with confidence intervals, F1\-score comparison, and relative accuracy improvement, highlighting the strongest gains for the GRPO \(COMPLEXITY\) and GRPO \(HYBRID\) variants among local models\.
Table 3:Table 3 — Performance improvements relative to the Qwen3\-14B Base model \(n=500n=500\)\.Variance\-aware reward functions \(COMPLEXITY and HYBRID\) produce substantially larger gains than the RaR\-based reward strategies\.
### 1\.3Ablations over reward shaping, judge scale, and training stability
Figure 5:Fig\. 5 — Pairwise McNemar significance and average response time\.Left: pairwise McNemar tests for statistical significance of prediction differences; cells report not significant \(n\.s\.\), significant \(\*,p<0\.05p<0\.05\), or highly significant \(\*\*,p<0\.01p<0\.01\)\. Right: mean response time \(in seconds\) for each evaluated model, characterizing the latency dimension of the performance\-deployment tradeoff\. Additional per\-model response time details are provided in Supplementary Fig\. 11\.Figure[5](https://arxiv.org/html/2606.05174#S1.F5)presents pairwise McNemar significance tests alongside mean response times across all evaluated models\. Supplementary Video 2 provides a per\-sample, per\-criterion visualization comparing the Base, GRPO \(Complexity\), GPT\-OSS\-120B, and MedGemma\-27B models which shows how each criterion is satisfied or missed across evaluation prompts\.
We additionally trained two GRPO variants using the reward aggregation strategies proposed in the original Rubrics as Rewards \(RaR\) framework\[[10](https://arxiv.org/html/2606.05174#bib.bib35)\]: RaR\-Explicit, which independently evaluates each rubric criterion via an LLM judge and aggregates binary satisfaction signals through a normalized weighted sum with fixed categorical weights \(Essential: 1\.0,Important: 0\.7,Optional: 0\.3,Pitfall: 0\.9\), and RaR\-Implicit, which passes all criteria holistically to the judge and elicits a single Likert score normalized to\[0,1\]\[0,1\]\. Both variants produced only modest improvements over the Qwen3\-14B base model—\+9\.4% and \+13\.8% relative accuracy, respectively far below the \+38\.7% and \+37\.6% gains achieved by the Complexity and Hybrid rewards \(Table[3](https://arxiv.org/html/2606.05174#S1.T3)\)\. The difference is statistically significant: pairwise McNemar tests confirm that both our variance\-aware rewards outperform RaR\-Explicit \(p<10−5p<10^\{\-5\}\) and RaR\-Implicit \(p<10−3p<10^\{\-3\}\)\.
We attribute the limited effectiveness of the RaR aggregation strategies in our setting to three factors\. First, the explicit aggregation relies on rigid, hand\-tuned categorical weights that impose a fixed importance hierarchy across all prompts; this one\-size\-fits\-all mapping cannot adapt to the heterogeneous complexity of cardiac medicine queries, where the relative salience of individual criteria varies substantially by clinical context\. Second, the implicit aggregation delegates the entire scoring decision to a single holistic LLM judgment, collapsing the multi\-dimensional rubric information into a coarse Likert score that discards granular criterion\-level signal and introduces judge\-level variance\. Third, neither RaR strategy accounts for rubric complexity, treating a prompt with five criteria identically to one with eighteen\. A base model can often satisfy all criteria on simple rubrics approximately out of the box, so these easy prompts contribute little discriminative training signal\. The harder and more informative case arises when rubric sets are large: satisfying seventeen out of eighteen criteria on a complex prompt represents a substantially greater achievement than a perfect score on a five\-criterion prompt, yet both RaR aggregation strategies assign comparable normalized rewards to both outcomes\. Our variance\-aware reward functions address this asymmetry explicitly\. The Complexity\-aware variant applies a logarithmic bonus that scales with rubric size, so that high satisfaction on demanding prompts produces a stronger reward signal and therefore a larger policy gradient\. This design amplifies learning from precisely the prompts where the base model struggles most, converting partial\-credit differences on complex rubrics into high\-value training signal that neither the fixed\-weight explicit aggregation nor the holistic implicit scoring can provide\.
Figure[6](https://arxiv.org/html/2606.05174#S1.F6)shows the mean LLM\-judge reward over 1000 GRPO training rounds for the Hybrid and Complexity reward functions\. Both curves display substantial per\-step variance, which is expected and structurally beneficial in GRPO: group\-relative advantage estimation normalizes rewards within each sampled group by subtracting the group mean and dividing by the group standard deviation\[[34](https://arxiv.org/html/2606.05174#bib.bib3)\], so that reward spread across completions is converted into differential advantage signals that drive policy improvement\. The exponential moving average \(EMA\) and the linear trend confirm steady reward improvement throughout training\. The±1σ\\pm 1\\sigmaband illustrates the per\-step reward variance that arises from stochastic prompt sampling and group\-level generation\. Notably, the Complexity reward operates in a higher absolute range, consistent with its rubric\-size scaling described above, while the Hybrid reward shows a tighter variance envelope\.
Figure 6:Fig\. 6 — Training reward dynamics for the Hybrid and Complexity reward functions\.Mean LLM\-judge reward over 1000 GRPO training rounds\. Light lines show the raw per\-step reward; the bold curve shows an exponential moving average \(EMA, span = 50\); the shaded band indicates±1σ\\pm 1\\sigmaaround the EMA; the dashed line shows the linear trend\. Both reward functions show steady improvement over the course of training\.
## 2Discussion
The transition from Supervised Fine\-Tuning to Reinforcement Learning in healthcare is often blocked by the difficulty of defining “correctness\.” Unlike similar reinforcement learning tasks in games like AlphaGo\[[8](https://arxiv.org/html/2606.05174#bib.bib65)\]or protein folding algorithms like AlphaFold\[[16](https://arxiv.org/html/2606.05174#bib.bib5)\], clinical diagnosis lacks a simulator\. This work highlights a subtle but critical failure mode when adapting techniques like Rubrics as Rewards \(RaR\)\[[10](https://arxiv.org/html/2606.05174#bib.bib35)\]to algorithms like GRPO: optimization algorithms relying on batch normalization are intolerant of sparse, binary rewards\.
Our findings echo challenges reported in recent digital medicine literature\[[38](https://arxiv.org/html/2606.05174#bib.bib2)\], where generalist models fail to capture nuance\. By implementing a soft reward signal that acknowledges partial correctness \(e\.g\., meeting 17 out of 18 criteria\), we not only stabilize training but also align the model’s incentives with the iterative nature of clinical reasoning\.
##### Role of supervised fine\-tuning in the training pipeline\.
Our pipeline applies supervised fine\-tuning \(SFT\) before GRPO\. The purpose of SFT is to serve as a format warm start: it teaches the model to emit the required reasoning and answer tags reliably, which is a prerequisite for downstream reward computation\. The base Qwen3\-14B model does not reliably produce the structured output format required for rubric evaluation without this initial stage\. Once the model can produce well\-formed outputs, GRPO introduces rubric\-aligned optimization that directly maximizes criterion satisfaction\. The performance gains reported in Table[2](https://arxiv.org/html/2606.05174#S1.T2)therefore reflect end\-to\-end pipeline effectiveness from base model through both training stages\.
##### Judge design and validation considerations\.
The LLM judge in our pipeline operates at the individual criterion level: for each rubric criterion, it receives the prompt context and the model completion and returns a binary “present” decision with a short justification\. This design is substantially narrower than holistic scoring setups that ask a single model to assign an overall quality grade to an entire response\. The judge uses low temperature, JSON\-constrained output, and a retry policy, all of which reduce noise and improve reproducibility\. The original Rubrics as Rewards framework\[[10](https://arxiv.org/html/2606.05174#bib.bib35)\]demonstrated that GPT\-4o\-mini can perform this criterion\-level matching task effectively; our pipeline uses GPT\-OSS\-120B as the judge model, which ranks higher on the Chatbot Arena leaderboard\[[3](https://arxiv.org/html/2606.05174#bib.bib64)\]\(Elo 1354 vs\. 1317 for GPT\-4o\-mini\), providing stronger overall capability for nuanced criteria\.
##### Deployment considerations and infrastructure\.
The final adapted policy model is designed for local deployment\. The 14B parameter model with 4\-bit quantization and LoRA adapters can be served on a single workstation GPU such as the NVIDIA RTX 6000 PRO, which supports privacy\-preserving inference without transmitting patient data to external services\. However, the training and evaluation pipeline uses a Groq\-hosted inference backend to serve the judge model \(GPT\-OSS\-120B\), which reduces criterion\-level judging latency from impractical to manageable levels in an academic setting\. We note that GPT\-OSS\-120B is an open\-source model that can in principle be deployed locally on sufficiently large hardware; the use of Groq is motivated purely by inference speed and academic resource constraints, not by model access restrictions\.
##### Scope and clinical framing\.
This work targets heart\-focused medical question answering as a clinically motivated testbed for rubric\-aligned reinforcement learning\. The heart\-focused scope is intentional: cardiovascular disease represents the leading cause of death globally\[[25](https://arxiv.org/html/2606.05174#bib.bib63)\], and the associated clinical reasoning demands conservative guidance, appropriate uncertainty handling, and careful risk assessment\. Recent work innpj Digital Medicinehas shown that narrow clinical scope is productive for evaluating AI systems, with published studies focusing on radiology contrast\-media consultation\[[41](https://arxiv.org/html/2606.05174#bib.bib58)\], symptom\-based depression scoring\[[45](https://arxiv.org/html/2606.05174#bib.bib54)\], and prostate\-cancer patient education\[[13](https://arxiv.org/html/2606.05174#bib.bib57)\]\. Our contribution is methodological: we demonstrate that variance\-aware reward shaping improves rubric satisfaction for a clinically grounded task, and we expect the reward design principles to transfer to other rubric\-based medical QA domains\.
##### Limitations and future work\.
Several directions remain for future investigation\. First, the current evaluation relies on automated rubric\-based judging; incorporating direct physician review of model outputs in a prospective setting could be interesting to explore\. Second, extending the pipeline beyond heart\-related medical QA to other clinical domains would help establish the generalizability of variance\-aware reward shaping\. Finally, while the policy model supports local deployment, reducing the computational cost of criterion\-level judging during training would improve accessibility for resource\-constrained research groups\.
## 3Methods
This section describes the datasets, preprocessing steps, model configuration, training procedure, and reward design used to post\-train a heart\-focused medical assistant with rubric\-based supervision\. The pipeline begins with converting rubric\-annotated medical prompts into a unified chat format, continues with a short supervised phase that stabilizes the response structure, and ends with reinforcement learning that optimizes a continuous variance\-aware reward derived from criterion\-level rubric judgments\.
### 3\.1Terminology and notation
A*prompt*is a user query and is denoted byq∈𝒬q\\in\\mathcal\{Q\}, where𝒬\\mathcal\{Q\}is the set of prompts\. A*completion*is the sequence of tokens generated by a language model in response toqqand is denoted byo=\(o1,…,o\|o\|\)o=\(o\_\{1\},\\dots,o\_\{\|o\|\}\), where eachoto\_\{t\}is a token from a tokenizer vocabulary𝒱\\mathcal\{V\}and\|o\|\|o\|is the completion length\.
A*policy model*is the language model interpreted as a stochastic decision rule over tokens\. The policy model is parameterized byθ\\thetaand defines a conditional distribution over completions through an autoregressive factorization,
πθ\(o∣q\)=∏t=1\|o\|πθ\(ot∣q,o<t\),\\pi\_\{\\theta\}\(o\\mid q\)=\\prod\_\{t=1\}^\{\|o\|\}\\pi\_\{\\theta\}\(o\_\{t\}\\mid q,o\_\{<t\}\),\(1\)whereo<t=\(o1,…,ot−1\)o\_\{<t\}=\(o\_\{1\},\\dots,o\_\{t\-1\}\)\.
Each promptqqis paired with a rubric set𝒞\(q\)=\{\(ck,wk\)\}k=1C\\mathcal\{C\}\(q\)=\\\{\(c\_\{k\},w\_\{k\}\)\\\}\_\{k=1\}^\{C\}, whereckc\_\{k\}is a natural\-language criterion,wk∈ℤw\_\{k\}\\in\\mathbb\{Z\}is its point value, andCCis the number of criteria\. Positive weights encode desirable properties that the completion should satisfy\. Negative weights encode undesirable properties that the completion should avoid\. A*judge model*evaluates a completion against each criterion and returns a binary decision\. A*reward function*maps the criterion\-level decisions into a scalar rewardr∈ℝr\\in\\mathbb\{R\}\.
### 3\.2Model and output format with parameter\-efficient adaptation
#### 3\.2\.1Base model and parameter\-efficient adaptation
The policy is initialized from Qwen3\-14B\-Base\. Training uses 4\-bit quantization to reduce memory usage and applies Low\-Rank Adaptation to a subset of attention and feed\-forward projection matrices\. For a weight matrixW0∈ℝdout×dinW\_\{0\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times d\_\{\\mathrm\{in\}\}\}, Low\-Rank Adaptation parameterizes the adapted weight as
W=W0\+ΔW,ΔW=BA,W=W\_\{0\}\+\\Delta W,\\qquad\\Delta W=BA,\(2\)whereA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{\\mathrm\{in\}\}\},B∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times r\}, andrris the Low\-Rank Adaptation rank\. The implementation setsr=16r=16and uses a Low\-Rank Adaptation scaling factorαLoRA=2r\\alpha\_\{\\mathrm\{LoRA\}\}=2r\. Adaptation is applied to the attention projections\{qproj,kproj,vproj,oproj\}\\\{q\_\{\\mathrm\{proj\}\},k\_\{\\mathrm\{proj\}\},v\_\{\\mathrm\{proj\}\},o\_\{\\mathrm\{proj\}\}\\\}and the feed\-forward projections\{gateproj,upproj,downproj\}\\\{\\mathrm\{gate\}\_\{\\mathrm\{proj\}\},\\mathrm\{up\}\_\{\\mathrm\{proj\}\},\\mathrm\{down\}\_\{\\mathrm\{proj\}\}\\\}\.
#### 3\.2\.2Response structure and chat template
The assistant response follows a two\-part format that separates a reasoning trace from the final answer\. The reasoning segment is delimited by<start\_working\_out\>and<end\_working\_out\>\. The solution segment is delimited by<SOLUTION\>and</SOLUTION\>\. A fixed system message instructs the model to produce outputs in this format, and the tokenizer chat template begins generation at the reasoning start marker\. This formatting is introduced during supervised fine\-tuning and preserved during reinforcement learning\.
### 3\.3Training procedure
#### 3\.3\.1Supervised Fine\-Tuning warm start
Supervised fine\-tuning provides a format warm start for reinforcement learning\. The base model does not reliably emit the required tags before training, which makes downstream parsing and reward computation ambiguous\. This stage teaches the model to place its reasoning trace and its final answer in the prescribed locations, and it establishes a stable response structure that is preserved during reinforcement learning\.
Each supervised example is converted into a chat transcript with a system message, a user message, and an assistant message\. The assistant message concatenates a reasoning trace and the reference completion inside the required tags\. The learning objective maximizes the likelihood of the reference completion under the policy model, conditioned on the prompt,
ℒSFT\(θ\)=−𝔼\(q,o⋆\)\[∑t=1\|o⋆\|logπθ\(ot⋆∣q,o<t⋆\)\]\.\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\(q,o^\{\\star\}\)\}\\left\[\\sum\_\{t=1\}^\{\|o^\{\\star\}\|\}\\log\\pi\_\{\\theta\}\(o^\{\\star\}\_\{t\}\\mid q,o^\{\\star\}\_\{<t\}\)\\right\]\.\(3\)
Loss computation is restricted to the assistant response tokens\. Tokens that belong to the system and user messages receive a mask value of−100\-100so that they do not contribute to the gradient\. The supervised dataset is filtered by transcript length: examples whose tokenized transcripts exceed 90% of the maximum sequence length are dropped\.
#### 3\.3\.2Group Relative Policy Optimization post\-training
Reinforcement learning post\-training uses Group Relative Policy Optimization\[[34](https://arxiv.org/html/2606.05174#bib.bib3)\]\. GRPO is particularly suitable for rubric\-based medical question answering because it avoids training a separate value network and instead uses group\-wise relative rewards to estimate advantages\. A key implication of this design is that learning depends on within\-group reward variation: if all sampled completions for a prompt receive the same reward, the normalized advantages collapse toward zero and the policy receives little or no learning signal\. An effective reward function for GRPO should therefore preserve partial\-credit information and produce non\-trivial dispersion across completions of different quality\.
For each promptqq, the old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}samples a group ofGGoutputs\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\. The objective maximized by Group Relative Policy Optimization is
𝒥GRPO\(θ\)\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{GRPO\}\}\(\\theta\)=𝔼q∼P\(Q\),\{oi\}i=1G∼πθold\(O∣q\)\[1G∑i=1G1\|oi\|∑t=1\|oi\|\(min\[ρi,t\(θ\)A^i,t,clip\(ρi,t\(θ\),1−ϵ,1\+ϵ\)A^i,t\]\\displaystyle=\\mathbb\{E\}\_\{q\\sim P\(Q\),\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(O\\mid q\)\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\Bigg\(\\min\\Big\[\\rho\_\{i,t\}\(\\theta\)\\,\\widehat\{A\}\_\{i,t\},\\operatorname\{clip\}\\\!\\left\(\\rho\_\{i,t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\right\)\\widehat\{A\}\_\{i,t\}\\Big\]\(4\)−βKL𝔻KL\[πθ∥πref\]\)\],\\displaystyle\\hskip 70\.0pt\-\\beta\_\{\\mathrm\{KL\}\}\\,\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\\!\\left\[\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\]\\Bigg\)\\Bigg\],whereϵ\>0\\epsilon\>0is the clipping parameter andβKL≥0\\beta\_\{\\mathrm\{KL\}\}\\geq 0is the coefficient of the Kullbac\-Leibler regularization term\[[18](https://arxiv.org/html/2606.05174#bib.bib33)\]\. The likelihood ratio is
ρi,t\(θ\)=πθ\(oi,t∣q,oi,<t\)πθold\(oi,t∣q,oi,<t\)\.\\rho\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)\}\.\(5\)
The implementation uses outcome\-level supervision, so the advantage is constant across tokens within a completion:
A^i,t=r~ifor allt∈\{1,…,\|oi\|\}\.\\widehat\{A\}\_\{i,t\}=\\widetilde\{r\}\_\{i\}\\quad\\text\{for all \}t\\in\\\{1,\\dots,\|o\_\{i\}\|\\\}\.\(6\)The normalized group rewardr~i\\widetilde\{r\}\_\{i\}is computed from the raw rewards\{ri\}i=1G\\\{r\_\{i\}\\\}\_\{i=1\}^\{G\}as
r~i=ri−mean\(\{rj\}j=1G\)std\(\{rj\}j=1G\)\.\\widetilde\{r\}\_\{i\}=\\frac\{r\_\{i\}\-\\operatorname\{mean\}\(\\\{r\_\{j\}\\\}\_\{j=1\}^\{G\}\)\}\{\\operatorname\{std\}\(\\\{r\_\{j\}\\\}\_\{j=1\}^\{G\}\)\}\.\(7\)
The reinforcement learning dataset is filtered by rubric count and prompt length\. The implementation removes prompts whose rubric set contains fewer than four criteria, since short rubrics often yield trivial reward patterns and weaker group\-wise discrimination\. The implementation also computes the 90th percentile of prompt token lengths and retains prompts at or below this threshold\. The experiments useG=6G=6sampled completions per prompt and a maximum completion length of 1024 tokens\.
### 3\.4Rubric\-based reward computation
#### 3\.4\.1Reward design principles
The reward design is guided by the optimization requirements of GRPO\. Because GRPO normalizes rewards within a sampled group, a useful reward function should satisfy four properties\. First, it should produce non\-zero variance whenever completions differ in quality\. Second, it should be monotonic, so that better rubric performance receives higher reward\. Third, it should preserve information from rubric evaluation rather than collapse partial credit into a binary outcome\. Fourth, it should be complexity\-aware, since prompts with more criteria typically represent harder evaluation problems than prompts with only a few rubric items\.
#### 3\.4\.2Criterion\-level judging
Reward computation evaluates each rubric criterion independently with a separate judge model\. The judge model receives the prompt context and the model completion, and it returns a structured JSON decision with a binary fieldpresentand a short justification\. The judge is queried once per criterion\.
Letmk∈\{0,1\}m\_\{k\}\\in\\\{0,1\\\}indicate whether the judge marks criterionckc\_\{k\}as present in the completion\. The implementation aggregates positive and negative contributions separately\. The achieved positive score and the maximum possible positive score are
s\+=∑k:wk\>0wkmk,smax\+=∑k:wk\>0wk\.s^\{\+\}=\\sum\_\{k:w\_\{k\}\>0\}w\_\{k\}m\_\{k\},\\qquad s^\{\+\}\_\{\\max\}=\\sum\_\{k:w\_\{k\}\>0\}w\_\{k\}\.\(8\)The achieved negative magnitude and the maximum possible negative magnitude are
s−=∑k:wk<0\|wk\|mk,smax−=∑k:wk<0\|wk\|\.s^\{\-\}=\\sum\_\{k:w\_\{k\}<0\}\|w\_\{k\}\|m\_\{k\},\\qquad s^\{\-\}\_\{\\max\}=\\sum\_\{k:w\_\{k\}<0\}\|w\_\{k\}\|\.\(9\)
We define the normalized positive score and normalized negative ratio as
snorm=s\+max\(smax\+,1\),ρ=\{s−max\(smax−,1\)ifsmax−\>0,0ifsmax−=0\.s\_\{\\mathrm\{norm\}\}=\\frac\{s^\{\+\}\}\{\\max\(s^\{\+\}\_\{\\max\},1\)\},\\qquad\\rho=\\begin\{cases\}\\frac\{s^\{\-\}\}\{\\max\(s^\{\-\}\_\{\\max\},1\)\}&\\text\{if \}s^\{\-\}\_\{\\max\}\>0,\\\\ 0&\\text\{if \}s^\{\-\}\_\{\\max\}=0\.\\end\{cases\}\(10\)We also define two indicators for exact positive satisfaction and zero negative violations:
𝕀allpos=𝕀\[s\+≥smax\+\],𝕀noneg=𝕀\[s−=0\]\.\\mathbb\{I\}\_\{\\mathrm\{all\\ pos\}\}=\\mathbb\{I\}\[s^\{\+\}\\geq s^\{\+\}\_\{\\max\}\],\\qquad\\mathbb\{I\}\_\{\\mathrm\{no\\ neg\}\}=\\mathbb\{I\}\[s^\{\-\}=0\]\.\(11\)
#### 3\.4\.3General reward formulation
Letnc=\|𝒞\(q\)\|n\_\{c\}=\|\\mathcal\{C\}\(q\)\|denote the number of rubric criteria for promptqq, and letnmaxn\_\{\\max\}be the maximum rubric count in the training set\. In our data,nmax=25n\_\{\\max\}=25\. We define a general parametric reward family
r=rbase⋅s^α⋅\(1\+β⋅log\(1\+nc\)log\(1\+nmax\)\),r=r\_\{\\mathrm\{base\}\}\\cdot\\hat\{s\}^\{\\,\\alpha\}\\cdot\\left\(1\+\\beta\\cdot\\frac\{\\log\(1\+n\_\{c\}\)\}\{\\log\(1\+n\_\{\\max\}\)\}\\right\),\(12\)whererbaser\_\{\\mathrm\{base\}\}is a base reward scale,α\>0\\alpha\>0controls the curvature of the reward function, andβ≥0\\beta\\geq 0controls the strength of the complexity bonus\.
The effective scores^\\hat\{s\}incorporates both positive credit and negative\-criteria penalties:
s^=max\(0,snorm−λρ\),\\hat\{s\}=\\max\\\!\\left\(0,\\;s\_\{\\mathrm\{norm\}\}\-\\lambda\\rho\\right\),\(13\)whereλ≥0\\lambda\\geq 0is the penalty coefficient for negative criteria\. This formulation preserves partial credit, penalizes unsafe or undesirable content, and ensures that low\-quality completions do not receive inflated rewards\.
#### 3\.4\.4Hyperparameter selection and justification
The reward hyperparameters were selected as theory\-informed design constants rather than learned parameters\. Their purpose is to balance four requirements imposed by GRPO training: non\-trivial reward variance within each sampled group, monotonicity with respect to rubric quality, preservation of partial\-credit information, and modest awareness of rubric complexity\.
Both reward variants use a base reward scale of 20\. This choice keeps rewards in a numerically stable and interpretable range for GRPO while matching the scale used throughout the training implementation\. Within that shared range, the Hybrid reward allocates 15 points to a continuous base term and 5 points to a perfection bonus\. The 15\-point base ensures that partially correct responses still receive informative gradients, whereas the 5\-point bonus creates a clear but not overwhelming incentive for completions that satisfy all positive criteria and avoid all negative criteria\.
For the Hybrid penalty term, the code applies a maximum subtraction equal to 30% of the 15\-point base, which yields the coefficient 4\.5\. This penalty is strong enough to discourage unsafe or incomplete answers, but not so aggressive that most partially correct completions collapse to zero reward\. In the Complexity\-aware reward, the exponentα=1\.2\\alpha=1\.2was chosen to mildly sharpen preference for high\-quality completions relative to a linear mapping without making the reward too sparse for intermediate outputs\. This exponent therefore encourages movement from good to excellent responses while preserving useful gradients for partial progress\.
The complexity coefficientβ=0\.2\\beta=0\.2provides only a modest logarithmic bonus as the rubric size increases\. This acknowledges that prompts with more criteria are typically harder, but it keeps answer quality as the dominant signal\. The negative penalty strengthλ=0\.5\\lambda=0\.5reduces the effective score before exponentiation, so harmful outputs are penalized both directly and through the nonlinear transform, increasing separation among low\- and mid\-quality completions\. Finally, the criterion\-count normalization usesnmax=25n\_\{\\max\}=25, matching the rubric scale assumed by the implementation, so that complexity adjustments remain bounded and comparable across prompts\.
#### 3\.4\.5Reward variants
We implement and compare two reward variants derived from the general formulation\.
##### Complexity\-aware reward\.
The first variant directly instantiates Eq\.[12](https://arxiv.org/html/2606.05174#S3.E12)withrbase=20r\_\{\\mathrm\{base\}\}=20,α=1\.2\\alpha=1\.2,β=0\.2\\beta=0\.2,λ=0\.5\\lambda=0\.5, andnmax=25n\_\{\\max\}=25:
rcomplexity=20s^1\.2\(1\+0\.2log\(1\+nc\)log\(26\)\),r\_\{\\mathrm\{complexity\}\}=20\\,\\hat\{s\}^\{\\,1\.2\}\\left\(1\+0\.2\\frac\{\\log\(1\+n\_\{c\}\)\}\{\\log\(26\)\}\\right\),\(14\)where
s^=max\(0,snorm−0\.5ρ\)\.\\hat\{s\}=\\max\\\!\\left\(0,\\;s\_\{\\mathrm\{norm\}\}\-0\.5\\rho\\right\)\.\(15\)This reward emphasizes high scores through the power transform, incorporates an explicit bonus for prompts with larger rubric sets, and applies the negative\-criteria penalty before the nonlinear transformation so that harmful outputs are penalized more strongly\. In practice, the range is approximately\[0,25\]\[0,25\], with slight overshoot possible for highly complex prompts that achieve very strong rubric performance\.
##### Hybrid reward\.
The second variant separates reward into a continuous base component and a discrete perfection bonus:
rhybrid=max\(0,Bsnorm−0\.3Bρ\)\+P𝕀\[𝕀allpos=1∧𝕀noneg=1\],r\_\{\\mathrm\{hybrid\}\}=\\max\\\!\\left\(0,\\;B\\,s\_\{\\mathrm\{norm\}\}\-0\.3B\\,\\rho\\right\)\+P\\,\\mathbb\{I\}\\\!\\left\[\\mathbb\{I\}\_\{\\mathrm\{all\\ pos\}\}=1\\ \\wedge\\ \\mathbb\{I\}\_\{\\mathrm\{no\\ neg\}\}=1\\right\],\(16\)whereB=15B=15is the base component andP=5P=5is the perfection bonus\. This construction yields a linear, interpretable reward for partial success while creating a clear jump at complete positive satisfaction with no negative violations\. The proportional negative penalty reduces reward for harmful or incomplete responses without automatically forcing the reward to zero\. The output range is\[0,20\]\[0,20\]\.
The two variants emphasize different aspects of learning\. The Complexity\-aware reward is better aligned with prompts whose rubric sets are large and heterogeneous, while the Hybrid reward provides a simpler continuous signal together with a strong discrete incentive for flawless responses\. Both preserve partial credit and both avoid the information loss that would arise from binary pass/fail reward assignment\.
#### 3\.4\.6GRPO compatibility of the reward functions
Both proposed reward variants are designed to remain compatible with GRPO\. First, they produce continuous outputs, which makes non\-zero reward variance much more likely within a sampled group\. Second, they are monotonically non\-decreasing in rubric quality: higher positive scores and fewer negative violations lead to larger rewards\. Third, they preserve information from criterion\-level judgments rather than collapsing rubric outcomes into a single binary label\. Fourth, the Complexity\-aware variant explicitly incorporates rubric size through the logarithmic complexity bonus, while the Hybrid variant retains difficulty information indirectly through the normalized criterion\-level scores\. These properties make the reward signal more informative for policy optimization than sparse binary aggregation\.
### 3\.5Implementation details
All experiments use a maximum sequence length of 4096 tokens\. The base model is loaded with 4\-bit quantization, and training updates only Low\-Rank Adaptation parameters with rankr=16r=16on the attention projections and the feed\-forward projections\. The implementation uses gradient checkpointing to reduce activation memory and fixes the random seed to 3407 for data shuffling and sampling\.
##### Hardware and GPU utilization\.
All training runs are executed on a single NVIDIA RTX 6000 PRO \(Blackwell Workstation Edition\) GPU, which has a maximum power consumption of 600 W\. Figure[7](https://arxiv.org/html/2606.05174#S3.F7)shows the GPU power draw over the duration of each GRPO training run\. Both the Hybrid and Complexity reward runs sustain approximately 200–300 W on average during active training, with brief spikes reaching the full 600 W envelope during peak backward\-pass computation\. Each reward\-function training run takes approximately 26 hours\. The substantial time per run is driven by criterion\-level judging: each prompt contains dozens of rubric criteria, each requiring an independent LLM judge call, which dominates wall\-clock time even though the model update itself is lightweight\.
##### Inference backend for the judge model\.
To mitigate the latency bottleneck of criterion\-level judging, we use the Groq inference platform\[[24](https://arxiv.org/html/2606.05174#bib.bib62)\]to serve the judge model\. Groq provides substantially lower per\-call latency than local GPU inference, which is critical because each training step requires multiple independent judge evaluations per completion per criterion\. Without a fast inference backend, the cumulative judge latency would extend each 26\-hour training run by an impractical margin\. Using Groq also frees the local GPU entirely for policy model training and vLLM\-based generation\. We note that the choice of Groq is motivated purely by inference speed for the judge; models with very large parameter counts, such as Kimi\-K2 with approximately 1 trillion total parameters, cannot be served locally on the RTX 6000 PRO and are evaluated through their respective API endpoints\.
Figure 7:Fig\. 7 — GPU power consumption during GRPO training on the NVIDIA RTX 6000 PRO\.GPU power draw \(in watts\) over the full training duration for the Hybrid \(left\) and Complexity \(right\) reward functions\. Light lines show raw 15\-second power samples; the bold curve shows an exponential moving average \(EMA, span = 30\)\. The NVIDIA RTX 6000 PRO \(Blackwell Workstation Edition\) has a maximum rated power of 600 W; both runs sustain an average draw of 200–300 W, with periodic spikes during compute\-intensive backward passes\.Supervised fine\-tuning uses a per\-device batch size of 4 and gradient accumulation of 4, which yields an effective batch size of 16\. Optimization uses AdamW in 8\-bit mode with learning rate2×10−42\\times 10^\{\-4\}, weight decay10−310^\{\-3\}, and a linear learning rate schedule with 5 warmup steps\. Training runs for two epochs with a cap of 500 update steps, and it logs every step\.
Reinforcement learning post\-training uses one prompt per update step and samplesG=6G=6completions per prompt\. Optimization uses AdamW in 8\-bit mode with learning rate5×10−65\\times 10^\{\-6\}, weight decay10−210^\{\-2\}, and a linear learning rate schedule with warmup ratio 0\.1\. Sampling uses temperature 1\.0 and a decoding configuration that setsmin\_p=0\.1\\texttt\{min\\\_p\}=0\.1,top\_p=1\.0\\texttt\{top\\\_p\}=1\.0, andtop\_k=−1\\texttt\{top\\\_k\}=\-1\. The maximum completion length is 1024 tokens\. The training loop saves Low\-Rank Adaptation checkpoints at fixed step intervals so that reward variants can be compared under matched conditions\.
### 3\.6Evaluation methodology and metrics
We evaluate on HealthBench and treat it as held out\. HealthBench contains medical prompts and clinician\-derived quality signals and supports rubric\-based evaluation by associating prompts with explicit grading criteria\. In our pipeline, we merge the HealthBench prompt file with a separate rubric file usingprompt\_id\. We filter the evaluation set to heart\-related prompts and compute Accuracy, Precision, Recall, and F1 with respect to available physician\-provided binary labels\. All reported results are computed onn=500n=500heart\-related evaluation examples sampled with a fixed random seed of 42\.
### 3\.7Ethical considerations
This study uses publicly available benchmark datasets \(RaR\-Medicine and HealthBench\) that do not contain identifiable patient information\. RaR\-Medicine provides synthetic and curated medical question\-answer pairs with rubric annotations\. HealthBench provides physician\-authored rubrics grounded in realistic but synthetically generated health conversations\[[1](https://arxiv.org/html/2606.05174#bib.bib32)\]\. No human subjects were recruited, no identifiable patient data were collected or processed, and no clinical interventions were performed as part of this research\. The study therefore did not require Institutional Review Board \(IRB\) approval or informed consent\. All model outputs are intended for research evaluation and are not designed for direct clinical use without further validation and clinician oversight\.
## Data availability
The training data are derived from the publicly available RaR\-Medicine dataset\[[10](https://arxiv.org/html/2606.05174#bib.bib35)\]\. The evaluation data are derived from the publicly available HealthBench benchmark\[[1](https://arxiv.org/html/2606.05174#bib.bib32)\]\. The heart\-related filtered dataset, processed training splits, and evaluation configurations are available at[https://github\.com/INQUIRELAB/variance\-aware\-rubric\-rewards\-grpo](https://github.com/INQUIRELAB/variance-aware-rubric-rewards-grpo)\.
## Code availability
The training, evaluation, and reward computation code is available at[https://github\.com/INQUIRELAB/variance\-aware\-rubric\-rewards\-grpo](https://github.com/INQUIRELAB/variance-aware-rubric-rewards-grpo)\. Two supplementary videos are also provided in the repository: Supplementary Video 1 animates the cumulative accuracy of all evaluated models across the 500 held\-out heart\-related HealthBench samples, and Supplementary Video 2 visualizes per\-criterion satisfaction for the Base, GRPO \(Complexity\), GPT\-OSS\-120B, and MedGemma\-27B models on each evaluation prompt\.
## Acknowledgements
The authors acknowledge the use of the Groq inference platform for serving the judge model during training and evaluation, and the open\-source Unsloth framework for parameter\-efficient fine\-tuning\. Computational resources were provided by the Inquire Lab at the University of Oklahoma\.
## Author contributions
All authors contributed to the concept and outline of the manuscript\. A\.A\. and P\.M\. drafted the paper\. All authors participated in revising the manuscript and approved the completed version\. A\.A\. and P\.M\. are co\-first authors and contributed equally\.
## Competing interests
The authors declare no competing interests\.
## Funding declaration
Not applicable\.
## Additional information
## References
- \[1\]R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel,et al\.\(2025\)Healthbench: evaluating large language models towards improved human health\.arXiv preprint arXiv:2505\.08775\.Cited by:[§1\.1\.3](https://arxiv.org/html/2606.05174#S1.SS1.SSS3.p1.1),[§3\.7](https://arxiv.org/html/2606.05174#S3.SS7.p1.1),[Data availability](https://arxiv.org/html/2606.05174#Sx1.p1.1)\.
- \[2\]M\. Babaeizadeh, I\. Frosio, S\. Tyree, J\. Clemons, and J\. Kautz\(2016\)Reinforcement learning through asynchronous advantage actor\-critic on a gpu\.arXiv preprint arXiv:1611\.06256\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[3\]W\. Chiang, L\. Zheng, Y\. Sheng, A\. N\. Angelopoulos, T\. Li, D\. Li, B\. Zhu, H\. Zhang, M\. Jordan, J\. E\. Gonzalez,et al\.\(2024\)Chatbot arena: an open platform for evaluating llms by human preference\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.05174#S2.SS0.SSS0.Px2.p1.1)\.
- \[4\]E\. Croxford, Y\. Gao, E\. First, N\. Pellegrino, M\. Schnier, J\. Caskey, M\. Oguss, G\. Wills, G\. Chen, D\. Dligach,et al\.\(2025\)Evaluating clinical ai summaries with large language models as judges\.npj Digital Medicine8\(1\),pp\. 640\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p9.1)\.
- \[5\]H\. Eguiaet al\.\(2024\)Clinical decision support and natural language processing in health care: a systematic review\.Journal of Medical Internet Research\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[6\]A\. Estevaet al\.\(2017\)Dermatologist\-level classification of skin cancer with deep neural networks\.Nature\.External Links:[Document](https://dx.doi.org/10.1038/nature21056)Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[7\]M\. Giuffrè, K\. You, Z\. Pang, S\. Kresevic, S\. Chung, R\. Chen, Y\. Ko, C\. Chan, T\. Saarinen, M\. Ajcevic,et al\.\(2025\)Expert of experts verification and alignment \(eval\) framework for large language models safety in gastroenterology\.NPJ Digital Medicine8\(1\),pp\. 242\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p9.1)\.
- \[8\]S\. R\. Granter, A\. H\. Beck, and D\. J\. Papke Jr\(2017\)AlphaGo, deep learning, and the future of the human microscopist\.Archives of pathology & laboratory medicine141\(5\),pp\. 619–621\.Cited by:[§2](https://arxiv.org/html/2606.05174#S2.p1.1)\.
- \[9\]V\. Gulshanet al\.\(2016\)Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs\.JAMA\.External Links:[Document](https://dx.doi.org/10.1001/jama.2016.17216)Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[10\]A\. Gunjal, A\. Wang, E\. Lau, V\. Nath, Y\. He, B\. Liu, and S\. Hendryx\(2025\)Rubrics as rewards: reinforcement learning beyond verifiable domains\.arXiv preprint arXiv:2507\.17746\.Cited by:[§1\.1\.1](https://arxiv.org/html/2606.05174#S1.SS1.SSS1.p1.5),[§1\.3](https://arxiv.org/html/2606.05174#S1.SS3.p2.3),[§2](https://arxiv.org/html/2606.05174#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.05174#S2.p1.1),[Data availability](https://arxiv.org/html/2606.05174#Sx1.p1.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p9.1)\.
- \[11\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[12\]S\. Halat, M\. M\. Ebadzadeh, and K\. Amani\(2024\)Modified double\-dqn: addressing stability\.In2024 11th International Symposium on Telecommunications \(IST\),pp\. 697–702\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[13\]Y\. Hao, J\. Holmes, M\. R\. Waddle, B\. J\. Davis, N\. Y\. Yu, K\. S\. Vickers, H\. Preston, D\. Margolin, C\. E\. Löckenhoff, A\. Vashistha,et al\.\(2025\)Personalizing prostate cancer education for patients using an ehr\-integrated llm agent\.NPJ Digital Medicine8\(1\),pp\. 770\.Cited by:[§2](https://arxiv.org/html/2606.05174#S2.SS0.SSS0.Px4.p1.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p4.1)\.
- \[14\]K\. Huang, J\. Altosaar, and R\. Ranganath\(2019\)ClinicalBERT: modeling clinical notes and predicting hospital readmission\.arXiv preprint arXiv:1904\.05342\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[15\]A\. Jerfyet al\.\(2024\)The growing impact of natural language processing in public health and healthcare: a narrative review\.Frontiers in Public Health\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[16\]J\. Jumper, R\. Evans, A\. Pritzel, T\. Green, M\. Figurnov, O\. Ronneberger, K\. Tunyasuvunakool, R\. Bates, A\. Žídek, A\. Potapenko,et al\.\(2021\)Highly accurate protein structure prediction with alphafold\.Nature596\(7873\),pp\. 583–589\.Cited by:[§2](https://arxiv.org/html/2606.05174#S2.p1.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[17\]M\. Khalifa and M\. Albadawy\(2024\)Artificial intelligence for clinical prediction: exploring key domains and essential functions\.Computer Methods and Programs in Biomedicine Update5,pp\. 100148\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[18\]S\. Kullback and R\. A\. Leibler\(1951\)On information and sufficiency\.The annals of mathematical statistics22\(1\),pp\. 79–86\.Cited by:[§3\.3\.2](https://arxiv.org/html/2606.05174#S3.SS3.SSS2.p2.6)\.
- \[19\]T\. Li, S\. Shetty, A\. Kamath, A\. Jaiswal, X\. Jiang, Y\. Ding, and Y\. Kim\(2024\)CancerGPT for few shot drug pair synergy prediction using large pretrained language models\.NPJ Digital Medicine7\(1\),pp\. 40\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p3.1)\.
- \[20\]L\. Liuet al\.\(2025\)Using natural language processing to extract information from clinical text for populating clinical registries: a review\.Journal of the American Medical Informatics Association\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[21\]I\. Lopez, A\. Swaminathan, K\. Vedula, S\. Narayanan, F\. N\. Haredasht, S\. Ma, A\. Liang, S\. Tate, M\. Maddali, R\. Gallo,et al\.\(2025\)Clinical entity augmented retrieval for clinical information extraction, npj digital medicine 8\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p4.1)\.
- \[22\]S\. M\. McKinneyet al\.\(2020\)International evaluation of an ai system for breast cancer screening\.Nature\.External Links:[Document](https://dx.doi.org/10.1038/s41586-019-1799-6)Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[23\]V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. Graves, I\. Antonoglou, D\. Wierstra, and M\. Riedmiller\(2013\)Playing atari with deep reinforcement learning\.arXiv preprint arXiv:1312\.5602\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[24\]S\. Moon, J\. Kim, J\. Kim, S\. Hong, J\. Cha, M\. Kim, S\. Lim, G\. Choi, D\. Seo, J\. Kim,et al\.\(2024\)A latency processing unit: a latency\-optimized and highly scalable processor for large language model inference\.IEEE Micro44\(6\),pp\. 17–33\.Cited by:[§3\.5](https://arxiv.org/html/2606.05174#S3.SS5.SSS0.Px2.p1.1)\.
- \[25\]M\. Naghavi, H\. H\. Kyu, M\. A\. Aalipour, H\. Aalruz, H\. S\. Ababneh, B\. J\. Abafita, U\. O\. Abaraogu, C\. Abbafati, M\. Abbasi, F\. Abbaspour,et al\.\(2025\)Global burden of 292 causes of death in 204 countries and territories and 660 subnational locations, 1990–2023: a systematic analysis for the global burden of disease study 2023\.The Lancet406\(10513\),pp\. 1811–1872\.Cited by:[§2](https://arxiv.org/html/2606.05174#S2.SS0.SSS0.Px4.p1.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p2.1)\.
- \[26\]J\. Pan, C\. Liu, J\. Wu, F\. Liu, J\. Zhu, H\. B\. Li, C\. Chen, C\. Ouyang, and D\. Rueckert\(2025\)Medvlm\-r1: incentivizing medical reasoning capability of vision\-language models \(vlms\) via reinforcement learning\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 337–347\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p7.1)\.
- \[27\]F\. Pennino, B\. Raimondi, M\. Rondelli, A\. Gurioli, and M\. Gabbrielli\(2025\)From reasoning to code: grpo optimization for underrepresented languages\.arXiv preprint arXiv:2506\.11027\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p7.1)\.
- \[28\]P\. Petridis, G\. Margaritis, V\. Stoumpou, and D\. Bertsimas\(2026\)Holistic ai in medicine; improved performance and explainability\.npj Digital Medicine\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p4.1)\.
- \[29\]T\. Pham and C\. Ngo\(2025\)Rarl: improving medical vlm reasoning and generalization with reinforcement learning and lora under data and hardware constraints\.arXiv preprint arXiv:2506\.06600\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p7.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p9.1)\.
- \[30\]G\. Research and collaborators\(2025\)MedGemma technical report\.arXiv preprint arXiv:2507\.05201\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p3.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p8.1)\.
- \[31\]J\. Schulman, S\. Levine, P\. Abbeel, M\. Jordan, and P\. Moritz\(2015\)Trust region policy optimization\.InInternational conference on machine learning,pp\. 1889–1897\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[32\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[33\]A\. Sellergren, S\. Kazemzadeh, T\. Jaroensri, A\. Kiraly, M\. Traverse, T\. Kohlberger, S\. Xu, F\. Jamil, C\. Hughes, C\. Lau,et al\.\(2025\)Medgemma technical report\.arXiv preprint arXiv:2507\.05201\.Cited by:[§1\.1\.1](https://arxiv.org/html/2606.05174#S1.SS1.SSS1.p2.1)\.
- \[34\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1\.3](https://arxiv.org/html/2606.05174#S1.SS3.p4.1),[§3\.3\.2](https://arxiv.org/html/2606.05174#S3.SS3.SSS2.p1.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p7.1)\.
- \[35\]D\. Silver, J\. Schrittwieser, K\. Simonyan, I\. Antonoglou, A\. Huang, A\. Guez, T\. Hubert, L\. Baker, M\. Lai, A\. Bolton,et al\.\(2017\)Mastering the game of go without human knowledge\.nature550\(7676\),pp\. 354–359\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[36\]X\. Su, Q\. Mao, Z\. Wu, X\. Lin, S\. You, Y\. Liao, and C\. Xu\(2025\)Large language models driven neural architecture search for universal and lightweight disease diagnosis on histopathology slide images\.npj Digital Medicine8\(1\),pp\. 682\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p3.1)\.
- \[37\]R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[38\]A\. J\. Thirunavukarasu, D\. S\. W\. Ting, K\. Elangovan, L\. Gutierrez, T\. F\. Tan, and D\. S\. Ting\(2023\)Large language models in medicine\.Nature Medicine29,pp\. 1930–1940\.External Links:[Document](https://dx.doi.org/10.1038/s41591-023-02448-8)Cited by:[§2](https://arxiv.org/html/2606.05174#S2.p2.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[39\]E\. J\. Topol\(2019\)High\-performance medicine: the convergence of human and artificial intelligence\.Nature Medicine25\(1\),pp\. 44–56\.External Links:[Document](https://dx.doi.org/10.1038/s41591-018-0300-7)Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p1.1)\.
- \[40\]H\. Van Hasselt, A\. Guez, and D\. Silver\(2016\)Deep reinforcement learning with double q\-learning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.30\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[41\]A\. Wada, Y\. Tanaka, M\. Nishizawa, A\. Yamamoto, T\. Akashi, A\. Hagiwara, Y\. Hayakawa, J\. Kikuta, K\. Shimoji, K\. Sano,et al\.\(2025\)Retrieval\-augmented generation elevates local llm quality in radiology contrast media consultation\.NPJ Digital Medicine8\(1\),pp\. 395\.Cited by:[§2](https://arxiv.org/html/2606.05174#S2.SS0.SSS0.Px4.p1.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p4.1)\.
- \[42\]J\. Wang, F\. Meng, and J\. Zhou\(2026\)Deeptrans: deep reasoning translation via reinforcement learning\.Transactions of the Association for Computational Linguistics14,pp\. 47–63\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p7.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p9.1)\.
- \[43\]Y\. Wang, T\. S\. Li, and C\. Lin\(2013\)Backward q\-learning: the combination of sarsa algorithm and q\-learning\.Engineering Applications of Artificial Intelligence26\(9\),pp\. 2184–2193\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[44\]C\. J\. Watkins and P\. Dayan\(1992\)Q\-learning\.Machine learning8\(3\),pp\. 279–292\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p6.1)\.
- \[45\]S\. Weber, N\. Deperrois, R\. Heun, L\. Frühschütz, A\. Monn, S\. Homan, A\. Häfliger, E\. Seifritz, T\. Kowatsch, M\. consortium Jäger Lena 8 Schultebraucks Katharina 9 Gershov Sapir 9 Mocellin Jacopo 1 4,et al\.\(2025\)Using a fine\-tuned large language model for symptom\-based depression evaluation\.npj Digital Medicine8\(1\),pp\. 598\.Cited by:[§2](https://arxiv.org/html/2606.05174#S2.SS0.SSS0.Px4.p1.1),[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p3.1)\.
- \[46\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p8.1)\.
- \[47\]H\. Zheng, Z\. Luo, K\. He, W\. Zhou, Z\. Kong, J\. Dong, Q\. Dai, and Q\. Sun\(2026\)KT\-llm: an evidence\-grounded and sequence text framework for auditable kidney transplant modeling\.npj Digital Medicine\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p4.1)\.
- \[48\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[Improving Heart\-Focused Medical Question Answering in LLMs via Variance\-Aware Rubric Rewards with GRPO](https://arxiv.org/html/2606.05174#p9.1)\.Similar Articles
ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents
ARBOR introduces a reusable rubric buffer to provide online process rewards for LLM-based search agents, improving training efficiency when outcome-only rewards are insufficient. It outperforms GRPO and DAPO on multi-hop QA benchmarks, converting up to 42% of zero-gradient training groups into informative ones.
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning
ARES proposes a framework for automatically constructing rubric-based RL data from pretraining documents, generating question-answer pairs and weighted rubrics to enable instance-level reward supervision for open-ended LLM responses, outperforming existing methods on multi-dimensional open-ended tasks.
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
Introduces OGCaReBench, a free-form retrieval benchmark for evaluating LLMs on clinical questions that require reasoning beyond standard guidelines. Experiments show that even the best model achieves only 56% accuracy, but retrieval augmentation boosts performance to 82%.
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data, achieving competitive accuracy and gains for LLM post-training in non-verifiable domains.
@neural_avb: Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I'll be RL training on fr…
Neural_avb releases a lightweight Answer-eq Reward Model for RL training on QA tasks, claiming 80% agreement with external judge LM and faster than F1/ROUGE/BertScore.