Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

arXiv cs.AI Papers

Summary

This paper demonstrates that supervised fine-tuning with synthetic rationale data consistently harms prediction performance for Alzheimer's disease detection compared to label-only fine-tuning, across many configurations and model families. The degradation persists despite high-quality rationales and is attributed to a conflict between narrative plausibility and discriminative optimization.

arXiv:2606.10279v1 Announce Type: new Abstract: Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:13 AM

# Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction
Source: [https://arxiv.org/html/2606.10279](https://arxiv.org/html/2606.10279)
Buxin Su1Bingxuan Li2Cheng Qian2Yiwei Wang3Jin Jin1Bingxin Zhao1 1University of Pennsylvania,2University of Illinois Urbana\-Champaign 3University of California, Merced subuxin@sas\.upenn\.edu,bl61@illinois\.edu,chengq9@illinois\.edu, yiweiwang2@ucmerced\.edu,jin\.jin@pennmedicine\.upenn\.edu,bxzhao@upenn\.edu

###### Abstract

Supervised fine\-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why\. We test this assumption on five\-year Alzheimer’s disease and related dementias \(ADRD\) prediction from longitudinal health histories\. Across a large\-scale controlled experiment of 504 configurations, we find that rationale\-based SFT consistently and substantially hurts prediction performance relative to label\-only fine\-tuning\. The degradation persists across model families and data scales, and is not resolved by using a reasoning\-oriented base model\. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient\-specific evidence, and few\-shot experiments show that the same rationales improve performance when used as inference\-time demonstrations rather than training targets\. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization\. We hope our work paves the path toward a more precise understanding of when and how rationale\-based supervision helps and when it does not, guiding the responsible development of language models for high\-stakes clinical prediction\.

Supervised Fine\-tuning with Synthetic Rationale Data Hurts Real\-World Disease Prediction

Buxin Su1Bingxuan Li2Cheng Qian2Yiwei Wang3Jin Jin1Bingxin Zhao11University of Pennsylvania,2University of Illinois Urbana\-Champaign3University of California, Mercedsubuxin@sas\.upenn\.edu,bl61@illinois\.edu,chengq9@illinois\.edu,yiweiwang2@ucmerced\.edu,jin\.jin@pennmedicine\.upenn\.edu,bxzhao@upenn\.edu

## 1Introduction

No rationalesex: Malemedical history:F10 alcohol\-related disorder: 58\.4G40 epilepsy: 62\.3F32 depressive episode: 69\.3BMI\_high: 70\.0Fluid intelligence low: 70\.0J06 acute upper respiratory infection: 73\.6I67 cerebrovascular disease: 73\.1E78 lipidaemias: 74\.7ADRD:<THINK\>
</THINK\>
Prediction \(0 or 1\): 1Free rationaleSame disease history\.ADRD:<THINK\>
This male has multiple mid\-to\-late life factors that plausibly precede Alzheimer’s or related dementia, including alcohol\-related disorder, epilepsy, low fluid intelligence, slowed matching time, cerebrovascular disease, and later dyslipidaemia\. The combination of cognitive testing abnormalities and cerebrovascular disease in the early 70s makes an ADRD code within the next five years plausible\.</THINK\>
Prediction \(0 or 1\): 1Stepwise rationaleSame disease history\.ADRD:<THINK\>
Step 1: Relevant ADRD\-related events include I67 at 73\.1, earlier G40 and F32, and cognitive testing abnormalities at 70\.0\.Step 2: Cerebrovascular disease in the early 70s, together with prior cognitive deficits, makes near\-term ADRD risk high\.Step 3: Progression to an ADRD\-coded diagnosis within five years is therefore supported\.</THINK\>
Prediction \(0 or 1\): 1Figure 1:Example training records constructed from one participant record\. The three columns share the same disease history; only theADRDresponse field changes across no\-rationale, free\-rationale, and stepwise\-rationale\.Supervised fine\-tuning with synthetic rationale data has become a widely used technique for improving the reasoning capabilities of language models in medicine\(Chenet al\.,[2024](https://arxiv.org/html/2606.10279#bib.bib42); Yuet al\.,[2025](https://arxiv.org/html/2606.10279#bib.bib41); Kimet al\.,[2025](https://arxiv.org/html/2606.10279#bib.bib40)\)\. The intuition is compelling: If a model learns not just what the answer is but why, it should generalize better, produce more interpretable outputs, and be easier to audit\. This intuition has motivated a growing body of work using LLM\-generated rationales to improve clinical diagnosisKwonet al\.\([2024](https://arxiv.org/html/2606.10279#bib.bib32)\), reasoning\-enhanced prediction from structured health dataJianget al\.\([2025](https://arxiv.org/html/2606.10279#bib.bib34)\); Caoet al\.\([2026](https://arxiv.org/html/2606.10279#bib.bib38)\), multimodal clinical rationale generationNiuet al\.\([2025](https://arxiv.org/html/2606.10279#bib.bib33)\), and large\-scale medical reasoning datasetsSunet al\.\([2025](https://arxiv.org/html/2606.10279#bib.bib37)\)\. A recurring finding is that rationale quality matters: filtering or selecting higher\-quality rationales improves distillation into smaller modelsSonget al\.\([2025](https://arxiv.org/html/2606.10279#bib.bib35)\), and multi\-task rationale objectives can strengthen prediction alongside explanationHasanet al\.\([2025](https://arxiv.org/html/2606.10279#bib.bib36)\)\.

In this work, we ask whether this intuition holds in the real\-world medical prediction setting that is specifically designed to challenge it\. Our testbed is five\-year ADRD prediction from longitudinal health histories, which is clinically important, and epidemiologically well\-motivated\. Dementia is currently the seventh leading cause of death worldwide and a major cause of disability and dependency among older adultsWorld Health Organization \([2025](https://arxiv.org/html/2606.10279#bib.bib23)\)\. It is a difficult prediction target: risk can accumulate through genetic, vascular, metabolic, psychiatric, cognitive, and lifestyle pathways, rather than through one defining precursorReitzet al\.\([2023](https://arxiv.org/html/2606.10279#bib.bib19)\); Rasmussen and Frikke\-Schmidt \([2023](https://arxiv.org/html/2606.10279#bib.bib20)\)\. This sparsity and heterogeneity make the task a precise stress test for rationale\-based SFT\. The cohort of our data contains 42,566 participants, including 8,802 ADRD cases and 33,764 matched controls, represented with 1,167 input features\. Records are sparse, with only 17\.7 observed features on average\. The useful signal is not a single diagnosis or a fixed checklist: one future ADRD case may have vascular and metabolic history, another may have cognitive and psychiatric signals, and another may have a different mixture of weak risk factors\.

Through systematical experiments, we found out that rationale\-based SFT fails: across a controlled sweep of 504 configurations, models trained to output only the final label substantially outperform models trained to produce free\-form or stepwise rationales before the label \(mean ROC\-AUC 0\.734 vs\. 0\.604 and 0\.592\)\. The gap persists across training set sizes and base models\. The natural explanation is that the rationales are simply not good enough\. We test rationale quality through two independent checks\. First, few\-shot experiments show that when the same style of rationale is used as a demonstration rather than a training target, it improves performance over zero\-shot baselines, demonstrating that the rationales carry genuine discriminative signal\. Second, human annotation confirms that the generated rationales are medically accurate and faithfully select patient\-specific evidence from the record\. The problem is not what rationale contain, but what happens when a model is trained to reproduce them\.

We conduct further in\-depth analysis to find the root\-cause\. The same rationale content that degrades performance as a training target improves it as a demonstration points to a structural incompatibility between rationale\-based SFT and prediction setting\. A medically plausible rationale must tell a coherent story about why a patient’s history is consistent with their label, emphasizing broad morbidity markers that read as clinically relevant\. Discriminative fine\-tuning, by contrast, requires learning which features distinguish future cases from matched controls in this specific cohort\. These two objectives coincide when discriminative signal is concentrated in features that also anchor plausible narratives\. They diverge when the signal is distributed across many patient\-specific combinations of features\. In such settings, training a model to reproduce plausible rationales redirects its optimization budget away from learning the discriminative boundary that actually separates cases from controls\.

## 2Experimental Design

### 2\.1Data and Task Formulation

We study five\-year Alzheimer’s disease and related dementias \(ADRD\) prediction from longitudinal health histories\. For each participant, the input contains prior events and risk factors available before an age\-specific cutoff; the label is whether ADRD is recorded within the next five years\. Full data\-processing and matching details are in Appendix[A\.1](https://arxiv.org/html/2606.10279#A1.SS1)\.

##### Prediction target\.

ADRD onset is the first recorded occurrence of one of five ICD\-10 code groups: F00, F01, F03, G30, or G31\. The binary label isY=1Y=1for participants with an ADRD onset under this definition andY=0Y=0for controls without a recorded ADRD onset in the processed data\. The input is a history of earlier events and risk factors aligned by age, not a static feature vector\.

##### Input representation\.

Each participant is serialized as sex plus an age\-stamped medical\-history dictionary, as illustrated in Figure[1](https://arxiv.org/html/2606.10279#S1.F1)\. The final cohort contains 42,566 participants, including 8,802 ADRD cases and 33,764 matched controls\. The structured input has 1,167 possible features: 1,102 ICD\-10 first\-occurrence disease features and 65 cognitive or lifestyle features\. Records are sparse, with 17\.7 observed features on average and a median of 15 \(interquartile range 10–23\)\. This sparsity makes generated rationales a demanding interface: a short explanation must select patient\-specific evidence from many weak and incomplete signals\.

### 2\.2Training With Labeled Targets

we compare SFT targets with and without generated rationales\. Each model is trained on the same ADRD prediction task and the same structured health\-history inputs\. The only difference across the main data conditions is the response format used as the training target\. We compare no\-rationale, free\-rationale, and stepwise\-rationale targets, as shown in Figure[1](https://arxiv.org/html/2606.10279#S1.F1)\.

The generated rationales used in free\-rationale and stepwise\-rationale conditions are generated before SFT from the original training labels\(Hanet al\.,[2023](https://arxiv.org/html/2606.10279#bib.bib39)\)\. The generator receives the structured patient record and the ground\-truth ADRD\-within\-five\-years label, and is instructed to use only evidence present in the record\. Full generation prompts and SFT prompts are given in Appendix[C](https://arxiv.org/html/2606.10279#A3)\.

The controlled SFT grid crosses rationale format, training sample size, learning rate, base model, and decoding setting \(Appendix Table[2](https://arxiv.org/html/2606.10279#A3.T2)\)\. It contains 504 configurations: three target formats, three sample sizes, four learning rates, two base models, and seven decoding settings\. The base models are Qwen3\-8B and Qwen2\.5\-7B\-Instruct\. The decoding settings include greedy decoding and top\-kkor top\-ppsampling at temperatures 0\.1, 0\.5, and 1\.0\. Every configuration is evaluated on the same held\-out test set of 853 individuals\.

For matched comparisons, we vary one factor at a time and hold the remaining factors fixed\. We use pairedtt\-tests for grid\-average comparisons and a paired DeLong test for selected ROC\-AUC comparisons between fixed configurations, following Appendix[B](https://arxiv.org/html/2606.10279#A2)\.

## 3Experiment Results

This section evaluates whether the SFT target should include a generated rationale before the final ADRD risk probability\. Across the controlled sweep, models trained to output only the final label or probability achieve higher ROC\-AUC than models trained to output either free\-form rationales or stepwise rationales\. This pattern remains when we increase the training set size and when we use Qwen3\-8B, a reasoning\-oriented base model\. We then report three additional checks from the SFT sweep: how performance changes with training\-set size, how the two base models behave under each target format, and how decoding choices affect the trained models\.

### 3\.1Label\-Only Outperforms Rationale\-based

![Refer to caption](https://arxiv.org/html/2606.10279v1/x1.png)\(a\)Rationale, ROC\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x2.png)\(b\)Base model, ROC\-AUC

Figure 2:SFT ROC\-AUC performance by rationale format and base model\.![Refer to caption](https://arxiv.org/html/2606.10279v1/x3.png)Figure 3:Parameter\-level SFT diagnostics for the additional summary insights\. All panels use ROC\-AUC as the main metric and summarize the full SFT sweep\. Bars or points show mean ROC\-AUC, and error bars show standard error across configurations\. Panel A shows data scaling by rationale format, Panel B shows the base\-model interaction with target format, and Panel C compares decoding settings\.##### Direct label targets without rationale are strongest\.

Across matched SFT configurations, no\-rationale is clearly strongest \(Figure[2](https://arxiv.org/html/2606.10279#S3.F2)A\)\. Mean ROC\-AUC is 0\.734 for no\-rationale, compared with 0\.604 for free\-rationale and 0\.592 for stepwise\-rationale\. Both rationale conditions are substantially worse than no\-rationale \(pairedtt\-test,P=7\.26×10−52P=7\.26\\times 10^\{\-52\}for free\-rationale andP=6\.51×10−57P=6\.51\\times 10^\{\-57\}for stepwise\-rationale\)\.

The same pattern appears beyond ROC\-AUC \(Appendix Figure[8](https://arxiv.org/html/2606.10279#A3.F8)C,E,G\)\. No\-rationale has higher mean PR\-AUC than free\-rationale \(0\.504 versus 0\.313;P=1\.79×10−49P=1\.79\\times 10^\{\-49\}\) and stepwise\-rationale \(0\.504 versus 0\.306;P=1\.92×10−52P=1\.92\\times 10^\{\-52\}\)\. It also has higher mean F1 score than the two rationale formats \(0\.332 versus 0\.284 and 0\.291;P=1\.03×10−4P=1\.03\\times 10^\{\-4\}andP=4\.18×10−4P=4\.18\\times 10^\{\-4\}\)\. Mean recall follows the same ordering: 0\.256 for no\-rationale, 0\.237 for free\-rationale, and 0\.228 for stepwise\-rationale\.

##### The gap remains after selecting the best configuration\.

The best individual SFT configurations lead to the same conclusion\. The best no\-rationale configuration reaches ROC\-AUC 0\.849, while the best free\-rationale and stepwise\-rationale configurations reach ROC\-AUC 0\.698 and 0\.693\. Because the no\-rationale median is 0\.755, even the best observed rationale configurations fall below a typical no\-rationale configuration\.

##### A reasoning\-oriented base model does not eliminate the degradation\.

Using a reasoning\-oriented base model does not remove the rationale degradation\. Across 252 matched SFT pairs, Qwen3\-8B has slightly lower mean ROC\-AUC than Qwen2\.5\-7B\-Instruct \(0\.640 versus 0\.647; mean difference−0\.0077\-0\.0077; pairedtt\-test,P=0\.0348P=0\.0348; Figure[2](https://arxiv.org/html/2606.10279#S3.F2)B\)\. The absolute difference is small, but it goes in the opposite direction from the idea that a reasoning\-oriented base model should handle rationales better\.

The best\-configuration comparison is also not significant\. The best Qwen3\-8B SFT configuration reaches ROC\-AUC 0\.849, compared with 0\.839 for the best Qwen2\.5\-7B\-Instruct configuration \(paired DeLong test,P=0\.235P=0\.235\)\. Appendix Figure[9](https://arxiv.org/html/2606.10279#A3.F9)shows the same lack of a practical advantage on PR\-AUC, F1, and recall\. Thus, the SFT degradation is not resolved by switching from a standard instruction model to a model with a stronger reasoning emphasis\.

### 3\.2Analysis and Insights

![Refer to caption](https://arxiv.org/html/2606.10279v1/x4.png)Figure 4:Feature\-level error analysis for the best no\-rationale and free\-rationale SFT configurations, using the same validation examples and the same mapped feature representation\. Panels A and B show post\-hoc feature\-score associations for the top mapped features in the no\-rationale and free\-rationale configurations, respectively\. Panels C and D compare the prevalence of the same features among detected positives \(blue\) and missed positives \(red\)\.##### Data scaling benefits label\-only targets most\.

Sample size improves no\-rationale SFT monotonically\. Mean ROC\-AUC rises from 0\.707 at 1\.5K examples to 0\.725 at 3\.8K and 0\.771 at 15\.3K; in matched comparisons, 15\.3K improves over 1\.5K by\+0\.065\+0\.065ROC\-AUC on average \(P=7\.19×10−10P=7\.19\\times 10^\{\-10\}\)\. The validation metrics move with the same scaling pattern for no\-rationale SFT: PR\-AUC rises from 0\.451 to 0\.568, and F1 rises from 0\.265 to 0\.424\. Scaling is weaker for rationale targets: free\-rationale improves by only\+0\.030\+0\.030ROC\-AUC from 1\.5K to 15\.3K, and stepwise\-rationale has no significant ROC\-AUC gain \(\+0\.0048\+0\.0048,P=0\.337P=0\.337; Figure[3](https://arxiv.org/html/2606.10279#S3.F3)A\)\.

##### Model choice interacts with rationale format\.

Qwen3\-8B improves performance when the target contains no rationale, but degrades performance when the target contains generated rationales\. Under no\-rationale SFT, Qwen3\-8B has mean ROC\-AUC 0\.750 versus 0\.719 for Qwen2\.5\-7B\-Instruct \(matched\+0\.030\+0\.030,P=3\.69×10−4P=3\.69\\times 10^\{\-4\}\)\. Under free\-rationale SFT, Qwen3\-8B drops to 0\.587 versus 0\.620 for Qwen2\.5\-7B\-Instruct \(−0\.033\-0\.033,P=9\.91×10−14P=9\.91\\times 10^\{\-14\}\); under stepwise\-rationale SFT, it is 0\.582 versus 0\.602 \(−0\.020\-0\.020,P=5\.54×10−8P=5\.54\\times 10^\{\-8\}\)\. PR\-AUC validates the same interaction: Qwen3\-8B is higher for no\-rationale \(0\.529 versus 0\.479\), but lower for free\-rationale \(0\.297 versus 0\.328\) and stepwise\-rationale \(0\.296 versus 0\.316; Figure[3](https://arxiv.org/html/2606.10279#S3.F3)B\)\.

##### Decoding choice is a strong inference\-time parameter\.

The inference setting changes ROC\-AUC substantially even after SFT\. Averaged over all training settings, top\-ppsampling has lower mean ROC\-AUC than top\-kksampling or greedy decoding \(0\.610 versus 0\.666 and 0\.675\)\. The effect is clearest for no\-rationale SFT: greedy and all top\-kktemperatures cluster near ROC\-AUC 0\.768, while top\-ppat temperature 0\.1 falls to 0\.615 and top\-ppat temperature 0\.5 reaches only 0\.691\. PR\-AUC supports the same interpretation, with no\-rationale top\-ppat temperature 0\.1 reaching 0\.394 compared with roughly 0\.531 for greedy or top\-kkdecoding\. Thus, stable decoding is especially important for preserving the gains from direct label\-only fine\-tuning \(Figure[3](https://arxiv.org/html/2606.10279#S3.F3)C\)\.

## 4Discussion

The experiments establish a clear empirical pattern: generated rationales consistently hurt five\-year ADRD prediction when used as SFT targets, and the degradation cannot be explained by poor rationale quality\. This section examines the evidence for that claim from three angles\. We first assess rationale quality directly—through human expert annotation and through a few\-shot ablation that uses the same style of rationale as a demonstration rather than a training target\. We then analyze the feature\-level errors that separate the best no\-rationale and free\-rationale configurations\. Finally, we synthesize these observations into a mechanistic account of why plausible rationales and discriminative fine\-tuning are conflicting objectives in this setting\.

### 4\.1Quality Analysis of Generated Rationales

#### 4\.1\.1Human Expert Study

Table 1:Logic = logical coherence; Bio\. = biomedical correctness; ADRD = ADRD relatedness; Fidelity = evidence fidelity and temporal grounding\.To verify that the observed SFT degradation is not simply an artifact of low\-quality rationale generation, we invited a clinical expert to evaluate a stratified sample of the generated training rationales across two dimensions:*medical accuracy*\(whether the clinical reasoning is factually correct and consistent with established ADRD risk factors\) and*record fidelity*\(whether the rationale draws only on evidence present in the structured patient record, without hallucinating symptoms, medications, or diagnoses\)\.

The generated rationales score highly on both dimensions \. The annotator confirmed that the rationales accurately interpret the ICD\-10 codes present in each record, correctly invoke known ADRD risk pathways, and avoid fabricating patient\-specific evidence\. As illustrated in table[1](https://arxiv.org/html/2606.10279#S4.T1), average scores were broadly comparable across sexes, with male samples scoring slightly higher on biomedical correctness \(8\.33 vs\. 7\.00\) and fidelity \(7\.83 vs\. 7\.00\), a difference driven largely by the lower scores on Sample 4\. These findings rule out the most natural explanation for the SFT degradation: that the rationales are too noisy or inaccurate to serve as effective training targets\. By the standards a clinician would apply, the rationales are of high quality\.

#### 4\.1\.2Rationales as Few\-Shot Demonstrations

![Refer to caption](https://arxiv.org/html/2606.10279v1/x5.png)Figure 5:Few\-shot ablation in the training\-free setting\. Bars show metric means over matched decoding settings within each model, displayed as percentages\. The three conditions are Zero\-shot, Zero\-shot with CoT, and Few\-shot\.A second quality check asks a different question: if the rationales genuinely carry discriminative signal, they should improve performance when provided as demonstrations, even if they fail as training targets\. We test this by prepending five fixed GPT\-5\.4\-generated disease\-prediction demonstrations to the query prompt, each pairing a structured health history with a<THINK\>reasoning block and a final probability\. No parameter updates are made; the only change relative to the Zero\-shot with CoT baseline is the addition of worked examples\.

Few\-shot demonstrations improve substantially over both zero\-shot baselines\. Averaged across both models and all matched decoding settings, mean ROC\-AUC rises from 0\.552 \(Zero\-shot\) and 0\.563 \(Zero\-shot with CoT\) to 0\.648 \(Few\-shot; pairedtt\-test,P=6\.89×10−9P=6\.89\\times 10^\{\-9\}andP=7\.57×10−7P=7\.57\\times 10^\{\-7\}respectively; Figure[5](https://arxiv.org/html/2606.10279#S4.F5)\)\. In the deterministic Qwen3\-8B comparison, few\-shot reaches ROC\-AUC 0\.654, while Zero\-shot with CoT reaches only 0\.510\. The score scale also shifts: few\-shot assigns scores as high as 0\.85 and produces three true\-positive predictions at the 0\.5 threshold, whereas Zero\-shot with CoT never exceeds 0\.35 and therefore makes no positive predictions at that threshold\.

The feature\-level analysis in Figure[6](https://arxiv.org/html/2606.10279#S4.F6)reveals the mechanism\. Few\-shot assigns large score increases to clinically specific ADRD\-related histories—transient cerebral ischaemia, stroke, cerebral infarction, delirium, hypotension, and cerebrovascular disease \(Figure[6](https://arxiv.org/html/2606.10279#S4.F6)A\)\. The Zero\-shot with CoT list, by contrast, is dominated by broad cardiovascular and metabolic markers such as hypertension, angina, diabetes, and chronic ischaemic heart disease \(Figure[6](https://arxiv.org/html/2606.10279#S4.F6)B\), with the largest feature\-score increase reaching only 0\.092\. The demonstrations do not simply remove code\-checking behavior—both prompt formats check for the absence of explicit ADRD ICD codes at similar rates\. What changes is the specificity of the reasoning that follows\. Few\-shot rationales mention cognitive\-test evidence in 99% of cases versus 70% for Zero\-shot with CoT, neurologic or delirium signals in 80% versus 13%, and comorbidity caveats in 40% versus 1% \(Figure[6](https://arxiv.org/html/2606.10279#S4.F6)C–D\)\. The demonstrations teach the model to connect broad risk\-factor language to more specific clinical signals and to distinguish between comorbidity as background noise and comorbidity as genuine ADRD evidence\. The few\-shot result has a direct implication for interpreting the SFT experiments\. The same style of generated rationale that degrades performance when used as a training target improves it when used as a demonstration\. This dissociation confirms that the rationales encode genuine discriminative signal\. The problem is not what the rationales contain but how the model is trained on them\.

### 4\.2Error Analysis Across Rationale Formats

![Refer to caption](https://arxiv.org/html/2606.10279v1/x6.png)Figure 6:Feature and generated\-rationale analysis for deterministic Qwen3\-8B few\-shot and Zero\-shot with CoT outputs\. Panels A and B show feature\-score associations for the mappedmedical historyfeatures, using the same proxy as Figure[4](https://arxiv.org/html/2606.10279#S3.F4): the mean model score when a feature is present minus the mean score when it is absent\. Features are shown when they appear in at least five examples\. Panels C and D show the percentage of generated<THINK\>rationales that mention each rationale cue, measured by keyword or phrase matches in the generated<THINK\>text\.Having established that the rationales themselves are not the source of failure, we turn to what SFT training on them actually does to the model\. We compare the best no\-rationale and free\-rationale configurations on the same 853 validation examples, using post\-hoc feature\-score associations as a conservative importance proxy: for each feature appearing in at least ten validation examples, importance is the mean model score when the feature is present minus the mean score when it is absent\.

The no\-rationale model learns a tight, clinically coherent signal\. Its strongest score\-linked features are specific neurologic, vascular, and metabolic histories that separate future ADRD cases from matched controls in this validation set\. F05 delirium raises the mean score by 0\.556 \(observed case rate: 83\.3%\); G20 Parkinson’s disease raises it by 0\.510 \(72\.7% case rate\); I67 cerebrovascular disease raises it by 0\.485 \(58\.3% case rate\)\. The full top\-feature list—hemiplegia, cerebral infarction, mineral metabolism disorders, epilepsy, diabetes, and stroke—is not merely a list of common diagnoses but a set of histories that are disproportionately represented among future ADRD cases relative to matched controls \(Figure[4](https://arxiv.org/html/2606.10279#S3.F4)A,C\)\.

The free\-rationale model is broader\. It still assigns elevated scores to some of the same ADRD\-specific markers, including Parkinson’s disease, delirium, epilepsy, and stroke\. However, its top\-feature list also includes H18 corneal disorders, H36 retinal disorders, F10 alcohol\-related disorders, J84 interstitial lung disease, and K26 duodenal ulcer \(Figure[4](https://arxiv.org/html/2606.10279#S3.F4)B\)\. These features reflect overall health burden rather than near\-term ADRD risk\. The degradation is not a complete loss of risk information—the free\-rationale top features still carry some signal, appearing more often among detected positives than missed positives \(Figure[4](https://arxiv.org/html/2606.10279#S3.F4)D\)\. But the model has learned to treat general morbidity as stronger ADRD evidence than it should, inflating false positives on controls\.

The error counts make this concrete\. Relative to the no\-rationale configuration, free\-rationale SFT produces more false positives \(86 versus 26\) and more false negatives \(112 versus 98\), yielding lower ROC\-AUC \(0\.698 versus 0\.849\), lower PR\-AUC \(0\.384 versus 0\.702\), lower precision \(0\.472 versus 0\.778\), and lower specificity \(0\.870 versus 0\.961\)\. Among the 119 examples where no\-rationale is correct and free\-rationale is wrong, two error patterns dominate\. In one direction, 77 controls are misclassified as positive: the free\-rationale model treats weak cognitive bins, high BMI, smoking, and dyslipidaemia as sufficient ADRD evidence even when the held\-out label is negative\. In the other direction, 42 ADRD cases are misclassified as negative: the model acknowledges hypertension, diabetes, depression, B12 anaemia, and cognitive\-timing signals but discounts them because no explicit dementia ICD code appears in the observed history\. Both error types reflect the same underlying shift—toward imitating the surface logic of the training rationales rather than learning the discriminative boundary that separates cases from controls\.

### 4\.3Why Do SFT with Synthetic Rationales Degrade Model Performance?

The evidence from Sections[4\.1\.1](https://arxiv.org/html/2606.10279#S4.SS1.SSS1)–[4\.2](https://arxiv.org/html/2606.10279#S4.SS2)points to a structural incompatibility between rationale\-based SFT and this prediction setting, not a quality problem with the rationales themselves\. We now make this argument explicit\.

##### Plausibility and discriminability are different objectives\.

A medically plausible rationale for a given patient must tell a coherent story about why that patient’s history is consistent with the ground\-truth label\. This is a generative task: the rationale selects evidence that fits a narrative arc, weighs it against known risk pathways, and arrives at a conclusion that reads as reasonable to a clinician\. Discriminative fine\-tuning, by contrast, requires the model to learn which features distinguish future cases from matched controls in this specific cohort\. These two objectives coincide when the discriminative signal is concentrated in features that also anchor plausible narratives\. They diverge when the signal is distributed across many weak, patient\-specific combinations of features, as it is in our ADRD cohort\.

In the ADRD setting, a plausible narrative for a positive case must mention vascular disease, metabolic risk, cognitive decline, psychiatric history, or some combination thereof\. Many of these features are also present in controls, just in weaker or different combinations\. A rationale generator instructed to explain a positive label will therefore reasonably emphasize broad morbidity markers that are common in cases and that read as clinically relevant, but which are not strong discriminators against matched controls\. The SFT model that learns to reproduce such rationales internalizes this broader explanation pattern\. The feature analysis in Section[4\.2](https://arxiv.org/html/2606.10279#S4.SS2)shows exactly this: free\-rationale SFT assigns elevated scores to conditions like retinal disorders, interstitial lung disease, and duodenal ulcer that are mentioned in plausible rationales but are weak discriminators in this cohort\.

##### Rationale SFT redirects the optimization target\.

Label\-only SFT trains the model to place a decision boundary that separates the case and control distributions in the representation space of structured health histories\. The completion loss is directly coupled to the label: every gradient step asks whether the model assigns higher probability to the correct binary outcome\. Rationale SFT interleaves this signal with a much larger loss contribution from the generated text\. At typical rationale lengths of 50–150 tokens, the label token accounts for less than 2% of the completion\. The model therefore spends most of its optimization budget learning to reproduce the narrative, with the discriminative signal diluted to a small fraction of the training objective\. The result is a model that has been well\-trained to generate plausible medical reasoning and poorly trained to separate cases from controls\.

##### Implications for rationale\-based approaches in clinical prediction\.

These results do not imply that clinical reasoning is uninformative for disease prediction, nor that chain\-of\-thought approaches are generally harmful\. The few\-shot experiments demonstrate that the same rationale content improves performance when it guides inference rather than training\. The limitation is specific to the use of generated rationales as SFT targets in settings where the discriminative signal is sparse and heterogeneous: here, the narrative plausibility of a rationale is at best weakly correlated with its discriminative informativeness, and optimizing for the former interferes with learning the latter\. In such settings, label\-only fine\-tuning is not a weaker approach, because it keeps the optimization target directly aligned with the prediction task\.

## 5Conclusion

We studied whether supervised fine\-tuning on synthetic rationales improves language\-model performance for five\-year ADRD prediction\. Across 504 controlled configurations, rationale\-based SFT consistently underperforms label\-only fine\-tuning\. The failure is not due to poor rationale quality: the same rationale content helps as inference\-time demonstrations, and expert review confirms that the rationales are medically accurate and grounded\. Instead, rationale SFT appears to shift optimization toward reproducing narrative reasoning rather than learning the discriminative boundary between cases and controls\. These findings clarify when rationale supervision may help, when it may harm, and how it should be used in high\-stakes clinical prediction\.

## Limitations

This study is bounded by its focus on a single disease and cohort, 7–8B open models, and LLM\-generated rather than clinician\-written rationales\. Future work should examine whether label\-weighted loss functions, discriminative rationale filtering, or hybrid objectives can recover the benefits of rationale\-based training while preserving the discriminative signal that direct label supervision provides\. This is a case study on ADRD prediction; results may differ for diseases with clearer causal markers or denser clinical records\. The generated rationales are produced by strong LLMs but are not clinician\-written explanations\. The study evaluates 7–8B open models; larger proprietary models may behave differently\. The task is for research evaluation only and is not intended for clinical deployment\. The results do not show that all reasoning is uninformative; they show that generated rationales, as currently used, are unreliable for this setting\.

## Broader Impact

This work studies clinical risk prediction, a setting where model errors can affect patient care if used outside appropriate research safeguards\. A key risk is misuse: models trained or prompted with plausible rationales may be treated as clinically interpretable even when their predictions are poorly calibrated or their explanations do not reflect discriminative evidence\. Such systems could produce false reassurance, unnecessary alarm, or spurious justification for downstream decisions\. The results should therefore not be used to deploy ADRD prediction tools or to support individual\-level clinical, insurance, or resource allocation decisions\. More broadly, our findings caution against using synthetic rationales as a shortcut for trustworthy explanation in high\-stakes domains\. Any future deployment of language\-model\-based clinical prediction should require independent validation, bias and calibration audits across patient subgroups, privacy\-preserving data governance, and human clinical oversight\.

## Use of LLMs

In this work, LLMs are used strictly for research support rather than as sources of substantive content\. Their use falls into: \(i\) serving as the tested and trained model, and \(ii\) assisting with language refinement during paper writing\. For writing support, we used GPT\-5 solely to polish text \(improving coherence and grammar\) while all ideas, logic, results, and technical contributions originate from the authors\.

## References

- The area above the ordinal dominance graph and the area below the receiver operating characteristic graph\.Journal of Mathematical Psychology12\(4\),pp\. 387–415\.External Links:[Document](https://dx.doi.org/10.1016/0022-2496%2875%2990001-2)Cited by:[§B\.1](https://arxiv.org/html/2606.10279#A2.SS1.p1.4)\.
- Y\. Benjamini and Y\. Hochberg \(1995\)Controlling the false discovery rate: a practical and powerful approach to multiple testing\.Journal of the Royal Statistical Society: Series B \(Methodological\)57\(1\),pp\. 289–300\.External Links:[Document](https://dx.doi.org/10.1111/j.2517-6161.1995.tb02031.x)Cited by:[§B\.2](https://arxiv.org/html/2606.10279#A2.SS2.SSS0.Px3.p1.2)\.
- C\. Bycroft, C\. Freeman, D\. Petkova, G\. Band, L\. T\. Elliott, K\. Sharp, A\. Motyer, D\. Vukcevic, O\. Delaneau, J\. O’Connell,et al\.\(2018\)The uk biobank resource with deep phenotyping and genomic data\.Nature562\(7726\),pp\. 203–209\.Cited by:[§A\.1](https://arxiv.org/html/2606.10279#A1.SS1.p1.1)\.
- Y\. Cao, Y\. Chen, H\. Jiang, H\. Lee, and R\. T\. Tan \(2026\)ReMedi: reasoner for medical clinical prediction\.arXiv preprint arXiv:2605\.01474\.Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.
- J\. Chen, Z\. Cai, K\. Ji, X\. Wang, W\. Liu, R\. Wang, J\. Hou, and B\. Wang \(2024\)Huatuogpt\-o1, towards medical complex reasoning with llms\.arXiv preprint arXiv:2412\.18925\.Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.
- J\. Davis and M\. Goadrich \(2006\)The relationship between precision\-recall and ROC curves\.InProceedings of the 23rd International Conference on Machine Learning,pp\. 233–240\.External Links:[Document](https://dx.doi.org/10.1145/1143844.1143874)Cited by:[§B\.1](https://arxiv.org/html/2606.10279#A2.SS1.SSS0.Px1.p1.4)\.
- E\. R\. DeLong, D\. M\. DeLong, and D\. L\. Clarke\-Pearson \(1988\)Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach\.Biometrics44\(3\),pp\. 837–845\.External Links:[Document](https://dx.doi.org/10.2307/2531595)Cited by:[§B\.2](https://arxiv.org/html/2606.10279#A2.SS2.SSS0.Px1.p1.1)\.
- B\. Efron and R\. J\. Tibshirani \(1993\)An introduction to the bootstrap\.Monographs on Statistics and Applied Probability, Vol\.57,Chapman and Hall,New York\.External Links:ISBN 0412042312Cited by:[§B\.2](https://arxiv.org/html/2606.10279#A2.SS2.SSS0.Px2.p1.5)\.
- T\. Fawcett \(2006\)An introduction to ROC analysis\.Pattern Recognition Letters27\(8\),pp\. 861–874\.External Links:[Document](https://dx.doi.org/10.1016/j.patrec.2005.10.010)Cited by:[§B\.1](https://arxiv.org/html/2606.10279#A2.SS1.p1.4)\.
- C\. Han, D\. W\. Kim, S\. Kim, S\. C\. You, S\. Bae, and D\. Yoon \(2023\)Large\-language\-model\-based 10\-year risk prediction of cardiovascular disease: insight from the uk biobank data\.medRxiv,pp\. 2023–05\.Cited by:[§2\.2](https://arxiv.org/html/2606.10279#S2.SS2.p2.1)\.
- J\. A\. Hanley and B\. J\. McNeil \(1982\)The meaning and use of the area under a receiver operating characteristic \(ROC\) curve\.Radiology143\(1\),pp\. 29–36\.External Links:[Document](https://dx.doi.org/10.1148/radiology.143.1.7063747)Cited by:[§B\.1](https://arxiv.org/html/2606.10279#A2.SS1.p1.4)\.
- H\. Hasan, H\. K\. Bashier, J\. Dai, M\. Kim, and R\. Goebel \(2025\)Reason2Decide: rationale\-driven multi\-task learning\.arXiv preprint arXiv:2512\.20074\.Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.
- S\. Holm \(1979\)A simple sequentially rejective multiple test procedure\.Scandinavian Journal of Statistics6\(2\),pp\. 65–70\.External Links:[Link](https://www.jstor.org/stable/4615733)Cited by:[§B\.2](https://arxiv.org/html/2606.10279#A2.SS2.SSS0.Px3.p1.2)\.
- P\. Jiang, C\. D\. Xiao, M\. Jiang, P\. Bhatia, T\. Kass\-Hout, J\. Sun, and J\. Han \(2025\)Reasoning\-enhanced healthcare predictions with knowledge graph community retrieval\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 81785–81830\.Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.
- H\. Kim, H\. Hwang, J\. Lee, S\. Park, D\. Kim, T\. Lee, C\. Yoon, J\. Sohn, J\. Park, O\. Reykhart,et al\.\(2025\)Small language models learn enhanced reasoning skills from medical textbooks\.NPJ digital medicine8\(1\),pp\. 240\.Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.
- T\. Kwon, K\. T\. Ong, D\. Kang, S\. Moon, J\. R\. Lee, D\. Hwang, B\. Sohn, Y\. Sim, D\. Lee, and J\. Yeo \(2024\)Large language models are clinical reasoners: reasoning\-aware diagnosis framework with prompt\-generated rationales\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 18417–18425\.Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.
- S\. Niu, J\. Ma, H\. Lin, L\. Bai, Z\. Wang, Y\. Xu, Y\. Song, and X\. Yang \(2025\)Knowledge\-augmented multimodal clinical rationale generation for disease diagnosis with small language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 11011–11024\.External Links:[Link](https://aclanthology.org/2025.acl-long.540/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.540),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.
- I\. J\. Rasmussen and R\. Frikke\-Schmidt \(2023\)Modifiable cardiovascular risk factors and genetics for targeted prevention of dementia\.European Heart Journal44\(28\),pp\. 2526–2543\.External Links:[Document](https://dx.doi.org/10.1093/eurheartj/ehad293)Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p2.1)\.
- C\. Reitz, M\. A\. Pericak\-Vance, T\. Foroud, and R\. Mayeux \(2023\)A global view of the genetic basis of alzheimer disease\.Nature Reviews Neurology19,pp\. 261–277\.External Links:[Document](https://dx.doi.org/10.1038/s41582-023-00789-z)Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p2.1)\.
- T\. Saito and M\. Rehmsmeier \(2015\)The precision\-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets\.PLOS ONE10\(3\),pp\. e0118432\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0118432)Cited by:[§B\.1](https://arxiv.org/html/2606.10279#A2.SS1.SSS0.Px1.p1.4)\.
- A\. Shmatko, A\. W\. Jung, K\. Gaurav, S\. Brunak, L\. H\. Mortensen, E\. Birney, T\. Fitzgerald, and M\. Gerstung \(2025\)Learning the natural history of human disease with generative transformers\.Nature647,pp\. 248–256\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09529-3)Cited by:[§A\.1](https://arxiv.org/html/2606.10279#A1.SS1.SSS0.Px3.p1.1)\.
- H\. Song, H\. Lee, J\. Shin, S\. Cho, C\. Ko, and J\. C\. Park \(2025\)Does rationale quality matter? enhancing mental disorder detection via selective reasoning distillation\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 21738–21756\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1119/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1119),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.
- Y\. Sun, X\. Qian, W\. Xu, H\. Zhang, C\. Xiao, L\. Li, D\. Zhao, W\. Huang, T\. Xu, Q\. Bai,et al\.\(2025\)Reasonmed: a 370k multi\-agent generated dataset for advancing medical reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 26457–26478\.Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.
- C\. J\. van Rijsbergen \(1979\)Information retrieval\.2 edition,Butterworths,London\.External Links:ISBN 0408709294Cited by:[§B\.1](https://arxiv.org/html/2606.10279#A2.SS1.p1.4)\.
- World Health Organization \(2025\)Dementia\.Note:Fact sheet, updated March 31, 2025External Links:[Link](https://www.who.int/news-room/fact-sheets/detail/dementia)Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p2.1)\.
- H\. Yu, T\. Cheng, Y\. Wang, W\. He, Q\. Wang, Y\. Cheng, Y\. Zhang, R\. Feng, and X\. Zhang \(2025\)FinemedLM\-o1: enhancing medical knowledge reasoning ability of llm from supervised fine\-tuning to test\-time training\.arXiv preprint arXiv:2501\.09213\.Cited by:[§1](https://arxiv.org/html/2606.10279#S1.p1.1)\.

## Appendix AAdditional Details on Data Preliminaries

### A\.1Data Processing

This subsection records the data\-processing choices used to construct the longitudinal ADRD prediction cohort\. The pipeline starts from participant\-level first\-occurrence disease tables and structured UK Biobank\-derived sources\(Bycroftet al\.,[2018](https://arxiv.org/html/2606.10279#bib.bib31)\): ADRD case and control tables, cancer first\-occurrence records, cognitive test factors, lifestyle and environmental factors, APOE genotypes, and ADRD polygenic risk scores\. Participant identifiers are cleaned before merging\. For genotype and polygenic risk score sources, identifiers are mapped to the common participant index used across the linked UK Biobank release\.

##### Feature harmonization\.

The lifestyle and environment table is converted into age\-stamped categorical features\. For each group of assessment columns, entries are coerced to numeric values; negative values and values in\[0\.1,0\.9\]\[0\.1,0\.9\]are set to missing\. The remaining values are pooled across assessment instances and discretized at the empirical 10th and 90th percentiles\. The middle category is removed for these features, leaving lower\-tail and upper\-tail indicators\. One\-hot indicators are then assigned the participant’s age at the corresponding UK Biobank assessment instance when an instance marker is present\. Field names are mapped to human\-readable titles using the accompanying UK Biobank metadata\.

Genetic features are processed separately before being merged into the lifestyle table\. APOE columns are collapsed into carrier indicators \(e2\_carrier,e3\_carrier, ande4\_carrier\)\. The ADRD polygenic risk score is binned by quantiles with cut points\(0,0\.10,0\.30,0\.60,0\.90,1\.00\)\(0,0\.10,0\.30,0\.60,0\.90,1\.00\), producing five ordered risk categories\. APOE and polygenic risk score categories are one\-hot encoded, with active indicators assigned age 0 and inactive indicators left missing\. The lifestyle table is the anchor table, and APOE and polygenic risk score features are merged by cleaned participant ID\.

##### Case construction\.

The ADRD case table is merged with cancer first\-occurrence features, cognitive test features, and the merged lifestyle/genetic feature table on the intersection of available participant IDs\. The target column is normalized toadrdand treated as the age at first ADRD onset\. Feature columns are converted to numeric values\. Any feature timestamp later thanadrd−1\\texttt\{adrd\}\-1is removed, giving a one\-year offset between the last eligible record and diagnosis\. Cases are kept only if they have at least one retained feature in\(adrd−5,adrd−1\]\(\\texttt\{adrd\}\-5,\\texttt\{adrd\}\-1\]and if ADRD onset is at age 50 or older\. The resulting case table is carried forward to matching and serialization\.

##### Matched control construction\.

Controls are drawn from participants without an ADRD onset after merging the same feature sources on the intersection of available participant IDs\. The control table is aligned to the case schema and assigned a missingadrdvalue\. We match controls to cases on sex and observation\-age distributions, following the matched disease\-prediction design ofShmatkoet al\.\([2025](https://arxiv.org/html/2606.10279#bib.bib21)\)\. The purpose of the matching is to make the label depend on pre\-diagnostic health history rather than on follow\-up length\. Without this step, a model could partially separate cases from controls by the endpoint of the observed record\.

For cases, the observation cutoff is tied to the onset age\. All features after one year before ADRD onset are removed, preventing the model from using diagnostic or near\-diagnostic information\. We keep a case only if at least one retained feature lies in the interval from five years to one year before onset, and we restrict to onsets at age 50 or older\. Thus, a positive example asks the model to predict ADRD within five years from the observed history while leaving at least a one\-year gap before the recorded onset\.

For controls, we compute each case’s latest retained feature age and form a histogram using five\-year age bins\. For each bin, we select up to four controls per case among controls with the same sex and at least one feature timestamp in that bin\. A cutoff age is drawn, with random seed 42, from observed control feature ages in the bin; features after that cutoff are removed\. Selected controls are removed from the pool before later bins are processed\. This procedure yields the 1:4 case\-control structure used in the experiments while keeping the observed age distribution comparable between cases and controls\.

##### Train/validation split and serialization\.

Cases and controls are shuffled and split separately into 90% training and 10% validation partitions, using seed 42 for cases and seed 123 for controls\. Column names are mapped to full labels when a matching label is available, and duplicate names introduced by this mapping are disambiguated\. Each row is serialized with three fields:sex, an orderedmedical historydictionary of non\-missing event\-age pairs sorted by age, and the binaryADRDlabel\. Rows with finiteadrdvalues receive label 1; rows with missingadrdvalues receive label 0\. As a final quality filter, examples with fewer than three medical\-history entries are removed from the training and validation partitions\. This rule is applied consistently across the serialized examples used in the fine\-tuning experiments\.

## Appendix BAdditional Details on Metrics

This section formalizes the evaluation metrics and hypothesis tests used in the controlled experiments\. We consider a binary prediction problem with test examples\{\(Yi,Si\)\}i=1n\\\{\(Y\_\{i\},S\_\{i\}\)\\\}\_\{i=1\}^\{n\}, whereYi∈\{0,1\}Y\_\{i\}\\in\\\{0,1\\\}is the observed label andSi∈ℝS\_\{i\}\\in\\mathbb\{R\}is a scalar model score, with larger scores indicating higher predicted risk\. Let𝒫=\{i:Yi=1\}\\mathcal\{P\}=\\\{i:Y\_\{i\}=1\\\}and𝒩=\{i:Yi=0\}\\mathcal\{N\}=\\\{i:Y\_\{i\}=0\\\}denote the empirical positive and negative sets, withm=\|𝒫\|m=\|\\mathcal\{P\}\|andn0=\|𝒩\|n\_\{0\}=\|\\mathcal\{N\}\|\. For a thresholdτ\\tau, the induced classifier isY^i​\(τ\)=𝕀​\{Si≥τ\}\\widehat\{Y\}\_\{i\}\(\\tau\)=\\mathbb\{I\}\\\{S\_\{i\}\\geq\\tau\\\}\. All threshold\-dependent metrics use thresholds selected on validation data and then fixed before evaluation on the held\-out test set\.

### B\.1Evaluation Metrics

We report ROC\-AUC, F1 score, and recall using their standard definitions\. ROC\-AUC is the threshold\-free ranking metricPr⁡\(S\+\>S−\)\+12​Pr⁡\(S\+=S−\)\\Pr\(S^\{\+\}\>S^\{\-\}\)\+\\frac\{1\}\{2\}\\Pr\(S^\{\+\}=S^\{\-\}\), estimated by the usual empirical Mann–Whitney statistic over positive–negative score pairs\(Hanley and McNeil,[1982](https://arxiv.org/html/2606.10279#bib.bib6); Bamber,[1975](https://arxiv.org/html/2606.10279#bib.bib5)\)\. At a fixed validation\-selected thresholdτ\\tau, F1 is the harmonic mean of precision and recall, following the standard F\-measure formulation\(van Rijsbergen,[1979](https://arxiv.org/html/2606.10279#bib.bib8)\), equivalently2​T​P​\(τ\)/\(2​T​P​\(τ\)\+FP​\(τ\)\+FN​\(τ\)\)2\\mathrm\{TP\}\(\\tau\)/\(2\\mathrm\{TP\}\(\\tau\)\+\\mathrm\{FP\}\(\\tau\)\+\\mathrm\{FN\}\(\\tau\)\), and recall isTP​\(τ\)/\(TP​\(τ\)\+FN​\(τ\)\)\\mathrm\{TP\}\(\\tau\)/\(\\mathrm\{TP\}\(\\tau\)\+\\mathrm\{FN\}\(\\tau\)\)\. These metrics are included to summarize global discrimination, the thresholded precision–recall trade\-off, and sensitivity to positive cases, respectively\(Fawcett,[2006](https://arxiv.org/html/2606.10279#bib.bib7)\)\.

##### PR\-AUC\.

For thresholdτ\\tau, define

P^​\(τ\)\\displaystyle\\widehat\{P\}\(\\tau\)=TP​\(τ\)TP​\(τ\)\+FP​\(τ\),\\displaystyle=\\frac\{\\mathrm\{TP\}\(\\tau\)\}\{\\mathrm\{TP\}\(\\tau\)\+\\mathrm\{FP\}\(\\tau\)\},R^​\(τ\)\\displaystyle\\widehat\{R\}\(\\tau\)=TP​\(τ\)TP​\(τ\)\+FN​\(τ\)\.\\displaystyle=\\frac\{\\mathrm\{TP\}\(\\tau\)\}\{\\mathrm\{TP\}\(\\tau\)\+\\mathrm\{FN\}\(\\tau\)\}\.The precision–recall area under the curve \(PR\-AUC\) is the area under precision as a function of recall as the threshold is swept over score values\. Empirically, after sorting examples by decreasing score and evaluating the right\-continuous step curve at the distinct recall values0=r0<r1<⋯<rK0=r\_\{0\}<r\_\{1\}<\\cdots<r\_\{K\}, we use

PR​\-​AUC^=∑k=1K\(rk−rk−1\)​pk,\\widehat\{\\mathrm\{PR\\text\{\-\}AUC\}\}=\\sum\_\{k=1\}^\{K\}\(r\_\{k\}\-r\_\{k\-1\}\)\\,p\_\{k\},wherepkp\_\{k\}is the precision after including all predictions up to thekkth recall level\(Davis and Goadrich,[2006](https://arxiv.org/html/2606.10279#bib.bib9)\)\. PR\-AUC is prevalence\-sensitive and is particularly informative when the positive class is rare, since it directly penalizes false positives among high\-scoring predictions\(Saito and Rehmsmeier,[2015](https://arxiv.org/html/2606.10279#bib.bib10)\)\.

### B\.2Hypothesis Testing

For each controlled comparison, the null hypothesis is equality of the population\-level metric between two experimental conditions:

H0:Δ=θa−θb=0,H1:Δ≠0\.H\_\{0\}:\\Delta=\\theta\_\{a\}\-\\theta\_\{b\}=0,\\qquad H\_\{1\}:\\Delta\\neq 0\.All tests are paired because every condition in the grid is evaluated on the same held\-out examples, with all non\-target experimental factors controlled\.

##### Paired DeLong test for ROC\-AUC\.

For ROC\-AUC, letΔ^=AUC^a−AUC^b\\widehat\{\\Delta\}=\\widehat\{\\mathrm\{AUC\}\}\_\{a\}\-\\widehat\{\\mathrm\{AUC\}\}\_\{b\}\. Because ROC\-AUC is a two\-sample U\-statistic, the paired DeLong test estimates the covariance of the two AUC estimates from their shared positive and negative examples\(DeLonget al\.,[1988](https://arxiv.org/html/2606.10279#bib.bib11)\)\. Define

ϕ​\(si,sj\)=𝕀​\(si\>sj\)\+12​𝕀​\(si=sj\)\.\\phi\(s\_\{i\},s\_\{j\}\)=\\mathbb\{I\}\(s\_\{i\}\>s\_\{j\}\)\+\\frac\{1\}\{2\}\\mathbb\{I\}\(s\_\{i\}=s\_\{j\}\)\.For conditionr∈\{a,b\}r\\in\\\{a,b\\\}, the positive\- and negative\-class placement values are

Vr,i\(1\)\\displaystyle V\_\{r,i\}^\{\(1\)\}=1n0​∑j∈𝒩ϕ​\(Sr,i,Sr,j\),\\displaystyle=\\frac\{1\}\{n\_\{0\}\}\\sum\_\{j\\in\\mathcal\{N\}\}\\phi\(S\_\{r,i\},S\_\{r,j\}\),i∈𝒫,\\displaystyle i\\in\\mathcal\{P\},Vr,j\(0\)\\displaystyle V\_\{r,j\}^\{\(0\)\}=1m​∑i∈𝒫ϕ​\(Sr,i,Sr,j\),\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i\\in\\mathcal\{P\}\}\\phi\(S\_\{r,i\},S\_\{r,j\}\),j∈𝒩\.\\displaystyle j\\in\\mathcal\{N\}\.Letθ^=\(AUC^a,AUC^b\)T\\widehat\{\\theta\}=\(\\widehat\{\\mathrm\{AUC\}\}\_\{a\},\\widehat\{\\mathrm\{AUC\}\}\_\{b\}\)^\{\\mathrm\{\\scriptscriptstyle T\}\}and estimate the paired DeLong covariance matrix by

Σ^r​s\\displaystyle\\widehat\{\\Sigma\}\_\{rs\}=Cov^𝒫​\(Vr\(1\),Vs\(1\)\)m\\displaystyle=\\frac\{\\widehat\{\\mathrm\{Cov\}\}\_\{\\mathcal\{P\}\}\(V\_\{r\}^\{\(1\)\},V\_\{s\}^\{\(1\)\}\)\}\{m\}\+Cov^𝒩​\(Vr\(0\),Vs\(0\)\)n0,\\displaystyle\\quad\+\\frac\{\\widehat\{\\mathrm\{Cov\}\}\_\{\\mathcal\{N\}\}\(V\_\{r\}^\{\(0\)\},V\_\{s\}^\{\(0\)\}\)\}\{n\_\{0\}\},r,s∈\{a,b\}\.\\displaystyle\\hskip 30\.00005ptr,s\\in\\\{a,b\\\}\.With contrast vectorc=\(1,−1\)c=\(1,\-1\), the estimated variance of the AUC difference isVar^​\(Δ^\)=cT​Σ^​c\\widehat\{\\mathrm\{Var\}\}\(\\widehat\{\\Delta\}\)=c^\{\\mathrm\{\\scriptscriptstyle T\}\}\\widehat\{\\Sigma\}c\. We then form the Wald statistic

Z=Δ^Var^​\(Δ^\)\.Z=\\frac\{\\widehat\{\\Delta\}\}\{\\sqrt\{\\widehat\{\\mathrm\{Var\}\}\(\\widehat\{\\Delta\}\)\}\}\.UnderH0H\_\{0\}and standard regularity conditions for U\-statistics,ZZis asymptotically standard normal, yielding the two\-sidedpp\-value

pDeLong=2​\{1−Φ​\(\|Z\|\)\}\.p\_\{\\mathrm\{DeLong\}\}=2\\\{1\-\\Phi\(\|Z\|\)\\\}\.This is the primary hypothesis test for ROC\-AUC differences in the controlled experiment table\.

##### Paired bootstrap tests for PR\-AUC, F1 score, and recall\.

For PR\-AUC, F1 score, and recall, we use a paired nonparametric bootstrap at the patient/example level because these metrics are nonsmooth or nonlinear functionals of the empirical distribution, and F1 score and recall also depend on a fixed threshold\(Efron and Tibshirani,[1993](https://arxiv.org/html/2606.10279#bib.bib12)\)\. For bootstrap replicateb=1,…,Bb=1,\\ldots,B, draw indicesI1\(b\),…,In\(b\)I\_\{1\}^\{\(b\)\},\\ldots,I\_\{n\}^\{\(b\)\}independently from the empirical distribution on\{1,…,n\}\\\{1,\\ldots,n\\\}, reuse the same resampled indices for all experimental conditions, and compute the paired contrastΔ^\(b\)=θ^a\(b\)−θ^b\(b\)\\widehat\{\\Delta\}^\{\(b\)\}=\\widehat\{\\theta\}^\{\(b\)\}\_\{a\}\-\\widehat\{\\theta\}^\{\(b\)\}\_\{b\}on the resampled test set\. We report the bootstrap standard error, percentile confidence interval, and a two\-sided sign\-based bootstrappp\-value

pboot=2​min⁡\{1\+∑b=1B𝕀​\(Δ^\(b\)≤0\)B\+1,1\+∑b=1B𝕀​\(Δ^\(b\)≥0\)B\+1\},p\_\{\\mathrm\{boot\}\}=2\\min\\left\\\{\\begin\{aligned\} &\\frac\{1\+\\sum\_\{b=1\}^\{B\}\\mathbb\{I\}\(\\widehat\{\\Delta\}^\{\(b\)\}\\leq 0\)\}\{B\+1\},\\\\ &\\frac\{1\+\\sum\_\{b=1\}^\{B\}\\mathbb\{I\}\(\\widehat\{\\Delta\}^\{\(b\)\}\\geq 0\)\}\{B\+1\}\\end\{aligned\}\\right\\\},truncated at one if necessary\. This test asks whether the paired resampling distribution of the metric difference is centered away from zero, while sharing resampled indices preserves the correlation induced by common test examples\.

##### Multiple comparisons\.

The full experiment grid contains many controlled comparisons\. When making claims across multiple factors, we interpret isolated smallpp\-values cautiously and report effect sizes together with uncertainty estimates\. If a single family of confirmatory comparisons is specified, the correspondingpp\-values can be adjusted using Holm–Bonferroni control of the family\-wise error rate\(Holm,[1979](https://arxiv.org/html/2606.10279#bib.bib13)\)or Benjamini–Hochberg control of the false discovery rate\(Benjamini and Hochberg,[1995](https://arxiv.org/html/2606.10279#bib.bib14)\)\. In all cases, statistical significance is interpreted together with the magnitude and uncertainty of the estimated effect\.

## Appendix CTraining

Table 2:SFT controlled experiment grid\. Rows list factor levels rather than individual configurations; the SFT sweep crosses all listed data, training, inference, and metric factors, yielding 504 configurations\.### C\.1Additional Details

All SFT experiments used the same longitudinal ADRD prediction task and the same input representation\. Each example contained the individual’s sex and amedical historyobject, where each non\-missing event was represented as anevent: agepair\. The output field,ADRD, stored the training target\. We compared three rationale formats\. In no\-rationale, the target contained an empty<THINK\>block followed by the binary label\. In free\-rationale, the empty block was replaced by a concise biomedical rationale\. In stepwise\-rationale, the generated rationale was constrained to three steps: identifying relevant events, aggregating risk, and concluding whether the record supported ADRD within five years\.

##### SFT prompt\.

The chat prompt used for SFT and for SFT evaluation was shared across all rationale conditions\. The user message inserted the structured record after the first line shown below\. The assistant completion was taken directly from theADRDfield of the corresponding training record\.

System:

Thiswillnotbeusedtomakedecisionsaboutapatient\.Thisisforresearchpurposesonly\.

Youareahealthcareriskassessmentassistant\.

YouwillbegivenasingleindividualinJSONformatcontaining\{\{event:age\}\}pairs\.Eventsmayincludesex,laboratoryresults,andpastdiseases/diagnosesencodedwithICD\-10codes\.

YourtaskistopredictwhethertheindividualwilldevelopAlzheimer’sdiseaseorrelateddementias\(ADRD\)withinfiveyearsafterthelastrecordedage\.

Forthistask,ADRDisdefinedbythefollowingICD\-10codes:F00\(DementiainAlzheimer’sdisease\),F01\(Vasculardementia\),F03\(Unspecifieddementia\),G30\(Alzheimer’sdisease\),andG31\(Otherdegenerativediseasesofthenervoussystem,notelsewhereclassified\)\.

YouMUSTfollowtheoutputformatexactlyandoutputnothingelse\.

OUTPUTFORMAT:

<THINK\>

\{THINKING\_STEPS\}

</THINK\>

Prediction\(0or1\):\{FINAL\_LABEL\}

Where:

\-\{THINKING\_STEPS\}containsyourstep\-by\-stepreasoning\.

\-\{FINAL\_LABEL\}isexactlyonecharacter:0or1,where1indicatesADRDwithin5yearsand0indicatesnoADRDwithin5years\.

User:

Hereistheinputfortheindividual:

\{INPUT\_JSON\}

Returntheoutputintherequiredformat\.

##### Rationale generation\.

The generated rationales used in free\-rationale and stepwise\-rationale were generated from the original training labels before SFT\. The reasoning generator was given the structured patient record, the last observed age, and the ground\-truthADRD\_within\_5ylabel\. It was instructed to use only evidence present in the record and not to invent symptoms, imaging, medications, family history, genetic findings, or lifestyle factors\. Long records were capped at 200 events when constructing the reasoning\-generation prompt\. The default reasoning\-generation model was GPT\-5\.2 with medium reasoning effort and a 450\-token output limit\. The generated text was then wrapped as<THINK\> reasoning </THINK\>followed byPrediction \(0 or 1\): label\.

For free\-rationale, the generator prompt was:

Youareaclinicalepidemiologyassistanthelpinggeneratetrainingrationales\.

YouwillbegivenasingleindividualinJSONformatcontaining\{\{event:age\}\}pairs\.Eventsmayincludesex,laboratoryresults,andpastdiseases/diagnosesencodedwithICD\-10codes\.Yourtaskistowriteamedicallyplausiblereasoningnarrativeexplainingwhytheground\-truthlabelADRD\_within\_5yis\{label\}forthisindividual,basedONLYontheprovidedevents\.

Definition:ADRDisdefinedbythefollowingICD\-10codes:

\-F00\(DementiainAlzheimer’sdisease\)

\-F01\(Vasculardementia\)

\-F03\(Unspecifieddementia\)

\-G30\(Alzheimer’sdisease\)

\-G31\(Otherdegenerativediseasesofthenervoussystem,notelsewhereclassified\)

Inputs:

\-Patientrecord:JSON\{\{event:age\}\}pairs

\-Lastobservedage:\{last\_age\}

\-Ground\-truthADRD\_within\_5ylabel:\{label\}

Instructions:

\-UseONLYevidencepresentintherecord;doNOTinventsymptoms,imaging,medications,familyhistory,genetics,orlifestyle\.

\-YoumayinterpretanICD\-10codeintoaconditionONLYifyouarehighlyconfident;otherwisekeepitasthecodeandreasongenerally\.

\-Focusonbiomedicalplausibility:ageeffects;vascular/metabolicrisks\(e\.g\.,hypertension/diabetes/dyslipidemia\);cerebrovasculardisease;chronickidneydisease;depression;traumaticbraininjury;sleepdisorders;andothercomorbiditiesifpresentintherecord\.

\-Ifthelabelis1,explainhowtheobservedpatterncouldplausiblyprecedeADRDwithinabout5yearsafterthelastage\.

\-Ifthelabelis0,explainwhytherecordlacksstrongindicatorsofnear\-termADRD,orwhyriskappearslower,whileacknowledginguncertainty\.

\-Keepitcoherentandconcise:6\-8sentences,oneparagraph\.Avoidlistsunlessnecessary\.

Outputformatconstraints:

\-OutputONLYthereasoningnarrativetext\.

\-DoNOToutputthelabel\.

\-DoNOToutput"Prediction"\.

\-DoNOTinclude<THINK\>tags\.

\-DoNOTaddanyheadingsorextrametadata\.

Patientrecord\(JSON\):

\{patient\_json\}

For stepwise\-rationale, the generator prompt used the same task definition, but replaced the open narrative instruction with the following three\-step structure:

Subtasks\(writethemasStep1/Step2/Step3\):

Step1:IdentifyrelevanteventsintherecordthatcorrelatewithADRDriskusingONLYwhatappears\.IfyouarenotconfidentinterpretinganICD\-10code,keepitasthecodeandreasongenerally\.

Step2:Aggregaterisk:weighage,sex\(ifpresent\),andtheidentifiedevents\.Explainwhytheyincrease/decreasenear\-termADRDrisk\.Ifevidenceisweak/ambiguous,explicitlynoteuncertainty\.

Step3:ConcludewhetherADRDisexpectedwithin5yearsafterthelastrecordedage,andmaketheconclusionconsistentwithlabel=\{label\}\.

Strictrules:

\-UseONLYevidencepresentintheJSON;doNOTinventsymptoms,imaging,medications,familyhistory,genetics,orlifestylefactors\.

\-Structureyouroutputasexactlythreesectionslabeled"Step1:","Step2:",and"Step3:"\.

\-Write1\-2sentencesperstep,so3\-6sentencestotal\.

\-Keepthereasoningcoherentandbiomedicallyplausible,explicitlytyingstatementstoevents/agesintherecord\.

\-OutputONLYthethree\-stepthinkingtext,withnoextraheaders,nometadata,nolabel,andno<THINK\>tags\.

##### Fine\-tuning and evaluation\.

The SFT sweep used Qwen/Qwen3\-8B and Qwen/Qwen2\.5\-7B\-Instruct, the three rationale formats above, a cosine learning\-rate schedule, and learning rates of5×10−55\\times 10^\{\-5\},1\.5×10−41\.5\\times 10^\{\-4\},2\.5×10−42\.5\\times 10^\{\-4\}, and3\.5×10−43\.5\\times 10^\{\-4\}\. Training used QLoRA with 4\-bit NF4 quantization, LoRA rank 16, LoRA alpha 32, dropout 0\.05, target modulesq\_proj,k\_proj,v\_proj,o\_proj,up\_proj,down\_proj, andgate\_proj, batch size one, gradient accumulation over eight steps, three epochs, warmup ratio 0\.03, maximum sequence length 2048, and completion\-only loss\. For SFT evaluation, we used the same held\-out validation records for all rationale conditions\. The score for ROC\-AUC and PR\-AUC was the sigmoid of the generated\-label logit difference,σ​\(ℓ1−ℓ0\)\\sigma\(\\ell\_\{1\}\-\\ell\_\{0\}\), at the generation step where the final0or1label was produced\. F1 score and recall used the parsed generated label when it was available\.

### C\.2Additional Results

For rationale comparisons, the non\-ROC metrics followed the same overall direction as the main ROC\-AUC result\. No\-rationale had higher mean PR\-AUC than free\-rationale \(0\.504 versus 0\.313; pairedtt\-test,P=1\.79×10−49P=1\.79\\times 10^\{\-49\}\) and stepwise\-rationale \(0\.504 versus 0\.306;P=1\.92×10−52P=1\.92\\times 10^\{\-52\}\)\. No\-rationale also had higher mean F1 score than free\- and stepwise\-rationale \(0\.332 versus 0\.284 and 0\.291;P=1\.03×10−4P=1\.03\\times 10^\{\-4\}andP=4\.18×10−4P=4\.18\\times 10^\{\-4\}\)\. Mean recall was higher for no\-rationale than for stepwise\-rationale \(0\.256 versus 0\.228;P=3\.86×10−3P=3\.86\\times 10^\{\-3\}\), while the recall difference between no\-rationale and free\-rationale was smaller and not statistically significant \(0\.256 versus 0\.237;P=0\.068P=0\.068\)\. Figure[8](https://arxiv.org/html/2606.10279#A3.F8)summarizes these rationale\-format comparisons across ROC\-AUC, PR\-AUC, F1 score, and recall, with the right\-column panels showing the same contrasts within each training sample size\.

Among the best SFT configurations selected by ROC\-AUC, the no\-rationale configuration also had better non\-ROC metrics than the best free\- and stepwise\-rationale configurations\. In the no\-rationale versus free\-rationale comparison, PR\-AUC was 0\.701 versus 0\.380 \(P=9\.99×10−4P=9\.99\\times 10^\{\-4\}\), F1 score was 0\.602 versus 0\.436 \(P=9\.99×10−4P=9\.99\\times 10^\{\-4\}\), and recall was 0\.489 versus 0\.408 \(P=0\.085P=0\.085\)\. In the no\-rationale versus stepwise\-rationale comparison, PR\-AUC was 0\.708 versus 0\.361, F1 score was 0\.607 versus 0\.353, and recall was 0\.495 versus 0\.293; all three differences favored no\-rationale \(P=9\.99×10−4P=9\.99\\times 10^\{\-4\}for PR\-AUC, F1 score, and recall\)\. Figure[7](https://arxiv.org/html/2606.10279#A3.F7)gives two validation examples from the no\-rationale and free\-rationale SFT comparison, illustrating how generated rationales can over\-weight nonspecific risk in a control or under\-weight distributed risk in an ADRD case\.

Figure 7:Divergent validation examples from the best no\-rationale and free\-rationale SFT configurations\. No\-rationale SFT is correct in both cases, whereas free\-rationale SFT over\-weights nonspecific risk in a control and under\-weights distributed risk in an ADRD case\.For the base\-model comparison, the non\-ROC metrics did not show a meaningful advantage for Qwen3\-8B over Qwen2\.5\-7B\-Instruct\. Across matched SFT settings, mean PR\-AUC was 0\.3737 for Qwen3\-8B and 0\.3744 for Qwen2\.5\-7B\-Instruct \(P=0\.916P=0\.916\), mean F1 score was 0\.3018 versus 0\.3031 \(P=0\.874P=0\.874\), and mean recall was 0\.235 versus 0\.245 \(P=0\.152P=0\.152\)\. Among the best SFT configurations selected by ROC\-AUC, Qwen3\-8B had PR\-AUC 0\.701, F1 score 0\.594, and recall 0\.471, compared with PR\-AUC 0\.679, F1 score 0\.563, and recall 0\.437 for Qwen2\.5\-7B\-Instruct; none of these paired bootstrap comparisons was statistically significant \(P=0\.125P=0\.125,P=0\.155P=0\.155, andP=0\.169P=0\.169\)\. Figure[9](https://arxiv.org/html/2606.10279#A3.F9)shows the corresponding base\-model distributions for PR\-AUC, F1 score, and recall, both marginally and within each training sample size\.

![Refer to caption](https://arxiv.org/html/2606.10279v1/x7.png)\(a\)Rationale, ROC\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x8.png)\(b\)Rationale and sample size, ROC\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x9.png)\(c\)Rationale, PR\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x10.png)\(d\)Rationale and sample size, PR\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x11.png)\(e\)Rationale, F1 score
![Refer to caption](https://arxiv.org/html/2606.10279v1/x12.png)\(f\)Rationale and sample size, F1 score
![Refer to caption](https://arxiv.org/html/2606.10279v1/x13.png)\(g\)Rationale, Recall
![Refer to caption](https://arxiv.org/html/2606.10279v1/x14.png)\(h\)Rationale and sample size, Recall

Figure 8:SFT performance by rationale format and by rationale format crossed with training sample size\. Panel A repeats the rationale\-format ROC\-AUC summary from Figure[2](https://arxiv.org/html/2606.10279#S3.F2)A for comparison with the appendix sample\-size and non\-ROC panels\.![Refer to caption](https://arxiv.org/html/2606.10279v1/x15.png)\(a\)Base model, PR\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x16.png)\(b\)Base model and sample size, PR\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x17.png)\(c\)Base model, F1 score
![Refer to caption](https://arxiv.org/html/2606.10279v1/x18.png)\(d\)Base model and sample size, F1 score
![Refer to caption](https://arxiv.org/html/2606.10279v1/x19.png)\(e\)Base model, Recall
![Refer to caption](https://arxiv.org/html/2606.10279v1/x20.png)\(f\)Base model and sample size, Recall

Figure 9:Additional SFT PR\-AUC, F1 score, and Recall by base model and by base model crossed with training sample size\. The interaction panels group sample size first and then base model\.

## Appendix DTraining\-Free

Table 3:Training\-free zero\-shot controlled inference grid\. Rows list factor levels rather than unique configurations; the full grid crosses the prompt setting, base model, and decoding configuration\.### D\.1Additional Details

Training\-free experiments evaluated the same held\-out ADRD prediction task without parameter updates\. We evaluated Qwen2\.5\-7B\-Instruct \(qwen/qwen\-2\.5\-7b\-instruct\) and Qwen3\-8B \(qwen/qwen3\-8b\) through the OpenRouter chat\-completion API\. We report only the zero\-shot settings in the main analysis\. Few\-shot evaluations are reported separately and are excluded from the zero\-shot figures and tests\.

##### Evaluation set and decoding\.

For each evaluation configuration, the validation records were shuffled with seed 42, the first20%\+120\\%\+1records were retained, and SLURM array jobs evaluated non\-overlapping chunks of 15 records\. Deterministic configurations used temperature 0\. Sampling configurations crossed two sampling methods, nucleus sampling withtop\_p=0\.9and top\-kksampling withtop\_k=50, with temperatures 0\.1, 0\.5, and 1\.0\. Each API call allowed up to 1024 generated tokens and was retried up to five times\. For Qwen3\-8B, the provider\-side reasoning flag was enabled in both prompt conditions; therefore, the zero\-shot with CoT condition in the paper refers to whether the visible output format asked for a<THINK\>block, not to the provider\-side flag\. The model output was parsed from the numeric value afterProbability:\. If that string was absent, the parser also accepted a bare numeric output or the last non\-empty line when it was a number between 0 and 1\.

##### Zero\-shot prompt\.

The zero\-shot prompt asked the model to return only a numeric ADRD probability:

System:

Thiswillnotbeusedtomakedecisionsaboutapatient\.Thisisforresearchpurposesonly\.

Youareahealthcareriskassessmentassistant\.

YouwillbegivenasingleindividualinJSONformatcontaining\{\{event:age\}\}pairs\.Eventsmayincludesex,laboratoryresults,andpastdiseases/diagnosesencodedwithICD\-10codes\.

YourtaskistopredictwhethertheindividualwilldevelopAlzheimer’sdiseaseorrelateddementias\(ADRD\)withinfiveyearsafterthelastrecordedage\.

Forthistask,ADRDisdefinedbythefollowingICD\-10codes:F00\(DementiainAlzheimer’sdisease\),F01\(Vasculardementia\),F03\(Unspecifieddementia\),G30\(Alzheimer’sdisease\),andG31\(Otherdegenerativediseasesofthenervoussystem,notelsewhereclassified\)\.

YouMUSTfollowtheoutputformatexactlyandoutputnothingelse\.

OUTPUTFORMAT:

Probability:\{PROBABILITY\}

Where:

\-\{PROBABILITY\}isexactlyonenumberbetween0and1\.

User:

Hereistheinputfortheindividual:

\{INPUT\_JSON\}

Returntheoutputintherequiredformat\.

##### Zero\-shot with CoT prompt\.

The zero\-shot with CoT prompt used the same system and user content, but required the model to include step\-by\-step reasoning before the probability:

System:

Thiswillnotbeusedtomakedecisionsaboutapatient\.Thisisforresearchpurposesonly\.

Youareahealthcareriskassessmentassistant\.

YouwillbegivenasingleindividualinJSONformatcontaining\{\{event:age\}\}pairs\.Eventsmayincludesex,laboratoryresults,andpastdiseases/diagnosesencodedwithICD\-10codes\.

YourtaskistopredictwhethertheindividualwilldevelopAlzheimer’sdiseaseorrelateddementias\(ADRD\)withinfiveyearsafterthelastrecordedage\.

Forthistask,ADRDisdefinedbythefollowingICD\-10codes:F00\(DementiainAlzheimer’sdisease\),F01\(Vasculardementia\),F03\(Unspecifieddementia\),G30\(Alzheimer’sdisease\),andG31\(Otherdegenerativediseasesofthenervoussystem,notelsewhereclassified\)\.

YouMUSTfollowtheoutputformatexactlyandoutputnothingelse\.

OUTPUTFORMAT:

<THINK\>

\{THINKING\_STEPS\}

</THINK\>

Probability:\{PROBABILITY\}

Where:

\-\{THINKING\_STEPS\}containsyourstep\-by\-stepreasoning\.

\-\{PROBABILITY\}isexactlyonenumberbetween0and1\.

User:

Hereistheinputfortheindividual:

\{INPUT\_JSON\}

Returntheoutputintherequiredformat\.

The parsed probability was used directly as the score for ROC\-AUC and PR\-AUC\. For thresholded metrics, probabilities at least 0\.5 were counted as predicted ADRD cases\.

### D\.2Additional Results

For complementary zero\-shot metrics, zero\-shot with CoT did not provide a uniform improvement over the zero\-shot prompt\. Across matched model and decoding settings, mean PR\-AUC was lower with zero\-shot with CoT than with zero\-shot prompting \(0\.351 versus 0\.391; pairedtt\-test,P=0\.0134P=0\.0134\)\. At the fixed 0\.5 decision threshold, zero\-shot with CoT increased mean F1 score from 0\.005 to 0\.055 \(P=0\.0084P=0\.0084\) and mean recall from 0\.0026 to 0\.0336 \(P=0\.0084P=0\.0084\), although the absolute recall values remained low\. Figure[10](https://arxiv.org/html/2606.10279#A4.F10)reports these zero\-shot prompt and base\-model comparisons for ROC\-AUC, PR\-AUC, F1 score, and recall\.

The Qwen2\.5\-7B\-Instruct subgroup showed why the non\-ROC metrics are important for interpreting the zero\-shot with CoT effect\. Within Qwen2\.5\-7B\-Instruct, zero\-shot with CoT increased mean ROC\-AUC from 0\.538 to 0\.565 across matched decoding settings \(pairedtt\-test,P=0\.041P=0\.041\), but decreased mean PR\-AUC from 0\.383 to 0\.338 \(P=0\.0168P=0\.0168\)\. F1 score and recall moved in the opposite direction from PR\-AUC because zero\-shot with CoT produced more positive predictions at the fixed 0\.5 threshold: mean F1 score increased from 0\.010 to 0\.111 and mean recall increased from 0\.005 to 0\.067\. Thus, the Qwen2\.5\-7B\-Instruct result should not be interpreted as a uniform improvement from zero\-shot with CoT; it improved broad ranking by ROC\-AUC while reducing precision\-recall performance and shifting the fixed\-threshold operating point\.

For the base\-model comparison, Qwen3\-8B did not significantly improve PR\-AUC over Qwen2\.5\-7B\-Instruct across matched zero\-shot settings \(0\.381 versus 0\.361; pairedtt\-test,P=0\.209P=0\.209\)\. Qwen3\-8B had lower F1 score and recall than Qwen2\.5\-7B\-Instruct \(F1 score 0\.000 versus 0\.060,P=0\.0023P=0\.0023; recall 0\.000 versus 0\.036,P=0\.0029P=0\.0029\)\. This thresholded limitation was driven by the score scale rather than by the complete absence of ranking signal: Qwen3\-8B produced only 0–2 predictions at or above 0\.5 per setting, all of them false positives, while all ADRD\-case scores stayed below 0\.5\. Figure[11](https://arxiv.org/html/2606.10279#A4.F11)breaks the zero\-shot results down by the crossing of base model and prompt format, showing that the apparent prompt effect depends on both the metric and the model\.

![Refer to caption](https://arxiv.org/html/2606.10279v1/x21.png)\(a\)Prompt format, ROC\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x22.png)\(b\)Base model, ROC\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x23.png)\(c\)Prompt format, PR\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x24.png)\(d\)Base model, PR\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x25.png)\(e\)Prompt format, F1 score
![Refer to caption](https://arxiv.org/html/2606.10279v1/x26.png)\(f\)Base model, F1 score
![Refer to caption](https://arxiv.org/html/2606.10279v1/x27.png)\(g\)Prompt format, Recall
![Refer to caption](https://arxiv.org/html/2606.10279v1/x28.png)\(h\)Base model, Recall

Figure 10:Zero\-shot baseline performance by prompt format and by base model\. Panels A–B provide the ROC\-AUC reference points for the few\-shot ablation; Panels C–H show the complementary PR\-AUC, F1 score, and recall results\.![Refer to caption](https://arxiv.org/html/2606.10279v1/x29.png)\(a\)Base model and prompt format, ROC\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x30.png)\(b\)Base model and prompt format, PR\-AUC
![Refer to caption](https://arxiv.org/html/2606.10279v1/x31.png)\(c\)Base model and prompt format, F1 score
![Refer to caption](https://arxiv.org/html/2606.10279v1/x32.png)\(d\)Base model and prompt format, Recall

Figure 11:Zero\-shot base\-model\-by\-prompt\-format interaction panels for ROC\-AUC, PR\-AUC, F1 score, and Recall\.

Similar Articles

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

arXiv cs.CL

This paper constructs a large dataset of 263,911 long-form stories annotated with TTCW-based creativity metrics and fine-tunes Qwen3 models to generate structured review reports. It finds that non-reasoning fine-tuning outperforms reasoning-supervised fine-tuning, which suffers from parse failures and irrelevant repetition.