RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings

arXiv cs.CL Papers

Summary

RADS uses reinforcement learning to pick the most informative samples for few-shot fine-tuning, boosting transfer-learning accuracy on low-resource, highly imbalanced clinical datasets.

arXiv:2604.20256v1 Announce Type: new Abstract: A common strategy in transfer learning is few shot fine-tuning, but its success is highly dependent on the quality of samples selected as training examples. Active learning methods such as uncertainty sampling and diversity sampling can select useful samples. However, under extremely low-resource and class-imbalanced conditions, they often favor outliers rather than truly informative samples, resulting in degraded performance. In this paper, we introduce RADS (Reinforcement Adaptive Domain Sampling), a robust sample selection strategy using reinforcement learning (RL) to identify the most informative samples. Experimental evaluations on several real world clinical datasets show our sample selection strategy enhances model transferability while maintaining robust performance under extreme class imbalance compared to traditional methods.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/23/26, 10:03 AM

# RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings
Source: [https://arxiv.org/html/2604.20256](https://arxiv.org/html/2604.20256)
Wei Han1,David Martinez1,Anna Khanina3,4,5,Lawrence Cavedon1,Karin Verspoor1,2,3

1School of Computing Technologies, RMIT University 2School of Computing and Information Systems, The University of Melbourne 3National Centre for Infections in Cancer, Melbourne 4Department of Infectious Disease, Peter MacCallum Cancer Centre 5Sir Peter MacCallum Department of Oncology, The University of Melbourne

###### Abstract

A common strategy in transfer learning is few shot fine\-tuning, but its success is highly dependent on the quality of samples selected as training examples\. Active learning methods such as uncertainty sampling and diversity sampling can select useful samples\. However, under extremely low\-resource and class\-imbalanced conditions, they often favor outliers rather than truly informative samples, resulting in degraded performance\. In this paper, we introduceRADS\(ReinforcementAdaptiveDomainSampling\), a robust sample selection strategy using reinforcement learning \(RL\) to identify the most informative samples\. Experimental evaluations on several real world clinical datasets show our sample selection strategy enhances model transferability while maintaining robust performance under extreme class imbalance compared to traditional methods\. Our code is open\-sourced on GitHub111[https://github\.com/Wei\-0808/RADS](https://github.com/Wei-0808/RADS)\.

RADS: Reinforcement Learning\-Based Sample Selection Improves Transfer Learning in Low\-resource and Imbalanced Clinical Settings

## 1Introduction

Maximizing the utility of limited data is a crucial focus of Natural Language Processing \(NLP\) research in domains such as clinical texts where acquiring large amounts of gold standard data may be difficult due to data restrictions and the relative rarity of many disease conditions\. The high cost of annotation in such highly specialized domains further limits availability of labeled data\. Yet, the effectiveness of NLP techniques in healthcare heavily relies on the quality of annotated datasets, particularly because clinical data contains specialized symbols, abbreviations, and medical jargonTouvronet al\.\([2023](https://arxiv.org/html/2604.20256#bib.bib4)\); Liuet al\.\([2024a](https://arxiv.org/html/2604.20256#bib.bib5)\)\.

Transfer Learning \(TL\)Tanet al\.\([2018](https://arxiv.org/html/2604.20256#bib.bib41)\), in which knowledge learned from a task is reused to boost performance on a different but related \(target\) task, has shown effectiveness across various machine learning applicationsWeisset al\.\([2016](https://arxiv.org/html/2604.20256#bib.bib6)\)and opens new avenues for addressing low\-resource scenarios\. Previous works have attempted to leverage pretrained embeddingsMaimaitiet al\.\([2021](https://arxiv.org/html/2604.20256#bib.bib42)\)and few shot examplesAlyafeaiet al\.\([2020](https://arxiv.org/html/2604.20256#bib.bib43)\)to facilitate transfer learning in NLP\. However, when the target task offers very few labeled instances, these approaches may generate unreliable outputs\. This is an especially acute problem in healthcare, where reliability is paramount\.

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/RL.png)Figure 1:RL\-based active sampling for transfer learning from source domain to target domain\. Domain shift reduces zero\-shot generalization from the source\-trained model to the target domain\. Our sample selection strategy uses RL to identify key samples from the target domain\. By jointly fine\-tuning on the selected target samples with the source data, the model achieves good performance on both domains\.Class imbalanceJohnson and Khoshgoftaar \([2019](https://arxiv.org/html/2604.20256#bib.bib45)\)is another challenge for low\-resource settings\. In clinical datasets, there is often a scarcity of positive cases due to the low prevalence of many conditions, making such instances both highly valuable and limited in number\. At the same time, differences in data collection protocols can lead some disease datasets to contain a very high proportion of positive samples\. These extreme disparities in class distribution further hinder the transferability of NLP models across clinical datasets\.

Clinical documentation is heterogeneous, reflecting diverse investigations including CT and PET scans, or cytology and histopathology analysis\. Although disease detection cues appear across these document types, their content structures, terminology, and linguistic expressions can vary greatly\. CT and PET scan reports primarily emphasize imaging\-based findingsTownsendet al\.\([2004](https://arxiv.org/html/2604.20256#bib.bib50)\), whereas cytology and histopathology reports focus on cellular and tissue\-level observationsJensen \([2021](https://arxiv.org/html/2604.20256#bib.bib51)\)\.

Previous works have shown the efficacy of NLP techniques for disease detection from clinical reports\. However, models fine\-tuned on one report type show clear performance degradation when applied to anotherHanet al\.\([2025](https://arxiv.org/html/2604.20256#bib.bib30)\)\. While disease\-related signals overlap to some extent across different document types, existing disease detection models still fall short of human performance in transferring knowledge between them\. As the preparation of gold standard annotated datasets for training is time\-consuming, it is therefore important to explore effective knowledge transfer strategies from existing datasets to new but similar tasks\. This not only improves annotation efficiency but also enhances the models’ adaptability in dealing with variations in task settings\.

In this work, we propose RADS \(Reinforcement Adaptive Domain Sampling\), a robust strategy for knowledge transfer between related but distinct sources\. Following the active learning paradigmFuet al\.\([2013](https://arxiv.org/html/2604.20256#bib.bib56)\), we enhance transfer learning by identifying and selecting the most relevant samples for few\-shot fine\-tuning as shown in Figure[1](https://arxiv.org/html/2604.20256#S1.F1)\. First, we employ an RL\-based agent to identify the most informative samples within the target dataset\. These selected samples are then annotated by medical experts and incorporated into the fine\-tuning process\. By jointly fine\-tuning the model on the source dataset and the newly annotated target samples, the model is able to preserve strong performance on the source domain while achieving improved generalization to the target domain\. We evaluated this approach across multiple real world clinical datasets\. Experimental results show that our method improves both the adaptability and performance of disease detection between different sources\. In the context of transfer learning, this technique offers a promising way to both reduce annotation effort and enhance model robustness in low\-resource and class\-imbalanced settings\.

Our contributions are summarized as follows:

- •This work addresses the challenges posed by low\-resource and class\-imbalance scenarios in disease detection across heterogeneous clinical report types from real\-world clinical data sources\.
- •We propose RADS, a robust RL\-based sample selection strategy tailored to scenarios with both data scarcity and class imbalance\.
- •Extensive experiments on several clinical datasets confirm that our transfer learning approach is more effective between similar but different sources, even under low\-resource and class\-imbalanced conditions\.

## 2Related Work

With high\-quality annotated datasets, NLP methods have shown promising results in disease detection\. Based on the concept features relevant to diseases, dictionary\-based detection approaches and classical machine learning have shown effective performanceRozovaet al\.\([2023a](https://arxiv.org/html/2604.20256#bib.bib11)\); Martinezet al\.\([2015](https://arxiv.org/html/2604.20256#bib.bib12)\)\. Bag\-of\-words models have also been utilized, often combined with machine learning techniques to further enhance accuracy and scalability in disease detectionCuryet al\.\([2021](https://arxiv.org/html/2604.20256#bib.bib15)\); López\-Úbedaet al\.\([2020](https://arxiv.org/html/2604.20256#bib.bib16)\)\. Recently, large language models \(LLMs\), such as BioBERTLeeet al\.\([2020](https://arxiv.org/html/2604.20256#bib.bib17)\)and ClinicalBERTHuanget al\.\([2019](https://arxiv.org/html/2604.20256#bib.bib18)\), pre\-trained on large biomedical corpora, have improved contextual understanding in clinical textsConsoliet al\.\([2024](https://arxiv.org/html/2604.20256#bib.bib19)\); Hanet al\.\([2025](https://arxiv.org/html/2604.20256#bib.bib30)\)\.

Low resource settings remain challenging for NLP tasks\. Few\-shot fine\-tuningBrownet al\.\([2020](https://arxiv.org/html/2604.20256#bib.bib39)\); Guet al\.\([2022](https://arxiv.org/html/2604.20256#bib.bib21)\); Liuet al\.\([2022](https://arxiv.org/html/2604.20256#bib.bib22)\), where large pre\-trained models are adapted using only a small number of labeled examples, has shown promising results\. Selecting effective few\-shot samples is critical, and active learning strategies such as uncertainty samplingNguyenet al\.\([2022](https://arxiv.org/html/2604.20256#bib.bib23)\)and diversity samplingYanget al\.\([2015](https://arxiv.org/html/2604.20256#bib.bib24)\)are often employed\. However, these methods typically optimize a single metric, and under domain shift, tend to select distributional outliers rather than truly informative samplesGonsioret al\.\([2024](https://arxiv.org/html/2604.20256#bib.bib60)\)\. Reinforcement Learning \(RL\)Fanget al\.\([2017](https://arxiv.org/html/2604.20256#bib.bib25)\); Liuet al\.\([2024b](https://arxiv.org/html/2604.20256#bib.bib26)\)offers a potential solution by optimizing more flexible and adaptive sample selection policies, thereby improving robustness in different contexts\.

Class imbalance is especially crucial in low\-resource clinical NLP tasksGhoshet al\.\([2024](https://arxiv.org/html/2604.20256#bib.bib33)\)\. Data\-level approaches, such as oversampling minority classesHairaniet al\.\([2024](https://arxiv.org/html/2604.20256#bib.bib34)\)and undersampling majority classesYanget al\.\([2024](https://arxiv.org/html/2604.20256#bib.bib36)\), are typically used to balance class distributions\. Algorithm\-level methods, such as cost\-sensitive learningArafet al\.\([2024](https://arxiv.org/html/2604.20256#bib.bib37)\)and focal loss adjustmentsAljohaniet al\.\([2023](https://arxiv.org/html/2604.20256#bib.bib38)\), aim to direct model attention towards underrepresented classes, thereby improving model performance in class\-imbalanced settings\.

## 3Methodology

### 3\.1Problem Setup and Overview

We study low\-resource and class\-imbalanced transfer learning between heterogeneous clinical report datasets: a fully labeled source dataset𝒟s\\mathcal\{D\}\_\{s\}and an unlabeled target dataset𝒰t\\mathcal\{U\}\_\{t\}\. Although the two datasets \(domains\) share some similar clinical knowledge, distribution shift and differences in label distribution make direct transfer challenging\.

We formulate cross\-domain adaptation as a budgeted active learning problem: given an annotation budgetB≪NtB\\ll N\_\{t\}, whereNtN\_\{t\}is the target pool size, our goal is to select a small but high\-utility subset𝒬⊂𝒰t\\mathcal\{Q\}\\subset\\mathcal\{U\}\_\{t\}\. The selected samples are then annotated and merged with𝒟s\\mathcal\{D\}\_\{s\}to form an expanded training set\. With supervised fine\-tuning on the final dataset, the knowledge can be effectively transferred and the model performance across both domains also improved\.

The overall framework of RADS is shown in Figure[2](https://arxiv.org/html/2604.20256#S3.F2)\. Our approach consists of three stages: \(1\) we train an active learner on𝒟s\\mathcal\{D\}\_\{s\}and compute informativeness signals for𝒰t\\mathcal\{U\}\_\{t\}via Monte\-Carlo \(MC\) dropout; \(2\) we define a prior\-aware utility that combines BALD\-based mutual informationHoulsbyet al\.\([2011](https://arxiv.org/html/2604.20256#bib.bib62)\)with pseudo\-label class weighting to explicitly control the quality of selected samples for transfer learning under severe class imbalance; and \(3\) we train a reinforcement learning sampler to select samples that maximize prior\-aware utility while discouraging redundant selections\. The pseudocode for this part is provided in Appendix[A](https://arxiv.org/html/2604.20256#A1)\.

### 3\.2Active Learner

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/method.png)Figure 2:RADS framework for RL\-based active sampling under domain shift\. The active learner is fine\-tuned on the source domain, and MC dropout is used to score unlabeled target reports and construct informativeness signals \(state\)\. An RL sampler then selects a subset for annotation by maximizing the reward, producing a target set for joint fine\-tuning with the source data\.We first fine\-tune a lightweight classifierfϕf\_\{\\phi\}on the labeled source dataset𝒟s\\mathcal\{D\}\_\{s\}\. For each unlabeled target report in training poolx∈𝒰tx\\in\\mathcal\{U\}\_\{t\}, we estimate epistemic uncertainty via MC dropoutGal and Ghahramani \([2016](https://arxiv.org/html/2604.20256#bib.bib61)\)\. Specifically, we keep dropout activated at inference time and performKKstochastic forward passes\. Each pass corresponds to sampling a dropout mask, yielding a sampled set of network weights𝐰k\\mathbf\{w\}\_\{k\}and a predictive distribution:

pk​\(y∣x\)=softmax​\(fϕ​\(x;𝐰k\)\)p\_\{k\}\(y\\mid x\)\\;=\\;\\mathrm\{softmax\}\\\!\\left\(f\_\{\\phi\}\(x;\\mathbf\{w\}\_\{k\}\)\\right\)\(1\)Aggregating theseKKstochastic predictions approximates the posterior predictive distribution\. We compute the MC predictive mean as:

p¯​\(y∣x\)=1K​∑k=1Kpk​\(y∣x\)\\bar\{p\}\(y\\mid x\)\\;=\\;\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}p\_\{k\}\(y\\mid x\)\(2\)
In addition, we retain the mean log\-probability vectorℓ¯\(x\)=logp¯\(⋅∣x\)\\bar\{\\ell\}\(x\)=\\log\\bar\{p\}\(\\cdot\\mid x\), which serves as a representation for redundancy estimation in our RL\-based sampler \(Section[3\.5](https://arxiv.org/html/2604.20256#S3.SS5)\)\.

Based on this active learner, we also define a pseudo labely^​\(x\)=arg⁡maxy⁡p¯​\(y∣x\)\\hat\{y\}\(x\)=\\arg\\max\_\{y\}\\bar\{p\}\(y\\mid x\)and estimate the predicted target class prior:

π^\+\\displaystyle\\hat\{\\pi\}\_\{\+\}=1Nt​∑x∈𝒰t𝟙​\[y^​\(x\)=1\],\\displaystyle=\\;\\frac\{1\}\{N\_\{t\}\}\\sum\_\{x\\in\\mathcal\{U\}\_\{t\}\}\\mathbbm\{1\}\\\!\\left\[\\hat\{y\}\(x\)=1\\right\],\(3\)π^−\\displaystyle\\hat\{\\pi\}\_\{\-\}=1−π^\+\.\\displaystyle=1\-\\hat\{\\pi\}\_\{\+\}\.These priors let us correct selection bias when the pool is imbalanced or the source\-trained model is miscalibrated on the target domain\.

### 3\.3BALD Signal

To score informativeness for unlabeled target\-domain samples, we use BALD, which quantifies the mutual information between the predicted label and the model parameters\. LetH​\(⋅\)H\(\\cdot\)denote entropy\. For eachx∈𝒰tx\\in\\mathcal\{U\}\_\{t\}, we compute:

PE​\(x\)\\displaystyle\\mathrm\{PE\}\(x\)=H\(p¯\(⋅∣x\)\),\\displaystyle=H\\\!\\left\(\\bar\{p\}\(\\cdot\\mid x\)\\right\),\(4\)EE​\(x\)\\displaystyle\\mathrm\{EE\}\(x\)=1K∑k=1KH\(pk\(⋅∣x\)\),\\displaystyle=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}H\\\!\\left\(p\_\{k\}\(\\cdot\\mid x\)\\right\),\(5\)MI​\(x\)\\displaystyle\\mathrm\{MI\}\(x\)=PE​\(x\)−EE​\(x\)\.\\displaystyle=\\mathrm\{PE\}\(x\)\-\\mathrm\{EE\}\(x\)\.\(6\)
Here,MI​\(x\)\\mathrm\{MI\}\(x\)is the BALD score\. It is large when the predictive distribution is uncertain overall \(highPE\\mathrm\{PE\}\) while individual stochastic models are relatively confident but disagree with each other \(lowEE\\mathrm\{EE\}\)\. We normalizeMI​\(x\)\\mathrm\{MI\}\(x\)to\[0,1\]\[0,1\]over𝒰t\\mathcal\{U\}\_\{t\}, denoted asMI~​\(x\)\\widetilde\{\\mathrm\{MI\}\}\(x\)\.

We treat samples with highMI~​\(x\)\\widetilde\{\\mathrm\{MI\}\}\(x\)as informative and assign them higher utility in our selection policy\. Prioritizing these samples for annotation is expected to reduce the model’s uncertainty and improve transfer to the target domain\.

Table 1:Key attributes of the three datasets used in this study\. P/N = positive/negative class proportions
### 3\.4Prior\-Aware Utility for Sample Selection

Selecting the top\-BBuncertain samples can sometimes produce an extreme class skew\. This often happens under domain shift and severe class imbalance, where the source\-trained active learner may predict biased pseudo labels on the target domain\. To control the selected class mixture, we introduce a prior\-aware utility\. We define class weights using the estimated prior:

w\+\\displaystyle w\_\{\+\}=ρclip​\(π^\+\),\\displaystyle=\\;\\frac\{\\rho\}\{\\mathrm\{clip\}\(\\hat\{\\pi\}\_\{\+\}\)\},\(7\)w−\\displaystyle w\_\{\-\}=1−ρ1−clip​\(π^\+\)\.\\displaystyle=\\;\\frac\{1\-\\rho\}\{1\-\\mathrm\{clip\}\(\\hat\{\\pi\}\_\{\+\}\)\}\.whereclip​\(⋅\)\\mathrm\{clip\}\(\\cdot\)clamps probabilities away from\{0,1\}\\\{0,1\\\}for stability andρ\\rhois a hyperparameter that trades off class\-balance control and informativeness\. We then define the utility:

u​\(x\)=MI~​\(x\)⋅\{w\+,y^​\(x\)=1,w−,y^​\(x\)=0\.u\(x\)\\;=\\;\\widetilde\{\\mathrm\{MI\}\}\(x\)\\cdot\\begin\{cases\}w\_\{\+\},&\\hat\{y\}\(x\)=1,\\\\ w\_\{\-\},&\\hat\{y\}\(x\)=0\.\\end\{cases\}\(8\)This utility favors informative samples and shifts selection toward the desired class ratio\.

### 3\.5RL\-based Sample Selection Strategy

At each steptt, the sampler agent observes the current candidatextx\_\{t\}and decides whether to select or discard\. An episode ends whenBBsamples are selected or the pool is exhausted\.

#### State\.

For each candidatextx\_\{t\}, the state vector combines the active learner signals and a budget progress term:

st=\[ℓ¯​\(xt\);PE​\(xt\);MI​\(xt\);\|St\|/B\]s\_\{t\}\\;=\\;\\Big\[\\;\\bar\{\\ell\}\(x\_\{t\}\);\\;\\mathrm\{PE\}\(x\_\{t\}\);\\;\\mathrm\{MI\}\(x\_\{t\}\);\\;\|S\_\{t\}\|/B\\;\\Big\]\(9\)whereℓ¯​\(xt\)\\bar\{\\ell\}\(x\_\{t\}\)is the mean log\-probability vector computed from MC dropout;PE​\(xt\)\\mathrm\{PE\}\(x\_\{t\}\)is the predictive entropy;MI​\(xt\)\\mathrm\{MI\}\(x\_\{t\}\)is the BALD score; and\|St\|/B\|S\_\{t\}\|/Bindicates the fraction of the annotation budget already consumed, withStS\_\{t\}denoting the set of selected samples so far\. In our binary setting,ℓ¯​\(xt\)∈ℝ2\\bar\{\\ell\}\(x\_\{t\}\)\\in\\mathbb\{R\}^\{2\}, hence the overall dimension of the state vector is 5\.

#### Reward\.

Our reward encourages the agent to select samples that are both \(i\) informative for learning under class imbalance and \(ii\) non\-redundant with respect to previously selected instances\. Specifically, when the agent selects the current candidate \(at=1a\_\{t\}=1\) and the budget is not yet exhausted \(\|St\|<B\|S\_\{t\}\|<B\), we define:

rt=u​\(xt\)−λ⋅Red​\(xt,St\)r\_\{t\}\\;=\\;u\(x\_\{t\}\)\\;\-\\;\\lambda\\cdot\\mathrm\{Red\}\(x\_\{t\},S\_\{t\}\)\(10\)and setrt=0r\_\{t\}=0otherwise\. Here,u​\(xt\)u\(x\_\{t\}\)is the prior\-aware utility \(Section[3\.4](https://arxiv.org/html/2604.20256#S3.SS4)\) andλ\\lambdacontrols the strength of the diversity regularization\.

To discourage selecting near\-duplicate samples, we measure redundancy in the active learner’s predictive representation space\. For a candidatexxand the current selected setSS, we first compute the distance to its nearest selected neighbor:

δ​\(x,S\)=\{\+∞,\|S\|=0,minx′∈S⁡‖ℓ¯​\(x\)−ℓ¯​\(x′\)‖2,otherwise\.\\delta\(x,S\)=\\begin\{cases\}\+\\infty,&\|S\|=0,\\\\\[2\.0pt\] \\min\\limits\_\{x^\{\\prime\}\\in S\}\\left\\lVert\\bar\{\\ell\}\(x\)\-\\bar\{\\ell\}\(x^\{\\prime\}\)\\right\\rVert\_\{2\},&\\text\{otherwise\.\}\\end\{cases\}\(11\)We then convert this distance into a bounded redundancy score:

Red​\(x,S\)=\{0,\|S\|=0,11\+δ​\(x,S\),otherwise\.\\mathrm\{Red\}\(x,S\)=\\begin\{cases\}0,&\|S\|=0,\\\\\[2\.0pt\] \\dfrac\{1\}\{1\+\\delta\(x,S\)\},&\\text\{otherwise\.\}\\end\{cases\}\(12\)This definition yields a larger penalty whenxxis very close to an existing selection \(smallδ\\delta\), and a smaller penalty whenxxis far away \(largeδ\\delta\)\. As a result, the agent is encouraged to select diverse samples while still prioritizing high\-utility ones\.

#### Dueling DQN Sampler Agent\.

We learn a Q\-functionQθ​\(s,a\)Q\_\{\\theta\}\(s,a\)with a dueling DQN architectureWanget al\.\([2016](https://arxiv.org/html/2604.20256#bib.bib65)\)and optimize it via the standard DQN objectiveMnihet al\.\([2015](https://arxiv.org/html/2604.20256#bib.bib52)\)\. We maintain an experience replay bufferℬ\\mathcal\{B\}and a target networkQθ−Q\_\{\\theta^\{\-\}\}\. At each gradient step, we minimize the temporal\-difference loss:

ℒ​\(θ\)\\displaystyle\\mathcal\{L\}\(\\theta\)=𝔼\(s,a,r,s′,d\)∼ℬ​\[\(Qθ​\(s,a\)−y\)2\],\\displaystyle=\\mathbb\{E\}\_\{\(s,a,r,s^\{\\prime\},d\)\\sim\\mathcal\{B\}\}\\Big\[\\big\(Q\_\{\\theta\}\(s,a\)\-y\\big\)^\{2\}\\Big\],\(13\)y\\displaystyle y=r\+γ​\(1−d\)​maxa′⁡Qθ−​\(s′,a′\)\.\\displaystyle=r\+\\gamma\(1\-d\)\\max\_\{a^\{\\prime\}\}Q\_\{\\theta^\{\-\}\}\(s^\{\\prime\},a^\{\\prime\}\)\.
whereγ\\gammais the discount factor andd∈\{0,1\}d\\in\\\{0,1\\\}indicates episode termination\. We adoptϵ\\epsilon\-greedy exploration with a decayingϵ\\epsilonschedule, periodically synchronizeθ−\\theta^\{\-\}withθ\\theta, and finally use the learned policyπ​\(s\)=arg⁡maxa⁡Qθ​\(s,a\)\\pi\(s\)=\\arg\\max\_\{a\}Q\_\{\\theta\}\(s,a\)to selectBBsamples from𝒰t\\mathcal\{U\}\_\{t\}\.

## 4Experimental Setup

### 4\.1Benchmark Datasets

CHIFIR and PIFIR datasets are related to Invasive Fungal Infection \(IFI\), but the vocabulary used varies across them\. The cytology and histopathology reports of the CHIFIR dataset assess tissue or fluid samples and describe the microscopic visualization of fungal organisms\. The PET\-CT reports from PIFIR assess metabolic activity and discuss the anatomical and morphological features of fungal lesions via PET imaging\.

To assess transfer beyond IFI and beyond pathology\-style reports, we also include MIMIC\-CXR, a corpus of chest X\-ray reports\. We construct aPneumoniasubset by selecting the top 3,000 reports that are labeled as pneumonia by CheXpert’sIrvinet al\.\([2019](https://arxiv.org/html/2604.20256#bib.bib64)\)weak labels\. Although PIFIR and MIMIC\-CXR both consist of radiology reports, they still differ greatly in reporting style and clinical phrasing\. Moreover, pneumonia and IFI reflect distinct clinical contexts, further increasing the domain shift\. Figure[3](https://arxiv.org/html/2604.20256#S4.F3)shows differences in predominant clinical terms across the three datasets\.

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/CHIFIR.png)

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/PIFIR.png)

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/MIMIC_CXR.png)

Figure 3:Word clouds for the CHIFIR \(left\), PIFIR \(middle\), and MIMIC\-CXR \(right\) datasets\. Word size corresponds to term frequency\.All three datasets exhibit class imbalance\. CHIFIR and MIMIC\-CXR are dominated by negative cases, whereas PIFIR is dominated by positive cases\. Throughout the paper we focus on transferring from other sources to PIFIR, and we provide results of other directions in the Appendix\.

Table[1](https://arxiv.org/html/2604.20256#S3.T1)summarizes the key characteristics of each dataset and highlights the challenges for transfer learning across them\. Details of the dataset split are provided in Appendix[B](https://arxiv.org/html/2604.20256#A2)\.

### 4\.2Evaluation Metrics

We evaluate performance using accuracy, F1 score, precision, recall, and ROC\-AUC\. Class imbalance in benchmark datasets makes the F1 score particularly important\. Recall is also important, given that it is critical not to miss positive cases\.

Table 2:Performance comparison of ClinicalBERT under different transfer strategies\. Zero\-shot transfer refers to fine\-tuning solely on the source dataset\. We set this as our baseline\. Full\-shot transfer refers to jointly fine\-tuning on both source and target datasets\. We assume this represents the best possible transfer learning performance\.
### 4\.3Baselines

We select the fine\-tuned ClinicalBERT approach from previous workHanet al\.\([2025](https://arxiv.org/html/2604.20256#bib.bib30)\)as the baseline\. Table[2](https://arxiv.org/html/2604.20256#S4.T2)shows the baseline results and reveals the challenges of knowledge transferability between these datasets\. Models perform well when fine\-tuned and evaluated on the same dataset\. Without transfer learning, evaluation on a similar but still different dataset results in a clear performance drop\. Although training on all datasets together can improve performance, it requires annotating all reports, which is labor\-intensive\. Full reproducibility details can be found in Appendix[B](https://arxiv.org/html/2604.20256#A2)\.

## 5Experimental Results

### 5\.1Transfer Learning Performance

We compare our method with several other active learning approaches to analyze the impact of different sample selection methods on knowledge transfer performance\. 1\)Random Selection: Randomly select samples from the unlabeled target domain\. Each experiment is run five times to reduce variance and obtain more reliable results\. We report the mean evaluation metrics over these five runs\. 2\)Uncertainty\-based SelectionNguyenet al\.\([2022](https://arxiv.org/html/2604.20256#bib.bib23)\): Selectskksamples by predictive uncertainty \(lowest confidence\) from the active learner\. 3\)Diversity\-based SelectionYuanet al\.\([2020](https://arxiv.org/html/2604.20256#bib.bib57)\): Selects thekkmost diverse samples by calculating the cosine distance between each report embedding in the unlabeled dataset and the embeddings in the labeled dataset\. 4\)LM\-DPP SelectionWanget al\.\([2024](https://arxiv.org/html/2604.20256#bib.bib53)\): This method jointly models uncertainty and diversity using a Determinantal Point Process \(DPP\) kernel\. Following the original work, we set the trade\-off coefficient between uncertainty and diversity to 0\.5 and select the subset of sizekkthat maximizes the DPP objective for annotation\. 5\)TAGCOS SelectionZhanget al\.\([2025](https://arxiv.org/html/2604.20256#bib.bib66)\): A task\-agnostic selection baseline that selectskksamples according to its gradient\-based selection criterion\. 6\)BatchBALD SelectionKirschet al\.\([2019](https://arxiv.org/html/2604.20256#bib.bib67)\): Selectskksamples using a batch acquisition strategy that extends BALD by maximizing joint mutual information under MC dropout\.

Knowledge Transfer from CHIFIR to PIFIRStrategyPerformance on PIFIRPerformance on CHIFIRTransfer GapAccuracyF1\-scorePrecisionRecallROC\-AUCAccuracyF1\-scorePrecisionRecallROC\-AUCΔ\\DeltaF195% CIRandom0\.5950\.6390\.8850\.5610\.8130\.9270\.7460\.8050\.7000\.938––Uncertainty0\.5240\.5450\.9230\.3870\.8300\.9420\.8240\.7780\.8750\.9770\.278\[\-0\.037, 0\.530\]Diversity0\.5950\.6380\.9380\.4840\.8090\.9420\.8000\.8570\.7500\.9740\.162\[\-0\.167, 0\.412\]LM\-DPP0\.5710\.6090\.9330\.4520\.8390\.9040\.6150\.8000\.5000\.9770\.007\[\-0\.418, 0\.319\]TAGCOS0\.7620\.8440\.8180\.8710\.7300\.9420\.8240\.7780\.8750\.972\-0\.020\[\-0\.310, 0\.168\]BatchBALD0\.7380\.8490\.7381\.0000\.7830\.8850\.5000\.7500\.3750\.946\-0\.349\[\-0\.833, \-0\.019\]\\rowcolorgray\!16RADS0\.8100\.8710\.8710\.8710\.8330\.9230\.7500\.7500\.7500\.977\-0\.121\[\-0\.430, 0\.100\]

Table 3:Transfer learning performance from CHIFIR to PIFIR with 5 samples selected in PIFIR under different sample selection strategies\.Δ\\DeltaF1 = F1\(CHIFIR\)−\-F1\(PIFIR\)\. CI = Confidence Interval\.Knowledge Transfer from MIMIC\-CXR to PIFIRStrategyPerformance on PIFIRPerformance on MIMIC\-CXRTransfer GapAccuracyF1\-scorePrecisionRecallROC\-AUCAccuracyF1\-scorePrecisionRecallROC\-AUCΔ\\DeltaF195% CIRandom0\.7660\.8480\.8080\.8420\.8060\.8620\.8240\.8080\.8420\.932––Uncertainty0\.7380\.8310\.7940\.8710\.6360\.8480\.8100\.7800\.8420\.916\-0\.021\[\-0\.165, 0\.126\]Diversity0\.7380\.8450\.7500\.9680\.6160\.8590\.8160\.8160\.8160\.933\-0\.029\[\-0\.157, 0\.102\]LM\-DPP0\.7380\.8310\.7940\.8710\.6360\.8480\.8100\.7800\.8420\.916\-0\.021\[\-0\.165, 0\.126\]TAGCOS0\.7380\.8360\.7780\.9030\.7950\.8620\.8310\.8210\.8420\.930\-0\.005\[\-0\.131, 0\.139\]BatchBALD0\.7620\.8480\.8000\.9030\.8270\.8890\.8610\.8290\.8950\.9490\.012\[\-0\.107, 0\.141\]\\rowcolorgray\!16RADS0\.8100\.8820\.8110\.9680\.8800\.8690\.8400\.7910\.8950\.921\-0\.043\[\-0\.153, 0\.072\]

Table 4:Transfer learning performance from MIMIC\-CXR to PIFIR with 2 samples selected in PIFIR under different sample selection strategies\.Δ\\DeltaF1 = F1\(MIMIC\-CXR\)−\-F1\(PIFIR\)\. CI = Confidence Interval\.Table[3](https://arxiv.org/html/2604.20256#S5.T3)reports transfer learning results from CHIFIR to PIFIR\. Uncertainty\-, diversity\-, and LM\-DPP\-based selection yield comparable or worse performance than random sampling\. While TAGCOS and BatchBALD attain relatively high F1 scores on PIFIR, their ROC\-AUC is noticeably lower\. In contrast, our method RADS achieves the best performance on PIFIR while maintaining competitive performance on the source domain \(CHIFIR\)\. This indicates strong sample efficiency, requiring only5/135≈3\.7%5/135\\approx 3\.7\\%of the target training set to obtain substantial transfer gains\.

We further compare RADS with a prompt\-guided LLM selection baselineJeonget al\.\([2025](https://arxiv.org/html/2604.20256#bib.bib69)\), which uses an open\-source medical LLM to score report and then selects the top\-k reports for annotation\. We report the CHIFIR to PIFIR transfer results in Appendix[C](https://arxiv.org/html/2604.20256#A3)\. Although this baseline can retrieve some useful reports as the budget increases, it is highly unstable in the ultra\-low\-budget regime and remains less reliable than RADS overall\.

Table[4](https://arxiv.org/html/2604.20256#S5.T4)reports transfer learning results from MIMIC\-CXR to PIFIR\. Other baselines provide limited gains and are often on par with or below random selection\. RADS achieves better target performance \(F1 on PIFIR = 0\.882\) and remains highly sample\-efficient, requiring annotation of only2/135≈1\.5%2/135\\approx 1\.5\\%of the target training set\.

We also conduct transfer learning experiments from PIFIR to CHIFIR\. The results show that our method still achieves the best performance with only 8 samples selected from target dataset CHIFIR\. More detailed discussion is shown in Appendix[D](https://arxiv.org/html/2604.20256#A4)\.

RADS consistently outperforms strong baselines, demonstrating superior sample\-efficient transfer performance\. More robustness analysis under imbalanced settings appears in Appendix[E](https://arxiv.org/html/2604.20256#A5)\.

### 5\.2Learning Curves under Varying Budgets

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/BB.png)

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/TG.png)

Figure 4:Transfer from CHIFIR to PIFIR under baselines BatchBALD \(left\) and TAGCOS \(right\)\.![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/Our.png)

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/F1.png)

Figure 5:Transfer from CHIFIR to PIFIR under our method RADS\.We analyze the effect of annotation budget on transfer performance\. Figure[4](https://arxiv.org/html/2604.20256#S5.F4)shows the transfer performance from CHIFIR to PIFIR across budgets under two strong baselines\. BatchBALD is highly unstable; a small change in budget can flip the model from almost perfect to completely broken\. TAGCOS is more stable but still remains unreliable at small budgets\. Figure[5](https://arxiv.org/html/2604.20256#S5.F5)shows the transfer performance from CHIFIR to PIFIR with our method\. The left graph shows that starting from budget 5, F1 on PIFIR stays around 0\.85–0\.87 and additional labels provide marginal improvements, while CHIFIR performance remains stable\. The right graph plots the domain gapΔ\\DeltaF1 versus budget with 95% confidence intervals\. From budget = 4 onward, the gap is effectively closed \(Δ\\DeltaF1≈0\\approx 0with overlapping confidence intervals\), suggesting the selected subset suffices to eliminate the transfer gap\.

As MIMIC\-CXR is larger than CHIFIR, identifying informative target samples is less challenging for all models, even under low annotation budgets\. The transfer performance from MIMIC\-CXR to PIFIR across budgets is shown in Appendix[F](https://arxiv.org/html/2604.20256#A6)\.

### 5\.3Ablation Study

Table 5:Ablation results under the hard transfer setting from CHIFIR to PIFIR with a labeling budget of 5\.To evaluate the effectiveness of each component in RADS, we conduct ablation studies as shown in Table[5](https://arxiv.org/html/2604.20256#S5.T5)\. Replacing the RL sampler with a greedy selector \(No RL\) leads to a clear drop performance\. Although this variant optimizes the same objective, it lacks the sequential decision\-making needed to balance exploration and redundancy control\. Selecting samples by BALD signal \(MI Only\) fails, implying that uncertainty\-only criteria can favor noisy or out\-of\-distribution target examples under domain shift\. Using the prior\-aware utility \(Utility Only\) greatly improves results, confirming the benefit of class\- and quality\-aware selection, but remains below our method, highlighting the additional gains from discouraging redundant selections\. Finally, RADS further improves over Utility Only, demonstrating that the RL sampler provides additional gains by learning a non\-redundant, globally optimized subset selection policy rather than relying on pointwise ranking\.

### 5\.4Selected Sample Quality Analysis

We audit the quality of the selected target samples\. Because RADS is trained and applied on the same unlabeled target pool, we clarify that the sampler never accesses target gold labels during optimization\. Figure[6](https://arxiv.org/html/2604.20256#S5.F6)shows that in CHIFIR to PIFIR transfer, the source\-trained pseudo labels are notably misaligned with the annotated labels, yet RADS still selects mostly true positives, which helps improve target performance\. In MIMIC\-CXR to PIFIR, the pseudo\-label ratio better matches the annotated composition, consistent with smaller domain shift and improved calibration on the target domain\.

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/ratio.png)Figure 6:Selected sample analysis under CHIFIR to PIFIR \(left\) and MIMIC\-CXR to PIFIR \(right\) transfer\. The black line shows the pseudo\-positive ratio predicted by the source\-trained active learner, and the bars report the numbers of true positives and true negatives after manual annotation \(blue/yellow\)\.Figure[7](https://arxiv.org/html/2604.20256#S5.F7)visualizes the CHIFIR to PIFIR \(left\) and MIMIC\-CXR to PIFIR \(right\) transfer, where our method selects 5 samples for adaptation\. Before transfer learning, the decision boundary only captures the source dataset specific separation\. After adding the selected target subset, the boundary rotates and shifts toward a direction that better reflects the class layout from both datasets\.

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/pcaa.png)Figure 7:Transfer learning from CHIFIR to PIFIR \(left\) and MIMIC\-CXR to PIFIR \(right\) with PCA projection of report embeddings\. Red×\\timesmarkers denote the selected PIFIR subsetSS\. The dashed black line shows the decision boundary learned from the source dataset only\. The solid red line shows the boundary after augmenting with selected samples\. Source = CHIFIR / MIMIC\-CXR dataset, Target = PIFIR datatset\.
### 5\.5Efficiency and Budget\-Aware Transfer

#### Runtime Efficiency

Our RL\-based sampler in RADS is light\-weight, requiring only a few seconds to produce a selection\. Selecting 2 samples \(MIMIC\-CXR to PIFIR transfer\) takes around 3 seconds, and selecting 5 samples \(CHIFIR to PIFIR transfer\) takes around 9 seconds\. Given that annotating a single report takes about one minute, the selection overhead is negligible\. Compared with full\-shot transfer that labels the entire target training set \(135 reports\), RADS achieves comparable target performance with only 2 or 5 annotated reports\.

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/Num.png)

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/Num-m.png)

Figure 8:Number of selected PIFIR samples versus the annotation budgetBBwhen training on CHIFIR to PIFIR \(left\) and MIMIC\-CXR to PIFIR \(right\)\.
#### Budget Utilization and Early Stopping

As our RL\-based sampler encodes budget progress \(\|St\|/B\|S\_\{t\}\|/B\) in the state, it can also provide guidance on how many samples are worth annotating during transfer\. Figure[8](https://arxiv.org/html/2604.20256#S5.F8)shows the actual number of PIFIR samples selected as the budget increases\. For CHIFIR to PIFIR transfer, the policy stops fully consuming the budget onceB≥8B\\geq 8, consistent with our results showing good transfer performance already atB=5B=5\. For MIMIC\-CXR to PIFIR transfer, the policy no longer uses the full budget whenB≥5B\\geq 5, aligning with good target performance achieved withB=2B=2labeled PIFIR samples\. This pattern is useful as in active learning it is important to know when it is time to stop adding samples\.

### 5\.6Transfer Gap between Datasets

To explain why model transfer from MIMIC\-CXR to PIFIR is easier than CHIFIR to PIFIR, we further analyze the distribution gap between these datasets\.

We quantify overlap between datasets with a shared unigram–bigram vocabularyElangovanet al\.\([2021](https://arxiv.org/html/2604.20256#bib.bib68)\)\. Coverage of PIFIR\-test n\-grams is higher for MIMIC\-CXR to PIFIR than for CHIFIR to PIFIR \(0\.193 vs\. 0\.115\)\. The Jaccard similarity between source and target vocabularies is also higher for MIMIC\-CXR to PIFIR than for CHIFIR to PIFIR \(0\.187 vs\. 0\.124\)\. This suggests a smaller lexical domain shift between MIMIC\-CXR and PIFIR, as it is illustrated by our empirical results fine\-tuning MIMIC\-CXR\. More detailed differences are analyzed in Appendix[G](https://arxiv.org/html/2604.20256#A7)\.

## 6Conclusion

In this work, we studied transfer learning for disease detection under low\-resource and class\-imbalanced conditions\. We proposed RADS, an RL\-based sampler that jointly optimizes a prior\-aware utility for class\-mixture control and a diversity regularizer to avoid near\-duplicate selections\. Our approach improves model performance and adaptability across medical datasets compared to traditional sample selection strategies\. We expect this approach to generalize to other transfer learning problems not only in clinical NLP but also in broader application domains, offering a promising direction for broader validation and impact\.

## Limitations

Despite demonstrating promising results, our approach has several limitations\. First, the effectiveness of our RL\-based sample selection heavily depends on the feedback provided by the active learner\. This places high quality demands on the original gold dataset\. Second, our formulation controls class mixture only through predicted priors and does not explicitly incorporate richer clinical knowledge\. Third, our experiments focus on binary clinical disease detection with relatively small target pools and we have not yet validated RADS on larger\-scale multi\-class settings or on broader non\-clinical transfer tasks\. Fourth, more stable RL optimizers, improved uncertainty estimation, and better transfer\-aligned validation strategies remain important directions for future work\.

## Acknowledgments

This work was supported by the Australian Government through Medical Research Future Fund grant MRFCRI000188\. The authors also thank the EINSTEIN Study Group for their support and insightful discussions\.

## References

- N\. R\. Aljohani, A\. Fayoumi, and S\. Hassan \(2023\)A novel focal\-loss and class\-weight\-aware convolutional neural network for the classification of in\-text citations\.Journal of Information Science49\(1\),pp\. 79–92\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p3.1)\.
- Z\. Alyafeai, M\. S\. AlShaibani, and I\. Ahmad \(2020\)A survey on transfer learning in natural language processing\.arXiv preprint arXiv:2007\.04239\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p2.1)\.
- I\. Araf, A\. Idri, and I\. Chairi \(2024\)Cost\-sensitive learning for imbalanced medical data: a review\.Artificial Intelligence Review57\(4\),pp\. 80\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p3.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 1877–1901\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p2.1)\.
- S\. Consoli, P\. Markov, N\. I\. Stilianakis, L\. Bertolini, A\. P\. Gallardo, and M\. Ceresa \(2024\)Epidemic information extraction for event\-based surveillance using large language models\.InInternational Congress on Information and Communication Technology,pp\. 241–252\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p1.1)\.
- R\. C\. Cury, I\. Megyeri, T\. Lindsey, R\. Macedo, J\. Batlle, S\. Kim, B\. Baker, R\. Harris, and R\. H\. Clark \(2021\)Natural Language Processing and Machine Learning for Detection of Respiratory Illness by Chest CT Imaging and Tracking of COVID\-19 Pandemic in the United States\.Radiology: Cardiothoracic Imaging3\(1\),pp\. e200596\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p1.1)\.
- A\. Elangovan, J\. He, and K\. Verspoor \(2021\)Memorization vs\. generalization : quantifying data leakage in NLP performance evaluation\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,P\. Merlo, J\. Tiedemann, and R\. Tsarfaty \(Eds\.\),Online,pp\. 1325–1335\.External Links:[Link](https://aclanthology.org/2021.eacl-main.113/),[Document](https://dx.doi.org/10.18653/v1/2021.eacl-main.113)Cited by:[§5\.6](https://arxiv.org/html/2604.20256#S5.SS6.p2.1)\.
- M\. Fang, Y\. Li, and T\. Cohn \(2017\)Learning how to active learn: a deep reinforcement learning approach\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,M\. Palmer, R\. Hwa, and S\. Riedel \(Eds\.\),Copenhagen, Denmark,pp\. 595–605\.External Links:[Link](https://aclanthology.org/D17-1063/),[Document](https://dx.doi.org/10.18653/v1/D17-1063)Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p2.1)\.
- Y\. Fu, X\. Zhu, and B\. Li \(2013\)A survey on instance selection for active learning\.Knowledge and information systems35\(2\),pp\. 249–283\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p6.1)\.
- Y\. Gal and Z\. Ghahramani \(2016\)Dropout as a bayesian approximation: representing model uncertainty in deep learning\.Ininternational conference on machine learning,pp\. 1050–1059\.Cited by:[§3\.2](https://arxiv.org/html/2604.20256#S3.SS2.p1.5)\.
- K\. Ghosh, C\. Bellinger, R\. Corizzo, P\. Branco, B\. Krawczyk, and N\. Japkowicz \(2024\)The class imbalance problem in deep learning\.Machine Learning113\(7\),pp\. 4845–4901\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p3.1)\.
- J\. Gonsior, C\. Falkenberg, S\. Magino, A\. Reusch, C\. Hartmann, M\. Thiele, and W\. Lehner \(2024\)Comparing and improving active learning uncertainty measures for transformer models by discarding outliers\.Information systems frontiers,pp\. 1–17\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p2.1)\.
- Y\. Gu, X\. Han, Z\. Liu, and M\. Huang \(2022\)PPT: pre\-trained prompt tuning for few\-shot learning\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 8410–8423\.External Links:[Link](https://aclanthology.org/2022.acl-long.576/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.576)Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p2.1)\.
- H\. Hairani, T\. Widiyaningtyas, and D\. D\. Prasetya \(2024\)Addressing class imbalance of health data: a systematic literature review on modified synthetic minority oversampling technique \(smote\) strategies\.JOIV: International Journal on Informatics Visualization8\(3\),pp\. 1310–1318\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p3.1)\.
- W\. Han, D\. Martinez, V\. Rozova, L\. Cavedon, A\. Khanina, L\. J\. Worth, M\. A\. Slavin, K\. A\. Thursky, and K\. Verspoor \(2025\)Automated detection of invasive fungal infections in clinical reports using medical language models\.Studies in Health Technology and Informatics329,pp\. 1002–1007\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p5.1),[§2](https://arxiv.org/html/2604.20256#S2.p1.1),[§4\.3](https://arxiv.org/html/2604.20256#S4.SS3.p1.1)\.
- N\. Houlsby, F\. Huszár, Z\. Ghahramani, and M\. Lengyel \(2011\)Bayesian active learning for classification and preference learning\.arXiv preprint arXiv:1112\.5745\.Cited by:[§3\.1](https://arxiv.org/html/2604.20256#S3.SS1.p3.2)\.
- K\. Huang, J\. Altosaar, and R\. Ranganath \(2019\)ClinicalBERT: Modeling clinical notes and predicting hospital readmission\.CHIL 2020 Workshop\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p1.1)\.
- J\. Irvin, P\. Rajpurkar, M\. Ko, Y\. Yu, S\. Ciurea\-Ilcus, C\. Chute, H\. Marklund, B\. Haghgoo, R\. Ball, K\. Shpanskaya,et al\.\(2019\)Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison\.InProceedings of the AAAI conference on artificial intelligence,Vol\.33,pp\. 590–597\.Cited by:[§4\.1](https://arxiv.org/html/2604.20256#S4.SS1.p3.1)\.
- H\. E\. Jensen \(2021\)Histopathology in the diagnosis of invasive fungal diseases\.Current Fungal Infection Reports15\(1\),pp\. 23–31\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p4.1)\.
- D\. P\. Jeong, Z\. C\. Lipton, and P\. K\. Ravikumar \(2025\)LLM\-select: feature selection with large language models\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=16f7ea1N3p)Cited by:[Appendix C](https://arxiv.org/html/2604.20256#A3.p1.1),[§5\.1](https://arxiv.org/html/2604.20256#S5.SS1.p3.1)\.
- A\. E\. Johnson, T\. J\. Pollard, S\. J\. Berkowitz, N\. R\. Greenbaum, M\. P\. Lungren, C\. Deng, R\. G\. Mark, and S\. Horng \(2019\)MIMIC\-CXR, a de\-identified publicly available database of chest radiographs with free\-text reports\.Scientific data6\(1\),pp\. 317\.Cited by:[§4\.1](https://arxiv.org/html/2604.20256#S4.SS1.p1.1)\.
- J\. M\. Johnson and T\. M\. Khoshgoftaar \(2019\)Survey on deep learning with class imbalance\.Journal of Big Data6\(1\),pp\. 1–54\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p3.1)\.
- A\. Kirsch, J\. Van Amersfoort, and Y\. Gal \(2019\)Batchbald: efficient and diverse batch acquisition for deep bayesian active learning\.Advances in neural information processing systems32\.Cited by:[§5\.1](https://arxiv.org/html/2604.20256#S5.SS1.p1.5)\.
- J\. Lee, W\. Yoon, S\. Kim, D\. Kim, S\. Kim, C\. H\. So, and J\. Kang \(2020\)BioBERT: a pre\-trained biomedical language representation model for biomedical text mining\.Bioinformatics36\(4\),pp\. 1234–1240\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p1.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024a\)DeepSeek\-V3 Technical Report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p1.1)\.
- H\. Liu, D\. Tam, M\. Muqeeth, J\. Mohta, T\. Huang, M\. Bansal, and C\. A\. Raffel \(2022\)Few\-shot parameter\-efficient fine\-tuning is better and cheaper than in\-context learning\.Advances in Neural Information Processing Systems35,pp\. 1950–1965\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p2.1)\.
- Y\. Liu, H\. Wang, H\. Zhou, M\. Li, Y\. Hou, S\. Zhou, F\. Wang, R\. Hoetzlein, and R\. Zhang \(2024b\)A review of reinforcement learning for natural language processing and applications in healthcare\.Journal of the American Medical Informatics Association31\(10\),pp\. 2379–2393\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p2.1)\.
- P\. López\-Úbeda, M\. C\. Díaz\-Galiano, T\. Martín\-Noguerol, A\. Luna, L\. A\. Ureña\-López, and M\. T\. Martín\-Valdivia \(2020\)COVID\-19 detection in radiological text reports integrating entity recognition\.Computers in Biology and Medicine127,pp\. 104066\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p1.1)\.
- M\. Maimaiti, Y\. Liu, H\. Luan, and M\. Sun \(2021\)Enriching the transfer learning with pre\-trained lexicon embedding for low\-resource neural machine translation\.Tsinghua Science and Technology27\(1\),pp\. 150–163\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p2.1)\.
- D\. Martinez, M\. R\. Ananda\-Rajah, H\. Suominen, M\. A\. Slavin, K\. A\. Thursky, and L\. Cavedon \(2015\)Automatic detection of patients with invasive fungal disease from free\-text computed tomography \(CT\) scans\.Journal of Biomedical Informatics53,pp\. 251–260\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p1.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski,et al\.\(2015\)Human\-level control through deep reinforcement learning\.Nature518\(7540\),pp\. 529–533\.Cited by:[§3\.5](https://arxiv.org/html/2604.20256#S3.SS5.SSS0.Px3.p1.3)\.
- V\. Nguyen, M\. H\. Shaker, and E\. Hüllermeier \(2022\)How to measure uncertainty in uncertainty sampling for active learning\.Machine Learning111\(1\),pp\. 89–122\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p2.1),[§5\.1](https://arxiv.org/html/2604.20256#S5.SS1.p1.5)\.
- V\. Rozova, A\. Khanina, J\. Ong, R\. Alipour, L\. Worth, M\. Slavin, K\. Thursky, and K\. Verspoor \(2025\)PIFIR: PET\-CT Invasive Fungal Infection Reports\.PhysioNet\.Note:Version 1\.0\.0External Links:[Document](https://dx.doi.org/10.13026/d51v-j343),[Link](https://doi.org/10.13026/d51v-j343)Cited by:[§4\.1](https://arxiv.org/html/2604.20256#S4.SS1.p1.1)\.
- V\. Rozova, A\. Khanina, J\. C\. Teng, J\. S\. Teh, L\. J\. Worth, M\. A\. Slavin, K\. A\. Thursky, and K\. Verspoor \(2023a\)Detecting evidence of invasive fungal infections in cytology and histopathology reports enriched with concept\-level annotations\.Journal of Biomedical Informatics139,pp\. 104293\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p1.1)\.
- V\. Rozova, A\. Khanina, J\. Teng, J\. Teh, L\. Worth, M\. Slavin, k\. thursky, and K\. Verspoor \(2023b\)CHIFIR: Cytology and Histopathology Invasive Fungal Infection Reports\.PhysioNet\.Note:Version 1\.0\.0External Links:[Document](https://dx.doi.org/10.13026/fmj9-p237),[Link](https://doi.org/10.13026/fmj9-p237)Cited by:[§4\.1](https://arxiv.org/html/2604.20256#S4.SS1.p1.1)\.
- C\. Tan, F\. Sun, T\. Kong, W\. Zhang, C\. Yang, and C\. Liu \(2018\)A survey on deep transfer learning\.InArtificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4\-7, 2018, Proceedings, Part III 27,pp\. 270–279\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p2.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p1.1)\.
- D\. W\. Townsend, J\. P\. Carney, J\. T\. Yap, and N\. C\. Hall \(2004\)PET/CT today and tomorrow\.Journal of Nuclear Medicine45\(1 suppl\),pp\. 4S–14S\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p4.1)\.
- P\. Wang, X\. Wang, C\. Lou, S\. Mao, P\. Xie, and Y\. Jiang \(2024\)Effective demonstration annotation for in\-context learning via language model\-based determinantal point process\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 1266–1280\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.74/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.74)Cited by:[§5\.1](https://arxiv.org/html/2604.20256#S5.SS1.p1.5)\.
- Z\. Wang, T\. Schaul, M\. Hessel, H\. Hasselt, M\. Lanctot, and N\. Freitas \(2016\)Dueling network architectures for deep reinforcement learning\.InInternational conference on machine learning,pp\. 1995–2003\.Cited by:[§3\.5](https://arxiv.org/html/2604.20256#S3.SS5.SSS0.Px3.p1.3)\.
- K\. Weiss, T\. M\. Khoshgoftaar, and D\. Wang \(2016\)A survey of transfer learning\.Journal of Big Data3,pp\. 1–40\.Cited by:[§1](https://arxiv.org/html/2604.20256#S1.p2.1)\.
- C\. Yang, E\. A\. Fridgeirsson, J\. A\. Kors, J\. M\. Reps, and P\. R\. Rijnbeek \(2024\)Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data\.Journal of Big Data11\(1\),pp\. 7\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p3.1)\.
- Y\. Yang, Z\. Ma, F\. Nie, X\. Chang, and A\. G\. Hauptmann \(2015\)Multi\-class active learning by uncertainty sampling with diversity maximization\.International Journal of Computer Vision113,pp\. 113–127\.Cited by:[§2](https://arxiv.org/html/2604.20256#S2.p2.1)\.
- M\. Yuan, H\. Lin, and J\. Boyd\-Graber \(2020\)Cold\-start active learning through self\-supervised language modeling\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 7935–7948\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.637/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.637)Cited by:[§5\.1](https://arxiv.org/html/2604.20256#S5.SS1.p1.5)\.
- J\. Zhang, Y\. Qin, R\. Pi, W\. Zhang, R\. Pan, and T\. Zhang \(2025\)TAGCOS: Task\-agnostic gradient clustered coreset selection for instruction tuning data\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 4671–4686\.Cited by:[§5\.1](https://arxiv.org/html/2604.20256#S5.SS1.p1.5)\.

## Appendix AAlgorithm Pseudocode

We provide the pseudocode for the strategy of sample selection in our approach below\.

1:pool

𝒰t\\mathcal\{U\}\_\{t\}, budget

BB, episodes

NN, utility

u​\(⋅\)u\(\\cdot\), diversity weight

λ\\lambda
2:feature set

ℱ=\{ℓ¯​\(x\),PE​\(x\),MI~​\(x\),\|S\|/B\}\\mathcal\{F\}=\\\{\\bar\{\\ell\}\(x\),\\mathrm\{PE\}\(x\),\\widetilde\{\\mathrm\{MI\}\}\(x\),\|S\|/B\\\}, action set

𝒜=\{0,1\}\\mathcal\{A\}=\\\{0,1\\\}
3:

env←RLSampleSelectionEnv​\(𝒰t,ℱ,u,B,λ\)\\textit\{env\}\\leftarrow\\textsc\{RLSampleSelectionEnv\}\(\\mathcal\{U\}\_\{t\},\\mathcal\{F\},u,B,\\lambda\)
4:Initialize online network

QϕQ\_\{\\phi\}and target network

Q^ϕ←Qϕ\\hat\{Q\}\_\{\\phi\}\\leftarrow Q\_\{\\phi\}
5:Initialize replay buffer

𝒟←∅\\mathcal\{D\}\\leftarrow\\emptyset
6:Initialize exploration rate

ϵ\\epsilon
7:for

episode=1\\textit\{episode\}=1to

NNdo

8:

s←env\.reset\(\)s\\leftarrow\\textit\{env\.reset\(\)\};

done←false\\textit\{done\}\\leftarrow\\textbf\{false\}
9:whilenotdonedo

10:

a←EpsGreedy​\(Qϕ,s,ϵ\)a\\leftarrow\\textsc\{EpsGreedy\}\(Q\_\{\\phi\},s,\\epsilon\)
11:

\(s′,r,done\)←env\.step​\(a\)\(s^\{\\prime\},r,\\textit\{done\}\)\\leftarrow\\textit\{env\.step\}\(a\)
12:Add

\(s,a,r,s′,done\)\(s,a,r,s^\{\\prime\},\\textit\{done\}\)to

𝒟\\mathcal\{D\}
13:if

\|𝒟\|≥M\|\\mathcal\{D\}\|\\geq Mthen⊳\\trianglerightMMis minibatch size

14:

ℳ←SampleMinibatch​\(𝒟,M\)\\mathcal\{M\}\\leftarrow\\textsc\{SampleMinibatch\}\(\\mathcal\{D\},M\)
15:UpdateNets\(

Qϕ,Q^ϕ,ℳQ\_\{\\phi\},\\hat\{Q\}\_\{\\phi\},\\mathcal\{M\}\)

16:endif

17:

s←s′s\\leftarrow s^\{\\prime\}
18:endwhile

19:if

episodemodKupd=0\\textit\{episode\}\\bmod K\_\{\\text\{upd\}\}=0then

20:

Q^ϕ←Qϕ\\hat\{Q\}\_\{\\phi\}\\leftarrow Q\_\{\\phi\}
21:endif

22:

ϵ←DecayEps​\(ϵ\)\\epsilon\\leftarrow\\textsc\{DecayEps\}\(\\epsilon\)
23:endfor

24:Selection \(Greedy Policy\)

25:

𝒮←∅\\mathcal\{S\}\\leftarrow\\emptyset;

s←env\.reset\(\)s\\leftarrow\\textit\{env\.reset\(\)\};

done←false\\textit\{done\}\\leftarrow\\textbf\{false\}
26:whilenotdonedo

27:

i​d←env\.currentId\(\)id\\leftarrow\\textit\{env\.currentId\(\)\}
28:

a←arg⁡maxa′∈𝒜⁡Qϕ​\(s,a′\)a\\leftarrow\\arg\\max\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}Q\_\{\\phi\}\(s,a^\{\\prime\}\)
29:

\(s′,\_,done\)←env\.step​\(a\)\(s^\{\\prime\},\\\_,\\textit\{done\}\)\\leftarrow\\textit\{env\.step\}\(a\)
30:if

a=1a=1then

31:

𝒮←𝒮∪\{i​d\}\\mathcal\{S\}\\leftarrow\\mathcal\{S\}\\cup\\\{id\\\}
32:endif

33:

s←s′s\\leftarrow s^\{\\prime\}
34:endwhile

35:return

𝒮\\mathcal\{S\}

Table 6:Performance comparison of transfer learning from PIFIR to CHIFIR under zero\-shot transfer and full\-shot transfer\.Knowledge Transfer from PIFIR to CHIFIRStrategyPerformance on CHIFIRPerformance on PIFIRTransfer GapAccuracyF1\-scorePrecisionRecallROC\-AUCAccuracyF1\-scorePrecisionRecallROC\-AUCΔ\\DeltaF195% CIRandom0\.8380\.1670\.2500\.1250\.7160\.7520\.7910\.9160\.7290\.837––Uncertainty0\.7880\.2670\.2860\.2500\.6560\.9050\.9380\.9090\.9680\.8800\.671\[0\.387, 0\.967\]Diversity0\.1540\.2670\.1541\.0000\.5910\.7380\.8490\.7381\.0000\.8830\.584\[0\.409, 0\.752\]LM\-DPP0\.1540\.2670\.1541\.0000\.4890\.7380\.8490\.7381\.0000\.8150\.583\[0\.409, 0\.752\]TAGCOS0\.7690\.1430\.1670\.1250\.5450\.9290\.9540\.9121\.0000\.8270\.811\[0\.529, 0\.986\]BatchBALD0\.8460\.0000\.0000\.0000\.4860\.7140\.7930\.8520\.7420\.7420\.793\[0\.654, 0\.897\]\\rowcolorgray\!16RADS0\.8650\.6320\.5450\.7500\.8580\.8810\.9210\.9060\.9350\.8650\.289\[0\.075, 0\.599\]

Table 7:Transfer learning performance from PIFIR to CHIFIR with 8 samples selected in CHIFIR under different sample selection strategies\.Δ\\DeltaF1 = F1\(PIFIR\)−\-F1\(CHIFIR\)\. CI = Confidence Interval\.
## Appendix BReproducibility

All experiments are conducted on a single NVIDIA A100 GPU\. Key fine\-tuning settings are: epochs = 15, learning rate =2×10−52\\times 10^\{\-5\}, batch size = 8, max sequence length = 512, weight decay = 0\.01, and early stopping with a patience of 3 epochs\.

For uncertainty estimation, we use MC dropout with the number of stochastic forward passes set toK=10K=10\. In RADS, we train a dueling DQN sampler for 300 episodes withϵ\\epsilon\-greedy exploration, decayingϵ\\epsilonfrom 1\.0 to 0\.05 with a multiplicative factor of 0\.995\. We use an experience replay buffer of size 10000 and start network updates once at least one minibatch is available \(batch size = 64\)\. Both the online and target networks are optimized with Adam \(learning rate =10−410^\{\-4\}\) and discount factorγ=0\.95\\gamma=0\.95, and the target network is synchronized every 10 episodes\.ρ=0\.9\\rho=0\.9\. The reward is defined as the prior\-aware utility minus a diversity penalty computed from the nearest\-neighbor distance in the mean log\-probability spaceℓ¯​\(x\)\\bar\{\\ell\}\(x\), withλ=0\.01\\lambda=0\.01\.

#### Dataset Split

Each dataset is split into training, validation, and test sets \(around 70%, 10%, and 20%\), preserving the original class balance\. Table[8](https://arxiv.org/html/2604.20256#A2.T8)shows the number of positive and negative samples in each split\.

Table 8:Class distribution for CHIFIR, MIMIC\-CXR, and PIFIR across train, development, and test sets\. P = the number of positive reports, N = the number of negative reports\.

## Appendix CPrompt\-guided LLM Selection Baseline

Table 9:Prompt\-guided LLM selection results for CHIFIR to PIFIR transfer\. Num is the number of selected PIFIR reports\.We compare RADS against a prompt\-guided LLM selection baseline inspired by recent LLM\-based data selection methodsJeonget al\.\([2025](https://arxiv.org/html/2604.20256#bib.bib69)\)\. In this baseline, we use an open\-source medical LLM \(OpenBioLLM\-8B555[https://huggingface\.co/aaditya/Llama3\-OpenBioLLM\-8B](https://huggingface.co/aaditya/Llama3-OpenBioLLM-8B)\) to score each unlabeled target report according to its estimated usefulness for training the IFI classifier, and then select the top\-k reports for annotation\.

Table[9](https://arxiv.org/html/2604.20256#A3.T9)reports results for transfer from CHIFIR to PIFIR under different annotation budgets\. We observe that the LLM\-guided baseline is highly unstable in the ultra\-low\-budget regime: at budgets 1 and 4, it completely fails to recover positive target performance, and at budget 2 it yields only a marginal F1 improvement\. Performance improves at larger budgets, suggesting that the LLM can identify some useful reports when more annotations are allowed\. However, the overall behavior remains much less reliable than RADS in the low\-budget regime that is central to our study\.

## Appendix DTransfer Performance from PIFIR to CHIFIR

Table[6](https://arxiv.org/html/2604.20256#A1.T6)summarizes the baseline transfer performance\. The zero\-shot model trained on PIFIR exhibits severe performance degradation on CHIFIR, achieving very low accuracy and F1, indicating a substantial domain shift\. In particular, the model tends to over\-predict the positive class on CHIFIR \(high recall but very low precision\), which suggests that the decision boundary learned from PIFIR does not directly generalize to CHIFIR\. When CHIFIR data are available for adaptation \(full\-shot\), incorporating CHIFIR supervision mitigates this shift and improves target\-domain performance, demonstrating that a small amount of target data is critical for reliable transfer\.

We next evaluate whether our data selection strategy can maximize the benefit of limited target supervision\. Table[7](https://arxiv.org/html/2604.20256#A1.T7)reports transfer learning results when only 8 CHIFIR samples are selected for fine\-tuning under different sampling strategies\. Across all baselines, we observe that other strategies are insufficient to bridge the transfer gap: they either fail to improve CHIFIR F1 or produce unstable behavior\. In contrast, RADS achieves the strongest transfer performance on CHIFIR, yielding the best overall target metrics \(Accuracy/F1/ROC\-AUC\) while maintaining high performance on the source domain\. Importantly, RADS also produces the smallest transfer gap \(Δ\\DeltaF1\) among compared methods, indicating that the selected CHIFIR samples lead to more effective adaptation without sacrificing the knowledge learned from PIFIR\. The confidence interval ofΔ\\DeltaF1 further suggests that RADS provides a more reliable and stable reduction of the transfer discrepancy compared to alternative selection strategies\.

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/C-P.png)Figure 9:Concept\-level KL divergence from CHIFIR to PIFIR\.
## Appendix ERobustness under Imbalanced Sampling

We evaluate the robustness of our sampling strategy under imbalanced sampling\. For transfer learning from CHIFIR to PIFIR, We randomly select 5 samples from PIFIR with different positive to negative ratios\. Each setting is repeated five times to obtain stable results\. Figure[10](https://arxiv.org/html/2604.20256#A5.F10)shows the results\. The best performance occurs when the positive to negative ratio is 1\.00:0\.00, which matches the class ratio selected by our method\. This happens because CHIFIR has many more negative cases\. Prioritizing positive target samples helps counter this imbalance and narrow the target\-domain class distribution gap during transfer\.

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/1.png)Figure 10:Class imbalance analysis of positive to negative sample ratios for CHIFIR to PIFIR transfer\. Bars show mean values and black lines indicate variance\.Table 10:Top 10 terms with the highest TF–IDF scores in CHIFIR, PIFIR and MIMIC\-CXR\.![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/JaccardSim.png)Figure 11:Jaccard similarity heatmap between CHIFIR and PIFIR concepts\.
## Appendix FLearning Curves for MIMIC\-CXR to CHIFIR Transfer under Varying Budgets

Figure[12](https://arxiv.org/html/2604.20256#A6.F12)shows the performance from MIMIC\-CXR to PIFIR across budgets under two baselines and Figure[13](https://arxiv.org/html/2604.20256#A6.F13)\(left\) shows our methods’ performance\. MIMIC\-CXR is larger and the zero\-shot baseline is already good\. Therefore, the headroom for improvement is limited, and our method is similar to other baselines\. Figure[13](https://arxiv.org/html/2604.20256#A6.F13)\(right\) plots the domain gapΔ\\DeltaF1 against budget with 95% confidence intervals\. Across budgets, the point estimates stay close to zero and the confidence intervals largely overlap, indicating a small residual gap and no clear separation between budgets in this ultra\-low\-resource setting\.

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/BB-m.png)

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/TG-m.png)

Figure 12:Transfer from MIMIC\-CXR to PIFIR under baselines BatchBALD \(left\) and TAGCOS \(right\)\.![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/Our-m.png)

![Refer to caption](https://arxiv.org/html/2604.20256v1/figures/F1-m.png)

Figure 13:Transfer from MIMIC\-CXR to PIFIR under our method RADS\.
## Appendix GDifferences Analysis in CHIFIR, PIFIR, and MIMIC\-CXR Datasets

CHIFIR contains 283 reports from 201 patients, with an average report length of 1,353 characters\. PIFIR contains 201 reports from 156 patients, with an average report length of 1,809 characters\. The MIMIC\-CXR subset contains 493 reports from 290 patients, with a shorter average report length of 677 characters\.

We first compute TF–IDF scores within each corpus and compare the top 10 highest\-scoring terms between CHIFIR, PIFIR and MIMIC\-CXR as shown in Table[10](https://arxiv.org/html/2604.20256#A5.T10)\. CHIFIR is dominated by pathology and specimen\-centric language \(e\.g\., cells, fluid, bronchial, biopsy, tissue, specimen\), reflecting cytology/histopathology reporting that emphasizes sample type and microscopic description rather than imaging observations\. PIFIR is characterized by PET\-CT and metabolic\-imaging terminology \(e\.g\., uptake, FDG, PET, CT, activity\), as well as systemic disease descriptors \(e\.g\., marrow, disease\), consistent with PET\-driven assessment of metabolic activity and whole\-body involvement\. MIMIC\-CXR is dominated by chest radiography vocabulary and common pulmonary findings \(e\.g\., chest, pneumonia, pleural, effusion, pulmonary, lung\), reflecting the focus of X\-ray reports on thoracic anatomy and acute cardiopulmonary abnormalities\. Overall, these TF–IDF profiles highlight substantial modality\- and workflow\-driven lexical shifts between these datasets, motivating domain\-adaptive transfer methods that can operate under pronounced vocabulary mismatch\. Figure[3](https://arxiv.org/html/2604.20256#S4.F3)also shows the transfer gap\.

For the CHIFIR and PIFIR datasets, expert annotators also provided span\-level annotation of concepts relevant to disease detection\. Concept annotations in CHIFIR and PIFIR Datasets are listed in Table[13](https://arxiv.org/html/2604.20256#A7.T13)and Table[12](https://arxiv.org/html/2604.20256#A7.T12)\. The CHIFIR dataset reports 1,155 concepts, and the PIFIR dataset has 3,194 concepts\. The two corpora serve different clinical niches\. CHIFIR comes from cytology and histopathology notes and therefore focuses on microbiology terms such asFungalDescriptorandStain\. PIFIR is built from PET–CT reports and centres on imaging findings and risk factors, for exampleAbnormality\_CTandRisk\_factor\.

Table 11:Representative \(de\-identified\) report excerpts from each dataset\.To quantify overlap, we compute the Jaccard Similarity between the concept vocabularies:

J​\(A,B\)=\|A∩B\|\|A∪B\|J\(A,B\)=\\frac\{\|A\\cap B\|\}\{\|A\\cup B\|\}\(14\)
whereAAandBBare the sets of surface forms in CHIFIR and PIFIR, respectively\. Figure[11](https://arxiv.org/html/2604.20256#A5.F11)plots the resulting heatmap\. Although both datasets include the classification termspositive,equivocalandnegative, their lexical realizations share little common ground, so the Jaccard scores remain low\.

Table 12:Summary statistics for the IFI\-related concepts in the PIFIR dataset\.Table 13:Summary statistics for the IFI\-related concepts in the CHIFIR dataset\.To quantify directional lexical divergence, we compute the KL divergence on the concept level\. For each concept,PPandQQdenote the distributions of surface forms in PIFIR and CHIFIR, respectively\. Both are smoothed over the combined vocabulary𝒱=𝒱PIFIR∪𝒱CHIFIR\\mathcal\{V\}=\\mathcal\{V\}\_\{\\text\{PIFIR\}\}\\cup\\mathcal\{V\}\_\{\\text\{CHIFIR\}\}with a smallε\\varepsilonto avoid zeros\.

K​L​\(P∥Q\)=∑v∈𝒱P​\(v\)​log⁡P​\(v\)Q​\(v\),KL\(P\\,\\\|\\,Q\)=\\sum\_\{v\\in\\mathcal\{V\}\}P\(v\)\\,\\log\\frac\{P\(v\)\}\{Q\(v\)\},\(15\)
where

P​\(v\)=countPIFIR​\(v\)\+ε∑u∈𝒱countPIFIR​\(u\)\+ε​\|𝒱\|\.P\(v\)=\\frac\{\\mathrm\{count\}\_\{\\text\{PIFIR\}\}\(v\)\+\\varepsilon\}\{\\sum\_\{u\\in\\mathcal\{V\}\}\\mathrm\{count\}\_\{\\text\{PIFIR\}\}\(u\)\+\\varepsilon\|\\mathcal\{V\}\|\}\.\(16\)
We computeKL​\(CHIFIR∥PIFIR\)\\mathrm\{KL\}\(\\text\{CHIFIR\}\\\|\\text\{PIFIR\}\)and visualize the results as a heatmap \(Figure[9](https://arxiv.org/html/2604.20256#A4.F9)\)\. In KL divergence, larger values indicate that greater mismatch between the two datasets\. Even for shared classification terms \(positive,equivocal,negative\), the divergence values remain large\. This suggests that the two datasets differ systematically in how their concepts are expressed, not simply in whether specific words occur\.

#### Representative Report Examples

To qualitatively illustrate domain\- and modality\-specific language, we provide representative \(de\-identified\) report excerpts from each corpus in Table[11](https://arxiv.org/html/2604.20256#A7.T11)\.

Similar Articles

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hugging Face Daily Papers

RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.

Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

arXiv cs.CL

Researchers from Fordham University introduce Reciprocal Co-Training (RCT), a framework that couples LLMs and Random Forest classifiers via reinforcement learning, creating an iterative feedback loop where each model improves using signals from the other. Experiments on three medical datasets show consistent performance gains for both models, demonstrating a general mechanism for integrating incompatible model families.

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Hugging Face Daily Papers

RadAgent is a tool-using AI agent that generates chest CT reports through interpretable step-by-step reasoning, improving clinical accuracy by 36.4% relative and achieving 37% faithfulness—a capability absent in existing 3D vision-language models. The system provides fully inspectable reasoning traces allowing clinicians to validate and refine diagnostic outputs.

Reinforcement learning with prediction-based rewards

OpenAI Blog

OpenAI introduces Random Network Distillation (RND), a prediction-based method for encouraging exploration in RL agents through curiosity, achieving human-level performance on Montezuma's Revenge without demonstrations or game state access.