Effect of Demographic Bias on Skin Lesion Classification

arXiv cs.AI Papers

Summary

This paper investigates the impact of demographic bias (sex and age) on skin lesion classification using ResNet models, finding that sex biases stem from data imbalances while age biases consistently favor younger groups, and evaluating multi-task and adversarial learning mitigation strategies.

arXiv:2606.03214v1 Announce Type: new Abstract: In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:43 AM

# Effect of Demographic Bias on Skin Lesion Classification
Source: [https://arxiv.org/html/2606.03214](https://arxiv.org/html/2606.03214)
\\melbaid

2026:011\\melbaauthorsRaumanns, Schouten, Pluim and Cheplygina\\firstpageno200\\melbayear2026\\datesubmitted2025\-04\-21\\datepublished2026\-05\-29\\melbaspecialissueSpecial issue on Fairness of AI in Medical Imaging \(FAIMI\)\\melbaspecialissueeditorsVeronika Cheplygina, Aasa Feragen, Andrew King, Ben Glocker, Enzo Ferrante, Eike Petersen, Esther Puyol\-Antón, Melanie Ganz\-Benjaminsen\\ShortHeadingsDemographic Bias in Skin LesionsRaumanns, Schouten, Pluim and Cheplygina\\affiliations\\num1\\addrFontys University of Applied Science, Venlo, The Netherlands \\num2\\addrFontys University of Applied Science, Eindhoven, The Netherlands \\num3\\addrEindhoven University of Technology, Eindhoven, The Netherlands \\num4\\addrIT University of Copenhagen, Denmark

\\nameGerard Schouten\\aff2\\orcid0000\-0001\-7042\-2143\\nameVeronika Cheplygina\\aff4\\orcid0000\-0003\-0176\-9324\\nameJosien P\.W\. Pluim\\aff3\\orcid0000\-0001\-7327\-9178

###### Abstract

The influence of bias in datasets on the fairness of model predictions is a topic of ongoing research in various fields\. In this study, we evaluate the performance of skin lesion classification using ResNet\-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age\. We use a linear programming method to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects\. Three distinct learning strategies are evaluated: a single\-task model, a reinforcing multi\-task model, and an adversarial learning scheme\.

Our sex\-based analysis indicates that sex\-specific training datasets optimise model performance\. Notably, including male patients in the training data improved performance for the male subgroup, even in female\-majority cases\. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female\-majority datasets\. However, these strategies proved less effective in male\-majority settings, where models continued to perform better for males than females\. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations\.

Age\-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories\. Younger groups consistently achieve the highest performance, regardless of training data distribution\. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories\.

We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution\. These distinct mechanisms require targeted mitigation strategies\. Our work aims to advance equitable AI in medical imaging by addressing these specific sources of disparity\.

Additionally, cross\-dataset validation on two external datasets revealed that domain shifts notably affect performance and demographic bias patterns\.

The source code and models are available on GitHub: [https://github\.com/raumannsr/demographic\-fairness\-extended](https://github.com/raumannsr/demographic-fairness-extended)\.

###### keywords:

Skin lesions, Bias, Fairness, Multi\-task learning, Adversarial learning, Cross\-dataset analysis

###### doi:

10\.59275/j\.melba\.2026\-4156

††volume:2026## 1Introduction

Deep learning has shown many successes in the diagnosis of medical images, as demonstrated by several studies \(Sahaet al\.\([2024](https://arxiv.org/html/2606.03214#bib.bib66)\); Estevaet al\.\([2017](https://arxiv.org/html/2606.03214#bib.bib21)\); Bejnordiet al\.\([2017](https://arxiv.org/html/2606.03214#bib.bib22)\)\), but despite the high overall performance, models can be biased against patients from different demographic groups, a concern highlighted in recent work \(Abbasi\-Sureshjaniet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1417)\); Larrazabalet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1423)\); Gichoyaet al\.\([2022b](https://arxiv.org/html/2606.03214#bib.bib1507)\)\)\. Bias and fairness have therefore become central research topics in medical imaging, with studies focusing, for example, on skin lesions \(Abbasi\-Sureshjaniet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1417)\); Grohet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib33)\)\), chest radiographs \(Larrazabalet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1423)\)\) and brain magnetic resonance imaging\(Petersenet al\.\([2022](https://arxiv.org/html/2606.03214#bib.bib1679)\)\)\. Sensitive attributes commonly examined are age, sex, or race\. For the classification of skin lesions, the Fitzpatrick skin type is often studied \(Seth and Pai \([2024](https://arxiv.org/html/2606.03214#bib.bib67)\); Benčevićet al\.\([2024](https://arxiv.org/html/2606.03214#bib.bib6)\); Grohet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib33)\); Wuet al\.\([2022](https://arxiv.org/html/2606.03214#bib.bib76)\)\)\.

While deep learning models continue to advance diagnostic capabilities, their fairness remains a significant concern because model performance is fundamentally tied to the quality and representativeness of the training data, as well as the model’s ability to mitigate any bias embedded in the training dataset\.

Although bias and fairness in AI for medical imaging have gained attention, prior studies have often examined individual demographic factors in isolation, typically within a single imaging modalityor without systematic control over data distributions\. A comprehensive evaluation comparing how these demographic attributes, when systematically skewed, influence model performance across different learning strategies \(single\-task, reinforcing, and adversarial\) is lacking\.Moreover, the relative effectiveness of debiasing approaches across specific demographic subgroups, particularly under extreme distributional imbalances, remains unexplored\.Additionally, the utility of auxiliary demographic prediction heads as fairness indicators has not been systematically assessed\.

In this paper, we define dataset bias \(also known as representation bias\) strictly as demographic bias, meaning any systematic imbalance in age, sex, or other protected attributes within the training set\. Such imbalances lead to unbalanced learning and performance gaps between subgroups\. We examine both demographic and model bias, measuring how controlled skews in the training data affect performance, and testing multi\-task learning strategies designed to mitigate model bias\. Using a balanced test set, we quantify the degree to which demographic bias propagates to model bias and identify the most effective approaches for equitable skin\-lesion classification across age and sex groups\.

This manuscript substantially extends our FAIMI 2024 workshop paper \(Raumannset al\.\([2025](https://arxiv.org/html/2606.03214#bib.bib61)\)\)\. The workshop paper evaluated five distributions of male/female patients \(sex demographics\) with three learning strategies \(one single\-task and two multi\-task models\)\.The evaluationfocused on overall and subgroup\-specific performance to assess whether training data distribution biases manifested in results when tested on a balanced test set\.

Extending our FAIMI 2024 workshop paper, we present the following contributions:

1. 1\.We extend our linear programming \(LP\) method to control age subgroups in addition to sex, introducing five age groups and three skewed age distributions\.
2. 2\.We systematically evaluate two bias mitigation strategies \(reinforcing multi\-task and adversarial\) across various age and sex subgroups\. By presenting both overall and subgroup\-specific metrics, we determine how each strategy performs under different conditions\. This includes two new sex\-distribution scenarios, namely predominantly male and predominantly female patients, which enable a more granular evaluation of the models\.
3. 3\.Beyond the internal hold\-out validation, we extend our external validation from the prior study\. We introduce a new dermatoscopic skin\-lesion dataset in this work\. Alongside the retained smartphone dataset, the datasets facilitate testing across diverse geographical regions, acquisition methods, and demographic groups\.
4. 4\.We analyse the auxiliary age\-prediction head to assess its utility as a fairness indicator\.

## 2Related work

We revisit prior studies on demographic bias and fairness in medical imaging, highlighting how earlier work has examined demographic disparities, bias mitigation techniques such as multi\-task and adversarial learning, and the limitations that motivate our more systematic analysis\.

##### Understanding representation bias

Demographic bias in medical imaging, referring to performance disparities across protected attributes \(such as biological sex, race, age, and skin tone\), has been extensively studied, revealing how these imbalances can cause unfair or discriminatory outcomes in healthcare\. Glocker et al\. showed that a widely used chest radiography foundation model actually encodes protected attributes, like biological sex and race, leading to statistically significant performance gaps across those subpopulations \(Glockeret al\.\([2023](https://arxiv.org/html/2606.03214#bib.bib30)\)\)\. Vaidya et al\. reported that deep learning pathology models exhibit racial bias, as demonstrated on large publicly available cancer imaging datasets \(Vaidyaet al\.\([2024](https://arxiv.org/html/2606.03214#bib.bib73)\)\)\.

Demographic bias in machine learning manifests in various forms, with representation bias being particularly significant in healthcare\. Representation bias occurs when certain demographic groups are underrepresented in training data, leading to reduced model performance for these groups \(Larrazabalet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1423)\)\)\. This differs from bias caused by inherent anatomical or physiological differences between groups, though these can contribute to representation bias when they affect data collection, for example, clinical protocols that exclude pregnant patients for safety reasons \(Seyyed\-Kalantariet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib68)\)\)\.

Sies et al\. assessed a market\-approved skin cancer CNN and documented a male predominance in the training data\. Despite this imbalance, performance on a balanced test set showed no statistically significant sex\-related disparity, suggesting the extensive training set mitigated the imbalance effect \(Sieset al\.\([2022](https://arxiv.org/html/2606.03214#bib.bib70)\)\)\. Conversely, even with deliberately balanced datasets, intrinsic anatomical differences can still generate bias\. Klingenberg et al\. demonstrated this by showing that a CNN trained on a sex\-balanced MRI cohort for Alzheimer’s detection performed markedly better in female patients than males, underscoring that demographic bias can arise from physiological factors rather than merely data imbalance \(Klingenberget al\.\([2023](https://arxiv.org/html/2606.03214#bib.bib46)\)\)\.

Understanding how representation bias influences model performance is essential for building fair systems\. By identifying underrepresented populations or those with the poorest performance, targeted corrections can be applied to the dataset\. Moreover, understanding these effects provides insights for designing future datasets, allowing researchers to avoid similar problems early on\.

##### Role of demographics

The role of demographics in medical AI is multifaceted\. Some demographic variations reflect genuine biological differences that models should take into account; for example, patient characteristics such as age and sex significantly influence the predictive precision of health markers, such as blood pressure, in retinal image analysis \(Gerritset al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib26)\)\)\. Deep learning models can extract demographic characteristics, such as sex and age, directly from medical images, such as chest X‑rays, with high accuracy\(Gichoyaet al\.\([2022a](https://arxiv.org/html/2606.03214#bib.bib29)\);Jones and Glocker \([2025](https://arxiv.org/html/2606.03214#bib.bib41)\)\)\. This capability offers applications in forensic investigations, aiding identification and uncovering novel anatomical landmarks for sex and age determination \(Yiet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib80)\)\)\.

However, it is crucial to distinguish these valid demographic correlations from problematic representation bias, which often relates to data collection practices rather than physiological differences\.Our research addresses this by deliberately building datasets with specific demographic imbalances, helping us determine whether performance disparities are due to true physiological factors or simply to data collection\.

##### Addressing bias

Research on fairness usually involves baseline studies that demonstrate bias between groups and/or suggest methods to enhance fairness\. These approaches mainly tackle representation bias through sampling or weighting strategies during training \(Grohet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib33)\)\)\. Alternatively, they implement architectural techniques that prevent models from depending on sensitive attributes, such as adversarial learning \(Abbasi\-Sureshjaniet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1417)\)\)\. For example, Yang et al\. developed an adversarial framework to mitigate biases arising from hospital location and patient ethnicity \(Yanget al\.\([2023](https://arxiv.org/html/2606.03214#bib.bib79)\)\)\. Wu et al\. introduced FairPrune, which trims parameters based on their importance to both privileged and unprivileged groups \(Wuet al\.\([2022](https://arxiv.org/html/2606.03214#bib.bib76)\)\)\. Other methods focus on data augmentation\. Stanley et al\. proposed a synthetic bias framework for brain MRI\. They showed that simple sample reweighting effectively reduces hidden biases \(Stanleyet al\.\([2024](https://arxiv.org/html/2606.03214#bib.bib71)\)\)\. Ktena et al\. demonstrated that diffusion\-generated synthetic images improve fairness across histopathology, chest X\-ray, and dermatology datasets \(Ktenaet al\.\([2024](https://arxiv.org/html/2606.03214#bib.bib44)\)\)\.

Commonly used datasets for studying demographic bias in skin lesion classification include the ISIC skin lesion datasets \(Gutmanet al\.\([2016](https://arxiv.org/html/2606.03214#bib.bib36)\); Codellaet al\.\([2018](https://arxiv.org/html/2606.03214#bib.bib15),[2019](https://arxiv.org/html/2606.03214#bib.bib16)\); Tschandlet al\.\([2018](https://arxiv.org/html/2606.03214#bib.bib72)\); Combaliaet al\.\([2019](https://arxiv.org/html/2606.03214#bib.bib17)\); Rotemberget al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib74)\)\) and Fitzpatrick\-17K \(Grohet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib33),[2022](https://arxiv.org/html/2606.03214#bib.bib34)\)\)\.However, researchers typically rely on pre\-provided data splits or stratify by a single demographic attribute \(e\.g\., male vs female\)\. Crucially, these methods often fail to control for the interplay between attributes, treating sex and age as independent variables rather than managing their joint distribution\. Our linear programming approach, however, explicitly enforces constraints on both sex and age simultaneously, ensuring that specific subgroups \(such as older males or younger females\) are accurately represented according to the desired ratios\.

##### Bias mitigation approaches

Our current study builds on two crucial insights from medical imaging: multi\-task learning and shortcut learning \(Geirhoset al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib25)\); Nautaet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib57)\)\)\.The reinforcing model uses multi\-task learning, which trains multiple related tasks simultaneously, aiming to enhance the model’s generalisability while also reducing bias in two ways\. Firstly, when the same hidden layers support multiple related tasks \(Ruder \([2017](https://arxiv.org/html/2606.03214#bib.bib1183)\)\), the network must learn features that work across various contexts\. Benefiting one task over another is not the goal, as this would diminish performance on other tasks\. Joint training of the tasks improves generalisability, as each task regularises the others \(Caruana \([1993](https://arxiv.org/html/2606.03214#bib.bib11)\)\)\. Secondly, training a multi\-task model requires more diverse data than a single\-task model; in addition to medical image data, demographics are also included in the training\. This learning approach exposes the model to a broader range of data, which may help reduce the influence of patterns that are prominent only in a subset of the data\. Having more data lets multi\-task models build stronger, more general features that work across several tasks, helping prevent overfitting \(Zhang and Yang \([2022](https://arxiv.org/html/2606.03214#bib.bib83)\)\)\. This suggests that incorporating an auxiliary task, specifically addressing potential bias factors such as age and sex, alongside the primary binary classification \(malignant or not\), could help mitigate bias\.

In addition to the standard multi\-task approach, our study also employs an adversarial model approach, a special variant of multi\-task learning\. As previously demonstrated \(Adeliet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib2)\); Abbasi\-Sureshjaniet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1417)\)\), this strategy reduces output bias to some degree through adversarial training\. The model aims to minimise bias by decreasing the mutual information between learned features and the protected attribute, employing a negative\-squared Pearson correlation loss for age and binary cross\-entropy for sex\.

We use the auxiliary head’s performance as a diagnostic tool for the reinforcing model\. Moderate to high accuracy confirms that the demographic signal has been learned and that regularisation is active, serving as a direct indicator that the debiasing mechanism is functioning properly\.

Studies have explored different approaches to handling demographic attributes in model training\. Some use demographics within multi\-task learning settings \(Liuet al\.\([2019](https://arxiv.org/html/2606.03214#bib.bib49)\)\), where attributes reinforce diagnosis during optimisation\. This contrasts with more recent adversarial strategies \(Adeliet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib2)\); Abbasi\-Sureshjaniet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1417)\)\) that specifically aim to reduce representation bias by preventing models from predicting sensitive attributes\. Additionally, representation bias can be confounded by correlations between demographics and imaging characteristics, leading to shortcut learning\. These characteristics include variations in imaging devices \(such as different scanner types or image acquisition protocols\) and technical artefacts such as surgical markers or medical instruments\(Willeminket al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1386)\); Jiménez\-Sánchezet al\.\([2023](https://arxiv.org/html/2606.03214#bib.bib1649)\); Gichoyaet al\.\([2022b](https://arxiv.org/html/2606.03214#bib.bib1507)\); Bissotoet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib9)\)\)\. For example, Bevan and Atapour\-Abarghouei specifically demonstrated how these technical artefacts can introduce bias in the classification of skin lesions, developing methods to identify and mitigate their impact \(Bevan and Atapour\-Abarghouei \([2023](https://arxiv.org/html/2606.03214#bib.bib7)\)\)\. In such cases, addressing representation bias requires considering multiple confounding factors, as balancing data for one demographic attribute may leave other sources of bias unaddressed\.

We aim to comprehensively evaluate AI‑based skin‑lesion classification models across demographic groups, with a particular focus on identifying and mitigating representation bias\. Whereas Sies et al\. examined a market\-approved CNN in its uncontrolled training set and considered only sex bias \(Sieset al\.\([2022](https://arxiv.org/html/2606.03214#bib.bib70)\)\), we deliberately construct subsets with exact sex and age ratios to study howthe combination ofdemographic skews affect performance\.

## 3Methods

To evaluate the impact of demographic imbalance in training data on skin\-lesion classification, we conducted two parallel experiments: one manipulating the distribution of patient sex and another modifying the age distribution\. In the sex\-based analysis, we created datasets with different male\-to\-female ratios\. In the age\-based analysis, we built datasets with skewed age profiles, favouring younger, older, or balanced age groups,while keeping a 1:1 sex ratio\. Both analyses followed the same methodological pipeline, using linear programming, with the only difference being the demographic attribute constrained during dataset creation\. We first describe the data collection and preprocessing steps, then outline the model architectures and evaluation methods\.

### 3\.1Data

Table 1:Overview of the curated skin\-lesion datasets used in this study\.\\newcolumntype

L¿c\\newcolumntypeC¿c

We used three skin\-lesion datasets: a curated ISIC subset \(for training, validation, and internal testing\), plus PAD\-UFES\-20 and DERM7PT \(for external testing only\)\. The ISIC subset was derived from the full archive after preprocessing \(Section[3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1)\), with controlled demographic distributions for sex and age\. Figure[1](https://arxiv.org/html/2606.03214#S3.F1)illustrates representative samples\. Dermoscopic images \(ISIC and DERM7PT, respectively, in the left and middle panels\) show greater detail and subsurface structures, particularly with polarised dermoscopy, potentially improving diagnostic accuracy \(Kittleret al\.\([2002](https://arxiv.org/html/2606.03214#bib.bib45)\)\)\. Smartphone images \(PAD\-UFES\-20\) exhibit greater variation in lighting, angle, and background\.

![Refer to caption](https://arxiv.org/html/2606.03214v1/x1.jpg)\(a\)F\. malignant
![Refer to caption](https://arxiv.org/html/2606.03214v1/x2.jpg)\(b\)M\. malignant
![Refer to caption](https://arxiv.org/html/2606.03214v1/x3.jpg)\(c\)F\. benign
![Refer to caption](https://arxiv.org/html/2606.03214v1/x4.jpg)\(d\)M\. benign

![Refer to caption](https://arxiv.org/html/2606.03214v1/x5.jpg)\(e\)F\. malignant
![Refer to caption](https://arxiv.org/html/2606.03214v1/x6.jpg)\(f\)M\. malignant
![Refer to caption](https://arxiv.org/html/2606.03214v1/03_methods/figures/Nbl086.jpg)\(g\)F\. benign
![Refer to caption](https://arxiv.org/html/2606.03214v1/x7.jpg)\(h\)M\. benign

![Refer to caption](https://arxiv.org/html/2606.03214v1/03_methods/figures/PAT_636_1204_521.png)\(i\)F\. Malignant
![Refer to caption](https://arxiv.org/html/2606.03214v1/03_methods/figures/PAT_265_406_276.png)\(j\)M\. malignant
![Refer to caption](https://arxiv.org/html/2606.03214v1/03_methods/figures/PAT_359_4450_86.png)\(k\)F\. benign
![Refer to caption](https://arxiv.org/html/2606.03214v1/03_methods/figures/PAT_868_1657_698.png)\(l\)M\. benign

Figure 1:Comparison of skin lesion images: The left panel shows ISIC dermoscopic images, the middle panel presents four representative lesions DERM7PT dermoscopic images, and the right panel displays PAD‑UFES‑20 smartphone‑captured images\. In each panel, the top row contains malignant lesions from a male \(M\.\) and a female \(F\.\) patient, while the bottom row shows benign lesions from male and female patients, illustrating the visual characteristics across the different sources\.#### 3\.1\.1Collection and preprocessing

##### ISIC based dataset

We used the ISIC archive’s gallery browser \(Gutmanet al\.\([2016](https://arxiv.org/html/2606.03214#bib.bib36)\); Codellaet al\.\([2018](https://arxiv.org/html/2606.03214#bib.bib15),[2019](https://arxiv.org/html/2606.03214#bib.bib16)\); Tschandlet al\.\([2018](https://arxiv.org/html/2606.03214#bib.bib72)\); Combaliaet al\.\([2019](https://arxiv.org/html/2606.03214#bib.bib17)\); Rotemberget al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib74)\);[ISIC2024](https://arxiv.org/html/2606.03214#bib.bib40)\), which contained 81,155 dermoscopic images of skin lesions with associated age and sex metadata\. The archive was queried for dermoscopic images with diagnoses of ”benign” or ”malignant” in all age groups and both sexes, yielding 71,035 images \(62,439 benign, 8,596 malignant\)\. After data collection, we performed several preprocessing steps to ensure data quality\. First, we removed cases lacking age attribute values, leaving 70,843 lesions \(62,291 benign and 8,552 malignant\)\. We then removed duplicate images by comparing MD5 hash\-values, following Cassidy’s method \(Cassidyet al\.\([2022](https://arxiv.org/html/2606.03214#bib.bib12)\)\)\. After duplicate elimination, 69,982 lesions remained \(61,472 benign and 8,510 malignant\)\. Finally, we identified multiple images of the same patient \(multiplets\) using patient ID attributes and excluded them, resulting in 35,884 lesions \(28,810 benign and 7,074 malignant\)\.This removal reduces bias by preventing a single patient from disproportionately influencing the model and eliminates the risk of data leakage across train/validation/test splits\.Among the benign lesions, 13,207 were from female patients and 15,603 from male patients\. Among malignant lesions, 3,012 were from female patients and 4,062 from male patients\.

##### PAD\-UFES\-20 based dataset

![Refer to caption](https://arxiv.org/html/2606.03214v1/x8.png)Figure 2:Age distribution across sex and diagnosis type in the curatedPAD\-UFES\-20dataset, showing the breakdown between male and female patients with benign versus malignant skin lesions\. For a detailed breakdown of lesion counts, see the Appendix[C](https://arxiv.org/html/2606.03214#A3)\.For external validation, we used the PAD\-UFES\-20 dataset \(Pachecoet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1581)\)\), which comprises clinical skin\-lesion photographs taken with smartphones from patients in Brazil\. The original collection comprised 2,298 records\. To prepare the data for cross\-validation, we performed a sequence of cleaning operations to ensure completeness, consistency, and nonredundancy\. First, we removed all entries lacking a sex label \(804 rows in total\), leaving a fully sex\-annotated cohort \(741 male and 753 female patients\)\. No records were missing age information, eliminating the need for further imputation\. Next, we excluded any lesion without an associated biopsy result, thereby ensuring that every sample used for model evaluation had a definitive pathological ground truth; this filtering reduced the set to 1,179 cases\. We then consolidated diagnostic labels into two broad categories: malignant \(Melanoma, Basal Cell Carcinoma, Squamous Cell Carcinoma\) and benign \(Actinic keratosis, Nevus, and Seborrheic keratosis\)\. Finally, we checked for duplicate entries representing the same patient\-lesion pair and found none\. The final curated dataset comprises a malignant subset of 432 male and 401 female patients, and a benign subset of 148 male and 198 female patients, resulting in a dataset of 1,179 unique records\. Figure[2](https://arxiv.org/html/2606.03214#S3.F2)illustrates the age distribution stratified by sex for both malignant and benign cases in the curated dataset\.

##### DERM7PT based dataset

We used the publicly released dermoscopic collection \(Kawaharaet al\.\([2018](https://arxiv.org/html/2606.03214#bib.bib42)\)\), comprising 1,011 cases originally curated for the Interactive Atlas of Dermoscopy by Argenziano et al\. \(Argenziano and others \([2000](https://arxiv.org/html/2606.03214#bib.bib4)\)\)\. Each case includes a dermoscopic image, a clinical image, patient metadata, and eight labels \(seven 7\-point checklist criteria plus diagnosis\)\. Sex metadata is available for all samples, though age information is absent\. We grouped diagnostic codes into two categories: benign lesions \(nevi, dermatofibromas, lentigines, melanoses, vascular lesions, and seborrheic keratoses\) and malignant lesions \(basal cell carcinoma and all melanoma subtypes, including in situ, invasive, and metastatic\)\. The malignant subset comprises 160 female and 134 male patients, while the benign subset includes 362 female and 355 male patients\. For analysis, we utilised only the dermoscopic images\.

Table[1](https://arxiv.org/html/2606.03214#S3.T1)provides an overview of the curated datasets used in this study, including their modalities, sample sizes, geographic origins, and demographic distributions\.

#### 3\.1\.2Dataset creation

We developed a method \(Raumannset al\.\([2025](https://arxiv.org/html/2606.03214#bib.bib61)\)\) to create diverse dataset compositions using linear programming \(LP\), a standard mathematical optimisation technique\. Our pipeline consists of two steps: \(1\) generating a demographically controlled subset using an LP model, and \(2\) splitting it into training, validation, and hold\-out test sets\. We applied this exact pipeline to every experiment, whether adjusting the male\-to\-female patient ratio or reshaping the age\-group distribution\. We chose LP over random sampling\. LP exactly satisfies multiple demographic constraints \(sex, age, lesion diagnosis\) while selecting the largest possible subset meeting those ratios\. Random sampling approximates the target distribution but often discards or duplicates rare cases\. It cannot guarantee specific subgroup constraints \(e\.g\., dark\-skinned males aged 50–60 with malignant lesions\)\. The accurate and reproducible control provided by LP is therefore essential for rigorous bias and fairness analysis\. It can also be easily extended with additional constraints that random sampling cannot accommodate\. In what follows, we describe the specific steps used to create the datasets for each age\- and sex\-related experiment\.

##### Dataset composition using linear programming

The goal of the LP model is to maximise the number of instances of skin lesions within defined constraints, as we express below:

Find a vector​x​\(decision variables\)\\displaystyle\\text\{Find a vector \}x\\text\{ \(decision variables\)\}that maximises​f=x1​\(objective function\)\\displaystyle\\text\{that maximises \}f=x\_\{1\}\\text\{ \(objective function\)\}subject to​ai​1​x1\+ai​2​x2\+⋯\+ai​n​xn≤bi​\(constraints\)\\displaystyle\\text\{subject to \}a\_\{i1\}x\_\{1\}\+a\_\{i2\}x\_\{2\}\+\\dots\+a\_\{in\}x\_\{n\}\\leq b\_\{i\}\\text\{ \(constraints\)\}for​i=1,…,Ni\\displaystyle\\hskip 25\.00003pt\\text\{for \}i=1,\\dots,N\_\{i\}and​xj≥0​\(non\-negativity constraints\)\\displaystyle\\text\{and \}x\_\{j\}\\geq 0\\text\{ \(non\-negativity constraints\)\}for​j=1,…,Nj\\displaystyle\\hskip 25\.00003pt\\text\{for \}j=1,\\dots,N\_\{j\}
The model hasNjN\_\{j\}decision variables \(x1,…,xNjx\_\{1\},\\dots,x\_\{N\_\{j\}\}\) andNiN\_\{i\}constraints\. Each decision variable corresponds to specific categories \(e\.g\., benign lesions in female patients aged\>60\>60years\)\. The objective function maximises the count of malignant instancesx1x\_\{1\}\. In the ISIC archive, there are fewer malignant instances than benign ones, and the goal is to achieve a balance between the two\. The constraints enforce bounds on individual groups and maintain inter\-group ratios\. Representative groups include all benign lesions, all females over 60 years old, and all males under 60 years of age\. A key constraint maintains class balance by ensuring an equal number of malignant and benign lesions \(x1−x2=0x\_\{1\}\-x\_\{2\}=0\)\. Non\-negativity constraints prohibit negative values for all decision variables\. The complete LP formulation is detailed in Appendix[A](https://arxiv.org/html/2606.03214#A1)\.

Table 2:ISIC\-baseddatasets are distributed amongst malignant, benign, male patients \(M\), and female patients \(F\) categories for both training and validation\. Bold value indicates the minimal malignant‑lesion count\.Within set constraints, the optimal solution maximises malignant lesions and assigns value to decision variables\. To find this solution, we created a unique LP model for each dataset\. Table[2](https://arxiv.org/html/2606.03214#S3.T2)shows the result of the LP model for the different datasets\. We adopted a procedure to obtain the final solution for each distribution, consisting of the following steps\. First, we solved the LP model to identify the optimal composition of a balanced test set while maximising the number of malignant lesions\. From this balanced set, we reserved one‑eighth as a hold‑out test set\. Second, we recalibrate the upper‑bound constraints using the lesion counts observed in the hold‑out set\. With these updated bounds, we resolved the LP model to derive the final solutions for the various distributions\. Third, after obtaining solutions from the LP model, we determined the minimum number of malignant instances in all datasets\. Fourth, we scaled each dataset proportionally to the minimum value, preserving demographic distributions while ensuring comparability\.

##### Sex distribution analysis

We createdsevendistinct training and validation datasets to analyse sex\-related biases with varying patient ratios of female \(F\) to male \(M\)\. Each of thesevendataset instances was created using a distinct random seed\. For each seed, we first created a hold‑out test set; the remaining data were then shuffled and split strictly into an 80% training subset and a 20% validation subset while preserving the original demographic ratios\. There was no overlap of lesions between the training and validation sets for any given seed, preventing overlapping samples from inflating performance\. We maintained the same number of malignant and benign lesions in all datasets and balanced age distributions with the same numbers of patients below and above 60 years \(median age\) for each sex\.

The datasets consisted of a M100 set \(100% male patients\), a F100 set \(100% female patients\), a F95M5 set \(95% female, 5% male patients\), a F75M25 set \(75% female,25% male patients\), a F50M50 set \(50% female, 50% male patients\), a F25M75 set \(25% female, 75% male patients\), a F5M95 set \(5% female, 95% male patients\), and a separate balanced test set that matches the distribution of the F50M50 \(equally‑split\) dataset\.

Figure[4](https://arxiv.org/html/2606.03214#S4.F4)illustrates the age distributions across the sex\-based training datasets and the balanced test set\.

##### Age distribution analysis

Similar to the sex‑distribution analysis, we built the training, validation, and test sets for the age‑focused experiments using an LP model \(see Appendix[B](https://arxiv.org/html/2606.03214#A2)for full details\)\. We defined three age‑distribution schemes, each spanning the same five age brackets\. Table[3](https://arxiv.org/html/2606.03214#S3.T3)shows the definition and proportions of the five age brackets \(A1A\_\{1\}–A5A\_\{5\}\) for each of the three schemes\. In the YOUNGER scheme, the majority of samples come from the youngest age brackets, with the proportion gradually decreasing toward older age groups\. The BALANCED scheme allocates samples uniformly across all five brackets\. Finally, the OLDER scheme is the inverse of the younger‑skewed arrangement, concentrating most samples in the oldest age groups\. Each distribution enforces a strict 1:1 balance between malignant and benign lesions and a 1:1 balance between male and female patients\. For each scheme, we generated five independent instances using five different seeds\. For each seed, we divided the data into non\-overlapping training and validation sets and reserved a holdout test set with a uniform age category distribution \(based on the BALANCED scheme\)\. See Table[4](https://arxiv.org/html/2606.03214#S3.T4)for the count of skin lesions in each split\.

Table 3:Proportion of the five age groups \(A1A\_\{1\}–A5A\_\{5\}\) across the three schemes \(YOUNGER, BALANCED, and OLDER\) for theISIC\-baseddata\. The age brackets are defined as follows,whereaarepresents the patient’s age in years:A1=0≤a≤50A\_\{1\}=\{0\\leq a\\leq 50\},A2=51≤a≤60A\_\{2\}=\{51\\leq a\\leq 60\},A3=61≤a≤70A\_\{3\}=\{61\\leq a\\leq 70\},A4=71≤a≤80A\_\{4\}=\{71\\leq a\\leq 80\}, andA5=a≥81A\_\{5\}=\{a\\geq 81\}\.Table 4:Division of the inISIC\-baseddataset into training and validation subsets for all three age‑distribution schemes\. The numbers shown in the diagram indicate the count of skin‑lesion images in each split\.

### 3\.2Model

Using our carefully constructed datasets, we implemented three different architectures based on the ResNet50 model\(Heet al\.\([2016](https://arxiv.org/html/2606.03214#bib.bib37)\)\)\. We selected ResNet50 for its proven performance in medical imaging and widespread adoption \(Xuet al\.\([2023](https://arxiv.org/html/2606.03214#bib.bib78)\)\), enabling meaningful study comparisons\. These architectures evaluate different approaches to handling demographic information:

##### The single\-task baseline model

The single\-task baseline model, enhanced with two fully connected layers, uses a sigmoid activation function and binary cross\-entropy loss\.

##### The multi\-task reinforcing model

The multi\-task “reinforcing” model with three layers added to the convolutional base, produces two outputs: one for classification and another for the demographic attribute \(either sex or age, depending on the specific experiment\)\. Please note that we use the term reinforcing here in the sense of “strengthening influence”, not in the reinforcement learning \(RL\) sense\. When the attribute is sex \(binary: male/female\), we used a binary cross\-entropy loss \(LcL\_\{c\}\) and a sigmoid activation function for both heads\. For age, which is treated as a continuous variable, we replace the binary loss with a mean\-squared error loss applied to the normalised age value\. Both heads \(primary and demographic\) receive equal weighting in the overall objective, while the classification head continues to use its standard loss function\. When we use the auxiliary head for age prediction, we first normalise the age labels to the unit interval \(\[0, 1\]\) using the minimum and maximum ages observed in the dataset\. During inference, we denormalise the sigmoid output back to the original age scale\. Because the admissible age range is fixed and known a priori, this normalisation\-denormalisation procedure preserves the semantic meaning of the prediction while keeping an identical model architecture for both auxiliary tasks\.

##### The multi\-task adversarial model

The multi\-task adversarial model was implemented following the methodology of Adeli et al\. and Abbasi\-Sureshjani et al\. \(Adeliet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib2)\); Abbasi\-Sureshjaniet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1417)\)\), using a network with a shared feature encoder and two classifier heads\. One classifier targeted skin cancer classification; the other predicted confounders such as sex or age\. We used the ResNet architecture to compare performance with baseline and reinforcement models in a systematic way\. We trained the skin‑cancer classifier and its encoder with a standard cross‑entropy loss \(LcL\_\{c\}\)\. The choice of bias‑predictor head loss \(LbpL\_\{\\text\{bp\}\}\) depended on the demographic variable under study: for age‑distribution experiments we employed an age bias predictor head whose loss was defined as the negative‑squared Pearson correlation coefficient loss\(this worked for protected attributes that are continuous or ordinal \(Adeliet al\.\([2021](https://arxiv.org/html/2606.03214#bib.bib2)\)\), while for sex‑distribution experiments we usedLbpL\_\{\\text\{bp\}\}as a binary cross‑entropy loss, reflecting the binary nature of the attribute\.

To reduce the predictiveness of the encoded features, we adversarially adjusted the encoder using a third loss term \(LbrL\_\{\\text\{br\}\}\), withλ\\lambdagoverning the penalty for accurate demographic predictions:Lb​r=λ​LbpL\_\{br\}=\\lambda\{L\_\{\\text\{bp\}\}\}, following the practice of Abbasi\-Sureshjani and colleagues \(Abbasi\-Sureshjaniet al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib1417)\)\)\.

##### Model training parameters

We optimised the single\-task model using a grid search over random seeds, learning rates, and momentum values, selecting the combination that yielded the highest validation performance across all experiments\. Subsequently, these optimised parameters were then applied to the reinforcement and adversarial models for a fair comparison\. The optimal hyperparameters, selected via validation set performance, are as follows:

Pre\-training:ImageNetInput size:384×384​pixels\\displaystyle 84\\times 84\\text\{ pixels\}Max epochs:40\\displaystyle 0Batch size:20\\displaystyle 0Learning rate:2\.0×10−5\\displaystyle 0\\times 0^\{\-5\}
To mitigate overfitting and improve model robustness, we implemented an early stopping technique \(patience = 10 epochs\) and data augmentation techniques\. Following prior implementations in the literature, we implemented our baseline and reinforcement models in Keras with the TensorFlow backend \(Géron \([2022](https://arxiv.org/html/2606.03214#bib.bib27)\)\), while our adversarial model was implemented in PyTorch \(Paszkeet al\.\([2019](https://arxiv.org/html/2606.03214#bib.bib59)\)\) to maintain consistency with existing adversarial learning frameworks\.

## 4Experiments

### 4\.1Demographic bias evaluation

##### Defining test, training and validation bundles

We took steps to ensure that any observed performance differences between models are attributable to the models themselves and the underlying data distributions, rather than to inconsistencies in how we split the data\. By fixing a single random seed for each run, we use the same hold‑out test set across distributions, making cross\-distribution comparisons meaningful\. Importantly, we deliberately construct the hold‑out test set to be balanced across relevant subgroups \(such as age or sex\)\. This balanced test set removes bias toward any particular segment and allows us to isolate the effect of bias in the training data itself\. We acknowledge, however, that measuring “true” behaviour across the entire population would require a test set whose distribution matches the expected real‑world population; our balanced test set is chosen specifically to evaluate bias rather than to predict real‑life performance\. When we generate the training and validation partitions with the same seed, we ensure that these splits faithfully reflect the target distribution\. By repeating the entire process with several different seeds, we reduce the influence of random variation in the data\. Implementation is illustrated in Figure[3](https://arxiv.org/html/2606.03214#S4.F3)\. Our experimental workflow enables both within‑distribution and cross‑distribution model comparisons\.

![Refer to caption](https://arxiv.org/html/2606.03214v1/x9.png)Figure 3:Experimental workflow \(age analysis; sex analysis is analogous\): for each distributionDD, we build five independent bundlesBB, each split into mutually exclusive test, training, and validation sets with a fixed seedSS\. The same test set is shared by all distributions for a given seed, and training/validation splits use the same seed\. The distribution properties of the training/validation splits are consistent with the distribution under test\. All three models are evaluated on this test set, allowing within‑ and cross‑distribution comparisons, and the process is repeated with multiple seeds to reduce random effects\.
##### Sex distribution

For a thorough evaluation, we created five bundles for each of the seven distributions: F100, F5M95, F25M75, F50M50, F25M75, F5M95, and M100, resulting in 35 bundles \(seven distributions×\\timesfive seeds\)\. We assessed AUC overall and within male and female subgroups for each learning strategy and dataset combination\. Using three learning strategies on 35 bundles, we conducted 105 experiments to evaluate model performance within and across all seven distributions\. Notably, multi\-task models cannot unlearn constant protected attributes \(as in M100 and F100 experiments\)\. These edge cases serve to stress\-test the training pipeline’s stability when demographic attributes are absent\.

##### Age distribution

In our age experiments, we conducted comprehensive evaluations using age\-stratified data sets\. We generate three distinct data distributions, across five age categories \(A1−A5A\_\{1\}\-A\_\{5\}\): YOUNGER \(predominantly young patients\), BALANCED \(evenly distributed age categories\), and OLDER \(predominantly elderly patients\)\. For each configuration, we evaluated the three model architectures\. Performance was measured using AUC scores, with results visualised across different age distributions and model architectures\. We conducted 45 experiments in total \(15 bundles and three learning strategies\) to evaluate model performance within and across the three distributions\.

### 4\.2Reinforcing model: Auxiliary head analysis

We evaluate the auxiliary prediction head of the reinforcing model to serve two purposes: \(1\) assessing whether the multi\-task architecture successfully learns the demographic signal, and \(2\) providing a mechanistic explanation for bias mitigation in the primary task\. Specifically, if the auxiliary head fails to learn the demographics, the reinforcing model loses its regularisation effect, which may explain observed failures in bias mitigation\. An auxiliary\-head analysis was omitted for the adversarial model because the original network implementation did not provide demographic outputs\.

##### Evaluation of the sex‑prediction head

We evaluated the auxiliary sex‑prediction head of the multi‑task reinforcing model using two complementary metrics\. First, we computed the AUC to assess discriminative ability\. Second, we measured the Brier score \(Rufibach \([2010](https://arxiv.org/html/2606.03214#bib.bib64)\)\) separately for male and female patients\. We interpret the Brier score \(β\\beta\) for binary classification as follows: strong calibration for0≤β<0\.050\\leq\\beta<0\.05, moderate for0\.05≤β<0\.150\.05\\leq\\beta<0\.15, weak for0\.15≤β<0\.250\.15\\leq\\beta<0\.25, poor for0\.25≤β<0\.350\.25\\leq\\beta<0\.35, and very weak forβ≥0\.35\\beta\\geq 0\.35\. Lower Brier scores indicate tighter alignment between predicted probabilities and observed outcomes, whereas higher values reflect increasingly poor calibration\. We do not evaluate the two edge cases \(F100, M100\) with a training set containing only one sex; the auxiliary head cannot learn a useful decision boundary\.

##### Evaluation of the age‑prediction head

For the auxiliary age‑prediction head of the multi\-task reinforcing model, we used mean absolute error \(MAE\) as our primary performance measure\. MAE computes the average absolute difference between the predicted age and the ground\-truth age across all test samples\. To disclose any systematic biases throughout the age spectrum, we compute MAE for the five age categories \(A1A\_\{1\}–A5A\_\{5\}, see Table[3](https://arxiv.org/html/2606.03214#S3.T3)\)\. Furthermore, to assess the quality of the age predictions, we computed the Pearson correlation coefficient \(ρ\\rho\) between the predicted ages and the ground\-truth ages\. This metric quantifies the linear relationship between the two variables and complements the MSE by indicating how well the model captures age trends\. We interpret the Pearson correlation coefficient as follows: strong correlation for0\.5≤ρ<10\.5\\leq\\rho<1, moderate correlation for0\.3≤ρ<0\.50\.3\\leq\\rho<0\.5, and weak correlation for0≤ρ<0\.30\\leq\\rho<0\.3\. Only correlations withp<0\.05p<0\.05were retained for further analyses\.

### 4\.3Cross\-dataset evaluation

To validate our findings and assess generalisability, we performed additional experiments using two external skin‑lesion datasets: PAD‑UFES‑20 \(both sex and age\) and DERM7PT \(no age information\)\. Using the saved weights from our previously trained models \(base, reinforcement, and adversarial\), we evaluated them on the external datasets without any further fine\-tuning, using the same evaluation metrics\. This cross\-dataset validation approach provides insight into the robustness and transferability of our models across different patient populations and data collection contexts\.

![Refer to caption](https://arxiv.org/html/2606.03214v1/x10.png)\(a\)M100
![Refer to caption](https://arxiv.org/html/2606.03214v1/x11.png)\(b\)F25M75
![Refer to caption](https://arxiv.org/html/2606.03214v1/x12.png)\(c\)F50M50
![Refer to caption](https://arxiv.org/html/2606.03214v1/x13.png)\(d\)F75M25
![Refer to caption](https://arxiv.org/html/2606.03214v1/x14.png)\(e\)F100
![Refer to caption](https://arxiv.org/html/2606.03214v1/x15.png)\(f\)Balanced testset

Figure 4:Age distributions in the ISIC\-based datasets range from M100 to F100 \(a to e\), with a balanced test set \(f\)\. The training sets \(3,528 records, approximately 80% of the total\) and corresponding validation sets \(880 records, approximately 20%\) span distributions from M100 to F100 and maintain similar population compositions\. The total of 4,412 records \(2×2,2062\\times 2,206lesions\) reflects the inclusion of both malignant and benign cases, with the base value of 2,206 taken from Table[2](https://arxiv.org/html/2606.03214#S3.T2)\. Test sets contain 1,264 records\. The visualised distributions correspond to seed value 1970; distributions for other seeds are equivalent\.
### 4\.4Exploratory performance assessment

We did not conduct formal statistical testing because our study is exploratory, and the number of observations is minimal\. Applying p‑value–based tests to the model performances would yield unstable estimates that provide little trustworthy insight into whether any actual effect exists\. Consequently, we refrained from labeling results as “significant” or “non‑significant\.” As Amrhein et al\. highlighted, such dichotomous labeling often leads to misinterpretation, and non‑significant findings are frequently mistaken for evidence of no effect \(Amrheinet al\.\([2019](https://arxiv.org/html/2606.03214#bib.bib3)\)\)\. Instead, we present AUC values as boxplots and provide descriptive comparisons across subgroups, allowing readers to assess the magnitude and direction of any observed differences\.

## 5Results

### 5\.1Sex\-specific model assessment

Figure[5](https://arxiv.org/html/2606.03214#S5.F5)shows the performance of the model in sex distributions using box plots of AUC metrics for three types of models \(base, reinforcement, adversarial\)\. Figure[6](https://arxiv.org/html/2606.03214#S5.F6)shows the impact of dataset distributions on three learning strategies, reporting AUC scores for both sexes\.

##### Comparable performance at model level

Our analysis in Figure[5](https://arxiv.org/html/2606.03214#S5.F5)demonstrates that all three learning strategies achieve similar levels of effectiveness, showing only slight performance variations\. Across the three model architectures, the accuracy scores maintain a consistent range between 0\.79 and 0\.85\. Although there are minor variations between the approaches,none clearly outperforms the others\.

##### Sex\-specific training data yields better results

Figure[6](https://arxiv.org/html/2606.03214#S5.F6)shows that all models perform better for male patients in male\-only\(M100\), predominantly male \(F5M95\) and lightly male\-skewed \(F25M75\)scenarios\. The reinforcing and base models show the most pronounced performance gap in thepredominantlymale dataset\. The base and reinforcing show equal performance between subgroups in the balanced, lightly female\-skewed\(F75M25\) and predominantly female \(F95M5\) scenarios\. The base model shows better performance for female patients in the female\-only\(F100\)scenario, while thereinforcingand adversarial models show equal performance between male and female patients\. Thus, our models appear more attuned to male patients in mixed\-sex training sets, regardless of the percentage of female patients\. The best results are achieved when both sexes are trained exclusively on their respective data\. We hypothesise that a model trained on a single‑sex dataset may specialise in those sex\-specific cues and often attains higher accuracy\. In mixed‑sex training, the network must accommodate both distributions, which can lead to a modest performance dip, typically favouring the more dominant signal\.

##### Base model reveals sex bias

We found substantial sex bias in the performance of the base model \(see Figure[6](https://arxiv.org/html/2606.03214#S5.F6)\)\. In the male\-only andpredominantly malescenarios, we observed a substantial performance gap between male and female patients\.In the female\-only scenario, there is also a performance gap; however, this is less pronounced\.We found that the base model performed comparably for male and female patients across balanced, lightly skewed, and predominantly female experiments\.We assume the base model binds onto the most prevalent cues in the training set\. When the data consist of only one sex or are heavily skewed, it reveals sex\-specific visual patterns \(e\.g\., hair density, skin texture\) that aid classification, performing well for the majority sex but poorly for the minority\. In a balanced, only mildly skewed female dataset, both sexes are equally represented, forcing the model to rely on lesion\-intrinsic features for discrimination, leading to comparable accuracy for males and females\.

##### Reinforcement model partially successful in sex bias mitigation

When we trained the model onmale majoritydata, we observed performance disparities between the sexes \(see Figure[6](https://arxiv.org/html/2606.03214#S5.F6)\)\.With balanced training data, the reinforcement model successfully mitigates sex\-based bias\. Notably, this same bias reduction effect is observed in female\-majority training sets as well\.We hypothesise that the reinforcing multi‑task model can reduce sex bias only when its auxiliary sex‑prediction head receives sufficiently informative female patient\-related signals\. In the only\-male and predominantly\-male scenarios, the encoder overfits to male\-specific cues because the auxiliary head lacks enough female examples to learn a meaningful discriminator\. Conversely, with balanced or mostly\-female settings, the auxiliary head can learn a reliable sex classifier; its loss then regularises the shared encoder towards sex\-invariant representations, thereby reducing bias\.

##### Adversarial model reduces sex biasin predominantly female training scenarios\.

The adversarial model reduces sex bias in scenarios with predominantly female patients but is less effective in other scenarios, often favouring male patients\. Its performance varies between experiments and datasets \(see Figure[6](https://arxiv.org/html/2606.03214#S5.F6)\)\.We suspect that complex anatomical confounders, such as variations in body hair distribution or skin texture, continue to act as strong proxies for sex in mixed populations, resisting the adversarial removal of these features\.

![Refer to caption](https://arxiv.org/html/2606.03214v1/x16.png)Figure 5:Comparison of model performance across sex distributions using datasets generated from the curatedISIC dataset\. Box plots display AUC metrics for three model architectures \(base, reinforcing, and adversarial\) trained and validated on sex\-biased datasets \(M100,F5M95,F25M75, F50M50, F75M25,F95M5, F100\) and evaluated on a balanced dataset\.![Refer to caption](https://arxiv.org/html/2606.03214v1/x17.png)Figure 6:The AUC score varies based on data splits ranging from only male patients \(M100\) to only female patients \(F100\) in theISIC dataset\. We show base, reinforcing and adversarial model performance for female and male patient subgroups\.![Refer to caption](https://arxiv.org/html/2606.03214v1/x18.png)Figure 7:Comparison of model performance across different age distributions using curatedISIC dataset\. Box plots show the AUC metrics of three model architectures \(base, reinforcing, and adversarial\) trained and validated on age\-biased datasets \(YOUNGER, BALANCED, OLDER\) and evaluated on a balanced test set\.![Refer to caption](https://arxiv.org/html/2606.03214v1/x19.png)Figure 8:Age\-stratified \(using five age categories \(A1−A5A\_\{1\}\-A\_\{5\}\)\) model evaluation across datasets with varying age distributions based on the curatedISIC dataset\. The analysis compares AUC scores for three model architectures \(base, reinforcing, and adversarial\) using age\-biased training sets \(YOUNGER, BALANCED, OLDER\)\. Each model was evaluated on a balanced test set, showing performance variations across different age distributions\.The age brackets are defined as follows, whereaarepresents the patient’s age in years:A1=0≤a≤50A\_\{1\}=\{0\\leq a\\leq 50\},A2=51≤a≤60A\_\{2\}=\{51\\leq a\\leq 60\},A3=61≤a≤70A\_\{3\}=\{61\\leq a\\leq 70\},A4=71≤a≤80A\_\{4\}=\{71\\leq a\\leq 80\}, andA5=a≥81A\_\{5\}=\{a\\geq 81\}\.

### 5\.2Age\-specific model assessment

Figure[7](https://arxiv.org/html/2606.03214#S5.F7)presents a comparative analysis using box plots to illustrate AUC metrics across YOUNGER, BALANCED, and OLDER datasets\. For a more granular understanding, Figure[8](https://arxiv.org/html/2606.03214#S5.F8)provides an age\-stratified evaluation using five distinct age categories \(A1A\_\{1\}–A5A\_\{5\}\), demonstrating how each model architecture performs when trained on differently distributed datasets and evaluated against a balanced test set\.

##### Comparable overall model performance

Looking at the overall performance of the model in all experiments \(Figure[7](https://arxiv.org/html/2606.03214#S5.F7)\), the adversarial model shows the highest variance in performance in different seeds compared to the other two strategies, particularly in the YOUNGER and OLDER cases\. All three model approaches demonstrate comparable base performance levels with AUC scores falling within a 0\.06 range, with both the base and reinforcement models showing a slight advantage compared with the adversarial model\.

##### Declining trend across categories

As shown in Figure[8](https://arxiv.org/html/2606.03214#S5.F8), all three distributions show a decreasing AUC trend, with the youngest age bracket achieving the highest AUCs and the oldest the lowest\. In the balanced distribution, the decreasing trend does not hold for theA4A\_\{4\}age bracket in either baseline or reinforcing models\.A4A\_\{4\}performance is higher in the balanced case than in the younger\-age distribution\. In the older age distribution,A1A\_\{1\}bracket models show a slight performance decrease compared to the younger and balanced distributions\. This decline is most pronounced for the baseline and adversarial models\.We assume the performance drop with increasing age is due to the nature of older skin, which exhibits more heterogeneous visual traits \(such as wrinkles, pigment changes, and vascular alterations\) that mask lesion cues\.

##### Strong performance for younger age categories

Models trained on the balanced dataset show the highest AUC for age categoryA1A\_\{1\}, but experience a small performance drop for age categoryA2A\_\{2\}compared to models trained on the YOUNGER dataset \(see Figure[8](https://arxiv.org/html/2606.03214#S5.F8)\)\. In contrast, the AUC values for the age categoriesA3A\_\{3\},A4A\_\{4\}, andA5A\_\{5\}generally fall below 0\.85\. We hypothesise that the model excels in the youngest age group because young skin presents the most explicit lesion cues, with fewer wrinkles, pigment variations, or vascular artefacts to obscure the diagnostic signal, and the balanced training set supplies fewer examples of these clean patterns\.

##### Performance improvement of balanced models

Figure[8](https://arxiv.org/html/2606.03214#S5.F8)illustrates that the base and reinforcing models trained on a balanced dataset show improved performance for theA4A\_\{4\}age category compared to the same models and age category trained in the younger dataset\. We assume the balanced set provides enough examples that force the encoder to learn features that work across age groups\. In the younger‑skewed data the model overfits to smooth, youthful textures, so its performance drops on theA4A\_\{4\}group\. Adding balanced age samples thus improves AUC for that category\.

### 5\.3Sex\-prediction head evaluation

Table 5:Performance evaluation of the auxiliary sex\-prediction head within the reinforcement model\. Metrics include overall AUC and Brier scores \(β\\beta\), computed separately for male and female patient subgroups\. Bold values denote optimal performance per column \(highest AUC, lowest Brier score\)\.Table[5](https://arxiv.org/html/2606.03214#S5.T5)reports the auxiliary sex\-prediction head’s performance in the reinforcing model, including overall AUC and Brier scores for male and female subgroups\. Predictive performance was limited across all skewed distributions but reached reasonable accuracy in the balanced case \(AUC = 0\.732\), indicating that the encoder successfully learns sex\-related features when the data is balanced\. This aligns with the observed reduction in sex bias for the reinforcing model under balanced settings\. However, in skewed scenarios \(e\.g\., F5M95\), the auxiliary head’s performance becomes highly asymmetric: it predicts the majority sex with high confidence \(a weak Brier score\) but fails on the minority sex \(a strong Brier score\)\. This inability to learn a robust sex discriminator explains why the reinforcing model cannot effectively regularise the encoder under skewed distributions, leading to persistent bias gaps \(see F5M95 in Figure[6](https://arxiv.org/html/2606.03214#S5.F6)\)\.

##### Highest AUC andweakBrier for the balanced training set

When we train the reinforcing model with equal numbers of male and female patients, the auxiliary sex‑prediction head learns a discriminative representation and achieves the highest AUC among the experiments\. Brier scores for theF50M50case areweakfor female and male patients \(see Table[5](https://arxiv.org/html/2606.03214#S5.T5)\)\.

##### Strong Brier for the majority, very weak for the minority

When we train on a majority of male patients \(and F5M95and F25M75\), the Brier score for males isstrongwhereas that for females isvery weak\. When we train with a minority of male patients \(F95M5 andF75M25\), the opposite occurs \(see Table[5](https://arxiv.org/html/2606.03214#S5.T5)\)\.

### 5\.4Age\-prediction head evaluation

We evaluated the auxiliary age prediction head by reporting both the Pearson correlation \(Table[6](https://arxiv.org/html/2606.03214#S5.T6)\) and the distribution of the mean absolute error \(Figure[9](https://arxiv.org/html/2606.03214#S5.F9)\)\. Table[7](https://arxiv.org/html/2606.03214#S5.T7)provides the overall MAE for the three training\-bias configurations\.

Table 6:Pearson correlation \(ρ\\rho\) between predicted and true ages for the reinforcing multi‑task model\. Correlation coefficients are reported for three training‑bias configurations \(younger‑skewed, balanced, older‑skewed\) evaluated on a balanced test set\. The table lists the overallρ\\rhoas well as theρ\\rhofor each age category\. “\-” indicates too low to be meaningfully reported\.In the BALANCED training configuration, the auxiliary head shows a moderate Pearson correlation \(ρ=0\.433\\rho=0\.433\), indicating that the model recognises age patterns without relying on them as a dominant shortcut\. This supports the hypothesis that balanced sampling forces the encoder to learn features that generalise across age groups\. In contrast, the older\-skewed training yields a strong correlation \(ρ=0\.710\\rho=0\.710\) for the youngest cohort, suggesting the model relies heavily on shortcut\-related age cues when the distribution is imbalanced\.

##### Increasing correlation with older\-skewed training

The reinforcement model trained on the OLDER dataset exhibits a strong Pearson correlation between predicted and ground‑truth ages\. In contrast, models trained on the BALANCED and YOUNGER datasets show only moderateand weakcorrelations, with the BALANCED model achieving a slightly higher correlation than the YOUNGER model \(Table[6](https://arxiv.org/html/2606.03214#S5.T6)\)\.

##### Youngest cohort shows strongest correlation

A1A\_\{1\}correlates most strongly with older‑skewed data \(Table[6](https://arxiv.org/html/2606.03214#S5.T6)\)\.

##### Middle group shows weakest correlation

The middle age groups \(A2​–​A4A\_\{2\}\\text\{\-\-\}A\_\{4\}\) correlations are uniformly weak and often markedwith a dash ”\-” in\(Table[6](https://arxiv.org/html/2606.03214#S5.T6)\)\.

##### Oldest cohort shows practically no correlation

Only the balanced scheme shows a very weak correlationfor the oldest cohort \(A5A\_\{5\}\); for the other schemes, the correlation score is markedwith a dash ”\-” inTable[6](https://arxiv.org/html/2606.03214#S5.T6)\.

##### MAE of age categories

Table 7:Mean Absolute Error \(MAE\) of the auxiliary age‑prediction head for the reinforcing multi‑task model under three training‑bias configurations\.We find that the model we trained on a balanced age distribution yields the lowest overall average error \(Table[7](https://arxiv.org/html/2606.03214#S5.T7)\), whereas the younger‑skewed and older‑skewed configurations show higher MAE\. For the youngest cohort of patients, we see relatively high MAE scores for all three models \(Figure[9](https://arxiv.org/html/2606.03214#S5.F9)\)\. The MAE scores of the youngest cohort increase as the number of patients in that cohort in the training set decreases\. We observe the same pattern for the oldest patient cohort\. However, in skewed training, the oldest cohort scores better when trained with mainly older patients than the youngest cohort trained with mostly younger patients\.The oldest age groups \(A4A\_\{4\}andA5A\_\{5\}\) exhibit high MAE when the model is trained mostly on younger patients, while the youngest age groups \(A1A\_\{1\}andA2A\_\{2\}\) show high MAE when trained mainly on older patients\.TheA2A\_\{2\}age cohort achieves the lowest score when we train with predominantly younger patients\. The scores of theA3A\_\{3\}cohort change little in all models\. The MAE score of theA4A\_\{4\}cohort decreases as the proportion ofA4A\_\{4\}patients relative to the total training population increases\. TheA4A\_\{4\}age cohort achieves the lowest score when we train the model with mostly older patients\.

![Refer to caption](https://arxiv.org/html/2606.03214v1/x20.png)
![Refer to caption](https://arxiv.org/html/2606.03214v1/x21.png)
![Refer to caption](https://arxiv.org/html/2606.03214v1/x22.png)
![Refer to caption](https://arxiv.org/html/2606.03214v1/x23.png)

Figure 9:The top figure shows the Mean Absolute Error \(MAE\) distributions for five age groups \(A1A\_\{1\}–A5A\_\{5\}\)\. Each plot depicts the MAE, calculated from predicted versus reference ages, under one of the three training biases \(younger, balanced, older\) using the reinforcing model\. We evaluated all models on a balanced test set\. The bottom three figures show the age distribution for the younger, balanced, and older training datasets \(from left to right\)\. The same colour legend that appears in the top panel \(mappingA1A\_\{1\}–A5A\_\{5\}to their respective colours\) is reused for the three lower panels\. The age brackets are defined as follows,whereaarepresents the patient’s age in years:A1=0≤a≤50A\_\{1\}=\{0\\leq a\\leq 50\},A2=51≤a≤60A\_\{2\}=\{51\\leq a\\leq 60\},A3=61≤a≤70A\_\{3\}=\{61\\leq a\\leq 70\},A4=71≤a≤80A\_\{4\}=\{71\\leq a\\leq 80\}, andA5=a≥81A\_\{5\}=\{a\\geq 81\}\.

### 5\.5Cross\-dataset analysis

#### 5\.5\.1PAD\-UFES\-20

Figures[10](https://arxiv.org/html/2606.03214#S5.F10)and[11](https://arxiv.org/html/2606.03214#S5.F11)present our model performance evaluation on the PAD\-UFES\-20 dataset, examining age\-stratified and sex\-stratified distributions\. While Figure[10](https://arxiv.org/html/2606.03214#S5.F10)breaks down performance across five age categories \(A1\-A5\) for different age\-biased training sets, Figure[11](https://arxiv.org/html/2606.03214#S5.F11)analyses sex\-based performance variations across multiple male\-female distribution ratios\.

##### External validation shows performance drop

During external validation, model performance in both sex\- and age\-based evaluations showed notably lower metrics compared to internal validation scenarios \(Figures[10](https://arxiv.org/html/2606.03214#S5.F10)and[11](https://arxiv.org/html/2606.03214#S5.F11)\)\.

##### External validation shows best performance for younger age groups

For age\-based models, we observe a different pattern in external validation compared to internal validation\. The models trained on the YOUNGER, BALANCED, and OLDER cases show similar performance ranges in the three learning strategies for age categoriesA2A\_\{2\},A3A\_\{3\},A4A\_\{4\}, andA5A\_\{5\}\.Across all configurations, models show notably better performance for theA1A\_\{1\}age category\. The adversarial model also shows improved performance for theA2A\_\{2\}age category, achieving results comparable to the performance of theA1A\_\{1\}category \(Figure[10](https://arxiv.org/html/2606.03214#S5.F10)\)\.

##### Sex\-based validations show contrasting patterns

In internal validation, we observed an X pattern: as the percentage of male patients in the training and validation sets decreased, male patients’ performance declined, whereas female patients’ performance improved\. However, the X pattern is less pronounced than in the internal validation\. For adversarial models, we observed a different pattern \(see Figure[11](https://arxiv.org/html/2606.03214#S5.F11)\): in the internal validation with F75M25 adversarial models, the performance for female patients was lower than for male patients, while in the internal validation with F100 adversarial models, the performance for male and female patients was equal\. However, in external validation, we observe a notable improvement in performance in female patients\.

![Refer to caption](https://arxiv.org/html/2606.03214v1/x24.png)Figure 10:Age\-stratified model evaluation showing AUC performance across different age categories \(A1\-A5\) for three model architectures \(base, reinforcing, and adversarial\) trained and validated on age\-biased ISIC datasets \(YOUNGER, BALANCED, OLDER\) and evaluated on the curatedPAD\-UFES\-20 dataset\. The analysis demonstrates how each model type performs across different age distributions\. The age brackets are defined as follows,whereaarepresents the patient’s age in years:A1=0≤a≤50A\_\{1\}=\{0\\leq a\\leq 50\},A2=51≤a≤60A\_\{2\}=\{51\\leq a\\leq 60\},A3=61≤a≤70A\_\{3\}=\{61\\leq a\\leq 70\},A4=71≤a≤80A\_\{4\}=\{71\\leq a\\leq 80\}, andA5=a≥81A\_\{5\}=\{a\\geq 81\}\.![Refer to caption](https://arxiv.org/html/2606.03214v1/x25.png)Figure 11:Sex\-stratified model evaluation showing AUC performance across male and female categories for three model architectures \(base, reinforcing, and adversarial\) trained and validated on sex\-biased ISIC datasets \(M100, F5M95, F25M75, F50M50, F75M25, F95M5 and F100\) and evaluated on the curatedPAD\-UFES\-20 dataset\.

#### 5\.5\.2DERM7PT

##### External validation shows performance drop

During external validation, we observed that the models showed considerably lower AUC metrics than in the internal validation experiments \(see Figure[12](https://arxiv.org/html/2606.03214#S5.F12)\)\. Nevertheless, we found that their performance remained higher than the sex‑based results obtained on PAD‑UFES‑20\.

##### Reduced subgroup performance gaps across external datasets

We observed that the performance of the male and female sub‑groups remained much closer together across the various training scenarios than it did in the ISIC‑based experiments, and less pronounced than that seen with PAD‑UFES‑20\. In other words, the sex ratio distribution in the training data had a markedly smaller effect on the performance gaps between subgroups for the DERM7PT validation\.

![Refer to caption](https://arxiv.org/html/2606.03214v1/x26.png)Figure 12:Sex\-stratified model evaluation showing AUC performance across male and female categories for three model architectures \(base, reinforcing, and adversarial\) trained and validated on sex\-biased ISIC datasets \(M100, F5M95, F25M75, F50M50, F75M25, F95M5 and F100\) and evaluated on the curatedDERM7PT dataset\.

## 6Discussion and conclusions

We investigated the effect of demographic bias on skin lesion classification performance using three ResNet\-50\-based CNN models, with a specific focus on variations in patient sex and age in the training data\.Using linear programming to generate datasets with controlled demographic distributions, we evaluated three learning strategies: a single\-task model, a reinforcing multi\-task model, and an adversarial learning scheme\.Additionally, we performed cross\-dataset validation to assess the model’s generalisation capabilities\. Overall, the results highlight that sex‑related performance gaps are largely driven by training set imbalances, whereas age\-related declines persist even with balanced sampling, and that domain shifts \(from dermoscopic to smartphone images\) cause substantial drops in external validation accuracy\.Bias patterns were not consistent across all datasets\.

##### Sex based analysis

In our sex\-based analysis, we observed that sex\-specific training data produced better results, though single\-task models exhibited notable sex bias\. The reinforcement approach was partially effective in reducing this bias\. The adversarial model eliminated sex bias specifically in cases involvingpredominantly female distributions \(F95M5\)\.An unexpected result emerged regarding sex\-related bias in ISIC\-based datasets\. When the patient cohort was male\-dominated, models achieved higher accuracy on male patients\. Conversely, in female\-dominated cohorts, models did not show a comparable boost for female patients\. This asymmetry suggests that models tend to overfit to male\-specific cues\. In contrast to ISIC\-based datasets, PAD\-UFES\-20 and DERM7PT showed that the adversarial model performed better on female patients than on male patients in female\-dominated cohorts\.

The auxiliary sex\-prediction head heavily relies on data composition: it learns demographic signals under balanced distributions but struggles in skewed scenarios\. This shows that the reinforcing model cannot build a robust discriminator when the minority class is underrepresented, leading to the collapse of the regularisation signal\. Our Brier score analysis confirms this mechanism: in balanced datasets, the head makes well\-calibrated predictions for both sexes, whereas in skewed datasets, it becomes over\-confident for the majority and inaccurate for the minority\. Ultimately, effective debiasing depends on the auxiliary head’s ability to reliably learn the attribute\. When data scarcity prevents this, the reinforcing model overfits to the dominant group, and the regularisation mechanism fails\.

Our findings on sex\-related bias contradict the conclusion of Sies et al\. that*despite sex\-related imbalances in open access training data, the diagnostic performance of the CNN tested showed no sex\-related bias in the classification of skin lesions*\(Sieset al\.\([2022](https://arxiv.org/html/2606.03214#bib.bib70)\)\)\. We hypothesise that dataset size contributes to these divergent findings\. Whereas Sies et al\. leveraged over 150,000 dermoscopic images, our smaller dataset may have rendered the CNN more vulnerable to sex\-related biases\. Larger datasets typically offer greater diversity across demographic groups, facilitating the learning of robust and generalisable features\. Conversely, smaller datasets may amplify existing biases or result in overfitting to specific demographic characteristics\.

As expected, the base model shows sensitivity for sex bias, likely driven by overfitting and various anatomical confounders in the training data\. These include sex\-specific variations in skin thickness, hair distribution, lesion location, and underlying vasculature, as well as differences in sun exposure patterns linked to sex\-linked behavioural factors\. While the reinforcing and adversarial models incorporate regularisation techniques to mitigate such bias, our experiments revealed limited success: bias correction was observed only in the adversarial model for female\-only cohorts\. This suggests that complex anatomical confounders continue to substantially influence model predictions in mixed\-sex populations, resisting standard regularisation approaches\.

##### Age based analysis

In age\-related experiments, we found comparable baseline performance across all three model approaches, with an evident decline in performance in older age categories\. We observed a clear decline trend across age categories: the age groupA1A\_\{1\}consistently achieved the highest performance, while subsequent age groups showed progressively lower scores, regardless of the training data distribution used \(YOUNGER, BALANCED or OLDER\)\.

The moderate predictive power of the age head \(0\.319≤ρ≤0\.5170\.319\\leq\\rho\\leq 0\.517\) suggests it may not strongly influence bias mitigation\. As shown in Figure[8](https://arxiv.org/html/2606.03214#S5.F8), performance disparities remain evident across all training distributions\. While the OLDER dataset shows smaller subgroup gaps than the YOUNGER and BALANCED cases, we cannot definitively attribute this to successful bias mitigation\. Instead, it appears to result from the model exploiting age\-related shortcuts\. This is supported by the strong correlation for the youngest cohort \(A1A\_\{1\},ρ=0\.710\\rho=0\.710\) under older\-skewed training, suggesting the model may rely on age\-specific cues\.

A plausible explanation for the trend of increasing overall correlation from the youngest cohort to the oldest cohort is that the model trained on a cohort dominated by older patients learns a more reliable mapping from image characteristics to chronological age because aging skin shows more significant structural alterations, such as wrinkles, skin texture degradation, increased density of pigmentary spots, and larger pigmentary spots \(Flamentet al\.\([2019](https://arxiv.org/html/2606.03214#bib.bib24)\)\)\.

Based on Pearson correlation and MAE analyses, the auxiliary age\-prediction head functions as a demographically sensitive regressor that performs optimally when age distributions are balanced and match the training data\. Fairness\-focused pipelines should therefore employ balanced age sampling or explicit re\-weighting and augmentation to mitigate systematic bias\. This is particularly evident in the MAE analysis, which shows substantially higher errors for underrepresented age groups\.

Several factors likely contribute to the observed age\-related performance differences\. Natural changes in skin characteristics with age can influence lesion classification\. Even with balanced sampling, intrinsic skin conditions may differ between age groups, potentially affecting model performance\.

Within the youngest age group \(A1A\_\{1\}\), malignant cases show a concentration toward the end of the age range, close to the upper boundary of the group\. This contrasts with benign lesions, which are well\-represented throughout the group, including at the beginning\. This internal skew results in an unequal distribution within the category: an overrepresentation of benign samples at the beginning of the group and malignant samples at the end\. Such a subgroup distribution can influence model training and complicate the interpretation of the corresponding test results\. While we maintained class balance across age distributions through our LP constraints, we recognise that the natural prevalence of malignancy varies by age in clinical practice\.

The consistent decline in performance for older age categories initially suggests a need for more balanced age representation in training data\. However, since this pattern persists even with balanced training sets, it indicates that factors beyond mere data distribution, such as intrinsic biological or physiological differences, are driving these performance gaps\.

The discretisation of continuous variables such as age presents methodological challenges, given that age is inherently continuous\. While categorical variables like patient sex have natural groupings, age categorisation necessitates arbitrary cut\-off points that may not correspond to biologically or clinically meaningful boundaries\.

Adversarial learning is designed to reduce reliance on confounding features by training a discriminator to predict protected attributes, while updating the main model to minimise the discriminator’s prediction\.However, our results show that this approach succeeded only in certain data compositions, failing to mitigate bias in other scenarios\. This discrepancy suggests that shortcut learning \(the reliance on superficial correlations rather than clinically relevant features\) remains a significant concern\. For instance, body hair patterns, which dermatologists have noted can significantly interfere with dermatoscopic examinations \(Finket al\.\([2020](https://arxiv.org/html/2606.03214#bib.bib23)\)\), could be unintentionally used by the model as a proxy for sex or age classification, potentially affecting performance differences between male and female patients\. Similarly, confounding factors such as skin colour and image artefacts may influence classification performance across demographic subgroups\. Future research should systematically investigate these factors to determine whether adversarial learning can be adapted to address them more robustly across all demographic compositions\. Furthermore, we will examine the bias mechanisms by analysing the auxiliary head’s behaviour in the adversarial setting\.

We demonstrate that skewed distributions in training data cause performance disparities acrosssexand age groups\. Consequently, when using a dataset with a certain level of skewness, one possible mitigation strategy is to rebalance the data through augmentation\. However, in many practical scenarios, acquiring additional instances is often infeasible in the short term\. Under such constraints, introducing synthetically generated images may provide an alternative way to restore balance \(Stanleyet al\.\([2024](https://arxiv.org/html/2606.03214#bib.bib71)\),Kebailiet al\.\([2023](https://arxiv.org/html/2606.03214#bib.bib43)\)\)\. Nonetheless, it remains crucial to understand the root sources of bias, such as demographic representation, acquisition protocols, or annotation practices, so these factors can be explicitly considered during the synthesis process\.

##### Cross\-dataset validation and future directions

In our cross\-dataset validation, we observed performance patterns by demographics: younger age groups generally performed better but showed a notable decline in performance during external validation\. In the external validation of models trained on varying female\-to\-male ratios, we observed improved performance for female patients\.

External\-validation performance may be affected by numerous factors, including differences in training data, imaging equipment \(such as smartphones versus dermoscopes\), population demographics, and image\-collection protocols\. The PAD\-UFES\-20 set comprises digital camera photographs that differ markedly from the dermoscopic images used for training\. Because we evaluated the models without any fine\-tuning, we observed a sharp drop in accuracy for the PAD\-UFES\-20 set\. We assume that the lack of adaptation contributed to this decline, although other factors may also play a role\. Previous research has provided valuable information in this area\.DERM7PT showed lower AUC than internal validation but outperformed PAD\-UFES\-20, with smaller male/female performance gaps than ISIC\-based experiments\. This likely reflects modality consistency \(dermoscopic\-to\-dermoscopic transfer reduces domain shift\) and potentially less salient sex\-related visual cues in DERM7PT images compared to ISIC\.

Previous research has provided valuable information in this area\.Bevan and Atapour\-Abarghouei demonstrated improved generalisation through “unlearning” spurious variations in skin lesion imaging instruments \(Bevan and Atapour\-Abarghouei \([2023](https://arxiv.org/html/2606.03214#bib.bib7)\)\)\. Daneshjou and colleagues emphasised the need to address skin tone bias in dermatology AI systems before deploying them to diverse populations \(Daneshjouet al\.\([2022](https://arxiv.org/html/2606.03214#bib.bib19)\)\)\. These studies confirm that performance decline results from differences in image acquisition methods and varying population characteristics between datasets\. Additional research is needed to investigate cross\-dataset performance across different imaging modalities, such as comparing model robustness between dermoscopic, smartphone, and conventional digital camera images\.We recommend that future studies separate modality effects from demographic bias\. One way is to assemble paired datasets where the same lesions are photographed with dermoscopes, smartphones, and regular digital cameras\. These “multi‑view” benchmarks would let us measure performance loss caused by modality changes while keeping lesion identity and patient demographics constant\.

##### Concluding remarks

In conclusion, our experiments show that imbalanced training data mainly cause sex\-related performance differences in skin\-lesion classification\. Conversely, age\-related performance gaps remain even when the training set is balanced\.

Our findings highlight the importance of understanding two key aspects when designing and implementing AI in medical imaging: explicit factors \(such as imaging protocols, patient demographic data, data set distributions, and sampling methods\) and implicit factors \(such as geographic differences, biological variations, and demographic imbalances\)\. Specifically, our research concludes that even with balanced training sets, performance disparities between demographic groups persisted, indicating that the relationship between these factors is complex\. Previous studies have shown that bias in medical imaging arises from multiple interconnected factors \(Zonget al\.\([2023](https://arxiv.org/html/2606.03214#bib.bib84)\)\)\. Disentangling how these factors reinforce or oppose each other, as well as their impact on the performance of medical image applications, remains a challenging area for further research\.

We hope that the methodologies and insights developed through our research can serve as building blocks to create more sophisticated and equitable healthcare AI systems that better serve diverse patient populations\.

\\acks

This work was supported by the Netherlands Organisation for Scientific Research, grant no\. 023\.014\.010\.

\\ethics

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects\.

\\coi

The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper\.

## References

- S\. Abbasi\-Sureshjani, R\. Raumanns, B\. E\. Michels, G\. Schouten, and V\. Cheplygina \(2020\)Risk of training diagnostic algorithms on data with demographic bias\.InMICCAI LABELS workshop, Lecture Notes in Computer Science,Vol\.12446,pp\. 183–192\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p1.1.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p2.1.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p4.1),[§3\.2](https://arxiv.org/html/2606.03214#S3.SS2.SSS0.Px3.p1.3),[§3\.2](https://arxiv.org/html/2606.03214#S3.SS2.SSS0.Px3.p2.3)\.
- E\. Adeli, Q\. Zhao, A\. Pfefferbaum, E\. V\. Sullivan, L\. Fei\-Fei, J\. C\. Niebles, and K\. M\. Pohl \(2021\)Representation learning with statistical independence to mitigate bias\.IEEE Winter Conf Appl Comput Vis2021,pp\. 2512–2522\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p2.1.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p4.1),[§3\.2](https://arxiv.org/html/2606.03214#S3.SS2.SSS0.Px3.p1.3),[§3\.2](https://arxiv.org/html/2606.03214#S3.SS2.SSS0.Px3.p1.3.1)\.
- V\. Amrhein, S\. Greenland, and B\. McShane \(2019\)Scientists rise up against statistical significance\.Nature567\(7748\),pp\. 305–307\(en\)\.Cited by:[§4\.4](https://arxiv.org/html/2606.03214#S4.SS4.p1.1)\.
- G\. Argenzianoet al\.\(2000\)Interactive atlas of dermoscopy: a tutorial\.Edra Medical Publishing and New Media,Milan, Italy\.Note:Book and CD\-ROMExternal Links:ISBN 88\-86457\-30\-8Cited by:[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px3.p1.1.1)\.
- B\. E\. Bejnordi, M\. Veta, P\. J\. van Diest, B\. van Ginneken, N\. Karssemeijer, G\. Litjens, J\. A\. van der Laak, M\. Hermsen, Q\. F\. Manson, M\. Balkenhol,et al\.\(2017\)Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer\.JAMA318\(22\),pp\. 2199–2210\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1)\.
- M\. Benčević, M\. Habijan, I\. Galić, D\. Babin, and A\. Pižurica \(2024\)Understanding skin color bias in deep learning\-based skin lesion segmentation\.Comput\. Methods Programs Biomed\.245,pp\. 108044\(en\)\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1)\.
- P\. J\. Bevan and A\. Atapour\-Abarghouei \(2023\)Skin deep unlearning: artefact and instrument debiasing in the context of melanoma classification\.arXiv preprint arXiv:2109\.09818\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p4.1),[§6](https://arxiv.org/html/2606.03214#S6.SS0.SSS0.Px3.p3.1)\.
- A\. Bissoto, E\. Valle, and S\. Avila \(2020\)Debiasing skin lesion datasets and models? not so fast\.External Links:2004\.11457Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p4.1.1)\.
- R\. A\. Caruana \(1993\)Multitask learning: a knowledge\-based source of inductive bias\.InMachine Learning Proceedings 1993,pp\. 41–48\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p1.1.1)\.
- B\. Cassidy, C\. Kendrick, A\. Brodzicki, J\. Jaworek\-Korjakowska, and M\. H\. Yap \(2022\)Analysis of the ISIC image datasets: usage, benchmarks and recommendations\.Med\. Image Anal\.75,pp\. 102305\(en\)\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px1.p1.1)\.
- N\. C\. F\. Codella, D\. Gutman, M\. E\. Celebi, B\. Helba, M\. A\. Marchetti, S\. W\. Dusza, A\. Kalloo, K\. Liopyris, N\. Mishra, H\. Kittler, and A\. Halpern \(2018\)Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging \(isbi\), hosted by the international skin imaging collaboration \(isic\)\.External Links:1710\.05006Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p2.1.1),[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px1.p1.1)\.
- N\. Codella, V\. Rotemberg, P\. Tschandl, M\. E\. Celebi, S\. Dusza, D\. Gutman, B\. Helba, A\. Kalloo, K\. Liopyris, M\. Marchetti, H\. Kittler, and A\. Halpern \(2019\)Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration \(isic\)\.External Links:1902\.03368Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p2.1.1),[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px1.p1.1)\.
- M\. Combalia, N\. C\. F\. Codella, V\. Rotemberg, B\. Helba, V\. Vilaplana, O\. Reiter, C\. Carrera, A\. Barreiro, A\. C\. Halpern, S\. Puig, and J\. Malvehy \(2019\)BCN20000: dermoscopic lesions in the wild\.External Links:1908\.02288Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p2.1.1),[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px1.p1.1)\.
- R\. Daneshjou, K\. Vodrahalli, R\. A\. Novoa, M\. Jenkins, W\. Liang, V\. Rotemberg, J\. Ko, S\. M\. Swetter, E\. E\. Bailey, O\. Gevaert, P\. Mukherjee, M\. Phung, K\. Yekrang, B\. Fong, R\. Sahasrabudhe, J\. A\. C\. Allerup, U\. Okata\-Karigane, J\. Zou, and A\. S\. Chiou \(2022\)Disparities in dermatology AI performance on a diverse, curated clinical image set\.Sci\. Adv\.8\(32\),pp\. eabq6147\(en\)\.Cited by:[§6](https://arxiv.org/html/2606.03214#S6.SS0.SSS0.Px3.p3.1)\.
- A\. Esteva, B\. Kuprel, R\. A\. Novoa, J\. Ko, S\. M\. Swetter, H\. M\. Blau, and S\. Thrun \(2017\)Dermatologist\-level classification of skin cancer with deep neural networks\.Nature542\(7639\),pp\. 115–118\(en\)\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1)\.
- C\. Fink, L\. Uhlmann, K\. Vogt, R\. Schneiderbauer, C\. Menzer, F\. Toberer, T\. E\. Schank, A\. Enk, and H\. A\. Haenssle \(2020\)Physicians’ level of hindrance by body hair in dermatoscopy and clinical benefit of an automated hair removal algorithm\.J\. Dtsch\. Dermatol\. Ges\.18\(1\),pp\. 27–32\(en\)\.Cited by:[§6](https://arxiv.org/html/2606.03214#S6.SS0.SSS0.Px2.p9.1)\.
- F\. Flament, D\. Velleman, S\. Yamamoto, A\. Nicolas, K\. Udodaira, S\. Yamamoto, C\. Morimoto, S\. Belkebla, C\. Negre, and C\. Delaunay \(2019\)Clinical impacts of sun exposures on the faces and hands of japanese women of different ages\.Int\. J\. Cosmet\. Sci\.41\(5\),pp\. 425–436\(en\)\.Cited by:[§6](https://arxiv.org/html/2606.03214#S6.SS0.SSS0.Px2.p3.1)\.
- R\. Geirhos, J\. Jacobsen, C\. Michaelis, R\. Zemel, W\. Brendel, M\. Bethge, and F\. A\. Wichmann \(2020\)Shortcut learning in deep neural networks\.Nature Machine Intelligence2\(11\),pp\. 665–673\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Géron \(2022\)Hands\-On machine learning with Scikit\-Learn, keras, and TensorFlow\.“O’Reilly Media, Inc\.”\(en\)\.Cited by:[§3\.2](https://arxiv.org/html/2606.03214#S3.SS2.SSS0.Px4.p2.1)\.
- N\. Gerrits, B\. Elen, T\. Van Craenendonck, D\. Triantafyllidou, I\. N\. Petropoulos, R\. A\. Malik, and P\. De Boever \(2021\)Publisher correction: age and sex affect deep learning prediction of cardiometabolic risk factors from retinal images\.Sci\. Rep\.11\(1\),pp\. 1198\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px2.p1.1)\.
- J\. W\. Gichoya, I\. Banerjee, A\. R\. Bhimireddy, J\. L\. Burns, L\. A\. Celi, L\. Chen, R\. Correa, N\. Dullerud, M\. Ghassemi, S\. Huang, P\. Kuo, M\. P\. Lungren, L\. J\. Palmer, B\. J\. Price, S\. Purkayastha, A\. T\. Pyrros, L\. Oakden\-Rayner, C\. Okechukwu, L\. Seyyed\-Kalantari, H\. Trivedi, R\. Wang, Z\. Zaiman, and H\. Zhang \(2022a\)AI recognition of patient race in medical imaging: a modelling study\.Lancet Digit\. Health4\(6\),pp\. e406–e414\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px2.p1.1.1)\.
- J\. W\. Gichoya, I\. Banerjee, A\. R\. Bhimireddy, J\. L\. Burns, L\. A\. Celi, L\. Chen, R\. Correa, N\. Dullerud, M\. Ghassemi, S\. Huang,et al\.\(2022b\)AI recognition of patient race in medical imaging: a modelling study\.The Lancet Digital Health4\(6\),pp\. e406–e414\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p4.1.1)\.
- B\. Glocker, C\. Jones, M\. Roschewitz, and S\. Winzeck \(2023\)Risk of bias in chest radiography deep learning foundation models\.Radiol\. Artif\. Intell\.5\(6\),pp\. e230060\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Groh, C\. Harris, R\. Daneshjou, O\. Badri, and A\. Koochek \(2022\)Towards transparency in dermatology image datasets with skin tone annotations by experts, crowds, and an algorithm\.Proc\. ACM Hum\.\-Comput\. Interact\.6\(CSCW2\),pp\. 1–26\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p2.1.1)\.
- M\. Groh, C\. Harris, L\. Soenksen, F\. Lau, R\. Han, A\. Kim, A\. Koochek, and O\. Badri \(2021\)Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset\.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops \(CVPRW\),pp\. 1820–1828\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p1.1.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p2.1.1)\.
- D\. Gutman, N\. C\. F\. Codella, E\. Celebi, B\. Helba, M\. Marchetti, N\. Mishra, and A\. Halpern \(2016\)Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging \(isbi\) 2016, hosted by the international skin imaging collaboration \(isic\)\.External Links:1605\.01397Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p2.1.1),[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px1.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§3\.2](https://arxiv.org/html/2606.03214#S3.SS2.p1.1.1)\.
- \[28\]ISIC archive\.Note:[https://gallery\.isic\-archive\.com](https://gallery.isic-archive.com/)Accessed: 2024\-06\-07Cited by:[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px1.p1.1)\.
- A\. Jiménez\-Sánchez, D\. Juodelyte, B\. Chamberlain, and V\. Cheplygina \(2023\)Detecting shortcuts in medical images\-a case study in chest x\-rays\.In2023 IEEE 20th International Symposium on Biomedical Imaging \(ISBI\),pp\. 1–5\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p4.1.1)\.
- C\. Jones and B\. Glocker \(2025\)A primer on causal and statistical dataset biases for fair and robust image analysis\.arXiv \[cs\.LG\]\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px2.p1.1.1)\.
- J\. Kawahara, S\. Daneshvar, G\. Argenziano, and G\. Hamarneh \(2018\)7\-point checklist and skin lesion classification using multi\-task multi\-modal neural nets\.IEEE J\. Biomed\. Health Inform\.23\(2\),pp\. 538–546\(en\)\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px3.p1.1.1)\.
- A\. Kebaili, J\. Lapuyade\-Lahorgue, and S\. Ruan \(2023\)Deep learning approaches for data augmentation in medical imaging: a review\.J\. Imaging9\(4\),pp\. 81\(en\)\.Cited by:[§6](https://arxiv.org/html/2606.03214#S6.SS0.SSS0.Px2.p10.1)\.
- H\. Kittler, H\. Pehamberger, K\. Wolff, and M\. Binder \(2002\)Diagnostic accuracy of dermoscopy\.Lancet Oncol\.3\(3\),pp\. 159–165\(en\)\.Cited by:[§3\.1](https://arxiv.org/html/2606.03214#S3.SS1.p1.1.1)\.
- M\. Klingenberg, D\. Stark, F\. Eitel, C\. Budding, M\. Habes, K\. Ritter, and Alzheimer’s Disease Neuroimaging Initiative \(2023\)Higher performance for women than men in MRI\-based alzheimer’s disease detection\.Alzheimers\. Res\. Ther\.15\(1\),pp\. 84\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px1.p3.1)\.
- I\. Ktena, O\. Wiles, I\. Albuquerque, S\. Rebuffi, R\. Tanno, A\. G\. Roy, S\. Azizi, D\. Belgrave, P\. Kohli, T\. Cemgil, A\. Karthikesalingam, and S\. Gowal \(2024\)Generative models improve fairness of medical classifiers under distribution shifts\.Nat\. Med\.30\(4\),pp\. 1166–1173\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p1.1.1)\.
- A\. J\. Larrazabal, N\. Nieto, V\. Peterson, D\. H\. Milone, and E\. Ferrante \(2020\)Gender imbalance in medical imaging datasets produces biased classifiers for computer\-aided diagnosis\.Proceedings of the National Academy of Sciences117\(23\),pp\. 12592–12594\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px1.p2.1)\.
- M\. Liu, J\. Zhang, E\. Adeli, and D\. Shen \(2019\)Joint classification and regression via deep multi\-task multi\-channel learning for alzheimer’s disease diagnosis\.IEEE Trans\. Biomed\. Eng\.66\(5\),pp\. 1195–1206\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p4.1)\.
- M\. Nauta, R\. Walsh, A\. Dubowski, and C\. Seifert \(2021\)Uncovering and correcting shortcut learning in machine learning models for skin cancer diagnosis\.Diagnostics \(Basel\)12\(1\) \(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p1.1)\.
- A\. G\. Pacheco, G\. R\. Lima, A\. S\. Salomao, B\. Krohling, I\. P\. Biral, G\. G\. de Angelo, F\. C\. Alves Jr, J\. G\. Esgario, A\. C\. Simora, P\. B\. Castro,et al\.\(2020\)PAD\-ufes\-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones\.Data in brief32,pp\. 106221\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px2.p1.1)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. P\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, and Others \(2019\)An imperative style, high\-performance deep learning library\.Adv\. Neural Inf\. Process\. Syst\.32,pp\. 8026–8037\.Cited by:[§3\.2](https://arxiv.org/html/2606.03214#S3.SS2.SSS0.Px4.p2.1)\.
- E\. Petersen, A\. Feragen, M\. L\. da Costa Zemsch, A\. Henriksen, O\. E\. Wiese Christensen, M\. Ganz, and A\. D\. N\. Initiative \(2022\)Feature robustness and sex differences in medical imaging: a case study in mri\-based alzheimer’s disease detection\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 88–98\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1.2)\.
- R\. Raumanns, G\. Schouten, J\. P\. W\. Pluim, and V\. Cheplygina \(2025\)Dataset distribution impacts model fairness: single vs\. multi\-task learning\.InEthics and Fairness in Medical Imaging,E\. Puyol\-Antón, G\. Zamzmi, A\. Feragen, A\. P\. King, V\. Cheplygina, M\. Ganz\-Benjaminsen, E\. Ferrante, B\. Glocker, E\. Petersen, J\. S\. H\. Baxter, I\. Rekik, and R\. Eagleson \(Eds\.\),Cham,pp\. 14–23\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p5.1),[§3\.1\.2](https://arxiv.org/html/2606.03214#S3.SS1.SSS2.p1.1.1)\.
- V\. Rotemberg, N\. Kurtansky, B\. Betz\-Stablein, L\. Caffery, E\. Chousakos, N\. Codella, M\. Combalia, S\. Dusza, P\. Guitera, D\. Gutman, A\. Halpern, B\. Helba, H\. Kittler, K\. Kose, S\. Langer, K\. Lioprys, J\. Malvehy, S\. Musthaq, J\. Nanda, O\. Reiter, G\. Shih, A\. Stratigos, P\. Tschandl, J\. Weber, and H\.P\. Soyer \(2021\)A patient\-centric dataset of images and metadata for identifying melanomas using clinical context\.Scientific Data; London8\(1\),pp\. s41597–021\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p2.1.1),[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px1.p1.1)\.
- S\. Ruder \(2017\)An overview of multi\-task learning in deep neural networks\.arXiv preprint arXiv:1706\.05098\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p1.1.1)\.
- K\. Rufibach \(2010\)Use of brier score to assess binary predictions\.J\. Clin\. Epidemiol\.63\(8\),pp\. 938–9; author reply 939\(en\)\.Cited by:[§4\.2](https://arxiv.org/html/2606.03214#S4.SS2.SSS0.Px1.p1.6)\.
- A\. Saha, J\.S\. Bosma, J\.J\. Twilt, B\. van Ginneken, A\. Bjartell, A\.R\. Padhani, D\. Bonekamp, G\. Villeirs, G\. Salomon, G\. Giannarini, J\. Kalpathy\-Cramer, J\. Barentsz, K\.H\. Maier\-Hein, M\. Rusu, O\. Rouvière, R\. van den Bergh, V\. Panebianco, V\. Kasivisvanathan, N\.A\. Obuchowski, D\. Yakar,et al\.\(2024\)Artificial intelligence and radiologists in prostate cancer detection on mri \(pi\-cai\): an international, paired, non\-inferiority, confirmatory study\.Lancet Oncol\.\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1)\.
- P\. Seth and A\. K\. Pai \(2024\)Does the fairness of your Pre\-Training hold up? examining the influence of Pre\-Training techniques on skin tone bias in skin lesion classification\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,pp\. 570–577\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1)\.
- L\. Seyyed\-Kalantari, H\. Zhang, M\. B\. A\. McDermott, I\. Y\. Chen, and M\. Ghassemi \(2021\)Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under\-served patient populations\.Nat\. Med\.27\(12\),pp\. 2176–2182\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px1.p2.1)\.
- K\. Sies, J\. K\. Winkler, C\. Fink, F\. Bardehle, F\. Toberer, T\. Buhl, A\. Enk, A\. Blum, W\. Stolz, A\. Rosenberger, and H\. A\. Haenssle \(2022\)Does sex matter? analysis of sex\-related differences in the diagnostic performance of a market\-approved convolutional neural network for skin cancer detection\.Eur\. J\. Cancer164,pp\. 88–94\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px1.p3.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p5.1),[§6](https://arxiv.org/html/2606.03214#S6.SS0.SSS0.Px1.p3.1)\.
- E\. A\. M\. Stanley, R\. Souza, A\. J\. Winder, V\. Gulve, K\. Amador, M\. Wilms, and N\. D\. Forkert \(2024\)Towards objective and systematic evaluation of bias in artificial intelligence for medical imaging\.J\. Am\. Med\. Inform\. Assoc\.31\(11\),pp\. 2613–2621\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p1.1.1),[§6](https://arxiv.org/html/2606.03214#S6.SS0.SSS0.Px2.p10.1)\.
- P\. Tschandl, C\. Rosendahl, and H\. Kittler \(2018\)The HAM10000 dataset, a large collection of multi\-source dermatoscopic images of common pigmented skin lesions\.Vol\.5\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p2.1.1),[§3\.1\.1](https://arxiv.org/html/2606.03214#S3.SS1.SSS1.Px1.p1.1)\.
- A\. Vaidya, R\. J\. Chen, D\. F\. K\. Williamson, A\. H\. Song, G\. Jaume, Y\. Yang, T\. Hartvigsen, E\. C\. Dyer, M\. Y\. Lu, J\. Lipkova, M\. Shaban, T\. Y\. Chen, and F\. Mahmood \(2024\)Demographic bias in misdiagnosis by computational pathology models\.Nat\. Med\.30\(4\),pp\. 1174–1190\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px1.p1.1)\.
- M\. J\. Willemink, W\. A\. Koszek, C\. Hardell, J\. Wu, D\. Fleischmann, H\. Harvey, L\. R\. Folio, R\. M\. Summers, D\. L\. Rubin, and M\. P\. Lungren \(2020\)Preparing medical imaging data for machine learning\.Radiology,pp\. 192224\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p4.1.1)\.
- Y\. Wu, D\. Zeng, X\. Xu, Y\. Shi, and J\. Hu \(2022\)FairPrune: achieving fairness through pruning for dermatological disease diagnosis\.InMedical Image Computing and Computer Assisted Intervention – MICCAI 2022,pp\. 743–753\.Cited by:[§1](https://arxiv.org/html/2606.03214#S1.p1.1),[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p1.1.1)\.
- W\. Xu, Y\. Fu, and D\. Zhu \(2023\)ResNet and its application to medical image processing: research progress and challenges\.Comput\. Methods Programs Biomed\.240\(107660\),pp\. 107660\(en\)\.Cited by:[§3\.2](https://arxiv.org/html/2606.03214#S3.SS2.p1.1)\.
- J\. Yang, A\. A\. S\. Soltan, D\. W\. Eyre, Y\. Yang, and D\. A\. Clifton \(2023\)An adversarial training framework for mitigating algorithmic biases in clinical machine learning\.NPJ Digit Med6\(1\),pp\. 55\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px3.p1.1.1)\.
- P\. H\. Yi, J\. Wei, T\. K\. Kim, J\. Shin, H\. I\. Sair, F\. K\. Hui, G\. D\. Hager, and C\. T\. Lin \(2021\)Radiology “forensics”: determination of age and sex from chest radiographs using deep learning\.Emerg\. Radiol\.28\(5\),pp\. 949–954\(en\)\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhang and Q\. Yang \(2022\)A survey on multi\-task learning\.IEEE Trans\. Knowl\. Data Eng\.34\(12\),pp\. 5586–5609\.Cited by:[§2](https://arxiv.org/html/2606.03214#S2.SS0.SSS0.Px4.p1.1.1)\.
- Y\. Zong, Y\. Yang, and T\. Hospedales \(2023\)MEDFAIR: benchmarking fairness for medical imaging\.arXiv \[cs\.LG\]\.Cited by:[§6](https://arxiv.org/html/2606.03214#S6.SS0.SSS0.Px4.p2.1)\.

## Appendix ASex distribution LP model

### A\.1Decision variables

### A\.2Constraints

x1−x2=0\\displaystyle x\_\{1\}\-x\_\{2\}=0\(1\)r​x4−x3=0\\displaystyle rx\_\{4\}\-x\_\{3\}=0\(2\)s​x8−x7=0\\displaystyle sx\_\{8\}\-x\_\{7\}=0\(3\)t​x10−x9=0\\displaystyle tx\_\{10\}\-x\_\{9\}=0\(4\)x1−x3−x4=0\\displaystyle x\_\{1\}\-x\_\{3\}\-x\_\{4\}=0\(5\)x3−x7−x8=0\\displaystyle x\_\{3\}\-x\_\{7\}\-x\_\{8\}=0\(6\)x4−x9−x10=0\\displaystyle x\_\{4\}\-x\_\{9\}\-x\_\{10\}=0\(7\)u​x12−x11=0\\displaystyle ux\_\{12\}\-x\_\{11\}=0\(8\)v​x14−x13=0\\displaystyle vx\_\{14\}\-x\_\{13\}=0\(9\)x2−x5−x6=0\\displaystyle x\_\{2\}\-x\_\{5\}\-x\_\{6\}=0\(10\)x6−x13−x14=0\\displaystyle x\_\{6\}\-x\_\{13\}\-x\_\{14\}=0\(11\)x5−x11−x12=0\\displaystyle x\_\{5\}\-x\_\{11\}\-x\_\{12\}=0\(12\)w​x6−x5=0\\displaystyle wx\_\{6\}\-x\_\{5\}=0\(13\)Eq\.[1](https://arxiv.org/html/2606.03214#A1.E1):\# of malignant records equals \# benign records\.

Eq\.[2](https://arxiv.org/html/2606.03214#A1.E2):Ratiorrof malignant male \(M\) to female patients \(F\)\.

Eq\.[3](https://arxiv.org/html/2606.03214#A1.E3):Ratiossof malignant M \(age<60<60\) to M \(age≥60\\geq 60\)\.

Eq\.[4](https://arxiv.org/html/2606.03214#A1.E4):Ratiottof malignant F \(age<60<60\) to F \(age≥60\\geq 60\)\.

Eq\.[5](https://arxiv.org/html/2606.03214#A1.E5):\# of malignant lesions equals sum of malignant M and F\.

Eq\.[6](https://arxiv.org/html/2606.03214#A1.E6):\# M with malignant lesions is equal to that of all ages\.

Eq\.[7](https://arxiv.org/html/2606.03214#A1.E7):\# F with malignant lesions is equal to that of all ages\.

Eq\.[8](https://arxiv.org/html/2606.03214#A1.E8):Ratiouuof benign M \(age<60<60\) to M \(age≥60\\geq 60\)\.

Eq\.[9](https://arxiv.org/html/2606.03214#A1.E9):Ratiovvof benign F \(age<60<60\) to F \(age≥60\\geq 60\)\.

Eq\.[10](https://arxiv.org/html/2606.03214#A1.E10):\# benign lesions equals sum of benign M and F\.

Eq\.[11](https://arxiv.org/html/2606.03214#A1.E11):\# F with benign lesions is equal to \# all ages\.

Eq\.[12](https://arxiv.org/html/2606.03214#A1.E12):\# M with benign lesions is equal to \# all ages\.

Eq\.[13](https://arxiv.org/html/2606.03214#A1.E13):Ratiowwof benign M to F\.

## Appendix BAge distribution LP model

### B\.1Decision variables

The age brackets are defined as follows,whereaarepresents the patient’s age in years\. x1x\_\{1\}:\#0≤a≤50​\(A1\)\{0\\leq a\\leq 50\}\(A\_\{1\}\)instances x2x\_\{2\}:\#51≤a≤60​\(A2\)\{51\\leq a\\leq 60\}\(A\_\{2\}\)instances x3x\_\{3\}:\#61≤a≤70​\(A3\)\{61\\leq a\\leq 70\}\(A\_\{3\}\)instances x4x\_\{4\}:\#71≤a≤80​\(A4\)\{71\\leq a\\leq 80\}\(A\_\{4\}\)instances x5x\_\{5\}:\#a≥81​\(A5\)\{a\\geq 81\}\(A\_\{5\}\)instances

### B\.2Objective function

Find a vector​x​\(decision variables\)\\displaystyle\\text\{Find a vector \}x\\text\{ \(decision variables\)\}that maximises​f=x1​\(objective function\)\\displaystyle\\text\{that maximises \}f=x\_\{1\}\\text\{ \(objective function\)\}subject to​ai​1​x1\+ai​2​x2\+⋯\+ai​5​x5≤bi​\(constraints\)\\displaystyle\\text\{subject to \}a\_\{i1\}x\_\{1\}\+a\_\{i2\}x\_\{2\}\+\\dots\+a\_\{i5\}x\_\{5\}\\leq b\_\{i\}\\text\{ \(constraints\)\}for​i=1,…,5\\displaystyle\\hskip 23\.12485pt\\text\{for \}i=1,\\dots,5and​xj≥0​\(non\-negativity constraints\)\\displaystyle\\text\{and \}x\_\{j\}\\geq 0\\text\{ \(non\-negativity constraints\)\}for​j=1,…,5\\displaystyle\\hskip 23\.12485pt\\text\{for \}j=1,\\dots,5

### B\.3Constraints

\(100−p1\)​x1−p1​x2−p1​x3−p1​x4−p1​x5=0\\displaystyle\(100\-p\_\{1\}\)x\_\{1\}\-p\_\{1\}x\_\{2\}\-p\_\{1\}x\_\{3\}\-p\_\{1\}x\_\{4\}\-p\_\{1\}x\_\{5\}=0\(1\)−p2​x1\+\(100−p2\)​x2−p2​x3−p2​x4−p2​x5=0\\displaystyle\-p\_\{2\}x\_\{1\}\+\(100\-p\_\{2\}\)x\_\{2\}\-p\_\{2\}x\_\{3\}\-p\_\{2\}x\_\{4\}\-p\_\{2\}x\_\{5\}=0\(2\)−p3​x1−p3​x2\+\(100−p3\)​x3−p3​x4−p3​x5=0\\displaystyle\-p\_\{3\}x\_\{1\}\-p\_\{3\}x\_\{2\}\+\(100\-p\_\{3\}\)x\_\{3\}\-p\_\{3\}x\_\{4\}\-p\_\{3\}x\_\{5\}=0\(3\)−p4​x1−p4​x2−p4​x3\+\(100−p4\)​x4−p4​x5=0\\displaystyle\-p\_\{4\}x\_\{1\}\-p\_\{4\}x\_\{2\}\-p\_\{4\}x\_\{3\}\+\(100\-p\_\{4\}\)x\_\{4\}\-p\_\{4\}x\_\{5\}=0\(4\)−p5​x1−p5​x2−p5​x3−p5​x4\+\(100−p5\)​x5=0\\displaystyle\-p\_\{5\}x\_\{1\}\-p\_\{5\}x\_\{2\}\-p\_\{5\}x\_\{3\}\-p\_\{5\}x\_\{4\}\+\(100\-p\_\{5\}\)x\_\{5\}=0\(5\)Eq\.[1](https://arxiv.org/html/2606.03214#A2.E1): Distribution for first category with percentagep1p\_\{1\}\.

Eq\.[2](https://arxiv.org/html/2606.03214#A2.E2): Distribution for second category with percentagep2p\_\{2\}\.

Eq\.[3](https://arxiv.org/html/2606.03214#A2.E3): Distribution for third category with percentagep3p\_\{3\}\.

Eq\.[4](https://arxiv.org/html/2606.03214#A2.E4): Distribution for fourth category with percentagep4p\_\{4\}\.

Eq\.[5](https://arxiv.org/html/2606.03214#A2.E5): Distribution for fifth category with percentagep5p\_\{5\}\.

## Appendix CPAD\-UFES\-20 demographics

Table 8:Age distribution \(A1−A5A\_\{1\}\-A\_\{5\}\) of skin lesions across sex and diagnosis in the curatedPAD\-UFES\-20dataset, showing the breakdown between male and female patients\. The age brackets are defined as follows,whereaarepresents the patient’s age in years:A1=0≤a≤50A\_\{1\}=\{0\\leq a\\leq 50\},A2=51≤a≤60A\_\{2\}=\{51\\leq a\\leq 60\},A3=61≤a≤70A\_\{3\}=\{61\\leq a\\leq 70\},A4=71≤a≤80A\_\{4\}=\{71\\leq a\\leq 80\}, andA5=a≥81A\_\{5\}=\{a\\geq 81\}\.

Similar Articles

Your Multimodal Speech Model Says I Have a Face for Radio

arXiv cs.CL

This paper presents the first bias evaluation of multimodal speech recognition models, finding significant accuracy differences across gender and ethnicity when pairing faces with audio, with implications for fairness in AI systems.

Is Position Bias in Dense Retrievers Built In-or Learned from Data?

Hugging Face Daily Papers

This paper investigates whether positional bias in dense retrievers originates from architecture or training data, finding that training data distribution strongly influences bias and that balanced training can reduce sensitivity by up to 87% while maintaining retrieval performance.

A Systematic Evaluation of Imbalance Handling Methods in Biomedical Binary Classification

arXiv cs.LG

This paper systematically evaluates five imbalance handling methods (RUS, ROS, SMOTE, re-weighting, direct F1 optimization) on three biomedical datasets (tabular, text, image) using models of varying complexity. Results show that benefits depend on model complexity and data modality, with ROS, re-weighting, and direct F1 optimization being effective for complex models on unstructured data.