Can LLMs Hire Fairly? Racial Bias in Resume Screening

arXiv cs.CL Papers

Summary

This paper audits 14 large language models for hiring discrimination using a paired-resume methodology, finding that older models exhibit pro-White bias while newer models show null or pro-Black bias, indicating a reversal in algorithmic hiring bias across model generations.

arXiv:2606.28978v1 Announce Type: new Abstract: We audit fourteen mainstream large language models (LLMs) for hiring discrimination using the paired-resume methodology of Kline, Rose, and Walters (2022). The sole 2023-vintage model reproduces the pro-White callback gap documented in field experiments on labor market discrimination ($+2.12$ pp, significant at the 1\% level). Every model released in 2024 or after shows either a null gap or a significant pro-Black reversal (up to $-3.01$ pp). The same pattern holds on the gender axis. Based on 24,024 paired postings per model across 14 models, our results document a reversal in the direction of algorithmic hiring bias across model generations.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:29 AM

# Can LLMs Hire Fairly? Racial Bias in Resume Screening
Source: [https://arxiv.org/html/2606.28978](https://arxiv.org/html/2606.28978)
Zhenyu Gao, Wenxi Jiang, Yutong YanGao, Jiang, and Yan are at the Department of Finance, CUHK Business School, The Chinese University of Hong Kong\. Please send all correspondence to yutong\.yan@link\.cuhk\.edu\.hk\.

\(June 2026\)

###### Abstract

We audit fourteen mainstream large language models \(LLMs\) for hiring discrimination using the paired\-resume methodology ofkline2022systemic\. The sole 2023\-vintage model reproduces the pro\-White callback gap documented in field experiments on labor market discrimination \(\+2\.12\+2\.12pp, significant at the 1% level\)\. Every model released in 2024 or after shows either a null gap or a significant pro\-Black reversal \(up to−3\.01\-3\.01pp\)\. The same pattern holds on the gender axis\. Based on 24,024 paired postings per model across 14 models, our results document a reversal in the direction of algorithmic hiring bias across model generations\.

## 1Introduction

As large language models \(LLMs\) become embedded in consequential economic decisions, the question of whether they perpetuate or reshape existing patterns of discrimination has become an important question for researchers and regulators\. In June 2026, a federal judge in California ruled that Workday, whose AI\-powered screening software is used by virtually all Fortune 500 companies, must face class\-action claims alleging that its algorithms discriminated against Black applicants, women, and older workers\(wiessner2026workday\)\. The case is the first of its kind to broadly target algorithmic decision\-making in hiring, yet the empirical evidence base on whether and how LLMs discriminate remains thin\. We contribute to this evidence base by conducting a large\-scale paired\-resume audit, modeled directly on the influential correspondence experiments of10\.1257/0002828042002561andkline2022systemic, across fourteen LLMs spanning three years of releases\.

Our design follows the methodology ofkline2022systemic: we construct pairs of candidate profiles that are identical in every respect \(education, employment history, age, gender\) except for the candidate’s first and last name\. Names are drawn from the distinctively\-Black and distinctively\-White name lists established by10\.1257/0002828042002561and refined in Appendix B ofkline2022systemic\. Each profile\-posting pair is presented to the LLM with a standardized system prompt instructing it to act as an HR hiring manager and respond with a single word, “yes” or “no,” indicating whether to advance the candidate to a phone screen\. Temperature is set to zero, making each decision deterministic\. We score 24,024 paired postings per model on the race axis and 48,048 paired postings per model on the gender axis, drawing on 6,007 real US job postings from the Revelio Labs universe matched to thekline2022systemicFortune\-500 firm distribution\.

The headline finding is a sign reversal\. The 2023\-vintage model in our panel, OpenAI’s GPT\-3\.5\-turbo, reproduces the pro\-White callback gap documented in field experiments on labor market discrimination: White\-coded names receive callbacks 2\.12 percentage points more often than Black\-coded names \(significant at the 1% level, cluster\-robust\)\. This magnitude exceeds the 1\.6 pp within\-employer gap reported bykline2022systemicin their field experiment at 108 Fortune\-500 firms\.

Every model released in 2024 or later, however, exhibits either a null gap or a statistically significant gap in theoppositedirection, favoring Black\-coded names by 0\.4 to 3\.0 pp\. The pattern is not confined to a single provider or model family: it appears across OpenAI, Anthropic, Meta, Google, xAI, DeepSeek, Alibaba, and Zhipu models, and is robust to posting\-level cluster\-bootstrap inference\.

The same reversal holds on the gender axis: GPT\-3\.5\-turbo favors male candidates by 1\.92 pp \(significant at the 1% level\), while every model released in 2024 or later with completed gender data favors female candidates by 0\.18 to 5\.80 pp or shows a null gap\. The race and gender gaps are positively correlated across models: models that favor White\-coded names also tend to favor male\-coded names\.

Within model families, the trajectory is informative\. OpenAI’s own lineage illustrates the arc most cleanly: GPT\-3\.5\-turbo \(May 2023\) is strongly pro\-White and pro\-Male; GPT\-4o\-mini \(July 2024\) is significantly pro\-Black; GPT\-5\.4\-mini \(March 2026\) is null on both axes\. Meta’s Llama family shows a similar pattern: Llama\-3\.1\-8B\-Instruct \(July 2024\) exhibits the strongest pro\-Black gap in our panel \(−3\.01\-3\.01pp, significant at the 1% level\), while the larger Llama\-3\.3\-70B \(November 2024\) is null on race but significantly pro\-Female\. These within\-family trajectories suggest that the sign reversal is not an artifact of our design but reflects real changes in model alignment across successive releases\.

We make three contributions\. First, we provide the first systematic multi\-model audit of LLM hiring discrimination at a scale comparable to the field experiments that established the human baseline\. Our panel covers 14 models on the race axis \(24,024 paired postings per model\) and 13 models on the gender axis \(48,048 paired postings per model\), spanning the period from GPT\-3\.5 through the current generation of frontier models\. Second, we document a qualitative shift in the direction of algorithmic discrimination across model generations, consistent with changes in post\-training alignment procedures between 2023\- and 2024\-vintage models\. Third, we show that the race and gender axes are correlated across models: models that favor Black\-coded names also tend to favor female\-coded names, and the sole model that favors White\-coded names also favors male\-coded names\.

#### Related literature\.

Our work sits at the intersection of the correspondence\-audit literature in labor economics and the rapidly growing body of work on LLM bias\. We discuss each in turn\.

#### Correspondence audits\.

The paired\-resume methodology was pioneered by10\.1257/0002828042002561, who sent fictitious resumes to help\-wanted ads in Boston and Chicago and found that White\-coded names received 50% more callbacks than Black\-coded names, corresponding to a gap of approximately 3\.2 pp on a 6\.5% callback base\.kline2022systemicupdated and scaled this design to 108 Fortune\-500 firms with approximately 83,000 applications, documenting a persistent 1\.6 pp pro\-White gap after conditioning on employer\-by\-occupation fixed effects\. Our contribution is to apply this well\-established methodology to LLM\-based screeners, holding the experimental design constant while replacing the human decision\-maker with a language model\.

#### LLM bias and fairness\.

A large literature documents biases in language model outputs across dimensions including race, gender, religion, and political orientation\(seeblodgett\-etal\-2020\-language;gallegos\-etal\-2024\-bias, for surveys\)\. Most closely related to our work,veldanda2023emilyreplicated elements of the10\.1257/0002828042002561design on a small number of models and found no detectable bias on race or gender\. A key limitation of their approach is the use of scraped resumes that differ across candidates in education, experience, and skills, making it difficult to isolate the effect of race from unobserved heterogeneity in candidate quality\. Our paired\-resume design, followingkline2022systemic, eliminates this confound by holding all resume content identical within each pair, varying only the candidate’s name\.

#### AI in hiring and algorithmic fairness\.

More broadly, a growing literature examines algorithmic bias in consequential economic decisions\.fuster2022predictablyshow that machine learning models in credit markets can improve predictive accuracy but simultaneously increase disparities between racial groups, because the gains from better prediction accrue unevenly\. Our setting differs in that hiring decisions are binary and the bias we document reverses sign across model generations, but the underlying concern is the same: algorithmic tools can reshape the distribution of economic opportunity in ways that are difficult to anticipate\. The use of AI tools in employment screening has attracted growing regulatory attention\. The European Union’s AI Act classifies AI systems used in recruitment as “high risk,” requiring conformity assessments and ongoing monitoring\. In the United States, New York City’s Local Law 144 \(effective July 2023\) requires bias audits of automated employment decision tools, and the Workday litigation represents the first federal class action challenging such tools under existing anti\-discrimination statutes\. Our results speak directly to the empirical premise of these regulatory efforts: we show that the direction and magnitude of algorithmic bias depend on which model is deployed, when it was trained, and how it is prompted\.

## 2Methodology

### 2\.1Experimental Design

We conduct a paired correspondence audit in which large language models \(LLMs\) evaluate fictitious candidate profiles for real job postings\. The design followskline2022systemicas closely as possible, adapting only those elements that are necessary for the LLM setting\. This section describes the data construction, the scoring procedure, and the statistical framework\.

#### Job postings\.

We draw 6,007 entry\-level US job postings from the Revelio Labs job\-postings universe, sampled to match the Fortune\-500 firm distribution inkline2022systemic\. Entry\-level status is determined by title\-keyword filtering \(e\.g\., “analyst,” “associate,” “coordinator,” “assistant”\) applied to the raw posting title\. Each posting record includes the firm name, job title, metropolitan area, state, and post date\.

#### Candidate profiles\.

For each posting, we construct four independent paired\-resume draws \(pair\_id∈\{0,1,2,3\}\\in\\\{0,1,2,3\\\}\)\. Within each pair, two candidate profiles are generated that are identical in education, employment history, age, and gender, differing only in the candidate’s first and last name\. Names are drawn uniformly from the distinctively\-Black and distinctively\-White name lists in10\.1257/0002828042002561Tables A1–A2 andkline2022systemicAppendix B Tables B1–B2\. Education credentials \(school, degree, GPA\) and employment histories \(prior employers, job titles, tenure\) are constructed following the resume\-building procedure described inkline2022systemicAppendix B, using school and establishment names drawn from reference databases matched to the posting’s geographic area and industry\. After excluding a small number of postings with missing fields, the final race\-audit sample contains 24,024 complete Black\-White pairs \(48,048 profile\-posting rows\) drawn from 6,006 unique postings\.

#### Gender extension\.

To study gender discrimination, we extend the profile bank by crossing race and gender, yielding four conditions per pair: Black\-male, Black\-female, White\-male, White\-female\. The gender contrast pairs Male against Female within the same posting, pair\_id, and race prefix, producing24,024×2=48,04824\{,\}024\\times 2=48\{,\}048gender pairs \(two per posting\-pair, one for each race stratum\)\. Names are gender\-appropriate within each race category, following the gendered name lists inkline2022systemicAppendix B Table B1\.

### 2\.2LLM Scoring

#### Prompt design\.

Each profile\-posting pair is scored by presenting the LLM with a two\-message conversation\. The system message reads:

> You are an HR hiring manager screening a job application\. Given a job posting and a single candidate profile in JSON, decide whether to advance this candidate to a callback \(phone screen\)\. Respond with exactly one lowercase word: ‘yes’ \(callback\) or ‘no’ \(no callback\)\. Do not include punctuation, explanation, or any other text\.

The user message presents the posting fields \(firm, title, location, post date\) in a labeled format, followed by the candidate profile as a JSON object, and concludes with the cue “Answer with exactly one word: yes or no\.” The prompt is held constant across all models and all profile\-posting pairs\.

#### Decoding parameters\.

All models are scored at temperatureT=0T=0\(greedy decoding\), producing deterministic outputs\. For non\-reasoning models, we set max\_tokens=200=200\. For reasoning\-class models \(GPT\-5\.x family\), we suppress hidden chain\-of\-thought reasoning via the provider’s API and set max\_tokens=16=16\(the minimum permitted by the API with reasoning disabled\)\. The response is parsed for the first occurrence of “yes” or “no” in the output text; rows that do not contain either token are recorded as failures and excluded from the analysis\.

#### Model panel\.

Our panel comprises fourteen models, spanning release dates from May 2023 to April 2026\. The panel includes models from eight providers \(OpenAI, Anthropic, Meta, Google, xAI, DeepSeek, Alibaba/Qwen, and Zhipu/GLM\) and covers a range of model sizes from 8B to 397B parameters\. Models are accessed via the Together AI, OpenRouter, and AWS Bedrock inference APIs\. Each model scores the full set of 48,048 race\-axis rows \(24,024 pairs\) and 96,096 gender\-axis rows \(48,048 pairs\) independently\.

### 2\.3Statistical Framework

#### Callback gap\.

For each model, we compute the per\-race callback rate asP^​\(yes∣r\)=Nr−1​∑i=1Nryi​r\\hat\{P\}\(\\text\{yes\}\\mid r\)=N\_\{r\}^\{\-1\}\\sum\_\{i=1\}^\{N\_\{r\}\}y\_\{ir\}, whereyi​r∈\{0,1\}y\_\{ir\}\\in\\\{0,1\\\}is the callback decision for profileiiwith racer∈\{Black,White\}r\\in\\\{\\text\{Black\},\\text\{White\}\\\}\. The race gap in percentage points is

=race\[P^\(yes∣White\)−P^\(yes∣Black\)\]×100\.\{\}\_\{\\text\{race\}\}=\\left\[\\hat\{P\}\(\\text\{yes\}\\mid\\text\{White\}\)\-\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Black\}\)\\right\]\\times 100\.\(1\)A positiverace\{\}\_\{\\text\{race\}\}indicates pro\-White discrimination; a negative value indicates pro\-Black discrimination\. The gender gapgender\{\}\_\{\\text\{gender\}\}is defined analogously as\[P^​\(yes∣Male\)−P^​\(yes∣Female\)\]×100\[\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Male\}\)\-\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Female\}\)\]\\times 100, where pairs are formed within posting×\\timespair\_id×\\timesrace\-prefix\.

#### McNemar’s test\.

Because our profiles are paired within postings, only discordant pairs, those in which the two profiles receive different decisions, carry information about the gap\. Letbbdenote the number of pairs in which the advantaged\-group profile receives “yes” and the other receives “no” \(e\.g\., White\-yes, Black\-no for the race axis\), and letccdenote the reverse\. McNemar’s two\-sided test statistic with continuity correction is

=2\(\|b−c\|−1\)2b\+c,\{\}^\{2\}=\\frac\{\(\|b\-c\|\-1\)^\{2\}\}\{b\+c\},\(2\)which follows a2distribution with one degree of freedom under the null hypothesisb=cb=c\. We report significance at thep<0\.05p<0\.05\(∗\*\),p<0\.01p<0\.01\(∗⁣∗\*\*\), andp<0\.001p<0\.001\(∗⁣∗⁣∗\*\*\*\) levels\.

#### Cluster\-robust inference\.

Because each posting contributes four pair\_ids, observations within a posting are not independent\. We address this in two ways\. First, we compute 95% confidence intervals via a cluster bootstrap at the posting level: we resample the 6,006 posting clusters with replacement 2,000 times, compute the gap on each bootstrap sample, and report the 2\.5th and 97\.5th percentiles\. Second, we compute cluster\-robust standard errors by treating each posting’s mean within\-pair difference as a single observation, yielding att\-statistic withG−1=6,005G\-1=6\{,\}005degrees of freedom\. In practice, the cluster\-robust inference changes no qualitative conclusion relative to the unadjusted McNemar test: all results that are significant atp<0\.001p<0\.001under McNemar remain significant atp<0\.001p<0\.001under cluster\-robust inference\.

## 3Results

### 3\.1Race Discrimination

Table[I](https://arxiv.org/html/2606.28978#S4.T1)presents the race\-axis results for all fourteen models, sorted by release date\. The table reports the callback rate, the gaprace\{\}\_\{\\text\{race\}\}in percentage points, the posting\-level cluster\-bootstrap 95% confidence interval, the discordant\-pair countsbb\(White\-yes, Black\-no\) andcc\(Black\-yes, White\-no\), and the McNemarpp\-value\.

#### GPT\-3\.5\-turbo reproduces the gap from field experiments\.

The oldest model in our panel, GPT\-3\.5\-turbo \(May 2023\), exhibits a callback gap of\+2\.12\+2\.12pp \(95% CI\[\+1\.82,\+2\.43\]\[\+1\.82,\+2\.43\], significant at the 1% level\)\. White\-coded names receive “yes” in 953 discordant pairs versus 443 for Black\-coded names, a ratio of approximately 2\.15:1\. The magnitude exceeds the\+1\.6\+1\.6pp within\-employer gap documented bykline2022systemicin their field experiment at 108 Fortune\-500 firms\.

#### 2024\+ models reverse the sign\.

Every model released from July 2024 onward exhibits either a null gap or a significant pro\-Black gap\. The reversal is broad\-based: Claude Haiku 4\.5 \(−0\.70\-0\.70pp\), GPT\-4o\-mini \(−0\.61\-0\.61pp\), Llama\-3\.1\-8B\-Instruct \(−3\.01\-3\.01pp\), Gemma\-4\-31B\-it \(−0\.67\-0\.67pp\), DeepSeek\-V3\.1 \(−1\.04\-1\.04pp\), Qwen3\.5\-397B \(−1\.04\-1\.04pp\), and GLM\-5\.1 \(−1\.60\-1\.60pp\) are all significant at the 1% level\. Gemini\-2\.5\-flash \(−0\.37\-0\.37pp\) and Grok\-4\.1\-fast \(−0\.40\-0\.40pp\) are significant at the 5% level, though neither survives a Bonferroni correction for fourteen models\. Four models, Claude 3 Haiku, Llama\-3\.3\-70B, GPT\-oss\-120b, and GPT\-5\.4\-mini, show point estimates that are negative but whose confidence intervals include zero\. No 2024\+ model exhibits a significant pro\-White gap\.

#### Within\-family trajectories\.

The OpenAI lineage traces a clear arc: GPT\-3\.5\-turbo \(\+2\.12\+2\.12pp, pro\-White\)→\\toGPT\-4o\-mini \(−0\.61\-0\.61pp, pro\-Black\)→\\toGPT\-5\.4\-mini \(−0\.14\-0\.14pp, null\)\. Among Meta models, the smaller Llama\-3\.1\-8B\-Instruct \(−3\.01\-3\.01pp\) shows a much larger pro\-Black gap than the larger Llama\-3\.3\-70B \(−0\.16\-0\.16pp, null\), suggesting that the strength of the alignment\-induced reversal may vary with model scale within the same family\.

#### Magnitudes and discordant\-pair structure\.

The discordant\-pair counts vary widely across models\. GPT\-3\.5\-turbo has 1,396 discordant pairs \(5\.8% of all pairs\) with a strong White tilt \(953:443, ratio 2\.15:1\)\. Among 2024\+ models, the number of discordant pairs ranges from 547 \(Gemma\-4\-31B\) to 3,643 \(Llama\-3\.1\-8B\), and the tilt consistently favors Black\-coded names\.

### 3\.2Gender Discrimination

Table[II](https://arxiv.org/html/2606.28978#S4.T2)presents the gender\-axis results for thirteen models\. The structure mirrors Table[I](https://arxiv.org/html/2606.28978#S4.T1): we report=gender\[P^\(yes∣Male\)−P^\(yes∣Female\)\]×100\{\}\_\{\\text\{gender\}\}=\[\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Male\}\)\-\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Female\}\)\]\\times 100, with positive values indicating pro\-Male discrimination\.

#### GPT\-3\.5\-turbo is the sole pro\-Male model\.

GPT\-3\.5\-turbo exhibits a gender gap of\+1\.92\+1\.92pp \(95% CI\[\+1\.71,\+2\.13\]\[\+1\.71,\+2\.13\], significant at the 1% level\), with 1,750 discordant pairs favoring males versus 827 favoring females\. This is the only model in our panel that significantly favors male candidates\.

#### 2024\+ models favor female candidates\.

Ten of the twelve remaining models show a significant pro\-Female gap, ranging from−0\.18\-0\.18pp \(Gemma\-4\-31B, significant at the 5% level\) to−5\.80\-5\.80pp \(Claude 3 Haiku, significant at the 1% level\)\. Two models, Gemini\-2\.5\-flash \(−0\.03\-0\.03pp\) and GPT\-5\.4\-mini \(\+0\.11\+0\.11pp\), show null gender gaps, with confidence intervals that comfortably include zero\.

#### Correlation between race and gender axes\.

The race and gender gaps are positively correlated across models\. GPT\-3\.5\-turbo, the only pro\-White model, is also the only pro\-Male model\. Claude 3 Haiku exhibits the largest pro\-Female gender gap in the panel \(−5\.80\-5\.80pp\), followed by DeepSeek\-V3\.1 \(−2\.12\-2\.12pp\)\. The two null\-gender models \(Gemini\-2\.5\-flash and GPT\-5\.4\-mini\) are among the weaker pro\-Black models on the race axis\. This pattern is consistent with a common alignment mechanism that shifts model preferences toward both minority\-race and female candidates simultaneously\.

### 3\.3Robustness

#### Cluster\-robust inference\.

As noted in Section 2, we compute posting\-level cluster\-bootstrap confidence intervals and cluster\-robust standard errors to account for within\-posting dependence across the four pair\_ids\. In every case, the cluster\-robustzz\-statistic agrees in sign and significance with the McNemar test\. The most notable change is Gemma\-4\-31B’s gender gap, which weakens fromp<0\.05p<0\.05\(McNemar\) top=0\.023p=0\.023\(cluster\-robust\), placing it just inside the conventional 5% threshold but outside a Bonferroni\-corrected threshold for fourteen models\.

#### Multiple comparisons\.

We test fourteen models on the race axis and thirteen on the gender axis\. Under a Bonferroni correction at the family\-level=0\.05\\alpha=0\.05, the per\-model threshold isp<0\.0036p<0\.0036for race andp<0\.005p<0\.005for gender\. At this threshold, the race\-axis results for Gemini\-2\.5\-flash \(p=0\.041p=0\.041\) and Grok\-4\.1\-fast \(p=0\.019p=0\.019\) lose significance; all other significant results survive\. On the gender axis, Gemma\-4\-31B \(p=0\.023p=0\.023\) loses significance\. We report both unadjusted and Bonferroni\-adjusted significance in the tables\.

## 4Conclusion

We conduct a paired correspondence audit of fourteen large language models \(LLMs\), applying the methodology ofkline2022systemicto LLM\-based hiring screeners\. Our principal finding is a sign reversal in the direction of racial discrimination across model generations\. The lone 2023\-vintage model in our panel, GPT\-3\.5\-turbo, reproduces the pro\-White callback gap documented in field experiments on labor market discrimination \(\+2\.12\+2\.12pp, significant at the 1% level\)\. Every 2024\+ model exhibits either a null gap or a statistically significant pro\-Black gap, with significant pro\-Black gaps ranging from−0\.37\-0\.37pp to−3\.01\-3\.01pp\. The same pattern holds on the gender axis: GPT\-3\.5\-turbo favors male candidates \(\+1\.92\+1\.92pp\), while 2024\+ models uniformly favor female candidates or show no gap\.

These results have two implications\. First, the direction of algorithmic hiring bias is not a fixed property of language models but varies systematically with model vintage, provider, and scale\. Within the OpenAI lineage alone, the race gap moves from\+2\.12\+2\.12pp \(pro\-White, 2023\) to−0\.61\-0\.61pp \(pro\-Black, 2024\) to null \(2026\)\. This trajectory is consistent with the hypothesis that successive generations of post\-training alignment have shifted model behavior from reproducing the pro\-White patterns in pretraining data to actively favoring minority\-coded names, and then toward neutrality as alignment techniques have matured\. Second, neither the original pro\-White bias nor its pro\-Black reversal is desirable from a fairness standpoint\. A hiring screener that systematically favors one group over another, in either direction, fails the basic requirement of equal treatment regardless of race or gender\.

The scale of our audit, 14 models on the race axis and 13 on the gender axis with over 24,000 paired postings per model, provides, to our knowledge, the most comprehensive evidence to date on how LLMs behave when placed in the role of an HR screener\. As AI\-powered hiring tools become ubiquitous, systematic auditing of the kind we conduct here will be essential for both regulatory compliance and public accountability\.

## References

Figure I:Race Discrimination Across Twelve LLMsThis figure plots the callback gap=race\[P^\(yes∣White\)−P^\(yes∣Black\)\]×100\{\}\_\{\\text\{race\}\}=\[\\hat\{P\}\(\\text\{yes\}\\mid\\text\{White\}\)\-\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Black\}\)\]\\times 100in percentage points for each model, sorted by release date\. Error bars denote posting\-level cluster\-bootstrap 95% confidence intervals \(2,000 replications over 6,006 posting clusters\)\. Filled diamonds indicate a significant pro\-White gap; filled squares indicate a significant pro\-Black gap; open circles indicate gaps whose confidence intervals include zero\.

![Refer to caption](https://arxiv.org/html/2606.28978v1/x1.png)

Figure II:Gender Discrimination Across Thirteen LLMsThis figure plots the callback gap=gender\[P^\(yes∣Male\)−P^\(yes∣Female\)\]×100\{\}\_\{\\text\{gender\}\}=\[\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Male\}\)\-\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Female\}\)\]\\times 100in percentage points for each model, sorted by release date\. Error bars denote posting\-level cluster\-bootstrap 95% confidence intervals\. Filled diamonds indicate a significant pro\-Male gap; filled squares indicate a significant pro\-Female gap; open circles indicate gaps whose confidence intervals include zero\.

![Refer to caption](https://arxiv.org/html/2606.28978v1/x2.png)

Table I:Race Discrimination Across Fourteen LLMsThis table reports results from a paired correspondence audit with 24,024 Black\-White resume pairs per model\.race\{\}\_\{\\text\{race\}\}is the callback gap\[P^​\(yes∣White\)−P^​\(yes∣Black\)\]×100\[\\hat\{P\}\(\\text\{yes\}\\mid\\text\{White\}\)\-\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Black\}\)\]\\times 100in percentage points\.bbdenotes the number of discordant pairs in which the White\-coded profile receives a callback and the Black\-coded profile does not;ccdenotes the reverse\. CI is the posting\-level cluster\-bootstrap 95% confidence interval \(2,000 replications over 6,006 posting clusters\)\. \*\*\*, \*\*, and \* denote significance at the 1%, 5%, and 10% levels, respectively \(McNemar two\-sided with continuity correction\)\.

Table II:Gender Discrimination Across Thirteen LLMsThis table reports results from a paired correspondence audit with 48,048 Male\-Female resume pairs per model \(two race strata×\\times24,024 postings\)\.gender\{\}\_\{\\text\{gender\}\}is the callback gap\[P^​\(yes∣Male\)−P^​\(yes∣Female\)\]×100\[\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Male\}\)\-\\hat\{P\}\(\\text\{yes\}\\mid\\text\{Female\}\)\]\\times 100in percentage points\.bbdenotes the number of discordant pairs in which the Male\-coded profile receives a callback and the Female\-coded profile does not;ccdenotes the reverse\. CI and significance conventions as in Table[I](https://arxiv.org/html/2606.28978#S4.T1)\.

Similar Articles

I analyzed 25,500 LLM resume screenings to measure hiring bias. The results are a wake-up call.

Reddit r/artificial

A study analyzing 25,500 LLM resume evaluations across 10 models found a 45% bias rate driven by 'silent bias', with models inventing professional-sounding excuses to penalize candidates. It highlights significant variability in fairness and stability, with Claude, Mistral-Large, and Llama 4 being most stable, while Qwen and older Gemini models were volatile.

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

arXiv cs.CL

This paper audits six large language models for gender stereotyping across English, Korean, Chinese, and Japanese, anchoring against human baselines. It finds that LLM stereotyping often exceeds human cross-country variation and can compound across languages, introducing a four-pattern framework to characterize such behaviors.