I analyzed 25,500 LLM resume screenings to measure hiring bias. The results are a wake-up call.

Reddit r/artificial 06/01/26, 01:46 PM News

llm-bias hiring-bias ai-audit resume-screening bias-detection ai-fairness

Summary

A study analyzing 25,500 LLM resume evaluations across 10 models found a 45% bias rate driven by 'silent bias', with models inventing professional-sounding excuses to penalize candidates. It highlights significant variability in fairness and stability, with Claude, Mistral-Large, and Llama 4 being most stable, while Qwen and older Gemini models were volatile.

Hey Reddit, I just published a study analyzing 25,500 LLM resume evaluations to measure hiring bias. By swapping minor identity and demographic variables on the exact same work history across 10 different models, an independent AI auditor flagged a staggering 45% bias rate driven by "silent bias." Instead of saying anything overtly offensive, models invent professional-sounding excuses to penalize candidates, like when a model dropped its score after I changed the university to MIT, suddenly claiming the candidate's experience wasn't relevant despite praising that exact same experience on the baseline resume. We also found a massive 6x difference in stability between systems, with Qwen and older Gemini models being highly volatile, while the Claude models, Mistral-Large, and Llama 4 proved to be the most stable and fair. Ultimately, AI screening tools are outputting highly subjective, unpredictable opinions driven by statistical noise rather than objective truth, making them a massive liability under regulations like the EU AI Act. You can read the full write-up and explore our interactive data app here: [https://re-cinq.com/blog/ai-hiring-bias-25500-llm-evaluations](https://re-cinq.com/blog/ai-hiring-bias-25500-llm-evaluations)

Original Article

Similar Articles

Can LLMs Hire Fairly? Racial Bias in Resume Screening

arXiv cs.CL

This paper audits 14 large language models for hiring discrimination using a paired-resume methodology, finding that older models exhibit pro-White bias while newer models show null or pro-Black bias, indicating a reversal in algorithmic hiring bias across model generations.

I had 55 LLMs blind-grade each other (22k judgments, all open). Every model family with enough data is biased toward its own siblings. Qwen judges favor Qwen by ~0.9 points. Mistral penalizes its own by ~1.0.

Reddit r/LocalLLaMA

An open evaluation setup with 55 LLMs blind-grading each other reveals statistically significant same-family rating bias across 8 model families, with Mistral penalizing its own models most severely. The study highlights issues with aggregate leaderboards and proposes improvements like within-response mixed-effects models.

Evaluated 6 frontier LLMs (GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.7, Gemini Pro/Flash, Grok 4.3) on political, gender, and racial bias across 8 benchmarks (~20,600 examples) [R]

Reddit r/MachineLearning

A solo evaluation of six frontier LLMs on 8 bias benchmarks finds that most models lean left politically, and Grok's self-reported right-leaning stance is inconsistent with its left-leaning behavior. Refusal rates vary, with GPT-5.4 refusing 20% of race-related questions.

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

arXiv cs.CL

This paper audits six large language models for gender stereotyping across English, Korean, Chinese, and Japanese, anchoring against human baselines. It finds that LLM stereotyping often exceeds human cross-country variation and can compound across languages, introducing a four-pattern framework to characterize such behaviors.

Defining and evaluating political bias in LLMs

OpenAI Blog

OpenAI presents a comprehensive framework for defining and evaluating political bias in LLMs, introducing a 500-prompt evaluation spanning 100 topics across five bias axes. Results show GPT-5 models achieve 30% bias reduction compared to prior versions, with less than 0.01% of production ChatGPT responses exhibiting political bias.