I analyzed 25,500 LLM resume screenings to measure hiring bias. The results are a wake-up call.
Summary
A study analyzing 25,500 LLM resume evaluations across 10 models found a 45% bias rate driven by 'silent bias', with models inventing professional-sounding excuses to penalize candidates. It highlights significant variability in fairness and stability, with Claude, Mistral-Large, and Llama 4 being most stable, while Qwen and older Gemini models were volatile.
Similar Articles
Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit
This paper audits six large language models for gender stereotyping across English, Korean, Chinese, and Japanese, anchoring against human baselines. It finds that LLM stereotyping often exceeds human cross-country variation and can compound across languages, introducing a four-pattern framework to characterize such behaviors.
Defining and evaluating political bias in LLMs
OpenAI presents a comprehensive framework for defining and evaluating political bias in LLMs, introducing a 500-prompt evaluation spanning 100 topics across five bias axes. Results show GPT-5 models achieve 30% bias reduction compared to prior versions, with less than 0.01% of production ChatGPT responses exhibiting political bias.
Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions
This paper studies how instruction-tuned LLMs can exhibit fair outputs while retaining biased internal representations in high-stakes decisions like mortgage underwriting, showing that these hidden biases are causally potent, asymmetric, and exploitable through activation steering.
How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation
This paper presents a systematic evaluation of how differential privacy impacts social bias in large language models, finding that while it reduces bias in sentence scoring, the effect does not generalize across all tasks.
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.