Tag
This paper evaluates LLM-based coding agents (Claude Code and Codex) in social science analysis, finding they match or exceed human methodological diversity while remaining vulnerable to interpretation bias through verdict-layer manipulation.
The author observes that LLMs exhibit denominational bias depending on language (Protestant-leaning in English, Catholic-leaning in Spanish/French/Portuguese) and introduces a free Bible study app called Biblians.
This paper investigates how LLMs produce different outcomes based on conversational context, finding that topic, rather than explicit user demographics, is the primary driver of disparities in high-stakes scenarios like salary advice.
A study analyzing 25,500 LLM resume evaluations across 10 models found a 45% bias rate driven by 'silent bias', with models inventing professional-sounding excuses to penalize candidates. It highlights significant variability in fairness and stability, with Claude, Mistral-Large, and Llama 4 being most stable, while Qwen and older Gemini models were volatile.
This paper presents an experiment where GPT-4.1 is asked to pick a random number between 1 and 100, 10,000 times, and the resulting distribution is analyzed for bias compared to a uniform baseline.
A discussion on how AI language models may disproportionately recommend well-known brands, potentially making it harder for smaller companies to be discovered in AI-powered search.
This paper investigates how chain-of-thought prompting affects gender bias in large language models, finding that it does not consistently reduce bias and that apparent improvements stem from superficial compliance rather than genuine understanding.
This study demonstrates that large language models inherit and amplify biases from stigmatizing language in clinical notes, leading to less aggressive patient management, and that current mitigation strategies are insufficient.
This paper investigates central tendency bias in multimodal LLMs used for clinical ordinal scoring of the Clock Drawing Test, finding that LLMs compress predictions toward the middle of the scale, disproportionately affecting critical extremes. The study extends the LLM-as-judge bias literature to clinical assessment, highlighting the need for calibration-aware evaluation before deployment.
This paper introduces a Probabilistic Graphical Model framework to causally audit LLM safety mechanisms, revealing that standard observational metrics overestimate demographic bias by ignoring context toxicity.
This paper investigates whether assigning personas to large language models induces human-like motivated reasoning, finding that persona-assigned LLMs show up to 9% reduced veracity discernment and are up to 90% more likely to evaluate scientific evidence in ways congruent with their induced political identity, with prompt-based debiasing largely ineffective.
This paper presents a large-scale audit of recommendation biases in LLM-based content curation across OpenAI, Anthropic, and Google using 540,000 simulated selections from Twitter/X, Bluesky, and Reddit data. The study finds that LLMs systematically amplify polarization, exhibit distinct toxicity handling trade-offs, and show significant political leaning bias favoring left-leaning authors despite right-leaning plurality in datasets.