Tag
The Ghost Annotator framework combines conformal prediction with collaborative filtering to model LLM behavior and human label variation in content moderation, revealing structural demographic biases in larger models.
A study analyzing 25,500 LLM resume evaluations across 10 models found a 45% bias rate driven by 'silent bias', with models inventing professional-sounding excuses to penalize candidates. It highlights significant variability in fairness and stability, with Claude, Mistral-Large, and Llama 4 being most stable, while Qwen and older Gemini models were volatile.
This paper introduces GPF-LiveNews, a streaming evaluation protocol for auditing how large language models frame live news events differently for various demographic groups, using semantic sensitivity and sentiment disparity measures across 42 identity labels and seven prompt families.
This paper introduces Counterfactual Explanation Consistency (CEC), a framework to detect and mitigate hidden procedural bias in outcome-fair models by aligning feature attributions between individuals and their counterfactual counterparts, with experiments on credit and income datasets.
Researchers propose a surrogate modeling framework to quantify and interpret latent medical knowledge encoded in black-box LLMs, revealing both valid associations and persistent racial biases.
Columbia and Northwestern researchers propose a pipeline to surface race and gender bias in LLM abstractive summaries of life-story interviews, showing representational harm risks.
Introduces a framework to quantify how LLMs overstate certainty through rhetorical devices, revealing model-agnostic patterns of epistemic-rhetorical miscalibration.
ArXiv preprint maps stereotype-encoding neurons and attention heads in GPT-2 Small and Llama 3.2, showing biases cluster in small neuron subsets yet ablating them barely reduces biased text generation.
Google Research introduces LocQA, a 12-language dataset revealing that multilingual LLMs exhibit strong US-centric and population-based locale biases when answering ambiguous locale-dependent questions.
Academic study exposes systemic counterfactual unfairness in LLMs: jokes from privileged speakers are refused 67% more often and rated as more malicious than identical jokes from marginalized speakers.
Researchers introduce BIASEDTALES-ML, a large-scale multilingual dataset of ~350,000 LLM-generated children's stories across eight languages, designed to analyze narrative attribute distributions and cross-lingual bias patterns in language model outputs. The work reveals significant cross-lingual variability, highlighting limitations of English-centric bias evaluations.