Tag
This paper introduces 'second-order bias', the bias LLMs exhibit when judging biased content, and proposes a reasoning task grounded in epistemic entitlement to evaluate it. Experiments show that the task evades safety guardrails and reveals systematic demographic biases in LLM judges.
BiasGRPO proposes a framework using Group Relative Policy Optimization (GRPO) to stabilize social bias mitigation in LLMs by normalizing rewards across sampled completions, outperforming DPO and PPO on multiple benchmarks. The authors also release a compute-efficient bias reward model designed for integration into multi-objective RLHF pipelines.
This paper presents a systematic evaluation of how differential privacy impacts social bias in large language models, finding that while it reduces bias in sentence scoring, the effect does not generalize across all tasks.
OpenAI published a study examining how subtle identity cues like user names can influence ChatGPT's responses, introducing the concept of 'first-person fairness' to evaluate whether name-based biases lead to harmful stereotypes in direct user interactions. The research highlights limitations including a focus on English-language, binary gender, and four racial/ethnic categories.