Mitigating Cognitive Bias in RLHF by Altering Rationality

arXiv cs.AI 05/11/26, 04:00 AM Papers
Summary
This academic paper proposes a method to mitigate cognitive biases in Reinforcement Learning from Human Feedback (RLHF) by dynamically adjusting the rationality parameter based on LLM assessments of annotator reliability.
arXiv:2605.06895v1 Announce Type: new Abstract: How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns scalar values to responses. Because these rewards are inferred from pairwise comparisons, this learning depends on an assumed relationship between latent reward differences and observed preferences, typically modeled using a Boltzmann formulation in which a rationality parameter beta informs how consistently preferences reflect reward differences. In practice, beta is typically treated as a fixed constant that reflects assumed uniform annotator reliability. However, human feedback is not this simplistic in practice: real human judgments are shaped by cognitive biases, leading to systematic deviations from reward-consistent behavior that arise contextually. To address this, we treat rationality as context- and annotation-dependent. We design an approach to dynamically adjust the rationality parameter beta during reward learning using an LLM-as-judge to assess the likely presence of cognitive biases. This approach effectively downweights comparisons that are likely to reflect biased or unreliable judgments. Empirically, we show that this approach learns a more rational downstream model, even when finetuning on datasets with strongly biased preferences.
Original Article
View Cached Full Text
Cached at: 05/11/26, 07:08 AM
# Mitigating Cognitive Bias in RLHF by Altering Rationality
Source: [https://arxiv.org/html/2605.06895](https://arxiv.org/html/2605.06895)
Tiffany Horter University of Oxford &Andrew Markham University of Oxford &Niki Trigoni University of Oxford &Serena Booth Brown University

###### Abstract

How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback \(RLHF\), human preferences over model outputs are used to train a reward model that assigns scalar values to responses\. Because these rewards are inferred from pairwise comparisons, this learning depends on an assumed relationship between latent reward differences and observed preferences, typically modeled using a Boltzmann formulation in which a rationality parameterβ\\betainforms how consistently preferences reflect reward differences\. In practice,β\\betais typically treated as a fixed constant that reflects assumed uniform annotator reliability\. However, human feedback is not this simplistic in practice: real human judgments are shaped by cognitive biases, leading to systematic deviations from reward\-consistent behavior that arise contextually\. To address this, we treat rationality as context\- and annotation\-dependent\. We design an approach to dynamically adjust the rationality parameterβ\\betaduring reward learning using an LLM\-as\-judge to assess the likely presence of cognitive biases\. This approach effectively downweights comparisons that are likely to reflect biased or unreliable judgments\. Empirically, we show that this approach learns a more rational downstream model, even when finetuning on datasets with strongly biased preferences\.

## 1Introduction

The underlying premise of reinforcement learning from human feedback, or RLHF, is that human annotators express their preferences correctly; this underpins this technique’s widespread use in the alignment of large language models \(LLMs\)\. This premise is unfortunately flawed: human feedback is subject to the influence of cognitive biasesD’Alonzoet al\.\([2026](https://arxiv.org/html/2605.06895#bib.bib30)\)\.

A famous example of cognitive bias is the Linda problemTversky and Kahneman \([1983](https://arxiv.org/html/2605.06895#bib.bib34)\), a conjunction fallacy\. People are told, “Linda is 31 years old, single, outspoken, and very bright\. She majored in philosophy\. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti\-nuclear demonstrations” They must then assess whether A: “Linda is a bank teller” or B: “Linda is a bank teller and is active in the feminist movement” is more probable\. Most people pick B despite the statistical impossibility of any subset of A being greater than the entire set of A\.

The Linda problem may seem trivial, but cognitive biases arise often in high\-stakes domains such as medicine\. For example, physicians are known to exhibit anchoring bias\. This can lead a physician to potentially overweight an initial diagnosis \(e\.g\., asthma\) and discount subsequent evidence that may point to a more serious condition, like heart failure\. These are not hypothetical concerns: empirical studies document systematic cognitive biases in clinical decision\-making with real consequences for patient outcomesKeet al\.\([2024](https://arxiv.org/html/2605.06895#bib.bib35)\)\. When cognitive biases are present in the judgments used as training signals, as in RLHF, these systematic deviations become embedded in the learned reward model, leading downstream models to reproduce or even amplify these biases in their outputs\.

To address the risk of cognitive biases in preference data, we intervene directly on the RLHF objective\. In the standard formulation, human preferences are modeled using a Boltzmann\-rational model with a rationality parameterβ\\beta, which is typically fixed across annotators and across comparisons\. This implicitly assumes uniform reliability in human judgments\. We relax this assumption by treatingβ\\betaas*feedback\-dependent*, as some responses are more likely to be unreliable than others\. To do so, we estimate the likelihood that a given comparison is affected by cognitive bias, and we use this estimate to dynamically adjustβ\\betaduring reward learning to downweight feedback that is likely to be biased; see overview Figure[1](https://arxiv.org/html/2605.06895#S1.F1)\. This preserves informative signals from human preferences while reducing the influence of systematic cognitive biases, enabling the learned reward model and downstream policy to better reflect underlying preferences rather than observed, potentially biased judgments\.

![Refer to caption](https://arxiv.org/html/2605.06895v1/figures/overview5.png)Figure 1:Overview of the intervention pipeline to reduce cognitive bias in the finetuned LLM\. Humans provide preferences over responses to a prompt, like the “Linda is a bank teller” example\. An LLM then judges whether each paired prompt and response is likely to be subject to a cognitive bias \(DfD\_\{f\}\)\. From this measure, we compute a dynamic value ofβ\\beta\. These prompts, responses, andβ\\betavalues are used to learn a reward model and to finetune an LLM\. Higher bias = lower value ofβ\\beta\.We evaluate our approach on two datasets designed to elicit cognitive bias\. Empirically, we find that dynamically adjustingβ\\betaby assessing response bias propensity yields the following:

1. 1\.Reduction in bias propagation\.Relative to baseline LLMs without this intervention, our method produces models that are significantly less likely to prefer biased responses\. In pairwise evaluations where one response exhibits a cognitive bias and the other does not, the finetuned model more frequently selects the unbiased response\.
2. 2\.Robustness to biased feedback\.Our approach mitigates a failure of RLHF under biased supervision: collapse toward biased preferences\. By downweighting bias\-prone preferences, the method remains effective even when a large fraction of training data reflects systematically biased annotations\.
3. 3\.Preservation of general performance\.Despite operating on heavily biased training data, the intervention does not degrade performance on unrelated tasks, indicating that the method selectively reduces bias without sacrificing overall model capability\.

## 2Background

### 2\.1Cognitive bias

RLHF has emerged as a prominent technique for aligning LLMs with human preferences\. This is often framed as progress toward “value alignment,” the challenge of ensuring that learned objectives faithfully reflect human intentRussell \([2019](https://arxiv.org/html/2605.06895#bib.bib15)\)\. However, RLHF adopts the premise that expressed human preferences reflect human intentions\. But, humans are fallible: we regularly make mistakes and are subject to cognitive biases, so our expressed preferences may not capture our intentions\. Standard RLHF pipelines do not differentiate between more or less reliable forms or instances of feedback\.

Similar to the problem of reward misspecification in reinforcement learning, where even well\-intentioned designs can yield unintended behavior, RLHF can inherit systematic irrationality from human judgment\. One contributing factor, consistent with prior work on reward designBoothet al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib20)\), is that human annotators rely on simplified, often myopic reasoning when expressing preferences\. Annotators may neglect longer\-term consequences or cumulative effects and instead rely on heuristics\. For example, prospect theory shows that humans are more sensitive to relative losses than to gains of the same magnitude, leading to asymmetric evaluations of outcomesKahneman and Tversky \([2013](https://arxiv.org/html/2605.06895#bib.bib28)\), which may manifest in preferences\. Likewise, temporal preferences are often inconsistent with exponential discounting; people exhibit hyperbolic discounting and disproportionately favor immediate rewards over delayed onesMooreet al\.\([2025](https://arxiv.org/html/2605.06895#bib.bib29)\)\. Additional biases such as framing effects, anchoring, and scope insensitivity can further shape expressed preferences in ways that are not aligned with stable or reflective intentionsD’Alonzoet al\.\([2026](https://arxiv.org/html/2605.06895#bib.bib30)\); Hatgis\-Kessellet al\.\([2025](https://arxiv.org/html/2605.06895#bib.bib31)\)\.

For humans, cognitive biases serve a purpose: they enable rapid decision\-making through heuristics, trading off accuracy for speed\. There is an established literature that shows humans are more likely to experience cognitive biases under certain conditions and therefore act in a less rational wayMacmillan\-Scott and Musolesi \([2023](https://arxiv.org/html/2605.06895#bib.bib13)\); Kahneman \([2013](https://arxiv.org/html/2605.06895#bib.bib10)\), e\.g\., under time pressure\. Because LLMs are trained with human preferences, LLMs tend to adopt our cognitive biases during pretrainingMalberget al\.\([2024](https://arxiv.org/html/2605.06895#bib.bib14)\)or in finetuning whether through instruction tuning or RLHFCheunget al\.\([2025](https://arxiv.org/html/2605.06895#bib.bib4)\); Itzhaket al\.\([2024](https://arxiv.org/html/2605.06895#bib.bib8)\)\. Unlike humans, however, LLMs are not subject to the same time and resource constraints that encourage the use of such heuristics\. This creates an opportunity: rather than inheriting human biases as an unavoidable feature of decision\-making or RLHF, we can design learning procedures that identify and mitigate them, enabling LLMs to better approximate deliberative reasoning\.

In decision\-aid settings, there is a fundamental tension between matching human\-provided feedback and avoiding the systematic biases that such feedback may contain\. In this work, we assume that people hold true preferences that may be obscured in their expressed preferences when cognitive bias is likely presentHoskinget al\.\([2024](https://arxiv.org/html/2605.06895#bib.bib25)\)\. We seek to learn the underlying preferences that a person would express were it not for the confounding factors of the bias\. Others have shown that under some circumstances, people prefer machines not act in accordance with the literal preference given but rather with the intent behind the wordingHorteret al\.\([2026](https://arxiv.org/html/2605.06895#bib.bib21)\)\. In a similar vein, we assess the level of rationality a human annotator is likely to be experiencing when providing preferences over prompts and responses, and scale weight of this preference into the reward model approximation accordingly\. This can in part address two known open problems with RLHFCasperet al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib3)\), namely \(A\) “Humans make simple mistakes due to limited time, attention, or care” and \(B\) “Humans can be misled, so their evaluations can be gamed\.”

### 2\.2Boltzmann Rationality

The premise of this work \(and of RLHF more broadly\) is that people have a latent reward functionr∗r^\{\*\}that induces a distribution over their preferences, or that this is a reasonable modeling assumption\. The objective of RLHF is to approximate this reward function from expressed preferences\. The standard model of human decision\-making uses a Boltzmann\-rational assumptionJeonet al\.\([2020](https://arxiv.org/html/2605.06895#bib.bib9)\)\. Under this model, humans are more likely to select higher\-reward choices, with the strength of this tendency increasing as reward differences grow\. In this model, there is a parameterβ\\betathat is known as the rationality parameter, and it controls how consistently higher\-reward responses are selected\.

Although our proposed approach of intervening on the rationality parameter could apply more broadly to settings that learn from expressed human choices \(e\.g\., learning from demonstrations\), we focus on the RLHF setting, where a reward function is inferred from pairwise preferences over model outputs\. In particular, we condition on a candidate reward function r and model the likelihood of observed comparisons between outputs\. Given a comparisonσ1≻σ2\\sigma\_\{1\}\\succ\\sigma\_\{2\}, this yields the standard logistic form:

ℙ\(σ1≻σ2∣r\)=logistic\(β\(r\(σ1\)−r\(σ2\)\)\)\\mathds\{P\}\(\\sigma\_\{1\}\\succ\\sigma\_\{2\}\\mid r\)=\\texttt\{logistic\}\\left\(\\beta\\left\(r\(\\sigma\_\{1\}\)\-r\(\\sigma\_\{2\}\)\\right\)\\right\)In standard approaches of approximating a reward function, whether from preferences or other modes of human feedback,β\\betais treated as a fixed parameter capturing the overall noisiness of human feedback\. This interpretation assumes that deviations from reward\-consistent behavior arise from uniform stochastic noise, rather than systemic bias of some form\.

We instead treat the mapping from latent preferences to expressed preferences as context\-dependent\. Rather than assuming a fixedβ\\beta, we allowβ\\betato vary across instances based on the expected reliability of the feedback\. Intuitively,β\\betashould be lower in settings where cognitive biases are likely to distort judgment, and higher when responses are more likely to align with latent preferences\. This can be viewed as reweighting observations by their expected fidelity, with the goal of recovering a reward model that better approximates the truer∗r^\{\*\}, rather than fitting the reward model to artifacts of bias\.

### 2\.3Prior Work on Rationality Modeling

Accurately modeling the human’s true decision process, even under cognitive bias, is critical for effective reward learningHonget al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib22)\); Knoxet al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib12)\); Chanet al\.\([2021](https://arxiv.org/html/2605.06895#bib.bib24)\)\. In fact, explicitly modeling structured human irrationality can improve reward inference beyond assuming perfectly rational behavior, even with an oracleChanet al\.\([2021](https://arxiv.org/html/2605.06895#bib.bib24)\)\.

Prior methods have explored how rationality varies with feedback modality \(e\.g\., comparisons, demonstrations, corrections\)Ghosalet al\.\([2022](https://arxiv.org/html/2605.06895#bib.bib7)\), as well as with annotator expertiseDaniels\-Koch and Freedman \([2022](https://arxiv.org/html/2605.06895#bib.bib5)\)\. Other work addresses heterogeneity in annotator rationality by computationally varying theβ\\betaparameterYamagataet al\.\([2024](https://arxiv.org/html/2605.06895#bib.bib2)\); Barnettet al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib1)\), though these approaches differ from our approach as they do not model bias as context\-dependent or as a property of the response\. Some approaches instead fit a globalβ\\betato account for the level of systematic bias in human responses across feedback typesGhosalet al\.\([2022](https://arxiv.org/html/2605.06895#bib.bib7)\), whereas we focus on the information held in the response rather than its format\. Others attempt to learn the algorithms people use to make decisions directly, often finding that incorporating known heuristic biases improves learning performanceShahet al\.\([2019](https://arxiv.org/html/2605.06895#bib.bib16)\)\.

More recent work adjustsβ\\betabased on estimated difficulty of the annotation setting, such as interaction signals \(e\.g\., clicks, time spent\) or model\-predicted difficultySinghalet al\.\([2024](https://arxiv.org/html/2605.06895#bib.bib17)\)\. However, these approaches treat rationality as a function of the overall scenario rather than the specific feedback provided over responses, and do not evaluate whether such adjustments mitigate cognitive bias, focusing instead on upstream reward model behavior \(e\.g\., sensitivity to factuality versus length\)\. The issue with accounting for only the scenario rather than the response is that it removes or downweights*all*responses to biased scenarios, and ignores responses that avoid the pitfall of biases\. We suspected that conditioning bias on the scenario alone would lead to suboptimal performance; see Appendix[A](https://arxiv.org/html/2605.06895#A1)\.

## 3Method

We build on the RLHF formulation introduced above, where the rationality parameterβ\\betagoverns how strongly reward differences influence observed preferences\. Rather than treatingβ\\betaas fixed, we model this parameter as instance\-dependent, reflecting the reliability of each piece of feedback\.

Given an observed comparison, we estimate whether the expressed preference is likely to be influenced by cognitive bias\. To do so, we use an out of the box LLM\-based evaluator that takes as input the prompt, candidate responses, and annotation, and predicts whether the observed preference reflects a biased judgment\.

We then adjustβ\\betaaccordingly\. When feedback is likely to be affected by bias, we reduceβ\\betato downweight its influence; when feedback is likely to reflect the annotator’s latent preference, we increaseβ\\betato place greater weight on the observation\. In this way,β\\betaacts as a context\-dependent measure of how faithfully expressed preferences reflect latent preferences\.

This proposed dynamic, instance\-dependent rationality parameter is defined as

βnew=logistic\(k⋅\(θ−Df\)\),\\beta\_\{new\}=\\texttt\{logistic\}\\\!\\left\(k\\cdot\(\\theta\-D\_\{f\}\)\\right\),\(1\)whereDf∈\[0,1\]D\_\{f\}\\in\[0,1\]denotes the estimated probability that the observed feedback is influenced by cognitive bias,θ\\thetais a threshold parameter, andkkcontrols the steepness of the transition\. This parameterization maps bias likelihoods to a continuous measure of feedback reliability\. WhenDfD\_\{f\}is low \(i\.e\., the feedback is unlikely to be biased\),βnew\\beta\_\{new\}approaches 1, placing greater weight on the observed preference\. AsDfD\_\{f\}increases,βnew\\beta\_\{new\}decreases smoothly toward 0, reducing the influence of potentially biased feedback\. The use of a logistic transform ensures a bounded mapping, while allowing for a tunable transition around the thresholdθ\\theta\. This corresponds to human reasoning: we don’t update as strongly on information we deem to be likely to not be trustworthy or to be biased\.

The hyperparameterskkandθ\\thetacontrol the sensitivity of this mapping\. The thresholdθ\\thetadetermines the bias likelihood at which feedback begins to be substantially downweighted, whilekkgoverns how sharply this transition occurs\. In practice, we select these parameters via a 2\-D traversal of the validation set, selecting values that maximize the model’s ability to distinguish between ground\-truth\-consistent and cognitively biased preferences\. These hyperparameters may shift across datasets or annotator populations and reflect how susceptible to bias the annotator population is likely to be\.

Importantly,DfD\_\{f\}is defined with respect to the*human’s chosen response*rather than the scenario as a whole\. This distinction allows the model to differentiate between contexts that are conducive to bias where annotators fall into those biases and instances in which annotators nevertheless provide reliable feedback\. When a scenario is bias\-inducing but the annotator prefers an unbiased response, the correspondingDfD\_\{f\}remains low, and the feedback is assigned a higherβnew\\beta\_\{new\}\. Such a scenario might occur in someone with statistical training who encounters the Linda example, who would be primed to recognize that having a larger subset than the original set is impossible\. In the reverse case, we have situations like the patient with heart failure who presents to the hospital with chest pain\. In this example, the context is not primed to trigger cognitive bias, but the doctor’s response only mentions having seen asthma on the chart, indicating a likely anchoring bias\. Considering the bias of the chosen response avoids discarding informative examples from difficult or ambiguous settings, and instead treats such cases as high\-value signals for learning under challenging conditions\.

### 3\.1Data

To validate this intervention, we used two datasets designed to induce biased responses\. The first is the BRU DatasetZhonget al\.\([2025](https://arxiv.org/html/2605.06895#bib.bib26)\)which was developed with a psychologist and a medical data expert to ensure validity\. This small dataset consists of 205 multiple choice questions with a ground truth answer and covers 8 cognitive biases \(Anchoring Bias, Base Rate Fallacy, Conjunction Fallacy, Gambler’s Fallacy, Insensitivity to Sample Size, Overconfidence Bias, Regression Fallacy, Sunk Cost Fallacy\)\. The second dataset is the “Comprehensive Evaluation of Cognitive Biases in LLMs: Dataset” \(CogBias datasetMalberget al\.\([2025](https://arxiv.org/html/2605.06895#bib.bib27)\)\) that includes 30,000 choice selection test cases and 30 biasesMalberget al\.\([2025](https://arxiv.org/html/2605.06895#bib.bib27)\)\. Stereotyping bias examples were excluded as they reflect social bias rather than a cognitive bias\. This dataset was designed only for testing models for bias; we received special permission from the authors to use the finetune with this dataset to proactively reduce replicated cognitive bias in models\.

Across both datasets, to be consistent with RLHF framing, we convert each example from the initial multiple choice options into a binary preference pair consisting of a ground\-truth response and a bias\-consistent response\. The former aligns with the unbiased or intended evaluation of the task; this is the preference a human might arrive at if they employed deliberative reasoning instead of heuristic\-driven reasoningGuthrieet al\.\([2007](https://arxiv.org/html/2605.06895#bib.bib32)\); Kahneman \([2013](https://arxiv.org/html/2605.06895#bib.bib10)\)\. The latter reflects the direction of error induced by a cognitive bias: for example, in the Linda problem, selecting an example that exhibits the conjunction fallacy \(that it is more likely she is a bank teller and active in the feminist movement\)\. Since neither dataset contained train/val/test splits, we split the data into train, val and test \(shared in code\)\.

In the BRU dataset, the unbiased choice ground\-truth labels are provided\. We construct bias\-consistent responses by prompting a language model to select the response most aligned with a specified bias type\. In the CogBias dataset, we use the provided bias metric to select the least biased option as the ground\-truth\-consistent response and the most biased option as the bias\-consistent response\.

To evaluate robustness to biased annotations, we simulate systematic bias in human feedback by constructing a dataset that consists of preference pairs where the biased response is more often \(but not universally\) preferred over the ground truth answer\. Specifically, we construct this dataset by marking the bias\-consistent response as preferred over the ground\-truth\-consistent response a tunable fraction\(\>0\.5\)\(\>0\.5\)of the time\. This models a common failure mode in RLHF, where annotators systematically favor more salient or cognitively appealing options rather than the correct ones\.

By default, we use a 3:1 ratio of bias\-consistent to ground\-truth\-consistent labels in our experimentation, reflecting settings in which biased responses are more likely but not universal\. We additionally vary this ratio from 1:1 to 5:1 in Section[4\.3](https://arxiv.org/html/2605.06895#S4.SS3)to study sensitivity to the level of bias\.

### 3\.2Bias detector

The first step of our method is to estimate, for each training example, the likelihood that the observed feedback is influenced by cognitive bias\. This estimate, denotedDfD\_\{f\}, is used to modulate the contribution of each example during reward learning\.

This approach uses an LLM\-based evaluator to estimate the likelihood that a given preference comparison reflects cognitive bias\. Here, we frame bias detection as a local judgment over a single comparison reflecting its contextual basis rather than a consistent aggregation across many examples\. As we show empirically, the method remains effective even when the bias detector is imperfect, indicating that even coarse estimates of bias likelihood are sufficient to guide the intervention\.

To demonstrate that our approach is compatible with different bias detection strategies, we instantiateDfD\_\{f\}using two LLM\-based evaluators\. For the BRU dataset, we use ChatGPT 5\.2 ThinkingOpenAI \([2025](https://arxiv.org/html/2605.06895#bib.bib39)\)to score the likelihood that a given preference reflects the influence of cognitive bias\. For the CogBias dataset, since the dataset is much larger, we used a local model as the evaluator, Mistral\-7B\-InstructJianget al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib36)\)\. We constructDfD\_\{f\}using a pairwise LLM\-as\-judge procedure\. Given two candidate responses, the model is prompted to identify which response is more biased\. To mitigate positional bias, we evaluate each pair in both orderings and aggregate the results\. We then compute a probability by applying a softmax over the final\-token logits of the two choices\.

Because bias detection performance varies across bias types \(i\.e\., a model may be able to reliably detect anchoring bias but not conjunction fallacies\), we apply a lightweight calibration step\. Specifically, we estimate a per\-bias\-type transformation using a minimal annotated held\-out validation set \(20 examples per bias\) to correct systematic directionality errors via sign inversion\. Importantly, the bias detector is not trained with ground\-truth correctness labels\. This design reflects a realistic setting in which large\-scale bias annotations are unavailable, while still enabling effective estimation ofDfD\_\{f\}\.

These two instantiations provide evidence that the approach is not tightly coupled to a particular bias detection mechanism\. Despite differences in model scale and prompting strategy, both yield effective estimates ofDfD\_\{f\}, suggesting the method is robust to variation in how likelihood is operationalized\.

### 3\.3Reward Model & GRPO Finetuning

Given the training set consists of a mix of groundtruth\-consistent and bias\-consistent preference pairs that simulate cognitively\-biased annotation, we train a reward model with a loss function that treats each pair’s weight as conditional on how biased the chosen response appears to be\. Concretely, we replace the global Bradley–Terry rationality constantβ\\betawith a per\-datapoint weightβi\\beta\_\{i\}that decays smoothly when the chosen response’s bias scoreDfD\_\{f\}is high\.

The reward model is initialized from Mistral\-7B\-v0\.1Jianget al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib36)\)\. Theθ\\thetaandkkparameters are selected via a two dimensional sweep to maximize accuracy over the validation set\. The finalk,θk,\\thetavalues per dataset, as well as reward model training hyperparameters are in Appendix[C](https://arxiv.org/html/2605.06895#A3)and[D](https://arxiv.org/html/2605.06895#A4)\. The models were then finetuned with Group Relative Policy Optimization \(GRPO\)Shaoet al\.\([2024](https://arxiv.org/html/2605.06895#bib.bib33)\), where rewards were determined by a frozen reward model\. All hyperparameters were held identical across runs for the baselines and debiased models\. Our standard baseline for the finetuning was to useβ=1\.0\\beta=1\.0Barnettet al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib1)\); Ghosalet al\.\([2022](https://arxiv.org/html/2605.06895#bib.bib7)\)\. This simulates annotators being equally rational in all scenarios\. We also tested a variety of fixedβ\\betavalues \(0\.1, 0\.5, 0\.9\) and a randomβ\\betato confirm that there was not a hidden fixedβ\\betavalue that was best or that per\-example modulation was the key to the improvement mechanism\.

Testing was conducted on a held out test split from each dataset\. We generatednnsamples per prompt in the test set to reduce dependence on a random draw\. These answers were then compared to the ground truth answers and those that matched the ground truth were marked as correct\. Our main metric to assess the performance improvements of dynamicβ\\betais the accuracy \(the direct ground\-truth rate\) since it is RM independent\. A significance level ofα=0\.05\\alpha=0\.05was used globally\.

## 4Evaluation on the BRU & CogBias Datasets

We evaluate the performance of this method on two datasets\. One is a small scale study on the BRU dataset \(205 questions total, 21 in the test split\) to validate the promise of this method\. The other is the CogBias dataset \(29,000 test cases\)\. Across these experiments, we found support for the hypothesis that dynamicβ\\betaselection can reduce an LLM’s replication of cognitive biases\.

### 4\.1Reduction in bias propagation & preservation of general performance

To investigate whether the intervention could reduce bias propagation, we compared the performance of the debiased model with the base model\. Despite exposure to a substantial number of biased samples, the debiased model still outperforms the base model which was never exposed to the biased data on the both the test set of the CogBias and BRU dataset\. On the BRU dataset, the debiased model improves by8\.3%8\.3\\%over the base vanilla \(untrained\) LLM despite having been finetuned on3:13:1cognitively biased data, though it doesn’t reachα<0\.05\\alpha<0\.05withn=21n=21prompts\. The CogBias results show significant improvement of25\.2%25\.2\\%\(95%95\\%CI\[\+23\.3,\+26\.9\]\[\+23\.3,\+26\.9\],d=0\.51d=0\.51\)\. This lends support to our claim of a reduction in bias propagation through this dynamicβ\\betaintervention\. See figure[2](https://arxiv.org/html/2605.06895#S4.F2)\.

![Refer to caption](https://arxiv.org/html/2605.06895v1/figures/fig_combined_vanilla_vs_debiased.png)Figure 2:For CogBias \(left\), the debiased model’s accuracy of67\.6%67\.6\\%correctly choosing ground truth is a significant improvement over the vanilla un\-finetuned LLM with ground truth rate42\.4%42\.4\\%\.Further, this increase in performance on cognitive bias tasks does not come at the cost of general performance\. To test this, we compared the vanilla LLM and the debiased model \(trained with the CogBias dataset and the dynamicβ\\betaintervention\) on the TruthfulQA datasetLinet al\.\([2022](https://arxiv.org/html/2605.06895#bib.bib37)\), and found it did not significantly degrade performance \(non\-significant change of−0\.5%\-0\.5\\%with95%95\\%CI\[−0\.011,0\.000\]\[\-0\.011,0\.000\]\)\.

### 4\.2Robustness to biased feedback

![Refer to caption](https://arxiv.org/html/2605.06895v1/figures/fig_combined_baselines.png)Figure 3:Comparison of CogBias \(left\) and BRU \(right\) datasets accuracy compared across different baselines \(β∈\{1,0\.9,0\.5,random\}\\beta\\in\\\{1,0\.9,0\.5,\\texttt\{random\}\\\}\) and our dynamicβ\\betamethod \(Debiased, top row\)\. Our method shows a strong improvement in accuracy compared to the baselines\. With either fixed or random choices ofβ\\beta, these models were highly susceptible to cognitive biases\.The performance on both datasets shows that adding a model of cognitive bias to the reward model tuning makes it robust to even high levels of human error due to cognitive bias; see Figure[3](https://arxiv.org/html/2605.06895#S4.F3)\. When finetuned on a dataset where each question had three biased and one unbiased responses to model the distribution of cognitive bias, we found that our debiased method \(67\.6%\\%accurate for CogBias, 49\.2%\\%for BRU\) far outperformed the baselines \(Cohen’s d \>= 1\.48 for CogBias\) that trained on the same data but with fixed values forβ∈\{0\.1,0\.5,0\.9,1\.0\}\\beta\\in\\\{0\.1,0\.5,0\.9,1\.0\\\}or whereβ\\betawas assigned a random value for each question\. All baselines produce GT rates roughly indistinguishable from the collapsed baseline\(≈9−10%\(\\approx 9\-10\\%for BRU dataset,≈1\.7%\\approx 1\.7\\%for CogBias dataset\)\. The baselines’ degradation in performance is expected, as each has been exposed to imperfect data and has learned to predict that imperfect data\. Only our dynamicβ\\betaimproves robustness with biased human data, with bootstrapp<0\.01p<0\.01for all baselines in both datasets \(for BRU fixed\-β\\beta: one\-sided\)\. This is promising, as the system learned to avoid biases without structured heuristics or privileged information about the bias\.

### 4\.3Robustness to different levels of biased feedback

Our method is robust to different levels of bias in the dataset, as shown in Table[1](https://arxiv.org/html/2605.06895#S4.T1)\. To study this, we finetuned with the baselineβ=1\.0\\beta=1\.0and dynamicβ\\betaon training data with different ratios of preferring biased or unbiased responses\. We trained both on fixed ratios \(1:1, 3:1, 5:1\) and on one dataset where the number of biased pairs was randomly selected for each question\. In all cases, despite finetuning on preference pairs that did not match people’s true underlying preferences, the debiased model still has higher accuracy in selecting more rational responses than the un\-finetuned base model\.

Table 1:CogBias: Comparing different ratios of ground\-truth preferences to cognitively biased preferences by ground\-truth answer rate, effect size, and reward\-based head\-to\-head win rate\. All values usen=10n\{=\}10samples per prompt over 2,900 test prompts\. Win rate is reward\-based: per\-prompt mean debiased\-RM reward, debiased vs\. baseline\. Intervention shows strong improvement across different levels of cognitive bias\. BL GT: baseline ground truth\. DB GT: debiased ground truth\.
### 4\.4Robustness to model architectures

We also tested to assess whether this intervention applied across model architectures\. We tested with Mistral\-7b\-0\.1Jianget al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib36)\)and Qwen3\-8B\-BaseTeam \([2025](https://arxiv.org/html/2605.06895#bib.bib38)\)\. Due to compute limitations, Qwen3\-8B\-Base was quantized to use a 4\-bit policy quantization for GRPO; hyperparameters in Appendix[E](https://arxiv.org/html/2605.06895#A5)\. Across both models, the debiased method showed a statistically significant improvement \(Cohen’sdd, Table[2](https://arxiv.org/html/2605.06895#S4.T2)\)\.

Table 2:Cross\-architecture robustness on CogBias dataset\. Each row is a base model finetuned with the beta modulation pipeline; OTB is the un\-tuned base\. Alln=2900n=2900prompts×\\times10 samples\. CIs from 10,000\-iteration paired bootstrap;pp\-values from one\-sided Wilcoxon signed\-rank test \(H1:Δ\>0H\_\{1\}\\colon\\Delta\>0\); Cohen’sddon per\-prompt paired differences\.
### 4\.5Robustness to noisy judging

To confirm the bias detection worked on the BRU dataset, we tested the pairwise accuracy\(Df\(cb\)\>Df\(gt\)\)\(D\_\{f\}\(cb\)\>D\_\{f\}\(gt\)\)of the ChatGPT 5\.2\-Thinking model; this was98\.0%98\.0\\%accurate\. All 8 types of bias in this dataset scored above93%93\\%accuracy in detection and therefore did not require calibration\.

On the CogBias dataset, at training time, the per\-pairβ\\betavalues were able to cleanly separate ground truth pairs from cognitive biased ones\. On the held\-out validation set, the calibrated out\-of\-the\-box Mistral\-7b\-0\.1\-Instruction\-TunedJianget al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib36)\)judge achieves82\.2%82\.2\\%pairwise accuracy \(Cohen’sd=0\.696d=0\.696\) across bias types, with24/2924/29types significantly above chance \(Wilcoxonp<0\.001p<0\.001\)\.

Given the intervention relies on bias scores from a potentially noisy judge, it is important that the method can tolerate noise when the judge is wrong\. To probe this, we perturbed the bias scores by adding zero\-mean Gaussian noise in logit space to bias scores and recalculatedβ\\betaon then=2900n=2900CogBias validation pairs with the3:13:1ratio ofgtgttocbcb\. When the judge is correct\(Df\(cb\)\>Df\(gt\)\)\(D\_\{f\}\(cb\)\>D\_\{f\}\(gt\)\),βcb<βgt\\beta\_\{cb\}<\\beta\_\{gt\}correctly downweights thecbcb\-chosen pairs; when the judge is incorrect, it flips soβcb\>βgt\\beta\_\{cb\}\>\\beta\_\{gt\}incorrectly downweights the ground\-truth\-chosen pairs\. We summarize this trade\-off as the benefit/damage ratio: avg\(βgt−βcb\\beta\_\{gt\}\-\\beta\_\{cb\}\) on judge\-correct pairs divided byavg\(βcb−βgt\)avg\(\\beta\_\{cb\}\-\\beta\_\{gt\}\)on judge\-flipped pairs\. A ratio greater than 1\.0 indicates that theβ\\beta\-modulation is helping more than it is harming\. Across the judge accuracy \(induced by the noise addition\) from83%83\\%\(no noise\) down to57%57\\%\(near chance\), the benefit/damage ratio stays\>1\>1\(1\.84→1\.12\)\(1\.84\\rightarrow 1\.12\)\. The 1\.0 failure threshold is not crossed; we conclude the intervention is therefore relatively insensitive to adjudication noise\.

### 4\.6Transferring learned knowledge of cognitive biases to other datasets

Does the model learn a transferrable representation of the cognitive biases? We tested the debiased model trained on the BRU dataset on the subset of test prompts from the CogBias dataset that shared a common cognitive bias: anchoring bias \(100 samples, randomly assigned equal numbers of A/B labels\)\. We find that the BRU\-trained debiased model is successfully able to transfer that knowledge of anchoring bias onto the new dataset, having a44\.3%44\.3\\%GT accuracy improving over the vanilla LLM’s37\.7%37\.7\\%GT accuracy \(p=0\.001, paired Wilcoxon\)\. This suggests that this method has promise for being used more broadly on datasets drawn from a different distribution\. Given the small scale of the BRU dataset, it is surprising and encouraging that it was able to transfer structural knowledge\.

## 5Limitations

![Refer to caption](https://arxiv.org/html/2605.06895v1/figures/fig05_downstream.png)Figure 4:Does more accurate judging \(i\.e\.,DfD\_\{f\}\) result in more effective debiasing? Judge accuracy compared to ground truth rate improvement \(debiased \- baseline\)\. All low\-performing \(∼0\\sim 0improvement\) have judge accuracy under80%80\\%\.One limitation of this method is that the power of this transformation depends in part on the LLM\-as\-judge’s ability to detect a given type of bias\. In these experiments, we observed that the LLM was able to correctly assess some types of bias, while not performing well on others\. We therefore probed whether model performance after finetuning could be attributed to how accurate the LLM judge was; we break this analysis down by bias types\. For the CogBias dataset, all 21 bias types with a judge accuracy greater than85%85\\%improve by\>80\>80pp while the 4 with judge accuracy less than55%55\\%improved by≤1\\leq 1pp; even with these effectively random judges, the performance did not degrade further\. As shown in Fig\.[4](https://arxiv.org/html/2605.06895#S5.F4), the Spearmanρ\\rhoof0\.840\.84withp<0\.01p<0\.01confirms the positive relationship between the judge accuracy and the debiasing effect, showing that a limiting factor on this method’s success is whether the LLM\-as\-judge can recognize the bias when it occurs\. In general, though, we expect that even if the judge were not a strong supervisor, it should show promising results, given that work has shown finetuning a strong model on labels from a weak model leads to improved generalized performance over the weak modelBurnset al\.\([2023](https://arxiv.org/html/2605.06895#bib.bib40)\)\. In addition, we expect that the task of determining if bias is present is part of the training corpus and therefore recoverable; whereas avoiding such a bias is not, as in[2\.1](https://arxiv.org/html/2605.06895#S2.SS1)\. There is promise in this method; even in the presence of an ineffective judge, our experiments did not yield performance regressions\.

### 5\.1Ethics

There is, of course, always a moral dilemma when we work on the question of how can we reinterpret humans’ expressed preferences: what does it mean to say that we are able to recover the true or accurate preference when that preference is not the one we express or choose? We run the risk of paternalism when we allow an LLM to determine truth despite explicit feedback\. However, there is a key distinction here, as we are only applying this reinterpretation in situations where long\-running psychological experiments have shown that humans make less rational choices\. The biases and heuristics that may be helpful for humans to short circuit computations should not be unthinkingly passed on to an LLM that people will then use and be themselves be impacted by cognitive bias \(i\.e\., confirmation bias\) to accept its prescriptions and responses with little questioning\.

## 6Conclusion

In this work, we revisit an old assumption in preference\-based reward learning: that deviations from reward\-consistent behavior can be captured by a fixed, global rationality parameter\. We argue that this view is inadequate for real human feedback, where systematic biases and contextual effects shape expressed preferences in structured ways\. By treatingβ\\betaas annotation\-dependent and using an LLM\-based judge to estimate the likelihood of bias, our approach reframes noise as coming from a signal that can be in part identified and accounted for during training\.

Empirically, we show that this intervention improves downstream reward learning even when bias detection is imperfect, suggesting that robustness to human inconsistency does not require perfect models of cognition\. Our results contribute to a growing direction: improving alignment may depend less on idealized feedback and more on modeling the distortions in feedbackChanet al\.\([2021](https://arxiv.org/html/2605.06895#bib.bib24)\); Ethayarajhet al\.\([2024](https://arxiv.org/html/2605.06895#bib.bib41)\); Knoxet al\.\([2024](https://arxiv.org/html/2605.06895#bib.bib42)\)\.

More broadly, this work argues for a richer view of RLHF as a measurement process\. Rather than collapsing human judgments into a single scalar signal under strong assumptions of rationality, future work should treat preference data as context\-dependent, heterogeneous, and systematically biased\. Developing methods that can detect, model, and adapt to these properties is critical for effective alignment, enabling reward models to leverage imperfect feedback without collapsing to its biases\.

## References

- Active reward learning from multiple teachers\.External Links:2303\.00894Cited by:[§2\.3](https://arxiv.org/html/2605.06895#S2.SS3.p2.2),[§3\.3](https://arxiv.org/html/2605.06895#S3.SS3.p2.7)\.
- S\. Booth, W\. B\. Knox, J\. Shah, S\. Niekum, P\. Stone, and A\. Allievi \(2023\)The perils of trial\-and\-error reward design: Misdesign through overfitting and invalid task specifications\.Proceedings of the AAAI Conference on Artificial Intelligence37\(5\),pp\. 5920–5929\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v37i5.25733)Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p2.1)\.
- C\. Burns, P\. Izmailov, J\. H\. Kirchner, B\. Baker, L\. Gao, L\. Aschenbrenner, Y\. Chen, A\. Ecoffet, M\. Joglekar, J\. Leike, I\. Sutskever, and J\. Wu \(2023\)Weak\-to\-strong generalization: eliciting strong capabilities with weak supervision\.External Links:2312\.09390,[Link](https://arxiv.org/abs/2312.09390)Cited by:[§5](https://arxiv.org/html/2605.06895#S5.p1.7)\.
- S\. Casper, X\. Davies, C\. Shi, T\. K\. Gilbert, J\. Scheurer, J\. Rando, R\. Freedman, T\. Korbak, D\. Lindner, P\. J\. Freire, T\. Wang, S\. Marks, C\. Ségerie, M\. Carroll, A\. Peng, P\. J\. K\. Christoffersen, M\. Damani, S\. Slocum, U\. Anwar, A\. Siththaranjan, M\. Nadeau, E\. J\. Michaud, J\. Pfau, D\. Krasheninnikov, X\. Chen, L\. L\. di Langosco, P\. Hase, E\. Biyik, A\. D\. Dragan, D\. Krueger, D\. Sadigh, and D\. Hadfield\-Menell \(2023\)Open problems and fundamental limitations of reinforcement learning from human feedback\.ArXivabs/2307\.15217\.Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p4.1)\.
- L\. Chan, A\. Critch, and A\. Dragan \(2021\)Human irrationality: both bad and good for reward inference\.External Links:2111\.06956,[Link](https://arxiv.org/abs/2111.06956)Cited by:[§2\.3](https://arxiv.org/html/2605.06895#S2.SS3.p1.1),[§6](https://arxiv.org/html/2605.06895#S6.p2.1)\.
- V\. Cheung, M\. Maier, and F\. Lieder \(2025\)Large language models show amplified cognitive biases in moral decision\-making\.Proceedings of the National Academy of Sciences122\(25\),pp\. e2412015122\.External Links:https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.2412015122,[Document](https://dx.doi.org/10.1073/pnas.2412015122)Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p3.1)\.
- S\. D’Alonzo, F\. Kreuter, and S\. Booth \(2026\)Helpful, harmless, honest? rlhf as survey design and content moderation\.InProceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency \(FAccT ’26\),External Links:[Document](https://dx.doi.org/10.1145/3805689.3806444),ISBN 979\-8\-4007\-2596\-8/2026/06Cited by:[§1](https://arxiv.org/html/2605.06895#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p2.1)\.
- O\. Daniels\-Koch and R\. Freedman \(2022\)The expertise problem: Learning from specialized feedback\.InNeurIPS ML Safety Workshop,Cited by:[§2\.3](https://arxiv.org/html/2605.06895#S2.SS3.p2.2)\.
- K\. Ethayarajh, W\. Xu, N\. Muennighoff, D\. Jurafsky, and D\. Kiela \(2024\)KTO: model alignment as prospect theoretic optimization\.arXiv preprint arXiv:2402\.01306\.Cited by:[§6](https://arxiv.org/html/2605.06895#S6.p2.1)\.
- G\. R\. Ghosal, M\. Zurek, D\. S\. Brown, and A\. D\. Dragan \(2022\)The effect of modeling human rationality level on learning rewards from multiple feedback types\.InAAAI Conference on Artificial Intelligence,Cited by:[§2\.3](https://arxiv.org/html/2605.06895#S2.SS3.p2.2),[§3\.3](https://arxiv.org/html/2605.06895#S3.SS3.p2.7)\.
- C\. Guthrie, J\. J\. Rachlinski, and A\. J\. Wistrich \(2007\)Blinking on the bench: how judges decide cases\.Cornell L\. Rev\.93,pp\. 1\.Cited by:[§3\.1](https://arxiv.org/html/2605.06895#S3.SS1.p2.1)\.
- S\. Hatgis\-Kessell, W\. B\. Knox, S\. Booth, S\. Niekum, and P\. Stone \(2025\)Influencing humans to conform to preference models for RLHF\.arXiv preprint arXiv:2501\.06416\.Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p2.1)\.
- J\. Hong, K\. Bhatia, and A\. Dragan \(2023\)On the sensitivity of reward inference to misspecified human models\.External Links:2212\.04717,[Link](https://arxiv.org/abs/2212.04717)Cited by:[§2\.3](https://arxiv.org/html/2605.06895#S2.SS3.p1.1)\.
- T\. Horter, A\. Markham, N\. Trigoni, and S\. Booth \(2026\)Should robots comply with our instructions or intentions?\.InProceedings of the 21st ACM/IEEE International Conference on Human\-Robot Interaction,HRI ’26,New York, NY, USA,pp\. 246–254\.External Links:ISBN 9798400721281,[Link](https://doi.org/10.1145/3757279.3785553),[Document](https://dx.doi.org/10.1145/3757279.3785553)Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p4.1)\.
- T\. Hosking, P\. Blunsom, and M\. Bartolo \(2024\)Human feedback is not gold standard\.External Links:2309\.16349,[Link](https://arxiv.org/abs/2309.16349)Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p4.1)\.
- I\. Itzhak, G\. Stanovsky, N\. Rosenfeld, and Y\. Belinkov \(2024\)Instructed to bias: Instruction\-tuned language models exhibit emergent cognitive bias\.External Links:2308\.00225Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p3.1)\.
- H\. J\. Jeon, S\. Milli, and A\. D\. Dragan \(2020\)Reward\-rational \(implicit\) choice: A unifying formalism for reward learning\.ArXivabs/2002\.04833\.Cited by:[§2\.2](https://arxiv.org/html/2605.06895#S2.SS2.p1.2)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§3\.2](https://arxiv.org/html/2605.06895#S3.SS2.p3.2),[§3\.3](https://arxiv.org/html/2605.06895#S3.SS3.p2.7),[§4\.4](https://arxiv.org/html/2605.06895#S4.SS4.p1.1),[§4\.5](https://arxiv.org/html/2605.06895#S4.SS5.p2.5)\.
- D\. Kahneman and A\. Tversky \(2013\)Prospect theory: an analysis of decision under risk\.InHandbook of the fundamentals of financial decision making: Part I,pp\. 99–127\.Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p2.1)\.
- D\. Kahneman \(2013\)Thinking, fast and slow\.InThinking, Fast and Slow,External Links:ISBN 978\-0\-374\-53355\-7Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p3.1),[§3\.1](https://arxiv.org/html/2605.06895#S3.SS1.p2.1)\.
- Y\. Ke, R\. Yang, S\. A\. Lie, T\. X\. Y\. Lim, Y\. Ning, I\. Li, H\. R\. Abdullah, D\. S\. W\. Ting, and N\. Liu \(2024\)Mitigating cognitive biases in clinical decision\-making through multi\-agent conversations using large language models: simulation study\.J Med Internet Res26,pp\. e59439\.External Links:ISSN 1438\-8871,[Document](https://dx.doi.org/10.2196/59439),[Link](https://www.jmir.org/2024/1/e59439),[Link](https://doi.org/10.2196/59439)Cited by:[§1](https://arxiv.org/html/2605.06895#S1.p3.1)\.
- W\. B\. Knox, S\. Hatgis\-Kessell, S\. O\. Adalgeirsson, S\. Booth, A\. Dragan, P\. Stone, and S\. Niekum \(2024\)Learning optimal advantage from preferences and mistaking it for reward\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 10066–10073\.Cited by:[§6](https://arxiv.org/html/2605.06895#S6.p2.1)\.
- W\. B\. Knox, S\. Hatgis\-Kessell, S\. Booth, S\. Niekum, P\. Stone, and A\. Allievi \(2023\)Models of human preference for learning reward functions\.External Links:2206\.02231Cited by:[§2\.3](https://arxiv.org/html/2605.06895#S2.SS3.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.External Links:2109\.07958,[Link](https://arxiv.org/abs/2109.07958)Cited by:[§4\.1](https://arxiv.org/html/2605.06895#S4.SS1.p2.4)\.
- O\. Macmillan\-Scott and M\. Musolesi \(2023\)\(Ir\)rationality in AI: State of the art, research challenges and open questions\.Artificial Intelligence Review58,pp\. 352\.Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p3.1)\.
- S\. Malberg, R\. Poletukhin, C\. Schuster, and G\. G\. Groh \(2025\)A comprehensive evaluation of cognitive biases in LLMs\.InProceedings of the 5th International Conference on Natural Language Processing for Digital Humanities,M\. Hämäläinen, E\. Öhman, Y\. Bizzoni, S\. Miyagawa, and K\. Alnajjar \(Eds\.\),Albuquerque, USA,pp\. 578–613\.External Links:[Link](https://aclanthology.org/2025.nlp4dh-1.50/),[Document](https://dx.doi.org/10.18653/v1/2025.nlp4dh-1.50),ISBN 979\-8\-89176\-234\-3Cited by:[§3\.1](https://arxiv.org/html/2605.06895#S3.SS1.p1.1)\.
- S\. Malberg, R\. Poletukhin, C\. M\. Schuster, and G\. Groh \(2024\)A comprehensive evaluation of cognitive biases in llms\.External Links:2410\.15413Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p3.1)\.
- I\. M\. Moore, E\. Nofshin, S\. Swaroop, S\. Murphy, F\. Doshi\-Velez, and W\. Pan \(2025\)When and why hyperbolic discounting matters for reinforcement learning interventions\.InReinforcement Learning Conference,Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p2.1)\.
- OpenAI \(2025\)Update to GPT\-5 System Card: GPT\-5\.2\.Note:[https://openai\.com/index/gpt\-5\-system\-card\-update\-gpt\-5\-2/](https://openai.com/index/gpt-5-system-card-update-gpt-5-2/)Cited by:[§3\.2](https://arxiv.org/html/2605.06895#S3.SS2.p3.2)\.
- S\. J\. Russell \(2019\)Human compatible : AI and the problem of control\.InHuman Compatible : AI and the Problem of Control,External Links:ISBN 978\-0\-241\-33524\-6Cited by:[§2\.1](https://arxiv.org/html/2605.06895#S2.SS1.p1.1)\.
- R\. Shah, N\. Gundotra, P\. Abbeel, and A\. D\. Dragan \(2019\)On the feasibility of learning, rather than assuming, human biases for reward inference\.External Links:1906\.09624Cited by:[§2\.3](https://arxiv.org/html/2605.06895#S2.SS3.p2.2)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§3\.3](https://arxiv.org/html/2605.06895#S3.SS3.p2.7)\.
- S\. Singhal, C\. Laidlaw, and A\. Dragan \(2024\)Scalable oversight by accounting for unreliable feedback\.InICML Workshop on Models of Human Feedback for AI Alignment,Cited by:[§2\.3](https://arxiv.org/html/2605.06895#S2.SS3.p3.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.4](https://arxiv.org/html/2605.06895#S4.SS4.p1.1)\.
- A\. Tversky and D\. Kahneman \(1983\)Extensional versus intuitive reasoning: the conjunction fallacy in probability judgment\.Psychological review90\(4\),pp\. 293–315\(eng\)\.External Links:ISSN 0033\-295XCited by:[§1](https://arxiv.org/html/2605.06895#S1.p2.1)\.
- T\. Yamagata, T\. Oberkofler, T\. Kaufmann, V\. Bengs, E\. Hüllermeier, and R\. Santos\-Rodriguez \(2024\)Relatively rational: learning utilities and rationalities jointly from pairwise preferences\.\(English\)\.Note:ICML 2024 Workshop on Models of Human Feedback for AI Alignment ; Conference date: 23\-08\-2024 Through 23\-08\-2024Cited by:[§2\.3](https://arxiv.org/html/2605.06895#S2.SS3.p2.2)\.
- H\. Zhong, L\. Wang, W\. Cao, and Z\. Sun \(2025\)Balancing rigor and utility: mitigating cognitive biases in large language models for multiple\-choice questions\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.47\.External Links:[Link](https://escholarship.org/uc/item/2vr690cx)Cited by:[§3\.1](https://arxiv.org/html/2605.06895#S3.SS1.p1.1)\.

## Appendix Aβnew\\beta\_\{new\}formula ablation

As an alternative to equation[1](https://arxiv.org/html/2605.06895#S3.E1), we instead considered the following formula for settingβnew\\beta\_\{new\}:

βnew=logistic\(k∗Ds∗\(c−Df\)\)\\beta\_\{new\}=\\texttt\{logistic\}\(k\*D\_\{s\}\*\(c\-D\_\{f\}\)\)whereDsD\_\{s\}is the likelihood from the bias detector that the situation is one in which bias may be present, andDfD\_\{f\}is the likelihood that the given feedback is biased\. This uses a logit transform to make it smooth around the ends\. However, we determined using an ablation study on the BRU dataset \(shown in Table[3](https://arxiv.org/html/2605.06895#A1.T3)\) that the inclusion of theDsD\_\{s\}factor in fact degraded performance\. This is likely because the scenario is essentially repeated as context for the bias detector in theDfD\_\{f\}factor but lacks the information about whether a biased choice was actually made or not; as discussed throughout the text, if a preference reflects a rational choice in the face of bias, that information is highly useful for learning and should not be discarded through a lowβ\\beta\. Using theDsD\_\{s\}factor only while holding theDfD\_\{f\}factor constant at the pooled mean ofDfD\_\{f\}made it collapse to near the baseline, indicating the potential weakness of aβ\\betathat relies only on the scenario\.

Table 3:Mistake\-function ablation on BRU dataset\. “Full” uses bothss\(scenario\) andrr\(response\) bias scores; “response only” holdss=1s\{=\}1; “scenario only” pools gt/cb response scores to their mean\. Diff andppare vs\. the GRPO baseline \(β=1\.0\\beta\{=\}1\.0, GT=9\.9%=9\.9\\%\); Cohen’sddis paired across test prompts;ppis from a10 00010\\,000\-sample paired bootstrap\.
## Appendix BImpact of Debiasing by Bias Type

Here we present the results of the debiased model and consider the effect on each bias type\. 95%\\%CI width is reported\.

![Refer to caption](https://arxiv.org/html/2605.06895v1/figures/d1_bias_type_heatmap.png)Figure 5:Heat map of debiasing effects by bias type for BRU dataset\.![Refer to caption](https://arxiv.org/html/2605.06895v1/figures/d2_bias_type_heatmap.png)Figure 6:Heat map of debiasing effects by bias type for CogBias dataset\.
## Appendix CTraining hyperparameters for CogBias Dataset

On publication, we will release the code we used to train the models\.

All values reflectconfig/judge\.yamland the source files undersrc/\.

Table 4:CogBias: Base\-model loading and numeric precision\.Table 5:CogBias: LoRA adapter configuration\. Identical for the reward model and the policy except fortask\_type\.Table 6:CogBias: Data loading\.Table 7:CogBias \- Mistake function \(dynamicβ\\beta, debiased branch only\)\.Table 8:CogBias \- Pair generation per training row\.Table 9:CogBias: Reward\-model training arguments\. Both baseline \(β=1\\beta=1\) and debiased branches share these; only the per\-rowβ\\betacolumn differs\.Table 10:CogBias \- Reward\-model offset calibration \(pre\-GRPO\)\. Zero\-centering offset computed once per RM and cached; ensures the GRPO advantage term is not biased by the RM’s mean logit\.Table 11:CogBias \- GRPO trainer configuration\.Table 12:CogBias \- GRPO Reward functionTable 13:CogBias \- Trainer defaults inherited fromtransformers\.TrainingArguments; not set inconfig/judge\.yaml\.### C\.1Distributed\-training scaling

When the pipeline is launched under accelerate / torch DDP withWORLD\_SIZE\>1\(e\.g\.scripts/run\_grpo\_parallel\.py\),maybe\_scale\_batch\_for\_ddp\(main\.py:42\-\-55\) preserves the single\-GPU effective batch size by:

1. 1\.dividinggradient\_accumulation\_stepsbyWORLD\_SIZEwhen cleanly divisible \(memory\-neutral\); otherwise
2. 2\.dividing per\-devicebatch\_sizebyWORLD\_SIZE\.

The rule is applied to both thetraining\.\*\(GRPO\) andreward\_model\.\*sections so that ablations launched on different machines remain comparable\.

### C\.2Evaluation

Table 14:CogBias \- Logging and post\-training evaluation

## Appendix DTraining hyperparameters for BRU Dataset

BRU dataset is run viaconfig/default\.yaml:

All settings not listed below match the CogBias dataset tables\.

Table 15:BRU dataset \- DataTable 16:BRU dataset \- Mistake functionTable 17:BRU dataset \- Reward\-model training\.Table 18:BRU dataset \- GRPO finetuneing and evaluation
## Appendix ERobustness hyperparameters

For Qwen3\-8B\-Base run:

```
"model": {"use_4bit": True},
    "training": {"batch_size": 1, "gradient_accumulation_steps": 16},
    "grpo": {"num_generations": 8, "temperature": 1.0},
```

## Appendix FCompute Resources

All experiments were able to be run on a maximum of 4 A10s\. To reproduce an experiment where a model passes through the reward model training and then GRPO from end to end to be finetuned would take approximately 36 hours on 2 A10s\. Multiply this number by the number of variants present in the experiment for an estimate of wall\-clock time per experiment\.
Mitigating Cognitive Bias in RLHF by Altering Rationality

Similar Articles

Reliability-Aware LLM Alignment from Inconsistent Human Feedback

Rater State Bias in RLHF Preference Data: An Audit Framework

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Meta-Learned Reward Shaping for Reinforcement Learning from Human Feedback

Submit Feedback

Similar Articles

Reliability-Aware LLM Alignment from Inconsistent Human Feedback
Rater State Bias in RLHF Preference Data: An Audit Framework
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs
Meta-Learned Reward Shaping for Reinforcement Learning from Human Feedback