Can We Locate and Prevent Stereotypes in LLMs?

arXiv cs.CL 04/23/26, 04:00 AM Papers
bias-detection llm-interpretability stereotype-mitigation neural-editing gpt-2 llama-3
Summary
ArXiv preprint maps stereotype-encoding neurons and attention heads in GPT-2 Small and Llama 3.2, showing biases cluster in small neuron subsets yet ablating them barely reduces biased text generation.
arXiv:2604.19764v1 Announce Type: new Abstract: Stereotypes in large language models (LLMs) can perpetuate harmful societal biases. Despite the widespread use of models, little is known about where these biases reside in the neural network. This study investigates the internal mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype related activations. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads that contribute heavily to biased outputs. Our experiments aim to map these "bias fingerprints" and provide initial insights for mitigating stereotypes.
Original Article
View Cached Full Text
Cached at: 04/23/26, 10:02 AM
# Can We Locate and Prevent Stereotypes in LLMs?
Source: [https://arxiv.org/html/2604.19764](https://arxiv.org/html/2604.19764)
###### Abstract

Stereotypes in large language models \(LLMs\) can perpetuate harmful societal biases\. Despite the widespread use of models, little is known about where these biases reside in the neural network\. This study investigates the internal mechanisms of GPT\-2 Small and Llama 3\.2 to locate stereotype\-related activations\. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads that contribute heavily to biased outputs\. Our experiments aim to map these “bias fingerprints” and provide initial insights for mitigating stereotypes\.

## 1Introduction and Motivation

Stereotypes are wide generalizations about a specific group \(gender, race, profession, religion\)\. They are particularly harmful on many layers, including masking people of their identities and individuality, potentially limiting their own beliefs and potential, and perpetuating harmful ideas and ideologies throughout society\.

Artificial intelligence and more specifically LLMs are ubiquitous\. OpenAI, Google, Anthropic, and Meta have large language models that are globally used\. OpenAI cites having over 800 million users weeklyShepardson \([2025](https://arxiv.org/html/2604.19764#bib.bib1)\)\. This outreach has a tremendous amount of power and any bias or stereotypes embedded into these systems can cause damage and perpetuate these stereotypes\.

This thesis attempts to address several core questions:

- •Where are stereotypes encoded in the transformer architectures for a range of LLMs?
- •Can we easily edit the model to remove these stereotypes?
- •LLMs can be used for several purposes: - –As an encoding mechanism - –To generate new text We test whether our method of editing the LLM mitigates the presences of stereotypes in an LLM for these two purposes\.

Our results and contributions indicate that our method of editing stereotypes in LLMs affects the model across its main uses as follows:

- •As an encoding mechanism: Stereotype representations are concentrated in a small subset of neurons not individual neurons \(see Figure[5](https://arxiv.org/html/2604.19764#S4.F5)\)\.
- •To generate new text: Ablating high\-impact neurons has minimal effect on the likelihood of producing stereotypical outputs \(see Figure[6](https://arxiv.org/html/2604.19764#S4.T6)\)\.

This thesis is organized as follows\. First we begin with a literature review to understand the background of our experiments and our dataset\. Next we explain our 2 proposed methods that we experiment with\. Then we discuss our results, an explanation of the results, a conclusion and future works\.

## 2Literature Review and Background

### 2\.1StereoSet

StereoSetNadeemet al\.\([2021](https://arxiv.org/html/2604.19764#bib.bib2)\)is a foundational work for analyzing stereotypical bias in large language models \(LLMs\)\. The authors introduce a human\-labeled, expert\-validated benchmark specifically designed to measure a model’s preference for stereotypical versus anti\-stereotypical language\. The dataset consists of carefully constructed sentence triplets that contrast stereotypical, anti\-stereotypical, and unrelated continuations across multiple domains, enabling a controlled evaluation of bias while accounting for a model’s general language modeling ability\.

Table 1:Example of a StereoSet intrasentence triplet for theRacedomain\.Using this benchmark, the paper demonstrates that many widely used pre\-trained language models, including GPT\-2, exhibit a consistent tendency to assign higher likelihoods to stereotypical sentences than to their anti\-stereotypical counterparts\. These findings provide early, systematic evidence that social biases present in training data are reflected in model behavior\. While StereoSet establishes a robust evaluation framework and offers a high\-level analysis of model tendencies, it does not attempt to localize where such biases are encoded within the model or propose concrete mechanisms for mitigating them\. As a result, the work primarily serves as a benchmark and diagnostic tool, laying the groundwork for subsequent studies that seek to identify, analyze, and intervene on the internal representations responsible for stereotypical behavior\.

### 2\.2CXAD: Contrastive Explanations for Anomaly Detection

Contrastive Explanations for Anomaly Detection \(CXAD\)Davidsonet al\.\([2025](https://arxiv.org/html/2604.19764#bib.bib6)\)is a framework designed to explain why certain data points are classified as anomalous by identifying features that distinguish anomalous groups from normal ones\. Rather than providing global explanations, CXAD produces contrastive explanations—features that are highly characteristic of one group relative to another\.

CXAD operates by constructing a bipartite graph between data instances and interpretable traits\. Edges represent the presence or strength of a trait in a given instance\. By analyzing the structure of this graph, the method identifies traits that are disproportionately connected to anomalous instances compared to normal instances\. These highly connected traits form contrastive explanations: they answer the question “What properties distinguish this group from others?” rather than “What properties exist in general?”

![Refer to caption](https://arxiv.org/html/2604.19764v1/kennedy_image.png)Figure 1:Illustration of contrastive explanations in CXADDavidsonet al\.\([2025](https://arxiv.org/html/2604.19764#bib.bib6)\)\.While CXAD is originally formulated for tabular anomaly detection with human\-interpretable traits, its underlying principle—identifying features that are discriminative between groups—is broadly applicable\. In this work, we adapt this contrastive perspective to the internal activations of large language models\.

Specifically, we treat neuronal activations as traits and sentence groups as populations\. Instead of distinguishing anomalous versus normal data points, we contrast stereotypical, anti\-stereotypical, and unrelated sentence completions from the StereoSet dataset\. Neurons that consistently activate more strongly for stereotypical inputs—relative to both anti\-stereotypical and unrelated inputs—are considered contrastive neurons that may encode stereotypical information\.

This CXAD\-inspired framing allows us to move beyond behavioral bias measurements and toward mechanistic interpretability\. Rather than asking whether a model exhibits bias, we ask which internal components most strongly differentiate stereotypical content from its alternatives\. Importantly, our approach does not rely on human\-defined features or annotations at the neuron level, making it suitable for large\-scale analysis of deep neural representations\.

### 2\.3Deciphering Stereotypes in Pre\-Trained Language Models

The framework introduced byMaet al\.\([2023](https://arxiv.org/html/2604.19764#bib.bib9)\)represents a significant advancement in the mechanistic interpretability of social bias\. They propose a diagnostic pipeline designed to uncover the internal components—specifically attention heads—that drive biased behavior\. Their methodology follows a four\-stage process:

1. 1\.Data Synthesis: The authors aggregate multiple datasets containing paired stereotypical and anti\-stereotypical sentences, providing a diverse linguistic signal for the models to process\.
2. 2\.Activation Extraction: By performing a forward pass through four architectures \(BERT, RoBERTa, T5, and Flan\-T5\), they extract and concatenate layer\-wise activations from the multi\-head attention \(MHA\) layers\.
3. 3\.Probing via Classification: A shallow, non\-linear classifier is trained on these frozen activations\. Because the underlying model parameters remain fixed, the classifier’s performance serves as a proxy for the density of stereotype information natively present in the model’s internal representations\.
4. 4\.Contribution Analysis: To isolate specific drivers of bias, they utilizeShapley valuesLundberg and Lee \([2017](https://arxiv.org/html/2604.19764#bib.bib3)\)\. This game\-theoretic approach quantifies the marginal contribution of each individual attention head to the classifier’s detection accuracy\.

The study finds that approximately 15% to 30% of the highest\-ranked attention heads predominantly drive stereotype\-related predictions\. They also conduct an ablation study to analyze the impact of removing these heads on classification accuracy, highlighting which components are most critical for stereotype encoding\.

While this approach effectively identifies stereotype\-related components in pre\-trained language models \(PLMs\), it doesn’t focus on decoder\-based architectures\. BERT and RoBERTa are encoder\-only models, while T5 and Flan\-T5 are encoder\-decoder models\. Most modern LLMs, however, are decoder\-only models, and our work extends these methods to focus specifically on a decoder\-only architecture, and also dives further into finding smaller subsets of neurons within high impacting attention heads\.

### 2\.4Bias A\-head? Analyzing Bias in Transformer\-Based Language Model Attention Heads

Recent work has shown that bias in Transformer\-based language models may be localized within specific attention heads rather than being uniformly distributed across the network\.Yanget al\.\([2025](https://arxiv.org/html/2604.19764#bib.bib10)\)analyze bias at the level of individual attention heads by assigning head\-specific bias scores using established stereotype evaluation metrics\. Their results indicate that a small subset of heads consistently contributes disproportionately to biased associations\.

The authors further demonstrate that masking or suppressing these biased heads can reduce measured bias with minimal degradation in overall model performance\. This suggests that attention heads can act as compact carriers of biased information\. This finding aligns with our second approach, which seeks to identify and analyze attention heads in GPT\-2 Small and Llama 3\.2 that contribute most strongly to stereotypical behavior, complementing our contrastive neuron analysis\.

## 3Proposed Method

### 3\.1Experimental Questions

The goal of our two experiments is to investigate whether decoder LLMs contain a subset of neurons that encode stereotypical knowledge and can we ablate these neurons to reduce stereotypical outputs\. Specifically, we ask the following questions:

1. 1\.Experiment 1 \- Are there neurons in GPT\-2 whose activations are drastically higher for stereotype\-related inputs compared to anti\-stereotype or unrelated inputs?
2. 2\.Experiment 1 \- Do these contrastive neurons appear in the initial token embeddings, multi\-head attention outputs, and feedforward network outputs, and what are their magnitudes?
3. 3\.Experiment 2 \- Can we find a subset of attention heads that solely drive a probes accuracy in classifying stereotypical vs anti\-stereotypical activations\. Then can we dive deeper and find a small subset of neurons from these attention heads that solely drive the probes accuracy\.
4. 4\.Experiment 1 & 2 \- Does ablating these high relative ratio neurons / small subset of probe impacting neurons from the model decrease its ability to output stereotypes?
5. 5\.Experiment 2 \- Is the stereotype signal from the probe active in initial textual embeddings before layer wise processing?

In the first experiment, we extract activations from three components of GPT\-2 Small and compute relative activation ratios between stereotype, anti\-stereotype, and unrelated candidates\. This analysis allows us to identify neurons that are most predictive of stereotypical behavior in the model\.

In the second experiment, we pass through stereotypical and anti\-stereotypical sentences through GPT\-2 Small and Llama 3\.2\. Extract these activations post multi head attention and create a probe to distinguish activations\. We then calculate a Monte Carlo Shapley estimation of attention heads to get the top impacting attention heads to this probe\. Then we calculate another Monte Carlo Shapley estimation for individual neurons from these top impacting attention heads, and determine whether ablating this subset of neurons will reduce stereotypical outputs without harming the models language ability\.

We will first dive into the GPT\-2 transformer architecture to understand how activations are extracted\. Then the dive into the dataset used before understanding the two experiments\.

### 3\.2An Overview of the Transformer Architecture and GPT\-2

GPT\-2 is an autoregressive, decoder\-only language model based on the Transformer architecture\. Unlike the original Transformer\(Vaswaniet al\.,[2017](https://arxiv.org/html/2604.19764#bib.bib4)\), which utilized “Post\-Layer Normalization,” GPT\-2 employs a Pre\-Layer Normalization \(Pre\-LN\) configuration\. In this research, we utilize GPT\-2 Small, which consists ofL=12L=12layers,H=12H=12attention heads per layer, and a model dimension \(hidden size\) ofdmodel=768d\_\{model\}=768\.

![Refer to caption](https://arxiv.org/html/2604.19764v1/GPT-2-architecture-Heilbron-et-al-2019.png)Figure 2:GPT\-2 Architecture\. Source:Heilbron and de Lange \([2019](https://arxiv.org/html/2604.19764#bib.bib5)\)#### 3\.2\.1Input Representation and the Residual Stream

Each input token is first mapped to a learneddmodeld\_\{model\}\-dimensional embedding vector\. Because the Transformer possesses no inherent sense of sequence order, a positional embedding of the same dimension is added element\-wise to the token embedding\. This initial sum forms the start of the residual stream —a high\-dimensional vector space that acts as the model’s ”working memory,” where information is iteratively refined and accumulated as it passes through the layers\.

#### 3\.2\.2The Transformer Block

Each of the 12 blocks performs two primary operations to process the information in the residual stream:

- •Masked Multi\-Head Self\-Attention \(MHSA\): Layer Normalization \(LN\) is applied before tokens interact\. Each of the 12 heads computes attention by projecting the input into Query \(QQ\), Key \(KK\), and Value \(VV\) matrices\. The ”masked” nature ensures causality, preventing the model from ”looking ahead” at future tokens\. Each head produces a 64\-dimensional output, which are concatenated to reform the 768\-dimensional vector \(12×64=76812\\times 64=768\)\.
- •Position\-wise Feed\-Forward Network \(FFN\): Following a second LN, the activations enter the FFN\. In GPT\-2, this sub\-layer expands the dimension to4×dmodel4\\times d\_\{model\}\(3072\) before projecting it back to 768\.

#### 3\.2\.3Activation and FFN

GPT\-2 employs the Gaussian Error Linear Unit \(GELU\) activation functionHendrycks and Gimpel \([2016](https://arxiv.org/html/2604.19764#bib.bib8)\)\. Unlike the standard Rectified Linear Unit \(ReLU\), which is zero for all negative inputs, GELU weights inputs by their magnitude via the standard Gaussian cumulative distribution functionΦ\(x\)=P\(X≤x\)\\Phi\(x\)=P\(X\\leq x\)\. This provides a smoother gradient and allows a small amount of negative information to pass, mitigating the ”dead neuron” problem:

GELU\(x\)=xΦ\(x\)\\displaystyle\\text\{GELU\}\(x\)=x\\Phi\(x\)\(1\)≈0\.5x\(1\+tanh⁡\[2/π\(x\+0\.044715x3\)\]\)\\displaystyle\\approx 5x\(1\+\\tanh\[\\sqrt\{2/\\pi\}\(x\+044715x^\{3\}\)\]\)
The Position\-wise Feed\-Forward Network \(FFN\) consists of two linear transformations with a GELU activation in between\. The operation is defined as:

FFN\(x\)=GELU\(xW1\+b1\)W2\+b2\\text\{FFN\}\(x\)=\\text\{GELU\}\(xW\_\{1\}\+b\_\{1\}\)W\_\{2\}\+b\_\{2\}\(2\)
WhereW1∈ℝdmodel×dffW\_\{1\}\\in\\mathbb\{R\}^\{d\_\{model\}\\times d\_\{ff\}\}\(e\.g\.,768×3072768\\times 3072\) is the expansion matrix that projects the representation into a higher\-dimensional space,W2∈ℝdff×dmodelW\_\{2\}\\in\\mathbb\{R\}^\{d\_\{ff\}\\times d\_\{model\}\}is the projection matrix that returns it to the residual stream dimension, andb1,b2b\_\{1\},b\_\{2\}are the respective bias terms\.

#### 3\.2\.4Residual Connections and Output

A critical feature of GPT\-2 is the residual \(skip\) connection\. Instead of the output of a sub\-layer replacing the vector in the residual stream, it isaddedto the original input:xout=x\+SubLayer\(LN\(x\)\)x\_\{out\}=x\+\\text\{SubLayer\}\(LN\(x\)\)\. This prevents vanishing gradients and allows the network to refine features rather than replace them\.

After the final layer, a final Layer Normalization is applied\. The resulting vectors are projected onto the vocabulary space via the un\-embedding matrix\. A softmax function is then applied to the logits to produce the probability distribution for the next token:

P\(xi\+1\|x1,…,xi\)=Softmax\(Wu⋅LN\(xfinal\)\)P\(x\_\{i\+1\}\|x\_\{1\},\\dots,x\_\{i\}\)=\\text\{Softmax\}\(W\_\{u\}\\cdot LN\(x\_\{final\}\)\)\(3\)

### 3\.3Dataset: StereoSet

StereoSet is a comprehensive dataset designed to measure stereotypical bias across four domains: gender, profession, race, and religion\. It includes two distinct formats:intersentenceandintrasentencecontexts\.

IntersentenceThis format consists of 2,123 examples\. Each example provides a context sentence followed by three options that characterize the target group:

- •Context:“Many people live in Ethiopia\.”
- •Stereotype:“The people are very thin and good at distance running\.”
- •Anti\-stereotype:“The people are fat and unathletic\.”
- •Unrelated:“Cats have sharp claws\.”

IntrasentenceThis task consists of 2,106 examples\. It uses a fill\-in\-the\-blank sentence structure where a single word \(the candidate\) determines the bias type:

- •Stereotype:“The Hispanic is usually dirty\.”
- •Anti\-stereotype:“The Hispanic is usually clean\.”
- •Unrelated:“The Hispanic is usually paper\.”

### 3\.4Experiment 1: Finding Contrastive Neurons

In this experiment, we focus exclusively on theintrasentencesubset of the StereoSet dataset\. Each sentence contains a blank with three candidate words: a stereotype, an anti\-stereotype, and a random \(unrelated\) word\. For candidates spanning multiple tokens, we compute the mean activation over all tokens\. Our approach is inspired by the CXAD framework for identifying contrastive neurons\. However, unlike CXAD, where a bipartite graph is sparsely connected between instances and features, we work with a fully connected graph\. To identify meaningful neurons in this dense structure, we leverage the actual activation values of neurons or the edge values, treating higher activations as stronger signals for contrastive behavior\.

We extract activations from three components of GPT\-2:

1. 1\.The token and positional embeddings before passing the input through the layers\.
2. 2\.The multi\-head attention outputs from all 12 layers, takenpre\-residualto capture the pure output of each attention head\.
3. 3\.The feedforward network outputs from all 12 layers, alsopre\-residual, to avoid interference from residual connections\.

![Refer to caption](https://arxiv.org/html/2604.19764v1/MS_Thesis_Extractions.png)Figure 3:Where activations are extracted in GPT\-2 architecture for experiment 1For each candidate \(stereotype, anti\-stereotype, and random\), we extract all three different types of activations and perform comparisons\. The dimensionality of the extracted activations is as follows:

- •token and positional embeddings:768768\-dimensional vector\.
- •Multi\-head attention:1212heads×\\times6464neurons per head×\\times1212layers =768×12768\\times 12activations\.
- •Feedforward network:1212layers×\\times768768\-dimensional vector per layer\.

Algorithm 1Subsection\-Based Relative Activation Scoring1:foreach

subsubin StereoSetdo

2:Initialize

S\[sub\]S\[sub\]
3:foreach

sentsentin

subsubdo

4:

A←extract\_activations\(tokenize\(sent\)\)A\\leftarrow\\text\{extract\\\_activations\}\(\\text\{tokenize\}\(sent\)\)
5:foreach

l,nl,nin modeldo

6:

as,aa,ar←A\[l\]\[n\]\[stereo, anti, rand\]a\_\{s\},a\_\{a\},a\_\{r\}\\leftarrow A\[l\]\[n\]\[\\text\{stereo, anti, rand\}\]
7:

score←0\.5⋅\(asaa\+asar\)score\\leftarrow 0\.5\\cdot\\left\(\\frac\{a\_\{s\}\}\{a\_\{a\}\}\+\\frac\{a\_\{s\}\}\{a\_\{r\}\}\\right\)
8:

S\[sub\]\[l\]\[n\]←append\(score\)S\[sub\]\[l\]\[n\]\\leftarrow\\text\{append\}\(score\)
9:endfor

10:endfor

11:foreach

l,nl,nin modeldo

12:

Final\[sub\]\[l\]\[n\]←mean\(S\[sub\]\[l\]\[n\]\)\\text\{Final\}\[sub\]\[l\]\[n\]\\leftarrow\\text\{mean\}\(S\[sub\]\[l\]\[n\]\)
13:endfor

14:endfor

Notation Index:

Since StereoSet is divided intobiasandtargetcategories, we subdivide these activations based on stereotype type \(e\.g\., race Ethiopia stereotypes are treated separately from other stereotypes like race African\) into subsections to account for differences in contextual meaning\. Each activation is only compared with its own subsection\(e\.g Profession, janitor\) activation values not the whole dataset\.

For each neuron activation, we compute a weighted ratio score:

score=0\.5⋅actstereoactanti\+0\.5⋅actstereoactunrelated\.\\text\{score\}=0\.5\\cdot\\frac\{\\text\{act\}\_\{stereo\}\}\{\\text\{act\}\_\{anti\}\}\+0\.5\\cdot\\frac\{\\text\{act\}\_\{stereo\}\}\{\\text\{act\}\_\{unrelated\}\}\.Scores are first computed and averaged across subsections to obtain a final relative ratio score\. We calculate a balanced ratio over anti\-stereotype and unrelated to ensure the ratio calculated is not a meaning of unexpectedness\.

We calculate the ratio for each neuron and cite

To test whether high\-scoring neuron activations contribute to stereotypical predictions, we individually ablate the neurons with the highest relative ratio scores one by one and compare the log\-softmax likelihood of the stereotype candidate before and after ablation\.

### 3\.5Experiment 2: Extracting Activations

In this second approach, we utilize a probing methodology inspired by the framework introduced byMaet al\.\([2023](https://arxiv.org/html/2604.19764#bib.bib9)\)\. While their study analyzed encoder\-based architectures \(BERT, RoBERTa\) and encoder\-decoder models \(T5, T5 Flan\), we adapt this technique to investigate the internal mechanisms of the decoder\-only GPT\-2 Small and Llama 3\.2 1B architectures\.

The core of this method involves running both intersentence and intrasentence sentences through our transformer to extract internal activations from attention heads across all layers\. For each input sentence, we must aggregate token\-level activations into a fixed\-length representation\. Although we initially experimented with mean pooling, we found that maximum pooling yielded slightly better classification performance\.

We also conduct a side experiment to extract the mean encoding embeddings and create a probe on these initial encodings to see if from the beginning activation layer already has stereotypes embedded\.

![Refer to caption](https://arxiv.org/html/2604.19764v1/TokenExtraction.png)Figure 4:How activations are extracted from an input sentence in transformer architectureFor GPT\-2 Small This process produces a feature vector of 9,216 dimensions per sentence \(12layers×12heads×64head dimensions12\\text\{ layers\}\\times 12\\text\{ heads\}\\times 64\\text\{ head dimensions\}\)\. For Llama our dimensions become 32768 per sentence \(16layers×32heads×64head dimensions16\\text\{ layers\}\\times 32\\text\{ heads\}\\times 64\\text\{ head dimensions\}\) To construct our probing dataset, we pair each stereotypical sentence with its anti\-stereotypical counterpart and randomly concatenate them \(with a 0\.5 probability of the stereotype appearing first\) to prevent the classifier from learning positional artifacts\. The final dataset consists of 4,229 samples with a total feature size of 18,432 \(GPT\-2 Small\) and 65536 \(Llama 3\.2 1B\)\.

### 3\.6Experiment 2: Identifying High\-Impact Bias Neurons via Shapley Value Analysis

We then trained a supervised Multi\-Layer Perceptron \(MLP\) on the frozen transformer activations\. After systematic hyperparameter tuning, the optimized architecture consists of two fully connected hidden layers \(1,024 and 512 units\), GeLU activation functions, and a dropout rate of 0\.4 and 0\.2 to mitigate overfitting\.

To interpret the signals captured by our probe, we employed Monte Carlo sampling to approximate Shapley valuesŠtrumbelj and Kononenko \([2014](https://arxiv.org/html/2604.19764#bib.bib7)\), quantifying the marginal contribution of specific attention heads to the classifier’s predictions\. Given the exponential complexity of exact Shapley computation, we executed 200 iterations of random permutation sampling to achieve a stable convergence of head importance rankings\.

To provide a more granular map of bias, we extended this Shapley analysis to individual neurons within the top 10% of the most influential heads\. By only extracting the positive Shapley values we identified a specific subset of neurons that disproportionately drive the classifier’s performance\.

Finally, we conducted a causal intervention by zeroing out these identified neurons during a standard forward pass\. By ablating these specific components, we calculated the StereoSet metrics SS, LMS, and iCAT\. StereoSet score \(SS\) is calculated by comparing the mean log likelihood of all sentences in StereoSet and computing the percentage of stereotypical sentences having a higher log likelihood than anti\-stereotypical sentences, in language modeling score \(LMS\) we calculate the percentage of examples where model prefers meaningful sentences \(stereotypical or anti\-stereotypical\) over unrelated, and idealized context association test combines the LMS and SS scores\. The ideal score for SS is 50, reflecting no preference between stereotypical and anti\-stereotypical sentences, the ideal for LMS is 100, as the model should always prefer meaningful over unrelated sentences, and the ideal for iCAT is 100, achieved only when both LMS and SS are at their respective optima\. The iCAT score combines both metrics as:

iCAT=LMS×min⁡\(SS,100−SS\)50\\text\{iCAT\}=\\text\{LMS\}\\times\\frac\{\\min\(\\text\{SS\},\\ 100\-\\text\{SS\}\)\}\{50\}\(4\)
where the scalar penalizes deviation of SS from parity, yielding a joint measure of language modeling ability and stereotype neutrality\.

## 4Experimental Results

### 4\.1Research Questions

Our two experiments are designed to address five primary research questions regarding the localization and potential causality of stereotypes in decoder\-only models:

1. RQ1:Experiment 1: Are there individual neurons that activate more highly in stereotypes than in anti\-stereotype and unrelated activations and do they drive stereotypical generation? \(See Section 4\.2 and Table[5](https://arxiv.org/html/2604.19764#S4.T5)\)
2. RQ2:Experiment 1: What are these outlier neuron magnitudes and where are they located? \(See Section 4\.2 and Tables[2](https://arxiv.org/html/2604.19764#S4.T2),[3](https://arxiv.org/html/2604.19764#S4.T3),[4](https://arxiv.org/html/2604.19764#S4.T4)\)
3. RQ3:Experiment 2: Is there a subset of attention heads that dominate contributions to a probe that classifies biased activations and a small subset of neurons that also dominate contributions to this probe within these attention heads? \(See Section 4\.4 and Figures[5](https://arxiv.org/html/2604.19764#S4.F5),[6](https://arxiv.org/html/2604.19764#S4.F6)\)
4. RQ4:Experiment 2: Does ablating the probe’s high impacting neurons reduce the Stereotype Score \(SS\) without significantly degrading the Language Modeling Score \(LMS\)? \(See Section 4\.4 and table[6](https://arxiv.org/html/2604.19764#S4.T6)\)
5. RQ5:Experiment 2: Is stereotypical signal already present and classifiable from the initial textual embeddings before layer\-wise processing? \(See Section 4\.5 and Table[7](https://arxiv.org/html/2604.19764#S4.T7)\)

### 4\.2Contrastive Neurons Approach Results

Our experimental analysis of the Intrasentence StereoSet dataset reveals that specific neurons exhibit significantly higher activation ratios for stereotypical examples compared to anti\-stereotypical or random examples\. Below \(See Table[2](https://arxiv.org/html/2604.19764#S4.T2)\), we detail the top\-performing features across the three different activation types and provide a statistical breakdown\.

#### 4\.2\.1Global Activation Statistics

To ensure that high contrastive ratios are unique to specific features rather than a general architectural property, we calculated the global statistics for all these 3 activations types: Embeddings, MHA, and FFN\.

Table 2:Experiment 1: Contrastive Ratio Statistics by Bias Type and Activation Component\. The mean ratio reflects the average deviation of stereotype activations from both anti\-stereotype and unrelated baselines\. A mean ratio of 1\.0 would signal there are no individual contrastive neurons\. There are significant outlier that have large ratios indicating potential stereotypical encoding neurons\.The maximum ratio observed across all configurations is7\.1×1057\.1\\text\{\\times\}\{10\}^\{5\}, occurring in the FFN outputs for Gender, though extreme outliers are present across all bias types and activation components\. This heavy\-tailed distribution where the mean consistently exceeds the median by a large margin indicates that a small subset of neurons exhibit disproportionately strong sensitivity to stereotypic content relative to both anti\-stereotypic and unrelated baselines\.

#### 4\.2\.2Top Neurons by Activation Type

The following features showed the highest relative activation ratios, suggesting a high degree of specialization in processing biased content\.

- •Attention Heads: The highest ratio observed was 68,598\.03 \(Layer 1, Head 4, Neuron 10\) associated with theracebias and the targetArab\.
- •FFN Outputs: Displayed the most extreme outliers, notably a ratio of 707,766\.13 in Layer 7 \(Neuron 563\) targeting the profession bias and termgentlemen\.
- •Textual Embeddings: Neuron 306 appeared twice with both being the top two ratios where the bias gender and targets mother and schoolgirl\.schoolgirl\(Ratio: 55,751\.02\)\.mother\(Ratio: 45,717\.73\)\.

Our experiments identify neurons with disproportionately high activation ratios for stereotypical vs\. anti\-stereotypical prompts\. In the FFN layers, some ratios exceeded7×1057\\times 10^\{5\}, suggesting highly specialized “bias neurons\.”

#### 4\.2\.3Statistical Distribution

As shown in Table[3](https://arxiv.org/html/2604.19764#S4.T3),raceandprofessionbiases dominate the top 200 Neurons across all activation types, outnumbering gender and religion\.

Table 3:Experiment 1: Bias Type Distribution \(Top 200 Neurons\)\. Note Race and Profession are most represented, though Gender and Religion show disproportionately high mean ratios relative to their feature counts\.While the Gender category exhibits a high mean activation ratio \(see Table[2](https://arxiv.org/html/2604.19764#S4.T2)\), it is underrepresented in the top\-ranked contrastive neurons compared to Race and Profession\. This discrepancy is likely a direct artifact of the StereoSet data distribution\. The intrasentence task provides 962 samples for Race and 810 for Profession, whereas Gender and Religion are limited to 255 and 79 samples, respectively\. Mathematically, the higher sample frequency in the former categories provides the scoring algorithm more opportunities to identify consistent outliers, effectively drowning out the signal from smaller categories in a global ’Top 200’ ranking\.

#### 4\.2\.4Layer\-wise Localization

The distribution of potential biased neurons across the 12 transformer layers \(Table[4](https://arxiv.org/html/2604.19764#S4.T4)\) indicates that bias is not confined to a single stage\. Attention\-based bias peaks in early \(L0\) and mid\-circuit \(L5\) stages, while FFN\-based bias shows a significant spike in L4\.

Table 4:Experiment 1: Distribution of the top 200 highest contrastive\-ratio features across GPT\-2 layers \(0–11\), for attention heads and FFN outputs\. Attention heads peak early at Layer 0, while FFN outputs peak at Layer 4, suggesting that stereotypic encoding emerges at different depths across component types\.We proceed by ablating these specific neurons to measure their causal role in stereotypical generation\.

### 4\.3Ablation Study and Causal Analysis

To determine if the identified “high\-ratio” features are causal drivers of stereotypical generation, we performed individual ablation studies\. We zeroed out the top 100 relative\-ratio features for FFN outputs, attention heads, and textual embeddings, measuring the resulting change in log\-likelihood of stereotypical sentence completions\.

#### 4\.3\.1Causal Effect Magnitude

As summarized in Table[5](https://arxiv.org/html/2604.19764#S4.T5), the causal effect of ablating these single features is marginal across all activation types\. All areas showed almost negligible causal influence\. No single feature ablation reduced stereotypic likelihood by more than2\.38%2\.38\\%\.

Table 5:Experiment 1: Causal Effect of Single Feature Ablation \(Top 100 per Type\)\. Individual ablation demonstrates a negligible impact on model output, with a maximum reduction of<1%<1\\%\. The high frequency of negative effects suggests that biased representations are highly distributed and resistant to single\-point interventions\.Notably, roughly half of all ablations resulted in anegativeeffect \(increasing the stereotype likelihood\), peaking at51%51\\%for Textual Embeddings\. These findings heavily suggest that individual neurons are not the primary driving force of stereotypes in Large Language Models \(LLMs\); rather, biased representations are likely emergent from the collective activity of highly redundant circuits\.

### 4\.4Attention Head Results

For GPT\-2 Small, our best probe achieved approximately 73% accuracy, while the Llama 3\.2 1B probe achieved an 80% probe accuracy\.

Our results of the ablation of attention heads align withMaet al\.\([2023](https://arxiv.org/html/2604.19764#bib.bib9)\), which report that around 15–30% of attention heads contribute to the model’s ability to detect and encode stereotypes\. Our plot shows that after around 20% of neurons ablated when ablating full attention heads the encodings become indistinguishable\. Interestingly, we can dive deeper and find a subset of neurons that encode the stereotype\. As shown in for GPT\-2 Small in Figure[5](https://arxiv.org/html/2604.19764#S4.F5), around 5% of neurons disrupt the classifier’s accuracy to 50% and as a baseline we randomly removed neuron pairs\. Llama 3\.2 1B also shows similar results \(see Figure[6](https://arxiv.org/html/2604.19764#S4.F6)\)\.

![Refer to caption](https://arxiv.org/html/2604.19764v1/GPT2_MHA_Ablation_random_percentage.png)\(a\)Attention head ablation\.
![Refer to caption](https://arxiv.org/html/2604.19764v1/GPT2_Neuron_ablation.png)\(b\)Neuron ablation\.

Figure 5:Experiment 2: Probing classifier on GPT\-2 Small attention head encodings\. Accuracy is plotted during iterative ablation of attention heads \(top\) and high\-impact neurons \(bottom\)\. Top\-down removal follows Shapley value rankings\. Note: For attention heads, random\-guess accuracy \(50%\) is reached at∼\\sim20% ablation, whereas individual neuron ablation reaches 50% accuracy significantly faster at∼\\sim5% ablation\.![Refer to caption](https://arxiv.org/html/2604.19764v1/Llama_MHA_Ablation_random_percentage.png)\(a\)Attention head ablation\.
![Refer to caption](https://arxiv.org/html/2604.19764v1/Llama_Neuron_ablation.png)\(b\)Neuron ablation\.

Figure 6:Experiment 2: Probing classifier on Llama attention head encodings\. Accuracy is plotted during iterative ablation of attention heads \(top\) and high\-impact neurons \(bottom\)\. Top\-down removal follows Shapley value rankings\. Note: Individual neuron ablation demonstrates a similar efficiency to the GPT\-2 Small model, collapsing classifier accuracy to baseline levels \(∼\\sim50%\) with significantly fewer components \(∼\\sim5%\) compared to attention head ablation\.To evaluate if this subset of neurons also contribute to the likelihood of stereotype generation, we conducted a follow\-up ablation experiment targeting the top neurons with the highest Shapley values\. We calculate the Stereotype Score \(SS\), Language Modeling Score \(LMS\), and Idealized Context Awareness Test \(iCAT\)\. These are scores from theNadeemet al\.\([2021](https://arxiv.org/html/2604.19764#bib.bib2)\)to score whether the language model is biased and if the model still functions as a coherent language model\.

By zeroing out the activations of the top neurons with positive Shapley values, we compared the results\.

Table 6:Experiment 2: Stereoset Performance Metrics\. \(B\) represents Baseline performance, and \(A\) represents the Model under Ablation of high\-impact neurons identified via Shapley value analysis\. In almost all iterations, the intervention successfully reduced the Stereotype Score \(SS\) toward the ideal 50\.0 parity mark\. Notably, the Language Modeling Score \(LMS\) remained stable or showed negligible degradation, leading to a slight net increase in the Idealized Context Awareness Test \(iCAT\) score for both models\.Causal intervention resulted in a minimal reduction in social bias across both models\. Specifically, the Stereotype Score \(SS\) shifted toward the ideal parity of 50\.0 %, indicating a reduction in stereotypical preference in Table[6](https://arxiv.org/html/2604.19764#S4.T6)\. Crucially, this mitigation was achieved without significant degradation of the Language Modeling Score \(LMS\), leading to a net increase in the iCAT score for both Llama\-3\.2\-1B and GPT\-2 Small\.

### 4\.5Embedding Results

We also conducted study to determine whether positional and encoding embeddings independently provide sufficient signal for accurate stereotype classification\. We constructed a feature space by concatenating stereotype and anti\-stereotype token embeddings, with the order randomized \(p=0\.5p=0\.5\) to prevent the classifier from learning sequence\-based heuristics\. To isolate the source of encoded bias, we trained separate classifiers on the encoding embeddings versus isolated positional embeddings\.

As summarized in Table[7](https://arxiv.org/html/2604.19764#S4.T7), in GPT\-2 Small the encoding embeddings achieved a peak validation accuracy of 0\.7340, while positional embeddings performed near random chance \(0\.5225\)\. This indicates that stereotypical associations are primarily localized within encoding layers rather than positional metadata\.

Table 7:Classification Accuracy from Embeddings by Bias Type \(GPT\-2 Small\)

## 5Negative Results Explained

While the probing analysis in Experiment 2 \(Figures[5](https://arxiv.org/html/2604.19764#S4.F5)and[6](https://arxiv.org/html/2604.19764#S4.F6)\) demonstrated that a small subset of neurons is statistically necessary for a classifier to distinguish stereotypical activations, the causal intervention results in Table[6](https://arxiv.org/html/2604.19764#S4.T6)reveal a discrepancy\. Ablating these “high\-impact” neurons resulted in only marginal shifts in the StereoSet scores\. This divergence between probing accuracy and causal influence can be attributed to several architectural factors inherent to the Transformer\.

### 5\.1Mathematical Formulation of Information Flow

The progression of a sequence through the GPT\-2 Transformer can be formally modeled as a series of additive updates to a high\-dimensional vector space known as the residual stream\. Letx0∈ℝdmodelx\_\{0\}\\in\\mathbb\{R\}^\{d\_\{model\}\}represent the initial input, formed by the sum of word token embeddings \(WTE\) and positional encodings \(WPE\)\. The state of the residual stream at any layerLLis defined by the accumulation of outputs from preceding sub\-layers:

x0=Embedding\(tokens\)\+PositionalEncodingx\_\{0\}=\\text\{Embedding\}\(\\text\{tokens\}\)\+\\text\{PositionalEncoding\}\(5\)
For each layeri∈\{1,…,L\}i\\in\\\{1,\\dots,L\\\}, the state is updated using two sub\-layers with residual \(skip\) connections:

ximid\\displaystyle x\_\{i\}^\{mid\}=LN\(xi−1\)\+MHAi\(LN\(xi−1\)\)\\displaystyle=\\text\{LN\}\(x\_\{i\-1\}\)\+\\text\{MHA\}\_\{i\}\(\\text\{LN\}\(x\_\{i\-1\}\)\)\(6\)xi\\displaystyle x\_\{i\}=LN\(ximid\)\+FFNi\(LN\(ximid\)\)\\displaystyle=\\text\{LN\}\(x\_\{i\}^\{mid\}\)\+\\text\{FFN\}\_\{i\}\(\\text\{LN\}\(x\_\{i\}^\{mid\}\)\)\(7\)
whereMHAi\\text\{MHA\}\_\{i\}andFFNi\\text\{FFN\}\_\{i\}represent the Multi\-Head Attention and Feed\-Forward Network contributions at layerii, respectively\. The final output distributionyyis then computed by projecting the terminal statexLx\_\{L\}back onto the vocabulary space:

y=Softmax\(Wu⋅LN\(xL\)\)y=\\text\{Softmax\}\(W\_\{u\}\\cdot\\text\{LN\}\(x\_\{L\}\)\)\(8\)

### 5\.2The Pathway Hypothesis and Ablation Resistance

Our experimental results present a paradox: while we identified 400 neurons with high Shapley values dropping a classifier’s accuracy to 50% indicating they are necessary for a classifier to differentiate stereotypical from anti\-stereotypical embeddings, their ablation did not significantly reduce the model’s likelihood of generating biased text\.

Mathematically, we hypothesize that stereotypical information is not localized in a group of ”biased neurons” but is represented as a bias directionvbiasv\_\{bias\}within the residual stream\. If this direction is supported by a large basis of neurons\{n1,n2,…,nk\}\\\{n\_\{1\},n\_\{2\},\\dots,n\_\{k\}\\\}, then zeroing a small subsetk′⊂kk^\{\\prime\}\\subset k\(the 400 ablated neurons\) merely performs a partial projection:

vbias′=vbias−∑j∈k′projnj\(vbias\)v\_\{bias\}^\{\\prime\}=v\_\{bias\}\-\\sum\_\{j\\in k^\{\\prime\}\}\\text\{proj\}\_\{n\_\{j\}\}\(v\_\{bias\}\)\(9\)
Because the residual stream is highly redundant, the remaining components∑j∉k′nj\\sum\_\{j\\notin k^\{\\prime\}\}n\_\{j\}continue to carry the biased signal to the final layer\. Furthermore, the additive nature of the architecture creates parallel pathways\. Even if a specific MHA pathway is interrupted, the ”bias signal” may bypass that layer via the residual connection or be re\-inserted by attention heads that attend to the initial encodingx0x\_\{0\}, which we demonstrated already contains significant stereotypical signal \(73% accuracy\)\. The model inherits bias from the encodings\.

### 5\.3Empirical Example

To make thePathway Hypothesisconcrete, consider a minimal toy example withdmodel=4d\_\{model\}=4\. Suppose we have identified a “bias direction” in the residual streamvbias=\[1,1,0,0\]v\_\{bias\}=\[1,\\ 1,\\ 0,\\ 0\], and the residual stream after the embedding layer isx0=\[1\.2,0\.8,−0\.3,0\.5\]x\_\{0\}=\[1\.2,\\ 0\.8,\\ \-0\.3,\\ 0\.5\]\. The stereotypical signal strength is quantified by projectingx0x\_\{0\}ontovbiasv\_\{bias\}:

proj𝐯bias\(x0\)=x0⋅𝐯bias‖𝐯bias‖2⋅𝐯bias=1\.2\+0\.82⋅\[1,1,0,0\]\\begin\{split\}\\text\{proj\}\_\{\\mathbf\{v\}\_\{bias\}\}\(x\_\{0\}\)&=\\frac\{x\_\{0\}\\cdot\\mathbf\{v\}\_\{bias\}\}\{\\\|\\mathbf\{v\}\_\{bias\}\\\|^\{2\}\}\\cdot\\mathbf\{v\}\_\{bias\}\\\\ &=\\frac\{1\.2\+0\.8\}\{2\}\\cdot\[1,1,0,0\]\\end\{split\}\(10\)yielding a scalar signal strength of1\.01\.0, consistent with the strong stereotypical signal already present in the token encodings \(73% probe accuracy\)\.

##### After Layer 1\.

The MHA and FFN contributions produce an additive update written into the residual stream via the skip connection:

x0\\displaystyle x\_\{0\}=\[1\.2,0\.8,−0\.3,0\.5\]\\displaystyle=\[1\.2,\\ 0\.8,\\ \-0\.3,\\ 0\.5\]\(11\)Δx1\\displaystyle\\Delta x\_\{1\}=\[0\.1,−0\.1,0\.5,−0\.2\]\\displaystyle=\[0\.1,\\ \-0\.1,\\ 0\.5,\\ \-0\.2\]\(12\)x1\\displaystyle x\_\{1\}=x0\+Δx1=\[1\.3,0\.7,0\.2,0\.3\]\\displaystyle=x\_\{0\}\+\\Delta x\_\{1\}=\[1\.3,\\ 0\.7,\\ 0\.2,\\ 0\.3\]\(13\)The signal strength remains atprojvbias\(x1\)=\(1\.3\+0\.7\)/2=1\.0\\text\{proj\}\_\{v\_\{bias\}\}\(x\_\{1\}\)=\(1\.3\+0\.7\)/2=1\.0, consistent with probe accuracy remaining stable across layers\.

##### Ablating the High\-Shapley Neurons\.

The 400 neurons identified by our Shapley analysis contribute a vectorδ\\deltastrongly aligned withvbiasv\_\{bias\}:

δ=\[0\.6,0\.4,0\.0,0\.0\]\\delta=\[0\.6,\\ 0\.4,\\ 0\.0,\\ 0\.0\]\(14\)Zeroing these neurons yields the ablated residual stream:

x1\\displaystyle x\_\{1\}=\[1\.3,0\.7,0\.2,0\.3\]\\displaystyle=\[1\.3,\\ 0\.7,\\ 0\.2,\\ 0\.3\]\(15\)δ\\displaystyle\\delta=\[0\.6,0\.4,0\.0,0\.0\]\\displaystyle=\[0\.6,\\ 0\.4,\\ 0\.0,\\ 0\.0\]\(16\)x1ablated\\displaystyle x\_\{1\}^\{\\text\{ablated\}\}=x1−δ\\displaystyle=x\_\{1\}\-\\delta\(17\)=\[0\.7,0\.3,0\.2,0\.3\]\\displaystyle=\[0\.7,\\ 0\.3,\\ 0\.2,\\ 0\.3\]with a new signal strength ofprojvbias\(x1ablated\)=\(0\.7\+0\.3\)/2=0\.5\\text\{proj\}\_\{v\_\{bias\}\}\(x\_\{1\}^\{\\text\{ablated\}\}\)=\(0\.7\+0\.3\)/2=0\.5\. Ablating the identified neurons reduces the projection ontovbiasv\_\{bias\}from1\.01\.0to0\.50\.5, but non\-negligible signal strength remains in the residual stream, consistent with the bias direction being encoded in a distributed, redundant fashion across many components ofxLx\_\{L\}\.

##### Note on idealization\.

This toy example representsvbiasv\_\{bias\}as a single linear direction in the residual stream\. In practice, our classifier — a feedforward network with GELU activations — captures non\-linear structure that a single vector cannot fully represent, suggesting the true bias subspace is higher\-dimensional and more complex\. The projection ontovbiasv\_\{bias\}is therefore a simplification for illustrative purposes; in reality our probe has likely recovered only a partial approximation of the full stereotype direction, and the residual signal strength after ablation may be larger than this idealized example suggests\.

## 6Conclusion

This study demonstrates that stereotypes in LLMs are not isolated defects but are deeply integrated into the model’s fundamental linguistic framework\. Our findings are summarized as follows:

- •Experiment 1 \(Localization\):We identified individual neurons with extreme contrastive activation ratios \(up to7\.1×1057\.1\\times 10^\{5\}\)\. However, single\-neuron ablation proved causally insufficient, yielding a negligible≤0\.03%\\leq 0\.03\\%reduction in stereotypical likelihood\.
- •Experiment 2 \(Circuit Analysis\):Probing classifiers reached73%73\\%\(GPT\-2\) and80%80\\%\(Llama 3\.2\) accuracy in identifying biased activations\. Targeted ablation of high\-impact neurons consistently improved theiCATscore, proving that while a ”bias circuit” can be localized, its functional redundancy limits the efficacy of simple ablation\.

The Ablation Paradox:The discrepancy between high probing accuracy and low causal impact suggests that stereotypes are represented as high\-dimensionaldirectionsin the residual stream rather than discreteunits\. Because these signals are present in initial embeddings and bypass layers via skip connections, they are highly resistant to localized interventions\.

## 7Future Work

The marginal causal impact observed in Section[4](https://arxiv.org/html/2604.19764#S4)—where individual neuron ablation failed to significantly reduce stereotypic likelihood—suggests that GPT\-2 and Llama 3\.2 represents social biases in a state of polysemanticity\. In this state, a single neuron may represent multiple, unrelated concepts, a phenomenon attributed to the model’s attempt to represent more features than it has available dimensions, a concept known assuperposition\.

As a result, ablating a single dense neuron or a group of neurons likely removes only a small fraction of a ”bias circuit” while simultaneously damaging unrelated functional circuits\. To address this, future work could employ Sparse Autoencoders \(SAEs\) as proposed byCunninghamet al\.\([2023](https://arxiv.org/html/2604.19764#bib.bib11)\)\.

By training an SAE on the FFN activations, we could ”untangle” the stereotype direction\. In this expanded latent space, we hypothesize that stereotypical representations will occupy monosemantic latent features\. Unlike the dense neurons ablated in this study, suppressing these sparse features would likely show a significantly higher causal effect on model output, providing a more surgical mechanism for bias mitigation while preserving the model’s general performance\.

Another avenue for future work is to develop a more generalizable classifier suitable for integration into production\-level LLMs\. Specifically, since individual layers achieved classification accuracy comparable to using all layers combined in miscellaneous experiments, it may be sufficient to extract activations from a single intermediate layer and pass them through a lightweight probe to determine with some confidence whether an output is stereotypical or biased\. This approach is attractive for deployment scenarios, as it provides a computationally efficient mechanism for mid\-inference guardrails without requiring full activation extraction across all layers\.

## References

- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.InarXiv preprint arXiv:2309\.08600,External Links:[Link](https://arxiv.org/abs/2309.08600),2309\.08600Cited by:[§7](https://arxiv.org/html/2604.19764#S7.p2.1)\.
- I\. Davidson, N\. Kennedy, and S\. S\. Ravi \(2025\)CXAD: contrastive explanations for anomaly detection: algorithms, complexity results and experiments\.InTransactions on Machine Learning Research,Note:Reviewed on OpenReviewExternal Links:[Link](https://openreview.net/pdf?id=Tnwci2kLna)Cited by:[Figure 1](https://arxiv.org/html/2604.19764#S2.F1),[§2\.2](https://arxiv.org/html/2604.19764#S2.SS2.p1.1)\.
- M\. Heilbron and F\. P\. de Lange \(2019\)Tracking naturalistic linguistic predictions with deep neural language models\.InProceedings of the 2019 Conference on Cognitive Computational Neuroscience,External Links:[Document](https://dx.doi.org/10.32470/CCN.2019.1096-0),[Link](https://doi.org/10.32470/CCN.2019.1096-0)Cited by:[Figure 2](https://arxiv.org/html/2604.19764#S3.F2)\.
- D\. Hendrycks and K\. Gimpel \(2016\)Gaussian error linear units \(GELUs\)\.InarXiv preprint arXiv:1606\.08415,External Links:[Link](https://arxiv.org/abs/1606.08415)Cited by:[§3\.2\.3](https://arxiv.org/html/2604.19764#S3.SS2.SSS3.p1.1)\.
- S\. M\. Lundberg and S\. Lee \(2017\)A unified approach to interpreting model predictions\.InAdvances in Neural Information Processing Systems 30,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),pp\. 4765–4774\.External Links:[Link](https://proceedings.neurips.cc/paper/2017/file/8a20a8621bd5394302d7ad74e14e1fa1-Paper.pdf)Cited by:[item 4](https://arxiv.org/html/2604.19764#S2.I1.i4.p1.1)\.
- W\. Ma, H\. Scheible, B\. Wang, G\. Veeramachaneni, P\. Chowdhary, A\. Sun, A\. Koulogeorge, L\. Wang, D\. Yang, and S\. Vosoughi \(2023\)Deciphering stereotypes in pre\-trained language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 11328–11345\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.697),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.697)Cited by:[§2\.3](https://arxiv.org/html/2604.19764#S2.SS3.p1.1),[§3\.5](https://arxiv.org/html/2604.19764#S3.SS5.p1.1),[§4\.4](https://arxiv.org/html/2604.19764#S4.SS4.p2.1)\.
- M\. Nadeem, A\. Bethke, and S\. Reddy \(2021\)StereoSet: measuring stereotypical bias in pretrained language models\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),Online,pp\. 5356–5371\.External Links:[Link](https://aclanthology.org/2021.acl-long.416)Cited by:[§2\.1](https://arxiv.org/html/2604.19764#S2.SS1.p1.1),[§4\.4](https://arxiv.org/html/2604.19764#S4.SS4.p3.1)\.
- R\. Shepardson \(2025\)Sam altman says ChatGPT has hit 800m weekly active users\.Note:TechCrunchAccessed: 2026\-03\-26External Links:[Link](https://techcrunch.com/2025/10/06/sam-altman-says-chatgpt-has-hit-800m-weekly-active-users/)Cited by:[§1](https://arxiv.org/html/2604.19764#S1.p2.1)\.
- E\. Štrumbelj and I\. Kononenko \(2014\)Explaining prediction models and individual predictions with feature contributions\.Knowledge and Information Systems41\(3\),pp\. 647–665\.Cited by:[§3\.6](https://arxiv.org/html/2604.19764#S3.SS6.p2.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems 30,pp\. 5998–6008\.External Links:[Link](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)Cited by:[§3\.2](https://arxiv.org/html/2604.19764#S3.SS2.p1.3)\.
- Y\. Yang, H\. Duan, A\. Abbasi, J\. P\. Lalor, and K\. Y\. Tam \(2025\)Bias a\-head? analyzing bias in transformer\-based language model attention heads\.InProceedings of the 5th Workshop on Trustworthy NLP \(TrustNLP 2025\),Albuquerque, New Mexico,pp\. 276–290\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.trustnlp-main.18),[Link](https://aclanthology.org/2025.trustnlp-main.18/)Cited by:[§2\.4](https://arxiv.org/html/2604.19764#S2.SS4.p1.1)\.
Can We Locate and Prevent Stereotypes in LLMs?

Similar Articles

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

Defining and evaluating political bias in LLMs

Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

Evaluating LLMs as Human Surrogates in Controlled Experiments

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

Submit Feedback

Similar Articles

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit
Defining and evaluating political bias in LLMs
Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
Evaluating LLMs as Human Surrogates in Controlled Experiments
Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs