Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

arXiv cs.CL Papers

Summary

This paper replicates the finding of 'emotion vectors' in open-weight LLMs Apertus-8B and Gemma-4-E4B, showing that valence geometry is recoverable across models with differences in layer emergence. The study also finds that arousal encoding is sensitive to the story corpus used for extraction.

arXiv:2606.26987v1 Announce Type: new Abstract: Recent work identified emotion vectors in Claude Sonnet 4.5, which are internal representations that encode emotion concepts, causally influence behavior, and exhibit geometry mirroring human psychological structure. We test the generality of these findings in two open-weight models, Apertus-8B-Instruct-2509 and Gemma-4-E4B-it, extracting emotion contrast vectors across all layers, using two model-generated corpora. We recover valence geometry for both models, with peak PC1--valence correlations of $r = 0.76$ and $r = 0.83$, approaching the $r = 0.81$ reported for Claude.Beyond replication, we observe notable differences in how valence representations emerge across model depth. In Gemma-4-E4B-it, valence is strongly encoded in early layers but collapses towards later layers, whereas Apertus-8B-Instruct-2509 exhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths. Arousal encoding, in contrast, is sensitive to the extraction corpus: both models show stronger PC2--arousal alignment with Gemma-generated stories ($r$ up to $0.45$) than Apertus-generated ones ($r \leq 0.21$), suggesting arousal-relevant cues are unevenly distributed across generated corpora. We open-source our experiment code and dataset for reproducible investigation of emotion representations across language model architectures.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:20 AM

# Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs
Source: [https://arxiv.org/html/2606.26987](https://arxiv.org/html/2606.26987)
###### Abstract

Recent work identified “emotion vectors” in Claude Sonnet 4\.5, which are internal representations that encode emotion concepts, causally influence behavior, and exhibit geometry mirroring human psychological structure\. We test the generality of these findings in two open\-weight models,Apertus\-8BandGemma\-4\-E4B, extracting emotion contrast vectors across all layers, using two model\-generated corpora\. We recover valence geometry for both models, with peak PC1–valence correlations ofr=0\.76r=0\.76andr=0\.83r=0\.83, approaching ther=0\.81r=0\.81reported for Claude\. Beyond replication, we observe notable differences in how valence representations emerge across model depth\. InGemma\-4\-E4B, valence is strongly encoded in early layers but collapses towards later layers, whereasApertus\-8Bexhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths\. Arousal encoding, in contrast, is sensitive to the extraction corpus: both models show stronger PC2–arousal alignment with Gemma\-generated stories \(rrup to0\.450\.45\) than Apertus\-generated ones \(r≤0\.21r\\leq 0\.21\), suggesting arousal\-relevant cues are unevenly distributed across generated corpora\. We open\-source our experiment code and dataset for reproducible investigation of emotion representations across language model architectures\.

emotion vectors, large language models, replication

## 1Introduction

As users interact with Large Language Models \(LLM\), they can encounter responses that appear emotionally reactive, such as expressing frustration when struggling with tasks or enthusiasm when helping users\. Recent work bySofroniewet al\.\([2026](https://arxiv.org/html/2606.26987#bib.bib18)\)moved beyond surface\-level observations, identifying internal “emotion vectors” in Claude Sonnet 4\.5\. They identified 171 linear directions in activation space corresponding to emotion concepts, with correlational and potentially causal relations to model behaviour\. Steering these vectors altered the model’s preferences and increased rates of misaligned behaviors such as reward hacking and blackmail\. The overall geometry of the emotion space mirrors human psychology, with principal components aligning to valence and arousal axes consistent with Russell’s circumplex model\(Russell,[1980](https://arxiv.org/html/2606.26987#bib.bib33)\)\. These findings raise key questions about generality:\(1\)Are emotion vectors specific to Claude’s training, or a general property of language models’ internal representations?\(2\)How does emotion geometry evolve across layers: Does it emerge suddenly or build up gradually?\(3\)How does the choice of story corpus affect extraction? These questions matter for interpretability and safety: If emotion representations are universal and robustly extractable, monitoring them could provide early warnings of misaligned internal states across different models\. We address these questions by replicating and extending emotion vector analysis in two open\-weight models:Apertus\-8B\(Hernández\-Canoet al\.,[2025](https://arxiv.org/html/2606.26987#bib.bib20)\), with fully open weights, training data, and code, andGemma\-4\-E4B\(DeepMind,[2026](https://arxiv.org/html/2606.26987#bib.bib35)\), a recently released open\-source model, both chosen for their relatively small size\. For each model, we extract emotion contrast vectors across multiple layers using two story corpora—one generated byApertus\-8Band one byGemma\-4\-E4B—to separate model\-intrinsic geometry from corpus\-dependent extraction artifacts\. Additional related work is provided in[Appendix A](https://arxiv.org/html/2606.26987#A1)\. We release our code publicly111[https://github\.com/sinievanderben/emotion\_experiment](https://github.com/sinievanderben/emotion_experiment)\.

- •Replication of key findings\.We recover valence geometry in bothApertus\-8BandGemma\-4\-E4B, with the highest PC1–valence correlations of r=0\.76 and r=0\.83 respectively, demonstrating that emotion vectors generalize beyond Claude to open\-weight models across different architectures\.
- •Divergent Emergence\.Models differ substantially in when valence structure emerges:Gemma\-4\-E4Bpeaks early \(layer 16\) then fades, whileApertus\-8Bbuilds progressively across depth, stabilizing around layer 20\. Cross\-layer CKA analysis shows a phase transition inApertus\-8Bthat is absent in Gemma\.
- •Corpus\-dependent arousal\.Arousal encoding is sensitive to story corpus: both models show substantially stronger PC2–arousal alignment when using Gemma\-generated stories \(r up to 0\.45\) than Apertus\-generated stories \(r≤\\leq0\.17\)

## 2Methods

### 2\.1Dataset

We generatedtwosynthetic emotion\-story datasets, followingSofroniewet al\.\([2026](https://arxiv.org/html/2606.26987#bib.bib18)\), with 9 stories for each of 171 emotions\. For each emotion, we promptedApertus\-8BandGemma\-4\-E4Bto write short stories in which characters experience the target emotion without naming it, using a similar prompt toSofroniewet al\.\([2026](https://arxiv.org/html/2606.26987#bib.bib18)\)\. This produced 1,539 stories across emotions \(Table[1](https://arxiv.org/html/2606.26987#A2.T1)\), plus 40 neutral stories from the same model\. The 40 neutral texts form a single fixed set shared by all 171 emotions, since we compute the confound subspace once per layer and project every emotion vector through the same operation\. The emotion concepts span the valence\-arousal space\.

We treat the story corpora as independent variable\. By running each model on both Apertus\-generated and Gemma\-generated stories, we intend to disentangle the emotion findings from corpus\-dependent extraction artifacts\. No previous work has tested story influence before\.

### 2\.2Model

We analyzed two open\-weight language models:Apertus\-8BInstruct, a 32\-layer transformer, andGemma\-4\-E4B, a 42\-layer transformer\. Both models are instruction\-tuned and comparable in scale to enable cross\-model comparison of emotion representations\. More details on both models can be found in Appendix[B\.2](https://arxiv.org/html/2606.26987#A2.SS2)\.

### 2\.3Contrast Vector Extraction

FollowingSofroniewet al\.\([2026](https://arxiv.org/html/2606.26987#bib.bib18)\), we construct one activation vector𝐯e\(l\)\\mathbf\{v\}\_\{e\}^\{\(l\)\}per emotioneeand layerll\. Since these vectors capture general linguistic structure, we apply a two\-step procedure to isolate the emotion\-specific component\.

First, for each emotion, we perform a forward pass on the corresponding nine stories and cache the residual stream activations at each layer, giving a tensor of shape\(\#​tokens,dmodel\)\(\\\#\\text\{tokens\},d\_\{\\text\{model\}\}\)per layer\. Averaging these activations across tokens and stories yields one raw vector𝐮e∈ℝdmodel\\mathbf\{u\}\_\{e\}\\in\\mathbb\{R\}^\{d\_\{\\text\{model\}\}\}per emotion and layer, which still mixes emotion\-specific and general linguistic features\.

Second, we project out non\-emotion\-specific components\. To characterize the emotion\-agnostic subspace, we collect mean residual activations from the 40 neutral stories, producing a\(40,dmodel\)\(40,d\_\{\\text\{model\}\}\)matrix per layer\. PCA on this matrix yields a basis for the subspace; we retain the topKKcomponents that together explain 50% of the variance\. To isolate the emotion\-specific component, we subtract from each emotion vector its projection onto the neutral subspace to get the contrast vector𝐯e\\mathbf\{v\}\_\{e\}:

𝐯e=𝐮e−∑k=1K\(𝐮e⋅𝐩k\)​𝐩k\\mathbf\{v\}\_\{e\}=\\mathbf\{u\}\_\{e\}\-\\sum\_\{k=1\}^\{K\}\(\\mathbf\{u\}\_\{e\}\\cdot\\mathbf\{p\}\_\{k\}\)\\mathbf\{p\}\_\{k\}ForApertus\-8B, we extracted vectors from layers*1–31*, and forGemma\-4\-E4B, from layers*1–40*\. Stacking these vectors across all\|E\|\|E\|emotions yields the matrixV\(l\)∈ℝ\|E\|×dmodelV^\{\(l\)\}\\in\\mathbb\{R\}^\{\|E\|\\times d\_\{\\text\{model\}\}\}at layerll, on which we perform the analyses\.

### 2\.4Analysis

PCA and Valence\-Arousal CorrelationWe applied PCA to the emotion contrast matrixV\(l\)V^\{\(l\)\}at each layer and correlated the first two principal components \(PC1, PC2\) with human valence and arousal ratings from the NRC Valence–Arousal–Dominance Lexicon\(Mohammad,[2018](https://arxiv.org/html/2606.26987#bib.bib32)\), following\(Sofroniewet al\.,[2026](https://arxiv.org/html/2606.26987#bib.bib18)\)\. We report Pearsonrrand correspondingpp\-values\.

Cross\-layer Representational Similarity with CKAWe computed linear Centered Kernel Alignment\(Kornblithet al\.,[2019](https://arxiv.org/html/2606.26987#bib.bib34)\)betweenV\(l\)V^\{\(l\)\}for all layer pairs within each model and story condition\. CKA values near 1 indicate similar representational geometry, while values near 0 indicate orthogonal structure\. Because CKA is invariant to orientation in latent space, it is well suited for this comparison and allowed us to quantify how emotion geometry evolves through the network\.

Valence direction stabilityLastly, we identified the valence direction at each layer as the vector most correlated with human valence ratings \(using PC1 when this correlation is significant\), then computed cosine similarity between these directions across layers to test whether the same subspace encodes valence at different depths\.

## 3Results

![Refer to caption](https://arxiv.org/html/2606.26987v1/images/fig_trajectory_combined_fracdepth_horizontal.png)Figure 1:Pearson correlation between the top two PCs of the emotion\-vector space and human valence \(left, PC1\) and arousal \(right, PC2\) across fractional layer depth, for Apertus 8B andGemma\-4\-E4Bprobed on Apertus\- and Gemma\-generated stories \(four conditions\)\. Hue = model \(blue = Apertus, red = Gemma\); line style = story source \(solid = Apertus, dashed = Gemma\)\. Dotted gray lines mark the Sonnet 4\.5 reference at a mid\-late layer \(r=0\.81r=0\.81valence,r=0\.66r=0\.66arousal;\(Sofroniewet al\.,[2026](https://arxiv.org/html/2606.26987#bib.bib18)\)\)\.### 3\.1Valence Replicates Across Models and Corpora

The first principal component of the emotion contrast matrix aligns with human valence ratings in both models, replicating the main result of\(Sofroniewet al\.,[2026](https://arxiv.org/html/2606.26987#bib.bib18)\)\.[Figure1](https://arxiv.org/html/2606.26987#S3.F1)shows PC1–valence correlations across fractional layer depth for all four model×\\timescorpus conditions; per\-layer values are reported in[Table3](https://arxiv.org/html/2606.26987#A3.T3)\.

##### Peak correlations\.

All model×\\timescorpus combinations reach a peak betweenr=0\.72r=0\.72and0\.830\.83\.Apertus\-8Bpeaks atr=0\.72r=0\.72\(layer 23, Apertus stories\) andr=0\.76r=0\.76\(layer 31, Gemma stories\);Gemma\-4\-E4Bpeaks atr=0\.79r=0\.79\(layer 13, Apertus stories\) andr=0\.83r=0\.83\(layer 16, Gemma stories\)\. All peaks are significant \(p<10−3p<10^\{\-3\}\) and approach or exceed the Sonnet 4\.5 reference ofr=0\.81r=0\.81\.

##### Valence Across Network Depth

Both models reach similarrr\-value peaks with opposite depth profiles\.Apertus\-8Bshows*abrupt late emergence*: PC1–valence correlation is near zero through fractional depth≈0\.5\\approx 0\.5\(layer 17/18\), then rises sharply, becoming significant at layer 18 \(r=0\.17r=0\.17,p<0\.05p<0\.05\) and exceedingr=0\.60r=0\.60at layer 21 \(≈\\approx63% depth\) under both story conditions\.

Gemma\-4\-E4Binstead shows*early encoding followed by collapse*: for Apertus stories, valence peaks at layer 16 \(≈\\approx38% depth\), then falls near zero by layer 18, with only partial recovery \(r≈0\.18r\\approx 0\.18–0\.200\.20\) in the final layers\. For Gemma stories, the peak comes later and both pre\- and post\-peak values are higher\. The Sonnet 4\.5 reference peaks in the mid\-late range, indicating thatApertus\-8Bfollows a similar pattern\.

##### Representation space vs\. valence\-axis stability

To interpret valence trajectories, we examine\(i\)whole\-space representational similarity via linear CKA and\(ii\)cosine alignment of the layer\-wise valence direction for each model–corpus combination\.

Apertus\-8B\(Figs\.[2](https://arxiv.org/html/2606.26987#A3.F2),[3](https://arxiv.org/html/2606.26987#A3.F3)\) shows three CKA phases: layers 2–11 form a flat plateau \(CKA≈1\\textrm\{CKA\}\\approx 1\); layers 12–21 form a transition band with off\-diagonal decay \(minimum0\.330\.33on Apertus stories,0\.580\.58on Gemma stories\); and layers 22–31 form a second plateau\. This transition aligns with the rise of PC1–valence correlation, suggesting a representational reorganization\.Gemma\-4\-E4B\(Figs\.[4](https://arxiv.org/html/2606.26987#A3.F4),[5](https://arxiv.org/html/2606.26987#A3.F5)\) instead shows a smooth gradient across all 40 layers with no sharp transition and CKA≥0\.73\\geq 0\.73between any pair\. The collapse of Gemma’s valence correlation around layer 18 therefore cannot stem from a global reorganization, as the geometry remains approximately stable through the collapse\.

The valence\-direction cosine matrices show what changes\. ForApertus\-8Bon its own stories \(Fig\.[6](https://arxiv.org/html/2606.26987#A3.F6)\), no off\-diagonal cell exceeds\|0\.49\|\|0\.49\|, which can indicate that the recovered direction is noise across layers\. On Gemma stories \(Fig\.[7](https://arxiv.org/html/2606.26987#A3.F7)\), early layers \(2–11\) form a coherent block with cosines0\.350\.35–0\.570\.57before becoming noisy in later layers\. ForGemma\-4\-E4Bon Gemma stories \(Fig\.[10](https://arxiv.org/html/2606.26987#A3.F10)\), there are two positive blocks \(layers 2–8 and 9–14\) and a late block \(28–40\), with adjacent\-layer cosines up to±0\.55\\pm 0\.55\. On Apertus stories \(Fig\.[11](https://arxiv.org/html/2606.26987#A3.F11)\) this structure is less pronounced\. Because CKA matrices are similar across corpora, the emotion representational space is corpus\-invariant\. However, the recovered valence axis depends on the input corpus, with Gemma stories yielding cleaner valence directions in both models\. Thus, valence is recoverable in both, but not encoded along a consistent axis across depth\.

##### PCA cluster separation at peak layers

PCA projections at each model’s peak layer \(Figs\.[14](https://arxiv.org/html/2606.26987#A3.F14),[15](https://arxiv.org/html/2606.26987#A3.F15)\) show emotion clustering and a clear corpus effect\. PC1–valence correlations are similar across story conditions \(Apertus\-8BL23:0\.720\.72vs\.0\.750\.75;Gemma\-4\-E4BL13:0\.790\.79vs\.0\.800\.80\), but clusters are more clearly separated for Gemma stories, with positive and negative emotions forming denser groups\.

### 3\.2Arousal Encoding

PC2–arousal correlations are generally weaker than PC1–valence and depend strongly on the story corpus \(Fig\.[1](https://arxiv.org/html/2606.26987#S3.F1), right;[Table4](https://arxiv.org/html/2606.26987#A3.T4)\)\. On Apertus stories, both models peak belowr=0\.21r=0\.21\(Apertus\-8B:r=0\.17r=0\.17at layer 18;Gemma\-4\-E4B:r=0\.21r=0\.21at layer 40\)\. On Gemma stories, both models reachr\>0\.40r\>0\.40\(Apertus\-8B:r=0\.45r=0\.45at layer 26;Gemma\-4\-E4B:r=0\.41r=0\.41at layer 31, bothp<10−8p<10^\{\-8\}\)\. Possibly, Gemma\-generated stories contain more arousal\-discriminative linguistic content\. We leave a corpus\-content analysis to future work\.

## 4Discussion

Main Research QuestionsOur results address the three questions raised in the introduction\.\(1\)*Emotion vectors are not specific to Claude’s training*\. We recover a valence axis of similar strength in two architecturally distinct open\-weight models, with peak correlations matching \(r=0\.83r=0\.83forGemma\-4\-E4B\) or approaching \(r=0\.76r=0\.76forApertus\-8B\) ther=0\.81r=0\.81reported for Claude Sonnet 4\.5\.\(2\)*Emergence is not uniform across models*\.Apertus\-8Bbuilds valence alignment abruptly in the second half of the network, whileGemma\-4\-E4Bencodes it early and then loses it mid\-network\.\(3\)*The story corpus affects extraction*\. This is especially clear for arousal: Gemma\-generated stories yield correlations more than twice as large as Apertus\-generated stories in both probed models\. Different paths to the same geometry\.Gemma\-4\-E4BandApertus\-8Breach similar peak valence correlations \(r≈0\.76r\\approx 0\.76–0\.830\.83\) via different layer\-wise trajectories\.Gemma\-4\-E4Bencodes valence in earlier layers before it degrades in later layers, whileApertus\-8Bdevelops it sharply across mid\-to\-upper layers\. We have not yet explored the possible attribution of this to architecture, training data, or post\-training, since the models differ in all three\. Our results show that similar peak valence correlations can hide substantial differences inwhereandhowvalence is computed\. Stable representation space, unstable axisThe representational space \(CKA\) and valence\-axis stability disconnect\. InGemma\-4\-E4B, the space remains similar across layers even where the PC1–valence correlation collapses, so valence information is preserved\. InApertus\-8B, the valence axis is relatively unstable across layers despite a late high plateau of valence–PC1 correlation\. Thus, representational similarity between layers does not guarantee a shared valence direction\. The arousal gap and corpus dependenceArousal shows the weakest replication, but the story\-condition analysis suggests that this may be attributed to our methodological choices\. With Gemma\-generated stories, arousal correlations in both models rise \(fromr≤0\.21r\\leq 0\.21tor≥0\.43r\\geq 0\.43\), partially closing the gap with the original result \(r=0\.66r=0\.66\)\. Because Gemma stories improve arousal extraction in*both*models, the effect likely reflects corpus properties rather than model–story matching\. Since it appears in both models, this rules out the simple confound that each model encodes only its own corpus well\. We hypothesize that Gemma has the ability to generate stories with greater variation in narrative intensity and physiological arousal cues, so corpus choice for eliciting emotion contrasts is a substantive methodological factor, not an implementation detail\. We leave verification to future work\.

### 4\.1Limitations

Several limitations warrant mention\. The first, the original study\(Sofroniewet al\.,[2026](https://arxiv.org/html/2606.26987#bib.bib18)\)did not release code, so our implementation is reconstructed from the methods they described\. Subtle methodological differences may therefore contribute to numerical differences\. Second, our analysis covers two open\-weight models from two families\. Broader cross\-architecture comparisons would strengthen claims about how general the valence\-pattern is, and whether the trajectory differences generalize to other model families\. Third, the corpora we probe are themselves model\-generated, which means we cannot fully separate properties of the distributions it produces\. A fully model\-independent stimulus set would be a stronger control\.

### 4\.2Future Work

Several directions follow from our findings and limitations\. The most direct is causal validation: steering model outputs at peak\-correlation layers along the recovered valence direction would test whether the representational structure we identify is actually used by the model\. Related, the cross\-layer rotations of the valence axis raises the question whether steering vectors derived at one layer remain effective when applied to another, even within regions of overall stable space\. Cross\-layer feature tracking using sparse autoencoders could further reveal whether the same interpretable features carry emotion information across the depth ranges we identify, or whether different layers encode emotions through different feature combinations\. Finally, extending this analysis to multi\-modal models could test whether the valence axis is preserved across modalities\.

## 5Conclusion

We replicate Anthropic’s emotion findings in two open\-weight models, achieving valence correlations ofr=0\.83r=0\.83\(Gemma\-4\-E4B\) andr=0\.76r=0\.76\(Apertus\-8B\)\. Cross\-layer analysis reveals divergent developmental trajectories:Gemma\-4\-E4Bencodes valence in early layers whileApertus\-8Bbuilds it progressively through late layers\. These results suggest that similar representations can arise from different computational paths, with implications for layer selection in interpretability work and targeted steering interventions\.

## References

- A\. Arditi, O\. B\. Obeso, A\. Syed, D\. Paleka, N\. Rimsky, W\. Gurnee, and N\. Nanda \(2024\)Refusal in language models is mediated by a single direction\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=pH3XAQME6c)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. Turner, C\. Anil, C\. Denison, A\. Askell,et al\.\(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread\.Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- E\. Cheng, D\. Doimo, C\. Kervadec, I\. Macocco, L\. Yu, A\. Laio, and M\. Baroni \(2025\)Emergence of a high\-dimensional abstraction phase in language transformers\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0fD3iIBhlV)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- B\. J\. Choi and M\. Weber \(2026\)Latent structure of affective representations in large language models\.External Links:2604\.07382,[Link](https://arxiv.org/abs/2604.07382)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- G\. DeepMind \(2026\)Gemma 4: expanding the gemmaverse with apache 2\.0\.Note:Accessed: 2026\-04\-28External Links:[Link](https://opensource.googleblog.com/2026/03/gemma-4-expanding-the-gemmaverse-with-apache-20.html)Cited by:[§1](https://arxiv.org/html/2606.26987#S1.p1.1)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen, R\. Grosse, S\. McCandlish, J\. Kaplan, D\. Amodei, M\. Wattenberg, and C\. Olah \(2022\)Toy models of superposition\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2022/toy_model/index.html)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- A\. Hernández\-Cano, A\. Hägele, A\. H\. Huang, A\. Romanou, A\. Solergibert, B\. Pasztor, B\. Messmer, D\. Garbaya, E\. F\. Ďurech, I\. Hakimi, J\. G\. Giraldo, M\. Ismayilzada, N\. Foroutan, S\. Moalla, T\. Chen, V\. Sabolčec, Y\. Xu, M\. Aerni, B\. AlKhamissi, I\. A\. Marinas, M\. H\. Amani, M\. Ansaripour, I\. Badanin, H\. Benoit, E\. Boros, N\. Browning, F\. Bösch, M\. Böther, N\. Canova, C\. Challier, C\. Charmillot, J\. Coles, J\. Deriu, A\. Devos, L\. Drescher, D\. Dzenhaliou, M\. Ehrmann, D\. Fan, S\. Fan, S\. Gao, M\. Gila, M\. Grandury, D\. Hashemi, A\. Hoyle, J\. Jiang, M\. Klein, A\. Kucharavy, A\. Kucherenko, F\. Lübeck, R\. Machacek, T\. Manitaras, A\. Marfurt, K\. Matoba, S\. Matrenok, H\. Mendoncça, F\. R\. Mohamed, S\. Montariol, L\. Mouchel, S\. Najem\-Meyer, J\. Ni, G\. Oliva, M\. Pagliardini, E\. Palme, A\. Panferov, L\. Paoletti, M\. Passerini, I\. Pavlov, A\. Poiroux, K\. Ponkshe, N\. Ranchin, J\. Rando, M\. Sauser, J\. Saydaliev, M\. A\. Sayfiddinov, M\. Schneider, S\. Schuppli, M\. Scialanga, A\. Semenov, K\. Shridhar, R\. Singhal, A\. Sotnikova, A\. Sternfeld, A\. K\. Tarun, P\. Teiletche, J\. Vamvas, X\. Yao, H\. Z\. A\. Ilic, A\. Klimovic, A\. Krause, C\. Gulcehre, D\. Rosenthal, E\. Ash, F\. Tramèr, J\. VandeVondele, L\. Veraldi, M\. Rajman, T\. Schulthess, T\. Hoefler, A\. Bosselut, M\. Jaggi, and I\. Schlag \(2025\)Apertus: Democratizing Open and Compliant LLMs for Global Language Environments\.Note:[https://arxiv\.org/abs/2509\.14233](https://arxiv.org/abs/2509.14233)Cited by:[§1](https://arxiv.org/html/2606.26987#S1.p1.1)\.
- S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton \(2019\)Similarity of neural network representations revisited\.InInternational conference on machine learning,pp\. 3519–3529\.Cited by:[§2\.4](https://arxiv.org/html/2606.26987#S2.SS4.p2.1)\.
- S\. Marks and M\. Tegmark \(2024\)The geometry of truth: emergent linear structure in large language model representations of true/false datasets\.External Links:2310\.06824,[Link](https://arxiv.org/abs/2310.06824)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- T\. Mikolov, W\. Yih, and G\. Zweig \(2013\)Linguistic regularities in continuous space word representations\.InProceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,L\. Vanderwende, H\. Daumé III, and K\. Kirchhoff \(Eds\.\),Atlanta, Georgia,pp\. 746–751\.External Links:[Link](https://aclanthology.org/N13-1090/)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- S\. M\. Mohammad \(2018\)Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words\.InProceedings of ACL,Cited by:[§2\.4](https://arxiv.org/html/2606.26987#S2.SS4.p1.3)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2023\)The linear representation hypothesis and the geometry of large language models\.InCausal Representation Learning Workshop at NeurIPS 2023,External Links:[Link](https://openreview.net/forum?id=T0PoOJg8cK)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- A\. Radford, R\. Jozefowicz, and I\. Sutskever \(2017\)Learning to generate reviews and discovering sentiment\.External Links:1704\.01444,[Link](https://arxiv.org/abs/1704.01444)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- N\. Rimsky, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. Turner \(2024\)Steering llama 2 via contrastive activation addition\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15504–15522\.External Links:[Link](https://aclanthology.org/2024.acl-long.828/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.828)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- J\. A\. Russell \(1980\)A circumplex model of affect\.\.Journal of personality and social psychology39\(6\),pp\. 1161\.Cited by:[§1](https://arxiv.org/html/2606.26987#S1.p1.1)\.
- N\. Sofroniew, I\. Kauvar, W\. Saunders, R\. Chen, T\. Henighan, S\. Hydrie, C\. Citro, A\. Pearce, J\. Tarng, W\. Gurnee, J\. Batson, S\. Zimmerman, K\. Rivoire, K\. Fish, C\. Olah, and J\. Lindsey \(2026\)Emotion concepts and their function in a large language model\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2026/emotions/index.html)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1),[Appendix B](https://arxiv.org/html/2606.26987#A2.p1.1),[§1](https://arxiv.org/html/2606.26987#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26987#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.26987#S2.SS3.p1.3),[§2\.4](https://arxiv.org/html/2606.26987#S2.SS4.p1.3),[Figure 1](https://arxiv.org/html/2606.26987#S3.F1),[Figure 1](https://arxiv.org/html/2606.26987#S3.F1.4.2),[§3\.1](https://arxiv.org/html/2606.26987#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.26987#S4.SS1.p1.1)\.
- L\. Sun, L\. Yan, X\. Lu, A\. Lee, J\. Zhang, and J\. Shao \(2026\)Valence\-arousal subspace in llms: circular emotion geometry and multi\-behavioral control\.External Links:2604\.03147,[Link](https://arxiv.org/abs/2604.03147)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- C\. Tigges, O\. J\. Hollinsworth, A\. Geiger, and N\. Nanda \(2024\)Language models linearly represent sentiment\.InICML 2024 Workshop on Mechanistic Interpretability,External Links:[Link](https://openreview.net/forum?id=Xsf6dOOMMc)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.
- L\. Valeriani, D\. Doimo, F\. Cuturello, A\. Laio, A\. ansuini, and A\. Cazzaniga \(2023\)The geometry of hidden representations of large transformer models\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=cCYvakU5Ek)Cited by:[Appendix A](https://arxiv.org/html/2606.26987#A1.p1.1)\.

## Appendix ARelated Work

Linear representations in LLMs\.The linear representation hypothesis holds that high\-level concepts are encoded as directions in activation space\(Mikolovet al\.,[2013](https://arxiv.org/html/2606.26987#bib.bib22); Elhageet al\.,[2022](https://arxiv.org/html/2606.26987#bib.bib24); Parket al\.,[2023](https://arxiv.org/html/2606.26987#bib.bib23)\)\.Tiggeset al\.\([2024](https://arxiv.org/html/2606.26987#bib.bib21)\)demonstrated this for sentiment, finding a single direction captures positive\-negative valence across tasks\. Subsequent work extended linear representations to truth\(Marks and Tegmark,[2024](https://arxiv.org/html/2606.26987#bib.bib25)\), refusal\(Arditiet al\.,[2024](https://arxiv.org/html/2606.26987#bib.bib26)\), and behavioral tendencies\(Rimskyet al\.,[2024](https://arxiv.org/html/2606.26987#bib.bib27)\)\. Sparse autoencoders can extract directions at scale, decomposing polysemantic activations into interpretable features\(Brickenet al\.,[2023](https://arxiv.org/html/2606.26987#bib.bib4); Cunninghamet al\.,[2023](https://arxiv.org/html/2606.26987#bib.bib3)\)\. Emotion in language models\.Early work identified a “sentiment neuron” in LSTMs\(Radfordet al\.,[2017](https://arxiv.org/html/2606.26987#bib.bib28)\), though later analysis suggested emotional content is distributed across many neurons\.\(Sofroniewet al\.,[2026](https://arxiv.org/html/2606.26987#bib.bib18)\)provide a comprehensive analysis, extracting 171 emotion vectors from Claude Sonnet 4\.5 and demonstrating causal influence on behavior\. They found emotion geometry mirrors human psychological structure, with valence and arousal as principal axes\. Concurrent work extends this to other models:\(Sunet al\.,[2026](https://arxiv.org/html/2606.26987#bib.bib37)\)identify a valence\-arousal subspace in Llama and Qwen with circumplex\-consistent circular geometry, where steering along VA axes controls refusal and sycophancy\.\(Choi and Weber,[2026](https://arxiv.org/html/2606.26987#bib.bib36)\)find coherent affective representations in Gemma\-2, Mistral, and LLaMA with modest nonlinear global structure\. We build onSofroniewet al\.\([2026](https://arxiv.org/html/2606.26987#bib.bib18)\), testing generalization across architectures and the role of extraction methodology\. Cross\-layer geometry\.Transformer representations evolve across layers in characteristic ways\.Valerianiet al\.\([2023](https://arxiv.org/html/2606.26987#bib.bib30)\)found intrinsic dimension expands then contracts, with semantics concentrated at intermediate depths\.\(Chenget al\.,[2025](https://arxiv.org/html/2606.26987#bib.bib31)\)identified a ”high\-dimensional abstraction phase” where representations peak in complexity before simplifying toward outputs\.

## Appendix BStory Dataset

The emotion story datasets was generated usingApertus\-8BandGemma\-4\-E4B, following a methodology similar to Anthropic’s emotion vectors work\(Sofroniewet al\.,[2026](https://arxiv.org/html/2606.26987#bib.bib18)\)\. The 171 emotions were copied from their work\. Stories were designed to convey emotions implicitly, such as never naming the target emotion directly, but instead relying instead on character actions, physical sensations, dialogue, and situational context\. The prompts used were also similar, to introduce as little methodological confound as possible\.

### B\.1Dataset statistics

Table 1:Emotion story dataset statistics by corpus\. Apertus stories were deduplicated to match the uniform 9\-stories\-per\-topic structure of the Gemma corpus\.
### B\.2Activation collection

Residual stream activations were collected fromApertus\-8B\-Instruct at multiple transformer layers \(Table[2](https://arxiv.org/html/2606.26987#A2.T2)\)\. The model can be found through HuggingFace:swiss\-ai/Apertus\-8B\-Instruct\-2509\.

For Gemma, different layers were picked to collect activations from \(Table[2](https://arxiv.org/html/2606.26987#A2.T2)\)\. The model was also accessed through HuggingFace:google/Gemma\-4\-E4B\-it\.

Table 2:Activation extraction configuration for both models\.

## Appendix CAdditional Results

### C\.1Principal Component Valence

Table 3:PC1–valence \(Pearsonrr\) across layers and story conditions\. Bold indicates the peak layer per model–condition pair\.p†<0\.05\{\}^\{\\dagger\}p<0\.05;p‡<0\.01\{\}^\{\\ddagger\}p<0\.01;p∗<0\.001\{\}^\{\*\}p<0\.001\.
### C\.2Principal Component Arousal

Table 4:PC2–arousal \(Pearsonrr\) across layers and story conditions\. Bold indicates the peak layer per model–condition pair\.p†<0\.05\{\}^\{\\dagger\}p<0\.05;p‡<0\.01\{\}^\{\\ddagger\}p<0\.01;p∗<0\.001\{\}^\{\*\}p<0\.001\.
### C\.3CKA Figures

CKA is a measure of how similar the emotion space is between 2 layers\. The diagonal is always 1, which is a layer compared to itself\.

- •CKA close to 1: spatial arrangement of emotion vectors between 2 layers is nearly identical\.
- •CKA close to 0: spatial arrangement has changed substantially between 2 layers

So each cell answers: does the model organize emotions in the same way at layer A as in layer B? The higher the value, the more similar\.

#### C\.3\.1Apertus\-8BCKA results

Figure 2:Apertus\-8BCKA results on Apertus stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x1.png)Figure 3:Apertus\-8BCKA values on Gemma stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x2.png)
#### C\.3\.2Gemma\-4\-E4BCKA results

Figure 4:Gemma\-4\-E4BCKA values on Gemma stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x3.png)Figure 5:Gemma\-4\-E4BCKA values on Apertus stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x4.png)

### C\.4Valence Direction Alignment

Each cell shows the cosine similarity between the valence direction vectors at 2 layers\. The valence direction is the axis in activation space that best predicts the emotion valence\.

- •Cosine similarity close to 1\. Valence axis points in the same direction in both layers, consistent positive axis\.
- •Cosine similarity close to 0\. Valence axes are orthogonal, they’ve rotated completely\.
- •Cosine similarity close to \-1\. The axis has flipped direction\.

CKA provides information about the whole space, while the cosine similarity specifically shows whether the valence axis is stable\. A predominantly blue matrix would indicate that the model has a persistent stable direction to represent positive vs\. negative emotions across many layers\.

The valence direction stability line plot shows the cosine similarity between 2 adjacent layers\. It has a similar interpretation as the values in the panel, but only for adjacent layers\. The interpretation can be slightly different, because it shows if the valence axis points in the same direction from layer to layer\. A dip reveals a specific transition, where the model changes how it encodes valence\.

#### C\.4\.1Apertus\-8BValence Alignment

Figure 6:Apertus\-8Bvalidation on Apertus stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x5.png)Figure 7:Apertus\-8Bvalidation on Gemma stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x6.png)Figure 8:Apertus\-8Bvalidation on Apertus stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x7.png)Figure 9:Apertus\-8Bvalidation on Gemma stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x8.png)
#### C\.4\.2Gemma\-4\-E4BResults

Figure 10:Gemma\-4\-E4Bvalidation on Gemma stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x9.png)Figure 11:Gemma\-4\-E4Bvalidation on Apertus stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x10.png)Figure 12:Gemma\-4\-E4Bvalidationon Apertus stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x11.png)Figure 13:Gemma\-4\-E4Bon Gemma stories![Refer to caption](https://arxiv.org/html/2606.26987v1/x12.png)

### C\.5PCA comparison

The PCA figure shows a map of the model’s emotional space at a specific layer\. We pick the layer with the highest valence\. Each dot is an emotion, positioned at how the model actually represents this emotion in its activation space\.

Comparing two panels can tell whether the map is reproducible across different inputs, or whether the emotional space is sensitive to what stories the model reads\.

#### C\.5\.1Apertus\-8BResults

Figure 14:Apertus\-8Bvalidation![Refer to caption](https://arxiv.org/html/2606.26987v1/x13.png)
#### C\.5\.2Gemma\-4\-E4BResults

Figure 15:Gemma\-4\-E4B: valence\-arousal PCA![Refer to caption](https://arxiv.org/html/2606.26987v1/x14.png)

## Appendix DPrompts

Below, we report verbatim the prompts used to generate the short stories\.

`System prompt — Explanation generation`

Similar Articles

A Shared Valence Axis Across Modern LLMs and Human EEG: The Saturation Regularity

arXiv cs.LG

This paper discovers a shared valence axis (V-axis) across modern LLMs and human EEG signals, showing that a single direction from LLM internal representations aligns with neural responses to emotional stimuli. It also identifies the saturation regularity, explaining why LLM-derived supervision fails to improve EEG decoding and how leveraging residual diversity boosts performance.

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

arXiv cs.CL

This paper investigates how large language models process emotional valence through mechanistic interpretability. Using activation patching and steering on three open-source LLMs, the authors find that negative valence is localized to early layers while positive valence peaks in mid-to-late layers, and they validate this through topic-controlled flip tests.