Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
Summary
This paper introduces a principled approach to multilingual language steering using sparse autoencoders (SAEs) trained on multilingual data and a novel layer selection rule based on the intersection of multilingual alignment and language separability, evaluated on LLaMA-3.1-8B and Gemma-2-9B for machine translation and cross-lingual summarization.
View Cached Full Text
Cached at: 05/25/26, 08:57 AM
# Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
Source: [https://arxiv.org/html/2605.23036](https://arxiv.org/html/2605.23036)
Yusser Al Ghussin1,2Daniil Gurgurov1,2Tanja Bäumel1,2,5 Josef van Genabith1,2Patrick Schramowski2,3,4,5Simon Ostermann1,2,5
1Saarland University2German Research Center for Artificial Intelligence \(DFKI\) 3TU Darmstadt4hessian\.AI5Centre for European Research in Trusted AI \(CERTAIN\) yusser\.al\_ghussin@dfki\.de
###### Abstract
Sparse autoencoders \(SAEs\) enable feature\-level mechanistic interpretability and activation steering in large language models \(LLMs\), but SAE\-based language control remains unreliable in multilingual settings: most SAEs are trained on English\-only data, and steering layers are chosen heuristically\. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs\. First, we show that training SAEs on multilingual data consistently strengthens cross\-lingual representations and yields more reliable, quality\-preserving language control across layers and model families\. Second, we introduce an*a priori*steering layer\-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search\. We evaluate our approach on LLaMA\-3\.1\-8B and Gemma\-2\-9B across machine translation and cross\-lingual summarization \(CrossSumm\), using SpBLEU, ROUGE\-L, COMET, and LaSE\. Our results show that multilingual SAEs combined with intersection\-selected layers stabilize the trade\-off between language identification accuracy and generation quality, providing a principled, predictive, representation\-level account of multilingual SAE steering\. We release all code and models for reproducibility\.111[https://github\.com/Yusser96/Multilingual\-Steering\-by\-Design/](https://github.com/Yusser96/Multilingual-Steering-by-Design/)222[https://huggingface\.co/collections/Yusser/multilingual\-steering\-by\-design](https://huggingface.co/collections/Yusser/multilingual-steering-by-design)
Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
Yusser Al Ghussin1,2Daniil Gurgurov1,2Tanja Bäumel1,2,5Josef van Genabith1,2Patrick Schramowski2,3,4,5Simon Ostermann1,2,51Saarland University2German Research Center for Artificial Intelligence \(DFKI\)3TU Darmstadt4hessian\.AI5Centre for European Research in Trusted AI \(CERTAIN\)yusser\.al\_ghussin@dfki\.de
## 1Introduction
Figure 1:Overview of our language\-control pipeline\. A language\-specific vector is constructed and used for layer selection and generation steering\.Large language models \(LLMs\) can generate text in many languages, yet reliably*controlling*the output language remains challenging\. While sparse autoencoders \(SAEs\) have emerged as a promising tool for interpreting internal activations and constructing steering vectors that causally influence model behaviorCunninghamet al\.\([2023](https://arxiv.org/html/2605.23036#bib.bib54)\); Templeton \([2024](https://arxiv.org/html/2605.23036#bib.bib93)\), SAE\-based language steering in multilingual settings remains brittle and difficult to reproduce, with steering success varying unpredictably across models and layers: intervention depths are typically chosen heuristically \(e\.g\., “mid\-to\-late” layers\), requiring expensive layer sweeps and yielding inconsistent outcomesBayatet al\.\([2025](https://arxiv.org/html/2605.23036#bib.bib35)\); Chouet al\.\([2025](https://arxiv.org/html/2605.23036#bib.bib85)\)\. As a result, although SAE steering can work, it lacks a predictive, mechanistic account of*where*and*why*language control should be applied inside the modelTanget al\.\([2024](https://arxiv.org/html/2605.23036#bib.bib50)\); Denget al\.\([2025](https://arxiv.org/html/2605.23036#bib.bib45)\)\.
We argue that this haphazardness stems from the lack of a mechanistic perspective on how multilingual information is organized across model depth\. We show that effective language steering requires access to two complementary signals: shared cross\-lingual structure that supports fluent generation across languages, and language\-specific information that distinguishes one language from another\. Prior work has shown that multilingual pretrained models learn shared latent representations across languages, facilitating cross\-lingual transfer even in the absence of shared vocabularies or parallel dataConneauet al\.\([2020](https://arxiv.org/html/2605.23036#bib.bib4)\)\. At the same time, language identity and language\-specific features are differentially encoded across layers and can transition toward shared abstractions over depth in multilingual modelsRiemenschneider and Frank \([2025](https://arxiv.org/html/2605.23036#bib.bib2)\); Zhanget al\.\([2025](https://arxiv.org/html/2605.23036#bib.bib3)\)\. If an intervention targets layers dominated by shared structure, steering lacks specificity; if it targets layers dominated by language\-specific signals, the model often fails to recover generation quality\. Our hypothesis reframes language steering as a problem of identifying representational balance points, rather than amplifying language\-specific features in isolation, as is common in prior workTanget al\.\([2024](https://arxiv.org/html/2605.23036#bib.bib50)\); Denget al\.\([2025](https://arxiv.org/html/2605.23036#bib.bib45)\); Gurgurovet al\.\([2025](https://arxiv.org/html/2605.23036#bib.bib37)\)\.
In this work, we operationalize this mechanistic hypothesis through two complementary contributions\. First, we train SAEs directly on multilingual data for LLaMA\-3\.1\-8BGrattafioriet al\.\([2024](https://arxiv.org/html/2605.23036#bib.bib52)\)and Gemma\-2\-9BTeamet al\.\([2024](https://arxiv.org/html/2605.23036#bib.bib75)\), showing that multilingual training preserves the shared cross\-lingual structure and language\-specific distinctions required for predictable and interpretable steering in the sparse representation space\. Compared to open\-source SAEsHeet al\.\([2024](https://arxiv.org/html/2605.23036#bib.bib21)\); Lieberumet al\.\([2024](https://arxiv.org/html/2605.23036#bib.bib22)\), these multilingual SAEs yield more stable and quality\-preserving language steering across layers and model families\. Second, we introduce a principled,*a priori*rule for selecting steering layers based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search\. Figure[1](https://arxiv.org/html/2605.23036#S1.F1)provides an overview of the proposed language steering framework\.
We validate this mechanistic framework across machine translation and cross\-lingual summarization on LLaMA\-3\.1\-8B and Gemma\-2\-9B, explicitly testing the prediction that balanced layers yield optimal language identification accuracy and generation quality trade\-offs\. Across both benchmarks, we find that multilingual SAEs combined with intersection\-selected layers consistently stabilize language control and improve interpretability, supporting the view that effective steering depth is a property of the model’s internal multilingual organization rather than a heuristic tuning choice\.
Our contributions are threefold:
- •Mechanistic characterization of language across depth\.We show that effective language steering arises at layers where cross\-lingual alignment and language separability coexist\.
- •Principled,*a priori*layer selection\.We introduce an intersection\-based criterion that predicts effective steering depths without layer sweeps\.
- •Multilingual SAEs as an interpretability enabler for language steering\.We show that multilingual SAE training preserves the representational structure required for reliable, interpretable language control\.
Figure 2:Correlation matrices of per\-language contrast \(DiffMean\) vectors for Gemma\-2\-9B \(Layer 23\)\.
## 2Related Work
#### SAE\-Based Activation and Language Steering\.
Sparse autoencoders \(SAEs\) have been widely used to interpret and steer internal activations in large language models\(Templeton,[2024](https://arxiv.org/html/2605.23036#bib.bib93); Zhaoet al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib14); O’Brienet al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib16); Wanget al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib15); Zhaoet al\.,[2026](https://arxiv.org/html/2605.23036#bib.bib17)\)\. Methods such as Sparse Activation Steering \(SAS\)\(Bayatet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib35)\), Feature\-Guided Activation Addition \(FGAA\)\(Sooet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib26)\), and SAE\-Targeted Steering \(SAE\-TS\)\(Chalnevet al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib25)\)demonstrate that manipulating small sets of sparse features can causally influence model behavior\. Applied to language control, prior work shows that editing individual SAE features can flip output language in models such as Gemma\-2\-9B and LLaMA\-3\.1\-8B\(Chouet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib85); Denget al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib45); Gurgurovet al\.,[2026](https://arxiv.org/html/2605.23036#bib.bib1)\)\. However, effective steering depths are typically identified through manual exploration or fixed heuristics \(e\.g\., mid\-to\-late layers\), and many existing approaches rely on SAEs trained predominantly on English data\. As a result, these methods do not provide a predictive, mechanistic account of where language steering should be applied across depth, nor how multilingual structure is preserved in sparse representations\.
#### Evaluating and Training SAEs\.
Recent benchmarks such as SAE\-Bench\(Karvonenet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib19)\)and AxBench\(Wuet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib88)\)evaluate SAE fidelity, interpretability, and intervention quality, reporting mixed results for SAE\-based steering compared to simpler baselines\. Other work emphasizes reconstruction fidelity as critical for causal interventions: Gemma\-Scope\(Lieberumet al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib22)\)and LLaMA\-Scope\(Heet al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib21)\)report that high reconstruction error degrades steering effectiveness, while JumpReLU SAEs\(Rajamanoharanet al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib20)\)improve the fidelity–sparsity trade\-off via straight\-through training\. These findings suggest that insufficient SAE fidelity may disproportionately affect low\-frequency or multilingual features, motivating our use of high\-fidelity JumpReLU SAEs for multilingual language steering\.
#### Language Features inside Models\.
Beyond SAEs, prior analyses point to strong layer\-dependent language signals in multilingual models\.Tanget al\.\([2024](https://arxiv.org/html/2605.23036#bib.bib50)\)identify language\-specific neurons in BLOOM and LLaMA\-2 and show that toggling them can switch the output language\.Changet al\.\([2022](https://arxiv.org/html/2605.23036#bib.bib24)\)study multilingual geometry in XLM\-R, finding that languages occupy approximately parallel subspaces separated by linear “language vectors” particularly in middle layers; shifting hidden states along these directions flips predictions\. Our findings echo these trends in the depth\-wise distribution of multilingual structure and support treating language as a steerable direction in representation spaceGurgurovet al\.\([2026](https://arxiv.org/html/2605.23036#bib.bib1)\), while further revealing correlations among language families\(Gurgurovet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib37)\)\.
Together, these lines of work motivate the need for a representation\-level account of multilingual language steering that explains both how language information is organized across depth and how this organization can be exploited to guide interventions predictively\.
## 3Language Representations and Principled Steering
Our goal is not merely to improve language control, but to explain*where*and*why*language steering is possible inside multilingual LLMs, and to use this explanation to guide interventions*a priori*\.
We definelanguage vectorsas directions in representation space that capture both the presence of individual languages and the directions along which they can be causally steered, building on prior evidence that language identity is linearly encoded as a direction or low\-dimensional subspace within model representations\(Parket al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib7); Denget al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib45)\)\. Our layer\-selection criterion is motivated by the observation that reliable language control requires access to two complementary signals: \(i\)*alignment*, corresponding to shared cross\-lingual structure that supports generation across languages, and \(ii\)*separability*, corresponding to language\-specific information that distinguishes one language from another\. Only at depths where these signals are balanced can a small intervention reliably steer the output language\.
### 3\.1Language Vectors
At each layer, we represent languages using contrastive*language vectors*constructed from model activations, either in the dense residual stream or in the sparse space induced by an SAE\. Given activations from a target language and a set of other languages, we construct language steering vectors using the DiffMean method\(Wuet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib88)\)\. For a given target language at layerℓ\\ell, let𝒵\+\\mathcal\{Z\}^\{\+\}denote the set of sparse codes corresponding to examples in the target language, and𝒵−\\mathcal\{Z\}^\{\-\}the set corresponding to all other languages\. We compute the mean sparse representations by averaging SAE codes over all non\-special tokens from all examples in that language
z¯ℓ\+=1\|𝒵\+\|∑z∈𝒵\+z,z¯ℓ−=1\|𝒵−\|∑z∈𝒵−z,\\bar\{z\}^\{\+\}\_\{\\ell\}=\\frac\{1\}\{\|\\mathcal\{Z\}^\{\+\}\|\}\\sum\_\{z\\in\\mathcal\{Z\}^\{\+\}\}z,\\qquad\\bar\{z\}^\{\-\}\_\{\\ell\}=\\frac\{1\}\{\|\\mathcal\{Z\}^\{\-\}\|\}\\sum\_\{z\\in\\mathcal\{Z\}^\{\-\}\}z,and define the steering vector as
wDiffMean\(ℓ\)=z¯ℓ\+−z¯ℓ−\.w\_\{\\mathrm\{DiffMean\}\}\(\\ell\)=\\bar\{z\}^\{\+\}\_\{\\ell\}\-\\bar\{z\}^\{\-\}\_\{\\ell\}\.
These vectors are then used additively in the SAE space to influence model outputs\. Full mathematical definitions of the SAE representations, DiffMean steering vectors, and the inference\-time steering procedure are provided in Appendix[C](https://arxiv.org/html/2605.23036#A3)\.
Beyond serving as steering directions, these language vectors exhibit meaningful linguistic structure\. In particular, at the layers selected by our intersection\-based criterion, pairwise correlations between per\-language vectors reveal clear language\-family groupings\. As shown in Figure[2](https://arxiv.org/html/2605.23036#S1.F2), languages from the same family \(e\.g\., Romance or Germanic\) exhibit high mutual similarity, while cross\-family correlations remain lower\. At the same time, a shared multilingual component persists across families, reflecting common cross\-lingual structure\. This coexistence of shared alignment and family\-specific separation aligns with the intuition behind our layer\-selection criterion and helps explain why these depths yield strong trade\-offs between language identification accuracy and generation quality\.
### 3\.2Multilingual SAEs for Language Steering
A central design choice in our framework is to train sparse autoencoders on multilingual data rather than English\-only corpora\. This choice is not merely pragmatic, but mechanistically important for reliable and interpretable language steering\.
English\-only SAEs preferentially encode monolingual structure: features that are frequent and salient in English dominate the sparse representation, while cross\-lingual correlations and low\-frequency language\-specific features are weakly represented or collapse entirely\. As a result, steering directions constructed from such representations are brittle\. Language vectors may activate English\-correlated features without cleanly isolating the intended target language, and the relationship between steering depth and downstream behavior becomes unstable\. In contrast, multilingual SAE training exposes the autoencoder to systematic variation across languages, encouraging the sparse feature space to preserve both shared cross\-lingual structure and language\-specific distinctions\.
From this perspective, multilingual SAEs can act as an*interpretability enabler*for representation\-level language steering\. They maintain the representational structure required to construct steering vectors whose effects can be predicted from representation\-level statistics\. The experimental comparisons in later sections empirically validate this claim, but the motivation for multilingual training arises directly from the mechanistic requirements of language steering\.
### 3\.3Principled Layer Selection
A common assumption in prior work is that effective language control primarily relies on manipulating strongly language\-specific features, which are believed to emerge in later layers, where the model is less able to recover from an intervention\(Tanget al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib50); Gurgurovet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib37)\)\. We find that effective steering emerges at depths where language\-specific signals coexist with sufficient shared cross\-lingual structure, motivating an*a priori*layer\-selection strategy based on the depthwise evolution of language representations\. At each layer, we quantify*multilinguality*as the degree to which language vectors share a dominant common direction, and*separability*as the extent to which languages remain distinct in representation space\.
Let\{λj\}j=1N\\\{\\lambda\_\{j\}\\\}\_\{j=1\}^\{N\}be the eigenvalues of the language vectors pairwise Pearson correlation matrixCℓC\_\{\\ell\};NNis the number of languages\. We define the*multilinguality*score as the explained\-variance ratio of the first principal component,
fℓ=maxjλj∑k=1Nλk,f\_\{\\ell\}=\\frac\{\\max\_\{j\}\\lambda\_\{j\}\}\{\\sum\_\{k=1\}^\{N\}\\lambda\_\{k\}\},which measures the degree of shared alignment across languages\. We define*separability*as the complementary quantity
sℓ=1−fℓ,s\_\{\\ell\}=1\-f\_\{\\ell\},which reflects how distinct the language representations remain\. We select steering layers at intersection points where these two signals are balanced, corresponding to depths that jointly preserve shared semantic structure while exposing discriminative language information\.
We empirically validate this criterion across models, SAE variants, and tasks\. Our contribution is to replace such heuristic choices with a principled, data\-driven criterion that predicts these depths*before*training SAEs and steering experiments are run\. Full definitions of the language correlation matrices, multilinguality and separability metrics, and the intersection\-based layer\-selection procedure are given in Appendix[D](https://arxiv.org/html/2605.23036#A4)\.
## 4Experiments
### 4\.1Models and Data
We evaluate on*LLaMA‑3\.1‑8B*Grattafioriet al\.\([2024](https://arxiv.org/html/2605.23036#bib.bib52)\)and*Gemma‑2‑9B*Teamet al\.\([2024](https://arxiv.org/html/2605.23036#bib.bib75)\)using 21 FLORES–200 languages \(see Appendix[B](https://arxiv.org/html/2605.23036#A2)\)Costa\-Jussàet al\.\([2022](https://arxiv.org/html/2605.23036#bib.bib76)\)\. For each model, we train parallel English\-only and multilingual JumpReLU SAE suites\(Rajamanoharanet al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib20)\)on 2\.1B WikipediaWikimedia Foundation \([2023](https://arxiv.org/html/2605.23036#bib.bib36)\)tokens with identical architectures and optimization settings, isolating the effect of multilingual training data\. Full details are provided in Appendix[A](https://arxiv.org/html/2605.23036#A1)\.
### 4\.2Evaluation
#### FLORES–200 Machine Translation\.
We evaluate language steering on machine translation using FLORES–200Costa\-Jussàet al\.\([2022](https://arxiv.org/html/2605.23036#bib.bib76)\)\. We use thedevsplit to construct steering vectors and thedevtestsplit for evaluation\. Eachdevtestset contains approximately 1,000 sentences per language, providing a substantially large and clean evaluation set while ensuring strict separation between steering construction and evaluation\.
For each non\-English target languageiiin our language set \(\|i\|=20\|i\|=20\), we define an English→\\rightarrowtgt\_iitranslation task, where English \(eng\_Latn\) is always the source language\. We construct a per\-language steering vector usingdevsentences, and apply this vector to steer generation into the intended output language, which we denote assteer\_ii\.333We usetgt\_iifor the prompt language andsteer\_iifor the intended output language after steering\. In our setup,steer\_iiis the language for which we construct the steering vector\. We use the term “steer language” to emphasize that the output language is controlled via a steering intervention\.
Prompts are written*in the target language*using natural translation instructions \(e\.g\., German: “Übersetze diesen Satz:”\), followed by a target\-language answer cue \(e\.g\., “Übersetzung:”\)\. We provide prompt examples in the Appendix[F](https://arxiv.org/html/2605.23036#A6)\.
Translate this sentence:⏟instruction in target language‘‘<source text\>”⏟Always English\.\\underbrace\{\\texttt\{Translate this sentence:\}\}\_\{\\text\{instruction in target language\}\}\\;\\underbrace\{\`\`\\texttt\{<source text\>\}"\}\_\{\\text\{Always English\}\}\.Translation:⏟answer cue in target language\\underbrace\{\\texttt\{Translation:\}\}\_\{\\text\{answer cue in target language\}\}This setup biases the model toward both the translation task and the prompt language, so that any deviation toward a steering language can be attributed to the steering intervention rather than prompt ambiguity\. We decode using greedy search with temperature0, yielding a conservative and interpretable baseline that isolates the effect of steering from prompt engineering or decoding strategies\. We report relative differences across SAE variants relative to open\-source SAE baselines, which directly measure the effectiveness of multilingual training and layer selection, for three metrics\. \(1\)LangID, computed by applying a fastText language identification classifierJoulinet al\.\([2016](https://arxiv.org/html/2605.23036#bib.bib31)\)to the generated outputs, measures how reliably steering enforces the intended output language\. \(2\)SpBLEUPost \([2018](https://arxiv.org/html/2605.23036#bib.bib13)\), computed against the reference translation in the intended*steer language*, provides a script\-agnostic measure of surface\-level translation quality\. \(3\)COMETReiet al\.\([2020](https://arxiv.org/html/2605.23036#bib.bib9)\), a neural evaluation metric that leverages cross\-lingual pretrained encoders and both the source and reference sentences, estimates semantic translation quality and correlates strongly with human judgments\.
We report results averaged across all 20 non\-English prompt languages, where for eachtgt\_iiwe evaluate steering into every other target languagesteer\_jjwithj≠ij\\neq i\. Concretely, the model is prompted in languagetgt\_iiand steered towardsteer\_jj, and results are averaged over the full cross\-product of\(i,j\)\(i,j\)pairs\. This aggregation directly measures how reliably different SAE variants enable control over the output language, independent of any single prompt–language pairing\. As our primary focus is the*relative*\(delta\) performance differences between SAE variants rather than absolute per\-language scores, we present these averages in the main text, while detailed per\-language and per\-pair results are provided in Appendix[O](https://arxiv.org/html/2605.23036#A15)\.
In addition, we report a restricted setting where steering is applied only whensteer\_j=tgt\_i\\texttt\{steer\\\_$j$\}=\\texttt\{tgt\\\_$i$\}, allowing us to analyze the behavior of steering when the prompt language and intended output language coincide \(Appendix[N](https://arxiv.org/html/2605.23036#A14)\)\.






Figure 3:Performance deltas relative to Scope baselines forGemma\-2\-9Bat the best\-performing steering layer\.Top:FLORES machine translation \(LangID, SpBLEU, COMET\)\.Bottom:Cross\-lingual summarization \(LangID, ROUGE\-L, LaSE\)\.
#### Cross\-Lingual Summarization \(CrossSumm\)\.
To evaluate whether our findings generalize beyond translation, we use the cross\-lingual summarization dataset CrossSum\(Parket al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib6)\)\. We select document–summary pairs whose target languages intersect with our translation language set\. The resulting dataset consists of 108 fully parallel English source documents paired with reference summaries in one of five target languages: Spanish \(es\), Russian \(ru\), Arabic \(ar\), Hindi \(hi\), and Turkish \(tr\)\.
We follow the same experimental design as in machine translation, reusing the same per\-language steering vectors\. The only change is the prompt, which is written in the target language and phrased as a natural summarization instruction in the target\-language \(e\.g\., “Summarize the following article”\) with a target\-language answer cue \("summary:"\), thereby biasing the model toward both the summarization task and the target language\. We provide prompt examples in the Appendix[E](https://arxiv.org/html/2605.23036#A5)\.
We evaluate generated summaries using three metrics:LangID,ROUGE\-L\(Lin,[2004](https://arxiv.org/html/2605.23036#bib.bib5)\), andLaSEParket al\.\([2025](https://arxiv.org/html/2605.23036#bib.bib6)\), following the evaluation protocol of the original dataset\. ROUGE\-L measures content overlap with the reference summary, while LaSE evaluates cross\-lingual semantic similarity between the generated and reference summaries\. This setup allows us to test whether steering preserves semantic content while controlling the output language in a non\-translation generative task\.
## 5Results
We assess steering performance and layer sensitivity, focusing on: \(i\) benefits from multilingual SAE training, \(ii\) optimal intervention layers, and \(iii\) comparisons with open‑source SAEs\.
### 5\.1Benefits of Multilingual Training for SAEs
We study the effect of training data on SAE steering, comparing monolingual \(English‑only\) SAEs with multilingual SAEs across transformer layers\. We evaluate multilingual steering across two generation tasks: \(i\) machine translation and \(ii\) cross\-lingual summarization, focusing on language identification accuracy, surface quality, and semantic preservation\.
#### Multilingual training improves steering\.
Figures[3](https://arxiv.org/html/2605.23036#S4.F3)in the paper and[16](https://arxiv.org/html/2605.23036#A8.F16)in the appendix summarize performance deltas relative to open\-source Scope baselines at the best\-performing steering layers \(i\.e\., the layers with the overall highest performance\), across both machine translation \(FLORES\) and cross\-lingual summarization \(CrossSumm\)\. For Gemma\-2\-9B \(Figure[3](https://arxiv.org/html/2605.23036#S4.F3)\), multilingual SAEs outperform English\-only SAEs across all reported metrics, yielding substantial gains in generation quality for both tasks\. In FLORES, multilingual training improves LangID and COMET while maintaining strong SpBLEU; in CrossSumm, it yields higher ROUGE\-L and LaSE, indicating better content preservation and semantic alignment\. For LLaMA\-3\.1\-8B \(Figure[16](https://arxiv.org/html/2605.23036#A8.F16)\), the improvements are smaller in magnitude but remain directionally consistent across tasks and SpBLEU, COMET and LaSE metrics while maintaining competitive LangID\. Overall, these results demonstrate that multilingual SAEs induce more effective and semantically aligned steering directions, with consistent benefits across models and task families\.




Figure 4:Layer\-selection curves showing the balance between multilingual alignment and language separability across layers\.Bluecurves denote*multilinguality*\(shared cross\-lingual alignment\), andorangecurves denote*separability*\(language\-specific structure\)\.Top:LLaMA\-3\.1\-8B\.Bottom:Gemma\-2\-9B\.Left:Open\-source SAEs \(LLaMA\-Scope, Gemma\-Scope\), where LLaMA\-Scope shows no clear intersection and Gemma\-Scope selects L14 and L23\.Right:Residual representations, which exhibit clear balance points at L15 \(LLaMA\) and L14/L23 \(Gemma\)\.Figure 5:Per\-language, per\-layer COMET score deltas forGemma\-2\-9BonFLORESunder cross\-lingual steering \(tgt\_ii≠\\neqsteer\_jj\)\.
### 5\.2Optimal Layers
For each model, we identify steering layers as*intersection points*where multilingual alignment and language separability are jointly balanced \(Figure[4](https://arxiv.org/html/2605.23036#S5.F4)\)\. Importantly, the multilinguality–separability curves are computed independently of any downstream generation metrics, making the predicted intersection layers a falsifiable, pre\-intervention hypothesis\.
For Gemma\-2\-9B, these curves exhibit a characteristic*two\-hump*shape, yielding intersection regions nearL14andL23\. Figure[3](https://arxiv.org/html/2605.23036#S4.F3)shows that these same layers achieve the strongest overall trade\-offs between language identification accuracy and generation quality for both multilingual and English\-only SAEs\. Figure[5](https://arxiv.org/html/2605.23036#S5.F5)further confirms this pattern at the per\-language level, whereL14andL23consistently emerge as the best\-performing steering depths\.
In LLaMA\-3\.1\-8B, a pronounced increase in multilinguality nearL13is followed by a rise in separability, yielding an intersection region spanningL13–L15\. Figure[16](https://arxiv.org/html/2605.23036#A8.F16)\(Appendix\) reports the best\-performing layers across SpBLEU, COMET, LaSE, and LangID, and shows that steering within this intersection region achieves the strongest overall trade\-offs\. In particular,L13andL15consistently emerge as the empirically optimal steering depths across most metrics\.
To further validate our layer\-selection criterion, we analyze layerwise performance trends under different steering regimes\. Figure[6](https://arxiv.org/html/2605.23036#S5.F6)reports layerwiseΔ\\DeltaCOMET andΔ\\DeltaLaSE averaged across SAE variants for two settings: \(i\) when the steering language matches the prompted target language,Δ\\DeltaCOMET andΔ\\DeltaLaSE increase monotonically with depth\. This behavior is consistent with later layers exhibiting stronger language separability in LLaMA\-3\.1\-8B and favoring same\-language amplification\. \(ii\) when steering toward a language different from the prompted target, performance follows a non\-monotonic trend, peaking near the layers identified by our multilinguality–separability intersection\.
This divergence highlights the role of representational balance: deeper layers benefit same\-language reinforcement, whereas effective cross\-language steering requires intervening at depths where shared cross\-lingual structure is still preserved alongside language\-specific distinctions\. These results provide additional empirical support that the layers selected by our criterion correspond to optimal steering depths\. Importantly, we show that layers selected by our criterion consistently outperform earlier and later layers when controlling for SAE architecture and training data\. This indicates that effective steering depth is a structural property of the base model rather than an artifact of a particular SAE\. We observe a different pattern in Gemma\-2\-9B: same\-language steering favors earlier layers, where language separability is high, while cross\-language steering again peaks near the layers identified by our intersection criterion\.




Figure 6:LayerwiseΔ\\DeltaCOMET andΔ\\DeltaLaSE trends for LLaMA\-3\.1\-8B averaged across SAEs under two steering regimes\.Top:steer\_lang≠\\neqtarget\_lang\.Bottom:steer\_lang==target\_lang\.
### 5\.3Additional Experiments: Steering with Open‑Source SAEs
We compare our multilingual SAEs against the open\-source*LLaMA\-Scope*and*Gemma\-Scope*suites\. For*Gemma\-Scope*, our criterion identifies two intersection layers atL14andL23\. Steering performance at these depths remains consistently below that of our multilingual SAEs \(Figure[5](https://arxiv.org/html/2605.23036#S5.F5)\)\. At other layers, Gemma\-Scope can exhibit competitive or occasionally stronger results, highlighting the sensitivity of multilingual steering to intervention depth and suggesting that multilingual SAEs, even when trained on comparatively less multilingual data, can surpass Gemma\-Scope when applied at mechanistically appropriate layers\.
In contrast, for*LLaMA\-Scope*, we observe consistently negligible downstream gains across layers\. Applying our multilinguality\-separability analysis in the sparse space reveals that LLaMA\-Scope does not exhibit a meaningful intersection layer: language separability remains weak across depth \(Figures[4](https://arxiv.org/html/2605.23036#S5.F4)and[20](https://arxiv.org/html/2605.23036#A9.F20)\)\. The absence of a balance point between shared cross\-lingual alignment and language\-specific structure aligns with its poor steering performance\.
Figure[7](https://arxiv.org/html/2605.23036#S5.F7)further reproduces the early–late dynamics of multilingual representations previously reported for LLaMA\-3\.1\-8B\(Gurgurovet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib37); Tanet al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib70)\): shared cross\-lingual structure is strongest in early\-to\-mid layers, while language separability increases toward later depths\. Notably, LLaMA\-Scope exhibits substantially lower separability than even the dense residual stream across all layers, which likely reflects the combined effects of English\-skewed training data and architectural choices in the SAE design, and helps explain its failure to support effective language control despite operating at similar depths\. More generally, when the separability score approaches zero, we consistently observe steering failure, indicating that separability provides a simple and predictive signal of multilingual steering capability at a specific layer \(Figure[20](https://arxiv.org/html/2605.23036#A9.F20)in appendix\)\.
Figure 7:Separability across layers for LLaMA‑3\.1‑8B, comparing different SAEs\.
## 6Take\-Aways and Conclusion
Our results show that reliable SAE\-based multilingual steering emerges from the combination of multilingual training and principled layer selection\. Crucially, our findings show that multilingual SAE steering is promising, and that its success can be predicted from representation\-level structure\.
#### \(I\) Multilingual SAE training strengthens language representations\.
Across both models, multilingual SAEs consistently outperform English\-only SAEs on language identification accuracy and generation quality\. These gains indicate that multilingual training does more than expand language coverage: it induces richer shared cross\-lingual structure while preserving cleaner language\-specific signals in the sparse feature space, yielding more reliable steering directions\.
#### \(II\) Intersection points predict optimal steering depths\.
Balancing multilingual alignment and language separability identifies layers where language control and generation quality are jointly maximized\. The intersection of these signals provides an*a priori*rule for layer selection that replaces heuristic mid–late choices and avoids exhaustive layer sweeps and repeated SAE training\. Across both base models, the layers identified by this criterion consistently coincide with those yielding the strongest LangID–quality trade\-offs, and outperform earlier and later layers even when controlling for SAE architecture and training data\.
#### \(III\) Open\-source SAEs highlight the limits of heuristic depth choices\.
Open\-source SAEs provide useful baselines but illustrate the importance of principled layer selection and multilingual training\.*LLaMA\-Scope*does not exhibit a clear intersection between multilinguality and separability \(Figure[4](https://arxiv.org/html/2605.23036#S5.F4)\) and yields negligible steering gains across layers\. Its sparse representations show weak language separability, often worse than the dense residual stream, suggesting that English\-skewed training data and architectural choices collapse multilingual features \(Figure[7](https://arxiv.org/html/2605.23036#S5.F7)\)\.
## Limitations
We evaluated two base models \(LLaMA\-3\.1\-8B and Gemma\-2\-9B\); larger, instruction\-tuned, or decoder–encoder architectures may exhibit different cross\-lingual dynamics\. Our evaluation focuses on automated metrics \(LangID, SpBLEU, COMET, ROUGE\-L, LaSE\), which do not capture stylistic fidelity, code\-switching behavior or robustness to ambiguous prompts\. Additionally, our findings are based on JumpReLU SAEs trained on the residual stream; extending this analysis to other sparse architectures, intervention sites \(e\.g\., attention or MLP activations\), or alternative steering constructions remains an open direction\. We do not claim that the intersection criterion is unique; alternative representational statistics may identify similar balance points\. Future work should complement these automated evaluations with manual translation\-error analysis and stronger comparisons to existing steering methods and state\-of\-the\-art multilingual systems, in order to better characterize failure modes and clarify the remaining performance gap for SAE\-based language control\. Similarly, the 0\.5 intersection threshold should be understood as an operational definition of equal multilingual alignment and language separability rather than as a uniquely optimal cutoff; future work should study adaptive or model\-specific thresholds\.
## Acknowledgments
This research was supported by the German Federal Ministry for Economic Affairs and Energy \(BMWE\) as part of the project“Souveräne KI für Europa \(SOOFI\)”\(13IPC040H\), and by the German Federal Ministry of Research, Technology and Space \(BMFTR\) as part of the project TRAILS \(01IW24005\)\.
## References
- Steering large language model activations in sparse spaces\.InProceedings of the Conference on Language Modelling \(COLM\),External Links:[Link](https://openreview.net/forum?id=VGw1viYliK)Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p1.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Chalnev, M\. Siu, and A\. Conmy \(2024\)Improving steering vectors by targeting sparse autoencoder features\.arXiv preprint arXiv:2411\.02193\.Cited by:[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1)\.
- T\. A\. Chang, Z\. Tu, and B\. K\. Bergen \(2022\)The geometry of multilingual language model representations\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://aclanthology.org/2022.emnlp-main.9/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.9)Cited by:[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Chou, G\. Liu, J\. Sun, C\. Blondin, K\. Zhu, V\. Sharma, and S\. O’Brien \(2025\)Causal language control in multilingual transformers via sparse feature steering\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics: Student Research Workshop,Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p1.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Conneau, S\. Wu, H\. Li, L\. Zettlemoyer, and V\. Stoyanov \(2020\)Emerging cross\-lingual structure in pretrained language models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 6022–6034\.External Links:[Link](https://aclanthology.org/2020.acl-main.536/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.536)Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p2.1)\.
- M\. R\. Costa\-Jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht, J\. Maillard,et al\.\(2022\)No language left behind: scaling human\-centered machine translation\.arXiv preprint arXiv:2207\.04672\.Cited by:[§4\.1](https://arxiv.org/html/2605.23036#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.23036#S4.SS2.SSS0.Px1.p1.1)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p1.1)\.
- B\. Deng, Y\. Wan, Y\. Zhang, B\. Yang, and F\. Feng \(2025\)Unveiling language\-specific features in large language models via sparse autoencoders\.External Links:2505\.05111,[Link](https://arxiv.org/abs/2505.05111)Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p1.1),[§1](https://arxiv.org/html/2605.23036#S1.p2.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.23036#S3.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.23036#S4.SS1.p1.1)\.
- D\. Gurgurov, Y\. A\. Ghussin, T\. Baeumel, C\. Chou, P\. Schramowski, M\. Mosbach, J\. van Genabith, and S\. Ostermann \(2026\)CLaS\-bench: a cross\-lingual alignment and steering benchmark\.External Links:2601\.08331,[Link](https://arxiv.org/abs/2601.08331)Cited by:[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Gurgurov, K\. Trinley, Y\. Al Ghussin, T\. Baeumel, J\. van Genabith, and S\. Ostermann \(2025\)Language arithmetics: towards systematic language neuron identification and manipulation\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),External Links:[Link](https://aclanthology.org/2025.ijcnlp-long.156/)Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p2.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2605.23036#S3.SS3.p1.1),[§5\.3](https://arxiv.org/html/2605.23036#S5.SS3.p3.1)\.
- Z\. He, W\. Shu, X\. Ge, L\. Chen, J\. Wang, Y\. Zhou, F\. Liu, Q\. Guo, X\. Huang, Z\. Wu,et al\.\(2024\)Llama scope: extracting millions of features from llama\-3\.1\-8b with sparse autoencoders\.arXiv preprint arXiv:2410\.20526\.Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p3.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Joulin, E\. Grave, P\. Bojanowski, M\. Douze, H\. Jégou, and T\. Mikolov \(2016\)FastText\.zip: compressing text classification models\.arXiv preprint arXiv:1612\.03651\.Cited by:[§4\.2](https://arxiv.org/html/2605.23036#S4.SS2.SSS0.Px1.p3.1)\.
- A\. Karvonen, C\. Rager, J\. Lin, C\. Tigges, J\. I\. Bloom, D\. Chanin, Y\. Lau, E\. Farrell, C\. S\. Mcdougall, K\. Ayonrinde, D\. Till, M\. Wearden, A\. Conmy, S\. Marks, and N\. Nanda \(2025\)SAEBench: a comprehensive benchmark for sparse autoencoders in language model interpretability\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 29223–29264\.External Links:[Link](https://proceedings.mlr.press/v267/karvonen25a.html)Cited by:[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Lieberum, S\. Rajamanoharan, A\. Conmy, L\. Smith, N\. Sonnerat, V\. Varma, J\. Kramar, A\. Dragan, R\. Shah, and N\. Nanda \(2024\)Gemma scope: open sparse autoencoders everywhere all at once on gemma 2\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,Miami, Florida, US,pp\. 278–300\.External Links:[Link](https://aclanthology.org/2024.blackboxnlp-1.19/),[Document](https://dx.doi.org/10.18653/v1/2024.blackboxnlp-1.19)Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p3.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§4\.2](https://arxiv.org/html/2605.23036#S4.SS2.SSS0.Px2.p3.1)\.
- K\. O’Brien, D\. Majercak, X\. Fernandes, R\. Edgar, B\. Bullwinkel, J\. Chen, H\. Nori, D\. Carignan, E\. Horvitz, and F\. Poursabzi\-Sangdeh \(2024\)Steering language model refusal with sparse autoencoders\.arXiv preprint arXiv:2411\.11296\.Cited by:[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Park, J\. Park, and H\. Lee \(2025\)Cross\-lingual summarization for low\-resource languages using multilingual retrieval\-based in\-context learning\.Applied Sciences15\(14\)\.External Links:[Link](https://www.mdpi.com/2076-3417/15/14/7800),ISSN 2076\-3417,[Document](https://dx.doi.org/10.3390/app15147800)Cited by:[§4\.2](https://arxiv.org/html/2605.23036#S4.SS2.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.23036#S4.SS2.SSS0.Px2.p3.1)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2024\)The linear representation hypothesis and the geometry of large language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 39643–39666\.External Links:[Link](https://proceedings.mlr.press/v235/park24c.html)Cited by:[§3](https://arxiv.org/html/2605.23036#S3.p2.1)\.
- M\. Post \(2018\)A call for clarity in reporting BLEU scores\.InProceedings of the Third Conference on Machine Translation: Research Papers,O\. Bojar, R\. Chatterjee, C\. Federmann, M\. Fishel, Y\. Graham, B\. Haddow, M\. Huck, A\. J\. Yepes, P\. Koehn, C\. Monz, M\. Negri, A\. Névéol, M\. Neves, M\. Post, L\. Specia, M\. Turchi, and K\. Verspoor \(Eds\.\),Brussels, Belgium,pp\. 186–191\.External Links:[Link](https://aclanthology.org/W18-6319/),[Document](https://dx.doi.org/10.18653/v1/W18-6319)Cited by:[§4\.2](https://arxiv.org/html/2605.23036#S4.SS2.SSS0.Px1.p3.1)\.
- S\. Rajamanoharan, T\. Lieberum, N\. Sonnerat, A\. Conmy, V\. Varma, J\. Kramár, and N\. Nanda \(2024\)Jumping ahead: improving reconstruction fidelity with jumprelu sparse autoencoders, 2024b\.URL https://arxiv\. org/abs/2407\.14435\.Cited by:[§A\.2](https://arxiv.org/html/2605.23036#A1.SS2.p1.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.23036#S4.SS1.p1.1)\.
- R\. Rei, C\. Stewart, A\. C\. Farinha, and A\. Lavie \(2020\)COMET: a neural framework for mt evaluation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,pp\. 2685–2702\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.213)Cited by:[§4\.2](https://arxiv.org/html/2605.23036#S4.SS2.SSS0.Px1.p3.1)\.
- F\. Riemenschneider and A\. Frank \(2025\)Cross\-lingual generalization and compression: from language\-specific to shared neurons\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 13470–13491\.External Links:[Link](https://aclanthology.org/2025.acl-long.661/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.661)Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p2.1)\.
- S\. Soo, C\. Guang, W\. Teng, C\. Balaganesh, T\. Guoxian, and Y\. Ming \(2025\)Interpretable steering of large language models with feature guided activation additions\.arXiv preprint arXiv:2501\.09929\.Cited by:[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Tan, D\. Wu, and C\. Monz \(2024\)Neuron specialization: leveraging intrinsic task modularity for multilingual machine translation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 6506–6527\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.374/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.374)Cited by:[§5\.3](https://arxiv.org/html/2605.23036#S5.SS3.p3.1)\.
- T\. Tang, W\. Luo, H\. Huang, D\. Zhang, X\. Wang, X\. Zhao, F\. Wei, and J\. Wen \(2024\)Language\-specific neurons: the key to multilingual capabilities in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 5701–5715\.External Links:[Link](https://aclanthology.org/2024.acl-long.309/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.309)Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p1.1),[§1](https://arxiv.org/html/2605.23036#S1.p2.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2605.23036#S3.SS3.p1.1)\.
- G\. Team, M\. Riviere, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, T\. Mesnard, B\. Shahriari, A\. Ramé,et al\.\(2024\)Gemma 2: improving open language models at a practical size\.arXiv preprint arXiv:2408\.00118\.Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.23036#S4.SS1.p1.1)\.
- A\. Templeton \(2024\)Scaling monosemanticity: extracting interpretable features from claude 3 sonnet\.Note:AnthropicExternal Links:[Link](https://transformer-circuits.pub/2024/scaling-monosemanticity/)Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p1.1),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Wang, X\. Wu, D\. Shu, Y\. Ma, and N\. Liu \(2025\)Enhancing llm steering through sparse autoencoder\-based vector refinement\.arXiv preprint arXiv:2509\.23799\.Cited by:[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1)\.
- Wikimedia Foundation \(2023\)Wikipedia dump, november 1, 2023\.Note:[https://dumps\.wikimedia\.org/](https://dumps.wikimedia.org/)Accessed: 2025\-10\-06Cited by:[§A\.1](https://arxiv.org/html/2605.23036#A1.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.23036#S4.SS1.p1.1)\.
- Z\. Wu, A\. Arora, A\. Geiger, Z\. Wang, J\. Huang, D\. Jurafsky, C\. D\. Manning, and C\. Potts \(2025\)Axbench: steering llms? even simple baselines outperform sparse autoencoders\.arXiv preprint arXiv:2501\.17148\.Cited by:[§C\.2](https://arxiv.org/html/2605.23036#A3.SS2.p1.3),[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2605.23036#S3.SS1.p1.3)\.
- R\. Zhang, Q\. Yu, M\. Zang, C\. Eickhoff, and E\. Pavlick \(2025\)The same but different: structural similarities and differences in multilingual language modeling\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NCrFA7dq8T)Cited by:[§1](https://arxiv.org/html/2605.23036#S1.p2.1)\.
- H\. Zhao, X\. Wu, F\. Yang, B\. Shen, N\. Liu, and M\. Du \(2026\)Denoising concept vectors with sparse autoencoders for improved language model steering\.InFindings of the Association for Computational Linguistics: EACL 2026,Rabat, Morocco,pp\. 797–808\.External Links:[Link](https://aclanthology.org/2026.findings-eacl.40/),[Document](https://dx.doi.org/10.18653/v1/2026.findings-eacl.40)Cited by:[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhao, A\. Devoto, G\. Hong, X\. Du, A\. P\. Gema, H\. Wang, X\. He, K\. Wong, and P\. Minervini \(2024\)Steering knowledge selection behaviours in llms via sae\-based representation engineering\.arXiv preprint arXiv:2410\.15999\.Cited by:[§2](https://arxiv.org/html/2605.23036#S2.SS0.SSS0.Px1.p1.1)\.
## Appendix
## Appendix ASAE Training
### A\.1SAE Training Data
Following the mechanistic motivation outlined in Section[3\.2](https://arxiv.org/html/2605.23036#S3.SS2), we train parallel English\-only and multilingual SAE suites under a controlled setup to isolate the effect of training data on language steering\. We train two SAE suites per base model using Wikipedia dataWikimedia Foundation \([2023](https://arxiv.org/html/2605.23036#bib.bib36)\)\. For the multilingual suite \(*MULTI21\-SAEs*\), we construct a balanced corpus covering the same 21 languages, and select a total of 2\.1B tokens with a uniform distribution across languages\. For the English\-only suite \(*EN\-SAEs*\), we select the same number of tokens \(2\.1B\), drawn from English Wikipedia\. This controlled setup ensures that both suites are trained on identical data volume with identical optimization parameters, isolating the effect of multilingual versus monolingual training data from corpus size or training duration\.
### A\.2SAE Training Procedure
For each layer of each base model, we train JumpReLU SAEs\(Rajamanoharanet al\.,[2024](https://arxiv.org/html/2605.23036#bib.bib20)\)on the residual stream, matching the expansion factors used by the corresponding open\-source SAE suites \(8×\\timesfor LLaMA\-Scope and 16k for Gemma\-Scope\)\. We use identical architectures and optimization hyperparameters for our*EN\-SAEs*and*MULTI21\-SAEs*\. To ensure a controlled comparison, we fix the number of optimization steps and therefore the total number of parameter updates across both suites\. This setup cleanly isolates the impact of multilingual training from architectural and optimization confounds\.
### A\.3Hyperparameters
We train SAEs withSAELens444[https://github\.com/jbloomAus/SAELens](https://github.com/jbloomAus/SAELens)using a JumpReLU architecture on the residual stream at multiple layers\. The base model is loaded infloat16, while SAE training usesfloat32\. Hook sites followblocks\.\{layer\}\.hook\_resid\_post
#### Key hyperparameters \(from code\)\.
- •Architecture:jumprelu\(expansion factor=8=8\),L1L\_\{1\}coefficient=5\.0=5\.0, JumpReLU bandwidth=10−3=10^\{\-3\}, init threshold=10−3=10^\{\-3\}, decoder init = zeros, transpose encoder init, decoder heuristic init enabled, sparsity penalty scaled by decoder norm\.
- •Training length \(\#steps\):30,00030\{,\}000\.
- •Batch size \(tokens/step\):4,0964\{,\}096\.
- •Context size:512512\.
- •Optimizer / schedule:Adam \(β1=0\.9,β2=0\.999\\beta\_\{1\}=0\.9,\\;\\beta\_\{2\}=0\.999\), LR=5×10−5=5\\\!\\times\\\!10^\{\-5\}, LR warmup=1,500=1\{,\}500steps, LR decay steps=3,000=3\{,\}000,L1L\_\{1\}warmup=1,500=1\{,\}500steps\.
- •Dead/feature refresh:feature sampling window=2000=2000, dead\-feature window=1000=1000, dead threshold=10−4=10^\{\-4\}\.
- •Data loader:streaming enabled,prepend\_bos=True\.
### A\.4Computational Budget
For all experiments we used 1× H100 80GB\. Each SAE model was trained for 30,000 optimization steps with a batch size of 4,096 tokens per step, corresponding to approximately 123M training tokens and about 3 GPU hours per run\.
### A\.5License and Availability
All trained SAE checkpoints produced in this work, including both English\-only and multilingual variants for*LLaMA\-3\.1\-8B*and*Gemma\-2\-9B*, are released under theApache License 2\.0\. This license permits use, modification, and distribution of the models for both research and commercial purposes, provided that proper attribution is maintained\.
We emphasize that the underlying base models \(*LLaMA\-3\.1\-8B*,*Gemma\-2\-9B*\) remain subject to their original licenses as released by Meta and Google, respectively\. Users of our checkpoints must therefore comply with both the Apache 2\.0 license governing our SAEs and the terms of the corresponding base model licenses\.
## Appendix BLanguage Labels
LanguageCodeEnglisheng\_LatnTibetanbod\_TibtMaltesemlt\_LatnItalianita\_LatnSpanishspa\_LatnGermandeu\_LatnJapanesejpn\_JpanArabicarb\_ArabChinese \(Simplified\)zho\_HansAfrikaansafr\_LatnDutchnld\_LatnFrenchfra\_LatnPortuguesepor\_LatnRussianrus\_CyrlKoreankor\_HangHindihin\_DevaTurkishtur\_LatnPolishpol\_LatnSwedishswe\_LatnDanishdan\_LatnNorwegian Bokmålnob\_Latn
Table 1:List of 21 target languages from FLORES–200 and their language codes\.
## Appendix CFormal Definitions of Language Vectors and Layer Selection
This appendix provides the full mathematical formulation of the language vectors, steering procedure, and layer\-selection metrics summarized in the main text\.
### C\.1Representation Extraction
At each transformer layerℓ\\ell, we extract the dense hidden representation from the residual stream,
hℓ\(x\)∈ℝD,h\_\{\\ell\}\(x\)\\in\\mathbb\{R\}^\{D\},for inputxx\. To obtain a sparse and interpretable representation, we apply an encoder–decoder sparse autoencoder \(SAE\) trained at layerℓ\\ell\. The encoder maps dense activations to a high\-dimensional sparse code,
zℓ\(x\)=Encoderℓ\(hℓ\(x\)\),zℓ\(x\)∈ℝK,K≫Dz\_\{\\ell\}\(x\)=\\mathrm\{Encoder\}\_\{\\ell\}\(h\_\{\\ell\}\(x\)\),z\_\{\\ell\}\(x\)\\in\\mathbb\{R\}^\{K\},\\;K\\gg D
and the decoder reconstructs the activation,
h^ℓ\(x\)=Decoderℓ\(zℓ\(x\)\)\.\\hat\{h\}\_\{\\ell\}\(x\)=\\mathrm\{Decoder\}\_\{\\ell\}\(z\_\{\\ell\}\(x\)\)\.Sparsity is enforced via the SAE objective, yielding sparse codes that isolate a small number of active features for each input\.
### C\.2DiffMean Steering Vectors
We construct language steering vectors using the DiffMean method\(Wuet al\.,[2025](https://arxiv.org/html/2605.23036#bib.bib88)\)\. For a given target language at layerℓ\\ell, let𝒵\+\\mathcal\{Z\}^\{\+\}denote the set of sparse codes corresponding to examples in the target language, and𝒵−\\mathcal\{Z\}^\{\-\}the set corresponding to all other languages\. We compute the mean sparse representations
z¯ℓ\+=1\|𝒵\+\|∑z∈𝒵\+z,z¯ℓ−=1\|𝒵−\|∑z∈𝒵−z,\\bar\{z\}^\{\+\}\_\{\\ell\}=\\frac\{1\}\{\|\\mathcal\{Z\}^\{\+\}\|\}\\sum\_\{z\\in\\mathcal\{Z\}^\{\+\}\}z,\\qquad\\bar\{z\}^\{\-\}\_\{\\ell\}=\\frac\{1\}\{\|\\mathcal\{Z\}^\{\-\}\|\}\\sum\_\{z\\in\\mathcal\{Z\}^\{\-\}\}z,and define the steering vector as
wDiffMean\(ℓ\)=z¯ℓ\+−z¯ℓ−\.w\_\{\\mathrm\{DiffMean\}\}\(\\ell\)=\\bar\{z\}^\{\+\}\_\{\\ell\}\-\\bar\{z\}^\{\-\}\_\{\\ell\}\.
This vector amplifies features that are characteristic of the target language while suppressing features shared with other languages\. Prior work applies DiffMean directly in the dense residual stream; in contrast, we primarily apply it in the SAE sparse space, which yields more disentangled and controllable steering directions\.
### C\.3Inference\-Time Steering
Given a hidden activationhℓ\(x\)h\_\{\\ell\}\(x\)at inference time, we apply steering as follows:
1. 1\.Encode the activation into sparse space: zℓ\(x\)=Encoderℓ\(hℓ\(x\)\)\.z\_\{\\ell\}\(x\)=\\mathrm\{Encoder\}\_\{\\ell\}\(h\_\{\\ell\}\(x\)\)\.
2. 2\.Apply the steering vector: zℓ′\(x\)=zℓ\(x\)\+αwDiffMean\(ℓ\),z^\{\\prime\}\_\{\\ell\}\(x\)=z\_\{\\ell\}\(x\)\+\\alpha\\,w\_\{\\mathrm\{DiffMean\}\}\(\\ell\),whereα\\alphacontrols steering strength\. We use fixed steering coefficients for all test examples within each model setting, withα=5\.0\\alpha=5\.0for LLaMA andα=100\.0\\alpha=100\.0for Gemma\. These values were chosen in preliminary experiments as conservative values that improved target\-language identification, and were fixed before final evaluation; they were not tuned per language, layer, or test example\.
3. 3\.Decode back to dense space: h^ℓ′\(x\)=Decoderℓ\(zℓ′\(x\)\)\.\\hat\{h\}^\{\\prime\}\_\{\\ell\}\(x\)=\\mathrm\{Decoder\}\_\{\\ell\}\(z^\{\\prime\}\_\{\\ell\}\(x\)\)\.
4. 4\.Correct for reconstruction error by adding the residual: h~ℓ\(x\)=h^ℓ′\(x\)\+\(hℓ\(x\)−Decoderℓ\(zℓ\(x\)\)\)\.\\tilde\{h\}\_\{\\ell\}\(x\)=\\hat\{h\}^\{\\prime\}\_\{\\ell\}\(x\)\+\\big\(h\_\{\\ell\}\(x\)\-\\mathrm\{Decoder\}\_\{\\ell\}\(z\_\{\\ell\}\(x\)\)\\big\)\.
The corrected activationh~ℓ\(x\)\\tilde\{h\}\_\{\\ell\}\(x\)is then passed to subsequent layers\. This procedure preserves the original activation outside the SAE subspace while applying a targeted intervention along the language direction\.
## Appendix DLanguage Correlation and Intersection\-Based Layer Selection
### D\.1Per\-Language Contrast Vectors
For each languageiiand layerℓ\\ell, we construct a contrast vector using DiffMean\. Letℋi\+\\mathcal\{H\}\_\{i\}^\{\+\}denote dense codes from languageii, andℋi−\\mathcal\{H\}\_\{i\}^\{\-\}dense codes from all other languages\. The per\-language vector is
𝐯i=1\|ℋi\+\|∑h∈ℋi\+h−1\|ℋi−\|∑h∈ℋi−h\.\\mathbf\{v\}\_\{i\}=\\frac\{1\}\{\|\\mathcal\{H\}\_\{i\}^\{\+\}\|\}\\sum\_\{h\\in\\mathcal\{H\}\_\{i\}^\{\+\}\}h\-\\frac\{1\}\{\|\\mathcal\{H\}\_\{i\}^\{\-\}\|\}\\sum\_\{h\\in\\mathcal\{H\}\_\{i\}^\{\-\}\}h\.These vectors represent languages in a shared feature space by emphasizing language\-specific features and suppressing shared ones\.
### D\.2Correlation Matrix Across Languages
Given the set of language vectors\{𝐯i\}i=1N\\\{\\mathbf\{v\}\_\{i\}\\\}\_\{i=1\}^\{N\}at layerℓ\\ell, whereNNis the number of languages, we compute a pairwise Pearson correlation matrix
Cℓ∈ℝN×N,Cij=corr\(𝐯i,𝐯j\)\.C\_\{\\ell\}\\in\\mathbb\{R\}^\{N\\times N\},\\qquad C\_\{ij\}=\\mathrm\{corr\}\(\\mathbf\{v\}\_\{i\},\\mathbf\{v\}\_\{j\}\)\.This matrix captures how similarly different languages are represented at a given depth\.
### D\.3Multilinguality and Separability Metrics
Let\{λj\}j=1N\\\{\\lambda\_\{j\}\\\}\_\{j=1\}^\{N\}be the eigenvalues ofCℓC\_\{\\ell\}\. We define the*multilinguality*score as the explained\-variance ratio of the first principal component,
fℓ=maxjλj∑k=1Nλk,f\_\{\\ell\}=\\frac\{\\max\_\{j\}\\lambda\_\{j\}\}\{\\sum\_\{k=1\}^\{N\}\\lambda\_\{k\}\},which measures the degree of shared alignment across languages\. We define*separability*as the complementary quantity
sℓ=1−fℓ,s\_\{\\ell\}=1\-f\_\{\\ell\},which reflects how distinct the language representations remain\.
### D\.4Intersection\-Based Layer Selection
We select steering layers at depths where multilinguality and separability are balanced\. Sincesℓ=1−fℓs\_\{\\ell\}=1\-f\_\{\\ell\}, an intersection occurs whenfℓ≈0\.5f\_\{\\ell\}\\approx 0\.5, or equivalently when2fℓ−12f\_\{\\ell\}\-1changes sign between adjacent layers\. In practice, we detect these sign changes with a small tolerance and linearly interpolate between layer indices\. These intersection points serve as*a priori*candidates for effective steering depths and consistently correspond to layers that yield strong language control while preserving generation quality\.
## Appendix ECrossSum Prompts

Figure 8:Example prompt and outputs for cross\-lingual summarization \(CrossSum\)\. The model is prompted in Hindi and steered to generate a Spanish summary\. English glosses are provided for clarity\. Full article text omitted for readability\.Figure 9:Example prompt and outputs for cross\-lingual summarization \(CrossSum\)\. The model is prompted in Spanish and steered to generate an Arabic summary\. English glosses are provided for clarity\. Full article text omitted for readability\.## Appendix FFlores Prompts

Figure 10:Example prompt and outputs for machine translation\. The model is prompted in Chinese and steered to generate a Russian translation\. English glosses are provided for readability\.Figure 11:Example prompt and outputs for machine translation\. The model is prompted in German and steered to generate an Arabic translation\. English glosses are provided for readability\.## Appendix GMultilingual Results for Gemma\-2\-9B


Figure 12:Performance deltas relative to Scope baselines forGemma\-2\-9Bat the selected steering layer\.Top:FLORES machine translation \(LangID, SpBLEU, COMET\)\.Bottom:Cross\-lingual summarization \(LangID, ROUGE\-L, LaSE\)\. Improvements from multilingual training are smaller than in LLaMA but remain directionally consistent across tasks and metrics\.





Figure 13:Performance deltas relative to Scope baselines forGemma\-2\-9Baveraged across layers\.Top:FLORES machine translation \(LangID, SpBLEU, COMET\)\.Bottom:Cross\-lingual summarization \(LangID, ROUGE\-L, LaSE\)\.


Figure 14:Layerwise heatmaps of performance deltas relative to*Gemma\-Scope*forGemma\-2\-9Boncross\-lingual summarization \(CrossSumm\)\. Columns show deltas inLangID,LaSE, andROUGE\-Las a function of steering layer\. Regions of positive gain cluster around the intersection layers identified by our multilinguality–separability criterion \(L14andL23\), indicating that these depths support more reliable language control and semantic preservation, though gains remain smaller than those achieved by our multilingual SAEs\.


Figure 15:Layerwise heatmaps of performance deltas relative to*Gemma\-Scope*forGemma\-2\-9Bonmachine translation \(FLORES\)\. Columns report deltas inLangID,COMET, andSpBLEUacross steering layers\. Improved performance concentrates near the predicted intersection layers \(L14andL23\), validating that these depths balance cross\-lingual alignment and language separability, but still underperform compared to multilingual SAEs trained in our framework\.## Appendix HMultilingual Results for LLaMA\-3\.1\-8B


Figure 16:Performance deltas relative to Scope baselines forLLaMA\-3\.1\-8Bat the best\-performing steering layer\.Top:FLORES machine translation \(LangID, SpBLEU, COMET\)\.Bottom:Cross\-lingual summarization \(LangID, ROUGE\-L, LaSE\)\. Multilingual SAEs consistently outperform English\-only SAEs across both tasks, with larger gains on semantic quality metrics\.





Figure 17:Performance deltas relative to Scope baselines forLLaMA\-3\.1\-8Baveraged across layers\.Top:FLORES machine translation \(LangID, SpBLEU, COMET\)\.Bottom:Cross\-lingual summarization \(LangID, ROUGE\-L, LaSE\)\.


Figure 18:Layerwise heatmaps of performance deltas relative to*LLaMA\-Scope*forLLaMA\-3\.1\-8Boncross\-lingual summarization \(CrossSumm\)\. Columns show deltas inLangID,LaSE, andROUGE\-Las a function of steering layer\. Unlike Gemma\-Scope, LLaMA\-Scope exhibits weak and diffuse gains across layers, with no clear concentration around an intersection depth, consistent with the absence of a strong multilinguality–separability balance and its limited downstream steering effectiveness\.


Figure 19:Layerwise heatmaps of performance deltas relative to*LLaMA\-Scope*forLLaMA\-3\.1\-8Bonmachine translation \(FLORES\)\. Columns report deltas inLangID,COMET, andSpBLEUacross steering layers\. Performance improvements remain small and scattered across depth, with no distinct layer emerging as consistently effective, mirroring the lack of a clear intersection between multilingual alignment and language separability in LLaMA\-Scope\.## Appendix ILanguage vectors correlations and sparsity score
Residual StreamLlama\-ScopeOur\-en\-SAEsOur\-multi\-SAEsLayer 1Sep\. score0\.010\.000\.000\.07Layer 11Sep\. score0\.160\.010\.500\.37Layer 23Sep\. score0\.670\.130\.280\.62
Figure 20:Comparison of LLama\-3\.1\-8B model representation space using residual stream vectors, LLama\-Scope sparse space vectors\. and our trained SAEs sparse space vectors\.## Appendix JRaw Results \(tgt\_ii≠\\neqsteer\_jj\) for Gemma\-2\-9B
layerMULTI21\-SAESEN\-SAESgemma\-scopeBase ModelLangIDROUGELLASELangIDROUGELLASELangIDROUGELLASELangIDROUGELLASE60\.00\.520\.00\.00\.530\.098\.944\.1413\.13\-\-\-1448\.334\.1716\.5542\.924\.0215\.7557\.733\.7818\.53\-\-\-2311\.811\.2512\.3810\.791\.1511\.2921\.391\.7811\.48\-\-\-329\.351\.2711\.395\.461\.1210\.0317\.041\.4715\.79\-\-\-400\.00\.590\.00\.00\.560\.00\.00\.60\.0\-\-\-Prompt \(No steering\)\-\-\-\-\-\-\-\-\-36\.483\.4722\.87
Table 2:Gemma\-2\-9B CrossSumm, cross\-lingual steering \(tgti≠steerj\\texttt\{tgt\}\_\{i\}\\neq\\texttt\{steer\}\_\{j\}\)\.Table entries reportLangID / ROUGE\-L / LaSE\(column order\)\.No steering prompt results\(baseline;LangID / ROUGE\-L / LaSE\):36\.48,3\.47,22\.87\.*Prompt*scores are computed against theprompt language\(tgti\\texttt\{tgt\}\_\{i\}\), whereas*steering*scores are computed against thesteering\-vector language\(steerj\\texttt\{steer\}\_\{j\}\) and averaged over all mismatched pairs\(i,j\)\(i,j\), averaging across target prompt languages for each steering language; therefore the prompt baseline and steering results are not directly comparable, but the baseline usefully characterizes the model’s unsteered default behavior\.Strong shadingmarks the best value*overall in the table*\(per metric\), while*light shading*marks the best value*within each SAE family*\(per metric\)\. Highlighted cells concentrate around the best layer, and the best overall results are often achieved byMULTI21\-SAEsat that layer\.layerMULTI21\-SAESEN\-SAESgemma\-scopeBase ModelLangIDSpBLEUCOMETLangIDSpBLEUCOMETLangIDSpBLEUCOMETLangIDSpBLEUCOMET60\.021\.134\.070\.021\.134\.0774\.398\.0349\.61\-\-\-1454\.3824\.8073\.5552\.1924\.9073\.1745\.0415\.6561\.79\-\-\-2324\.3319\.7358\.2321\.7318\.9055\.2625\.2612\.4944\.24\-\-\-3217\.1215\.8747\.3613\.1915\.8446\.6730\.4316\.1252\.00\-\-\-400\.042\.709\.280\.032\.468\.250\.054\.7413\.34\-\-\-Prompt \(No steering\)\-\-\-\-\-\-\-\-\-75\.5131\.3185\.12Table 3:Gemma\-2\-9B FLORES, cross\-lingual steering \(tgti≠steerj\\texttt\{tgt\}\_\{i\}\\neq\\texttt\{steer\}\_\{j\}\)\.Table entries reportLangID / SpBLEU / COMET\(column order\)\.No steering prompt results\(baseline;LangID / SpBLEU / COMET\):75\.51,31\.31,85\.12\.*Prompt*scores are computed against theprompt language\(tgti\\texttt\{tgt\}\_\{i\}\), whereas*steering*scores are computed against thesteering\-vector language\(steerj\\texttt\{steer\}\_\{j\}\) and averaged over all mismatched pairs\(i,j\)\(i,j\), averaging across target prompt languages for each steering language; therefore the prompt baseline and steering results are not directly comparable, but the baseline usefully characterizes the model’s unsteered default behavior\.Strong shadingmarks the best value*overall in the table*\(per metric\), while*light shading*marks the best value*within each SAE family*\(per metric\)\. Highlighted cells concentrate around the best layer, and the best overall results are often achieved byMULTI21\-SAEsat that layer\.## Appendix KRaw Results \(tgt\_ii≠\\neqsteer\_jj\) for LLaMA\-3\.1\-8B
layerMULTI21\-SAESEN\-SAESllama\-scopeBase ModelLangIDROUGELLASELangIDROUGELLASELangIDROUGELLASELangIDROUGELLASE678\.703\.2615\.7979\.862\.9216\.170\.000\.180\.00\-\-\-1366\.253\.9024\.8959\.313\.6527\.540\.000\.290\.00\-\-\-1530\.462\.1230\.4744\.494\.6427\.180\.000\.260\.00\-\-\-193\.801\.8417\.616\.762\.459\.890\.000\.010\.00\-\-\-250\.001\.500\.000\.001\.490\.000\.000\.180\.00\-\-\-Prompt \(No steering\)\-\-\-\-\-\-\-\-\-16\.852\.8732\.92
Table 4:LLaMA\-3\.1\-8B CrossSumm, cross\-lingual steering \(tgti≠steerj\\texttt\{tgt\}\_\{i\}\\neq\\texttt\{steer\}\_\{j\}\)\.Table entries reportLangID / ROUGE\-L / LaSE\(column order\)\.No steering prompt results\(baseline;LangID / ROUGE\-L / LaSE\):16\.85,2\.87,32\.92\.*Prompt*scores are computed against theprompt language\(tgti\\texttt\{tgt\}\_\{i\}\), whereas*steering*scores are computed against thesteering\-vector language\(steerj\\texttt\{steer\}\_\{j\}\) and averaged over all mismatched pairs\(i,j\)\(i,j\), averaging across target prompt languages for each steering language; therefore the prompt baseline and steering results are not directly comparable, but the baseline usefully characterizes the model’s unsteered default behavior\.Strong shadingmarks the best value*overall in the table*\(per metric\), while*light shading*marks the best value*within each SAE family*\(per metric\)\. Highlighted cells concentrate around the best layer, and the best overall results are often achieved byMULTI21\-SAEsat that layer\.layerMULTI21\-SAESEN\-SAESllama\-scopeBase ModelLangIDSpBLEUCOMETLangIDSpBLEUCOMETLangIDSpBLEUCOMETLangIDSpBLEUCOMET654\.224\.4239\.7768\.773\.3836\.552\.360\.026\.33\-\-\-1358\.2416\.1166\.2762\.1412\.2060\.440\.470\.029\.65\-\-\-1556\.9722\.5373\.2560\.9221\.0271\.570\.100\.002\.72\-\-\-1924\.1423\.3568\.6521\.3222\.4462\.750\.120\.011\.86\-\-\-250\.093\.7711\.250\.6412\.1231\.880\.090\.012\.13\-\-\-Prompt \(No steering\)\-\-\-\-\-\-\-\-\-91\.0631\.2283\.58Table 5:LLaMA\-3\.1\-8B FLORES, cross\-lingual steering \(tgti≠steerj\\texttt\{tgt\}\_\{i\}\\neq\\texttt\{steer\}\_\{j\}\)\.Table entries reportLangID / SpBLEU / COMET\(column order\)\.No steering prompt results\(baseline;LangID / SpBLEU / COMET\):91\.06,31\.22,83\.58\.*Prompt*scores are computed against theprompt language\(tgti\\texttt\{tgt\}\_\{i\}\), whereas*steering*scores are computed against thesteering\-vector language\(steerj\\texttt\{steer\}\_\{j\}\) and averaged over all mismatched pairs\(i,j\)\(i,j\), averaging across target prompt languages for each steering language; therefore the prompt baseline and steering results are not directly comparable, but the baseline usefully characterizes the model’s unsteered default behavior\.Strong shadingmarks the best value*overall in the table*\(per metric\), while*light shading*marks the best value*within each SAE family*\(per metric\)\. Highlighted cells concentrate around the best layer, and the best overall results are often achieved byMULTI21\-SAEsat that layer\.## Appendix LPer\-Language Results \(tgt\_ii==steer\_jj\) for Gemma\-2\-9B

Figure 21:Per\-language, per\-layer performance deltas forGemma\-2\-9Bon theCrossSumtask when the steering language matches the target language \(tgt\_ii=steer\_jj\)\. Each heatmap shows the change relative to the SCOPE baseline \(excluded\), with rows corresponding to target languages, columns to transformer layers, and separate panels for each SAE variant\. Positive values indicate improvements over the baseline\.

Figure 22:Per\-language, per\-layer deltas forGemma\-2\-9BonFLORESunder matched steering and target languages \(tgt\_ii=steer\_jj\)\. The heatmaps show the impact of SAE variants on language identification and translation quality across model depth\.Figure 23:Per\-language, per\-layer COMET score deltas forGemma\-2\-9BonFLORESwith matched steering and target languages \(tgt\_ii=steer\_jj\)\. This figure emphasizes how semantic translation quality varies across languages, layers, and SAE configurations relative to the SCOPE baseline\.## Appendix MPer\-Language Results \(tgt\_ii≠\\neqsteer\_jj\) for Gemma\-2\-9B

Figure 24:Per\-language, per\-layer performance deltas forGemma\-2\-9Bon theCrossSumtask under cross\-lingual steering \(tgt\_ii≠\\neqsteer\_jj\)\. Each heatmap shows how steering in a different language affects summarization quality and language identification across layers and SAE variants, relative to the SCOPE baseline\.

Figure 25:Per\-language, per\-layer performance deltas forGemma\-2\-9BonFLORESwith cross\-lingual steering \(tgt\_ii≠\\neqsteer\_jj\)\. The figure highlights the degradation or transfer effects induced by mismatched steering languages across model depth\.Figure 26:Per\-language, per\-layer COMET score deltas forGemma\-2\-9BonFLORESunder cross\-lingual steering \(tgt\_ii≠\\neqsteer\_jj\)\. This visualization captures how semantic translation quality responds to cross\-lingual steering at different layers and SAE variants, relative to the SCOPE baseline\.## Appendix NPer\-Language Results \(tgt\_ii==steer\_jj\) for LLaMA\-3\.1\-8B

Figure 27:Per\-language, per\-layer performance deltas forLLaMA\-3\.1\-8Bon theCrossSumtask when the steering language matches the target language \(tgt\_ii=steer\_jj\)\. Each heatmap shows the change relative to the SCOPE baseline \(excluded\), with rows corresponding to target languages, columns to transformer layers, and separate panels for each SAE variant\. Positive values indicate improvements over the baseline\.

Figure 28:Per\-language, per\-layer performance deltas forLLaMA\-3\.1\-8Bon theFLORESbenchmark when the steering language matches the target language \(tgt\_ii=steer\_jj\)\. Results are shown for language identification \(LangID\) and translation quality \(SpBLEU\), aggregated per SAE variant and measured relative to the SCOPE baseline\.Figure 29:Per\-language, per\-layer COMET score deltas forLLaMA\-3\.1\-8BonFLORESunder matched steering and target languages \(tgt\_ii=steer\_jj\)\. The heatmap highlights how SAE interventions affect semantic translation quality across languages and model depth, relative to the SCOPE baseline\.## Appendix OPer\-Language Results \(tgt\_ii≠\\neqsteer\_jj\) for LLaMA\-3\.1\-8B

Figure 30:Per\-language, per\-layer performance deltas forLLaMA\-3\.1\-8Bon theCrossSumtask under cross\-lingual steering \(tgt\_ii≠\\neqsteer\_jj\)\. Each panel corresponds to a different SAE variant, showing how mismatched steering languages impact summarization quality and language identification across layers, relative to the SCOPE baseline\.

Figure 31:Per\-language, per\-layer performance deltas forLLaMA\-3\.1\-8BonFLORESwith cross\-lingual steering \(tgt\_ii≠\\neqsteer\_jj\)\. The figure illustrates how steering in a different language affects language identification accuracy and translation quality across layers and SAE variants\.Figure 32:Per\-language, per\-layer COMET score deltas forLLaMA\-3\.1\-8BonFLORESunder cross\-lingual steering \(tgt\_ii≠\\neqsteer\_jj\)\. Results highlight the sensitivity of semantic translation quality to steering language mismatches at different depths of the model\.Similar Articles
SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors
SALSA introduces a lightweight adaptation method for speech-aware LLMs that learns layer-wise steering vectors via supervised objective, achieving significant improvements (up to 46.8% relative) on out-of-domain speech benchmarks, and shows that steering the encoder layers is more effective than modifying the LLM backbone.
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
This paper uses sparse autoencoders to decompose LLMs into interpretable features and shows that semantic features explain brain alignment with cortical semantic topography, generalizing across English, Chinese, and French.
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
FineSteer is a novel inference-time steering framework that decomposes steering into conditional steering and fine-grained vector synthesis stages, using Subspace-guided Conditional Steering (SCS) and Mixture-of-Steering-Experts (MoSE) mechanisms to improve safety and truthfulness while preserving model utility. Experiments show 7.6% improvement over state-of-the-art methods on TruthfulQA with minimal utility loss.
Steered Generation via Gradient-Based Optimization on Sparse Query Features
This paper introduces Prototype-Based Sparse Steering, a method that applies sparse autoencoders to attention query activations in LLMs, then uses gradient-based optimization during inference to steer generation toward target behaviors. The approach is validated in both a logical planning task and a stylistic educational domain, demonstrating interpretable and disentangled control.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.