How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

arXiv cs.CL Papers

Summary

This paper investigates whether auto-generated labels for sparse autoencoder features generalize across languages and scripts, using Serbian digraphia as a controlled testbed. It finds that while feature sets show substantial overlap across languages, the labels often fail to track the same concept in non-English inputs, particularly in less represented scripts.

arXiv:2606.00356v1 Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to interpret language models, with auto-generated natural-language labels serving as the primary interface for understanding what each feature represents. We ask whether these labels generalize: does a feature labeled for a concept actually track that concept across languages and scripts? Using Serbian digraphia as a controlled testbed -- the same language written in both Latin and Cyrillic via deterministic transliteration -- we first find that SAE feature sets activated by the same content in different languages, scripts, and wordings share substantial overlap (peak Jaccard similarity 0.57 vs.\ 0.13 random baseline), suggesting genuine cross-lingual semantic features. We then test whether auto-interpretation labels keep pace. They often do not: features whose labels describe semantic content miss the same meaning in Serbian up to $4\times$ more often than within English, and miss Serbian Cyrillic more than Serbian Latin -- two scripts that are deterministic transliterations of each other -- suggesting the failures track how well each form is represented in training. The gap grows with network depth, yet the labels give no indication that they fail. These results suggest that auto-interpretation labels may reflect a feature's behavior on well-represented inputs rather than the concept itself.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:36 PM

# How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings
Source: [https://arxiv.org/html/2606.00356](https://arxiv.org/html/2606.00356)
First Author Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 email@domain Second Author Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 email@domain

###### Abstract

Sparse autoencoder \(SAE\) features are increasingly used to interpret language models, with auto\-generated natural\-language labels serving as the primary interface for understanding what each feature represents\. We ask whether these labels generalize: does a feature labeled for a concept actually track that concept across languages and scripts? Using Serbian digraphia as a controlled testbed—the same language written in both Latin and Cyrillic via deterministic transliteration—we first find that SAE feature sets activated by the same content in different languages, scripts, and wordings share substantial overlap \(peak Jaccard similarity 0\.57 vs\. 0\.13 random baseline\), suggesting genuine cross\-lingual semantic features\. We then test whether auto\-interpretation labels keep pace\. They often do not: features whose labels describe semantic content miss the same meaning in Serbian up to4×4\\timesmore often than within English, and miss Serbian Cyrillic more than Serbian Latin—two scripts that are deterministic transliterations of each other—suggesting the failures track how well each form is represented in training\. The gap grows with network depth, yet the labels give no indication that they fail\. These results suggest that auto\-interpretation labels may reflect a feature’s behavior on well\-represented inputs rather than the concept itself\.

\[ Extension = \.otf, UprightFont = \*\-regular, BoldFont = \*\-bold, ItalicFont = \*\-italic, BoldItalicFont = \*\-bolditalic, \] \[ Extension = \.otf, UprightFont = \*, BoldFont = \*Bold, ItalicFont = \*Italic, BoldItalicFont = \*BoldItalic, \]\\setTransitionsForCyrillics\\cyrillicfont

How Far Do Auto\-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

Sripad KarneColumbia Universitysk5695@columbia\.edu

## 1Introduction

Sparse autoencoders \(SAEs\) have become a standard tool for inspecting the internals of language models, decomposing dense activations into sparse features that are easier to interpret\(Brickenet al\.,[2023](https://arxiv.org/html/2606.00356#bib.bib29); Cunninghamet al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib30); Templetonet al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib31)\)\. Because models contain far more features than anyone can inspect by hand, each feature is typically labeled automatically: a language model reads the feature’s top\-activating examples and writes a short natural\-language description\(Billset al\.,[2023](https://arxiv.org/html/2606.00356#bib.bib22); Lin and Bloom,[2023](https://arxiv.org/html/2606.00356#bib.bib17)\)\. These labels are then what practitioners rely on—to understand a model, to audit its behavior, or to steer it\.

But how far do these labels generalize? Consider a feature labeled “deception” or “violent content\.” A researcher reads the label, takes the feature to track that concept, and relies on it—perhaps to monitor safety\-relevant behavior\. What the label does not say is whether the feature tracks that concept equally well across languages and scripts\. If a feature detects deception mainly in English, weakening in Russian and weakening further in Serbian Cyrillic, then the same content that triggers it in one form could pass unnoticed in another\. The label names the concept correctly; what it omits is the*range*over which the feature actually tracks it\.

Testing this requires a setting where surface form can be varied while meaning is held exactly fixed\. Serbian digraphia offers precisely this control: Serbian is natively written in both Latin and Cyrillic scripts, related by deterministic, lossless transliteration, so we can change script while holding language, wording, and meaning constant\. Combined with cross\-language controls \(English, Russian\), we construct a factorial paradigm that isolates script, language, wording, and meaning independently\.

First, we find evidence that SAE features encode abstract meaning: feature sets activated by the same content across different languages, scripts, and wordings overlap substantially \(peak Jaccard 0\.57 vs\. 0\.13 baseline\)double check this, a pattern robust across model scales, architectures, and SAE hyperparameters \(Appendix[B](https://arxiv.org/html/2606.00356#A2)\), providing evidence that cross\-lingual semantic features exist and are worth labeling\. Second, we test whether the auto\-interpretation labels assigned to these features hold up when the same content is rendered in Serbian\. They do not\. Content\-labeled features miss the same meaning in Serbian up to4×4\\timesmore often than within English—a gap that grows with network depth and that the labels themselves give no indication of—and miss Serbian Cyrillic more than Serbian Latin despite the two being identical up to transliteration\. Our contributions are:

1. 1\.A controlled cross\-lingual evaluation showing that auto\-interpretation labels can fail systematically on less\-represented languages and scripts, with failures aligning with estimated training coverage, and that this failure is invisible from the label itself\.
2. 2\.A factorial paradigm that isolates script, language, wording, and meaning in SAE feature sets, providing evidence for cross\-lingual semantic features\.
3. 3\.A controlled multilingual evaluation suite built on FLORES\+: 300 sentences in four language\-script variants with validated paraphrases and matched random partners, designed for factorial analysis of SAE features and released to support future work\.111Dataset and code available at[https://anonymous\.4open\.science/r/auto\-interp\-cross\-lingual\-eval\-5D85](https://anonymous.4open.science/r/auto-interp-cross-lingual-eval-5D85)\.

## 2Related Work

#### Sparse autoencoders and feature interpretation\.

Sparse autoencoders \(SAEs\) decompose model activations into sparse, more interpretable features, offering a route to inspecting representations beyond individual neurons\(Brickenet al\.,[2023](https://arxiv.org/html/2606.00356#bib.bib29); Cunninghamet al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib30); Gaoet al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib33)\)\. Open SAE suites such as Gemma Scope 2 have made this practical at scale\(McDougallet al\.,[2025](https://arxiv.org/html/2606.00356#bib.bib4)\), and feature directories built on them are now a common entry point for interpretability work\. We use Gemma Scope 2 SAEs and the labels served through one such directory, treating them as the deployed tools a practitioner would actually reach for\.

#### Automated interpretation and its reliability\.

Because models contain far more features than anyone can inspect by hand, features are typically labeled automatically: a language model reads a feature’s top\-activating examples and writes a short natural\-language description\(Billset al\.,[2023](https://arxiv.org/html/2606.00356#bib.bib22); Lin and Bloom,[2023](https://arxiv.org/html/2606.00356#bib.bib17)\)\. The reliability of these labels is an open concern\. Prior work shows that automated explanations can be vague or inaccurate, and has proposed more rigorous ways to assess them\(Huanget al\.,[2023](https://arxiv.org/html/2606.00356#bib.bib19); Liuet al\.,[2026](https://arxiv.org/html/2606.00356#bib.bib24)\)\. These evaluations, however, are monolingual: they ask whether a label fits a feature’s behavior on the inputs it was derived from, not whether it still holds when the same content appears in another language or script\. That gap is what we test\.

#### Multilingual representations and script\.

Multilingual models are known to represent meaning in a partly language\-agnostic way: equivalent sentences in different languages converge in a shared semantic space in the middle layers, before representations specialize toward the output language near the end\(Wendleret al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib12); Wuet al\.,[2025](https://arxiv.org/html/2606.00356#bib.bib8)\)\. At the SAE feature level,Vermaet al\.\([2026](https://arxiv.org/html/2606.00356#bib.bib25)\)examine how script and linguistic structure shape which features activate\. These findings make a cross\-lingual test of feature labels meaningful: if features track meaning across languages, the labels assigned to them should hold across languages too\. We test whether they do\.

## 3Methods

### 3\.1Dataset

#### Source corpus\.

We draw 300 sentences from the FLORES\+ devtest split\(NLLB Teamet al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib1)\), a multilingual benchmark of professionally translated sentences, stratified across its Wikinews, Wikibooks, and Wikivoyage sources \(roughly 100 each\) to span news, instructional, and travel registers\. FLORES\+ supplies aligned professional translations of each sentence into Serbian \(Cyrillic\) and Russian \(Cyrillic\)\.

#### Language and script variants\.

Each sentence appears in four language\-script variants: English in Latin, Serbian in Cyrillic, Serbian in Latin, and Russian in Cyrillic\. Serbian Latin is produced by deterministic, lossless transliteration of Serbian Cyrillic using the Vuk Karadžić mapping implemented incyrtranslit\(Labrèche,[2025](https://arxiv.org/html/2606.00356#bib.bib32)\)\. The two Serbian variants thus differ only in script while holding content fixed, the property our paradigm exploits to isolate script from language and meaning\.

#### Conditions\.

Within each variant, every sentence appears in three conditions\. The*original*is the FLORES\+ professional translation, or its transliteration for Serbian Latin\. The*paraphrase*is a meaning\-preserving rewrite with varied surface form, generated and validated as described below\. The*random partner*is an unrelated FLORES\+ devtest sentence in the same language and script, length\-matched to within three words; it shares script and language with the target but no semantic content\. Table[1](https://arxiv.org/html/2606.00356#S3.T1)summarizes the full set of variants and conditions\. Crossing the four language\-script variants with the three conditions \(original, paraphrase, random partner\) across all 300 sentences yields300×4×3=3,600300\\times 4\\times 3=3\{,\}600texts\.

VariantLanguageScriptEn\-LatinEnglishLatinSr\-LatinSerbianLatinSr\-CyrillicSerbianCyrillicRu\-CyrillicRussianCyrillicConditionDescriptionOriginalFLORES\+ reference translationParaphraseMeaning\-preserving rewriteRandomUnrelated sentence, same variantTable 1:The four language\-script variants and three conditions\.
#### Paraphrase generation\.

English paraphrases were generated with Claude Opus 4\.6, prompted to preserve meaning while varying surface form and matching length within three words from the original FLORES\+ phrase\. Serbian and Russian paraphrases were obtained by translating the English paraphrase with the same model; Serbian Latin by transliterating Serbian Cyrillic\. All candidates were filtered with LaBSE\(Fenget al\.,[2022](https://arxiv.org/html/2606.00356#bib.bib2)\)\(cosine similarity≥0\.80\\geq 0\.80\)\. Full prompts appear in Appendix[A\.1](https://arxiv.org/html/2606.00356#A1.SS1)\.

#### Native\-speaker validation\.

For each of English, Serbian, and Russian we recruited two validators\. Each judged 200 sentences with a 100\-sentence overlap between the pair, covering 300 unique sentences per language\. Each sentence pair received two binary judgments: whether the paraphrase preserves the original*meaning*, and whether it reads as*natural*text\. A pair passing both was accepted as is; otherwise the validator supplied a corrected paraphrase, which was added to the dataset and flagged for reference \(full instructions in Appendix[A\.2](https://arxiv.org/html/2606.00356#A1.SS2)\)\.

#### Disagreement resolution\.

On the 100 overlap sentences per language, we resolved decisions conservatively: one\-sided flags honored the catch; both\-flagged items used a content\-blind even/odd tiebreak on the FLORES\+ index, fixed before corrections were examined \(Appendix[A\.3](https://arxiv.org/html/2606.00356#A1.SS3)\)\.

#### Validation outcomes\.

Clean\-accept rates were 96\.0% \(English\), 90\.7% \(Russian\), and 70\.7% \(Serbian\); corrections were minor \(median 1–3 words\)\. Inter\-rater agreement was near\-perfect for English, high for Russian \(Gwet’s AC1 0\.87\), and moderate for Serbian \(AC1 0\.63\)\. Full statistics appear in Appendix[A\.4](https://arxiv.org/html/2606.00356#A1.SS4)\.

### 3\.2Models and SAEs

#### Models\.

Our primary model is Gemma\-3\-27B\(Gemma Team,[2025](https://arxiv.org/html/2606.00356#bib.bib3)\)\. To check that our findings generalize across model scale, we replicate the core analyses on Gemma\-3\-1B and Gemma\-3\-12B; the factorial decomposition holds across all three sizes \(Appendix[B\.4](https://arxiv.org/html/2606.00356#A2.SS4), Figure[6](https://arxiv.org/html/2606.00356#A2.F6)\)\. To check generalization across model family we run Llama\-3\.1\-8B with Llama Scope SAEs\(Heet al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib26)\), a different architecture, SAE training regime, and dictionary width; the qualitative structure replicates \(Appendix[B\.5](https://arxiv.org/html/2606.00356#A2.SS5), Figure[7](https://arxiv.org/html/2606.00356#A2.F7)\)\.

#### Sparse autoencoders\.

We use the Gemma Scope 2 suite of SAEs\(McDougallet al\.,[2025](https://arxiv.org/html/2606.00356#bib.bib4)\)\. All SAEs use the JumpReLU activation\(Rajamanoharanet al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib5)\), a dictionary width of16,38416\{,\}384features, and are selected at lowL0L\_\{0\}sparsity\. We analyze every layer of each model, using the layer\-matched SAE at each depth\. To confirm our findings are not artifacts of the extraction setup, we vary dictionary width \(Appendix[B\.2](https://arxiv.org/html/2606.00356#A2.SS2)\),L0L\_\{0\}sparsity \(Appendix[B\.1](https://arxiv.org/html/2606.00356#A2.SS1)\), and pooling strategy \(Appendix[B\.3](https://arxiv.org/html/2606.00356#A2.SS3)\) on Gemma\-3\-27B and find the decomposition robust across all three\.

### 3\.3Feature Extraction and Similarity

#### Active feature sets\.

For a given textsswe extract the residual\-stream hidden state at the final token position at each layerℓ\\ell, encode it with the layer\-ℓ\\ellSAE, and define the active feature set asFℓ​\(s\)=\{i:ai\(ℓ\)​\(s\)\>0\}F\_\{\\ell\}\(s\)=\\\{\\,i:a\_\{i\}^\{\(\\ell\)\}\(s\)\>0\\,\\\}

#### Similarity\.

We measure the overlap between two texts’ active feature sets with the Jaccard index,

Jℓ​\(s,s′\)=\|Fℓ​\(s\)∩Fℓ​\(s′\)\|\|Fℓ​\(s\)∪Fℓ​\(s′\)\|,J\_\{\\ell\}\(s,s^\{\\prime\}\)=\\frac\{\|F\_\{\\ell\}\(s\)\\cap F\_\{\\ell\}\(s^\{\\prime\}\)\|\}\{\|F\_\{\\ell\}\(s\)\\cup F\_\{\\ell\}\(s^\{\\prime\}\)\|\},computed independently at each layer\.

#### Factorial comparisons\.

We isolate script, language, wording, and meaning one at a time by comparing pairs that hold the remaining properties fixed \(Table[2](https://arxiv.org/html/2606.00356#S3.T2)\):

- •Script:Sr\-Cyrillic orig vs\. Sr\-Latin orig \(same language and content, only script varies\)\.
- •Language:Sr\-Cyrillic orig vs\. Ru\-Cyrillic orig \(same script and content, language varies\)\.
- •Wording:English orig vs\. English paraphrase \(same language and script, surface form varies\)\.
- •Meaning:English orig vs\. Russian paraphrase \(script, language, and wording all differ; only meaning shared\)\.

Each is read against a random\-partner baseline sharing only surface properties, except wording, which uses the identity ceiling of1\.01\.0\.

Table 2:Controlled comparisons isolating each property of a text\. ✓ = property*shared*between the two texts and ✗ = property*differs*\. Each test isolates one property in its*main*comparison and reads it against a reference*floor*; a feature set tracking the isolated property yields Jaccard well above its floor\. Note that the wording comparison uses the identity ceiling rather than a random floor because two texts cannot differ in meaning while sharing the same wording\.

### 3\.4Auto\-interpretation evaluation

Leg 1 described what features collectively encode; we now ask whether the natural\-language labels assigned to individual features hold up\. At scale, features are labeled automatically: a language model reads a feature’s top\-activating examples and writes a short description of what the feature responds to\(Billset al\.,[2023](https://arxiv.org/html/2606.00356#bib.bib22); Lin and Bloom,[2023](https://arxiv.org/html/2606.00356#bib.bib17)\)\. These labels are often what practitioners rely on to understand and act on a model’s internals\. A content label makes an implicit prediction: the feature should respond to that content however it is written\. Our paradigm lets us check this directly by varying surface form while holding meaning fixed\.

#### Selecting features to test\.

A feature candidate must pass two independent content checks\. First, it fires on both the English original and the Russian paraphrase of a sentence—a pair that differs in script, language, and wording simultaneously, so co\-activation is strong evidence the feature tracks meaning rather than any surface property\. This selection is deliberately strict: we test only features with prior evidence of cross\-lingual content tracking, so any subsequent failures on Serbian are a conservative estimate of the broader feature population\. For each such feature, we retrieve its auto\-interp label from Neuronpedia, a deployed pipeline serving Gemma Scope 2 descriptions that practitioners consume directly, using two labelers—Claude Sonnet \(the strongest available\) and Gemini Flash \(Neuronpedia’s default\)—to separate genuine feature behavior from labeler\-specific artifacts\. Labels are retrieved at the four layers Neuronpedia provides \(16, 31, 40, 53\), spanning early\-middle to late depth\. From the features that pass this co\-activation filter, we apply a second check: an LLM classifies each feature’s auto\-interp label as a*content*claim \(semantic topic or meaning\) versus a*surface*or*language*claim, and the features whose labels describe surface form rather than content are discarded, since only content labels predict invariance to surface variation\.

#### Cross\-language falsification\.

For each \(feature, sentence\) unit we test whether the feature also fires on renderings of the same sentence in other forms: an English paraphrase and the Russian original \(within\-language floors\), and the Serbian Latin and Serbian Cyrillic versions \(cross\-language test\)\. A content label predicts firing on all of these; failure on any is a miss\. We report the miss rate per condition with 95% bootstrap confidence intervals \(10,00010\{,\}000resamples\)\. The two Serbian scripts are deterministic transliterations of each other, so their miss rates are directly comparable and their ratio isolates script\-specific bias\. We report Claude Sonnet as the primary labeler; Gemini gives the same pattern \(Appendix[C](https://arxiv.org/html/2606.00356#A3)\)\.

## 4Results

### 4\.1What SAE features encode

We first map how a text’s active feature set is organized, varying script, language, wording, and meaning one at a time while holding the others fixed \(Table[2](https://arxiv.org/html/2606.00356#S3.T2)\)\. For each property we measure the Jaccard overlap of active SAE feature sets across all layers of Gemma\-3\-27B, comparing each against an appropriate reference baseline\. We treat script, language, and wording as controls that isolate meaning, the property central to the rest of our analysis, and summarize them briefly before turning to it\. Figure[1](https://arxiv.org/html/2606.00356#S4.F1)shows the full layer\-wise picture: each panel isolates one property, plotting its Jaccard overlap across all 62 layers\.

![Refer to caption](https://arxiv.org/html/2606.00356v1/x1.png)Figure 1:SAE feature overlap decomposed by script, language, wording, and meaning across all layers of Gemma\-3\-27B \(300 sentences; shaded bands are 95% bootstrap CIs, 10,000 resamples\)\. In each panel a solid*main line*varies only the target property while holding the others fixed, read against a dashed*baseline*\(or, in \(c\), the identity ceiling\); see Table[2](https://arxiv.org/html/2606.00356#S3.T2)for the exact comparisons\.#### Script, language, and wording\.

Within our digraphic setting, script has strikingly little effect: transliterating a sentence between Latin and Cyrillic holds language, wording, and meaning fixed and preserves most of the active set \(mean Jaccard0\.730\.73\), far above the baseline\. This is consistent with prior work showing that multilingual representations become increasingly language\-agnostic in intermediate layers\(Wendleret al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib12); Wuet al\.,[2025](https://arxiv.org/html/2606.00356#bib.bib8)\); in our controlled setting it confirms that script is the mildest of the four perturbations\. Language matters more: the same sentence in two languages \(Serbian vs\. Russian, same script\) shares less of its feature set \(mean0\.500\.50\), indicating that language has a larger effect on the active set than script does\. Wording sits between the two: paraphrasing within a language preserves a substantial share of the feature set \(mean0\.660\.66\)\. Each property also traces its own course across depth: non\-script overlap stays high until a sharp drop near the output, non\-language overlap holds early then declines through the second half, and wording overlap stays roughly flat\. The late declines in non\-script and non\-language overlap are consistent with prior work suggesting that later transformer layers become increasingly specialized for next\-token prediction and output generation\(Gevaet al\.,[2022](https://arxiv.org/html/2606.00356#bib.bib10); Belroseet al\.,[2023](https://arxiv.org/html/2606.00356#bib.bib11)\)\. We make no strong claim here, however: the script baseline and wording overlap stay roughly flat across depth, so the picture is suggestive rather than conclusive\. These trajectories are worth detailed study, which the paradigm supports and we leave to future work\. These three properties show how the active set responds to surface variation\. We now turn to the property at the center of our analysis: meaning\.

#### Meaning\.

We isolate meaning by comparing two texts that share nothing but their content—an English sentence and a Russian paraphrase, differing in script, language, and wording—so any overlap in their feature sets points to genuinely cross\-lingual semantic features rather than surface form\. This overlap is substantial: it reaches0\.570\.57Jaccard at its peak, well above the baseline \(mean 0\.13\) \(Figure[1](https://arxiv.org/html/2606.00356#S4.F1)d\)\. SAE features therefore encode abstract meaning that survives a complete change of surface form\. The signal is also depth\-dependent: after an initial drop in the first few layers, the overlap climbs to a peak around the lower\-middle of the network before declining toward the output\.

#### From feature sets to feature labels\.

Having established that genuinely semantic features exist, we next need a way to identify what each one represents\. At scale, this is typically done with auto\-interpretation, which assigns each feature a natural\-language label, and these labels are often what researchers rely on to make sense of a model\. Our set\-level analysis cannot speak to them: it describes populations of features, not whether any individual feature’s label holds\. We turn to that question next\.

### 4\.2Auto\-interpretation labels generalize poorly across languages and scripts

Leg 1 gave a set\-level picture: SAE features encode genuine semantic content, while script, wording, and language reshape the active set to differing degrees\. Auto\-interpretation provides this feature\-level view, labeling each feature from its top\-activating examples\. We evaluate these labels on the cross\-lingual semantic features identified in Leg 1: those that fire on a sentence’s content in both English and Russian, which confirms behaviorally that they track meaning rather than surface form\. As a second robustness check, we confirm these features’ labels describe content rather than surface form, using an LLM to classify each label as content or surface and discarding the rare exceptions \(Appendix[C](https://arxiv.org/html/2606.00356#A3)\)\.

Testing labels from Neuronpedia’s auto\-interpretation pipeline, where Claude Sonnet and Gemini write each label by reading a feature’s top\-activating examples, we find these predictions often break down: a feature labeled for its content can fail to fire on that same content when it is rewritten in another language or script, and this happens more for less well\-represented forms\. We quantify this as the*miss rate*: the fraction of cases where a content\-labeled feature fails to fire on the same meaning in another form \(Table[3](https://arxiv.org/html/2606.00356#S4.T3)\)\. We report Claude Sonnet labels, the stronger labeler; Gemini gives the same picture, with the full breakdown and additional baselines \(including random controls\) in Appendix[C](https://arxiv.org/html/2606.00356#A3)\.

\(a\)Within\- vs\. cross\-language miss rates\.
\(b\)Serbian script asymmetry\.

Table 3:Content\-feature miss rates over the maxvar pool \(features firing on English original and Russian paraphrase, label\-confirmed content; Sonnet labels\)\. \(a\) Within\-language floors \(Eng\-para, Rus\-orig\) vs\. the cross\-language Serbian average\. \(b\) Serbian Latin vs\. Cyrillic, with the Cyrillic/Latin ratio\. 95% bootstrap CIs in brackets \(10,00010\{,\}000resamples\)\.#### Even within a language, content labels are unreliable; across languages they break down\.

Reworded within their own language, these features already miss a non\-trivial share of the time, falling from14\.8%14\.8\\%in early layers to7\.3%7\.3\\%at the final layer for English, and staying in the1313–17%17\\%range for Russian\. The same features miss Serbian far more often\. Averaged over the two Serbian scripts, miss rates start near the within\-language floor \(21\.1%21\.1\\%versus14\.8%14\.8\\%at layer 16\) but reach30\.1%30\.1\\%by the final layer \(Table[3\(a\)](https://arxiv.org/html/2606.00356#S4.T3.st1)\)—an excess of more than2020points over the English floor, roughly4×4\\timesthe within\-language miss rate\. Paraphrasing the Serbian text adds only11to33points \(Appendix[C](https://arxiv.org/html/2606.00356#A3)\), confirming the failure comes from the language and script, not the specific phrasing within them\. Across conditions, miss rates tend to order from English through Russian and Serbian Latin to Serbian Cyrillic \(Table[3\(a\)](https://arxiv.org/html/2606.00356#S4.T3.st1)\)\.

#### The Serbian Cyrillic disadvantage grows with depth\.

The two Serbian scripts are deterministic transliterations of each other, so a content feature should treat them identically\. Instead they diverge with depth: indistinguishable in early layers \(1\.02×1\.02\\timesCyrillic\-to\-Latin miss ratio at layer 16\), the gap rises to1\.35×1\.35\\timesat the final layer \(Table[3\(b\)](https://arxiv.org/html/2606.00356#S4.T3.st2)\)\. Since the scripts share language, wording, and meaning, the failure is unlikely to be semantic; the main difference is the character system, which points to how well the model represents each script\. The same gap appears whether we test the original Serbian sentences or their paraphrases, so it does not depend on the particular wording\. Table[4](https://arxiv.org/html/2606.00356#S4.T4)gives individual examples: content\-labeled features that fire on a sentence in three of the four forms but fail on one Serbian script\.

L40 / F4327— misses Sr\-CyrillicAuto\-interp:Sonnet: “Olympic athletes” — Gemini: “Olympic athletes and comebacks”Sentence:“Poland’s visually impaired skier Maciej Krezel…”L53 / F3041— misses Sr\-CyrillicAuto\-interp:Sonnet: “political leaders” — Gemini: “political leaders or figures”Sentence:“Hu encouraged developing countries to avoid the old path…”L40 / F8424— misses Sr\-LatinAuto\-interp:Sonnet: “diseases and conditions” — Gemini: “diseases and conditions”Sentence:“Infectious diseases themselves, or dangerous animals…”Table 4:Content\-labeled features that fail on one Serbian script, verifiable on Neuronpedia\. Additional examples in Appendix[C](https://arxiv.org/html/2606.00356#A3), Table[18](https://arxiv.org/html/2606.00356#A3.T18)\.
#### The divergence sharpens with depth\.

These failures are mildest in the earliest layer we can probe: at layer 16 the within\-language floors and the Serbian miss rate sit fairly close \(1515–17%17\\%versus21\.1%21\.1\\%\)\. They pull apart deeper in the network\. The within\-language floors hold steady or fall—English paraphrase miss drops to7\.3%7\.3\\%by the final layer, Russian stays near15%15\\%—while the Serbian miss rate climbs to30\.1%30\.1\\%\. Reliability thus holds or improves within a language and degrades across one as depth increases\. These patterns replicate across model scale: Gemma\-3\-1B and Gemma\-3\-12B show the same ordering and a Cyrillic/Latin asymmetry that grows with depth \(Appendix[D](https://arxiv.org/html/2606.00356#A4)\)\.

## 5Discussion

#### Semantic features are real, but their labels overpromise\.

Our two analyses give a coherent picture\. Leg 1 shows that SAE features genuinely encode abstract meaning: cross\-lingual semantic overlap exists and survives a complete change of script, language, and wording\. Leg 2 shows that this encoding is narrower than its labels suggest\. A feature that tracks a concept in both English and Russian, and is labeled for that concept, still fails to fire on it in Serbian noticeably more often than it does within those languages\. The two findings are not in tension: these features carry real semantic content, which is why their labels are often reasonable, but that content does not transfer cleanly across languages and scripts\. The failure is not that the labels are wrong: they often capture something real about the feature, yet can still fail to predict its behavior on an equivalent rendering of the same content, with no indication of where\.

#### The failures are consistent with expected training distribution\.

The failures align with how often each form of a meaning appears in training\. Averaged across layers, content features miss their target far more in Serbian than within English or Russian \(10\.1%10\.1\\%and15\.3%15\.3\\%versus22\.2%22\.2\\%for Serbian Latin and26\.1%26\.1\\%for Serbian Cyrillic\)\. The clearest case is the Serbian script asymmetry: the two scripts are deterministic transliterations, identical in language, wording, and meaning, so a difference between them cannot be semantic\. The only thing that differs is the character system, and features miss Serbian Cyrillic more than Serbian Latin\. This fits the known training distribution\. Web corpora are heavily English\-dominant—around45%45\\%of Common Crawl, with no other language above roughly5%5\\%\(Common Crawl Foundation,[2024](https://arxiv.org/html/2606.00356#bib.bib14)\)\. Cyrillic\-script text is predominantly Russian \(Russian∼\\sim5–6%, Serbian near0\.2%0\.2\\%\) and Serbian online is written more often in Latin than in Cyrillic\(RNIDS,[2024](https://arxiv.org/html/2606.00356#bib.bib16)\), which suggests Serbian Cyrillic is comparatively scarce in the web text that dominates modern pretraining corpora\. Such imbalances have measurable downstream effects: model performance has been shown to track a language’s share of pretraining data\(Liet al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib21)\)\. A similar pattern could hold at the level of individual features: the Serbian Cyrillic gap is what this predicts, with the rarest form of the content the one its features handle worst\. This is not a binary high\- versus low\-resource split but a graded one—English is the most dominant form, Russian high\-resource but less dominant, and Serbian rarer still—with reliability ordered accordingly\.

#### How the gap changes with depth\.

This ordering sharpens with depth: English becomes more reliable deeper in the network \(paraphrase miss falling to7\.3%7\.3\\%\), Russian stays roughly flat near15%15\\%, and Serbian degrades\. This divergence is visible in the raw firing patterns, before any label is involved, so it is a property of the features rather than an artifact of the labeling procedure\. One reading is that deeper layers tie representations more tightly to the specific form of the input, consistent with prior observations that multilingual representations become less language\-agnostic toward later layers\(Wendleret al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib12); Liu and Niehues,[2025](https://arxiv.org/html/2606.00356#bib.bib9)\); however, we cannot establish this directly\. The cross\-lingual gap itself may be unsurprising, but the within\-language trend is not: a feature becoming*more*reliable for its strongest form with depth is, to our knowledge, new, and we offer it as an empirical observation\.

None of this is visible from the label itself\. A label is written from a feature’s top\-activating examples\(Billset al\.,[2023](https://arxiv.org/html/2606.00356#bib.bib22); Lin and Bloom,[2023](https://arxiv.org/html/2606.00356#bib.bib17)\), the inputs it fires on most strongly, which tend to be the well\-represented forms\. Consider a feature labeled “deception” or “violent content\.” A researcher auditing the model reads the label, takes the feature to track that concept, and relies on it—perhaps to monitor or steer the behavior\. What the label would not reveal is that such a feature might detect the concept mainly in English, with its response weakening in Russian and weakening further in Serbian Cyrillic, so the same content that triggers it in one form could pass unnoticed in another\. The label may name the concept correctly; what it can omit is the range over which the feature actually tracks it\.

#### Implications for practitioners\.

The practical lesson is that an auto\-interpretation label is a claim about the forms a feature was most active on, not a guarantee about the concept in general\. A content label can be locally accurate and still fail to generalize across languages and scripts, and because the label carries no signal of this, the failure is invisible to monolingual inspection\. Evaluating feature labels therefore requires controlled stimuli that vary surface form while holding meaning fixed\. The failures we document also appear addressable: auto\-interpretation pipelines could draw top\-activating examples from multiple languages, or flag features whose activation is inconsistent across equivalent inputs, though we leave validation of such mitigations to future work\.

#### Concurrent and recent work\.

The interpretability toolchain is developing rapidly\. Circuit tracing\(Ameisenet al\.,[2025](https://arxiv.org/html/2606.00356#bib.bib27)\)traces computational graphs through model internals, and Natural Language Autoencoders\(Fraser\-Talienteet al\.,[2026](https://arxiv.org/html/2606.00356#bib.bib28)\)translate full activation vectors into natural\-language explanations, whose authors note can hallucinate\. These efforts build increasingly powerful tools for inspecting models; our work is complementary, contributing controlled evaluation methodology that checks whether the descriptions these tools produce generalize beyond the conditions they were derived from\. Both NLAs and the auto\-interp labels we evaluate face the same underlying challenge: a natural\-language description of a model’s internals may be accurate on the inputs it was built from and quietly fail on others\.

## 6Conclusion

Using Serbian digraphia as a controlled testbed, we find evidence that SAE features in multilingual language models encode abstract meaning that can survive complete changes of script, language, and wording\. However, the auto\-interpretation labels assigned to these features do not keep pace: content labels that hold in well\-represented languages quietly fail on less\-represented ones, with miss rates reaching4×4\\timesthe within\-language floor for Serbian Cyrillic, and the labels themselves carry no signal of where they break\. The failures are graded, aligning with the representation of each form in training data, and tend to sharpen with network depth\. These results suggest that auto\-interpretation labels should be understood as claims about a feature’s behavior on its best\-represented inputs, not as guarantees about the concept in general\.

## Limitations

Our cross\-language evaluation covers three languages \(English, Russian, and Serbian\) from two language families, with Serbian closely related to Russian\. The graded miss\-rate pattern we observe should in principle extend to more typologically distant and lower\-resource languages, but we have not verified this beyond the conditions in our dataset\. Leg 2 replicates across Gemma model scales \(1B, 12B, 27B; Appendix[D](https://arxiv.org/html/2606.00356#A4)\), but we have not confirmed the label\-failure patterns on other model families or labeling pipelines beyond Neuronpedia\.

## Ethics Statement

Native\-speaker validators were recruited through personal networks and compensated $125–$200 per batch of 200 sentence pairs, exceeding standard crowdsourcing rates for comparable linguistic annotation\. No personal or sensitive data was collected from participants\. Paraphrases were generated using Claude Opus 4\.6 and auto\-interpretation labels were retrieved from Neuronpedia’s deployed pipeline \(Claude Sonnet and Gemini Flash\); all prompts are reported in full in Appendix A\. The FLORES\+ source data is used under its CC BY\-SA 4\.0 license\. Our released dataset and code are available under CC BY 4\.0 for research use\.

#### Use of Large Language Models\.

Claude was used to assist with paper drafting, editing, and code implementation\. All experimental design, analysis, and scientific claims are the authors’ own\.

## References

- E\. Ameisen, J\. Lindsey, A\. Pearce, W\. Gurnee, N\. L\. Turner, B\. Chen, C\. Citro, D\. Abrahams, S\. Carter, B\. Hosmer, J\. Marcus, M\. Sklar, A\. Templeton, T\. Bricken, C\. McDougall, H\. Cunningham, T\. Henighan, A\. Jermyn, A\. Jones, A\. Persic, Z\. Qi, T\. B\. Thompson, S\. Zimmerman, K\. Rivoire, T\. Conerly, C\. Olah, and J\. Batson \(2025\)Circuit tracing: revealing computational graphs in language models\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)Cited by:[§5](https://arxiv.org/html/2606.00356#S5.SS0.SSS0.Px5.p1.1)\.
- N\. Belrose, Z\. Furman, L\. Smith, D\. Halawi, I\. Ostrovsky, L\. McKinney, S\. Biderman, and J\. Steinhardt \(2023\)Eliciting latent predictions from transformers with the tuned lens\.arXiv preprint arXiv:2303\.08112\.External Links:[Link](https://arxiv.org/abs/2303.08112)Cited by:[§4\.1](https://arxiv.org/html/2606.00356#S4.SS1.SSS0.Px1.p1.3)\.
- S\. Bills, N\. Cammarata, D\. Mossing, H\. Tillman, L\. Gao, G\. Goh, I\. Sutskever, J\. Leike, J\. Wu, and W\. Saunders \(2023\)Language models can explain neurons in language models\.Note:[https://openaipublic\.blob\.core\.windows\.net/neuron\-explainer/paper/index\.html](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html)Cited by:[§1](https://arxiv.org/html/2606.00356#S1.p1.1),[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.00356#S3.SS4.p1.1),[§5](https://arxiv.org/html/2606.00356#S5.SS0.SSS0.Px3.p2.1)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, S\. Carter, and C\. Olah \(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2023/monosemantic-features)Cited by:[§1](https://arxiv.org/html/2606.00356#S1.p1.1),[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px1.p1.1)\.
- Common Crawl Foundation \(2024\)Common crawl\.Note:[https://commoncrawl\.org](https://commoncrawl.org/)Accessed 2026\-05\-25Cited by:[§5](https://arxiv.org/html/2606.00356#S5.SS0.SSS0.Px2.p1.8)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2024\)Sparse autoencoders find highly interpretable features in language models\.International Conference on Learning Representations \(ICLR\)\.External Links:[Link](https://arxiv.org/abs/2309.08600)Cited by:[§1](https://arxiv.org/html/2606.00356#S1.p1.1),[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Feng, Y\. Yang, D\. Cer, N\. Arivazhagan, and W\. Wang \(2022\)Language\-agnostic BERT sentence embedding\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 878–891\.External Links:[Link](https://aclanthology.org/2022.acl-long.62/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.62)Cited by:[§3\.1](https://arxiv.org/html/2606.00356#S3.SS1.SSS0.Px4.p1.1)\.
- K\. Fraser\-Taliente, S\. Kantamneni, E\. Ong, D\. Mossing, C\. Lu, P\. C\. Bogdan, E\. Ameisen, J\. Chen, D\. Kishylau, A\. Pearce, J\. Tarng, A\. Wu, J\. Wu, Y\. Zhang, D\. M\. Ziegler, E\. Hubinger, J\. Batson, J\. Lindsey, S\. Zimmerman, and S\. Marks \(2026\)Natural language autoencoders produce unsupervised explanations of llm activations\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2026/nla/index.html)Cited by:[§5](https://arxiv.org/html/2606.00356#S5.SS0.SSS0.Px5.p1.1)\.
- L\. Gao, T\. D\. la Tour, H\. Tillman, G\. Goh, R\. Tow, I\. Babuschkin, I\. Sutskever, J\. Leike, and J\. Wu \(2024\)Scaling and evaluating sparse autoencoders\.arXiv preprint arXiv:2406\.04093\.Cited by:[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px1.p1.1)\.
- Gemma Team \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§3\.2](https://arxiv.org/html/2606.00356#S3.SS2.SSS0.Px1.p1.1)\.
- M\. Geva, A\. Caciularu, K\. Wang, and Y\. Goldberg \(2022\)Transformer feed\-forward layers build predictions by promoting concepts in the vocabulary space\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Abu Dhabi, United Arab Emirates,pp\. 30–45\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.3),[Link](https://aclanthology.org/2022.emnlp-main.3/)Cited by:[§4\.1](https://arxiv.org/html/2606.00356#S4.SS1.SSS0.Px1.p1.3)\.
- Z\. He, W\. Shu, X\. Ge, L\. Chen, J\. Wang, Y\. Zhou, F\. Liu, Q\. Guo, X\. Huang, Z\. Wu, Y\. Jiang, and X\. Qiu \(2024\)Llama scope: extracting millions of features from llama\-3\.1\-8b with sparse autoencoders\.arXiv preprint arXiv:2410\.20526\.Cited by:[§B\.5](https://arxiv.org/html/2606.00356#A2.SS5.p1.2),[§3\.2](https://arxiv.org/html/2606.00356#S3.SS2.SSS0.Px1.p1.1)\.
- J\. Huang, A\. Geiger, K\. D’Oosterlinck, Z\. Wu, and C\. Potts \(2023\)Rigorously assessing natural language explanations of neurons\.arXiv preprint arXiv:2309\.10312\.Cited by:[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Labrèche \(2025\)CyrTranslit\.Zenodo\.Note:Python package for bidirectional Cyrillic–Latin transliterationExternal Links:[Document](https://dx.doi.org/10.5281/zenodo.17663256),[Link](https://doi.org/10.5281/zenodo.17663256)Cited by:[§3\.1](https://arxiv.org/html/2606.00356#S3.SS1.SSS0.Px2.p1.1)\.
- Z\. Li, Y\. Shi, Z\. Liu, F\. Yang, A\. Payani, N\. Liu, and M\. Du \(2024\)Language ranker: a metric for quantifying llm performance across high and low\-resource languages\.arXiv preprint arXiv:2404\.11553\.External Links:[Link](https://arxiv.org/abs/2404.11553)Cited by:[§5](https://arxiv.org/html/2606.00356#S5.SS0.SSS0.Px2.p1.8)\.
- J\. Lin and J\. Bloom \(2023\)Neuronpedia\.Note:[https://www\.neuronpedia\.org](https://www.neuronpedia.org/)Interactive platform for sparse autoencoder and neuron interpretabilityCited by:[§1](https://arxiv.org/html/2606.00356#S1.p1.1),[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.00356#S3.SS4.p1.1),[§5](https://arxiv.org/html/2606.00356#S5.SS0.SSS0.Px3.p2.1)\.
- D\. Liu and J\. Niehues \(2025\)Middle\-layer representation alignment for cross\-lingual transfer in fine\-tuned LLMs\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 15979–15996\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.778),[Link](https://aclanthology.org/2025.acl-long.778/)Cited by:[§5](https://arxiv.org/html/2606.00356#S5.SS0.SSS0.Px3.p1.2)\.
- W\. Liu, Y\. Miao, H\. Zhao, Y\. Liu, and M\. Du \(2026\)NeuronScope: a multi\-agent framework for explaining polysemantic neurons in language models\.arXiv preprint arXiv:2601\.03671\.External Links:[Link](https://arxiv.org/abs/2601.03671)Cited by:[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px2.p1.1)\.
- C\. McDougall, A\. Conmy, J\. Kramár, T\. Lieberum, S\. Rajamanoharan, and N\. Nanda \(2025\)Gemma scope 2: technical paper\.Technical reportGoogle DeepMind\.Note:Technical reportExternal Links:[Link](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/gemma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-complex-language-model-behavior/Gemma_Scope_2_Technical_Paper.pdf)Cited by:[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.00356#S3.SS2.SSS0.Px2.p1.3)\.
- NLLB Team, M\. R\. Costa\-jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht, J\. Maillard, A\. Sun, S\. Wang, G\. Wenzek, A\. Youngblood, B\. Akula, L\. Barrault, G\. Mejia Gonzalez, P\. Hansanti, J\. Hoffman, S\. Jarrett, K\. Ram Sadagopan, D\. Rowe, S\. Spruit, C\. Tran, P\. Andrews, N\. F\. Ayan, S\. Bhosale, S\. Edunov, A\. Fan, C\. Gao, V\. Goswami, F\. Guzmán, P\. Koehn, A\. Mourachko, C\. Ropers, S\. Saleem, H\. Schwenk, and J\. Wang \(2024\)Scaling neural machine translation to 200 languages\.Nature630\(8018\),pp\. 841–846\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07335-x),[Link](https://doi.org/10.1038/s41586-024-07335-x)Cited by:[§3\.1](https://arxiv.org/html/2606.00356#S3.SS1.SSS0.Px1.p1.1)\.
- S\. Rajamanoharan, T\. Lieberum, N\. Sonnerat, A\. Conmy, V\. Varma, J\. Kramár, and N\. Nanda \(2024\)Jumping ahead: improving reconstruction fidelity with jumprelu sparse autoencoders\.External Links:2407\.14435,[Link](https://arxiv.org/abs/2407.14435)Cited by:[§3\.2](https://arxiv.org/html/2606.00356#S3.SS2.SSS0.Px2.p1.3)\.
- RNIDS \(2024\)Usage of latin and cyrillic scripts in serbian web domains\.Note:[https://www\.rnids\.rs](https://www.rnids.rs/)Statistics and reports on Serbian internet domain usageCited by:[§5](https://arxiv.org/html/2606.00356#S5.SS0.SSS0.Px2.p1.8)\.
- A\. Templeton, T\. Bricken, J\. Batson, B\. Chen, A\. Jermyn, S\. Carter, C\. Olah,et al\.\(2024\)Scaling monosemanticity: extracting interpretable features from claude 3 sonnet\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2024/scaling-monosemanticity)Cited by:[§1](https://arxiv.org/html/2606.00356#S1.p1.1)\.
- A\. A\. K\. Verma, A\. Chatterjee, M\. Gupta, and T\. Chakraborty \(2026\)Multilingual language models encode script over linguistic structure\.InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics \(ACL 2026\),Note:ACL 2026 Main ConferenceExternal Links:[Link](https://openreview.net/forum?id=QHR4upgkCV)Cited by:[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Wendler, V\. Veselovsky, G\. Monea, and R\. West \(2024\)Do llamas work in English? on the latent language of multilingual transformers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 15366–15394\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.820),[Link](https://aclanthology.org/2024.acl-long.820/)Cited by:[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.00356#S4.SS1.SSS0.Px1.p1.3),[§5](https://arxiv.org/html/2606.00356#S5.SS0.SSS0.Px3.p1.2)\.
- Z\. Wu, X\. V\. Yu, D\. Yogatama, J\. Lu, and Y\. Kim \(2025\)The semantic hub hypothesis: language models share semantic representations across languages and modalities\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2411.04986)Cited by:[§2](https://arxiv.org/html/2606.00356#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.00356#S4.SS1.SSS0.Px1.p1.3)\.

## Appendix ADataset Details

### A\.1Paraphrase Generation

All paraphrases were produced with Claude Opus 4\.6 \(claude\-opus\-4\-6\) using extended thinking \(10,000\-token budget\) at temperature 1\.0, with the sentence to process passed as the user message and the task specified by the system prompts below\. Stage 1 paraphrases the English sentence; Stage 2 translates the English paraphrase into Serbian Cyrillic and Russian Cyrillic\. Serbian Latin is not model\-generated, but a deterministic 1:1 transliteration of the Serbian Cyrillic\.

#### Stage 1: English Paraphrase Prompt

> You are a careful paraphrasing assistant\. Given an English sentence, generate a paraphrase that: 1. 1\.Preserves the exact meaning of the original 2. 2\.Uses different vocabulary and sentence structure where natural 3. 3\.Sounds like natural, fluent English 4. 4\.Maintains roughly similar length \(within 30% of original word count\) 5. 5\.Keeps proper nouns, numbers, and technical terms exactly as they appear in the original 6. 6\.Preserves grammatical gender markers throughout\. If the original sentence implies a male or female subject \(through pronouns, verb forms, or adjective agreements\), the paraphrase must preserve that gender\. Do not change “he” to “she” or vice versa\. 7. 7\.Preserves the strength and scope of claims\. Do not amplify or weaken descriptions: - •“sometimes” must not become “often” or “always” - •“may” or “can” must not become “will” or “must” - •“outbreak” must not become “epidemic” - •“some” must not become “most” or “all” - •Hedged claims must remain hedged - •Specific claims must remain specific \(do not generalize\) If the original is uncertain or qualified, the paraphrase must remain qualified to the same degree\. Return ONLY the paraphrased sentence, no preamble, no explanation, no quotes\.

#### Stage 2a: Serbian Cyrillic Translation

> You are a careful translator\. Translate the given English sentence into natural, fluent Serbian written in Cyrillic script\. Preserve the exact meaning\. Use natural Serbian phrasing — do not translate word\-for\-word from English\. Keep proper nouns, numbers, and technical terms in their standard Serbian form \(transliterate names if there’s a standard Serbian convention, otherwise keep as in English\)\. For proper nouns, use the standard form of the name in the target language and script\. For Serbian Cyrillic, use standard Serbian Cyrillic forms \(e\.g\., Кенеди for Kennedy, Трамп for Trump, Бжежински for Brzezinski\)\. Do not preserve Latin\-script spellings within Cyrillic text\. However, do not change which person, place, or entity is being referred to: if the original says “Port Stanley,” the translation should refer to Port Stanley \(in target\-language form\), not “Stanley” or any other place name\. Maintain the identity of named entities; localize their surface form to the target script and language conventions\. Preserve grammatical gender markers throughout\. If the original sentence implies a male or female subject \(through pronouns, verb forms, or adjective agreements\), the translation must preserve that gender\. In Serbian, maintain consistent gender across all verb forms, pronouns, and adjective endings\. Preserve verb tense and modality exactly: - •Past tense remains past tense - •Present tense remains present tense - •Future tense remains future tense - •Modal verbs \(must, may, can, will, should\) must map to equivalent strength in Serbian - •Conditional remains conditional - •Aspect distinctions \(perfective vs imperfective\) should match the original’s intent Return ONLY the translated sentence, no preamble, no explanation, no quotes\.

#### Stage 2b: Russian Cyrillic Translation

> You are a careful translator\. Translate the given English sentence into natural, fluent Russian written in Cyrillic script\. Preserve the exact meaning\. Use natural Russian phrasing — do not translate word\-for\-word from English\. Keep proper nouns, numbers, and technical terms in their standard Russian form \(transliterate names if there’s a standard Russian convention, otherwise keep as in English\)\. For proper nouns, use the standard form of the name in the target language and script\. For Russian, use standard Russian Cyrillic transliterations \(e\.g\., Кеннеди for Kennedy, Трамп for Trump\)\. Do not preserve Latin\-script spellings within Cyrillic text\. However, do not change which person, place, or entity is being referred to: if the original says “Port Stanley,” the translation should refer to Port Stanley \(in target\-language form\), not “Stanley” or any other place name\. Maintain the identity of named entities; localize their surface form to the target script and language conventions\. Preserve grammatical gender markers throughout\. If the original sentence implies a male or female subject \(through pronouns, verb forms, or adjective agreements\), the translation must preserve that gender\. In Russian, maintain consistent gender across all verb forms, pronouns, and adjective endings\. Preserve verb tense and modality exactly: - •Past tense remains past tense - •Present tense remains present tense - •Future tense remains future tense - •Modal verbs \(must, may, can, will, should\) must map to equivalent strength in Russian - •Conditional remains conditional - •Aspect distinctions \(perfective vs imperfective\) should match the original’s intent Return ONLY the translated sentence, no preamble, no explanation, no quotes\.

### A\.2Validation Instructions

Table 5:Worked examples shown to validators, illustrating the two\-axis judgment\. Original sentence:*“The committee voted to approve the budget on Tuesday\.”*Examples were presented in each validator’s target language; English is shown here for readability\.Native\-speaker validators received identical instructions for each language, with the target language substituted throughout\. Each validator saw sentence pairs, an original sentence and its candidate paraphrase in the same language and script, and rated each pair on two binary axes\. Table[5](https://arxiv.org/html/2606.00356#A1.T5)shows worked examples of the four Meaning and Naturalness combinations\.

#### Meaning\.

*Yes*if both sentences make the same factual claim, such that a reader of either would understand the same thing;*No*if the paraphrase adds, removes, or changes the claim\. Validators were told not to mark*No*merely for differences in wording, word order, or style, since paraphrases are expected to differ on the surface\. They were instructed to mark Meaning*No*whenever any of the following occurred: a flipped negation; a meaning\-relevant change of tense or aspect; a changed number; a changed or omitted named entity \(person, place, organization, or date\); a changed modality; a reversed comparison; a swapped subject and object; or any other omission, weakening, or addition that alters the claim\. Table[5](https://arxiv.org/html/2606.00356#A1.T5)shows worked examples of the four Meaning and Naturalness combinations\.

#### Naturalness\.

*Yes*if the paraphrase reads as fluent text a native speaker could plausibly have written;*No*if a native reader would notice something off without being told the text was machine\-generated\. Validators were told they did not need to name a grammatical rule; a native\-speaker judgment that the text sounded wrong was sufficient\. Mild shifts in formality were not grounds for*No*unless the register itself sounded unnatural\.

#### Corrections\.

A pair rated*Yes*on both axes was accepted as is\. Otherwise the validator supplied a corrected paraphrase, written in the target script, that preserved the original meaning and read naturally, along with a short note describing the change\. If a paraphrase was too broken to judge confidently, validators marked both axes*No*and supplied a correction\. Corrected paraphrases enter the dataset flagged separately from accepted model paraphrases, so the two can always be told apart, and no items are discarded\.

#### Compensation\.

Each validator was paid $125–$200 per batch of 200 sentence pairs, recruited through personal networks as native speakers of the target language\. Payment was calibrated to exceed standard crowdsourcing rates for comparable linguistic annotation tasks\.

### A\.3Disagreement Resolution

For the 100 sentences in each language judged by both validators, a single decision per item populates the dataset, resolved conservatively\. Table[6](https://arxiv.org/html/2606.00356#A1.T6)gives the per\-language breakdown\. When neither validator flagged the pair, the model paraphrase was kept\. When exactly one validator flagged it, we adopted that validator’s correction, honoring the catch\. When both flagged the same item with differing corrections, we kept the correction from one designated validator when the sentence’s FLORES\+ index was even and from the other when it was odd\. Because the index is assigned by the benchmark before validation, this rule was fixed in advance and cannot depend on the content or favor either validator\. One\-sided disagreements \(0, 11, and 23 items for English, Russian, and Serbian\) were therefore resolved deterministically by honoring the catch, and the parity tiebreak applied only to both\-flagged items with differing corrections, at most 1, 2, and 14 items respectively, each a choice between two validator\-approved paraphrases\.

Table 6:Validator decisions on the 100\-sentence overlap subset, by accept\-or\-flag outcome \(rows sum to 100\)\. Items neither validator flagged kept the model paraphrase; one\-sided flags were resolved by honoring the catch; both\-flagged items used the parity tiebreak only when the two corrections differed\.
### A\.4Detailed Validation Results

This section reports the full per\-language validation statistics summarized in the main text\. Table[7](https://arxiv.org/html/2606.00356#A1.T7)gives clean\-accept and correction rates; the lower Serbian rate reflects rougher generated Serbian, which validators caught and corrected rather than discarded\. Table[8](https://arxiv.org/html/2606.00356#A1.T8)breaks those corrections down by what failed: English corrections were almost all meaning fixes, every Russian correction involved naturalness, and Serbian spanned both\. Table[9](https://arxiv.org/html/2606.00356#A1.T9)reports how large those corrections were, as the word\-level edit distance between each model paraphrase and the validator’s replacement; the median touches one to three words in every language, and only Serbian shows a tail of larger rewrites, visible as a mean fraction well above its median\. Table[10](https://arxiv.org/html/2606.00356#A1.T10)reports inter\-rater agreement on the 100\-sentence overlap subset, measuring how often the two validators for a language reached the same accept\-or\-correct decision on the items they both judged\. We rely on raw agreement and Gwet’s AC1 rather than Cohen’sκ\\kappa, since the high accept rates produce skewed marginals under whichκ\\kappais deflated even when raw agreement is high\. Agreement is perfect for English, high for Russian, and moderate for Serbian\. For English the two validators made the same accept\-or\-correct decision on all 100 overlap items: the only item that drew a correction was flagged by both, and both supplied the same fix, so raw agreement and AC1 are both 1\.0\. For Serbian, validators differed mainly on whether a borderline paraphrase warranted a minor fix, not on whether meaning was preserved\.

Table 7:Per\-language outcomes over 300 sentences \(counts out of 300\)\. Serbian Cyrillic and Latin share these rates, since Latin inherits Cyrillic corrections via transliteration\. The correction counts \(12, 88, 28\) match the totals in Table[8](https://arxiv.org/html/2606.00356#A1.T8)\.Table 8:Reason for correction, by language, over corrected paraphrases only\. The three categories are mutually exclusive: each corrected item is counted once, by whether it failed on meaning, on naturalness, or on both\. Percentages are of the language’s corrected total\.Table 9:Edit magnitude over corrected items only, Edit distances are word\-level; “frac\.” is that distance as a fraction of sentence length\. Serbian Latin tracks Cyrillic to within rounding\. The Serbian mean fraction is inflated by a small tail of larger rewrites, while the median stays low\.Table 10:Inter\-rater agreement on the 100\-sentence overlap subset\.
### A\.5Length and Tokenization

We report sentence length in words and in token count, both to characterize the conditions and to address the concern that cross\-script effects could reflect tokenization rather than representation\. Table[11](https://arxiv.org/html/2606.00356#A1.T11)gives both counts by condition and variant\. Word counts are matched across the three conditions: originals, paraphrases, and random partners average20\.220\.2,20\.920\.9, and20\.220\.2words respectively, with paraphrases running under one word longer than their originals, well inside the±3\\pm 3\-word matching tolerance\. Variants differ modestly in word count \(English averages about one word more than Serbian, Russian about one word less\), reflecting language\-level differences in how the same content is expressed rather than any condition effect\.

Table 11:Word and subword token counts by condition and language\-script variant, as mean±\\pmstandard deviation with median in parentheses\. Word counts are matched across conditions; subword counts differ sharply across scripts\. Token count use the Gemma\-3\-27B tokenizer\.Table 12:Tokens per word \(Gemma\-3\-27B\)\. The same content costs roughly twice as many tokens in Serbian Cyrillic as in English, and about15%15\\%more in Serbian Cyrillic than in the deterministically transliterated Serbian Latin\.Subword tokenization, by contrast, differs sharply across scripts\. Table[12](https://arxiv.org/html/2606.00356#A1.T12)reports tokens per word under the Gemma\-3\-27B tokenizer\. English costs about1\.21\.2tokens per word, whereas Serbian Cyrillic costs about2\.32\.3, so the same roughly2020\-word content expands from about2626tokens in English to4545–4848in Serbian Cyrillic\. The effect is partly script and partly language: Serbian Latin, a deterministic transliteration of the identical Serbian Cyrillic content, costs about2\.02\.0tokens per word, so the Cyrillic script alone adds roughly15%15\\%more tokens for word\-for\-word identical text\. Russian Cyrillic is more efficient than Serbian Cyrillic \(about1\.91\.9versus2\.32\.3\)\.

These large, script\-driven token\-count differences are intrinsic to the comparison and cannot be removed without distorting content: equalizing token counts across scripts would require unmatching the content itself\. Our design therefore controls content length in words and treats subword tokenization as a property of each variant\. This raises the question of whether these token\-count differences themselves drive the feature overlap patterns we report\.

#### Tokenization and feature overlap\.

We test this directly\. For each of the 300 Serbian sentence pairs \(same sentence in Latin and Cyrillic\), we compute the token count difference \(Cyrillic minus Latin\) and correlate it with the cross\-script Jaccard similarity averaged across all layers\. Token deltas ranged from−3\-3to1919\(mean5\.45\.4, sd3\.63\.6, IQR\[3,8\]\[3,8\]\), confirming that Cyrillic consistently tokenizes into more subwords\. The correlation with feature overlap is statistically significant but negligible \(r=−0\.172r=\-0\.172,p=0\.003p=0\.003,r2=0\.03r^\{2\}=0\.03\): tokenization mismatch accounts for roughly 3% of the variance in Jaccard similarity\. Active feature set sizes show no relationship with token count differences at all \(r=0\.027r=0\.027,p=0\.639p=0\.639\), ruling out a mechanism in which longer tokenizations produce systematically different feature counts\. Combined with last\-token pooling, which produces a single representation vector regardless of sequence length, these results indicate that tokenization asymmetry is not a meaningful driver of the cross\-script patterns we report\. Figure[2](https://arxiv.org/html/2606.00356#A1.F2)shows the relationship visually\.

![Refer to caption](https://arxiv.org/html/2606.00356v1/x2.png)Figure 2:Cross\-script Jaccard similarity \(averaged across layers\) vs\. token count difference \(Cyrillic minus Latin\) for 300 Serbian sentence pairs\. The regression line is nearly flat \(r=−0\.17r=\-0\.17,r2=0\.03r^\{2\}=0\.03\)\.

### A\.6Dataset Composition by Source and Topic

Source\-domain composition is given in Table[13](https://arxiv.org/html/2606.00356#A1.T13)\. Originals and paraphrases preserve the FLORES\+ stratification exactly; random partners are sampled within language\-script and length\-matched, so their domain composition shifts slightly\. Table[14](https://arxiv.org/html/2606.00356#A1.T14)gives the finer\-grained topic distribution; the dataset spans a broad range of subject matter, with no single topic dominating\.

Table 13:Source\-domain counts over the300300anchors\. Originals and paraphrases preserve the FLORES\+ stratification exactly; random partners are sampled within language\-script and length\-matched, shifting domain composition slightly\.Table 14:Topic distribution of the 300 original sentences, grouped into broad categories from FLORES\+ metadata\. Percentages exceed 100% because some sentences carry multiple topic tags \(e\.g\., “disease, research, canada”\)\.

## Appendix BAblation Testing

### B\.1SAE Sparsity Level

We compare the small\-L0L\_\{0\}SAEs used in the main text with the large\-L0L\_\{0\}variant from Gemma Scope 2, both at 16K width \(Figure[3](https://arxiv.org/html/2606.00356#A2.F3)\)\. HigherL0L\_\{0\}means more features fire per input, producing larger and noisier active sets\. Overlap drops uniformly by66–77percentage points on panels A–C \(e\.g\. script mean0\.660\.66vs\.0\.730\.73\), while meaning is nearly unchanged \(0\.330\.33vs\.0\.340\.34\)\. The decomposition hierarchy and layer\-wise trajectories are fully preserved, confirming that the results are not sensitive to sparsity level\.

![Refer to caption](https://arxiv.org/html/2606.00356v1/x3.png)Figure 3:Leg 1 decomposition comparing small\-L0L\_\{0\}\(blue\) and large\-L0L\_\{0\}\(purple\) Gemma Scope 2 SAEs at 16K width on Gemma\-3\-27B\. Dashed grey lines show baselines\. The factorial ordering and depth structure are preserved; the uniform downward shift reflects noisier active sets under higher sparsity\.
### B\.2SAE Dictionary Width

We replicate the Leg 1 decomposition on Gemma\-3\-27B using 262K\-width Gemma Scope 2 SAEs, a16×16\\timesincrease over the primary 16K configuration \(Figure[4](https://arxiv.org/html/2606.00356#A2.F4)\)\. Absolute Jaccard values are22–44percentage points lower across all panels \(e\.g\. script mean0\.700\.70vs\.0\.730\.73\), consistent with a larger dictionary spreading activations more thinly and reducing set overlap mechanically\. The layer\-wise trajectories and the ordering across panels are nearly identical to the 16K results, confirming that the factorial decomposition is not an artifact of dictionary size\.

![Refer to caption](https://arxiv.org/html/2606.00356v1/x4.png)Figure 4:Leg 1 decomposition comparing 16K\-width \(blue\) and 262K\-width \(orange\) Gemma Scope 2 SAEs on Gemma\-3\-27B\. Dashed grey lines show baselines\. The layer\-wise shape and panel ordering are preserved; the uniform downward shift reflects sparser feature activation in the wider dictionary\.
### B\.3Pooling Strategy

We compare last\-token, mean, and max pooling on Gemma\-3\-27B \(Figure[5](https://arxiv.org/html/2606.00356#A2.F5)\)\. Max pooling is uninformative: element\-wise maxima across token positions yield Jaccard above0\.940\.94in every panel, including meaning, collapsing all distinctions\. Mean pooling preserves some structure but inflates wording and meaning overlap \(C=\\,\{=\}\\,0\.79, D=\\,\{=\}\\,0\.46 vs\. 0\.63 and 0\.34 for last\-token\), compressing the very differences the paradigm is designed to measure\. Last\-token pooling gives the widest dynamic range across panels \(0\.340\.34–0\.730\.73\) and the cleanest separation between main conditions and baselines, confirming it as the most discriminative choice\. The factorial ordering \(script\>\>wording\>\>language\>\>meaning\) holds under both last\-token and mean pooling; only the effect sizes differ\.

![Refer to caption](https://arxiv.org/html/2606.00356v1/x5.png)Figure 5:Pooling strategy comparison on Gemma\-3\-27B\. Max pooling \(green\) saturates near 1\.0 across all panels; mean pooling \(pink\) compresses cross\-panel differences; last\-token pooling \(blue\) preserves the full factorial decomposition with the widest dynamic range\. Dashed grey lines show the corresponding baselines for last\-token pooling\.
### B\.4Model Scale

To verify that our findings are not specific to Gemma\-3\-27B, we replicate the full Leg 1 factorial decomposition on Gemma\-3\-1B \(25 layers\) and Gemma\-3\-12B \(47 layers\) using the same dataset, SAE configuration \(16k width, low L0\), and pipeline\. Figure[6](https://arxiv.org/html/2606.00356#A2.F6)overlays the four panels across all three model sizes on a normalized depth axis\.

The core structure holds across scale: in all three models, script overlap exceeds language overlap, meaning rises well above its random baseline, and wording stays roughly flat\. Script invariance increases with model size \(mean Jaccard0\.650\.65,0\.720\.72,0\.730\.73for 1B, 12B, and 27B respectively\), and the gap above floor grows accordingly \(0\.36→0\.46→0\.520\.36\\to 0\.46\\to 0\.52\), indicating that larger models separate content from script more cleanly\. The depth trajectories are noisier at 1B, where the model has fewer layers to differentiate, but the qualitative ordering of the four panels is preserved\. We conclude that the factorial decomposition reported in the main text is robust to model scale within the Gemma family\.

### B\.5Cross\-Architecture: Llama\-3\.1\-8B

To verify that our findings are not specific to the Gemma architecture or Gemma Scope 2 SAEs, we replicate the Leg 1 decomposition on Llama\-3\.1\-8B\-Base using Llama Scope SAEs\(Heet al\.,[2024](https://arxiv.org/html/2606.00356#bib.bib26)\): 32K\-width \(8×\\timesexpansion\) post\-MLP residual stream SAEs at all 32 layers\. These SAEs are trained with TopK activation \(k=50k\{=\}50\) and decoder\-norm\-weighted selection, then converted to JumpReLU with per\-layer thresholds for independent feature activation at inference; we define the active set as features with nonzero post\-JumpReLU activation\. This setup differs from our primary analysis in model family, SAE training regime, dictionary width \(32K vs\. 16K\), and tokenizer\.

The qualitative structure replicates \(Figure[7](https://arxiv.org/html/2606.00356#A2.F7)\)\. Script overlap is highest and dips in late layers, meaning peaks in the middle of the network and stays well above its random baseline, and wording remains in the0\.40\.4–0\.60\.6range throughout\. Absolute Jaccard values are lower across the board \(e\.g\. script mean0\.480\.48vs\.0\.730\.73for Gemma\-3\-27B\), consistent with a smaller model having less robust cross\-lingual representations, but the ordering across panels and the depth trajectories are preserved\. The factorial decomposition is therefore not an artifact of the Gemma architecture or Gemma Scope 2 SAEs\.

![Refer to caption](https://arxiv.org/html/2606.00356v1/x6.png)Figure 6:Leg 1 decomposition across Gemma\-3 model sizes \(1B, 12B, 27B\), plotted on normalized depth \(300 sentences, 95% bootstrap CIs\)\. Solid lines show the main comparison for each panel; thin grey dashed lines show the corresponding random\-partner baseline averaged across all three model sizes \(per\-model baselines are nearly identical\)\. Panel \(c\) retains the identity ceiling at 1\.0\. The qualitative structure holds at all three scales: script overlap remains highest, meaning stays well above its baseline, and wording is roughly flat\. \(Table[2](https://arxiv.org/html/2606.00356#S3.T2)\)\.![Refer to caption](https://arxiv.org/html/2606.00356v1/x7.png)Figure 7:Leg 1 decomposition for Llama\-3\.1\-8B\-Base with Llama Scope SAEs \(32K width, 32 layers, 300 sentences\)\. Solid lines show the main comparison; dashed lines show random\-partner baselines\. Despite a different model family, SAE training regime \(TopK\-trained, JumpReLU\-converted\), and dictionary width \(32K vs\. 16K\), the qualitative structure matches Gemma\-3\-27B \(Figure[1](https://arxiv.org/html/2606.00356#S4.F1)\): script overlap highest, meaning above baseline peaking mid\-network, wording roughly flat\.

## Appendix CAuto Interpretation

Tables[15](https://arxiv.org/html/2606.00356#A3.T15)and[16](https://arxiv.org/html/2606.00356#A3.T16)report the full miss\-rate breakdown for the neutral content pool under Sonnet and Gemini labels, respectively\. Each table includes within\-language paraphrase miss rates \(Eng\-para, Rus\-para\), cross\-language Serbian miss rates for both originals and paraphrases, and random baselines confirming content\-selectivity \(77–96% miss on unrelated sentences\)\. Table[17](https://arxiv.org/html/2606.00356#A3.T17)compares the two labelers side by side: despite classifying different subsets as content\-claim, both converge on the same Cyrillic/Latin asymmetry ratio at every layer \(e\.g\., 1\.40×\\timesvs\. 1\.41×\\timesat layer 53\)\. Table[18](https://arxiv.org/html/2606.00356#A3.T18)provides 20 additional feature vignettes beyond the three in the main text, spanning all four SAE layers and both failure directions, with full English sentences and dual\-labeler agreement\.

Table 15:Full Sonnet content\-label miss rates over the maxvar content pool \(Eng\-orig∩\\capRus\-para\)\. ContentNN: L16=1,072, L31=2,236, L40=2,143, L53=1,392\. Random baselines confirm content selectivity \(features rarely fire on unrelated sentences\)\. 95% bootstrap CIs \(10,00010\{,\}000resamples\)\.Table 16:Full Gemini content\-label miss rates over the maxvar content pool \(Eng\-orig∩\\capRus\-para\)\. ContentNN: L16=1,503, L31=2,444, L40=2,744, L53=2,067\. Gemini classifies more features as content\-claim than Sonnet, yielding largerNN; the asymmetry pattern is consistent across labelers\. 95% bootstrap CIs \(10,00010\{,\}000resamples\)\.Table 17:Sonnet vs\. Gemini labeler comparison over the maxvar content pool\. ContentNNis the number of \(feature, sentence\) pairs classified as content\-claim\. This count differs between labelers because Sonnet and Gemini describe features differently, producing distinct content\-claim subsets from the same underlying activation pool; the converging ratios confirm the signal is robust to labeler choice\.Table 18:Extended examples of content\-labeled features whose auto\-interp labels fail on one Serbian script\. All verifiable on Neuronpedia \(gemma\-3\-27b, SAE\{layer\}\-gemmascope\-2\-res\-16k\)\.
## Appendix DLeg 2 Across Model Scale

We replicate the Leg 2 auto\-interpretation evaluation on Gemma\-3\-1B and Gemma\-3\-12B\. The core pattern holds at both scales \(Tables[19](https://arxiv.org/html/2606.00356#A4.T19)and[20](https://arxiv.org/html/2606.00356#A4.T20)\): Serbian miss rates exceed within\-language floors, the Cyrillic/Latin asymmetry grows with depth \(reaching1\.18×1\.18\\timesat 1B and1\.34×1\.34\\timesat 12B\), and miss rates are ordered by estimated training coverage\. Miss rates are notably higher at 1B \(up to58\.2%58\.2\\%at layer 22\), consistent with fewer features encoding Serbian content in the smaller model\. Early\-layer results at 1B should be interpreted with caution due to small pool sizes \(n=82n\{=\}82at layer 7\)\.

\(a\)Within\- vs\. cross\-language miss rates\.
\(b\)Serbian script asymmetry\.

Table 19:Content\-feature miss rates for Gemma\-3\-12B \(Gemini labels\)\. 95% bootstrap CIs in brackets \(10,000 resamples\)\.\(a\)Within\- vs\. cross\-language miss rates\.
\(b\)Serbian script asymmetry\.

Table 20:Content\-feature miss rates for Gemma\-3\-1B \(Gemini labels\)\. 95% bootstrap CIs in brackets \(10,000 resamples\)\. Pool sizes are smaller at early layers \(n=82n\{=\}82at layer 7\); interpret early\-layer results with caution\.

Similar Articles

An In-Vitro Study on Cross-Lingual Generalization in Language Models

arXiv cs.CL

This paper introduces an in-vitro framework with two procedurally generated languages to study cross-lingual generalization in language models, finding that tokenization's preservation of reusable substructure is more critical than lexical similarity or data balance for transferring capabilities across languages.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

arXiv cs.AI

This paper demonstrates that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing scalability concerns for dictionary learning. The features are multilingual, multimodal, and include safety-relevant concepts like deception and sycophancy, with causal influence on model outputs.