Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative

arXiv cs.CL Papers

Summary

This paper demonstrates that mean-pooled cosine similarity is not length-invariant under anisotropic representations, showing it artificially inflates similarity with sequence length. It argues for using Centered Kernel Alignment (CKA) as a default metric to correct biases in cross-lingual and cross-representation analysis.

arXiv:2605.07345v1 Announce Type: new Abstract: Mean-pooled cosine similarity is the default metric for comparing neural representations across languages, modalities, and tasks. We establish that this metric is not length-invariant: under the anisotropy that characterizes modern transformer representations, mean-pooled cosine grows monotonically in sequence length, independent of representational content. Empirically, on HumanEvalPack across four code LLMs, the length ratio alone explains $R^2 = 0.52$--$0.75$ of cross-language "Python proximity," while AST depth and shared-token fraction add less than 3% of explained variance beyond length. Substituting Centered Kernel Alignment (CKA) reduces explained variance by 83% and reverses the sign of the length coefficient ($\beta_{\mathrm{len}}: +0.86 \to -0.37$). The same pattern holds in Mistral-7B on parallel WMT pairs ($R^2 = 0.23$ EN-FR, $R^2 = 0.33$ EN-DE for cosine; $R^2 < 0.01$ for CKA). In CLIP ViT-B/32, mean-pooling reduces the length effect relative to EOS-pooling ($R^2: 0.21 \to {<}0.01$), as predicted by the theory's dependence on anisotropy. We argue that length-invariant metrics such as CKA should be the default for cross-representation comparisons, and that recent claims of cross-lingual representational convergence built on mean-pooled cosine warrant re-examination.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:01 AM

# Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative
Source: [https://arxiv.org/html/2605.07345](https://arxiv.org/html/2605.07345)
###### Abstract

Mean\-pooled cosine similarity is the default metric for comparing neural representations across languages, modalities, and tasks\. We establish that this metric is not length\-invariant: under the anisotropy that characterizes modern transformer representations, mean\-pooled cosine grows monotonically in sequence length, independent of representational content\. Empirically, on HumanEvalPack across four code LLMs, the length ratio alone explainsR2=0\.52R^\{2\}=0\.52–0\.750\.75of cross\-language “Python proximity,” while AST depth and shared\-token fraction add less than3%3\\%of explained variance beyond length\. Substituting Centered Kernel Alignment \(CKA\) reduces explained variance by83%83\\%and reverses the sign of the length coefficient \(βlen:\+0\.86→−0\.37\\beta\_\{\\text\{len\}\}:\{\+\}0\.86\\to\{\-\}0\.37\)\. The same pattern holds in Mistral\-7B on parallel WMT pairs \(R2=0\.23R^\{2\}=0\.23EN–FR,R2=0\.33R^\{2\}=0\.33EN–DE for cosine;R2<0\.01R^\{2\}<0\.01for CKA\)\. In CLIP ViT\-B/32, mean\-pooling reduces the length effect relative to EOS\-pooling \(R2:0\.21→<0\.01R^\{2\}:0\.21\\to\{<\}0\.01\), as predicted by the theory’s dependence on anisotropy\. We argue that length\-invariant metrics such as CKA should be the default for cross\-representation comparisons, and that recent claims of cross\-lingual representational convergence built on mean\-pooled cosine warrant re\-examination\.

representation similarity, mean pooling, cosine similarity, CKA, anisotropy, cross\-lingual analysis, mechanistic interpretability

## 1Introduction

Comparing how a neural network represents different inputs is a foundational task in mechanistic interpretability\. The standard procedure averages token\-level hidden states into a single vector per input and reports cosine similarity between such vectors\. This mean\-pooled cosine similarity has become the de facto metric for cross\-representation comparison and underpins recent claims that multilingual LLMs “think in English”\(Schut et al\.,[2025](https://arxiv.org/html/2605.07345#bib.bib14); Wendler et al\.,[2024](https://arxiv.org/html/2605.07345#bib.bib15)\)and that code LLMs route through Python\(Yin et al\.,[2025](https://arxiv.org/html/2605.07345#bib.bib16); Kargaran et al\.,[2025](https://arxiv.org/html/2605.07345#bib.bib6)\)\.

We show that the metric is not length\-invariant\. Under anisotropic representations\(Ethayarajh,[2019](https://arxiv.org/html/2605.07345#bib.bib3); Mu & Viswanath,[2018](https://arxiv.org/html/2605.07345#bib.bib8); Gao et al\.,[2019](https://arxiv.org/html/2605.07345#bib.bib4)\)—the regime that all modern transformers operate in—mean\-pooled cosine grows monotonically with sequence length, regardless of content\. The mechanism is a1/n1/\\sqrt\{n\}concentration of pooled vectors toward the shared anisotropy direction\. The dependence is large enough to dominate published\-style cross\-language similarity analyses\.

#### Key findings\.

F1\.Mean\-pooled cosine is monotonically increasing in sequence length under anisotropy \(Prop\.[1](https://arxiv.org/html/2605.07345#Thmproposition1)\), validated by a synthetic experiment on random vectors with no model involvement\.F2\.Across four code LLMs and 164 HumanEvalPack problems, length ratio alone explains5252–75%75\\%of variance in Python proximity\. AST depth and shared tokens add less than3%3\\%\.F3\.Substituting CKA on the same data reduces explained variance by83%83\\%and*flips*the sign of the length coefficient \(βlen:\+0\.86→−0\.37\\beta\_\{\\text\{len\}\}:\{\+\}0\.86\\to\{\-\}0\.37\)\. The metric\-level conclusion about Python proximity reverses\.F4\.The artifact is not code\-specific: Mistral\-7B on WMT EN–FR showsR2=0\.23R^\{2\}\{=\}0\.23, EN–DE showsR2=0\.33R^\{2\}\{=\}0\.33\. CKA on the same data hasR2<0\.01R^\{2\}<0\.01\.F5\.In CLIP ViT\-B/32, mean\-pooling reduces the length effect \(R2<0\.01R^\{2\}<0\.01\) compared with EOS\-pooling \(R2=0\.21R^\{2\}=0\.21\)\. The artifact requires anisotropy and is suppressed by CLIP’s contrastive head\.

#### What we are and are not claiming\.

We are not claiming that representational convergence across languages or modalities is illusory\. Our positive controls on Mistral\-7B for French→\\toEnglish routing show that genuine convergence exists and is detectable with appropriate metrics\. We are claiming that mean\-pooled cosine cannot distinguish genuine convergence from a tokenizer length differential, and that the languages used as reference points in prior work \(English; Python\) are precisely the ones with systematically shorter tokenizations\. The metric and the substantive conclusions therefore co\-vary in exactly the way that produces a confound\.

## 2Related Work

#### Mean\-pooled cosine in cross\-lingual analysis\.

Wendler et al\. \([2024](https://arxiv.org/html/2605.07345#bib.bib15)\)report that Llama\-2 representations cluster around an English\-language pivot in middle layers, using mean\-pooled cosine on parallel inputs\.Schut et al\. \([2025](https://arxiv.org/html/2605.07345#bib.bib14)\)extend this with logit\-lens and probing, but treat mean\-pooled cosine as the primary similarity metric\.Yin et al\. \([2025](https://arxiv.org/html/2605.07345#bib.bib16)\)apply the same protocol to code LLMs and report that hidden states for Java/JavaScript/Go are “Python\-proximate\.”Kargaran et al\. \([2025](https://arxiv.org/html/2605.07345#bib.bib6)\)use the same family of measurements for low\-resource languages\. None of these papers control for sequence length in the metric\.

#### Tokenization disparities across languages\.

Petrov et al\. \([2023](https://arxiv.org/html/2605.07345#bib.bib10)\)document large, systematic differences in token counts across languages for parallel content: the same sentence may take1\.51\.5–5×5\\timesmore tokens in some languages than English, and Python is one of the most token\-compact programming languages\. The tokenizer\-induced length differential is the input to our artifact mechanism\.

#### Anisotropy in transformers\.

Ethayarajh \([2019](https://arxiv.org/html/2605.07345#bib.bib3)\)showed that contextualized representations from BERT, ELMo, and GPT\-2 occupy a narrow cone in embedding space rather than being uniformly distributed\.Gao et al\. \([2019](https://arxiv.org/html/2605.07345#bib.bib4)\)andMu & Viswanath \([2018](https://arxiv.org/html/2605.07345#bib.bib8)\)document and propose mitigations\. The anisotropy is the precondition for our artifact: in a fully isotropic representation, mean\-pooling does not preferentially shrink toward any direction\.

#### Centered Kernel Alignment\.

Kornblith et al\. \([2019](https://arxiv.org/html/2605.07345#bib.bib7)\)introduced linear and kernel CKA as length\- and rotation\-invariant similarity measures for neural representations\. Linear CKA operates on full token\-level matrices rather than pooled vectors and is not affected by the1/n1/\\sqrt\{n\}pooling concentration that drives the artifact we describe\. The RV coefficient\(Robert & Escoufier,[1976](https://arxiv.org/html/2605.07345#bib.bib12)\)is the statistical antecedent\.

#### Multilingual representation studies\.

The hypothesis that multilingual transformers internally translate to a pivot language is a long\-standing one\(Conneau et al\.,[2020](https://arxiv.org/html/2605.07345#bib.bib1)\); the recent revival in interpretability builds on this\. Our results imply that the metric used in the latest wave of evidence may be confounded\.

## 3Mechanism

### 3\.1Anisotropy and mean pooling

Modern transformer hidden states are anisotropic: with high probability, two random states from the same model and layer have a positive cosine similarity, often exceeding0\.50\.5\(Ethayarajh,[2019](https://arxiv.org/html/2605.07345#bib.bib3)\)\. This is well\-modeled by writing each token state as

hi=μ\+σ​ϵi,‖μ‖≫σ​d,h\_\{i\}\\;=\\;\\mu\\;\+\\;\\sigma\\epsilon\_\{i\},\\qquad\\\|\\mu\\\|\\gg\\sigma\\sqrt\{d\},\(1\)whereμ∈ℝd\\mu\\in\\mathbb\{R\}^\{d\}is a layer\-specific shared direction andϵi\\epsilon\_\{i\}is zero\-mean noise\. Mean pooling overnntokens gives

h¯=1n​∑i=1nhi=μ\+σn​ϵ¯,\\bar\{h\}\\;=\\;\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}h\_\{i\}\\;=\\;\\mu\\;\+\\;\\frac\{\\sigma\}\{\\sqrt\{n\}\}\\,\\bar\{\\epsilon\},\(2\)whereϵ¯\\bar\{\\epsilon\}has𝔼​‖ϵ¯‖2=d\\mathbb\{E\}\\\|\\bar\{\\epsilon\}\\\|^\{2\}=d\. The shared directionμ\\muis preserved exactly; the noise term shrinks as1/n1/\\sqrt\{n\}\. Pooled representations of longer sequences therefore lie closer toμ\\mu, and consequently closer \(in cosine\) to*any*pooled vector from the same distribution\.

### 3\.2Formal result

###### Proposition 1\(Length dependence of mean\-pooled cosine\)\.

Let\{x1,…,xm\}\\\{x\_\{1\},\\ldots,x\_\{m\}\\\}and\{y1,…,yn\}\\\{y\_\{1\},\\ldots,y\_\{n\}\\\}be independent sequences inℝd\\mathbb\{R\}^\{d\}drawn i\.i\.d\. from a distribution with meanμ≠0\\mu\\neq 0and covarianceσ2​Id\\sigma^\{2\}I\_\{d\}\. Letx¯=1m​∑ixi\\bar\{x\}=\\tfrac\{1\}\{m\}\\sum\_\{i\}x\_\{i\},y¯=1n​∑jyj\\bar\{y\}=\\tfrac\{1\}\{n\}\\sum\_\{j\}y\_\{j\}\. Ford≫1d\\gg 1and‖μ‖2=Θ​\(d​μcomp2\)\\\|\\mu\\\|^\{2\}=\\Theta\(d\\mu\_\{\\text\{comp\}\}^\{2\}\),

𝔼​\[cos⁡\(x¯,y¯\)\]≈11\+σ2​dm​‖μ‖2​1\+σ2​dn​‖μ‖2,\\mathbb\{E\}\\\!\\left\[\\cos\(\\bar\{x\},\\bar\{y\}\)\\right\]\\;\\approx\\;\\frac\{1\}\{\\sqrt\{1\+\\tfrac\{\\sigma^\{2\}d\}\{m\\\|\\mu\\\|^\{2\}\}\}\\,\\sqrt\{1\+\\tfrac\{\\sigma^\{2\}d\}\{n\\\|\\mu\\\|^\{2\}\}\}\},\(3\)which is strictly increasing in bothmmandnn\.

###### Proof sketch\.

By the central limit theorem,x¯=μ\+\(σ/m\)​ϵx\\bar\{x\}=\\mu\+\(\\sigma/\\sqrt\{m\}\)\\epsilon\_\{x\}withϵx∼𝒩​\(0,Id\)\\epsilon\_\{x\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\), and analogously fory¯\\bar\{y\}\.

Numerator\.⟨x¯,y¯⟩=‖μ‖2\+\(σ/m\)​μ⊤​ϵx\+\(σ/n\)​μ⊤​ϵy\+\(σ2/m​n\)​ϵx⊤​ϵy\\langle\\bar\{x\},\\bar\{y\}\\rangle=\\\|\\mu\\\|^\{2\}\+\(\\sigma/\\sqrt\{m\}\)\\mu^\{\\top\}\\epsilon\_\{x\}\+\(\\sigma/\\sqrt\{n\}\)\\mu^\{\\top\}\\epsilon\_\{y\}\+\(\\sigma^\{2\}/\\sqrt\{mn\}\)\\epsilon\_\{x\}^\{\\top\}\\epsilon\_\{y\}\. The two single\-noise cross terms have mean zero\. In high dimensionsϵx⊤​ϵy=𝒪​\(d\)\\epsilon\_\{x\}^\{\\top\}\\epsilon\_\{y\}=\\mathcal\{O\}\(\\sqrt\{d\}\)while‖μ‖2=Θ​\(d​μcomp2\)\\\|\\mu\\\|^\{2\}=\\Theta\(d\\mu\_\{\\text\{comp\}\}^\{2\}\), so the cross\-noise term is dominated\. Hence𝔼​⟨x¯,y¯⟩=‖μ‖2\\mathbb\{E\}\\langle\\bar\{x\},\\bar\{y\}\\rangle=\\\|\\mu\\\|^\{2\}\.

Denominator\.‖x¯‖2=‖μ‖2\+\(2​σ/m\)​μ⊤​ϵx\+\(σ2/m\)​‖ϵx‖2\\\|\\bar\{x\}\\\|^\{2\}=\\\|\\mu\\\|^\{2\}\+\(2\\sigma/\\sqrt\{m\}\)\\mu^\{\\top\}\\epsilon\_\{x\}\+\(\\sigma^\{2\}/m\)\\\|\\epsilon\_\{x\}\\\|^\{2\}\. Since‖ϵx‖2\\\|\\epsilon\_\{x\}\\\|^\{2\}concentrates arounddd,𝔼​‖x¯‖2=‖μ‖2\+σ2​d/m\\mathbb\{E\}\\\|\\bar\{x\}\\\|^\{2\}=\\\|\\mu\\\|^\{2\}\+\\sigma^\{2\}d/m\. Analogously fory¯\\bar\{y\}\.

Assembly\.𝔼​cos⁡\(x¯,y¯\)≈‖μ‖2/\(‖μ‖2\+σ2​d/m\)​\(‖μ‖2\+σ2​d/n\)\\mathbb\{E\}\\cos\(\\bar\{x\},\\bar\{y\}\)\\approx\\\|\\mu\\\|^\{2\}/\\sqrt\{\(\\\|\\mu\\\|^\{2\}\+\\sigma^\{2\}d/m\)\(\\\|\\mu\\\|^\{2\}\+\\sigma^\{2\}d/n\)\}, which simplifies to Eq\. \([3](https://arxiv.org/html/2605.07345#S3.E3)\)\. The functionk↦1\+c/kk\\mapsto\\sqrt\{1\+c/k\}is decreasing inkkforc\>0c\>0, so the denominator is decreasing in bothmmandnnand the cosine is increasing in both\. ∎

### 3\.3When the artifact is large

The size of the effect is governed by the dimensionless quantityρ=σ2​d/‖μ‖2\\rho\\;=\\;\\sigma^\{2\}d/\\\|\\mu\\\|^\{2\}, the ratio of noise energy to shared\-direction energy\. Plugging into Eq\. \([3](https://arxiv.org/html/2605.07345#S3.E3)\) and Taylor expanding for moderateρ/n\\rho/n,

𝔼​cos⁡\(x¯,y¯\)≈1−12​ρ​\(1m\+1n\)\+O​\(ρ2\)\.\\mathbb\{E\}\\cos\(\\bar\{x\},\\bar\{y\}\)\\;\\approx\\;1\-\\tfrac\{1\}\{2\}\\rho\\\!\\left\(\\tfrac\{1\}\{m\}\+\\tfrac\{1\}\{n\}\\right\)\+O\(\\rho^\{2\}\)\.\(4\)Two predictions follow\. First, the artifact is largest when the lengths are short and asymmetric:\|1/m−1/n\|\|1/m\-1/n\|is the relevant signal\. Second, the artifact scales linearly in anisotropy \(ρ\\rho\); a model that has been trained or post\-processed to suppress anisotropy will exhibit a smaller effect\. Both predictions are consistent with the cross\-domain pattern in Section[5](https://arxiv.org/html/2605.07345#S5): code LLMs \(highly anisotropic, short sequences\) show the largest effect; CLIP after the contrastive projection head \(lower anisotropy\) shows essentially none\.

### 3\.4Why CKA is immune

Linear CKA\(Kornblith et al\.,[2019](https://arxiv.org/html/2605.07345#bib.bib7)\)compares full token\-level representation matricesX∈ℝnX×dX\\in\\mathbb\{R\}^\{n\_\{X\}\\times d\}andY∈ℝnY×dY\\in\\mathbb\{R\}^\{n\_\{Y\}\\times d\}\(withnX=nYn\_\{X\}=n\_\{Y\}on aligned tokens\) via

CKA​\(X,Y\)=‖Y⊤​X‖F2‖X⊤​X‖F⋅‖Y⊤​Y‖F\.\\mathrm\{CKA\}\(X,Y\)\\;=\\;\\frac\{\\\|Y^\{\\top\}X\\\|\_\{F\}^\{2\}\}\{\\\|X^\{\\top\}X\\\|\_\{F\}\\cdot\\\|Y^\{\\top\}Y\\\|\_\{F\}\}\.\(5\)There is no pooling step, so the1/n1/\\sqrt\{n\}noise concentration that drives the artifact does not arise\. CKA is also invariant under invertible linear transformations of the column space, which is exactly what shifting alongμ\\muamounts to\. The trade\-off is that linear CKA requires aligned position counts; we use shared\-surface\-form alignment to obtain such pairs\.

### 3\.5Synthetic validation

To verify the mechanism without any model, we generate200200pairs of random vectors inℝ4096\\mathbb\{R\}^\{4096\}\(matching CodeLlama\-7B’s hidden dimension\) fromhi=μ\+ϵih\_\{i\}=\\mu\+\\epsilon\_\{i\}with‖μ‖=10\\\|\\mu\\\|=10andσ=1\\sigma=1\. Each pair has lengthsn1=100n\_\{1\}=100andn2=⌊100/r⌋n\_\{2\}=\\lfloor 100/r\\rfloorwherer∼Uniform​\(0\.3,1\.0\)r\\sim\\mathrm\{Uniform\}\(0\.3,1\.0\)\. We then compute mean\-pooled cosine and CKA on aligned subsets\.

![Refer to caption](https://arxiv.org/html/2605.07345v1/synthetic_length_bias.png)Figure 1:Synthetic validation of the mechanism\.200 pairs of random anisotropic vectors inℝ4096\\mathbb\{R\}^\{4096\}at varying length ratios\. Mean\-pooled cosine \(left\) tracks the length ratio; CKA on aligned subsets of the same vectors \(right\) does not\. No model or semantics are involved\.Figure[1](https://arxiv.org/html/2605.07345#S3.F1)shows that mean\-pooled cosine tracks the length ratio exactly as Eq\. \([3](https://arxiv.org/html/2605.07345#S3.E3)\) predicts, while CKA is flat\. This rules out any explanation involving model behaviour, language structure, or content semantics: the dependence is intrinsic to the mean\-pooling\+cosine pipeline applied to anisotropic vectors\.

## 4Experimental Setup

#### Models\.

Code:CodeLlama\-7B, CodeLlama\-7B\-Python, CodeLlama\-13B\(Rozière et al\.,[2023](https://arxiv.org/html/2605.07345#bib.bib13)\), and Qwen2\.5\-Coder\-7B\(Hui et al\.,[2024](https://arxiv.org/html/2605.07345#bib.bib5)\)\.NLP:Mistral\-7B\-v0\.1\.Vision:CLIP ViT\-B/32\(Radford et al\.,[2021](https://arxiv.org/html/2605.07345#bib.bib11)\)\.

#### Datasets\.

Code:HumanEvalPack\(Muennighoff et al\.,[2023](https://arxiv.org/html/2605.07345#bib.bib9)\),164164programming problems with parallel solutions in Python, Java, JavaScript, and Go\.NLP:WMT14 French–English \(442442sentence pairs\) and WMT16 German–English \(428428pairs\), filtered for at least three shared surface\-form tokens to enable position\-aligned CKA\.Vision:400400synthetic captions of varying length paired with a fixed random\-noise image, processed through CLIP’s text and image encoders\.

#### Metrics\.

For each pair of inputs, we compute \(i\)mean\-pooled cosine:cos⁡\(h¯A,h¯B\)\\cos\(\\bar\{h\}\_\{A\},\\bar\{h\}\_\{B\}\)whereh¯=1T​∑tht\\bar\{h\}=\\tfrac\{1\}\{T\}\\sum\_\{t\}h\_\{t\}averages over token positions, then averaging across middle layers \(layersn/4n/4to3​n/43n/4\); \(ii\)linear CKA\(Kornblith et al\.,[2019](https://arxiv.org/html/2605.07345#bib.bib7)\)on aligned token positions; \(iii\)RV coefficient\(Robert & Escoufier,[1976](https://arxiv.org/html/2605.07345#bib.bib12)\), the matrix generalization of squared correlation\.

#### Dependent variable\.

For code, we follow the convention ofYin et al\. \([2025](https://arxiv.org/html/2605.07345#bib.bib16)\)and definePython proximityfor target languageLLassim​\(Python,L\)−1\|S\|−1​∑L′≠Lsim​\(L′,L\)\\mathrm\{sim\}\(\\text\{Python\},L\)\-\\tfrac\{1\}\{\|S\|\-1\}\\sum\_\{L^\{\\prime\}\\neq L\}\\mathrm\{sim\}\(L^\{\\prime\},L\), averaged across middle layers\. For NLP, we use the cosine \(or CKA\) directly between English and the target\-language representations\.

#### Confounds\.

We regress the dependent variable on three predictors:length ratio\(min/max\\min/\\maxtoken counts\),AST depth range\(max minus min syntax\-tree depth across languages, code only\), andshared\-token fraction\(proportion of token surface forms appearing in both languages\)\. All variables are standardized; reported coefficients are standardizedβ\\beta\-weights\.

## 5Results

### 5\.1Length explains nearly all of Python proximity

Table[1](https://arxiv.org/html/2605.07345#S5.T1)shows multiple regression results across four code LLMs\. In every model, length ratio is the dominant predictor, withβ=0\.74\\beta=0\.74–0\.880\.88andp<10−27p<10^\{\-27\}, explainingR2=0\.52R^\{2\}=0\.52–0\.750\.75of variance on its own\. AST depth contributes at mostR2≤0\.09R^\{2\}\\leq 0\.09, and its univariate negative correlation reverses to near\-zero in the multivariate model because depth covaries with length \(a Simpson’s\-paradox pattern\)\. Shared\-token fraction is negligible across all models\. Adding all three predictors beyond length adds less than1%1\\%of explained variance in three of four models\.

Table 1:Multiple regression of Python proximity \(mean\-pooled cosine\) on three confounds across four code LLMs\.Rℓ2R^\{2\}\_\{\\ell\}: length\-onlyR2R^\{2\}with 95% bootstrap CI \(B=5000B=5000\);Rfull2R^\{2\}\_\{\\text\{full\}\}: all three predictors\.n=164n=164HumanEvalPack problems per model\. Length dominates uniformly\.![Refer to caption](https://arxiv.org/html/2605.07345v1/scatter_confounds_7b.png)Figure 2:Length is the only predictor that matters \(CodeLlama\-7B\)\.Python proximity vs\. each confound: length ratio drivesR2=0\.72R^\{2\}=0\.72; AST depth and shared\-token fraction are flat once length is partialled out\.The mechanical explanation is direct\. Python’s tokens are systematically more compact than the other three languages on HumanEvalPack: the per\-problem mean token count for Python is consistently lowest, producing length ratiosmin/max<1\\min/\\max<1that, by Eq\. \([3](https://arxiv.org/html/2605.07345#S3.E3)\), mechanically inflate cosine similarity to Python relative to other pairs\.

### 5\.2CKA reverses the sign of the conclusion

If the length–similarity correlation reflected genuine convergence of content, it should persist under any valid similarity metric\. It does not\. Table[2](https://arxiv.org/html/2605.07345#S5.T2)compares cosine, RV, and CKA on the same164164problems and the same model \(CodeLlama\-7B\)\. CKA reduces the explained variance by83%83\\%and reverses the sign of the length coefficient\. The substantive conclusion—“Python is a hub” versus “Python is an outlier”—depends on which metric is used\.

Table 2:Metric comparison on CodeLlama\-7B \(n=164n=164\)\. Cosine uses mean\-pooled vectors; RV and CKA use full token\-level matrices\.runivr\_\{\\text\{univ\}\}: univariate Pearson correlation with length ratio\. The sign ofβlen\\beta\_\{\\text\{len\}\}flips between cosine and CKA\.![Refer to caption](https://arxiv.org/html/2605.07345v1/cosine_vs_cka_fixed.png)Figure 3:Cosine versus CKA on identical data \(CodeLlama\-7B, HumanEvalPack\)\.The horizontal axis \(length ratio\) is identical in both panels\. Mean\-pooled cosine \(left\) shows the strong positive length artifact \(R2=0\.72R^\{2\}=0\.72\)\. Linear CKA on aligned positions \(right\) shows weak negative dependence \(R2=0\.13R^\{2\}=0\.13,βlen=−0\.37\\beta\_\{\\text\{len\}\}=\-0\.37\)\. The sign reversal is the central result\.The sign reversal is large and significant\. Under cosine, Python’s compact tokenization makes it appear*more*similar to other languages; under CKA, which is invariant to the pooling concentration, Python is*less*similar than the cross\-language baseline\. The RV coefficient—which operates on full matrices but retains some length sensitivity through theX⊤​XX^\{\\top\}Xnormalization—falls between the two\. We interpret the CKA result as the pre\-confound estimate of the cross\-language convergence signal: there is no evidence for Python proximity once the length artifact is removed\.

### 5\.3Generalization to natural language

If the mechanism is mathematical rather than code\-specific, the artifact should appear whenever mean\-pooled cosine is applied to anisotropic representations of variable\-length sequences\. We test this on Mistral\-7B\-v0\.1 over parallel WMT14 EN–FR \(n=442n=442\) and WMT16 EN–DE \(n=428n=428\) sentence pairs\.

![Refer to caption](https://arxiv.org/html/2605.07345v1/nlp_cosine_vs_cka_fra.png)Figure 4:NLP generalization: English–French \(Mistral\-7B, WMT14,n=442n=442\)\.Left: mean\-pooled cosine correlates with length ratio atR2=0\.23R^\{2\}=0\.23,p<10−26p<10^\{\-26\}\. Right: shared\-token CKA shows no length dependence,R2<0\.001R^\{2\}<0\.001,p=0\.69p=0\.69\. The artifact is not specific to code\.![Refer to caption](https://arxiv.org/html/2605.07345v1/nlp_cosine_vs_cka_deu.png)Figure 5:NLP generalization: English–German \(Mistral\-7B, WMT16,n=428n=428\)\.The same pattern as French: cosineR2=0\.33R^\{2\}=0\.33, CKAR2=0\.005R^\{2\}=0\.005\. German’s longer tokenization relative to English produces a stronger length differential and a larger artifact\.The pattern reproduces \(Figures[4](https://arxiv.org/html/2605.07345#S5.F4),[5](https://arxiv.org/html/2605.07345#S5.F5)\)\. Cosine correlates with length ratio atR2=0\.23R^\{2\}=0\.23\(FR\) andR2=0\.33R^\{2\}=0\.33\(DE\), both highly significant; CKA on the same data shows essentially no dependence \(R2<0\.006R^\{2\}<0\.006\)\. The German result is larger than the French result, consistent with the well\-documented fact that German’s morphological compounding produces longer tokenizations relative to English than French does\.

### 5\.4Vision: architecture\-dependent suppression in CLIP

CLIP\(Radford et al\.,[2021](https://arxiv.org/html/2605.07345#bib.bib11)\)differs from the LLMs we test in two important ways: it uses an EOS\-token pooling step \(not mean\-pooling\) and trains with a contrastive objective that produces lower\-anisotropy embeddings\. We use400400synthetic captions of varying length paired with a fixed random\-noise image and measure both standard EOS\-pooled cosine and a non\-standard mean\-pooled variant\.

![Refer to caption](https://arxiv.org/html/2605.07345v1/clip_baseline_comparison.png)Figure 6:CLIP ViT\-B/32, two pooling regimes \(n=400n=400\)\.Left: standard EOS\-pooled cosine shows length sensitivity \(R2=0\.21R^\{2\}=0\.21,p<10−21p<10^\{\-21\}\), driven by self\-attention context length\. Right: mean\-pooled cosine shows essentially no length dependence \(R2<0\.01R^\{2\}<0\.01,p=0\.075p=0\.075\)\. Mean\-pooling*reduces*the artifact in CLIP because the contrastive head produces less anisotropic embeddings, removing the substrate the artifact requires\.The result is at first surprising and is in fact the most informative cross\-domain test of our theory\. Mean\-pooling*reduces*the length effect compared with EOS\-pooling, the opposite of what one might expect from a paper documenting a flaw in mean\-pooled cosine\. Our theory explains why: CLIP’s projection head and contrastive training produce embeddings with much lower anisotropy than decoder\-only LLMs\. The dimensionless ratioρ=σ2​d/‖μ‖2\\rho=\\sigma^\{2\}d/\\\|\\mu\\\|^\{2\}is small, so the1/n1/\\sqrt\{n\}concentration mechanism has little to act on\. The EOS\-pooling artifact arises from a separate mechanism—self\-attention\-mediated context\-length effects on a single output token—which is orthogonal to the pooling\-concentration mechanism we describe\.

This result strengthens rather than weakens the thesis\. The artifact is not a universal property of all pooling, but specifically of mean\-pooling under high anisotropy\. The cross\-domain pattern \(large in code LLMs, intermediate in Mistral\-7B, suppressed in CLIP\) tracks anisotropy\.

### 5\.5What CKA shows about cross\-language convergence

A natural reviewer question is whether, once the length artifact is removed, any cross\-language convergence remains\. We answer it directly using CKA on the same data\. Raw mean CKA at matched token positions is large in every model we test:0\.99970\.9997for Mistral\-7B on EN–FR and EN–DE,0\.9970\.997for CodeLlama\-13B and0\.910\.91for Qwen2\.5\-Coder\-7B,0\.650\.65for CodeLlama\-7B and0\.520\.52for CodeLlama\-7B\-Python on Python vs\. \{Java, JS, Go\} \(per\-pair detail with sample sizes in App\.[A](https://arxiv.org/html/2605.07345#A1)\)\. Genuine cross\-lingual and cross\-language convergence is therefore real, and in NLP it is essentially saturating\. What does not survive metric correction is the*asymmetric*reading: under cosine, Python’s compact tokenization made it appear privileged among code languages, and the same held for English among natural languages; under CKA the asymmetry vanishes \(βlen\\beta\_\{\\text\{len\}\}flips sign in CodeLlama\-7B; Tab\.[2](https://arxiv.org/html/2605.07345#S5.T2)\) and the residual structure is symmetric\. Convergence is real; only the privileged\-pivot interpretation was metric\-induced\.

### 5\.6Cross\-domain summary

Table 3:Cross\-domainR2R^\{2\}of length ratio on raw similarity at shared token positions, with 95% bootstrap CIs \(B=5000B=5000, code\) or Fisher CIs \(NLP, CLIP\)\. Cosine: mean\-pooled cosine\. CKA: linear CKA on aligned positions\.⋆EOS\-pooled CLIP; the mean\-pooled variant hasR2<0\.01R^\{2\}<0\.01\.

## 6Discussion

### 6\.1Implications for prior work

Several recent and influential papers in interpretability use mean\-pooled cosine as the primary metric supporting claims about cross\-lingual or cross\-language representational structure\.Schut et al\. \([2025](https://arxiv.org/html/2605.07345#bib.bib14)\)argue that multilingual LLMs route non\-English inputs through English\-like representations;Wendler et al\. \([2024](https://arxiv.org/html/2605.07345#bib.bib15)\)argue that Llama\-2 internally translates to English in middle layers;Yin et al\. \([2025](https://arxiv.org/html/2605.07345#bib.bib16)\)make the analogous claim for code LLMs and Python\. In each case the reference language has systematically shorter tokenizations on parallel content, and the metric is mean\-pooled cosine\. On the strength of the metric alone, the data are equally consistent with genuine convergence and with a pure tokenizer\-length differential—the metric cannot distinguish them\. As Sec\.[5\.5](https://arxiv.org/html/2605.07345#S5.SS5)shows, convergence in fact exists and is large under CKA; what does not survive metric correction is the privileged\-pivot framing\.

### 6\.2Recommendations

We suggest the following defaults for cross\-representation similarity work\.\(i\)Use a length\-invariant metric \(linear CKA, RV\) whenever input lengths can vary, and reserve mean\-pooled cosine for pre\-pooling baselines or sanity checks\.\(ii\)Report length statistics—mean token counts and per\-language tokenization summaries—alongside any similarity result\.\(iii\)Use length\-controlled baselines, either by equalizing the pooled length before computing the metric or by restricting to length\-matched subsets\.\(iv\)Validate any finding with multiple metrics: a result that holds under cosine but vanishes under CKA should be presumed to be a metric artifact\.\(v\)If anisotropy mitigation\(Mu & Viswanath,[2018](https://arxiv.org/html/2605.07345#bib.bib8)\)is acceptable in the application, it directly attenuates the artifact by reducingρ\\rho\.

### 6\.3Limitations

We treat linear CKA as a more honest baseline than mean\-pooled cosine, but it is not uncontested\.Davari et al\. \([2023](https://arxiv.org/html/2605.07345#bib.bib2)\)show that CKA can be made misleading under adversarial column\-scaling, and our shared\-surface\-form alignment introduces a selection effect, since only tokens appearing in both sequences contribute\. We mitigate by requiring at least three shared tokens per pair and averaging over middle layers, but the residual selection is real\. Linear CKA is rotation\- but not translation\-invariant; for our purposes this is appropriate, sinceμ\\muis exactly the translation we wish to ignore\. We test only linear CKA, and kernel or nonlinear variants may behave differently\. The theoretical analysis in Eq\. \([3](https://arxiv.org/html/2605.07345#S3.E3)\) assumes isotropic noise, so the closed\-form expression is approximate; the qualitative monotonic length\-dependence is what we test empirically and is what we observe across domains\. Our code experiments cover four models from two families, and the CLIP experiment uses a fixed random\-noise image; broader model coverage and natural image–caption pairs would further strengthen the conclusion\.

## 7Conclusion

Mean\-pooled cosine similarity, the default metric for comparing neural representations across languages, modalities, and tasks, is not length\-invariant under transformer anisotropy\. We proved this from first principles, validated it on random vectors with no model involvement, and demonstrated it empirically across code, natural\-language, and vision domains\. Substituting linear CKA on identical data cuts length\-explained variance by83%83\\%, flips the sign ofβlen\\beta\_\{\\text\{len\}\}in code, and removes the artifact in NLP; the artifact is naturally suppressed in CLIP, exactly where the theory predicts\. Yet CKA also shows that genuine cross\-language convergence is large—near\-saturating in Mistral\-7B and 0\.91–0\.997 across code languages in CodeLlama\-13B and Qwen2\.5\-Coder\-7B\. The convergence was real; only the asymmetric privileged\-pivot framing was metric\-induced\. We therefore recommend length\-invariant metrics as the default for cross\-representation comparisons, and a careful re\-examination of recent claims of cross\-lingual convergence built on mean\-pooled cosine\.

## References

- Conneau et al\. \(2020\)Conneau, A\., Khandelwal, K\., Goyal, N\., Chaudhary, V\., Wenzek, G\., Guzmán, F\., Grave, E\., Ott, M\., Zettlemoyer, L\., and Stoyanov, V\.Unsupervised cross\-lingual representation learning at scale\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2020\.
- Davari et al\. \(2023\)Davari, M\., Horoi, S\., Natik, A\., Lajoie, G\., Wolf, G\., and Belilovsky, E\.Reliability of cka as a similarity measure in deep learning\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Ethayarajh \(2019\)Ethayarajh, K\.How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT\-2 representations\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pp\. 55–65, 2019\.
- Gao et al\. \(2019\)Gao, J\., He, D\., Tan, X\., Qin, T\., Wang, L\., and Liu, T\.\-Y\.Representation degeneration problem in training natural language generation models\.In*International Conference on Learning Representations \(ICLR\)*, 2019\.
- Hui et al\. \(2024\)Hui, B\. et al\.Qwen2\.5\-coder technical report\.*arXiv preprint arXiv:2409\.12186*, 2024\.
- Kargaran et al\. \(2025\)Kargaran, A\. H\. et al\.From languages to atoms: Unifying low\-resource language representations through logit lens\.In*Findings of the Association for Computational Linguistics \(ACL\)*, 2025\.
- Kornblith et al\. \(2019\)Kornblith, S\., Norouzi, M\., Lee, H\., and Hinton, G\.Similarity of neural network representations revisited\.In*Proceedings of the 36th International Conference on Machine Learning \(ICML\)*, 2019\.
- Mu & Viswanath \(2018\)Mu, J\. and Viswanath, P\.All\-but\-the\-top: Simple and effective postprocessing for word representations\.In*International Conference on Learning Representations \(ICLR\)*, 2018\.
- Muennighoff et al\. \(2023\)Muennighoff, N\. et al\.Octopack: Instruction tuning code large language models\.*arXiv preprint arXiv:2308\.07124*, 2023\.
- Petrov et al\. \(2023\)Petrov, A\., La Malfa, E\., Torr, P\. H\. S\., and Bibi, A\.Language model tokenizers introduce unfairness between languages\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- Radford et al\. \(2021\)Radford, A\., Kim, J\. W\., Hallacy, C\., Ramesh, A\., Goh, G\., Agarwal, S\., Sastry, G\., Askell, A\., Mishkin, P\., Clark, J\., et al\.Learning transferable visual models from natural language supervision\.In*Proceedings of the 38th International Conference on Machine Learning \(ICML\)*, pp\. 8748–8763, 2021\.
- Robert & Escoufier \(1976\)Robert, P\. and Escoufier, Y\.A unifying tool for linear multivariate statistical methods: The RV\-coefficient\.*Journal of the Royal Statistical Society: Series C \(Applied Statistics\)*, 25\(3\):257–265, 1976\.
- Rozière et al\. \(2023\)Rozière, B\. et al\.Code llama: Open foundation models for code\.*arXiv preprint arXiv:2308\.12950*, 2023\.
- Schut et al\. \(2025\)Schut, L\., Gal, Y\., and Farquhar, S\.Do multilingual large language models think in english?In*International Conference on Learning Representations \(ICLR\)*, 2025\.
- Wendler et al\. \(2024\)Wendler, C\., Veselovsky, V\., Monea, G\., and West, R\.Do llamas work in english? on the latent language of multilingual transformers\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.
- Yin et al\. \(2025\)Yin, Z\. et al\.Do code llms understand programming languages? a comprehensive cross\-lingual analysis\.*arXiv preprint arXiv:2512\.00123*, 2025\.

Appendix

Supplementary material to “Mean\-Pooled Cosine Similarity is Not Length\-Invariant”

## Appendix APer\-pair raw CKA values

Table[4](https://arxiv.org/html/2605.07345#A1.T4)reports the raw mean linear CKA at shared token positions, averaged over middle layers, for each model and language pair used in Sec\.[5\.5](https://arxiv.org/html/2605.07345#S5.SS5)\. High values indicate that representations are similar at matched positions once the1/n1/\\sqrt\{n\}pooling concentration has been removed\. Variability across pairs is reported as the standard deviation across problems \(code\) or sentence pairs \(NLP\)\. Values are computed from the same data used for the regression results in Sec\.[5](https://arxiv.org/html/2605.07345#S5)\.

Table 4:Raw mean CKA at shared token positions, with per\-pair standard deviation\. Values verified against the source JSON data files; std values reflect across\-pair \(not bootstrap\) variation\.

Similar Articles

Anisotropic Modality Align

Hugging Face Daily Papers

This paper proposes AnisoAlign, a framework that addresses the modality gap in multimodal models by applying anisotropic geometric correction to enable effective unpaired modality alignment.

A Unified Geometric Framework for Weighted Contrastive Learning

arXiv cs.LG

This paper introduces a unified geometric framework showing that weighted InfoNCE objectives can be interpreted as Distance Geometry Problems, providing exact characterizations of optimal embeddings for supervised and weakly supervised contrastive learning methods and revealing when such embeddings are geometrically realizable, degenerate, or inconsistent.

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

arXiv cs.CL

This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Hugging Face Daily Papers

This paper introduces geometric stability measures—based on pairwise distance consistency in representations—to predict language model steerability and detect structural drift. Supervised variants achieve near-perfect correlation (ρ=0.89-0.97) with linear steerability across 35-69 embedding models, while unsupervised variants outperform CKA and Procrustes for post-deployment drift detection.