LLM Parameters for Math Across Languages: Shared or Separate?

arXiv cs.CL Papers

Summary

This paper presents a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, finding partial overlap of math-associated parameters across languages, concentrated in intermediate layers. English has the largest set of math-relevant parameters, while lower-resource languages have smaller sets.

arXiv:2606.18453v1 Announce Type: new Abstract: Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:45 AM

# LLM Parameters for Math Across Languages: Shared or Separate?
Source: [https://arxiv.org/html/2606.18453](https://arxiv.org/html/2606.18453)
Behzad Shomali1,2†Luisa Victor2†Tim Selbach2†Ali Hamza Bashir1,3 David Berghaus1,3Joachim Koehler1,3Mehdi Ali1,3Markus Frey1,2,3 1Lamarr Institute2University of Bonn3Fraunhofer IAIS†Equal contribution

###### Abstract

Large language models \(LLMs\) exhibit substantial cross\-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language\-specific parameters or a shared mechanism that manifests differently by language\. We present a cross\-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages\. We find that the extracted math\-associated parameters exhibit partial cross\-lingual overlap, with the strongest overlap concentrated in intermediate model layers\. We further observe that English consistently produces the largest set of math\-relevant parameters, whereas lower\-resource languages reveal smaller sets of relevant parameters\. These results suggest that math\-related behavior in multilingual LLMs is neither fully language\-invariant nor fully language\-specific, but instead exhibits partial cross\-lingual parameter overlap with systematic language\-dependent differences\.

![[Uncaptioned image]](https://arxiv.org/html/2606.18453v1/figures/github_logo.png)Code:[https://github\.com/luisavictor/math\-across\-languages](https://github.com/luisavictor/math-across-languages)

![[Uncaptioned image]](https://arxiv.org/html/2606.18453v1/figures/HF_logo.png)Translated datasets:[https://huggingface \.co/math\-across\-languages](https://huggingface.co/math-across-languages)

\[ Path=\./fonts/LiberationSerif/, Extension=\.ttf, UprightFont=LiberationSerif\-Regular, BoldFont=LiberationSerif\-Bold, ItalicFont=LiberationSerif\-Italic, BoldItalicFont=LiberationSerif\-BoldItalic \]

LLM Parameters for Math Across Languages: Shared or Separate?

Behzad Shomali1,2†Luisa Victor2†Tim Selbach2†Ali Hamza Bashir1,3David Berghaus1,3Joachim Koehler1,3Mehdi Ali1,3Markus Frey1,2,31Lamarr Institute2University of Bonn3Fraunhofer IAIS†Equal contribution

††footnotetext:Correspondence to:behzad\.shomali@uni\-bonn\.de## 1Introduction

Large language models \(LLMs\) have achieved remarkable success across a range of tasks, yet their performance on multilingual mathematical reasoning remains uneven\(Chenet al\.,[2024](https://arxiv.org/html/2606.18453#bib.bib9)\)\. Understanding how these models process mathematical tasks across languages is critical for building robust multilingual systems\.

Existing research on LLM capabilities can be broadly categorized into two streams\. First, mechanistic studies localize task\-specific skills to subsets of model parameters\. Recent methods leverage forward\-pass statistics\(Sunet al\.,[2023](https://arxiv.org/html/2606.18453#bib.bib8)\)\. Notably,Christet al\.\([2025](https://arxiv.org/html/2606.18453#bib.bib6)\)introduced*MathNeurosurgery*, isolating parameters critical to mathematical reasoning in LLMs\. Second, multilingual reasoning studies primarily evaluate performance or representational properties\. Datasets and strategies like MathOctopus\(Chenet al\.,[2024](https://arxiv.org/html/2606.18453#bib.bib9)\)and the UST dataset\(Koet al\.,[2025](https://arxiv.org/html/2606.18453#bib.bib10)\)target cross\-lingual generalization, whileLiuet al\.\([2026](https://arxiv.org/html/2606.18453#bib.bib23)\)suggests that English\-centric latent pathways may dominate multilingual reasoning\. However, these approaches rarely identify whether the same parameters support mathematical reasoning across languages\.

Despite these advances, several challenges remain\. Mechanistic analyses are mostly limited to English, leaving open questions about cross\-lingual parameter sharing\. Similarly, performance\-focused multilingual studies do not reveal the underlying computational substrates of reasoning\. These gaps hinder our understanding of whether mathematical reasoning relies on a universal core or language\-specific circuits\. To address this gap, we investigate the localization and overlap of math\-specific parameters in LLMs\. Our contributions are threefold:

- •Mechanistic localization of multilingual math circuits:We develop a framework to identify and compare model parameters supporting mathematical reasoning across languages\.
- •Language\-dependent variation in parameter allocation:We show that English contains the largest set of math\-specific parameters, while lower\-resource languages have fewer\.
- •Functional validation via weight intervention:We demonstrate that mathematical reasoning emerges as acollectiveproperty of these localized parameters\.

## 2Related Work

#### Mechanistic analysis of LLMs\.

Several studies have explored localizing task\-specific capabilities within LLM parameters\. Gradient\-based attribution techniques identify parameters critical to specific tasks\(Panigrahiet al\.,[2023](https://arxiv.org/html/2606.18453#bib.bib7)\), while forward\-pass statistics offer a computationally efficient alternative\(Sunet al\.,[2023](https://arxiv.org/html/2606.18453#bib.bib8)\)\.Christet al\.\([2025](https://arxiv.org/html/2606.18453#bib.bib6)\)proposed*MathNeurosurgery*, which isolates math\-specific parameters by comparing weight–activation products on math and non\-math datasets\. However, their analysis primarily focuses on English and does not examine cross\-lingual parameter behavior\.

#### Multilingual LLMs\.

Research on multilingual LLMs has explored dataset creation and evaluation\. MathOctopus\(Chenet al\.,[2024](https://arxiv.org/html/2606.18453#bib.bib9)\)and the UST dataset\(Koet al\.,[2025](https://arxiv.org/html/2606.18453#bib.bib10)\)target cross\-lingual generalization, while other studies suggest that multilingual models often rely on English\-centric latent pathways\(Liuet al\.,[2026](https://arxiv.org/html/2606.18453#bib.bib23)\)\. These analyses typically evaluate performance or representations, leaving the underlying parameter\-level mechanisms unexplored\.

#### Cross\-lingual mechanistic studies\.

Some recent work has investigated neuron activation patterns across translated inputs to identify shared functional units\(Wanget al\.,[2024](https://arxiv.org/html/2606.18453#bib.bib33)\)\. Our study differs by using a parameter\-importance framework, analyzing both attention and Feed\-Forward Network \(FFN\) weights, to localize math\-specific circuits and quantify their overlap across languages\.

## 3Methodology

We employ the MathNeurosurgery frameworkChristet al\.\([2025](https://arxiv.org/html/2606.18453#bib.bib6)\)to identify math\-specific parameters and subsequently analyze their overlap, both at the level of the entire model and across individual layers, across multiple languages\.

### 3\.1Parameter Identification

FollowingChristet al\.\([2025](https://arxiv.org/html/2606.18453#bib.bib6)\), we first compute the task\-specificSi​jS\_\{ij\}for weightjjin neuronii:

Si​j\\displaystyle S\_\{ij\}=∑k=1N\|Wi​j\|⋅‖Xjk‖2​for​X∈𝒟\\displaystyle=\\sum\_\{k=1\}^\{N\}\\left\|W\_\{ij\}\\right\|\\cdot\\left\\\|X\_\{j\}^\{k\}\\right\\\|\_\{2\}\\text\{ for \}X\\in\\mathcal\{D\}\(1\)
whereWi​jW\_\{ij\}represents the weight,\|\.\|\|\.\|denotes the absolute value operator, and‖Xjk‖2\\\|X\_\{j\}^\{k\}\\\|\_\{2\}is the L2 norm of thejj\-th feature aggregated across input tokens to normalize the inputXXfrom dataset𝒟\\mathcal\{D\}for samplekk\. Using this formulation, we compute scores for each parameter over math \(𝒟math\\mathcal\{D\}\_\{\\text\{math\}\}\) and non\-math datasets \(𝒟non\-math\\mathcal\{D\}\_\{\\text\{non\-math\}\}\)\. Then we separately identify the top\-kkproportion of parameters with the highest score for each task in each layer\. Finally, we identify the math\-specific parameters by taking the subset of parameters that are crucial for the math task but not for the unrelated task\.

### 3\.2Similarity Metric

To measure similarity between two sets of extracted parameters, we first separately run the previous step for each language, and then measure their pairwise overlaps using the Jaccard coefficient\. LetT1T\_\{1\}andT2T\_\{2\}denote the sets of important parameters representing task\-specific parameters for the same task in two languages\. The overlap between the two parameter sets is computed as:

J​\(T1,T2\)=\|T1∩T2\|\|T1∪T2\|J\(T\_\{1\},T\_\{2\}\)=\\frac\{\|T\_\{1\}\\cap T\_\{2\}\|\}\{\|T\_\{1\}\\cup T\_\{2\}\|\}\(2\)

## 4Results

We begin with the experimental setup, followed by an analysis of the results\.

### 4\.1Experimental Setup

We evaluate Llama 3\.2 1B IT \(Llama 1B\)\(Meta AI,[2024b](https://arxiv.org/html/2606.18453#bib.bib13)\), Qwen3 4B IT\-2507 \(Qwen3 4B\)\(Qwen Team,[2025](https://arxiv.org/html/2606.18453#bib.bib15)\), and Llama 3\.1 8B IT \(Llama 8B\)\(Meta AI,[2024a](https://arxiv.org/html/2606.18453#bib.bib14)\)across English, German, French, and Hindi on GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.18453#bib.bib16)\)for math reasoning \(𝒟math\\mathcal\{D\}\_\{\\text\{math\}\}\), MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.18453#bib.bib17)\)and RACE\(Laiet al\.,[2017](https://arxiv.org/html/2606.18453#bib.bib18)\)as general language understanding datasets \(𝒟non\-math\\mathcal\{D\}\_\{\\text\{non\-math\}\}\)\. We mainly focus on Llama 1B in the main text, while results for larger models are in the Appendices\.

### 4\.2Does math performance correlate with the model’s capacity for each language?

Figure[1](https://arxiv.org/html/2606.18453#S4.F1)\(left\) illustrates that, for each top\-kkvalue, English consistently exhibits the highest number of math\-specific parameters, followed by German and French, and then Hindi\. This pattern aligns with the observed reasoning performance across languages, as shown in Table[1](https://arxiv.org/html/2606.18453#S4.T1)\. More specifically, the trend in the number of math\-specific parameters aligns with the relative performance ranking across languages\. For example, in Figure[1](https://arxiv.org/html/2606.18453#S4.F1), at top\-k=0\.01k=0\.01, English has the highest parameter count, French and German have approximately equal counts, and Hindi has the lowest\. Moreover, the gap between Hindi and French is smaller than that between Hindi and English\. This pattern becomes even more pronounced with larger models \(Figures[F\.4](https://arxiv.org/html/2606.18453#A6.F4),[F\.5](https://arxiv.org/html/2606.18453#A6.F5)\)\.

Finding:The distribution of math\-specific parameters across languages tends to align with their relative reasoning performance in LLMs\.

Table 1:Pretraining accuracy \(%\) on GSM8K and RACE across English, German, Hindi, and French\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x1.png)Figure 1:Math\-specific parameters in Llama 1B\.Left: Number of isolated parameters per language\.Right: Jaccard similarity between language\-specific parameter sets\.
### 4\.3Is there a shared cross\-lingual core for mathematical reasoning?

Figure[1](https://arxiv.org/html/2606.18453#S4.F1)highlights the number of math\-associated parameters in English \(En\), German \(De\), Hindi \(Hi\) and French \(Fr\), as well as the Jaccard similarity across language pairs for Llama 1B\. Across all language pairs, the Jaccard similarity primarily peaks in the intermediate layers and decreases in deeper layers\. The same behavior is also observed in larger models \(Figures[F\.1](https://arxiv.org/html/2606.18453#A6.F1)\-[F\.3](https://arxiv.org/html/2606.18453#A6.F3)\)\. This pattern is consistent with prior findings that earlier layers capture more surface\-level linguistic features, while later layers can become more specialized and language\-dependent\(Tanget al\.,[2024](https://arxiv.org/html/2606.18453#bib.bib30)\)\.

Figures[1](https://arxiv.org/html/2606.18453#S4.F1)and[2](https://arxiv.org/html/2606.18453#S4.F2)demonstrate that English consistently possesses the highest number of math\-associated parameters, both in the entire model and in nearly every layer\. This implies that the model possesses a more distributed and resilient internal circuit for mathematical computations in English\.

Notably, the Jaccard similarity remains substantially above the random baseline across all pairings, except in the initial and final few layers\. This suggests the presence of a partially shared parameter substrate associated with mathematical problem solving, especially in intermediate layers\. Lower similarity in the initial and final layers, especially in larger models \(Figures[F\.2](https://arxiv.org/html/2606.18453#A6.F2),[F\.3](https://arxiv.org/html/2606.18453#A6.F3)\), likely reflects two distinct phenomena\. In early layers, low overlap is consistent with the surface\-level processing discussed above\. In later layers, it instead points to increased language\-specific specialization: with larger parameter budgets, models can mitigate the*curse of multilinguality*\(Conneauet al\.,[2020](https://arxiv.org/html/2606.18453#bib.bib31)\)by allocating dedicated sub\-networks for language\-specific reasoning, rather than relying on shared parameters, consistent with other findings on layer specialization in LLMs\(Freyet al\.,[2026](https://arxiv.org/html/2606.18453#bib.bib34)\)\.

While Figures[1](https://arxiv.org/html/2606.18453#S4.F1)and[3](https://arxiv.org/html/2606.18453#S4.F3)confirm that lexically similar, high\-resource languages like French and German exhibit the highest overlap, the lower similarity observed in pairs involving Hindi, particularly En\-Hi, suggests a structural barrier imposed by script and tokenization\. Because English and Hindi share neither an alphabet nor substantial lexical overlap, the model appears to rely on more disjoint parameter subsets to map divergent input encodings to a shared internal representation\. This pattern is also observed in Llama 8B, independent of model scale \(Figure[F\.3](https://arxiv.org/html/2606.18453#A6.F3)\)\. In contrast, the effect is less pronounced in Qwen3 4B \(Figure[F\.2](https://arxiv.org/html/2606.18453#A6.F2)\), potentially reflecting differences in multilingual pretraining coverage, vocabulary design, or tokenization\. We emphasize that this interpretation is speculative, as our experiments do not isolate these factors\.

Surprisingly, the De\-Hi and Fr\-Hi overlaps in many layers are notably higher than that of En\-Hi\. We consider two possible explanations\. First, French and Hindi as well as German and Hindi have a comparable number of parameters compared to English and Hindi, which explains the lower Jaccard scores for En\-Hi\. The larger denominator in the similarity computation using Equation[2](https://arxiv.org/html/2606.18453#S3.E2)is due to English’s larger set of math\-associated parameters, which increases the size of the union and reduces the Jaccard similarity\. Second, an alternative interpretation is the existence of a shared non\-English subspace: a multilingual reasoning manifold that the model exploits when operating outside its primary English\-centric pathways\. Notably, the pattern in which En\-Hi similarity is lower than De\-Hi, Fr\-Hi, and De\-Fr persists across almost all layers and scales to larger models \(Figures[F\.2](https://arxiv.org/html/2606.18453#A6.F2),[F\.3](https://arxiv.org/html/2606.18453#A6.F3)\)\. This consistent, surprising observation further motivates the hypothesis that non\-English languages may rely on partially overlapping parameter subsets distinct from those used for English\.

Finding:There exists a partially shared parameter substrate for math problem\-solving, which is most prominent in the intermediate layers\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x2.png)Figure 2:Layer\-wise distribution of math\-reasoning parameters per language for Llama 1B\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x3.png)Figure 3:Layer\-wise Jaccard similarity of math\-specific parameters across language pairs for Llama 1B\.
### 4\.4How do weights scaling & pruning change the model’s behavior?

To validate and investigate the effectiveness of isolated parameters, followingChristet al\.\([2025](https://arxiv.org/html/2606.18453#bib.bib6)\), we apply two interventions:*pruning*and*scaling*\(full results in Appendix[D](https://arxiv.org/html/2606.18453#A4)\)\. In pruning, weights are set to zero, which ideally impairs mathematical performance more than non\-mathematical performance\. In scaling, weights are multiplied by a positive factor \(1\.11\.1for Llama 1B,1\.011\.01otherwise\)\. While scaling may improve performance, the effect is less predictable than pruning, as outcomes depend sensitively on the factor choice\. In both settings, the intended trend is observed: non\-mathematical performance remains largely unchanged, whereas mathematical reasoning is either significantly reduced by pruning or enhanced by scaling\.

Analyzing examples where scaling guides the model to the correct answer, we find that the reasoning logic remains largely unchanged, even though the reasoning may not be entirely accurate and may include some missing steps \(Figure[E\.2](https://arxiv.org/html/2606.18453#A5.F2)\)\. However, for all four languages, the gains mainly come from correcting errors in arithmetic computations, as shown in Figures[E\.1](https://arxiv.org/html/2606.18453#A5.F1)\-[E\.4](https://arxiv.org/html/2606.18453#A5.F4)\. This observation explains the trivial improvements observed in larger models, which already have strong arithmetic capabilities and are therefore less responsive to small parameter perturbations\.

Tables[D\.9](https://arxiv.org/html/2606.18453#A4.T9)and[D\.10](https://arxiv.org/html/2606.18453#A4.T10)demonstrate an emerging pattern when pruning Qwen3 4B model with top\-kk=1e\-5\. Instead of dropping performance, performance on the English GSM8K task improved remarkably\. Manual inspection revealed that suppressing such parameters mainly improved output formatting and better in\-context learning \(Figure[E\.9](https://arxiv.org/html/2606.18453#A5.F9)\)\. More details in Appendix[D](https://arxiv.org/html/2606.18453#A4)\.

Finding:Scaling math\-specific parameters primarily corrects arithmetic execution rather than reasoning logic\. In some cases, pruning may improve GSM8K accuracy by changing output format or few\-shot imitation behavior, rather than by improving reasoning itself\.

### 4\.5Math\-specific parameters effect: collective or individual?

We investigate whether isolated math\-specific parameters exhibit a collective effect, i\.e\. their impact arises from their aggregation, or an individual effect, where performance changes are driven by a small number of critical parameters\. To this end, for the selected top\-kkparameters, we sample random subsets at several intermediate proportions\. For each proportion, we repeat the sampling process three times and prune the selected parameters\.

Figure[4](https://arxiv.org/html/2606.18453#S4.F4)depicts the results for Llama 1B across three languages\. The observed trend is approximately linear, suggesting that the isolated parameters effect is predominantlycollectiverather than driven by individual parameters\. Specifically, the linear relationship indicates that performance degradation scales with the proportion of important parameters removed\. Moreover, the low standard deviation supports this conclusion, indicating that the results are not sensitive to the specific subset selected; instead, performance degradation is primarily determined by the proportion of parameters ablated\. We observe similar trends for larger models \(Figures[D\.1](https://arxiv.org/html/2606.18453#A4.F1),[D\.2](https://arxiv.org/html/2606.18453#A4.F2)\)\.

Finding:Identified math\-specific parameters have a collective influence on the outcome\.

![Refer to caption](https://arxiv.org/html/2606.18453v1/x4.png)Figure 4:Effect of pruning isolated math\-specific parameters on performance across languages for Llama 1B\.

## 5Conclusion

Our mechanistic analysis suggests that math\-associated parameters in multilingual LLMs exhibit partial cross\-lingual overlap, with the highest overlap concentrated in intermediate layers\. At the same time, deeper layers show more language\-specific patterns\. We find that English yields the largest set of extracted math\-associated parameters, consistent with its stronger performance in our experiments\. Lower\-resource languages and languages with different scripts, such as Hindi, show less overlap with English, while still sharing some overlap with other non\-English languages\. These results suggest a distinct pattern of overlap among non\-English math\-associated parameters\. Finally, intervention experiments indicate that the effect of the identified parameters is distributed across the selected set, i\.e\. exhibiting a collective effect, rather than dominated by a small number of individual parameters\.

## 6Acknowledgment

This research has been funded by the Federal Ministry of Education and Research of Germany and the state of North\-Rhine Westphalia as part of the Lamarr Institute for Machine Learning and Artificial Intelligence\.

## Limitations

Our work has several limitations\. First, despite manual verification, the translated datasets may still lack full fluency and could benefit from stronger translation models\. Second, we focus exclusively on mathematical reasoning, leaving open whether similar shared–core and language\-dependent patterns extend to other reasoning forms\. Third, our analysis considers a limited set of languages and model families, which may restrict the generality of our conclusions across scripts and tokenization schemes\. We leave these directions for future work\.

## References

- Detecting and pruning prominent but detrimental neurons in large language models\.arXiv preprint arXiv:2507\.09185\.Cited by:[§D\.2](https://arxiv.org/html/2606.18453#A4.SS2.p2.1)\.
- A\. Ali, L\. Wolf, and I\. Titov \(2024\)Mitigating copy bias in in\-context learning through neuron pruning\.arXiv preprint arXiv:2410\.01288\.Cited by:[§D\.2](https://arxiv.org/html/2606.18453#A4.SS2.p3.1)\.
- R\. S\. Beekes \(2011\)Comparative indo\-european linguistics: an introduction\.2nd edition,John Benjamins Publishing Company,Amsterdam/Philadelphia\.Cited by:[Appendix A](https://arxiv.org/html/2606.18453#A1.SS0.SSS0.Px1.p1.1)\.
- B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini \(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprint arXiv:2407\.21787\.Cited by:[Appendix B](https://arxiv.org/html/2606.18453#A2.p1.1)\.
- N\. Chen, Z\. Zheng, N\. Wu, M\. Gong, D\. Zhang, and J\. Li \(2024\)Breaking language barriers in multilingual mathematical reasoning: insights and observations\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 7001–7016\.Cited by:[§1](https://arxiv.org/html/2606.18453#S1.p1.1),[§1](https://arxiv.org/html/2606.18453#S1.p2.1),[§2](https://arxiv.org/html/2606.18453#S2.SS0.SSS0.Px2.p1.1)\.
- B\. R\. Christ, Z\. Gottesman, J\. Kropko, and T\. Hartvigsen \(2025\)Math neurosurgery: isolating language models’ math reasoning abilities using only forward passes\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 24803–24840\.Cited by:[Appendix A](https://arxiv.org/html/2606.18453#A1.SS0.SSS0.Px2.p1.1),[Appendix B](https://arxiv.org/html/2606.18453#A2.p1.1),[§1](https://arxiv.org/html/2606.18453#S1.p2.1),[§2](https://arxiv.org/html/2606.18453#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.18453#S3.SS1.p1.3),[§3](https://arxiv.org/html/2606.18453#S3.p1.1),[§4\.4](https://arxiv.org/html/2606.18453#S4.SS4.p1.2)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2606.18453#S4.SS1.p1.2)\.
- A\. Conneau, K\. Khandelwal, N\. Goyal, V\. Chaudhary, G\. Wenzek, F\. Guzmán, E\. Grave, M\. Ott, L\. Zettlemoyer, and V\. Stoyanov \(2020\)Unsupervised cross\-lingual representation learning at scale\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 8440–8451\.Cited by:[§4\.3](https://arxiv.org/html/2606.18453#S4.SS3.p3.1)\.
- M\. R\. Costa\-Jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht, J\. Maillard,et al\.\(2022\)No language left behind: scaling human\-centered machine translation\.arXiv preprint arXiv:2207\.04672\.Cited by:[Appendix A](https://arxiv.org/html/2606.18453#A1.SS0.SSS0.Px3.p1.1)\.
- F\. Dong, Z\. Yan, X\. Ge, Z\. Xu, M\. Zhang, X\. Chen, B\. He, X\. Xin, Z\. Chen, and Y\. Zhou \(2026\)Identifying and transferring reasoning\-critical neurons: improving llm inference reliability via activation steering\.arXiv preprint arXiv:2601\.19847\.Cited by:[§D\.2](https://arxiv.org/html/2606.18453#A4.SS2.p2.1)\.
- M\. Frey, B\. Shomali, A\. H\. Bashir, D\. Berghaus, J\. Koehler, and M\. Ali \(2026\)Adaptive loops and memory in transformers: think harder or know more?\.arXiv preprint arXiv:2603\.08391\.Cited by:[§4\.3](https://arxiv.org/html/2606.18453#S4.SS3.p3.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[Appendix B](https://arxiv.org/html/2606.18453#A2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§4\.1](https://arxiv.org/html/2606.18453#S4.SS1.p1.2)\.
- H\. Ko, G\. Son, and D\. Choi \(2025\)Understand, solve and translate: bridging the multilingual mathematical reasoning gap\.arXiv preprint arXiv:2501\.02448\.Cited by:[§1](https://arxiv.org/html/2606.18453#S1.p2.1),[§2](https://arxiv.org/html/2606.18453#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Lai, Q\. Xie, H\. Liu, Y\. Yang, and E\. Hovy \(2017\)Race: large\-scale reading comprehension dataset from examinations\.arXiv preprint arXiv:1704\.04683\.Cited by:[§4\.1](https://arxiv.org/html/2606.18453#S4.SS1.p1.2)\.
- Y\. Liu, R\. Zhao, H\. Schütze, and M\. A\. Hedderich \(2026\)Large reasoning models are \(not yet\) multilingual latent reasoners\.arXiv preprint arXiv:2601\.02996\.Cited by:[§1](https://arxiv.org/html/2606.18453#S1.p2.1),[§2](https://arxiv.org/html/2606.18453#S2.SS0.SSS0.Px2.p1.1)\.
- Meta AI \(2024a\)Note:Accessed: 2026\-01\-26External Links:[Link](https://ai.meta.com/blog/meta-llama-3-1/)Cited by:[§4\.1](https://arxiv.org/html/2606.18453#S4.SS1.p1.2)\.
- Meta AI \(2024b\)Note:Accessed: 2026\-01\-26External Links:[Link](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)Cited by:[§4\.1](https://arxiv.org/html/2606.18453#S4.SS1.p1.2)\.
- A\. Panigrahi, N\. Saunshi, H\. Zhao, and S\. Arora \(2023\)Task\-specific skill localization in fine\-tuned language models\.InInternational Conference on Machine Learning,pp\. 27011–27033\.Cited by:[§2](https://arxiv.org/html/2606.18453#S2.SS0.SSS0.Px1.p1.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2606.18453#S4.SS1.p1.2)\.
- M\. Sun, Z\. Liu, A\. Bair, and J\. Z\. Kolter \(2023\)A simple and effective pruning approach for large language models\.arXiv preprint arXiv:2306\.11695\.Cited by:[§1](https://arxiv.org/html/2606.18453#S1.p2.1),[§2](https://arxiv.org/html/2606.18453#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Tang, W\. Luo, H\. Huang, D\. Zhang, X\. Wang, W\. X\. Zhao, F\. Wei, and J\. Wen \(2024\)Language\-specific neurons: the key to multilingual capabilities in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5701–5715\.Cited by:[§4\.3](https://arxiv.org/html/2606.18453#S4.SS3.p1.1)\.
- W\. Wang, B\. Haddow, M\. Wu, W\. Peng, and A\. Birch \(2024\)Sharing matters: analysing neurons across languages and tasks in llms\.arXiv preprint arXiv:2406\.09265\.Cited by:[§2](https://arxiv.org/html/2606.18453#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix AExperimental Setup

In this section, we describe the languages, the dataset curation, and the translation process employed throughout this study\.

#### Languages

To evaluate cross\-lingual transferability of mathematical reasoning, we consider English \(En\), German \(De\), French \(Fr\), and Hindi \(Hi\)\. This selection enables comparisons across both closely related and more distantly related languages within the Indo\-European family\. English and German belong to the Germanic branch, while French is a Romance language, and Hindi belongs to the Indo\-Aryan branch, following the classification outlined byBeekes \([2011](https://arxiv.org/html/2606.18453#bib.bib20)\)\.

#### Dataset curation:

Since MMLU includes a substantial number of math\-oriented questions, unlikeChristet al\.\([2025](https://arxiv.org/html/2606.18453#bib.bib6)\), who includeallavailable subjects, we apply category\-based filtering and remove subject areas that are strongly quantitative, as these can confound our parameter extraction by making parameters associated with mathematical reasoning appear as general language parameters\. Concretely, we excludemathematics\-related categories\(abstract\_algebra, elementary\_mathematics, college\_mathematics, high\_school\_mathematics, high\_school\_statistics\),physics categories\(conceptual\_physics, college\_physics, high\_school\_physics, astronomy\),chemistry categories\(college\_chemistry, high\_school\_chemistry\),biology\(high\_school\_biology\),computer science\(college\_computer\_science, high\_school\_computer\_science\), andquantitatively oriented fields\(econometrics, professional\_accounting\)\.

#### Translations:

We translate all datasets into German, Hindi, and French using the pretrained multilingual machine translation modelfacebook/nllb\-200\-3\.3B\(Costa\-Jussàet al\.,[2022](https://arxiv.org/html/2606.18453#bib.bib24)\)\.

ForGSM8K, our mathematical reasoning benchmark, we employ a placeholder\-based preprocessing step to preserve the integrity of mathematical expressions during translation\. Specifically, dataset\-specific markers such as⟨⟨⋅⟩⟩\\langle\\\!\\langle\\cdot\\rangle\\\!\\rangleand\#​\#​\#​\#\\\#\\\#\\\#\\\#are temporarily replaced with placeholders prior to translation and subsequently restored to their original form\.

ForRACE, we adopt a sentence\-aware chunking strategy to accommodate the substantially longer reading passages compared toGSM8KandMMLU\. The articles, questions, and answer options are translated independently, after which we reconstruct the hierarchical reading comprehension structure while strictly preserving the original evaluation schema\.

To ensure translation reliability, we manually inspected a random sample of approximately10%10\\%of the translated instances\.

## Appendix BEvaluation

To assess model performance before and after weight manipulation, we use theEleutherAI Evaluation HarnessframeworkGaoet al\.\([2024](https://arxiv.org/html/2606.18453#bib.bib25)\)\. In addition to English, we evaluate on German, Hindi, and French by implementing customlm\_evaltasks for GSM8K, RACE, and MMLU that mirror the original English task specifications \(i\.e\. matching dataset splits, preprocessing, and scoring\)\. For GSM8K, we adopt thegsm8k\_cotconfiguration, which uses an 8\-shot chain\-of\-thought prompting format and following prior workChristet al\.\([2025](https://arxiv.org/html/2606.18453#bib.bib6)\); Brownet al\.\([2024](https://arxiv.org/html/2606.18453#bib.bib19)\), we use a subset of 200 samples of GSM8K test set for efficacy\. Across languages, we evaluate on the same fixed set of examples per task to ensure comparability of results\. All evaluations are repeated for three different random seeds, and the mean and standard deviations are reported accordingly\. To further facilitate a fair comparison across settings, we increasemax\_new\_tokensfrom its default of 150 to 300 for evaluations on Hindi and French\. This adjustment accounts for the possibly higher token counts needed to represent Hindi and French texts compared to math responses in other languages due to script differences\. It is worth noting that employing varying values ofmax\_new\_tokensfor each language does not undermine the objectives of our experiments\. Our primary focus is on assessing the relative improvements or deteriorations observed in the intervention experiments for each language independently\.

## Appendix CJaccard Random Baseline

#### Setup \(fixed\-size, independent masks\)

LetUUbe a universe ofNNparameters\. Each setSi⊆US\_\{i\}\\subseteq Uis sampled*independently and uniformly*from all subsets of size

\|Si\|=m,m=p​N,\|S\_\{i\}\|=m,\\qquad m=pN,wherep∈\(0,1\)p\\in\(0,1\)andmmis assumed to be an integer\. For any fixed parameterθ∈U\\theta\\in U,

Pr⁡\(θ∈Si\)=mN=p,Pr⁡\(θ∉Si\)=1−p\.\\Pr\(\\theta\\in S\_\{i\}\)=\\frac\{m\}\{N\}=p,\\qquad\\Pr\(\\theta\\notin S\_\{i\}\)=1\-p\.

#### Difference sets

Define the one\-sided difference sets

A=S1∖S2,B=S3∖S4,A=S\_\{1\}\\setminus S\_\{2\},\\qquad B=S\_\{3\}\\setminus S\_\{4\},where allSiS\_\{i\}are independent\.

A parameterθ\\thetabelongs toAAif and only ifθ∈S1\\theta\\in S\_\{1\}andθ∉S2\\theta\\notin S\_\{2\}\. SinceS1S\_\{1\}andS2S\_\{2\}are independent,

Pr⁡\(θ∈A\)=Pr⁡\(θ∈S1\)​Pr⁡\(θ∉S2\)=p​\(1−p\)\.\\Pr\(\\theta\\in A\)=\\Pr\(\\theta\\in S\_\{1\}\)\\Pr\(\\theta\\notin S\_\{2\}\)=p\(1\-p\)\.The same holds forBB:

Pr⁡\(θ∈B\)=p​\(1−p\)\.\\Pr\(\\theta\\in B\)=p\(1\-p\)\.

#### Expected Jaccard similarity

The Jaccard similarity betweenAAandBBis

J​\(A,B\)=\|A∩B\|\|A∪B\|\.J\(A,B\)=\\frac\{\|A\\cap B\|\}\{\|A\\cup B\|\}\.
For any fixedθ∈U\\theta\\in U, membership inAAandBBdepends on disjoint pairs of sets and is therefore independent\. Hence,

Pr⁡\(θ∈A∩B\)=\[p​\(1−p\)\]2,\\Pr\(\\theta\\in A\\cap B\)=\[p\(1\-p\)\]^\{2\},and

Pr⁡\(θ∈A∪B\)=Pr⁡\(θ∈A\)\+Pr⁡\(θ∈B\)−Pr⁡\(θ∈A∩B\)=2​p​\(1−p\)−\[p​\(1−p\)\]2\.\\begin\{gathered\}\\Pr\(\\theta\\in A\\cup B\)\\\\ =\\Pr\(\\theta\\in A\)\+\\Pr\(\\theta\\in B\)\-\\Pr\(\\theta\\in A\\cap B\)\\\\ =2p\(1\-p\)\-\[p\(1\-p\)\]^\{2\}\.\\end\{gathered\}
For largeNN, the random set sizes concentrate around their expectations, and the expected Jaccard similarity is well approximated by the ratio of expectations:

𝔼​\[J​\(A,B\)\]≈\[p​\(1−p\)\]22​p​\(1−p\)−\[p​\(1−p\)\]2\.\\mathbb\{E\}\[J\(A,B\)\]\\approx\\frac\{\[p\(1\-p\)\]^\{2\}\}\{2p\(1\-p\)\-\[p\(1\-p\)\]^\{2\}\}\.
Factoring outp​\(1−p\)p\(1\-p\)yields the baseline expression:

𝔼\[J\(A,B\)\]≈p​\(1−p\)2−p​\(1−p\)=p​\(1−p\)p2−p\+2\.\\boxed\{\\mathbb\{E\}\[J\(A,B\)\]\\approx\\frac\{p\(1\-p\)\}\{2\-p\(1\-p\)\}=\\frac\{p\(1\-p\)\}\{p^\{2\}\-p\+2\}\.\}It is worth noting that, the approximation arises from replacing𝔼​\[\|A∩B\|/\|A∪B\|\]\\mathbb\{E\}\[\|A\\cap B\|/\|A\\cup B\|\]by𝔼​\|A∩B\|/𝔼​\|A∪B\|\\mathbb\{E\}\|A\\cap B\|/\\mathbb\{E\}\|A\\cup B\|\. This becomes accurate whenNNis large due to concentration of measure\.

## Appendix DIntervention Results

In the following, we present results for scaling \(Tables[D\.1](https://arxiv.org/html/2606.18453#A4.T1)\-[D\.6](https://arxiv.org/html/2606.18453#A4.T6)\) and pruning \(Tables[D\.7](https://arxiv.org/html/2606.18453#A4.T7)\-[D\.12](https://arxiv.org/html/2606.18453#A4.T12)\) math\-specific parameters across different models and multiple languages\. These results serve as proof that the parameter identification for the task and language was performed correctly\. In addition to accuracies on GSM8K, MMLU, and RACE, we also report theflexible\-extractGSM8K score from thelm\_eval\(lm\-evaluation\-harness\) framework\. The harness computes GSM8K exact\-match accuracy after applying an answer\-extraction step: astrictextractor that follows the dataset’s canonical output format \(e\.g\.,The answer is <answer\>\) and a more permissiveflexibleextractor that searches the generated text for a number\-like substring\. We include the flexible metric to reduce sensitivity to output formatting, as our analysis is not focused on instruction\-following behavior\.

### D\.1Robustness of parameter identification

Across all settings both in scaling and pruning experiments, the standard deviation across three random seeds remains low\. We take this as evidence that the identified parameter subset captures math\-reasoning capacity in a stable manner, rather than reflecting idiosyncrasies of a particular sample set\. Moreover, all scaling experiments yield highly stable accuracies onMMLUandRACE, indicating that scaling the identified parameters does not measurably affect general language understanding\. For pruning, the within\-dataset variability is more sensitive to the choice of top\-kk, but remains within a bounded and reasonably stable range\. This confirms that we isolate parameter subsets that drive math reasoning while being largely irrelevant to general language understanding, which is of high importance to our parameter comparison and Jaccard similarity interpretations\.

### D\.2Pruning results

Across all settings, increasing top\-kkleads to a substantial decline in math\-reasoning performance, while non\-math performance remains comparatively stable onMMLUandRACE\(Tables[D\.7](https://arxiv.org/html/2606.18453#A4.T7)\-[D\.12](https://arxiv.org/html/2606.18453#A4.T12)\)\. This selective degradation is the intended effect and indicates that our parameter identification is successful also across model sizes\.

However, surprisingly, when pruning the Qwen3 model with a top\-kk=1e\-5 threshold, corresponding to suppressing approximately 16K parameters, the performance on English GSM8K improves substantially, whereas only a marginal improvement is observed on English GSM8K\-Flex shown in Table[D\.9](https://arxiv.org/html/2606.18453#A4.T9), which may be a strong signal of wrong output formatting\. This counter\-intuitive gain from parameter reduction aligns with the framework proposed inAliet al\.\([2025](https://arxiv.org/html/2606.18453#bib.bib27)\), which identifies a class ofprominent but detrimental neurons\. These are highly activated units that, while capturing frequent training patterns, often enforce brittle heuristics that hinder the model’s ability to generalize to nuanced reasoning tasks\. Manual inspection reveals that this improvement is only partly due to corrections in arithmetic computation; more importantly, it stems from enhanced in\-context instruction following\. This manifests as shorter reasoning traces that are less likely to be truncated by the maximum generation length, and it also outputs the required output formats\. This stabilization of the output path is consistent with the AdaRAS byDonget al\.\([2026](https://arxiv.org/html/2606.18453#bib.bib26)\), which suggests thatover\-activereasoning circuits can lead to logical redundancy and verbosity\. By pruning these circuits, we effectively guide the model toward more direct and stable reasoning paths, thereby better mimicking the reasoning style demonstrated in the provided examples\.

Figure[E\.9](https://arxiv.org/html/2606.18453#A5.F9)illustrates a representative example\. Prior to pruning, the model persistently follows its own internal reasoning style, presumably acquired during post\-training, rather than adhering to the few\-shot exemplars\. After suppressing approximately 16K of the most highly activated parameters, the model more consistently aligns its reasoning format with that of the provided demonstrations\. This behavior provides empirical support for the theory ofCopy Biasin In\-Context LearningAliet al\.\([2024](https://arxiv.org/html/2606.18453#bib.bib28)\)\. According to this work, specific parameters act asstylistic anchorsthat prioritize internal priors over contextual cues; pruning them releases the model from these rigid post\-training biases, allowing for greaterbehavioral flexibility\.

### D\.3Scaling results

In contrast to our smallest model \(Llama 1B\), scaling does not yield significant consistent gains on larger models\. Given the strong pruning results, we attribute this mainly to an insufficient scaling factor \(1\.01\) rather than to low\-quality parameter identification\. Since optimizing this hyperparameter is beyond the scope of this work and does not affect our cross\-language comparisons, we leave a systematic search over scaling factors to future work\.

However, one can observe a relationship between the model’s general language understanding ability and the effect of scaling on performance, including indications of a saturation effect\. For the small Llama 1B model, scaling yields performance gains \(in English\) of up to 9\.7%\. For larger models, performance also improves in some cases, but the changes are much smaller, ranging only from \-1\.4% to 2\.4% compared to the baseline\. This suggests that scaling is most effective when the model has not yet reached a level of saturated performance\. In cases where performance is already high, scaling may be limited because the task\-relevant parameters have already been sufficiently optimized, rather than because the task\-specific weights were poorly identified\. For German, Hindi, and French, we frequently observe that baseline general language understanding performance \(e\.g\., measured by MMLU or RACE\) influences the gains obtained from scaling\. When language understanding performance is too low, as is often the case for Hindi, scaling tends to produce only small positive or even negative effects\. One possible explanation is that insufficient language knowledge prevents the model from identifying the optimal language\-specific parameters required for mathematical reasoning\. For instance, Qwen3 4B shows consistent improvements from scaling for both German and Hindi; however, the gains are larger for German, which has higher baseline language understanding scores on RACE and MMLU\. Overall, these results suggest that a sufficient level of language understanding is beneficial for the scaling approach, while very high baseline performance can lead to saturation effects that limit further improvements through scaling\.

Table D\.1:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for scaling with Llama 1B and RACE\.Table D\.2:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for scaling with Llama 1B and MMLU\.Table D\.3:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for scaling with Qwen3 4B and RACE\.Table D\.4:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for scaling with Qwen3 4B and MMLU\.Table D\.5:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for scaling with Llama 8B and RACE\.Table D\.6:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for scaling with Llama 8B and MMLU\.Table D\.7:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for pruning with Llama 1B and RACE\.Table D\.8:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for pruning with Llama 1B and MMLU\.Table D\.9:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for pruning with Qwen3 4B and RACE\.Table D\.10:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for pruning with Qwen3 4B and MMLU\.Table D\.11:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for pruning with Llama 8B and RACE\.Table D\.12:Relative change in accuracy \(%\), averaged over 3 seeds \(mean±\\pmstd\), for pruning with Llama 8B and MMLU\.
### D\.4Collective vs\. Individual Effects

We examine whether the identified math\-specific parameters operate primarily through a collective mechanism or exert independent \(individual\) effects\. For all experiments in this section, we setmax\_new\_tokensto 300 across all languages and use RACE as the non\-math dataset\.

As discussed in Section[4\.5](https://arxiv.org/html/2606.18453#S4.SS5)and shown in Figure[4](https://arxiv.org/html/2606.18453#S4.F4), the Llama 1B model exhibits collective behavior\. This trend persists in larger models: Figures[D\.1](https://arxiv.org/html/2606.18453#A4.F1)and[D\.2](https://arxiv.org/html/2606.18453#A4.F2)show that both Qwen3 4B and Llama 8B demonstrate similar collective dynamics, suggesting that this phenomenon is consistent across model scales\.

![Refer to caption](https://arxiv.org/html/2606.18453v1/x5.png)Figure D\.1:Effect of pruning isolated math\-specific parameters on performance across languages for Qwen3 4B at top\-k∈\{0\.001,0\.01,0\.1\}k\\in\\\{0\.001,0\.01,0\.1\\\}\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x6.png)Figure D\.2:Effect of pruning isolated math\-specific parameters on performance across languages for Llama 8B at top\-k∈\{0\.001,0\.01,0\.1\}k\\in\\\{0\.001,0\.01,0\.1\\\}\.

## Appendix EQualitative Examples for Scaling & Pruning

In this section, we present several qualitative examples where scaling improves the model’s initial answers for mathematical reasoning across languages \(Figures[E\.1](https://arxiv.org/html/2606.18453#A5.F1)\-[E\.4](https://arxiv.org/html/2606.18453#A5.F4)\)\. Furthermore, we show examples where pruning steers the model toward incorrect answers for mathematical reasoning \(Figures[E\.5](https://arxiv.org/html/2606.18453#A5.F5)\-[E\.8](https://arxiv.org/html/2606.18453#A5.F8)\), alongside a notable exception in which pruning instead improves performance \(Figure[E\.9](https://arxiv.org/html/2606.18453#A5.F9)\), as discussed in Appendix[D](https://arxiv.org/html/2606.18453#A4)\.

InputQ:Toula went to the bakery and bought various types of pastries\. She bought 3 dozen donuts which cost $68 per dozen, 2 dozen mini cupcakes which cost $80 per dozen, and 6 dozen mini cheesecakes for $55 per dozen\. How much was the total cost?A:

Output before scaling \(wrong\)3 dozen donuts cost3×68=2043\\times 68=204dollars\.2 dozen mini cupcakes cost2×80=1602\\times 80=160dollars\.6 dozen mini cheesecakes cost6×55=3306\\times 55=330dollars\.204 \+ 160 \+ 330 is 734\. The answer is 734\.Output after scaling \(correct\)3 dozen donuts cost3×68=2043\\times 68=204dollars\.2 dozen mini cupcakes cost2×80=1602\\times 80=160dollars\.6 dozen mini cheesecakes cost6×55=3306\\times 55=330dollars\.204 \+ 160 \+ 330 is 694\. The answer is 694\.

Figure E\.1:Illustration of arithmetic inconsistency before scaling and corrected output after scaling for English reasoning using Llama 1B\.InputQ:Gunter versucht, die Marmeladebohnen in einem Glas zu zählen\. Er fragt seine Freunde, wie viele in dem Glas sind\. Einer sagt 80\. Ein anderer sagt 20 mehr als die Hälfte des ersten\. Ein Drittel sagt 25% mehr als der erste\. Wie hoch ist ihre durchschnittliche Vermutung?A:\(English translation: Gunter is trying to count the jam beans in a jar\. He asks his friends how many are in the jar\. One says 80\. Another says 20, more than half the first\. A third say 25% more than the first\. What is their average guess?\)

Output before scaling \(wrong\)Der erste sagt 80\. Ein anderer sagt 20 mehr als die Hälfte des ersten, also 80 \+ 20 = 100\. Ein Drittel sagt 25% mehr als der erste, also 80 \+ \(80 \* 0,25\) = 100\. Die durchschnittliche Vermutung ist 100\. Die Antwort ist 100\.\(English translation: The first person says 80\. Another says 20 more than half of the first, so 80 \+ 20 = 100\. One third says 25% more than the first,so 80 \+ \(80 \* 0\.25\) = 100\. The average guess is 100\. The answer is 100\.\)Output after scaling \(correct\)Der erste sagt 80\. Ein anderer sagt 20 mehr als die Hälfte des ersten, also 80 \+ 20 = 100\. Ein Drittel sagt 25% mehr als der erste, also 80 \+ \(80 \* 0,25\) = 100 \+ 20 = 120\. Die durchschnittliche Vermutung ist 80\. Die Antwort ist 80\.\(English translation: The first person says 80\. Another says 20 more than half of the first, so 80 \+ 20 = 100\. One third says 25% more than the first,so 80 \+ \(80 \* 0\.25\) = 100 \+ 20 = 120\. The average guess is 80\. The answer is 80\.\)

Figure E\.2:Illustration of arithmetic inconsistency before scaling and corrected output after scaling for German reasoning using Llama 1B\.InputQ:\\hindifontमैरी ने एक चिकन भोजन का ऑर्डर दिया जिसकी कीमत 12 डॉलर है, 5 पैकेट दूध की कीमत 3 डॉलर है, 4 सेबों की कीमत 1\.50 डॉलर है, और पिज्जा के कुछ डिब्बे हैं। मैरी ने कुल 50 डॉलर का भुगतान किया। अगर प्रत्येक बॉक्स की कीमत 8\.50 डॉलर है तो मैरी ने पिज्जा के कितने डिब्बे मंगवाए?A:\(English translation: Mary ordered a chicken meal that costs $12, 5 packets of milk that cost $3, 4 apples that cost $1\.50, and some boxes of pizza\. Mary paid a total of $50\. How many boxes of pizza did Mary order if each box costs $8\.50?\)

Output before scaling \(wrong\)\\hindifontमैरी ने एक चिकन की कीमत 12 डॉलर के लिए 1 बॉक्स खरीदा। उसके पास 50 \- 12 = 38 डॉलर बचे हैं। 5 पैकेट दूध की कीमत 3 डॉलर है, जिसकी कुल कीमत 5 × 3 = 15 डॉलर है। उसके पास 38 \- 15 = 23 डॉलर बचे हैं। 4 सेबों की कीमत 1\.50 डॉलर है, जिसकी कुल कीमत 4 × 1\.50 = 6 डॉलर है। उसके पास 23 \- 6 = 17 डॉलर बचे हैं। पिज्जा के कुछ डिब्बे की कीमत 8\.50 डॉलर है, जिसकी कुल कीमत 17 / 8\.50 = 1\.96 डॉलर है। उसके पास 1\.96 डॉलर बचे हैं। 1\.96 / 8\.50 = 0\.23। उसके पास 0\.23 डिब्बे हैं। उत्तर 0\.23 है।\(English translation: Mary bought a box of chicken for $12\. She has $50 \- 12 = $38 left\. 5 packets of milk cost $3, for a total cost of $5 × 3 = $15\. She has $38 \- 15 = $23 left\. 4 apples cost $1\.50, for a total cost of 4 × 1\.50 = $6\. She has $23 \- 6 = $17 left\. Some boxes of pizza cost $8\.50, for a total cost of$17 / 8\.50 = $1\.96\. She has $1\.96 left\. 1\.96 / 8\.50 = 0\.23\. She has 0\.23 boxes\. The answer is 0\.23\.\)Output after scaling \(correct\)\\hindifontमैरी ने एक चिकन की कीमत 12 डॉलर के लिए 1 बॉक्स खरीदा। उसके पास 50 \- 12 = 38 डॉलर बचे हैं। 5 पैकेट दूध की कीमत 3 डॉलर है, जिसकी कुल कीमत 5 × 3 = 15 डॉलर है। उसके पास 38 \- 15 = 23 डॉलर बचे हैं। 4 सेबों की कीमत 1\.50 डॉलर है, जिसकी कुल कीमत 4 × 1\.50 = 6 डॉलर है। उसके पास 23 \- 6 = 17 डॉलर बचे हैं। पिज्जा के कुछ डिब्बे की कीमत 8\.50 डॉलर है, जिसकी कुल कीमत 17 / 8\.50 = 2 डिब्बे हैं। उत्तर 2 है।\(English translation: Mary bought a box of chicken for $12\. She has $50 \- 12 = $38 left\. 5 packets of milk cost $3, for a total cost of 5 × 3 = $15\. She has $38 \- 15 = $23 left\. 4 apples cost $1\.50, for a total cost of 4 × 1\.50 = $6\. She has $23 \- 6 = $17 left\. Some boxes of pizza cost $8\.50, for a total cost of17 / 8\.50 = 2 boxes\. The answer is 2\.\)

Figure E\.3:Illustration of arithmetic inconsistency before scaling and corrected output after scaling for Hindi reasoning using Llama 1B\.InputQ:Peter prévoit d’aller au cinéma cette semaine\. Il a toujours un billet à 7 $ et du pop\-corn à 7 $\. S’il a 42 dollars par semaine, combien de fois peut\-il aller au cinéma ?A:\(English translation: Peter plans to go to the movies this week\. He always has a $7 ticket and $7 popcorn\. If he has $42 a week, how many times can he go to the movies?\)

Output before scaling \(wrong\)Peter a 42 dollars\. Il a 7 $ pour le billet et 7 $ pour le pop\-corn\. Donc 42 \- 7 \- 7 = 28\. Il peut aller au cinéma 28 fois\. La réponse est 28\.\(English translation: Peter has $42\. He has $7 for the ticket and $7 for the popcorn\.So 42 \- 7 \- 7 = 28\. He can go to the movies 28 times\. The answer is 28\.\)Output after scaling \(correct\)Peter a 42 dollars\. 7 dollars pour le billet et 7 dollars pour le pop\-corn font 7 \+ 7 = 14 dollars\. Il peut donc aller au cinéma 42 / 14 = 3 fois\. La réponse est 3\.\(English translation: Peter has $42\. $7 for the ticket and $7 for the popcorn make$7 \+ $7 = $14\. He can therefore go to the movies 42 / 14 = 3 times\. The answer is 3\.\)

Figure E\.4:Illustration of arithmetic inconsistency before scaling and corrected output after scaling for French reasoning using Llama 1B\.InputQ:Janet’s ducks lay 16 eggs per day\. She eats three for breakfast every morning and bakes muffins for her friends every day with four\. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg\. How much in dollars does she make every day at the farmers’ market?A:

Output before pruning \(correct\)Janet has 16 eggs per day\. She eats 3 for breakfast\. So she has 16 \- 3 = 13 eggs left\. She bakes muffins for 4 eggs\. So she has 13 \- 4 = 9 eggs left\. She sells 9 eggs for $2 each\.9 \* 2 = 18 dollars\. The answer is 18\.Output after pruning \(wrong\)Janet started with 16 eggs\. She eats 3 for breakfast\. She bakes 4 muffins\. She sells 4 \- 3 = 2 eggs at the market\.16 \- 3 is 8\. The answer is 8\.

Figure E\.5:Illustration of correct output before pruning and wrong output after pruning for English reasoning \(top\-kk=0\.01\) using Llama 1B\.InputQ:Janets Enten legen 16 Eier am Tag\. Sie isst jeden Morgen drei zum Frühstück und backt mit vier täglich Muffins für ihre Freunde\. Sie verkauft den Rest auf dem Bauernmarkt täglich für 2 Dollar pro frisches Enten\-Ei\. Wie viel Dollar verdient sie jeden Tag auf dem Bauernmarkt?A:\(English translation: Janet’s ducks lay 16 eggs a day\. She eats three for breakfast every morning and bakes muffins with four of them daily for her friends\. She sells the rest at the farmers market every day for $2 per fresh duck egg\. How many dollars does she earn each day at the farmers market?\)

Output before pruning \(correct\)Janets Enten legen 16 Eier am Tag\. Sie isst 3 zum Frühstück und backt 4 Muffins\. Das sind 3 \+ 4 = 7 Eier pro Tag\. Sie verkauft 16 \- 7 = 9 Eier pro Tag\. Sie verdient 2 Dollar pro Ei\.9 \* 2 = 18\. Die Antwort ist 18\.\(English translation: Janet’s ducks lay 16 eggs a day\. She eats 3 for breakfast and bakes 4 muffins\. That’s 3 \+ 4 = 7 eggs per day\. She sells 16 \- 7 = 9 eggs per day\. She earns $2 per egg\.9 \* 2 = 18\. The answer is 18\.\)Output after pruning \(wrong\)Janets Enten legten 16 Eier am Tag\. Sie isst jeden Morgen drei zum Frühstück und backt mit vier täglich Muffins für ihre Freunde\. Sie verkauft den Rest auf dem Bauernmarkt täglich für 2 Dollar pro frisches Ei\.16 \- 2 = 4\. Die Antwort ist 4\. Die Antwort ist 4\. Die Antwort ist 4\. Die Antwort ist 4\. Die Antwort ist 4\.\(English translation: Janet’s ducks laid 16 eggs a day\. She eats three for breakfast every morning and bakes muffins for her friends with four of them daily\. She sells the rest at the farmers market every day for $2 a fresh egg\.16 \- 2 = 4\. The answer is 4\. The answer is 4\. The answer is 4\. The answer is 4\. The answer is 4\.\)

Figure E\.6:Illustration of correct output before pruning and wrong output after pruning for German reasoning \(top\-kk=0\.01\) using Llama 1B\.InputQ:\\hindifontहेनरी ने अपनी 60 मील की साइकिल यात्रा के दौरान दो बार रुका। वह पहली बार 20 मील के बाद रुका। यात्रा के अंत से 15 मील पहले उनका दूसरा पड़ाव था। उसने अपने पहले और दूसरे स्टॉप के बीच कितने मील की यात्रा की?A:\(English translation: Henry stopped twice during his 60\-mile bicycle trip\. He first stopped after 20 miles\. His second stop was 15 miles before the end of the trip\. How many miles did he travel between his first and second stops?\)

Output before pruning \(correct\)\\hindifontहेनरी ने अपनी यात्रा के दौरान 60 मील की यात्रा की और दो बार रुका।\\hindifontपहली बार 20 मील की यात्रा के बाद, उसका दूसरा पड़ाव 15 मील की यात्रा के बाद था।\\hindifont60 \- 20 \- 15 = 25 मील। उत्तर 25 है।\(English translation: Henry traveled 60 miles during his journey and stopped twice\. The first time was after traveling 20 miles, and his second stop was after traveling 15 miles\.60 \- 20 \- 15 = 25 miles\. The answer is 25\.\)Output after pruning \(wrong\)\\hindifontहेनरी ने अपनी 60 मील की साइकिल यात्रा के दौरान दो बार रुका। पहली बार 20 मील के बाद, उसका दूसरा पड़ाव 15 मील के बाद हुआ । 60 \- 20 = 40। 60 \- 20 = 40। 60 \- 20 = 40। उत्तर 40 है\(English translation: Henry stopped twice during his 60\-mile bicycle ride\. The first time was after 20 miles, and his second stop was after 15 miles\.60 \- 20 = 40\. 60 \- 20 = 40\. 60 \- 20 = 40\. The answer is 40\.\)

Figure E\.7:Illustration of correct output before pruning and wrong output after pruning for Hindi reasoning \(top\-kk=0\.01\) using Llama 1B\.InputQ:James décide de courir 3 sprints 3 fois par semaine\. Il court 60 mètres à chaque sprint\. Combien de mètres au total par semaine ?A:\(English translation: James decides to run 3 sprints 3 times a week\. He runs 60 meters in each sprint\. How many meters in total per week?\)

Output before pruning \(correct\)James court 3 sprints 3 fois par semaine\. Cela fait 3 × 3 = 9 sprints\. Chaque sprint fait 60 mètres\. Donc 9 × 60 = 540 mètres\. La réponse est 540\.\(English translation: James runs 3 sprints 3 times a week\. That makes 3 × 3 = 9 sprints\. Each sprint is 60 meters\.So 9 × 60 = 540 meters\. The answer is 540\.\)Output after pruning \(wrong\)James a 3 sprints 3 fois par semaine\. Il court 60 mètres à chaque sprint\. 3 \+ 3 = 5 sprints\. 3 \+ 3 = 5\. La réponse est 5\.\(English translation: James has 3 sprints 3 times a week\. He runs 60 meters in each sprint\.3 \+ 3 = 5 sprints\. 3 \+ 3 = 5\. The answer is 5\.\)

Figure E\.8:Illustration of correct output before pruning and wrong output after pruning for French reasoning \(top\-kk=0\.01\) using Llama 1B\.InputQ: There are 15 trees in the grove\. Grove workers will plant trees in the grove today\. After they are done, there will be 21 trees\. How many trees did the grove workers plant today?A: There are 15 trees originally\. Then there were 21 trees after some more were planted\. So there must have been 21 \- 15 = 6\. The answer is 6\.\[\.\.\.\]Q:A treasure hunter found a buried treasure chest filled with gems\. There were 175 diamonds, 35 fewer rubies than diamonds, and twice the number of emeralds than the rubies\. How many of the gems were there in the chest?A:

Output before pruning \(wrong format\)Let’s break down the problem step by step:1\. \*\*Number of diamonds\*\*: The problem states there are 175 diamonds\. So, diamonds = 175\.2\. \*\*Number of rubies\*\*: There are 35 fewer rubies than diamonds\. So, rubies = diamonds \- 35 = 175 \- 35 = 140\.3\. \*\*Number of emeralds\*\*: There are twice as many emeralds as rubies\. So, emeralds = 2 × rubies = 2 × 140 = 280\.4\. \*\*Total number of gems\*\*: Total gems = diamonds \+ rubies \+ emeralds Total gems = 175 \+ 140 \+ 280 = 595\.\*\*Answer: 595\*\*\.\\emoji✅The total number of gems in the chest is \*\*595\*\*\.\\emoji✅Final Answer: \*\*595\*\*\.\\emoji✅Output after pruning \(correct format\)The number of rubies is 175 \- 35 = 140\. The number of emeralds is 2 \* 140 = 280\. The total number of gems is 175 \+ 140 \+ 280 = 595\. The answer is 595\.

Figure E\.9:Illustration of wrong output formatting and poor in\-context learning before pruning and correct output after pruning for English reasoning \(top\-kk=1e\-5\) using Qwen3 4B\.
## Appendix FSimilarities across Languages & Tasks

In this section, we present plots showing the number of isolated parameters across languages \(English \(En\), German \(De\), Hindi \(Hi\), and French \(Fr\)\), as well as Jaccard similarity plots for each language pair, across all top\-kkvalues and models\.

![Refer to caption](https://arxiv.org/html/2606.18453v1/x7.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x8.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x9.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x10.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x11.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x12.png)

Figure F\.1:Layer\-wise Jaccard similarity of math\-reasoning parameters across top\-kkvalues in Llama 1B, computed for all language pairs \(English \(En\), German \(De\), French \(Fr\), and Hindi \(Hi\)\) and the random baseline \(Chance\) on RACE\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x13.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x14.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x15.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x16.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x17.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x18.png)

Figure F\.2:Layer\-wise Jaccard similarity of math\-reasoning parameters across top\-kkvalues in Qwen3 4B, computed for all language pairs \(English \(En\), German \(De\), French \(Fr\), and Hindi \(Hi\)\) and the random baseline \(Chance\) on RACE\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x19.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x20.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x21.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x22.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x23.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x24.png)

Figure F\.3:Layer\-wise Jaccard similarity of math\-reasoning parameters across top\-kkvalues in Llama 8B, computed for all language pairs \(English \(En\), German \(De\), French \(Fr\), and Hindi \(Hi\)\) and the random baseline \(Chance\) on RACE\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x25.png)Figure F\.4:Math\-specific parameters in Qwen3 4B\.Left: Number of isolated parameters per language \(English \(En\), German \(De\), French \(Fr\), and Hindi \(Hi\)\)\.Right: Jaccard similarity between language\-specific parameter sets across all language pairs, compared to a random baseline using RACE as comparison dataset\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x26.png)Figure F\.5:Math\-specific parameters in Llama 8B\.Left: Number of isolated parameters per language \(English \(En\), German \(De\), French \(Fr\), and Hindi \(Hi\)\)\.Right: Jaccard similarity between language\-specific parameter sets across all language pairs, compared to a random baseline using RACE as comparison dataset\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x27.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x28.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x29.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x30.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x31.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x32.png)

Figure F\.6:Layer\-wise distribution of math\-reasoning parameters across top\-kkvalues in Llama 1B, using RACE as comparison dataset\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x33.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x34.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x35.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x36.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x37.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x38.png)

Figure F\.7:Layer\-wise distribution of math\-reasoning parameters across top\-kkvalues in Qwen3 4B, using RACE as comparison dataset\.![Refer to caption](https://arxiv.org/html/2606.18453v1/x39.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x40.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x41.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x42.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x43.png)

![Refer to caption](https://arxiv.org/html/2606.18453v1/x44.png)

Figure F\.8:Layer\-wise distribution of math\-reasoning parameters across top\-kkvalues in Llama 8B, using RACE as comparison dataset\.

Similar Articles

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

arXiv cs.CL

Introduces mmPISA-bench, a compact multilingual reasoning benchmark derived from PISA, and evaluates proprietary LLMs across 43 languages, finding that they reason effectively with some performance variations, and that machine-translated questions do not degrade accuracy.

LLM Neuroanatomy III - LLMs seem to think in geometry, not language

Reddit r/LocalLLaMA

Researcher analyzes LLM internal representations across 8 languages and multiple models, finding that concept thinking occurs in geometric space in middle transformer layers independent of input language, supporting a universal deep structure hypothesis similar to Chomsky's theory rather than Sapir-Whorf linguistic relativism.

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

arXiv cs.AI

This paper investigates whether different LLMs share common inference patterns when predicting the same token, using interaction-based explanations. Results show that advanced LLMs exhibit consistent interaction patterns, suggesting implicit optimization toward shared inference mechanisms.