Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning
Summary
This paper introduces PolyFact, a large-scale multilingual factual QA dataset, and demonstrates that reinforcement learning via GRPO significantly improves cross-lingual factual consistency in LLMs compared to supervised fine-tuning, by reorganizing multilingual representations.
View Cached Full Text
Cached at: 06/08/26, 09:18 AM
# Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning
Source: [https://arxiv.org/html/2606.06586](https://arxiv.org/html/2606.06586)
Jonathan von Rad Louis Arts George Burgess Eleftheria KolokythaHarry O’Donnell Ektor Oikonomidis Doumpas Eduardo SanchezYao Lu Pontus StenetorpUniversity College London, Centre for Artificial Intelligence\{jonathan\.rad\.25,eduardo\.sanchez\.22,yao\.lu\}@ucl\.ac\.uk
###### Abstract
Large language models \(LLMs\) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, known as*cross\-lingual factual inconsistency*\. To study and address this, we introducePolyFact, a large\-scale parallel multilingual factual QA dataset of 100K Wikidata\-grounded facts across 12 typologically diverse languages, and use it to compare light continual pretraining \(CPT\), supervised fine\-tuning \(SFT\), and reinforcement learning via Group Relative Policy Optimization \(GRPO\) for improving cross\-lingual factual recall in Qwen\-2\.5\-7B and OLMo\-2\-1124\-7B\. GRPO consistently outperforms SFT, improving both cross\-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains\. Mechanistic analyses reveal that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads and promoting shared cross\-lingual representations\. We release our code, models, and dataset\.111[jvonrad/Lost\-in\-Mistranslation](https://github.com/jvonrad/Lost-in-Mistranslation) [jvonrad/PolyFact](https://huggingface.co/datasets/jvonrad/PolyFact)
Improving Cross\-Lingual Factual Recall via Consistency\-Driven Reinforcement Learning
Jonathan von Rad††thanks:Corresponding author\.Louis Arts George Burgess Eleftheria KolokythaHarry O’Donnell Ektor Oikonomidis Doumpas Eduardo SanchezYao Lu Pontus StenetorpUniversity College London, Centre for Artificial Intelligence\{jonathan\.rad\.25,eduardo\.sanchez\.22,yao\.lu\}@ucl\.ac\.uk
## 1Introduction
Large language models \(LLMs\) trained predominantly on English data encode vast amounts of world knowledge, yet struggle to reliably access this knowledge in other languages, leading tocross\-lingual factual inconsistencyWanget al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib1)\); Schutet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib10)\)\. This raises a key question: how can models be enabled to access their already present latent knowledge through non\-English interfaces without the need for large\-scale additional pretraining?
Recent work suggests that multilingual models may rely on shared internal representationsSchutet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib10)\); Wendleret al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib30)\), where reasoning is performed in a shared latent space before being translated into the target language\. In line with this perspective, prior work has shown that cross\-lingual factual inconsistency often does not stem from missing knowledge, but rather emerges during the*language transition phase*Wanget al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib1)\); Gekhmanet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib24)\); Luet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib29)\); Liuet al\.\([2025b](https://arxiv.org/html/2606.06586#bib.bib35)\)\. Specifically, models may correctly retrieve the answer in intermediate layers, yet fail to map it reliably into the target language in later layers, leading to inconsistent or incorrect outputs across languages\.
Figure 1:Incentivizing cross\-lingual factual consistency through post\-training on our multilingual question\-answering dataset,PolyFact\. GRPO\-based reinforcement learning promotes shared internal representations that yield consistent factual predictions across languages, whereas SFT primarily leads to surface\-level memorization\.More recently, parallel data has been identified as a key driver of multilingual capabilities during pretrainingQoribet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib2)\); Shaoet al\.\([2026](https://arxiv.org/html/2606.06586#bib.bib23)\); Wanget al\.\([2025b](https://arxiv.org/html/2606.06586#bib.bib28)\); Qoribet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib2)\); Fuet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib25)\); Linet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib27)\); Wuet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib26)\)\. However, while continual pretraining \(CPT\) on parallel corpora improves translation fluency, it often fails to substantially improve performance on more demanding tasks such as multilingual factual recallShenet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib3)\)\. This suggests that parallel data primarily improves the alignment of internal representations, while the model still struggles to reliably access knowledge encoded by the aligned representations through non\-English language interfaces, resulting in inconsistent multilingual outputs\.
Building on this insight, we hypothesize that multilingual factual recall in English\-dominant LLMs can be improved without large\-scale retraining, by separating*representation alignment*from*cross\-lingual knowledge access*\. Concretely, our contributions are as follows:
1. \(i\)We show that light CPT on parallel data provides limited gains for cross\-lingual factual recall, motivating post\-training on multilingual factual QA as a more direct mechanism for improving latent factual knowledge access in English\-dominant LLMs\.
2. \(ii\)We demonstrate that, among factual QA post\-training methods, consistency\-driven RL via GRPO consistently outperforms SFT, improving cross\-lingual factual consistency and generalization to unseen languages while reshaping internal representations and multilingual routing, as visualized in Figure[1](https://arxiv.org/html/2606.06586#S1.F1)\.
3. \(iii\)We create and open\-sourcePolyFact, a large fully parallel multilingual factual QA dataset grounded in Wikidata, spanning 100K facts across 12 typologically diverse high\- and low\-resource languages listed in Figure[2](https://arxiv.org/html/2606.06586#S3.F2)\.
## 2Related Work
#### Cross\-Lingual Factual Recall\.
Recent mechanistic studies of Large Language Models \(LLMs\) have discovered that the primary bottleneck for cross\-lingual factual recall tasks is not a knowledge deficit but rather failures during language transitionWanget al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib1)\)\. This breakdown can either occur in early layers, where the model fails to map prompts into its shared English\-like language agnostic conceptual space, or, more commonly, in the final layers, where latent concepts fail to decode into the correct target\-language tokensLiuet al\.\([2025b](https://arxiv.org/html/2606.06586#bib.bib35)\); Wanget al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib1)\)\. While, query\-level interventions such as “subject injection” or English pivotingBandarkaret al\.\([2026](https://arxiv.org/html/2606.06586#bib.bib36)\); Liuet al\.\([2025b](https://arxiv.org/html/2606.06586#bib.bib35)\)can temporarily alleviate these inconsistencies, they function as inference\-time patches\. Their success suggests that cross\-lingual alignment is an important bottleneck for multilingual consistency, motivating more persistent model\-level adaptation\.
#### Parallel Data\.
A common method of extending multilingual capabilities of English\-centric language models relies on Continual Pretraining \(CPT\)Fujiiet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib5)\); Kuulmetset al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib6)\); Shaoet al\.\([2026](https://arxiv.org/html/2606.06586#bib.bib23)\)\. However, CPT is computationally expensive and often leads to catastrophic forgetting of the original model’s English capabilitiesFujiiet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib5)\)\. More recently, parallel data has been identified as the most salient source of mulitlingual capabilities during pretrainingQoribet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib2)\); Fuet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib25)\); Linet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib27)\); Wuet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib26)\)\. However,Shenet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib3)\)notes a significant limitation; while CPT on parallel corpora improves translation ability, it often fails to significantly boost performance in harder tasks, such as cross\-lingual factual recall\. This suggests that while CPT succeeds in creating a performative interface, giving the illusion of multilingual capability through surface\-level fluency, it remains largely disconnected from the model’s internal knowledge\.
#### Post\-Training via Reinforcement Learning\.
The emergence of Reinforcement Learning \(RL\) as a post\-training method offers a novel path for adapting models to new domains and improving alignment with task\-specific objectives\. Mechanistic analysis byMatsutaniet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib37)\)suggests that Supervised Fine\-Tuning \(SFT\) and RL play complementary roles during post\-training; while SFT expands the model’s behavioural search space, RL “squeezes” it, concentrating probability mass on consistent and correct reasoning paths\. While current works in monolingual reasoning consistency, such as DeReasonHuet al\.\([2026](https://arxiv.org/html/2606.06586#bib.bib38)\)and CC\-LearnYeet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib39)\), emphasise that RL is most effective when preceded by an SFT “warm\-up” to mitigate its cold\-start problem, the effect of RL in cross\-lingual consistency remains largely unexplored\. GRPO was popularized by DeepSeek\-R1 incentivizing reasoning abilityGuoet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib16)\)with a verifiable consistency reward\. This method recently was validated in the multilingual domain applied to RAG byQiet al\.\([2026](https://arxiv.org/html/2606.06586#bib.bib40)\), allowing for a fundamental shift in multilingual optimisation\.
## 3Method
#### PolyFact Dataset\.
We constructPolyFact, a fully parallel multilingual multiple\-choice QA dataset for studying cross\-lingual factual consistency\. Starting from Wikidata truthy triples,222[https://dumps\.wikimedia\.org/wikidatawiki/entities/latest\-all\.json\.bz2](https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2)we retain 22 factual relations spanning geography, biography, creative works, and organizational or cultural ties, and extract labels in twelve high\- and low\-resource languages \(Figure[2](https://arxiv.org/html/2606.06586#S3.F2)\)\. For each fact, we sample three type\- and length\-matched distractors from co\-occurring objects of the same property, then use round\-robin balanced sampling to obtain 100,000 facts\. We generate parallel MCQ bundles with Gemma\-3\-27B\-IT\(Kamathet al\.,[2025](https://arxiv.org/html/2606.06586#bib.bib22)\)\. Quality is assessed using a web\-grounded GPT\-4o judge and manual review, yielding 91% LLM\-human agreement and a recommendedPolyFact\-Cleanfilter for high\-ambiguity relations\. The final corpus contains 95,000 training, 2,500 validation, and 2,500 test facts, with verification labels and quality tiers included\. Further details are provided in Appendix[D](https://arxiv.org/html/2606.06586#A4)\.
#### Continual Pretraining\.
We continually pretrain onTED2025Shenet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib3)\), a multi\-way parallel corpus covering the 12 languages in Figure[2](https://arxiv.org/html/2606.06586#S3.F2)\. We retain talks containing at least two target languages and format each row as a multilingual block where available translations appear once, in randomized order, followed by an end\-of\-sequence token\. Adjacent rows from the same talk are packed into∼\\sim512\-token chunks and truncated at 1024 tokens\. To improve coverage for Swahili and Bengali, we augment TED2025 with Rogendo English–Swahili and AI4Bharat Samanantar English–Bengali sentence pairs\. The final CPT corpus contains 325,134 packed chunks totalling 235\.5M tokens \(Table[3](https://arxiv.org/html/2606.06586#A2.T3)\)\.
Figure 2:Expanding language coverage from English\-only capability to the 12 most widely spoken languages\(18\.5%→70%\(18\.5\\%\\rightarrow 70\\%of the global population\)\.
#### Post\-Training via GRPO\.
We apply a multilingual variant of GRPOShaoet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib17)\)on thePolyFactdataset, where each training item is a single factual multiple\-choice question available in parallel across all twelve languages\. For every fact we sampleG=8G=8grouped rollouts; each rollout consists of twelve*independent*generations, one per language, produced from language\-specific prompts that instruct the model to return the answer in the target language\. For a rollout producing answers\{y^ℓ\}ℓ=1L\\\{\\hat\{y\}\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}acrossL=12L=12languages, the reward is computed as
R\\displaystyle R=∑ℓ=1Lrℓ\+1\[∀ℓ,rℓ=1\],\\displaystyle=\\sum\_\{\\ell=1\}^\{L\}r\_\{\\ell\}\\;\+\\;\\mathbb\{1\}\\\!\\left\[\\forall\\ell,\\;r\_\{\\ell\}=1\\right\],\(1\)rℓ\\displaystyle r\_\{\\ell\}=\{\+1y^ℓcorrect option−0\.5y^ℓhallucination0y^ℓincorrect option\\displaystyle=\\begin\{cases\}\+1&\\hat\{y\}\_\{\\ell\}\\text\{ correct option\}\\\\ \-0\.5&\\hat\{y\}\_\{\\ell\}\\text\{ hallucination\}\\\\ \\phantom\{\+\}0&\\hat\{y\}\_\{\\ell\}\\text\{ incorrect option\}\\end\{cases\}\(2\)
where an answer is considered correct ify^ℓ\\hat\{y\}\_\{\\ell\}matches the gold answeryℓ⋆y^\{\\star\}\_\{\\ell\}, and invalid if it does not match any option in𝒪ℓ\\mathcal\{O\}\_\{\\ell\}\. The final term adds a bonus of\+1\+1if all languages are answered correctly, encouraging cross\-lingual consistency\.
#### Post\-Training via SFT\.
As a supervised counterpart to GRPO, we finetune Qwen\-2\.5\-7B and OLMo\-2\-1124\-7B onPolyFactwith a joint classification\-plus\-consistency objective:
ℒ=−1L∑ℓ=1Llogpℓ\(yℓ⋆\)\+λ⋅1L∑ℓ=1LKL\(pℓ∥sg\(p¯\)\),\\mathcal\{L\}=\-\\tfrac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\\log p\_\{\\ell\}\(y^\{\\star\}\_\{\\ell\}\)\\;\+\\;\\lambda\\cdot\\tfrac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\\mathrm\{KL\}\\\!\\left\(p\_\{\\ell\}\\,\\\|\\,\\mathrm\{sg\}\(\\bar\{p\}\)\\right\),\(3\)wherepℓ∈Δ4p\_\{\\ell\}\\in\\Delta^\{4\}is the model’s distribution over\{A,B,C,D\}\\\{A,B,C,D\\\}in languageℓ\\ell,p¯\\bar\{p\}is the group mean across theL=12L\{=\}12parallel copies of a fact \(all processed in the same forward pass\),sg\(⋅\)\\mathrm\{sg\}\(\\cdot\)is stop\-gradient, andλ=0\.5\\lambda=0\.5\. We introduce the consistency term because preliminary GRPO experiments showed that its joint reward reshapes the model’s internal representations toward a more language\-agnostic space; to isolate whether this effect is driven by the*objective*\(rewarding cross\-lingual agreement\) versus the*algorithm*\(on\-policy RL\), our SFT baseline must share the former\. Results for a pure SFT variant withλ=0\\lambda=0reported in Appendix[E\.1](https://arxiv.org/html/2606.06586#A5.SS1)\.
#### Mechanistic Interpretability \- LAHIS
To examine how finetuning affects internal language processing, we apply LAHISLiuet al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib13)\)to the base, SFT\-, and GRPO\-finetuned models\. LAHIS estimates the contribution of individual attention heads to language\-specific processing using a first\-order Taylor approximation with respect to a learned head mask\. We run the analysis across all 12 languages using parallel sentences from theTED2025corpusShenet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib3)\)\. For each language, attention heads are ranked by their resulting importance scores, and the top 2% \(20 out of 1,024 in OLMo\-2\-7B’s32×3232\\times 32configuration\) are defined as language\-important\. We then track how their distribution across layers and overlap across languages evolve across models\.
#### Mechanistic Interpretability – LAPE
To identify language\-specific neurons in MLP layers, we use LAPETanget al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib34)\)\. We define a neuron as one intermediate feed\-forward dimension, corresponding to a row of the up\-projection matrix and the matching column of the down\-projection matrix, and treat it as active when its post\-activation value is positive\.
Using parallelTED2025text across our twelve target languagesShenet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib3)\), we estimate each neuron’s activation frequency per language, normalize these frequencies across languages, and compute the Shannon entropy:
LAPEi,j=−∑kpi,j,k′log\(pi,j,k′\)\.LAPE\_\{i,j\}=\-\\sum\_\{k\}p^\{\\prime\}\_\{i,j,k\}\\log\(p^\{\\prime\}\_\{i,j,k\}\)\.Low\-entropy neurons are more language\-specialized\. FollowingTanget al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib34)\), we take the bottom 1% as language\-specific, assign each neuron to the language with its highest activation frequency, and analyze their distribution across layers and languages\.
## 4Experimental Setup
#### Models\.
We evaluate our approach on OLMo\-2\-1124\-7BOLMoet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib31)\)and Qwen\-2\.5\-7BQwenet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib41)\), two 7B\-scale decoder\-only language models with different pretraining distributions and multilingual capabilities\. OLMo\-2\-1124\-7B serves as an English\-dominant base model \(documented in its open\-source pretraining corpus\) with a large English–non\-English performance gap, while Qwen\-2\.5\-7B provides a stronger multilingual comparison point\.
#### Training Pipeline\.
To disentangle the effects of continual pretraining \(CPT\) from those of post\-training, we study six model variants: \(i\) the base 7B models, \(ii\) a CPT model obtained by adapting the base model on 235\.5M tokens of balanced parallel data, \(iii\-iv\) SFT and GRPO models trained directly from the base model, and \(v\-vi\) SFT and GRPO models initialized from the CPT checkpoint\. This setup allows us to isolate the contribution of light multilingual continual pretraining and consistency\-driven post\-training, as well as their interaction\.
#### Training Setup and Evaluation\.
Experiments were conducted on A100 GPUs\. SFT and GRPO use LoRA finetuningHuet al\.\([2022](https://arxiv.org/html/2606.06586#bib.bib32)\)with rankr=64r=64andα=128\\alpha=128\. Continual pretraining uses 235\.5M tokens from the balanced TED2025 parallel corpus\. Detailed CPT training pipeline and hyperparameters are listed in Appendix[B](https://arxiv.org/html/2606.06586#A2)\.
For post\-training, we train both SFT and GRPO onPolyFact, either directly from the base model or from the CPT checkpoint\. GRPO uses grouped rollouts with group sizeG=8G=8overL=12L=12languages and optimizes a reward that favors correct answers, penalizes hallucinated outputs, and adds a bonus when all languages are answered correctly\. Full GRPO and SFT hyperparameters are reported in Appendix Tables[4](https://arxiv.org/html/2606.06586#A2.T4)and[5](https://arxiv.org/html/2606.06586#A2.T5)\.
We evaluate multilingual factual recall onPolyFact, KLAR, and Global\-MMLUWanget al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib1)\); Singhet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib15)\)\. We evaluate performance benchmarks using LightEvalHabibet al\.\([2023](https://arxiv.org/html/2606.06586#bib.bib18)\)and lm\-evaluation\-harnessGaoet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib20)\)with vLLMKwonet al\.\([2023](https://arxiv.org/html/2606.06586#bib.bib19)\)serving as the backend\. We further analyze our results using mechanistic interpretability tools LAHISLiuet al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib13)\)and LAPETanget al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib34)\)as discussed in Section[3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px5)to compare how SFT and GRPO reshape language routing and specialization\.
## 5Results
We first present the main performance results across datasets, before analysing cross\-lingual transfer and mechanistic interpretability results to understand the roots of improved cross\-lingual factual consistency\.
### 5\.1Performance
#### Reinforcement Learning dominates\.
Table[1](https://arxiv.org/html/2606.06586#S5.T1)shows a clear difference between supervised finetuning and GRPO\-based post\-training\. Across all three evaluation settings, GRPO delivers the strongest overall gains relative to the base model, while SFT provides weaker and less consistent improvements\. OnPolyFact, both GRPO variants outperform the baseline in both high\- and low\-resource languages, with only GRPO achieving the best overall results\. This indicates that GRPO is highly effective at improving performance on the training\-style task itself, and that alignment \(in our light CPT setup on parallel data\) before GRPO does not further improve in\-domain factual accuracy\.
#### Transfer to KLAR\.
Evaluating the SFT and GRPO finetuned models on the KLAR datasetWanget al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib1)\)highlights the different capabilities each training method actually develops \(see Table[1](https://arxiv.org/html/2606.06586#S5.T1)\)\. Crucially, KLAR requires free\-form answer generation, whereas thePolyFacttraining data is multiple\-choice — so transfer to KLAR is a test of whether the finetuned skills generalise beyond candidate selection\. GRPO finetuning improves all languages except Russian, while SFT regresses in every language\. This asymmetry suggests that SFT’s training\-set gains reflect pattern\-matching over the answer candidates rather than genuine improvements in cross\-lingual retrieval: when the candidates are removed, the learned behaviour offers no benefit and even actively hurts performance\. In fact, SFT’s main failure mode is to output “1” or “2” rather than make any attempt to answer the prompt\. GRPO, by contrast, appears to use thePolyFactdata in a manner that develops real cross\-lingual retrieval and target\-language generation\.
#### Generalisation to Global\-MMLU\.
On Global\-MMLU, the pattern is slightly different but still favours GRPO\. The GRPO models achieve the highest scores for both high\- and low\-resource languages, while the aligned variants regress somewhat relative to the baseline\. This suggests that alignment alone can reduce multilingual knowledge performance, and that while GRPO can recover and exceed baseline performance, combining alignment with post\-training does not always produce additive gains\. It is worth noting that Global\-MMLU is substantially more challenging than pure factual recall benchmarks, as it requires multi\-step reasoning and domain knowledge rather than direct retrieval\. Overall, the results indicate that GRPO is the most robust method across datasets: it improves in\-domain performance, transfers more effectively to held\-out languages, and generalises better to broader multilingual knowledge evaluation than SFT\.
PolyFactKLARGlobal\-MMLUMethodHighLowTrainedOODHighLowOLMo\-2\-1124\-7BBaseline57\.9351\.8024\.613\.338\.7231\.79CPT57\.8951\.8217\.08\.337\.4129\.32SFT56\.3350\.0418\.17\.835\.4030\.32CPT \+ SFT59\.2052\.0220\.78\.636\.1530\.88GRPO64\.2154\.4829\.016\.739\.2232\.00CPT \+ GRPO61\.2654\.4129\.817\.636\.3429\.61Qwen\-2\.5\-7BBaseline66\.6952\.2647\.6835\.7668\.1552\.19CPT61\.2951\.0938\.9923\.7563\.9549\.19SFT63\.2548\.8843\.2329\.9664\.4546\.75CPT \+ SFT58\.1047\.5345\.4330\.2164\.2051\.19GRPO73\.1556\.9349\.6938\.1368\.3552\.31CPT \+ GRPO64\.1149\.3345\.9730\.7565\.2050\.00
Table 1:Knowledge performance acrossPolyFact, KLAR and Global\-MMLU\. CPT denotes light continual pretraining on parallel text before post\-training\. High and Low refer to high\- and low\-resource languages\. KLAR includes evaluation on trained languages and 11 out\-of\-distribution \(OOD\) languages not present in either CPT or post\-training\. Language breakdown in Appendix[E\.4](https://arxiv.org/html/2606.06586#A5.SS4)\.
### 5\.2Language Transition
To understand how GRPO improves cross\-lingual behaviour, we analyse the layer\-wise ranking of the correct answer in both the target language and English \(Figures[3\(b\)](https://arxiv.org/html/2606.06586#S5.F3.sf2)and[3\(a\)](https://arxiv.org/html/2606.06586#S5.F3.sf1)\)\. These plots show where the model retrieves the correct fact and how it transitions into the target language\.
In the base model, a consistent failure mode emerges: the correct answer is retrieved in English \(grey dashed line ranks highly in later layers\), while the target\-language form \(blue line\) remains poorly ranked\. This indicates that the model accesses the correct knowledge but fails to convert it into the required language\. In the Spanish example \(Figure[3\(b\)](https://arxiv.org/html/2606.06586#S5.F3.sf2)\), the model retrievesLondonbut fails to produceLondres; similarly, in the Chinese example \(Figure[3\(a\)](https://arxiv.org/html/2606.06586#S5.F3.sf1)\), the concept is accessible but not properly transferred\. This reflects a breakdown in the final stage of processing: the transition from a shared or English\-centric representation to the target\-language output, aligned with findings fromWanget al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib1)\)\.
Consistency\-driven GRPO fine\-tuning mitigates this issue\. In both examples, the target\-language answer rises in early layers and overtakes the English form in later layers\. This suggests that GRPO strengthens the pathway between language\-agnostic knowledge representations and target\-language decoding, enabling the model to directly produce the correct linguistic form rather than relying on an English intermediate\. A more detailed analysis of failure modes during factual retrieval can be found in Appendix[C](https://arxiv.org/html/2606.06586#A3)\.
\(a\)\[ZH\] Official language: ”French Polynesia”
\(b\)\[ES\] Place of birth: ”Mark Strong”
Figure 3:Layer\-rank analysis of Base \(left\) and GRPO finetuned OLMO\-2 \(right\) models, showing rank of correct English and target language tokens\. \(a\) Example in non\-Latin script language\. \(b\) Example in Latin\-script language\.
### 5\.3Cross\-lingual Transfer
We evaluate cross\-lingual transfer by measuring performance on 11 other languages held out during post\-training such as Greek, Ukrainian and Hebrew\. As shown in Figure[4\(a\)](https://arxiv.org/html/2606.06586#S5.F4.sf1), GRPO improves accuracy not only on previously seen languages but also on held\-out languages, indicating that its gains are not limited to the training distribution\. In contrast, SFT shows weaker improvements and limited transfer, suggesting that it primarily captures language\-specific memorization rather than generalizable cross\-lingual behaviour\. The consistent gains of GRPO on held\-out languages indicate that it promotes more language\-agnostic representations, enabling the model to access and express knowledge across languages beyond those seen during training\.
\(a\)Accuracy on trained vs\. held\-out languages\.
\(b\)Per\-language accuracy comparison\.
Figure 4:OLMo\-2\-7B KLAR performance on trained and held\-out languages across \(a\) models and \(b\) languages\.A more fine\-grained per\-language analysis \(Figure[4\(b\)](https://arxiv.org/html/2606.06586#S5.F4.sf2)\) reveals that these gains are not uniformly distributed\. Instead, CPT combined with GRPO yields the largest improvements for seen languages with relatively low baseline performance, such as Arabic and Japanese, as well as other unseen languages with scripts similar to the ones seen during training\. This is most evident in languages like Catalan and Vietnamese \(Latin\-script\) and Farsi which likely benefits from the alignment of Perso\-Arabic subwords during the training on Arabic\. Meanwhile, the most pronounced performance drops occur in unseen “script\-isolated” languages such as Greek, Korean, and Hebrew, where the absence of similar tokens during training means language transition failures persist\.
Figure 5:Language\-specific neurons across languages; GRPO increases English alignment while SFT does not change language\-specialization considerably\.This pattern suggests that GRPO improves factual access through shared language interfaces, particularly for languages with lexical or orthographic overlap with the training languages, such as Latin\-script languages and languages sharing Perso\-Arabic subwords\. However, these gains remain constrained by script and tokenization boundaries: languages with more isolated scripts, such as Greek, Korean, and Hebrew, benefit less\. At the same time, small drops for some high\-performing seen languages, such as French, suggest a mild trade\-off between peak per\-language accuracy and broader cross\-lingual regularization\. Taken together, these results support the view that GRPO encourages a more balanced allocation of model capacity across languages, reducing performance disparities and improving cross\-lingual consistency by prioritizing shared, rather than isolated, retrieval pathways leading to more shared representations\.
### 5\.4Language\-Specific Neurons
Following the LAPE methodology described in Section[3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px6), we identify and evaluate the distribution of language\-specific neurons in three configurations of OLMo\-2\-7B: the base model andPolyFactfine\-tuned model variants using SFT and GRPO\.
#### Total neuron count shifts towards English\-specificity\.
An analysis of the language\-specific neuron distribution reveals a striking and somewhat counterintuitive effect\. While the SFT model remains nearly identical to the Base model, GRPO exhibits a pronounced shift toward English\-specific neurons\. As shown in Figure[5](https://arxiv.org/html/2606.06586#S5.F5), GRPO increases English\-specific neurons by 38\.2% \(1671 to 2310\), at the expense of other languages, with large reductions for Bengali \(\-36\.1%, 424 to 271\) and Swahili \(\-31\.2%, 337 to 232\), while Chinese is a notable exception \(\+18\.6%, 86 to 102\)\.
Notably, this shift cannot be explained by improved English performance\. The SFT model achieves stronger EnglishPolyFactaccuracy than GRPO, and GRPO itself performs slightly better than the baseline \(Appendix[E\.4](https://arxiv.org/html/2606.06586#A5.SS4), Table[9](https://arxiv.org/html/2606.06586#A5.T9)\)\. This suggests the effect reflects a structural reorganisation rather than performance gains\. One interpretation is that the RL objective “squeezes” the model’s behavioural space, encouraging reliance on the most stable representational backbone \(here: English\), consistent with the work ofMatsutaniet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib37)\)\.
#### GRPO delays linguistic specialisation in non\-Latin scripts\.
To quantify the layer\-wise distribution of language\-specific neurons, we analyse the empirical cumulative distribution function \(ECDF\) across layers \(Figure[15](https://arxiv.org/html/2606.06586#A5.F15)in Appendix[E\.3](https://arxiv.org/html/2606.06586#A5.SS3)\)\. We observe a clear “late\-discovery” effect unique to GRPO: while the base and SFT models show nearly identical distributions, the GRPO curve is systematically lower, indicating that specialised neurons are recruited later in the network\. This is reflected in a negligible Kolmogorov\-Smirnov distance for SFT \(DKS=0\.005D\_\{KS\}=0\.005\) and a substantial shift for GRPO \(DKS=0\.089D\_\{KS\}=0\.089\)\.
This suggests that GRPO preserves a larger language\-agnostic space in intermediate layers and defers linguistic specialisation\. Notably, Latin\-script languages and Russian \(which shares a large portion of its alphabet with Latin scripts\) shift specialisation earlier, while languages of non\-Latin script \(e\.g\. Arabic, Japanese\) concentrate it in the final layers, as visualized in Figure[6](https://arxiv.org/html/2606.06586#S5.F6)\. A full breakdown can be found in Appendix[E\.3](https://arxiv.org/html/2606.06586#A5.SS3)\.
Figure 6:Concentration of specialised neurons in the final layers \(28–32\)\. GRPO exhibits a bifurcation: Latin scripts shift toward intermediate layers, while distinct scripts consolidate in the final layers leading script\-dependent processing\.
### 5\.5Mechanistic Analysis via LAHIS
\(a\)Base
\(b\)SFT
\(c\)GRPO
Figure 7:Pairwise head overlap across language pairs for Base, SFT\-, and GRPO\-finetuned models\.Figure 8:Percentage of language\-important attention heads in OLMo\-2\-1124\-7B across Base, SFT and GRPO\.#### Cross\-lingual head sharing\.
Head overlap between language pairs increases substantially after both finetuning methods \(as visualized in Figure[7](https://arxiv.org/html/2606.06586#S5.F7)\)\. SFT produces the strongest gains in pairwise overlap, particularly among Indo\-European languages \(e\.g\. DE–FR: 25%→\\rightarrow90%, DE–ID: 35%→\\rightarrow85%\)\. GRPO also increases overlap but more moderately, with notable gains for typologically distant pairs such as JA–ZH \(50%→\\rightarrow80%\)\. This suggests that SFT primarily encourages shared representations across languages, while GRPO’s key effect lies in redistributing processing deeper into the network rather than maximising head sharing\.
#### Language processing shifts deeper into the network\.
In the base model, up to 50% of language\-important heads for some languages concentrate at layer 0, indicating near\-immediate language\-routing decisions at the first layer\. Both finetuning methods redistribute this: under SFT the peak drops to around 40% with a slight spread across early layers, while GRPO reduces it further to 20% and distributes heads across layers 0–10 \(Figure[8](https://arxiv.org/html/2606.06586#S5.F8)\)\. This suggests GRPO drives a more substantial reorganisation of language processing particularly in the early routing layers\. More detailed analysis of language\-important attention head reshaping can be found in Appendix[E\.2](https://arxiv.org/html/2606.06586#A5.SS2)\. Overall, the mechanistic interpretability results are complementary and reveal a qualitatively distinct effect of GRPO compared to SFT: while SFT primarily increases cross\-lingual head sharing, GRPO fundamentally restructures multilingual computation, shifting it away from early routing and toward a shared, language\-agnostic intermediate representation space with delayed language specialisation\.
## 6Conclusion
We study cross\-lingual factual recall in predominantly English\-pretrained LLMs and introducePolyFact, a fully parallel multilingual factual QA dataset spanning 100K Wikidata\-grounded facts across 12 languages\. Across Qwen\-2\.5\-7B and OLMo\-2\-1124\-7B, GRPO consistently outperforms supervised finetuning, improving factual recall, transfer to held\-out languages, and generalisation beyond the multiple\-choice setting, while light CPT on parallel data provides limited gains\. Mechanistically, our analyses suggest that these improvements are not merely surface\-level: GRPO improves the transition from shared factual representations into target\-language outputs, restructures language routing, and reduces language specialization in favour of more shared cross\-lingual computation\. Overall, our results suggest that cross\-lingual factual inconsistency is less a problem of missing knowledge than of unreliable access, and that consistency\-driven reinforcement learning provides an effective way to improve cross\-lingual factual recall without large\-scale continual pretraining\.
## 7Limitations
Our work has several limitations\. First, our experiments are conducted on two 7B model families, OLMo\-2\-7B and Qwen\-2\.5\-7B\. While these models cover different training setups and multilingual capabilities, it is not certain whether our findings generalise to larger models, smaller models, or substantially different architectures such as mixture\-of\-experts models\.
Second, training onPolyFactdoes not fully generalise to more challenging benchmarks such as Global\-MMLU, indicating limitations on tasks that require deeper reasoning beyond factual retrieval\. Our method is therefore best understood as improving cross\-lingual factual access rather than solving multilingual reasoning more broadly\.
Third, our evaluation is limited to a relatively small set of benchmarks and languages, which may not fully capture all aspects of real\-world multilingual performance\. Broader evaluation across more tasks, domains, and languages would strengthen our conclusions\.
Finally, our work introduces potential risks\. SincePolyFactis derived from Wikidata, it may inherit coverage biases, factual errors, or uneven representation across languages, regions, and entities\. Moreover, improving cross\-lingual factual recall may also make incorrect or biased factual associations more consistently expressed across languages if such information is present in the underlying model or dataset\. We mitigate these risks through relation filtering, automatic and manual quality checks, and by releasing the dataset and code to support transparency, auditing, and future corrections\.
#### LLMs Usage\.
Through the paper, we use LLMs to assist with grammar checking and minor rephrasing for clarity\. LLMs did not contribute to the conceptual design of the study, experimental implementation, or core writing of the paper\.
## References
- L\. Bandarkar, A\. Ansell, and T\. Cohn \(2026\)Large reasoning models struggle to transfer parametric knowledge across scripts\.arXiv preprint arXiv:2603\.17070\.External Links:[Link](https://arxiv.org/abs/2603.17070)Cited by:[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Fu, L\. Liu, D\. Cai, G\. Huang, S\. Shi, and R\. Yan \(2024\)The reasonableness behind unreasonable translation capability of large language model\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3KDbIWT26J)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p3.1),[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Fujii, T\. Nakamura, G\. I\. Winata,et al\.\(2024\)Continual pre\-training for cross\-lingual LLM adaptation: enhancing japanese language capabilities\.arXiv preprint arXiv:2404\.17790\.External Links:[Link](https://huggingface.co/papers/2404.17790)Cited by:[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[Appendix A](https://arxiv.org/html/2606.06586#A1.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px3.p3.1)\.
- Z\. Gekhman, E\. Ben\-David, H\. Orgad, E\. Ofek, Y\. Belinkov, I\. Szpektor, J\. Herzig, and R\. Reichart \(2025\)Inside\-out: hidden factual knowledge in LLMs\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=f7GG1MbsSM)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p2.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, and S\. Ma \(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Habib, C\. Fourrier, H\. Kydlíček, T\. Wolf, and L\. Tunstall \(2023\)LightEval: a lightweight framework for llm evaluation\.External Links:[Link](https://github.com/huggingface/lighteval)Cited by:[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px3.p3.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2106.09685)Cited by:[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px3.p1.2)\.
- H\. Hu, Y\. Wang, M\. Huan, J\. Vamvas, Y\. Huang, Z\. Guo, and R\. Sennrich \(2026\)DeReason: a difficulty\-aware curriculum improves decoupled SFT\-then\-RL training for general reasoning\.arXiv preprint arXiv:2603\.11193\.External Links:[Link](https://arxiv.org/abs/2603.11193)Cited by:[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Kaffee, A\. Piscopo, P\. Vougiouklis, E\. Simperl, L\. Carr, and L\. Pintscher \(2017\)A glimpse into babel: an analysis of multilinguality in Wikidata\.InProceedings of the 13th International Symposium on Open Collaboration \(OpenSym ’17\),pp\. 14:1–14:5\.External Links:[Document](https://dx.doi.org/10.1145/3125433.3125465)Cited by:[Appendix C](https://arxiv.org/html/2606.06586#A3.p3.1)\.
- A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard,et al\.\(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.External Links:[Link](https://arxiv.org/abs/2503.19786)Cited by:[§3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px1.p1.1)\.
- H\. Kuulmets, T\. Purason, A\. Luhtaru, and M\. Fishel \(2024\)Teaching Llama a new language through cross\-lingual knowledge transfer\.InFindings of NAACL 2024,External Links:[Link](https://aclanthology.org/2024.findings-naacl.210/)Cited by:[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[Appendix A](https://arxiv.org/html/2606.06586#A1.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px3.p3.1)\.
- H\. Lin, Y\. Zhao, W\. Han, P\. Guo, BINBINLIU, Y\. Zhang, B\. Zhang, T\. Wang, and Y\. Zheng \(2025\)From translation to multilinguality: revisit the role of parallel data in multilingual LLM pretraining\.External Links:[Link](https://openreview.net/forum?id=1gbJ8euERb)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p3.1),[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Liu, Q\. Song, Q\. Zhou, H\. Du, S\. Xu, W\. Jiang, W\. Zhang, and X\. Jia \(2025a\)Focusing on language: revealing and exploiting language attention heads in multilingual large language models\.External Links:2511\.07498,[Link](https://arxiv.org/abs/2511.07498)Cited by:[§3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px5.p1.1),[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px3.p3.1)\.
- Y\. Liu, M\. Wang, F\. Yvon, and H\. Schütze \(2025b\)On the entity\-level alignment in crosslingual consistency\.arXiv preprint arXiv:2510\.10280\.External Links:[Link](https://arxiv.org/abs/2510.10280)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p2.1),[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Lu, R\. Zhang, C\. Eickhoff, and E\. Pavlick \(2025\)Paths not taken: understanding and mending the multilingual factual recall pipeline\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 15066–15096\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.762/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.762),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p2.1)\.
- K\. Matsutani, S\. Takashiro, G\. Minegishi, T\. Kojima, Y\. Iwasawa, and Y\. Matsuo \(2025\)RL squeezes, SFT expands: a comparative study of reasoning LLMs\.arXiv preprint arXiv:2509\.21128\.External Links:[Link](https://arxiv.org/abs/2509.21128)Cited by:[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px3.p1.1),[§5\.4](https://arxiv.org/html/2606.06586#S5.SS4.SSS0.Px1.p2.1)\.
- T\. OLMo, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan, N\. Lambert, D\. Schwenk, O\. Tafjord, T\. Anderson, D\. Atkinson, F\. Brahman, C\. Clark, P\. Dasigi, N\. Dziri, M\. Guerquin, H\. Ivison, P\. W\. Koh, J\. Liu, S\. Malik, W\. Merrill, L\. J\. V\. Miranda, J\. Morrison, T\. Murray, C\. Nam, V\. Pyatkin, A\. Rangapur, M\. Schmitz, S\. Skjonsberg, D\. Wadden, C\. Wilhelm, M\. Wilson, L\. Zettlemoyer, A\. Farhadi, N\. A\. Smith, and H\. Hajishirzi \(2024\)2 olmo 2 furious\.External Links:2501\.00656,[Link](https://arxiv.org/abs/2501.00656)Cited by:[Appendix B](https://arxiv.org/html/2606.06586#A2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px1.p1.1)\.
- R\. Qi, F\. Mo, Y\. Chen, X\. Zhang, S\. Wang, H\. Li, J\. Xu, M\. Jiang, J\. Nie, and K\. Huang \(2026\)Language\-coupled reinforcement learning for multilingual retrieval\-augmented generation\.arXiv preprint arXiv:2601\.14896\.External Links:[Link](https://arxiv.org/abs/2601.14896)Cited by:[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px3.p1.1)\.
- M\. R\. Qorib, J\. Li, and H\. T\. Ng \(2025\)Just go parallel: improving the multilingual capabilities of large language models\.InProceedings of ACL 2025,External Links:[Link](https://aclanthology.org/2025.acl-long.1602/)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p3.1),[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px2.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[Appendix B](https://arxiv.org/html/2606.06586#A2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px1.p1.1)\.
- L\. Schut, Y\. Gal, and S\. Farquhar \(2025\)Do multilingual LLMs think in english?\.InICLR 2025 Workshop on Building Trust in Language Models and Applications,External Links:[Link](https://openreview.net/forum?id=I8BOtOPcOv)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p1.1),[§1](https://arxiv.org/html/2606.06586#S1.p2.1)\.
- J\. Shao, R\. Tang, C\. Zhang, K\. Sevegnani, P\. Stenetorp, J\. Yang, and Y\. Lu \(2026\)The role of mixed\-language documents for multilingual large language model pretraining\.External Links:2601\.00364,[Link](https://arxiv.org/abs/2601.00364)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p3.1),[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px3.p1.3)\.
- Y\. Shen, W\. Lai, S\. Wang, G\. Gao, K\. Luo, A\. Fraser, and M\. Sun \(2025\)From unaligned to aligned: scaling multilingual llms with multi\-way parallel corpora\.InProceedings of EMNLP 2025,External Links:[Link](https://aclanthology.org/2025.emnlp-main.374/)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p3.1),[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px5.p1.1),[§3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px6.p2.1)\.
- S\. Singh, A\. Romanou, C\. Fourrier, D\. I\. Adelani, J\. G\. Ngui, D\. Vila\-Suero, P\. Limkonchotiwat, K\. Marchisio, W\. Q\. Leong, Y\. Susanto, R\. Ng, S\. Longpre, S\. Ruder, W\. Ko, A\. Bosselut, A\. Oh, A\. Martins, L\. Choshen, D\. Ippolito, E\. Ferrante, M\. Fadaee, B\. Ermis, and S\. Hooker \(2025\)Global MMLU: understanding and addressing cultural and linguistic biases in multilingual evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18761–18799\.External Links:[Link](https://aclanthology.org/2025.acl-long.919/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.919),ISBN 979\-8\-89176\-251\-0Cited by:[Appendix A](https://arxiv.org/html/2606.06586#A1.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px3.p3.1)\.
- T\. Tang, W\. Luo, H\. Huang, D\. Zhang, X\. Wang, X\. Zhao, F\. Wei, and J\. Wen \(2024\)Language\-specific neurons: the key to multilingual capabilities in large language models\.External Links:2402\.16438,[Link](https://arxiv.org/abs/2402.16438)Cited by:[§3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px6.p1.1),[§3](https://arxiv.org/html/2606.06586#S3.SS0.SSS0.Px6.p2.2),[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px3.p3.1)\.
- M\. Wang, H\. Adel, L\. Lange, Y\. Liu, E\. Nie, J\. Strötgen, and H\. Schütze \(2025a\)Lost in multilinguality: dissecting cross\-lingual factual inconsistency in transformer language models\.InProceedings of ACL 2025,External Links:[Link](https://aclanthology.org/2025.acl-long.253/)Cited by:[Appendix A](https://arxiv.org/html/2606.06586#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.06586#S1.p1.1),[§1](https://arxiv.org/html/2606.06586#S1.p2.1),[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.06586#S4.SS0.SSS0.Px3.p3.1),[§5\.1](https://arxiv.org/html/2606.06586#S5.SS1.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2606.06586#S5.SS2.p2.1)\.
- Z\. Wang, J\. Li, H\. Zhou, R\. Weng, J\. Wang, X\. Huang, X\. Han, J\. Feng, C\. Deng, and S\. Huang \(2025b\)Investigating and scaling up code\-switching for multilingual language model pre\-training\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 11032–11046\.External Links:[Link](https://aclanthology.org/2025.findings-acl.575/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.575),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p3.1)\.
- C\. Wendler, V\. Veselovsky, G\. Monea, and R\. West \(2024\)Do llamas work in English? on the latent language of multilingual transformers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15366–15394\.External Links:[Link](https://aclanthology.org/2024.acl-long.820/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.820)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p2.1)\.
- D\. Wu, S\. Tan, Y\. Meng, D\. Stap, and C\. Monz \(2024\)How far can 100 samples go? unlocking zero\-shot translation with tiny multi\-parallel data\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15092–15108\.External Links:[Link](https://aclanthology.org/2024.findings-acl.896/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.896)Cited by:[§1](https://arxiv.org/html/2606.06586#S1.p3.1),[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Ye, S\. Shrivastava, Z\. Li, J\. Dineen, S\. Lu, A\. Ahuja, M\. Shen, Z\. Xu, and B\. Zhou \(2025\)CC\-LEARN: cohort\-based consistency learning\.arXiv preprint arXiv:2506\.15662\.External Links:[Link](https://arxiv.org/abs/2506.15662)Cited by:[§2](https://arxiv.org/html/2606.06586#S2.SS0.SSS0.Px3.p1.1)\.
## Appendix ABenchmark details
We evaluate across three benchmarks targeting complementary aspects of multilingual factual recall: factual recall in the training\-task format \(PolyFact\), free\-form factual recall \(KLAR\-CLC\) and broader multilingual knowledge and reasoning \(Global\-MMLU\)\. All inference uses bf16 precision\.
#### PolyFact\.
We evaluate on the 2,500\-fact test split ofPolyFactusing a custom evaluator \(evaluate/evaluate\_consistency\.py\) that scores each of the four MCQ options by its length\-normalised conditional log\-likelihood under the model, conditioning on a language\-specific prompt that wraps the question and instructs the model to answer in the target language\. The option with the highest per\-token average log\-probability is selected\. We report per\-language accuracy across all twelve target languages \(en, de, es, fr, pt, id, ru, ar, zh, ja, sw, bn\)\. Because facts are fully parallel across languages, per\-language differences isolate language\-interface effects from underlying knowledge\.
#### KLAR\-CLC\.
For free\-form cross\-lingual factual recall we evaluate on KLAR\-CLCWanget al\.\([2025a](https://arxiv.org/html/2606.06586#bib.bib1)\)using a custom evaluator \(evaluate/evaluate\_klar\.py\)\. The evaluator runs 3\-shot prompted greedy generation \(max 10 new tokens\) and compares the generated answer to the gold using NFC\-normalised, case\-insensitive, punctuation\-stripped string matching, with the punctuation regex covering ASCII, CJK, and typographic quotes/guillemets to handle all 17 KLAR languages\. KLAR is structured into 20 factual relations \(e\.g\.capital,place of birth,manufacturer,occupation\)\. UnlikePolyFact, KLAR removes the candidate set, testing whether the model can*generate*the correct answer in the target language rather than just discriminate among given options\. We evaluate on six seen languages \(en, es, fr, ru, zh, ja\) and eleven held\-out languages \(ar, ca, el, fa, he, hu, ko, nl, tr, uk, vi\) to measure cross\-lingual transfer\.
#### Global\-MMLU\.
For broader multilingual knowledge and reasoning we evaluate on Global\-MMLUSinghet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib15)\)usinglm\-evaluation\-harnessGaoet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib20)\)with vLLMKwonet al\.\([2023](https://arxiv.org/html/2606.06586#bib.bib19)\)as the backend \(task identifiersglobal\_mmlu\_full\_\{lang\},batch\_size=auto,gpu\_memory\_utilization=0\.85\)\. Scoring is standard log\-likelihood over the four MCQ options\. We report per\-language accuracy across the twelve target languages\. Global\-MMLU is substantially harder than direct factual\-recall benchmarks because it additionally requires multi\-step reasoning and domain knowledge, making it a stricter test of generalisation beyond training\-distribution facts\.
#### Inference backends\.
lm\-evaluation\-harnessand LightEval use vLLM for throughput; thePolyFactand KLAR custom evaluators use HuggingFace Transformers directly because they need access to per\-token log\-probabilities \(PolyFact\) and explicit batched greedy decoding with custom string matching \(KLAR\), which are awkward to express in the harnesses\.
## Appendix BHyperparameters
#### Continual Pretraining\.
We continually pretrain OLMo\-2\-1124\-7BOLMoet al\.\([2024](https://arxiv.org/html/2606.06586#bib.bib31)\)and Qwen\-2\.5\-7BQwenet al\.\([2025](https://arxiv.org/html/2606.06586#bib.bib41)\)on the balanced TED2025 dataset using the hyperparameters in Table[2](https://arxiv.org/html/2606.06586#A2.T2)\. A detailed per\-language of the TED2025 subset used for CPT is listed in Table[3](https://arxiv.org/html/2606.06586#A2.T3)\.
DataMax sequence length1024 tokensChunk packing target512 tokens \(talk\-internal\)Min languages per row2Min languages per talk2Validation split0\.5% \(language\-balanced\)OptimizationOptimizerAdamW \(fused\)Learning rate2×10−52\\times 10^\{\-5\}LR scheduleCosineWarmup steps200Weight decay0\.01Precisionbf16Epochs1Per\-device batch size6Gradient accumulation steps8
Table 2:Continual pretraining hyperparameters\.Language\# lang\-linesTokens \(M\)English \(en\)1,778,88767\.2Spanish \(es\)656,92119\.5Arabic \(ar\)615,80815\.3French \(fr\)534,64716\.6Chinese \(zh\)515,55313\.7Bengali \(bn\)500,58135\.3Russian \(ru\)469,03516\.0Japanese \(ja\)356,3254\.9German \(de\)297,3069\.8Indonesian \(id\)219,0407\.3Portuguese \(pt\)210,4416\.7Swahili \(sw\)174,1254\.0Total6,328,669216\.4Table 3:Per\-language coverage in the TED 2025 CPT corpus tokenized with the Qwen\-2\.5 tokenizer\. “\# lang\-lines” counts how often each language appears as a section within a packed chunk; a single TED chunk typically contains 5–12 language sections\. Token counts exclude the\{lang\}:format wrapper\. The remaining∼\\sim19M tokens in the 235\.5M total are end\-of\-sequence and other framing tokens\.
#### GRPO Post\-Training\.
Starting from Olmo\-2\-1124\-7B basemodel and CPT checkpoint, we run multilingual GRPO onPolyFactwith the configuration in Table[4](https://arxiv.org/html/2606.06586#A2.T4)\.
ParameterValueDatasetPolyFactLanguages \(LL\)12Group size \(GG\)8Max train facts40,000Epochs1Batch size1 factLearning rate1×10−51\\\!\\times\\\!10^\{\-5\}LR schedulecosine, 3% warmupWeight decay0\.01OptimizerAdamWGrad clipping1\.0Max prompt / completion512 / 48Temperature / top\-pp0\.7 / 0\.95Repetition penalty1\.5No\-repeatnn\-gram3Correct reward\+1\+1Hallucination penalty−0\.5\-0\.5All\-correct bonus\+1\+1KL coefficient \(β\\beta\)0\.0LoRArr/α\\alpha/ dropout64 / 128 / 0\.05Precisionbf16Grad checkpointingenabledEval frequencyevery 500 stepsSeed42Table 4:GRPO post\-training hyperparameters\.
#### Supervised Finetuning\.
Starting from Olmo\-2\-1124\-7B basemodel and CPT checkpoint, we run multilingual SFT onPolyFactwith the configuration in Table[5](https://arxiv.org/html/2606.06586#A2.T5)\.
ParameterValueDatasetPolyFactLanguages per fact \(LL\)12Epochs1Facts per device batch8Gradient accumulation8Effective batch \(facts\)64Learning rate2×10−52\\\!\\times\\\!10^\{\-5\}LR schedulecosine, 5% warmupWeight decay0\.01OptimizerAdamW \(fused\)Consistency weight \(λ\\lambda\)0\.5Consistency lossKL to stop\-grad group meanMax eval samples/lang200LoRArr/α\\alpha/ dropout64 / 128 / 0\.05Precisionbf16Grad checkpointingenabledEval / save frequencyevery 200 stepsTable 5:SFT post\-training hyperparameters\.
## Appendix CFailure Modes during Cross\-Lingual Retrieval
Performing a layer\-rank analysis of the base and GRPO\-finetuned model we can get a better understanding of the specific ways in which we have improved performance\. The main failure mode of the base model across all languages is producing incoherent answers including randomly guessed numbers, the word “what” and hybrid combinations of different languages\. GRPO is effective in reducing this and contributes most of the gain in performance\. Beyond this, the next most common failure modes that we are able to improve upon are largely distinct between languages that do and do not share a Latin script with English\. We include Russian in the Latin\-script group here: although Cyrillic, it shares enough tokenisation overlap with Latin that it exhibits the same failure mode\. In essence, the base model suffers from some languages’ representations being too close to that of English, such that their interlingual pathways are not distinct enough; and some being too separated, such that the routes between them and English are too weak\. GRPO fine\-tuning helps to reduce both contrasting issues\.
Those that do share the same script, and likely similar representations within the model, suffer from responding to prompts with the right answer but in English rather than the target language \(Figure[9](https://arxiv.org/html/2606.06586#A3.F9)\)\. Either the base model cannot find the direct translation for the prompt’s answer or its prior training has conditioned it to rely too heavily on English\-tied pathways within the Latin\-script regions of its representations\. The former seems unlikely, particularly given the expected prominence of the answer in the specific example shown\. Therefore, our GRPO fine\-tuning is able to encourage the model to separate pathways for Latin script languages and respond in the correct language\. This unfortunately does not apply to all examples and in fact GRPO produces nearly as many regressions of this type as it does improvements\. Though this looks discouraging on the surface, a deeper analysis of where these failures occur reveals much about the base model and our fine\-tuning process\.
Figure 9:Layer\-rank analysis of Base and GRPO finetuned OLMO\-2 models for a prompt in latin script language\.The KLAR dataset is structured intorelationsthat group questions from the same fields\. We break these relations down into two groups, those with and without proper\-noun answers \(jurisdiction, manufacturer, owned\-by etc\)\. Given the global prevalence of English\-language proper nouns it is understandable that many of these proper\-nouns originate in the English language and may be known as such, even if they have native translations\. In some instances this will naturally aid the answering of these questions: target\-language answers are often similar or identical to the English forms the model can retrieve more easily\. Proper\-noun relations account for 61%, 76%, and 73% of base\-to\-GRPO regressions in Russian, French, and Spanish respectively\. While the prevalence of English proper\-nouns in the base model’s pretraining corpus likely helps it score well on the low\-hanging lexical cognates, it also pushes many answers into the wrong language failure state — a tendency our finetuning may have inadvertently reinforced\. Wikipedia is well known for exhibiting this bias, where entity labels for people, companies, and many places default to English or Latin\-script forms even on non\-English Wikipedias, because Wikidata’s canonical label is often English and localisers haven’t filled in every language\(Kaffeeet al\.,[2017](https://arxiv.org/html/2606.06586#bib.bib43)\)\. Finetuning onPolyFacttherefore strengthens this bias, worsening the wrong\-language failure mode on proper\-noun prompts\.
This points us to the notion that GRPO fine\-tuning is actually stronger than our initial metrics suggest\. The gains GRPO actually delivers on language abstraction are partly masked in the aggregate by regressions concentrated in proper\-noun prompts where dataset bias works against us\. Figure[10](https://arxiv.org/html/2606.06586#A3.F10)makes this visible: improvements on common\-noun \(non\-proper\-noun\) relations are substantially larger across the board, including in non\-Latin\-script languages\.
Figure 10:KLAR improvement by relation type\. Proper\-noun relations account for 55% of the KLAR dataset\.We highlight the regression of Russian performance after our fine\-tuning, as it is a slight anomaly though we cannot fully explain it\. Some gains come from cleaning up incoherent answers, but the overall result is dominated by regressions in the same failure mode as the Latin based languages\. This complicates our script\-based framing but the aforementioned the English\-leak failure mode reaches it too\. What we cannot fully explain is its magnitude: Russian is the only language to net\-regress on KLAR, and does so more severely than fully Latin\-script French, despite sharing the underlying mechanism\.
In languages that do not share the same script or have this overlap the failure mode for questions not related to proper\-nouns is qualitatively different\. These languages perform much worse in general evaluation, in part because of weaker factual retrieval: responses are more often in the correct language but factually wrong \(Figure[3\(a\)](https://arxiv.org/html/2606.06586#S5.F3.sf1)\)\. The model’s representations for these languages are clearly distinct from those of English and its language\-agnostic regions\. This separation means the base model answers in the correct language but the cross\-lingual transfer of factual content from the English/agnostic knowledge region into the language\-specific output pathway fails\. Our GRPO fine\-tuning strengthens the pathways between these distinct languages and more commonly allows the model to answer in the correct language with the correct information\.
Our GRPO fine\-tuning uses a reward function that requires both the correctness of the answer and its production in the target language to assign credit\. This is what enables us to correct both failure modes, pushing the model towards answering correctly in the right language from either direction relative to the target\. But there is no notion of partial credit and given the distinct failure modes \(correct language, wrong information vs wrong language, correct information\) it seems that there may be specific benefits tailored to each language type to including this in future iterations\. Similarly, an altered training dataset could yield further gains\. The reward rewards correctness, and the canonical correct answer is often English, so the reward effectively trains in the English\-leak failure on proper\-noun prompts The aforementioned biases of thePolyFactdatabase could be corrected to favour more consistently native proper nouns and we would likely see the true gains of this method unhampered by the discussed regressions\.
## Appendix DPolyFactDataset\.
### D\.1Detailed Construction\.
We describe the full pipeline used to constructPolyFact, a parallel multilingual multiple\-choice QA dataset grounded in Wikidata\.
Source data and relation selection\.We start from the full Wikidata truthy\-triples dump and retain only triples whose property belongs to a curated set of 22 factual relations, selected to cover stable, unambiguous facts suitable for MCQ generation\. These span geography \(capital,country,continent,official language,currency,shares border with\), biography \(country of citizenship,place of birth,place of death,educated at,employer\), creative works and media \(author,director,creator,developer,genre,country of origin,language of work,platform\), and organizational or cultural relations \(manufacturer,architect,discoverer or inventor\)\.
Multilingual label extraction\.For every subject, property, and object entity, we extract labels in twelve typologically and geographically diverse languages: English, German, Indonesian, Portuguese, Arabic, Bengali, Swahili, Spanish, Russian, French, Japanese, and Chinese\. Labels are collected primarily by streaming the compressed Wikidata JSON dump, with missing entries backfilled through the WikidatawbgetentitiesAPI in batches of 50 IDs\. We additionally extract each entity’sinstance of\(P31\) types from the dump for distractor selection\.
Triple filtering and distractor construction\.We retain only triples whose subject and object have labels in a sufficient subset of the twelve target languages\. For each surviving triple, we sample three distractors from objects that appear with the same property elsewhere in the corpus, so distractors are plausible candidates for the relation \(e\.g\., distractors for a*place of birth*question are drawn from other Wikidata\-recorded birthplaces\)\. To prevent surface\-level shortcuts, distractors must additionally share at least one entity type with the gold object — via Wikidata’sinstance ofproperty \(P31\), e\.g\. both being cities — and match its English label length within a small tolerance\. We discard any fact whose four options are not all distinct under case\- and whitespace\-normalised comparison\.
Balanced sampling\.To prevent high\-frequency relations from dominating the corpus, we apply round\-robin sampling across properties: on each pass, one fact is drawn from each property’s shuffled pool, cycling until 100,000 facts have been selected\. This produces a near\-uniform distribution over relations regardless of the underlying frequency skew in Wikidata\.
Multilingual MCQ generation\.For each selected fact, we generate a parallel MCQ bundle covering all twelve languages using gemma\-3\-27b\-it served via vLLM\. Entity and property labels missing from Wikidata are prefilled using a dedicated low\-temperature translation prompt \(temperature0, max 32 tokens\) conditioned on the relation context, with results cached across the batch to amortize cost\. Given the localized subject, relation, gold answer, and three distractors, the model generates a question in the target language under explicit instructions to reuse the provided option strings verbatim \(temperature0\.10\.1, top\-pp0\.90\.9, max 192 tokens\)\. The predictedanswer\_textis resolved against the option set via exact match and, on failure, case\- and whitespace\-normalized string matching against both the options and the known gold label\.
Validity and parallelism\.Each generated MCQ is validated for well\-formedness: exactly four distinct non\-empty options, a non\-empty question string, and ananswer\_textthat resolves to one of the options\. A fact is retained only if generation succeeds for*all*twelve languages; partial bundles are discarded entirely\. This strict all\-or\-nothing criterion guarantees that every fact inPolyFactis fully parallel, so cross\-lingual performance differences can be attributed to model behavior rather than to variation in underlying content, question difficulty, or answer set composition\.
Splits\.The final corpus is partitioned into 95,000 training, 2,500 validation, and 2,500 test facts, with splits applied at the fact level so that all twelve language versions of a given fact remain in the same partition\.
### D\.2Quality Verification\.
To characterise residual ambiguity in the released dataset, we labelled 300 facts per language across all 12 languages \(3,600 items total\) with an LLM\-as\-judge \(GPT\-4o with web\-search tool grounding\), and an independently human\-labelled 100 items per language across six languages \(Arabic, English, French, German, Russian, Spanish; 600 items total\), using a three\-label rubric \(correct / ambiguous / incorrect\)\. Using an LLM judge enables per\-relation quality estimation at a scale infeasible for purely manual annotation and provides per\-item verification flags for a significantly bigger subset\. Inter\-judge agreement between the LLM judge and the independent human review is 91\.0% overall, with per\-language values ranging from 86% \(Arabic\) to 96% \(English\), as listed in Table[6](https://arxiv.org/html/2606.06586#A4.T6)\. The LLM judge identifies an overall ambiguity rate of 13\.3% across the 3,600\-item sample; rates vary both by Wikidata property and by language\. By property, three relations exhibit relatively higher ambiguity:country of origin\(35%\),place of birth\(25%\), andgenre\(25%\)\. The ambiguity in the first two cases is largely driven by person\-name queries, where multiple individuals may share the same name, whilegenreis inherently more subjective and underspecified\. The remaining 16 relations all have ambiguity rates at or below 20%\. By language, Japanese \(23%\) and Chinese \(21%\) show roughly twice the ambiguity rate of European languages \(9–13%\), consistent with the translation\-induced ambiguity patterns we observed in qualitative review \(e\.g\., transliterated subjects losing disambiguating information\)\. We release per\-item verification labels and a recommendedPolyFact\-Cleanfilter \(which excludes the three highest\-ambiguity relations\) alongside the dataset; see Table[7](https://arxiv.org/html/2606.06586#A4.T7)for the per\-relation breakdown\.
LanguageNNAgreementEnglish \(en\)10096\.0%French \(fr\)10093\.0%German \(de\)10092\.4%Spanish \(es\)10092\.0%Russian \(ru\)10087%Chinese \(zh\)10091\.9%Arabic \(ar\)10086\.0%Overall70091\.0%Table 6:LLM\-human inter\-judge agreement on the post\-hoc verification sample \(100 items per language\)\.NNis the number of items with both an LLM\-judge label and an independent human label\.RelationNNAmbig\. %*Excluded fromPolyFact\-Clean*country of origin16835genre24025place of birth21625*Included inPolyFact\-Clean*director19220continent21620language of work or name25218educated at24017currency3614employer21611architect1809official language1688creator1928place of death2168developer1927country1926country of citizenship2165author603discoverer or inventor2163manufacturer1922Table 7:Per\-relation ambiguity rates from the full 3,600\-item LLM\-judge verification sample \(300 items per language across 12 languages\), sorted by the fraction of items labelled ambiguous or incorrect\.PolyFact\-Cleanexcludes the three highest\-ambiguity relations \(top group\), removing approximately 17% of items\. Three of the 22 curated relations did not appear in the random sample\.
## Appendix EAdditional Results
### E\.1Pure SFT model
In the following we list our results for the purely SFT trained model \(λ=0\\lambda=0\) onPolyFact\(see Table[8](https://arxiv.org/html/2606.06586#A5.T8)\. Because this doesn’t include any consistency term, this model overindexes on individual per\-language performance rather than learning to route the same knowledge consistently through shared pathways\.
KLAR \(Seen\)KLAR \(Unseen\)ModelenesfrruzhjaarcaelfahehukonltrukviSFT\-pure64\.424\.731\.015\.19\.05\.25\.416\.78\.43\.73\.48\.81\.719\.616\.16\.612\.1
PolyFactModelendeesfrptidruzharjaswbnSFT\-pure81\.3769\.0866\.7969\.9645\.7841\.8548\.1672\.8954\.4671\.7851\.4148\.79
Table 8:SFT\-pure performance on KLAR andPolyFact\.
### E\.2LAHIS
Despite similar overlap patterns, the two methods differ in how much they alter the base model’s routing\. Under SFT, between 45% and 75% of language\-important heads change relative to the base, while GRPO replaces between 70% and 85% \(Figure[12](https://arxiv.org/html/2606.06586#A5.F12)\)\. Per\-language delta maps for both models are provided in Appendix[E\.2](https://arxiv.org/html/2606.06586#A5.SS2.SSS0.Px1)\. Pairwise head overlap for the three Qwen\-2\.5\-7B variants is visualized in Figure[13](https://arxiv.org/html/2606.06586#A5.F13)\.



Figure 11:Percentage of language\-important attention heads per layer for Qwen\-2\.5\-7B before and after post\-training\. GRPO reduces the concentration of language\-specific heads in the earliest layers and distributes language routing more broadly across the network\.\(a\)Base→\\rightarrowSFT
\(b\)Base→\\rightarrowGRPO
Figure 12:Number of stable \(blue\) and changed \(orange\) language\-important heads per language\.\(a\)Base
\(b\)SFT
\(c\)GRPO
Figure 13:Pairwise overlap of language\-important attention heads across language pairs for Qwen\-2\.5\-7B before and after post\-training\.#### Per\-language Delta Maps
Figures[16](https://arxiv.org/html/2606.06586#A5.F16)and[17](https://arxiv.org/html/2606.06586#A5.F17)show per\-language delta maps of LAHIS importance scores \(finetuned−\-base\) for GRPO and SFT\-consistent respectively\. Both methods predominantly suppress existing language\-specific heads rather than creating new ones, with the strongest suppression at layer 0\.
### E\.3Per\-language Neuron Distributions
Figure[14](https://arxiv.org/html/2606.06586#A5.F14)provides a detailed visualisation of the language activation probability entropy \(LAPE\) results, showing the frequency of specialised neurons per layer for all twelve target languages across the Base, SFT, and GRPO configurations\. Figure[15](https://arxiv.org/html/2606.06586#A5.F15)provides the empirical cumulative distribution functions for the base, SFT and GRPO configurations, summed over all languages\.
Figure 14:Layer\-wise frequency of language\-specific neurons for all target languages\. Note the distinct English \(EN\) surge and the redistribution of neurons in non\-Latin scripts \(AR, JA, ZH\-CN\) under the GRPO configuration \(red\)\.Figure 15:Empirical Cumulative Distribution Function \(ECDF\) of specialised neuron discovery across layers\. The significantDKSD\_\{KS\}shift in GRPO indicates a deferral of linguistic specialisation to later layers\.
### E\.4Per\-language Performance
#### PolyFact:
See Table[9](https://arxiv.org/html/2606.06586#A5.T9)\.
#### KLAR:
See Table[10](https://arxiv.org/html/2606.06586#A5.T10)\.
#### Global\-MMLU\.
See Table[11](https://arxiv.org/html/2606.06586#A5.T11)\.
Figure 16:Delta heatmap \(GRPO−\-base\)\. Blue indicates weakened heads, red indicates strengthened heads\.Figure 17:Delta heatmap \(SFT\-consistent−\-base\)\. Pattern closely mirrors GRPO, suggesting both methods suppress similar heads\.ModelendeesfrptidruzharjaswbnBaseline79\.9465\.9564\.4568\.6142\.8141\.4644\.3170\.0454\.5866\.1948\.8347\.96Aligned79\.1165\.6863\.4268\.6942\.9639\.5647\.5668\.2154\.4668\.0948\.5948\.39GRPO76\.5070\.8767\.9771\.3049\.4344\.8348\.8371\.6655\.8169\.8051\.4150\.18Aligned \+ GRPO79\.9871\.0366\.7170\.9149\.2343\.3651\.1371\.7458\.0771\.7451\.0950\.14SFT80\.9063\.0665\.4060\.3644\.0740\.3546\.2965\.6449\.9865\.6049\.0345\.26Aligned \+ SFT80\.6264\.5766\.5566\.7546\.1443\.1251\.6170\.0454\.8668\.6550\.2647\.21Table 9:PolyFactaccuracy \(%\) across languages\.SeenUnseenModelenesfrruzhjaarcaelfahehukonltrukviBaseline62\.922\.340\.620\.39\.98\.27\.914\.019\.62\.814\.39\.99\.224\.614\.810\.513\.5Aligned50\.213\.221\.913\.79\.73\.07\.49\.510\.42\.46\.58\.16\.211\.214\.73\.710\.4GRPO75\.537\.644\.313\.411\.88\.212\.226\.38\.69\.62\.219\.55\.535\.722\.010\.027\.2Aligned \+ GRPO71\.140\.235\.718\.714\.214\.015\.028\.216\.04\.110\.620\.24\.428\.922\.215\.026\.7SFT42\.816\.732\.414\.28\.45\.46\.98\.411\.82\.63\.97\.12\.311\.815\.03\.711\.0Aligned \+ SFT50\.229\.335\.112\.18\.42\.37\.311\.17\.22\.31\.68\.33\.713\.920\.36\.311\.5Table 10:KLAR accuracy \(%\) across languages, grouped into seen and held\-out \(unseen\) languages\.ModelendeesfrptidruzharjaswbnHighBaseline56\.4942\.4640\.7041\.7536\.8437\.1931\.5831\.9333\.6832\.9829\.8225\.2640\.25Aligned54\.0434\.0432\.9834\.7431\.2329\.1228\.4232\.9826\.3233\.6829\.4725\.2637\.41GRPO55\.7942\.8138\.6043\.8638\.6038\.2532\.9834\.0429\.1232\.9831\.2328\.4240\.95Aligned \+ GRPO54\.3941\.7538\.6041\.7535\.0932\.2833\.3335\.0926\.3232\.2829\.8227\.3739\.43SFT54\.0440\.3540\.3541\.0537\.1934\.3933\.6832\.6331\.5832\.2829\.8223\.5139\.90Aligned \+ SFT54\.7441\.7536\.4941\.4037\.8935\.4432\.2830\.8833\.6831\.5829\.8223\.8639\.35Table 11:Global\-MMLU accuracy \(%\) across languages\. High denotes the average over high\-resource languages\.Similar Articles
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.
The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning
This paper proposes a self-supervised reinforcement learning framework that uses consistency verifiers—reward functions checking geometric and semantic consistency under transformations—to improve spatial reasoning in large reasoning models without requiring ground-truth annotations. The method approaches the accuracy of supervised fine-tuning and generalizes across diverse tasks.
FACTS Grounding: A new benchmark for evaluating the factuality of large language models
DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
This paper proposes using reinforcement learning with semantic rewards (via GRPO) to expand LLMs to low-resource languages without the typical alignment tax of catastrophic forgetting, showing improved semantic quality and transferability over supervised fine-tuning.
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
This paper proposes a self-supervised framework using multilingual self-consistency and a self-critique mechanism to transfer cultural knowledge across languages, achieving a 5.03% average improvement on English queries in the BLEnD benchmark by surfacing latent cultural knowledge from local-language representations.