Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

arXiv cs.CL Papers

Summary

This paper studies catastrophic forgetting in multilingual expert language models during continual pretraining and proposes five parameter alignment strategies (hard layer freezing, soft regularization, post-hoc weight reversion, and model merging) to mitigate forgetting across 32 training languages with minimal cost to language acquisition.

arXiv:2606.00284v1 Announce Type: new Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:36 PM

# Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models
Source: [https://arxiv.org/html/2606.00284](https://arxiv.org/html/2606.00284)
###### Abstract

While continual pretraining \(CPT\) is a practical way to extend large language models to new languages, naïve finetuning on targeted data erodes existing capabilities through catastrophic forgetting\. Organizing training around language families reduces cross\-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks\. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer\-aware parameter alignment strategies: hard layer freezing, soft regularization, post\-hoc weight reversion, and model merging\. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held\-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation\. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post\-hoc reversion yields the strongest translation gains\. Together, these results map the acquisition–forgetting frontier for family\-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves\.

## 1Introduction

Adapting Large Language Models \(LLMs\) through continual pretraining \(CPT\) is a practical solution for expanding model coverage to new languages while avoiding the prohibitive compute costs of pretraining from scratch\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib39); Douet al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib40)\)\. However, naïve dense CPT yields strong language acquisition but leads to catastrophic forgetting\(McCloskey and Cohen,[1989](https://arxiv.org/html/2606.00284#bib.bib25); Kirkpatricket al\.,[2017](https://arxiv.org/html/2606.00284#bib.bib27)\)of the model’s original knowledge, particularly in multilingual settings, where the curse of multilinguality\(Conneauet al\.,[2020](https://arxiv.org/html/2606.00284#bib.bib10)\)forces trade\-offs between language coverage and the preservation of existing capabilities\.

A particularly promising paradigm, introduced byx\-ELM\(Blevinset al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib15)\), trains independent bilingual experts in parallel and merges them on demand, eliminating cross\-language interference and facilitating efficient, distributed multilingual training\. Drawing on targeted methods for low\-resource language families\(Downeyet al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib44); Oguejiet al\.,[2021](https://arxiv.org/html/2606.00284#bib.bib45)\), we generalize this approach to traininglanguage familyexperts, scaling the language coverage per expert while limiting intra\-expert interference\(Chronopoulouet al\.,[2023](https://arxiv.org/html/2606.00284#bib.bib35)\)\. However, while catastrophic forgetting has been studied in dense multilingual models\(Owodunni and Kumar,[2025](https://arxiv.org/html/2606.00284#bib.bib28); Khelliet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib29)\), how to mitigate it in the family\-expert setting remains an open question\.

Forgetting remains a clear issue in multilingual CPT: unconstrained dense CPT leads to 6\.6–12\.3 percentage point decreases on reading comprehension, and vanilla family experts, while less damaging in\-family, can still drift from the shared initialization and degrade robustness on related held\-out and cross\-family languages\. We hypothesize that this stems in part from excessive parameter drift away from the base model, and instantiate fiveparameter alignment strategiesthat vary in how they constrain parameter updates or correct model weights post\-training \(§[2\.2](https://arxiv.org/html/2606.00284#S2.SS2)\), each preserving the distributed, parallel nature of expert training\. Motivated by recent analyses that suggest the middle layers of transformer LMs are the primary locus of language\-neutral knowledge\(Bandarkar and Peng,[2025](https://arxiv.org/html/2606.00284#bib.bib2); Bandarkaret al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib3); Wendleret al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib41)\), our alignment methods arelayer\-aware, focusing on constraining changes in the middle layers while allowing the initial and final layers more freedom for better language acquisition\.

We compare these five alignment strategies against two unregularized CPT baselines within ourfamily\-expert CPTsetup spanning five language families \(Slavic, Germanic, Indic, Austronesian, Romance\) and 32 training languages, using Gemma\-3\-4B\(Teamet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib18)\)as a shared initialization for each expert\. We continue pretraining on up to 5B tokens per family on MADLAD\-400\(Kuduguntaet al\.,[2023](https://arxiv.org/html/2606.00284#bib.bib1)\)and evaluate across four axes: Belebele reading comprehension\(Bandarkaret al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib5)\), Global\-PIQA physical reasoning\(Changet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib6)\), FLORES\-200 translation\(Teamet al\.,[2022](https://arxiv.org/html/2606.00284#bib.bib7)\), and held\-out perplexity as a proxy for language acquisition, including held\-out relatives where benchmark coverage exists\.

Our results show that parameter alignment substantially reduces forgetting over unregularized baselines at minimal cost to language acquisition, including generalization to held\-out languages within each family\. Which strategy works best is task\-specific: freezing layer weights improves comprehension over the base model itself \(Belebele avg\. \+1\.7 pp\), reverting some layers back to the base weights after training preserves strong translation quality \(avg\. \+20\.6 ChrF over base\), and L2 regularization consistently maintains or improves held\-out perplexity\. These findings, along with a targeted interpolation analysis showing that middle\-layer drift is the primary driver of comprehension degradation while FLORES translation follows a different layer\-sensitivity profile, map a nuancedlanguage acquisition–knowledge forgetting trade\-offin multilingual expert training and indicate that alignment strategy selection should be layer\-aware and driven by the target application rather than a single aggregate metric\.

Our main contributions are as follows:

- •We introducefamily\-expert CPT, a paradigm for distributed multilingual training centered on language families \(§[2\.1](https://arxiv.org/html/2606.00284#S2.SS1)\), and fivelayer\-aware parameter alignment strategiesfor mitigating catastrophic forgetting in this setting \(§[2\.2](https://arxiv.org/html/2606.00284#S2.SS2)\)\.
- •We comprehensively evaluate our methods across five typologically diverse language families and four evaluation axes, characterizing theacquisition–forgetting trade\-offfor each strategy on both seen and held\-out languages \(§[3](https://arxiv.org/html/2606.00284#S3)\)\.
- •Based on these analyses, we derivepractical deployment guidelineslinking each alignment strategy to the settings it best serves \(§[4\.2](https://arxiv.org/html/2606.00284#S4.SS2)\)\.

## 2Parameter\-Aligned Family Experts

![Refer to caption](https://arxiv.org/html/2606.00284v1/x2.png)Figure 1:Left: Overview of parameter alignment strategies\. The layer\-aware methods regularize or replace middle\-layer parameters while allowing the other layers to learn language\-specific information;Expert Soupuniformly averages the baselineExperts\.Right: Summarized downstream results; parameter alignment improves reading\-comprehension retention, while Dense\-Reverted preserves strong translation quality\.We address catastrophic forgetting in multilingual continual pre\-training with two key strategies\. First, we proposefamily\-expert CPT, a training paradigm that organizes data bylanguage familiesto enable targeted, distributed expert training \(§[2\.1](https://arxiv.org/html/2606.00284#S2.SS1)\), allowing for flexible scaling to new settings\. However, without further intervention, language family experts can suffer from cross\-lingual forgetting and parameter divergence from the shared initialization, reducing their multilingual robustness and making post\-hoc combination less predictable\. We therefore instantiate and benchmark five layer\-awareparameter alignment methodsthat either regularize parameter updates or correct model weights after training \(§[2\.2](https://arxiv.org/html/2606.00284#S2.SS2)\), alongside two baselines \(§[2\.3](https://arxiv.org/html/2606.00284#S2.SS3)\)\. Together, family\-expert CPT with parameter alignment retains the efficiency and flexibility of independent expert training—each expert can be trained in parallel and new families added on demand—while recovering multilingual generalization that unconstrained expert training sacrifices\.

### 2\.1Language Family Grouping

An important design decision in multilingual expert training is how to group languages across models\. We build onx\-ELM\(Blevinset al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib15)\), which grouped languages by syntactic similarity; however, this metric is not ablated, and their setting still harms performance if used to group too dissimilar languages \(e\.g\., Swahili and Vietnamese\)\.

We therefore organize experts by language family, followingChronopoulouet al\.\([2023](https://arxiv.org/html/2606.00284#bib.bib35)\), who show that family\-level grouping mitigates inter\-language interference and facilitates generalization to unseen low\-resource languages\. We create five experts corresponding to the Indic, Austronesian, Germanic, Romance, and Slavic families \(Table[1](https://arxiv.org/html/2606.00284#S2.T1)\), each trained on a mix of high\-, medium\-, and low\-resource languages\. We additionally designate held\-out related languages to probe within\-family generalization \(§[3\.5](https://arxiv.org/html/2606.00284#S3.SS5)\)\.

### 2\.2Parameter Alignment Strategies

While each family expert is finetuned from a shared initialization, unconstrained training can shift parameters far from the original model, erasing prior knowledge\. Our alignment strategies aim to limit this forgetting while preserving each expert’s ability to acquire new languages and maintaining the distributed efficiency of vanilla expert training\. Specifically, motivated by evidence that the middle layers of transformer LMs encode language\-neutral knowledge while the outer layers handle language\-specific processing\(e\.g\., Wendleret al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib41)\), our strategies primarily constrain the model’s middle layers\. Figure[1](https://arxiv.org/html/2606.00284#S2.F1)summarizes our alignment strategies and baselines \(§[2\.3](https://arxiv.org/html/2606.00284#S2.SS3)\):Train\-then\-RevertAfter training a dense model or a family expert, we reset the weights of the model’smiddle layersback to the base model’s pre\-trained weights, while keeping the updated weights of the firstmmand lastnnlayers\. Reverting middle layers post\-hoc recovers general capabilities without requiring any retraining\. This strategy was applied to both dense and expert settings, yielding two variants:Dense\-RevertedandExpert\-Reverted\.

Layer FreezingRather than correcting forgetting after training, the strategy enforces layer boundaries as a hard constraintduringtraining: the middle layers are frozen while the firstmmand lastnnlayers receive gradient updates\. This prevents middle\-layer drift, at the cost of reducing the model’s capacity to absorb new language information\.

Layer\-Range L2We applyL2 starting\-point regularization\(L2\-SP;Liet al\.[2018](https://arxiv.org/html/2606.00284#bib.bib26)\), as adapted byKumaret al\.\([2024](https://arxiv.org/html/2606.00284#bib.bib4)\), with layer\-dependent penalty strengths, offering a soft alternative to layer freezing during training\. This strategy addsℒreg=∑lλl​‖θl−θl0‖22\\mathcal\{L\}\_\{\\text\{reg\}\}=\\sum\_\{l\}\\lambda\_\{l\}\\\|\\theta\_\{l\}\-\\theta\_\{l\}^\{0\}\\\|\_\{2\}^\{2\}to the learning objective, whereθl0\\theta\_\{l\}^\{0\}are the weights of the base model andλl\\lambda\_\{l\}is set high for the middle layers \(λmid=0\.05\\lambda\_\{\\text\{mid\}\}=0\.05\) and low for the outer layers \(λfirst=λlast=0\.001\\lambda\_\{\\text\{first\}\}=\\lambda\_\{\\text\{last\}\}=0\.001\)\. The middle layers thus receive a strong anchor toward the pre\-trained weights while the outer layers remain nearly unconstrained\.

Expert SoupAfter training five vanilla family experts, we merge them into a single unified model byuniformly averagingtheir weights:θsoup=15​∑f=15θf\\theta\_\{\\text\{soup\}\}=\\frac\{1\}\{5\}\\sum\_\{f=1\}^\{5\}\\theta\_\{f\}, whereθf\\theta\_\{f\}are the weights of familyff’s expert\. Because all five experts are fine\-tuned from the same base checkpoint for a relatively small number of steps, uniform averaging is a plausible model\-soup baseline under the linear mode connectivity intuition for weight averaging\(Wortsmanet al\.,[2022](https://arxiv.org/html/2606.00284#bib.bib36)\)\.

Table 1:Language families, their training languages, and held\-out languages for evaluating within\-family generalization\.
### 2\.3Baselines

We compare our parameter alignment strategies to two multilingual CPT baselines:

Dense CPTtrains a single model jointly on all considered languages without any forgetting mitigation or family\-based data partitioning\.

Family ExpertInspired byBlevinset al\.\([2024](https://arxiv.org/html/2606.00284#bib.bib15)\), we extend the X\-ELM framework to language families, training one expert per family on linguistically related data without regularization or post\-hoc weight correction\.

## 3Experiments

### 3\.1Experimental Setup

Pre\-training corpusWe sample training data from MADLAD\-400\(Kuduguntaet al\.,[2023](https://arxiv.org/html/2606.00284#bib.bib1)\), a massively multilingual web corpus\. To ensure a fair comparison across families of different sizes, we fix a budget of5B tokens per family\(25B tokens total\), distributing each family’s budget equally across its member languages\. Language clusters are grouped based on genealogical relationships\.111As documented in[http://www\.elinguistics\.net/Language\_Evolutionary\_Tree\.html](http://www.elinguistics.net/Language_Evolutionary_Tree.html)\.Documents are tokenized with the Gemma\-3 tokenizer\(Teamet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib18)\)at a maximum sequence length of 2,048 tokens, with a 95%/5% train/validation split used for early stopping and per\-language perplexity evaluation\.

Base modelAll experiments usegemma\-3\-4b\-pt\(Teamet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib18)\), a 4B\-parameter decoder\-only transformer with 34 layers\. Since the released checkpoint is multimodal, we strip the vision sub\-network before any CPT, ensuring all capability changes are attributable to CPT alone\. All runs use bfloat16 precision and gradient checkpointing\.

For all layer\-aware strategies, we designate the firstm=9m\{=\}9and lastn=6n\{=\}6layers as flanking \(trainable\) layers and the middle1919as the constrained region, motivated by evidence that middle layers encode language\-neutral knowledge while outer layers handle language\-specific processing\(Bandarkaret al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib3); Bandarkar and Peng,[2025](https://arxiv.org/html/2606.00284#bib.bib2); Wendleret al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib41)\)\. We keep this layer range fixed across all families and strategies, then evaluate its task\-specific consequences with the interpolation analysis in §[4](https://arxiv.org/html/2606.00284#S4)\.

TrainingDense CPT trains jointly on all 32 training languages for up to 50,000 steps\. All per\-family strategies were trained for up to∼\{\\sim\}17,000 steps \(≈\\approx1 epoch\), with early stopping \(patience of 6 evaluations at 500\-step intervals\) across all strategies\. For Layer\-Range L2\-SP,λ\\lambdavalues were selected on one family’s validation perplexity and held fixed across all five families\. Train\-then\-Revert and Expert Soup are applied post\-hoc and require no additional training\. Full hyperparameter details are in Appendix[A\.2](https://arxiv.org/html/2606.00284#A1.SS2)\.

EvaluationWe evaluate in two directions:language acquisitionandgeneral knowledgeretention, using a2\-shotsetting throughout with lm\-eval\-harness\(Gaoet al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib9)\)\. Benchmarks cover:Perplexityon held\-out MADLAD text;Belebele\(Bandarkaret al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib5)\)\(reading comprehension\);Global\-PIQA\(Changet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib6)\)\(world\-knowledge reasoning\); andFLORES\-200\(Teamet al\.,[2022](https://arxiv.org/html/2606.00284#bib.bib7)\)\(ChrF,xx→\\toEN and EN→\\toxx\)\. Evaluations cover the 32 training languages and held\-out relatives \(Appendix[A\.1](https://arxiv.org/html/2606.00284#A1.SS1)\)\.

### 3\.2Language Acquisition

Table[2](https://arxiv.org/html/2606.00284#S3.T2)summarizes the perplexity evaluation across all families and strategies for the*training languages*222Per\-language breakdowns are in Appendix[A\.7](https://arxiv.org/html/2606.00284#A1.SS7)\(Table[11](https://arxiv.org/html/2606.00284#A1.T11)\)\., while an analysis of model perplexity on held\-out languages is in §[3\.5](https://arxiv.org/html/2606.00284#S3.SS5)\. Dense CPT and Family Expert are close in overall in\-domain acquisition, with Dense marginally ahead on average \(7\.207\.20vs\.7\.307\.30; Table[2](https://arxiv.org/html/2606.00284#S3.T2)\)\. The two strategies differ by family: Expert is clearly ahead on Romance \(8\.408\.40vs\.8\.558\.55\) and comparable on Slavic \(6\.406\.40vs\.6\.416\.41\), while Dense is much better on Austronesian \(6\.226\.22vs\.6\.746\.74\) and slightly ahead on Germanic \(7\.187\.18vs\.7\.237\.23\) and Indic \(7\.667\.66vs\.7\.737\.73\)\. Austronesian and Romance show the largest per\-family gaps, pointing in opposite directions: Dense’s Austronesian advantage is consistent with cross\-family transfer benefiting a typologically diverse low\-resource family, while Expert’s Romance advantage shows that family\-level specialization can meaningfully outperform joint training when the family is well\-represented in Gemma’s pretraining mixture\.

We see more moderate perplexity improvements over the base model when training with parameter alignment strategies\. Layer\-Range L2\-SP achieves moderate but consistent perplexity reductions across all families \(e\.g\., Slavic mk:6\.70→6\.366\.70\\to 6\.36\)\. Layer Freezing is comparable to Layer\-Range L2\-SP in acquisition strength but benefits from the hard constraint preventing middle\-layer drift\. The Revert variants \(Dense\-Reverted, Expert\-Reverted\) sacrifice perplexity gains relative to their non\-reverted counterparts, confirming that middle\-layer weights carry meaningful language\-specific knowledge \(e\.g\., Indic family average: Expert10\.07→7\.7310\.07\\to 7\.73; Expert\-Reverted→9\.07\\to 9\.07; Table[2](https://arxiv.org/html/2606.00284#S3.T2)\), but they sit between the base model and the plain CPT model\.

Table 2:Perplexity↓\\downarrowon the validation split of each family’s training data, averaged over the training languages within that family\.Bold= best per row;underline= within 0\.2 of best\.
### 3\.3Catastrophic Forgetting on Downstream Tasks

We now evaluate our family\-expert models on downstream tasks: Tables[3](https://arxiv.org/html/2606.00284#S3.T3)and[4](https://arxiv.org/html/2606.00284#S3.T4)summarize the Belebele and Global\-PIQA results across language families and strategies, respectively333The per\-language breakdowns for each task are in Appendix[A\.7](https://arxiv.org/html/2606.00284#A1.SS7)\(Tables[12](https://arxiv.org/html/2606.00284#A1.T12)–[15](https://arxiv.org/html/2606.00284#A1.T15)\)\.

Dense CPT causes substantial forgetting on reading comprehension: Belebele accuracy drops 6\.6–12\.3 pp relative to the base model across families \(e\.g\., English:0\.813→0\.6740\.813\\to 0\.674\)\. Global\-PIQA shows a more muted, family\-dependent pattern \(Table[4](https://arxiv.org/html/2606.00284#S3.T4)\)\. Family Expert shows intermediate behavior: it preserves in\-family Belebele accuracy better than Dense \(e\.g\., Slavic:0\.7260\.726vs\. Dense0\.6190\.619\)\.

Among parameter alignment strategies,Layer Freezingbest preserves downstream performance on average, whileLayer\-Range L2\-SPremains competitive and is especially useful when held\-out perplexity is prioritized\. Layer Freezing exceeds the base model on average for Belebele and Global\-PIQA, and Layer\-Range L2\-SP stays close to the base on both tasks \(e\.g\., English: Freeze0\.8170\.817, Layer\-Reg0\.8020\.802vs\. base0\.8130\.813on Belebele\)\.

The Revert strategies partially recover from forgetting, Dense\-Reverted recovers approximately 8 pp relative to Dense on Belebele, but remains below the base model on average, suggesting that post\-hoc reversion of middle layers does not fully restore all general capabilities\. Family Expert without reversion shows intermediate forgetting: in\-domain language performance is preserved reasonably, but cross\-family languages still show modest drops compared to the base\.Expert Soupachieves the second\-best average Belebele accuracy \(0\.711\) after Layer Freezing \(0\.716\), and is best on Germanic \(0\.761\), exceeding both the individual Expert \(0\.756\) and Expert\-Reverted \(0\.760\) and demonstrating that uniform weight averaging across all five family experts produces stronger comprehension retention than any individual family expert checkpoint\. We also tested additional soups, including a freeze\-best soup built from the strongest ablation family; on Belebele and Global\-PIQA, it produced only minimal changes relative to the best existing method, with average deltas near zero across held\-in and held\-out splits \(Appendix[A\.6](https://arxiv.org/html/2606.00284#A1.SS6)\)\.

Table 3:Belebele accuracy↑\\uparrow\(2\-shot\), averaged over training languages within each family\.Bold= best per row;underline= within 0\.5 pp of best\.Table 4:Global\-PIQA accuracy↑\\uparrow\(2\-shot\), averaged over evaluation languages within each family\.Bold= best per row;underline= within 0\.5 pp of best\.
### 3\.4Translation Quality \(FLORES\-200\)

Table 5:FLORES\-200 ChrF↑\\uparrow\(2\-shot\), averaged over both translation directions \(en→\\toxx and xx→\\toen\) and training languages within each family\.Bold= best per row;underline= within 1 ChrF point of best\.All CPT strategies improve translation performance over the base model, which averages33\.433\.4combined ChrF across families \(Table[5](https://arxiv.org/html/2606.00284#S3.T5)\)\. Appendix[A\.5](https://arxiv.org/html/2606.00284#A1.SS5)describes the FLORES decoding and post\-truncation rescoring protocol used to keep evaluation comparable across checkpoints while matching the Gemma\-3 technical report numbers as closely as possible\(Teamet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib18)\)\.

Dense\-Reverted is the average leader at54\.054\.0ChrF, ahead of Dense \(53\.853\.8\)and∼7\{\\sim\}7points above the next tier \(Soup47\.447\.4, L\.\-Reg45\.145\.1, Freeze44\.744\.7, E\.\-Rev\.44\.544\.5, Expert44\.144\.1\)\. The narrow Dense vs\. Dense\-Reverted gap shows that joint training already produces strong translation; reverting middle\-layer weights primarily preserves comprehension \(§[3\.3](https://arxiv.org/html/2606.00284#S3.SS3)\) without sacrificing translation\.

Per family, Dense\-Reverted wins four of five:Slavic \(53\.653\.6\), Germanic \(59\.559\.5\), Indic \(44\.244\.2\), and Romance \(59\.059\.0\)\.Dense itself leads Austronesian\(55\.755\.7vs\. Dense\-Reverted53\.653\.6\), the typologically diverse low\-resource family where unconstrained joint training appears to extract the most translation gain\. Per\-family experts trail by larger margins on the difficult, non\-Latin\-script families \(Indic Expert:36\.136\.1ChrF at PPL7\.737\.73; Austronesian Expert:25\.025\.0ChrF at PPL6\.746\.74\); Soup recovers some of the gap \(Indic42\.642\.6, Austronesian48\.148\.1\), but Dense\-Reverted still beats Soup on Indic and Austronesian \(1\.6 and 5\.5 ChrF respectively\)\.

### 3\.5Within\-Family Generalization

A natural question is whether the benefits of family\-expert CPT extend to unseen languageswithinthe targeted family\. We evaluate this setting across all four benchmarks on languages withheld from training in each family \(see Table[1](https://arxiv.org/html/2606.00284#S2.T1)as well as Appendix[A\.1](https://arxiv.org/html/2606.00284#A1.SS1)for the full list with per\-benchmark coverage\)\.444While these languages are held out during our CPT experiments, we are unable to confirm whether the base Gemma model is pretrained on them, as the model’s training data is not reported\.Family\-level held\-out perplexity averages are reported in Table[9](https://arxiv.org/html/2606.00284#A1.T9)\(Appendix[A\.4](https://arxiv.org/html/2606.00284#A1.SS4)\)\.

Dense CPT hurts held\-out languages:Dense CPT increases held\-out perplexity substantially across every family \(e\.g\., Indic:8\.99→14\.708\.99\\to 14\.70\) and degrades held\-out Belebele by 8–11 pp, confirming that catastrophic forgetting extends to typologically related unseen languages\.

Soft regularization enables within\-family transfer:Layer\-Range L2\-SP is the only strategy that consistently matches orimprovesheld\-out perplexity relative to the base model across all five families \(Germanic:8\.56→8\.388\.56\\to 8\.38; Indic:8\.99→8\.898\.99\\to 8\.89; Slavic:7\.36→7\.187\.36\\to 7\.18; Austronesian:14\.04→14\.0214\.04\\to 14\.02; Romance:7\.90→7\.697\.90\\to 7\.69\), never increasing it\. Expert Soup achieves similar gains in four of five families, falling marginally short in Austronesian \(14\.1714\.17vs\. base14\.0414\.04\)\. Freeze and Expert\-Reverted improve held\-out perplexity on a subset of families but sit slightly above base on the remaining ones, making them competitive but not uniformly improving\. Vanilla family Experts degrade held\-out perplexity in every family, most sharply for Austronesian \(14\.04→28\.3514\.04\\to 28\.35\), where training on six Austronesian languages transfers poorly to the five held\-out relatives\.

On held\-out Belebele, the ranking mirrors §[3\.3](https://arxiv.org/html/2606.00284#S3.SS3): Layer Freezing and Expert Soup stay within 1–2 pp of the base model on average, while Dense drops up to 11 pp \(Slavic\)\. No strategy surpasses the base model on comprehension on average, confirming that family\-level CPT does not yield transferable comprehension gains for held\-out relatives\.

![Refer to caption](https://arxiv.org/html/2606.00284v1/x3.png)Figure 2:Held\-out Belebele accuracy delta and FLORES MT ChrF delta relative to the base model, averaged over the five families\.Translation generalizes; leaders shift by family:Unlike comprehension, translation quality generalizes to held\-out languages: all CPT strategies improve ChrF over the base model \(Figure[2](https://arxiv.org/html/2606.00284#S3.F2)\)\. No single strategy dominates:Dense\-Revertedleads Slavic \(48\.548\.5\);Layer Freezingleads Germanic \(53\.053\.0, with Expert\-Reverted52\.652\.6and Dense\-Reverted52\.252\.2within 1 ChrF\);Expert Soupleads Indic \(35\.235\.2\);Layer\-Range L2\-SPleads Austronesian \(33\.133\.1, edging out Expert Soup31\.931\.9and Dense31\.431\.4\); andDensenarrowly leads Romance \(58\.558\.5, with Dense\-Reverted58\.358\.3within0\.20\.2ChrF\)\. Per\-direction breakdowns in Appendix[A\.8](https://arxiv.org/html/2606.00284#A1.SS8)\.

## 4Understanding Layer\-Aware Adaptation

The layer\-aware design strategies in this work manipulate the middle layers based on prior findings that a model’s middle layers encode more reasoning knowledge, while the outer layers are more involved in language understanding\. Having observed the downstream results in the prior section, we examine whether this design choice aligns with where forgetting occurs in our trained models\. We find that middle\-layer drift is the strongest causal contributor to comprehension degradation, aligning with our design assumptions, but that translation quality has a different layer\-sensitivity profile\.

![Refer to caption](https://arxiv.org/html/2606.00284v1/x4.png)\(a\)Held\-in Belebele accuracy
![Refer to caption](https://arxiv.org/html/2606.00284v1/x5.png)\(b\)Held\-in FLORES ChrF by translation direction

Figure 3:Layer interpolation between the base model and Dense CPT\. All non\-interpolated layers are kept at their Dense CPT values\. Panel \(a\) reports held\-in Belebele accuracy; panel \(b\) reports held\-in FLORES ChrF separately for en→\\toxx and xx→\\toen directions\.### 4\.1Causal Analysis of Layer Drift

First, we analyze whether middle\-layer drift merely correlates with forgetting or causally contributes to the loss of downstream performance\. Specifically, we perform a targeted interpolation analysis on the unregularized Dense CPT model\. For a layer groupGG, we replace only that group’s parameters withθG​\(α\)=θG0\+α​\(θGDense−θG0\)\\theta\_\{G\}\(\\alpha\)=\\theta\_\{G\}^\{0\}\+\\alpha\(\\theta\_\{G\}^\{\\text\{Dense\}\}\-\\theta\_\{G\}^\{0\}\), while keeping all other layers fixed to the Dense CPT checkpoint\. Thusα=0\\alpha\{=\}0reverts the selected layer group to the base model andα=1\\alpha\{=\}1recovers the original Dense CPT parameters for that group\. We sweepα∈\{0,0\.25,0\.5,0\.75,1\}\\alpha\\in\\\{0,0\.25,0\.5,0\.75,1\\\}over the first \(9\), middle \(19\), and last \(6\) layer groups and evaluate held\-in language Belebele accuracy and ChrF for FLORES\.

Figure[3](https://arxiv.org/html/2606.00284#S4.F3)shows that Belebele degradation after CPT is primarily driven by middle\-layer drift: restoring Dense CPT drift in the middle layers produces a monotonic7\.697\.69point drop, more than twice the first\-layer effect and an order of magnitude larger than the last\-layer control\. This holds even though the first layers undergo greater absolute parameter change than the middle layers, indicating that the location of drift matters more than its magnitude for comprehension\.

The FLORES sweep shows a different pattern\. After applying the same post\-truncation scoring protocol used in our main FLORES tables, first\- and middle\-layer interpolation have little effect on ChrF in either translation direction, whereas restoring drift in the final layer group substantially improves translation quality\. This mismatch suggests that the layer locations most responsible for comprehension forgetting are not necessarily the same locations that control translation behavior\. More broadly, the result argues for task\-specific validation of layer ranges rather than assuming that middle\-layer preservation is universally optimal\. Per\-family trends are reported in Appendix[A\.3](https://arxiv.org/html/2606.00284#A1.SS3)\.

### 4\.2Layer Design and Task\-Specific Trade\-offs

The causal interpolation result supports middle\-layer alignment as a useful design principle, but it does not indicate a clean universal partition of multilingual knowledge across layers\. Instead, our results suggest that both forgetting and layer\-wise parameter design are task\-dependent\. For comprehension and reasoning\-style tasks, middle\-layer preservation is clearly beneficial: Layer Freezing and Layer\-Range L2\-SP best preserve Belebele and Global\-PIQA performance, and the interpolation sweep shows that middle\-layer drift is the largest contributor to Belebele degradation\.

The pattern differs for perplexity and generative tasks, such as FLORES\. Dense CPT achieves the lowest perplexity and the second\-best translation performance on average, despite exhibiting the worst downstream knowledge retention\. In the FLORES interpolation sweep, first\- and middle\-layer drift have little effect after post\-truncation scoring, while restoring final\-layer drift substantially improves ChrF\. This suggests that translation quality in our setup depends more on output behavior and generation compatibility than on the middle\-layer drift that drives Belebele forgetting\.

These insights can inform future design choices when adapting multilingual experts for a specific task or downstream setting\. Hard constraints \(Layer Freezing\) best preserve comprehension and reasoning, while softer constraints \(Layer\-Range L2\-SP\) better balance language acquisition against forgetting; post\-hoc reversion can correct a trained model at certain layers without retraining but the optimal layer range should be selected with the target evaluation behavior in mind, as our interpolation experiments with FLORES show\. In sum, the optimal constraint type and location remain task\-dependent, leaving room for future work to tune layer ranges and regularization strengths for specific objectives\.

## 5Related Work

Multilingual PretrainingScaling multilingual language models through dense pretraining has been approached via architectural changes\(Goyalet al\.,[2021](https://arxiv.org/html/2606.00284#bib.bib19)\), cross\-lingual objectives\(CONNEAU and Lample,[2019](https://arxiv.org/html/2606.00284#bib.bib20); Chiet al\.,[2022](https://arxiv.org/html/2606.00284#bib.bib21)\), and multilingual data curation\(Le Scaoet al\.,[2023](https://arxiv.org/html/2606.00284#bib.bib24); Fujiiet al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib47); Zosaet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib48)\)\. However, dense multilingual models are fundamentally constrained by the curse of multilinguality\(Conneauet al\.,[2020](https://arxiv.org/html/2606.00284#bib.bib10)\): a fixed parameter budget forces trade\-offs between language coverage and per\-language quality\.

A complementary line of work targets specific language groups:Chronopoulouet al\.\([2023](https://arxiv.org/html/2606.00284#bib.bib35)\)show that organizing training around language families reduces cross\-language interference, and family\-targeted pretraining improves low\-resource generalization\(Oguejiet al\.,[2021](https://arxiv.org/html/2606.00284#bib.bib45); Ogunremiet al\.,[2023](https://arxiv.org/html/2606.00284#bib.bib23); Downeyet al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib44)\)\. Our work adopts language families as the natural grouping for expert training, combining targeted data curation with embarrassingly parallel training\.

Expert Language ModelingBranch\-Train\-Merge\(Liet al\.,[2022](https://arxiv.org/html/2606.00284#bib.bib16)\)introduces parallel expert training: independent models are fine\-tuned from a shared initialization and combined at inference time, eliminating synchronization overhead\.x\-ELM\(Blevinset al\.,[2024](https://arxiv.org/html/2606.00284#bib.bib15)\)applies this paradigm to the multilingual setting, training bilingual experts that can be added on demand\. Crucially,x\-ELMsidesteps catastrophic forgetting by never modifying existing experts, but does not investigate strategies to mitigate forgettingwithineach expert during training\.

Multilingual Catastrophic ForgettingCatastrophic forgetting\(McCloskey and Cohen,[1989](https://arxiv.org/html/2606.00284#bib.bib25); Kirkpatricket al\.,[2017](https://arxiv.org/html/2606.00284#bib.bib27)\)is a central challenge when adapting pretrained models to new languages\.Khelliet al\.\([2025](https://arxiv.org/html/2606.00284#bib.bib29)\)find that partial parameter sharing can mitigate forgetting in multilingual CPT, whileOwodunni and Kumar \([2025](https://arxiv.org/html/2606.00284#bib.bib28)\)study layer\-selective fine\-tuning but find no clear advantage of parameter\-efficient methods over full fine\-tuning\. However, these analyses focus exclusively on dense models\. Our work addresses this gap by studying forgetting in language\-family experts and proposing parameter alignment strategies for the distributed expert setting\.

## 6Conclusion

In this work, we investigate whether layer\-aware parameter alignment mitigates catastrophic forgetting when specializing multilingual models into language\-family experts with CPT\. We evaluate five alignment strategies and two unregularized baselines across five typologically diverse families, 32 training languages, and held\-out relatives on perplexity and three downstream benchmarks\. Our experiments reveal that the acquisition–forgetting frontier is fundamentally strategy\- and task\-dependent, with no single strategy dominating across all evaluation axes\. Moreover, causal analysis of layer\-wise parameter changes further supports these results by confirming that middle\-layer drift is the primary driver of comprehension degradation, while FLORES translation follows a different layer\-sensitivity profile that depends more on final\-layer drift\. Taken together, these findings indicate that CPT strategy selection should be driven by the target setting \(such as translation\-heavy, comprehension\-critical, balanced, or broad\-coverage\) rather than by a single aggregate metric\.

## Limitations

All experiments use a single 4B\-parameter model \(Gemma\-3 4B\) with a fixed budget of 5B tokens per family from one web corpus \(MADLAD\-400\); we do not evaluate whether strategy rankings transfer to other model scales, architectures, or data regimes\. The individual strategies are not themselves novel and each builds on established techniques, so our contribution is the systematic comparison under a unified protocol and the practical guidelines that emerge, rather than new forgetting\-mitigation methods\. Our guidelines \(§[4\.2](https://arxiv.org/html/2606.00284#S4.SS2)\) are derived from post\-hoc empirical comparison; we do not provide a principled method for automatically selecting a strategy given a target language set and task distribution\. Finally, as shown in §[3\.4](https://arxiv.org/html/2606.00284#S3.SS4), perplexity is an incomplete proxy for language acquisition: strategies with similar held\-in perplexity diverge sharply on downstream translation and comprehension benchmarks, highlighting the need for cross\-lingual evaluation metrics earlier in the pipeline\.

## Acknowledgments

We would like to thank Eugene Jang for feedback on the initial project idea and giving detailed and helpful comments on our draft\. We would also like to thank Sanjana Londhe who helped us in designing the Figure[1](https://arxiv.org/html/2606.00284#S2.F1)for our draft\.

This work used H200 GPUs at NCSA DeltaAI through allocation CIS251341 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support \(ACCESS\) program, which is supported by U\.S\. National Science Foundation grants \#2138259, \#2138286, \#2138307, \#2137603, and \#2138296\(Boerneret al\.,[2023](https://arxiv.org/html/2606.00284#bib.bib49)\)\.

## References

- L\. Bandarkar, D\. Liang, B\. Muller, M\. Artetxe, S\. N\. Shukla, D\. Husa, N\. Goyal, A\. Krishnan, L\. Zettlemoyer, and M\. Khabsa \(2024\)The belebele benchmark: a parallel reading comprehension dataset in 122 language variants\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 749–775\.External Links:[Link](https://aclanthology.org/2024.acl-long.44/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.44)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p5.2)\.
- L\. Bandarkar, B\. Muller, P\. Yuvraj, R\. Hou, N\. Singhal, H\. Lv, and B\. Liu \(2025\)Layer swapping for zero\-shot cross\-lingual transfer in large language models\.External Links:2410\.01335,[Link](https://arxiv.org/abs/2410.01335)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p3.3)\.
- L\. Bandarkar and N\. Peng \(2025\)The unreasonable effectiveness of model merging for cross\-lingual transfer in LLMs\.InProceedings of the 5th Workshop on Multilingual Representation Learning \(MRL 2025\),D\. I\. Adelani, C\. Arnett, D\. Ataman, T\. A\. Chang, H\. Gonen, R\. Raja, F\. Schmidt, D\. Stap, and J\. Wang \(Eds\.\),Suzhou, China,pp\. 131–148\.External Links:[Link](https://aclanthology.org/2025.mrl-main.10/),[Document](https://dx.doi.org/10.18653/v1/2025.mrl-main.10),ISBN 979\-8\-89176\-345\-6Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p3.3)\.
- T\. Blevins, T\. Limisiewicz, S\. Gururangan, M\. Li, H\. Gonen, N\. A\. Smith, and L\. Zettlemoyer \(2024\)Breaking the curse of multilinguality with cross\-lingual expert language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 10822–10837\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.604/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.604)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00284#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.00284#S2.SS3.p3.1),[§5](https://arxiv.org/html/2606.00284#S5.p3.1)\.
- T\. J\. Boerner, S\. Deems, T\. R\. Furlani, S\. L\. Knuth, and J\. Towns \(2023\)ACCESS: advancing innovation: nsf’s advanced cyberinfrastructure coordination ecosystem: services & support\.InPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good,PEARC ’23,New York, NY, USA,pp\. 173–176\.External Links:ISBN 9781450399852,[Link](https://doi.org/10.1145/3569951.3597559),[Document](https://dx.doi.org/10.1145/3569951.3597559)Cited by:[Acknowledgments](https://arxiv.org/html/2606.00284#Sx2.p2.1)\.
- T\. A\. Chang, C\. Arnett, A\. Eldesokey, A\. Sadallah, A\. Kashar, A\. Daud, A\. G\. Olanihun, A\. L\. Mohammed, A\. Praise, A\. M\. Sharma, A\. Gupta,et al\.\(2025\)Global piqa: evaluating physical commonsense reasoning across 100\+ languages and cultures\.External Links:2510\.24081,[Link](https://arxiv.org/abs/2510.24081)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p5.2)\.
- Z\. Chi, S\. Huang, L\. Dong, S\. Ma, B\. Zheng, S\. Singhal, P\. Bajaj, X\. Song, X\. Mao, H\. Huang, and F\. Wei \(2022\)XLM\-E: cross\-lingual language model pre\-training via ELECTRA\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 6170–6182\.External Links:[Link](https://aclanthology.org/2022.acl-long.427/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.427)Cited by:[§5](https://arxiv.org/html/2606.00284#S5.p1.1)\.
- A\. Chronopoulou, D\. Stojanovski, and A\. Fraser \(2023\)Language\-family adapters for low\-resource multilingual neural machine translation\.InProceedings of the Sixth Workshop on Technologies for Machine Translation of Low\-Resource Languages \(LoResMT 2023\),A\. Kr\. Ojha, C\. Liu, E\. Vylomova, F\. Pirinen, J\. Abbott, J\. Washington, N\. Oco, V\. Malykh, V\. Logacheva, and X\. Zhao \(Eds\.\),Dubrovnik, Croatia,pp\. 59–72\.External Links:[Link](https://aclanthology.org/2023.loresmt-1.5/),[Document](https://dx.doi.org/10.18653/v1/2023.loresmt-1.5)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00284#S2.SS1.p2.1),[§5](https://arxiv.org/html/2606.00284#S5.p2.1)\.
- A\. Conneau, K\. Khandelwal, N\. Goyal, V\. Chaudhary, G\. Wenzek, F\. Guzmán, E\. Grave, M\. Ott, L\. Zettlemoyer, and V\. Stoyanov \(2020\)Unsupervised cross\-lingual representation learning at scale\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 8440–8451\.External Links:[Link](https://aclanthology.org/2020.acl-main.747/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.747)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p1.1),[§5](https://arxiv.org/html/2606.00284#S5.p1.1)\.
- A\. CONNEAU and G\. Lample \(2019\)Cross\-lingual language model pretraining\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf)Cited by:[§5](https://arxiv.org/html/2606.00284#S5.p1.1)\.
- L\. Dou, Q\. Liu, G\. Zeng, J\. Guo, J\. Zhou, W\. Lu, and M\. Lin \(2024\)Sailor: open language models for south\-east asia\.External Links:2404\.03608,[Link](https://arxiv.org/abs/2404.03608)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p1.1)\.
- C\. M\. Downey, T\. Blevins, D\. Serai, D\. Parikh, and S\. Steinert\-Threlkeld \(2024\)Targeted multilingual adaptation for low\-resource language families\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 15647–15663\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.918/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.918)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p2.1),[§5](https://arxiv.org/html/2606.00284#S5.p2.1)\.
- K\. Fujii, T\. Nakamura, M\. Loem, H\. Iida, M\. Ohi, K\. Hattori, H\. Shota, S\. Mizuki, R\. Yokota, and N\. Okazaki \(2024\)Continual pre\-training for cross\-lingual llm adaptation: enhancing japanese language capabilities\.External Links:2404\.17790,[Link](https://arxiv.org/abs/2404.17790)Cited by:[§5](https://arxiv.org/html/2606.00284#S5.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p5.2)\.
- N\. Goyal, J\. Du, M\. Ott, G\. Anantharaman, and A\. Conneau \(2021\)Larger\-scale transformers for multilingual masked language modeling\.InProceedings of the 6th Workshop on Representation Learning for NLP \(RepL4NLP\-2021\),A\. Rogers, I\. Calixto, I\. Vulić, N\. Saphra, N\. Kassner, O\. Camburu, T\. Bansal, and V\. Shwartz \(Eds\.\),Online,pp\. 29–33\.External Links:[Link](https://aclanthology.org/2021.repl4nlp-1.4/),[Document](https://dx.doi.org/10.18653/v1/2021.repl4nlp-1.4)Cited by:[§5](https://arxiv.org/html/2606.00284#S5.p1.1)\.
- M\. Khelli, S\. Cahyawijaya, A\. Purwarianti, and G\. I\. Winata \(2025\)What causes knowledge loss in multilingual language models?\.InProceedings of the Fourth Workshop on NLP Applications to Field Linguistics,É\. Le Ferrand, E\. Klyachko, A\. Postnikova, T\. Shavrina, O\. Serikov, E\. Voloshina, and E\. Vylomova \(Eds\.\),Vienna, Austria,pp\. 15–25\.External Links:[Link](https://aclanthology.org/2025.fieldmatters-1.2/),ISBN 979\-8\-89176\-282\-4Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p2.1),[§5](https://arxiv.org/html/2606.00284#S5.p4.1)\.
- J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the national academy of sciences114\(13\),pp\. 3521–3526\.Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p1.1),[§5](https://arxiv.org/html/2606.00284#S5.p4.1)\.
- S\. Kudugunta, I\. Caswell, B\. Zhang, X\. Garcia, D\. Xin, A\. Kusupati, R\. Stella, A\. Bapna, and O\. Firat \(2023\)Madlad\-400: a multilingual and document\-level large audited dataset\.Advances in Neural Information Processing Systems36,pp\. 67284–67296\.Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p1.1)\.
- S\. Kumar, H\. Marklund, and B\. V\. Roy \(2024\)Maintaining plasticity in continual learning via regenerative regularization\.External Links:2308\.11958,[Link](https://arxiv.org/abs/2308.11958)Cited by:[§2\.2](https://arxiv.org/html/2606.00284#S2.SS2.p3.5)\.
- T\. Le Scao, A\. Fan, C\. Akiki, E\. Pavlick, S\. Ilić, D\. Hesslow, R\. Castagné,et al\.\(2023\)BLOOM: a 176b\-parameter open\-access multilingual language model\.External Links:2211\.05100,[Link](https://arxiv.org/abs/2211.05100)Cited by:[§5](https://arxiv.org/html/2606.00284#S5.p1.1)\.
- M\. Li, S\. Gururangan, T\. Dettmers, M\. Lewis, T\. Althoff, N\. A\. Smith, and L\. Zettlemoyer \(2022\)Branch\-train\-merge: embarrassingly parallel training of expert language models\.External Links:2208\.03306,[Link](https://arxiv.org/abs/2208.03306)Cited by:[§5](https://arxiv.org/html/2606.00284#S5.p3.1)\.
- X\. Li, Y\. Grandvalet, and F\. Davoine \(2018\)Explicit inductive bias for transfer learning with convolutional networks\.InProceedings of the 35th International Conference on Machine Learning \(ICML\),pp\. 2830–2839\.Cited by:[§2\.2](https://arxiv.org/html/2606.00284#S2.SS2.p3.5)\.
- M\. McCloskey and N\. J\. Cohen \(1989\)Catastrophic interference in connectionist networks: the sequential learning problem\.G\. H\. Bower \(Ed\.\),Psychology of Learning and Motivation, Vol\.24,pp\. 109–165\.External Links:ISSN 0079\-7421,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0079-7421%2808%2960536-8),[Link](https://www.sciencedirect.com/science/article/pii/S0079742108605368)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p1.1),[§5](https://arxiv.org/html/2606.00284#S5.p4.1)\.
- K\. Ogueji, Y\. Zhu, and J\. Lin \(2021\)Small data? no problem\! exploring the viability of pretrained multilingual language models for low\-resourced languages\.InProceedings of the 1st Workshop on Multilingual Representation Learning,D\. Ataman, A\. Birch, A\. Conneau, O\. Firat, S\. Ruder, and G\. G\. Sahin \(Eds\.\),Punta Cana, Dominican Republic,pp\. 116–126\.External Links:[Link](https://aclanthology.org/2021.mrl-1.11/),[Document](https://dx.doi.org/10.18653/v1/2021.mrl-1.11)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p2.1),[§5](https://arxiv.org/html/2606.00284#S5.p2.1)\.
- T\. Ogunremi, D\. Jurafsky, and C\. D\. Manning \(2023\)Mini but mighty: efficient multilingual pretraining with linguistically\-informed data selection\.InFindings of the Association for Computational Linguistics: EACL 2023,A\. Vlachos and I\. Augenstein \(Eds\.\),Dubrovnik, Croatia,pp\. 1251–1266\.External Links:[Link](https://aclanthology.org/2023.findings-eacl.93/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-eacl.93)Cited by:[§5](https://arxiv.org/html/2606.00284#S5.p2.1)\.
- A\. T\. Owodunni and S\. Kumar \(2025\)Continually adding new languages to multilingual language models\.External Links:2509\.11414,[Link](https://arxiv.org/abs/2509.11414)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p2.1),[§5](https://arxiv.org/html/2606.00284#S5.p4.1)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin,et al\.\(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§A\.5](https://arxiv.org/html/2606.00284#A1.SS5.p1.2),[§1](https://arxiv.org/html/2606.00284#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p2.1),[§3\.4](https://arxiv.org/html/2606.00284#S3.SS4.p1.1)\.
- N\. Team, M\. R\. Costa\-jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht, J\. Maillard, A\. Sun, S\. Wang, G\. Wenzek, A\. Youngblood, B\. Akula, L\. Barrault, G\. M\. Gonzalez, P\. Hansanti, J\. Hoffman, S\. Jarrett, K\. R\. Sadagopan, D\. Rowe, S\. Spruit, C\. Tran, P\. Andrews, N\. F\. Ayan, S\. Bhosale, S\. Edunov, A\. Fan, C\. Gao, V\. Goswami, F\. Guzmán, P\. Koehn, A\. Mourachko, C\. Ropers, S\. Saleem, H\. Schwenk, and J\. Wang \(2022\)No language left behind: scaling human\-centered machine translation\.External Links:2207\.04672,[Link](https://arxiv.org/abs/2207.04672)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p5.2)\.
- C\. Wendler, V\. Veselovsky, G\. Monea, and R\. West \(2024\)Do llamas work in English? on the latent language of multilingual transformers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15366–15394\.External Links:[Link](https://aclanthology.org/2024.acl-long.820/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.820)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.00284#S2.SS2.p1.2),[§3\.1](https://arxiv.org/html/2606.00284#S3.SS1.p3.3)\.
- M\. Wortsman, G\. Ilharco, S\. Y\. Gadre, R\. Roelofs, R\. Gontijo\-Lopes, A\. S\. Morcos, H\. Namkoong, A\. Farhadi, Y\. Carmon, S\. Kornblith, and L\. Schmidt \(2022\)Model soups: averaging weights of multiple fine\-tuned models improves accuracy without increasing inference time\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 23965–23998\.External Links:[Link](https://proceedings.mlr.press/v162/wortsman22a.html)Cited by:[§2\.2](https://arxiv.org/html/2606.00284#S2.SS2.p4.3)\.
- Y\. Zhao, C\. Liu, Y\. Deng, J\. Ying, M\. Aljunied, Z\. Li, L\. Bing, H\. P\. Chan, Y\. Rong, D\. Zhao, and W\. Zhang \(2025\)Babel: open multilingual large language models serving over 90% of global speakers\.External Links:2503\.00865,[Link](https://arxiv.org/abs/2503.00865)Cited by:[§1](https://arxiv.org/html/2606.00284#S1.p1.1)\.
- E\. Zosa, J\. Luoma, K\. Hakala, A\. Virtanen, M\. Koistinen, and J\. Burdge \(2025\)Continued Pretraining: A Practical Playbook for Language\-Specific LLM Adaptation — rocm\.blogs\.amd\.com\.Note:[https://rocm\.blogs\.amd\.com/artificial\-intelligence/multilingual\-continued\-pretraining/README\.html](https://rocm.blogs.amd.com/artificial-intelligence/multilingual-continued-pretraining/README.html)\[Accessed 29\-03\-2026\]Cited by:[§5](https://arxiv.org/html/2606.00284#S5.p1.1)\.

## Appendix AAppendix

### A\.1Held\-Out Evaluation Languages

Table[6](https://arxiv.org/html/2606.00284#A1.T6)lists all held\-out languages evaluated in §[3\.5](https://arxiv.org/html/2606.00284#S3.SS5), grouped by family, together with their benchmark coverage\. Languages are excluded from the CPT training set but belong to the same family as the training languages, allowing us to probe within\-family generalization\. Not all languages are available across every benchmark: Latvian \(lv\) and Odia \(or\) lack Global\-PIQA coverage, and the held\-out Austronesian languages \(Ilocano, Malagasy, Māori, Sundanese, Waray\) are not represented in Global\-PIQA\. German \(de\) is excluded from the Germanic held\-out results because it was absent from the current held\-out table coverage\.

### A\.2Training Hyperparameters and Fairness Controls

Tables[7](https://arxiv.org/html/2606.00284#A1.T7)and[8](https://arxiv.org/html/2606.00284#A1.T8)summarize the optimization settings used for the Gemma\-3 4B experiments\.

Layer Freezing learning rate\.Although Layer Freezing updates only∼44%\{\\sim\}44\\%of parameters, we intentionally keep the learning rate unchanged across all per\-family strategies\. With fewer trainable parameters, the per\-parameter gradient signal is more concentrated, partially compensating for the reduced capacity\. We verified that validation loss converges before the patience window expires across all families, indicating the schedule does not under\-train this strategy\.

Layer\-Range L2\-SPλ\\lambdaselection\.We selectedλ\\lambdavalues by sweeping over a small grid on one family’s validation perplexity and held the chosen values fixed across all five families\. While a full ablation is infeasible given our compute budget, the consistent performance of Layer\-Range L2\-SP across all families and tasks suggests the method is reasonably robust to this hyperparameter choice\.

FamilyCodeLanguagePPLBelebelePIQAFLORESSlavicbgBulgarian✓✓✓✓csCzech✓✓✓✓ltLithuanian✓✓✓✓plPolish✓✓✓✓slSlovenian✓✓✓✓lvLatvian✓✓—✓GermanicisIcelandic✓✓✓✓noNorwegian✓✓✓✓svSwedish✓✓✓✓IndicasAssamese✓✓✓✓guGujarati✓✓✓✓orOdia✓✓—✓paPunjabi✓✓✓✓sdSindhi✓✓✓✓siSinhala✓✓✓✓urUrdu✓✓✓✓AustronesianiloIlocano✓✓—✓miMāori✓✓—✓suSundanese✓✓—✓warWaray✓✓—✓mgMalagasy✓✓—✓RomancecaCatalan✓✓✓✓Table 6:Held\-out evaluation languages per benchmark\. ✓ = evaluated; — = not available in that benchmark’s task suite\. PPL = held\-out perplexity; FLORES results cover bothen→\\toxxandxx→\\toendirections\.Table 7:Training configuration for Dense CPT on Gemma\-3 4B\.Table 8:Training configuration for family\-specific expert variants on Gemma\-3 4B\.
### A\.3Causal Layer Interpolation by Family

Figures[4](https://arxiv.org/html/2606.00284#A1.F4)and[5](https://arxiv.org/html/2606.00284#A1.F5)expand the causal interpolation analysis from §[4\.1](https://arxiv.org/html/2606.00284#S4.SS1)by reporting held\-in Belebele accuracy and FLORES ChrF separately for each language family\. For Belebele, the middle\-layer curve degrades monotonically across all five families, while first\-layer interpolation has a smaller effect and last\-layer interpolation is nearly flat\. FLORES shows a different task profile: middle\-layer interpolation has only small family\-level effects after post\-truncation scoring, while restoring final\-layer CPT drift improves ChrF for every family, most strongly for Austronesian\.

![Refer to caption](https://arxiv.org/html/2606.00284v1/x6.png)Figure 4:Held\-in Belebele accuracy under first\-, middle\-, and last\-layer interpolation, broken down by language family\.![Refer to caption](https://arxiv.org/html/2606.00284v1/x7.png)Figure 5:Held\-in FLORES ChrF under first\-, middle\-, and last\-layer interpolation, broken down by language family\.
### A\.4Held\-Out Language Perplexity

Table[9](https://arxiv.org/html/2606.00284#A1.T9)reports perplexity on the held\-out \(unseen\) languages for each family, complementing the training\-language perplexity in Table[2](https://arxiv.org/html/2606.00284#S3.T2)\. Results show that Dense CPT substantially*increases*perplexity on unseen relatives across all families, while Layer\-Range L2\-SP and Expert Soup are the only strategies that consistently match or improve upon the base model\.

Table 9:Perplexity↓\\downarrowon held\-out \(unseen\) languages, averaged over each family’s held\-out relatives \(see Table[6](https://arxiv.org/html/2606.00284#A1.T6)for the full language list\)\. German \(de\) excluded from Germanic\. These languages were withheld from CPT training entirely; results probe within\-family generalization\. For per\-family strategies, the Expert column reports the model trained on that row’s family evaluated on its own held\-out relatives\.Bold= best per row;underline= within 0\.2 of best\.
### A\.5FLORES\-200 Evaluation Protocol

As described in §[3\.1](https://arxiv.org/html/2606.00284#S3.SS1), we evaluate FLORES\-200 with lm\-eval\-harness in a 2\-shot setting and report corpus ChrF for bothen→\\toxxandxx→\\toendirections\. Our initial goal was to match the Gemma\-3 technical report’s FLORES setup as closely as possible for the shared base model\(Teamet al\.,[2025](https://arxiv.org/html/2606.00284#bib.bib18)\)\. However, exact replication was not possible with the default lm\-eval\-harness task configuration, since newline stopping could terminate some base\-model generations before a translation was produced\.

We therefore use a uniform post\-truncation protocol for all checkpoints\. After generation, we strip leading whitespace, keep only the text before the first generated newline or literal\\n, and score the remaining span against the reference with ChrF\. This keeps decoding and scoring comparable across the base model, Dense CPT, family experts, reverted checkpoints, Layer Freezing, Layer\-Range L2\-SP, and Expert Soup\. The resulting FLORES numbers should therefore be read as a controlled comparison under a shared evaluation pipeline, rather than as a direct reproduction of the Gemma\-3 technical report score\.

### A\.6Additional Model Soup Results

Table[10](https://arxiv.org/html/2606.00284#A1.T10)summarizes an additional freeze\-best soup, constructed by uniformly averaging the freeze models, the strongest ablation family in our main downstream results\. Across Belebele and Global\-PIQA, this soup changes performance only minimally relative to the best existing paper method\.

Table 10:Average accuracy of the freeze\-best soup on Belebele and Global\-PIQA\.Δ\\Deltais relative to the best existing method in the main paper tables for the same benchmark and split\.
### A\.7Per\-Language Results Tables

Tables[11](https://arxiv.org/html/2606.00284#A1.T11)–[15](https://arxiv.org/html/2606.00284#A1.T15)report per\-language results for perplexity and all downstream benchmarks\. For each language, the Expert, E\.\-Rev\. Freeze and L\.Reg columns report the model trained on that language’s family \(e\.g\., the Slavic expert for Croatian and Russian\)\. All other strategy columns \(Dense, D\.\-Rev\., Soup\) are single models evaluated across all languages\.

Table 11:Per\-language perplexity↓\\downarrowon held\-out validation text\. Expert, E\.\-Rev\., Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within 0\.2 of best\.Table 12:Per\-language Belebele accuracy↑\\uparrow\(2\-shot\)\. Expert, E\.\-Rev\., Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within 0\.5 pp of best\.Table 13:Per\-language Global\-PIQA accuracy↑\\uparrow\(2\-shot\)\. Expert, E\.\-Rev\., Soup, Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within 0\.5 pp of best\.Table 14:Per\-language FLORES\-200 ChrF↑\\uparrow\(xx→\\toen, 2\-shot\)\. Expert, E\.\-Rev\., Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within 1 ChrF point of best\.Table 15:Per\-language FLORES\-200 ChrF↑\\uparrow\(en→\\toxx, 2\-shot\)\. Expert, E\.\-Rev\., Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within 1 ChrF point of best\.
### A\.8Per\-Language Held\-Out Results Tables

Tables[16](https://arxiv.org/html/2606.00284#A1.T16)–[20](https://arxiv.org/html/2606.00284#A1.T20)report per\-language results on the*held\-out*\(unseen\) languages for each family, complementing the family\-averaged perplexity in Table[9](https://arxiv.org/html/2606.00284#A1.T9)and the summary in Figure[2](https://arxiv.org/html/2606.00284#S3.F2)\. For each held\-out language, the Expert, E\.\-Rev\., Freeze and L\.\-Reg columns report the model trained on that language’s family; Dense, D\.\-Rev\., and Soup are single global models\. German \(de\) is excluded from Germanic throughout; Austronesian languages are absent from Global\-PIQA and Odia/Latvian are absent from PIQA individually \(shown as “—”\)\.

Table 16:Per\-language held\-out perplexity↓\\downarrow\. Expert, E\.\-Rev\., Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within threshold of best\. German \(de\) excluded from Germanic\.Table 17:Per\-language held\-out Belebele accuracy↑\\uparrow\. Expert, E\.\-Rev\., Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within threshold of best\. German \(de\) excluded from Germanic\.Table 18:Per\-language held\-out Global\-PIQA accuracy↑\\uparrow\. Expert, E\.\-Rev\., Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within threshold of best\. German \(de\) excluded from Germanic\.Table 19:Per\-language held\-out FLORES\-200 ChrF \(xx→\\toen\)↑\\uparrow\. Expert, E\.\-Rev\., Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within threshold of best\. German \(de\) excluded from Germanic\.Table 20:Per\-language held\-out FLORES\-200 ChrF \(en→\\toxx\)↑\\uparrow\. Expert, E\.\-Rev\., Freeze, and L\.\-Reg columns each report the model trained on that language’s family; all other columns are single models\.Bold= best per row;underline= within threshold of best\. German \(de\) excluded from Germanic\.

Similar Articles

Attribution-Guided Continual Learning for Large Language Models

arXiv cs.LG

This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.

Representation Collapse in Sequential Post-Training of Large Language Models

arXiv cs.LG

This paper studies representation collapse in sequential post-training of large language models, showing that repeated adaptation stages compress internal representations, reducing plasticity and out-of-domain generalization. The authors propose lightweight interventions to preserve future learnability without sacrificing behavioral gains.