Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning

arXiv cs.CL 06/10/26, 04:00 AM Papers
Summary
This paper empirically compares several LoRA variants for multilingual instruction tuning and finds no significant advantage of complex variants over basic LoRA in balancing cross-lingual transfer and knowledge retention.
arXiv:2606.10428v1 Announce Type: new Abstract: We investigate whether commonly available LoRA variants have an advantage over basic LoRA in multilingual instruction tuning. Experiments involving LoRA and four other variants on two datasets across diverse target languages show that there is no significant advantage in using more complex LoRA variants instead of basic LoRA, with respect to balancing cross-lingual transfer and knowledge retention. An analysis of hidden embeddings reveal that layer-wise language representation remains largely similar across LLMs fine-tuned with different LoRA techniques, suggesting that architectural novelty of LoRA techniques may not translate into better cross-lingual adaptation.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:11 AM
# Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning
Source: [https://arxiv.org/html/2606.10428](https://arxiv.org/html/2606.10428)
Napoleon H\. ReyesSurangika Ranathunga School of Mathematical and Computational Sciences Massey University, Auckland, New Zealand \{N\.Wiewardhana, N\.H\.Reyes, S\.Ranathunga\}@massey\.ac\.nz

###### Abstract

We investigate whether commonly available LoRA variants have an advantage over basic LoRA in multilingual instruction tuning\. Experiments involving LoRA and four other variants on two datasets across diverse target languages show that there is no significant advantage in using more complex LoRA variants instead of basic LoRA, with respect to balancing cross\-lingual transfer and knowledge retention\. An analysis of hidden embeddings reveal that layer\-wise language representation remains largely similar across LLMs fine\-tuned with different LoRA techniques, suggesting that architectural novelty of LoRA techniques may not translate into better cross\-lingual adaptation\.

Which LoRA?An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning

Thamali Wijewardhana and Napoleon H\. Reyes and Surangika RanathungaSchool of Mathematical and Computational SciencesMassey University,Auckland, New Zealand\{N\.Wiewardhana, N\.H\.Reyes, S\.Ranathunga\}@massey\.ac\.nz

## 1Introduction

Low\-Rank Adaptation \(LoRA\)\(Huet al\.,[2022](https://arxiv.org/html/2606.10428#bib.bib18)\)is a prominent parameter\-efficient fine\-tuning \(PEFT\) technique that trains only a small set of low\-rank matrices while freezing the original LLM parameters\. Due to its success in getting LLMs adapted to new tasks and languages with fewer resources, many novel LoRA variants are being introduced\(Yanget al\.,[2024](https://arxiv.org/html/2606.10428#bib.bib49)\)\. While they have been evaluated on English tasks, their efficacy is underexplored in the context of multilingual instruction tuning\.

With the introduction of multilingual LLMs, instruction tuning is being used to adapt them for tasks in new languagesGarciaet al\.\([2026](https://arxiv.org/html/2606.10428#bib.bib5)\), and multilingual instruction tuning has shown to be better than its monolingual counterpartShahamet al\.\([2024](https://arxiv.org/html/2606.10428#bib.bib30)\)\. In multilingual instruction tuning, effective adaptation requires optimising multilingual transfer \(the process of transferring knowledge learned in one language to another language\) while preserving previously learned knowledge\. Whether recent LoRA variants are better suited in achieving these is an open question\.

Some research studied LoRA in cross\-lingual settings, focusing on adapting them to low\-resource languages \(LRLs\)\(Khadeet al\.,[2025](https://arxiv.org/html/2606.10428#bib.bib21)\), factors impacting cross\-lingual transfer\(Khelliet al\.,[2025](https://arxiv.org/html/2606.10428#bib.bib22)\), improving language learning and knowledge retention\(Owodunni and Kumar,[2025](https://arxiv.org/html/2606.10428#bib.bib23)\), and empirical studies on data availability\(Whitehouseet al\.,[2024](https://arxiv.org/html/2606.10428#bib.bib24)\)\. Furthermore,Hassan and Chen \([2026](https://arxiv.org/html/2606.10428#bib.bib50)\)proposed a method for effectively merging LoRA adapters for cross\-lingual transfer\. However, these investigations have been limited to basic LoRA\. To the best of our knowledge, no study has comprehensively compared and analysed these LoRA variants in multilingual instruction tuning, particularly considering Low\-resource languages \(LRLs\)\. While several recent studies have investigated the internal language representations of LLMs\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.10428#bib.bib48); Wendleret al\.,[2024](https://arxiv.org/html/2606.10428#bib.bib53); Gurgurovet al\.,[2025](https://arxiv.org/html/2606.10428#bib.bib55); Tamoet al\.,[2026](https://arxiv.org/html/2606.10428#bib.bib54)\), to the best of our knowledge, none have analyzed the impact of LoRA\-based instruction tuning on these representations\.

We conduct an empirical evaluation of a selected set of LoRA variants for their ability to facilitate cross\-lingual transfer while preserving previously gained knowledge during multilingual instruction tuning\. We mainly focus on low\-resource settings, selecting five linguistically diverse languages, Urdu \(Ur\), Swahili \(Sw\), Hindi \(Hi\), Bengali \(Bn\) and Telugu \(Te\) \(referred to as target languages \(TLs\)\) \(See Appendix[A](https://arxiv.org/html/2606.10428#A1)for language details\)\. Our experiments reveal that more complex LoRA variants do not have a significant advantage over basic LoRA, indicating that architectural modifications introduced in recent LoRA variants do not always translate into performance gains in multilingual instruction tuning\. Using a layer\-wise language representation analysis, we show that LoRA variants have not introduced noticeable language representation changes in the LLM\. We also find that, in contrast to LoRA\-based pre\-trainingOwodunni and Kumar \([2025](https://arxiv.org/html/2606.10428#bib.bib23)\), in instruction tuning, LoRA should be applied to all the layers, instead of just the final layers\. Compared toImamet al\.\([2026](https://arxiv.org/html/2606.10428#bib.bib56)\), the closest to ours who focused only on African ASR, our work provides a broader investigation of language learning in LLMs \(wider range of languages, model architectures, TL data mixtures, analyze the internal representations and layer\-wise behavior of LLMs\)\.

## 2Methodology

### 2\.1LoRA Variants

Out of the numerous LoRA variants available\(Yanget al\.,[2024](https://arxiv.org/html/2606.10428#bib.bib49)\), we selected the following: Weight\-Decomposed Low\-Rank Adaptation \(DoRA\), Vector\-Based Random Matrix Adaptation \(VeRA\), Adaptive Low\-Rank Adaptation \(AdaLoRA\), Principal Singular Values and Singular Vectors Adaptation \(PiSSA\)\. This selection was based on their level of acceptance by the research community111By considering the publication venue, number of citations for the relevant paper, their results as reported by other research\., the availability of their open\-source implementations, and their computational feasibility within our available GPU resources\.

LoRA\(baseline\) optimises LLM training by approximating weight updates through low\-rank decomposition\(Huet al\.,[2022](https://arxiv.org/html/2606.10428#bib.bib18)\)\.DoRAdecomposes pretrained weights into two distinct components \(magnitude and direction\) and fine\-tunes both\(Liuet al\.,[2024](https://arxiv.org/html/2606.10428#bib.bib14)\)\.VeRAemploys one pair of frozen, randomly initialised matrices that are shared across all adapted layers, and inserts trainable scaling vectors for layer\-wise adaptation\(Kopiczkoet al\.,[2024](https://arxiv.org/html/2606.10428#bib.bib25)\)\.AdaLoRAdynamically redistributes the parameter budget across the weight matrices based on their importance score\(Zhanget al\.,[2023](https://arxiv.org/html/2606.10428#bib.bib15)\)\.PiSSAuses singular value decomposition to decompose the original weight matrix and trains only the principal components\(Menget al\.,[2024](https://arxiv.org/html/2606.10428#bib.bib26)\)\. Appendix[B](https://arxiv.org/html/2606.10428#A2)has more details on these LoRA techniques\.

### 2\.2Main Experiments

Given that multilingual instruction tuning is better than monolingual instruction tuningShahamet al\.\([2024](https://arxiv.org/html/2606.10428#bib.bib30)\)\(we confirm the same; See Table[14](https://arxiv.org/html/2606.10428#A4.T14)\), we fine\-tune the LLM with a mixture of English and TL data, and measure the performance on both languages\. As the first step, we determine the optimal data mixture percentage between English and the TL to be used during instruction tuning\. sinceShahamet al\.\([2024](https://arxiv.org/html/2606.10428#bib.bib30)\)reported that during full parameter fine\-tuning, even minimal multilingual exposure can significantly facilitate cross\-lingual transfer, we evaluate four TL ratios: 0%, 1%, 10%, and 50% for LoRA\-based fine\-tuning\.

Our investigation on the impact of hyper\-parameters on different LoRA techniques follows a two\-step approach\. First, we assess each method using its optimal hyper\-parameters\. Second, to ensure a controlled comparison under a similar parameter budget, we fix the rank r to 8 and apply adapters to all linear layers of the transformer architecture\.

Note that all except DoRA converge to an optimal rank of 8 \(parameter budget of 0\.26%\) during hyperparameter tuning\. DoRA reaches its peak performance at a r=16, with adapters applied to Q, K, V, Up and Down layers, resulting in a slightly larger parameter budget of 0\.36% trainable parameters\. To ensure a fair comparison, we introduce DoRA\*, in which r = 8 and adapters are applied to all linear layers, matching the parameter budget of others\. Table[1](https://arxiv.org/html/2606.10428#S2.T1)presents the parameter budget and rank for each adapter configuration\. The marginal increase of 0\.01% in the number of trainable parameters for DoRA is due to the integration of learnable magnitude componentsLiuet al\.\([2024](https://arxiv.org/html/2606.10428#bib.bib14)\)\. VeRA needs a higher rank under a similar parameter budget, which is infeasible due to hardware constraints\.

Table 1:Rank and number of trainable parameters\.Table 2:F1 scores across LoRA techniques during Llama\-3\.1\-8B fine\-tuning on XNLI \(averaged over 3 random seeds\)\. Each value is reported asEnglish F1 / Target\-language F1\.Boldindicates the best result for each language andunderlinethe second\-best\. Ln: Language; %: LRL percentage\.In order to verify the generalizability of our findings, we carried out a series of ablation studies with respect to the size and the architecture of the LLM changes, when the task changes, and when the LoRA rank changes\.

## 3Experimental Setup

Data:We use the cross\-lingual natural language inference \(XNLI\) dataset\(Conneauet al\.,[2018](https://arxiv.org/html/2606.10428#bib.bib44)\)for main experiments \(with Ur, Sw, Hi,\), and TyDiQA\-GoldP\(Clarket al\.,[2020](https://arxiv.org/html/2606.10428#bib.bib45)\), a question answering dataset for the ablation \(with Sw, Bn, Te\)\. For XNLI, we use 5,000 samples for training, 500 for validation and the full test portion \(5010 samples\) for testing\. For TyDiQA\-GoldP, we use subsets of the training set for training and validation \(3000 for training, 696 for validation\) and the entire validation set for testing \(FollowingAhujaet al\.\([2023](https://arxiv.org/html/2606.10428#bib.bib52)\)\)\. For both datasets, we apply stratified random sampling\. We use micro F1 score as evaluation metric\.

Models:Preliminary experiments \(Table[14](https://arxiv.org/html/2606.10428#A4.T14)\) show that Llama\-3\.1\-8B base is better than its instruct version\. Therefore the former is used for subsequent experiments\. To ensure generalizability of our findings, we extend a subset of our experiments to Llama\-3\.2\-3B base \(same model family, different size\) and Qwen3\-8B base \(different model family, same size\)\. We conduct a hyperparameter search starting from the values recommended for each LoRA variant \(see Appendix[D](https://arxiv.org/html/2606.10428#A4)and Appendix[E](https://arxiv.org/html/2606.10428#A5)\.Our code will be released\.

## 4Results

Table 3:Comparison of computational efficiency metrics across LoRA techniques during Llama\-3\.1\-8B fine\-tuning on XNLI\. ATT: Avg\. Training Time; AGU: Avg\. GPU Utilisation; AMU: Avg\. Memory Usage; MMU: Max\. Memory Usage; AGT: Avg\. GPU Temperature\.### 4\.1Main Experiments

Table[2](https://arxiv.org/html/2606.10428#S2.T2)summarises the results \(across 3 seeds\) of instruction tuning Llama\-3\.1\-8B with XNLI\. Similar to findings for full\-fine\-tuningShahamet al\.\([2024](https://arxiv.org/html/2606.10428#bib.bib30)\), introduction of even a small proportion \(1%\) of TL data enhances performance in the TLs across configurations, with only three exceptions, and just a slight decline in En performance in most cases\. Increasing the TL proportion further \(to 10% and 50%\) results in performance degradation for both English and TLs in most configurations\.

Considering the peak performance across all LoRA types and TL percentages, the highest overall F1 scores for Ur, Sw, and Hi are achieved by LoRA, DoRA\*, and DoRA, respectively\. However, an ANOVA test performed across all four language percentages indicates no statistically significant difference \(F=0\.33F=0\.33,p=0\.89p=0\.89\) among the LoRA variants on TL performance\. On En performance, there is a significant difference\(F=5\.14F=5\.14,p=0\.0002p=0\.0002\) and post\-hoc comparisons via Tukey’s HSD yield a distinct performance hierarchy: LoRA = DoRA = DoRA\* = AdaLoRA \> PiSSA \> VeRA\. VeRA forces all layers to share identical globalAAandBBmatrices, restricting updates to a single subspace\. We believe that this shared configuration is a potential disadvantage for prior knowledge preservation\.

As reported in Table[3](https://arxiv.org/html/2606.10428#S4.T3), no single LoRA variant dominates across all efficiency dimensions\. LoRA demonstrates a favorable efficiency profile, with AdaLoRA being the second best\. DoRA and DoRA\* demonstrate the least favorable efficiency profile \(ranked at the bottom for 4/5 of the metrics\)\.

### 4\.2Experimenting with Different LLMs

Table 4:Average F1 scores of LoRA techniques on the XNLI dataset for Llama\-3\.2\-3B \(L\) and Qwen3\-8B \(Q\)\.We use 1% ratio \(which gave the best result\) in subsequent experiments\. Table[4](https://arxiv.org/html/2606.10428#S4.T4)illustrates the average performance across all three TLs for both Llama\-3\.2\-3B and Qwen\-3\-8B\. Detailed results are in Table[7](https://arxiv.org/html/2606.10428#A0.T7)and Table[8](https://arxiv.org/html/2606.10428#A0.T8)in Appendix[C](https://arxiv.org/html/2606.10428#A3)\. Average F1 scores, and statistical testing \(ANOVA and post\-hoc\) indicate that VeRA significantly \(p<0\.05p<0\.05\) underperforms other methods across both TL and English in the Llama\-3\.2\-3B experiments\. For Qwen 3\-8B ANOVA testing indicates no statistically significant difference \(p=0\.65p=0\.65\) among the LoRA methods on TL performance\. For English, VeRA significantly underperforms \(p<0\.05p<0\.05\) other methods\.

### 4\.3Experimenting with Varying Ranks

Table 5:Average F1 of LoRA techniques on Llama\-3\.1\-8B across different ranks for the XNLI dataset\.To investigate the sensitivity of each adapter to the parameter budget, we evaluate performance across rank values\. Here, the parameter budget is directly proportional to the rank as we keep other factors constant\. For these experiments, we use DoRA\*, which maintains a parameter budget comparable to the other methods\. As per Table[5](https://arxiv.org/html/2606.10428#S4.T5)\(detailed results in Table[9](https://arxiv.org/html/2606.10428#A0.T9)in Appendix[C](https://arxiv.org/html/2606.10428#A3)\), TL performance exhibits sharp sensitivity to rank scaling, while En F1 scores remaining highly stable across all four techniques\. TL performance uniformly degrades at rank 16 relative to Rank 8, while rank 32 yields peak TL performance across nearly all variants\.

### 4\.4Evaluating Robustness on TyDiQA

As per Table[6](https://arxiv.org/html/2606.10428#S4.T6), across all 3 languages, performance differences among the LoRA variants on Llama\-3\.1\-8B are minor \(p=0\.30p=0\.30for TLs andp=0\.10p=0\.10for English\), validating our earlier observations\.

Table 6:Results of LoRA techniques on the TyDiQA\-GoldP for Llama 3\.1\-8B\.![Refer to caption](https://arxiv.org/html/2606.10428v1/x1.png)Figure 1:Layer\-wise language distribution\.
### 4\.5Layer\-wise Hidden Embedding Analysis

Using the method proposed byZhaoet al\.\([2024](https://arxiv.org/html/2606.10428#bib.bib48)\), we perform a layer\-wise analysis of embeddings of LLMs to investigate whether the LoRA variants have been able to introduce significant language representation changes in the LLM\.

Experiments were conducted on Ur, Sw, and Hi using Llama\-3\.1\-8B fine\-tuned with each LoRA variant with 1% TL data\. For each test instance, we classified decoded layer\-wise embeddings intoEnglish,TL, orotherusing CLD3222[https://pypi\.org/project/pycld3/](https://pypi.org/project/pycld3/), computed per\-layer language proportions, and averaged them over the test set\. The percentages of En and Ur in the hidden embeddings of each layer are illustrated in Figure[1](https://arxiv.org/html/2606.10428#S4.F1)\. Experimental setup and supplementary results for additional languages are in Appendix[F](https://arxiv.org/html/2606.10428#A6)\.

We note that the amount of language representation across layers is largely consistent across all LoRA variants, indicating that LoRA variants have not been able to introduce noticeable language representation changes in the LLM\. These language representations exhibit a distinct three\-stage pattern transitioning from initial layers, to English\-dominant middle layers, and finally to outer layers where English dominance declines\. This pattern aligns with findings of prior work that investigated language representations of the LLMs\.Zhaoet al\.\([2024](https://arxiv.org/html/2606.10428#bib.bib48)\); Wendleret al\.\([2024](https://arxiv.org/html/2606.10428#bib.bib53)\); Tamoet al\.\([2026](https://arxiv.org/html/2606.10428#bib.bib54)\); Gurgurovet al\.\([2025](https://arxiv.org/html/2606.10428#bib.bib55)\)\.

TL percentages in the final layers \(30\-31\), as well as the English percentages in the task\-solving layers \(22–27\), both exhibit a highly significant \(p<0\.05p<0\.05\) positive Pearson correlation with the final TL F1 score \(detailed results in Figure[4](https://arxiv.org/html/2606.10428#A6.F4)of Appendix[F](https://arxiv.org/html/2606.10428#A6)\.\)\. Motivated by this, we applied LoRA adapters exclusively to these layers\. However, it under\-performs compared to LoRA fine\-tuning of all layers \(see Table[13](https://arxiv.org/html/2606.10428#A4.T13)of Appendix[F](https://arxiv.org/html/2606.10428#A6)\)\. This contradicts withOwodunni and Kumar \([2025](https://arxiv.org/html/2606.10428#bib.bib23)\), who reported that applying LoRA to the first ten and last two layers is sufficient during continual pretraining\.

## 5Conclusion

Our comprehensive experiments reveal no significant difference among the considered LoRA techniques in TL learning, while VeRA underperformed the other LoRA techniques in terms of English language preservation\. Layer\-wise embedding and correlation analyses show that every LoRA technique, regardless of its underlying architecture converges on an identical "Reason in English, Translate at Exit" setup\. Future work may focus on developing LoRA architectures that formalize and optimize this structural mechanic\.

## 6Limitations

While our evaluation encompasses a significant range of configurations, several constraints bound the scope of this study\. First, due to computational resource limitations, we restricted our analysis to LLMs in the 3B to 8B parameter range\. Future research could investigate whether the above observations persist in much larger models\.

Our experimental evaluation is limited to two downstream tasks \(NLI and QA\)\. This restriction stems from computational constraints as well as the limited availability of task\-specific fine\-tuning datasets that simultaneously support LRLs, are feasible within our GPU budget, and cover tasks \(i\.e\. text classification and question answering\) beyond those considered in this study\. If the availability of LRL datasets spanning diverse tasks increases, future work could extend this analysis to a broader range of tasks\.

Due to lack of computing resources, we could only experiment with 0%, 1%, 10%, 50% data mixtures\. While future experiments can conduct a a denser sweep around 0–10%, we believe this is not necessary for the current research, as our primary goal is not to identify the optimal data mixture\. Similarly, we only checked the forgetting only in the context of English, because English is the language that is most prominent in LLMs and most language resources are available for English\.

Finally, while there many LoRA variants available, we only experiment with LoRA techniques that are supported by the Huggingface library and are feasible with our GPU resources\.

## 7Ethical Considerations

We use publicly available datasets\. XNLI is under CC\-BY\-NC\-4\.0\. TyDiQA\-GoldP is under Apache 2\.0\. Whatever the biases in these datasets may have been reflected in our results\.

## References

- K\. Ahuja, H\. Diddee, R\. Hada, M\. Ochieng, K\. Ramesh, P\. Jain, A\. Nambi, T\. Ganu, S\. Segal, M\. Ahmed,et al\.\(2023\)Mega: multilingual evaluation of generative ai\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 4232–4267\.Cited by:[§3](https://arxiv.org/html/2606.10428#S3.p1.1)\.
- J\. H\. Clark, E\. Choi, M\. Collins, D\. Garrette, T\. Kwiatkowski, V\. Nikolaev, and J\. Palomaki \(2020\)Tydi qa: a benchmark for information\-seeking question answering in ty pologically di verse languages\.Transactions of the Association for Computational Linguistics8,pp\. 454–470\.Cited by:[§3](https://arxiv.org/html/2606.10428#S3.p1.1)\.
- A\. Conneau, R\. Rinott, G\. Lample, A\. Williams, S\. Bowman, H\. Schwenk, and V\. Stoyanov \(2018\)XNLI: evaluating cross\-lingual sentence representations\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2475–2485\.Cited by:[§3](https://arxiv.org/html/2606.10428#S3.p1.1)\.
- G\. L\. Garcia, A\. d\. F\. Schuck, J\. R\. Manesco, P\. H\. Paiola, L\. A\. Passos, and J\. P\. Papa \(2026\)Think portuguese with bode reasoning\.InProceedings of the 17th International Conference on Computational Processing of Portuguese \(PROPOR 2026\)\-Vol\. 1,pp\. 953–958\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p2.1)\.
- D\. Gurgurov, K\. Trinley, Y\. Al Ghussin, T\. Bäumel, J\. van Genabith, and S\. Ostermann \(2025\)Language arithmetics: towards systematic language neuron identification and manipulation\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,pp\. 2911–2937\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p3.1),[§4\.5](https://arxiv.org/html/2606.10428#S4.SS5.p3.1)\.
- B\. Hassan and X\. Chen \(2026\)GRASP lora: grpo guided adapter sparsity policy for cross lingual transfer\.arXiv preprint arXiv:2601\.06702\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p3.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InThe Tenth International Conference on Learning Representations,Note:OpenReview\.netExternal Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10428#S2.SS1.p2.1)\.
- S\. H\. Imam, M\. Y\. Bello, H\. A\. Umar, T\. D\. Belay, I\. Abdulmumin, S\. M\. Yimam, and S\. H\. Muhammad \(2026\)Full fine\-tuning vs\. parameter\-efficient adaptation for low\-resource african asr: a controlled study with whisper\-small\.InProceedings of the 7th Workshop on African Natural Language Processing \(AfricaNLP 2026\),pp\. 197–203\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p4.1)\.
- O\. Khade, S\. Jagdale, A\. Phaltankar, G\. Takalikar, and R\. Joshi \(2025\)Challenges in adapting multilingual llms to low\-resource languages using lora peft tuning\.InProceedings of the First Workshop on Challenges in Processing South Asian Languages \(CHiPSAL 2025\),pp\. 217–222\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p3.1)\.
- M\. Khelli, S\. Cahyawijaya, A\. Purwarianti, and G\. I\. Winata \(2025\)What causes knowledge loss in multilingual language models?\.arXiv preprint arXiv:2504\.20356\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p3.1)\.
- D\. J\. Kopiczko, T\. Blankevoort, and Y\. M\. Asano \(2024\)VeRA: vector\-based random matrix adaptation\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2606.10428#S2.SS1.p2.1)\.
- S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024\)DoRA: weight\-decomposed low\-rank adaptation\.InProceedings of the 41st International Conference on Machine Learning,pp\. 32100–32121\.Cited by:[§2\.1](https://arxiv.org/html/2606.10428#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2606.10428#S2.SS2.p3.1)\.
- F\. Meng, Z\. Wang, and M\. Zhang \(2024\)Pissa: principal singular values and singular vectors adaptation of large language models\.Advances in Neural Information Processing Systems37,pp\. 121038–121072\.Cited by:[§2\.1](https://arxiv.org/html/2606.10428#S2.SS1.p2.1)\.
- A\. T\. Owodunni and S\. Kumar \(2025\)Continually adding new languages to multilingual language models\.arXiv preprint arXiv:2509\.11414\.Cited by:[Appendix A](https://arxiv.org/html/2606.10428#A1.p1.1),[§1](https://arxiv.org/html/2606.10428#S1.p3.1),[§1](https://arxiv.org/html/2606.10428#S1.p4.1),[§4\.5](https://arxiv.org/html/2606.10428#S4.SS5.p4.1)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga,et al\.\(2019\)Pytorch: an imperative style, high\-performance deep learning library\.Advances in neural information processing systems32\.Cited by:[Appendix E](https://arxiv.org/html/2606.10428#A5.p1.1)\.
- U\. Shaham, J\. Herzig, R\. Aharoni, I\. Szpektor, R\. Tsarfaty, and M\. Eyal \(2024\)Multilingual instruction tuning with just a pinch of multilinguality\.InFindings of the Association for Computational Linguistics ACL 2024,pp\. 2304–2317\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.10428#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.10428#S4.SS1.p1.1)\.
- J\. B\. Tamo, D\. Carlander\-Reuterfelt, J\. Rubin, O\. Poliannikov, D\. Hong, and M\. Wang \(2026\)LinguaMap: which layers of LLMs speak your language and how to tune them?\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=r00UxTl8El)Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p3.1),[§4\.5](https://arxiv.org/html/2606.10428#S4.SS5.p3.1)\.
- C\. Wendler, V\. Veselovsky, G\. Monea, and R\. West \(2024\)Do llamas work in english? on the latent language of multilingual transformers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15366–15394\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p3.1),[§4\.5](https://arxiv.org/html/2606.10428#S4.SS5.p3.1)\.
- C\. Whitehouse, F\. Huot, J\. Bastings, M\. Dehghani, C\. Lin, and M\. Lapata \(2024\)Low\-rank adaptation for multilingual summarization: an empirical study\.InFindings of the Association for Computational Linguistics: NAACL 2024,pp\. 1202–1228\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p3.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz,et al\.\(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations,pp\. 38–45\.Cited by:[Appendix E](https://arxiv.org/html/2606.10428#A5.p1.1)\.
- M\. Yang, J\. Chen, J\. Tao, Y\. Zhang, J\. Liu, J\. Zhang, Q\. Ma, H\. Verma, R\. Zhang, M\. Zhou,et al\.\(2024\)Low\-rank adaptation for foundation models: a comprehensive review\.arXiv preprint arXiv:2501\.00365\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10428#S2.SS1.p1.1)\.
- Q\. Zhang, M\. Chen, A\. Bukharin, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao \(2023\)Adaptive budget allocation for parameter\-efficient fine\-tuning\.InThe Eleventh International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2606.10428#S2.SS1.p2.1)\.
- Y\. Zhao, W\. Zhang, G\. Chen, K\. Kawaguchi, and L\. Bing \(2024\)How do large language models handle multilingualism?\.Advances in Neural Information Processing Systems37,pp\. 15296–15319\.Cited by:[§1](https://arxiv.org/html/2606.10428#S1.p3.1),[§4\.5](https://arxiv.org/html/2606.10428#S4.SS5.p1.1),[§4\.5](https://arxiv.org/html/2606.10428#S4.SS5.p3.1)\.

Table 7:F1 scores of LoRA techniques on the XNLI dataset for Llama\-3\.2\-3B\. Each value is reported asEnglish F1 / Target\-language F1\.Boldindicates the best result for each language andunderlinethe second\-best\.Table 8:F1 scores of LoRA techniques on the XNLI dataset for Qwen3\-8B\. Each value is reported asEnglish F1 / Target\-language F1\.Boldindicates the best result for each language andunderlinethe second\-best\.Table 9:F1 scores of LoRA techniques on XNLI dataset across different ranks using a 1% TL data ratio\. Each value is reported asEnglish F1 / Target\-language F1\.Boldindicates the best result for each rank andunderlinethe second\-best\.## Appendix ALanguages

For XNLI, we conducted our experiments on three distinct languages: Urdu, Swahili, and Hindi\. Urdu is a low\-resource Indo\-Aryan language written in the Perso\-Arabic script, and shares significant linguistic similarities with Hindi, which is represented in pretraining corpora of the LLMs used in this study\. Swahili is a low\-resource Bantu language written in the Latin script\(Owodunni and Kumar,[2025](https://arxiv.org/html/2606.10428#bib.bib23)\)\. Hindi is a medium\-resource Indo\-Aryan language written in the Devanagari script\. For TyDiQA, we evaluate on Swahili as well as Bengali and Telugu, the latter two representing the Indo\-Aryan and Dravidian language families and written in Bengali and Telugu scripts, respectively\. The selection of languages with diverse scripts, language families and structures ensures the robustness of our experiments\.

## Appendix BLoRA Techniques

This section provides a brief overview of the four LoRA variants used in our experiments\.

- •DoRAis designed to bridge the performance gap between LoRA and full\-parameter fine\-tuning\. It operates by decomposing the pre\-trained weight matrices into magnitude and directional components\. During the adaptation process, the directional component is updated via low\-rank matrices while the magnitude is tuned independently\. This method provides a learning ability closer to full fine\-tuning without introducing any inference latency compared to LoRA\.
- •VeRAdeparts from traditional LoRA by utilising a single pair of random, low\-rank matrices that remain frozen and are shared across all adapted layers\. Layer\-wise tuning is facilitated through scalable training vectors\. This design achieves a substantial reduction of trainable parameters compared to LoRA, while maintaining performance levels competitive with more parameter\-intensive adaptation methods\.
- •AdaLoRAimproves the efficiency of LoRA by moving away from a static parameter distribution across weight matrices\. AdaLoRA utilises a Singular Value Decomposition \(SVD\) based formulation to represent weight increments, allowing the system to quantify the relative importance of specific updates\. This technique prunes singular values with low importance scores, effectively reallocating the training budget toward incremental matrices that are more critical for the target task\. This adaptive approach ensures that the model capacity is utilised where it provides the greatest performance gain\.
- •PiSSAtargets the original weight matrixWWdirectly, in contrast to LoRA and its variants, which approximate the weight updateΔW\\Delta W\. In this approach, the weight matrixWWundergoes singular value decomposition \(SVD\), allowing it to be segmented based on the magnitude of its singular values\. This process yields two components: a principal low\-rank matrix, which captures the most significant singular values, and a residual matrix, containing the remaining smaller singular values\. During training, the principal component is updated, and the residual component remains fixed\.

## Appendix CExtended Results

We provide the full per\-language breakdown for the generalizability experiments discussed in Section[4\.2](https://arxiv.org/html/2606.10428#S4.SS2)in Table[7](https://arxiv.org/html/2606.10428#A0.T7)and Table[8](https://arxiv.org/html/2606.10428#A0.T8)\. Table[7](https://arxiv.org/html/2606.10428#A0.T7)presents the F1 scores for Llama\-3\.2\-3B, and Table[8](https://arxiv.org/html/2606.10428#A0.T8)details the results for Qwen 3\-8B\. All reported scores are derived from experiments using the 1% TL data ratio\.

Table[9](https://arxiv.org/html/2606.10428#A0.T9)presents the extended results discussed in Section[4\.3](https://arxiv.org/html/2606.10428#S4.SS3)\. For each language and rank setting, we report the F1 scores for both English and the corresponding TL\.

## Appendix DHyperparameters

We initialized hyperparameters using the values recommended in the original method papers and subsequently performed controlled tuning to ensure a fair comparison across methods\.

We tuned the number of training epochs, learning rate, LoRA rank \(rr\), LoRA alpha \(α\\alpha\) and target modules\. Following the original design of VeRA, we explored larger rank valuesr∈\{256,512,1024,2048\}r\\in\\\{256,512,1024,2048\\\}\. For all other PEFT methods, we evaluatedr∈\{4,8,16,32\}r\\in\\\{4,8,16,32\\\}andα∈\{4,8,16,32\}\\alpha\\in\\\{4,8,16,32\\\}\. Learning rates were swept across\{2×10−5,5×10−5,1×10−4,2×10−4,5×10−4,1×10−3,2×10−3,4×10−3\}\\\{2\\times 10^\{\-5\},5\\times 10^\{\-5\},1\\times 10^\{\-4\},2\\times 10^\{\-4\},5\\times 10^\{\-4\},1\\times 10^\{\-3\},2\\times 10^\{\-3\},4\\times 10^\{\-3\}\\\}\. The default number of training epochs was set to 3\. However, if the original method papers recommended an alternative epoch count, we evaluated that configuration as well and compared it against the default baseline\. For target modules, we evaluated both the module configuration recommended in the respective original papers and an "all\-linear" configuration, selecting the best\-performing option for each method\. Hyperparameter configurations for the main experiments on the XNLI and TyDiQA\-GoldP datasets are provided in Table[10](https://arxiv.org/html/2606.10428#A4.T10)and Table[11](https://arxiv.org/html/2606.10428#A4.T11), respectively\. For the ablation study conducted on XNLI with increased ranksr∈\{16,32\}r\\in\\\{16,32\\\}, we tuned the LoRA alpha over the set\{16,32,64\}\\\{16,32,64\\\}while keeping other hyperparameters consistent with the main configuration\. Table[12](https://arxiv.org/html/2606.10428#A4.T12)summarizes the optimalα\\alphavalues chosen for each method across these increased ranks\.

![Refer to caption](https://arxiv.org/html/2606.10428v1/x2.png)Figure 2:Layer\-wise language distribution in hidden embeddings for Sw experiments![Refer to caption](https://arxiv.org/html/2606.10428v1/x3.png)Figure 3:Layer\-wise language distribution in hidden embeddings for Hi experimentsTable 10:Hyperparameter configurations used for different LoRA methods in the main experiments\.Table 11:Hyperparameters for LoRA methods on TyDiQA\-GoldP\.Table 12:Best\-performing alpha values for the XNLI rank sensitivity studyTable 13:F1 scores on XNLI obtained by multilingual instruction\-tuning selected layers of Llama\-3\.1\-8B using LoRA\. Each value is reported asEnglish F1 / Target\-language F1\.Table 14:F1 scores on XNLI obtained by instruction\-tuning Llama\-3\.1\-8B base LLM and Llama\-3\.1\-8B\-Instruct with LoRA under varying instruction\-tuning data compositions and dataset sizes \(2,000 vs\. 20,000\)\. Each value is reported asEnglish F1 / Ur F1\.
## Appendix EComputing Infrastructure

All experiments were conducted using Pytorch\(Paszkeet al\.,[2019](https://arxiv.org/html/2606.10428#bib.bib46)\)and the HuggingFace library\(Wolfet al\.,[2020](https://arxiv.org/html/2606.10428#bib.bib47)\)on an NVIDIA A100 GPU with 80GB memory\. We used PyTorch version 2\.5\.1 and CUDA 12\.2\.

## Appendix FLayer\-wise Hidden Embedding Analysis

We conducted a layer\-wise embedding analysis on Urdu, Swahili, and Hindi to track how internal language representations evolve across different fine\-tuning techniques\. For this analysis, we utilized the Llama\-3\.1\-8B models fine\-tuned on a training mixture consisting of 1% TL data and 99% English data\. For each model, we passed the target\-language test set and extracted the hidden state representations from the output of every transformer layer\. The extracted embeddings were decoded using the model’s language modeling head and classified into three language categories: English, TL, and other\. Embeddings identified as neither English nor the corresponding TL were assigned to the other category\.

Language identification was performed using CLD3 library\. For each test instance and each layer, we computed the proportion of embeddings assigned to each language category\. These layer\-wise language proportions were then averaged across the entire test set to obtain the final distributional patterns\.

To ensure robustness of our findings, we repeated a subset of the analysis using the Lingua333[https://pypi\.org/project/lingua/](https://pypi.org/project/lingua/)language identification library\. The resulting layer\-wise language distribution patterns were consistent with those obtained using CLD3, confirming that our observations are not dependent on a specific language identification tool\. Figure[2](https://arxiv.org/html/2606.10428#A4.F2)and Figure[3](https://arxiv.org/html/2606.10428#A4.F3)present results for Swahili and Hindi respectively\.

![Refer to caption](https://arxiv.org/html/2606.10428v1/x4.png)Figure 4:Correlation of layer\-wise TL ratio and English ratio to TL F1 score\. Warm colors \(reds/oranges\) denote negative correlations, while cool colors \(blues\) denote positive correlations\.
Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Beyond LoRA: Can you beat the most popular fine-tuning technique?

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

@jbhuang0604: LoRA, low-rank adaptation, is arguably the most popular parameter-efficient fine-tuning method for LLMs. But how does i…

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

Submit Feedback

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Beyond LoRA: Can you beat the most popular fine-tuning technique?
CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation
@jbhuang0604: LoRA, low-rank adaptation, is arguably the most popular parameter-efficient fine-tuning method for LLMs. But how does i…
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation