Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models
Summary
This paper presents the first systematic study of multilingual instruction following in Vision-Language-Action (VLA) models, revealing significant performance degradation when models trained on English are evaluated on other languages. The authors propose Multilingual Principal Component Alignment (MPCA) to reduce the multilingual performance gap.
View Cached Full Text
Cached at: 06/16/26, 11:50 AM
# Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models Source: [https://arxiv.org/html/2606.15714](https://arxiv.org/html/2606.15714) Hongliang LiJiarui CaoYang JiangHaonan WenKaiyu HuangShengnan GuoHuaiyu WanBeijing Jiaotong University[hanyangchen@bjtu\.edu\.cn](https://arxiv.org/html/2606.15714v1/mailto:[email protected])[guoshn@bjtu\.edu\.cn](https://arxiv.org/html/2606.15714v1/mailto:[email protected]) \(June 14, 2026\) ###### Abstract Vision\-Language\-Action \(VLA\) models have recently demonstrated promising capabilities in learning generalist robot policies from large\-scale multimodal data\. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored\. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training\. In this work, we present the first systematic study of multilingual instruction following in VLA models\. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions\. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings\. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual\. We provide several findings and analyses to understand the multilingual gap\. Cross\-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution\. Representation analyses suggest that multilingual instruction\-caused representation shifts may contribute to the multilingual gap\. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs\. We propose a simple yet effective multilingual fine\-tuning approach, Multilingual Principal Component Alignment \(MPCA\), which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap\. Our experiments show that MPCA can effectively improve multilingual performance in VLAs, demonstrating its potential as a practical solution for enhancing multilingual robustness in embodied agents\. \\correspondence Hanyang Chen at , Shengnan Guo at Figure 1:Illustration of the evaluation stages for multilingual instruction following in VLAs\. We first construct multilingual instruction variants based on existing robot benchmarks, and then conduct a comprehensive evaluation and analysis to reveal the multilingual gap in current VLA systems, and finally explore strategies to improve multilingual performance in VLAs\.## 1Introduction Vision\-Language\-Action models \(VLAs\) have recently emerged as a promising paradigm for building generalist robot policies\(team2024octo;kim2024openvla;black2024pi\_0;black2025pi\_;kim2025fine;cen2025rynnvla\)\. By integrating visual perception, natural language understanding, and action generation within a unified framework, VLAs enable robots to execute complex tasks conditioned on natural language instructions\. Built upon vision\-language models \(VLMs\), VLAs extend multimodal understanding to action generation\. Recent advances in VLMs\(hurst2024gpt;team2024gemini;wang2024qwen2;bai2025qwen3\)have also strengthened this foundation through large\-scale multimodal pretraining, further improving generalization across tasks\. As a result, VLAs have become an increasingly important research direction for developing scalable and flexible embodied agents\. Despite these advances, most existing VLA research focuses primarily on improving instruction\-following capability\(karnik2024embodied;zhang2025vlabench;xu2025seeing\)and visual grounding\(fei2025libero;zhou2025libero\)under English instructions\. In both training and evaluation, the majority of VLA benchmarks and datasets\(mees2022calvin;liu2023libero;li2023behavior;nasiriany2024robocasa;li2024evaluating;chen2025robotwin\)adopt English as the default instruction language\. However, this setting overlooks an important challenge in real\-world deployment: robots operating in global environments must be able to understand instructions expressed in different languages\. Although many VLAs rely on multilingual large language models as their backbone, it remains unclear whether these multilingual capabilities transfer to VLAs during training\. In particular, the alignment process between language tokens and robot actions may implicitly bias the model toward English instructions, potentially leading to degraded performance when instructions are provided in other languages\. This raises an important yet largely unexplored question:Do VLAs truly retain multilingual understanding after being aligned with robot actions? To address this question, a systematic evaluation of multilingual instructions following in VLAs is necessary\. In this work, we study multilingual capabilities in embodied systems through three key stages, as shown in Figure[1](https://arxiv.org/html/2606.15714#S0.F1)\.\(1\) Multilingual dataset construction: Evaluating multilingual generalization requires constructing instruction data across multiple languages to simulate multilingual interaction scenarios\. While the underlying language models may possess multilingual knowledge, it remains uncertain whether such knowledge can be effectively transferred to the action space through language\-conditioned policy learning\. Therefore, we extend existing robot datasets by constructing multilingual instruction variants that allow us to directly compare model behavior across languages\.\(2\) Multilingual evaluation and analysis: Beyond simply measuring task success rates, analyzing the results of VLAs under multilingual instructions can provide valuable insights into how language representations interact with action policies\. In particular, we will examine how performance varies across various dimensions and investigate whether the representations remain consistent under different linguistic inputs\.\(3\) Multilingual performance enhancement: Based on the insights gained from multilingual evaluation, we can further explore strategies to improve multilingual performance in VLAs\. Based on these stages, we introduce a multilingual evaluation framework for VLAs\. Specifically, we construct multilingual instruction sets by extending existing robot manipulation benchmarks with instructions expressed in multiple languages\. Our design focuses on two multilingual interaction settings\. The first setting is themultilingual instruction, where instructions are translated into different languages while preserving their original semantics, enabling controlled evaluation of multilingual generalization\. The second setting is thecode\-switching instruction, where instructions contain mixed\-language expressions that commonly occur in multilingual communication scenarios\. These settings allow us to systematically study how VLAs behave when language inputs deviate from the English\-only assumption commonly used in existing benchmarks\. Using this framework, we conduct a comprehensive empirical study on several representative VLA models across two robot benchmarks, LIBERO\(liu2023libero\)and SimplerEnv\(li2024evaluating\)\. Our experiments reveal a clear multilingual gap, where models trained primarily on English instructions exhibit noticeable performance degradation when evaluated on other languages\. Furthermore, we analyze the multilingual performance across different base models, suites, interaction settings and action head designs, and also provide discussions from the perspective of model behavior and representations\. Our findings suggest that multilingual generalization is an important yet overlooked aspect of current VLA systems, and provide insights for improving multilingual generalization for VLAs\. Based on these insights and analyses, we further explore strategies to improve multilingual performance in VLAs\. We also propose a simple yet effective multilingual fine\-tuning approach, Multilingual Principal Component Alignment \(MPCA\), which leverages Principal Component Analysis \(PCA\) to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap\. Our contributions are summarized as follows: \(1\) We present the first systematic study of multilingual instruction following in VLA models, introducing a multilingual evaluation framework that extends existing robot benchmarks with multilingual instruction variants\. \(2\) We conduct comprehensive experiments across multiple VLA models, suites, and multilingual settings, revealing a significant multilingual gap in current VLA systems\. \(3\) We analyze the multilingual performance from both behavioral and representational perspectives, providing insights into the underlying causes of the multilingual gap and potential avenues for improvement\. \(4\) We propose a simple yet effective multilingual fine\-tuning approach, MPCA, which effectively improves multilingual performance in VLAs\. ## 2Related Work Vision\-Language\-Action Models\.The rapid development of VLA models has been driven by large\-scale pre\-training and architectural innovations, yielding policies with strong English instruction\-following and visual grounding capabilities\. Early approaches, such as RT\-1\(brohan2022rt\)and Octo\(team2024octo\), pioneered end\-to\-end transformer policies trained on massive robot datasets, establishing the foundation for generalist manipulation\. Subsequent paradigms\(kim2024openvla;black2024pi\_0;black2025pi\_;kim2025fine;cen2025rynnvla;bjorck2025gr00t\)increasingly leverage pre\-trained VLMs as backbones, fine\-tuning them for embodied control via action discretization\(kim2024openvla;kim2025fine\)or diffusion\-based continuous action experts\(black2024pi\_0;wen2024diffusion\)\. To mitigate catastrophic forgetting and preserve foundation model priors during policy adaptation, recent works have explored parameter\-efficient fine\-tuning\(torne2026mem\), dual\-system architectures that decouple high\-level reasoning from low\-level control\(bjorck2025gr00t\), and co\-training with external data\(bu2025univla;cen2025worldvla;lian2026bayesianvla\)\. These architectures demonstrate remarkable zero\-shot generalization in familiar settings, excelling at spatial reasoning, object grounding, and executing complex manipulation sequences conditioned on natural language\. However, despite leveraging multilingual VLM backbones, contemporary VLA training pipelines remain overwhelmingly monolingual, which may bias the language\-to\-action alignment process toward English, potentially leading to degraded performance when evaluated on non\-English instructions\. Our work proves the existence of this multilingual gap and analyzes it in current VLA systems\. Evaluation Benchmarks for VLAs\.Evaluation benchmarks for VLA models have progressively expanded from static task completion metrics to systematic robustness analysis across visual, kinematic, and environmental dimensions\. Previous benchmarks, such as CALVIN\(mees2022calvin\)and LIBERO\(liu2023libero\), established standardized environments for language\-conditioned manipulation\. To expose hidden brittleness, recent benchmarks\(fei2025libero;zhou2025libero;chen2025robotwin;wang2025vlatest\)have automated perturbation generation and introduced multi\-dimensional evaluation protocols\. These benchmarks systematically evaluate model resilience to camera viewpoint shifts, object layout variations, lighting changes, robot initial state perturbations, and sensor noise, revealing that contemporary VLAs are highly sensitive to spatial and visual distribution shifts\. However, regarding language robustness, existing evaluations\(fei2025libero;wang2025vlatest\)typically restrict perturbations to English paraphrasing or instruction complexity variations\. While these studies confirm that VLAs can handle semantic variations within English, they operate under the implicit assumption that language understanding is language\-invariant once trained\. However, this assumption overlooks the real\-world needs for multilingual instruction following\. This omission creates a critical gap in embodied evaluation: without multilingual evaluation, performance degradation across languages cannot be quantified, limiting real\-world deployment\. Our work addresses this gap by systematically evaluating multilingual instruction following in VLAs, revealing a significant multilingual gap and providing insights into the underlying reasons and potential remedies\. ## 3Multilingual Evaluation Framework for VLAs In this section, we introduce our multilingual evaluation framework for VLAs, which consists of two key components: \(1\) a multilingual instruction construction pipeline that generates multilingual instruction variants for existing robot benchmarks; and \(2\) a multilingual evaluation adapter that enables seamless integration of multilingual instructions into the evaluation process\. This framework allows us to systematically assess the multilingual generalization capabilities of VLA models under controlled and consistent conditions\. ### 3\.1Multilingual Instruction Construction #### 3\.1\.1Multilingual Settings Design Expanding the scope of languages\.Following most of the prior works in multilingual LLMs and VLMs\(thellmann2024towards;yong2025state;luo2026lost\), we select Chinese, French, Russian, and Arabic as the target languages for evaluation, covering high\-resource languages \(English and Chinese\) and low\-resource languages \(French, Russian, and Arabic\) in different language families and scripts\. Introducing diverse multilingual interaction settings\.We consider two multilingual interaction settings in this paper: \(1\) multilingual instructions, where instructions are directly translated into different languages while preserving their original semantics; and \(2\) code\-switching instructions, where instructions contain mixed\-language expressions that commonly occur in multilingual communication scenarios\. These settings allow us to systematically study how VLAs behave when language inputs deviate from the English\-only assumption commonly used in existing benchmarks\. #### 3\.1\.2Instruction Generation Pipeline To construct multilingual instruction variants, we start from the original English instructions provided in existing benchmarks\. For the multilingual instruction setting, we use a standard machine translation system, i\.e\., Cloud Translation API, to translate the English instructions into each target language\. For the code\-switching instruction setting, we use LLMs to first perform named entity recognition, identifying key verbs and nouns in both the original English instructions and their translated versions\. Based on these alignments, the LLMs then substitute the corresponding key phrases in the translated instructions with those from the original English instructions, resulting in code\-switched instructions\. To ensure the quality of the generated instructions, we employ LLM\-as\-a\-judge to evaluate the semantic consistency between the original English instructions and the code\-switching instructions\. This pipeline allows us to systematically generate multilingual instruction variants while controlling for semantic equivalence, enabling a rigorous evaluation of multilingual generalization in VLA models\. ### 3\.2Multilingual Evaluation Adapter To achieve a simple and fair evaluation of multilingual performance across different VLA models, we provide a multilingual evaluation adapter that can replace the original language instruction with only one line of code change\. This adapter takes the original instruction as input and outputs the corresponding multilingual instruction variant based on the specified language setting\. By integrating this adapter into the evaluation pipeline, we can seamlessly switch between different language inputs without modifying any other components of the model or environment\. This design ensures that all models are evaluated under consistent conditions, allowing for a direct comparison of multilingual performance while maintaining the integrity of the original evaluation protocol\. We provide instruction variants for LIBERO\(liu2023libero\)and SimplerEnv\(li2024evaluating\), and the adapter can automatically process new instructions and languages to generate corresponding variants\. ### 3\.3Evaluation Protocol We follow the original evaluation protocols of the respective benchmarks for each task, measuring task success rates under different language settings\. To quantify the multilingual performance gap, we report relative performance with respect to the original English instructions, allowing us to directly compare the impact of language changes on model performance: Relative Performance=\(Success RateMultilingual−Success RateEnglish\)×100%\.\\text\{Relative Performance\}=\(\\text\{Success Rate\}\_\{\\text\{Multilingual\}\}\-\\text\{Success Rate\}\_\{\\text\{English\}\}\)\\times 100\\%\.\(1\) ## 4Experiments In this section, we first introduce the experiment setup in Section[4\.1](https://arxiv.org/html/2606.15714#S4.SS1), including the environments and baselines used for evaluation\. We then present the main results in Section[4\.2](https://arxiv.org/html/2606.15714#S4.SS2), where we analyze the multilingual performance of different VLA models across various languages and multilingual settings\. We further investigate how the multilingual gap relates to model behavior in Section[4\.3\.1](https://arxiv.org/html/2606.15714#S4.SS3.SSS1), and analyze the underlying reasons behind the multilingual gap in Section[4\.3\.2](https://arxiv.org/html/2606.15714#S4.SS3.SSS2)\. Finally, we explore potential strategies to improve the multilingual capability of VLA models in Section[4\.4](https://arxiv.org/html/2606.15714#S4.SS4)\. ### 4\.1Experiment Setup Environments\.We evaluate various VLA models on two simple but representative simulation benchmarks: LIBERO\(liu2023libero\)and SimplerEnv\(li2024evaluating\)\. These environments cover a range of tasks with varying complexity, such as object rearrangement and tool use\. Each environment provides a standardized evaluation protocol and metrics for assessing task success, enabling a systematic comparison of multilingual performance across diverse settings\. Baselines\.We compare several VLA models that have been trained primarily on English instructions, includingπ0\.5\\pi\_\{0\.5\}\(intelligence2025pi\_\), OpenVLA\-OFT\(kim2025fine\), and ABot\-M0\(yang2026abot\)\. Multiple Qwen\-VL\-based VLAs with different action head designs\(community2026starvla\)are also included to evaluate the impact of multilingual VLM backbones and architectural choices on multilingual performance\. In addition, we include a world\-model\-based policy, Cosmos Policy\(kim2026cosmos\), to investigate existing world\-model\-based approaches in multilingual settings\. We provide detailed information about these models in the Appendix[9\.1](https://arxiv.org/html/2606.15714#S9.SS1)\. ### 4\.2Main Results Table 1:Multilingual performance on LIBERO across four suites, with the average performance across suites shown in the last group\. Relative performance is shown with respect to the original English instructions\.∅\\varnothingdenotes evaluated without any instructions\. The Avg\. denotes the average performance under each language across the four suites\. Percentage sign is omitted for better readability\.ModelsLongGoalObjectSpatialAvg\.zhfrruar∅\\varnothingzhfrruar∅\\varnothingzhfrruar∅\\varnothingzhfrruar∅\\varnothingzhfrruar∅\\varnothingOpenVLA\-OFT\-12\.5\-12\.5\-9\.5\-13\.5\-8\.5\-89\.0\-89\.0\-89\.0\-89\.0\-89\.0\-2\.0\-1\.0\-1\.5\-2\.0\-1\.0\-11\.0\-12\.0\-11\.0\-14\.5\-11\.5\-28\.6\-28\.6\-27\.8\-29\.8\-27\.5π0\.5\\pi\_\{0\.5\}\-20\.0\-17\.0\-23\.5\-21\.0\-23\.0\-78\.0\-84\.5\-88\.5\-85\.5\-85\.0\-32\.5\-28\.0\-31\.5\-33\.5\-34\.5\-28\.0\-22\.5\-23\.5\-25\.5\-31\.5\-39\.6\-38\.0\-41\.8\-41\.4\-43\.5ABot\-M0\-5\.5\-5\.0\-14\.0\-15\.5\-18\.0\-33\.5\-88\.0\-89\.5\-87\.0\-87\.5\-4\.5\-19\.0\-11\.5\-8\.5\-8\.5\-3\.5\-9\.0\-7\.5\-15\.5\-13\.5\-11\.8\-30\.3\-30\.6\-31\.6\-31\.9Cosmos Policy\-19\.5\-3\.5\-14\.5\-20\.5\-22\.0\-89\.0\-46\.5\-81\.5\-87\.5\-87\.5\-38\.0\-2\.0\-28\.0\-38\.0\-41\.0\-39\.0\-16\.0\-31\.0\-40\.0\-41\.0\-46\.4\-17\.0\-38\.8\-46\.5\-47\.9Qwen2\.5\-VLOFT\-7\.0\-22\.0\-26\.0\-35\.5\-32\.9\-18\.0\-71\.5\-71\.5\-90\.0\-92\.0\-19\.0\-8\.5\-34\.5\-42\.0\-23\.4\-27\.0\-26\.5\-49\.0\-69\.0\-25\.0\-17\.8\-32\.1\-45\.3\-59\.1\-43\.3FAST\-22\.5\-51\.0\-55\.0\-52\.0\-52\.5\-20\.0\-83\.5\-80\.0\-83\.5\-83\.0\-30\.0\-46\.0\-47\.0\-48\.5\-46\.5\-31\.0\-82\.0\-81\.5\-76\.5\-84\.5\-25\.9\-65\.6\-65\.9\-65\.1\-66\.6GR00T3\.0\-14\.5\-21\.0\-26\.5\-28\.5\-17\.5\-69\.5\-64\.5\-80\.5\-93\.0\-10\.0\-8\.5\-22\.0\-36\.0\-22\.0\-13\.5\-11\.5\-22\.5\-23\.5\-33\.0\-9\.5\-26\.0\-32\.5\-41\.6\-44\.1π\\pi\-16\.5\-22\.5\-31\.0\-26\.5\-29\.0\-57\.5\-76\.5\-79\.0\-83\.0\-83\.5\-22\.0\-27\.0\-27\.5\-35\.5\-39\.0\-34\.5\-27\.0\-39\.0\-37\.0\-35\.5\-32\.6\-38\.3\-44\.1\-45\.5\-46\.8Qwen3\-VLOFT3\.0\-23\.0\-33\.0\-33\.0\-46\.0\-3\.0\-78\.0\-91\.5\-79\.0\-90\.5\-13\.5\-18\.0\-14\.5\-24\.5\-57\.5\-15\.5\-37\.5\-56\.0\-52\.0\-58\.5\-7\.3\-39\.1\-48\.8\-47\.1\-63\.1FAST\-25\.5\-65\.5\-70\.0\-69\.0\-63\.5\-32\.0\-80\.0\-82\.5\-78\.5\-81\.0\-26\.5\-63\.0\-68\.5\-64\.5\-61\.0\-34\.5\-74\.5\-84\.5\-82\.5\-76\.5\-29\.6\-70\.8\-76\.4\-73\.6\-70\.5GR00T1\.0\-19\.5\-23\.0\-26\.0\-34\.0\-2\.0\-87\.0\-84\.0\-75\.0\-92\.5\-9\.0\-14\.0\-34\.0\-25\.0\-36\.0\-14\.5\-11\.5\-14\.0\-18\.0\-29\.5\-6\.1\-33\.0\-38\.8\-36\.0\-48\.0π\\pi\-6\.0\-22\.5\-18\.5\-26\.5\-36\.5\-10\.5\-90\.0\-87\.0\-81\.5\-91\.0\-10\.0\-33\.5\-47\.5\-42\.5\-42\.0\-22\.0\-17\.5\-25\.5\-36\.5\-52\.0\-12\.1\-40\.9\-44\.6\-46\.8\-55\.4 Table 2:Multilingual performance on SimperEnv across four tasks, with the average performance across tasks shown in the last group\. Relative results are shown with respect to the original English instructions\.∅\\varnothingdenotes evaluated without any instructions\. The Avg\. denotes the average performance under each language across the four suites\. Percentage sign is omitted for better readability\.ModelsPut Spoonon TowelPut Carroton PlateStack Green Blockon Yellow BlockPut Eggplantin Yellow BasketAvg\.zhfrruar∅\\varnothingzhfrruar∅\\varnothingzhfrruar∅\\varnothingzhfrruar∅\\varnothingzhfrruar∅\\varnothingQwen2\.5\-VLOFT\-33\.3\-25\.0\-25\.0\-20\.8\-37\.5\-20\.8\-16\.7\-16\.7\-20\.8\-33\.3\-8\.3\-8\.3\-8\.3\-8\.3\-8\.3\-58\.3\-50\.0\-50\.0\-58\.3\-87\.5\-30\.2\-25\.0\-25\.0\-27\.1\-41\.7FAST\-29\.2\-12\.5\-37\.5\-29\.2\-79\.2\-4\.2\-12\.5\-16\.7\-50\.0\-50\.0\-12\.5\-12\.5\-29\.2\-29\.2\-37\.5\-37\.50\.0\-54\.212\.5\-83\.3\-20\.8\-9\.4\-34\.4\-24\.0\-62\.5GR00T0\.0\-1\.0\-72\.9\-76\.0\-84\.4\-1\.0\-13\.5\-29\.2\-54\.2\-51\.0\-8\.3\-23\.6\-37\.5\-41\.7\-41\.7\-10\.4\-11\.5\-49\.08\.3\-66\.7\-4\.9\-12\.4\-47\.1\-40\.9\-60\.9π\\pi12\.54\.2\-62\.5\-79\.2\-79\.2\-16\.7\-33\.3\-25\.0\-58\.3\-79\.2\-12\.5\-16\.7\-12\.5\-29\.2\-29\.220\.88\.325\.0\-50\.0\-62\.51\.0\-9\.4\-18\.8\-54\.2\-62\.5 Table 3:Code\-switching performance on LIBERO across four suites, with the average performance across suites shown in the last group\. Relative results are shown with respect to the original English instructions\. The Avg\. denotes the average performance under each language across the four suites\. Percentage sign is omitted for better readability\.ModelsLongGoalObjectSpatialAvg\.zhfrruarzhfrruarzhfrruarzhfrruarzhfrruarOpenVLA\-OFT\-14\.5\-10\.5\-10\.0\-4\.0\-73\.0\-45\.0\-53\.5\-50\.5\-2\.50\.0\-1\.0\-1\.5\-11\.0\-1\.5\-5\.0\-12\.0\-25\.25\-14\.25\-17\.38\-17\.00π0\.5\\pi\_\{0\.5\}\-4\.0\-2\.5\-8\.5\-12\.0\-17\.0\-28\.0\-32\.0\-33\.5\-13\.0\-1\.5\-6\.5\-18\.5\-21\.0\-16\.5\-20\.0\-23\.5\-13\.75\-12\.12\-16\.75\-21\.88ABot\-M0\-0\.5\-1\.5\-13\.0\-12\.5\-22\.0\-27\.0\-37\.5\-40\.0\-6\.5\-1\.5\-1\.0\-6\.01\.0\-10\.5\-6\.0\-6\.5\-7\.00\-10\.13\-14\.38\-16\.25Cosmos Policy\-15\.5\-6\.5\-3\.0\-11\.0\-50\.0\-21\.0\-36\.0\-61\.0\-25\.0\-2\.5\-5\.0\-20\.5\-40\.0\-9\.0\-21\.5\-30\.0\-32\.63\-9\.75\-16\.38\-30\.63Qwen2\.5\-VLOFT\-0\.5\-15\.0\-21\.5\-20\.5\-11\.5\-20\.5\-32\.0\-58\.0\-11\.5\-3\.0\-1\.5\-19\.0\-28\.0\-22\.5\-33\.5\-51\.5\-12\.88\-15\.25\-22\.13\-37\.25FAST\-20\.0\-47\.5\-46\.5\-43\.0\-20\.5\-63\.0\-51\.5\-53\.0\-30\.0\-43\.0\-37\.5\-43\.5\-30\.5\-74\.0\-30\.5\-71\.0\-25\.25\-56\.88\-41\.50\-52\.63GR00T\-6\.5\-9\.0\-15\.0\-14\.0\-17\.0\-26\.0\-27\.0\-35\.5\-9\.0\-4\.5\-12\.0\-19\.0\-18\.5\-13\.5\-20\.5\-20\.0\-12\.75\-13\.25\-18\.63\-22\.13π\\pi\-15\.0\-14\.5\-24\.5\-24\.0\-42\.5\-55\.0\-49\.5\-55\.5\-13\.5\-13\.5\-9\.5\-23\.0\-35\.5\-27\.0\-36\.0\-37\.5\-26\.63\-27\.50\-29\.88\-35\.00Qwen3\-VLOFT1\.0\-18\.5\-7\.0\-16\.5\-3\.0\-19\.5\-34\.0\-31\.0\-10\.5\-0\.5\-5\.0\-10\.5\-14\.5\-30\.5\-43\.0\-52\.0\-6\.75\-17\.25\-22\.25\-27\.50FAST\-26\.0\-63\.5\-59\.5\-60\.0\-23\.0\-55\.0\-52\.0\-46\.5\-37\.0\-43\.5\-44\.5\-55\.0\-37\.0\-62\.0\-79\.0\-77\.0\-30\.75\-56\.00\-58\.75\-59\.63GR00T1\.5\-14\.5\-11\.5\-20\.51\.0\-30\.5\-35\.0\-30\.0\-8\.5\-0\.5\-0\.5\-8\.0\-14\.5\-8\.0\-16\.0\-12\.5\-5\.13\-13\.38\-15\.75\-17\.75π\\pi\-4\.0\-11\.0\-14\.5\-14\.5\-1\.0\-38\.0\-35\.5\-45\.0\-11\.0\-14\.0\-6\.5\-28\.5\-20\.5\-9\.0\-20\.5\-31\.0\-9\.13\-18\.00\-19\.25\-29\.75 Table 4:Code\-switching performance on SimplerEnv across four tasks, with the average performance across tasks shown in the last group\. Relative results are shown with respect to the original English instructions\. The Avg\. denotes the average performance under each language across the four suites\. Percentage sign is omitted for better readability\.ModelsPut Spoonon TowelPut Carroton PlateStack Green Blockon Yellow BlockPut Eggplantin Yellow BasketAvg\.zhfrruarzhfrruarzhfrruarzhfrruarzhfrruarQwen2\.5\-VLOFT\-33\.3\-25\.0\-25\.0\-37\.5\-20\.8\-12\.5\-25\.0\-20\.8\-4\.2\-8\.3\-4\.2\-8\.3\-62\.5\-41\.7\-70\.8\-83\.3\-30\.2\-21\.9\-31\.3\-34\.4FAST8\.3\-16\.74\.2\-58\.3\-12\.5\-4\.2\-12\.5\-12\.5\-20\.8\-16\.7\-16\.7\-37\.512\.5\-16\.712\.54\.2\-3\.1\-13\.5\-3\.1\-26\.0GR00T\-9\.4\-13\.53\.1\-51\.0\-1\.0\-17\.7\-13\.5\-17\.7\-25\.0\-37\.5\-33\.3\-41\.74\.2\-4\.212\.5\-4\.2\-7\.8\-18\.2\-7\.8\-28\.6π\\pi4\.26\.38\.3\-45\.8\-12\.5\-33\.3\-16\.7\-29\.2\-16\.7\-25\.0\-4\.2\-25\.04\.20\.012\.50\.0\-9\.4\-13\.00\.0\-25\.0 We provide the relative performance of all models across different languages and multilingual settings in Table[1](https://arxiv.org/html/2606.15714#S4.T1), Table[2](https://arxiv.org/html/2606.15714#S4.T2), Table[3](https://arxiv.org/html/2606.15714#S4.T3)and Table[4](https://arxiv.org/html/2606.15714#S4.T4)\. Results with no instruction are also included as a reference to show the performance of models without any language input\. We analyze the results from three perspectives and provide our findings for each perspective\. Additional results are included in the Appendix[10\.1](https://arxiv.org/html/2606.15714#S10.SS1)\. Finding 1: Multilingual gaps are driven by the language source of the base VLM\.We analyze the relative performance of each language compared to English for each model, and we observe a consistent performance gap between English and non\-English instructions across almost all models in both LIBERO and SimplerEnv environments\. We first divide the models into two groups based on their base VLMs: \(1\)π0\.5\\pi\_\{0\.5\}, OpenVLA\-OFT are based on PaliGemma\(steiner2024paligemma\)and Prismatic VLM\(karamcheti2024prismatic\), and Cosmos Policy is a world\-model\-based policy, which are all trained on English\-centric data; \(2\) ABot\-M0 and Qwen\-VL\-based VLAs are based on Qwen\-VL\(bai2025qwen2;bai2025qwen3\), which is a multilingual VLM trained on English, Chinese, and additional multilingual data\. The first group of models shows significant performance drops on non\-English instructions, which are close to the performance of models with no instructions\. This suggests that these models struggle to understand and follow instructions in non\-English languages, likely due to the lack of multilingual training data in their base VLMs\. The second group of models shows better multilingual performance, especially on Chinese instructions\. This suggests that the multilingual training data in Qwen\-VL has enabled ABot\-M0 and other Qwen\-VL\-based VLAs to generalize well to Chinese instructions, despite the absence of explicit training on Chinese instructions\. These two groups reveal a significant insight that the multilingual training data in the base VLM can significantly impact the multilingual performance of the resulting VLA models, and that models trained on English\-centric data may struggle to generalize to instructions in other languages\. Finding 2: Same visual input zooms in the multilingual gap\.We observe that the multilingual gap is more pronounced in the LIBERO\-Goal environment compared to other environments\. This is because LIBERO\-Goal contains almost the same visual input across different tasks, which makes the language input more critical for task success\. Therefore, we consider that the language understanding should be a key factor that contributes to the performance gap across different languages, which is also mentioned in the previous work\(lian2026langforce\)\. Finding 3: Key words in instructions can help reduce the multilingual gap\.We compare the performance of models in the multilingual instruction setting and the code\-switching instruction setting, where instructions contain several key verbs or nouns in English\. We observe that the performance drop in the code\-switching setting is generally smaller than that in the multilingual instruction setting across all models and environments\. This suggests that the presence of key words in the instructions can provide helpful cues for the models to better understand and follow the instructions, even when the overall instruction is in a different language\. This finding highlights that the keywords in instructions are crucial for the language understanding of VLA models\. Finding 4: The action head plays an essential role in multilingual generalization\.We compare the multilingual performance of models with different action head designs, including OFT\-style, FAST\-style,π\\pi\-style, and GR00T\-style action heads\. With the same base VLM, we observe that models with GR00T\-style andπ\\pi\-style action heads exhibit better multilingual performance compared to models with other action head designs\. In contrast, models with FAST\-style action heads show worse multilingual performance\. This is because multilingual instructions would shift the distribution of the representations from the base VLM, which is supported by the visualization in Section[4\.4](https://arxiv.org/html/2606.15714#S4.SS4), and the design of action heads can significantly impact the model’s ability to adapt to this distribution shift\. GR00T\-style andπ\\pi\-style action heads generate actions through a diffusion transformer, which can retain the semantic information from the base VLM and better adapt to the distribution shift caused by multilingual instructions\. While FAST\-style action heads directly use FAST tokenizers to process the output of the base VLM, which may lead to a loss of semantic information and worse multilingual performance\. We also observe OFT\-style action heads sometimes perform in between\. We think that the MLP module in the OFT\-style action head cannot effectively adapt to the distribution shift caused by multilingual instructions\. Overall, this finding suggests that the design of action heads can play a crucial role in the multilingual generalization of VLA models, and that GR00T\-style andπ\\pi\-style action heads with diffusion transformers are more effective in handling multilingual instructions\. ### 4\.3Analysis We further analyze the multilingual gap from two perspectives: \(1\) how the multilingual gap reflects model behavior, and \(2\) where the multilingual gap comes from\. #### 4\.3\.1Model Behavior Reflects the Multilingual Gap \(a\)Behavior of Qwen3\-VL\-π\\piwith the instruction in French: "turn on the stove" \(b\)Behavior of Qwen2\.5\-FAST with the instruction in Chinese: "pick up the cream cheese and place it in the basket" Figure 2:Behavioral analysis of the failure cases under multilingual instructions\. Case \(a\) shows a failure case on LIBERO\-Goal, where the model confuses the task of "turn on the stove" with "put the bowl on the plate"\. Case \(b\) shows a failure case on LIBERO\-object, where the model recognizes the instruction but fails to execute the correct action\.In this section, we are going to analyze the relationship between the multilingual gap and the model behavior\. We analyze multiple failure cases under multilingual instructions and find two key failure modes\. The first failure mode is that the model fails to understand the instruction\. This failure mode will lead to the model completely misinterpreting the instruction and generating confusing actions\. Meanwhile, similar visual input may further exacerbate the confusion, which is the case in LIBERO\-Goal\. We provide an example in Figure[2a](https://arxiv.org/html/2606.15714#S4.F2.sf1), where Qwen3\-VL\-π\\pifails to understand the instruction "turn on the stove" in French, and it completely misinterprets the instruction "put the bowl on the plate" due to the similar visual input\. The second failure mode is that the model recognizes the instruction but fails to execute the correct action\. This failure mode suggests that the model may have some understanding of the instruction but still struggles to generate the correct action\. We provide an example in Figure[2b](https://arxiv.org/html/2606.15714#S4.F2.sf2), where Qwen2\.5\-FAST recognizes the instruction "pick up the cream cheese and place it in the basket" in Chinese and generates an action that is related to picking up the cream cheese, but it fails to execute the correct action of picking up the cream cheese and placing it in the basket\. These two failure modes suggest that the multilingual gap can be reflected in the language understanding and action generation of VLA models\. We also provide more failure cases in the Appendix[10\.2](https://arxiv.org/html/2606.15714#S10.SS2), which further support our analysis\. #### 4\.3\.2Representation Shift Leads to Multilingual Gap Figure 3:Average pooled embeddings from the middle layer of Qwen3\-VL\-π\\piandπ0\.5\\pi\_\{0\.5\}\. The same instruction in different languages are connected with polygon lines\.In this section, we are going to further analyze the reason behind the multilingual gap\. As shown in Figure[3](https://arxiv.org/html/2606.15714#S4.F3), we provide the visualization of the average pooled embedding from the middle layer of Qwen3\-VL\-π\\piandπ0\.5\\pi\_\{0\.5\}\. The same instruction in different languages is connected with polygon lines\. We observe that the embeddings of English and Chinese instructions are relatively close in Qwen3\-VL\-π\\pi, while the embeddings of other languages are far away from English and Chinese\. Inπ0\.5\\pi\_\{0\.5\}, the embeddings of all non\-English instructions are far away from English\. Obviously, the similarity of the principal components among different languages correlates well with the performance gap, discussed in Section[4\.2](https://arxiv.org/html/2606.15714#S4.SS2)\. The better multilingual performance of Qwen3\-VL\-π\\pican be attributed to the better cross\-lingual alignment between English and Chinese, which may help the model better generalize to other languages as well\. We also provide multiple suite\-wise visualizations for different models in the Appendix[10\.3](https://arxiv.org/html/2606.15714#S10.SS3), which further support our analysis\. ### 4\.4Multilingual performance enhancement Table 5:Multilingual performance of different training strategies on LIBERO across four suites\. Absolute performance is shown\.Boldandunderlinedenote the best and second\-best performance under each language, respectively\. The Avg\. denotes the average performance under each language across the four suites\. Percentage sign is omitted for better readability\.ModelsLongGoalObjectSpatialAvg\.enzhfrruarenzhfrruarenzhfrruarenzhfrruarenzhfrruarGR00T92\.593\.573\.069\.566\.597\.595\.510\.513\.522\.597\.088\.083\.063\.072\.094\.079\.582\.580\.076\.095\.389\.162\.356\.559\.3E\-FT94\.591\.562\.069\.064\.098\.092\.012\.56\.027\.0100\.080\.591\.072\.569\.092\.573\.054\.546\.551\.096\.384\.355\.048\.552\.8M\-FT93\.590\.078\.577\.584\.598\.097\.523\.523\.043\.098\.587\.574\.559\.059\.591\.575\.584\.069\.064\.095\.487\.665\.157\.162\.8E\-CT88\.586\.078\.072\.575\.096\.591\.523\.014\.525\.098\.084\.582\.559\.565\.593\.085\.589\.571\.564\.094\.086\.968\.354\.557\.4M\-CT80\.082\.578\.578\.083\.596\.594\.581\.567\.086\.598\.587\.096\.094\.094\.595\.589\.093\.593\.593\.092\.688\.387\.483\.189\.4MPCA90\.090\.086\.585\.583\.597\.095\.577\.580\.092\.098\.088\.097\.095\.097\.596\.089\.097\.081\.590\.095\.390\.689\.585\.590\.8 #### 4\.4\.1Training Strategies Evaluation Based on the findings in Section[4\.2](https://arxiv.org/html/2606.15714#S4.SS2), we provide multiple variants to further improve the multilingual performance\. To ensure a fair comparison, we introduce a COCO\-VQA dataset provided by StarVLA\(community2026starvla\)for VLA cotraining, and augment the dataset with multilingual instructions through translation\. The dataset contains 50K image\-question\-answer triplets, covering a wide range of visual concepts\. The multilingual instructions are generated by translating the original English questions into Chinese, French, Russian, and Arabic, using the Cloud Translation API\. We first follow the findings 1 and 4 to choose Qwen3\-VL as the base VLM, which has better multilingual performance\(bai2025qwen3\), and adopt the GR00T\-style action head design for all models\. We then fine\-tune the models on the multilingual COCO\-VQA dataset to improve the multilingual understanding\. Two multilingual variants are trained with different fine\-tuning strategies: \(1\) M\-FT: multilingual fine\-tuning for the base VLM at first, and then training the action head with the LIBERO dataset; \(2\) M\-CT: cotraining the VLA model with both the multilingual COCO\-VQA dataset and the LIBERO dataset\. We also include two monolingual variants as baselines, denoted as E\-FT and E\-CT, which are trained with the original English COCO\-VQA dataset\. The absolute success rate of these variants are provided in Table[5](https://arxiv.org/html/2606.15714#S4.T5)\. The original performance of Qwen3\-VL\-GR00T is also included as a reference, denoted as GR00T\. We analyze the results from three perspectives and provide our findings for each perspective\. \(1\) Compared to the original GR00T, M\-FT performs better on multilingual instructions, which further confirms our finding 1 in Section[4\.2](https://arxiv.org/html/2606.15714#S4.SS2)that the multilingual training data in the base VLM can impact the multilingual performance of the resulting VLA models\. \(2\) The average performance of M\-FT and M\-CT under English and Chinese instructions are generally worse than the original GR00T\. It means that the multilingual gains come with a cost of performance drop on English instructions, which is a common phenomenon in multilingual large language models\(qin2025survey;marchisio2024does;dou2023multispider\)as well\. \(3\) M\-CT performs better than M\-FT on multilingual instructions, which suggests that cotraining with both the multilingual COCO\-VQA dataset and the LIBERO dataset can help the model better understand and follow multilingual instructions\. #### 4\.4\.2Multilingual Principal Component Alignment Inspired by the analysis in Section[4\.3\.2](https://arxiv.org/html/2606.15714#S4.SS3.SSS2)and Section[4\.4\.1](https://arxiv.org/html/2606.15714#S4.SS4.SSS1), we further propose a simple yet effective method based on M\-CT to improve the multilingual performance by aligning the principal components of multilingual instruction embeddings so that the multilingual instruction embeddings can better align with English instruction embeddings\. Specifically, we first samplenngroups of samples from the multilingual COCO\-VQA dataset, where each group contains the same instruction in different languages\. For sampleiiin languagej\{j\}, we compute the average pooled embedding𝐡i,j\\mathbf\{h\}\_\{i,j\}from the middle layer of the base VLM, and perform principal component analysis \(PCA\) on these embeddings to obtain the principal components𝐔\\mathbf\{U\}: 𝐔=PCA\(\{𝐡i,j\}\|i=1,…,n;j∈\{en,zh,fr,ru,ar\}\)\\mathbf\{U\}=\\text\{PCA\}\(\\\{\\mathbf\{h\}\_\{i,j\}\\\}\|i=1,\.\.\.,n;j\\in\\\{\\text\{en\},\\text\{zh\},\\text\{fr\},\\text\{ru\},\\text\{ar\}\\\}\)\(2\)We update the principal components𝐔\\mathbf\{U\}perkksteps during training\. We then use the cosine similarity to align the projection of the multilingual instruction embeddings on the principal components with that of English instructions\. The loss function is defined as follows: ℒ=ℒVLM\+ℒVLA\+λ∑i=1n∑j=1m\(1−cos\(𝐔T𝐡i,j,𝐔T𝐡i,en\)\)\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{VLM\}\}\+\\mathcal\{L\}\_\{\\text\{VLA\}\}\+\\lambda\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{m\}\(1\-\\cos\(\\mathbf\{U\}^\{T\}\\mathbf\{h\}\_\{i,j\},\\mathbf\{U\}^\{T\}\\mathbf\{h\}\_\{i,\\text\{en\}\}\)\)\(3\)whereℒVLM\\mathcal\{L\}\_\{\\text\{VLM\}\}andℒVLA\\mathcal\{L\}\_\{\\text\{VLA\}\}are the original loss functions for training the base VLM and the action head, respectively,λ\\lambdais a hyperparameter that controls the weight of the alignment loss, andmmis the number of languages\. The insight of this method is to encourage the most significant components of the multilingual instruction embeddings to be aligned with those of English instruction embeddings, while allowing the less significant components to capture other information\. We denote the model trained with this method as Multilingual Principal Component Alignment \(MPCA\)\. The performance is provided in Table[5](https://arxiv.org/html/2606.15714#S4.T5), which shows that MPCA can further improve the multilingual performance compared to M\-CT\. We also provide an ablation study on PCA in the Appendix[10\.4](https://arxiv.org/html/2606.15714#S10.SS4), which further supports the effectiveness of MPCA\. ## 5Conclusion In this work, we present a systematic study of multilingual instruction following in VLA models\. Through a comprehensive evaluation and extensive experiments across multiple VLA models and robot benchmarks, we reveal a significant multilingual gap that has been largely overlooked in the field\. Our analysis demonstrates that this gap would reflect both instruction understanding and action execution, and multilingual instruction\-caused representation shifts may contribute to the multilingual gap\. Inspired by these findings, we propose Multilingual Principal Component Alignment \(MPCA\), a simple yet effective approach that leverages PCA\-based representation alignment to improve multilingual performance in VLAs\. ## Acknowledgements We thank the StarVLA team for providing a solid foundation for our evaluation\. We are also grateful to the authors of the benchmarks and models used in our experiments for openly sharing their work with the research community\. ## References \\beginappendix ## 6Limitations While our work provides valuable insights into the multilingual capabilities of VLA models and proposes an effective approach to improve multilingual performance, there are several limitations that should be acknowledged\. First, our evaluation is limited to a specific set of VLA models and robot benchmarks, which may not fully capture the diversity of real\-world scenarios and applications\. Future work could explore a wider range of models and benchmarks to further validate our findings\. Second, while MPCA shows promising results in improving multilingual performance, it is a relatively simple approach that may not fully address all the challenges associated with multilingual instruction following\. More sophisticated methods that consider the nuances of different languages and their interactions with visual and action representations could be explored in future research\. Third, our analysis primarily focuses on the representation shift caused by multilingual instructions, but other factors, such as data quality, may also contribute to the multilingual gap\. A more comprehensive analysis that considers these factors could provide deeper insights into the underlying causes of the multilingual gap and inform more effective solutions\. ## 7Broader Impacts Our work has several broader impacts that are important to consider\. First, by systematically studying multilingual instruction following in VLA models, we highlight the importance of multilingual capabilities in embodied AI systems\. This can encourage researchers and practitioners to prioritize multilingual performance when developing and deploying VLA models, leading to more inclusive and accessible technologies that can serve a wider range of users across different languages and cultures\. Second, our findings regarding the multilingual gap and its underlying causes can inform the design of future VLA models and training strategies, potentially leading to more robust and effective multilingual instruction following capabilities\. This can have positive impacts on the usability and effectiveness of VLA systems in real\-world applications, such as assistive robotics, where users may interact with robots in their native languages\. Third, our proposed MPCA approach offers a practical solution for improving multilingual performance in VLAs, which can be adopted by researchers and practitioners to enhance the multilingual robustness of their models\. This can contribute to the development of more versatile and adaptable VLA systems that can better serve diverse user needs\. However, it is also important to acknowledge potential risks associated with multilingual VLA systems, such as the possibility of biased performance across different languages or the misuse of multilingual capabilities in harmful ways\. Future research should continue to explore these risks and develop strategies to mitigate them, ensuring that the benefits of multilingual VLA systems are realized while minimizing potential harms\. Overall, our work contributes to advancing the field of embodied AI towards more inclusive and multilingual\-aware systems, with the potential to positively impact a wide range of applications and users\. ## 8Dataset Construction Details In this section, we provide detailed information about the construction of our multilingual instruction\-following dataset\. Multilingual Instruction Construction\.We utilize the v2 version of Cloud Translation API111https://translation\.googleapis\.com/language/translate/v2to translate the original English instructions into the target languages\. Code\-switching Instruction Construction\.To construct code\-switching instructions, we leverage the multilingual capabilities of LLMs to identify key verbs and nouns in both the original English instructions and their translated versions\. We then substitute the corresponding key phrases in the translated instructions with those from the original English instructions, resulting in code\-switched instructions that contain mixed\-language expressions\. Another LLM is used to evaluate the semantic consistency between the original English instructions and the code\-switching instructions, ensuring that the generated instructions maintain their intended meaning\. The specific prompt templates used for code\-switching instruction generation and evaluation are provided\. We use gpt\-5\.2\-20251211 for both generation and evaluation, with temperature set to 0\.2\. Code\-Switching Prompt TemplateYou are a code\-mixing assistant for multilingual instruction augmentation\.Given English tokens \[EN\] and target language tokens \[LT\], your task is to generate a code\-mixed variant of \[LT\] by replacing 1–3 nouns or verbs with tokens from \[EN\]\.Follow these rules strictly:Procedure•Identify tokens in \[LT\] with POS∈\{NOUN, VERB\}\\in\\\{\\text\{NOUN, VERB\}\\\}\.•Uniformly samplek∈\{1,2,3\}k\\in\\\{1,2,3\\\}tokens\.•Replace each sampled token with a token from \[EN\]\.•Keep all other tokens unchanged\.Constraints•Do not introduce tokens outside \[EN\]\.•Do not modify tokens outside \[LT\]\.•Preserve word order and fluency\.Output Format•The output must start with \#\#\#\.•Format strictly as: \#\#\#<Modified target language instruction\>•Do not include explanations or extra text\. Code\-Switching Evaluation TemplateYou are a strict evaluator of code\-mixed instructions for a multilingual VLA dataset\.Given English tokens \[EN\] and target language tokens \[LT\], your task is to evaluate whether a code\-mixed instruction \[CANDIDATE\] satisfies the following criteria\.Rules the \[CANDIDATE\] MUST satisfy:•The candidate equals \[LT\] except 1–3 tokens are replaced with English words from \[EN\]\.•Replaced tokens are NOUN or VERB in \[LT\]\.•No tokens are introduced that are absent from both \[LT\] and \[EN\] \(e\.g\. extra English articles or particles like ’the’, ’a’\)\.•Word order is preserved\.•Semantic meaning is identical to \[EN\]\.Think briefly \(1–3 short sentences\), then output the verdict on the LAST line strictly as ‘\#\#\# YES‘ or ‘\#\#\# NO‘\. Nothing after the verdict line\. ## 9Experiment Details ### 9\.1Model Details In this section, we provide detailed information about the models evaluated in our experiments\. Models we include can be categorized into two groups: \(1\) models trained primarily on English instructions, including OpenVLA\-OFT\(kim2025fine\),π0\.5\\pi\_\{0\.5\}\(intelligence2025pi\_\), and Cosmos Policy\(kim2026cosmos\); \(2\) Qwen\-VL\-based VLAs with different action head designs\(community2026starvla\), which are evaluated to understand the impact of multilingual VLM backbones and architectural choices on multilingual performance\. The specific details of each model are provided below: - •OpenVLA\-OFT\(kim2025fine\)is a VLA model that utilizes LLaMA\-2\-7B as the backbone and is trained on a large\-scale dataset\. It employs an Optimized Finetuning \(OFT\) approach to enhance the capabilities of the model\. We use the official checkpoint finetuned on LIBERO for evaluation, which is available at[https://huggingface\.co/moojink/openvla\-7b\-oft\-finetuned\-libero\-spatial\-object\-goal\-10](https://huggingface.co/moojink/openvla-7b-oft-finetuned-libero-spatial-object-goal-10)\. - •π0\.5\\pi\_\{0\.5\}\(intelligence2025pi\_\)is a VLA model that leverages a diffusion\-based continuous action expert to generate actions\. We use the official checkpoint finetuned on LIBERO for evaluation, which is available at[https://storage\.googleapis\.com/openpi\-assets/checkpoints/pi05\_libero](https://storage.googleapis.com/openpi-assets/checkpoints/pi05_libero)\. - •ABot\-M0\(yang2026abot\)is a VLA model that incorporates a manifold\-based action head to enhance the model’s ability to generate actions\. We use the official checkpoint finetuned on LIBERO for evaluation, which is available at[https://huggingface\.co/acvlab/ABot\-M0\-LIBERO](https://huggingface.co/acvlab/ABot-M0-LIBERO)\. - •Cosmos Policy\(kim2026cosmos\)is a world\-model\-based policy that utilizes a video\-based world model to predict future states and generate actions\. We use the official checkpoint finetuned on LIBERO for evaluation, which is available at[https://huggingface\.co/nvidia/Cosmos\-Policy\-LIBERO\-Predict2\-2B](https://huggingface.co/nvidia/Cosmos-Policy-LIBERO-Predict2-2B)\. - •Qwen\-VL\-based VLAs\(community2026starvla\)are a series of VLA models that utilize Qwen\-VL as the backbone and incorporate different action head designs\. We evaluate multiple variants of these models to understand the impact of multilingual VLM backbones and architectural choices on multilingual performance\. We use the official checkpoints finetuned on LIBERO or SimplerEnv for evaluation, which are available at[https://huggingface\.co/StarVLA](https://huggingface.co/StarVLA)\. ### 9\.2Training Details In this section, we provide training details of MPCA and other multilingual variants\. For MPCA, E\-CT, and M\-CT, we use a batch size of 2 per GPU and train for 50K steps on 8 A800 GPUs or higher\-end GPUs\. The learning rate is set to 2\.5e\-5, with a cosine learning rate schedule\. We use the AdamW optimizer with no weight decay\. For MPCA, we set the number of principal components to 128 and the alignment weight to 0\.01\. The principal components will be updated every 1000 steps\. For E\-CT and M\-CT, we use the same training data and hyperparameters as MPCA, which consists of multilingual instruction variants generated from the original English instructions in the respective benchmarks\. For E\-FT and M\-FT, we keep the same training settings as MPCA in the VLA training stage\. But it should be noted that we adopt the batch size of 8 per GPU for E\-FT and M\-FT in the VLM fine\-tuning stage, which is different from the batch size used in MPCA\. The training steps for E\-FT and M\-FT are set to 30K steps in the VLM fine\-tuning stage and 50K steps in the VLA training stage\. ## 10Extended Experimental Results and Analysis ### 10\.1Extended Main Results We further provide the absolute performance of all models in a multilingual instruction setting in Table[6](https://arxiv.org/html/2606.15714#S10.T6)and Table[7](https://arxiv.org/html/2606.15714#S10.T7)\. We also provide the absolute code\-switching performance in Table[8](https://arxiv.org/html/2606.15714#S10.T8)and Table[9](https://arxiv.org/html/2606.15714#S10.T9)\. Table 6:Multilingual performance on LIBERO across four suites\. Absolute performance is shown\.∅\\varnothingdenotes evaluated without any instructions\. Percentage sign is omitted for better readability\.ModelsLongGoalObjectSpatialenzhfrruar∅\\varnothingenzhfrruar∅\\varnothingenzhfrruar∅\\varnothingenzhfrruar∅\\varnothingOpenVLA\-OFT94\.589\.089\.580\.579\.076\.597\.564\.09\.58\.010\.510\.0100\.095\.581\.088\.591\.591\.593\.590\.084\.586\.078\.080\.0π0\.5\\pi\_\{0\.5\}93\.073\.076\.069\.572\.070\.094\.016\.09\.55\.58\.59\.099\.066\.571\.067\.565\.564\.597\.569\.575\.074\.072\.066\.0ABot\-M094\.589\.089\.580\.579\.076\.597\.564\.09\.58\.010\.510\.0100\.095\.581\.088\.591\.591\.593\.590\.084\.586\.078\.080\.0Cosmos Policy98\.078\.594\.583\.577\.576\.096\.57\.550\.015\.09\.09\.099\.561\.597\.571\.561\.558\.596\.557\.580\.565\.556\.555\.5Qwen2\.5\-VLOFT89\.582\.567\.563\.554\.056\.697\.579\.526\.026\.07\.55\.598\.579\.590\.064\.056\.575\.191\.564\.565\.042\.522\.566\.5FAST88\.566\.037\.533\.536\.536\.091\.071\.07\.511\.07\.58\.096\.566\.550\.549\.548\.050\.086\.555\.54\.55\.010\.02\.0GR00T93\.596\.579\.072\.567\.065\.097\.079\.527\.532\.516\.54\.099\.589\.591\.077\.563\.577\.595\.582\.084\.073\.072\.062\.5π\\pi89\.072\.566\.558\.062\.560\.094\.537\.018\.015\.511\.511\.099\.077\.072\.071\.563\.560\.088\.053\.561\.049\.051\.052\.5Qwen3\-VLOFT91\.094\.068\.058\.058\.045\.099\.096\.021\.07\.520\.08\.5100\.086\.582\.085\.575\.542\.592\.577\.055\.036\.540\.534\.0FAST83\.057\.517\.513\.014\.019\.587\.555\.57\.55\.09\.06\.598\.071\.535\.029\.533\.537\.089\.054\.514\.54\.56\.512\.5GR00T92\.593\.573\.069\.566\.558\.597\.595\.510\.513\.522\.55\.097\.088\.083\.063\.072\.061\.094\.079\.582\.580\.076\.064\.5π\\pi96\.590\.574\.078\.070\.060\.097\.086\.57\.010\.015\.56\.099\.589\.566\.052\.057\.057\.593\.071\.075\.567\.556\.541\.0 Table 7:Multilingual performance on SimperEnv across four tasks\. Absolute performance is shown\.∅\\varnothingdenotes evaluated without any instructions\. Percentage sign is omitted for better readability\.ModelsPut Spoonon TowelPut Carroton PlateStack Green Blockon Yellow BlockPut Eggplantin Yellow Basketenzhfrruar∅\\varnothingenzhfrruar∅\\varnothingenzhfrruar∅\\varnothingenzhfrruar∅\\varnothingQwen2\.5\-VLOFT37\.54\.212\.512\.516\.70\.033\.312\.516\.716\.712\.50\.08\.30\.00\.00\.00\.00\.087\.529\.237\.537\.529\.20\.0FAST79\.250\.066\.741\.750\.00\.050\.045\.837\.533\.30\.00\.037\.525\.025\.08\.38\.30\.083\.345\.883\.329\.295\.80\.0GR00T84\.484\.483\.311\.58\.30\.055\.254\.241\.726\.01\.04\.241\.733\.318\.14\.20\.00\.066\.756\.355\.217\.775\.00\.0π\\pi79\.291\.783\.316\.70\.00\.079\.262\.545\.854\.220\.80\.029\.216\.712\.516\.70\.00\.062\.583\.370\.887\.512\.50\.0 Table 8:Code\-switching performance on LIBERO across four suites\. Absolute performance is shown\. Percentage sign is omitted for better readability\.ModelsLongGoalObjectSpatialzhfrruarzhfrruarzhfrruarzhfrruarOpenVLA\-OFT79\.583\.584\.090\.026\.054\.045\.548\.597\.099\.598\.598\.081\.591\.087\.580\.5π0\.5\\pi\_\{0\.5\}89\.090\.584\.581\.077\.066\.062\.060\.586\.097\.592\.580\.576\.581\.077\.574\.0ABot\-M094\.093\.081\.582\.075\.570\.560\.057\.593\.598\.599\.094\.094\.583\.087\.587\.0Cosmos Policy82\.5091\.5095\.0087\.0046\.5075\.5060\.5035\.5074\.5097\.0094\.5079\.0056\.5087\.5075\.0066\.50Qwen2\.5\-VLOFT89\.074\.568\.069\.086\.077\.065\.539\.587\.095\.597\.079\.563\.569\.058\.040\.0FAST68\.541\.042\.045\.570\.528\.039\.538\.066\.553\.559\.053\.056\.012\.556\.015\.5GR00T87\.084\.578\.579\.580\.071\.070\.061\.590\.595\.087\.580\.577\.082\.075\.075\.5π\\pi74\.074\.564\.565\.052\.039\.545\.039\.085\.585\.589\.576\.052\.561\.052\.050\.5Qwen3\-VLOFT92\.072\.584\.074\.596\.079\.565\.068\.089\.599\.595\.089\.578\.062\.049\.540\.5FAST57\.019\.523\.523\.064\.532\.535\.541\.061\.054\.553\.543\.052\.027\.010\.012\.0GR00T94\.078\.081\.072\.098\.567\.062\.567\.588\.596\.596\.589\.079\.586\.078\.081\.5π\\pi92\.585\.582\.082\.096\.059\.061\.552\.088\.585\.593\.071\.072\.584\.072\.562\.0 Table 9:Code\-switching performance on SimperEnv across four tasks\. Absolute performance is shown\. Percentage sign is omitted for better readability\.ModelsPut Spoonon TowelPut Carroton PlateStack Green Blockon Yellow BlockPut Eggplantin Yellow BasketzhfrruarzhfrruarzhfrruarzhfrruarQwen2\.5\-VLOFT4\.212\.512\.512\.512\.520\.88\.312\.54\.20\.04\.20\.025\.045\.816\.74\.2FAST87\.562\.583\.320\.837\.545\.837\.537\.516\.720\.820\.80\.095\.866\.795\.887\.5GR00T75\.070\.887\.533\.354\.237\.541\.737\.516\.74\.28\.30\.070\.862\.579\.262\.5π\\pi83\.385\.487\.533\.366\.745\.862\.550\.012\.54\.225\.04\.266\.762\.575\.062\.5 ### 10\.2Extended Model Behavior Analysis \(a\)Behavior of Qwen2\.5\-FAST with the instruction in French: "pick up the BBQ sauce and place it in the basket" \(b\)Behavior of OpenVLA\-OFT with the instruction in Chinese: "open the middle drawer of the cabinet" \(c\)Behavior of ABot\-M0 with the instruction in Chinese: "open the top drawer and put the bowl inside" \(d\)Behavior of Qwen3\-GR00T with the instruction in Russian: "put both moka pots on the stove" \(e\)Behavior of Qwen2\.5\-VL\-GR00T with the instruction in Arabic: "put the white mug on the left plate and put the yellow and white mug on the right plate" \(f\)Behavior of Qwen2\.5\-VL\-π\\piwith the instruction in Arabic: "stack green block on yellow block" \(g\)Behavior of Qwen2\.5\-VL\-FAST with the instruction in Russian: "put the spoon on the towel" Figure 4:Behavioral analysis of the failure cases under multilingual instructions\. Case \(a\)\-\(e\) shows failure cases on LIBERO\. Case \(f\) and \(g\) shows failure cases on SimperEnv\. Case \(a\)\-\(c\) shows failure cases where the model fails to understand the instruction and thus performs wrong actions\. Case \(d\)\-\(g\) shows failure cases where the model recognizes the instruction but fails to execute the correct action\.We provide additional behavioral analysis of the failure cases under multilingual instructions in Figure[4](https://arxiv.org/html/2606.15714#S10.F4)\. The observations confirm our previous analysis in Section[4\.3\.1](https://arxiv.org/html/2606.15714#S4.SS3.SSS1)that the performance drop in a multilingual instruction setting is not only due to the language understanding capability of the model, but also due to the execution capability of the model under multilingual instructions\. In case \(d\)\-\(g\), models can always recognize the objects or the spatial relationships in the instruction, but fail to execute the correct actions\. This indicates that even if the model can understand the instruction, it may still struggle to execute the correct actions under multilingual instructions, which further highlights the challenges of multilingual instruction following in embodied tasks\. Besides, in case \(a\)\-\(c\), models fail to understand the instruction and thus perform wrong actions, which is more straightforward and also indicates the importance of improving the language understanding capability of the model under multilingual instructions\. ### 10\.3Extended Representation Shift Analysis Figure 5:Average pooled embeddings from the middle layer of the model under English instruction and multilingual instruction\. The representation shift is observed across multiple models and suites, which indicates that the representation shift under multilingual instructions is a common phenomenon across different models and different suites\.We provide additional visualization of the middle\-layer representation shift across all suites in LIBERO, as shown in Figure[5](https://arxiv.org/html/2606.15714#S10.F5)\. The results show that the representation shift under multilingual instructions is a common phenomenon across different models and different suites, which further confirms our previous analysis in Section[4\.3\.2](https://arxiv.org/html/2606.15714#S4.SS3.SSS2)\. ### 10\.4Extended Ablation Study of MPCA Table 10:Ablation study of MPCA\. Multilingual performance of different variants of MPCA on LIBERO across four suites\. Absolute performance is shown\.Bolddenotes the best performance under each language\. The Avg\. denotes the average performance under each language across the four suites\. Percentage sign is omitted for better readability\.ModelsLongGoalObjectSpatialAvg\.enzhfrruarenzhfrruarenzhfrruarenzhfrruarenzhfrruarw/o PCA82\.585\.586\.582\.079\.093\.593\.579\.569\.087\.597\.087\.597\.587\.097\.595\.585\.097\.091\.093\.592\.187\.990\.182\.389\.4MPCA90\.090\.086\.585\.583\.597\.095\.577\.580\.092\.098\.088\.097\.095\.097\.596\.089\.097\.081\.590\.095\.390\.689\.585\.590\.8 We provide an ablation study of MPCA in Table[10](https://arxiv.org/html/2606.15714#S10.T10)to further analyze the contribution of the component in MPCA\. The variant, w/o PCA, denotes the variant without the PCA projection, which directly computes the cosine similarity in the original feature space\. The results show that the PCA projection in MPCA can help improve the performance under multilingual instructions\. Since the principal components obtained by PCA can capture the most important variance in the data, projecting the features onto these components and aligning them in the subspace can help the optimization process focus on the most important features and thus improve the performance under multilingual instructions, instead of treating all features equally in the original feature space\. The results further confirm the effectiveness of MPCA in improving the multilingual performance by addressing the representation shift under multilingual instructions\.
Similar Articles
Disparities In Negation Understanding Across Languages In Vision-Language Models
MIT researchers release the first multilingual negation benchmark covering seven languages and show VLMs like CLIP struggle with non-Latin scripts, while MultiCLIP and SpaceVLM offer uneven improvements across languages.
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
This paper introduces an Information Bottleneck Adapter (IB-Adapter) for Vision-Language-Action (VLA) models to improve robustness against unseen visual disturbances without requiring extra data, achieving up to 30% improvement with minimal parameter overhead.
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.
MotionVLA: Vision-Language-Action Model for Humanoid Motion
Proposes MotionVLA, a vision-language-action model for humanoid motion generation using a dual-stream frequency tokenizer that separately encodes pose and physical dynamics, achieving better diversity and consistency.