mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?
Summary
Introduces mmPISA-bench, a compact multilingual reasoning benchmark derived from PISA, and evaluates proprietary LLMs across 43 languages, finding that they reason effectively with some performance variations, and that machine-translated questions do not degrade accuracy.
View Cached Full Text
Cached at: 06/08/26, 09:22 AM
# mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?
Source: [https://arxiv.org/html/2606.07069](https://arxiv.org/html/2606.07069)
Yerzhan Sapenov Independent Scholar \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=Englishysapenov@gmail\.com &Jaromir Savelka School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=Englishjsavelka@cs\.cmu\.edu
###### Abstract
We introducemmPISA\-bench, a compact high\-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment \(PISA\)\. The benchmark consists of 25 multiple\-choice questions that require reasoning in order to be answered correctly\. Each question is provided in official human translations to 43 languages and complemented with machine\-translated counterparts \(i\.e\., 2,150 data points in total\)\. We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly\. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test\-takers, with some performance variations across covered languages\. We further find that machine\-translated questions do not degrade accuracy relative to official human translations which suggests that high\-quality machine translation \(synthetic data\) might often be adequate for large\-scale multilingual reasoning evaluations where official translations are not available\. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate\.
\[ BoldFont = timesbd\.ttf, ItalicFont = timesi\.ttf, BoldItalicFont = timesbi\.ttf \] \[ BoldFont = DejaVuSansMono\-Bold\.ttf, ItalicFont = DejaVuSansMono\-Oblique\.ttf, BoldItalicFont = DejaVuSansMono\-BoldOblique\.ttf \]\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English
mmPISA\-bench: Do LLMs Reason Equally Well Across 43 Languages?
Yerzhan SapenovIndependent Scholar\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=Englishysapenov@gmail\.comJaromir SavelkaSchool of Computer Science,Carnegie Mellon University,Pittsburgh, PA, USA\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=Englishjsavelka@cs\.cmu\.edu
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1Introduction
Large language models \(LLMs\) have demonstrated strong reasoning capabilities, yet their reasoning ability in many languages remains comparatively under\-explored\. Despite substantial investments in multilingual modeling, today’s LLM ecosystem is still largely dominated by Englishwu\-2025\-bitter\. Persistent challenges include limited data for many languages, uneven performance across language communities, and tokenizer\-induced disparities that can affect both effectiveness and costqin\-2025\-survey\. As a result, reasoning in diverse languages remains at an early stage of evaluation and understandingghosh\-2025\-survey\.
To study reasoning in many languages in a controlled setting, we leverage the OECD Programme for International Student Assessment \(PISA\), which provides a rigorous framework for assessing student competencies and collects contextual data to explain performance differencesoecd\-2023\-pisa22\-framework\. PISA is a worldwide study that measures the proficiency of 15\-year\-old students in reading, mathematics, and science\. Crucially, PISA items undergo extensive translation and validation procedures to ensure cross\-country comparability\. This includes localization workflows \(adaptation, translation, and validation\), explicit translatability assessment, and reconciliation practices designed to reduce language\-specific artifacts and improve equivalence across versionsoecd\-2024\-pisa\-techreport\. These properties make PISA questions a high\-quality source for evaluating whether LLM performance on questions requiring reasoning is stable across languages, rather than being confounded by low\-quality or inconsistent translations\.
We investigate the following research questions:
1. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\(RQ1\)How stable are frontier LLMs in answering questions requiring reasoning across many different languages?
2. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\(RQ2\)How does model performance on machine\-translated questions compare to performance on official human translations?
3. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\(RQ3\)Do reasoning length vary systematically across languages, and how does this relate to accuracy?
We release a dataset of 2,150 questions requiring reasoning drawn from PISA \(25 multiple\-choice questions in 43 languages with official human and matched machine translations\)\. Further, we provide an analysis of stability and reasoning effort across human and machine translations in the 43 covered languages for selected frontier LLMs\.
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2Related Work
#### PISA\-based evaluation of LLMs\.
PISA questions have only been used sparingly for LLM evaluation, primarily because most items are not publicly available and access is restricted to a subset of released questions\.takami\-2023\-PISA\-japaneseevaluated ChatGPT using PISA multiple\-choice questions, but limited the comparison to English and Japanese\.basaran\-2025\-PISA\-readingfocused exclusively on English reading items to assess reading proficiency\. The most extensive prior use of PISA is PISA\-Benchhaller\-2025\-pisa, which adapts PISA questions for evaluating vision language models; however, their benchmark relies on English source items that are machine\-translated into five additional languages\. In contrast, our work focuses on text\-only reasoning and leverages official human translations across 43 languages\.
#### Massively multilingual benchmarks\.
A broad range of benchmarks has been proposed to evaluate multilingual capabilities of LLMs\. Global\-MMLUsingh\-2025\-globalMMLUextends MMLU to 42 languages with a culturally sensitive subset, though a portion of the data is machine\-translated\. M3Examzhang\-2023\-m3examevaluates models across multiple modalities and difficulty levels in nine languages\. MMLU\-ProXxuan\-2025\-mmluproxemphasizes reasoning complexity using translated items in 29 languages, while BUFFETasai\-2024\-buffetunifies 15 tasks across 54 languages using machine\-translated instructions\. GlotEvalluo\-2025\-glotevalprovides a framework for integrating and comparing results from 27 specialized multilingual benchmarks\.
Several benchmarks target specific skills or modalities\. Belebelebandarkar\-2024\-belebelefocuses on reading comprehension in 122 language variants based on FLORES\-200costajussa\-2024\-nllb\.shi\-2023\-reasonersintroduce a multilingual grade\-school mathematics benchmark in 10 languages, demonstrating chain\-of\-thought reasoning beyond English\. mSTEBbeyene\-2025\-mstebcombines text and speech evaluation across many languages using FLORES\-200 and FLEURSconneau\-2022\-fleurs\. In addition, AI Language Proficiency Monitorpomerenke\-2025\-monitoraggregates results across multiple multilingual benchmarks to track model progress over time\. Prior work has also examined trade\-offs between accuracy and the language used for reasoningqi\-2025\-accuracy\-cost, as well as approaches to improve non\-English reasoning efficiencyhuang\-2024\-mindmergeror to disentangle language processing from reasoningzhao\-2025\-less\. Finally, identifying and covering low\-resource languages remains an active area of research, with recent progress reaching over a thousand languageskargaran\-2023\-glotlid\.
Compared to these benchmarks, our dataset emphasizes high\-quality, fully human\-translated questions with explicit difficulty levels, enabling direct comparison to the performance of 15\-year\-old students across 43 languages\.
#### Reasoning length and cost across languages\.
Recent studies indicate that multilingual differences in tokenization create systematic disparities in token counts across languagespetrov\-2023\-tokenizers\. For models that externalize their reasoning, these disparities manifest not only as higher cost but also as differences in*reasoning length*\. It is the amount of generated intermediate text used to justify an answer\. Such variation matters because longer reasoning traces increase inference cost and may reflect different internal strategies or difficulty in a given language\. While prior work has emphasized the cost implications of token inflationahia\-2023\-cost, we center our analysis on reasoning length itself and show that some languages elicit longer, more expensive reasoning while still exhibiting reduced accuracy\.
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3Dataset
The collected dataset\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1https://github\.com/ysapenov/mmPISA\-benchcomprises 25 multiple\-choice reasoning questions represented in 43 languages, derived from publicly available materials from the OECD Programme for International Student Assessment \(PISA\)\. Specifically, the collection includes 11 mathematics items from PISA 2022 and 14 reading comprehension items from PISA 2018oecd\-2026\-pisaexamples\.
All publicly accessible PISA items were manually reviewed across available assessment years by the authors\. The questions were hand\-selected according to the following criteria:
1. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1\.availability of a broad set of official language translations;
2. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2\.exclusive reliance on textual information, excluding items requiring images or interactive components;
3. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.multiple\-choice format, excluding items that require evaluation of free\-form responses\.
These constraints resulted in the inclusion of reading items from PISA 2018 and mathematics items from PISA 2022\. Only languages for which complete translations existed for both reading and mathematics questions were retained\. Question texts were obtained by structured scraping of interactive, language\-specific subpages inoecd\-2026\-pisaexamples, followed by manual verification against the original English sources to ensure textual fidelity\.
PISA items are annotated with eight difficulty levels, comprising six major levels, with level 1 further subdivided into three sublevels \(1a, 1b, and 1c\)\. The two assessed competencies are defined by PISA as follows\. Mathematics is defined as students’ capacity to reason mathematically and to formulate, employ, and interpret mathematics to solve problems in a variety of real\-world contexts, encompassing concepts, procedures, facts, and tools to describe, explain, and predict phenomena\. Reading is defined as students’ capacity to understand, use, evaluate, reflect on, and engage with written texts in order to achieve goals, develop knowledge and potential, and participate in societyoecd\-2023\-pisa22vol1\.
This massively multilingual dataset enables systematic evaluation of LLM performance on questions requiring reasoning across 43 languages\. It also supports analysis of machine translation effects, as models can be evaluated on both official human translations and machine\-translated variants of the same questions\. For mathematics items, the availability of rationales further enables the construction of auxiliary or derived reasoning tasks\. As shown in section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5](https://arxiv.org/html/2606.07069#S5), model performance on machine\-translated questions is not lower than on human\-translated versions, suggesting that large\-scale machine translation could be used to extend the dataset to hundreds of additional languages and to probe the breadth of multilingual reasoning capabilities\.
Because each item is decomposed into context, question, and answer options with line\-level consistency across languages, the dataset also enables controlled experiments involving mixed\-language inputs at the component or line level\. In the present experiments, models were not explicitly informed of the input language, leaving language identification implicit; providing such information may represent a potential avenue for improving performance\. Finally, the dataset supports extensions that increase task difficulty, such as introducing adversarial or incorrect answer options to study robustness across languagesgoral\-2025\-wrong\-option\.
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4Experimental Design
Across all experiments, we issued a total of 107,500 API calls to the evaluated LLMs\. Unless explicitly stated otherwise, temperature and all other model parameters were kept at their defaults\.
The total number of evaluations is given by:
107,500data points=25questions×43languages×2models×2translation types×5reasoning effort levels×5repetitions\.\\begin\{split\}&107\{,\}500\\text\{ data points\}=\\\\ &\\;25\\text\{ questions\}\\times 43\\text\{ languages\}\\\\ &\\times 2\\text\{ models\}\\times 2\\text\{ translation types\}\\\\ &\\times 5\\text\{ reasoning effort levels\}\\times 5\\text\{ repetitions\}\.\\end\{split\}\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\(1\)
Among the 25 questions, 11 assess mathematical reasoning and 14 assess reading comprehension\. With respect to difficulty, 10 questions are at levels 1–2, 6 at levels 3–4, and 9 at levels 5–6, following the official PISA difficulty annotationsoecd\-2023\-pisa22vol1\.
We evaluated two proprietary frontier LLMs, OpenAI’s GPT and Anthropic Claude, under multiple reasoning effort settings\. For GPT, we used the\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishGPT\-5\.1\-2025\-11\-13model and evaluated five effort configurations:*none*,*none with double prompt*,*low*,*medium*, and*high*\. The double\-prompt, no\-reasoning configuration was included to test the effect of prompt repetition, following the methodology ofleviathan\-2025\-prompt\. The most recent\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishOpus\-4\-5\-20251101model does not support disabling reasoning effort\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishHaiku\-4\-5\-20251001was used to approximate the no\-reasoning setting, while higher effort levels were evaluated using the Opus model\.
Two translation conditions were considered\. The first uses the human translations provided by PISA\. The second uses machine\-translated versions produced with Google Translate\. Because English and French both serve as source languages in the original PISA materials, machine\-translated English and French items were obtained by translating each language into the other\.
\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 1:Comparative Language Performance Accuracy, %: Claude vs GPT\.- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishaHaiku model was used for none reasoning case
Each question–language–model–configuration combination was evaluated independently five times to assess answer stability and estimate accuracy under stochastic generation\. The system prompts used for both models are provided in Appendix[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishB](https://arxiv.org/html/2606.07069#A2)\. These prompts were restricted to enforcing a uniform multiple\-choice answer format\. Aside from the system prompt, no additional instructions or contextual information were supplied to the models\. All evaluations were conducted in a zero\-shot setting\. Cost is computed as the sum of input tokens multiplied by the input token price and output tokens multiplied by the output token price\.
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5Results
Subsections[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.1](https://arxiv.org/html/2606.07069#S5.SS1)and[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.3](https://arxiv.org/html/2606.07069#S5.SS3)report results on the original, human\-translated questions only\. Subsection[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.2](https://arxiv.org/html/2606.07069#S5.SS2)compares performance on original questions against their machine\-translated counterparts\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.1RQ1: Consistency across languages
Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2606.07069#S4.T1)reports accuracy by language for both models under no\-reasoning and high\-reasoning settings\. Both models’ performance varies somewhat across languages, and these differences persist across reasoning levels\. While performance differences across languages clearly exist the relatively bounded extent of the variations indicates reasonably capable multilingual reasoning behavior across the 43 studied languages\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.2RQ2: Human vs machine translation
Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2606.07069#S5.T2)compares accuracy and token usage between original human translations and machine\-translated questions across reasoning effort levels\. Overall, accuracy on machine\-translated questions is not lower than on the original versions for either model, and in several configurations it is marginally higher\. This suggests that machine translation does not introduce performance degradation for the evaluated reasoning tasks\.
\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 2:Model accuracy in %, thousands of input and output tokens across model reasoning effort levels\.Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3](https://arxiv.org/html/2606.07069#S5.T3)further breaks down results by PISA difficulty level\. As expected, accuracy decreases with increasing difficulty, while average token usage per question increases\. This trend is consistent across both original and machine\-translated questions, indicating that difficulty effects dominate translation effects\.
\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 3:Model accuracy in %, average token usage per question across difficulty levels\.Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4](https://arxiv.org/html/2606.07069#S5.T4)shows results by question category\. Both models perform better on reading comprehension than on mathematics, with similar patterns observed for original and machine\-translated inputs\. Token usage differs substantially between categories, particularly the GPT model\.
\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 4:Model accuracy in %, average token usage per question across categories\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.3RQ3: Reasoning length
Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5](https://arxiv.org/html/2606.07069#S5.T5)summarizes the overall accuracy–cost trade\-off\. Claude achieves higher average accuracy, but at more than double the cost of GPT\.
\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 5:Comparison of Model Accuracy and Cost\.Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2606.07069#S5.F1)illustrates the relationship between input and output token usage across languages\. Claude exhibits a strong positive correlation between input and output tokens, whereas this relationship is notably weaker for GPT\. The correlation between input and output tokens is extremely high for Claude \(0\.950\) but much weaker for GPT \(0\.334\)\. This indicates that Claude’s reasoning length scales closely with the length of the input across languages, while GPT exhibits more decoupled behavior in which longer prompts do not consistently result in longer generated reasoning\.


\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 1:Claude \(left\) and GPT \(right\) input and output tokens\.Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2606.07069#S5.F2)plots accuracy against cost by language\. For both models, higher cost is generally associated with lower accuracy, yielding negative correlations \(Claude: −0\.484; GPT: −0\.339\)\. Importantly, cost reflects two distinct sources: tokenization\-driven input inflation and variation in reasoning length \(output tokens\)\. For Claude, languages with higher input token counts also tend to elicit longer reasoning traces \(Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2606.07069#S5.F1)\), amplifying cost and contributing to lower accuracy\. In contrast, GPT exhibits weaker coupling between input length and output length, suggesting that cost–accuracy degradation cannot be attributed to tokenization alone but also reflects language\-dependent differences in generated reasoning\.


\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 2:Claude \(left\) and GPT \(right\) accuracy \(%\) versus cost \($\)\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.4Additional results
Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3](https://arxiv.org/html/2606.07069#S5.F3)compares accuracy between mathematics and reading questions\. The correlation between category\-specific accuracies for Claude is weak \(0\.210\), and for GPT it is negative \(\-0\.324\), suggesting that performance on one category does not reliably predict performance on the other\.


\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 3:Claude \(left\) and GPT \(right\) accuracy \(%\) for Math and Reading categories\.Although LLM outputs are stochastic, qualitative inspection reveals that models can exhibit distinct reasoning behaviors across languages for the same question\. Tables[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6](https://arxiv.org/html/2606.07069#S5.T6),[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8](https://arxiv.org/html/2606.07069#A3.T8), and[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7](https://arxiv.org/html/2606.07069#A3.T7)show that the model’s reasoning is not necessarily carried out in prompt language or English\. In some cases, the input text is implicitly translated from the prompt language into another language used for reasoning\. Notably, for Kazakh prompts, Claude performs parts of its reasoning in Russian rather than English and may switch between languages within a single response\. These behaviors were identifiable because one of the authors is fluent in Kazakh and Russian\. Similar cross\-language reasoning may occur in other languages but remain difficult to detect without native\-language expertise\.
\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 6:Comparison of Claude reasoning for the same question \#23 in Kazakh language across independent runs \(same prompt, identical settings\)\.
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6Discussion
This section interprets the empirical results in light of the three research questions, focusing on multilingual reasoning robustness, translation effects, and cost–accuracy trade\-offs\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6\.1RQ1: Consistency across languages
The results indicate that leading LLMs can reason across all evaluated languages at a level comparable to that expected of 15\-year\-old students\. Compared to earlier evaluations, cross\-lingual performance gaps appear reduced, suggesting improved multilingual robustness in recent modelspetrov\-2023\-tokenizers\. As shown in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2606.07069#S4.T1), under high reasoning effort GPT achieves 100% accuracy only for German, whereas Claude reaches perfect accuracy in seven languages\.
Claude’s accuracy spans 88%–100% under high effort and 80%–94\.4% without explicit reasoning, corresponding to ranges of 12\.0%–14\.4%, respectively\. GPT exhibits narrower ranges: 90\.4%–100% under high effort and 73\.6%–84\.8% without reasoning, corresponding to ranges of 9\.6%–11\.2%\. Thus, although Claude achieves higher average accuracy, GPT displays less variability across languages\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6\.2RQ2: Human vs machine translation
Prompt repetition improves accuracy for GPT in the no\-reasoning setting, as shown in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2606.07069#S5.T2)\. In contrast, Claude Haiku does not benefit from double prompting, consistent with prior findingsleviathan\-2025\-prompt\. These results suggest that prompt repetition is model\-dependent and primarily beneficial for architectures that do not expose internal reasoning by default\.
Increasing question difficulty leads to lower accuracy and higher output token usage, as shown in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3](https://arxiv.org/html/2606.07069#S5.T3)\. Notably, higher\-difficulty questions are often shorter in terms of input tokens, yet they elicit longer outputs, indicating more elaborate reasoning processes that more closely resemble human problem\-solving behavior\.
Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4](https://arxiv.org/html/2606.07069#S5.T4)further shows category\-specific differences\. GPT uses substantially more output tokens for mathematics questions than for reading, whereas Claude’s output token usage is comparatively similar across categories\. This suggests different internal strategies for handling numerical reasoning versus textual comprehension\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6\.3RQ3: Reasoning length
Token usage varies substantially across languages, corroborating earlier observations disparities based on tokenization premiumpetrov\-2023\-tokenizers\. For Claude, the highest tokenization premium is observed for Thai in Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2606.07069#S5.F1), with factors of 2\.71 for input tokens and 2\.01 for output tokens relative to English\. Crucially, we also observe systematic variation in reasoning length across languages, operationalized as output tokens under a fixed reasoning\-effort setting\. For some languages, models produce substantially longer rationales even when answering the same items\. These values are lower than previously reported maxima for the\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=Englishcl100k\_basetokenizerpetrov\-2023\-tokenizers, indicating partial mitigation of extreme token inflation\.
For GPT, the largest input token premium occurs for Greek \(1\.84\), while the largest output token premium is observed for Icelandic \(1\.51\) in Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2606.07069#S5.F1)\. In the Claude case, English consistently yields the lowest token usage and exhibits a strong correlation between input and output tokens, suggesting that languages that are “longer to read” also tend to elicit longer reasoning traces\. In contrast, GPT shows weaker coupling between input and output length, indicating that cross\-linguistic differences in generated reasoning verbosity are not fully explained by tokenization alone\.
Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2606.07069#S5.F2)illustrates the relationship between cost and accuracy\. For both models, languages that are less costly to process tend to yield higher accuracy\. For example, Claude incurs 2\.22×\\timeshigher cost on Thai than on English while achieving 5\.1 percentage points lower accuracy\. Similarly, GPT spends 1\.69×\\timesmore on Greek than on German, with a corresponding 5\.3 percentage point accuracy decrease\. These findings indicate that some languages are simultaneously more expensive and less accurate, reinforcing the importance of cost\-aware multilingual evaluation\. Taken together, these results suggest that multilingual evaluation should report not only accuracy but also reasoning length, since some languages systematically induce longer and sometimes less effective reasoning\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6\.4Additional results
Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3](https://arxiv.org/html/2606.07069#S5.F3)compares performance on mathematics and reading questions\. The weak correlation between category\-specific accuracies suggests that strong performance in one category does not necessarily transfer to the other\. Notably, GPT achieves one of its lowest reading accuracies overall, yet performs strongly on reading questions in Chinese, highlighting language–category interactions that merit further investigation\.
The qualitative examples in Tables[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6](https://arxiv.org/html/2606.07069#S5.T6),[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7](https://arxiv.org/html/2606.07069#A3.T7)and[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8](https://arxiv.org/html/2606.07069#A3.T8)show that multilingual differences extend beyond accuracy and reasoning length to the linguistic behavior of model reasoning\. In particular, Claude exhibits variation in reasoning structure across languages, including explicit translation during reasoning and cross\-language reasoning\. Notably, in some cases the model switches to a language different from both the input language and English \(e\.g\., Russian when prompted in Kazakh\), suggesting that intermediate reasoning may occur in a latent pivot language\. These behaviors were identifiable only because one of the authors is fluent in Kazakh and Russian, highlighting a broader evaluation challenge: such phenomena may remain invisible without native\-language expertise\. This suggests that multilingual LLM evaluation would benefit from qualitative inspection in addition to aggregate accuracy metrics\.
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7Conclusion
Our results show that leading proprietary LLMs are capable of reasoning across all 43 evaluated languages\. While overall reasoning accuracy remains high, both performance and inference cost vary substantially across languages, reflecting differences in multilingual robustness and tokenization efficiency\. These findings underscore the importance of evaluating LLM reasoning beyond English and of jointly considering accuracy and cost when assessing multilingual capabilities\.
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8Future Work
Several directions follow naturally from this study\. First, the dataset can be extended to a substantially larger number of languages by leveraging machine translation systems, such as Google Translate, which support more than 250 languages\. Targeted human translations could be obtained for selected ultra low\-resource languages to assess reasoning capabilities under extreme data scarcity\. Second, future work may compare the observed trends with those of open\-source LLMs, enabling analysis of how architectural choices and training regimes affect multilingual reasoning\. Finally, a longitudinal analysis tracking multilingual reasoning performance of LLMs over the past several years would provide insight into the pace and nature of progress in this area\.
## Acknowledgments
We thank the Organisation for Economic Co\-operation and Development \(OECD\) for conducting the PISA assessments worldwide and for providing open access to the test questions in numerous languages\. The second author acknowledges generous support of Carnegie Mellon\-Accenture Center of Excellence in AI\-Enabled Workforce Training \(ACE\-AI\)\. The content of this paper does not necessarily reflect the position or the policy of the funding organization and no official endorsement should be inferred\.
## References
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishAppendix ALimitations
The evaluation in this study is limited to two proprietary large language models\. While these models are representative of current frontier systems, their training data, architectures, and inference mechanisms are not publicly documented\. Consequently, the observed patterns in multilingual reasoning, reasoning length, and cost may not generalize to other proprietary models or to open\-source models developed under different training regimes\.
In addition, the benchmark draws exclusively on publicly released PISA questions from specific assessment years \(PISA 2018 for reading and PISA 2022 for mathematics\)\. Although these items were selected to maximize language coverage and comparability, they represent only a subset of PISA competencies and formats\. In particular, the exclusion of constructed\-response items, visual prompts, and interactive tasks limits the scope of reasoning behaviors that can be evaluated\. Extending the benchmark to additional PISA cycles or complementary assessment frameworks would help capture a wider spectrum of multilingual reasoning skills\.
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishAppendix BSystem Prompts
The system prompt for GPT models: ”Reply format: ¡LETTER¿”\. Interesting that the same prompt did not work for Claude\. It looks like GPT can easily have hidden reasoning tokens, while Claude displays all the tokens\. So, the system prompt for Claude models: ”Reply text inside the ¡reasoning¿ tags\. Output only the letter answer outside the tags\.”
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishAppendix CAdditional Results
\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 7:Comparison of Claude reasoning for the same question \#11 in Icelandic language across independent runs \(same prompt, identical settings\)\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 8:Comparison of Claude reasoning for the same question \#14 in Kazakh language across independent runs \(same prompt, identical settings\)\.Similar Articles
LLM Parameters for Math Across Languages: Shared or Separate?
This paper presents a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, finding partial overlap of math-associated parameters across languages, concentrated in intermediate layers. English has the largest set of math-relevant parameters, while lower-resource languages have smaller sets.
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.
DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation
DLawBench is a new benchmark for evaluating large language models in multi-turn legal consultation, covering Chinese and US law with four client types. Experiments show significant room for improvement, with the best model achieving only 0.562 on legal reasoning.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.