SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses

arXiv cs.CL Papers

Summary

Introduces SPLIT, a 500-prompt benchmark evaluating LLM cross-lingual empathy and cultural grounding in English and Ukrainian. Findings show Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade in Ukrainian while DeepSeek-V3 remains stable, with weak agreement between human and AI evaluators on cultural dimensions.

arXiv:2607.02049v1 Announce Type: new Abstract: Large Language Models are increasingly deployed in emotional-support contexts and crisis-related situations. Nevertheless, their cross-lingual abilities in these circumstances remain underexplored. Existing benchmarks emphasize multilingual performance but rarely examine crisis-related empathy and cultural grounding in low-to-mid-resource languages. We introduce SPLIT, a 500-prompt benchmark designed to evaluate LLM consistency in generating emotionally grounded responses across five categories: Stress, Panic, Loneliness, Internal Displacement, and Tension. We evaluate three technically diverse LLMs across three dimensions: Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding. The framework aims to assess and compare the quality of LLM responses in both English and Ukrainian languages, as well as to explore the reliability of the LLM-as-a-jury paradigm. Our findings reveal that Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade when transitioning to Ukrainian, while DeepSeek-V3 remains comparatively stable within our benchmark. We additionally find that human and AI evaluators agree weakly on empathy and naturalness but diverge on cultural grounding. We further argue that producing Ukrainian text is not equivalent to producing Ukrainian emotional support. Our findings may assist in the future development of more culturally tailored benchmark designs, as well as encourage a stronger emphasis on human-centered evaluation.
Original Article
View Cached Full Text

Cached at: 07/03/26, 05:42 AM

# Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses Producing Ukrainian Text Is Not Equivalent to Producing Ukrainian Emotional Support
Source: [https://arxiv.org/html/2607.02049](https://arxiv.org/html/2607.02049)
\(July 2026\)

###### Abstract

Large Language Models are increasingly deployed in emotional\-support contexts and crisis\-related situations\. Nevertheless, their cross\-lingual abilities in these circumstances remain underexplored\. Existing benchmarks emphasize multilingual performance but rarely examine crisis\-related empathy and cultural grounding in low\-to\-mid\-resource languages\. We introduce SPLIT, a 500\-prompt benchmark designed to evaluate LLM consistency in generating emotionally grounded responses across five categories:Stress,Panic,Loneliness,Internal Displacement, andTension\. We evaluate three technically diverse LLMs across three dimensions: Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding\. The framework aims to assess and compare the quality of LLM responses in both English and Ukrainian languages, as well as to explore the reliability of the LLM\-as\-a\-jury paradigm\. Our findings reveal that Gemini\-2\.5\-Flash and LLaMA\-3\.3\-70B\-Instruct degrade when transitioning to Ukrainian, while DeepSeek\-V3 remains comparatively stable within our benchmark\. We additionally find that human and AI evaluators agree weakly on empathy and naturalness but diverge on cultural grounding\. We further argue that producing Ukrainian text is not equivalent to producing Ukrainian emotional support\. Our findings may assist in the future development of more culturally tailored benchmark designs, as well as encourage a stronger emphasis on human\-centered evaluation\.

## 1Introduction

The realm of Large Language Models \(LLMs\) has been evolving rapidly in recent years, with several major breakthroughs\[[55](https://arxiv.org/html/2607.02049#bib.bib1),[5](https://arxiv.org/html/2607.02049#bib.bib2)\]taking place\. Crucially, the consistency of LLM responses remains highly questionable\[[60](https://arxiv.org/html/2607.02049#bib.bib3)\]due to apparent differences in training data and available benchmarks\. English and Chinese remain the primary sources of the vast majority of training data, creating performance variations in response generation in a range of low\- and mid\-resource languages\[[30](https://arxiv.org/html/2607.02049#bib.bib6)\]\. This inconsistency raises the question of whether LLMs are capable of preserving cultural nuances related to these languages\[[46](https://arxiv.org/html/2607.02049#bib.bib7)\]and interpreting them precisely\.

![Refer to caption](https://arxiv.org/html/2607.02049v1/macro_average.png)Figure 1:Cross\-lingual performance trajectories showing macro\-average human evaluation scores from EN to UA\.Moreover, the question of LLMs’ ability to empathize with humans is not considered to be fully explored\[[48](https://arxiv.org/html/2607.02049#bib.bib13),[44](https://arxiv.org/html/2607.02049#bib.bib14)\]\. It is essential to bridge the gap between the way LLMs produce empathetic responses and what is considered accurate in modern psychology in terms of human capabilities and development\[[61](https://arxiv.org/html/2607.02049#bib.bib15),[25](https://arxiv.org/html/2607.02049#bib.bib16)\]\. As outlined in “Mind in Society”\[[57](https://arxiv.org/html/2607.02049#bib.bib9)\], early psychological frameworks asserted that mind “is a set of specific capabilities, each of which, to some extent, is independent of the others, and is developed independently\.” And while cognitive science has primarily switched to a more holistic evaluation of human interactive skills development\[[36](https://arxiv.org/html/2607.02049#bib.bib17)\], modern AI systems tend to reflect the first theory\[[29](https://arxiv.org/html/2607.02049#bib.bib18),[6](https://arxiv.org/html/2607.02049#bib.bib19),[51](https://arxiv.org/html/2607.02049#bib.bib20)\]\. The intelligence of LLMs has thereby been classified as multidimensional in a range of studies, suggesting that achieving accuracy with respect to their full potential requires rigorous experiments which include benchmarking a range of different parameters\[[33](https://arxiv.org/html/2607.02049#bib.bib4),[42](https://arxiv.org/html/2607.02049#bib.bib5),[61](https://arxiv.org/html/2607.02049#bib.bib15)\]\. Perception, cognition and interaction can be perceived as facets of the Emotional Intelligence\[[33](https://arxiv.org/html/2607.02049#bib.bib4)\]of LLMs\. This consequently posits the idea that high performance is required across these areas to ensure consistency in LLMs’ generated empathetic responses\.

Naturalness is widely regarded as a key variable in evaluating LLMs’ ability to communicate effectively\[[29](https://arxiv.org/html/2607.02049#bib.bib18),[6](https://arxiv.org/html/2607.02049#bib.bib19)\]with human beings as well, with this often being its primary task\. The ability to interpret and grasp human emotions and struggles is highly valuable in terms of understanding one’s predicament\[[42](https://arxiv.org/html/2607.02049#bib.bib5)\]and proposing effective measures for tackling issues regarding stress, anxiety or emotional exhaustion\. Empathetic grounding\[[1](https://arxiv.org/html/2607.02049#bib.bib8)\]is another primary term intersecting closely with LLMs’ ability to maintain decent interactions, acknowledging human struggles\. The idea of the importance of comprehending subtle cultural meanings may be directly justified by this concept, with emotional markers and idioms of distress varying in multiple languages\[[31](https://arxiv.org/html/2607.02049#bib.bib21),[37](https://arxiv.org/html/2607.02049#bib.bib22),[46](https://arxiv.org/html/2607.02049#bib.bib7)\]\.

Recent studies indicate\[[56](https://arxiv.org/html/2607.02049#bib.bib10)\]that lack of cultural understanding is a major bottleneck for the vast majority of current LLMs, impeding effective communication\. Another comprehensive study reveals\[[35](https://arxiv.org/html/2607.02049#bib.bib11)\]that there is a significant variation among empathetic responses produced by LLMs, due to apparent distinctions in demographics that largely influence models’ cultural understanding\.

LLMs demonstrate remarkable abilities in Natural Language Processing \(NLP\) across high\-resource languages, showing fluency, consistency, and accessibility\[[43](https://arxiv.org/html/2607.02049#bib.bib12)\]\. Conversely, their capabilities in low\-resource languages remain far from state\-of\-the\-art performance\[[16](https://arxiv.org/html/2607.02049#bib.bib23),[17](https://arxiv.org/html/2607.02049#bib.bib24),[34](https://arxiv.org/html/2607.02049#bib.bib25)\]across the three outlined dimensions\. Ukrainian has been widely considered a low\-resource language, with an apparent lack of digitalized benchmarks\[[53](https://arxiv.org/html/2607.02049#bib.bib26)\]\. Nevertheless, recent years have seen substantial growth in Ukrainian NLP resources, benchmarks, and language models\[[53](https://arxiv.org/html/2607.02049#bib.bib26)\]\.

The current study is aimed at evaluating the magnitude of the gap between English and Ukrainian NLP\[[24](https://arxiv.org/html/2607.02049#bib.bib27)\]as well as LLMs’ responses when providing empathetic grounding\[[1](https://arxiv.org/html/2607.02049#bib.bib8)\]to humans\. The objective of this study was motivated by the development and deployment of a multilingual Telegram Bot designed to support individuals experiencing Stress, Panic, Loneliness, Internal Displacement, or Tension\. During deployment, we observed qualitative differences between English and Ukrainian outputs, motivating a systematic study of whether multilingual LLMs preserve comparable levels of naturalness, cultural grounding, and empathetic consistency between these two languages\.

To investigate this performance gap, we introduce SPLIT \- a diverse 500\-prompt benchmark, aimed at crisis\-affected communication across five parameters: Stress, Panic, Loneliness, Internal Displacement and Tension\.

Therefore, the aim of this study is to provide answers to the research questions as follows:

RQ1How do state\-of\-the\-art LLMs differ in empathetic response quality between English and Ukrainian crisis\-related scenarios?

RQ2What linguistic and conversational discrepancies emerge when LLMs generate responses to English and Ukrainian crisis\-related scenarios?

RQ3To what extent do LLM\-generated responses exhibit appropriate contextual and cultural grounding when addressing crisis scenarios in Ukrainian compared to English baselines?

RQ4To what extent does automated LLM\-based evaluation agree with human assessment of empathetic conversational responses?

Figure[1](https://arxiv.org/html/2607.02049#S1.F1)illustrates the macro\-average scores across Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding dimensions capturing the aggregate trajectory of the current study\. A fine\-grained analysis of these results is further provided in the Results & Analysis section, where the exact Human Evaluation Baseline scores are rigorously detailed\.

## 2Methodology

### 2\.1Dataset Curation

Our SPLIT benchmark is intended to evaluate three technically diverse LLMs in 500 scenarios\. Hence, we establish a dataset of 500 distinct emotional support queries across 5 evaluation categories \- Stress, Panic, Loneliness, Internal Displacement, and Tension \- resulting in 100 prompts per category\. These specific categories are selected because they represent common psychosocial situations faced by crisis\-affected Ukrainians\.

Potential crisis\-affected queries are generated using a deliberately adjusted prompt\[[5](https://arxiv.org/html/2607.02049#bib.bib2)\]for an LLM such as GPT\-4o\. The NLP capabilities of this LLM and thus its ability to outperform a range of other LLMs in low\-resource languages\[[40](https://arxiv.org/html/2607.02049#bib.bib29),[39](https://arxiv.org/html/2607.02049#bib.bib28),[18](https://arxiv.org/html/2607.02049#bib.bib30),[47](https://arxiv.org/html/2607.02049#bib.bib31)\]reinforce the idea of it being a reliable prompt engineering source for this study\. In addition, it demonstrates capabilities in interpreting complex emotional and social interactions within text\-based scenarios\[[59](https://arxiv.org/html/2607.02049#bib.bib33)\]\. The prompts are generated in English and Ukrainian simultaneously, with machine translation being the main source\[[16](https://arxiv.org/html/2607.02049#bib.bib23),[32](https://arxiv.org/html/2607.02049#bib.bib32)\]of identically translated data in Ukrainian\.

Nevertheless, to ensure accuracy in Natural Language Generation \(NLG\), the prompts are subjected to rigorous testing\[[44](https://arxiv.org/html/2607.02049#bib.bib14),[17](https://arxiv.org/html/2607.02049#bib.bib24)\]by a native Ukrainian speaker with certified C2 English Proficiency on a Cambridge Scale\. This check is performed on a randomized 15% \(n=75n=75\) sample of the total 500 prompts according to the established holistic verification standards\[[29](https://arxiv.org/html/2607.02049#bib.bib18),[49](https://arxiv.org/html/2607.02049#bib.bib59)\]\.

### 2\.2Large Language Models Selection

To achieve higher scalability and thus credibility in the conducted experiment, we deploy three technically diverse models to generate responses to the queries\. The currently adopted approach aligns well with peer studies conducted by a range of other researchers\[[29](https://arxiv.org/html/2607.02049#bib.bib18),[6](https://arxiv.org/html/2607.02049#bib.bib19)\]\. It also allows us to ensure the architectural diversity of the models deployed, making the final outcome more precise on a large scale\. The following models are deployed to act as response generators:

1. 1\.DeepSeek\-V3\[[11](https://arxiv.org/html/2607.02049#bib.bib34)\]:this LLM showcases a mixture\-of\-experts \(MoE\) architecture\[[2](https://arxiv.org/html/2607.02049#bib.bib36)\]adopting an auxiliary\-loss\-free strategy\[[58](https://arxiv.org/html/2607.02049#bib.bib35)\]and a multi\-token prediction training objective\[[64](https://arxiv.org/html/2607.02049#bib.bib37)\]\. This approach therefore allows the LLM to distribute users’ queries effectively and efficiently, demonstrating the level of performance remarkably close to closed\-source models\[[11](https://arxiv.org/html/2607.02049#bib.bib34)\], making it highly valuable for the current study\.
2. 2\.LLaMA\-3\.3\-70B\-Instruct\[[14](https://arxiv.org/html/2607.02049#bib.bib38)\]:this LLM possesses an architecture directly opposite to the MoE one\[[2](https://arxiv.org/html/2607.02049#bib.bib36)\], which corresponds to the DeepSeek\-V3 model examined above\. A standard dense transformer model architecture\[[55](https://arxiv.org/html/2607.02049#bib.bib1)\]was implemented for this specific LLM with minor adaptations, to ensure training stability, thus avoiding potential loss spikes\[[14](https://arxiv.org/html/2607.02049#bib.bib38)\]\. Recent peer study additionally indicates that fine\-tuned LLaMA models have the potential to outperform larger open\-weight models\[[52](https://arxiv.org/html/2607.02049#bib.bib39)\]\. Another empirical study highlights that ”cross\-lingual alignment might have been internalized within the model”\[[62](https://arxiv.org/html/2607.02049#bib.bib40)\], showcasing its competency in Natural Language Processing \(NLP\), and Natural Language Generation \(NLG\)\.
3. 3\.Gemini\-2\.5\-Flash\[[10](https://arxiv.org/html/2607.02049#bib.bib41)\]:this LLM’s architecture intersects closely with that of DeepSeek\[[11](https://arxiv.org/html/2607.02049#bib.bib34)\], implementing a sparse mixture\-of\-experts\[[2](https://arxiv.org/html/2607.02049#bib.bib36)\]transformers\[[55](https://arxiv.org/html/2607.02049#bib.bib1)\]approach\. Being a hybrid reasoning model which balances speed, cost and intelligence, its capabilities largely outperform the generations of Gemini models prior to the current one\. Consequently, its multilingual capabilities encompass over 400 languages via pretraining, experiencing robust improvement in NLP\. Nevertheless, contrasted with DeepSeek\-V3 and LLaMA\-3\.3\-70B\-Instruct, it is a closed\-source commercial LLM, making it a relevant addition to the current study for a more balanced and accurate baseline\.

### 2\.3SPLIT benchmark evaluation criteria

The LLMs’ responses are assessed across three parameters, corresponding directly to the research questions:

1. 1\.Empathetic Accuracy:Does the LLM identify the user’s emotional state accurately and produce an appropriate response without consistently falling back on clichés?
2. 2\.Linguistic Naturalness:Does the LLM preserve a natural response flow, using appropriate idioms and expressions related to the crisis\-affected situations?
3. 3\.Contextual and Cultural Grounding:Does the LLM take into account the language and cultural background of the user when producing an emotionally grounding response?

The performance interpretation of the 1–5 SPLIT scale is as follows:

1 \- Inadequate alignment\.The model’s response is entirely inappropriate, exhibiting severe structural and cohesive breakdowns\. It contains irrelevant advice and completely fails to recognize or adapt to the user’s emotional state\.

2 \- Superficial alignment\.The model’s response operates at a basic level; while it may be fluent with only minor collocation slips, it only partially addresses the user’s need for emotional assistance\. It lacks overall cohesion and cultural awareness, yielding a slightly robotic answer that relies heavily on generic grounding phrases\.

3 \- Sufficient alignment\.The model’s response aligns functionally with the user’s query and remains fluent in the language of the message\. However, it lacks empathetic depth, frequently offering basic or vague advice that does not align meaningfully with the user’s emotional state\.

4 \- High\-quality alignment\.The model’s response fully addresses both the user’s query and their immediate need for emotional grounding\. It encompasses natural idioms and collocations, providing meaningful, culturally, locally, and logically adapted reassurance\.

5 \- Human\-level alignment\.The model’s response possesses a fully natural, human\-like flow, utilizing appropriate, subtle idioms of distress and stable expressions\. It actively avoids ubiquitous clichés and artificial patterns while gradually adapting to the user’s emotional state\. Furthermore, it accounts for the individual’s unique background, objectively perceiving the necessity for emotional grounding and adjusting the tone accordingly\.

### 2\.4The LLM Jury Setup

In pursuit of accuracy, as described in the preliminary analysis, three high\-reasoning LLMs are selected to act as a jury\[[63](https://arxiv.org/html/2607.02049#bib.bib42),[13](https://arxiv.org/html/2607.02049#bib.bib43),[3](https://arxiv.org/html/2607.02049#bib.bib44)\]in evaluating the responses of the assessed models to the preliminary queries, generated in the established dataset, adopting the introduced SPLIT benchmark\. The choice of the jury stems from a range of peer studies and literature review reinforcing a similar approach\[[63](https://arxiv.org/html/2607.02049#bib.bib42),[15](https://arxiv.org/html/2607.02049#bib.bib48),[26](https://arxiv.org/html/2607.02049#bib.bib49),[7](https://arxiv.org/html/2607.02049#bib.bib50)\], providing us with the variety required for authenticity and a less biased result\[[50](https://arxiv.org/html/2607.02049#bib.bib60),[63](https://arxiv.org/html/2607.02049#bib.bib42)\]\. A range of studies additionally reveal self\-preference bias tendencies\[[45](https://arxiv.org/html/2607.02049#bib.bib45),[27](https://arxiv.org/html/2607.02049#bib.bib46),[28](https://arxiv.org/html/2607.02049#bib.bib47)\]when scoring the model of its own architecture\. The selected judging LLMs are direct representatives of different architectures and vary in terms of a range of criteria\. They largely act as primary judges in studies as well\[[4](https://arxiv.org/html/2607.02049#bib.bib51)\]\. The following three models are deployed for the specific assessment to ensure a diverse, cross\-architecture jury for achieving an overall, stable consensus:

1. 1\.GPT\-4o\[[40](https://arxiv.org/html/2607.02049#bib.bib29),[39](https://arxiv.org/html/2607.02049#bib.bib28)\]:this closed\-source model is built on a transformer architecture\[[55](https://arxiv.org/html/2607.02049#bib.bib1)\]manifesting high reasoning skills and comprehension across multiple historically underrepresented languages, therefore scaling exceptionally well across non\-Latin alphabets, specifically Ukrainian language as a whole\[[19](https://arxiv.org/html/2607.02049#bib.bib54)\]\. Being a rich morphological language, Ukrainian suffers from high tokenizer fertility\[[41](https://arxiv.org/html/2607.02049#bib.bib53)\]\. This model’s expanded vocabulary is aimed to partially address this predicament and reduce the ”Ukrainian penalty”\. It outperforms a range of prior OpenAI models, fostering multimodal features and excelling at NLG\. While a range of fine\-tuned models tend to show higher performance rates, they underperform GPT\-4, thus GPT\-4o in a number of dimensions\[[20](https://arxiv.org/html/2607.02049#bib.bib52)\]\. Likewise, its semantic density and the ability to recognize the user’s intent makes it a valuable source for the following study\. Being a closed\-source commercial model, it additionally contributes to the diversity and credibility of the newly introduced benchmark\.
2. 2\.Mistral Large:this LLM’s capabilities tend to reach the level of human expertise, encompassing state\-of\-the\-art features in NLG\[[54](https://arxiv.org/html/2607.02049#bib.bib55)\]\. Reflecting the architectural paradigm of the assessed Gemini\-2\.5\-Flash model, it adopts a sparse mixture\-of\-experts\[[2](https://arxiv.org/html/2607.02049#bib.bib36)\]transformer\[[55](https://arxiv.org/html/2607.02049#bib.bib1)\]approach, simultaneously featuring grouped\-query attention mechanisms\[[23](https://arxiv.org/html/2607.02049#bib.bib56)\]\. Being additionally trained on European cultural and linguistic data, Mistral model is crucially distinct from GPT and Claude models, facilitating diversity in the ongoing experiment\.
3. 3\.Claude 4\.5 Sonnet:this closed\-source constitutional model exhibits substantial similarities in performance rate with the first GPT\-4o judge deployed in this study, with each model excelling at specific strengths\[[21](https://arxiv.org/html/2607.02049#bib.bib57)\], with Claude surpassing GPT in cause\-and\-effect reasoning\. Being strictly rule\-based, it balances the overall scoring consensus\. Deploying Claude Sonnet is a logical addition to the current jury setup in order to mitigate length bias\[[7](https://arxiv.org/html/2607.02049#bib.bib50)\], which may be caused by one judging model\.

### 2\.5Evaluation Metric and Human Validation

First and foremost, the acquired data is evaluated by means of the LLM\-as\-a\-jury paradigm\[[50](https://arxiv.org/html/2607.02049#bib.bib60),[63](https://arxiv.org/html/2607.02049#bib.bib42),[26](https://arxiv.org/html/2607.02049#bib.bib49)\]\. The layout of the data is complex, further demonstrating the need for a nested averaging system\. This way, three judging LLMs’ evaluations are averaged independently, providing us with final mean scores \(split between 2 languages and 3 assessed models across the three measuring dimensions\) for each of the judges\. These intermediate calculations are put into three separate files, to later compare the performance of each of the judging LLMs\. When the data outlined above is fully processed, we calculate the grand mean across the three obtained files\. Each of the categories \(Empathetic Accuracy, Linguistic Naturalness, Contextual ‘&’ Cultural Grounding\) for both languages receives its final mean score, shedding light on the overall LLMs’ performance\. Summarizing all of these steps, this is done using a grand mean, which is a simple, widely\-accepted formula:

Mgrand=1N​∑i=1NSfinal,iM\_\{\\text\{grand\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}S\_\{\\text\{final\},i\}

The breakdown of each of the formula’s variables:

1. 1\.MgrandM\_\{\\text\{grand\}\}\- the grand mean, which defines a final average score for a specific model, language and dimension\.
2. 2\.NN\- the total number of responses belonging to the specific model\-language group\.
3. 3\.ii\- response index, representing the specific row of an individual generated response within the dataset\.
4. 4\.Sfinal,iS\_\{\\text\{final\},i\}\- final mean score calculated for a specific model\-language group by calculating the average assigned score by the three judging models\.

Nevertheless, relying mainly on automated evaluation undermines the study’s credibility, as it was justified in a range of peer research\[[6](https://arxiv.org/html/2607.02049#bib.bib19),[44](https://arxiv.org/html/2607.02049#bib.bib14)\]\. Therefore, we adopt an approach identical to that deployed to ensure that GPT\-4o\-generated queries strictly follow the introduced SPLIT benchmark\. Specifically, 10% \(n=300n=300\) of the three models’ answers are manually evaluated on a randomized basis\[[49](https://arxiv.org/html/2607.02049#bib.bib59)\]via the same three parameters \(Empathetic Accuracy, Linguistic Naturalness, Contextual and Cultural Grounding\) by a native Ukrainian speaker with C2 English proficiency to indicate to what extent the scoring framework is objective, valid, and aligned with real human perception\. However, the limitations caused by one human annotator should be acknowledged\. We therefore discuss the implications of using a single annotator in the Limitations section\.

Due to the fact that our SPLIT benchmark utilizes a continuous \(1\-5\) scale, a traditional formula like Cohen’s Kappa\[[9](https://arxiv.org/html/2607.02049#bib.bib58)\]fails to capture proximity and nuances in scoring\. Therefore, we calculate the Pearson correlation coefficient \(rr\)\[[12](https://arxiv.org/html/2607.02049#bib.bib61)\]for the statistical alignment of human and AI jury scores for each of the three measured dimensions \(Empathetic Accuracy, Linguistic Naturalness, Contextual and Cultural Grounding\):

r=∑i=1n\(Hi−H¯\)​\(Ji−J¯\)∑i=1n\(Hi−H¯\)2​∑i=1n\(Ji−J¯\)2r=\\frac\{\\sum\_\{i=1\}^\{n\}\(H\_\{i\}\-\\bar\{H\}\)\(J\_\{i\}\-\\bar\{J\}\)\}\{\\sqrt\{\\sum\_\{i=1\}^\{n\}\(H\_\{i\}\-\\bar\{H\}\)^\{2\}\}\\sqrt\{\\sum\_\{i=1\}^\{n\}\(J\_\{i\}\-\\bar\{J\}\)^\{2\}\}\}

The breakdown of this equation is as follows:

1. 1\.rr\- the Pearson correlation coefficient, measuring the alignment between the human evaluator’s scores and the AI Jury’s scores\.
2. 2\.nn\- the total number of validation samples manually evaluated by the human rater \(n=300n=300, representing 10% of the total dataset\)\.
3. 3\.HiH\_\{i\}\- the manually validated score on a \(1 to 5\) scale assigned to a specific responseii\.
4. 4\.H¯\\bar\{H\}\- the mean \(average\) score of all 300 human evaluations across that specific dimension\.
5. 5\.JiJ\_\{i\}\- the automated consensus score \(Sfinal,iS\_\{\\text\{final\},i\}\) assigned by the AI Jury to that exact same responseiifor that dimension\.
6. 6\.J¯\\bar\{J\}\- the mean \(average\) consensus score of all 300 AI Jury evaluations across that specific dimension\.

To further quantify agreement between human and automated evaluations, Mean Absolute Error \(MAE\) and Mean Error \(ME\), interpreted as a measure of systematic leniency bias, are additionally computed by using the following formulas:

M​A​E=1n​∑i=1n\|xi−yi\|MAE=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\|x\_\{i\}\-y\_\{i\}\|
M​E=1n​∑i=1n\(xi−yi\)ME=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(x\_\{i\}\-y\_\{i\}\)
![Refer to caption](https://arxiv.org/html/2607.02049v1/framework.png)Figure 2:Overview of the SPLIT evaluation framework\.Overall, the methodological approach, demonstrated in Figure[2](https://arxiv.org/html/2607.02049#S2.F2), provides us with a clear research structure and, consequently, the empirical data required to explicitly answer the four questions our benchmark aims to address\.

## 3Results and Analysis

### 3\.1Human Evaluation Baseline

#### 3\.1\.1Interpretation

To establish the Human Evaluation Baseline for the three evaluated models, manual assessment was conducted and mean scores were calculated for six language\-model groups across the three SPLIT dimensions: Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding\. Overall, the baseline reveals a substantial performance gap when transitioning from high\-resource \(English\) to a low\-to\-mid resource \(Ukrainian\) language\.

Table 1:Human Evaluation Baseline\. Mean evaluator scores \(1–5 scale\) across languages\.Consistent with prior peer research, the performance of LLMs demonstrates severe discrepancies in cross\-lingual conversations\[[60](https://arxiv.org/html/2607.02049#bib.bib3),[30](https://arxiv.org/html/2607.02049#bib.bib6),[16](https://arxiv.org/html/2607.02049#bib.bib23)\]\. The largest observed decline reaches 1\.76 points, occurring in LLaMA’s Linguistic Naturalness scores, see Table[1](https://arxiv.org/html/2607.02049#S3.T1)and Figure[3](https://arxiv.org/html/2607.02049#S3.F3)\. Furthermore, Gemini received a manual score of 2\.5 in Ukrainian for Cultural & Contextual Grounding, experiencing a decline of 1\.2 points\. Manual evaluation also demonstrates a drop in Empathetic Accuracy and Linguistic Naturalness, obtaining scores of 2\.72 and 2\.70 respectively\. LLaMA’s performance reaches as low as 2\.16 in Linguistic Naturalness and Contextual & Cultural Grounding when answering in Ukrainian as well, compared with its English scores of 3\.92 and 3\.26, respectively\. Among the three dimensions, Empathetic Accuracy remains LLaMA’s strongest category in Ukrainian, reaching 2\.58 on the 1–5 scale\.

Interestingly, DeepSeek’s performance stays the same or slightly improves when altering the primary language of communication\. Specifically, Linguistic Naturalness stays at the same level for both evaluated languages with a score of 3\.44, suggesting relatively strong fluency and linguistic consistency in Ukrainian\. For the other two dimensions, such as Empathetic Accuracy and Contextual & Cultural Grounding, performance in Ukrainian increases slightly, achieving a score of 3\.56 and 3\.74, showing an increase of 0\.42 and 0\.20, respectively\.

#### 3\.1\.2Explanation

![Refer to caption](https://arxiv.org/html/2607.02049v1/human_baseline.png)Figure 3:Human Evaluation Baseline scores across the three evaluated dimensions in English \(EN\) and Ukrainian \(UA\)\.Based on the results above, several observations can be made to rigorously explain the behavioral patterns of each of the models, and outline the potential reasons for the specific outcome\. Our interpretation is informed by differences in model architectures and training data\.

DeepSeek\-V3 demonstrated strong performance across all three measured dimensions, moderately exceeding the English baseline shown in Table[1](https://arxiv.org/html/2607.02049#S3.T1)when producing answers in Ukrainian\. One possible explanation is that, unlike the LLaMA family, which features a dense\-transformer architecture, DeepSeek’s routing MoE strategy\[[11](https://arxiv.org/html/2607.02049#bib.bib34),[58](https://arxiv.org/html/2607.02049#bib.bib35)\]adopts a more specialized processing approach\. It utilizes specialized experts alongside shared experts responsible for more general linguistic patterns\. Additionally, its auxiliary\-loss\-free routing mechanism may allow queries to be distributed more efficiently among experts, potentially contributing to the model’s ability to maintain stable performance across languages\[[2](https://arxiv.org/html/2607.02049#bib.bib36)\]\. Furthermore, its multi\-token prediction objective may enable stronger phrase\-level planning and contextual coherence, which could be particularly beneficial when generating responses in a morphologically rich language such as Ukrainian\.

A similar pattern was observed in a range of other studies, suggesting that instruction\-aligned LLMs, including DeepSeek, demonstrate cross\-lingual consistency\[[38](https://arxiv.org/html/2607.02049#bib.bib62)\], while a range of other open\-weight models showcase relative linguistic instability\. Nevertheless, our findings only partially align with these observations\. While Gemini is generally claimed to demonstrate linguistic stability\[[10](https://arxiv.org/html/2607.02049#bib.bib41)\], our study reveals that its performance in crisis\-related situations remains less stable, exhibiting a noticeable decline across the three evaluated dimensions when transitioning to Ukrainian\. This discrepancy may stem from the specific design of the SPLIT benchmark, as well as limitations associated with human evaluation\.

The SPLIT benchmark, prompts, research log, and evaluation data are publicly available\.111https://github\.com/Anna\-a\-host/SPLIT\-Cross\-Lingual\-Empathy\-and\-Cultural\-Grounding\-in\-English\-and\-Ukrainian\-LLMs

### 3\.2Automated Multi\-Agent Assessment

Table 2:Automated Grand Summary Averages\.Comparing the Human Evaluation Baseline with the LLM\-as\-a\-jury framework reveals several important observations\. As demonstrated in Table[2](https://arxiv.org/html/2607.02049#S3.T2)and Figure[4](https://arxiv.org/html/2607.02049#S3.F4), the widely accepted LLM\-as\-a\-judge paradigm appears partially blind to a range of linguistic and cultural nuances, failing to detect significant discrepancies and a lack of contextual adaptation when switching to a low\-to\-mid resource language \(Ukrainian\)\. This observation suggests that the AI jury rewards a set of features when evaluating complex Contextual & Cultural Grounding metrics that differ from Human Evaluation Baseline perception\. While LLM\-as\-a\-jury evaluations may statistically focus more on overall coherence, relevance, politeness, and instruction\-following, human evaluation tends to reward natural phrasing and the ability to provide grounding to an individual while avoiding robotic expressions\. Therefore, the scores attributed to a highly localized dimension such as Contextual & Cultural Grounding vary significantly across languages\.

Furthermore, several specific results deserve closer examination\. According to the AI Jury Baseline, both Gemini and LLaMA scored 0\.722 and 0\.578 points higher, respectively, for the Contextual & Cultural Grounding dimension when switching to the Ukrainian language\. In contrast, the Human Evaluation Baseline suggests that their performance significantly dropped, reaching a floor of 2\.5 and 2\.16, respectively\. However, while DeepSeek’s AI\-jury performance direction largely aligns with the human evaluation, it acquired a mean score of 2\.774 when answering in English, which is 0\.852 points lower than the human evaluation of the same Contextual & Cultural Grounding metric\. It suggests that DeepSeek’s English cultural abilities may have been partially underestimated, suggesting another important observation — AI and human juries tend to reward different characteristics when assessing the Contextual & Cultural Grounding dimension\.

![Refer to caption](https://arxiv.org/html/2607.02049v1/automated_baseline.png)Figure 4:Automated Baseline scores across the three evaluated dimensions in English \(EN\) and Ukrainian \(UA\)\.Linguistic Naturalness remains another questionable dimension, with the AI jury being partly unable to identify a range of conversational discrepancies\. While assigning lower scores for LLaMA when transitioning to Ukrainian, the difference still remains insignificant, experiencing a drop of 0\.156\. Although moving in the same negative direction as the Human Evaluation Baseline, these differences remain too small to capture the magnitude of the performance gap\. In addition, Gemini’s and LLaMA’s Linguistic Naturalness scores demonstrate a substantial difference of roughly 1\.14 points when contrasted specifically with the Human Evaluation Baseline\. This again reinforces the idea that the AI jury tends to overestimate the quality of performance in Linguistic Naturalness, assigning higher scores for grammatical correctness and sentence structure, while human evaluation places greater emphasis on authentic Ukrainian expressions and natural communication\. A highly similar pattern for Empathetic Accuracy can be observed as well, demonstrating differences between human\-based emotional realism standards and AI\-jury evaluations\.

In light of the findings above and prior research, a conclusion regarding the performance of LLMs when responding to crisis\-related queries can be made\. Prior work suggests that fine\-tuned models showcase stability across a number of metrics\[[38](https://arxiv.org/html/2607.02049#bib.bib62)\], while our SPLIT benchmark reveals that these abilities do not necessarily transfer to emotional\-support contexts, which our 500\-query dataset consists of\. While the models’ multilingual competencies are sufficient to demonstrate the required level of cohesion and coherence for effective communication, their ability to exhibit multicultural understanding is not implied by the former\. This conclusion directly supports the findings of prior research\[[46](https://arxiv.org/html/2607.02049#bib.bib7)\], demonstrating the significance of deep cultural nuances for effective communication\.

### 3\.3Statistical Validation and Correlation Analysis

Following the approach described above for the sampled 300 LLM responses, the Pearson correlation coefficient \(rr\) was computed\. The results demonstrate several key findings, as summarized in Table[3](https://arxiv.org/html/2607.02049#S3.T3)and Figure[5](https://arxiv.org/html/2607.02049#S3.F5):

Table 3:Final Statistical Metrics1. 1\.Empathetic Accuracyshowed a weak, though highly significant, positive alignment with the Human Evaluation Baseline, with anrrvalue of 0\.198 and ap<0\.001p<0\.001value\. The result suggests that empathy is partially observable through general linguistic patterns and appears to be less challenging for the LLM\-as\-a\-jury framework to detect\. The additionally calculatedM​A​EMAE\(Mean Absolute Error\) andM​EME\(Systematic Leniency Bias\) values of 0\.721 and \+0\.281, respectively, suggest the idea of an AI overscoring tendency, positioning it as a lenient evaluation paradigm\.
2. 2\.Linguistic Naturalnessdemonstrates a highly similar, weak but significant positive alignment ofr=0\.149r=0\.149andp<0\.01p<0\.01, positing the idea of an LLM’s ability to perceive sentence structure and reward formal, structured responses rather than natural idioms of distress\. In this case, the additionally calculated values ofM​A​E=0\.81MAE=0\.81andM​E=\+0\.506ME=\+0\.506further support this interpretation, showing considerable systematic AI leniency and inflation\.
3. 3\.The Contextual & Cultural Groundingmetric emerges as the most challenging dimension for the LLM\-as\-a\-jury paradigm, showing a slight negative correlationr=−0\.095r=\-0\.095and a non\-significantpp\-valuep\>0\.05p\>0\.05\. This observation suggests that AI systems may assign lower scores to responses that exhibit cultural nuances while deviating from the grammatical or structural patterns the models tend to reward\. The corresponding values ofM​A​E=0\.892MAE=0\.892andM​E=−0\.114ME=\-0\.114further underpin this interpretation\. Therefore, we find no evidence that AI can meaningfully predict real human judgments in the Contextual & Cultural Grounding dimension, owing to its potential inability to recognize authentic cultural adaptation that humans tend to reward\.

While the correlation between Human and AI Baselines in Empathetic Accuracy and Linguistic Naturalness is statistically positive but weak, its high level of significance indicates that LLMs have the potential to capture certain tendencies of human judgment\. Conversely, they cannot reliably substitute for human assessment in culturally grounded emotional\-support scenarios, which can be assumed by the negative correlation and low significance in the Contextual & Cultural Grounding metric\.

![Refer to caption](https://arxiv.org/html/2607.02049v1/agreement_overview.png)Figure 5:Overall agreement between Automated and Human Evaluation BaselinesOur SPLIT benchmark serves as a logical supplement to prior studies in the realm of Cross\-Lingual Empathy Divergence\. Our findings suggest that LLMs showcase strong capabilities in maintaining coherent cross\-lingual interactions and preserving linguistic fluency, thus correlating modestly with humans in general\-purpose tasks\. However, our benchmark aims to address a more complex setting, requiring not only multilingual but also multicultural abilities\. Our observations strongly resemble those reported in previous work, specifically regarding the principle that multilingualism does not equate to multiculturalism\[[46](https://arxiv.org/html/2607.02049#bib.bib7)\]\. Therefore, our SPLIT benchmark evaluation may lead to the following conclusion:

> LLMs’ abilities to generate text in many languages do not necessarily imply an understanding of the cultural norms, references, values, emotions, and communicative expectations associated with those languages\.

## 4Discussion

### 4\.1RQ1: How do state\-of\-the\-art LLMs differ in empathetic response quality between English and Ukrainian crisis\-related scenarios?

As outlined in the Results & Analysis section, LLMs tend to exhibit noticeable discrepancies when producing responses in Ukrainian in comparison with the English baseline\[[60](https://arxiv.org/html/2607.02049#bib.bib3),[30](https://arxiv.org/html/2607.02049#bib.bib6)\]\. While this is a pervasive tendency within the AI paradigm, our SPLIT benchmark specifically aims to evaluate crisis\-related scenarios, requiring emotionally aware empathetic accuracy in the generated responses\[[42](https://arxiv.org/html/2607.02049#bib.bib5),[35](https://arxiv.org/html/2607.02049#bib.bib11)\]\.

Nevertheless, according to the established human\-in\-the\-loop baseline, Gemini\-2\.5\-Flash and LLaMA\-3\.3\-70B\-Instruct models tend to experience a substantial decline in empathetic response quality when transitioning from English to Ukrainian\. This outcome aligns with a widely observed phenomenon in multilingual NLP; however, it does not necessarily apply to all LLMs\. According to our benchmark, DeepSeek\-V3 demonstrates strong performance in empathetic response quality within the Empathetic Accuracy metric, even slightly enhancing its abilities when transitioning to Ukrainian\.

One possible explanation for this outcome lies in the training objectives and architectural paradigms behind these LLMs\.

LLaMA’s pretraining dataset consists of 15 trillion tokens, the majority of which originate from English\-language sources\[[14](https://arxiv.org/html/2607.02049#bib.bib38)\]\. The model consistently falls back on basic, rigid phrasing, causing degradation in empathy within a morphologically rich language such as Ukrainian\. The model may simply experience a scarcity of culturally tailored data required to preserve empathetic authenticity\.

On the contrary, Gemini models are pretrained on a massive volume of tokens spanning over 200 languages\[[10](https://arxiv.org/html/2607.02049#bib.bib41)\]; therefore, their linguistic capabilities in low\-to\-mid resource languages is technically substantial\. However, the results of our research indicate that, although Gemini performs slightly better than LLaMA across the Empathetic Accuracy dimension, it still lags behind DeepSeek\. While it understands Ukrainian vocabulary and syntax, the massive scale of web\-scraped data may average out unique, highly localized emotional idioms used to express empathy\[[46](https://arxiv.org/html/2607.02049#bib.bib7)\]\.

The paradox behind DeepSeek’s strong performance in the Empathetic Accuracy dimension demonstrates an interesting, though predictable, outcome\. By adopting an MoE framework\[[11](https://arxiv.org/html/2607.02049#bib.bib34)\], its cross\-lingual objectives are not treated as an afterthought, encouraging the model to reason in the language of the query itself by routing it to specialized experts alongside shared experts\[[11](https://arxiv.org/html/2607.02049#bib.bib34),[58](https://arxiv.org/html/2607.02049#bib.bib35)\]\. This mechanism may enable the extraction of more authentic, emotionally relevant patterns without relying excessively on English\-based expressions of empathy\.

### 4\.2RQ2: What linguistic and conversational discrepancies emerge when LLMs generate responses to English and Ukrainian crisis\-related scenarios?

Linguistic and conversational discrepancies have long been considered a major bottleneck in NLP\[[60](https://arxiv.org/html/2607.02049#bib.bib3),[16](https://arxiv.org/html/2607.02049#bib.bib23)\], particularly in morphologically rich and lower\-resource languages\[[60](https://arxiv.org/html/2607.02049#bib.bib3),[30](https://arxiv.org/html/2607.02049#bib.bib6)\], which might often require splitting words into multiple tokens\[[8](https://arxiv.org/html/2607.02049#bib.bib63)\]\. This issue has also been observed in the current study, with two of the LLMs showing a gap on the SPLIT benchmark’s Linguistic Naturalness dimension\. According to the Human Evaluation Baseline, LLaMA was assigned the lowest score across all three models, two languages, and three evaluation metrics, alongside the Contextual & Cultural Grounding measure\. A range of inaccuracies, such as overly formal or academic phrasing was observed\. While partly fluent, the model remained unable to address the user’s emotional needs adequately\. Another notable observation included a consistent fallback on clichés and translated phrases, which substantially reduced the model’s performance in the other two closely related evaluated dimensions\.

Specifically, possessing a dense\-transformer architecture\[[14](https://arxiv.org/html/2607.02049#bib.bib38)\], when LLaMA operated in a less represented language environment, it appeared to demonstrate a feature resembling language drifting and byte\-fallback decoding errors\[[22](https://arxiv.org/html/2607.02049#bib.bib64)\]\. One possible explanation for this phenomenon is the model’s inability to find appropriate grounding expressions within its vocabulary range to tackle crisis\-related situations\. Thus, LLaMA may lose its linguistic anchor, reverting to highly overrepresented multilingual patterns in its training distribution and occasionally producing untranslated or externally sourced expressions\.

A comparable pattern can be detected when considering Gemini’s level of Linguistic Naturalness\. Though remaining fluent throughout its responses, and therefore receiving a higher score than LLaMA, it still lacks depth of naturalness and authenticity\. The model’s Reinforcement Learning from Human Feedback mechanism\[[10](https://arxiv.org/html/2607.02049#bib.bib41)\], though effective, appears to prioritize politeness over responses adapted to a real user’s emotional state\. As is widely known, when addressing crisis\-related and complex scenarios, such as those represented in our SPLIT benchmark, Gemini models may rely more heavily on safety\-oriented alignment objectives\[[10](https://arxiv.org/html/2607.02049#bib.bib41)\], producing heavily structured, overly defensive reassurance scripts accompanied by a range of clichéd expressions\. This specific pattern was observed in the manually validated sample as well\.

Similarly to the other two dimensions of our SPLIT benchmark, DeepSeek performs reasonably well in Linguistic Naturalness, which may stem from its NLG capabilities\. In comparison with the other two evaluated LLMs, DeepSeek uniquely pioneers a Multi\-Token Prediction training objective\[[11](https://arxiv.org/html/2607.02049#bib.bib34)\], allowing it to engage in phrase\-level planning\. Unlike Gemini and LLaMA models, which adopt a Next\-Token Prediction strategy, MTP may enable DeepSeek to preserve a coherent conversational flow in a morphologically complex language such as Ukrainian\.

Another major feature of DeepSeek\-V3 is its post\-training objective, specifically the knowledge distillation process\[[11](https://arxiv.org/html/2607.02049#bib.bib34)\], through which the model inherits reasoning capabilities from the highly capable DeepSeek\-R1 LLM, while simultaneously undergoing additional fine\-tuning and eliminating long, chaotic reasoning chains\. This might have contributed to the overall more human\-like sentence structure in comparison with the other two models\.

### 4\.3RQ3: To what extent do LLM\-generated responses exhibit appropriate contextual and cultural grounding when addressing crisis scenarios in Ukrainian compared to English baselines?

Generally, LLMs have been shown to face challenges in Contextual & Cultural Grounding\[[46](https://arxiv.org/html/2607.02049#bib.bib7)\], with this metric being much more specific and tailored to human perception, rather than a capability which can be directly inherited from digitized corpora alone\. Therefore, Gemini and LLaMA experience difficulties in adapting effectively to users’ emotional states, as is evident from the Human Evaluation Baseline\. Empirical evidence suggests that English\-language responses target a much broader audience\[[14](https://arxiv.org/html/2607.02049#bib.bib38),[11](https://arxiv.org/html/2607.02049#bib.bib34)\], therefore making it easier to address queries using general idioms of distress\. Nevertheless, the Ukrainian language represents a much narrower community than the English baseline does, making it harder to address the need for Emotional & Cultural Grounding\.

As discussed in the previous sections, the decline in one metric often coincides with lower performance in the others due to the close relationship between these dimensions\. When LLaMA may experience difficulties in producing more natural phrasing because of language drifting and byte\-fallback phenomena when answering in Ukrainian, it directly affects its ability to provide grounding to an individual, substantially undermining the perception of a natural conversation\. Similarly, if Gemini’s phrasing contains over\-alignment bias or becomes lexically saturated, its Empathetic Accuracy may decrease as well\. While some grounding phrases are appropriate in an English\-speaking environment, their direct translation into Ukrainian causes awkward, unnatural, and robotic phrasing that may not be fully noticeable to a machine\.

However, while LLMs’ multilingual abilities might be far from perfection, the distinction between multilingual competence and Contextual & Cultural Grounding becomes particularly evident from the perspective of native Ukrainian speakers when compared to English\. Therefore, while models can possess multilingual competence, Cultural Grounding requires a deep understanding of norms, values, emotional expectations, communicative conventions, and implicit references\[[46](https://arxiv.org/html/2607.02049#bib.bib7)\]\. Consequently, both prior research and the findings obtained through the SPLIT benchmark suggest the following conclusion:

> Producing Ukrainian text is not equivalent to producing Ukrainian emotional support\.

In the context of the current conflict, it is particularly important to understand the cultural and contextual nuances associated with the Ukrainian language, as linguistic fluency alone is not the most challenging component of effective communication\. Nevertheless, prior research consistently indicates that English training corpora remains dominant within the LLM paradigm\[[14](https://arxiv.org/html/2607.02049#bib.bib38),[11](https://arxiv.org/html/2607.02049#bib.bib34)\]\. This may therefore lead to excessive formality and general reassurance, the exact tendency observed in LLaMA and Gemini models, consequently resulting in lower scores\.

Interestingly, DeepSeek’s performance differs across all three parameters of our SPLIT benchmark\. As has already been explained above, DeepSeek’s internal architecture and both its pre\-training and post\-training objectives may have contributed to its strong performance in both English and Ukrainian samples of users’ queries\. The relatively high scores assigned to the model appear to reflect its ability to maintain emotional continuity throughout responses\. While not exhibiting sufficiently human\-like performance to receive perfect scores, its ability to maintain stable performance when transitioning to a less represented language constitutes a notable finding in itself\.

### 4\.4RQ4: To what extent does automated LLM\-based evaluation agree with human assessment of empathetic conversational responses?

Stating that there is no agreement between the two evaluation baselines, human and AI, would be statistically incorrect, as demonstrated by the Pearson correlation coefficient calculated and examined in the Results and Analysis section\. Accordingly, the Empathetic Accuracy metric demonstrates weak, but highly significant alignment, suggesting that AI systems can approximate certain aspects of human judgments of empathy\[[42](https://arxiv.org/html/2607.02049#bib.bib5),[35](https://arxiv.org/html/2607.02049#bib.bib11)\]\.

This is additionally evident from the Grand AI Averages Baseline, with the scores for English responses showcasing a similar tendency\. However, while the automated jury is capable of capturing general patterns of reassurance suited for dominant English\-speaking users, the scores never dropped significantly in Ukrainian response evaluations, indicating limitations in the jury’s performance in this context\. Specifically, Empathetic Accuracy is partially overscored, with positiveM​A​EMAEandM​EMEvalues, suggesting the presence of Leniency Bias\. This reinforces an interesting observation, where AI tends to assign higher scores to potentially empathetic responses containing basic, though emotionally appropriate, vocabulary\[[42](https://arxiv.org/html/2607.02049#bib.bib5),[35](https://arxiv.org/html/2607.02049#bib.bib11)\]\.

Despite the challenges LLMs face when addressing emotionally draining situations, the score for DeepSeek remains relatively stable for both English and Ukrainian\. It may indicate that AI has the ability to detect more relevant, natural idioms of distress\. However, considering the similar scores assigned to the other two assessed models, which do not change significantly with the change of language, this posits the idea that the LLM\-as\-a\-jury paradigm may rely on surface\-level linguistic indicators rather than deeper emotional representations of empathy\.

Linguistic Naturalness demonstrates a similar pattern of approximately the same correlation, but with slightly weaker significance, though still sufficient enough to showcase a persistent level of alignment with the Human Evaluation Baseline\. However, theM​A​EMAEandM​EMEvalues presented in the section above indicate that the magnitude of bias is considerably larger, suggesting that AI and human evaluators attend to different characteristics of the generated responses\. One of the most reasonable explanations is that the AI jury rewards structure and grammar more than authenticity\[[60](https://arxiv.org/html/2607.02049#bib.bib3),[16](https://arxiv.org/html/2607.02049#bib.bib23)\], thus reflecting patterns inherited from the training corpora of the evaluating models\.

With the Contextual & Cultural Grounding dimension being more complicated to assess, even for human beings, the LLM\-as\-a\-jury framework exhibits no significant alignment, thus demonstrating a negative correlation and low significance, showing no meaningful agreement between human and the LLM\-as\-a\-jury paradigm\. Although both humans and LLMs recognize certain aspects of cultural grounding, according to the baseline averages, LLMs appear to demonstrate limited sensitivity to cultural nuances and implicit meanings\[[46](https://arxiv.org/html/2607.02049#bib.bib7)\]\. It further leads to a broader conclusion: Large Language Models demonstrate strong statistical performance for simpler, less complex dimensions that rely less heavily on culturally embedded knowledge, which may not be fully captured by training corpora alone\. This deduction is also justified by slightly negativeM​EMEand positiveM​A​EMAEvalues, supporting the interpretation that AI and human juries face difficulties in applying the same evaluation criteria to this dimension, with some words carrying implicit, culturally appropriate meanings\.

Overall, the results of our LLM\-as\-a\-jury framework can be summarized as follows:

> Automated evaluation may reliably capture certain superficial dimensions of empathetic communication, while diverging from human perception in aspects requiring contextual and cultural immersion\.

## 5Limitations

Our study specializes in a niche subject, consequently aiming to explore and contribute to the realm of Large Language Models and cross\-lingual inconsistencies encountered when addressing crisis\-related queries by introducing our own SPLIT benchmark\. Nevertheless, several limitations should be acknowledged:

### 5\.1Human Annotation Subjectivity

First and foremost, the Human Evaluation Baseline established in our study relies specifically on a single evaluator, meaning that the outcome may remain relatively subjective\. Accordingly, some degree of disagreement would likely emerge if the same responses were evaluated by another annotator\.

To address this specific issue, future work in this sphere should adopt an inter\-rater agreement metric to average the outcomes and produce more culturally reliable results\. Nevertheless, the current study aims to provide an unbiased assessment, informed by relevant peer research and a range of other considerations regarding the deployed models established preliminarily\.

### 5\.2Language Scope

While our findings resemble those found in a range of peer research, they do not observe empathy divergence across all low\-to\-mid resource languages\. While some broad statements are employed, their purpose is to state the omnipresence of the current cross\-lingual issue, rather than claim or appropriate other studies’ results\. Further research should implement a number of other less represented languages to explore the lack of cultural understanding in LLMs more rigorously and grasp the full scope of such a multifaceted issue\.

### 5\.3Benchmark Scope

The SPLIT benchmark specializes in testing queries in five specific topics, such as Stress, Panic, Loneliness, Internal Displacement, and Tension\. Therefore, a range of other potential bottlenecks remains to be included, with ours covering only several of them, which are directly related to crisis\-affected scenarios in a Ukrainian context\. We do not state that LLMs’ lack of cultural understanding directly transfers to other areas, such as education, healthcare, business communication, or general dialogue\. These are separate subjects, which are intentionally not included in our benchmark\.

### 5\.4Statistical Scope

The SPLIT benchmark encompasses 500 crisis\-related queries, with 100 in each of the evaluated topics\. In total, 3000 responses are generated by three LLMs, 10% of which are manually evaluated\. While this sample is sufficient for exploratory analysis, it may not capture all forms of scenarios\. Therefore, future work could incorporate additional reliability metrics or simply extract a larger sample for human validation\.

### 5\.5Model Selection

Our research employed more than one, specifically three LLMs to address the queries generated for our SPLIT benchmark\. Additionally, three completely different models in terms of architecture were deployed to mitigate bias levels in the assessments and make our research more valid and multifaceted\. Nevertheless, to ensure higher credibility, future work should aim to adopt a more comprehensive approach, using a wider range of models for both answer generation and their assessment\.

Furthermore, proprietary models evolve rapidly; thus, future versions might exhibit different behavior, consequently altering the models’ answers and assessments\.

## 6Conclusion

We introduce the SPLIT benchmark, aimed at evaluating Cross\-Lingual Empathy Divergence and Conversational Discrepancies in English and Ukrainian LLM Responses\. Our 500\-prompt dataset consists of 500 Ukrainian crisis\-related scenarios across five categories: Stress, Panic, Loneliness, Internal Displacement, and Tension\. Our study combines prior peer research, the assessment of three LLMs, and the establishment of the LLM\-as\-a\-jury framework\. Our work encompasses both human and automated assessment, additionally aiming to validate LLMs’ capabilities in detecting cultural and linguistic nuances\. Thus, we evaluate three dimensions: Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding\.

Our empirical findings reveal that several models, such as Gemini\-2\.5\-Flash and LLaMA\-3\.3\-70B\-Instruct experience degradation in the Empathetic Accuracy and Linguistic Naturalness metrics, while DeepSeek\-V3 demonstrates stability when addressing crisis\-related and empathy\-demanding queries within our SPLIT benchmark\. However, a positive correlation suggests that a certain degree of agreement exists between human and AI assessments\. Having conducted a thorough analysis, a conclusion has been made:fluency does not necessarily imply naturalness\. In this way, human evaluators tend to penalize translated, rigid, or overly formal expressions much more strongly than AI judges\.

Contextual & Cultural Grounding emerges as the most challenging dimension, showing negative, insignificant alignment between human and AI evaluators, suggesting that the LLM\-as\-a\-jury paradigm struggles to apply evaluation criteria that align with human judgments for this dimension\. Therefore, while showcasing decent grammatical structures, the models are not necessarily able to follow culturally tailored crisis\-related queries\. This has led to the core final idea derived from evaluating the results of our SPLIT benchmark:producing Ukrainian text is not equivalent to producing Ukrainian emotional support\.Therefore, automated evaluation may capture some dimensions reliably, while diverging from human perception in dimensions requiring cultural and contextual immersion\.

Our findings aim to assist in the further development of LLMs and their implementation in emotional\-support\-demanding situations\. We reinforce the idea that LLMs deployed in crisis contexts require deeper cultural adaptation\. Existing evaluation benchmarks may overlook the Cultural Grounding dimension, consequently demonstrating the need for multicultural, not merely multilingual, benchmarks\. Specifically, LLM developers should take into account the cultural nuances of specific mid\-to\-low\-resource languages when cultivating and adopting post\-training objectives\.

Ultimately, as LLMs are increasingly integrated into crisis response systems, ensuring they possess genuine cultural immersion, rather than simply linguistic fluency, is no longer just an optimization goal, but a fundamental requirement for safe and responsible deployment\.

## 7Acknowledgments

The author wants to thank Professor Russell Reid for valuable comments and suggestions on an earlier version of this paper\. Any remaining errors or inaccuracies are solely the author’s responsibility\.

## References

- \[1\]M\. Arjmand, F\. Nouraei, I\. Steenstra, and T\. Bickmore\(2024\)Empathic grounding: explorations using multimodal interaction and large language models with conversational agents\.InProceedings of the 24th ACM International Conference on Intelligent Virtual Agents,IVA ’24,New York, NY, USA\.External Links:ISBN 9798400706257,[Link](https://doi.org/10.1145/3652988.3673949),[Document](https://dx.doi.org/10.1145/3652988.3673949)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p3.1),[§1](https://arxiv.org/html/2607.02049#S1.p6.1)\.
- \[2\]L\. Bandarkar, C\. Yang, M\. Fayyaz, J\. Hu, and N\. Peng\(2026\)Multilingual routing in mixture\-of\-experts\.External Links:2510\.04694,[Link](https://arxiv.org/abs/2510.04694)Cited by:[item 1](https://arxiv.org/html/2607.02049#S2.I1.i1.p1.1),[item 2](https://arxiv.org/html/2607.02049#S2.I1.i2.p1.1),[item 3](https://arxiv.org/html/2607.02049#S2.I1.i3.p1.1),[item 2](https://arxiv.org/html/2607.02049#S2.I4.i2.p1.1),[§3\.1\.2](https://arxiv.org/html/2607.02049#S3.SS1.SSS2.p2.1)\.
- \[3\]A\. Bavaresco, R\. Bernardi, L\. Bertolazzi,et al\.\(2025\)LLMs instead of human judges? a large scale empirical study across 20 nlp evaluation tasks\.External Links:2406\.18403,[Link](https://arxiv.org/abs/2406.18403)Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1)\.
- \[4\]R\. R\. Bellibatlu, E\. Raff, and W\. Zhang\(2026\)JudgeSense: a benchmark for prompt sensitivity in llm\-as\-a\-judge systems\.External Links:2604\.23478,[Link](https://arxiv.org/abs/2604.23478)Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1)\.
- \[5\]T\. Brown, B\. Mann, N\. Ryder,et al\.\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 1877–1901\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p1.1),[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p2.1)\.
- \[6\]Y\. Chang, X\. Wang, J\. Wang, Y\. Wu, L\. Yang, K\. Zhu, H\. Chen, X\. Yi, C\. Wang, Y\. Wang, W\. Ye, Y\. Zhang, Y\. Chang, P\. S\. Yu, Q\. Yang, and X\. Xie\(2023\)A survey on evaluation of large language models\.External Links:2307\.03109,[Link](https://arxiv.org/abs/2307.03109)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1),[§1](https://arxiv.org/html/2607.02049#S1.p3.1),[§2\.2](https://arxiv.org/html/2607.02049#S2.SS2.p1.1),[§2\.5](https://arxiv.org/html/2607.02049#S2.SS5.p5.1)\.
- \[7\]J\. Chen, Y\. Dong, H\. L\. Graves, W\. Su, Y\. Zhou, M\. Zhang, Y\. Liu, and Q\. Ai\(2026\)Benchmarking llm\-as\-a\-judge for long\-form output evaluation\.External Links:2606\.01629,[Link](https://arxiv.org/abs/2606.01629)Cited by:[item 3](https://arxiv.org/html/2607.02049#S2.I4.i3.p1.1),[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1)\.
- \[8\]G\. Churchill and S\. Skiena\(2026\)Reducing tokenization premiums for low\-resource languages\.External Links:2601\.13328,[Link](https://arxiv.org/abs/2601.13328)Cited by:[§4\.2](https://arxiv.org/html/2607.02049#S4.SS2.p1.1)\.
- \[9\]J\. Cohen\(1960\)A coefficient of agreement for nominal scales\.Educational and Psychological Measurement20\(1\),pp\. 37–46\.Cited by:[§2\.5](https://arxiv.org/html/2607.02049#S2.SS5.p6.1)\.
- \[10\]G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.External Links:2507\.06261,[Link](https://arxiv.org/abs/2507.06261)Cited by:[item 3](https://arxiv.org/html/2607.02049#S2.I1.i3.p1.1),[§3\.1\.2](https://arxiv.org/html/2607.02049#S3.SS1.SSS2.p3.1),[§4\.1](https://arxiv.org/html/2607.02049#S4.SS1.p5.1),[§4\.2](https://arxiv.org/html/2607.02049#S4.SS2.p3.1)\.
- \[11\]DeepSeek\-AI, A\. Liu, B\. Feng, B\. Xue, B\. Wang,et al\.\(2025\)DeepSeek\-v3 technical report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[item 1](https://arxiv.org/html/2607.02049#S2.I1.i1.p1.1),[item 3](https://arxiv.org/html/2607.02049#S2.I1.i3.p1.1),[§3\.1\.2](https://arxiv.org/html/2607.02049#S3.SS1.SSS2.p2.1),[§4\.1](https://arxiv.org/html/2607.02049#S4.SS1.p6.1),[§4\.2](https://arxiv.org/html/2607.02049#S4.SS2.p4.1),[§4\.2](https://arxiv.org/html/2607.02049#S4.SS2.p5.1),[§4\.3](https://arxiv.org/html/2607.02049#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2607.02049#S4.SS3.p5.1)\.
- \[12\]M\. Freitag, N\. Mathur, C\. Lo,et al\.\(2023\-12\)Results of WMT23 metrics shared task: metrics might be guilty but references are not innocent\.InProceedings of the Eighth Conference on Machine Translation,P\. Koehn, B\. Haddow, T\. Kocmi, and C\. Monz \(Eds\.\),Singapore,pp\. 578–628\.External Links:[Link](https://aclanthology.org/2023.wmt-1.51/),[Document](https://dx.doi.org/10.18653/v1/2023.wmt-1.51)Cited by:[§2\.5](https://arxiv.org/html/2607.02049#S2.SS5.p6.1)\.
- \[13\]X\. Fu and W\. Liu\(2025\)How reliable is multilingual llm\-as\-a\-judge?\.External Links:2505\.12201,[Link](https://arxiv.org/abs/2505.12201)Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1)\.
- \[14\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey,et al\.\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[item 2](https://arxiv.org/html/2607.02049#S2.I1.i2.p1.1),[§4\.1](https://arxiv.org/html/2607.02049#S4.SS1.p4.1),[§4\.2](https://arxiv.org/html/2607.02049#S4.SS2.p2.1),[§4\.3](https://arxiv.org/html/2607.02049#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2607.02049#S4.SS3.p5.1)\.
- \[15\]S\. Han, G\. T\. Junior, T\. Balough, and W\. Zhou\(2025\)Judge’s verdict: a comprehensive analysis of llm judge capability through human agreement\.External Links:2510\.09738,[Link](https://arxiv.org/abs/2510.09738)Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1)\.
- \[16\]W\. Han, Y\. Zhang, Z\. Chen, B\. Liu, H\. Lin, B\. Zhang, T\. Wang, M\. Pechenizkiy, M\. Fang, and Y\. Zheng\(2025\)MuBench: assessment of multilingual capabilities of large language models across 61 languages\.External Links:2506\.19468,[Link](https://arxiv.org/abs/2506.19468)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p5.1),[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p2.1),[§3\.1\.1](https://arxiv.org/html/2607.02049#S3.SS1.SSS1.p2.1),[§4\.2](https://arxiv.org/html/2607.02049#S4.SS2.p1.1),[§4\.4](https://arxiv.org/html/2607.02049#S4.SS4.p4.2)\.
- \[17\]L\. He, E\. Nie, S\. S\. Dindar, A\. Firoozi, A\. Florea, V\. Nguyen, C\. Puffay, R\. Shimizu, H\. Ye, J\. Brennan, H\. Schmid, H\. Schütze, and N\. Mesgarani\(2025\)XCOMPS: a multilingual benchmark of conceptual minimal pairs\.External Links:2502\.19737,[Link](https://arxiv.org/abs/2502.19737)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p5.1),[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p3.1)\.
- \[18\]M\. F\. B\. Hossen, M\. E\. Rahman, M\. R\. Rahman,et al\.\(2025\)Optimizing ai language models: a study of chatgpt\-4 vs chatgpt\-4o\.ODU Digital Commons: Electrical & Computer Engineering Faculty Publications\.Note:Preprint\. Accessed: June 15, 2026External Links:[Link](https://digitalcommons.odu.edu/ece_fac_pubs/512/)Cited by:[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p2.1)\.
- \[19\]L\. R\. P\. Houamegni and F\. Gedikli\(2025\)Evaluating the effectiveness of large language models in automated news article summarization\.External Links:2502\.17136,[Link](https://arxiv.org/abs/2502.17136)Cited by:[item 1](https://arxiv.org/html/2607.02049#S2.I4.i1.p1.1)\.
- \[20\]H\. Huang, X\. Bu, H\. Zhou, Y\. Qu, J\. Liu, M\. Yang, B\. Xu, and T\. Zhao\(2025\)An empirical study of llm\-as\-a\-judge for llm evaluation: fine\-tuned judge model is not a general substitute for gpt\-4\.External Links:2403\.02839,[Link](https://arxiv.org/abs/2403.02839)Cited by:[item 1](https://arxiv.org/html/2607.02049#S2.I4.i1.p1.1)\.
- \[21\]Z\. Huang, Z\. Wang, S\. Xia, and P\. Liu\(2024\)OlympicArena medal ranks: who is the most intelligent ai so far?\.External Links:2406\.16772,[Link](https://arxiv.org/abs/2406.16772)Cited by:[item 3](https://arxiv.org/html/2607.02049#S2.I4.i3.p1.1)\.
- \[22\]E\. Jang, K\. Lee, J\. Chung, K\. Park, and S\. Shin\(2024\)Improbable bigrams expose vulnerabilities of incomplete tokens in byte\-level tokenizers\.arXiv preprint arXiv:2410\.23684\.External Links:[Link](https://arxiv.org/)Cited by:[§4\.2](https://arxiv.org/html/2607.02049#S4.SS2.p2.1)\.
- \[23\]A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed\(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[item 2](https://arxiv.org/html/2607.02049#S2.I4.i2.p1.1)\.
- \[24\]P\. Joshi, S\. Santy, A\. Budhiraja, K\. Bali, and M\. Choudhury\(2021\)The state and fate of linguistic diversity and inclusion in the nlp world\.External Links:2004\.09095,[Link](https://arxiv.org/abs/2004.09095)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p6.1)\.
- \[25\]A\. Kumar, N\. Poungpeth, D\. Yang, B\. Lambert, and M\. Groh\(2026\)Practicing with language models cultivates human empathic communication\.External Links:2603\.15245,[Link](https://arxiv.org/abs/2603.15245)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1)\.
- \[26\]D\. Li, B\. Jiang, L\. Huang, A\. Beigi, C\. Zhao, Z\. Tan, A\. Bhattacharjee, Y\. Jiang, C\. Chen, T\. Wu, K\. Shu, L\. Cheng, and H\. Liu\(2025\)From generation to judgment: opportunities and challenges of llm\-as\-a\-judge\.External Links:2411\.16594,[Link](https://arxiv.org/abs/2411.16594)Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1),[§2\.5](https://arxiv.org/html/2607.02049#S2.SS5.p1.1)\.
- \[27\]D\. Li, R\. Sun, Y\. Huang, M\. Zhong, B\. Jiang, J\. Han, X\. Zhang, W\. Wang, and H\. Liu\(2026\)Preference leakage: a contamination problem in llm\-as\-a\-judge\.External Links:2502\.01534,[Link](https://arxiv.org/abs/2502.01534)Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1)\.
- \[28\]Q\. Li, S\. Dou, K\. B\. Shao, C\. Chen, and H\. Hu\(2026\)Evaluating scoring bias in llm\-as\-a\-judge\.arXiv preprint arXiv:2604\.06996\.Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1)\.
- \[29\]P\. Liang, R\. Bommasani, T\. Lee,et al\.\(2023\)Holistic evaluation of language models\.External Links:2211\.09110,[Link](https://arxiv.org/abs/2211.09110)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1),[§1](https://arxiv.org/html/2607.02049#S1.p3.1),[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p3.1),[§2\.2](https://arxiv.org/html/2607.02049#S2.SS2.p1.1)\.
- \[30\]Z\. W\. Lim, A\. F\. Aji, and T\. Cohn\(2025\)Understanding cross\-lingual inconsistency in large language models\.arXiv preprint arXiv:2505\.13141\.Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p1.1),[§3\.1\.1](https://arxiv.org/html/2607.02049#S3.SS1.SSS1.p2.1),[§4\.1](https://arxiv.org/html/2607.02049#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2607.02049#S4.SS2.p1.1)\.
- \[31\]C\. C\. Liu, F\. Koto, T\. Baldwin, and I\. Gurevych\(2024\)Are multilingual llms culturally\-diverse reasoners? an investigation into multicultural proverbs and sayings\.External Links:2309\.08591,[Link](https://arxiv.org/abs/2309.08591)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p3.1)\.
- \[32\]M\. Lu, R\. Zhang, C\. Eickhoff, and E\. Pavlick\(2025\)Paths not taken: understanding and mending the multilingual factual recall pipeline\.External Links:2505\.20546Cited by:[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p2.1)\.
- \[33\]M\. Lv, L\. Chen, E\. Zhang, A\. Zhou, X\. Xue, H\. Zhang, F\. Tang, Z\. R\. Han, and M\. Wu\(2026\)Emotional intelligence in large language models is fragmented across perception, cognition, and interaction\.External Links:2605\.24686,[Link](https://arxiv.org/abs/2605.24686)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1)\.
- \[34\]A\. Maheshwari, K\. Sharma, V\. Patel, and A\. Maheshwari\(2026\)IndicParam: benchmark to evaluate llms on low\-resource indic languages\.External Links:2512\.00333,[Link](https://arxiv.org/abs/2512.00333)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p5.1)\.
- \[35\]A\. Malik, N\. Sabri, M\. M\. Karnaze, and M\. ElSherief\(2025\-11\)Are LLMs empathetic to all? investigating the influence of multi\-demographic personas on a model’s empathy\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24938–24959\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1358/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1358),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p4.1),[§4\.1](https://arxiv.org/html/2607.02049#S4.SS1.p1.1),[§4\.4](https://arxiv.org/html/2607.02049#S4.SS4.p1.1),[§4\.4](https://arxiv.org/html/2607.02049#S4.SS4.p2.2)\.
- \[36\]R\. Mao, Q\. Liu, X\. Li, E\. Cambria, and A\. Hussain\(2025\)Bridging minds and machines: toward an integration of ai and cognitive science\.External Links:2508\.20674,[Link](https://arxiv.org/abs/2508.20674)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1)\.
- \[37\]R\. I\. Masoud, Z\. Liu, M\. Ferianc, P\. Treleaven, and M\. Rodrigues\(2024\)Cultural alignment in large language models: an explanatory analysis based on hofstede’s cultural dimensions\.External Links:2309\.12342,[Link](https://arxiv.org/abs/2309.12342)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p3.1)\.
- \[38\]P\. Nemkova, A\. Adhikari, M\. Pearson, V\. K\. Sadu, and M\. V\. Albert\(2025\)Cross\-lingual stability and bias in instruction\-tuned language models for humanitarian nlp\.External Links:2510\.22823,[Link](https://arxiv.org/abs/2510.22823)Cited by:[§3\.1\.2](https://arxiv.org/html/2607.02049#S3.SS1.SSS2.p3.1),[§3\.2](https://arxiv.org/html/2607.02049#S3.SS2.p4.1)\.
- \[39\]OpenAI, J\. Achiam, S\. Adler, S\. Agarwal,et al\.\(2024\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[item 1](https://arxiv.org/html/2607.02049#S2.I4.i1.p1.1),[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p2.1)\.
- \[40\]OpenAI, A\. Hurst, A\. Lerer, A\. P\. Goucher,et al\.\(2024\)GPT\-4o system card\.External Links:2410\.21276,[Link](https://arxiv.org/abs/2410.21276)Cited by:[item 1](https://arxiv.org/html/2607.02049#S2.I4.i1.p1.1),[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p2.1)\.
- \[41\]V\. Ovcharov\(2026\)The tokenizer tax across 25 european languages: domain invariance, cross\-lingual few\-shot effects, and the ukrainian penalty\.External Links:2605\.24718,[Link](https://arxiv.org/abs/2605.24718)Cited by:[item 1](https://arxiv.org/html/2607.02049#S2.I4.i1.p1.1)\.
- \[42\]S\. J\. Paech\(2024\)EQ\-bench: an emotional intelligence benchmark for large language models\.External Links:2312\.06281,[Link](https://arxiv.org/abs/2312.06281)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1),[§1](https://arxiv.org/html/2607.02049#S1.p3.1),[§4\.1](https://arxiv.org/html/2607.02049#S4.SS1.p1.1),[§4\.4](https://arxiv.org/html/2607.02049#S4.SS4.p1.1),[§4\.4](https://arxiv.org/html/2607.02049#S4.SS4.p2.2)\.
- \[43\]P\. Pakray, A\. Gelbukh, and S\. Bandyopadhyay\(2025\)Natural language processing applications for low\-resource languages\.Natural Language Processing31\(2\),pp\. 183–197\.External Links:[Document](https://dx.doi.org/10.1017/nlp.2024.33)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p5.1)\.
- \[44\]S\. Park, J\. Kim, and H\. Kim\(2025\-07\)Too polite to be human: evaluating LLM empathy in Korean conversations via a DCT\-based framework\.InProceedings of the Third Workshop on Social Influence in Conversations \(SICon 2025\),J\. Hale, B\. D\. Kwon, and R\. Dutt \(Eds\.\),Vienna, Austria,pp\. 76–89\.External Links:[Link](https://aclanthology.org/2025.sicon-1.6/),[Document](https://dx.doi.org/10.18653/v1/2025.sicon-1.6),ISBN 979\-8\-89176\-266\-4Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1),[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p3.1),[§2\.5](https://arxiv.org/html/2607.02049#S2.SS5.p5.1)\.
- \[45\]J\. Pombal, R\. Rei, and A\. F\. T\. Martins\(2026\)Self\-preference bias in rubric\-based evaluation of large language models\.arXiv preprint arXiv:2604\.06996\.Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1)\.
- \[46\]J\. Rystrøm, H\. R\. Kirk, and S\. Hale\(2025\)Multilingual \!= multicultural: evaluating gaps between multilingual capabilities and cultural alignment in llms\.External Links:2502\.16534,[Link](https://arxiv.org/abs/2502.16534)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p1.1),[§1](https://arxiv.org/html/2607.02049#S1.p3.1),[§3\.2](https://arxiv.org/html/2607.02049#S3.SS2.p4.1),[§3\.3](https://arxiv.org/html/2607.02049#S3.SS3.p4.1),[§4\.1](https://arxiv.org/html/2607.02049#S4.SS1.p5.1),[§4\.3](https://arxiv.org/html/2607.02049#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2607.02049#S4.SS3.p3.1),[§4\.4](https://arxiv.org/html/2607.02049#S4.SS4.p5.2)\.
- \[47\]S\. Shahriar, B\. D\. Lund, N\. R\. Mannuru, M\. A\. Arshad, K\. Hayawi, R\. V\. K\. Bevara, A\. Mannuru, and L\. Batool\(2024\)Putting gpt\-4o to the sword: a comprehensive evaluation of language, vision, speech, and multimodal proficiency\.Applied Sciences14\(17\)\.External Links:[Link](https://www.mdpi.com/2076-3417/14/17/7782),ISSN 2076\-3417Cited by:[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p2.1)\.
- \[48\]A\. Sharma, I\. W\. Lin, A\. S\. Miner, D\. C\. Atkins, and T\. Althoff\(2021\)Towards facilitating empathic conversations in online mental health support: a reinforcement learning approach\.External Links:2101\.07714,[Link](https://arxiv.org/abs/2101.07714)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1)\.
- \[49\]J\. Sim and C\. C\. Wright\(2005\)The kappa statistic in reliability studies: use, interpretation, and sample size requirements\.Physical Therapy85\(3\),pp\. 257–268\.Cited by:[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p3.1),[§2\.5](https://arxiv.org/html/2607.02049#S2.SS5.p5.1)\.
- \[50\]S\. K\. Soumik\(2026\)Judging the judges: a systematic evaluation of bias mitigation strategies in llm\-as\-a\-judge pipelines\.External Links:2604\.23178,[Link](https://arxiv.org/abs/2604.23178)Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1),[§2\.5](https://arxiv.org/html/2607.02049#S2.SS5.p1.1)\.
- \[51\]A\. Srivastava, A\. Rastogi, A\. Rao,et al\.\(2023\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.External Links:2206\.04615,[Link](https://arxiv.org/abs/2206.04615)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1)\.
- \[52\]M\. Syromiatnikov, V\. Ruvinskaya, and N\. Komleva\(2025\)Empowering smaller models: tuning llama and gemma with chain\-of\-thought for ukrainian exam tasks\.External Links:2503\.13988,[Link](https://arxiv.org/abs/2503.13988)Cited by:[item 2](https://arxiv.org/html/2607.02049#S2.I1.i2.p1.1)\.
- \[53\]M\. V\. Syromiatnikov and V\. M\. Ruvinskaya\(2025\-11\)UA\-code\-bench: a competitive programming benchmark for evaluating large language models code generation in ukrainian\.Informatics Culture Technology2,pp\. 308–314\.External Links:ISSN 2522\-1523,[Link](http://dx.doi.org/10.15276/ict.02.2025.47),[Document](https://dx.doi.org/10.15276/ict.02.2025.47)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p5.1)\.
- \[54\]H\. Tsai, Y\. Huang, and C\. Kuo\(2024\)Comparative analysis of automatic literature review using mistral large language model and human reviewers\.Note:Research Square PreprintExternal Links:[Document](https://dx.doi.org/10.21203/rs.3.rs-4022248/v1)Cited by:[item 2](https://arxiv.org/html/2607.02049#S2.I4.i2.p1.1)\.
- \[55\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p1.1),[item 2](https://arxiv.org/html/2607.02049#S2.I1.i2.p1.1),[item 3](https://arxiv.org/html/2607.02049#S2.I1.i3.p1.1),[item 1](https://arxiv.org/html/2607.02049#S2.I4.i1.p1.1),[item 2](https://arxiv.org/html/2607.02049#S2.I4.i2.p1.1)\.
- \[56\]A\. Vayani, D\. Dissanayake, H\. Watawana,et al\.\(2025\)All languages matter: evaluating lmms on culturally diverse 100 languages\.External Links:2411\.16508,[Link](https://arxiv.org/abs/2411.16508)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p4.1)\.
- \[57\]L\. S\. VygotskyM\. Cole, V\. John\-Steiner, S\. Scribner, and E\. Souberman \(Eds\.\)\(1978\)Mind in society: the development of higher psychological processes\.Harvard University Press,Cambridge, MA\.Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1)\.
- \[58\]L\. Wang, H\. Gao, C\. Zhao, X\. Sun, and D\. Dai\(2024\)Auxiliary\-loss\-free load balancing strategy for mixture\-of\-experts\.External Links:2408\.15664,[Link](https://arxiv.org/abs/2408.15664)Cited by:[item 1](https://arxiv.org/html/2607.02049#S2.I1.i1.p1.1),[§3\.1\.2](https://arxiv.org/html/2607.02049#S3.SS1.SSS2.p2.1),[§4\.1](https://arxiv.org/html/2607.02049#S4.SS1.p6.1)\.
- \[59\]V\. Williams and B\. Rosman\(2025\)Heartificial intelligence: exploring empathy in language models\.External Links:2508\.08271,[Link](https://arxiv.org/abs/2508.08271)Cited by:[§2\.1](https://arxiv.org/html/2607.02049#S2.SS1.p2.1)\.
- \[60\]X\. Xing, Z\. He, H\. Xu, X\. Wang, R\. Wang, and Y\. Hong\(2024\)Evaluating knowledge\-based cross\-lingual inconsistency in large language models\.External Links:2407\.01358,[Link](https://arxiv.org/abs/2407.01358)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p1.1),[§3\.1\.1](https://arxiv.org/html/2607.02049#S3.SS1.SSS1.p2.1),[§4\.1](https://arxiv.org/html/2607.02049#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2607.02049#S4.SS2.p1.1),[§4\.4](https://arxiv.org/html/2607.02049#S4.SS4.p4.2)\.
- \[61\]Z\. Xu and J\. Jiang\(2024\)Multi\-dimensional evaluation of empathetic dialog responses\.External Links:2402\.11409,[Link](https://arxiv.org/abs/2402.11409)Cited by:[§1](https://arxiv.org/html/2607.02049#S1.p2.1)\.
- \[62\]J\. Zhao, Z\. Zhang, L\. Gao, Q\. Zhang, T\. Gui, and X\. Huang\(2024\)LLaMA beyond english: an empirical study on language capability transfer\.External Links:2401\.01055,[Link](https://arxiv.org/abs/2401.01055)Cited by:[item 2](https://arxiv.org/html/2607.02049#S2.I1.i2.p1.1)\.
- \[63\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.External Links:2306\.05685,[Link](https://arxiv.org/abs/2306.05685)Cited by:[§2\.4](https://arxiv.org/html/2607.02049#S2.SS4.p1.1),[§2\.5](https://arxiv.org/html/2607.02049#S2.SS5.p1.1)\.
- \[64\]Q\. Zhong, H\. Liao, S\. Wang, M\. Zhou, X\. Wu, R\. Mao, and W\. Chen\(2025\)Understanding and enhancing the planning capability of language models via multi\-token prediction\.External Links:2509\.23186,[Link](https://arxiv.org/abs/2509.23186)Cited by:[item 1](https://arxiv.org/html/2607.02049#S2.I1.i1.p1.1)\.

Similar Articles

Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth

arXiv cs.CL

This paper introduces a cross-evaluation framework for benchmarking LLMs on Arabic cultural and sociolinguistic knowledge, using human SME ground truth and automated judges. The authors contribute a dataset of prompt-rubric pairs for Egyptian and Iraqi Arabic, evaluating frontier LLMs and finding that cultural reasoning remains a primary failure mode for automated grading.

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arXiv cs.CL

UrduMMLU is a new benchmark of 26,431 multiple-choice questions across 26 subjects for evaluating LLMs on Urdu language understanding, sourced from native educational materials. Evaluation of 30 LLMs reveals Gemini-3.5-Flash performs best, while open-source models and region-specific subjects pose significant challenges.