MetaHOPE: A Metaphor-Oriented Evaluation Framework for Analysing MT and LLM Translation Errors

arXiv cs.CL 07/02/26, 04:00 AM Papers
metaphor-translation evaluation-framework machine-translation llm open-source error-analysis
Summary
MetaHOPE is a metaphor-oriented evaluation framework for analyzing translation errors in machine translation and large language models. The paper proposes an error severity-aware annotation framework and evaluates models like GoogleMT, GPT5.4, and Hunyuan-7b on English-Chinese metaphor translation.
arXiv:2607.00848v1 Announce Type: new Abstract: In this opinion paper, we propose MetaHOPE, an error severity-aware annotation framework for evaluating metaphor translations. Metaphors present challenges for machine translation (MT) and natural language understanding and processing (NLU, NLP), because it presents the features of semantic complexity, contextual dependency, and cultural embeddings that can lead to ambiguity issues for NLP models. To investigate how state-of-the-art NLP models perform on translating metaphors, we select three representative systems, i.e., GoogleMT, GPT5.4, and Hunyuan-7b as Neural MT (NMT) models and LLMs. We used two human-annotated metaphor corpora, including VUAMC and PSUCMC for English-to-Chinese and Chinese-to-English translation purposes. The original corpora we used are monolingual, where we carried out error annotation using the MetaHOPE framework, and also produced the human post-edited gold reference for bilingual use as a new resource. We believe the MetaHOPE evaluation framework for metaphor translation annotation, the parallel corpora resources, and the error analysis on SOTA automatic translation models can be useful and shed some light for the field of metaphor translation study. We share our resources publicly upon paper acceptance.
Original Article
View Cached Full Text
Cached at: 07/02/26, 05:38 AM
# MetaHOPE: Metaphor Translation Evaluation Framework Investigating Open-Source LLMs and State-of-the-Art Neural Translation Models
Source: [https://arxiv.org/html/2607.00848](https://arxiv.org/html/2607.00848)
Jiahui Liang1, Lifeng Han2,3 1Centre for Linguistics, Humanities, Leiden University, NL 2LIACS, Leiden University, NL 3BDS, Leiden University Medical Centre, NL j\.h\.l\.jiahui@hum\.leidenuniv\.nl — l\.han@lumc\.nl

###### 摘要

In this opinion paper, we propose MetaHOPE, an error severity\-aware annotation framework for evaluating metaphor translations\. Metaphors present challenges for machine translation \(MT\) and natural language understanding and processing \(NLU, NLP\), because it presents the features of semantic complexity, contextual dependency, and cultural embeddings that can lead to ambiguity issues for NLP models\. To investigate how state\-of\-the\-art NLP models perform on translating metaphors, we select three representative systems, i\.e\., GoogleMT, GPT5\.4, and Hunyuan\-7b as Neural MT \(NMT\) models and LLMs\. We used two human\-annotated metaphor corpora, including VUAMC and PSUCMC for English\-to\-Chinese and Chinese\-to\-English translation purposes\. The original corpora we used are monolingual, where we carried out error annotation using the MetaHOPE framework, and also produced the human post\-edited gold reference for bilingual use as a new resource\. We believe the MetaHOPE evaluation framework for metaphor translation annotation, the parallel corpora resources, and the error analysis on SOTA automatic translation models can be useful and shed some light for the field of metaphor translation study\. We share our resources publicly upon paper acceptance\.

\\setmainfont

lmroman10\-regular\.otf\[ BoldFont=lmroman10\-bold\.otf, ItalicFont=lmroman10\-italic\.otf, BoldItalicFont=lmroman10\-bolditalic\.otf \]\\setsansfontlmsans10\-regular\.otf\[ BoldFont=lmsans10\-bold\.otf, ItalicFont=lmsans10\-oblique\.otf \]\\setmonofontlmmono10\-regular\.otf

MetaHOPE: Metaphor Translation Evaluation Framework Investigating Open\-Source LLMs and State\-of\-the\-Art Neural Translation Models

Jiahui Liang1, Lifeng Han2,31Centre for Linguistics, Humanities, Leiden University, NL2LIACS, Leiden University, NL3BDS, Leiden University Medical Centre, NLj\.h\.l\.jiahui@hum\.leidenuniv\.nl — l\.han@lumc\.nl

## 1Introduction

Metaphors are pervasive in everyday discourse and serve as an essential cognitive tool, enabling people to understand and communicate abstract, complex, and unfamiliar concepts through more concrete and familiar experiences\. For example, economic indicators may\\CJK@punctchar\\CJK@uniPunct0”80”9Csoar\\CJK@punctchar\\CJK@uniPunct0”80”9D or\\CJK@punctchar\\CJK@uniPunct0”80”9Cplummet\\CJK@punctchar\\CJK@uniPunct0”80”9D, governments may\\CJK@punctchar\\CJK@uniPunct0”80”9Cfight\\CJK@punctchar\\CJK@uniPunct0”80”9D inflation, and negotiations may\\CJK@punctchar\\CJK@uniPunct0”80”9Creach a dead end\\CJK@punctchar\\CJK@uniPunct0”80”9D\. These expressions draw on concrete experiences of movement, conflict, and space to convey meanings that extend beyond literal languagejohnson1980metaphors;smedinga2023metaphors\. Beyond their semantic complexity, metaphors are also culturally embedded, and their interpretation often requires contextual awareness, sociocultural knowledge, and conceptual reasoning\. As a result, they pose challenges for both machine translation \(MT\) and broader natural language understanding and processing \(NLU, NLP\) tasks\.

Recent advances in neural MT \(NMT\) and large language models \(LLMs\) have substantially improved translation quality, with some systems achieving performance comparable to human translators on general translation benchmarkskocmi2025findings\. However, such improvements do not necessarily extend to metaphor translationhan2026towards\.karakanta2025metaphorsreport metaphor translation accuracy rates of only 64\-80%, whilewang2024mmtefind that around 20% of metaphorical expressions remain non\-equivalent in translation\. A majorsourceof error is overly literal translation, particularly for multi\-word expressions \(MWEs\) such as idioms and collocations, where models often fail to capture the intended figurative meaningmwe\-2023\-multiword;bhatia2024proceedings;han2024overview\. Therefore, to better understand the gap between general MT performance and metaphor translation performance, it is necessary tosystematically analyze metaphor translation errors\. In addition, existing studies mainly focus on translation strategiespedersen2017metaphors;zajdel2022catching;li2025mindmachineor translation quality on equivalence, fluency, emotional effect, and authenticitywang2024mmte\. However,fine\-grained error analysisremains limited\.karakanta2025metaphorsclassify issues into meaning, form, and omission, but this framework is relatively coarse\-grained and does not address severity\.

To address this gap, this study adapts the HOPE frameworkgladkoff2022hopefor metaphor translation evaluation, forming a new framework calledMetaHOPE\. Originally developed as a lightweight version to Multidimensional Quality Metrics \(MQM\)lommel2014multidimensional;lommel2024multi;gladkoff2025non, HOPE reduces annotation complexity through a smaller set of error categories and a severity\-based scoring scheme\. Building on this design, we develop a metaphor\-oriented annotation framework that enables the systematic identification and severity assessment of metaphor translation errors\. Adapted from HOPE, our project develops a metaphor\-oriented annotation framework consisting of five error categories: Impact, Style, Mistranslation, Required Adaptation Missing, and Proofreading Error, together with a five\-level severity scale\. Using this framework, the study investigates:RQ\-1\)What types of metaphor translation errors are produced by different MT systems?RQ\-2\)How do the frequency and severity of errors vary across systems and translation directions \(EN\-ZH and ZH\-EN\)?

In this Opinion Paper, preliminary results show that our human annotators\\CJK@punctchar\\CJK@uniPunct0”80”99 agreement levels for \[GoogleMT, GPT\-5\.4, Hunyuan\-LLM\-7B\] are \[0\.536, 0\.726, 0\.333\] for Pearson\\CJK@punctchar\\CJK@uniPunct0”80”99s correlation, and \[76\.9%, 70\.8%, 61\.5%\] for exact agreement\. Metaphor translation errors are demonstrated as the main cause of translation errors, occupying \[91\.7%, 93\.8%, 61\.8%\] error ratios of the three translation systems, respectively\. We further qualitatively clustered the MT errors and interesting phenomena into distinct categories\.

## 2Background and Related Work

### 2\.1Metaphors and Translation

Traditional metaphor translations focus on isolated linguistic expressions or rhetorical ornaments\. The research areas include the translatability of metaphors, translation procedures, metaphor substitution, and the question of whether the metaphorical image should be preservednewmark1988textbook;vandenbroeck1981limits\. Solutions of metaphor translation can be metaphor to same/different metaphor, simile, paraphrase, or deletiontoury2012descriptive\.

Later, there is the cognitive turn on metaphor understanding and translations, which underlines that metaphor translation shall be conceptual mapping from source to target language, not only on the lexicon level, or stylistic embellishmentschaffner2004metaphor;johnson1980metaphors\. For instance,hong2021cognitivecarried out a survey on cognitive perspectives on metaphor translation, where the authors discussed that the cognitive approach offers insight for cross\-cultural communication, using English\-Chinese and French\-Chinese as distant languages\.

Previous studies on metaphor machine translation, includingwang2024mmte,dorst2023metaphorandkarakanta2025metaphorshave largely relied onsentence\-leveltranslation and evaluation\. However, translation scholars have criticized sentence\-level MT evaluation, arguing that translations that appear acceptable in isolation may become inappropriate or inaccurate when broader discourse context is consideredcastilho2020context\.hong2021cognitivealso emphasized that metaphor translation shall go beyond sentence\-by\-sentence mapping, and discussed the potential of combining cognitive theory and translation theories\.

Aligning with this development, in the MetaHOPE design, we carry out the translation at the context\-aware document\-level first, then extract the translated sentence for annotation, and the annotators are given the context for awareness\.

### 2\.2Language/Domain Specific Studies

There are language or domain\-specific studies on metaphors and their translations\. For instance, from Serbian to Englishmilenkovic2024influencestudy the influence of translation on perceived metaphor features, including Metaphoricity, quality, aptness, and familiarity on both the source and target sides\. It covered 55 Serbian metaphors translated into English using the A is B form\.

Meanwhile,khalifah2022arabiccarried out Arabic\-English metaphor translation from a cognitive linguistic perspective, using some evidence from Naguib Mahfuz Midaq Alley and its translated version\.

Focus on the science domain,shuttleworth2017studyingexamines how figurative language in popular science articles functions across languages and bridges the gap between metaphor research and translation studies, specifically in neurobiology and biotechnology\. This work challenges the notion that scientific language is purely literal, arguing that metaphor is a vital component of transferring complex scientific concepts to the public\. They develop new, theoretically nuanced procedures to describe how translators navigate and adapt metaphorical language across different linguistic and cultural contexts\. Similarly, the work bysmedinga2023metaphorsstudies metaphors as tools for understanding in science communication among experts and to the public\. In our work, for MetaHOPE, we use English\-Chinese bidirectional studies and focused on the news domain for proof\-of\-concept\.

### 2\.3LLMs on Metaphor Translation

Using LLMs for metaphor translation is still an under\-explored area\.wong2025mappingconducted a bibliometric analysis of metaphor research in Translation and Interpreting Studies \(TIS\) based on 1,023 publications from 1964 to 2023\. They found that machine translation represents only 0\.68% of translation\-related metaphor studies, highlighting a substantial research gap in this area\.

Recent work has begun to explore the use of large language models for metaphor\-related tasks\. For example,csen2026comparativeemployed GPT\-4o with few\-shot prompting to detect conceptual legal metaphors in English–Turkish HUDOC judgments and analyze conceptual shifts across translations\. However, their focus was metaphor identification and conceptual labeling rather than evaluating the quality of metaphor translation outputs\. In contrast, the present study investigates how different MT/LLM systems render metaphors in translation and evaluates translation quality using a metaphor\-adapted MetaHOPE framework

In parallel, some NLP work has begun to addressfigurative language translationusing large language models\.donthi\-etal\-2025\-improving, for example, investigated idiomatic translation and proposed semantic idiom alignment methods to improve LLM handling of non\-literal expressions, finding that semantic\-level alignment better preserved figurative meaning and cultural authenticity than direct prompting approaches\. However, their focus was idiomatic translation generation rather than systematic evaluation of metaphor translation quality\.

The most relevant work to ours is fromli2025mindmachine\. This work examined the translation of metaphor\-related words \(MRWs\) by human translators, NMT, and LLMs, combining translation product analysis with think\-aloud protocols and quality assessment\. Their findings suggest that LLMs produce translation strategies more similar to human translators than conventional NMT systems, although performance remains inconsistent for novel metaphors\. However, their analysis focuses primarily on translation strategies and MRW\-level behavior rather than systematic evaluation of metaphor translation quality\. In contrast, the present study investigates metaphor translation outputs at the segment\-level \(with context\) using a metaphor\-adapted MetaHOPE framework \(a dedicated error taxonomy\) to analyze fine\-grained error distributions across MT and LLM systems\.

![Refer to caption](https://arxiv.org/html/2607.00848v1/x1.png)图 1:MetaHOPE Framework: Metaphor corpus preparation, MT, Aligning segments and metaphor\-related words \(MRWs\), and Post\-editing with annotations on pilot/development and test sets\.

## 3MetaHOPE Methodology

As shown in Figure[1](https://arxiv.org/html/2607.00848#S2.F1), from left\-to\-right and top\-to\-down, the overview of MetaHOPE framework including the following steps:

- •1\) Text formatting and preprocessing from VUAMC and PSUCMC corpora\. This step includes a\) plain text extraction, and b\) CSV file preparation, including word id, if it is a metaphor, POS, token\-position, etc\. We present an example table in Figure[2](https://arxiv.org/html/2607.00848#A2.F2)and[3](https://arxiv.org/html/2607.00848#A2.F3)for the Chinese and English corpora, formatted accordingly\.
- •2\.1\) Machine translation \(MT\) on the two source texts to the target languages ona full document\-level for context\-awareness, English\-to\-Chinese and Chinese\-to\-English using three selected systems: GoogleMT, GPT5\.4, and Hunyuan\-llm\-8b as representatives of state\-of\-the\-art NMT systems, LLMs, and state\-of\-the\-art performing system at the annual WMT shared task on MTkocmi2025findings\.
- •2\.2\) In parallel with 2\.1, we sample 20 and 200 segments from each of the two corpora as a pilot study and system testing set, respectively\.
- •3\) a\) Manual alignment of the four data sets 2 x \(20, 200\) segments to the system translation outputs to find parallel translations for English\-Chinese and Chinese\-English pairs\. b\) Manual alignment of metaphor\-related words \(MRWs\) in the target MT outputs towards the source\-side metaphor words \(more details in Section[D\.1](https://arxiv.org/html/2607.00848#A4.SS1)\)\.
- •4\) Pilot studies are carried out on the two sets of 20 segments from the two translation directions\. These lessons are used to discuss the metaphor translation error annotation guidelines, resolving annotators’ disagreement, refine the annotation policies, for the next stage role\-out\.
- •5\) Larger\-size human annotation on translation outputs from three MT systems on metaphor related errors producing: a\) post\-edited human reference for each translation direction, b\) MetaHOPE score table generation on three systems, and c\) qualitative analysis on error types and MT behaviors on metaphor translation\.

Regarding error categories in MetaHOPE, we limit into the following 5 types instead of the original 8 used by HOPE metric:

- •Impact \(IMP\): Over\-literal translation; structural shifts affecting emphasis\.
- •Required Adaptation is Missing \(RAM\): Missing cultural or idiomatic adaptation of metaphor\.
- •Mistranslation \(MIS\): Meaning mismatch; incorrect interpretation of metaphor\.
- •Style \(STL\): Loss of metaphorical effect, imagery, or emotional tone\.
- •Proof\-reading error \(PRF\): Awkward or unnatural expression \(not reflecting meaning\)\.

The principle design of this five categories is according to the existing metaphor\-focused studies and their mapping to the original HOPE categories\. The HOPE framework defines eight error categories\. However, not all of them are equally relevant for metaphor translation\. Based on existing literature on metaphor translation, common error types include overly literal translation, meaning mismatch, loss of metaphorical \(rhetorical and aesthetic\) effect, emotional shift, structural changes \(e\.g\. active–passive alternation depending on discourse context\), omission \(when meaning is lost\), and unauthentic expression\. These error types are mapped onto the HOPE framework to adapt it for metaphor translation\. We further list examples of each of such five error categories from MetaHOPE in Table[4](https://arxiv.org/html/2607.00848#A3.T4)\.

Categories such as TRM \(Terminology\), PRN \(Proper Name\), and UGR \(Ungrammatical\) are excluded, as they operate at a different level from metaphor processing\. TRM mainly concerns terminology consistency, PRN relates to the correct translation of named entities, and UGR captures grammatical well\-formedness\. While these issues may affect overall translation quality, they do not directly explain how figurative meaning is interpreted, adapted, or expressed\. In addition, these categories are not explicitly reflected in commonly discussed error types in the metaphor translation literature\.

For quantitative scoring on error penalties for each error type, we keep the 5 severity levels, i\.e\. minor, medium, major, severe, and critical, but use the following score alignment \(2, 4, 6, 8, 10\) instead of the exponential score range used by original HOPE \(2x2^\{x\}, 1, 2, 4, 8, 16\)\. The rationale is that we think the original score range can be very sparse, e\.g\., the same error can be labeled by different annotators as 4 \(medium\) or 16 \(critical\) which potentially leads to higher disagreement\.

In addition, we describehow many percent of errors are from metaphors and how many from non but in the sentence, in our data\.

## 4Experimental Evaluation

### 4\.1Data Preprocessing

We list the details on how we extracted the sampled data according to part\-of\-speech \(POS\), and the segmentation steps, Section[D](https://arxiv.org/html/2607.00848#A4)including data extraction and filtering\. The metaphor related word alignment from source to target language is carried out using the guidline in Section[D\.1](https://arxiv.org/html/2607.00848#A4.SS1)\.

### 4\.2Pilot Study on Development Set

Annotator Backgrounds: Annotator\-A is a PhD candidate in linguistics and translation studies, who holds an MA in digital humanities and a BA in translation\. Annotator\-B is a Master’s student majoring in translation and interpreting studies \(MTI\)\. The Annotations are carried out independently with the instruction manual\.

#### 4\.2\.1Annotator Agreements

Inter\-annotator agreement was evaluated using Krippendorff\\CJK@punctchar\\CJK@uniPunct0”80”99sα\\alphaand quadratic weighted Cohen\\CJK@punctchar\\CJK@uniPunct0”80”99sκ\\kappa, treating MetaHOPE severity scores as ordered penalty levels \(0, 2, 4, 6, 8, 10\)\. Segment\-level \(SEGS\) agreement was computed by summing IMP, RAM, MIS, STL, and PRF into SEGS for each metaphor instance as in Table[1](https://arxiv.org/html/2607.00848#S4.T1)\. More detailed per error type agreement is listed in Table[6](https://arxiv.org/html/2607.00848#A5.T6)\. From these two tables, the strongest agreement appears for MIS, especially for Hunyuan\-LLM\-7B\. However, overallα\\alpha/κ\\kappavalues are low because Annotator B applied much higher total penalties than Annotator A\. Exact agreement is high because most cells are zero, butα\\alpha/κ\\kappareveal the severity\-bias problem more clearly\.

Although exact agreement was relatively high,α\\alphaandκ\\kappawere modest, suggesting systematic severity differences between annotators, particularly with Annotator B assigning substantially higher penalties\.

表 1:Segment\-level inter\-annotator agreement on summed MetaHOPE penalties \(SEGS\)\.GPT\-5\.4 has the highest annotator consistency \(r = 0\.726\), GoogleMT moderate agreement, Hunyuan lower agreement, which is probably because it generated more varied metaphor outputs, making annotation harder or more subjective\. This is an interesting finding from the pilot study: harder\-to\-interpret translations may reduce annotator consistency\.

#### 4\.2\.2MetaHOPE Error Statistics

There are 65 lines of translation annotation on 65 metaphors with 20 unique sentences for English\-to\-Chinese and 32 for Chinese\-to\-English, some sentences having multiple metaphor words\.

From Annotator\-A, the summary of full sentence\-level error penalty \(EPS\), which includes metaphor level/component penalty, is displayed in Table[2](https://arxiv.org/html/2607.00848#S4.T2)\. Correspondingly, the summary of error penalty on metaphor \(EPM\) only across each error type on the 65 lines is shown in Table[3](https://arxiv.org/html/2607.00848#S4.T3)for the three tested systems, where the Rotio \(M/S\) is the Ratio of EPM/EPS, i\.e\., the value of error scores of metaphors divided by the value of error scores of the full sentences\. From this Ratio\(M/S\), we can see that the metaphor caused errors occupy around 91\.7% to 93\.8% for GoogleMT and GPT\-5\.4, i\.e\., the major cause of automated translation errors\. While Hunyuan\-LLM\-7B has metaphor\-caused errors 61\.8% as a main source \(¿50%\), it also has many other error types, including hallucination, which we will discuss more in the next section on qualitative analysis/categorization \(Section[4\.2\.3](https://arxiv.org/html/2607.00848#S4.SS2.SSS3)\)\.

表 2:MetaHOPE Sentence\-level PenaltyEPSDistribution Across Three MT Systems on Pilot Set表 3:MetaHOPE Metaphor\-Level Penalty ScoreEPMDistribution Across Three MT Systems on Pilot Set
#### 4\.2\.3Qualitative Categorization on Errors

We categorize some of the error phenomena from the three systems, which can be useful for further research on this topic\.

1\) GoogleMT and GPT5\.4 are more fact tracking, while Hunyuan\-llm\-7b is more flexible in generated translation\. This flexibility sometimes generates more native translation, while other times it can be hallucinatory, e\.g\., by adding extra information or losing some source meaning \(addition or reduction\)\. Example\-1:

- •Src: An organisation that doesn\\CJK@punctchar\\CJK@uniPunct0”80”99t change fossilises\. \(fossilises; N\)
- •GoogleMT: 一个不改变的组织就会僵化。
- •GPT5\.4: 一个不变化的组织会僵化。
- •Hunyuan\-7b:变革是必要的，否则组织就会僵化停滞

In Example\-1, Hunyuan\-7b used a more flexible order than the original English word order, which actually makes it sound more native\. The other two translations are more like literal translations, keeping strict word tracking in translation \(rigid\)\. However, Hunyuan\-7b’s output is also debatable, since “变革是必要的” is not exactly included in the source sentence \(必要的\)\. Example\-2:

- •Src: She pledged that the Government would safeguard those that did not opt for trust status, but she expected this to be a minority\. \(pledged; V\)
- •GoogleMT: 她承诺政府将保障那些未选择信托地位的诊所的权益，但她预计这只是少数。
- •GPT5\.4: 她承诺，政府将保障那些不选择信托地位的医院，但她预计这将只是少数。
- •Hunyuan\-7b: 她相信…政府也会保护那些未选择自主管理的医院，但这类医院只会占少数。

In this example, on the one hand, “信托地位” is a literal translation of “trust status” while “自主管理” is more down\-to\-earth \(native\-like\) Chinese that most people can understand better\. Even though “信托地位” also exists in Chinese, it is more like a borrowing word, and lay people will not understand what it means\. So Hunyuan\-7b did a better job on this foreign concept translation/localization\. On the other hand, “相信” is not accurate enough from Hunyuan\-7b\. Interestingly, current ChatGPT \(using GPT5\.5\) can explain this literal translation well from GPT5\.4 \(API\), when we gave the following prompt:

##### ¡¡I am doing a new project on metaphor translation investigation using three different models \- GoogleMT/GPT5\.4/Hunyuan\-llm\. I need some feedback when I am doing annotation for human gold standard reference preparation from English to Chinese\. From the following sentence, ”She pledged that the Government would safeguard those that did not opt for trust status, but she expected this to be a minority\.” What does ”trust status” mean?¿¿

The suggestion of ChatGPT is:

##### ¡¡她承诺政府会保障那些不选择转为信托制的学校，但她预计这只会是少数。” for shool, or ”成为信托机构的资格/地位” and ”转为 NHS 信托机构” for healthcare\.¿¿

Example\-3:

- •Src: Earnings were level at 17\.5p a share\. \(level; AJ\)
- •GoogleMT: 每股收益持平于17\.5便士。
- •GPT5\.4: 每股收益持平，为17\.5便士。
- •Hunyuan: 每股收益为17\.5便士
- •每股收益持平于17\.5便士。

In this Example\-3, Hunyuan\-7blostthe ”level” information only saying ”收益为”, which is an important feature in finance context, i\.e\., not more or less, but the same — unchanged compared to the previous reporting period\. Meanwhile, the other two systems used ”收益持平” which indicates the “level” is the same as the previous report\.

2\) Translation Inconsistency\. There are situations of inconsistency in term translation from the same translation system\. For instance, GoogleMT translates ”FT\-SE” into 富时100 while sometimes just keeps ”FT\-SE”, with two examples below\. Example\-1:

- •Src: Prices have remained high\-indeed the FT\-SE index has risen another 55 points since then\\CJK@punctchar\\CJK@uniPunct0”80”94 allowing even the most passive of private investors, including unit trust holders, to take advantage of the market\.
- •GoogleMT: 价格一直保持高位\\CJK@punctchar\\CJK@uniPunct0”80”94\\CJK@punctchar\\CJK@uniPunct0”80”94事实上，自那以后，FT\-SE指数又上涨了 55 点\\CJK@punctchar\\CJK@uniPunct0”80”94\\CJK@punctchar\\CJK@uniPunct0”80”94这使得包括单位信托基金持有者在内的最被动的私人投资者也能从市场中获利。

Example\-2:

- •Src: The change in employment may not always be so favourable as yesterday’s either, but the market is now starting to feel bullish and looking for a FT\-SE of 3,000 by next year\.
- •GoogleMT: 就业形势的变化或许不会总是像昨日那样乐观，但市场目前开始看涨，并预期富时100指数明年将达到3000点。

3\) Metaphorical word\-alignment issues — metaphor\-word to none \(general issues\)\. There are situations when there is no clear target word mapping to a source metaphor word\. In Example\-2 of the last Inconsistency issue, GoogleMT used “看涨” for “feel bullish”, though there is no exact/separate translation of “feel”\. However, one of the individual entry of this source English sentence focuses on \(feel; V\)\. This reflects the issue of “word\-level” metaphor translation\.

4\)The selection of Chinese interchangeable words\.There are often inter\-changeable words in modern Chinese, however, there are indeed new regulations to guide their use, as reflected by the Act “Legislative Drafting Technical Standards \(Trial Implementation\) \(II\) \(Document No\. \[2011\] 5 issued by the Legislative Affairs Commission of the Standing Committee of the National People\\CJK@punctchar\\CJK@uniPunct0”80”99s Congress\)—《立法技术规范\(试行\)\(二\)》（全国人大常委会法工委发 \[2011\]5号）。”111Post from Chinese University of Petroleum, Beijing, accessed 2026\.8th June\.https://www\.cup\.edu\.cn/yww/jpbl/d0f2ffb8c08a4f3cb74782ed39a50fef\.htmFor instance, the words ”做出” and ”作出” in the following example:

- •Src: She recalled a promise made by Mr Major when he became Prime Minister: that he would work for a nation at ease with itself\. \(ease; N\)
- •GoogleMT: 她回顾了梅杰先生就任首相时做出的承诺：他将致力于建设一个安心自在的国家。
- •GPT5\.4: 她回忆起梅杰先生出任首相时作出的承诺：他将致力于建设一个与自身和谐相处的国家。
- •Hunyuan\-7b: 她回忆起约翰梅杰在担任首相时曾承诺要打造一个让民众安心的医疗体系
- •Ref: 她回顾了梅杰先生就任首相时作出的承诺：他将致力于建设一个安心自在的国家。

## 5Conclusions

This paper introduced MetaHOPE, a metaphor\-oriented adaptation of the HOPE translation evaluation framework for investigating metaphor translation performance in neural machine translation \(NMT\) and large language models \(LLMs\)\. Motivated by the semantic complexity, contextual dependency, and cultural embeddedness of metaphor, MetaHOPE operationalizes metaphor translation quality through five error categories\\CJK@punctchar\\CJK@uniPunct0”80”94Impact \(IMP\), Required Adaptation Missing \(RAM\), Mistranslation \(MIS\), Style \(STL\), and Proofreading Error \(PRF\)\\CJK@punctchar\\CJK@uniPunct0”80”94combined with a severity\-aware scoring scheme\. In doing so, the framework transforms cognitively motivated concerns in metaphor translation, such as conceptual meaning transfer, metaphor remapping, and pragmatic\-cultural adaptation, into measurable annotation categories suitable for empirical MT evaluation\. For proof\-of\-concept, using metaphor\-containing news data sampled from the VUAMC and PSUCMC corpora, we conducted a pilot investigation of three representative translation systems: GoogleMT, GPT\-5\.4, and Hunyuan\-LLM\-7B, across English–Chinese and Chinese–English translation settings\. Preliminary findings suggest that metaphor translation remains a major challenge for current systems\. Metaphor\-related errors account for a substantial proportion of overall translation penalties, indicating that figurative language remains a key bottleneck despite recent improvements in general MT quality\. We also observed systematic differences across systems: GoogleMT and GPT\-5\.4 tended to preserve source\-side wording more conservatively, while Hunyuan\-LLM\-7B demonstrated greater flexibility and localization ability, although sometimes at the cost of factual consistency or hallucination\. In addition, the pilot study revealed that harder\-to\-interpret metaphor translations may reduce inter\-annotator consistency\.

## References

## 附录 ARationale for News Corpus

The rationale to use the news domain is that news is rich in metaphors and it is not a widely studied domain for this topic yet\. Corpus research has shown that, among academic articles, fiction, conversation, and news discourse, news texts contain the second\-highest frequency of metaphor\-related expressions, exceeding both fiction and conversational discoursesteen2010vu;steen2010method, which suggests that metaphor constitutes an important linguistic feature of news discourse\. News translation features: compared with other discourse types, such as literary translation that often emphasizes stylistic and aesthetic effectsbassnett2013translation, news translation primarily focuses on the rapid and effective communication of information to target audiences\. To achieve this, news texts are often adapted to suit the communicative needs and reading conventions of target readersbielsa2008translation\.

At the same time, news translation is not a fully objective transfer of information, but a process shaped by cultural, institutional, and ideological factors through selection and rewritinglefevere2016translation\. As a result, information may be reorganized or reframed to align with the sociocultural and sociopolitical expectations of target audiencesbielsa2008translation\.

Within this context, metaphors in news discourse may help make complex events and issues more understandable and vivid to readers, while also carrying rhetorical, emotional, humorous, ironic, and ideological functions in the representation and framing of news eventssemino2008metaphor\. Studies on media discourse further suggest that metaphors in news reporting may dramatize events, influence readers\\CJK@punctchar\\CJK@uniPunct0”80”99 responses, and guide interpretations of political and social issuestrvckova2011multi;molek2014coercive\. Metaphors in the news are therefore not merely decorative expressions, but important discourse tools that influence how events are presented and understood\.

Overall, metaphor translation in news discourse warrants attention, since translation shifts or errors may alter the meanings and functions of source metaphors\. Investigating translation errors and translation quality in news metaphor translation is therefore important for examining how metaphorical meanings and discourse functions are conveyed across languages in news communication\.

## 附录 BExamples of Source Corpus and Related Work

### B\.1Metaphor Study Corpora

There are parallel corpora for metaphor studies, such as the aforementioned work fromwang2024mmteon English\-Italian and English\-Chinese \(one\-directional\)\. This work adopts a heterogeneous multi\-domain corpus for benchmarking metaphor\-sensitive MT evaluation\.

Regarding the monolingual corpus, VUAMC is the largest available English corpus word\-level\-annotated for all metaphorical language use, based on a systematic and explicit metaphor identification procedure \(MIPVU）steen2010method\.222VUAMC:http://www\.vismet\.org/metcor/documentation/home\.htmlIt covers about 190,000 lexical units from a subset of four broad registers from academic texts, conversation, fiction, and news texts\. The corpus was used for metaphor identification shared tasksleong2018report;leong2020report, and the Fleiss\\CJK@punctchar\\CJK@uniPunct0”80”99 Kappa is over 0\.8, which indicates good inter\-annotator agreementsteen2010vu\.

TroFi is one of the early datasets for distinguishing literal and nonliteral usages of verbs, constructed from the Wall Street Journal \(news discourse\)birke2006clustering;birke2007active\.333Trofi:https://natlang\.cs\.sfu\.ca/software/trofi\.htmlIt consists of 3,727 English sentences covering 50 verbs from news texts\. The metaphoricity of verb usage is annotated at the word level with a binary classification of literal vs nonliteral\. Inter\-annotator agreement is reported on a subset of 200 examples, with a Cohen\\CJK@punctchar\\CJK@uniPunct0”80”99s Kappa of 0\.77, indicating relatively good agreement\.

PSUCMC is a word\-level\-annotated Chinese dataset based on an adapted version of the Metaphor Identification Procedure Vrije Universiteit \(MIPVU\), whose reliability for Mandarin Chinese has been validated through inter\-annotator agreement\.444PSUCMC:https://sites\.psu\.edu/xxl13/cmc/It consists of 30,012 words and covers three registers: academic discourse, fiction, and news\. Fleiss\\CJK@punctchar\\CJK@uniPunct0”80”99 Kappa on this corpus is over 0\.8, which indicates good inter\-annotator agreementlu2017towards\.

For our work, we select the VUAMC and PSUCMC corpora for English and Chinese source language, respectively, because they used the same annotation strategy\. Both corpora are annotated at the lexical\-unit level following the Metaphor Identification Procedure Vrije Universiteit procedure \(MIPVU\)steen2010method, which identifies metaphorical expressions through dictionary\-based comparison between contextual and more basic meanings, further improvement/modification from metaphor identification procedure \(MIP\) proposed bygroup2007mip\. They have similar discourse, and both are in the news domain\. These features make them comparable to studying two languages\. For MetaHOPE, we use these two source corpora to carry out MT and post\-editing togenerate a reference translation, rather than relying on existing human reference translations\. This is because metaphor translation often allows multiple valid and creative solutions, and a single reference cannot adequately represent all acceptable interpretations\. Therefore, the focus is placed on analyzing system outputs rather than comparing them against a fixed gold standard\. We will present our methodology in detail in the next section\.

![Refer to caption](https://arxiv.org/html/2607.00848v1/Fig-Meta/psucmc-formating.png)图 2:PSUCMC formatting\.![Refer to caption](https://arxiv.org/html/2607.00848v1/Fig-Meta/vuamc-formatting.png)图 3:VUAMC formatting\.

## 附录 CMetaHOPE Error Types with Examples

表 4:MetaHOPE error types with illustrative examples and explanations\.
## 附录 DData Preprocessing Details

#### D\.0\.1Data Extraction and Filtering via POS

Only metaphorically used nouns, verbs, adjectives, and adverbs are included in the analysis\. Grammatical function words, particularly prepositions, are excluded, as previous studies on English\-Chinese translation have shown that such items frequently undergo omission or transformation due to structural differences between the two languagesshih2012corpus\. These translation shifts are often caused by grammatical differences between English and Chinese rather than metaphor processing itself\. Therefore, the analysis focuses on content words in order to better examine metaphor translation patterns\. Chinese\-specific Part\-of\-Speech \(POS\) categories in the original corpus annotation, such as vn \(verb\-noun\) and i \(idiom\), were further normalized into the broader categories used in this study \(noun, verb, adjective and adverb\)\. Since category boundaries in Chinese are often flexible, the classification was determined based on the contextual syntactic function of each metaphorical expression within the sentence\. Similarly, English\-specific annotation categories in the original corpus, such as\\CJK@punctchar\\CJK@uniPunct0”80”9CN\+N\\CJK@punctchar\\CJK@uniPunct0”80”9D and\\CJK@punctchar\\CJK@uniPunct0”80”9CV\+AV\\CJK@punctchar\\CJK@uniPunct0”80”9D, were also mapped onto these broader categories according to the contextual meaning and syntactic function of the metaphorical expression in context\.

#### D\.0\.2Segment Processing

Context is particularly important for resolving ambiguity, maintaining coherence, and interpreting context\-dependent expressions, especially for metaphors\. Also, human translators do not usually translate sentences completely in isolation but rely on the surrounding context when producing translations\. Therefore, instead of getting translations of isolated sentences, the present studyprovides larger text segments to translation models in order to better approximate real translation conditions\. To remain compatible with the HOPE\-based annotation framework, the generated translations are subsequently segmented back into sentences and aligned at the sentence/segment level for later annotation and post\-editing analysis\. However, annotators are required to first read the full source text in order to build contextual understanding before conducting sentence/segment\-level annotation and evaluation\. For the segment selection, the study focuses on authentic news reports and excludes genres such as editorials and opinion pieces\. Compared with opinion\-oriented news discourse, hard news reporting generally prioritizes clarity and information delivery, allowing a more controlled comparison of metaphor translation across languages and systems\. For both translation directions, segments were sampled from the corpora\. Since not every sentence within a sampled segment contains metaphorical expressions, only sentences containing metaphorically used nouns, verbs, adjectives, or adverbs were included in the final analysis\. A single sentence may also contain multiple metaphorical expressions\.

The final dataset consists of 200 metaphor\-containing sentences for each translation direction, to ensure that the analysis remained manageable while still allowing for reasonably reliable observations of translation quality patterns\. Research on translation quality evaluation has found, through comparisons of different sample sizes, that a sample size of fewer than 200 sentences cannot statistically reflect the MT system quality in the translation quality evaluation taskgladkoff2022measuring\.555However, in our current study, we already discovered many errors from MT systems using smaller set of the samples\.This does not affect the demonstration of the effectiveness of the MetaHOPE framework\.Since the MetaHOPE design involves detailed manual metaphor annotation and qualitative error analysis, 200 sentences were considered an appropriate balance between analytical reliability and the practical feasibility of in\-depth analysis\.

Table[5](https://arxiv.org/html/2607.00848#A4.T5)is the statistics of two extracted corpora we used, where it includes the language type, segments, sentence length, metaphor\-containing sentences, total MRWs and amounts from each POS\.

表 5:Corpus statistics of the metaphor translation dataset, including average sentence length and the distribution of metaphor\-related word categories\.
### D\.1Aligning Metaphor Related Words \(Source, MT\.output\)

In this section, we discuss the detailed framework for the Alignment of Metaphor\-Related Words \(MRWs\) in Translation and its application to our task of preparing the corpus for MetaHOPE annotators\.

The objective of the word\-level alignment task is to identify how the metaphor\-related meaning associated with an MRW is realized in translation\. The task focuses on semantic\-functional correspondence rather than strict lexical equivalence\. Since metaphor\-related meaning is often context\-sensitive, semantically rich, and culturally dependent, metaphor translation frequently involves reformulation across linguistic and cultural systems\. Therefore, it requires annotation to examine the target text to identify whether there is a target word, phrase, or larger textual unit that realizes the metaphor\-related meaning associated with the source MRW\. The alignment framework is informed by translation shift and equivalence literature in Translation Studies, which suggests that translation frequently involves structural, semantic, and pragmatic reformulation rather than strict formal correspondence\(e\.g\.,catford1965linguistic;baker1992other;chesterman1997ethics\)\. Particularly, paraphrasing, implicitation, restructuring, semantic reformulation, and pragmatic adaptation are common translational strategies that may affect how metaphor\-related meaning is realized in the target text\. As a result, metaphor\-related meaning is not always realized through direct lexical correspondence in translation\. Instead, it may be reformulated, redistributed across multiple units, implicitly conveyed through syntactic structure or discourse context, partially preserved, or omitted altogether\. Alignment decisions in the present study are therefore made according to how metaphor\-related meaning is represented in translation rather than through strict lexical matching\.

Manual alignment:pallucchini2025lostargue that multilingual models still struggle with polysemy, homonymy, and language\-specific semantic structures, while current alignment methods often rely on only\\CJK@punctchar\\CJK@uniPunct0”80”9Cimplicit and somewhat weak\\CJK@punctchar\\CJK@uniPunct0”80”9D correspondence signals between languages\. Similarly,miao2024enhancingnote that\\CJK@punctchar\\CJK@uniPunct0”80”9Cthe acquisition of token\-level or word\-level supervisory signals remains a challenging topic of ongoing discussion,\\CJK@punctchar\\CJK@uniPunct0”80”9D indicating that reliable token\-level semantic alignment is still unresolved in multilingual NLP systems\. Therefore, even if auto\-alignment tools are involved, manual verification and correction are still required\. Based on the consideration above, this study adopts manual annotation directly to ensure context\-sensitive and semantically informed alignment decisions\.

Alignment Principle with Examples— In metaphor translation, metaphor\-related meanings are realized through different translational patterns, as shown in Figure[4](https://arxiv.org/html/2607.00848#A4.F4)\. MetaHOPE annotators need to examine the target text to determine how metaphor\-related meaning is conveyed in translation using the prepared highlighted/aligned data from this guideline framework\.

![Refer to caption](https://arxiv.org/html/2607.00848v1/x2.png)图 4:Metaphor translation alignment examples realized through different translational patterns

## 附录 EDetailed MetaHOPE Annotation Agreement

表 6:Per\-error\-type inter\-annotator agreement across 65 metaphor segments\.
## 附录 FDiscussion

When a metaphor is translated to a paraphrase in the target language instead of in a metaphorical expression, whether it is applied with a translation error penalty or not is a question\. In our current MetaHOPE annotation design, we include it as a penalty; however, it is not strictly applied by all annotators from our observation in the pilot study period\. We apply strict control on this aspect for the full test set annotation\.

## 附录 GPrompts Used

For Hunyuan\-MT\-7B, the default prompt provided in the Hugging Face example is adopted to ensure standardized usage\. As systems such as Google Translate and Hunyuan\-MT\-7B operate without external contextual information, the same prompt format is applied to GPT to control for prompt\-related variation and ensure comparability across systems\. The prompt example is as below:

##### \\CJK@punctchar\\CJK@uniPunct0”80”9CTranslate the following segment into Chinese, without additional explanation\. \[input texts\]\\CJK@punctchar\\CJK@uniPunct0”80”9D

## 附录 HImplications and Future Work

More broadly, MetaHOPE contributes toward bridging cognitive metaphor theory and empirical MT evaluation\. Existing metaphor translation research has often emphasized translation strategies, metaphor preservation, or conceptual mapping, while MT evaluation research typically relies on broad sentence\-level quality metrics\. MetaHOPE offers a middle ground by enabling fine\-grained, metaphor\-sensitive error analysis that can systematically identify how metaphorical meaning is preserved, distorted, weakened, adapted, or lost during translation\.

The present work represents a proof\-of\-concept study and opens several directions for future research\. First, we plan to extend MetaHOPE to the full\-scale test set and further refine annotation guidelines to improve inter\-annotator agreement\. Second, although this study focuses on the news domain, future work will examine the generalizability of MetaHOPE across other metaphor\-rich domains, such as literary texts, fiction, academic discourse, political speeches, and science communication, potentially using corpora such as literary metaphor datasets and scientific communication corpora\. Third, future analyses may investigate how different metaphor types\\CJK@punctchar\\CJK@uniPunct0”80”94such as conventional vs\. novel metaphors, deliberate vs\. non\-deliberate metaphors, or culture\-specific vs\. potentially universal metaphors\\CJK@punctchar\\CJK@uniPunct0”80”94affect MT and LLM translation behavior\. Finally, future work may explore semi\-automatic or LLM\-assisted metaphor annotation and alignment support, reducing human annotation costs while maintaining interpretability and reliability in metaphor\-sensitive translation evaluation\.
MetaHOPE: A Metaphor-Oriented Evaluation Framework for Analysing MT and LLM Translation Errors

Similar Articles

A Reproducible Multi-Architecture Baseline for Token-Level Chinese Metaphor Identification under the MIPVU Framework

Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation

Model-Based Quality Assessment for Massively Multilingual Parallel Data

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Submit Feedback

Similar Articles

A Reproducible Multi-Architecture Baseline for Token-Level Chinese Metaphor Identification under the MIPVU Framework
Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation
Model-Based Quality Assessment for Massively Multilingual Parallel Data
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild
Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs