Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

arXiv cs.CL 06/17/26, 04:00 AM Papers
Summary
This paper investigates verbalized methods for extracting LLM confidence in machine translation outputs, comparing them with internal token probabilities. The study finds that while both approaches perform similarly in error detection and calibration, there is little correlation between internal and verbalized confidence measures.
arXiv:2606.17234v1 Announce Type: new Abstract: The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. In addition, they require access to such internal signals. Here, we devise five verbalized methods of extracting an LLM's per-token confidence without those shortcomings and compare their reliability with that of the model's internal signals of certainty. We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. For both, internal and verbalized methods perform similarly, although results vary by model. Interestingly, we find little to no correlation between internal and verbalized methods.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:39 AM
# Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation
Source: [https://arxiv.org/html/2606.17234](https://arxiv.org/html/2606.17234)
Ali Marashian1Alexis Palmer1Katharina von der Wense1, 2 1University of Colorado Boulder, USA 2Johannes Gutenberg University Mainz, Germany Correspondence:[ali\.marashian@colorado\.edu](https://arxiv.org/html/2606.17234v1/mailto:[email protected])

###### Abstract

The rapid rise in popularity of large language models \(LLMs\) for translation calls for a thorough study of the reliability of their confidence in their own outputs\. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity \(tokens, words, or spans\)\. Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness\. In addition, they require access to such internal signals\. Here, we devise fiveverbalizedmethods of extracting an LLM’s per\-token confidence without those shortcomings and compare their reliability with that of the model’s internal signals of certainty\. We evaluate reliability using two forms of alignment: fine\-grained error detection and calibration\. For both, internal and verbalized methods perform similarly, although results vary by model\. Interestingly, we find little to no correlation between internal and verbalized methods\.

Speaking in Self\-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

Ali Marashian1Alexis Palmer1Katharina von der Wense1, 21University of Colorado Boulder, USA2Johannes Gutenberg University Mainz, GermanyCorrespondence:[ali\.marashian@colorado\.edu](https://arxiv.org/html/2606.17234v1/mailto:[email protected])

## 1Introduction

With the growing use of large language models \(LLMs\) for machine translation \(MT\)Kocmi et al\. \([2024a](https://arxiv.org/html/2606.17234#bib.bib11),[2025](https://arxiv.org/html/2606.17234#bib.bib10)\), it is becoming increasingly important to be able to judge their reliability\. However, LLMs are known to be poorly self\-calibrated for MT: output token probabilities are not good indicators of the quality of the corresponding translationWu et al\. \([2025](https://arxiv.org/html/2606.17234#bib.bib39)\); Sarti et al\. \([2025b](https://arxiv.org/html/2606.17234#bib.bib30)\)\.

![Refer to caption](https://arxiv.org/html/2606.17234v1/x1.png)Figure 1:Illustration of internal vs\. verbalized confidence estimation, using token probabilities as internal signals and word\-level confidence obtained via prompts; we explore broader variants of both in our experiments\.One of the key challenging aspects of MT calibration is that there is not just one single way of translating any given input – e\.g\., one can choose between synonyms for translation of the same concept, or between multiple syntactic realizations of the same semantic content\. Therefore, if the model is unsure between multiple possible tokens, we cannot directly consider that as representing the model’s confidence in the validity of the selected token\. Instead, it demonstrates the model’suncertainty, as some potentially valid options all compete with respect to probability\. More generally, this phenomenon in generative tasks is known assurface form competitionHoltzman et al\. \([2021](https://arxiv.org/html/2606.17234#bib.bib8)\); Wiegreffe et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib38)\)\.

Something which so far is underexplored in MT is that LLMs have the unique ability toverbalizehow certain they are by generating tokens that estimate confidenceTian et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib32)\)– i\.e\., by "saying" how confident they are\. Thus, they can provide an estimate of the correctness of their outputs that is not limited by surface form competition\. Verbalized confidence has the further advantage of being easily obtainable by end users and without access to model internals\.

In this paper, we focus on the verbalized per\-token confidence of LLMs for MT and explore how it aligns with the LLMs’ translation performance \(cf\. Figure[1](https://arxiv.org/html/2606.17234#S1.F1)\)\. Since there is no single, programmatic way of accessing verbalized confidence, we explore five techniques\. For \(the commonly considered\) internal uncertainty, we employ two different indicators: token\-level probability and entropy\. We compare how both verbalized confidence and internal uncertainty align with the ground truth using two evaluation paradigms: fine\-grained binary error detection and calibration\. Additionally, we examine the correlation between the verbalized confidence and internal uncertainty to understand how similar the two high\-level strategies are\.

Our results show that verbalized confidence performs comparably to or better than internal uncertainty\-based methods on fine\-grained binary error detection on average, depending on the model\. For calibration, we find that verbalized methods perform similarly to model probability but lag behind model entropy\. Moreover, we find no significant correlation between verbalized confidence and internal uncertainty in our setting\.

## 2Related Work

Althoughuncertainty\(the variability of a model’s response for a particular input\) andconfidence\(the model’s belief that a specific output is correct\) are distinct concepts, they are so closely related that uncertainty is frequently used in practice to estimate confidenceLiu et al\. \([2025](https://arxiv.org/html/2606.17234#bib.bib16)\)\. As a result, “calibration” in the literature can mean both the alignment between confidence \(e\.g\., as verbalized by the model itself\) and ground\-truth and the alignment between certainty and ground\-truth\.

To estimate model uncertainty, some approaches utilize external knowledge, e\.g\., auxiliary models trained to predict confidence scoresMielke et al\. \([2022](https://arxiv.org/html/2606.17234#bib.bib21)\); Tsai et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib33)\); Ulmer et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib34)\), or using search engines to detect errorsGou et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib5)\)\. Other approaches rely solely on the model\. A number of them quantify sample consistency, measuring the similarity between multiple model responsesTian et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib32)\); Manakul et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib20)\); Wang and Holmes \([2024](https://arxiv.org/html/2606.17234#bib.bib37)\)\. Many works use internal signals\. For multiple\-choice question answering or other classification tasks, one can aggregate the probabilities of different tokens that the model would use for each optionHan et al\. \([2022](https://arxiv.org/html/2606.17234#bib.bib7)\); Zhang et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib43)\); Wang et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib35)\); Lovering et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib17)\); Kumar et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib13)\)\.

Another approach that leverages only the model itself is to have the LLMs verbalize their confidence\. This is typically done for QALin et al\. \([2022](https://arxiv.org/html/2606.17234#bib.bib15)\); Kadavath et al\. \([2022](https://arxiv.org/html/2606.17234#bib.bib9)\); Tian et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib32)\); Yang et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib41)\); Xiong et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib40)\); Ni et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib23)\)\. Instead of QA, our work focuses on MT\. Rather than considering the ground truth at all,Kumar et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib13)\)focus on the relationship between model’s internal uncertainty and its verbalized confidence for QA, introducing the concept of “Confidence\-Probability Alignment\.” In our work, we also consider the correlation of verbalized methods with tokenentropiesas an additional internal signal\.

MT calibration has been a topic of interest in the literature even before the boom of LLMsOtt et al\. \([2018](https://arxiv.org/html/2606.17234#bib.bib25)\); Kumar and Sarawagi \([2019](https://arxiv.org/html/2606.17234#bib.bib14)\); Wang et al\. \([2020](https://arxiv.org/html/2606.17234#bib.bib36)\); Lu et al\. \([2022](https://arxiv.org/html/2606.17234#bib.bib19)\)\. A key challenge for MT calibration is assigning errors to particular spans in the output\. Previous work has solved the problem by automatically estimating these spans using Translation Edit RateSnover et al\. \([2006](https://arxiv.org/html/2606.17234#bib.bib31)\)111By considering the minimum number of edits necessary to get from the produced hypothesis to a gold translation: if a token needs to change, it is labeled as wrong\. These pseudo\-labels have been shown to be inconsistent with real labelsYang et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib42)\)\.or manually annotating the output of MT modelsFomicheva et al\. \([2022](https://arxiv.org/html/2606.17234#bib.bib4)\); Sarti et al\. \([2022](https://arxiv.org/html/2606.17234#bib.bib28)\); Yang et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib42)\); Sarti et al\. \([2025a](https://arxiv.org/html/2606.17234#bib.bib29)\)\.Dinh and Niehues \([2025](https://arxiv.org/html/2606.17234#bib.bib3)\)boost the probability of certain tokens to counterbalance the resultant under\-confidence caused by surface form competition for MT and other generative tasks\.

The most relevant study to ours isSarti et al\. \([2025b](https://arxiv.org/html/2606.17234#bib.bib30)\), which studies the predictive power of different signals of model uncertainty and their alignment with actual error spans in translation\. We extend their work by examining the capabilities of LLMs to verbalize their confidence and the alignment of these verbalizations with both ground\-truth error spans and internal indicators\.

## 3Experimental Design

We now describe the experimental design, including the datasets, models, different measures of model confidence, and evaluation metrics\.

### 3\.1Dataset

Our goal is to investigate the alignment of different measures of model confidence with ground truth error annotations\. The basis of our experiments is the WMT2024 General Machine Translation shared task data\. We look at translations from English to Chinese, Czech, Hindi, Japanese, and Russian\.

In order to assess the reliability of model confidence measures, we further leverage the Error Span Annotations\(ESA; Kocmi et al\.,[2024b](https://arxiv.org/html/2606.17234#bib.bib12)\)byKocmi et al\. \([2024a](https://arxiv.org/html/2606.17234#bib.bib11)\), who mark errors in the outputs of several models for 9 language pairs\. In ESA, annotators mark erroneous spans in the translation and categorize them as eitherminorormajorerrors\. Table[1](https://arxiv.org/html/2606.17234#S3.T1)shows an annotated example\.

Following prior work, we do not distinguish between the two error categories: tokens are considered either correct or incorrectSarti et al\. \([2025b](https://arxiv.org/html/2606.17234#bib.bib30)\)\. For each translation direction–model pair, we have 634 annotated outputs\. We use these annotations as the ground truth\. As a development set, we randomly select 100 English sentences and their respective translations for each language pair\. This leaves us with 534 test sentences per language pair\.

#### Word Segmentation

Some of our proposed verbalized approaches require prompting a model for its confidence for specific words\. For this, we automatically segment the text of the translations into words: for Czech, Hindi, and Russian we split text on whitespaces, for Chinese we use Jieba,222[https://github\.com/fxsjy/jieba](https://github.com/fxsjy/jieba)and for Japenese we employ Nagisa\.333[https://github\.com/taishi\-i/nagisa](https://github.com/taishi-i/nagisa)

### 3\.2Models

Using ESA,Kocmi et al\. \([2024a](https://arxiv.org/html/2606.17234#bib.bib11)\)annotate the outputs of 28 systems; including shared task submissions, 7 popular LLMs, and additional online systems\. Note that annotations are not available for all systems and all translation directions\. Of the available LLMs, we select the only two that are open\-source: Aya23Aryabumi et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib1)\)and Llama3\-70BGrattafiori et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib6)\)\. We have the same setup and prompts asKocmi et al\. \([2024a](https://arxiv.org/html/2606.17234#bib.bib11)\), in order to obtain identical probability and entropy for the annotated outputs\. Code and results are available on GitHub\.444Link will be provided upon publication\.

### 3\.3Verbalized Confidence Measures

We now describe our five proposed approaches for obtaining a model’s verbalized confidence in its translations\. FollowingKumar et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib13)\), we use separate prompts for translation and confidence elicitation: we do not provide the information to the LLM that it is evaluating its own translations, in order to avoid bias in its judgmentsZheng et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib44)\)\. We leave same\-session self\-evaluation to future work\.

1. 1\.List: We prompt the model to return a list containing those spans in the translation that it is not confident about\. If a token is part of any returned span, we assign that token the labelincorrect\.
2. 2\.Word\_Numeric: We prompt the model to return its confidence for every word as a number within\[0,1\]\[0,1\]\. We ask for the score of each word in a separate prompt; see Figure[1](https://arxiv.org/html/2606.17234#S1.F1)\. For our analysis, the numerical confidence score assigned to a word will be assigned to all of its constituent tokens\. If a token contains letters of two words, we treat it as part of the second word\.
3. 3\.Word\_Likert: Instead of numbers, we ask LLMs to assign qualitative labels to each word\. FollowingKumar et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib13)\)’s approach for question answering, we consider six different labels:very uncertain,not certain,somewhat certain,moderately certain,fairly certain, andvery certain\. We convert these labels into real numbers from 0 \(forvery uncertain\) to 1 \(very certain\) in increments of 0\.2\. As in Word\_Numeric, the score assigned to a word is assigned to its constituent tokens\.Kumar et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib13)\)’s motivation for this approach is the hypothesis that an LLM might use qualitative descriptors better than numerical values\.
4. 4\.Token\_Numeric: We further explore the effect of increased granularity by using the model’s tokenizer and repeating Word\_Numeric with tokens instead of words\.555This requires using the model’s tokenizer, which limits its applicability in practical settings\. In addition it is more computationally expensive than the other methods\. We report the costs of all prompts in Appendix[E\.3](https://arxiv.org/html/2606.17234#A5.SS3)\.
5. 5\.Token\_Likert: We further experiment with a version of Word\_Likert, but for the outputs provided by the model’s tokenizer\.

#### Prompts

We use the development set of English to Czech translations of Aya23 together with the corresponding error annotations for prompt engineering\. Appendix[A](https://arxiv.org/html/2606.17234#A1)showcases the prompts for all verbalized confidence methods\.

#### Binarization

Since the ground truth labels for error detection are binary, we binarize the output of continuous methods to measure their alignment with the ground truth\. For each method–model pair, followingNi et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib23)\), we tune a threshold on the development set\.666We also report results using an optimal binarization threshold based on the test set itself, rather than using the threshold of the development as we do here\. See Appendix[E](https://arxiv.org/html/2606.17234#A5)\.

Table 1:An example of English→\\rightarrowRussian translations from WMT24 datasetKocmi et al\. \([2024a](https://arxiv.org/html/2606.17234#bib.bib11)\)annotated according to ESA protocol\. Light red indicates minor errors, whereas darker red denotes major errors\. All tokens that are part of a shaded word are considered wrong in our analysis\.

### 3\.4Baseline Confidence Measures

#### Internal \(Un\)certainty

We compare our verbalized approaches to two internal confidence measures\. The first one is theprobabilityofti∗t^\{\*\}\_\{i\}, theii\-th predicted token in the output translation, given all previously generated tokens:p\(ti∗\|t<i∗\)p\(t^\{\*\}\_\{i\}\|t^\{\*\}\_\{<i\}\)\. The second is theentropyof the probability distribution over the vocabularyVV:−∑j=1\|V\|p\(tij\|t<i∗\)log2⁡p\(tij\|t<i∗\)\-\\sum\_\{j=1\}^\{\|V\|\}p\(t\_\{ij\}\|t^\{\*\}\_\{<i\}\)\\,\\log\_\{2\}p\(t\_\{ij\}\|t^\{\*\}\_\{<i\}\)\. A higher probability indicates greater certainty in the correctness of that token\. In contrast, higher entropy corresponds to lower certainty\. We perform the binarization as outlined in[3\.3](https://arxiv.org/html/2606.17234#S3.SS3)\.

We choose those two measures as they are the top\-performing unsupervised methods for the dataset we experiment withSarti et al\. \([2025b](https://arxiv.org/html/2606.17234#bib.bib30)\)\. In addition to their high performance, they are often obtainable even from closed\-source models as those frequently return token probabilities as well as the top\-kktokens, from which entropy can be estimatedManakul et al\. \([2023](https://arxiv.org/html/2606.17234#bib.bib20)\)\.

#### Random Baseline

We further compare to a random baseline that assigns a label to each token by sampling from the labels’ probability distribution estimated on the development set\. We repeat this baseline 10 times and report the average\.

### 3\.5Binary Error Detection

To examine how well different confidence measures align with the ground truth, we consider a binary error detection task and report both performance as well as model calibration\. The goal of this task is to identifycorrectandincorrecttokens in a provided translation\.

FollowingSarti et al\. \([2025b](https://arxiv.org/html/2606.17234#bib.bib30)\), we report theF1score of theincorrectclass as our main metric\. We use the binarized version of all confidence metrics\.

We also assess how well the models are calibrated for different confidence measures: in a well\-calibrated model, token confidence values accurately correspond to the actual probability of those tokens being correct\. To measure calibration, we reportExpected Calibration Error \(ECE\)\. For ECE, the predictions are sorted and partitioned intoKKbins\{B1,B2,⋯,BK\}\\\{B\_\{1\},B\_\{2\},\\cdots,B\_\{K\}\\\}\. Each bin corresponds to an interval of confidence\. ECE is the weighted average of the absolute difference between the actual accuracy and the expectation of accuracy \(according to confidence scores\) of all bins:

ECE=∑i=1K\|Bi\|N\|acci−confi\|,ECE=\\sum\_\{i=1\}^\{K\}\\frac\{\|B\_\{i\}\|\}\{N\}\|acc\_\{i\}\-conf\_\{i\}\|,
whereBiB\_\{i\}denotes the number of predictions in binii,NNis the total number of predictions,acciacc\_\{i\}is the true fraction of correctly predicted instances in binii, andconficonf\_\{i\}is the mean of the prediction confidences in biniiNaeini et al\. \([2015](https://arxiv.org/html/2606.17234#bib.bib22)\)\.

In addition, we reportArea Under the Receiver Operating Characteristic Curve \(AUROC\)andArea Under the Precision\-Recall Curve \(AUPRC\)scores\. They both aggregate the results over all possible thresholds for classification, reflecting how well the confidence scores discriminate between classesUlmer et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib34)\)\. AUPRC is more robust to class\-imbalance than AUROCQi et al\. \([2021](https://arxiv.org/html/2606.17234#bib.bib27)\)\.

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/ece_pics/llama3-dev-entropy-average-results.png)
![Refer to caption](https://arxiv.org/html/2606.17234v1/images/ece_pics/llama3-dev-probability-average-results.png)
![Refer to caption](https://arxiv.org/html/2606.17234v1/images/ece_pics/llama3-dev-word_numeric_obo-average-results.png)
![Refer to caption](https://arxiv.org/html/2606.17234v1/images/ece_pics/llama3-dev-word_likert_obo-average-results.png)
![Refer to caption](https://arxiv.org/html/2606.17234v1/images/ece_pics/llama3-dev-token_numeric_obo-average-results.png)
![Refer to caption](https://arxiv.org/html/2606.17234v1/images/ece_pics/llama3-dev-token_likert_obo-average-results.png)
![Refer to caption](https://arxiv.org/html/2606.17234v1/images/ece_pics/llama3-dev-list-average-results.png)

Figure 2:Reliability diagrams on the development set for eligible methods using Llama3\-70B, aggregated across all languages\. We use 10 bins in our evaluation\. Bar height is the average accuracy of the bin\. Darker shades imply higher density of predictions in the bin\.Table 2:F1 scores of negative labels for the binary error detection task on the test set \(higher is better\)\. Best performing results are bolded in each column\. Underlined values indicate no statistically significant difference from the top score; all other results are significantly lower\.Table 3:Results for error calibration on the test set\. All the numbers report the ECE score \(lower is better\)\. Best performing results are bolded in each column\.#### Statistical Significance

We assess statistical significance using the bootstrap method with Holm–Bonferroni correction to control the family\-wise error rate\. Significance is marked in the tables in Section[4](https://arxiv.org/html/2606.17234#S4), and full results, including pairwise comparisons between methods, are reported in Appendix[B](https://arxiv.org/html/2606.17234#A2)\.

## 4Results

#### F1

The results for error detection are presented in Table[2](https://arxiv.org/html/2606.17234#S3.T2)\. The scores are uniformly low but above random, ranging from 0\.02 to 0\.23\. Averaged over each method–model pair, they fall between 0\.10 and 0\.17\.

For Aya23, Entropy and Word\_Numeric perform best on average\. In 3 of the 5 language pairs, they both perform best or are statistically indistinguishable from the best result\. For Llama3\-70B, Word\_Likert has the best performance in 4 out of 5 language pairs\. This shows that no single method consistently outperforms the others; the best\-performing method depends on both the language and the model\. However, verbalized methods generally perform better with Llama3\-70B\. The clearest inconsistency appears with Word\_Likert: while it performs best on average for Llama3\-70B, it is the worst\-performing non\-random method for Aya23\.

The greatest discrepancy between internal and verbalized methods is seen for English→\\rightarrowJapanese translations on the Llama3\-70B model, where verbalized methods beat internal methods by 0\.09 F1\. In no setting do Probability and Entropy significantly outperform all verbalized methods\. However, verbalized methods outperform them in 5 of the 10 settings\.

#### ECE

Table[3](https://arxiv.org/html/2606.17234#S3.T3)shows the test set results for calibration\. Scores span from 0\.04–0\.70, with method–model pair averages ranging from 0\.05 to 0\.48\. For both models, the normalized Entropy consistently yields the best results\. For Aya23, Word\_Likert is the second best method on average, being on par with Entropy for English\-to\-Japanese translations\. For Llama3\-70B, Probability, Word\_Likert and Token\_Likert show comparable results on average\.

We use reliability diagrams to visualize calibration\. They partition predictions into bins according to their confidence, and showcase the average accuracy of each binNiculescu\-Mizil and Caruana \([2005](https://arxiv.org/html/2606.17234#bib.bib24)\)\. Figure[2](https://arxiv.org/html/2606.17234#S3.F2)shows the reliability diagrams for the aggregation of all tokens in the development set for the Llama3\-70B model\. As we can see, all methodsunderestimatethe correctness of miscalibrated tokens\.

Likert methods consistently outperform their Numeric counterparts in terms of calibration\. The weak calibration of numeric methods is consistent with prior work; LLMs generally show poor calibration in numerical contextsLovering et al\. \([2025](https://arxiv.org/html/2606.17234#bib.bib18)\)\.

Interestingly, even though Token\_Likert is one of the best performing methods for Llama3\-70B, it doesnotperform competitively for Aya23\. Word\_Likert is more consistent on both models\. This can perhaps be explained in part by differences in tokenization of the two models, while the word segmentation process used for Word\_Numeric and Word\_Likert is model\-agnostic\.

#### AUROC and AUPRC

Tables[16](https://arxiv.org/html/2606.17234#A5.T16)and[17](https://arxiv.org/html/2606.17234#A5.T17)\(in the appendix\) report AUROC and AUPRC results, respectively\. We observe mixed results overall, with generally weak alignment across methods\. For AUROC, Entropy consistently performs best for Aya23, while for Llama3\-70B, Word\_Numeric and Token\_Numeric outperform other measures\. For AUPRC, internal measures generally yield better results than verbalized ones for Aya23, whereas for Llama3\-70B, Word\_Numeric and Word\_Likert perform on par with internal methods on average\. These results are largely consistent with those for the F1 score\.

Table 4:Results for the alignment of continuous verbalized methods with token probabilities \(top four rows\) and with token entropies \(bottom four rows\)\. All the numbers report Spearman’s correlation coefficient \(higher indicates higher alignment\)\. Best alignment with either probabilities or entropies are bolded in each column\.

## 5Analysis

### 5\.1Alignment of Internal and Verbalized Measures

Given the similar performance of internal and verbalized measures, we further ask if there is substantial agreement between the two types of measures\. Therefore, we evaluate the correlation of verbalized and internal measures with each other\. FollowingKumar et al\. \([2024](https://arxiv.org/html/2606.17234#bib.bib13)\), we reportSpearman’s rank correlation coefficient\. Given variablesXXandY,Y,each of sizenn, we havennpairs of raw scores\(Xi,Yi\)\(X\_\{i\},Y\_\{i\}\)\. Spearman’s coefficient is calculated as follows:

ρ=1−6∑di2n\(n2−1\)\\rho=1\-\\frac\{6\\sum d\_\{i\}^\{2\}\}\{n\(n^\{2\}\-1\)\}wheredid\_\{i\}is the difference inrankofXiX\_\{i\}andYiY\_\{i\}in their corresponding lists\.

Tables[4](https://arxiv.org/html/2606.17234#S4.T4)\(next page\) and[18](https://arxiv.org/html/2606.17234#A5.T18)and[19](https://arxiv.org/html/2606.17234#A5.T19)\(in the appendix\) show the alignment of verbalized confidence and internal certainty\. Word\_Numeric yields the highest correlation with probability in all 10 settings and the highest correlation with entropy in 9 of the 10 settings\. Overall, however, we see that, for the two models in our experiments, there is little to no correlation between internal and verbalized measures\. Comparing the two models, Llama3\-70B generally achieves slightly higher correlations than Aya23\. Appendix[C](https://arxiv.org/html/2606.17234#A3)presents an example of the disagreement between internal and verbalized measures\.

Table 5:The ratio of numerical confidence outputs that are multiples of the respective column number, as aggregated across all languages\. All columns include the number 0 as a multiple\. The results are for the test set, for Aya23 \(top two rows\) and Llama3\-70B \(bottom two rows\)\.
### 5\.2Precision and Recall Tradeoff

Figure[3](https://arxiv.org/html/2606.17234#S5.F3)shows the F1, precision, and recall for different methods averaged across all languages for Aya23 and Llama3\-70B\. Verbalized methods typically have higher recalls than internal methods\. Verbalized methods have lower precision for Aya23 but they match internal methods for Llama3\-70B, particularly for the Word\_Likert and Word\_Numeric methods\.

Some methods behave noticeably differently between the two models\. For Llama3\-70B, Word\_Numeric has higher recall than Word\_Likert, and Token\_Numeric has lower recall than Token\_Likert\. For Aya23, both relations are reversed\. Notably, the List method consistently has some of the highest recalls of all methods in both models\. In contrast, Word\_Likert has a very high recall for Aya23 but the lowest recall for Llama3\-70B\.

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/f1_agg_charts/aya-test-average-results.png)

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/f1_agg_charts/llama3-test-average-results.png)

Figure 3:F1, precision, and recall score for different methods averaged over all the languages on the test set for Aya23 \(top\) and Llama3\-70B \(bottom\)\.
### 5\.3Granularity

One advantage that verbalized Numeric methods might have over their Likert counterparts is their ability to return arbitrarily fine\-grained values\. However, from Figure[5](https://arxiv.org/html/2606.17234#A5.F5)\(in the appendix\) it seems that, in practice, the two models do not make use of this flexibility\. Table[5](https://arxiv.org/html/2606.17234#S5.T5)shows the ratios of different levels of granularity for the numeric methods for both models\. For Aya23, the confidences are rarelynotmultiples of0\.20\.2\. For both models, all confidences are multiples of0\.10\.1\. Thus, even the Numeric methods are limited to at most 11 confidence levels \(multiples of 0\.1 in the range\[0,1\]\[0,1\]\), compared to the 6 levels of Likert methods\.

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/f1_agg_charts/open_vs_closed_threshold_aya-test-average-results.png)

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/f1_agg_charts/open_vs_closed_threshold_llama3-test-average-results.png)

Figure 4:The difference between average precision, recall, and F1 for open vs closed class words on the test set for Aya23 \(top\) and Llama3\-70B \(bottom\)\. Each column denotes the performance for open tokens minus that of closed tokens\.
### 5\.4Syntactic Categories

To see differences between the methods, we use grammatical functions as our main point of comparison\. We use StanzaQi et al\. \([2020](https://arxiv.org/html/2606.17234#bib.bib26)\)for part\-of\-speech tagging of the dataset\. For the purpose of our analysis, we assign the syntactic category of a word to each of its constituent tokens\.

Figures[6](https://arxiv.org/html/2606.17234#A5.F6)and[7](https://arxiv.org/html/2606.17234#A5.F7)\(in the appendix\) show, for each method and model, the distribution of the POS tags in the test set that the model labels asincorrect\. Additionally, we report the Hellinger distance777For two discrete probability distributions, Hellinger distance is defined as the Euclidean distance between their element\-wise square roots\.of these distributions from that of the real errors\. For both models Token\_Numeric, Token\_Likert and List are consistently the closest in distribution to the real error distribution\. For both models, Word\_Numeric and Word\_Likert report an unusually high proportion of punctuation errors, and a lower\-than\-expected proportion for nouns and proper nouns\.

Alternatively, each word belongs to either an open class or a closed class\. Open classes typically allow new words to be added, while closed classes rarely gain new members\. The open classes typically contain words with richer semantic content \(like nouns, verbs, and adjectives\)\. The closed classes are usually functional words serving grammatical purposes \(like adpositions or articles\)\.888Note that this is not universal\. The distinction of open and closed classes is often compared to that of content and function words, but that is also debated\.

Figure[4](https://arxiv.org/html/2606.17234#S5.F4)shows the average difference of performance for open and closed class tokens on the test set\. For both models and all methods, we see higher average F1 and precision for open class words\. From Section[4](https://arxiv.org/html/2606.17234#S4), we know that verbalized methods generally perform better with Llama3\-70B\. Here, they show higher recall for open class words than for closed class words, whereas internal methods do not differentiate in recall\. For Aya23, verbalized methods do not outperform internal methods, as shown in Section[4](https://arxiv.org/html/2606.17234#S4)\. They also do not show the same recall pattern observed for Llama3\-70B\. Appendix[D](https://arxiv.org/html/2606.17234#A4)presents the results for the performance of all methods on each “open” syntactic category, along with some additional analysis\.

## 6Conclusion and Future Work

For MT error detection, traditional unsupervised techniques that utilize the model’s internal representations, such as predicted probabilities, can be flawed because they reflect certainty among competing hypotheses rather than correctness\. We investigate verbalized confidence as an alternative unsupervised approach\. For both binary error detection and calibration, we show that verbalized confidence performs comparably to internal methods\. The List method and the word\-based methods also have the advantage of being more easily accessible without requiring model transparency\. We note that the performance of all methods, whether verbalized or internal, remains relatively low\. This is partly due to the difficulty of the task, as translation quality is typically very high for high\-resource languages\. As for the alignment between internal uncertainty and verbalized confidence, we find little to no correlation between them for either model, although Llama3\-70B shows slightly higher correlations than Aya23\. Notably, Llama3\-70B also derives greater benefits from verbalized measures in terms of alignment with ground truth\. Future work is needed to identify the factors that affect these alignments\.

Poorer MT performance in low\-resource languages creates an even greater need for quality control\. Future work should examine the extent to which our findings generalize to lower\-resource scenarios\. Although using Translation Edit Rate to generate pseudo\-labels may seem like an expedient solution in the absence of human annotations, such labels are not reliably consistent with human\-annotated labels\. Therefore, we emphasize the necessity of human annotation for these languages over the use of automated proxies\.

## Limitations

The main limitation of our work is the scarce availability of annotated translations for LLM outputs\. Such annotations are rare even for traditional, non\-LLM MT systems\. Therefore, we were unable to extend our study to a wider range of morphologically diverse languages, despite the potential benefits for low\-resource settings\.

Although some of the available annotations are available for closed\-source LLMs in the WMT24 dataset, we exclude these models from experiments due to cost constraints\.

## Ethical Considerations

These unsupervised automatic error detection methods generally show limited effectiveness and are not a replacement for human annotation\.

We adhere to the licensing agreement of Llama3\-70B999[https://www\.llama\.com/llama3/license/](https://www.llama.com/llama3/license/)and Aya23101010CC\-BY\-NC 4\.0\. The dataset we use in our work fromKocmi et al\. \([2024a](https://arxiv.org/html/2606.17234#bib.bib11)\)is already fully pseudonymized \(using random first names and surnames\) and the required steps were taken to ensure safety and anonymity\.

Additionally, our work relied on significant computational resources, with unavoidable environmental impact\.

## References

- Aryabumi et al\. \(2024\)Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, and 1 others\. 2024\.Aya 23: Open weight releases to further multilingual progress\.*arXiv preprint arXiv:2405\.15032*\.
- Berg\-Kirkpatrick et al\. \(2012\)Taylor Berg\-Kirkpatrick, David Burkett, and Dan Klein\. 2012\.[An empirical investigation of statistical significance in NLP](https://aclanthology.org/D12-1091/)\.In*Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning*, pages 995–1005, Jeju Island, Korea\. Association for Computational Linguistics\.
- Dinh and Niehues \(2025\)Tu Anh Dinh and Jan Niehues\. 2025\.[Are generative models underconfident? better quality estimation with boosted model probability](https://doi.org/10.18653/v1/2025.emnlp-main.166)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 3364–3382, Suzhou, China\. Association for Computational Linguistics\.
- Fomicheva et al\. \(2022\)Marina Fomicheva, Shuo Sun, Erick Fonseca, Chrysoula Zerva, Frédéric Blain, Vishrav Chaudhary, Francisco Guzmán, Nina Lopatina, Lucia Specia, and André F\. T\. Martins\. 2022\.[MLQE\-PE: A multilingual quality estimation and post\-editing dataset](https://aclanthology.org/2022.lrec-1.530/)\.In*Proceedings of the Thirteenth Language Resources and Evaluation Conference*, pages 4963–4974, Marseille, France\. European Language Resources Association\.
- Gou et al\. \(2023\)Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen\. 2023\.Critic: Large language models can self\-correct with tool\-interactive critiquing\.*arXiv preprint arXiv:2305\.11738*\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others\. 2024\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*\.
- Han et al\. \(2022\)Zhixiong Han, Yaru Hao, Li Dong, Yutao Sun, and Furu Wei\. 2022\.Prototypical calibration for few\-shot learning of language models\.*arXiv preprint arXiv:2205\.10183*\.
- Holtzman et al\. \(2021\)Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer\. 2021\.[Surface form competition: Why the highest probability answer isn’t always right](https://doi.org/10.18653/v1/2021.emnlp-main.564)\.In*Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing*, pages 7038–7051, Online and Punta Cana, Dominican Republic\. Association for Computational Linguistics\.
- Kadavath et al\. \(2022\)Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield\-Dodds, Nova DasSarma, Eli Tran\-Johnson, and 1 others\. 2022\.Language models \(mostly\) know what they know\.*arXiv preprint arXiv:2207\.05221*\.
- Kocmi et al\. \(2025\)Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, and 10 others\. 2025\.[Findings of the WMT25 general machine translation shared task: Time to stop evaluating on easy test sets](https://doi.org/10.18653/v1/2025.wmt-1.22)\.In*Proceedings of the Tenth Conference on Machine Translation*, pages 355–413, Suzhou, China\. Association for Computational Linguistics\.
- Kocmi et al\. \(2024a\)Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popović, and 3 others\. 2024a\.[Findings of the WMT24 general machine translation shared task: The LLM era is here but MT is not solved yet](https://doi.org/10.18653/v1/2024.wmt-1.1)\.In*Proceedings of the Ninth Conference on Machine Translation*, pages 1–46, Miami, Florida, USA\. Association for Computational Linguistics\.
- Kocmi et al\. \(2024b\)Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popović, Mrinmaya Sachan, and Mariya Shmatova\. 2024b\.[Error span annotation: A balanced approach for human evaluation of machine translation](https://doi.org/10.18653/v1/2024.wmt-1.131)\.In*Proceedings of the Ninth Conference on Machine Translation*, pages 1440–1453, Miami, Florida, USA\. Association for Computational Linguistics\.
- Kumar et al\. \(2024\)Abhishek Kumar, Robert Morabito, Sanzhar Umbet, Jad Kabbara, and Ali Emami\. 2024\.[Confidence under the hood: An investigation into the confidence\-probability alignment in large language models](https://doi.org/10.18653/v1/2024.acl-long.20)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 315–334, Bangkok, Thailand\. Association for Computational Linguistics\.
- Kumar and Sarawagi \(2019\)Aviral Kumar and Sunita Sarawagi\. 2019\.Calibration of encoder decoder models for neural machine translation\.*arXiv preprint arXiv:1903\.00802*\.
- Lin et al\. \(2022\)Stephanie Lin, Jacob Hilton, and Owain Evans\. 2022\.Teaching models to express their uncertainty in words\.*arXiv preprint arXiv:2205\.14334*\.
- Liu et al\. \(2025\)Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei\. 2025\.Uncertainty quantification and confidence calibration in large language models: A survey\.In*Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2*, pages 6107–6117\.
- Lovering et al\. \(2024\)Charles Lovering, Michael Krumdick, Viet Dac Lai, Seth Ebner, Nilesh Kumar, Varshini Reddy, Rik Koncel\-Kedziorski, and Chris Tanner\. 2024\.Language model probabilities are not calibrated in numeric contexts\.*arXiv preprint arXiv:2410\.16007*\.
- Lovering et al\. \(2025\)Charles Lovering, Michael Krumdick, Viet Dac Lai, Varshini Reddy, Seth Ebner, Nilesh Kumar, Rik Koncel\-Kedziorski, and Chris Tanner\. 2025\.[Language model probabilities arenotnotcalibrated in numeric contexts](https://doi.org/10.18653/v1/2025.acl-long.1417)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 29218–29257, Vienna, Austria\. Association for Computational Linguistics\.
- Lu et al\. \(2022\)Yu Lu, Jiali Zeng, Jiajun Zhang, Shuangzhi Wu, and Mu Li\. 2022\.[Learning confidence for transformer\-based neural machine translation](https://doi.org/10.18653/v1/2022.acl-long.167)\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 2353–2364, Dublin, Ireland\. Association for Computational Linguistics\.
- Manakul et al\. \(2023\)Potsawee Manakul, Adian Liusie, and Mark Gales\. 2023\.[SelfCheckGPT: Zero\-resource black\-box hallucination detection for generative large language models](https://doi.org/10.18653/v1/2023.emnlp-main.557)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 9004–9017, Singapore\. Association for Computational Linguistics\.
- Mielke et al\. \(2022\)Sabrina J\. Mielke, Arthur Szlam, Emily Dinan, and Y\-Lan Boureau\. 2022\.[Reducing conversational agents’ overconfidence through linguistic calibration](https://doi.org/10.1162/tacl_a_00494)\.*Transactions of the Association for Computational Linguistics*, 10:857–872\.
- Naeini et al\. \(2015\)Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht\. 2015\.Obtaining well calibrated probabilities using bayesian binning\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 29\.
- Ni et al\. \(2024\)Shiyu Ni, Keping Bi, Lulu Yu, and Jiafeng Guo\. 2024\.Are large language models more honest in their probabilistic or verbalized confidence?In*China Conference on Information Retrieval*, pages 124–135\. Springer\.
- Niculescu\-Mizil and Caruana \(2005\)Alexandru Niculescu\-Mizil and Rich Caruana\. 2005\.Predicting good probabilities with supervised learning\.In*Proceedings of the 22nd international conference on Machine learning*, pages 625–632\.
- Ott et al\. \(2018\)Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato\. 2018\.Analyzing uncertainty in neural machine translation\.In*International Conference on Machine Learning*, pages 3956–3965\. PMLR\.
- Qi et al\. \(2020\)Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D\. Manning\. 2020\.[Stanza: A python natural language processing toolkit for many human languages](https://doi.org/10.18653/v1/2020.acl-demos.14)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations*, pages 101–108, Online\. Association for Computational Linguistics\.
- Qi et al\. \(2021\)Qi Qi, Youzhi Luo, Zhao Xu, Shuiwang Ji, and Tianbao Yang\. 2021\.Stochastic optimization of areas under precision\-recall curves with provable convergence\.*Advances in neural information processing systems*, 34:1752–1765\.
- Sarti et al\. \(2022\)Gabriele Sarti, Arianna Bisazza, Ana Guerberof\-Arenas, and Antonio Toral\. 2022\.[DivEMT: Neural machine translation post\-editing effort across typologically diverse languages](https://doi.org/10.18653/v1/2022.emnlp-main.532)\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 7795–7816, Abu Dhabi, United Arab Emirates\. Association for Computational Linguistics\.
- Sarti et al\. \(2025a\)Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof\-Arenas, Malvina Nissim, and Arianna Bisazza\. 2025a\.Qe4pe: Word\-level quality estimation for human post\-editing\.*Transactions of the Association for Computational Linguistics*, 13:1410–1435\.
- Sarti et al\. \(2025b\)Gabriele Sarti, Vilém Zouhar, Malvina Nissim, and Arianna Bisazza\. 2025b\.[Unsupervised word\-level quality estimation for machine translation through the lens of annotators \(dis\)agreement](https://doi.org/10.18653/v1/2025.emnlp-main.924)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 18320–18337, Suzhou, China\. Association for Computational Linguistics\.
- Snover et al\. \(2006\)Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea Micciulla, and John Makhoul\. 2006\.[A study of translation edit rate with targeted human annotation](https://aclanthology.org/2006.amta-papers.25/)\.In*Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers*, pages 223–231, Cambridge, Massachusetts, USA\. Association for Machine Translation in the Americas\.
- Tian et al\. \(2023\)Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning\. 2023\.[Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback](https://doi.org/10.18653/v1/2023.emnlp-main.330)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 5433–5442, Singapore\. Association for Computational Linguistics\.
- Tsai et al\. \(2024\)Yao\-Hung Hubert Tsai, Walter Talbott, and Jian Zhang\. 2024\.Efficient non\-parametric uncertainty quantification for black\-box large language models and decision planning\.*arXiv preprint arXiv:2402\.00251*\.
- Ulmer et al\. \(2024\)Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, and Seong Oh\. 2024\.[Calibrating large language models using their generations only](https://doi.org/10.18653/v1/2024.acl-long.824)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 15440–15459, Bangkok, Thailand\. Association for Computational Linguistics\.
- Wang et al\. \(2024\)Cheng Wang, Gyuri Szarvas, Georges Balazs, Pavel Danchenko, and Patrick Ernst\. 2024\.Calibrating verbalized probabilities for large language models\.*arXiv preprint arXiv:2410\.06707*\.
- Wang et al\. \(2020\)Shuo Wang, Zhaopeng Tu, Shuming Shi, and Yang Liu\. 2020\.[On the inference calibration of neural machine translation](https://doi.org/10.18653/v1/2020.acl-main.278)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 3070–3079, Online\. Association for Computational Linguistics\.
- Wang and Holmes \(2024\)Ziyu Wang and Chris Holmes\. 2024\.On subjective uncertainty quantification and calibration in natural language generation\.*arXiv preprint arXiv:2406\.05213*\.
- Wiegreffe et al\. \(2023\)Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, and Ashish Sabharwal\. 2023\.[Increasing probability mass on answer choices does not always improve accuracy](https://doi.org/10.18653/v1/2023.emnlp-main.522)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 8392–8417, Singapore\. Association for Computational Linguistics\.
- Wu et al\. \(2025\)Di Wu, Yibin Lei, and Christof Monz\. 2025\.Calibrating translation decoding with quality estimation on llms\.*arXiv preprint arXiv:2504\.19044*\.
- Xiong et al\. \(2023\)Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi\. 2023\.Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms\.*arXiv preprint arXiv:2306\.13063*\.
- Yang et al\. \(2024\)Daniel Yang, Yao\-Hung Hubert Tsai, and Makoto Yamada\. 2024\.On verbalized confidence scores for llms\.*arXiv preprint arXiv:2412\.14737*\.
- Yang et al\. \(2023\)Zhen Yang, Fandong Meng, Yuanmeng Yan, and Jie Zhou\. 2023\.[Rethinking the word\-level quality estimation for machine translation from human judgement](https://doi.org/10.18653/v1/2023.findings-acl.126)\.In*Findings of the Association for Computational Linguistics: ACL 2023*, pages 2012–2025, Toronto, Canada\. Association for Computational Linguistics\.
- Zhang et al\. \(2024\)Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, and Xipeng Qiu\. 2024\.[Calibrating the confidence of large language models by eliciting fidelity](https://doi.org/10.18653/v1/2024.emnlp-main.173)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 2959–2979, Miami, Florida, USA\. Association for Computational Linguistics\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others\. 2023\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.*Advances in neural information processing systems*, 36:46595–46623\.

## Appendix APrompts

Table[7](https://arxiv.org/html/2606.17234#A5.T7)shows all the prompts for the verbalized methods\.

## Appendix BStatistical Significance Results

We report the full statistical significance results for both binary error detection and calibration\. FollowingBerg\-Kirkpatrick et al\. \([2012](https://arxiv.org/html/2606.17234#bib.bib2)\), we use the bootstrap method, and collect 10000 values of test statistic for each comparison\. To control family\-wise error rate \(FWER\), we use the Holm\-Bonferroni method\. We chooseα=0\.05\\alpha=0\.05\. We report the results for all 5 language directions for binary error detection in Tables[8](https://arxiv.org/html/2606.17234#A5.T8)and[9](https://arxiv.org/html/2606.17234#A5.T9)\. Tables[10](https://arxiv.org/html/2606.17234#A5.T10)and[11](https://arxiv.org/html/2606.17234#A5.T11)show the results for calibration\. Finally, Tables[12](https://arxiv.org/html/2606.17234#A5.T12)and[13](https://arxiv.org/html/2606.17234#A5.T13)present the results for the correlation of internal and verbalized methods\.

## Appendix CVerbalized⇔\\LeftrightarrowInternal Extra Analysis

Although we do not see evidence of higher correlation between these two components, this does not mean they are any less correlated than \(verbalized↔\\leftrightarrowground\-truth\) and \(internal↔\\leftrightarrowground\-truth\)\. To be able to more directly compare them, we should at least use the same metric\. For this, we binarize the internal methods to fit the results of the list method\. Similarly to the alignments with the ground\-truth, we use the dev sets to find binarization thresholds\. Table[18](https://arxiv.org/html/2606.17234#A5.T18)shows the results for F1 on the negative labels\. The much higher performance here is because the list method \(and some other verbalized methods\) has a high recall and lower precision \(see Figure[4](https://arxiv.org/html/2606.17234#S5.F4)\)\. Therefore, both probability and entropy find that according to the dev set, the threshold that maximizes the F1 score for the desired tokens \(i\.e\. tokens that are wrong and need to be changed, and we want to detect\) is one which marks almost everything as wrong\. For that reason, directly comparing \(internal↔\\leftrightarrowverbalized\) with \(internal↔\\leftrightarrowground\-truth\) and \(verbalized↔\\leftrightarrowground\-truth\) is not very informative\. We also report AUROC scores as a threshold\-independent metric\. Table[16](https://arxiv.org/html/2606.17234#A5.T16)includes the AUROC scores for \(internal↔\\leftrightarrowground\-truth\) and \(verbalized↔\\leftrightarrowground\-truth\), and Table[19](https://arxiv.org/html/2606.17234#A5.T19)reports the same for \(internal↔\\leftrightarrowverbalized\)\. We see that the alignments are generally on the same level, but \(internal↔\\leftrightarrowverbalized\) is less than most of the scores for \(internal↔\\leftrightarrowground\-truth\) and \(verbalized↔\\leftrightarrowground\-truth\) on average\. Here is an example of the misalgnment between verbalized and internal signals:

- •Source sentence:Plougheth mine feeldes\.
- •Model output:Мои поля орошаются\(the mistake is instead of plowing, the translation points to irrigation\)
- •Gold annotations:\[1, 1, 1, 0, 0, 0, 0\] \(орошаютсяis wrong\)
- •Model probabilities:\[0\.086, 0\.876, 0\.797, 0\.174, 0\.93, 0\.914, 0\.894\]
- •Verbalized confidence:\["a", "a", "a", "f", "f", "f", "f"\], which we convert to: \[1, 1, 1, 0, 0, 0, 0\]

Other than the first token which has a very low probability, we can see that model probabilities drop on the first token of the word‘орошаются’, but the rest of them are high probabilities\. Spearman correlation coefficient between the verbalized numbers and the internal signals here is about \-0\.577\.

## Appendix DPOS\-specific Results

Figures[8](https://arxiv.org/html/2606.17234#A5.F8)through[13](https://arxiv.org/html/2606.17234#A5.F13)showcase the results of all the methods’ performances for each “open” syntactic role as labeled by Stanza: adjectives, adverbs, interjections, nouns, proper nouns, and verbs\. We can see some interesting patterns in these charts\. For adjectives, Word\_Likert and Word\_Numeric have the best performances on average\. For interjections, Word\_Numeric isby farthe best performing method for Aya23, while only falling short of the List method for Llama3\-70B\. For nouns, Word\_Numeric and Word\_Likert lead for Llama3\-70B\. For Aya23, Entropy replaces Word\_Likert as the second best method\. Verbalized methods also dominate the performances for proper nouns for Llama3\-70B\. For Aya23, Word\_Numeric still outperforms other methods\. For verbs, Entropy outperforms other methods for Aya23\. For Llama3\-70B it lags behind List, Word\_Numeric, and Word\_Likert\.

## Appendix EOther Results

### E\.1Dev Set\-Independent Binarization

Table[6](https://arxiv.org/html/2606.17234#A5.T6)shows the results for binary error detection if we artificially binarize the scores for the listed methods optimally on the test set\. This serves as an upper bound on performance of these methods, and is consistent with the analysis ofSarti et al\. \([2025b](https://arxiv.org/html/2606.17234#bib.bib30)\)\. Note that Random and List methods do not need binarization and report the same performances as Table[2](https://arxiv.org/html/2606.17234#S3.T2)\. Overall, the trends are consistent\. Word\_Numeric continues to perform well on average, with Entropy showing similar performance\. Entropy performs best for Aya23, consistent with the findings ofSarti et al\. \([2025b](https://arxiv.org/html/2606.17234#bib.bib30)\)\. In contrast, three verbalized methods outperform Entropy in the Llama3\-70B experiments\.

Table 6:Results for the binary error detection task on the test set\. All the numbers report the F1 score of negative labels \(higher is better\), where continuous methods areoptimally binarized\. Best performing results are bolded in each column\.
### E\.2Bundling

For the methods that do not operate on the word level, we want to measure how the methods “bundle” their allegedly wrong tokens \(according to their prediction\) together\. For this, we considerallthe words that haveanywrong tokens, and calculate what ratio of all the tokens of these words are predicted as wrong\. For example, supposing we have only one word and it has 4 tokens and the model predictions are \[0, 0, 0, 1\], then the bundling ratio would be \(number of wrong tokens / summation of the length of all the words with any wrong tokens\) = \(3 / 4\) = 0\.75\. Table[14](https://arxiv.org/html/2606.17234#A5.T14)shows the bundling ratio of different models as aggregated across different languages on the test set\. We see that for both models, verbalized methods have higher concentration of wrong tokens in the same words than internal methods\. This is closer to the way humans annotated the dataset, since they specified errors on the word level\.

### E\.3Prompt Costs

We approximate the cost of each verbalized method by reporting the average number of tokens for each prompt\. Table[15](https://arxiv.org/html/2606.17234#A5.T15)shows the results\.

### E\.4Computing Infrastructure

We used a mixture of NVIDIA 80GB H100 SXM GPUs and NVIDIA A100 Tensor Core GPUs in our experiments\. Aya23, at 35B parameters, could be prompted by a single H100 but Llama3\-70B needs more GPUs\.

Table 7:Prompt templates for each verbalized method\.Table 8:Statistical significance results for binary error detection as measured in F1 on the wrong \(negative\) tokens\. These are the results for English→\\rightarrowCzech, Hindi, and Japanese\. We employ the bootstrap method\. For each method, we report the methods that aresignificantly betterin the column \(language, model pair\)\.Table 9:Statistical significance results for binary error detection as measured in F1 on the wrong \(negative\) tokens\. These are the results for English→\\rightarrowRussian and Chinese\. We employ the bootstrap method\. For each method, we report the methods that aresignificantly betterin the column \(language, model pair\)\.Table 10:Statistical significance results for calibration as measured in ECE\. These are the results for English→\\rightarrowCzech, Hindi, and Japanese\. We employ the bootstrap method\. For each method, we report the methods that aresignificantly betterin the column \(language, model pair\)\.Table 11:Statistical significance results for calibration as measured in ECE\. These are the results for English→\\rightarrowRussian and Chinese\. We employ the bootstrap method\. For each method, we report the methods that aresignificantly betterin the column \(language, model pair\)\.Table 12:Statistical significance results for correlation of internal and verbalized methods as measured in Spearman’s coefficient\. These are the results for English→\\rightarrowCzech, Hindi, and Japanese\. We employ the bootstrap method\. For each method, we report the methods that aresignificantly betterin the column \(language, model pair\)\.Table 13:Statistical significance results for correlation of internal and verbalized methods as measured in Spearman’s coefficient\. These are the results for English→\\rightarrowRussian and Chinese\. We employ the bootstrap method\. For each method, we report the methods that aresignificantly betterin the column \(language, model pair\)\.Table 14:The “bundling” ratio of different models, as describe in[E\.2](https://arxiv.org/html/2606.17234#A5.SS2)\. We aggregate the tokens of the test sets of all different languages\.Table 15:Average number of tokens for the prompt of each verbalized method\.Table 16:AUROC for the continuous methods in their alignment with ground truth on the test set \(higher is better\)\. Best performing results are bolded in each column\.Table 17:AUPRC \(or Average Precision\) for the methods in their alignment with ground truth on the test set \(higher is better\)\. Best performing results are bolded in each column\.Table 18:F1 results for the alignment of list method with token probabilities and with token entropies \(higher is better\)\. We binarize according to a threshold obtained via the corresponding development set\.Table 19:AUROC results for the alignment of list method with token probabilities and with token entropies \(higher is better\)\.![Refer to caption](https://arxiv.org/html/2606.17234v1/images/curve_plots/russian-ayatest_results_curve_plot.png)\(a\)Aya23
![Refer to caption](https://arxiv.org/html/2606.17234v1/images/curve_plots/russian-llama3test_results_curve_plot.png)\(b\)Llama3\-70B

Figure 5:Precision and recall curves of different methods for English→\\rightarrowRussian translations\. We observe that even numeric variants of verbalized methods provide limited granularity, effectively partitioning predictions into a few confidence levels \(see Table[5](https://arxiv.org/html/2606.17234#S5.T5)\)\.![Refer to caption](https://arxiv.org/html/2606.17234v1/images/pie_charts/aya-test-threshold-average-results-pos-piechart.png)Figure 6:Distributions of different POS tags in the tokens considered asincorrectfor each method, averaged across all languages on the test set for Aya23\. The number in paranthesis is the Hellinger distance of the distribution of the errors as identified by the method from those of the real distribution\.![Refer to caption](https://arxiv.org/html/2606.17234v1/images/pie_charts/llama3-test-threshold-average-results-pos-piechart.png)Figure 7:Distributions of different POS tags in the tokens considered as wrong for each method, averaged across languages\. The number in paranthesis is the Hellinger distance of the distribution of the errors for the method from those of the real distribution\. These are for the test set on Llama3\-70B\.![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/ADJ-aya-test-threshold-average-results.png)

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/ADJ-llama3-test-threshold-average-results.png)

Figure 8:F1, precision, and recall scores for different methods averaged over all the languages on the test set for Aya23 \(top\) and Llama3\-70B \(bottom\)\. These are for tokens that particularly have theADJ\(adjective\) label\.![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/ADV-aya-test-threshold-average-results.png)

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/ADV-llama3-test-threshold-average-results.png)

Figure 9:F1, precision, and recall scores for different methods averaged over all the languages on the test set for Aya23 \(top\) and Llama3\-70B \(bottom\)\. These are for tokens that particularly have theADV\(adverb\) label\.![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/INTJ-aya-test-threshold-average-results.png)

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/INTJ-llama3-test-threshold-average-results.png)

Figure 10:F1, precision, and recall scores for different methods averaged over all the languages on the test set for Aya23 \(top\) and Llama3\-70B \(bottom\)\. These are for tokens that particularly have theINTJ\(interjection\) label\.![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/NOUN-aya-test-threshold-average-results.png)

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/NOUN-llama3-test-threshold-average-results.png)

Figure 11:F1, precision, and recall scores for different methods averaged over all the languages on the test set for Aya23 \(top\) and Llama3\-70B \(bottom\)\. These are for tokens that particularly have theNOUN\(noun\) label\.![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/PROPN-aya-test-threshold-average-results.png)

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/PROPN-llama3-test-threshold-average-results.png)

Figure 12:F1, precision, and recall scores for different methods averaged over all the languages on the test set for Aya23 \(top\) and Llama3\-70B \(bottom\)\. These are for tokens that particularly have thePROPN\(proper noun\) label\.![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/VERB-aya-test-threshold-average-results.png)

![Refer to caption](https://arxiv.org/html/2606.17234v1/images/each_pos_charts/VERB-llama3-test-threshold-average-results.png)

Figure 13:F1, precision, and recall scores for different methods averaged over all the languages on the test set for Aya23 \(top\) and Llama3\-70B \(bottom\)\. These are for tokens that particularly have theVERB\(verb\) label\.
Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

Similar Articles

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Large Language Models Are Overconfident in Their Own Responses

Confidence Calibration in Large Language Models

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

A better method for identifying overconfident large language models

Submit Feedback

Similar Articles

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]
Large Language Models Are Overconfident in Their Own Responses
Confidence Calibration in Large Language Models
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
A better method for identifying overconfident large language models