On the Stability of Prompt Ranking in Large Language Model Evaluation

arXiv cs.CL Papers

Summary

This paper systematically studies the stability of prompt rankings in LLM evaluation under common sources of variability, finding that top-performing prompts often change. It proposes a stability-aware selection strategy based on a lower confidence bound to improve robustness.

arXiv:2606.24381v1 Announce Type: new Abstract: Prompt-based interaction has become a dominant paradigm for using large language models (LLMs), where multiple candidate prompts are evaluated and the top-ranked one is selected for downstream use. This workflow implicitly assumes that prompt rankings are stable under minor variations in evaluation conditions. In this paper, we systematically study prompt ranking stability under common sources of variability, including random seeds and limited evaluation subsets. Across three open-weight LLMs and two benchmark tasks, we find that while overall rank correlations are often moderate to high, the identity of the top-performing prompt frequently changes, leading to unreliable selection decisions. To address this issue, we propose a simple stability-aware selection strategy based on a lower confidence bound, which accounts for both performance and variance. Our results show that this approach improves robustness in unstable settings while remaining competitive in more stable regimes. These findings highlight the importance of accounting for evaluation uncertainty in prompt selection and LLM benchmarking.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:47 AM

# On the Stability of Prompt Ranking in Large Language Model Evaluation
Source: [https://arxiv.org/html/2606.24381](https://arxiv.org/html/2606.24381)
11institutetext:University of Amsterdam, Amsterdam, Netherlands11email:s\.du@uva\.nl22institutetext:Northeastern University, Boston, MA, USA33institutetext:University of California San Diego, La Jolla, CA, USA44institutetext:Duke University, Durham, NC, USA###### Abstract

Prompt\-based interaction has become a dominant paradigm for using large language models \(LLMs\), where multiple candidate prompts are evaluated and the top\-ranked one is selected for downstream use\. This workflow implicitly assumes that prompt rankings are stable under minor variations in evaluation conditions\. In this paper, we systematically study prompt ranking stability under common sources of variability, including random seeds and limited evaluation subsets\. Across three open\-weight LLMs and two benchmark tasks, we find that while overall rank correlations are often moderate to high, the identity of the top\-performing prompt frequently changes, leading to unreliable selection decisions\. To address this issue, we propose a simple stability\-aware selection strategy based on a lower confidence bound, which accounts for both performance and variance\. Our results show that this approach improves robustness in unstable settings while remaining competitive in more stable regimes\. These findings highlight the importance of accounting for evaluation uncertainty in prompt selection and LLM benchmarking\.

## 1Introduction

As AI models continue to advance, large language models \(LLMs\) are increasingly accessed through prompt\-based interfaces, where task behavior is specified via natural language instructions rather than task\-specific training\[[11](https://arxiv.org/html/2606.24381#bib.bib4),[9](https://arxiv.org/html/2606.24381#bib.bib5),[2](https://arxiv.org/html/2606.24381#bib.bib6)\]\. As a result, prompt design and prompt selection have become central to both research and deployment of LLM\-based systems\. In many practical workflows, multiple candidate prompts are evaluated on a benchmark, ranked by performance, and a single “best” prompt is selected for downstream use\.

A common but often implicit assumption underlying these practices is that prompt performance rankings are stable\. That is, a prompt that outperforms others under a given evaluation protocol is expected to remain superior under minor variations of evaluation conditions\. This assumption motivates widespread practices such as selecting prompts based on average accuracy or reporting a single top\-performing prompt\.

In practice, however, prompt evaluation is subject to multiple sources of variability\. Evaluation protocols often rely on limited evaluation budgets, subsampling of benchmark datasets, or different random seeds\. While prior work has studied output variability under stochastic decoding or sensitivity to few\-shot examples, prompt evaluation itself is typically treated as deterministic\. In particular, the stability of prompt*rankings*, rather than absolute scores, has received little systematic attention\.

Importantly, instability in prompt rankings has direct practical implications\. When multiple prompts achieve similar average performance, even small fluctuations in evaluation scores can lead to changes in their relative ordering\. Such ranking instability may cause prompt selection decisions to be overly sensitive to evaluation noise, resulting in brittle or non\-reproducible outcomes\.

In this work, we challenge the assumption of stable prompt rankings\. Rather than focusing on absolute performance variance, we study how the relative ordering of prompts changes under realistic evaluation variability\. Specifically, we ask the following questions: \(1\) How stable are prompt performance rankings across random seeds and evaluation subset sizes? \(2\) How does ranking instability affect common prompt selection strategies? \(3\) Can prompt selection be made more robust using simple stability\-aware criteria?

To answer these questions, we conduct a systematic empirical study in which a fixed set of prompts is repeatedly evaluated under controlled variations of evaluation conditions\. Our results reveal substantial ranking instability, particularly for small evaluation subsets, and show that selecting prompts based solely on mean performance can lead to unreliable decisions\. At the same time, we demonstrate that incorporating stability considerations via a simple lower confidence bound criterion can improve robustness under noisy evaluation conditions while remaining competitive in more stable settings\.

Our contributions are as follows:

\(1\) We provide the first systematic study of prompt ranking stability under realistic evaluation variability, including random seeds and limited evaluation budgets\.

\(2\) We show that high rank correlation does not necessarily imply stable prompt selection, revealing a gap between global ranking agreement and decision\-level consistency\.

\(3\) We propose a simple stability\-aware selection method based on lower confidence bounds, which improves robustness under noisy evaluation conditions without sacrificing performance in stable regimes\.

## 2Related Work

### 2\.1Prompt Engineering and Prompt Sensitivity

Prompt engineering has emerged as a key technique for controlling the behavior of large language models without additional training\[[12](https://arxiv.org/html/2606.24381#bib.bib7),[5](https://arxiv.org/html/2606.24381#bib.bib8)\]\. Prior work has explored instruction design, reasoning cues, output constraints, and prompt ensembles, demonstrating that prompt formulation can significantly affect model performance\. Several studies have also reported sensitivity of LLM outputs to prompt phrasing, formatting, and example selection\.

### 2\.2Evaluation Variability and Benchmark Robustness

Evaluation variability has been studied in various contexts, including randomness in model initialization, stochastic decoding, and dataset subsampling\[[3](https://arxiv.org/html/2606.24381#bib.bib9),[4](https://arxiv.org/html/2606.24381#bib.bib10)\]\. Prior work has shown that benchmark scores can be sensitive to evaluation noise, particularly under limited data or stochastic settings\. Bootstrap methods and statistical significance testing have been proposed to quantify uncertainty in model evaluation\.

In contrast to these studies, our focus is not on estimating confidence intervals for absolute scores, but on understanding how evaluation variability affects the*relative ordering*of prompts\.

### 2\.3Ranking Stability and Model Selection

While ranking stability has been studied in areas such as information retrieval and model selection, these works typically focus on model\-level comparisons\[[10](https://arxiv.org/html/2606.24381#bib.bib11)\]\. In contrast, we investigate ranking stability at the prompt level, where performance differences are often subtle and evaluation noise plays a larger role\. Several works have noted that rank\-based decisions may be sensitive to noise, even when aggregate performance metrics appear stable\[[8](https://arxiv.org/html/2606.24381#bib.bib12)\]\.

Our work brings this perspective to prompt evaluation for LLMs\. To our knowledge, this is the first work that systematically studies prompt ranking stability under controlled evaluation variability and connects ranking instability to prompt selection robustness\.

## 3Method

In this section, we formalize prompt evaluation as a stochastic ranking problem, where evaluation variability induces randomness in prompt performance and ranking outcomes\. Under this formulation, prompt ranking is no longer deterministic but depends on the underlying sampling distribution, and prompt selection corresponds to identifying robust optima under uncertainty\.

\\includestandalone

\[width=0\.8\]figures/method

Figure 1:Overview of the proposed prompt evaluation and selection framework\. The pipeline simulates evaluation variability through multi\-seed subsampling before applying the stability\-aware selection strategy\.### 3\.1Problem Formulation

Let𝒫=\{p1,p2,…,pM\}\\mathcal\{P\}=\\\{p\_\{1\},p\_\{2\},\\dots,p\_\{M\}\\\}denote a fixed set of candidate prompts for a given task, and let𝒟\\mathcal\{D\}denote the evaluation dataset\. Under an evaluation conditioncc\(e\.g\., a specific random seed and subset of examples\), each promptpip\_\{i\}is assigned a performance scoresi\(c\)s\_\{i\}^\{\(c\)\}, such as accuracy\.

This induces a ranking over prompts:

π\(c\)=rank​\(\{si\(c\)\}i=1M\),\\pi^\{\(c\)\}=\\text\{rank\}\\big\(\\\{s\_\{i\}^\{\(c\)\}\\\}\_\{i=1\}^\{M\}\\big\),where lower ranks in the induced permutation correspond to better\-performing prompts\.

We consider multiple evaluation conditions𝒞=\{c1,…,cK\}\\mathcal\{C\}=\\\{c\_\{1\},\\dots,c\_\{K\}\\\}, obtained by varying random seeds and evaluation subsets\. Our goal is to analyze how stable the rankings\{π\(c\)\}\\\{\\pi^\{\(c\)\}\\\}are across different conditions\.

### 3\.2Evaluation Variability

We model evaluation variability as arising from two sources:

Random seed variation:Different seeds induce different subsamples of the dataset\.

Subset size variation:We evaluate on subsets of sizek∈\{50,100,200\}k\\in\\\{50,100,200\\\}to simulate limited evaluation budgets\.

For each conditioncc, all prompts are evaluated on the same subset to ensure fair comparison\. This produces a score matrix of size\|𝒞\|×M\|\\mathcal\{C\}\|\\times M, where each row corresponds to a ranking over prompts\.

### 3\.3Ranking Stability Metrics

To quantify the similarity between rankings, we compute pairwise correlations across evaluation conditions\.

##### Rank Correlation\.

Given two rankingsπ\(c1\)\\pi^\{\(c\_\{1\}\)\}andπ\(c2\)\\pi^\{\(c\_\{2\}\)\}, we measure their agreement using Spearman’sρ\\rho, which captures correlation between rank positions, and Kendall’sτ\\tau, which measures pairwise ordering consistency\. These metrics reflect global ranking consistency across evaluation conditions\.

##### Top\-kkConsistency\.

To evaluate decision\-level stability, we further consider top\-kkconsistency metrics\. Top\-1 consistency measures the fraction of evaluation conditions that identify the same best\-performing prompt, while top\-kkconsistency quantifies the average overlap between the sets of top\-kkprompts across conditions\. These metrics capture the reliability of prompt selection decisions beyond global ranking agreement\.

For pairwise top\-k consistency, we compute the average overlap ratio between the top\-k prompt sets from two evaluation conditions:

Top\-​k​\(π\(c1\),π\(c2\)\)=\|Tk\(c1\)∩Tk\(c2\)\|k,\\text\{Top\-\}k\(\\pi^\{\(c\_\{1\}\)\},\\pi^\{\(c\_\{2\}\)\}\)=\\frac\{\|T\_\{k\}^\{\(c\_\{1\}\)\}\\cap T\_\{k\}^\{\(c\_\{2\}\)\}\|\}\{k\},whereTk\(c\)T\_\{k\}^\{\(c\)\}denotes the set of top\-k prompts under conditioncc\.

### 3\.4Prompt Selection Strategies

We consider two prompt selection strategies based on multiple evaluation runs\.

##### Mean\-based Selection\.

We compute the average score of each prompt across conditions:

s¯i=1K​∑c∈𝒞si\(c\),\\bar\{s\}\_\{i\}=\\frac\{1\}\{K\}\\sum\_\{c\\in\\mathcal\{C\}\}s\_\{i\}^\{\(c\)\},and select the prompt with the highest mean performance:

pmean∗=arg⁡maxi⁡s¯i\.p^\{\*\}\_\{\\text\{mean\}\}=\\arg\\max\_\{i\}\\bar\{s\}\_\{i\}\.

##### Stability\-aware Selection \(LCB\)\.

To account for variability, we define a lower confidence bound \(LCB\) score:

LCBi=s¯i−z⋅σiK,\\text\{LCB\}\_\{i\}=\\bar\{s\}\_\{i\}\-z\\cdot\\frac\{\\sigma\_\{i\}\}\{\\sqrt\{K\}\},whereσi\\sigma\_\{i\}is the standard deviation of scores for promptpip\_\{i\}, andzzcontrols the strength of the penalty\.

We select:

pLCB∗=arg⁡maxi⁡LCBi\.p^\{\*\}\_\{\\text\{LCB\}\}=\\arg\\max\_\{i\}\\text\{LCB\}\_\{i\}\.
This strategy favors prompts with both high mean performance and low variance\. We use this LCB score as a simple uncertainty\-aware heuristic rather than a strict statistical confidence interval, since the number of evaluation conditions is limited\.

### 3\.5Selection Robustness Evaluation

We evaluate selection robustness using a leave\-one\-seed\-out \(LOSO\) protocol\.

For each held\-out conditionctestc\_\{\\text\{test\}\}, we:

1. 1\.Use the remaining conditions𝒞∖\{ctest\}\\mathcal\{C\}\\setminus\\\{c\_\{\\text\{test\}\}\\\}to select a prompt\.
2. 2\.Evaluate the selected prompt onctestc\_\{\\text\{test\}\}\.

We report the average and standard deviation of test performance across all held\-out conditions\.

This protocol measures how well a selection strategy generalizes to unseen evaluation settings\.

## 4Experiments

We conduct experiments using an open\-weight instruction\-tuned LLM in a zero\-shot setting with greedy decoding\. All experiments use a fixed model to isolate the effects of evaluation variability from model\-specific factors\.

### 4\.1Setup

#### 4\.1\.1Models

We evaluate prompt ranking stability across three representative open\-weight instruction\-tuned large language models with comparable parameter scales but different training recipes: Mistral\-7B\-Instruct\-v0\.3 \(Mistral\)\[[1](https://arxiv.org/html/2606.24381#bib.bib1)\], Phi\-3\-mini\-4k\-instruct \(Phi\)\[[6](https://arxiv.org/html/2606.24381#bib.bib2)\], and Qwen2\.5\-7B\-Instruct \(Qwen\)\[[7](https://arxiv.org/html/2606.24381#bib.bib3)\]\.

Mistral\-7B\-Instruct\-v0\.3 is a 7B\-parameter instruction\-tuned model designed for strong general\-purpose reasoning and instruction following\.

Phi\-3\-mini\-4k\-instruct is a compact instruction\-tuned model with competitive performance across a range of reasoning and knowledge tasks\.

Qwen2\.5\-7B\-Instruct is a 7B\-scale instruction\-tuned model that demonstrates strong multilingual and general knowledge capabilities\.

These models are selected to cover diverse training paradigms and capabilities while maintaining comparable model sizes, allowing us to isolate the effects of evaluation variability on prompt ranking stability\.

#### 4\.1\.2Tasks

We evaluate on two benchmark tasks with automatic evaluation\. GSM8K requires multi\-step numerical reasoning and is sensitive to compounding errors\. MMLU is a multi\-disciplinary multiple\-choice question answering benchmark covering a broad range of subjects\. These tasks differ substantially in structure and difficulty, allowing us to examine task\-dependent stability effects\.

#### 4\.1\.3Prompts

For each task, we construct a fixed set of 20 candidate prompts\. The prompts vary in instruction phrasing, reasoning guidance, and output constraints, while targeting the same underlying task\. All prompts are evaluated under identical conditions within each evaluation run\. The prompt pool was manually constructed by the authors to represent diverse instruction styles, reasoning cues, and output constraints commonly used in prompt engineering\.

### 4\.2Results

In this section, we analyze prompt ranking stability from three complementary perspectives: \(1\) global ranking consistency, \(2\) decision\-level consistency of top\-performing prompts, and \(3\) robustness of prompt selection under evaluation variability\. In particular, Table[1](https://arxiv.org/html/2606.24381#S4.T1)reports pairwise consistency metrics that quantify agreement between pairs of evaluation conditions, whereas Table[2](https://arxiv.org/html/2606.24381#S4.T2)reports global consistency metrics that quantify agreement with a modal reference ranking aggregated across all conditions\.

Table 1:Ranking stability across different models \(Mistral,Phi, andQwen\)\. Metrics represent the mean±\\pmstandard deviation calculated overnpairs=10n\_\{\\text\{pairs\}\}=10seed pairs\.#### 4\.2\.1Ranking Stability under Evaluation Variability

Table[1](https://arxiv.org/html/2606.24381#S4.T1)reports ranking stability across random seeds\. Across models, prompt rankings exhibit substantial instability under small evaluation budgets, particularly on GSM8K\. At 50 evaluation examples, Spearman correlation ranges from 0\.456 to 0\.763 across models, with large variance observed in some cases \(e\.g\.,±0\.310\\pm 0\.310for Qwen\), indicating weak to moderate agreement across seeds\. Although ranking stability improves as the evaluation subset size increases, correlations remain far from perfect even at 200 examples\.

MMLU exhibits higher overall ranking stability than GSM8K\. For example, at 200 examples, Spearman correlation reaches up to 0\.862 for Qwen and 0\.821 for Phi\. However, even in these settings, Kendall’sτ\\tauremains substantially below 1\.0 \(e\.g\., 0\.703 for Qwen\), suggesting that non\-trivial prompt reordering persists across evaluation conditions\.

Notably, relatively high rank correlations do not guarantee stable prompt selection\. As we show next, even when rankings appear consistent at a global level, the identity of the top\-performing prompt can vary significantly across seeds\.

#### 4\.2\.2Top\-kkConsistency and Prompt Selection Reliability

Table 2:Top\-1 and Top\-3 \(k=3k=3\) configuration consistency across models overn=5n=5random seeds\. "Unique Top\-1" represents the count of distinct best\-performing configurations identified across seeds\.We note that the top\-kkmetrics reported in Table[1](https://arxiv.org/html/2606.24381#S4.T1)are computed in a pairwise manner across seed pairs, whereas those in Table[2](https://arxiv.org/html/2606.24381#S4.T2)are defined with respect to a global reference \(mode\) across all seeds\. For global top\-1 consistency, we identify the prompt that appears most frequently as the top\-ranked prompt across evaluation conditions and report the fraction of conditions in which this modal top\-1 prompt is selected\. For global top\-k consistency, we analogously define a modal top\-k set and report the average overlap between each condition\-specific top\-k set and this reference set\. These metrics capture complementary aspects of ranking stability\.

Table[2](https://arxiv.org/html/2606.24381#S4.T2)summarizes decision\-level consistency across seeds\. On GSM8K with 50 evaluation examples, top\-1 consistency is low across all models \(around 40%\), with multiple distinct prompts identified as the best\-performing configuration\. Even with 200 examples, top\-1 consistency remains limited \(typically≤60%\\leq 60\\%\), indicating that the identity of the best prompt is highly sensitive to evaluation variability\.

A similar pattern is observed on MMLU\. Despite relatively high rank correlations, top\-1 consistency remains low across most settings, and multiple prompts emerge as the top choice across different seeds\. These results reveal a critical gap between ranking stability and decision reliability: stable rankings do not necessarily imply reliable prompt selection\.

#### 4\.2\.3Robustness of Stability\-aware Prompt Selection

Table 3:Selection robustness via leave\-one\-seed\-out \(LOSO\) evaluation across models\. Results report mean±\\pmstd overn=5n=5held\-out seeds with parameterz=1\.0z=1\.0\.Table[3](https://arxiv.org/html/2606.24381#S4.T3)evaluates the robustness of prompt selection under a leave\-one\-seed\-out \(LOSO\) protocol\. On GSM8K, the proposed stability\-aware selection strategy based on a lower confidence bound \(LCB\) consistently improves or matches mean\-based selection\. In particular, under small evaluation budgets, LCB yields substantial gains \(e\.g\., 0\.312 vs\. 0\.228 at size 50 for Qwen\), demonstrating improved robustness under noisy evaluation conditions\. At larger subset sizes, LCB remains competitive and often achieves higher or comparable performance\.

On MMLU, LCB does not uniformly improve average performance\. In relatively stable settings, such as MMLU with larger evaluation subsets, the variance penalty may lead to the selection of slightly more conservative prompts, resulting in a modest reduction in mean accuracy\. This reflects an inherent trade\-off between robustness and peak performance\.

These results suggest that incorporating uncertainty into prompt selection is particularly beneficial in challenging or high\-variance regimes, while remaining safe in more stable settings\.

Importantly, the LCB\-based strategy requires no additional training or model modification, making it a practical drop\-in replacement for standard mean\-based selection\.

#### 4\.2\.4Discussion

Our findings indicate that prompt ranking instability primarily arises among prompts with similar average performance\. While clearly suboptimal prompts are consistently identified, fine\-grained ordering among strong prompts is highly sensitive to evaluation variability\. This explains why rank correlations can be moderately high while top\-1 consistency remains low\.

The contrast between GSM8K and MMLU highlights the role of task characteristics\. GSM8K requires multi\-step reasoning and is more sensitive to compounding errors, leading to higher variability across evaluation subsets\. In contrast, MMLU exhibits higher overall ranking stability but still suffers from frequent changes in the top\-ranked prompt, indicating that even relatively stable tasks can yield unreliable prompt selection outcomes\.

From a practical perspective, these results caution against over\-interpreting small performance differences when selecting prompts\. Rather than treating prompt evaluation as deterministic, practitioners should account for ranking stability, especially under limited evaluation budgets\. Simple stability\-aware criteria can provide a low\-cost way to improve robustness without modifying the underlying model\.

Overall, our results suggest that prompt evaluation should be viewed as a stochastic estimation problem rather than a deterministic comparison, particularly in realistic low\-resource evaluation settings\. This perspective suggests that future benchmarking practices should incorporate uncertainty\-aware evaluation protocols rather than relying solely on point estimates\.

A limitation of the current study is that it focuses on two benchmark tasks and a fixed pool of 20 prompts per task\. Future work should examine a broader range of tasks, prompt families, and evaluation settings, including few\-shot prompting and generative evaluation\. A more comprehensive sensitivity analysis of z is also left for future work\.

To facilitate reproducibility, the code, prompts, and evaluation scripts used in this study are publicly available at:[GitHub Repository](https://github.com/shaoshuaidu/prompt_stability)\.

## 5Conclusion

In this work, we studied the stability of prompt performance rankings under common sources of evaluation variability\. We showed that prompt rankings can be highly unstable, especially under limited evaluation budgets, and that the identity of the top\-ranked prompt frequently changes across evaluation conditions\. We further demonstrated that a simple stability\-aware selection strategy based on a lower confidence bound \(LCB\) can improve robustness in high\-variance settings while remaining competitive in more stable regimes\.

More broadly, our findings suggest that prompt evaluation should be viewed as a statistical estimation problem rather than a deterministic comparison\. We hope this work encourages more reliable evaluation practices and greater awareness of ranking instability in prompt\-based LLM systems\.

## References

- \[1\]M\. AI\(2024\)Mistral\-7b\-instruct\-v0\.3\.Note:[https://huggingface\.co/mistralai/Mistral\-7B\-Instruct\-v0\.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)Accessed: 2026\-03\-23Cited by:[§4\.1\.1](https://arxiv.org/html/2606.24381#S4.SS1.SSS1.p1.1)\.
- \[2\]T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\. Balcan, and H\. Lin \(Eds\.\),External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)Cited by:[§1](https://arxiv.org/html/2606.24381#S1.p1.1)\.
- \[3\]J\. Dodge, S\. Gururangan, D\. Card, R\. Schwartz, and N\. A\. Smith\(2019\)Show your work: improved reporting of experimental results\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP\-IJCNLP 2019, Hong Kong, China, November 3\-7, 2019,K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),pp\. 2185–2194\.External Links:[Link](https://doi.org/10.18653/v1/D19-1224),[Document](https://dx.doi.org/10.18653/V1/D19-1224)Cited by:[§2\.2](https://arxiv.org/html/2606.24381#S2.SS2.p1.1)\.
- \[4\]R\. Dror, G\. Baumer, S\. Shlomov, and R\. Reichart\(2018\)The hitchhiker’s guide to testing statistical significance in natural language processing\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15\-20, 2018, Volume 1: Long Papers,I\. Gurevych and Y\. Miyao \(Eds\.\),pp\. 1383–1392\.External Links:[Link](https://aclanthology.org/P18-1128/),[Document](https://dx.doi.org/10.18653/V1/P18-1128)Cited by:[§2\.2](https://arxiv.org/html/2606.24381#S2.SS2.p1.1)\.
- \[5\]T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa\(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.24381#S2.SS1.p1.1)\.
- \[6\]Microsoft\(2024\)Phi\-3\-mini\-4k\-instruct\.Note:[https://huggingface\.co/microsoft/Phi\-3\-mini\-4k\-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)Accessed: 2026\-03\-23Cited by:[§4\.1\.1](https://arxiv.org/html/2606.24381#S4.SS1.SSS1.p1.1)\.
- \[7\]A\. G\. Qwen Team\(2024\)Qwen2\.5\-7b\-instruct\.Note:[https://huggingface\.co/Qwen/Qwen2\.5\-7B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)Accessed: 2026\-03\-23Cited by:[§4\.1\.1](https://arxiv.org/html/2606.24381#S4.SS1.SSS1.p1.1)\.
- \[8\]N\. Reimers and I\. Gurevych\(2017\)Reporting score distributions makes a difference: performance study of lstm\-networks for sequence tagging\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9\-11, 2017,M\. Palmer, R\. Hwa, and S\. Riedel \(Eds\.\),pp\. 338–348\.External Links:[Link](https://doi.org/10.18653/v1/d17-1035),[Document](https://dx.doi.org/10.18653/V1/D17-1035)Cited by:[§2\.3](https://arxiv.org/html/2606.24381#S2.SS3.p1.1)\.
- \[9\]Y\. Shen, L\. Wang, C\. Shi, S\. Du, Y\. Tao, Y\. Shen, and H\. Zhang\(2024\)Comparative analysis of listwise reranking with large language models in limited\-resource language contexts\.CoRRabs/2412\.20061\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.20061),[Document](https://dx.doi.org/10.48550/ARXIV.2412.20061),2412\.20061Cited by:[§1](https://arxiv.org/html/2606.24381#S1.p1.1)\.
- \[10\]E\. M\. Voorhees\(1998\)Variations in relevance judgments and the measurement of retrieval effectiveness\.InSIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 24\-28 1998, Melbourne, Australia,W\. B\. Croft, A\. Moffat, C\. J\. van Rijsbergen, R\. Wilkinson, and J\. Zobel \(Eds\.\),pp\. 315–323\.External Links:[Link](https://doi.org/10.1145/290941.291017),[Document](https://dx.doi.org/10.1145/290941.291017)Cited by:[§2\.3](https://arxiv.org/html/2606.24381#S2.SS3.p1.1)\.
- \[11\]L\. Wang, C\. Shi, S\. Du, Y\. Tao, Y\. Shen, H\. Zheng, Y\. Shen, and X\. Qiu\(2025\)Performance review on LLM for solving leetcode problems\.CoRRabs/2502\.15770\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.15770),[Document](https://dx.doi.org/10.48550/ARXIV.2502.15770),2502\.15770Cited by:[§1](https://arxiv.org/html/2606.24381#S1.p1.1)\.
- \[12\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.24381#S2.SS1.p1.1)\.

Similar Articles

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

arXiv cs.LG

This paper introduces a margin-based confidence ranking method for LLM-as-a-judge systems, learning a dedicated estimator to ensure monotonicity between confidence and human-disagreement risk, with generalization guarantees and improved ranking accuracy across datasets.

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

Hugging Face Daily Papers

This paper introduces a controlled protocol to evaluate answer stability in large language models by challenging correct answers with plausible counterarguments, revealing large variation in flip rates across models that accuracy metrics alone do not capture. The authors release the protocol, challenge records, and a curated MaxFlip challenge set to support stability evaluation.

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Hugging Face Daily Papers

This paper argues that aggregate-score leaderboards for LLM agent benchmarks fail to capture deployment-relevant dimensions and show rank instability. It proposes ranking configurations by predictive validity—the correlation between in-sample and out-of-sample rank—and introduces a twelve-tier measurement apparatus along with falsifiable out-of-distribution criteria.