Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

arXiv cs.CL Papers

Summary

Researchers from PNNL and Washington University introduce a systematic framework to test how five LLMs detect subtle semantic changes in documents, revealing positional bias, context coherence effects, and model-specific scoring fingerprints.

arXiv:2604.18835v1 Announce Type: new Abstract: We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length across all combinations, testing five LLMs on tens of thousands of document pairs. Our analysis reveals several striking findings. First, LLMs exhibit a within-document positional bias distinct from previously studied candidate-order effects: most models penalize semantic differences more harshly when they occur earlier in a document. Second, when the altered sentence is surrounded by topically unrelated context, it systematically lowers similarity scores and induces bipolarized scores that indicate either very low or very high similarity. This is consistent with an interpretive frame account in which topically-related context may allow models to contextualize and downweight the alterations. Third, each LLM produces a qualitatively distinct scoring distribution, a stable "fingerprint" that is invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types. Together, these results demonstrate that LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity in ways that go beyond the semantic change itself, and that the proposed framework offers a practical, LLM-agnostic toolkit for auditing and comparing scoring behavior across current and future models.
Original Article
View Cached Full Text

Cached at: 04/22/26, 08:29 AM

# Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
Source: [https://arxiv.org/html/2604.18835](https://arxiv.org/html/2604.18835)
Sinan G\. Aksoy1,Alexandra A\. Sabrio1,2,Erik VonKaenel1,3,Lee Burke1 1Pacific Northwest National Laboratory 2Washington University in St\. Louis 3Humana Inc\. \{first\.last\}@pnnl\.gov,vonkaenelerik@gmail\.com

###### Abstract

We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison\. We analogize this as a needle\-in\-a\-haystack problem: a single semantically altered sentence \(the needle\) is embedded within surrounding context \(the hay\), and we vary the perturbation type \(negation, conjunction swap, named entity replacement\), context type \(original vs\. topically unrelated\), needle position, and document length across all combinations, testing five LLMs on tens of thousands of document pairs\. Our analysis reveals several striking findings\. First, LLMs exhibit a within\-document positional bias distinct from previously studied candidate\-order effects: most models penalize semantic differences more harshly when they occur earlier in a document\. Second, when the altered sentence is surrounded by topically unrelated context, it systematically lowers similarity scores and induces bipolarized scores that indicate either very low or very high similarity\. This is consistent with an interpretive frame account in which topically\-related context may allow models to contextualize and downweight the alterations\. Third, each LLM produces a qualitatively distinct scoring distribution, a stable “fingerprint” that is invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types\. Together, these results demonstrate that LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity in ways that go beyond the semantic change itself, and that the proposed framework offers a practical, LLM\-agnostic toolkit for auditing and comparing scoring behavior across current and future models\.

## 1Introduction

LLM\-as\-a\-judge systems are rapidly pervading domains ranging from complex scientific workflows to routine, everyday decisions\. In these systems, an LLM acts as an automated evaluator: assessing the quality of generated text\(Liu et al\.,[2023b](https://arxiv.org/html/2604.18835#bib.bib16); Zheng et al\.,[2023](https://arxiv.org/html/2604.18835#bib.bib23)\), grading student essays\(Song et al\.,[2024](https://arxiv.org/html/2604.18835#bib.bib20)\), scoring medical question\-answering\(Krolik et al\.,[2024](https://arxiv.org/html/2604.18835#bib.bib12)\), evaluating mathematical reasoning\(Li et al\.,[2024](https://arxiv.org/html/2604.18835#bib.bib13)\), and serving as a stand\-in for costly human annotation across natural language generation tasks\(Chiang and Lee,[2023](https://arxiv.org/html/2604.18835#bib.bib4); Dubois et al\.,[2023](https://arxiv.org/html/2604.18835#bib.bib6)\)\. Recent surveys document the remarkable breadth and rapid adoption of this paradigm\(Gu et al\.,[2025](https://arxiv.org/html/2604.18835#bib.bib7)\), while simultaneously raising concerns about systematic biases in LLM evaluation\(Wang et al\.,[2024](https://arxiv.org/html/2604.18835#bib.bib22); Chen et al\.,[2024](https://arxiv.org/html/2604.18835#bib.bib3)\)\. As new LLMs are released at an accelerating pace and these judge systems are continually updated, practitioners face a moving target: each model version may introduce new scoring behaviors, biases, or idiosyncrasies that are difficult to anticipate\. For these reasons, it is increasingly important that we have experimental pipelines for evaluating LLM\-as\-a\-judge systems that aredetailed,generalizable, andscalable\.

One particularly far\-reaching use case of LLM\-as\-a\-judge systems ispairwise document similarity assessment\. This task arises across a wide range of applications, including plagiarism and near\-duplicate detection, information retrieval and document clustering, semantic textual similarity benchmarks\(Cer et al\.,[2017](https://arxiv.org/html/2604.18835#bib.bib2)\), and automated scoring of genre\-specific articles\. Beyond settings where document comparison is the primary objective, pairwise similarity assessment is a critical subcomponent of evaluatingothertools\. For example, when testing text privatization, pairwise document similarity can quantify how much privatized text strays from the original semantic meaning\. Furthermore, the pairwise case can be naturally bootstrapped to multi\-way comparisons, making robust pairwise evaluation all the more consequential\.

Within this domain, the ability to detect and differentiate betweensubtlesemantic changes is becoming especially important\. There are several reasons for this\. First, small semantic differences are critical to many real\-world applications of pairwise document similarity, beyond the aforementioned example of text privatization\. In medical documentation, even minor revisions to treatment recommendations or dosage instructions can lead to substantially different patient outcomes\(Krolik et al\.,[2024](https://arxiv.org/html/2604.18835#bib.bib12)\), making detection of subtle semantic drift a matter of safety\. Similarly, in legal texts, small changes in wording, such as the substitution of “and” for “or” in a contractual clause, can dramatically alter the obligations and rights of the parties involved\(Adams and Kaye,[2006](https://arxiv.org/html/2604.18835#bib.bib1)\)\.

Second, as LLMs become increasingly capable at pairwise similarity assessment, we need correspondingly challenging tasks to stress\-test them and expose their limitations\. Idiosyncratic scoring behaviors are often most visible “on the margins,” where subtle differences push the boundary of what a model can reliably detect\(Wang et al\.,[2023](https://arxiv.org/html/2604.18835#bib.bib21)\)\. This makes fine\-grained tasks more discriminative than coarser evaluations\.

Third, even humans are unreliable at detecting subtle semantic changes, raising the question of whether LLMs inherit similar blind spots or introduce new ones\. Research on semantic illusions suggests readers routinely overlook meaning alterations embedded within coherent discourse\(Cook et al\.,[2018](https://arxiv.org/html/2604.18835#bib.bib5); Nieuwland and Van Berkum,[2005](https://arxiv.org/html/2604.18835#bib.bib18)\), and that the position of a change within a passage modulates this failure rate\(Liu et al\.,[2023a](https://arxiv.org/html/2604.18835#bib.bib15)\)\. Understanding whether LLMs exhibit analogous biases \(or different ones\) when scoring document similarity is essential for calibrating trust in these systems and for identifying failure modes before they propagate into downstream applications\.

Accordingly, in this work we propose, apply, and analyze an experimental pipeline for testing LLM\-as\-a\-judge sensitivity to semantic similarity between pairs of documents\. We analogize our experiment as aneedle\-in\-a\-haystackproblem\. While this metaphor originates in work on long\-context retrieval \(where a planted fact must be located within filler text\(Kamradt,[2023](https://arxiv.org/html/2604.18835#bib.bib11); Hsieh et al\.,[2024](https://arxiv.org/html/2604.18835#bib.bib10)\)\) we invoke it here to instead test how LLMsscore semantic similaritybetween two almost\-identical documents, one of which contains a single semantic alteration embedded within surrounding context\. In our formulation, the “needle” is a single, semantically altered sentence, and the “hay” is the surrounding context of varying length, position, and relevance\. By systematically varying the needle type, hay type, hay amount, and needle position across all possible combinations and testing on tens of thousands of document pairs, we construct a scalable experimental design that probes LLM scoring sensitivity along multiple axes simultaneously\. Our main contributions are two\-fold:

- •We propose an LLM\-agnostic, automated, and scalable factorial experimental design for sensitivity testing of LLM\-guided pairwise document semantic similarity\.
- •We instantiate this framework across five LLMs \(GPT\-4o, GPT\-5, Claude, Gemini, and o4\-mini, see Appendix[A](https://arxiv.org/html/2604.18835#A1)for model information\) and develop quantitative analyses, including positionality bias measures, document length effects, distributional “fingerprints,” and a bipolarization index, that expose interpretable differences in scoring behavior across models\.

Original Documentdd
…\\dotsThousands of flights land in Chicago\.Tons of passengers transfer at O’Hare\.O’Hare is larger and busier than Midway\.Airport runways handle traffic all day\.United runs its busiest hub at O’Hare\.…\\dotsLegend
Needle sentenceorig: original hayrand: random hayUnderlineSemantic changeneg: Negation insertioncon: Conjunctive swapner: Named entity replacement\(i,j\)\(i,j\):iipre &jjpost sentencesExample Comparisons
d​\(∅,orig,\(1,2\)\)d\(\\varnothing,\\texttt\{orig\},\(1,2\)\)d​\(neg,orig,\(1,2\)\)d\(\\texttt\{neg\},\\texttt\{orig\},\(1,2\)\)Tons of passengers transfer at O’Hare\.O’Hare is larger and busier than Midway\.Airport runways handle traffic all day\.United runs its busiest hub at O’Hare\.vsTons of passengers transfer at O’Hare\.O’Hare isnotlarger and busier than Midway\.Airport runways handle traffic all day\.United runs its busiest hub at O’Hare\.Score77/100d​\(∅,rand,\(2,0\)\)d\(\\varnothing,\\texttt\{rand\},\(2,0\)\)d​\(con,rand,\(2,0\)\)d\(\\texttt\{con\},\\texttt\{rand\},\(2,0\)\)Goodfellas was a hit film in 1990\.Robert De Niro’s acting was praised\.O’Hare is larger and busier than Midway\.vsGoodfellas was a hit film in 1990\.Robert De Niro’s acting was praised\.O’Hare is largerorbusier than Midway\.Score89/100d​\(∅,orig,\(0,1\)\)d\(\\varnothing,\\texttt\{orig\},\(0,1\)\)d​\(ner,orig,\(0,1\)\)d\(\\texttt\{ner\},\\texttt\{orig\},\(0,1\)\)O’Hare is larger and busier than Midway\.Airport runways handle traffic all day\.vsO’Hare is larger and busier thanChicago\.Airport runways handle traffic all day\.Score62/100Figure 1:An example illustrating our experimental pipeline\.Our framework does not make value judgments about LLM scoring behavior, nor does it rank model performance\. Rather, it provides a modular experimental framework, within the familiar needle\-in\-a\-haystack metaphor, through which one can generate rich experimental data about LLM scoring behavior and idiosyncrasies, for both current and future systems\. Our instantiation and subsequent analysis on real data demonstrate that the framework is highly discriminative, sharply highlighting differences between different LLMs as well as between versions of the same LLM\. Together, the intuitive needle\-and\-hay framing that practitioners can readily adapt to new perturbation types, domains, and models, and the quantitative analyses we develop for interpreting the resulting data offer a practical toolkit for anyone seeking to understand, compare, or audit LLM scoring behavior\. Our work is organized as follows: Section 2 describes our experimental design, Section 3 presents our results, and Section 4 discusses our findings in context\.

## 2Experimental Design

### Overview

We study how LLMs quantitatively score semantic similarity between pairs of almost\-identical text documents apart from aneedle– a single, semantically altered sentence\. We consider three needle types: semantic perturbation via negation insertion, conjunction replacement, and named entity replacement\. These semantically altered sentences are subsequently positioned between some number of preceding and succeeding sentences – the surroundinghay\. We vary both the relativepositionof the needle within this hay \(moving the semantic difference across beginning, middle, and end\) and the total hay amount \(the length of the document by sentence count\)\. We also toggle the hay type; if the sentences surrounding the needle are taken from a randomly chosen document, we call this “random hay", whereas “original hay" retains the original surrounding context\. We vary all possible combinations of these parameters, automate needle and hay insertion, run each setting on many documents, and repeat this across multiple LLMs\.

### Formal description

Let𝒰\\mathcal\{U\}denote a universe of raw text documents\. We take𝒰\\mathcal\{U\}to be a subset ofPlain Text WikipediaLtCmdrData \([2020](https://arxiv.org/html/2604.18835#bib.bib17)\)consisting of 453,602 documents\. We further clean these to remove non\-natural language text, and filter them for length, yielding 40,003 cleaned documents,𝒞\\mathcal\{C\}\. There are four main parameters to our experiment:

- •Needle type:N∈\{∅,neg,con,ner\}N\\in\\\{\\varnothing,\\texttt\{neg\},\\texttt\{con\},\\texttt\{ner\}\\\}, the semantic change type applied to the altered sentence\. Can be negation \(neg\), swapping between “and" and “or" \(con\), swapping a named entity \(ner\), or no change \(∅\\varnothing\)\.
- •Hay type:H∈\{orig,randH\\in\\\{\\texttt\{orig\},\\texttt\{rand\}\}, the type of sentences preceding and following the altered sentence\. Can be the original document’s sentences \(orig\) or consecutive sentences from a random document in𝒞\\mathcal\{C\}\(rand\)\.
- •Position:P∈\{\(i,j\):i,j∈\{0,…,k\}\}P\\in\\\{\(i,j\):i,j\\in\\\{0,\\dots,k\\\}\\\}, the number of sentencesiipreceding and number of sentencesjjfollowing the altered sentence, whereiiandjjare integers between0andkk\. We takek=9k=9\.
- •LLM:L∈L\\in\{GPT\-4o, GPT\-5, Claude, Gemini, o4\-mini\}, the Large Language Model scoring document similarity\.

We consider each possible parameter setting\(L,N,H,P\)\(L,N,H,P\)in the Cartesian product

LLM×Needle×Hay×Position,\\mbox\{LLM\}\\times\\mbox\{Needle\}\\times\\mbox\{Hay\}\\times\\mbox\{Position\},which, for our choices detailed above, totals 3000 distinct parameter settings\. For each position\(i,j\)∈P\(i,j\)\\in P, we randomly permute the documents in𝒞\\mathcal\{C\}and process them in this order so that the set of documents processed by a LLM, needle, and hay triple is the same random sample\. Fixing a position\(i,j\)\(i,j\), we then run the following pipeline multiple times until a stopping criterion is met: for a documentd∈𝒞d\\in\\mathcal\{C\}with\|d\|\|d\|many sentences, we selectm=⌈\|d\|2⌉m=\\lceil\\frac\{\|d\|\}\{2\}\\rceilas the “middle" sentence to be semantically altered\. The resulting document,d​\(N,H,P\)d\(N,H,P\), is derived fromddwithi\+j\+1i\+j\+1many sentences, where them​thm\\textsuperscript\{th\}sentence ofddhas been altered via semantic changeNN, and this sentence is preceded and followed byiiandjjmany sentences of typeHH, respectively\. We then use LLMLLto compare this document with its unaltered counterpart, i\.e\.

d​\(∅,H,P\)​vs\.​d​\(N,H,P\)d\(\\varnothing,H,P\)\\mbox\{ vs\. \}d\(N,H,P\)using a basic scoring prompt for semantic similarity \(see Figure[8](https://arxiv.org/html/2604.18835#A1.F8), Appendix[A](https://arxiv.org/html/2604.18835#A1)\)\. The resulting score,s​\(N,H,P\)s\(N,H,P\), is an integer in\[0,100\]\[0,100\]\. Figure[1](https://arxiv.org/html/2604.18835#S1.F1)illustrates the comparison for different triples inN×H×PN\\times H\\times Pon a toy example\. We continue scoring documents until at least 100 documents have been processed and a stopping criterion based on mean score stabilization is also met\. Finally, for a fixed\(L,N,H\)\(L,N,H\), we take the maximum number of documents run across all positions, and run all other positions on this number of documents\. For more details on data cleaning, needle implementation, and our stopping criteria, see Appendix[A](https://arxiv.org/html/2604.18835#A1)\.

![Refer to caption](https://arxiv.org/html/2604.18835v1/x1.png)Figure 2:EMD and KDE plot comparison of score distributions for GPT\-4o \(left\) and Claude \(right\) byi,ji,jposition, under thenerneedle andorighay\. Note the y\-axis scale difference between panels\.

## 3Results

### Positionality

We begin by investigating how thepositionof the perturbed sentence within the document affects similarity scores\. Do LLMs penalize semantic differences more harshly when they appear early in a document, and how rapidly does this bias intensify with position? Does surrounding context that is topically unrelated to the altered sentence amplify or dampen this effect? To begin, a simple null hypothesis is that scores are identical \(in distribution\) whether the needle is in the first vs second half of the document, i\.e\.

\{s​\(N,H,\(i,j\)\):i\>j\}=\{s​\(N,H,\(i,j\)\):i<j\}\\\{s\(N,H,\(i,j\)\):i\>j\\\}=\\\{s\(N,H,\(i,j\)\):i<j\\\}
We perform a 2\-sample Kolmogorov\-Smirnov test between the score distributions of these “first\-half” vs “second\-half” needle documents for each LLM, needle, and hay parameter setting, for a total of 30 hypothesis tests\. The results are clear: forallsuch parameter settings, we can reject this hypothesis at a significant level \(p<0\.05p<0\.05\) and, apart from the exception of GPT\-5’s scores under original hay, at a very highly significant level \(p<0\.001p<0\.001\)\. A subsequent, more nuanced null hypothesis is to test equality individually for eachi,ji,jposition pair; that is:

s​\(N,H,\(i,j\)\)=s​\(N,H,\(j,i\)\)s\(N,H,\(i,j\)\)=s\(N,H,\(j,i\)\)
Here, whether we can reject the null hypothesis varies depending on the position, LLM, needle, and hay settings\. However, some commonalities are worth highlighting\. Across all LLMs, needles, and hay we can reject the null hypothesis at a weakly significant level \(p<0\.1p<0\.1\) level for certain recurrent positions:

\(0,7\)\\displaystyle\(0,7\)\(Original hay\)\(0,3\),\(0,4\),\(0,5\),\(0,7\)\\displaystyle\(0,3\),\(0,4\),\(0,5\),\(0,7\)\(Random hay\)
This suggests positionality biases are consistently exhibited in our experiment whenever the perturbed sentence is first or last in documents that are between 4 and 8 sentences, regardless of the choice of LLM or needle type\.

However, there are also striking differences in positionality bias across LLMs\. For example, Figure[2](https://arxiv.org/html/2604.18835#S2.F2)compares GPT\-4o and Claude for original hay, and thenerneedle type\. Each\(i,j\)\(i,j\)cell in the upper portion displays the earth movers distance \(EMD\) between score distributions for\(i,j\)\(i,j\)and\(j,i\)\(j,i\)\. One takeaway is the brightly colored first column which indicates high EMD: GPT\-4o’s scores more sharply differentiate documents with perturbed sentences that appear first vs\. last in the document\. Claude also shows positionality bias here, but at a much smaller magnitude\.

The lower portion of each heatmap contains Kernel Density Estimation \(KDE\) plots of the distributions in question, such that the\(j,i\)\(j,i\)cell plots the distributions compared by EMD in the\(i,j\)\(i,j\)cell\. For example, the high EMD in GPT\-4o’s left column can also be seen by the divergence of the red and blue lines in the bottom row\. Studying the shape of these plots as we varyiiandjjyields nuanced differences within and across LLMs\. For example, with largeri,ji,j, Claude’s distribution becomes sharply concentrated at high\-scores; the\(9,9\)\(9,9\)cell shows consistent, near perfect similarity scores for 19\-sentence documents differing in their middle sentence\. In contrast, GPT\-4o’s scores remain more spread out for this cell and appear bimodal throughout\. These observations suggest LLMs exhibit qualitatively different scoring “fingerprints" based on needle position\.

![Refer to caption](https://arxiv.org/html/2604.18835v1/x2.png)

![Refer to caption](https://arxiv.org/html/2604.18835v1/x3.png)

Figure 3:Left: the mean \(top\) and standard deviation \(bottom\) of scores by document length for different LLMs under thenerneedle and eitherorigorrandhay\. Right: the early\-positionality bias for different LLMs and needle types under original \(top\) and random \(bottom\) hay\.To further explore these differences across LLMs, Figure[3](https://arxiv.org/html/2604.18835#S3.F3)\(right\) presents the Early\-Positionality Bias which quantifies both bias magnitude and direction\. For a given\(i,j\)\(i,j\)pair withi\>ji\>j, this computes the average score differential

s​\(N,H,\(i,j\)\)−s​\(N,H,\(j,i\)\),s\(N,H,\(i,j\)\)\-s\(N,H,\(j,i\)\),and further averages this quantity for all positionsi,ji,j\. Positive values indicate harsher penalization of earlier semantic differences, while negative values indicate penalization of later differences\. Focusing on the original hay plot \(top\), we see GPT\-4o has roughly∼8\{\\sim\}8–10×10\\timesas much early\-positionality bias as Claude\. Interestingly, GPT\-5 is the only LLM to show any negative bias here, meaning this LLM more harshly penalizes semantic needles that occur later within the document\. The scores for a given LLM show a comparable level of bias across needle types\. When the surrounding hay is random \(bottom\) the LLMs show a strong increase in early\-positionality bias, ranging from roughly∼2×\{\\sim\}2\\times\(GPT\-4o\) to∼8\{\\sim\}8–9×9\\times\(Claude\), even reversing the direction of GPT\-5’s bias\. Clearly, positionality is heavily influenced by hay type: when the surrounding context is unrelated to the semantic change in question, the LLM more heavily penalizes differences which occur earlier in the document\.

### Document length

We now explore how document length affects scoring behavior, regardless of needle position\. Measured in sentences, the length of the document for position\(i,j\)\(i,j\)is simplyk=i\+j\+1k=i\+j\+1\. Askkincreases, the proportion of the document’s sentences that have been semantically altered decreases\. Accordingly, a natural hypothesis is that semantic similarity score is increasing in document length and proportional to the fraction of semantically identical sentences, i\.e\.k−1k\\frac\{k\-1\}\{k\}\.

Figure[3](https://arxiv.org/html/2604.18835#S3.F3)\(left\) plots the mean and standard deviation of scores forner\. The mean score plots include a reference line of100⋅k−1k100\\cdot\\tfrac\{k\-1\}\{k\}\. Scores above this line suggest semantic perturbation preservessomeof the semantic meaning of the altered sentence, while scores below suggest semantic perturbation alters the meaning of document more severely than proportionally to the number of altered sentences\. Fororighay, Gemini’s mean scores closely follow this line, while Claude and o4\-mini are above, outputting the highest mean scores by document length\. For bothorigandrandhay, all LLMs have mean scores that increase in document length, as hypothesized\.

Scores are notably lower across the board forrandhay\. One could hypothesize that random hay shouldincreasesemantic similarity scores relative to original hay, under the intuition that perturbing a sentence unrelated to the rest of the document does not alter its holistic semantic meaning as significantly\. One might likewise expect an increase in scores because original hay provides opportunities for contradictions between the needle sentence and the rest of the document: returning to the example in Figure[1](https://arxiv.org/html/2604.18835#S1.F1), asserting O’Hare isnotbusier than Midway seems at odds with surrounding original context describing it as an airline hub, whereas in a document about the film Goodfellas, whether O’Hare is busier than Midway seems immaterial\. Nonetheless, the results show the opposite: when there is no apparent connection between the needle and its context, LLMs ascribe \(on average\) more semantic importance to perturbations of that needle\. We return to this finding in Section[4](https://arxiv.org/html/2604.18835#S4), where we consider a competing interpretation that may account for it\.

Lastly, we consider standard deviation: for documents with 2 or more sentences, standard deviation is decreasing in document length fororig, suggesting LLMs become more consistent in rating longer documents highly\. However, the opposite occurs forrand, which shows increasing\-then\-plateauing standard deviation in document length, suggesting erratic scoring\.

### Needle type

A priori, there is no reason to expect that one perturbation type \(negation, named entity replacement, or conjunction swap\) changes semantic meaning any more or less than another\. Our null hypothesis is equality between all\(32\)=3\{3\\choose 2\}=3pairs of needle types:

s​\(neg,H,P\)\\displaystyle s\(\\texttt\{neg\},H,P\)=s​\(con,H,P\)\\displaystyle=s\(\\texttt\{con\},H,P\)s​\(neg,H,P\)\\displaystyle s\(\\texttt\{neg\},H,P\)=s​\(ner,H,P\)\\displaystyle=s\(\\texttt\{ner\},H,P\)s​\(ner,H,P\)\\displaystyle s\(\\texttt\{ner\},H,P\)=s​\(con,H,P\)\\displaystyle=s\(\\texttt\{con\},H,P\)
Again, we perform a 2\-sample KS test with Bonferroni correction between each of the three comparisons for a given LLM and hay type\. For all such parameter settings, we reject the null hypothesis at the highly significant level\(p<0\.01\)\(p<0\.01\)\. The score distributions differ across needle types\.

![Refer to caption](https://arxiv.org/html/2604.18835v1/x4.png)Figure 4:Distribution shape difference \(yy\-axis\) vs\. shift difference \(xx\-axis\) between needle types\.![Refer to caption](https://arxiv.org/html/2604.18835v1/x5.png)Figure 5:Violin plot of score distributions for each LLM and needle type with random vs original hay\.![Refer to caption](https://arxiv.org/html/2604.18835v1/x6.png)Figure 6:KDE plots of aggregate score distributions under random and original hay \(top\) and the Bipolarization Index \(bottom\) for those distributions as the sensitivity parameterkkis varied\.However, we find that for a fixed \(LLM, hay type\), these score distribution differences are almost entirely due to translational shifts\. To isolate pure shape deformation from translational shifts, we computed the EMD between the mean\-centered score distributions\. Figure[4](https://arxiv.org/html/2604.18835#S3.F4)shows centered EMD remains consistently low, ranging between 1 and 4\. To contextualize this magnitude, the theoretical maximum EMD for any two centered distributions bounded between 0 and 100 is 50, making an observed EMD of 1 to 4 only 2% to 8% relative to the max possible divergence\. Consequently, accounting for the initial mean shift, the underlying score distribution shapes across needle types are nearly identical and likely fall well within the expected margin of stochastic noise inherent to LLM\-as\-a\-judge evaluationsZheng et al\. \([2023](https://arxiv.org/html/2604.18835#bib.bib23)\); Wang et al\. \([2024](https://arxiv.org/html/2604.18835#bib.bib22)\)\. Plotting the distribution shapes in Figure[5](https://arxiv.org/html/2604.18835#S3.F5)confirms this\. We observe that, for example, Claude’s negation scores have the same shape \(albeit shifted\) as those for conjunction swap, both of which are plainly different from those of other LLMs like GPT\-4o\. Thus, score distribution shapes across needle types are consistent within, but not across, LLMs\.

There is, however, a striking cross\-LLM commonality: the ordering of needle types by mean score is always the same\. For each LLM, and for both hay, there is a consistent hierarchy,

neg≻ner≻con,\\texttt\{neg\}\\succ\\texttt\{ner\}\\succ\\texttt\{con\},meaning each LLM scores negation\-perturbed documents most highly, followed by named entity replacement and conjunction swap\. In short, while LLMs exhibit distinct scoring distribution fingerprints that are unchanged by perturbation type, each model shifts these fingerprints in the same hierarchical order in all parameter settings\.

### Hay type

Lastly, we study the impact of the surrounding context type – the hay\. Our analysis of other parameters so far has already revealed several insights\. Similarity scoring under random hay, relative to original hay, exhibits \(1\) significantly increased early\-positionality bias across all models; \(2\) lower mean scores and higher score variance, which weakly increase in document length; and \(3\) the same hierarchy in needle type preferences\. We now focus on the random and original hay score distributions themselves, and conduct a more fine\-grained comparison of their properties\.

Perhaps the most striking difference is that random hay, across all LLMs, induces a higher rate of "bipolarization" in scores – ping\-ponging between very high and very low scores – than for original hay\. To quantify this behavior rigorously, we define a Bipolarization IndexBB, which captures the simultaneous concentration of mass at both extremes of a 0–100 scoring scale\. For a distribution of document scoresXXand a symmetric margin thresholdkk, the index is calculated as:

B=4⋅P​\(X≤k\)⋅P​\(X≥100−k\)B=4\\cdot P\(X\\leq k\)\\cdot P\(X\\geq 100\-k\)whereP​\(X≤k\)P\(X\\leq k\)represents the proportion of documents receiving a score at or below the lower marginkk, andP​\(X≥100−k\)P\(X\\geq 100\-k\)represents the proportion of documents receiving a score at or above the upper margin\. Note that the product of these two probabilities ensures that the index only yields a non\-zero value ifbothextremes are populated, and the factor of 4 scales the index to a\[0,1\]\[0,1\]range, whereB=1\.0B=1\.0represents an equal50/5050/50split of scores at the boundaries\. In short, this metric registers bimodal patterns that are specifically "all\-or\-nothing" scores in document similarity evaluation\.

Figure[6](https://arxiv.org/html/2604.18835#S3.F6)presents KDE plots \(top\) of the original vs\. random scoring distributions aggregated over all needle types for each LLM, as well the Bipolarization Index \(bottom\) for sensitivity valuesk∈\[0,25\]k\\in\[0,25\]\. Again, lower values ofkkrepresent stricter notions of polarization \(e\.g\.k=2k=2defines “extreme" scores as those at or above 98 and at or below 2\)\. Focusing first on original hay, we see all LLMs exhibit near\-zero Bipolarization Index\. This reflects a consistent lack of extreme\-end scoring, though does not mean scores are unimodal \(e\.g\. GPT\-4o exhibits secondary mass and "humps" in the intermediate scoring ranges, yet because these fluctuations do not reach the lower marginkk, the Bipolarization Index remains low\)\.

In contrast, random hay induces a systematic shift towards polarized scoring\. All LLMs exhibit much higher Bipolarization Index, reaching at leastB=0\.4B=0\.4fork=25k=25\. The KDE plots suggest commonalities in how this bifurcation occurs: a significant proportion of documents receive high scores, while the rest receive very low scores\.

Analyzing sensitivity as we varykkreveals a spectrum of behavioral patterns\. At one extreme, Claude exhibits a sharp, step\-function increase in bipolarization that plateaus quickly afterk=10k=10, suggesting a rigid, binary judgment style under random hay\. At the other extreme, GPT\-4o and GPT\-5 exhibit more graduated scoring, with a more incremental rise in bipolarization and no clear plateau, indicating their scoring extends into the intermediate range rather than confined to polar extremes\. Gemini and o4\-mini fall between these extremes, rising steeply but without the clean plateau of Claude\. GPT\-4o, in particular, exhibits an apparent trimodal distribution with a third peak at the 50\-point mark, thereby alternating between very high, middle, and very low scores\.

## 4Discussion & Conclusion

Our results demonstrate that LLM\-as\-a\-judge scoring for pairwise document similarity is sensitive to a range of experimental parameters – needle position, document length, perturbation type, and surrounding context – and that these sensitivities manifest differently across models\. We now discuss these findings in the context of prior work, address the scope and limitations of our experimental design, and outline directions for future research\.

### Positionality bias as within\-document evaluation weighting

A growing body of work documents positional biases in LLM\-as\-a\-judge systems, but the existing literature focuses almost exclusively oncandidate\-order bias: when an LLM compares two responses, the order in which they are presented influences the evaluation\(Wang et al\.,[2024](https://arxiv.org/html/2604.18835#bib.bib22); Zheng et al\.,[2023](https://arxiv.org/html/2604.18835#bib.bib23); Shi et al\.,[2024](https://arxiv.org/html/2604.18835#bib.bib19)\)\. Separately,Liu et al\. \([2023a](https://arxiv.org/html/2604.18835#bib.bib15)\)showed that LLM performance on information retrieval tasks degrades when relevant content appears in the middle of a long context, the so\-called “lost in the middle" phenomenon\. Our findings reveal a complementary and, to our knowledge, previously uncharacterized bias dimension:within\-document positional weighting, in which the location of a semantic difference within the compared documents themselves influences scoring\. Unlike the U\-shaped retrieval curve ofLiu et al\. \([2023a](https://arxiv.org/html/2604.18835#bib.bib15)\), our bias concernswherewithin a document a difference occurs and manifests in scoring magnitude rather than retrieval accuracy, revealing that most models penalize earlier differences more harshly\. GPT\-5 is a notable exception, exhibiting a reversed bias that more harshly penalizes later\-occurring differences under original hay\. Moreover, this bias persists across documents as short as 4\-8 sentences, well below the long\-context regime studied byLiu et al\. \([2023a](https://arxiv.org/html/2604.18835#bib.bib15)\), suggesting a distinct mechanism tied to evaluative weighting rather than attention decay over thousands of tokens\.

Interestingly, the prevailing early\-positionality pattern parallels findings from human cognition: research on semantic illusions has shown that readers are more likely to overlook meaning alterations embedded later in coherent discourse, with the position of the change modulating detection rates\(Cook et al\.,[2018](https://arxiv.org/html/2604.18835#bib.bib5); Nieuwland and Van Berkum,[2005](https://arxiv.org/html/2604.18835#bib.bib18); Liu et al\.,[2023a](https://arxiv.org/html/2604.18835#bib.bib15)\)\. Whether the mechanism in LLMs is analogous \(i\.e\. an anchoring effect where early content disproportionately shapes the evaluative frame\) or instead reflects transformer\-specific attention dynamics is unclear, though GPT\-5’s contrarian behavior suggests the bias is not an inevitable architectural consequence\.

### The effect of context relevance

Perhaps our most thought\-provoking finding concerns the effect of context relevance on scoring\. Random hay \(surrounding context topically unrelated to the semantic change\) systematicallylowerssimilarity scores and amplifies positionality bias\. Two competing intuitions make opposing predictions here\. Under acontradiction view, original hay permits cross\-sentence semantic contradictions that compound the effect of the needle; randomizing the surrounding context removes these contradictions and should yieldhighersimilarity scores\. Under aninterpretive frame view, original hay provides a topical context within which the perturbation can be contextualized and even dismissed\. For example, a reader encountering one contradictory sentence among nine reinforcing ones might treat it as a minor anomaly rather than a fundamental change in meaning; removing this frame leaves the LLM with no basis on which to downweight the perturbation, predictinglowerscores under random hay\. Our results are consistent with the interpretive frame view: all five LLMs assign lower mean scores and exhibit higher score variance under random hay\. This connection to contextual dismissal resonates with the semantic illusions literature\(Cook et al\.,[2018](https://arxiv.org/html/2604.18835#bib.bib5); Nieuwland and Van Berkum,[2005](https://arxiv.org/html/2604.18835#bib.bib18)\), in which human readers overlook meaning alterations precisely when surrounding discourse provides a coherent interpretive frame\. Furthermore, the bipolarization analysis reveals that random hay induces sharply bifurcated, all\-or\-nothing scoring behavior across all models, with the sensitivity analysis distinguishing two behavioral profiles: step\-function polarization \(e\.g\. Claude\) versus graduated polarization with intermediate scoring \(e\.g\. GPT\-5\)\. These results suggest that LLMs do not simply evaluate the needle in isolation; the coherence of the surrounding context modulates how the perturbation is weighted\. For practitioners, this implies that the domain and coherence of compared documents are not neutral factors and actively shape scoring behavior in ways that must be accounted for\.

### Scoring fingerprints and the needle type hierarchy

Another key finding of our analysis is that each LLM exhibits a qualitatively distinct scoring distribution, a “fingerprint", that is stable across perturbation types\. Changing the needle type shifts the distribution along the scoring scale but does not alter its shape\. Yet despite these idiosyncratic shapes, all five LLMs share a universal hierarchy in mean scores: negation is scored most leniently, followed by named entity replacement, then conjunction swap\. This hierarchy is consistent across all hay types and positions, suggesting it reflects something fundamental about how these models assess semantic change\. One plausible interpretation is that negation, while logically inverting a proposition, preserves much of the surface\-level lexical overlap and syntactic structure, whereas conjunction swaps can alter the logical relationship between clauses in ways that ripple through the sentence’s meaning\. Named entity replacement falls between these extremes, changing referential content while preserving predicate structure\. From a methodological standpoint, the stability of fingerprints within models and the consistency of the hierarchy across models is encouraging: it suggests that our framework captures reproducible structural features of LLM scoring behavior which are intrinsic to the model itself, not to the particular perturbation being evaluated\. The practical implication is clear: when comparing scores across LLMs, model\-specific baselines are essential, as raw scores from different models are not directly commensurable\.

### On the idealization of semantic needles

A natural criticism of our approach is that our chosen perturbation types are idealized interventions that may not reflect the messier reality of natural semantic variation\. We respond in two ways\. First, each needle type represents failure modes encountered in real\-world, high\-stakes settings\. Negation errors arise routinely in clinical documentation, where the presence or absence of "not" in a diagnosis or treatment plan is a well\-documented patient safety concern, and in regulatory texts where amendments negate previously permitted actions\. Conjunctions are among the most consequential ambiguities in legal drafting\(Adams and Kaye,[2006](https://arxiv.org/html/2604.18835#bib.bib1)\), as illustrated by the landmark caseO’Connor v\. Oakhurst Dairy\(2017\), in which a missing serial comma and resulting conjunctive ambiguity led to a multi\-million\-dollar settlement\. Named entity replacement occurs in plagiarism and text reuse, template\-based document generation, and adversarial text manipulation\.

Second, the idealized nature of our needles is a deliberate design choice that confers several advantages\. Controlled, atomic perturbations allow us to isolate the effect of individual variables without the confounds introduced by naturalistic variation, where multiple semantic changes co\-occur and their effects are entangled\. The idealization enables automation and scalability: we test over 3000 distinct parameter settings across tens of thousands of document pairs, a scale that would be infeasible with hand\-crafted naturalistic perturbations\. Moreover, because each perturbation is well\-defined and reproducible, our framework can be applied to future LLMs consistently, enabling longitudinal comparisons across model versions\.

### Limitations

Our study has several limitations\. The corpus is drawn entirely from English Wikipedia, which may not generalize to domain\-specific texts \(e\.g\., legal, clinical, or literary documents\) where sentence structure and semantic density differ\. Our perturbation types, while motivated by real\-world scenarios, do not exhaust the space of possible semantic changes; modifications such as quantifier shifts, temporal alterations, or pragmatic implicature changes remain untested\. We use a single scoring prompt throughout; variations in prompt wording, scoring scale, or task framing may elicit different behaviors\. Our analysis treats each LLM as a fixed entity, but model behavior can vary with temperature, system prompt, and API version, and the specific model versions tested here will inevitably be superseded\. Finally, we do not provide a mechanistic explanation for the observed biases: whether they arise from attention patterns, tokenization effects, or training data artifacts remains an open question\.

### Future work

Several directions emerge naturally from our findings\. First, our framework’s modularity invites extension to new needle types and domains: testing on legal corpora with legally meaningful perturbations, or on clinical texts with dosage and treatment modifications, would assess whether the biases we observe are stable across genres\. Second, the pairwise similarity scores produced by our framework can be naturally extended to multi\-way comparisons in multiple ways\. For example, multi\-way rankings can be obtained by constructing win/draw/lose tournaments over collections of documents and analyzing the resulting preference graphs via methods such as Bradley\-Terry models\. Third, the scoring fingerprints we identify suggest a connection to model architecture and training: systematic comparisons across model families, sizes, and fine\-tuning strategies could illuminate which design choices give rise to which scoring behaviors\. Finally, because our framework is LLM\-agnostic and automated, it can be re\-run as new models are released, enabling longitudinal tracking of how scoring behaviors evolve across model versions and providing practitioners with a consistent basis for comparison\.

In summary, our needle\-in\-a\-haystack framework provides a scalable, modular, and highly discriminative experimental design for probing LLM scoring behavior in pairwise document similarity\. The framework reveals that LLMs exhibit systematic positional biases, context\-dependent scoring shifts, and model\-specific fingerprints – phenomena that would be invisible in a simpler experimental setup that varies fewer parameters or aggregates over them\. These findings underscore the importance of fine\-grained sensitivity testing as a complement to standard benchmarks\. We hope the intuitive needle\-and\-hay framing, together with the quantitative analyses developed here, will serve as a practical toolkit for researchers and practitioners seeking to understand, compare, and audit LLM\-as\-a\-judge document similarity assessment systems for both the five models tested here and for those yet to come\.

Acknowledgements\.We thank Emily Saldanha, Ian Stewart, Joshua Chong, Ana Usenko, and Kate Gibb for helpful conversations and paper feedback\. This work was performed by Pacific Northwest National Laboratory operated by Battelle for the U\.S\. Department of Energy under Contract DE\-AC05\-76RL01830\. This work was also supported by the Office of the Director of National Intelligence \(ODNI\), Intelligence Advanced Research Projects Activity \(IARPA\), via the HIATUS Program contract D2022\-2204140001\. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U\.S\. Government\. The U\.S\. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein\. Information Release: PNNL\-SA\-221843\.

## References

- Adams and Kaye \(2006\)Kenneth A\. Adams and Alan S\. Kaye\.Revisiting the ambiguity of “and” and “or” in legal drafting\.*St\. John’s Law Review*, 80\(4\):1167–1198, 2006\.
- Cer et al\. \(2017\)Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez\-Gazpio, and Lucia Specia\.SemEval\-2017 task 1: Semantic textual similarity – multilingual and cross\-lingual focused evaluation\.In*Proceedings of the 11th International Workshop on Semantic Evaluation \(SemEval\-2017\)*, pages 1–14\. Association for Computational Linguistics, 2017\.doi:10\.18653/v1/S17\-2001\.
- Chen et al\. \(2024\)Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang\.Humans or LLMs as the judge? a study on judgement biases, 2024\.URLhttps://arxiv\.org/abs/2402\.10669\.
- Chiang and Lee \(2023\)Cheng\-Han Chiang and Hung\-yi Lee\.Can large language models be an alternative to human evaluations?In Anna Rogers, Jordan Boyd\-Graber, and Naoaki Okazaki, editors,*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 15607–15631, Toronto, Canada, July 2023\. Association for Computational Linguistics\.doi:10\.18653/v1/2023\.acl\-long\.870\.URLhttps://aclanthology\.org/2023\.acl\-long\.870/\.
- Cook et al\. \(2018\)Anne E Cook, Erinn K Walsh, Margaret A A Bills, John C Kircher, and Edward J O’Brien\.Validation of semantic illusions independent of anomaly detection: evidence from eye movements\.*Quarterly Journal of Experimental Psychology*, 71\(1\):113–121, 2018\.doi:10\.1080/17470218\.2016\.1264432\.URLhttps://doi\.org/10\.1080/17470218\.2016\.1264432\.
- Dubois et al\. \(2023\)Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B\. Hashimoto\.AlpacaFarm: A simulation framework for methods that learn from human feedback\.In*Advances in Neural Information Processing Systems*, volume 36, 2023\.
- Gu et al\. \(2025\)Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo\.A survey on LLM\-as\-a\-judge, 2025\.URLhttps://arxiv\.org/abs/2411\.15594\.
- Harispe et al\. \(2015\)Sébastien Harispe, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain\.*Semantic Similarity from Natural Language and Ontology Analysis*\.Springer International Publishing, 2015\.ISBN 9783031021565\.doi:10\.1007/978\-3\-031\-02156\-5\.URLhttp://dx\.doi\.org/10\.1007/978\-3\-031\-02156\-5\.
- Honnibal et al\. \(2020\)Matthew Honnibal, Ines Montani, Sofie Van Landghem, and Adriane Boyd\.spacy: Industrial\-strength natural language processing in python\.2020\.doi:10\.5281/zenodo\.1212303\.
- Hsieh et al\. \(2024\)Cheng\-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg\.RULER: What’s the real context size of your long\-context language models?, 2024\.
- Kamradt \(2023\)Greg Kamradt\.Needle in a haystack – pressure testing LLMs\.https://github\.com/gkamradt/LLMTest\_NeedleInAHaystack, 2023\.Accessed: 2025\-10\-01\.
- Krolik et al\. \(2024\)Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, and Bahador Saket\.Towards leveraging large language models for automated medical Q&A evaluation, 2024\.URLhttps://arxiv\.org/abs/2409\.01941\.
- Li et al\. \(2024\)Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng\.Evaluating mathematical reasoning of large language models: A focus on error identification and correction, 2024\.URLhttps://arxiv\.org/abs/2406\.00755\.
- Lin \(1998\)Dekang Lin\.An information\-theoretic definition of similarity\.In*Proceedings of the Fifteenth International Conference on Machine Learning*, ICML ’98, pages 296–304, San Francisco, CA, USA, 1998\. Morgan Kaufmann Publishers Inc\.ISBN 1558605568\.
- Liu et al\. \(2023a\)Nelson F\. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\.Lost in the middle: How language models use long contexts, 2023a\.URLhttps://arxiv\.org/abs/2307\.03172\.
- Liu et al\. \(2023b\)Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu\.G\-Eval: NLG evaluation using GPT\-4 with better human alignment, 2023b\.
- LtCmdrData \(2020\)LtCmdrData\.Plain text wikipedia 202011, 2020\.URLhttps://www\.kaggle\.com/datasets/ltcmdrdata/plain\-text\-wikipedia\-202011\.Accessed: 2025\-09\-11\.
- Nieuwland and Van Berkum \(2005\)Mante S\. Nieuwland and Jos J\.A\. Van Berkum\.Testing the limits of the semantic illusion phenomenon: Erps reveal temporary semantic change deafness in discourse comprehension\.*Cognitive Brain Research*, 24\(3\):691–701, 2005\.ISSN 0926\-6410\.doi:https://doi\.org/10\.1016/j\.cogbrainres\.2005\.04\.003\.URLhttps://www\.sciencedirect\.com/science/article/pii/S0926641005001102\.
- Shi et al\. \(2024\)Lanxin Shi, Changmao Ma, Weijia Liang, Xiaodan Diao, Wenxuan Ma, and Soroush Vosoughi\.Judging the judges: A systematic study of position bias in LLM\-as\-a\-judge, 2024\.
- Song et al\. \(2024\)Yishen Song, Qianta Zhu, Huaibo Wang, and Qinhua Zheng\.Automated essay scoring and revising based on open\-source large language models\.*IEEE Transactions on Learning Technologies*, 17:1880–1890, 2024\.
- Wang et al\. \(2023\)Haoyu Wang, Guozheng Ma, Cong Yu, Ning Gui, Linrui Zhang, Zhiqi Huang, Suwei Ma, Yongzhe Chang, Sen Zhang, Li Shen, Xueqian Wang, Peilin Zhao, and Dacheng Tao\.Are large language models really robust to word\-level perturbations?, 2023\.URLhttps://arxiv\.org/abs/2309\.11166\.
- Wang et al\. \(2024\)Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui\.Large language models are not fair evaluators\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar, editors,*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 9440–9450, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.511\.URLhttps://aclanthology\.org/2024\.acl\-long\.511/\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, et al\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.In*Advances in Neural Information Processing Systems*, volume 36, 2023\.

Original Documentdd
…\\dotsCleopatra was Queen of the Ptolemaic Kingdom of Egypt\.A member of the Ptolemaic dynasty, she was a descendant of its founder Ptolemy I Soter\.After her death, Egypt became a province of the Roman Empire\.…\\dotsDocument with Named Entity Replacementd^\\hat\{d\}
…\\dotsCleopatra was Queen of the Ptolemaic Kingdom of Egypt\.A member of the Ptolemaic dynasty, she was a descendant of its founderCleopatra\.After her death, Egypt became a province of the Roman Empire\.…\\dotsFigure 7:Named entity replacement needle: spaCy NER identifies entities in the selected sentence; one is chosen via random sampling and its span is replaced with the text of a randomly chosen entity of the same label drawn from elsewhere in the document\.## Appendix AExperimental Design Details

### Data selection, cleaning, & length

ThePlain Text Wikipediadocuments used in our experiments are publicly available through KaggleLtCmdrData \[[2020](https://arxiv.org/html/2604.18835#bib.bib17)\]\. Though these documents are already processed as plain text, they are further cleaned to remove all section headers, appendical sections \(e\.g\. “See Also", “References"\), and other non\-natural language text such as tables and code\. We parse sentences using spaCy and only consider documents withL≥40L\\geq 40sentences, discarding any documents with fewer sentences or with average sentence length that is unusually short or long, which we define as fewer or more than150​L150Land2000​L2000Lcharacters, respectively\.

We vary our position parameters0≤i,j≤90\\leq i,j\\leq 9, which yields cleaned documents ranging from 1 sentence to 19 sentences and spans 100 distinct needle positions\. The maximumi,ji,jvalue of99was selected to strike a balance between granularity and length, keeping documents within typical LLM context windows while providing enough positional variation to detect biases\.

### Needle type implementations

We implement the negation \(neg\), conjunction swap \(con\) and named\-entity replacement \(ner\) needle types in Python using spaCy v3\.8\.11 under modelen\_core\_web\_smfor tokenization, POS tagging, dependency parsing, and NERHonnibal et al\. \[[2020](https://arxiv.org/html/2604.18835#bib.bib9)\]\.

- •neg: for each sentence marked for negation perturbation, we first detect existing negation via the “neg" dependency or tokens with lemmas not/n’t/never/no; if present, we skip the sentence and move to the next\. Otherwise, we insert "not” after a root verb\. For example, in the sentence: CaesarwasaRomangeneral\.↓\\downarrow↓\\downarrow↓\\downarrow↓\\downarrow↓\\downarrownsubjROOTdetamodattrspaCy dependency parsing identifies the root verb and inserts "not” immediately after it, yielding: Caesar wasnota Roman general\. We preserve casing, spacing, and punctuation and avoid contractions to prevent double negatives\.
- •con: we swap occurrences of "and” with occurrences of "or” in a sentence, and vice versa using spaCy’s POS tagging, searching the text for "and” and "or” and swapping accordingly, as illustrated before in Figure[1](https://arxiv.org/html/2604.18835#S1.F1)\. We preserve original casing and do not alter ampersands\.
- •ner: for each document, we inspect the middle sentence and, if it contains an eligible entity \(label∈\\inPERSON, GPE, LOC, LANGUAGE, DATE\), we select one uniformly at random and replace its text span with the text of a randomly chosen entity of the same label from elsewhere in the document\. Figure[7](https://arxiv.org/html/2604.18835#A0.F7)presents an example111Note that in Figure[7](https://arxiv.org/html/2604.18835#A0.F7)spaCy’s model incorrectly labels ‘the Ptolemaic dynasty’ asDATE\. While such errors are inevitable in natural language processing and contribute noise to the NER needle, this is partially mitigated by our large sample size and aggregate analysis over many documents\.\. Replacements operate on the exact span returned by spaCy, treating multi\-token entities as a unit and preserving surrounding whitespace and punctuation\.

### Handling failed semantic perturbations

In some cases, this middle sentence cannot be semantically altered as desired\. For example, the middle sentence might have no eligible entity, or the document lacks another entity of the same label to substitute, renderingnerimpossible\. Similarly, a sentence without connectives “and/or" cannot be altered viacon\.

In such cases, we check55sentences above and below for sentences that can be satisfactorily altered\. Here, recall our minimum document length of 40 sentences, along with our chosen maximum position\(i,j\)=\(9,9\)\(i,j\)=\(9,9\), yields documents with under 20 sentences\. This guarantees this scanning procedure never places the needle in positions such that the end of the document would be reached when appending varying amounts of hay\. If still no such sentence is found, the document is simply discarded for the next document from𝒞\\mathcal\{C\}\.

### Random hay implementation

Documents with random hay \(rand\) retain their needle sentence, but surround it with sentences from a randomly chosen document\. More precisely, for a documentd​\(N,rand,\(i,j\)\)d\(N,\\texttt\{rand\},\(i,j\)\)whosemm’th sentence was selected as the needle, a randomly chosen document from𝒞\\mathcal\{C\}is selected, and we extract a consecutive window ofiisentences before andjjsentences after itsmm’th sentence\. This random context is then placed around the needle sentence, as shown in Figure[1](https://arxiv.org/html/2604.18835#S1.F1)\. This random selection is made consistently across LLMs, guaranteeingd​\(N,rand,\(i,j\)\)d\(N,\\texttt\{rand\},\(i,j\)\)is the same across all choices of LLMLL\.

Semantic Similarity
\(Scale: 0–100\)
Score the degree to which two text documents are semantically similar, based on underlying concepts and ideas rather than surface\-level lexical features, word choice, or syntax\.0–25 \(Poor\):The documents have significantly different semantic meaning, conveying fundamentally different subject matter, topics, ideas, or context\.26–50 \(Fair\):The documents share some overlap in semantic meaning, but differences outweigh similarities\. Significant differences in subject matter, topics, ideas, or context\.51–75 \(Good\):The documents share substantial semantic similarity\. Similarities outweigh differences, but nontrivial differences in subject matter, topics, ideas, or context remain\.76–100 \(Excellent\):The documents have nearly identical or identical semantic meaning\. Both convey the same core idea, information, subject matter, topics, and context\.Figure 8:Criteria: Scoring rubric for evaluating semantic similarity between documents\.
### Prompt, scoring rubric, and trial independence

Figure[8](https://arxiv.org/html/2604.18835#A1.F8)presents the prompt and scoring rubric we utilize in our experiment\. Semantic similarity has been defined in many ways—mathematically, algorithmically, and linguistically—and the literature offers robust frameworks for doing soHarispe et al\. \[[2015](https://arxiv.org/html/2604.18835#bib.bib8)\], Lin \[[1998](https://arxiv.org/html/2604.18835#bib.bib14)\]\. In our prompt, we adopt a deliberately broad, intuitive notion that emphasizes shared underlying meaning across words, phrases, and sentences, without committing to a specific formal definition\. This choice allows us to assess LLM\-as\-a\-judge’s ability to recognize semantic similarity based on widely held intuitions, rather than its adherence to any particular formalism\.

Lastly, it is important to note we use stateless API calls when accessing the LLMs to ensure each prompt is processed in isolation\. This ensures that, for a given LLM, each scoring trial is independent of the next, with no shared context window\.

### Choosing number of documents

For a given position\(i,j\)\(i,j\), we continue scoring documents until two criteria are met: \(1\) at leastNNdocuments have been processed; and \(2\) maximum difference in the running mean over the lastwwdocuments is less than given threshold,tt\. More formally,nDoc​\(i,j\)\\text\{nDoc\}\(i,j\), the number of documents we process for position\(i,j\)\(i,j\), is defined by

min⁡\{n≥N:maxa,b∈\{n−w,…,n\}⁡\|s¯a−s¯b\|≤t\}\.\\min\\Bigl\\\{\\,n\\geq N:\\max\_\{\\begin\{subarray\}\{c\}a,b\\in\\\{n\-w,\\dots,n\\\}\\end\{subarray\}\}\\bigl\|\\overline\{s\}\_\{a\}\-\\overline\{s\}\_\{b\}\\bigr\|\\leq t\\,\\Bigr\\\}\.wheresa¯\\overline\{s\_\{a\}\}denotes the mean score over documents1,…,a1,\\dots,a\. We chooseN=100N=100,w=10w=10, andt=1t=1\. After all positions have been processed \(for a fixed LLM, needle, and hay type\), we takeD=maxi,j⁡nDoc​\(i,j\)D=\\max\_\{i,j\}\\mbox\{nDoc\}\(i,j\), loop back over any positions for which fewer documents have been processed, and continue untilnDoc​\(i,j\)=D\\mbox\{nDoc\}\(i,j\)=Dfor alli,ji,j\. We note that the 100 document minimum ends up being close to sufficient for satisfying the second criterion for many positions\(i,j\)\(i,j\): across all positions, needles, and LLMs, we see nDoc ranging from 100 to 110 for original hay, while for random hay \(where more score variability was observed\), this upper end goes up to 133 documents\.

### LLM Information

All Large Language Models \(LLMs\) were accessed via APIs and used with default hyperparameter settings, including default values for temperature, top\-pp, max tokens, and other generation parameters as provided by their respective APIs\. The Azure OpenAI API was used to access GPT\-5 and GPT\-4o, please see Table[1](https://arxiv.org/html/2604.18835#A1.T1)for version numbers\.

A private API was used to access Gemini 2\.5 Flash, Claude Sonnet 4, and o4\-mini, with host information and version numbers given in Table[1](https://arxiv.org/html/2604.18835#A1.T1)\.

Table 1:Details of the Language Models Used

Similar Articles

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

arXiv cs.CL

This paper proposes a semantic verification framework using Natural Language Inference (NLI) to evaluate the sensitivity of clinical LLMs to meaning-preserving prompt variations, introducing metrics such as MVS, ΔC, and WCI. Results show that domain specialization does not consistently improve robustness, with both domain-specific and general-purpose models showing mixed performance.

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

arXiv cs.CL

This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.

When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates

arXiv cs.CL

This paper introduces SemCog Bench, a curated benchmark of 1,858 Arabic-Hebrew word pairs with sentence-level annotations, to evaluate LLMs' ability to distinguish true cognates from false friends and loanwords. Results show high accuracy on true cognates but sharp drops on false friends, highlighting a key limitation in cross-lingual semantic reasoning.