On the Persistent Effects of Lexicality in Large Language Mod

arXiv cs.CL Papers

Summary

This paper investigates how lexical overlap, rather than semantic content, influences LLM representations across layers and architectures, and demonstrates that this lexical effect persists even in models trained for semantic similarity, leading to degraded performance on downstream tasks.

arXiv:2606.02750v1 Announce Type: new Abstract: Representations extracted from large language models (LLMs) play an important role in many downstream applications. However, the structure of these representations is often influenced by lexical overlap rather than semantic content. Our understanding of the relationship between this lexical influence and semantic content, and its implications for downstream tasks, remains limited. In this work, we investigate representations to quantify the effect of lexical overlap relative to semantic content. We consider several adversarial semantic stress tests and further connect our findings to the information theory perspective. We find that lexical influence extends across the depth of models, consistently across architectures, training regimes, and objective functions, including the models trained for semantic similarity. Moreover, we observe a mid-depth region in which both lexical and semantic signals degrade simultaneously, indicating a transitional regime where representations are poor for both surface form and meaning. We further demonstrate the effect of lexical influence on downstream uses of LLMs using summarization and model editing as a case study.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:35 AM

# On the Persistent Effects of Lexicality in Large Language Models
Source: [https://arxiv.org/html/2606.02750](https://arxiv.org/html/2606.02750)
Hammad Rizwan Dalhousie University Halifax, NS &Muhammad Umair Haider University of Kentucky Lexington, KY &Nishant Subramani Carnegie Mellon University Pittsburgh, PA &Mona T\. Diab Carnegie Mellon University Pittsburgh, PA &A\.B\. Siddique University of Kentucky Lexington, KY &Hassan Sajjad Dalhousie University Halifax, NS

###### Abstract

Representations extracted from large language models \(LLMs\) play an important role in many downstream applications\. However, the structure of these representations is often influenced by lexical overlap rather than semantic content\. Our understanding of the relationship between this lexical influence and semantic content, and its implications for downstream tasks, remains limited\. In this work, we investigate representations to quantify the effect of lexical overlap relative to semantic content\. We consider several adversarial semantic stress tests and further connect our findings to the information theory perspective\. We find that lexical influence extends across the depth of models, consistently across architectures, training regimes, and objective functions, including the models trained for semantic similarity\. Moreover, we observe a mid\-depth region in which both lexical and semantic signals degrade simultaneously, indicating a transitional regime where representations are poor for both surface form and meaning\. We further demonstrate the effect of lexical influence on downstream uses of LLMs using summarization and model editing as a case study\.

## 1Introduction

Natural Language Processing \(NLP\) has seen rapid advances with the emergence of LLMs, which now achieve strong performance across a wide variety of benchmarks and downstream tasks\. Beyond generating text, these models are increasingly used as general\-purpose embedding engines, providing vector representations that underpin retrieval, textual similarity, clustering, and evaluation pipelines\.

The widespread reliance on LLM embeddings implicitly assumes that, as we progress through model depth, representations gradually move away from local lexical structure and converge toward increasingly abstract sequence\-level semantics\(Jawaharet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib23); Hewitt and Liang,[2019](https://arxiv.org/html/2606.02750#bib.bib25); Haideret al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib24)\)\. In practice, this assumption is fragile, since it does not imply that lexical cues are eliminated as representations can remain highly similar whenever inputs share tokens, even when their meanings diverge\(Dumpalaet al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib9); Rizwanet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib10)\)\. When lexical overlap influences embedding geometry, similarity estimates degrade, making it harder to retrieve or classify documents based on meaning rather than wording and weakening systems that depend on meaning\-driven representations\. We therefore ask how representational geometry evolves along the axes of lexical and semantic information; how training regimes, including pretraining objectives, instruction tuning, and contrastive learning, shape these geometric structures; and how lexical influence propagates to downstream uses of LLMs\.

We treat every layer of a model as a candidate embedding space and organize our study into four components\.First, to quantify lexical influence, we use a triplet semantic\-equivalence stress test consisting of an anchor, a meaning\-preserving paraphrase, and a meaning\-changing distractor that shares substantial lexical overlap with the anchor\. This construction isolates semantic similarity failures driven by surface\-form similarity and localizes the depths at which representations most often conflate lexical overlap with semantic equivalence\.Second, to quantify these failures in terms of lexical and semantic signals, we pair the stress tests with two layer\-wise measurements: lexical decodability and semantic fidelity\. Lexical decodability is quantified by training linear token\-identity probes on hidden states, measuring how much surface form remains linearly recoverable at each layer\. We hypothesize that a strong token identity corresponds to a salient lexical signal that downstream methods can readily leverage, whether they rely on similarity geometry or learn additional predictors\. We quantify semantic fidelity based on the representation’s performance on diverse embedding tasks\. Together, these diagnostics distinguish among competing explanations for representation\-similarity failures: lexical dominance, weak semantic organization, and transitional regimes in which both lexical and semantic signals are simultaneously low\.Third, as layer\-wise lexical and semantic geometry does not by itself characterize*information processing dynamics*, we relate our investigation to these dynamics using*input \(prompt\) entropy*across layers\. We test whether information compression and decompression shifts correlate with the layer\-wise changes in lexical and semantic structure\.Fourth, we show the consequences of lexical influence in practical LLM applications, focusing on abstractive summarization evaluation and factual model editing\.

Our investigation leads to several central insights:

Lexical influence persists across depth \(§[3](https://arxiv.org/html/2606.02750#S3)\)\.Adversarial stress tests across multiple models reveal that token\-overlap\-driven similarity errors persist throughout model depth\. Although attenuated in deeper layers, the effect is neither fully eliminated nor confined to shallow representations\.

Adaptation regimes improve embeddings, but do not remove lexical influence \(§[3](https://arxiv.org/html/2606.02750#S3)\)\.Instruction tuning and metric learning improve overall embedding quality, yet our stress tests show that metric learning still fails under lexical overlap adversaries\. Lexically similar but semantically mismatched pairs continue to receive inflated similarity scores, indicating that the dominant training paradigm for embedding models reduces but does not eliminate lexical bias\.

Lexical decodability is non\-monotonic across depth \(§[4\.1](https://arxiv.org/html/2606.02750#S4.SS1)\)\.The ability to decode exact lexical identity from token representations fluctuates across layers rather than declining monotonically\. This is in line with findings byChenget al\.\([2025](https://arxiv.org/html/2606.02750#bib.bib71)\), indicating that depth does not induce a clean lexical\-to\-semantic transition\(Tenneyet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib51); Jawaharet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib23); Li and Subramani,[2025](https://arxiv.org/html/2606.02750#bib.bib65)\)\.

A mid\-depth valley where lexical and semantic signals both weaken \(§[4](https://arxiv.org/html/2606.02750#S4), §[5](https://arxiv.org/html/2606.02750#S5)\)\. Across model families, intermediate layers form a valley\-like region where token identity is least recoverable, and semantic performance stagnates or degrades across both raw embedding geometry and linearly probed semantic evaluations\. This valley appears to align with a compression–re\-expansion point in prompt information across layers; a similar valley shows up under full attention even when representation entropy stays roughly constant with depth, indicating that the effect is not simply an entropy transition\.

Practical implications \(§[6](https://arxiv.org/html/2606.02750#S6)\)\.In embedding\-based pipelines, lexical overlap can miscalibrate similarity signals, degrading performance\. In summarization evaluation, common reference\-based metrics systematically reward reference wording, favoring surface overlap over semantic preservation\. In weight\-space model editing, updates generalize along surface\-form similarity, producing correlated shifts for token\-overlapping distractors, compromising edit locality\.

## 2Experimental Setup and Preliminaries

We use a common set of datasets and model families to ensure consistency across experiments\.

Datasets\.To probe lexical influence, we use two adversarial benchmarks, CounterFact\(Menget al\.,[2022](https://arxiv.org/html/2606.02750#bib.bib12)\)and SugarCrepe\+\+ \(SCPP\)\(Dumpalaet al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib9)\)\. Relative to CounterFact, which primarily perturbs entities, SCPP offers a more targeted benchmark for lexical\-influence tests, as it introduces systematic shifts in attributes and relations\. We additionally use CounterFact for model editing\. Dataset details and reference samples are provided in Appendix[D\.1](https://arxiv.org/html/2606.02750#A4.SS1)\. To measure lexical decodability, token\-identity probes are trained on WikiText\(Merityet al\.,[2017](https://arxiv.org/html/2606.02750#bib.bib11)\), and to evaluate semantic fidelity, the MTEB benchmark\(Muennighoffet al\.,[2023](https://arxiv.org/html/2606.02750#bib.bib13)\)is utilized\.

Models\.We consider three training paradigms:*pretrained*,*instruction\-tuned*, and*metric learning trained*embedding models, spanning multiple model families\.111Llama 3\.2\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib47)\)and Gemma 3\(Teamet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib46)\)in both*pretrained*and*instruction\-tuned*variants, along with*embedding*models including Qwen\-3\(Zhanget al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib45)\)\(Qwen3\-Embedding\-8B\) and KaLM\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib44)\)\(KaLM\-Embedding\-Gemma3\-12B\-2511\)\. Note: KaLM modifies the Gemma3 architecture to use full attention\.

## 3Measuring Lexical Influence

In the context of learned representations, we define*lexical influence*as a semantic failure case in which an anchor prompt is embedded closer to a lexically overlapping distractor than to its meaning\-preserving paraphrase, violating the intended triplet ordering\. We measure this effect on a dataset of triplets𝒟=\{\(a,p,d\)\}\\mathcal\{D\}=\\\{\(a,p,d\)\\\}, whereaais an anchor,ppa meaning\-preserving paraphrase, anddda lexical distractor\. Dataset usage and details are provided in Appendix[D\.2](https://arxiv.org/html/2606.02750#A4.SS2)\.

For a given modelMMand inputxxconsisting of tokenst∈\{1,…,Tx\}t\\in\\\{1,\\dots,T\_\{x\}\\\}, letHℓM​\(x\)∈ℝTx×dH\_\{\\ell\}^\{M\}\(x\)\\in\\mathbb\{R\}^\{T\_\{x\}\\times d\}denote the matrix of token\-level hidden states at layerℓ∈ℒ\\ell\\in\\mathcal\{L\}, where thett\-th rowHℓM​\(x\)t∈ℝdH\_\{\\ell\}^\{M\}\(x\)\_\{t\}\\in\\mathbb\{R\}^\{d\}corresponds to token positiontt\. LethℓM​\(x\)∈ℝdh\_\{\\ell\}^\{M\}\(x\)\\in\\mathbb\{R\}^\{d\}denote the corresponding sentence\-level embedding, obtained via one of two standard choices: mean pooling and last token\.

hℓM,mean​\(x\)=1Tx​∑t=1TxHℓM​\(x\)t,hℓM,last​\(x\)=HℓM​\(x\)Txh\_\{\\ell\}^\{M,\\text\{mean\}\}\(x\)=\\frac\{1\}\{T\_\{x\}\}\\sum\_\{t=1\}^\{T\_\{x\}\}H\_\{\\ell\}^\{M\}\(x\)\_\{t\},\\quad h\_\{\\ell\}^\{M,\\text\{last\}\}\(x\)=H\_\{\\ell\}^\{M\}\(x\)\_\{T\_\{x\}\}We thenℓ2\\ell\_\{2\}\-normalizeh~ℓM​\(x\)=hℓM​\(x\)/‖hℓM​\(x\)‖2\\tilde\{h\}\_\{\\ell\}^\{M\}\(x\)=h\_\{\\ell\}^\{M\}\(x\)/\\\|h\_\{\\ell\}^\{M\}\(x\)\\\|\_\{2\}and measure euclidean distance,

dℓM​\(x,y\)=‖h~ℓM​\(x\)−h~ℓM​\(y\)‖2\.d\_\{\\ell\}^\{M\}\(x,y\)=\\bigl\\\|\\tilde\{h\}\_\{\\ell\}^\{M\}\(x\)\-\\tilde\{h\}\_\{\\ell\}^\{M\}\(y\)\\bigr\\\|\_\{2\}\.Forℓ2\\ell\_\{2\}\-normalized vectors, this is equivalent \(up to a monotone transform\) to cosine similarity, since‖u~−v~‖22=2−2​u~⊤​v~\\bigl\\\|\\tilde\{u\}\-\\tilde\{v\}\\bigr\\\|\_\{2\}^\{2\}=2\-2\\,\\tilde\{u\}^\{\\top\}\\tilde\{v\}\. To quantify lexical influence, we measure a triplet success rate\. When representations are driven by lexical overlap, the model is more likely to confuse a paraphrase with a lexically overlapping distractor, yielding a lower success rate\. For each triplet\(a,p,d\)\(a,p,d\), success occurs when the anchor and paraphrase are closer than either pair involving the distractor\. We report the layer\-wise success rate

SRℓM=1\|𝒟\|​∑\(a,p,d\)∈𝒟𝕀​\[dℓM​\(a,p\)<min⁡\{dℓM​\(a,d\),dℓM​\(p,d\)\}\]\.\\mathrm\{SR\}\_\{\\ell\}^\{M\}=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{\(a,p,d\)\\in\\mathcal\{D\}\}\\mathbb\{I\}\\\!\\left\[d\_\{\\ell\}^\{M\}\(a,p\)<\\min\\\!\\left\\\{d\_\{\\ell\}^\{M\}\(a,d\),d\_\{\\ell\}^\{M\}\(p,d\)\\right\\\}\\right\]\.
Figure[1](https://arxiv.org/html/2606.02750#S3.F1)compares averaged token \(mean\-pooled\) embeddings \([1\(a\)](https://arxiv.org/html/2606.02750#S3.F1.sf1),[1\(c\)](https://arxiv.org/html/2606.02750#S3.F1.sf3)\) and last\-token embeddings \([1\(b\)](https://arxiv.org/html/2606.02750#S3.F1.sf2),[1\(d\)](https://arxiv.org/html/2606.02750#S3.F1.sf4)\) on CounterFact and SCPP \(averaged results are reported for SCPP, task\-wise results are provided in Appendix[E](https://arxiv.org/html/2606.02750#A5)\)\. Across pretrained and instruction\-tuned models, averaged token representations generally outperform last\-token embeddings; the gap is clearer on CounterFact, where last\-token embeddings consistently underperform, suggesting stronger lexical influence and that single\-token summaries are under\-contextualized for sentence\-level meaning, while on SCPP the last token is more often comparable, and the pooling advantage is typically smaller\. This pattern is consistent with prior work suggesting pooling encourages more semantically abstract representations than relying on a single position\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.02750#bib.bib67); Xinget al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib66)\)\. We also observe a consistent depth effect\. Average token embeddings often exhibit a mid\-layer valley, where performance briefly declines before recovering and stabilizing in the upper layers\. By contrast, last\-token representations typically stagnate over the same depth range, or improve only gradually\. Only in the later layers do models more reliably separate true paraphrases from lexically similar distractors\.

![Refer to caption](https://arxiv.org/html/2606.02750v1/x1.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x2.png)\(b\)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x3.png)\(c\)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x4.png)\(d\)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x5.png)

Figure 1:Success rates for lexical influence tests on CounterFact and SCPP across models using average\-token and last\-token embeddings\. Higher success indicates lower lexical influence\. Panels show CounterFact with \(a\) average\-token and \(b\) last\-token embeddings, and SCPP with \(c\) average\-token and \(d\) last\-token embeddings\.Results forPretrained and Instruction\-tuned modelssuggest that lexical influence is strongly shaped during the core language\-model pretraining phase, where the model is optimized for next\-token prediction and therefore to match broad corpus statistics\. In this regime, next\-token prediction can make it statistically efficient for the model to rely on surface\-level lexical cues \(e\.g\., entity names, salient attributes\) that correlate with the target\(Duet al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib14); Bandelet al\.,[2022](https://arxiv.org/html/2606.02750#bib.bib15)\)\. This helps explain why lexical\-overlap failures appear in both pretrained and instruction\-tuned models\. Post\-training can improve instruction following and response quality\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.02750#bib.bib16); Chunget al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib17)\), but it operates on top of a pretrained representation space and may not fully remove lexical shortcuts inherited from pretraining\(Serranoet al\.,[2023](https://arxiv.org/html/2606.02750#bib.bib18)\)\.

The layer\-wise patterns reinforce this picture\. Up to roughly the mid\-depth of the model, the success rates of pretrained and instruction\-tuned models remain closely aligned, indicating that instruction tuning does not substantially reshape the earlier layers of the representation hierarchy\. Beyond this point, the success rate of instruction\-tuned models increases more rapidly than that of the purely pretrained model, suggesting that later layers are selectively adapted to support more instruction\-following and semantically appropriate behavior\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib74)\)\. However, lexical failures on SCPP remain high even in the upper layers, suggesting that lexical influences are deeply embedded in the representational structure learned during pretraining\. This is consistent with evidence that fine\-tuning primarily reweighs features already extractable from the pretrained model\(Loveringet al\.,[2021](https://arxiv.org/html/2606.02750#bib.bib21)\), and that shortcut reliance can persist in LLMs despite post\-training interventions\(Yuanet al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib62)\)\. These findings are robust to prompt variations, as reflected by high distance correlations between layer\-wise curves for both average and last\-token embeddings \(0\.993 and 0\.78, respectively; Appendix[F](https://arxiv.org/html/2606.02750#A6)\)\.

Embedding modelsperform near\-perfectly on the CounterFact dataset; paraphrases and distractors are almost always correctly separated in the learned embedding space\. This is unsurprising given that these models are trained explicitly for semantic textual similarity and the structure of the dataset\. CounterFact primarily perturbs entity tokens while keeping the rest of the prompt fixed, so that anchor and lexical distractor pairs differ in a small, highly local way\.

On the more challenging SCPP benchmark, we observe a different pattern\. SCPP introduces systematic edits to objects, attributes, and relations while keeping the rest of the description closely matched, making purely lexical cues much less discriminative\. SCPP subtask results are provided in Appendix[E](https://arxiv.org/html/2606.02750#A5)\. It shows that models perform best in cases where disambiguation can be achieved by tracking changes to the main object of the description, but they struggle substantially when only attributes \(e\.g\., color, size\) or relations \(e\.g\., subject–object roles, spatial relations\) are modified\. This highlights an important limitation of current embedding models, which can perform well on easier entity\-level contrasts yet still fail on fine\-grained semantic changes under high lexical overlap\.

Taken together, these results suggest that metric learning approaches tend to exploit the easiest discriminative signal available, often overt differences in entities or salient attributes\. This aligns with evidence that contrastive/metric objectives can admit “shortcut” solutions and may not reliably force the model to encode the intended features\(Robinsonet al\.,[2021](https://arxiv.org/html/2606.02750#bib.bib22)\)\.

## 4Probing Lexical and Semantic Structure

In this section, we analyze whether the failure patterns of lexical and semantic signals correlate with the accessibility of this information in the representations\. We hypothesize that information that is easy to access should be decodable by a linear classifier\. To test this, we train linear probing models to measure the lexical and semantic decodability\. Forlexical decodability, the classifier predicts the corresponding input token from its hidden state \(an operational proxy for surface\-form information\)\. Forsemantic structure, we extract sentence embeddings from each layer and evaluate them on MTEB using the benchmark’s standard lightweight heads\. As the backbone is kept frozen, differences across layers reflect representational changes rather than task\-specific adaptation\. Training details for the probes and MTEB dataset are provided in Appendix[G](https://arxiv.org/html/2606.02750#A7)\.

### 4\.1Lexical Probes

![Refer to caption](https://arxiv.org/html/2606.02750v1/x6.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x7.png)

Figure 2:Layer\-wise probe performance for lexical decodability on WikiText across model depths and architectures, measured by cross\-entropy loss and Top\-1 accuracy\.For each layerl∈ℒl\\in\\mathcal\{L\}and token positiont∈\{1,…,Tx\}t\\in\\\{1,\\dots,T\_\{x\}\\\},HlM​\(x\)t∈ℝdH^\{M\}\_\{l\}\(x\)\_\{t\}\\in\\mathbb\{R\}^\{d\}denotes the model’sdd\-dimensional hidden representation at positionttin layerll\. To test whether token identity is linearly recoverable from this representation, we apply a layer\-specific affine probe:

zl,t=Wl​HlM​\(x\)t\+bl,z\_\{l,t\}=W\_\{l\}H^\{M\}\_\{l\}\(x\)\_\{t\}\+b\_\{l\},whereWl∈ℝd×dW\_\{l\}\\in\\mathbb\{R\}^\{d\\times d\}andbl∈ℝdb\_\{l\}\\in\\mathbb\{R\}^\{d\}\. We score each vocabulary token using the model’s input embedding matrixEin∈ℝV×dE\_\{\\text\{in\}\}\\in\\mathbb\{R\}^\{V\\times d\}, whosevv\-th rowev∈ℝde\_\{v\}\\in\\mathbb\{R\}^\{d\}is the embedding for tokenvv\. We treatEinE\_\{\\text\{in\}\}as a natural reference basis for surface lexical form, since it is the learned lookup table that maps discrete token identities into continuous vectors at the input\. Afterℓ2\\ell\_\{2\}\-normalizing bothzl,tz\_\{l,t\}and the embedding rowseve\_\{v\}, the score assigned to tokenvvissl,t,v=⟨zl,t,ev⟩\.s\_\{l,t,v\}=\\langle z\_\{l,t\},e\_\{v\}\\rangle\.Applying a softmax overv∈\{1,…,V\}v\\in\\\{1,\\dots,V\\\}gives a probe\-induced distribution over token identities\. The details regarding training and hyperparameters for the probes is provided in Appendix[G\.1](https://arxiv.org/html/2606.02750#A7.SS1)\.

Figure[2](https://arxiv.org/html/2606.02750#S4.F2)reports top\-1 accuracy and cross\-entropy \(CE\) for token\-identity probes as a function of relative depth\. Token identity is highly linearly decodable in the earliest layers \(high accuracy / low CE\)\. As depth increases, probe performance degrades, and cross\-entropy rises, peaking around mid\-depth, roughly the middle half of the network, indicating that token identity is least linearly accessible in these intermediate representations\. Notably, this mid\-depth regime coincides with the layers where we observe the highest lexical failure rates on CounterFact and SCPP\. This points to a transient re\-encoding regime in which sensitivity to lexical influence may persist even as linear decodability of token identity decreases\. Prior work links intermediate layers to more syntactic abstractions\(Jawaharet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib23)\), providing a complementary view of this region of the network\. Beyond this point, token\-identity decodability partially recovers in later layers before deteriorating again near the final layers, where representations are increasingly shaped by the model’s training objectives\. This non\-monotonic pattern cautions against treating depth as a smooth lexical to semantic progression\.

![Refer to caption](https://arxiv.org/html/2606.02750v1/x8.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x9.png)\(b\)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x10.png)\(c\)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x11.png)\(d\)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x12.png)

Figure 3:MTEB task performance across the full model depth for all evaluated models\. For each dataset category, curves show the average task performance as a function of depthll, using layer\-wise sentence embeddings\.
### 4\.2Semantic Probes

We follow MTEB’s two complementary evaluation settings to probe semantic information in layerwise sentence embeddings\. MTEB measures performance either directly from embedding similarities and induced rankings \(no task\-specific training\) or by training a lightweight linear probe to test whether task\-relevant information is linearly separable at a given layer\. Dataset and evaluation details are provided in Appendix[H](https://arxiv.org/html/2606.02750#A8)\.

Figure[3](https://arxiv.org/html/2606.02750#S4.F3)reports MTEB results for classification \(a\), retrieval \(b\), and reranking \(c\)\. We provide the results for clustering, pairwise classification, and STS in Appendix[H\.1](https://arxiv.org/html/2606.02750#A8.SS1)\. Although MTEB spans diverse tasks that could yield heterogeneous layerwise trends, we see strong regularity where performance varies with depth, and curve trends cluster into a few characteristic profile types\. Across classification, pairwise classification, retrieval, and semantic textual similarity \(STS\), we consistently observe a non\-monotonic, valley\-shaped depth profile in all models: performance improves from early to mid layers, degrades in an intermediate regime, and then recovers near the top of the model\. In contrast, clustering and reranking do not exhibit the mid\-depth valley observed for classification\-style tasks\. For reranking, performance remains essentially constant across depth\. Clustering generally improves with depth, but the trend is noisy under the default MTEB setup \(k\-means \+ V\-measure\), where discrete assignment changes can cause jagged layer\-to\-layer variation\.

Forembedding models, performance in the intermediate “valley” regime either remains roughly constant or improves slowly, before increasing sharply after this\. Peak scores for all models typically occur in the last few layers, but not consistently at the final layer, suggesting that the standard “use the final layer” heuristic can be suboptimal for semantic benchmarks, consistent with findings bySkeanet al\.\([2025](https://arxiv.org/html/2606.02750#bib.bib27)\); Chenget al\.\([2025](https://arxiv.org/html/2606.02750#bib.bib71)\)\.

Together with §[3](https://arxiv.org/html/2606.02750#S3), these results identify a shared mid\-depth transition: lexical features become least recoverable over a broad intermediate regime, without a compensating semantic gain in MTEB\. Performance is often lowest over the same layers and recovers only later, suggesting a phase where surface form is suppressed before semantic structure is fully organized\. Consequently, the strongest representations typically emerge in mid\-to\-late layers\.

## 5An Information\-Theoretic Lens on the Mid\-Depth Valley

To further explore how lexical and semantic structure evolve across depth, we juxtapose our probing results with an information\-theoretic view of representations, which frames learning as a trade\-off between compression and the preservation of predictive structure\(Tishbyet al\.,[2000](https://arxiv.org/html/2606.02750#bib.bib42); Shwartz\-Ziv and Tishby,[2017](https://arxiv.org/html/2606.02750#bib.bib43)\)\. Geometric accounts based on intrinsic dimension reach a compatible conclusion from a different direction: hidden representations do not evolve monotonically with depth but instead pass through distinct regimes of expansion, contraction, and abstraction\(Valerianiet al\.,[2023](https://arxiv.org/html/2606.02750#bib.bib72); Chenget al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib71); Doimoet al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib70)\)\.Skeanet al\.\([2025](https://arxiv.org/html/2606.02750#bib.bib27)\)offer a particularly relevant comparison, showing that Transformer representations follow a characteristic compression–decompression trajectory across layers and arguing that the low\-entropy intermediate regime hosts the strongest representations\.

To connect our probing results to representation geometry, we adopt prompt entropy as a per\-layer summary statistic \(formal definition in Appendix[I](https://arxiv.org/html/2606.02750#A9)\)\. Intuitively, low entropy indicates compressed, low\-effective\-rank token representations\. Figure[3](https://arxiv.org/html/2606.02750#S4.F3)\(d\) confirms the expected shape: entropy falls into the middle layers and rises again toward the output\.

Analysing the entropy curves and empirical results in §[3](https://arxiv.org/html/2606.02750#S3)and §[4](https://arxiv.org/html/2606.02750#S4)together provides a more precise account of where the strongest representations emerge\. The mid\-depth valley at 40–60% relative depth aligns with the low\-entropy bottleneck and its immediate re\-expansion phase, precisely the regime thatSkeanet al\.\([2025](https://arxiv.org/html/2606.02750#bib.bib27)\)associates with the strongest intermediate representations\. Yet in our measurements, this is where semantic performance is weak and lexical influence is most pronounced: the model is maximally compressed but not maximally meaningful\. Clean semantic separation and reduced lexical\-overlap failures only emerge in the later layers, after entropy has begun to rise\. This nuances the compression\-centric account: the low\-entropy bottleneck is not the point of strongest embedding quality\. Instead, semantic performance is weak, and lexical influence is most pronounced near the bottleneck and the early re\-expansion phase\. These findings suggest that the re\-expansion phase plays a critical role in producing a more meaning\-sensitive embedding geometry\.

## 6Practical Implications

In this section, we explore the impact of lexical influence on practical settings that rely on embedding geometry and representation\-sensitive updates\.

### 6\.1Summarization

Text summarization compresses a document while preserving its key content\. The evaluation of summarization involves the use of both n\-gram\-based metrics and semantic metrics\. The former, such as ROUGE\(Lin,[2004](https://arxiv.org/html/2606.02750#bib.bib38)\)and BLEU\(Papineniet al\.,[2002](https://arxiv.org/html/2606.02750#bib.bib39)\), reward n\-gram overlap as a signal for a better summary, emphasizing phrase matching in contrast to semantic matching\. The latter metrics, such as BERTScore and BARTScore, aim at evaluating summaries based on meaning preservation\. Here, we show that despite the focus of evaluation methods on semantics, the underlying models possess lexical bias that impacts the correct semantic evaluation of summaries\.

Data\.We use the Extreme Summarization benchmarkNarayanet al\.\([2018](https://arxiv.org/html/2606.02750#bib.bib41)\)\. To probe lexical sensitivity, we augment each example with a highly lexicalized non\-summary distractor generated by an LLM\. Details for dataset construction and experimental methodology are provided in Appendix[J\.2](https://arxiv.org/html/2606.02750#A10.SS2)\.

Experiment\.For each document, we score \(i\) the gold summary and \(ii\) an LLM\-generated lexical distractor withBERTScoreandBARTScore\. We count a*failure*when the distractor scores at least as high as the gold summary\. BERTScore fails on64%64\\%of examples, consistent with a strong lexical signal in contextual representations, mirroring our earlier findings\. BARTScore fails on20%20\\%: despite being likelihood\-based, lexical distractors can remain locally plausible and score competitively\. This shows that lexical reliance is not limited to contextual representation similarity metrics, and the token\-level generative process can reward lexically aligned text as well\.

Table 1:Summarization ranking results \(lexical distractor vs\. gold\)\. FR is the fraction of cases where the distractor is scored higher than the gold; ASD/ASG are mean scores for distractor/gold \(method\-specific scales\)\.
Table 2:Mean \(over prompts\) distribution\-shift metrics for lexically similar \(LS\) and lexically dissimilar \(LD\) prompts\.

### 6\.2Model Editing

Deploying LLMs in production requires post\-training updates to maintain correctness and policy compliance \(e\.g\., incorporating new facts or suppressing harmful associations\)Menget al\.\([2022](https://arxiv.org/html/2606.02750#bib.bib12)\); Mitchellet al\.\([2021](https://arxiv.org/html/2606.02750#bib.bib30)\)\. However, retraining is costly\(Hartvigsenet al\.,[2023](https://arxiv.org/html/2606.02750#bib.bib33); Rizwanet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib10)\), and fine\-tuning can induce catastrophic forgetting\(Luoet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib31); Wanget al\.,[2024b](https://arxiv.org/html/2606.02750#bib.bib32)\), while prompt\-based methods require routing and are brittleRosatiet al\.\([2024](https://arxiv.org/html/2606.02750#bib.bib34)\); Suet al\.\([2025](https://arxiv.org/html/2606.02750#bib.bib35)\)\. Model editing addresses this gap by enabling targeted, localized updates that satisfy an edit request while preserving behavior on non\-target inputs\.

Formally, given a modelMθ0M\_\{\\theta\_\{0\}\}, an edit request\(xe,ye⋆\)\(x\_\{e\},y\_\{e\}^\{\\star\}\), and a locality set𝒩\\mathcal\{N\}representing all other behaviors that should remain unchanged, model editing seeks either updated parametersθ′\\theta^\{\\prime\}\(weight\-modifying\) or an auxiliary component parameterized byϕ\\phi\(e\.g\., a learned prefix, adapter, or external memory\) while keeping the base weightsθ0\\theta\_\{0\}fixed \(weight\-preserving\), such that the edit succeeds while non\-target behavior is preserved:

Mθ′​\(xe\)=ye⋆​or​Mθ0,ϕ​\(xe\)=ye⋆,\\displaystyle M\_\{\\theta^\{\\prime\}\}\(x\_\{e\}\)=y\_\{e\}^\{\\star\}\\;\\;\\text\{or\}\\;\\;M\_\{\\theta\_\{0\},\\phi\}\(x\_\{e\}\)=y\_\{e\}^\{\\star\},D\(\(Mθ′\|Mθ0,ϕ\)\(⋅∣x\),Mθ0\(⋅∣x\)\)≤ε∀x∈𝒩\.\\displaystyle D\\\!\\left\(\\bigl\(M\_\{\\theta^\{\\prime\}\}\\,\\big\|\\,M\_\{\\theta\_\{0\},\\phi\}\\bigr\)\(\\cdot\\mid x\),\\,M\_\{\\theta\_\{0\}\}\(\\cdot\\mid x\)\\right\)\\leq\\varepsilon\\quad\\forall x\\in\\mathcal\{N\}\.Rizwanet al\.\([2025](https://arxiv.org/html/2606.02750#bib.bib10)\)showed that semantic\-similarity based scoping systems can fail in weight\-preserving editing due to lexical overlap\. We argue that this failure extends to*weight\-modifying*editors: surface\-token similarity can bring semantically unrelated prompts close in embedding space, so an update targeted atxex\_\{e\}also shifts their output distributions\. Consequently, lexically similar but semantically mismatched distractors are disproportionately perturbed, producing systematic lexical skew in locality\.

Method and dataset\.We use AlphaEdit\(Fanget al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib29)\)and evaluate on a modified version of the CounterFact benchmark\. For each edit, we augment CounterFact with two prompt sets: \(*i*\)*lexically similar*variants created by substituting distractor entities into the edit prompt, and \(*ii*\)*lexically dissimilar*variants that keep the distractor’s semantics but substantially change its surface form\.

Metrics\.We quantify distribution shift at the answer position by comparing the base and edited next\-token distributions\. We compute Jensen–Shannon divergence \(JSD\) and total variation \(TV\), i\.e\. probability change, Kendall’sτ\\taufor rank stability, and the Top\-1 Flip Rate; all metrics are evaluated on the merged top\-KKsupport\. The details for method, dataset, metrics, andtop\-K=100\\text\{top\-K\}=100results are provided in Appendix[K](https://arxiv.org/html/2606.02750#A11)\. To measure leakage into distractors, we reportEdited\-Target Presence: the fraction of locality prompts whose edited\-model top\-KKcontains any relation\-specific edited target token\.

Table[2](https://arxiv.org/html/2606.02750#S6.T2)reports the results of our locality experiment\. Across metrics and Top\-KKsettings, lexically similar \(LS\) locality prompts are consistently more affected by the edit than lexically dissimilar \(LD\) locality prompts, even though both prompt families query facts unrelated to the edited target\. The effect is stable acrossKK, indicating it is not an artifact of the Top\-KKcutoff\.

The effect is clearest in the probability\-based shift metrics\. Relative to LD, LS exhibits larger changes in both the shape of the distribution \(JSD\) and the amount of probability mass moved \(TV\), implying that surface\-form overlap amplifies the edit’s collateral redistribution over plausible candidates\. Rank\-based evidence is consistent with this picture: Kendall’sτ\\taushows greater rank instability under LS, with the separation most apparent at smallerKK, whereτ\\taumore directly reflects reordering among high\-likelihood tokens\. AsKKincreases,τ\\taubecomes less discriminative because it averages over many additional token pairs, which can dilute head reordering by including comparatively stable mid and lower\-probability tokens\. Finally, behavioral indicators align with these distributional shifts: LS distractors exhibit more decision\-level changes \(Top\-1 flips\) and more frequent appearance of relation\-specific edited targets in the post\-edit top\-KK, consistent with stronger edit leakage under lexical overlap\.

Implication\.The results indicate that surface\-form overlap systematically increases a prompt’s susceptibility to unintended changes\. More broadly, this suggests that any evaluation criterion for model interventions \(e\.g\., steering or activation\-based methods, adapters/LoRA, targeted fine\-tuning\) should stress\-test scope using lexically similar control prompts that request different information\.

## 7Related Work

Layer\-wise probing studies show that linguistic properties are not uniformly encoded across depth, but peak at different layers\(Liuet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib50); Tenneyet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib51); Jawaharet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib23)\); later work strengthens these analyses with selectivity and control tasks\(Hewitt and Liang,[2019](https://arxiv.org/html/2606.02750#bib.bib25)\)\. Complementary geometric and similarity\-based approaches study how embedding spaces themselves evolve across layers, including changes in contextual geometry and representational alignment\(Ethayarajh,[2019](https://arxiv.org/html/2606.02750#bib.bib52); Kornblithet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib53)\)\.Valerianiet al\.\([2023](https://arxiv.org/html/2606.02750#bib.bib72)\)further characterizes the geometry of hidden representations in large transformers, showing that intrinsic dimension and neighborhood structure vary systematically across depth\. Recent work extends this with unified depth diagnostics and large\-scale evaluations showing that intermediate layers can be especially informative feature sources\(Skeanet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib27)\), as well as evidence for high\-dimensional abstraction phases\(Chenget al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib71)\), staged inference dynamics\(Ladet al\.,[2026](https://arxiv.org/html/2606.02750#bib.bib73)\), and representation\-landscape changes induced by few\-shot learning and fine\-tuning\(Doimoet al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib70)\)\. Related depth\-centric work further shows that concepts and knowledge emerge progressively across layers\(Haideret al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib24); Jinet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib54)\), while predictions can stabilize well before the final layers\(Sajjadet al\.,[2023](https://arxiv.org/html/2606.02750#bib.bib55); Fanet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib56)\)\.

Lexical bias is often framed as shortcut learning, where models exploit surface cues that correlate with labels rather than semantics\. In NLI, annotation artifacts introduce exploitable lexical patterns\(Gururanganet al\.,[2018](https://arxiv.org/html/2606.02750#bib.bib59); Heet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib60); Poliaket al\.,[2018](https://arxiv.org/html/2606.02750#bib.bib61)\)\. Controlled diagnostics like HANS\(McCoyet al\.,[2019](https://arxiv.org/html/2606.02750#bib.bib57)\)show that high performing models often follow lexical overlap and subsequence heuristics and fail when these heuristics are broken\. Large scale evaluations further confirm vulnerabilities to lexical shortcuts\(Yuanet al\.,[2024](https://arxiv.org/html/2606.02750#bib.bib62)\)\. Lexical shortcuts can persist even after mitigation, motivating debiasing objectives that target unknown biases rather than a manually specified cue\(Serranoet al\.,[2023](https://arxiv.org/html/2606.02750#bib.bib18); Utamaet al\.,[2020](https://arxiv.org/html/2606.02750#bib.bib58); Zhou and Bansal,[2020](https://arxiv.org/html/2606.02750#bib.bib64)\)\.

## 8Conclusion

We study the persistent effects of lexicality on LLM representations by tracking how lexical and semantic structure evolve across depth\. We find that instruction tuning and metric\-learning objectives improve semantic performance, but do not necessarily overcome lexical influence, so lexical\-overlap failures can persist despite gains on standard embedding evaluations: even embedding models trained for semantic similarity struggle on fine\-grained attribute and relation changes under high lexical overlap\. Connecting the stress test with layer\-wise token\-identity probes and semantic evaluations across model families and training regimes, we uncover a consistent mid\-layer valley in which semantic fidelity remains nearly constant while overlap\-driven failures peak, even as token identity is least linearly decodable\. Reading these results alongside the entropy curve refines the compression\-centric view, showing that the low\-entropy bottleneck is not where the strongest semantic representations arise\. Instead, stronger meaning\-sensitive geometry emerges later, during re\-expansion\. We further show that this lexical influence propagates into downstream LLM usage, as demonstrated with the use cases of automatic summarization evaluation and model editing\.

We conclude that lexically controlled stress tests should become a standard check for representation learning, automatic metrics, and model interventions because they reveal whether meaning sensitivity is robust to high\-overlap confounds or breaks down through overlap\-driven semantic failures\. In future work, we will dissect the mid\-depth transition regime with nonlinear transformations to identify what structure is formed there and how to prevent overlap\-driven similarity errors\.

## Acknowledgment

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada \(NSERC\), Canada Foundation of Innovation \(CFI\), and Research Nova Scotia\. Advanced computing resources are provided by ACENET, the regional partner in Atlantic Canada, and the Digital Research Alliance of Canada\.

## References

- Lexical generalization improves with larger models and longer training\.InFindings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7\-11, 2022,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),pp\. 4398–4410\.External Links:[Document](https://dx.doi.org/10.18653/V1/2022.FINDINGS-EMNLP.323)Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p4.1)\.
- E\. Cheng, D\. Doimo, C\. Kervadec, I\. Macocco, L\. Yu, A\. Laio, and M\. Baroni \(2025\)Emergence of a high\-dimensional abstraction phase in language transformers\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.02750#S1.p7.1),[§4\.2](https://arxiv.org/html/2606.02750#S4.SS2.p3.1),[§5](https://arxiv.org/html/2606.02750#S5.p1.1),[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma, A\. Webson, S\. S\. Gu, Z\. Dai, M\. Suzgun, X\. Chen, A\. Chowdhery, A\. Castro\-Ros, M\. Pellat, K\. Robinson, D\. Valter, S\. Narang, G\. Mishra, A\. Yu, V\. Y\. Zhao, Y\. Huang, A\. M\. Dai, H\. Yu, S\. Petrov, E\. H\. Chi, J\. Dean, J\. Devlin, A\. Roberts, D\. Zhou, Q\. V\. Le, and J\. Wei \(2024\)Scaling instruction\-finetuned language models\.J\. Mach\. Learn\. Res\.25,pp\. 70:1–70:53\.External Links:[Link](https://jmlr.org/papers/v25/23-0870.html)Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p4.1)\.
- D\. Doimo, A\. Serra, A\. Ansuini, and A\. Cazzaniga \(2024\)The representation landscape of few\-shot learning and fine\-tuning in large language models\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/206018a258033def63607fbdf364bd2d-Abstract-Conference.html)Cited by:[§5](https://arxiv.org/html/2606.02750#S5.p1.1),[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- M\. Du, F\. He, N\. Zou, D\. Tao, and X\. Hu \(2024\)Shortcut learning of large language models in natural language understanding\.Commun\. ACM67\(1\),pp\. 110–120\.External Links:[Document](https://dx.doi.org/10.1145/3596490)Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p4.1)\.
- S\. H\. Dumpala, A\. Jaiswal, C\. S\. Sastry, E\. E\. Milios, S\. Oore, and H\. Sajjad \(2024\)SUGARCREPE\+\+ dataset: vision\-language model sensitivity to semantic and lexical alterations\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),Cited by:[§1](https://arxiv.org/html/2606.02750#S1.p2.1),[§2](https://arxiv.org/html/2606.02750#S2.p2.1)\.
- K\. Ethayarajh \(2019\)How contextual are contextualized word representations? comparing the geometry of bert, elmo, and GPT\-2 embeddings\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP\-IJCNLP 2019, Hong Kong, China, November 3\-7, 2019,K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),pp\. 55–65\.External Links:[Link](https://doi.org/10.18653/v1/D19-1006)Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- S\. Fan, X\. Jiang, X\. Li, X\. Meng, P\. Han, S\. Shang, A\. Sun, and Y\. Wang \(2025\)Not all layers of llms are necessary during inference\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2025, Montreal, Canada, August 16\-22, 2025,pp\. 5083–5091\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- J\. Fang, H\. Jiang, K\. Wang, Y\. Ma, J\. Shi, X\. Wang, X\. He, and T\. Chua \(2025\)AlphaEdit: null\-space constrained knowledge editing for language models\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=HvSytvg3Jh)Cited by:[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p3.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[footnote 1](https://arxiv.org/html/2606.02750#footnote1)\.
- S\. Gururangan, S\. Swayamdipta, O\. Levy, R\. Schwartz, S\. R\. Bowman, and N\. A\. Smith \(2018\)Annotation artifacts in natural language inference data\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT, New Orleans, Louisiana, USA, June 1\-6, 2018, Volume 2 \(Short Papers\),M\. A\. Walker, H\. Ji, and A\. Stent \(Eds\.\),pp\. 107–112\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p2.1)\.
- M\. U\. Haider, U\. Farooq, A\. Siddique, and M\. Marron \(2024\)Looking into black box code language models\.arXiv preprint arXiv:2407\.04868\.Cited by:[§1](https://arxiv.org/html/2606.02750#S1.p2.1),[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- T\. Hartvigsen, S\. Sankaranarayanan, H\. Palangi, Y\. Kim, and M\. Ghassemi \(2023\)Aging with grace: lifelong model editing with discrete key\-value adaptors\.Advances in Neural Information Processing Systems36,pp\. 47934–47959\.Cited by:[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p1.1)\.
- H\. He, S\. Zha, and H\. Wang \(2019\)Unlearn dataset bias in natural language inference by fitting the residual\.InProceedings of the 2nd Workshop on Deep Learning Approaches for Low\-Resource NLP, DeepLo@EMNLP\-IJCNLP 2019, Hong Kong, China, November 3, 2019,C\. Cherry, G\. Durrett, G\. F\. Foster, R\. Haffari, S\. Khadivi, N\. Peng, X\. Ren, and S\. Swayamdipta \(Eds\.\),pp\. 132–142\.External Links:[Link](https://doi.org/10.18653/v1/D19-6115),[Document](https://dx.doi.org/10.18653/V1/D19-6115)Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p2.1)\.
- P\. He, X\. Liu, J\. Gao, and W\. Chen \(2021\)DeBERTa: decoding\-enhanced BERT with disentangled attention\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=XPZIaotutsD)Cited by:[§J\.2](https://arxiv.org/html/2606.02750#A10.SS2.p2.1)\.
- J\. Hewitt and P\. Liang \(2019\)Designing and interpreting probes with control tasks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP\-IJCNLP 2019, Hong Kong, China, November 3\-7, 2019,K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),pp\. 2733–2743\.External Links:[Link](https://doi.org/10.18653/v1/D19-1275),[Document](https://dx.doi.org/10.18653/V1/D19-1275)Cited by:[§G\.1](https://arxiv.org/html/2606.02750#A7.SS1.p5.3),[§1](https://arxiv.org/html/2606.02750#S1.p2.1),[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- G\. Jawahar, B\. Sagot, and D\. Seddah \(2019\)What does BERT learn about the structure of language?\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 3651–3657\.External Links:[Link](https://doi.org/10.18653/v1/p19-1356),[Document](https://dx.doi.org/10.18653/V1/P19-1356)Cited by:[§1](https://arxiv.org/html/2606.02750#S1.p2.1),[§1](https://arxiv.org/html/2606.02750#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.02750#S4.SS1.p2.1),[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- M\. Jin, Q\. Yu, J\. Huang, Q\. Zeng, Z\. Wang, W\. Hua, H\. Zhao, K\. Mei, Y\. Meng, K\. Ding, F\. Yang, M\. Du, and Y\. Zhang \(2025\)Exploring concept depth: how large language models acquire knowledge and concept at different layers?\.InProceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19\-24, 2025,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),pp\. 558–573\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton \(2019\)Similarity of neural network representations revisited\.InInternational conference on machine learning,pp\. 3519–3529\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- V\. Lad, J\. H\. Lee, W\. Gurnee, and M\. Tegmark \(2026\)Remarkable robustness of llms: stages of inference?\.Advances in Neural Information Processing Systems38,pp\. 130050–130083\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- M\. Lewis, Y\. Liu, N\. Goyal, M\. Ghazvininejad, A\. Mohamed, O\. Levy, V\. Stoyanov, and L\. Zettlemoyer \(2019\)BART: denoising sequence\-to\-sequence pre\-training for natural language generation, translation, and comprehension\.arXiv preprint arXiv:1910\.13461\.Cited by:[§J\.2](https://arxiv.org/html/2606.02750#A10.SS2.p2.1)\.
- M\. Li and N\. Subramani \(2025\)Model internal sleuthing: finding lexical identity and inflectional morphology in modern language models\.arXiv preprint arXiv:2506\.02132\.Cited by:[§1](https://arxiv.org/html/2606.02750#S1.p7.1)\.
- C\. Lin \(2004\)Rouge: a package for automatic evaluation of summaries\.InText summarization branches out,pp\. 74–81\.Cited by:[§6\.1](https://arxiv.org/html/2606.02750#S6.SS1.p1.1)\.
- N\. F\. Liu, M\. Gardner, Y\. Belinkov, M\. E\. Peters, and N\. A\. Smith \(2019\)Linguistic knowledge and transferability of contextual representations\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2019, Minneapolis, MN, USA, June 2\-7, 2019, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),pp\. 1073–1094\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§G\.1](https://arxiv.org/html/2606.02750#A7.SS1.p6.5)\.
- C\. Lovering, R\. Jha, T\. Linzen, and E\. Pavlick \(2021\)Predicting inductive biases of pre\-trained models\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=mNtmhaDkAr)Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p5.1)\.
- Y\. Luo, Z\. Yang, F\. Meng, Y\. Li, J\. Zhou, and Y\. Zhang \(2025\)An empirical study of catastrophic forgetting in large language models during continual fine\-tuning\.IEEE Transactions on Audio, Speech and Language Processing\.Cited by:[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p1.1)\.
- T\. McCoy, E\. Pavlick, and T\. Linzen \(2019\)Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 3428–3448\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p2.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Cited by:[§D\.2](https://arxiv.org/html/2606.02750#A4.SS2.p1.1),[§2](https://arxiv.org/html/2606.02750#S2.p2.1),[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p1.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2017\)Pointer sentinel mixture models\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings,External Links:[Link](https://openreview.net/forum?id=Byj72udxe)Cited by:[§2](https://arxiv.org/html/2606.02750#S2.p2.1)\.
- E\. Mitchell, C\. Lin, A\. Bosselut, C\. Finn, and C\. D\. Manning \(2021\)Fast model editing at scale\.arXiv preprint arXiv:2110\.11309\.Cited by:[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p1.1)\.
- N\. Muennighoff, N\. Tazi, L\. Magne, and N\. Reimers \(2023\)MTEB: massive text embedding benchmark\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2\-6, 2023,A\. Vlachos and I\. Augenstein \(Eds\.\),pp\. 2006–2029\.External Links:[Link](https://doi.org/10.18653/v1/2023.eacl-main.148),[Document](https://dx.doi.org/10.18653/V1/2023.EACL-MAIN.148)Cited by:[§2](https://arxiv.org/html/2606.02750#S2.p2.1)\.
- S\. Narayan, S\. B\. Cohen, and M\. Lapata \(2018\)Don’t give me the details, just the summary\! topic\-aware convolutional neural networks for extreme summarization\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 1797–1807\.Cited by:[§J\.2](https://arxiv.org/html/2606.02750#A10.SS2.p1.1),[§6\.1](https://arxiv.org/html/2606.02750#S6.SS1.p2.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p4.1)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th annual meeting of the Association for Computational Linguistics,pp\. 311–318\.Cited by:[§6\.1](https://arxiv.org/html/2606.02750#S6.SS1.p1.1)\.
- A\. Poliak, J\. Naradowsky, A\. Haldar, R\. Rudinger, and B\. V\. Durme \(2018\)Hypothesis only baselines in natural language inference\.InProceedings of the Seventh Joint Conference on Lexical and Computational Semantics, \*SEM@NAACL\-HLT 2018, New Orleans, Louisiana, USA, June 5\-6, 2018,M\. Nissim, J\. Berant, and A\. Lenci \(Eds\.\),pp\. 180–191\.External Links:[Link](https://doi.org/10.18653/v1/s18-2023),[Document](https://dx.doi.org/10.18653/V1/S18-2023)Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p2.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.arXiv preprint arXiv:1908\.10084\.Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p3.1)\.
- H\. Rizwan, D\. Rosati, G\. Wu, and H\. Sajjad \(2025\)Resolving lexical bias in model editing\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,External Links:[Link](https://openreview.net/forum?id=aPm6SfcMWQ)Cited by:[§D\.2](https://arxiv.org/html/2606.02750#A4.SS2.p1.1),[§D\.2](https://arxiv.org/html/2606.02750#A4.SS2.p2.2),[§1](https://arxiv.org/html/2606.02750#S1.p2.1),[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p1.1),[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p2.7)\.
- J\. Robinson, L\. Sun, K\. Yu, K\. Batmanghelich, S\. Jegelka, and S\. Sra \(2021\)Can contrastive learning avoid shortcut solutions?\.InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6\-14, 2021, virtual,M\. Ranzato, A\. Beygelzimer, Y\. N\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),pp\. 4974–4986\.Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p8.1)\.
- D\. Rosati, R\. Gonzales, J\. Chen, X\. Yu, Y\. Kayani, F\. Rudzicz, and H\. Sajjad \(2024\)Long\-form evaluation of model editing\.InNAACL\-HLT,Cited by:[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p1.1)\.
- H\. Sajjad, F\. Dalvi, N\. Durrani, and P\. Nakov \(2023\)On the effect of dropping layers of pre\-trained transformer models\.Comput\. Speech Lang\.77,pp\. 101429\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- S\. Serrano, J\. Dodge, and N\. A\. Smith \(2023\)Stubborn lexical bias in data and models\.InFindings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9\-14, 2023,A\. Rogers, J\. L\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),pp\. 8131–8146\.External Links:[Link](https://doi.org/10.18653/v1/2023.findings-acl.516),[Document](https://dx.doi.org/10.18653/V1/2023.FINDINGS-ACL.516)Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p4.1),[§7](https://arxiv.org/html/2606.02750#S7.p2.1)\.
- R\. Shwartz\-Ziv and N\. Tishby \(2017\)Opening the black box of deep neural networks via information\.arXiv preprint arXiv:1703\.00810\.Cited by:[§5](https://arxiv.org/html/2606.02750#S5.p1.1)\.
- O\. Skean, M\. R\. Arefin, D\. Zhao, N\. Patel, J\. Naghiyev, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Layer by layer: uncovering hidden representations in language models\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,External Links:[Link](https://openreview.net/forum?id=WGXb7UdvTX)Cited by:[Appendix I](https://arxiv.org/html/2606.02750#A9.p1.4),[§4\.2](https://arxiv.org/html/2606.02750#S4.SS2.p3.1),[§5](https://arxiv.org/html/2606.02750#S5.p1.1),[§5](https://arxiv.org/html/2606.02750#S5.p3.1),[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- W\. Su, Y\. Tang, Q\. Ai, J\. Yan, C\. Wang, H\. Wang, Z\. Ye, Y\. Zhou, and Y\. Liu \(2025\)Parametric retrieval augmented generation\.arXiv preprint arXiv:2501\.15915\.Cited by:[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p1.1)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière,et al\.\(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.Cited by:[footnote 1](https://arxiv.org/html/2606.02750#footnote1)\.
- I\. Tenney, D\. Das, and E\. Pavlick \(2019\)BERT rediscovers the classical NLP pipeline\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 4593–4601\.External Links:[Link](https://doi.org/10.18653/v1/p19-1452)Cited by:[§1](https://arxiv.org/html/2606.02750#S1.p7.1),[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- N\. Tishby, F\. C\. Pereira, and W\. Bialek \(2000\)The information bottleneck method\.arXiv preprint physics/0004057\.Cited by:[§5](https://arxiv.org/html/2606.02750#S5.p1.1)\.
- P\. A\. Utama, N\. S\. Moosavi, and I\. Gurevych \(2020\)Towards debiasing NLU models from unknown biases\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16\-20, 2020,B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),pp\. 7597–7610\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p2.1)\.
- L\. Valeriani, D\. Doimo, F\. Cuturello, A\. Laio, A\. Ansuini, and A\. Cazzaniga \(2023\)The geometry of hidden representations of large transformer models\.Advances in Neural Information Processing Systems36,pp\. 51234–51252\.Cited by:[§5](https://arxiv.org/html/2606.02750#S5.p1.1),[§7](https://arxiv.org/html/2606.02750#S7.p1.1)\.
- P\. Wang, N\. Zhang, B\. Tian, Z\. Xi, Y\. Yao, Z\. Xu, M\. Wang, S\. Mao, X\. Wang, S\. Cheng, K\. Liu, Y\. Ni, G\. Zheng, and H\. Chen \(2024a\)EasyEdit: an easy\-to\-use knowledge editing framework for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),Y\. Cao, Y\. Feng, and D\. Xiong \(Eds\.\),Bangkok, Thailand,pp\. 82–93\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-demos.9)Cited by:[§D\.2](https://arxiv.org/html/2606.02750#A4.SS2.p1.1)\.
- S\. Wang, Y\. Zhu, H\. Liu, Z\. Zheng, C\. Chen, and J\. Li \(2024b\)Knowledge editing for large language models: a survey\.ACM Computing Surveys57\(3\),pp\. 1–37\.Cited by:[§6\.2](https://arxiv.org/html/2606.02750#S6.SS2.p1.1)\.
- J\. Xing, D\. Luo, C\. Xue, and R\. Xing \(2024\)Comparative analysis of pooling mechanisms in llms: a sentiment analysis perspective\.arXiv preprint arXiv:2411\.14654\.Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p3.1)\.
- Y\. Yuan, L\. Zhao, K\. Zhang, G\. Zheng, and Q\. Liu \(2024\)Do llms overcome shortcut learning? an evaluation of shortcut challenges in large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 12188–12200\.External Links:[Link](https://doi.org/10.18653/v1/2024.emnlp-main.679),[Document](https://dx.doi.org/10.18653/V1/2024.EMNLP-MAIN.679)Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p5.1),[§7](https://arxiv.org/html/2606.02750#S7.p2.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou \(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[footnote 1](https://arxiv.org/html/2606.02750#footnote1)\.
- X\. Zhao, X\. Hu, Z\. Shan, S\. Huang, Y\. Zhou, X\. Zhang, Z\. Sun, Z\. Liu, D\. Li, X\. Wei, Y\. Pan, Y\. Xiang, M\. Zhang, H\. Wang, J\. Yu, B\. Hu, and M\. Zhang \(2025\)KaLM\-embedding\-v2: superior training techniques and data inspire a versatile embedding model\.External Links:2506\.20923,[Link](https://arxiv.org/abs/2506.20923)Cited by:[footnote 1](https://arxiv.org/html/2606.02750#footnote1)\.
- Z\. Zhao, Y\. Ziser, and S\. B\. Cohen \(2024\)Layer by layer: uncovering where multi\-task learning happens in instruction\-tuned large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA\.Cited by:[§3](https://arxiv.org/html/2606.02750#S3.p5.1)\.
- X\. Zhou and M\. Bansal \(2020\)Towards robustifying nli models against lexical dataset biases\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 8759–8771\.Cited by:[§7](https://arxiv.org/html/2606.02750#S7.p2.1)\.

## Appendix ALimitations

Our internal analysis probes representations only at transformer block outputs, so we do not distinguish contributions from finer\-grained components \(e\.g\., attention vs\. MLP pathways\) or alternative hook points\. Finally, the probing results are diagnostic: they indicate what information is recoverable from representations, but do not by themselves establish a causal mechanism\.

## Appendix BImpact Statement

This work characterizes lexical influence in representation spaces, showing that surface\-form overlap can distort embedding similarity and miscalibrate similarity\-based evaluation under lexical confounds\. It also introduces diagnostics that quantify when and where semantic structure degrades\. We expect the primary impact to be positive: improving evaluation rigor and informing more robust use of embeddings in retrieval, ranking, and monitoring pipelines\. While the findings could in principle be used to stress similarity\-based systems via lexically similar distractors, such misuse is likely limited in practice because it generally requires model and pipeline\-specific knowledge and sustained control over inputs; more often, the results will help identify and mitigate unintended brittleness\.

## Appendix CLLM Usage

LLMs were used only to improve grammar, wording, and prose clarity\. All technical content, experimental design, analysis, and conclusions were developed and verified by the authors\.

## Appendix DDatasets

For all experiments in the paper, we prioritize large evaluation sets and aggregate layer\-wise trends over repeated split\-based reruns, given the breadth of the study across models, tasks, and layers\.

### D\.1Stress Tests and Model Editing

Table[3](https://arxiv.org/html/2606.02750#A4.T3)shows representative examples from CounterFact and SugarCrepe\+\+ \(SCPP\)\. Each instance specifies a factual association of the form \(subject, relation, object\) together with a set of prompts designed to elicit that fact\. It includes \(i\)*anchor*prompts that directly query the target relation for the subject, \(ii\)*paraphrase*prompts that ask for the same fact using alternative surface forms to test generalization beyond a single template, and \(iii\)*distractor*prompts that query the same relation for semantically related subjects\. For model editing, distractors are used to test specificity/locality \(i\.e\., whether the edit remains localized\)\. After an edit, a successful model should produce the new target object for the anchor/paraphrase prompts while leaving predictions on distractor prompts largely unchanged\.

SCPP provides image–caption triplets with two semantically equivalent but lexically different positives and a hard negative that is lexically confusable yet semantically incorrect for the image\. We utilize only the texts for our paper\. It partitions negatives into five subsets based on the perturbation used to create the distractor:*Swap Object*exchanges two object nouns \(or noun phrases\) in the caption;*Swap Attribute*swaps attributes between objects \(e\.g\., colors/sizes\);*Replace Object*substitutes an object noun with another;*Replace Attribute*substitutes an attribute term with another; and*Replace Relation*substitutes the relation predicate \(e\.g\., action or spatial preposition\), altering who\-does\-what\-to\-whom or the spatial configuration\.

### D\.2Stress Test Sample Details

For the adversarial stress tests, we use the full SCPP dataset, consisting of 4,752 samples, together with 5,000 samples from CounterFact\. CounterFact exists in several variants, including the original version released with ROME\[Menget al\.,[2022](https://arxiv.org/html/2606.02750#bib.bib12)\], as well as later sampled or updated versions from EasyEdit\[Wanget al\.,[2024a](https://arxiv.org/html/2606.02750#bib.bib69)\]and PENME\[Rizwanet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib10)\]\. The original dataset contains two paraphrase prompts per sample, whereas the EasyEdit version contains one paraphrase prompt and one locality/neighborhood prompt per sample\. The generation prompts contained in the dataset are not exact paraphrases, but rather prompts that may lead to the edited answer\. For example, for the edit prompt “Autonomous University of Madrid, which is located in,” one generation prompt is “The best restaurants around Autonomous University of Madrid include\.” Since this is not an exact paraphrase, we cannot use these prompts\.

We therefore use the version released by PENME\[Rizwanet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib10)\], which provides the two original paraphrases along with several paraphrases generated via LLM \(3 minimum per sample\)\. There are1010locality/neighborhood prompts for each sample\. For this dataset, we used n\-grams to rank locality prompts by lexical similarity to the edit prompt to obtain a clear separation between five lexically similar samples and five dissimilar ones\. Using this ranking, we sampled 5,000 samples from the dataset\. For the model editing experiments, we used this sampled dataset\. For the stress tests, we restricted evaluation to the lexically similar locality prompts, resulting in 25 triplet evaluations per sample and125,000125,000evaluations in total\.

Table 3:The table shows samples from CounterFact and SugarCrepe\+\+ \(SCPP\) datasets\.![Refer to caption](https://arxiv.org/html/2606.02750v1/x13.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x14.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x15.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x16.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x17.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x18.png)

Figure 4:Results for SCPP average token embeddings![Refer to caption](https://arxiv.org/html/2606.02750v1/x19.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x20.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x21.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x22.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x23.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x24.png)

Figure 5:Results for SCPP last token embeddings

## Appendix EResults for Lexical Influence Test on SCPP \(subtasks\)

Figure[4](https://arxiv.org/html/2606.02750#A4.F4),[5](https://arxiv.org/html/2606.02750#A4.F5)shows performance for the selected averaged and last token embeddings, respectively, for models referenced in §[2](https://arxiv.org/html/2606.02750#S2)\.

## Appendix FEffect on Lexical Influence Test on Prompt Variation

![Refer to caption](https://arxiv.org/html/2606.02750v1/x25.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x26.png)

Figure 6:Results of prompt variations on the lexical influence test using the Gemma\-12B\-IT model on the CounterFact dataset\.To gauge the impact of baseline prompt variation on the instruction\-tuned model, we evaluate the Gemma\-3\-12b\-it on the CounterFact model under multiple instruction prompts\. The list of prompts evaluated includes the following: \(”query: ” \(query\), ”Represent this sentence for searching relevant passages: ” \(Rept\), ”Instruct: Represent this text for retrieval\. Query: ” \(Inst\), \(Paper\) contains no instruction other than the base structure for the instruction models\.

The results are shown in Figure[6](https://arxiv.org/html/2606.02750#A6.F6)\. The results indicate that the overall pattern is preserved across prompt variations\. To measure the correlation between the resulting curves, we compute the pairwise average distance correlation \(dCor\)\. For mean\-pooled embeddings, both the pattern and performance are nearly unchanged, with dCor = 0\.993\. For last\-token embeddings, variation is larger in the second half of the network, where the model moves closer to its generation objective, but the overall trend remains the same; the dCor is 0\.78 in this case\.

## Appendix GLexical and Semantic Probes

### G\.1Lexical probe

DataWe train the probe on raw WikiText \(WikiText\-103\-raw\)\. We tokenize the corpus with the same tokenizer as the base model, discard empty/whitespace\-only lines, and concatenate text segments with an end\-of\-sequence boundary token to avoid creating spurious cross\-document contexts\. From this token stream, we construct fixed\-length sequences ofL=50L=50tokens \(each sequence is used both as input and as next\-token supervision with a standard causal language\-modeling shift\)\.

To mitigate degenerate solutions that rely disproportionately on high\-frequency function words \(e\.g\., stop words\) rather than meaningful contextual cues, we promote diversity in the training signal by sampling sequences from a shuffled stream of documents \(and, in the streaming setting, shuffling with a buffer\) before chunking into length\-LLwindows\. We additionally cap the number of extracted sequences per split \(typically3,0003\{,\}000\) to control compute while maintaining broad topical coverage\. Validation and test data are preprocessed identically but are never used for sampling or shuffling decisions\.

Training and Hyperparameters

We train one probe independently for each selected layer while keeping the backbone language model frozen\. For a given layerll, the probe applies a learned affine transformation to the layer’s hidden states,

zl,t=Wl​HlM​\(x\)t\+bl,z\_\{l,t\}=W\_\{l\}H^\{M\}\_\{l\}\(x\)\_\{t\}\+b\_\{l\},whereWlW\_\{l\}is initialized to the identity andblb\_\{l\}is initialized to zero\. The transformed features are thenℓ2\\ell\_\{2\}\-normalized and scored against a row\-normalized, frozen copy of the model’s*input embedding matrix*\. Thus, the probe acts as a cosine\-similarity classifier over the input vocabulary\. We additionally learn a bounded per\-layer logit scale to keep optimization stable\.

We train the probe by minimizing the token\-level cross\-entropy lossℒCE\(l\)\\mathcal\{L\}^\{\(l\)\}\_\{\\mathrm\{CE\}\}between the probe’s scores and the ground\-truth token IDs, ignoring padded positions\. For causal models, we use the EOS token as the pad token\. We also apply a weakℓ2\\ell\_\{2\}penalty that encouragesWlW\_\{l\}to remain close to the identity\. This regularizer primarily stabilizes training and provides a mild preference for identity\-like mappings, discouraging large remappings that can confound probe\-based interpretation\[Hewitt and Liang,[2019](https://arxiv.org/html/2606.02750#bib.bib25)\]\. Consequently, probe accuracy reflects the linear accessibility of token identity in a space aligned with the model’s input embedding geometry, which we use as an operational reference for surface form\.

The resulting objective at layerllis

ℒl=ℒCE,l\+λ​‖Wl−I‖F2,\\mathcal\{L\}\_\{l\}=\\mathcal\{L\}\_\{\\mathrm\{CE\},l\}\+\\lambda\\\|W\_\{l\}\-I\\\|\_\{F\}^\{2\},\(1\)whereλ\>0\\lambda\>0is the identity\-regularization coefficient\. We optimize the probe parameters with AdamW\[Loshchilov and Hutter,[2017](https://arxiv.org/html/2606.02750#bib.bib68)\], using learning rate10−310^\{\-3\}, weight decay10−410^\{\-4\}, identity regularizationλ=10−6\\lambda=10^\{\-6\}, and gradient\-norm clipping to stabilize optimization\.

## Appendix HSemantic Probe

Following the findings in Figure[1](https://arxiv.org/html/2606.02750#S3.F1), we use the average token \(mean\-pooled\) for pretrained and instruction\-tuned models and the last\-token representation for embedding models\.

We evaluate on a suite of English tasks from the Massive Text Embedding Benchmark \(MTEB\) and cache each task locally to ensure deterministic reruns\. We perform subsampling for the few*massive*cases to keep runtime bounded\. ForMedRxivClusteringS2S\.v2\(37,500 test instances\), we uniformly sample 10,000 test points without replacement using a fixed seed and cache the resulting subset\. For extremely large reranking data such asMindSmallReranking, we cap the cached data to 10,000 queries and 200,000 corpus documents; we then filterrelevant\_docsandtop\_rankedto remove any entries that reference dropped query or document IDs, ensuring internal consistency\. The datasets used and their categorization is provided in Table[4](https://arxiv.org/html/2606.02750#A8.T4)\.

### H\.1Results MTEB

Figure[8](https://arxiv.org/html/2606.02750#A8.F8)shows the results for STS, Pairwise Classification, and Clustering\. Pairwise classification and sts follow the same geometric pattern discussed in the main text, while clustering diverges from the pattern\. We hypothesize that this issue is related to the data and how evaluation is performed for this task; we leave this as future work\.

### H\.2Word Sense Disambiguation

To validate the mid\-depth degradation on token\-level semantic tasks, we evaluate word sense disambiguation \(WSD\) on the SemCor corpus\. We use token\-level sense labels and restrict the prediction space to the 1,000 most frequent senses in the training set\. For each layer, we train a linear probe and evaluate its performance\. Figure[7](https://arxiv.org/html/2606.02750#A8.F7)shows a similar geometric pattern: performance degrades at intermediate depths and partially recovers in the latter half of the network\. However, the later layers do not outperform the earlier ones\. This is expected, as token representations in higher layers tend to become increasingly aligned with sequence\-level semantics rather than preserving fine\-grained token\-specific information\.

![Refer to caption](https://arxiv.org/html/2606.02750v1/x27.png)Figure 7:Results for WSD on Gemma\-12B\-IT and Gemma\-12B\-PT model\.Table 4:MTEB task categories and datasets used\.![Refer to caption](https://arxiv.org/html/2606.02750v1/x28.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x29.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x30.png)
![Refer to caption](https://arxiv.org/html/2606.02750v1/x31.png)

Figure 8:Performance of the models on MTEB tasks\. sts \(a\), pairwise classification \(b\), and clustering \(c\)\.

## Appendix IInput/Prompt Entropy

Extracted from\[Skeanet al\.,[2025](https://arxiv.org/html/2606.02750#bib.bib27)\]\. For an input \(prompt\)ppwithTpT\_\{p\}tokens, letZℓ​\(p\)∈ℝTp×dZ\_\{\\ell\}\(p\)\\in\\mathbb\{R\}^\{T\_\{p\}\\times d\}be the matrix of layer\-ℓ\\elltoken embeddings \(rows correspond to token positions\)\. Define the Gram matrix

Kℓ​\(p\)=Zℓ​\(p\)​Zℓ​\(p\)⊤,K~ℓ​\(p\)=Kℓ​\(p\)tr⁡\(Kℓ​\(p\)\)\.K\_\{\\ell\}\(p\)=Z\_\{\\ell\}\(p\)\\,Z\_\{\\ell\}\(p\)^\{\\top\},\\qquad\\tilde\{K\}\_\{\\ell\}\(p\)=\\frac\{K\_\{\\ell\}\(p\)\}\{\\operatorname\{tr\}\(K\_\{\\ell\}\(p\)\)\}\.Let\{μi\}i=1Tp\\\{\\mu\_\{i\}\\\}\_\{i=1\}^\{T\_\{p\}\}be the eigenvalues ofK~ℓ​\(p\)\\tilde\{K\}\_\{\\ell\}\(p\)\(so∑iμi=1\\sum\_\{i\}\\mu\_\{i\}=1\)\. The input \(prompt\) entropy is the matrix\-based Rényi entropy

Hin\(ℓ\)​\(p\)=11−α​log⁡\(∑i=1Tpμiα\),α\>0,α≠1,H\_\{\\mathrm\{in\}\}^\{\(\\ell\)\}\(p\)=\\frac\{1\}\{1\-\\alpha\}\\log\\\!\\left\(\\sum\_\{i=1\}^\{T\_\{p\}\}\\mu\_\{i\}^\{\\alpha\}\\right\),\\qquad\\alpha\>0,\\ \\alpha\\neq 1,with theα→1\\alpha\\to 1limit \(von Neumann/Shannon\)

Hin\(ℓ\)​\(p\)=−∑i=1Tpμi​log⁡μi\.H\_\{\\mathrm\{in\}\}^\{\(\\ell\)\}\(p\)=\-\\sum\_\{i=1\}^\{T\_\{p\}\}\\mu\_\{i\}\\log\\mu\_\{i\}\.

## Appendix JSummarization

### J\.1Details of Metrics

BERTScore\.BERTScore leverages contextual embeddings from a pre\-trained Transformer \(e\.g\., BERT\) to compute semantic similarity between a candidate summary and a reference summary\. Unlikenn\-gram metrics such as ROUGE, it captures synonymy and paraphrasing by computing cosine similarity between contextual token embeddings with greedy matching\.

Given a reference summaryx=⟨x1,…,xTx⟩x=\\langle x\_\{1\},\\dots,x\_\{T\_\{x\}\}\\rangleand a candidate summaryx^=⟨x^1,…,x^Tx^⟩\\hat\{x\}=\\langle\\hat\{x\}\_\{1\},\\dots,\\hat\{x\}\_\{T\_\{\\hat\{x\}\}\}\\rangle, let𝐯i∈ℝd\\mathbf\{v\}\_\{i\}\\in\\mathbb\{R\}^\{d\}and𝐯^j∈ℝd\\hat\{\\mathbf\{v\}\}\_\{j\}\\in\\mathbb\{R\}^\{d\}denote their contextual token embeddings\. Define the cosine similarity

si​j=𝐯i⊤​𝐯^j‖𝐯i‖2​‖𝐯^j‖2\.s\_\{ij\}=\\frac\{\\mathbf\{v\}\_\{i\}^\{\\top\}\\hat\{\\mathbf\{v\}\}\_\{j\}\}\{\\\|\\mathbf\{v\}\_\{i\}\\\|\_\{2\}\\,\\\|\\hat\{\\mathbf\{v\}\}\_\{j\}\\\|\_\{2\}\}\.Then the recall \(RBERTR\_\{\\text\{BERT\}\}\), precision \(PBERTP\_\{\\text\{BERT\}\}\), and F1 \(FBERTF\_\{\\text\{BERT\}\}\) are

RBERT=1Tx​∑i=1Txmax1≤j≤Tx^⁡si​j,R\_\{\\text\{BERT\}\}=\\frac\{1\}\{T\_\{x\}\}\\sum\_\{i=1\}^\{T\_\{x\}\}\\max\_\{1\\leq j\\leq T\_\{\\hat\{x\}\}\}s\_\{ij\},\(2\)PBERT=1Tx^​∑j=1Tx^max1≤i≤Tx⁡si​j,P\_\{\\text\{BERT\}\}=\\frac\{1\}\{T\_\{\\hat\{x\}\}\}\\sum\_\{j=1\}^\{T\_\{\\hat\{x\}\}\}\\max\_\{1\\leq i\\leq T\_\{x\}\}s\_\{ij\},\(3\)FBERT=2⋅PBERT​RBERTPBERT\+RBERT\.F\_\{\\text\{BERT\}\}=2\\cdot\\frac\{P\_\{\\text\{BERT\}\}\\,R\_\{\\text\{BERT\}\}\}\{P\_\{\\text\{BERT\}\}\+R\_\{\\text\{BERT\}\}\}\.\(4\)
We report the F1 score \(FBERTF\_\{\\text\{BERT\}\}\)\. We follow the official BERTScore configuration and use the default DeBERTa layer \(40\), which is tuned for WMT16 human\-correlation\.

BARTScore\.BARTScore treats the evaluation of generated text as a sequence\-to\-sequence generation task\. It utilizes the log\-probability of the target sequence \(the candidate summary\) conditioned on a source sequence \(either the reference or the original document\) using a pre\-trained BART model\. This allows the metric to capture aspects like faithfulness and informativeness more effectively than surface\-level matching\.

The score is defined as the average log\-likelihood of the target sequenceyygiven the source sequencexx:

BARTScore=1m​∑t=1mlog⁡P​\(yt\|y<t,x,θ\)\\text\{BARTScore\}=\\frac\{1\}\{m\}\\sum\_\{t=1\}^\{m\}\\log P\(y\_\{t\}\|y\_\{<t\},x,\\theta\)\(5\)
whereθ\\thetarepresents the parameters of the pre\-trained BART model, andmmis the length of the target sequence\.

### J\.2Data and Models

Data\.We use the standard extreme summarization\[Narayanet al\.,[2018](https://arxiv.org/html/2606.02750#bib.bib41)\]benchmark\. We utilize randomly sampled 500 samples from the dataset\. To probe lexical sensitivity, we augment each example with a highly lexicalized non\-summary distractor generated by an LLM\(gpt\-5\.2\-mini, Temperature = 0\.7\)\. The prompt used for the generation of the distractor is as follows:

prompt=f"""

Youaregeneratingadversarialevaluationdatatotestlexicalbiasinsummarizationsystems\.

SourceDocument:

\{document\}

TargetLength:\{target\_length\}characters\(\+/\-5%\)

TASK:

GenerateatextthathasHIGHlexicaloverlapwiththesourcedocumentbutNOsemanticoverlapwithit\.

CRITICALDEFINITION:

"NOsemanticoverlap"meansthatthetextmustnotconveythetopic,purpose,claims,events,orconclusionsofthesourceinanyform\.

Areadershouldbeunabletoinferwhatthesourcedocumentisaboutfromyourtext\.

CONSTRAINTS\(hardrequirements\):

1\.LexicalOverlap:

\-Reusedistinctivewords,phrases,andnamedentitiesfromthesource\.

\-Donotintroducenewterminology\.

\-Trytouseallwordsfromthesource\.

2\.SemanticOrthogonality:

\-DoNOTdescribe,summarize,orrestateanypartofthesource’smeaning\.

\-DoNOTpreserverelations,causality,chronology,orargumentation\.

\-Thetextmustbeaboutsomethingelseentirely,despiteusingthesamewords\.

3\.Length:

\-Approximately\{target\_length\}characters\.

VALIDATIONCHECK\(internal\):

Iftheoutputallowsareadertoidentifythesubjectorpurposeofthesourcedocument,itfails\.

OUTPUTFORMAT:

ReturnaJSONobjectwithexactlyonefield:

\{\{

"lexical\_non\_summary":"<text\>"

\}\}"

Listing 1:Prompt for Lexical Distractor Non\-Summary GenerationModels\.For BERTScore, DeBERTa\-XLargeHeet al\.\[[2021](https://arxiv.org/html/2606.02750#bib.bib48)\]is used\. For BARTScore, we utilize thebart\-largeLewiset al\.\[[2019](https://arxiv.org/html/2606.02750#bib.bib49)\]model fine\-tuned on the CNN/Daily Mail summarization dataset\.

## Appendix KModel Editing

### K\.1Method and dataset\.

#### AlphaEdit \(null\-space constrained editing\)\.

AlphaEdit is a simple refinement that can be applied on top of a base locate–then–edit method \(e\.g\., ROME/MEMIT\)\. Suppose the base editor proposes an updateΔbase\\Delta\_\{\\text\{base\}\}to a weight matrixWWto enforce the desired edit\. AlphaEdit additionally builds a*preservation set*of inputs whose behavior should remain unchanged, and extracts their corresponding key vectors\{ki\}i=1m\\\{k\_\{i\}\\\}\_\{i=1\}^\{m\}at the edited layer\. LetK0∈ℝd×mK\_\{0\}\\in\\mathbb\{R\}^\{d\\times m\}be the matrix with columnskik\_\{i\}\. AlphaEdit then projects the base update onto the subspace orthogonal tospan​\(K0\)\\mathrm\{span\}\(K\_\{0\}\):

Δα=Π⟂​Δbase,where​Π⟂​projects onto​span​\(K0\)⟂\.\\Delta\_\{\\alpha\}\\;=\\;\\Pi\_\{\\perp\}\\,\\Delta\_\{\\text\{base\}\},\\qquad\\text\{where \}\\Pi\_\{\\perp\}\\text\{ projects onto \}\\mathrm\{span\}\(K\_\{0\}\)^\{\\perp\}\.This guarantees

Δα​K0=0,\\Delta\_\{\\alpha\}K\_\{0\}=0,so the refined update leaves the model’s mapping on the preservation keys unchanged, while retaining the base editor’s effect on the target edit\.

CounterFact dataset\.CounterFact’s distractor prompts are already partially lexically aligned with the anchor/edit prompt\. To more directly quantify the effect of model editing under controlled lexical overlap, we augment CounterFact with two additional prompt sets: \(*i*\)*lexically similar*prompts, obtained by substituting the subject entity in the original anchor/edit template with distractor entities; and \(*ii*\)*lexically dissimilar*prompts, which preserve the distractor’s semantics while substantially rewriting its surface form\. We generate the lexically dissimilar prompts using an LLM \(gpt\-5\.2\-mini\) with the following prompt:

prompt=f"""

Youarealanguageexperttaskedwithrewritingprompts\.

Inputfields:

\-original\_locality\_prompt:thepromptwhosemeaningmustbepreservedexactly

\-edit\_prompt:areferencepromptwhosewordingyoumustavoidcopying

Task:

ProduceTHREErewrittenvariantsoforiginal\_locality\_prompt\.

STRICTmeaningpreservation\(eachvariant\):

1\)Askfortheexactsamefactasoriginal\_locality\_prompt:sameentityname\(s\)\+sameattribute/relation\.

2\)Copyentityname\(s\)EXACTLYasinoriginal\_locality\_prompt\(samespelling/casing\)\.Donotreplaceentitieswithdescriptions\.

3\)DoNOTadd/removequalifiers\(time,certainty,official/current,"capital","cityof",etc\.\)\.Nohints,noassumptions\.

Clozeformatting\(eachvariant\):

4\)EachpromptmustbeaPREFIXsuchthatthecorrectanswershouldbethenexttextgeneratedimmediatelyafterthefinalspace\.

5\)EndwithexactlyONEtrailingASCIIspaceandNOtrailingpunctuation\(no\.,?,\!,:,;,,\)\.

6\)DoNOTendwiththeentityname\(s\)\.

7\)Nointerrogatives:doNOTuseWH\-words\(what/which/where/who/when/how\)\.Usedeclarativestemsonly\.

8\)Validitycheck:appendingthecorrectanswerimmediatelyafterthefinalspacemustyieldacompletegrammaticalresultwithoutaddinganyextrawords\.

Anti\-copy/maximizelexicaldifferencefromedit\_prompt\(eachvariant\):

9\)Hardban:DoNOTreuseanycontiguous2\+wordsequencefoundinedit\_prompt\(exception:entitytokens\)\.

10\)Content\-wordtaboo:AvoidreusingANYnon\-entitycontentwordfromedit\_prompt\(verbs/nouns/adjectives/adverbs\)\.

\-Allowedfunction\-wordset\(mayrepeat\):\{the,a,an,of,to,in,on,at,from,for,with,by,is,are,was,were,be,been,being,as\}

\-Ifarelationwordfromoriginal\_locality\_promptalsoappearsinedit\_prompt,youshouldparaphraseitusingasynonymNOTpresentinedit\_prompt,whilepreservingmeaning\.

11\)Target:minimizeoverlapofnon\-entitycontentwordswithedit\_promptto0wheneverpossible\.

MANDATORYstructuraldiversity\(acrossthe3variants\):

Generateexactlyonefromeachframe:

A\)ENTITY\-SUBJECTframe:

"<ENTITY\><connector\>"

B\)POSSESSIVE/ownershipframe:

"<ENTITY\>’s<attributephrase\><connector\>"OR"<attributephrase\>of<ENTITY\><connector\>"

C\)ATTRIBUTE\-SUBJECTframe:

"The<attributephrase\>for<ENTITY\><connector\>"

Additionaldiversityconstraints:

12\)Nosharedopening3tokensacrossvariants\.

13\)Usethreedifferentconnectortails\(thelast1to4wordsbeforethefinalspacemustdifferacrossvariants\)\.

14\)Acrossthethreeoutputs,donotreusethesamekeycontentwords\(besidestheentityandunavoidableattributenoun\)\.Preferdifferentsynonymsets\.

Search\-and\-selectrequirement\(important\):

15\)ForEACHframe\(A/B/C\),internallydraftatleast5candidaterewrites,thenchoosethebestonethat:

\-satisfiesallconstraints,

\-hasthefewestnon\-entitycontentwordsoverlappingwithedit\_prompt,

\-anddiffersmostfromtheotherselectedvariants\.

Forbidden:

\-Donotincludetheliteralstrings"original\_locality\_prompt"or"edit\_prompt"\.

\-OutputJSONonly;noextratext\.

OutputJSONonly\(exactlythesekeys,noextras\):

\{"rewritten\_prompts":\["\.\.\.","\.\.\.","\.\.\."\]\}

"""’

Listing 2:Data generation prompt for lexicalLy similar distractor/localityWe perform editing on 2,000 sequential edit samples using the Llama\-3\-8B model, which is directly supported by the AlphaEdit codebase and evaluated with the default base settings\.

### K\.2Metrics

LetM0M\_\{0\}be the base model andM1M\_\{1\}the edited model\. For an inputxx, letτx\\tau\_\{x\}denote the answer position\. LetpxM​\(v\)=PM​\(v∣x,Tx\)p\_\{x\}^\{M\}\(v\)=P\_\{M\}\(v\\mid x,T\_\{x\}\)be the next\-token distribution over the vocabulary𝒱\\mathcal\{V\}\.

#### Merged top\-KKsupport\.

ForK∈\{20,50,100\}K\\in\\\{20,50,100\\\}, define

SK​\(x\)=Top−K​\(pxM0\)∪Top−K​\(pxM1\),ZxM​\(K\)=∑u∈SK​\(x\)pxM​\(u\),S\_\{K\}\(x\)=\\mathrm\{Top\-K\}\\\!\\bigl\(p\_\{x\}^\{M\_\{0\}\}\\bigr\)\\ \\cup\\ \\mathrm\{Top\-K\}\\\!\\bigl\(p\_\{x\}^\{M\_\{1\}\}\\bigr\),\\qquad Z\_\{x\}^\{M\}\(K\)=\\sum\_\{u\\in S\_\{K\}\(x\)\}p\_\{x\}^\{M\}\(u\),and the renormalized distributions on the merged support

p~x,KM​\(v\)=pxM​\(v\)max⁡\{ZxM​\(K\),εstab\},v∈SK​\(x\),\\tilde\{p\}\_\{x,K\}^\{M\}\(v\)=\\frac\{p\_\{x\}^\{M\}\(v\)\}\{\\max\\\{Z\_\{x\}^\{M\}\(K\),\\varepsilon\_\{\\text\{stab\}\}\\\}\},\\qquad v\\in S\_\{K\}\(x\),withεstab=10−12\\varepsilon\_\{\\text\{stab\}\}=10^\{\-12\}\. All @K divergence metrics are computed on\(p~x,KM0,p~x,KM1\)\(\\tilde\{p\}\_\{x,K\}^\{M\_\{0\}\},\\tilde\{p\}\_\{x,K\}^\{M\_\{1\}\}\)\.

#### Jensen–Shannon divergence \(JSD@K\)\.

Letmx,K=12​\(p~x,KM0\+p~x,KM1\)m\_\{x,K\}=\\tfrac\{1\}\{2\}\\bigl\(\\tilde\{p\}\_\{x,K\}^\{M\_\{0\}\}\+\\tilde\{p\}\_\{x,K\}^\{M\_\{1\}\}\\bigr\)\. Then

JSDK​\(x\)=12​KL​\(p~x,KM0∥mx,K\)\+12​KL​\(p~x,KM1∥mx,K\),\\mathrm\{JSD\}\_\{K\}\(x\)=\\frac\{1\}\{2\}\\mathrm\{KL\}\\\!\\left\(\\tilde\{p\}\_\{x,K\}^\{M\_\{0\}\}\\,\\\|\\,m\_\{x,K\}\\right\)\+\\frac\{1\}\{2\}\\mathrm\{KL\}\\\!\\left\(\\tilde\{p\}\_\{x,K\}^\{M\_\{1\}\}\\,\\\|\\,m\_\{x,K\}\\right\),where

KL​\(r∥s\)=∑v∈SK​\(x\)r​\(v\)​log⁡max⁡\{r​\(v\),εstab\}max⁡\{s​\(v\),εstab\}\.\\mathrm\{KL\}\(r\\\|s\)=\\sum\_\{v\\in S\_\{K\}\(x\)\}r\(v\)\\log\\frac\{\\max\\\{r\(v\),\\varepsilon\_\{\\text\{stab\}\}\\\}\}\{\\max\\\{s\(v\),\\varepsilon\_\{\\text\{stab\}\}\\\}\}\.

#### Total variation distance \(TV@K\)\.

TVK​\(x\)=12​∑v∈SK​\(x\)\|p~x,KM0​\(v\)−p~x,KM1​\(v\)\|\.\\mathrm\{TV\}\_\{K\}\(x\)=\\frac\{1\}\{2\}\\sum\_\{v\\in S\_\{K\}\(x\)\}\\left\|\\tilde\{p\}\_\{x,K\}^\{M\_\{0\}\}\(v\)\-\\tilde\{p\}\_\{x,K\}^\{M\_\{1\}\}\(v\)\\right\|\.

#### Kendall’sτ\\tau\(rank stability onSK​\(x\)S\_\{K\}\(x\)\)\.

Letrx,KM​\(v\)r\_\{x,K\}^\{M\}\(v\)be the rank of tokenv∈SK​\(x\)v\\in S\_\{K\}\(x\)when sortingpxM​\(v\)p\_\{x\}^\{M\}\(v\)in descending order\. Over unordered pairs\{u,v\}⊂SK​\(x\)\\\{u,v\\\}\\subset S\_\{K\}\(x\), define

a=rx,KM0​\(u\)−rx,KM0​\(v\),b=rx,KM1​\(u\)−rx,KM1​\(v\)\.a=r\_\{x,K\}^\{M\_\{0\}\}\(u\)\-r\_\{x,K\}^\{M\_\{0\}\}\(v\),\\qquad b=r\_\{x,K\}^\{M\_\{1\}\}\(u\)\-r\_\{x,K\}^\{M\_\{1\}\}\(v\)\.Pairs witha=0a=0orb=0b=0\(ties in either ranking\) are skipped\. LetCCandDDbe the number of remaining concordant and discordant pairs, where concordant meansa​b\>0ab\>0and discordant meansa​b<0ab<0\. Then

τK​\(x\)=\{C−DC\+D,C\+D\>0,0,C\+D=0\.\\tau\_\{K\}\(x\)=\\begin\{cases\}\\frac\{C\-D\}\{C\+D\},&C\+D\>0,\\\\\[3\.0pt\] 0,&C\+D=0\.\\end\{cases\}

#### Top\-1 flip rate\.

Flip​\(x\)=𝟏​\[arg⁡maxv∈𝒱⁡pxM0​\(v\)≠arg⁡maxv∈𝒱⁡pxM1​\(v\)\]\.\\mathrm\{Flip\}\(x\)=\\mathbf\{1\}\\\!\\left\[\\arg\\max\_\{v\\in\\mathcal\{V\}\}p\_\{x\}^\{M\_\{0\}\}\(v\)\\neq\\arg\\max\_\{v\\in\\mathcal\{V\}\}p\_\{x\}^\{M\_\{1\}\}\(v\)\\right\]\.

#### Edited\-Target Presence@K \(leakage onto locality prompts\)\.

Let𝒯⊂𝒱\\mathcal\{T\}\\subset\\mathcal\{V\}be the set of relation\-specific edited target token ids, and letℒ\\mathcal\{L\}be the set of locality/distractor inputs\. Forx∈ℒx\\in\\mathcal\{L\},

ETPK​\(x\)=𝟏​\[Top−K​\(pxM1\)∩𝒯≠∅\],E​T​PK=1\|ℒ\|​∑x∈ℒETPK​\(x\)\.\\mathrm\{ETP\}\_\{K\}\(x\)=\\mathbf\{1\}\\\!\\left\[\\mathrm\{Top\-K\}\\\!\\bigl\(p\_\{x\}^\{M\_\{1\}\}\\bigr\)\\cap\\mathcal\{T\}\\neq\\emptyset\\right\],\\qquad ETP\_\{K\}=\\frac\{1\}\{\|\\mathcal\{L\}\|\}\\sum\_\{x\\in\\mathcal\{L\}\}\\mathrm\{ETP\}\_\{K\}\(x\)\.
ResultsResults for top\-K100100are provided in Table[5](https://arxiv.org/html/2606.02750#A11.T5)\.

Table 5:Distribution\-shift metrics on*lexically similar and dissimilar*prompts \(Top\-K\-100\)\.

## Appendix LCompute Resources

All experiments are conducted on an NVIDIA H100 GPU with 96 GB of VRAM\. Several experiments, including the semantic probing evaluations, require storing intermediate embeddings on disk; these runs used up to 512 GB of storage and 128 GB of system RAM\.

Similar Articles

Language models struggle with compartmentalization

arXiv cs.CL

This paper investigates compartmentalization in LLMs, where models fail to share statistical strength across distinct representations of the same concept, leading to reduced sample efficiency and model capacity. The authors demonstrate this phenomenon in multilingual and multi-format settings and show that synthetic parallel data does not fully resolve it.

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

arXiv cs.CL

This paper introduces a register-aware linguistic evaluation framework to assess how human-like large language models (LLMs) are by comparing the distribution of 67 lexico-grammatical features between human and LLM-generated texts using Maximum Mean Discrepancy. Experiments across seven instruction-tuned open-source models and five registers show that no model perfectly matches human baselines, and closeness to human language varies by register rather than model size.