Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation

arXiv cs.CL Papers

Summary

This Systematization of Knowledge paper proposes a unified Multi-Trait Multi-Method (MTMM) geometric framework for evaluating Large Language Models, unifying disparate metrics into a shared latent coordinate space to address construct validity issues in current benchmarks.

arXiv:2605.08522v1 Announce Type: new Abstract: The evaluation of Large Language Models (LLMs) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance, such as prompt sensitivity, with true latent capabilities. Concurrently, emerging research suggests that LLM capabilities and outputs can be modeled as continuous geometric manifolds. In this Systematization of Knowledge (SoK), we bridge these paradigms by proposing a generalized Multi-Trait Multi-Method (MTMM) framework for LLM evaluation. We formalize and unify nine evaluation metrics, including Paraphrase Instability, Drift Score, Overton Width, and Pluralism Score, interpreting them not as isolated scalar values but as geometric measurements within a shared latent coordinate space. This spatial unification factorizes model behavior into three orthogonal latent dimensions: (1) Instability and Sensitivity, (2) Position and Alignment, and (3) Coverage and Expressiveness. By systematically separating task-irrelevant perturbations from true capability spans, the framework provides a theoretically grounded and domain-agnostic taxonomy for robust and empirically stable benchmark design.
Original Article
View Cached Full Text

Cached at: 05/12/26, 06:51 AM

# Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation
Source: [https://arxiv.org/html/2605.08522](https://arxiv.org/html/2605.08522)
Adib Sakhawat, Tahsin Islam, Takia Farhin, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan Systems and Software Lab \(SSL\) Department of Computer Science and Engineering Islamic University of Technology, Dhaka, Bangladesh \{adibsakhawat, tahsinislam, takiafarhin, rifatraiyan, hasan, hasank\}@iut\-dhaka\.edu

###### Abstract

The evaluation of Large Language Models \(LLMs\) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance—such as prompt sensitivity—with true latent capabilities\. Concurrently, emerging research indicates that LLM capabilities and outputs can be rigorously modeled as continuous geometric manifolds\. In this Systematization of Knowledge \(SoK\), we bridge these paradigms by proposing a generalized Multi\-Trait Multi\-Method \(MTMM\) framework for LLM evaluation\. We mathematically formalize and unify nine disparate metrics—ranging from Paraphrase Instability and Drift Score to Overton Width and Pluralism—interpreting them not as isolated scalar scores, but as geometric measurements \(displacements, spans, and distances\) within a shared latent coordinate space\. This spatial unification factorizes model behavior into three orthogonal latent dimensions: \(1\) Instability and Sensitivity, \(2\) Position and Alignment, and \(3\) Coverage and Expressiveness\. By systematically isolating task\-irrelevant perturbations from true capability spans, our framework provides a robust, domain\-agnostic taxonomy that moves the community toward theoretically grounded and empirically stable benchmark design\.

Coordinates of Capability: A Unified MTMM\-Geometric Framework for LLM Evaluation

Adib Sakhawat, Tahsin Islam, Takia Farhin,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul HasanSystems and Software Lab \(SSL\)Department of Computer Science and EngineeringIslamic University of Technology, Dhaka, Bangladesh\{adibsakhawat, tahsinislam, takiafarhin, rifatraiyan, hasan, hasank\}@iut\-dhaka\.edu

## 1Introduction

The rapid expansion of Large Language Model \(LLM\) capabilities has precipitated a severe crisis in evaluation methodology\. Recent systematic reviews of the evaluation landscape reveal a discipline plagued by fragmented metrics, ad hoc benchmark design, and pervasive data contamination\(Changet al\.,[2023](https://arxiv.org/html/2605.08522#bib.bib1); Niet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib4); Denget al\.,[2023](https://arxiv.org/html/2605.08522#bib.bib6)\)\. Crucially, current benchmarks frequently fail basic criteria for construct validity, conflating surface\-level method variance—such as prompt formulation or reference\-text artifacts—with the true underlying constructs they intend to measure\(Beanet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib2); Kearns,[2026](https://arxiv.org/html/2605.08522#bib.bib3)\)\. As a result, static leaderboards produce heterogeneous, scalar scores that struggle to reliably rank models or predict downstream robustness\(Zhanget al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib5)\)\.

Concurrently, a growing body of work has begun to formalize LLM representations through a geometric and spatial lens\. Analyses of internal activations and token embeddings demonstrate that language models inherently operate on structured, low\-dimensional latent manifolds\(Leeet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib11); Ninget al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib12); Choi and Weber,[2026](https://arxiv.org/html/2605.08522#bib.bib13)\)\. Furthermore, Item Response Theory frameworks have successfully embedded both models and evaluation tasks into shared Euclidean spaces, modeling capabilities as geometric interactions\(Yaoet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib10); Yu and coauthors,[2026](https://arxiv.org/html/2605.08522#bib.bib15)\)\.

Despite these spatial insights at the representation level, output\-space evaluation metrics remain theoretically disconnected\. The community treats prompt sensitivity\(Chatterjeeet al\.,[2024](https://arxiv.org/html/2605.08522#bib.bib16); Hidaet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib19)\), multi\-turn drift\(Dongreet al\.,[2025a](https://arxiv.org/html/2605.08522#bib.bib21); Liet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib22)\), LLM\-as\-a\-judge inconsistency\(Yeet al\.,[2024b](https://arxiv.org/html/2605.08522#bib.bib25)\), and ideological Overton windows\(Azzopardi and Moshfeghi,[2025a](https://arxiv.org/html/2605.08522#bib.bib29); Poole\-Dayanet al\.,[2026a](https://arxiv.org/html/2605.08522#bib.bib31)\)as isolated phenomena\.

In this Systematization of Knowledge \(SoK\), we bridge this theoretical gap by proposing a generalized Multi\-Trait Multi\-Method \(MTMM\) framework mapped onto a shared geometric coordinate space\. We mathematically formalize and unify nine disparate evaluation metrics—including the Paraphrase Instability Score \(PIS\), Drift Score \(DS\), and Pluralism Score \(PS\)—reinterpreting them as spatial measurements of displacement, span, and distance\. By projecting these metrics into an MTMM matrix, we demonstrate that evaluation scores are not independent constructs, but rather noisy observables of three orthogonal latent dimensions: \(1\) Instability and Sensitivity, \(2\) Position and Alignment, and \(3\) Coverage and Expressiveness\.

Our core contributions are as follows:

- •We systematize the current literature on LLM evaluation, highlighting the necessity of latent\-construct modeling to overcome the construct validity crisis\.
- •We formalize nine widely used but previously disconnected evaluation metrics into explicit geometric equations operating in a generalized output space\.
- •We introduce an MTMM taxonomy that rigorously disentangles task\-irrelevant method variance from true capability spans, providing a foundational blueprint for the next generation of robust, domain\-agnostic LLM evaluation frameworks\.

## 2Background: Construct Validity and Geometric Representations

To systematically motivate a Multi\-Trait Multi\-Method \(MTMM\) framework, we must first map the dual trajectories of recent NLP evaluation literature: the growing consensus of a construct validity crisis, and the parallel, independent discovery that language model capabilities operate on structured, low\-dimensional geometric manifolds\.

### 2\.1The Crisis of Construct Validity

The current LLM evaluation paradigm is characterized by a proliferation of static benchmarks that increasingly fail to measure the latent constructs they claim to represent\. Broad surveys highlight an evaluation landscape that is heavily fragmented, relying on ad hoc metric designs rather than theory\-driven measurement\(Changet al\.,[2023](https://arxiv.org/html/2605.08522#bib.bib1); Niet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib4)\)\. In a systematic review of 445 benchmarks from top\-tier venues,Beanet al\.\([2025](https://arxiv.org/html/2605.08522#bib.bib2)\)found pervasive failures in basic construct validity, noting a systemic inability to map test items to well\-defined capabilities robustly\.

This disconnect is severely exacerbated by dataset contamination\. Retrieval\-based and probe\-based analyses demonstrate that models frequently exploit pretraining overlaps, successfully guessing masked answer options in benchmarks like MMLU over 50% of the time\(Denget al\.,[2023](https://arxiv.org/html/2605.08522#bib.bib6)\)\. Consequently, leaderboards often capture method variance—such as memorization or sensitivity to reference\-text artifacts—rather than true trait variance\(Sottanaet al\.,[2023](https://arxiv.org/html/2605.08522#bib.bib7)\)\.

Aggregate scalar scores further mask this issue by hiding severe internal heterogeneity\.Kimet al\.\([2026](https://arxiv.org/html/2605.08522#bib.bib8)\)demonstrate through “functional fragmentation” that models with identical top\-line scores exhibit radically divergent sub\-capability profiles\. To model this rigorously,Kearns \([2026](https://arxiv.org/html/2605.08522#bib.bib3)\)applied a structured capabilities model, proving that naive factor models tend to proxy model size rather than specific capabilities unless explicit latent\-construct modeling is used\. This consensus necessitates frameworks that explicitly disentangle measurement error from latent ability, a function natively fulfilled by an MTMM architecture\(Zhanget al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib5)\)\.

### 2\.2The Geometric Topology of LLM Representations

Concurrently, a distinct body of research establishes that the internal representations and outputs of LLMs exhibit rigorous geometric structure, treating high\-dimensional latent spaces as the primary substrate for reasoning\(Yu and coauthors,[2026](https://arxiv.org/html/2605.08522#bib.bib15)\)\. Analysis of token embeddings across diverse architectures reveals shared global orientations and local manifold structures, suggesting that semantic capabilities map to consistent coordinate spaces regardless of the specific model parameters\(Leeet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib11)\)\.

This geometric regularity is highly observable; dimensionality reduction of layer\-wise activations reveals clear structural separations between attention and MLP components\(Ninget al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib12)\), and manifold learning has successfully recovered latent affective and semantic topologies directly from embeddings\(Choi and Weber,[2026](https://arxiv.org/html/2605.08522#bib.bib13)\)\.

Most critically for evaluation, the Joint Embedding Item Response Theory \(JE\-IRT\) proposed byYaoet al\.\([2025](https://arxiv.org/html/2605.08522#bib.bib10)\)embeds both language models and evaluation questions into a shared Euclidean space\. By encoding question semantics as directional vectors and difficulties as norms, JE\-IRT mathematically proves that interactions between models and benchmarks are fundamentally geometric phenomena\. Our framework extends this spatial intuition, arguing that metrics evaluating instability, alignment, and pluralism are universally measurable as displacements and spans within these manifolds\.

### 2\.3Empirical Symptoms: Instability, Drift, and Evaluator Bias

Without a unifying geometric framework, the field has treated the structural fragility of LLMs as isolated anomalies rather than facets of a shared latent “Instability” trait\.

##### Prompt Sensitivity\.

Research shows drastic performance swings induced by surface\-level prompt variations\. The Prompt Sensitivity Index \(POSIX\) demonstrates substantial log\-likelihood shifts across intent\-preserving variants\(Chatterjeeet al\.,[2024](https://arxiv.org/html/2605.08522#bib.bib16)\)\. Similarly, minor structural changes in Japanese prompt templates have been shown to halve task accuracy, even for frontier models\(Gan and Mori,[2023](https://arxiv.org/html/2605.08522#bib.bib17)\)\. Crucially, social bias evaluations can entirely flip model rankings depending on instruction phrasing and few\-shot exemplars\(Hidaet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib19)\)\. These variations empirically validate that prompt sensitivity must be modeled as an explicit latent dimension rather than ignored as noise\.

##### Multi\-Turn Drift\.

In conversational settings, single\-turn accuracy scores fail to predict multi\-turn robustness\(Kwanet al\.,[2024](https://arxiv.org/html/2605.08522#bib.bib20)\)\.Dongreet al\.\([2025a](https://arxiv.org/html/2605.08522#bib.bib21)\)model context drift as a bounded stochastic process diverging from a goal\-consistent reference, while survival analyses on adversarial multi\-turn interactions identify semantic drift as a primary hazard factor accelerating time\-to\-failure\(Liet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib22)\)\.

##### Judge Bias\.

When LLMs are utilized as evaluators, they exhibit severe inconsistencies, including familiarity bias, anchoring effects, and skewed rating distributions\(Stureborget al\.,[2024](https://arxiv.org/html/2605.08522#bib.bib24)\)\. Frameworks like CALM define up to twelve distinct judgment bias types, quantifying instability through decision flip rates under principled perturbations\(Yeet al\.,[2024b](https://arxiv.org/html/2605.08522#bib.bib25)\)\. Automated perturbation discovery further highlights that even strong LLM judges frequently perform at or below random on complex evaluation instances\(Laiet al\.,[2026](https://arxiv.org/html/2605.08522#bib.bib27)\)\.

### 2\.4Beyond Point Estimates: Overton Windows and Pluralism

Finally, a generalized evaluation framework must distinguish between a model’s central position and the distributional span of its outputs\.Azzopardi and Moshfeghi \([2025a](https://arxiv.org/html/2605.08522#bib.bib29)\)introduced the Political Overton Window \(POW\) framework, mapping the boundary of views models will espouse or refuse, demonstrating that window width is an entirely distinct property from point\-estimate ideology\.

This distinction is formalized via Overton pluralism\. Using set\-coverage metrics,Poole\-Dayanet al\.\([2026a](https://arxiv.org/html/2605.08522#bib.bib31)\)show that OvertonScore is moderately negatively correlated with political neutrality, proving that pluralism \(covering multiple legitimate viewpoints\) is not reducible to producing safe, centrist outputs\. The steerability of these outputs is tied to latent “ideological depth”\(Kabiret al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib33)\), and these inferred ideological coordinates are robust enough to successfully substitute for expert political surveys\(Wu,[2025](https://arxiv.org/html/2605.08522#bib.bib34)\)\.

Collectively, this literature dictates that a unified taxonomy must separate Central Position, Output Span, and Bidirectional Coverage into distinct geometric coordinates, directly motivating the formalization in our proposed MTMM framework\.

## 3The Theoretical Framework: MTMM in Latent Coordinate Space

To resolve the construct validity crisis outlined in Section[2](https://arxiv.org/html/2605.08522#S2), evaluation metrics must move beyond treating language model outputs as isolated, discrete strings or binary correctness labels\. We propose a formalized framework that unifies disparate evaluation methods by embedding them into a shared geometric space\. By mapping model generations to explicit coordinate vectors, we can mathematically factorize evaluation into an Multi\-Trait Multi\-Method \(MTMM\) matrix, where distinct metrics are merely different geometric operators \(e\.g\., distance, variance, convex hull\) applied to the same latent traits\.

### 3\.1Defining the Output Space: The Mathematical Foundation

Before introducing the formal mathematical construction, we provide an intuitive view of the framework\. Consider a simple scenario where a model is asked the same question using multiple semantically equivalent prompts \(e\.g\., paraphrases\)\. If the model possesses a stable and well\-grounded internal representation, its outputs should remain consistent in meaning, regardless of superficial variations in phrasing\. In a geometric interpretation, this implies that all such outputs should map to nearby points within a shared latent space\. Conversely, if small changes in wording cause large variations in the generated responses, the corresponding points will be widely dispersed, indicating instability\. Extending this intuition, alignment can be understood as the distance between a model’s output and a reference point in this space, while expressiveness corresponds to how widely the model’s outputs span across different regions\. Thus, evaluation reduces to measuring distances, displacements, and spans within a continuous coordinate system, rather than comparing discrete text outputs directly\.

The foundational premise of our framework is the existence of a continuous, low\-dimensional coordinate space in which the semantic, functional, and ideological properties of model outputs can be rigorously quantified\.

Formally, letm∈ℳm\\in\\mathcal\{M\}represent the Large Language Model under evaluation, andp∈𝒫p\\in\\mathcal\{P\}represent the input context or prompt\. The model operates as a functionfm:𝒫→𝒮f\_\{m\}:\\mathcal\{P\}\\rightarrow\\mathcal\{S\}, mapping the prompt to a raw textual output strings∈𝒮s\\in\\mathcal\{S\}\.

Traditional evaluations compute a scalar score directly onss\(e\.g\., exact match, ROUGE, or a pass/fail heuristic\)\. Our geometric approach instead introduces a projection functionϕ\\phi, which embeds the raw text into annn\-dimensional Euclidean latent output space, denoted as𝒪⊆ℝn\\mathcal\{O\}\\subseteq\\mathbb\{R\}^\{n\}:

𝐨m,p=ϕ​\(fm​\(p\)\)∈ℝn\\mathbf\{o\}\_\{m,p\}=\\phi\(f\_\{m\}\(p\)\)\\in\\mathbb\{R\}^\{n\}\(1\)
Here,𝐨m,p\\mathbf\{o\}\_\{m,p\}is the coordinate vector representing the model’s output in the latent space\. As defined in Equation[1](https://arxiv.org/html/2605.08522#S3.E1), the dimensionalitynnand the semantic meaning of the axes depend entirely on the target evaluation construct:

- •For ideological alignment and Overton window analysis,n=2n=2orn=3n=3, mapping outputs to coordinates\(x,y,z\)\(x,y,z\)representing economic, social, and moral axes\.
- •For generalized capability evaluations,nnrepresents a multidimensional vector of specific functional traits, mirroring the Joint Embedding Item Response Theory \(JE\-IRT\) models that project both models and questions into a shared Euclidean space\(Yaoet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib10)\)\.

The validity of the projection functionϕ\\phidefined in Equation[1](https://arxiv.org/html/2605.08522#S3.E1)is heavily supported by recent topological analyses of LLMs\. Because token embeddings and internal activations naturally converge onto shared, low\-dimensional geometric manifolds across different model architectures\(Leeet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib11); Ninget al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib12)\), it follows that the resulting functional outputs inhabit a similarly structured, continuous space\.

By defining the output space as𝒪⊆ℝn\\mathcal\{O\}\\subseteq\\mathbb\{R\}^\{n\}, we reduce the chaotic variability of natural language evaluation to pure geometry\. If𝐨1\\mathbf\{o\}\_\{1\}and𝐨2\\mathbf\{o\}\_\{2\}are the model’s coordinate responses to two distinct but semantically equivalent prompts, then the Euclidean distance between them,‖𝐨1−𝐨2‖2\\\|\\mathbf\{o\}\_\{1\}\-\\mathbf\{o\}\_\{2\}\\\|\_\{2\}, is no longer just a metaphorical "difference"—it is a mathematically precise measurement of instability\.

This geometric foundation allows us to systematically construct our MTMM matrix\. If the coordinate space𝒪\\mathcal\{O\}represents the true "traits" \(the model’s actual capabilities or alignments\), then the perturbations we apply to the promptpp\(e\.g\., paraphrasing, multi\-turn adversarial attacks, language switching\) are the "methods\." The resulting geometric movements—measured as point\-to\-point displacements, bounding box widths, or regional set\-coverage—yield the formalized metrics explored in the following sections\.

### 3\.2The MTMM Matrix for LLMs: Disentangling Trait from Method Variance

In classical psychometrics, the Multi\-Trait Multi\-Method \(MTMM\) matrix assesses construct validity by evaluating multiple latent traits using multiple measurement methods, ensuring that variance in scores is driven by the underlying construct \(trait variance\) rather than the artifact of the test itself \(method variance\)\. Adapted to the geometric output space𝒪\\mathcal\{O\}defined in Equation[1](https://arxiv.org/html/2605.08522#S3.E1), an LLM’s observed behavior is heavily confounded by the evaluation method—specifically, the syntactic phrasing, the conversational depth, or the designated reasoning pathway\.

To formalize this, we model any evaluation metric as an operation over an expected coordinate position𝐨m,p\\mathbf\{o\}\_\{m,p\}and a perturbation operatorμ∈𝒰\\mu\\in\\mathcal\{U\}representing the evaluation method \(e\.g\., paraphrase generation, temporal advancement, or language translation\)\. The observed geometric displacement or spatial span,Δobs\\Delta\_\{\\text\{obs\}\}, can be decomposed as:

Δobs=Δtrait\+Δmethod\+ϵ\\Delta\_\{\\text\{obs\}\}=\\Delta\_\{\\text\{trait\}\}\+\\Delta\_\{\\text\{method\}\}\+\\epsilon\(2\)
whereΔtrait\\Delta\_\{\\text\{trait\}\}is the true topological manifestation of the model’s capability,Δmethod\\Delta\_\{\\text\{method\}\}is the systematic bias introduced by the perturbationμ\\mu, andϵ\\epsilonis random error\. Current single\-score leaderboards implicitly assumeΔmethod≈0\\Delta\_\{\\text\{method\}\}\\approx 0, which the literature definitively refutes\.

Promptp∈𝒫p\\in\\mathcal\{P\}LLMfmf\_\{m\}Output Strings∈𝒮s\\in\\mathcal\{S\}Latent Space𝒪⊆ℝn\\mathcal\{O\}\\subseteq\\mathbb\{R\}^\{n\}x1x\_\{1\}x2x\_\{2\}x3x\_\{3\}𝐨m,p\\mathbf\{o\}\_\{m,p\}fm​\(p\)f\_\{m\}\(p\)ϕ​\(s\)\\phi\(s\)

Figure 1:The geometric projection pipeline mapping discrete textual outputs \(ss\) via the embedding function \(ϕ\\phi\) into the continuous latent coordinate space \(𝒪\\mathcal\{O\}\)\.Our generalized framework constructs an MTMM matrix by systematically applying diverse perturbation methods across three orthogonal latent traits: \(1\) Instability and Sensitivity, \(2\) Position and Alignment, and \(3\) Coverage and Expressiveness\.

Table 1:The MTMM Matrix Mapping LLM Evaluation Metrics to Latent Traits and Perturbation Methods\.Latent TraitPerturbation Method \(μ\\mu\)MetricAbbr\.1\. Instability & SensitivitySemantic ParaphrasingParaphrase Instability ScorePISPersona / Role InjectionPrompt Sensitivity ScorePSSTemporal / Multi\-Turn ExtensionDrift ScoreDSCross\-Lingual MappingLinguistic Divergence ScoreLDSResponse Mode / CoT RoutingReasoning Stability ScoreRSS2\. Position & AlignmentReference GroundingOutput Distance ScoreIDSContextual BoundingOutput Distribution WidthOW3\. Coverage & ExpressivenessDual\-Sided / Contested PromptingBidirectional Coverage ScorePSMeta: Evaluator StabilityJudge PerturbationJudge Bias ScoreJBS

Table[1](https://arxiv.org/html/2605.08522#S3.T1)illustrates this geometric MTMM classification\. By treating evaluation through this matrix, we can isolate actual capabilities\. For instance, if we aim to measure the latent trait ofInstability, we must observe the model’s coordinate displacement under semantic paraphrasing \(PIS\), prompt framing \(PSS\), and cross\-lingual translation \(LDS\) simultaneously\.

Dimension 1 \(e\.g\., Economic\)Dimension 2 \(e\.g\., Social\)Centroid \(Alignment/Position\)Output Distribution Width\(Span /ℋ90\\mathcal\{H\}\_\{90\}\)Instability / Drift\(DisplacementΔ\\Delta\)

Figure 2:Geometric factorization of latent evaluation traits in the MTMM framework\. Alignment is measured by the distribution’s centroid, Instability by intra\-method coordinate displacement, and Expressiveness by the span of the bounding convex hull\.If a model exhibits a high LDS but a low PIS, Equation[2](https://arxiv.org/html/2605.08522#S3.E2)allows us to diagnose that the instability is not a general trait of the model’s semantic processing, but a specific method variance tied to the cross\-lingual mapping \(Δmethod\\Delta\_\{\\text\{method\}\}\)\. Conversely, consistently high displacements across allμ∈𝒰\\mu\\in\\mathcal\{U\}indicate true trait instability \(Δtrait\\Delta\_\{\\text\{trait\}\}\)\.

This factorization prevents the common pitfall of confusing a model’s narrow Overton Width \(a trait of restricted position\) with low Pluralism \(a trait of poor bidirectional coverage\) or equating a robust prompt template with high latent reasoning capability\. The explicit formalization of these metrics within the output space𝒪\\mathcal\{O\}follows in Section[4](https://arxiv.org/html/2605.08522#S4)\.

### 3\.3Mathematical Boundaries and Identifiability

To prevent theoretical overreach, we must explicitly bound the MTMM\-geometric framework\. It does not claim to unify all LLM evaluation; rather, it provides an identifiable structure strictly forsemantically continuous generation tasks\(e\.g\., ideological positioning, alignment, pluralism\)\. We formalize the operational boundaries of this space through three strict definitions\.

###### Definition 1\(The Bounded Metric Space\)\.

We define the latent output space𝒪\\mathcal\{O\}not merely as a loose topological manifold, but strictly as a metric space\(𝒪,d\)\(\\mathcal\{O\},d\), whereddis a standardized distance metric \(such as Mahalanobis distance or scale\-normalized Euclidean distance\)\.

Defining𝒪\\mathcal\{O\}as a metric space is mandatory for trait identifiability\. It guarantees that a unit of spatial displacement measured by the Paraphrase Instability Score \(Dimension 1\) is mathematically comparable to a unit of geometric span measured by the Overton Width \(Dimension 2\)\.

###### Definition 2\(Lipschitz Continuity of the Projection Function\)\.

The projection functionϕ:𝒮→𝒪\\phi:\\mathcal\{S\}\\rightarrow\\mathcal\{O\}mapping discrete text to continuous coordinates must satisfy Lipschitz continuity\. For any two textual outputss1,s2∈𝒮s\_\{1\},s\_\{2\}\\in\\mathcal\{S\}and a baseline semantic distance functiond𝒮d\_\{\\mathcal\{S\}\}, there exists a real constantK≥0K\\geq 0such that:

d​\(ϕ​\(s1\),ϕ​\(s2\)\)≤K⋅d𝒮​\(s1,s2\)d\(\\phi\(s\_\{1\}\),\\phi\(s\_\{2\}\)\)\\leq K\\cdot d\_\{\\mathcal\{S\}\}\(s\_\{1\},s\_\{2\}\)\(3\)

Equation[3](https://arxiv.org/html/2605.08522#S3.E3)provides the mathematical guarantee that a minor semantic perturbation in the generated text results in a proportionally bounded geometric displacement in the latent space\. If the chosen embedding modelϕ\\phiviolates this condition, the coordinate projections become chaotic, and the MTMM framework cannot reliably factorize trait variance from method variance\.

###### Definition 3\(The Boundary Condition: Semantic vs\. Symbolic\)\.

The framework is valid exclusively for evaluating conditional distributions over continuous semantic spaces\. It is formally invalid for evaluating discrete formal logic, arithmetic, or executable code generation\.

This boundary condition is a direct consequence of Equation[3](https://arxiv.org/html/2605.08522#S3.E3)\. In symbolic tasks, a single\-character algorithmic perturbation \(e\.g\., changing a\+to a\-, or==to=\) drastically alters the formal execution state despite minimal textual distance, severely violating the Lipschitz constraint\. Therefore, the MTMM\-geometric framework is explicitly constrained to evaluating semantic, cultural, and ideological topologies, where meaning degrades continuously rather than discretely\.

## 4Formalizing the Latent Dimensions: Methodological Origins and Geometric Derivations

With the coordinate space𝒪⊆ℝn\\mathcal\{O\}\\subseteq\\mathbb\{R\}^\{n\}established, we now systematically formalize the nine core evaluation metrics\. Rather than treating these metrics as independent, ad hoc heuristics, this section traces the empirical origin of each metric in the recent literature and rigorously derives its generalized geometric formulation\. By mapping these specific mathematical operations \(displacements, spans, and ratios\) into our shared continuous space, we formally construct the MTMM matrix introduced in Section[3](https://arxiv.org/html/2605.08522#S3)\.

### 4\.1Dimension 1: Instability and Trait Sensitivity \(Δmethod\\Delta\_\{\\text\{method\}\}\)

The first latent dimension isolates the structural fragility of the model\. In an ideal, construct\-valid system, altering the method of inquiry without altering the core semantics should result in zero spatial displacement \(Δmethod=0\\Delta\_\{\\text\{method\}\}=0\)\. The metrics in this dimension quantify the geometric noise introduced by the evaluation method itself\.

#### 4\.1\.1Semantic Robustness: Deriving the Paraphrase Instability Score \(PIS\)

The Empirical Origin:The necessity of quantifying paraphrase variance was starkly highlighted byRöttgeret al\.\([2024](https://arxiv.org/html/2605.08522#bib.bib36)\)during their evaluation of LLM values and opinions\. Using the Political Compass Test \(PCT\), the authors discovered that models fundamentally lack paraphrase robustness; they coined the "spinning arrow" metaphor to describe how minor, intent\-preserving variations in prompt wording triggered massive, unpredictable shifts in the model’s output stances\.

Dimension 1Dimension 2Centroid\(x¯,y¯\)\(\\bar\{x\},\\bar\{y\}\)\(xv1,yv1\)\(x\_\{v\_\{1\}\},y\_\{v\_\{1\}\}\)\(xv2,yv2\)\(x\_\{v\_\{2\}\},y\_\{v\_\{2\}\}\)\(xv3,yv3\)\(x\_\{v\_\{3\}\},y\_\{v\_\{3\}\}\)\(xv4,yv4\)\(x\_\{v\_\{4\}\},y\_\{v\_\{4\}\}\)Distance‖ov−o¯‖2\\\|o\_\{v\}\-\\bar\{o\}\\\|\_\{2\}

Figure 3:Geometric representation of the Paraphrase Instability Score \(PIS\)\. The metric calculates the expected Euclidean displacement of semantically equivalent prompt variants \(v1​…​v4v\_\{1\}\\dots v\_\{4\}\) from their construct centroid\.Geometric Rationale for PIS\.In a geometric MTMM framework, a true latent trait \(e\.g\., a model’s underlying stance on a socio\-economic issue or its comprehension of a specific idiom\) cannot be observed directly; it must be inferred through the application of specific measurement methods \(prompts\)\. If the coordinate space𝒪\\mathcal\{O\}is topologically valid, semantic equivalence must strictly dictate spatial proximity\.

When evaluating subjective values or complex reasoning, there is rarely an absolute "ground truth" coordinate to measure against\. To resolve this, PIS designates the model’scentral tendencyacross all semantically equivalent prompt variants—the centroid—as the best geometric estimator of the unobservable latent trait \(Δtrait\\Delta\_\{\\text\{trait\}\}\)\.

The individual outputs for each paraphrase represent the observable behavior, which is contaminated by the specific syntax of the prompt\. By calculating the Euclidean distance of each individual output from the model’s own centroid, PIS mathematically isolates the spatial error introduced solely by the phrasing \(Δmethod\\Delta\_\{\\text\{method\}\}\)\. Measuring displacement from the centroid, rather than relying on pairwise distances, prevents outlier responses from quadratically skewing the metric\. A high PIS geometrically proves that the model’s responses are anchored to surface\-level syntactical artifacts rather than the underlying semantic construct, violating the foundational assumption of construct validity\.

The Formal Equation:Generalizing from the 2D political compass space to any low\-dimensional coordinate space, we define the Paraphrase Instability Score \(PIS\) as the expected Euclidean displacement from the construct centroid:

PIS​\(m\)=1V​∑v=1V\(xv−x¯\)2\+\(yv−y¯\)2\\text\{PIS\}\(m\)=\\frac\{1\}\{V\}\\sum\_\{v=1\}^\{V\}\\sqrt\{\(x\_\{v\}\-\\bar\{x\}\)^\{2\}\+\(y\_\{v\}\-\\bar\{y\}\)^\{2\}\}\(4\)
where:

- •m∈ℳm\\in\\mathcal\{M\}is the language model under evaluation\.
- •VVrepresents the total number of semantically equivalent prompt variants \(paraphrases\) tested for a single construct\.
- •v∈\{1,…,V\}v\\in\\\{1,\\dots,V\\\}is the index of a specific paraphrase variant\.
- •\(xv,yv\)\(x\_\{v\},y\_\{v\}\)are the specific coordinates of the model’s output in the latent space𝒪\\mathcal\{O\}when prompted with variantvv\.
- •\(x¯,y¯\)\(\\bar\{x\},\\bar\{y\}\)represents the model’s coordinate centroid for this construct, calculated as the mean position across all variants:\(1V​∑v=1Vxv,1V​∑v=1Vyv\)\\left\(\\frac\{1\}\{V\}\\sum\_\{v=1\}^\{V\}x\_\{v\},\\frac\{1\}\{V\}\\sum\_\{v=1\}^\{V\}y\_\{v\}\\right\)\.

A PIS mathematically bounded at0indicates perfect semantic robustness \(Δmethod=0\\Delta\_\{\\text\{method\}\}=0\)\.

#### 4\.1\.2Targeted Perturbation: Deriving the Prompt Sensitivity Score \(PSS\)

The Empirical Origin:The necessity of isolating the impact of specific prompt structures was formalized byZhuoet al\.\([2024](https://arxiv.org/html/2605.08522#bib.bib37)\)through their ProSA framework\. The authors demonstrated that LLMs exhibit high variability when presented with semantically diverse but functionally equivalent queries, specifically noting that targeted prompt components—such as persona assignment or format restrictions—drastically alter the output distribution\. To quantify this, they introduced the PromptSensiScore \(PSS\) to compute the expected deviation of model outputs from a reference prompt baseline\.

Dimension 1Dimension 2Baseline Variance\(σwithin\\sigma\_\{\\text\{within\}\}\)𝐨baseline\\mathbf\{o\}\_\{\\text\{baseline\}\}𝐨c\\mathbf\{o\}\_\{c\}\(Framing Condition\)Shift‖𝐨c−𝐨baseline‖2\\\|\\mathbf\{o\}\_\{c\}\-\\mathbf\{o\}\_\{\\text\{baseline\}\}\\\|\_\{2\}

Figure 4:Geometric derivation of the Prompt Sensitivity Score \(PSS\)\. The targeted framing conditionccpulls the coordinate𝐨c\\mathbf\{o\}\_\{c\}away from the baseline centroid𝐨baseline\\mathbf\{o\}\_\{\\text\{baseline\}\}\. The significance of this displacement is normalized by the baseline spatial noise \(σwithin\\sigma\_\{\\text\{within\}\}\)\.Geometric Rationale for PSS\.While the Paraphrase Instability Score \(PIS\) measures the unguided, natural "jitter" of a model under random semantic paraphrasing, evaluators frequently need to measure the impact ofdirectedinterventions, such as adversarial jailbreaks, role\-playing personas, or emotional coercion\.

If we strictly measure the absolute Euclidean displacement between a targeted output and the baseline, we lack geometric context: is a spatial shift of2\.02\.0units a catastrophic failure of alignment, or simply the normal operating noise of a highly erratic model?

To resolve this, we mathematically adapt the deviation concept fromZhuoet al\.\([2024](https://arxiv.org/html/2605.08522#bib.bib37)\)into a geometric Cohen’sddeffect size within our MTMM framework\. By dividing the absolute coordinate shift by the model’s natural baseline variance \(σwithin\\sigma\_\{\\text\{within\}\}\), PSS standardizes the displacement\. It mathematically distinguishes between a generally noisy model \(where the targeted attack merely gets lost in the highσwithin\\sigma\_\{\\text\{within\}\}baseline noise\) and a naturally stable model that is acutely vulnerable to a specific perturbation\. A normalized PSS guarantees that we are measuring true structural susceptibility rather than background instability\.

The Formal Equation:Projecting theZhuoet al\.\([2024](https://arxiv.org/html/2605.08522#bib.bib37)\)deviation baseline into our standardized coordinate space𝒪\\mathcal\{O\}, we formalize the Prompt Sensitivity Score \(PSS\) as follows:

PSS​\(m,c\)=‖𝐨c−𝐨baseline‖2σwithin\\text\{PSS\}\(m,c\)=\\frac\{\\\|\\mathbf\{o\}\_\{c\}\-\\mathbf\{o\}\_\{\\text\{baseline\}\}\\\|\_\{2\}\}\{\\sigma\_\{\\text\{within\}\}\}\(5\)
where:

- •m∈ℳm\\in\\mathcal\{M\}is the language model under evaluation\.
- •ccrepresents the specific targeted framing condition or perturbation \(e\.g\., an adversarial persona injection\)\.
- •𝐨c\\mathbf\{o\}\_\{c\}is the resulting coordinate vector in the latent space when the model is prompted under conditioncc\.
- •𝐨baseline\\mathbf\{o\}\_\{\\text\{baseline\}\}is the expected coordinate centroid of the model under neutral, unperturbed baseline conditions\.
- •σwithin\\sigma\_\{\\text\{within\}\}is the standard spatial deviation \(variance\) of the model’s outputs under the neutral baseline conditions, representing its inherent geometric noise\.

APSS\>1\\text\{PSS\}\>1geometrically proves that the injected framing conditionccexerts a stronger pull on the model’s latent representation than the model’s own baseline instability, indicating a critical vulnerability to that specific method of interaction\.

#### 4\.1\.3Temporal Degradation: Formalizing Drift Score \(DS\) and Drift Velocity

The Empirical Origin:The evaluation of static prompts fundamentally fails to capture the dynamic degradation models experience during extended conversational interactions\. The necessity of temporal metrics was formalized byDongreet al\.\([2025b](https://arxiv.org/html/2605.08522#bib.bib39)\), who modeled "Context Equilibria" to observe how models drift from their initial semantic states over successive turns\. Concurrently,Khraishiet al\.\([2026](https://arxiv.org/html/2605.08522#bib.bib38)\)demonstrated that accumulating dialogue history \(HH\) predictably shifts model alignment and performance, a vulnerability particularly exposed when switching between models in multi\-turn environments\.

Dimension 1Dimension 2𝐨0\\mathbf\{o\}\_\{0\}\(Turn 0\)u1u\_\{1\}u2u\_\{2\}u3u\_\{3\}uTu\_\{T\}𝐨T\\mathbf\{o\}\_\{T\}\(TurnTT\)Endpoint Drift‖𝐨T−𝐨0‖2\\\|\\mathbf\{o\}\_\{T\}\-\\mathbf\{o\}\_\{0\}\\\|\_\{2\}

Figure 5:Geometric derivation of the Drift Score \(DS\)\. While the model experiences incremental shifts at each conversational turn \(utu\_\{t\}\), DS isolates the cumulative temporal degradation as the Euclidean displacement from the initial goal state \(𝐨0\\mathbf\{o\}\_\{0\}\) to the final turn \(𝐨T\\mathbf\{o\}\_\{T\}\)\.Temporal Stability Rationale\.In single\-turn metrics like PIS or PSS, the prompt acts as a static, isolated perturbation \(Δmethod\\Delta\_\{\\text\{method\}\}\)\. However, in a multi\-turn environment, the context windowhth\_\{t\}is a dynamically expanding Markov process\. As defined byDongreet al\.\([2025b](https://arxiv.org/html/2605.08522#bib.bib39)\), the context history at turnttis recursively appended with the user utteranceutu\_\{t\}and the system responseata\_\{t\}:ht=ht−1⊕\(ut,at\)h\_\{t\}=h\_\{t\-1\}\\oplus\(u\_\{t\},a\_\{t\}\)\.

Geometrically, every new user utteranceutu\_\{t\}injects a directional vector into the model’s attention mechanism\. If an LLM possesses true structural stability \(e\.g\., rigid safety alignment or a robust persona\), its output coordinates𝐨t\\mathbf\{o\}\_\{t\}should remain tightly anchored to its initial goal state𝐨0\\mathbf\{o\}\_\{0\}, successfully resisting the adversarial "pull" of the conversational history\.

To prove this mathematically, we measure the cumulative drift\.Dongreet al\.\([2025b](https://arxiv.org/html/2605.08522#bib.bib39)\)formalizes the distance between the output at turnttand the initial turn asDcumulative​\(t\)=‖ϕ​\(at\)−ϕ​\(a0\)‖D\_\{\\text\{cumulative\}\}\(t\)=\\\|\\phi\(a\_\{t\}\)\-\\phi\(a\_\{0\}\)\\\|\. Rather than calculating the sum of incremental, turn\-by\-turn steps \(which only measures local volatility\), we strictly calculate the endpoint\-to\-endpoint translation vector\. A high endpoint displacement proves that the accumulating context tokens successfully "dragged" the model’s representations across the topological space, defeating its alignment anchors\.

Furthermore, to allow comparative analysis across evaluation benchmarks with varying conversation lengths\|T\|\|T\|, we adopt the normalization principle utilized byKhraishiet al\.\([2026](https://arxiv.org/html/2605.08522#bib.bib38)\)\. By dividing the absolute drift by the number of turns, we derive a constant rate of degradation\.

The Formal Equations:We project the cumulative drift fromDongreet al\.\([2025b](https://arxiv.org/html/2605.08522#bib.bib39)\)into our generalized coordinate space𝒪\\mathcal\{O\}to define the Drift Score \(DS\):

DS​\(m\)=‖𝐨T−𝐨0‖2=\(xT−x0\)2\+\(yT−y0\)2\\begin\{split\}\\text\{DS\}\(m\)&=\\\|\\mathbf\{o\}\_\{T\}\-\\mathbf\{o\}\_\{0\}\\\|\_\{2\}\\\\ &=\\sqrt\{\(x\_\{T\}\-x\_\{0\}\)^\{2\}\+\(y\_\{T\}\-y\_\{0\}\)^\{2\}\}\\end\{split\}\(6\)
To standardize this metric across varying session lengths, we formalize the Drift Velocity:

Drift Velocity​\(m\)=DS​\(m\)T\\text\{Drift Velocity\}\(m\)=\\frac\{\\text\{DS\}\(m\)\}\{T\}\(7\)
where:

- •m∈ℳm\\in\\mathcal\{M\}is the language model under evaluation\.
- •TTis the total number of dialogue turns in the evaluated session\.
- •𝐨0=\(x0,y0\)\\mathbf\{o\}\_\{0\}=\(x\_\{0\},y\_\{0\}\)represents the coordinate projection of the model’s output at the initial turnt=0t=0, acting as the baseline alignment state\.
- •𝐨T=\(xT,yT\)\\mathbf\{o\}\_\{T\}=\(x\_\{T\},y\_\{T\}\)represents the coordinate projection of the model’s output at the final turnt=Tt=T\.

A Drift Velocity of0geometrically proves that the model maintains perfect context equilibria, completely resisting the temporal accumulation of adversarial or off\-topic context vectors\.

#### 4\.1\.4Cross\-Lingual Mapping: Deriving the Linguistic Divergence Score \(LDS\)

The Empirical Origin:The assumption that multilingual LLMs maintain semantic and ideological consistency across translations was systematically dismantled byHelweet al\.\([2025](https://arxiv.org/html/2605.08522#bib.bib40)\)\. Evaluating frontier models across English, French, and Arabic, the authors discovered that the language of the prompt acted as an overriding perturbator\. Rather than translating a stable concept, switching languages frequently shifted the model’s outputs into entirely different quadrants of the political and semantic coordinate space, often overriding explicit persona or nationality instructions\.

Dimension 1Dimension 2𝐨ℓ0\\mathbf\{o\}\_\{\\ell\_\{0\}\}\(e\.g\., English\)Intra\-LangNoise\(PIS\)𝐨ℓ1\\mathbf\{o\}\_\{\\ell\_\{1\}\}\(e\.g\., French\)𝐨ℓ2\\mathbf\{o\}\_\{\\ell\_\{2\}\}\(e\.g\., Arabic\)d​\(𝐨ℓ0,𝐨ℓ1\)d\(\\mathbf\{o\}\_\{\\ell\_\{0\}\},\\mathbf\{o\}\_\{\\ell\_\{1\}\}\)d​\(𝐨ℓ0,𝐨ℓ2\)d\(\\mathbf\{o\}\_\{\\ell\_\{0\}\},\\mathbf\{o\}\_\{\\ell\_\{2\}\}\)d​\(𝐨ℓ1,𝐨ℓ2\)d\(\\mathbf\{o\}\_\{\\ell\_\{1\}\},\\mathbf\{o\}\_\{\\ell\_\{2\}\}\)

Figure 6:Geometric derivation of the Linguistic Divergence Score \(LDS\)\. The metric calculates the average Euclidean distance between different language centroids \(inter\-language displacement\) and normalizes it by the expected paraphrase noise within the baseline language \(intra\-language PIS\)\.Cross\-Lingual Consistency Rationale\.A construct\-valid multilingual model should possess a unified, language\-agnostic latent representation for a given concept\. If this holds true geometrically, prompting the model for the same concept in different languages should map to roughly the same coordinate centroid\.

However, we cannot simply measure the absolute Euclidean distance between language centroids \(e\.g\., English vs\. French\) to prove cross\-lingual failure\. As established in Section[4\.1\.1](https://arxiv.org/html/2605.08522#S4.SS1.SSS1), models exhibit natural intra\-language instability\. If the distance between the English and French outputs is1\.51\.5units, but the model’s English paraphrases naturally jitter by1\.51\.5units, the language switch did not actually break the alignment—it merely fell within the expected margin of error\.

To resolve this mathematically, the Linguistic Divergence Score \(LDS\) takes the average pairwise Euclidean distance across all tested language centroids and divides it by the baseline Paraphrase Instability Score \(PIS\)\. By doing so, LDS becomes a dimensionless ratio\. This ratio rigorously isolates the specific topological fracture caused by linguistic boundaries\. If the language of the prompt accesses an entirely disconnected, culturally flattened semantic subspace—asHelweet al\.\([2025](https://arxiv.org/html/2605.08522#bib.bib40)\)empirically observed—the inter\-language distance will vastly exceed the intra\-language noise\.

The Formal Equation:We formalize the Linguistic Divergence Score \(LDS\) as the ratio of average inter\-language spatial displacement to baseline intra\-language instability:

LDS​\(m\)=1\(L2\)​∑i<jd​\(𝐨ℓi,𝐨ℓj\)PIS​\(m,ℓ0\)\\text\{LDS\}\(m\)=\\frac\{\\frac\{1\}\{\\binom\{L\}\{2\}\}\\sum\_\{i<j\}d\(\\mathbf\{o\}\_\{\\ell\_\{i\}\},\\mathbf\{o\}\_\{\\ell\_\{j\}\}\)\}\{\\text\{PIS\}\(m,\\ell\_\{0\}\)\}\(8\)
where:

- •m∈ℳm\\in\\mathcal\{M\}is the language model under evaluation\.
- •LLis the total number of distinct languages evaluated for the construct\.
- •ℓi,ℓj\\ell\_\{i\},\\ell\_\{j\}represent specific languages from the set ofLLlanguages\.
- •d​\(𝐨ℓi,𝐨ℓj\)d\(\\mathbf\{o\}\_\{\\ell\_\{i\}\},\\mathbf\{o\}\_\{\\ell\_\{j\}\}\)is the Euclidean distance‖𝐨ℓi−𝐨ℓj‖2\\\|\\mathbf\{o\}\_\{\\ell\_\{i\}\}\-\\mathbf\{o\}\_\{\\ell\_\{j\}\}\\\|\_\{2\}between the coordinate centroids generated in languageiiand languagejj\.
- •The numerator calculates the mean pairwise distance across all unique language combinations\.
- •PIS​\(m,ℓ0\)\\text\{PIS\}\(m,\\ell\_\{0\}\)is the Paraphrase Instability Score \(Equation[4](https://arxiv.org/html/2605.08522#S4.E4)\) computed in the baseline language \(typically English,ℓ0\\ell\_\{0\}\), representing the inherent spatial noise of the model for that construct\.

AnLDS≈1\\text\{LDS\}\\approx 1geometrically proves that translating the prompt induces no more instability than simply paraphrasing it in English\. AnLDS≫1\\text\{LDS\}\\gg 1mathematically guarantees that the model has fractured its conceptual representations across language lines\.

#### 4\.1\.5Modality and Routing: Defining the Reasoning Stability Score \(RSS\)

The Empirical Origin:The assumption that Chain\-of\-Thought \(CoT\) prompting universally improves and stabilizes LLM outputs was systematically challenged byJianget al\.\([2025](https://arxiv.org/html/2605.08522#bib.bib41)\)\. In their comprehensive benchmarking of CoT robustness, the authors observed a dual phenomenon: while CoT can anchor complex reasoning tasks, it frequently exacerbates hallucinations and introduces instability on simpler tasks where the model’s direct output would have been correct\. This "overthinking" behavior demonstrates that intermediate reasoning steps do not inherently guarantee reliability; they can instead act as a destabilizing perturbator\.

Dimension 1Dimension 2Direct CentroidPISdirect\\text\{PIS\}\_\{\\text\{direct\}\}CoT CentroidPISreasoning\\text\{PIS\}\_\{\\text\{reasoning\}\}\(Overthinking\)

Figure 7:Geometric derivation of the Reasoning Stability Score \(RSS\)\. In this "overthinking" scenario, forcing the model through an intermediate reasoning manifold \(CoT\) introduces new degrees of freedom, expanding the expected spatial displacement \(PISreasoning\\text\{PIS\}\_\{\\text\{reasoning\}\}\) compared to the tightly clustered direct generation \(PISdirect\\text\{PIS\}\_\{\\text\{direct\}\}\)\.Reasoning\-Path Rationale\.Geometrically, forcing a model to generate intermediate reasoning tokens expands the generation trajectory\. Instead of mapping a promptppdirectly to the output space𝒪\\mathcal\{O\}, the model projects through an intermediate reasoning manifoldrr, such that the final output is conditioned on both:ϕ​\(fm​\(p⊕r\)\)\\phi\(f\_\{m\}\(p\\oplus r\)\)\.

If the reasoning manifold is well\-aligned with the target construct, it should act as a geometric attractor, pulling semantically equivalent paraphrase variants into a tighter coordinate centroid and reducing overall spatial noise\. However, if the model suffers from the "overthinking" failure mode identified byJianget al\.\([2025](https://arxiv.org/html/2605.08522#bib.bib41)\), the reasoning tokens introduce new, unconstrained degrees of freedom\. In this state, minor syntactical changes in the input prompt cause the reasoning chain to diverge wildly, subsequently fracturing the final output coordinates across the latent space\.

To mathematically isolate the structural effect of the reasoning pathway from the model’s baseline variance, we must compute a relative ratio\. By dividing the instability observed under CoT prompting by the instability observed under direct prompting, the Reasoning Stability Score \(RSS\) formalizes whether the reasoning manifold acts as a spatial stabilizer or a destabilizing perturbator\.

The Formal Equation:We define the Reasoning Stability Score \(RSS\) as the ratio of Paraphrase Instability Scores \(Equation[4](https://arxiv.org/html/2605.08522#S4.E4)\) between reasoning and direct generation modes:

RSS​\(m\)=PISreasoning​\(m\)PISdirect​\(m\)\\text\{RSS\}\(m\)=\\frac\{\\text\{PIS\}\_\{\\text\{reasoning\}\}\(m\)\}\{\\text\{PIS\}\_\{\\text\{direct\}\}\(m\)\}\(9\)
where:

- •m∈ℳm\\in\\mathcal\{M\}is the language model under evaluation\.
- •PISreasoning​\(m\)\\text\{PIS\}\_\{\\text\{reasoning\}\}\(m\)is the expected Euclidean displacement from the construct centroid when the model is evaluated using a Chain\-of\-Thought or reasoning\-eliciting prompt schema\.
- •PISdirect​\(m\)\\text\{PIS\}\_\{\\text\{direct\}\}\(m\)is the expected Euclidean displacement from the construct centroid when the model is forced to generate an immediate, direct response \(zero\-shot, no intermediate steps\)\.

The geometric implications of this ratio are absolute: anRSS<1\\text\{RSS\}<1mathematically proves that the reasoning trajectory successfully collapses spatial noise, acting as an anchor for construct validity\. Conversely, anRSS\>1\\text\{RSS\}\>1isolates the "overthinking" vulnerability, proving that the intermediate reasoning steps actively fracture the model’s semantic alignment\.

### 4\.2Dimension 2: Position and Alignment \(Δtrait\\Delta\_\{\\text\{trait\}\}\)

While Dimension 1 formalizes the noise and spatial volatility induced by evaluation methods \(Δmethod\\Delta\_\{\\text\{method\}\}\), Dimension 2 isolates the model’s actual topological location within the latent space\. This dimension decouples a model’s central tendency \(its true alignment or capability\) from the noise surrounding it, directly measuring the underlying construct \(Δtrait\\Delta\_\{\\text\{trait\}\}\)\.

#### 4\.2\.1Reference Grounding: Deriving the Generalized Output Distance Score \(IDS\)

The Empirical Origin:The necessity of measuring absolute geometric distance to evaluate alignment was formalized byBernardelleet al\.\([2025](https://arxiv.org/html/2605.08522#bib.bib43)\)\. In their study mapping LLM political ideologies, the authors projected model outputs into a continuous 2D Cartesian space defined by economic and social axes\. To quantify how synthetic personas influenced model alignment, they calculated the exact Euclidean distance between the model’s baseline position and the new persona\-steered position, effectively treating ideological shifts as measurable spatial trajectories\.

Dimension 1 \(e\.g\., Economic\)Dimension 2 \(e\.g\., Social\)𝐨r\\mathbf\{o\}\_\{r\}\(Reference Centroid\)𝐨m\\mathbf\{o\}\_\{m\}\(Model Centroid\)Intra\-method Variance\(Ignored by IDS\)IDS=‖𝐨m−𝐨r‖2\\text\{IDS\}=\\\|\\mathbf\{o\}\_\{m\}\-\\mathbf\{o\}\_\{r\}\\\|\_\{2\}

Figure 8:Geometric derivation of the Generalized Output Distance Score \(IDS\)\. By calculating the distance between the model’s expected coordinate position \(𝐨m\\mathbf\{o\}\_\{m\}\) and a target reference vector \(𝐨r\\mathbf\{o\}\_\{r\}\), IDS measures true trait alignment independent of the model’s surrounding spatial noise\.Reference\-Based Alignment Rationale\.To objectively evaluate alignment—whether assessing political neutrality, factual grounding, or compliance with safety policies—we must define a target spatial coordinate,𝐨r\\mathbf\{o\}\_\{r\}, representing the "gold standard" or "ground truth\."

However, evaluating individual outputs directly against this reference is mathematically confounded by the model’s inherent instability \(Dimension 1\)\. If we measure the distance of a single, noisy output from the reference, the resulting scalar incorporates both the model’s true misalignment \(Δtrait\\Delta\_\{\\text\{trait\}\}\) and the prompt’s specific syntactical noise \(Δmethod\\Delta\_\{\\text\{method\}\}\)\.

To resolve this, the Generalized Output Distance Score \(IDS\) strictly calculates the Euclidean distance from the model’scentroid\(𝐨m\\mathbf\{o\}\_\{m\}\) to the reference target \(𝐨r\\mathbf\{o\}\_\{r\}\)\. By using the centroid—the expected value mathematically derived in Equation[4](https://arxiv.org/html/2605.08522#S4.E4)—IDS effectively filters out the high\-frequency geometric noise of individual paraphrases\. It answers the fundamental question of alignment: once we strip away the surface\-level instability of the prompt, how far away is the core representation of the model from the desired target? This transforms alignment benchmarking from a binary exact\-match paradigm into a continuous, objective measurement of topological proximity\.

The Formal Equation:Abstracting the political coordinate methodology ofBernardelleet al\.\([2025](https://arxiv.org/html/2605.08522#bib.bib43)\)into a generalizednn\-dimensional Euclidean space, we formalize the Generalized Output Distance Score \(IDS\) as:

IDS​\(m,r\)=‖𝐨m−𝐨r‖2=∑k=1n\(xm,k−xr,k\)2\\begin\{split\}\\text\{IDS\}\(m,r\)&=\\\|\\mathbf\{o\}\_\{m\}\-\\mathbf\{o\}\_\{r\}\\\|\_\{2\}\\\\ &=\\sqrt\{\\sum\_\{k=1\}^\{n\}\(x\_\{m,k\}\-x\_\{r,k\}\)^\{2\}\}\\end\{split\}\(10\)
where:

- •m∈ℳm\\in\\mathcal\{M\}is the language model under evaluation\.
- •rris the designated target reference \(e\.g\., a gold\-standard human baseline, an explicitly safe policy region, or a target persona\)\.
- •𝐨m\\mathbf\{o\}\_\{m\}is the coordinate centroid of the model’s outputs for the evaluated construct \(derived by averaging coordinates across multiple semantic variants\)\.
- •𝐨r\\mathbf\{o\}\_\{r\}is the exact coordinate vector representing the target referencerr\.
- •kkrepresents the specific axes of thenn\-dimensional latent space \(e\.g\., economic and social dimensions in a 2D space, or multidimensional capability vectors\)\.

AnIDS=0\\text\{IDS\}=0geometrically proves perfect trait alignment with the target reference\. Because it computes distance using the model’s centroid, a model can have a massive Paraphrase Instability Score \(high noise\) but still possess a low IDS if its chaotic outputs average out precisely on the target\.

#### 4\.2\.2Contextual Bounding: Formalizing the Output Distribution Width \(OW\)

The Empirical Origin:The critical limitation of relying solely on central tendency metrics \(like the IDS\) was exposed byAzzopardi and Moshfeghi \([2025b](https://arxiv.org/html/2605.08522#bib.bib44)\)\. In their evaluation of 28 LLMs, the authors argued that traditional audits treat a model’s ideology as a singular "point estimate," which entirely masks the boundaries of what the model is actually willing or unwilling to say\. To capture this missing dimension, they introduced the Political Overton Window \(POW\), utilizing indirect probing to map the expansive boundaries of a model’s permissible responses, demonstrating that models with identical point\-estimate centroids can possess drastically different ideological spans\.

Dimension 1Dimension 2Outlier\(Ignored\)Overton Widthmax⁡‖p−q‖2\\max\\\|p\-q\\\|\_\{2\}90th Percentile Convex Hull \(ℋ90\\mathcal\{H\}\_\{90\}\)

Figure 9:Geometric derivation of the Output Distribution Width \(OW\)\. While standard variance measures average deviation, OW calculates the maximum spatial diameter across the convex hull of the model’s outputs, mathematically defining its behavioral boundary\.Coverage Bound Rationale\.Within the MTMM framework, measuring a model’s Alignment \(IDS, Section[4\.2\.1](https://arxiv.org/html/2605.08522#S4.SS2.SSS1)\) only provides the location of its expected value \(𝐨m\\mathbf\{o\}\_\{m\}\)\. However, in high\-stakes deployments, knowingwherea model is centered is insufficient; we must also knowhow far it can reach\.

If a model is heavily subjected to Reinforcement Learning from Human Feedback \(RLHF\) for safety, its responses will tightly cluster, refusing any prompt that pushes it toward the edges of the semantic space\. Conversely, a highly creative or unaligned model will generate responses spanning massive regions of the coordinate system\. Standard variance metrics like PIS measure the average "noise" around the centroid, but they do not capture the absolute bounds of the model’s capabilities\.

To formalize the Overton Window mapped byAzzopardi and Moshfeghi \([2025b](https://arxiv.org/html/2605.08522#bib.bib44)\), we must measure the geographic diameter of the model’s reachable output space\. We achieve this by calculating the maximum pairwise Euclidean distance between any two outputs generated across a diverse set of prompts\. To ensure this metric represents the true structural capability span and is not artificially inflated by a single anomalous hallucination \(a random coordinate point projected far outside the norm\), we first bind the output distribution within a 90th\-percentile convex hull \(ℋ90\\mathcal\{H\}\_\{90\}\)\. The Output Distribution Width \(OW\) is the mathematical diameter of this hull\.

The Formal Equation:We formalize the Output Distribution Width \(OW\) as the maximum spatial diameter within the bounded convex hull of the model’s outputs:

OW​\(m\)=maxp,q∈ℋ90⁡‖p−q‖2\\text\{OW\}\(m\)=\\max\_\{p,q\\in\\mathcal\{H\}\_\{90\}\}\\\|p\-q\\\|\_\{2\}\(11\)
where:

- •m∈ℳm\\in\\mathcal\{M\}is the language model under evaluation\.
- •The model is prompted with a large, diverse set of queries designed to elicit the full spectrum of its capabilities or ideological stances for a specific construct\.
- •ℋ90\\mathcal\{H\}\_\{90\}is the convex hull bounding the 90th\-percentile density of the resulting output coordinate vectors \(excluding the 10% most extreme spatial outliers\)\.
- •ppandqqrepresent any two coordinate vectors lying on or within the boundary of the convex hullℋ90\\mathcal\{H\}\_\{90\}\.
- •‖p−q‖2\\\|p\-q\\\|\_\{2\}is the Euclidean distance between these two points\.

A highly restricted, safety\-clamped model will exhibit a near\-zero OW, compressing all outputs into a narrow spatial region, while an expressive, unconstrained model will exhibit a large OW, proving its ability to successfully traverse diverse regions of the latent space\.

### 4\.3Dimension 3: Coverage and Expressiveness

The final latent dimension resolves a critical geometric ambiguity present in Dimension 2\. While the Overton Width \(OW\) successfully measures the absolute bounding diameter of a model’s capabilities, it provides no topological guarantees regarding the density or distribution within that hull\. A model might achieve a high OW by randomly emitting extreme outlier responses across unrelated prompts, while systematically collapsing to a single, monolithic stance whenever presented with a bilaterally contested issue\.

#### 4\.3\.1Bidirectional Traversal: Deriving the Pluralism Score \(PS\)

The Empirical Origin:The formalization of viewpoint diversity as a rigorously measurable construct was achieved byPoole\-Dayanet al\.\([2026b](https://arxiv.org/html/2605.08522#bib.bib45)\)through the OvertonBench framework\. The authors defined "Overton Pluralism" as the extent to which diverse, legitimate viewpoints are represented in model outputs\. Crucially, they proved that pluralistic alignment is frequently at odds with standard safety alignment; empirical evaluations demonstrated that models actively optimizing for political neutrality \(a slant near zero\) exhibited a negative correlation with pluralism, often refusing to surface valid opposing arguments\.

Ideological AxisProbability / DensityRegion A\(e\.g\., Pro\)Region B\(e\.g\., Con\)Model 1: Neutral \(PS=0\)Model 2: Pluralistic \(PS=1\)Covers both regions

Figure 10:Geometric derivation of the Pluralism Score \(PS\)\. A mathematically "neutral" model \(Model 1\) collapses its output to the coordinate origin, failing to cover the contested regions\. A pluralistic model \(Model 2\) successfully traverses the latent space, populating both Region A and Region B\.Bidirectional Coverage Rationale\.Geometrically, a contested issueccdefines at least two distinct, non\-overlapping target regions in the latent output space,𝒪A\\mathcal\{O\}\_\{A\}and𝒪B\\mathcal\{O\}\_\{B\}\. If an evaluator asks a model to generate arguments for both sides ofcc, the model must project its outputs into both respective regions\.

If an LLM has been heavily fine\-tuned using standard RLHF, it learns a topological "shortcut": to avoid penalization on controversial topics, it projects all responses to the exact center of the coordinate space \(the origin, or neutral zone\), regardless of whether it was prompted for Side A or Side B\. In this state, its Ideological Distance Score \(IDS\) to the center is0\(perfect neutrality\), but it completely fails the functional requirement of the task\.

To penalize this spatial collapse,Poole\-Dayanet al\.\([2026b](https://arxiv.org/html/2605.08522#bib.bib45)\)formalized pluralism as a strict set\-coverage metric\. The Pluralism Score \(PS\) does not measure distance or variance; it operates as a binary indicator function per topic\. It checks whether the model’s generated coordinates successfully landed inside the geometric boundaries of both𝒪A\\mathcal\{O\}\_\{A\}and𝒪B\\mathcal\{O\}\_\{B\}\. This ensures that true expressiveness is measured by actual bidirectional traversal of the latent manifold, rather than a false neutrality achieved through semantic evasion\.

The Formal Equation:Adapting the OvertonScore fromPoole\-Dayanet al\.\([2026b](https://arxiv.org/html/2605.08522#bib.bib45)\), we formalize the Pluralism Score \(PS\) as the proportion of contested cases where the model successfully covers all required semantic regions:

PS​\(m\)=\|\{c∈C:𝐨c,A∈𝒪A∧𝐨c,B∈𝒪B\}\|\|C\|\\text\{PS\}\(m\)=\\frac\{\|\\\{c\\in C:\\mathbf\{o\}\_\{c,A\}\\in\\mathcal\{O\}\_\{A\}\\land\\mathbf\{o\}\_\{c,B\}\\in\\mathcal\{O\}\_\{B\}\\\}\|\}\{\|C\|\}\(12\)
where:

- •m∈ℳm\\in\\mathcal\{M\}is the language model under evaluation\.
- •CCrepresents the total set of evaluated contested issues or cases\.
- •c∈Cc\\in Cis an individual contested issue requiring bidirectional representation\.
- •𝒪A\\mathcal\{O\}\_\{A\}and𝒪B\\mathcal\{O\}\_\{B\}define the distinct spatial regions in the latent coordinate space representing legitimate opposing viewpoints for casecc\.
- •𝐨c,A\\mathbf\{o\}\_\{c,A\}and𝐨c,B\\mathbf\{o\}\_\{c,B\}are the model’s output coordinate vectors when explicitly prompted to generate responses for Side A and Side B, respectively\.
- •The numerator counts the cardinality of the subset of cases where the model successfully projected outputs into both required target regions, without defaulting to a neutral refusal\.

APS=1\.0\\text\{PS\}=1\.0mathematically proves perfect structural pluralism, verifying that the model’s output manifold contains reachable, distinct sub\-spaces for all evaluated perspectives\.

### 4\.4Meta\-Evaluation: Evaluator Stability

As the evaluation of complex topological traits increasingly scales through automated LLM\-as\-a\-judge pipelines, the framework introduces a recursive vulnerability: the projection function itself may suffer from latent instability\. If the measuring instrument is structurally brittle, the entire coordinate space collapses\.

#### 4\.4\.1Instrument Calibration: Deriving the Judge Bias Score \(JBS\)

The Empirical Origin:The rigorous quantification of evaluator fragility was systematized byYeet al\.\([2024a](https://arxiv.org/html/2605.08522#bib.bib42)\)through the CALM \(Comprehensive Assessment of LLM\-as\-a\-judge Modeling biases\) framework\. The authors demonstrated that even frontier models acting as judges are highly susceptible to task\-irrelevant perturbations\. By mapping 12 distinct bias categories—ranging from positional bias \(favoring the first presented option\) to verbosity and stylistic biases—they proved that an LLM judge will frequently reverse its own evaluation outcome when the semantic meaning of the inputs remains identical but the surface presentation is perturbed\.

Decision Boundary \(ψ\\psi\)Prefers Model APrefers Model Bc=\(A,B\)c=\(A,B\)μ1​\(c\)\\mu\_\{1\}\(c\): Swap Orderμ2​\(c\)\\mu\_\{2\}\(c\): Add Verbosityμ3​\(c\)\\mu\_\{3\}\(c\): Style TweakRed arrows indicate a "Flip" \(Inconsistency\)\.The perturbation crossed the decision boundary\.

Figure 11:Geometric derivation of the Judge Bias Score \(JBS\)\. A construct\-valid judge should evaluate the original comparison pair \(cc\) and any task\-irrelevant perturbation \(μ​\(c\)\\mu\(c\)\) on the same side of the decision boundary\. A flip indicates the judge is anchored to surface artifacts rather than semantic quality\.Meta\-Evaluation Rationale\.Within the MTMM framework, if an automated judge is used to compute downstream spatial metrics \(such as evaluating the semantic equivalence for PIS or the task completion for DS\), the judge itself acts as the projection functionϕ\\phithat maps raw text into the coordinate space\.

Equation[2](https://arxiv.org/html/2605.08522#S3.E2)defined the observed spatial displacement asΔobs=Δtrait\+Δmethod\+ϵ\\Delta\_\{\\text\{obs\}\}=\\Delta\_\{\\text\{trait\}\}\+\\Delta\_\{\\text\{method\}\}\+\\epsilon\. If the judge is unstable, it injects massive systemic error \(ϵjudge\\epsilon\_\{\\text\{judge\}\}\) into the coordinate mappings\. For example, if a judge is evaluating two models and prefers Model A, but reverses its decision to prefer Model B simply because their positions in the prompt were swapped \(a task\-irrelevant perturbationμ\\mu\), the judge’s decision boundary is topologically invalid\.

To mathematically calibrate this instrument, we must compute the Judge Bias Score \(JBS\)\. JBS calculates the expectation that a judge will flip its evaluation classification when subjected to non\-semantic geometric perturbations\. It operates as a strict inversion of the agreement rate\. If JBS is high, the "ruler" used to measure the latent traits is made of rubber; any coordinate displacements observed in the target models could merely be artifacts of the evaluator’s own structural fragility\.

The Formal Equation:Abstracting the specific bias calculations from the CALM methodology into our generalized framework, we define the Judge Bias Score \(JBS\) using an indicator function over a set of task\-irrelevant perturbations:

JBS​\(ψ\)=1−1\|C\|​\|𝒰irr\|​∑c∈C∑μ∈𝒰irr𝕀​\(ψ​\(c\)=ψ​\(μ​\(c\)\)\)\\text\{JBS\}\(\\psi\)=1\-\\frac\{1\}\{\|C\|\|\\mathcal\{U\}\_\{\\text\{irr\}\}\|\}\\sum\_\{c\\in C\}\\sum\_\{\\mu\\in\\mathcal\{U\}\_\{\\text\{irr\}\}\}\\mathbb\{I\}\(\\psi\(c\)=\\psi\(\\mu\(c\)\)\)\(13\)
where:

- •ψ\\psiis the LLM\-as\-a\-judge function evaluating a given input\.
- •CCis the set of baseline evaluation instances \(e\.g\., pairs of model outputs to be compared\)\.
- •c∈Cc\\in Crepresents a specific, unperturbed evaluation instance\.
- •𝒰irr\\mathcal\{U\}\_\{\\text\{irr\}\}is the defined set of task\-irrelevant perturbation operators \(e\.g\., position swapping, injecting safe filler text, altering formatting\)\.
- •μ​\(c\)\\mu\(c\)represents the instanceccafter applying the task\-irrelevant perturbationμ\\mu\.
- •𝕀\\mathbb\{I\}is the indicator function, which returns11if the judge’s classification remains perfectly consistent \(ψ​\(c\)=ψ​\(μ​\(c\)\)\\psi\(c\)=\\psi\(\\mu\(c\)\)\) and0if the judge "flips" its decision\.

AJBS=0\\text\{JBS\}=0geometrically proves that the judge’s decision boundary is impervious to surface\-level noise, making it a topologically stable instrument for computing the MTMM coordinate matrix\. AJBSapproaching11indicates a catastrophically brittle evaluator\.

## 5Synthesizing the Construct: Orthogonality and Correlations

The primary theoretical advantage of organizing LLM evaluation into a geometric Multi\-Trait Multi\-Method \(MTMM\) framework is the ability to analyze metric interactions\. In classical psychometrics, a robust MTMM matrix demands that distinct latent traits remain structurally orthogonal, even if they exhibit empirical correlations in specific test populations\.

By projecting our nine formalized metrics into the shared output space𝒪\\mathcal\{O\}, we can leverage their geometric intersections as diagnostic tools\. A single scalar score obscures failure modes; however, mapping the intersections of Instability \(Dimension 1\), Position \(Dimension 2\), and Coverage \(Dimension 3\) reveals precise, multidimensional behavioral profiles\.

### 5\.1Disentangling Span from Coverage: OW vs\. PS

The interaction between the Overton Width \(OW, Equation[11](https://arxiv.org/html/2605.08522#S4.E11)\) and the Pluralism Score \(PS, Equation[12](https://arxiv.org/html/2605.08522#S4.E12)\) addresses a pervasive confounding factor in diversity and alignment evaluations\. Because OW measures the maximum geometric diameter of theℋ90\\mathcal\{H\}\_\{90\}convex hull, and PS measures the explicit bidirectional coverage of contested regions, their correlation serves as a vital diagnostic for "false pluralism"\.

Consider a model exhibiting a broadly dispersed output space \(high OW\) but a near\-zero Pluralism Score \(low PS\)\. Under a single\-metric paradigm, the high variance might be misconstrued as robust expressiveness\. Geometrically, however, this profile indicates a model that hallucinates wildly across different prompts but systematically collapses to a single, monolithic stance when presented with a bilaterally contested issue\. It spans a wide range but strictly avoids inhabiting opposing coordinates simultaneously\. Conversely, the empirical finding fromPoole\-Dayanet al\.\([2026a](https://arxiv.org/html/2605.08522#bib.bib31)\)—a moderate negative correlation \(r=−0\.41r=\-0\.41\) between political slant and pluralism—proves that a model can achieve a highly constrained, "neutral" position \(low OW, low IDS\) while fundamentally failing to provide pluralistic coverage \(low PS\)\.

### 5\.2Disentangling Stability from Alignment: PIS vs\. IDS

The intersection of the Paraphrase Instability Score \(PIS, Equation[4](https://arxiv.org/html/2605.08522#S4.E4)\) and the Generalized Output Distance Score \(IDS, Equation[10](https://arxiv.org/html/2605.08522#S4.E10)\) isolates true capability from method variance\.

- •Low PIS \+ High IDS \(Stubborn Misalignment\):The model generates outputs that are tightly clustered \(Δmethod≈0\\Delta\_\{\\text\{method\}\}\\approx 0\) but geometrically distant from the target reference distribution\. This indicates robust but incorrect internal representations\. The model’s behavior is highly predictable, which is advantageous for security auditing, but it fundamentally fails the alignment objective\.
- •High PIS \+ Low IDS \(Fragile Alignment\):The expected centroid of the model’s outputs may align closely with the reference, but the variance around that centroid is massive\. This reveals a model that is technically capable of hitting the target coordinate space, but only under highly specific, "lucky" prompt formulations\. This signifies a severe lack of construct validity in the underlying alignment training; the safety or factual grounding is merely an artifact of surface\-level syntax rather than a generalized capability\.

### 5\.3Temporal vs\. Static Vulnerability: DS vs\. PSS

Finally, mapping Prompt Sensitivity \(PSS, Equation[5](https://arxiv.org/html/2605.08522#S4.E5)\) against multi\-turn Drift Score \(DS, Equation[6](https://arxiv.org/html/2605.08522#S4.E6)\) separates immediate vulnerabilities from temporal degradation\.

A model might possess a low PSS, successfully resisting single\-turn role\-injection attacks or adversarial personas\. However, if it simultaneously exhibits a high DS and Drift Velocity, the spatial stability is revealed to be a temporary equilibrium\. Under prolonged conversational interaction, the model’s trajectory inevitably diverges from the goal\-consistent policy, accumulating geometric displacement over time\(Dongreet al\.,[2025a](https://arxiv.org/html/2605.08522#bib.bib21); Liet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib22)\)\. Without synthesizing these metrics, evaluators relying solely on single\-turn PSS would falsely certify a model as robust, entirely missing the temporal hazard mathematically captured by DS\.

In summary, the orthogonality of these formalized metrics ensures that evaluators can pinpoint exact functional failures—whether an issue stems from a structural collapse in the reasoning pathway \(RSS\), a failure of semantic anchoring \(PIS\), or an inability to traverse contested sub\-spaces \(PS\)—rather than simply observing a degraded top\-line score\.

## 6Synthesizing the Construct: Constructing the Geometric MTMM Matrix

With the nine core evaluation metrics mathematically formalized as spatial operations within the continuous latent space𝒪\\mathcal\{O\}\(Section[4](https://arxiv.org/html/2605.08522#S4)\), we can now construct the generalized Multi\-Trait Multi\-Method \(MTMM\) matrix\.

In classical psychometrics, the MTMM matrix is used to empirically demonstrate construct validity by separating true trait variance from measurement artifact\. By mapping our geometric metrics into this matrix framework, we provide a structured blueprint for designing LLM evaluations that systematically disentangle a model’s true capabilities from its prompt sensitivity\.

### 6\.1Matrix Visualization and Topology

A standard LLM evaluation leaderboard functions as a one\-dimensional vector, assessing a single trait \(e\.g\., accuracy\) using a single method \(e\.g\., zero\-shot direct prompting\)\. The geometric MTMM matrix expands this into a two\-dimensional grid\. The rows represent theLatent Traitswe wish to measure \(Position, Expressiveness, Stability\), and the columns represent thePerturbation Methodsapplied to the input \(Paraphrasing, Cross\-Lingual translation, Persona Injection\)\.

Method 1:Direct QueryMethod 2:Paraphrase \(vv\)Method 3:Cross\-Lingual \(ℓ\\ell\)Method 4:Persona \(cc\)Trait 1:Alignment\(Position\)Trait 2:Pluralism\(Expressiveness\)Trait 3:Robustness\(Stability\)BaselineIDSIDS Variance\(Acrossvv\)Cross\-LingualIDS ShiftSteerability\(Targeted IDS\)BaselinePS / OWCoverageConsistencyMultilingualPluralism GapPersonaRestrictionBaselineNoise \(σ\\sigma\)PIS\(Equation[4](https://arxiv.org/html/2605.08522#S4.E4)\)LDS\(Equation[8](https://arxiv.org/html/2605.08522#S4.E8)\)PSS\(Equation[5](https://arxiv.org/html/2605.08522#S4.E5)\)

Figure 12:The Geometric MTMM Matrix\. By intersecting latent traits with varied measurement methods, we map specific spatial metrics \(like PIS, LDS, and PSS\) to their precise role in diagnosing structural fragility \(Δmethod\\Delta\_\{\\text\{method\}\}\) versus true trait variance \(Δtrait\\Delta\_\{\\text\{trait\}\}\)\.As illustrated in Figure[12](https://arxiv.org/html/2605.08522#S6.F12), specific metrics from Section[4](https://arxiv.org/html/2605.08522#S4)act as operators within specific cells\. The Paraphrase Instability Score \(PIS\), Linguistic Divergence Score \(LDS\), and Prompt Sensitivity Score \(PSS\) populate the bottom row, explicitly quantifying the magnitude of the geometric displacement caused by their respective column methods\.

### 6\.2Matrix Interpretation and Discriminant Validity

To successfully evaluate an LLM, one must populate this matrix and interpret the intersections\. The geometric formalization allows us to rigorously distinguish between two primary failure modes:

##### Monotrait\-Heteromethod \(Diagnosing Method Variance\):

If we hold the latent trait constant \(e\.g\., we want to measure the model’s factual alignment, Trait 1\) but vary the method \(Direct vs\. Paraphrase vs\. Cross\-Lingual\), a construct\-valid model should maintain a stable Ideological Distance Score \(IDS\) across all columns\.

Consider the domain of low\-resource figurative language \(e\.g\., Bengali idioms\)\. A standard leaderboard might query the model in English \(Method 1\) and find a highly accurate alignment \(low IDS\)\. However, if evaluating the exact same idiom in Bengali \(Method 3\) causes the output coordinates to violently displace—resulting in an exceptionally high LDS \(Equation[8](https://arxiv.org/html/2605.08522#S4.E8)\)—the matrix proves that the model’s apparent capability was an artifact of the English language space\. The trait \(figurative understanding\) is structurally broken; it lacks cross\-lingual discriminant validity\.

##### Heterotrait\-Monomethod \(Disentangling Orthogonal Traits\):

Conversely, we must ensure that distinct traits are not geometrically conflated when measured by the same method\. As mathematically defined in our framework, the Overton Width \(OW, Equation[11](https://arxiv.org/html/2605.08522#S4.E11)\) measures the maximum spatial span of the model’s outputs, while the Pluralism Score \(PS, Equation[12](https://arxiv.org/html/2605.08522#S4.E12)\) measures its targeted coverage of opposing viewpoints\.

If an evaluator populates the Pluralism row \(Trait 2\) using only direct prompting \(Method 1\), they might observe a high OW \(the model generates widely dispersed text\) but a near\-zero PS \(the model completely fails to cover the requested opposing stances\)\. By analyzing this intersection within the MTMM framework, researchers can mathematically prove that high generative variance is orthogonal to true pluralistic alignment, preventing "false diversity" from artificially inflating safety or capability scores\.

Ultimately, constructing the MTMM matrix shifts LLM evaluation from an exercise in achieving high scalar scores to a topological audit\. By systematically isolating spatial noise from structural alignment, the framework provides the rigorous diagnostic resolution required to build truly reliable artificial intelligence\.

## 7Future Directions and Limitations: Towards Dynamic Meta\-Evaluation

While the MTMM\-geometric framework provides a mathematically rigorous foundation for resolving the construct validity crisis, its operationalization introduces specific constraints that define the trajectory for future empirical work\.

### 7\.1Methodological Limitations

The primary limitation of this framework lies in its dependence on the projection functionϕ\\phi\(Equation[1](https://arxiv.org/html/2605.08522#S3.E1)\)\. Because metrics like the Linguistic Divergence Score \(LDS\) and Paraphrase Instability Score \(PIS\) calculate Euclidean or cosine displacements in a latent space, they are inherently bottlenecked by the quality of the underlying embedding model\. If the embedding architecture itself suffers from geometric collapse or lacks robust cross\-lingual alignment, the resulting evaluation scores will conflate the LLM’s instability with the embedder’s representation error\.

Furthermore, populating an MTMM matrix is computationally expensive\. Evaluating a single construct requires querying the model across multiple perturbation methods \(e\.g\., semantic paraphrasing, role injection, cross\-lingual mapping\)\. While mathematically necessary to isolate true trait variance \(Δtrait\\Delta\_\{\\text\{trait\}\}\), scaling this across massive, generalized leaderboards will require optimized sampling techniques to prevent combinatorial explosion\.

### 7\.2Future Direction I: Dynamic Drift in Adversarial Environments

Static, single\-turn evaluation is fundamentally insufficient for modern deployment contexts\. A critical future direction is the application of this framework to dynamic, multi\-turn environments, specifically tracking temporal instability via the Drift Score \(DS\) and Drift Velocity \(Equations[6](https://arxiv.org/html/2605.08522#S4.E6)and[7](https://arxiv.org/html/2605.08522#S4.E7)\)\.

An ideal empirical testbed for this is the evaluation of LLMs within an Adversarial Resource Extraction Game \(AREG\)\. In such environments, the model is initialized with a secure goal state \(e\.g\., safeguarding a secret or policy\), while an adversarial agent systematically attempts to manipulate the context overNNturns\(Liet al\.,[2025](https://arxiv.org/html/2605.08522#bib.bib22)\)\. By calculating the spatial displacement at each turn, future research can mathematically map the trajectory of alignment decay, determining precisely when and how a model’s latent coordinates are dragged across safety boundaries\. This shifts evaluation from a binary "jailbroken/secure" label to a continuous topological survival analysis\.

## 8Conclusion

The current paradigm of Large Language Model evaluation has reached a critical bottleneck\. As models are increasingly deployed in complex, dynamic, and cross\-cultural environments, the reliance on fragmented, one\-dimensional leaderboards has precipitated a severe crisis of construct validity\. The NLP community can no longer afford to conflate a model’s surface\-level sensitivity to a specific prompt with its true underlying capability\.

In this Systematization of Knowledge, we have introduced a generalized Multi\-Trait Multi\-Method \(MTMM\) framework that resolves this crisis by mapping evaluation into a shared geometric coordinate space\. By formalizing nine disparate metrics—ranging from the Paraphrase Instability Score \(PIS\) to the Pluralism Score \(PS\)—as strict topological measurements of displacement, span, and distance, we have demonstrated that instability, alignment, and expressiveness are not isolated anomalies\. Rather, they are orthogonal geometric properties of the model’s latent output manifold\.

This spatial unification fundamentally shifts how we diagnose LLM behavior\. Instead of merely observing that a model fails on low\-resource figurative language or degrades during multi\-turn adversarial interactions, our framework allows researchers to mathematically isolate the exact nature of the failure\. Evaluators can now precisely determine whether an issue stems from a structural collapse in cross\-lingual mapping \(LDS\), a temporal decay of a safety boundary \(DS\), or a rigid inability to traverse contested viewpoints \(low PS despite a high Overton Width\)\.

Ultimately, evaluating the next generation of language models requires the rigorous discipline of psychometrics combined with the precision of spatial topology\. By systematically disentangling task\-irrelevant method variance from true capability spans, the MTMM\-geometric framework provides the foundational blueprint necessary to move beyond ad hoc scoring, enabling the development of robust, construct\-valid, and structurally stable artificial intelligence systems\.

## References

- POW: political overton windows of large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1347)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p3.1),[§2\.4](https://arxiv.org/html/2605.08522#S2.SS4.p1.1)\.
- L\. Azzopardi and Y\. Moshfeghi \(2025b\)POW: political overton windows of large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24767–24773\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1347/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1347),ISBN 979\-8\-89176\-335\-7Cited by:[§4\.2\.2](https://arxiv.org/html/2605.08522#S4.SS2.SSS2.p1.1),[§4\.2\.2](https://arxiv.org/html/2605.08522#S4.SS2.SSS2.p4.1)\.
- A\. M\. Bean, R\. O\. Kearns, A\. Romanou, F\. S\. Hafner, H\. Mayne, J\. Batzner, N\. Foroutan, C\. Schmitz, K\. Korgul, H\. Batra,et al\.\(2025\)Measuring what matters: construct validity in large language model benchmarks\.InNeurIPS 2025 Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=mdA5lVvNcU)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08522#S2.SS1.p1.1)\.
- P\. Bernardelle, L\. Fröhling, S\. Civelli, R\. Lunardi, K\. Roitero, and G\. Demartini \(2025\)Mapping and influencing the political ideology of large language models using synthetic personas\.External Links:2412\.14843,[Document](https://dx.doi.org/https%3A//doi.org/10.1145/3701716.3715578),[Link](https://arxiv.org/abs/2412.14843)Cited by:[§4\.2\.1](https://arxiv.org/html/2605.08522#S4.SS2.SSS1.p1.1),[§4\.2\.1](https://arxiv.org/html/2605.08522#S4.SS2.SSS1.p5.1)\.
- Y\. Chang, X\. Wang, J\. Wang, Y\. Wu, K\. Zhu, H\. Chen, L\. Yang, X\. Yi, C\. Wang, Y\. Wang, W\. Ye, Y\. Zhang, Y\. Chang, P\. S\. Yu, Q\. Yang, and X\. Xie \(2023\)A survey on evaluation of large language models\.arXiv preprint arXiv:2307\.03109\.External Links:[Link](https://arxiv.org/abs/2307.03109)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08522#S2.SS1.p1.1)\.
- A\. Chatterjee, H\. S\. V\. N\. S\. K\. Renduchintala, S\. Bhatia, and T\. Chakraborty \(2024\)POSIX: a prompt sensitivity index for large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Miami, Florida, USA,pp\. 14550–14565\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.852)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.08522#S2.SS3.SSS0.Px1.p1.1)\.
- B\. J\. Choi and M\. Weber \(2026\)Latent structure of affective representations in large language models\.External Links:2604\.07382,[Link](https://arxiv.org/abs/2604.07382)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08522#S2.SS2.p2.1)\.
- C\. Deng, Y\. Zhao, X\. Tang, M\. Gerstein, and A\. Cohan \(2023\)Investigating data contamination in modern benchmarks for large language models\.arXiv preprint arXiv:2311\.09783\.External Links:[Link](https://arxiv.org/abs/2311.09783)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08522#S2.SS1.p2.1)\.
- V\. Dongre, R\. A\. Rossi, V\. D\. Lai, D\. S\. Yoon, D\. Hakkani\-Tür, and T\. Bui \(2025a\)Drift no more? context equilibria in multi\-turn LLM interactions\.External Links:2510\.07777,[Link](https://arxiv.org/abs/2510.07777)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.08522#S2.SS3.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2605.08522#S5.SS3.p2.1)\.
- V\. Dongre, R\. A\. Rossi, V\. D\. Lai, D\. S\. Yoon, D\. Hakkani\-Tür, and T\. Bui \(2025b\)Drift no more? context equilibria in multi\-turn LLM interactions\.InProceedings of the AAAI 2026 Workshop on Personalization in the Era of Large Foundation Models,Philadelphia, USA\.Note:arXiv:2510\.07777External Links:[Link](https://arxiv.org/abs/2510.07777)Cited by:[§4\.1\.3](https://arxiv.org/html/2605.08522#S4.SS1.SSS3.p1.1),[§4\.1\.3](https://arxiv.org/html/2605.08522#S4.SS1.SSS3.p2.6),[§4\.1\.3](https://arxiv.org/html/2605.08522#S4.SS1.SSS3.p4.2),[§4\.1\.3](https://arxiv.org/html/2605.08522#S4.SS1.SSS3.p6.1)\.
- C\. Gan and T\. Mori \(2023\)Sensitivity and robustness of large language models to prompt template in japanese text classification tasks\.arXiv preprint arXiv:2305\.08714\.External Links:[Link](https://arxiv.org/abs/2305.08714)Cited by:[§2\.3](https://arxiv.org/html/2605.08522#S2.SS3.SSS0.Px1.p1.1)\.
- C\. Helwe, O\. Balalau, and D\. Ceolin \(2025\)Navigating the political compass: evaluating multilingual LLMs across languages and nationalities\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 17179–17204\.External Links:[Link](https://aclanthology.org/2025.findings-acl.883),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.883)Cited by:[§4\.1\.4](https://arxiv.org/html/2605.08522#S4.SS1.SSS4.p1.1),[§4\.1\.4](https://arxiv.org/html/2605.08522#S4.SS1.SSS4.p4.1)\.
- R\. Hida, M\. Kaneko, and N\. Okazaki \(2025\)Social bias evaluation for large language models requires prompt variations\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 14507–14530\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.783)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.08522#S2.SS3.SSS0.Px1.p1.1)\.
- D\. Jiang, R\. Zhang, Z\. Guo, Y\. Li, Y\. Qi, X\. Chen, L\. Wang, J\. Jin, C\. Guo, S\. Yan, B\. Zhang, C\. Fu, P\. Gao, and H\. Li \(2025\)MME\-cot: benchmarking chain\-of\-thought in large multimodal models for reasoning quality, robustness, and efficiency\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=YZvefQVLJI)Cited by:[§4\.1\.5](https://arxiv.org/html/2605.08522#S4.SS1.SSS5.p1.1),[§4\.1\.5](https://arxiv.org/html/2605.08522#S4.SS1.SSS5.p3.1)\.
- S\. Kabir, K\. Esterling, and Y\. Dong \(2025\)Beyond the surface: probing the ideological depth of large language models\.External Links:2508\.21448,[Link](https://arxiv.org/abs/2508.21448)Cited by:[§2\.4](https://arxiv.org/html/2605.08522#S2.SS4.p2.1)\.
- R\. O\. Kearns \(2026\)Quantifying construct validity in large language model evaluations\.External Links:2602\.15532,[Link](https://arxiv.org/abs/2602.15532)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08522#S2.SS1.p3.1)\.
- R\. Khraishi, I\. Zafar, K\. Myles, and G\. A\. Cowan \(2026\)Evaluating performance drift from model switching in multi\-turn LLM systems\.InProceedings of the CAO Workshop at ICLR 2026,Vienna, Austria\.Note:arXiv:2603\.03111External Links:[Link](https://arxiv.org/abs/2603.03111)Cited by:[§4\.1\.3](https://arxiv.org/html/2605.08522#S4.SS1.SSS3.p1.1),[§4\.1\.3](https://arxiv.org/html/2605.08522#S4.SS1.SSS3.p5.1)\.
- T\. S\. Kim, H\. Lee, Y\. Lee, J\. Seering, and J\. Kim \(2026\)Evaluating large language models through functional fragmentation\.InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems,External Links:[Link](https://dl.acm.org/doi/10.1145/3731120.3744588)Cited by:[§2\.1](https://arxiv.org/html/2605.08522#S2.SS1.p3.1)\.
- W\. Kwan, X\. Zeng, Y\. Jiang, Y\. Wang, L\. Li, L\. Shang, X\. Jiang, Q\. Liu, and K\. Wong \(2024\)MT\-Eval: a multi\-turn capabilities evaluation benchmark for large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 20153–20177\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1124)Cited by:[§2\.3](https://arxiv.org/html/2605.08522#S2.SS3.SSS0.Px2.p1.1)\.
- P\. Lai, Z\. Ou, Y\. Wang, L\. Wang, J\. Yang, Y\. Chen, and G\. Chen \(2026\)BiasScope: towards automated detection of bias in LLM\-as\-a\-judge evaluation\.External Links:2602\.09383,[Link](https://arxiv.org/abs/2602.09383)Cited by:[§2\.3](https://arxiv.org/html/2605.08522#S2.SS3.SSS0.Px3.p1.1)\.
- A\. Lee, F\. Viégas, and M\. Wattenberg \(2025\)Shared global and local geometry of language model embeddings\.External Links:2503\.21073,[Link](https://arxiv.org/abs/2503.21073)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08522#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.08522#S3.SS1.p7.1)\.
- Y\. Li, R\. Krishnan, and R\. Padman \(2025\)Time\-to\-inconsistency: a survival analysis of large language model robustness to adversarial attacks\.External Links:2510\.02712,[Link](https://arxiv.org/abs/2510.02712)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.08522#S2.SS3.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2605.08522#S5.SS3.p2.1),[§7\.2](https://arxiv.org/html/2605.08522#S7.SS2.p2.1)\.
- S\. Ni, G\. Chen, S\. Li, X\. Chen, S\. Li, B\. Wang, Q\. Wang, X\. Wang, Y\. Zhang, and L\. Fan \(2025\)A survey on large language model benchmarks\.External Links:2508\.15361,[Link](https://arxiv.org/abs/2508.15361)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08522#S2.SS1.p1.1)\.
- A\. Ning, V\. Rangaraju, and Y\. Kuo \(2025\)Visualizing LLM latent space geometry through dimensionality reduction\.External Links:2511\.21594,[Link](https://arxiv.org/abs/2511.21594)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08522#S2.SS2.p2.1),[§3\.1](https://arxiv.org/html/2605.08522#S3.SS1.p7.1)\.
- E\. Poole\-Dayan, J\. Wu, T\. Sorensen, J\. Pei, and M\. A\. Bakker \(2026a\)Benchmarking overton pluralism in LLMs\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=f2VxF4QIx1)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p3.1),[§2\.4](https://arxiv.org/html/2605.08522#S2.SS4.p2.1),[§5\.1](https://arxiv.org/html/2605.08522#S5.SS1.p2.1)\.
- E\. Poole\-Dayan, J\. Wu, T\. Sorensen, J\. Pei, and M\. A\. Bakker \(2026b\)Benchmarking overton pluralism in llms\.External Links:2512\.01351,[Link](https://arxiv.org/abs/2512.01351)Cited by:[§4\.3\.1](https://arxiv.org/html/2605.08522#S4.SS3.SSS1.p1.1),[§4\.3\.1](https://arxiv.org/html/2605.08522#S4.SS3.SSS1.p4.2),[§4\.3\.1](https://arxiv.org/html/2605.08522#S4.SS3.SSS1.p5.1)\.
- P\. Röttger, V\. Hofmann, V\. Pyatkin, M\. Hinck, H\. Kirk, H\. Schütze, and D\. Hovy \(2024\)Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 15295–15311\.External Links:[Link](https://aclanthology.org/2024.acl-long.816),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.816)Cited by:[§4\.1\.1](https://arxiv.org/html/2605.08522#S4.SS1.SSS1.p1.1)\.
- A\. Sottana, B\. Liang, K\. Zou, and Z\. Yuan \(2023\)Evaluation metrics in the era of GPT\-4: reliably evaluating large language models on sequence to sequence tasks\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 8776–8788\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.543)Cited by:[§2\.1](https://arxiv.org/html/2605.08522#S2.SS1.p2.1)\.
- R\. Stureborg, D\. Alikaniotis, and Y\. Suhara \(2024\)Large language models are inconsistent and biased evaluators\.arXiv preprint arXiv:2405\.01724\.External Links:[Link](https://arxiv.org/abs/2405.01724)Cited by:[§2\.3](https://arxiv.org/html/2605.08522#S2.SS3.SSS0.Px3.p1.1)\.
- P\. Y\. Wu \(2025\)Large language models can be a viable substitute for expert political surveys when a shock disrupts traditional measurement approaches\.External Links:2506\.06540,[Link](https://arxiv.org/abs/2506.06540)Cited by:[§2\.4](https://arxiv.org/html/2605.08522#S2.SS4.p2.1)\.
- L\. H\. Yao, N\. Jarvis, T\. Zhan, S\. Ghosh, L\. Liu, and T\. Jiang \(2025\)JE\-IRT: a geometric lens on LLM abilities through joint embedding item response theory\.External Links:2509\.22888,[Link](https://arxiv.org/abs/2509.22888)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08522#S2.SS2.p3.1),[2nd item](https://arxiv.org/html/2605.08522#S3.I1.i2.p1.1)\.
- J\. Ye, Y\. Wang, Y\. Huang, D\. Chen, Q\. Zhang, N\. Moniz, T\. Gao, W\. Geyer, C\. Huang, P\. Chen, N\. V\. Chawla, and X\. Zhang \(2024a\)Justice or prejudice? quantifying biases in llm\-as\-a\-judge\.External Links:2410\.02736,[Link](https://arxiv.org/abs/2410.02736)Cited by:[§4\.4\.1](https://arxiv.org/html/2605.08522#S4.SS4.SSS1.p1.1)\.
- J\. Ye, Y\. Wang, Y\. Huang, D\. Chen, Q\. Zhang, N\. Moniz, T\. Gao, W\. Geyer, C\. Huang, P\. Chen, N\. V\. Chawla, and X\. Zhang \(2024b\)Justice or prejudice? quantifying biases in LLM\-as\-a\-judge\.External Links:2410\.02736,[Link](https://arxiv.org/abs/2410.02736)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.08522#S2.SS3.SSS0.Px3.p1.1)\.
- Y\. Yu and coauthors \(2026\)The latent space: foundation, evolution, mechanism, ability, and outlook\.External Links:2604\.02029,[Link](https://arxiv.org/abs/2604.02029)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08522#S2.SS2.p1.1)\.
- M\. Zhang, Y\. Shen, J\. Deng, Y\. Wang, H\. Sha, K\. Tan, Q\. Peng, Y\. Zhang, J\. Wang, S\. Liu,et al\.\(2025\)LLMEval\-Fair: a large\-scale longitudinal study on robust and fair evaluation of large language models\.External Links:2508\.05452,[Link](https://arxiv.org/abs/2508.05452)Cited by:[§1](https://arxiv.org/html/2605.08522#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08522#S2.SS1.p3.1)\.
- J\. Zhuo, S\. Zhang, X\. Fang, H\. Duan, D\. Lin, and K\. Chen \(2024\)ProSA: assessing and understanding the prompt sensitivity of LLMs\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Miami, Florida, USA,pp\. 1950–1976\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.108),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.108)Cited by:[§4\.1\.2](https://arxiv.org/html/2605.08522#S4.SS1.SSS2.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.08522#S4.SS1.SSS2.p4.3),[§4\.1\.2](https://arxiv.org/html/2605.08522#S4.SS1.SSS2.p5.1)\.

Similar Articles

Measuring Representation Robustness in Large Language Models for Geometry

arXiv cs.CL

Researchers introduce GeoRepEval, a framework to evaluate LLM robustness across equivalent geometric problem representations (Euclidean, coordinate, vector). Testing 11 LLMs on 158 geometry problems, they find accuracy gaps up to 14 percentage points based solely on representation choice, with vector formulations being a consistent failure point.

TabularMath: Understanding Math Reasoning over Tables with Large Language Models

arXiv cs.CL

TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.