SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

arXiv cs.CL 05/19/26, 04:00 AM Papers
Summary
Proposes SKG-Eval, a quasi-deterministic evaluation framework for multi-turn dialogue that uses incremental semantic knowledge graphs to detect cross-turn inconsistencies, contradiction, and topic drift, achieving higher correlation with human judgments.
arXiv:2605.16650v1 Announce Type: new Abstract: Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.
Original Article
View Cached Full Text
Cached at: 05/19/26, 06:34 AM
# SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs
Source: [https://arxiv.org/html/2605.16650](https://arxiv.org/html/2605.16650)
Avijit Shil1Suman Samui2 1Maulana Abul Kalam Azad University of Technology, West Bengal, India 2National Institute of Technology Durgapur, West Bengal, India avijitshil52460@gmail\.com,ssamui@nitdgp\.ac\.in

###### Abstract

Evaluating multi\-turn dialogue systems poses fundamental challenges: the quality of each response depends not only on the immediate prompt but on a growing context of prior commitments, entities, and claims that the model is implicitly bound to\. Existing automatic evaluators, including LLM\-as\-a\-judge protocols and embedding\-based metrics, largely operate on flat or turn\-isolated representations, and therefore fail to reliably detect cross\-turn failure modes such as contradiction, topic drift, and entity inconsistency\. To address this, we proposeSKG\-Eval, a quasi\-deterministic and interpretable evaluation framework that models dialogue as an evolving*Semantic Knowledge Graph*\(SKG\) of entities, relations, and commitments across turns\. At each turn, the graph is incrementally updated via structured triple extraction, and three complementary signals are computed: \(i\)local relevance, measuring alignment with the current prompt and optional reference; \(ii\)historical consistency, quantifying how newly introduced information connects to prior conversational state via graph\-anchored and embedding\-based signals; and \(iii\)logical coherence, assessed by a geometric contradiction engine that detects cross\-turn conflicts without relying on NLI models or LLM judges\. These signals are fused using a regime\-adaptive mechanism and aggregated into a length\-invariant session score via recency\-weighted trend analysis\. Across multiple benchmarks, SKG\-Eval achieves higher correlation with human judgments and substantially improves recall of long\-range inconsistencies, particularly in extended conversations where existing evaluators degrade\. In addition, SKG\-Eval produces explicit contradiction certificates and yields deterministic scores given fixed inputs, enabling reproducible and auditable evaluation\. Our results suggest that externalized state tracking via structured representations is a principled and scalable alternative to implicit reasoning in LLM\-based evaluators for long\-horizon dialogue systems\.

## 1Introduction

Large language models \(LLMs\) are increasingly deployed as multi\-turn conversational agents in domains where the cost of a contradicted prior statement, a forgotten constraint, or a silently shifted topic ranges from user frustration to material harm\. Yet the dominant paradigm for automatically evaluating dialogue still rests on*turn\-isolated*signals—a single response is scored against a single prompt and \(sometimes\) a reference, with conversational history compressed into a prefix that the evaluator is trusted to read\(Lin and Chen,[2023](https://arxiv.org/html/2605.16650#bib.bib7); Zhanget al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib8); Mendonçaet al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib9)\)\. This paradigm has known and increasingly visible failure modes\. As soon as conversations grow beyond a few turns, models drop substantial competence\(Kwanet al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib2); Labanet al\.,[2025](https://arxiv.org/html/2605.16650#bib.bib12)\); they make assumptions in early turns that they later contradict\(Liet al\.,[2026a](https://arxiv.org/html/2605.16650#bib.bib13)\); and frontier judges trained or prompted on short contexts do not reliably surface these errors at the session level\(Sirdeshmukhet al\.,[2025](https://arxiv.org/html/2605.16650#bib.bib4)\)\.

The root cause is that conversational quality is intrinsically*stateful and temporal*\. A turn that looks locally fluent and on\-topic may nonetheless be wrong because, two turns earlier, the assistant claimed the opposite; or because it has slowly drifted off the user’s actual question; or because it has substituted a new value for a fact already established\. Capturing these failure modes requires the evaluator itself to maintain a structured representation of the conversation’s commitments and to reason explicitly about how a new turn relates to that commitment store\. LLM\-as\-a\-judge mitigates this only partially: it pushes the burden of stateful reasoning onto a black\-box judge whose attention pattern over long histories is itself unreliable, whose verdicts are non\-deterministic, and whose contradiction recall on paraphrased and numerical conflicts is poor\(Ikeet al\.,[2025](https://arxiv.org/html/2605.16650#bib.bib19)\)\.

We argue for an alternative: explicit, externalized state\. We proposeSKG\-Eval, an evaluation framework that incrementally builds a typed, time\-stamped*Semantic Knowledge Graph*from the conversation as it unfolds, and scores each new turn against this graph rather than against a flat prefix\. Three signals are extracted at every turn: a*local relevance*score that triangulates the response against the prompt and \(when available\) a reference, a*historical consistency*score that measures how the new turn’s entities and facts attach to the existing graph, and a*logical coherence*score produced by a purely geometric contradiction engine that compares the current turn’s edges to its historical edges through negation flips, antonym pairs, numeric mismatches, and combined relation/object divergence\. These signals are fused under a regime\-adaptive weighting and aggregated across the session via a recency\-weighted regression with a length\-adaptive trend coefficient\.

To summarize our contributions:

- •Stateful Dialogue Evaluation via Explicit Semantic Memory\.We formulate multi\-turn dialogue evaluation as reasoning over an evolvingSemantic Knowledge Graph\(SKG\) that explicitly represents entities, relations, and conversational commitments across turns\. Through this externalized semantic state, cross\-turn dependencies and long\-range conversational consistency can be analyzed in a structured manner\.
- •Geometric Contradiction Engine with Revision Awareness\.A geometry\-driven contradiction detection framework is proposed for graph\-structured semantic representations\. The engine detects inconsistencies through structured comparison of relations and objects, including negation reversals, antonymic relations, numeric mismatch, and relation\-consistent object divergence\. Revision\-aware filtering is further incorporated to avoid penalizing legitimate conversational updates\. As a result, interpretable contradiction certificates are produced without requiring NLI models or LLM judges during scoring\.
- •Graph\-Anchored Historical Consistency Modeling\.We introduce a graph\-anchored consistency metric that evaluates whether newly introduced information remains semantically connected to prior conversational state\. A complementary session\-anchor mechanism is also designed to capture higher\-level thematic continuity across turns\.
- •Robust Local Relevance through Multi\-Signal Semantic Alignment\.A triangulated relevance metric is developed that jointly considers prompt alignment and optional reference coverage, together with adaptive fallback mechanisms for short or reference\-free responses\. This design improves robustness relative to single\-signal semantic similarity measures\.
- •Regime\-Adaptive Fusion and Session\-Level Aggregation\.We develop a regime\-aware fusion strategy that dynamically weights relevance, consistency, and logical coherence according to response characteristics\. At the session level, a recency\-weighted regression\-based aggregation mechanism is introduced to capture both quality level and temporal degradation trends across long conversations\.
- •Interpretable and Quasi\-Deterministic Evaluation\.SKG\-Eval provides explicit audit trails through per\-turn scores, semantic anchors, and contradiction certificates\. Empirically, strong alignment with human judgments is achieved while improved sensitivity to long\-range conversational inconsistency and semantic drift is maintained across extended dialogue sessions\.

The remainder of the paper is organized as follows\. Section[2](https://arxiv.org/html/2605.16650#S2)surveys multi\-turn evaluation, LLM\-as\-judge, dialogue coherence, and knowledge\-graph\-augmented evaluation, identifying the precise gap that SKG\-Eval fills\. Section[3](https://arxiv.org/html/2605.16650#S3)formalizes the problem and states the desiderata\. Section[4](https://arxiv.org/html/2605.16650#S4)develops the proposed framework component by component, with formal definitions, propositions, algorithmic descriptions, and complexity analysis\. Section[5](https://arxiv.org/html/2605.16650#S5)reports empirical results\.

## 2Related Work

#### Multi\-turn dialogue benchmarks\.

An initial suite of benchmarking tests has demonstrated that the single\-turn estimation technique overstates dialogue abilities\. MT\-Bench\(Zhenget al\.,[2023](https://arxiv.org/html/2605.16650#bib.bib1)\)proposed two\-turn open prompts, where GPT\-4 acts as a judge; nevertheless, its limited time horizon limits its ability to detect far\-reaching failure cases\. MT\-Eval\(Kwanet al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib2)\)expanded upon the former by considering four types of interactions—recollection, extension, refinement, and follow\-ups—with significant drops in multi\-turn performance unrelated to single\-turn estimates, citing distance from relevant information and error propagation as key causes\. MINT\(Wanget al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib3)\)evaluated tools usage and feedback handling in multi\-turn iterations; MultiChallenge\(Sirdeshmukhet al\.,[2025](https://arxiv.org/html/2605.16650#bib.bib4)\)increased problem complexity so that advanced models fail to reach above 50% accuracy on multi\-turn tasks that involve instruction following, context assignment, and contextual reasoning\. MT\-dyna\(Gaoet al\.,[2026](https://arxiv.org/html/2605.16650#bib.bib5)\)presented dynamic questioning techniques based on conversation history and intentions\. Multi\-turn domain\-specific tests include medical consultation evaluators\(Liaoet al\.,[2023](https://arxiv.org/html/2605.16650#bib.bib17)\)and psychological dialogue benchmark sets such as PsycoLLM\(Huet al\.,[2025](https://arxiv.org/html/2605.16650#bib.bib18)\)\. Large\-scale simulation tests\(Labanet al\.,[2025](https://arxiv.org/html/2605.16650#bib.bib12)\)have validated that LLMs get "lost" once they make an initial wrong assumption; Li et al\.Liet al\.\([2026a](https://arxiv.org/html/2605.16650#bib.bib13)\)showed that even reasoning models will disregard true answers when faced with multi\-turn challenges\. These benchmarks have helped identify failure cases, while SKG\-Eval attempts to measure them\.

#### Reference\-free and LLM\-as\-judge evaluation\.

A second line replaces n\-gram metrics with neural reference\-free evaluators\. LLM\-Eval\(Lin and Chen,[2023](https://arxiv.org/html/2605.16650#bib.bib7)\)unifies multiple dimensions in a single prompt;Zhanget al\.\([2024](https://arxiv.org/html/2605.16650#bib.bib8)\)analyze 30 LLMs across 12 meta\-evaluation datasets, exposing both promising correlations with human judgment and significant brittleness under adversarial perturbations\. ECoh\(Mendonçaet al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib9)\)distills GPT\-3\.5 into a small open evaluator for turn\-level coherence in five languages\. BotChat\(Duanet al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib6)\)uses an LLM judge to assess pairwise dialogue generation quality\.Ikeet al\.\([2025](https://arxiv.org/html/2605.16650#bib.bib19)\)compare GPT\-4o to human judges across seven KPIs, reporting that GPT\-4o handles factuality and commonsense well but struggles with redundancy and self\-contradiction—precisely the cross\-turn failures SKG\-Eval targets\. LLM Comparator\(Kahnget al\.,[2025](https://arxiv.org/html/2605.16650#bib.bib11)\)addresses interpretability of side\-by\-side judge outputs but does not change the underlying judging mechanism\. Across this literature, the evaluator’s state is implicit in the model’s attention; whether the judge actually*tracked*the prior turns is unfalsifiable\.

#### Knowledge graphs in dialogue and reasoning evaluation\.

Knowledge\-graph\-grounded dialogue systems are a long\-standing line of work, but their use*as evaluators*of free\-form multi\-turn LLM dialogue is far less developed\. The closest benchmarks are surveyed in\(Liet al\.,[2026b](https://arxiv.org/html/2605.16650#bib.bib14); Guan and others,[2026](https://arxiv.org/html/2605.16650#bib.bib15)\)\.Yaoet al\.\([2026](https://arxiv.org/html/2605.16650#bib.bib16)\)quantify*benchmark*quality \(hardness, separability, diversity\) but do not address per\-conversation quality\. To our knowledge, no prior work integrates incremental triple extraction, time\-stamped graph state, geometric contradiction detection, and session\-level temporal aggregation into a single evaluation framework\.

#### Research gap\.

Across \(a\) benchmarks, which surface failure modes but do not measure them mechanistically; \(b\) LLM\-as\-judge protocols, which trust the judge’s implicit state and inherit its non\-determinism and weak contradiction recall; and \(c\) reference\-based metrics, which score isolated turns—there is no evaluator that \(i\) maintains an explicit, time\-stamped state of the conversation’s factual commitments, \(ii\) detects cross\-turn contradictions deterministically and interpretably without an LLM judge in the inner loop, and \(iii\) aggregates per\-turn quality into a session\-level metric in a length\-invariant way\. SKG\-Eval is designed to fill exactly this gap\.

## 3Problem Formulation

#### Dialogue and turn structure\.

A dialogue is a sequence of turns𝒟=\(τ1,τ2,…,τT\)\\mathcal\{D\}=\(\\tau\_\{1\},\\tau\_\{2\},\\ldots,\\tau\_\{T\}\), where each turnτt=\(qt,rt,rt∗\)\\tau\_\{t\}=\(q\_\{t\},r\_\{t\},r^\{\*\}\_\{t\}\)consists of a user promptqtq\_\{t\}, an assistant responsertr\_\{t\}generated by the system under evaluation, and an optional reference responsert∗r^\{\*\}\_\{t\}used only when scoring local relevance\. We denote the dialogue prefix up to and including turnttby𝒟1:t\\mathcal\{D\}\_\{1:t\}\.

#### Evaluation as a sequential decision problem\.

The goal of an automatic evaluator is to produce, at each turntt, a quality scoreQt∈\[0,1\]Q\_\{t\}\\in\[0,1\]together with a session\-level summary𝒮\(𝒟\)∈\[0,1\]\\mathcal\{S\}\(\\mathcal\{D\}\)\\in\[0,1\]\. We require:

- •Causality\.QtQ\_\{t\}depends only on𝒟1:t\\mathcal\{D\}\_\{1:t\}\. The evaluator may not consult future turns\.
- •Statefulness\.QtQ\_\{t\}depends on the entire prefix, not only onτt\\tau\_\{t\}\. Specifically, the evaluator must be able to detect cross\-turn failures in whichrtr\_\{t\}in isolation appears acceptable but\(r1,…,rt\)\(r\_\{1\},\\ldots,r\_\{t\}\)is not\.
- •Determinism\.Given the same dialogue prefix and the same model parameters \(embedding model, extractor\),QtQ\_\{t\}must be reproducible\.
- •Length invariance\.The session\-level summary𝒮\(𝒟\)\\mathcal\{S\}\(\\mathcal\{D\}\)should not be artificially inflated or deflated by session length alone\.
- •Interpretability\.For any low score, the evaluator should expose the structural cause \(which entity, which prior turn, which contradiction class\)\.

#### Failure modes\.

We consider six cross\-turn conversational failure modes:

- •\(F1\) Direct contradiction:A response conflicts with a previously asserted fact for the same subject or attribute\.
- •\(F2\) Numeric/value substitution:A different numeric or categorical value is assigned to an earlier subject–predicate pair\.
- •\(F3\) Antonymic flip:A previously asserted relation is reversed through an antonymic transformation \(e\.g\., “increases” vs “decreases”\)\.
- •\(F4\) Topic drift:New entities or relations are introduced without semantic grounding in prior conversational state\.
- •\(F5\) Local irrelevance:The response fails to address the current user query\.
- •\(F6\) Silent forgetting:Previously established constraints or commitments are omitted without acknowledgement\.

F5 is captured by local relevance; F4 and F6 by historical consistency; F1–F3 by the logical coherence engine\. F6 is partially captured by F4 in our framework and fully recovered through structural inspection of the graph\.

#### State representation\.

We represent the evaluator’s state at turnttby an incremental Semantic Knowledge GraphGt=\(Vt,Et\)G\_\{t\}=\(V\_\{t\},E\_\{t\}\), defined formally in Section[4](https://arxiv.org/html/2605.16650#S4)[Moet al\.](https://arxiv.org/html/2605.16650#bib.bib27)\. The graph is a directed multigraph whose nodes are entity labels, whose edges carry typed metadata \(relation, attribute, intent, property type, turn id\), and whose state at timettsummarizes all factual commitments made up to turntttogether with semantic\-similarity scaffolding induced by an embedding model\.

#### Quality functional\.

We posit a per\-turn factorization

Qt=ℱ\(Stloc,Stcons,Stlog;θt\),Q\_\{t\}\\;=\\;\\mathcal\{F\}\\\!\\left\(\\,S^\{\\text\{loc\}\}\_\{t\},\\,S^\{\\text\{cons\}\}\_\{t\},\\,S^\{\\text\{log\}\}\_\{t\}\\,;\\,\\theta\_\{t\}\\,\\right\),\(1\)whereStloc,Stcons,Stlog∈\[0,1\]S^\{\\text\{loc\}\}\_\{t\},S^\{\\text\{cons\}\}\_\{t\},S^\{\\text\{log\}\}\_\{t\}\\in\[0,1\]are the local\-relevance, historical\-consistency, and logical\-coherence scores respectively, andθt\\theta\_\{t\}is a regime\-adaptive weighting selected by simple statistics of\(qt,rt\)\(q\_\{t\},r\_\{t\}\)\. The session\-level summary is

𝒮\(𝒟\)=𝒜\(Q1,…,QT;𝒘T\),\\mathcal\{S\}\(\\mathcal\{D\}\)\\;=\\;\\mathcal\{A\}\\\!\\left\(\\,Q\_\{1\},\\ldots,Q\_\{T\}\\,;\\,\\boldsymbol\{w\}\_\{T\}\\,\\right\),\(2\)where𝒜\\mathcal\{A\}is the recency\-weighted aggregator of Section[4\.7](https://arxiv.org/html/2605.16650#S4.SS7)and𝒘T\\boldsymbol\{w\}\_\{T\}the recency weights\. The remainder of the paper developsℱ\\mathcal\{F\},𝒜\\mathcal\{A\}, and the stateGtG\_\{t\}\.

Table 1:Key notation used in SKG\-Eval\.![Refer to caption](https://arxiv.org/html/2605.16650v1/figures/skg_illustration2.png)Figure 1:End\-to\-end pipeline of SKG\-Eval\. Each dialogue turn is converted into structured triples, which are integrated into an evolving Semantic Knowledge Graph \(SKG\)\. The framework performs deduplication, semantic linking, and contradiction detection across turns, producing a unified state representation that enables interpretable and state\-aware evaluation of multi\-turn dialogue\.

## 4Proposed Method: SKG\-Eval

As illustrated in Figure[1](https://arxiv.org/html/2605.16650#S3.F1), SKG\-Eval incrementally constructs and reasons over a structured representation of dialogue state\.

We now develop the framework\. Section[4\.1](https://arxiv.org/html/2605.16650#S4.SS1)defines the incremental SKG and its update rule\. Sections[4\.3](https://arxiv.org/html/2605.16650#S4.SS3),[4\.4](https://arxiv.org/html/2605.16650#S4.SS4), and[4\.5](https://arxiv.org/html/2605.16650#S4.SS5)develop the three per\-turn scores\. Section[4\.6](https://arxiv.org/html/2605.16650#S4.SS6)defines the regime\-adaptive fusion\. Section[4\.7](https://arxiv.org/html/2605.16650#S4.SS7)develops the session aggregator\. Section[4\.8](https://arxiv.org/html/2605.16650#S4.SS8)discusses complexity\. Notation:cos\(⋅,⋅\)\\mathrm\{cos\}\(\\cdot,\\cdot\)denotes cosine similarity;ϕ:strings→ℝd\\phi:\\text\{strings\}\\to\\mathbb\{R\}^\{d\}denotes the sentence\-embedding map \(a frozen SentenceTransformer in our implementation\);𝟏\[⋅\]\\mathbf\{1\}\[\\cdot\]is the indicator function\. We summarize frequently used symbols in Table[1](https://arxiv.org/html/2605.16650#S3.T1)\.

### 4\.1The Incremental Semantic Knowledge Graph

###### Definition 1\(Semantic Knowledge Graph\)\.

A Semantic Knowledge Graph at timettis a directed multigraphGt=\(Vt,Et\)G\_\{t\}=\(V\_\{t\},E\_\{t\}\), where each nodev∈Vtv\\in V\_\{t\}carries attributes

v=\(ℓ\(v\),τ\(v\),ϕ\(v\),ι\(v\),t0\(v\),q\(v\)\),v\\;=\\;\\big\(\\,\\ell\(v\),\\;\\tau\(v\),\\;\\phi\(v\),\\;\\iota\(v\),\\;t\_\{0\}\(v\),\\;q\(v\)\\,\\big\),denoting respectively a normalized label, an entity type from a fixed taxonomy𝒯=\{Person,Event,Object,\\mathcal\{T\}=\\\{\\textsc\{Person\},\\textsc\{Event\},\\textsc\{Object\},Concept,Condition,Organization,Time,Number\}\\textsc\{Concept\},\\textsc\{Condition\},\\textsc\{Organization\},\\textsc\{Time\},\\textsc\{Number\}\\\}, an embeddingϕ\(v\)∈ℝd\\phi\(v\)\\in\\mathbb\{R\}^\{d\}, an importance scoreι\(v\)∈\[0,1\]\\iota\(v\)\\in\[0,1\], the introduction turnt0\(v\)∈ℕt\_\{0\}\(v\)\\in\\mathbb\{N\}, and a quarantine flagq\(v\)∈\{0,1\}q\(v\)\\in\\\{0,1\\\}\. Each edgee∈Ete\\in E\_\{t\}carries

e=\(u,v,ρ\(e\),κ\(e\),t\(e\),α\(e\),μ\(e\),π\(e\),q\(e\)\),e\\;=\\;\\big\(\\,u,v,\\;\\rho\(e\),\\;\\kappa\(e\),\\;t\(e\),\\;\\alpha\(e\),\\;\\mu\(e\),\\;\\pi\(e\),\\;q\(e\)\\,\\big\),denoting source/target nodes, relation stringρ\(e\)\\rho\(e\), edge kindκ\(e\)∈\{fact,semantic\}\\kappa\(e\)\\in\\\{\\textsc\{fact\},\\textsc\{semantic\}\\\}, turn indext\(e\)t\(e\), attributeα\(e\)∈𝒜\\alpha\(e\)\\in\\mathcal\{A\}, intentμ\(e\)∈ℐ\\mu\(e\)\\in\\mathcal\{I\}, property typeπ\(e\)∈\{Exclusive,Additive\}\\pi\(e\)\\in\\\{\\textsc\{Exclusive\},\\textsc\{Additive\}\\\}, and a quarantine flagq\(e\)q\(e\)\.

The attribute taxonomy𝒜=\{def,eff,prop,cmp,req,qty,neg\}\\mathcal\{A\}=\\\{\\textsc\{def\},\\textsc\{eff\},\\textsc\{prop\},\\textsc\{cmp\},\\textsc\{req\},\\textsc\{qty\},\\textsc\{neg\}\\\}encodes*which aspect*of the subject a triple describes \(definition, effect, property, comparison, requirement, quantity, negation\)\. The intent taxonomyℐ=\{State,Advice,Hyp\}\\mathcal\{I\}=\\\{\\textsc\{State\},\\textsc\{Advice\},\\textsc\{Hyp\}\\\}encodes modality\. These typed annotations are central to the contradiction engine: only edges with matching attributes and intents are compared\.

#### Triple extraction\.

We treat triple extraction as a one\-shot LLM callExtract:turn text→𝒯t\\mathrm\{Extract\}:\\text\{turn text\}\\to\\mathcal\{T\}\_\{t\}, where𝒯t\\mathcal\{T\}\_\{t\}is a \(possibly empty\) set of typed triples

T=\(s,r,o,τs,τo,ι,α,μ,π\)\.T\\;=\\;\(s,r,o,\\,\\tau\_\{s\},\\,\\tau\_\{o\},\\,\\iota,\\,\\alpha,\\,\\mu,\\,\\pi\)\.The extractor is governed by a deterministic prompt encoding \(i\) subject normalization rules \(strip leading actions, comparative qualifiers, role\-of constructs, location qualifiers, possessives\), \(ii\) an attribute taxonomy with disambiguation rules betweendef,eff,prop,qty, etc\., and \(iii\) a property\-type rule that classifies whether multiple values can coexist \(Additive\) or only one is admissible \(Exclusive\)\. The LLM is used*only*for extraction, never for reasoning or judgment, which preserves determinism of the downstream scoring pipeline modulo the extractor\.

#### Update rule\.

Given the new triple set𝒯t\\mathcal\{T\}\_\{t\}at turntt, the graph is updated by Algorithm[1](https://arxiv.org/html/2605.16650#alg1)\. The mapν\(s\)=lower\(s\)\.strip\(\)\\nu\(s\)=\\texttt\{lower\}\(s\)\.\\texttt\{strip\}\(\)is the canonical key\. Cross\-turn deduplication merges new subjects into existing graph nodes whencos\(ϕ\(s\),ϕ\(v\)\)≥θdedup\\mathrm\{cos\}\(\\phi\(s\),\\phi\(v\)\)\\geq\\theta\_\{\\text\{dedup\}\}for somev∈Vt−1v\\in V\_\{t\-1\}\(withθdedup=0\.80\\theta\_\{\\text\{dedup\}\}=0\.80in our implementation\), enforcing label consistency across turns—a precondition for cross\-turn contradiction detection\. After fact edges are added, semantic edges are added between newly introduced nodes and existing nodes whenever their embedding similarity exceedsθsem=0\.50\\theta\_\{\\text\{sem\}\}=0\.50\.

### 4\.2Worked Example: A Three\-Turn Walkthrough

We illustrate the incremental graph construction with a three\-turn example\. Fig\.[2](https://arxiv.org/html/2605.16650#S4.F2)shows how new nodes and relations are added at each turn and anchored to the existing graph via factual and semantic connections\.

#### Conversation\.

Three turns with reference answers:

- T1\.q1q\_\{1\}:*“Does skipping breakfast affect metabolism?”*r1r\_\{1\}:*“Yes, skipping breakfast significantly slows down metabolism\.”*r1∗r^\{\*\}\_\{1\}:*“Skipping breakfast slows metabolism\.”*
- T2\.q2q\_\{2\}:*“What about concentration?”*r2r\_\{2\}:*“Skipping breakfast also reduces concentration and focus\.”*r2∗r^\{\*\}\_\{2\}:*“It reduces concentration\.”*
- T3\.q3q\_\{3\}:*“Does metabolic rate affect weight gain?”*r3r\_\{3\}:*“Yes, a slower metabolic rate is linked to weight gain\.”*r3∗r^\{\*\}\_\{3\}:*“Skipping breakfast slows metabolism\.”*

Algorithm 1UpdateGraph: incremental SKG update at turntt1:Graph

Gt−1G\_\{t\-1\}, triple set

𝒯t\\mathcal\{T\}\_\{t\}, embedding model

ϕ\\phi, thresholds

θdedup,θsem\\theta\_\{\\text\{dedup\}\},\\theta\_\{\\text\{sem\}\}
2:

𝒯t←Deduplicate\(𝒯t,Vt−1,ϕ,θdedup\)\\mathcal\{T\}\_\{t\}\\leftarrow\\textsc\{Deduplicate\}\(\\mathcal\{T\}\_\{t\},V\_\{t\-1\},\\phi,\\theta\_\{\\text\{dedup\}\}\)⊳\\trianglerightcross\-turn label consistency

3:

𝒩t←∅\\mathcal\{N\}\_\{t\}\\leftarrow\\emptyset⊳\\trianglerightnew nodes added at turntt

4:foreach triple

T=\(s,r,o,τs,τo,ι,α,μ,π\)∈𝒯tT=\(s,r,o,\\tau\_\{s\},\\tau\_\{o\},\\iota,\\alpha,\\mu,\\pi\)\\in\\mathcal\{T\}\_\{t\}do

5:

u←ν\(s\),v←ν\(o\)u\\leftarrow\\nu\(s\),\\;v\\leftarrow\\nu\(o\)
6:if

u∉Vt−1u\\notin V\_\{t\-1\}thenadd

uuwith

\(ℓ=s,τ=τs,ϕ=⊥,ι,t0=t\)\(\\ell\{=\}s,\\tau\{=\}\\tau\_\{s\},\\phi\{=\}\\bot,\\iota,t\_\{0\}\{=\}t\);

𝒩t←𝒩t∪\{u\}\\mathcal\{N\}\_\{t\}\\leftarrow\\mathcal\{N\}\_\{t\}\\cup\\\{u\\\}
7:endif

8:if

v∉Vt−1v\\notin V\_\{t\-1\}thenadd

vvwith

\(ℓ=o,τ=τo,ϕ=⊥,ι,t0=t\)\(\\ell\{=\}o,\\tau\{=\}\\tau\_\{o\},\\phi\{=\}\\bot,\\iota,t\_\{0\}\{=\}t\);

𝒩t←𝒩t∪\{v\}\\mathcal\{N\}\_\{t\}\\leftarrow\\mathcal\{N\}\_\{t\}\\cup\\\{v\\\}
9:endif

10:Update existing\-node importance:

ι\(u\)←max⁡\(ι\(u\),ι\)\\iota\(u\)\\leftarrow\\max\(\\iota\(u\),\\iota\)and similarly for

vv
11:Add fact edge

u→𝑟vu\\xrightarrow\{r\}vwith

\(κ=fact,t\(e\)=t,α,μ,π\)\(\\kappa\{=\}\\textsc\{fact\},t\(e\)\{=\}t,\\alpha,\\mu,\\pi\)
12:endfor

13:LinkSemantic\(Gt,𝒩t,ϕ,θsem\)\(G\_\{t\},\\mathcal\{N\}\_\{t\},\\phi,\\theta\_\{\\text\{sem\}\}\)⊳\\trianglerightadd cosine\-thresholded semantic edges

14:return

Gt,𝒩tG\_\{t\},\\,\\mathcal\{N\}\_\{t\}

sbmetslowsTurn 1*‘skipping breakfast slows metabolism”*sbmetconcslowsreducesTurn 2*‘…\\ldotsalso reduces concentration”*sbmetconcm\. rateweight gainslowsreduceslinkedsem \(0\.88\)Turn 3*‘slower metabolic rate→\\toweight gain”*Figure 2:Turn\-wise growth of the incremental Semantic Knowledge Graph\. Orange nodes and edges denote information introduced at the current turn, black edges denote prior factual commitments, and dashed blue edges denote semantic links induced by embedding similarity\. Turn 3 illustrates semantic anchoring of*metabolic rate*to the prior node*metabolism*, followed by factual expansion toward*weight gain*\. Abbreviations:*sb*=skipping breakfast,*met*=metabolism,*conc*=concentration\.

### 4\.3Local Relevance via the Semantic Triangle

The local relevance scoreStlocS^\{\\text\{loc\}\}\_\{t\}measures whetherrtr\_\{t\}addressesqtq\_\{t\}and—when a referencert∗r^\{\*\}\_\{t\}is available—whether it covers what a correct answer should contain\. We compute it as a weighted maximum over sentence\-level cosine similarities, with a reference\-availability gate that protects against degenerate inputs\.

###### Definition 2\(Semantic Triangle\)\.

Let𝒮\(rt\)=\{s1,…,sK\}\\mathcal\{S\}\(r\_\{t\}\)=\\\{s\_\{1\},\\ldots,s\_\{K\}\\\}be the sentence segmentation ofrtr\_\{t\}, and letΦ\(𝒮\(rt\)\)=\[ϕ\(s1\);…;ϕ\(sK\)\]∈ℝK×d\\Phi\(\\mathcal\{S\}\(r\_\{t\}\)\)=\[\\phi\(s\_\{1\}\);\\ldots;\\phi\(s\_\{K\}\)\]\\in\\mathbb\{R\}^\{K\\times d\}\. Define

mq\\displaystyle m\_\{q\}=maxk≤K⁡cos\(ϕ\(qt\),ϕ\(sk\)\),\\displaystyle\\;=\\;\\max\_\{k\\leq K\}\\mathrm\{cos\}\\\!\\big\(\\phi\(q\_\{t\}\),\\,\\phi\(s\_\{k\}\)\\big\),\(3\)mr\\displaystyle m\_\{r\}=maxk≤K⁡cos\(ϕ\(rt∗\),ϕ\(sk\)\)\(whenrt∗≠∅\)\.\\displaystyle\\;=\\;\\max\_\{k\\leq K\}\\mathrm\{cos\}\\\!\\big\(\\phi\(r^\{\*\}\_\{t\}\),\\,\\phi\(s\_\{k\}\)\\big\)\\quad\\text\{\(when \}r^\{\*\}\_\{t\}\\neq\\varnothing\\text\{\)\}\.\(4\)With reference\-availability indicatorβt=𝟏\[rt∗≠∅∧\|qt\|≥Lmin\]\\beta\_\{t\}=\\mathbf\{1\}\\\!\\big\[\\,r^\{\*\}\_\{t\}\\neq\\varnothing\\,\\wedge\\,\|q\_\{t\}\|\\geq L\_\{\\min\}\\,\\big\], the Semantic Triangle local\-relevance score is

Stloc=\{ωqmq\+ωrmrifβt=1,mqifβt=0,withωq\+ωr=1\.S^\{\\text\{loc\}\}\_\{t\}\\;=\\;\\begin\{cases\}\\omega\_\{q\}\\,m\_\{q\}\+\\omega\_\{r\}\\,m\_\{r\}&\\text\{if \}\\beta\_\{t\}=1,\\\\ m\_\{q\}&\\text\{if \}\\beta\_\{t\}=0,\\end\{cases\}\\quad\\text\{with \}\\omega\_\{q\}\+\\omega\_\{r\}=1\.\(5\)We useωq=0\.4,ωr=0\.6\\omega\_\{q\}=0\.4,\\,\\omega\_\{r\}=0\.6andLmin=10L\_\{\\min\}=10words\.

#### Why max\-pooling\.

The maximum over sentences \(rather than the mean\) implements a coverage notion:rtr\_\{t\}addressesqtq\_\{t\}if*some*sentence inrtr\_\{t\}is highly aligned withqtq\_\{t\}\. This rewards extended responses that contain a focused answer surrounded by elaboration without diluting the score by averaging in low\-relevance sentences\.

#### Reference\-availability gate\.

The classifierβt\\beta\_\{t\}guards against two failure modes of naive triangulation\. First, when no reference is provided,mrm\_\{r\}is undefined; collapsing tomqm\_\{q\}keeps the score honest\. Second, when the prompt itself is degenerate \(e\.g\. a one\-word topic label,\|qt\|<Lmin\|q\_\{t\}\|<L\_\{\\min\}\),mqm\_\{q\}is dense in the response by construction and the triangle would over\-credit reference match alone; collapsing tomqm\_\{q\}prevents this\.

#### Short\-response context glue\.

For*short*responses \(word count<Wshort<W\_\{\\text\{short\}\}\), pronouns and elliptical references can deflatemqm\_\{q\}\. We apply a context glue: encode the augmented string‘‘Prompt:qtq\_\{t\}Response:rtr\_\{t\}’’in lieu ofrtr\_\{t\}itself for theStlocS^\{\\text\{loc\}\}\_\{t\}computation\. This is a linguistic anaphora\-resolution heuristic, not a semantic change to Definition[2](https://arxiv.org/html/2605.16650#Thmdefinition2)\.

### 4\.4Historical Consistency via Graph Connectivity and Session Anchor

Historical consistencyStconsS^\{\\text\{cons\}\}\_\{t\}measures whether the new turn’s content attaches to the existing conversation\. We give two complementary measurements: a structural*graph anchor*score over the typed\-edge connectivity of the new nodes, and a sequence\-level*session anchor*score that rescues focused\-Q&A patterns where graph disconnection is structurally expected\.

###### Definition 3\(Graph anchor score\)\.

Let𝒩t\\mathcal\{N\}\_\{t\}be the new nodes added at turntt, and letV<t=Vt−1V^\{<t\}=V\_\{t\-1\}\. For eachu∈𝒩tu\\in\\mathcal\{N\}\_\{t\}, define the per\-node anchor score

a\(u\)=\{ηFif∃v∈V<twith edgeu→vorv→uof kindfact,ηSelse if∃v∈V<tconnected touby asemanticedge,ηDotherwise \(drift\),a\(u\)\\;=\\;\\begin\{cases\}\\eta\_\{\\text\{F\}\}&\\text\{if \}\\exists\\,v\\in V^\{<t\}\\,\\text\{ with edge \}u\\to v\\text\{ or \}v\\to u\\text\{ of kind \}\\textsc\{fact\},\\\\ \\eta\_\{\\text\{S\}\}&\\text\{else if \}\\exists\\,v\\in V^\{<t\}\\,\\text\{ connected to \}u\\text\{ by a \}\\textsc\{semantic\}\\text\{ edge\},\\\\ \\eta\_\{\\text\{D\}\}&\\text\{otherwise \(drift\)\},\\end\{cases\}\(6\)withηF=1\.0,ηS=0\.65,ηD=0\.20\\eta\_\{\\text\{F\}\}=1\.0,\\,\\eta\_\{\\text\{S\}\}=0\.65,\\,\\eta\_\{\\text\{D\}\}=0\.20\. The graph anchor score is the importance\-weighted mean

Stgraph=∑u∈𝒩tι\(u\)a\(u\)∑u∈𝒩tι\(u\),S^\{\\text\{graph\}\}\_\{t\}\\;=\\;\\frac\{\\sum\_\{u\\in\\mathcal\{N\}\_\{t\}\}\\iota\(u\)\\,a\(u\)\}\{\\sum\_\{u\\in\\mathcal\{N\}\_\{t\}\}\\iota\(u\)\},\(7\)with the conventionStgraph=1S^\{\\text\{graph\}\}\_\{t\}=1when𝒩t=∅\\mathcal\{N\}\_\{t\}=\\emptysetorV<t=∅V^\{<t\}=\\emptyset\.

#### Worked example\.

Suppose a turn introduces two new nodes,metabolic rateandweight gain, with importances0\.850\.85and0\.600\.60respectively\. The nodemetabolic rateis connected to the existing nodemetabolismonly via a semantic edge \(cosine similarity0\.880\.88between their label embeddings, above theθsem=0\.50\\theta\_\{\\text\{sem\}\}\{=\}0\.50threshold\), soa\(metabolic rate\)=ηS=0\.65a\(\\textit\{metabolic rate\}\)=\\eta\_\{\\text\{S\}\}=0\.65\. The nodeweight gainhas no connection to any prior node, soa\(weight gain\)=ηD=0\.20a\(\\textit\{weight gain\}\)=\\eta\_\{\\text\{D\}\}=0\.20\. The graph anchor score is then

Stgraph=0\.85⋅0\.65\+0\.60⋅0\.200\.85\+0\.60=0\.553\+0\.1201\.45≈0\.464,S^\{\\text\{graph\}\}\_\{t\}=\\frac\{0\.85\\cdot 0\.65\+0\.60\\cdot 0\.20\}\{0\.85\+0\.60\}=\\frac\{0\.553\+0\.120\}\{1\.45\}\\approx 0\.464,which is the importance\-weighted compromise between a moderately\-anchored important concept and an isolated minor one\. Without importance weighting, both nodes would contribute equally and the score would be the unweighted mean0\.4250\.425, slightly under\-crediting the anchored part of the turn\.

#### Session anchor rescue\.

Pure graph attachment under\-credits Q&A sessions in which each turn introduces a new sub\-topic but all turns share an overarching theme: in such sessions,𝒩t\\mathcal\{N\}\_\{t\}is graph\-disconnected by design\. We rescue the score with a sequence\-level anchor\.

###### Definition 4\(Session anchor\)\.

Letϕ\(1\)\\phi^\{\(1\)\}denote the embedding of the \(glued\) first turn, fixed at the start of the session\. Define the session\-anchor similarity

Stanc=δ⋅cos\(ϕ\(rt\),ϕ\(1\)\),δ∈\(0,1\]\.S^\{\\text\{anc\}\}\_\{t\}\\;=\\;\\delta\\cdot\\mathrm\{cos\}\\\!\\big\(\\phi\(r\_\{t\}\),\\,\\phi^\{\(1\)\}\\big\),\\qquad\\delta\\in\(0,1\]\.\(8\)The full historical consistency score is

Stcons=max⁡\(Stgraph,Stanc\)\.S^\{\\text\{cons\}\}\_\{t\}\\;=\\;\\max\\\!\\big\(\\,S^\{\\text\{graph\}\}\_\{t\},\\,S^\{\\text\{anc\}\}\_\{t\}\\,\\big\)\.\(9\)We useδ=0\.85\\delta=0\.85, reflecting that the anchor is weaker evidence than a direct graph edge and should not dominate it when the latter is high\.

###### Proposition 1\(Boundedness and monotonicity ofStconsS^\{\\text\{cons\}\}\_\{t\}\)\.

For anyt≥1t\\geq 1and any non\-empty𝒩t\\mathcal\{N\}\_\{t\},Stcons∈\[ηD,1\]S^\{\\text\{cons\}\}\_\{t\}\\in\[\\eta\_\{\\text\{D\}\},\\,1\]\. Moreover, if a new turn introduces at least one nodeu∈𝒩tu\\in\\mathcal\{N\}\_\{t\}that is fact\-connected toV<tV^\{<t\}and∑u′∈𝒩tι\(u′\)\>0\\sum\_\{u^\{\\prime\}\\in\\mathcal\{N\}\_\{t\}\}\\iota\(u^\{\\prime\}\)\>0, thenStgraph≥ηF⋅minu′∈𝒩t⁡ι\(u′\)/∑u′ι\(u′\)\>0S^\{\\text\{graph\}\}\_\{t\}\\geq\\eta\_\{\\text\{F\}\}\\cdot\\min\_\{u^\{\\prime\}\\in\\mathcal\{N\}\_\{t\}\}\\iota\(u^\{\\prime\}\)/\\sum\_\{u^\{\\prime\}\}\\iota\(u^\{\\prime\}\)\>0\.

The proof is by direct algebra on \([7](https://arxiv.org/html/2605.16650#S4.E7)\); we omit it here\.

### 4\.5Logical Coherence via the Geometric Contradiction Engine

The logical coherence scoreStlogS^\{\\text\{log\}\}\_\{t\}is the primary component of the framework\. The process does not make use of string\-level NLI or prompt\-based LLM judgment, but instead applies a deterministic geometric reasoner to identify contradictions using an ever\-evolving Semantic Knowledge Graph \(SKG\)\.

#### Contradiction as stateful semantic incompatibility\.

Unlike NLI models applied at the sentence level where contradictions are evaluated on pairs of premises and hypotheses, the SKG\-Eval approach views contradiction identification as a state\-based incompatibility task involving a semantic memory which is continuously updated during the course of the conversation\. This involves building a Semantic Knowledge Graph[Moet al\.](https://arxiv.org/html/2605.16650#bib.bib27)which reflects the set of factual claims made throughout the conversation\.

We denote the conversational semantic state at turnttasΣt:=Gt\\Sigma\_\{t\}:=G\_\{t\}, whereGtG\_\{t\}stores all factual and semantic commitments accumulated up to turntt\.

The contradiction engine uses symbolically derived contradictions combined with geometric compatibility using embeddings\. The symbolic approach helps detect high precision logical contradictions, such as negations, antonyms, and numerical contradictions, while geometric similarity provides resistance to paraphrasing and surface variation\.

The contradiction engine checks the current claims for their logical consistency against the past claims tied together through relations and objects similarities to the same semantic anchor, then applies contradiction cascades\.

#### Time\-partitioned edge sets\.

For each candidate nodeuu, define

ℰcur\(u\)\\displaystyle\\mathcal\{E\}^\{\\text\{cur\}\}\(u\)=\{e∈Et:κ\(e\)=fact,q\(e\)=0,t\(e\)=t,eincident tou\},\\displaystyle=\\\{\\,e\\in E\_\{t\}:\\kappa\(e\)=\\textsc\{fact\},\\,q\(e\)=0,\\,t\(e\)=t,\\,e\\text\{ incident to \}u\\,\\\},\(10\)ℰhist\(u\)\\displaystyle\\mathcal\{E\}^\{\\text\{hist\}\}\(u\)=\{e∈Et:κ\(e\)=fact,q\(e\)=0,t\(e\)<t,eincident tou\}\.\\displaystyle=\\\{\\,e\\in E\_\{t\}:\\kappa\(e\)=\\textsc\{fact\},\\,q\(e\)=0,\\,t\(e\)<t,\\,e\\text\{ incident to \}u\\,\\\}\.\(11\)The geometric engine compares each pair\(ec,eh\)∈ℰcur\(u\)×ℰ~hist\(u\)\(e\_\{c\},e\_\{h\}\)\\in\\mathcal\{E\}^\{\\text\{cur\}\}\(u\)\\times\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)for the candidate set

𝒞t=𝒩t∪\{u∈Vt−1:ℰcur\(u\)≠∅\}∖ℬ,\\mathcal\{C\}\_\{t\}=\\mathcal\{N\}\_\{t\}\\;\\cup\\;\\big\\\{\\,u\\in V\_\{t\-1\}:\\mathcal\{E\}^\{\\text\{cur\}\}\(u\)\\neq\\emptyset\\,\\big\\\}\\;\\setminus\\;\\mathcal\{B\},whereℬ\\mathcal\{B\}is a blocklist of generic pronoun/filler nodes \(i,you,this,thing, etc\.\) that carry no contradiction\-bearing semantics\.

#### Revision\-aware history filtering\.

In multi\-turn dialogue, users may intentionally revise previously established information \(e\.g\., “change that to…”, “instead use…”\)\. Such operations correspond to authorized conversational state updates rather than logical inconsistencies\.

We therefore introduce a revision\-aware filtering mechanism\. Letℛt⊆Et\\mathcal\{R\}\_\{t\}\\subseteq E\_\{t\}denote the set of historical edges identified as revision targets at turntt\. For contradiction evaluation, the historical comparison set is redefined as:

ℰ~hist\(u\)=ℰhist\(u\)∖ℛt\.\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)\\;=\\;\\mathcal\{E\}^\{\\text\{hist\}\}\(u\)\\setminus\\mathcal\{R\}\_\{t\}\.\(12\)
All subsequent contradiction comparisons are then performed on

ℰcur\(u\)×ℰ~hist\(u\)\.\\mathcal\{E\}^\{\\text\{cur\}\}\(u\)\\times\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)\.
Importantly, this filtering is*ephemeral*: the underlying graph is not permanently modified, and the suppressed edges remain available for future conversational context\.

#### Intuition through example\.

Consider a dialogue in which the user intentionally revises a previously established slogan\.

Turn 1 \(initial state\)\.

- •User: “I am launching a new coffee brand\. The slogan is ‘Morning Spark: Fueling your day’\. What kind of audience does that attract?”
- •Assistant: “That slogan attracts young professionals seeking an energetic start to the day\.”

The extracted semantic memory contains:

\[Brand\]→\[has slogan\]→\[Morning Spark: Fueling your day\]\[\\texttt\{Brand\}\]\\rightarrow\[\\texttt\{has slogan\}\]\\rightarrow\[\\texttt\{Morning Spark: Fueling your day\}\]
Turn 2 \(intentional revision\)\.

- •User: “Please change the slogan to ‘Morning Spark: Relax and unwind’\. That works better for my decaf line\. Who is the audience now?”
- •Assistant: “The revised slogan targets users seeking a calming and soothing morning routine\.”

Without revision\-aware filtering, the contradiction engine would compare:

\[has slogan\]→\[Fueling your day\]\[\\texttt\{has slogan\}\]\\rightarrow\[\\texttt\{Fueling your day\}\]against

\[has slogan\]→\[Relax and unwind\]\[\\texttt\{has slogan\}\]\\rightarrow\[\\texttt\{Relax and unwind\}\]under the same semantic relation\. Since the object meanings are directionally opposed, the detector cascade could incorrectly trigger a semantic drift or exclusivity conflict\.

However, the phrase:

‘‘Please change the slogan to\.\.\.’’acts as an explicit revision signal\. Consequently, the earlier slogan edge is temporarily added toℛt\\mathcal\{R\}\_\{t\}and removed from the contradiction comparison set:

ℰ~hist\(u\)=ℰhist\(u\)∖ℛt\.\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)=\\mathcal\{E\}^\{\\text\{hist\}\}\(u\)\\setminus\\mathcal\{R\}\_\{t\}\.
As a result, the new slogan is interpreted as a legitimate conversational update rather than a contradiction\. This prevents the evaluator from penalizing dialogue systems for correctly following user\-authorized revisions\.

The contradiction engine therefore evaluates inconsistencies through progressively weaker regimes: \(i\) explicit polarity contradictions, \(ii\) directional semantic contradictions, \(iii\) structural exclusivity conflicts, and \(iv\) residual semantic divergence\. This hierarchy ensures that high\-confidence symbolic conflicts are resolved before softer embedding\-geometric inconsistencies are considered\.

#### The detector hierarchy\.

For each pair\(ec,eh\)\(e\_\{c\},e\_\{h\}\)on a candidate nodeuu, the engine computes relation similarity

rs=cos\(ϕ\(ρ\(ec\)\),ϕ\(ρ\(eh\)\)\)\\mathrm\{rs\}=\\mathrm\{cos\}\(\\phi\(\\rho\(e\_\{c\}\)\),\\phi\(\\rho\(e\_\{h\}\)\)\)and object similarity

os=cos\(ϕ\(ℓ\(oc\)\),ϕ\(ℓ\(oh\)\)\),\\mathrm\{os\}=\\mathrm\{cos\}\(\\phi\(\\ell\(o\_\{c\}\)\),\\phi\(\\ell\(o\_\{h\}\)\)\),whereoco\_\{c\}andoho\_\{h\}are the object endpoints\. The pair is then processed by a prioritized cascade of typed contradiction detectors summarized in Table[2](https://arxiv.org/html/2605.16650#S4.T2)\.

The ordering of the cascade reflects contradiction reliability\. High\-precision symbolic conflicts, such as explicit negation and antonymic reversal, are evaluated first\. Abstaining guards \(IntentGate,ElabGuard,NoiseFloor, andExclusivityGuard\) suppress categorically incomparable edge pairs and therefore prevent specific classes of false positives\. Finally, numeric mismatch, exclusive\-object conflict \(EOC\), and semantic drift are evaluated as progressively softer forms of incompatibility\.

The cascade exits at the first detector that fires, ensuring that strong symbolic contradictions dominate weaker embedding\-geometric inconsistencies\. The fieldρtype\\rho\_\{\\text\{type\}\}denotes the relation\-type label assigned by the extractor \(cf\. Section[4\.1](https://arxiv.org/html/2605.16650#S4.SS1)\);𝒜ant\\mathcal\{A\}\_\{\\text\{ant\}\}is a curated antonym dictionary \(increases/decreases,causes/prevents,always/never, etc\.\); andℳ¬\\mathcal\{M\}\_\{\\neg\}is a fixed set of negation markers\. Same\-type groups \(Person,Number,…\\ldots\) are defined according to the entity taxonomy in Definition[1](https://arxiv.org/html/2605.16650#Thmdefinition1)\.

#### Revision\-aware guard\.

The revision filtering described above is applied prior to the detector cascade and acts as a higher\-priority guard\. Consequently, even when a pair\(ec,eh\)\(e\_\{c\},e\_\{h\}\)satisfies all geometric and type constraints, it is excluded from contradiction evaluation ifeh∈ℛte\_\{h\}\\in\\mathcal\{R\}\_\{t\}\. This ensures that the engine distinguishes between model inconsistency and user\-directed state updates\.

#### Unified contradiction operator\.

We define contradiction confidence as:

c\(ec,eh\)=maxk∈𝒟⁡ck\(ec,eh\),c\(e\_\{c\},e\_\{h\}\)=\\max\_\{k\\in\\mathcal\{D\}\}c\_\{k\}\(e\_\{c\},e\_\{h\}\),\(13\)where𝒟\\mathcal\{D\}denotes the set of geometric detectors\.

Table 2:Geometric contradiction detector cascade\. Detectors are evaluated top\-to\-bottom, and the first firing detector assigns the contradiction confidence used inStlogS^\{\\text\{log\}\}\_\{t\}\. Italicized rows denote abstaining guards that suppress categorically incomparable comparisons\. Implementation thresholds are reported in Appendix[B](https://arxiv.org/html/2605.16650#A2)\. Revision\-filtered edges \(cf\.ℰ~hist\(u\)\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)\) are excluded prior to contradiction evaluation\.DetectorFiring conditionConfidenceccNegFlipexactly one ofρ\(ec\),ρ\(eh\)∈ℳ¬\\rho\(e\_\{c\}\),\\rho\(e\_\{h\}\)\\in\\mathcal\{M\}\_\{\\neg\};os\>θobjneg\\mathrm\{os\}\>\\theta^\{\\text\{neg\}\}\_\{\\text\{obj\}\}; bothρtype∉\{elab,sol,diag\}\\rho\_\{\\text\{type\}\}\\notin\\\{\\textsc\{elab\},\\textsc\{sol\},\\textsc\{diag\}\\\}0\.950\.95Antonym\(ρ\(ec\),ρ\(eh\)\)\(\\rho\(e\_\{c\}\),\\rho\(e\_\{h\}\)\)opposite in𝒜ant\\mathcal\{A\}\_\{\\text\{ant\}\},os\>θminobj\\mathrm\{os\}\>\\theta^\{\\text\{obj\}\}\_\{\\min\}0\.880\.88IntentGateμ\(ec\)≠μ\(eh\)\\mu\(e\_\{c\}\)\\neq\\mu\(e\_\{h\}\)abstainElabGuardρtype\(ec\)∈\{elab,sol,diag\}\\rho\_\{\\text\{type\}\}\(e\_\{c\}\)\\in\\\{\\textsc\{elab\},\\textsc\{sol\},\\textsc\{diag\}\\\}abstainNoiseFloorrs<θminrel\\mathrm\{rs\}<\\theta^\{\\text\{rel\}\}\_\{\\min\}oros<θminobj\\mathrm\{os\}<\\theta^\{\\text\{obj\}\}\_\{\\min\}abstainNumMismatchrs\>0\.70\\mathrm\{rs\}\>0\.70and extracted numerals satisfync≠nhn\_\{c\}\\neq n\_\{h\}0\.920\.92EOCrs\>0\.85\\mathrm\{rs\}\>0\.85,os<θdivobj\\mathrm\{os\}<\\theta^\{\\text\{obj\}\}\_\{\\text\{div\}\}, both predicates satisfyπ=Excl\\pi=\\textsc\{Excl\}1−os1\-\\mathrm\{os\}Same\-Type EOCrs\>0\.85\\mathrm\{rs\}\>0\.85, same\-type entity group,os<θST\\mathrm\{os\}<\\theta^\{\\text\{ST\}\}, both predicates satisfyπ=Excl\\pi=\\textsc\{Excl\}max⁡\(0\.60,1−os\)\\max\(0\.60,\\,1\-\\mathrm\{os\}\)Residual Semantic Driftrs\>0\.60\\mathrm\{rs\}\>0\.60,θminobj<os<0\.75\\theta^\{\\text\{obj\}\}\_\{\\min\}<\\mathrm\{os\}<0\.75\{1−os,os<0\.400\.45,0\.40≤os<0\.75\\begin\{cases\}1\-\\mathrm\{os\},&\\mathrm\{os\}<0\.40\\\\ 0\.45,&0\.40\\leq\\mathrm\{os\}<0\.75\\end\{cases\}pair\(ec,eh\)\(e\_\{c\},e\_\{h\}\)on candidateuu\(ec,eh\)∈ℰcur\(u\)×ℰ~hist\(u\)\(e\_\{c\},e\_\{h\}\)\\in\\mathcal\{E\}^\{\\text\{cur\}\}\(u\)\\times\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)computers,os\\mathrm\{rs\},\\mathrm\{os\}RevisionFiltereh∉ℛte\_\{h\}\\notin\\mathcal\{R\}\_\{t\}NegFlipos\>0\.40\\mathrm\{os\}\>0\.40Antonymos\>θminobj\\mathrm\{os\}\>\\theta^\{\\text\{obj\}\}\_\{\\min\}IntentGate/ElabGuard/NoiseFloorNumMismatchrs\>0\.70,nc≠nh\\mathrm\{rs\}\>0\.70,\\;n\_\{c\}\{\\neq\}n\_\{h\}EOC\(Std/Same\-type\)rs\>0\.85,π=Excl\\mathrm\{rs\}\>0\.85,\\;\\pi=\\textsc\{Excl\}SemDrift\(Strong/Mod\)rs\>0\.60\\mathrm\{rs\}\>0\.60no fire⇒c=0\\Rightarrow c=0c=0\.95c=0\.95c=0\.88c=0\.88abstain \(c=0c=0\)c=0\.92c=0\.921−os1\{\-\}\\mathrm\{os\}ormax⁡\(0\.6,1−os\)\\max\(0\.6,1\{\-\}\\mathrm\{os\}\)1−os1\{\-\}\\mathrm\{os\}or0\.450\.45passnonononononoyesyesfiresyesyesyesFigure 3:Geometric contradiction detector cascade with revision\-aware filtering\. Each pair\(ec,eh\)\(e\_\{c\},e\_\{h\}\)is drawn from the comparison setℰcur\(u\)×ℰ~hist\(u\)\\mathcal\{E\}^\{\\text\{cur\}\}\(u\)\\times\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\), whereℰ~hist\(u\)\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)excludes revision\-targeted edgesℛt\\mathcal\{R\}\_\{t\}\. TheRevisionFilteracts as a pre\-cascade guard, removing user\-intended updates before contradiction evaluation\. The remaining pairs are processed top\-to\-bottom; the first detector that fires assigns confidencecc\. Italic gray nodes denote abstaining guards that suppress incomparable pairs\. The final score isStlog=1−maxu∈𝒞t⁡max\(ec,eh\)⁡cS^\{\\text\{log\}\}\_\{t\}=1\-\\max\_\{u\\in\\mathcal\{C\}\_\{t\}\}\\max\_\{\(e\_\{c\},e\_\{h\}\)\}c\.
#### Score aggregation\.

For each candidate nodeuu, letc∗\(u\)=max\(ec,eh\)⁡c\(ec,eh\)c^\{\*\}\(u\)=\\max\_\{\(e\_\{c\},e\_\{h\}\)\}c\(e\_\{c\},e\_\{h\}\)be the maximum confidence over all detector firings \(0if none\)\. The logical coherence score is

Stlog=1−maxu∈𝒞t⁡c∗\(u\),S^\{\\text\{log\}\}\_\{t\}\\;=\\;1\\;\-\\;\\max\_\{u\\in\\mathcal\{C\}\_\{t\}\}\\,c^\{\*\}\(u\),\(14\)withStlog=1S^\{\\text\{log\}\}\_\{t\}=1when no detector fires\.

#### Why geometric rather than NLI\.

String\-level NLI cross\-encoders process linearized premise–hypothesis pairs and exhibit three failure modes that occur frequently in long dialogue: \(i\) numeric substitution within otherwise identical claims, \(ii\) antonymic or paraphrased contradiction across surface\-different statements, and \(iii\) contradictions whose conflicting evidence is buried in long conversational prefixes\.

SKG\-Eval addresses these limitations by operating directly on structured semantic state rather than serialized text\. Contradiction detection is performed over typed graph edges with explicit symbolic reasoning, embedding\-geometric compatibility analysis, and revision\-aware filtering that suppresses user\-intended updates prior to contradiction evaluation\.

###### Proposition 2\(Conditions favoring geometric contradiction reasoning\)\.

LetfNLIf\_\{\\textsc\{NLI\}\}denote a string\-level NLI cross\-encoder with bounded effective contextLmaxL\_\{\\max\}, and letfGEOf\_\{\\textsc\{GEO\}\}denote the proposed geometric contradiction engine\. The geometric engine is expected to exhibit higher contradiction\-recall thanfNLIf\_\{\\textsc\{NLI\}\}under the following conditions:

- \(i\)Numeric substitution\.Contradictory claims differ primarily in symbolic values embedded within otherwise similar contexts \(e\.g\. “boils at 100∘C” vs\. “boils at 90∘C”\)\.
- \(ii\)Long\-prefix contradiction\.The contradictory historical claim lies outside the effective context window of the NLI encoder\.
- \(iii\)Antonymic paraphrase\.The contradiction is expressed through semantically opposing relations under paraphrased surface forms \(e\.g\. “increases” vs\. “reduces”\)\.

###### Proof sketch\.

Numeric substitutions are difficult for text\-level semantic models because shared contextual tokens dominate the representation, whereas the proposed engine isolates and compares symbolic values directly\. Long\-prefix contradictions may be truncated or attenuated in string\-level NLI, while SKG\-Eval retrieves historical claims through graph\-indexed semantic anchors independent of dialogue length\. Antonymic paraphrases are handled explicitly through relation\-level opposition rather than implicit encoder generalization\. Revision\-aware filtering further distinguishes user\-directed state updates from genuine model inconsistency\. ∎

The proposition does not claim universal dominance over NLI systems\. Rather, it identifies the specific contradiction regimes that dominate long\-form conversational inconsistency and for which structured geometric reasoning is particularly effective\.

###### Proposition 3\(Determinism and complexity of the engine\)\.

Given a fixed embedding modelϕ\\phi, fixed extractor outputs𝒯1:t\\mathcal\{T\}\_\{1:t\}, and deterministic revision filtering, the scoreStlogS^\{\\text\{log\}\}\_\{t\}is deterministic\. The time complexity at turnttis

𝒪\(\|𝒞t\|⋅C¯H¯\),\\mathcal\{O\}\\\!\\left\(\|\\mathcal\{C\}\_\{t\}\|\\cdot\\bar\{C\}\\bar\{H\}\\right\),whereC¯=𝔼\[\|ℰcur\(u\)\|\]\\bar\{C\}=\\mathbb\{E\}\[\|\\mathcal\{E\}^\{\\text\{cur\}\}\(u\)\|\]andH¯=𝔼\[\|ℰ~hist\(u\)\|\]\\bar\{H\}=\\mathbb\{E\}\[\|\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)\|\]\. SinceC¯\\bar\{C\}andH¯\\bar\{H\}remain small in practice, the effective complexity is near\-linear in the number of turns\.

The complexity follows from bounded per\-node edge comparisons\. Embeddings are computed once per edge, and each detector executes in constant time\. The resulting computation is dominated by embedding evaluation and is naturally parallelizable across candidate nodes\.

#### Quarantine\.

WhenQt<θquar=0\.40Q\_\{t\}<\\theta\_\{\\text\{quar\}\}=0\.40, all nodes and edges introduced at turnttare markedq\(⋅\)=1q\(\\cdot\)=1and excluded from subsequent contradiction checks and consistency scoring\. Quarantine prevents low\-quality content from propagating through the graph state and serves as the framework’s analogue to a hypothesis\-rejection mechanism\. Edges suppressed via revision filtering \(i\.e\., those inℛt\\mathcal\{R\}\_\{t\}\) are not considered erroneous and therefore do not trigger quarantine\.

### 4\.6Regime\-Adaptive Fusion

The three scores are fused by a regime\-adaptive convex combination,

Q¯t=wtloc⋅Stloc\+wtcons⋅Stcons\+wtlog⋅Stlog,wtloc\+wtcons\+wtlog=1,\\bar\{Q\}\_\{t\}\\;=\\;w^\{\\text\{loc\}\}\_\{t\}\\cdot S^\{\\text\{loc\}\}\_\{t\}\\;\+\\;w^\{\\text\{cons\}\}\_\{t\}\\cdot S^\{\\text\{cons\}\}\_\{t\}\\;\+\\;w^\{\\text\{log\}\}\_\{t\}\\cdot S^\{\\text\{log\}\}\_\{t\},\\quad w^\{\\text\{loc\}\}\_\{t\}\+w^\{\\text\{cons\}\}\_\{t\}\+w^\{\\text\{log\}\}\_\{t\}=1,\(15\)where the weights depend on a turn\-level regime selector\. Define the regime indicator

gt=\{Short\|rt\|<Wshort,QA\|rt\|≥Wshort∧\|qt\|<Wqa,Generalotherwise\.g\_\{t\}\\;=\\;\\begin\{cases\}\\textsc\{Short\}&\|r\_\{t\}\|<W\_\{\\text\{short\}\},\\\\ \\textsc\{QA\}&\|r\_\{t\}\|\\geq W\_\{\\text\{short\}\}\\,\\wedge\\,\|q\_\{t\}\|<W\_\{\\text\{qa\}\},\\\\ \\textsc\{General\}&\\text\{otherwise\.\}\\end\{cases\}
The weight profile is selected by lookup,θt=Θ\[gt\]\\theta\_\{t\}=\\Theta\[g\_\{t\}\], with the three profiles

Θ\[Short\]=\(0\.50,0\.10,0\.40\),Θ\[QA\]=\(0\.65,0\.05,0\.30\),Θ\[General\]=\(0\.50,0\.20,0\.30\)\.\\Theta\[\\textsc\{Short\}\]=\(0\.50,\\,0\.10,\\,0\.40\),\\qquad\\Theta\[\\textsc\{QA\}\]=\(0\.65,\\,0\.05,\\,0\.30\),\\qquad\\Theta\[\\textsc\{General\}\]=\(0\.50,\\,0\.20,\\,0\.30\)\.
Shortresponses \(e\.g\. “Yes\.”, “42\.”\) legitimately produce lowStconsS^\{\\text\{cons\}\}\_\{t\}because they introduce few new nodes; the profile down\-weights consistency\.QAsessions are encyclopedic in nature; consistency is down\-weighted further while local relevance is up\-weighted, since each turn is a largely self\-contained sub\-question\.Generaldialogue is the default mixed regime\.

These guards ensure that strong logical failures dominate scoring, while preventing high logical coherence from masking relevance or consistency failures\.

#### Guard cascades\.

Three monotone guards refine the convex combination:

1. 1\.Hard logic gate\.IfStlog<θhardlog=0\.60S^\{\\text\{log\}\}\_\{t\}<\\theta^\{\\text\{log\}\}\_\{\\text\{hard\}\}=0\.60, setQ¯t←min⁡\(Q¯t,0\.40\)\\bar\{Q\}\_\{t\}\\leftarrow\\min\(\\bar\{Q\}\_\{t\},\\,0\.40\)\. A confirmed contradiction firmly fails the turn regardless of high local relevance\. \(Revision\-suppressed comparisons do not contribute toStlogS^\{\\text\{log\}\}\_\{t\}and therefore do not trigger this gate\.\)
2. 2\.Joint weakness penalty\.IfStloc<0\.50S^\{\\text\{loc\}\}\_\{t\}<0\.50*and*Stcons<0\.45S^\{\\text\{cons\}\}\_\{t\}<0\.45*and*gt≠Both\-Shortg\_\{t\}\\neq\\textsc\{Both\-Short\}, setQ¯t←μjointQ¯t\\bar\{Q\}\_\{t\}\\leftarrow\\mu\_\{\\text\{joint\}\}\\bar\{Q\}\_\{t\}withμjoint=0\.75\\mu\_\{\\text\{joint\}\}=0\.75\. This preventsStlog=1S^\{\\text\{log\}\}\_\{t\}=1from compensating for responses that are simultaneously off\-topic and disconnected\.
3. 3\.Non\-sequitur softening\.IfStcons<0\.45S^\{\\text\{cons\}\}\_\{t\}<0\.45*and*Stloc<0\.20S^\{\\text\{loc\}\}\_\{t\}<0\.20*and*gt≠Both\-Shortg\_\{t\}\\neq\\textsc\{Both\-Short\}, setQ¯t←0\.5Q¯t\\bar\{Q\}\_\{t\}\\leftarrow 0\.5\\,\\bar\{Q\}\_\{t\}\.

The final per\-turn quality isQt=clip\(Q¯t,0,1\)Q\_\{t\}=\\mathrm\{clip\}\(\\bar\{Q\}\_\{t\},\\,0,\\,1\)\. All guards are monotone non\-increasing inQ¯t\\bar\{Q\}\_\{t\}; they cannot inflate the score, only reduce it under structural failure conditions\.

#### Reporting thresholds\.

For all empirical analyses we treatQt≥θpass=0\.60Q\_\{t\}\\geq\\theta\_\{\\text\{pass\}\}=0\.60as a passing turn andStloc<0\.50S^\{\\text\{loc\}\}\_\{t\}<0\.50as a hard relevance failure \(the turn is reported as failed regardless ofQtQ\_\{t\}, complementing the hard\-logic gate above\)\. These thresholds are used solely for reporting and evaluation purposes and are not part of the scoring functional that producesQtQ\_\{t\}\.

Revision\-suppressed comparisons do not affect these thresholds, as they are excluded prior to the computation ofStlogS^\{\\text\{log\}\}\_\{t\}and therefore do not influence the pass/fail decision\.

###### Proposition 4\(Threshold invariance of ranking\)\.

LetQt∈\[0,1\]Q\_\{t\}\\in\[0,1\]denote the continuous turn\-level score produced by the SKG\-Eval scoring functional, and letθpass\\theta\_\{\\text\{pass\}\}be a reporting threshold used only to assign binary pass/fail labels\. For any two turnsiiandjj, ifQi\>QjQ\_\{i\}\>Q\_\{j\}, then their relative ranking remains unchanged for any choice ofθpass\\theta\_\{\\text\{pass\}\}\.

###### Proof\.

The thresholdθpass\\theta\_\{\\text\{pass\}\}is applied only afterQtQ\_\{t\}has been computed and maps each score to a reporting label𝟏\[Qt≥θpass\]\\mathbf\{1\}\[Q\_\{t\}\\geq\\theta\_\{\\text\{pass\}\}\]\. Since the thresholding operation does not modify the underlying continuous scoresQiQ\_\{i\}andQjQ\_\{j\}, the ordering induced byQi\>QjQ\_\{i\}\>Q\_\{j\}is invariant to the choice ofθpass\\theta\_\{\\text\{pass\}\}\. ∎

### 4\.7Session\-Level Aggregation: Recency\-Weighted Trend

A session\-level summary that simply averages per\-turn quality is biased against improving sessions and overly lenient toward degrading ones\. We therefore aggregate via a recency\-weighted regression with a length\-adaptive trend coefficient\.

###### Definition 5\(Recency weights\)\.

For a session of lengthTTand decay rateγ\>0\\gamma\>0, define

wi=eγ\(i−1\)∑j=1Teγ\(j−1\),i=1,…,T\.w\_\{i\}\\;=\\;\\frac\{e^\{\\gamma\\,\(i\-1\)\}\}\{\\sum\_\{j=1\}^\{T\}e^\{\\gamma\\,\(j\-1\)\}\},\\qquad i=1,\\ldots,T\.\(16\)

###### Definition 6\(Session aggregator\)\.

LetQ^=\(Q1,…,QT\)\\hat\{Q\}=\(Q\_\{1\},\\ldots,Q\_\{T\}\)denote the sequence of turn\-level scores and𝐰=\(w1,…,wT\)\\boldsymbol\{w\}=\(w\_\{1\},\\ldots,w\_\{T\}\)the corresponding recency weights\. Compute

Q¯rec\\displaystyle\\bar\{Q\}^\{\\text\{rec\}\}=∑i=1TwiQi,\\displaystyle\\;=\\;\\sum\_\{i=1\}^\{T\}w\_\{i\}\\,Q\_\{i\},\(17\)\(β^,α^\)\\displaystyle\(\\hat\{\\beta\},\\,\\hat\{\\alpha\}\)=arg⁡minβ,α∑i=1Twi\(Qi−α−β\(i−1\)\)2,\\displaystyle\\;=\\;\\arg\\min\_\{\\beta,\\alpha\}\\,\\sum\_\{i=1\}^\{T\}w\_\{i\}\\,\(Q\_\{i\}\-\\alpha\-\\beta\\,\(i\-1\)\)^\{2\},\(18\)λeff\\displaystyle\\lambda\_\{\\text\{eff\}\}=λbaseTTref,\\displaystyle\\;=\\;\\lambda\_\{\\text\{base\}\}\\,\\frac\{T\}\{T\_\{\\text\{ref\}\}\},\(19\)𝒮\(𝒟\)\\displaystyle\\mathcal\{S\}\(\\mathcal\{D\}\)=clip\(Q¯rec\+λeffβ^,0,1\)\.\\displaystyle\\;=\\;\\mathrm\{clip\}\\\!\\big\(\\,\\bar\{Q\}^\{\\text\{rec\}\}\+\\lambda\_\{\\text\{eff\}\}\\,\\hat\{\\beta\},\\;0,\\,1\\,\\big\)\.\(20\)We useγ=0\.1\\gamma=0\.1,λbase=5\.0\\lambda\_\{\\text\{base\}\}=5\.0, andTref=20T\_\{\\text\{ref\}\}=20\.

#### Why two layers\.

\([17](https://arxiv.org/html/2605.16650#S4.E17)\) captures the*level*\(current operating quality\), giving more weight to recent turns\. \([18](https://arxiv.org/html/2605.16650#S4.E18)\) captures the*slope*\(whether the conversation is improving or degrading\)\. Their combination via \([20](https://arxiv.org/html/2605.16650#S4.E20)\) integrates both instantaneous quality and temporal trend, with the contribution of the trend term scaled byλeff\\lambda\_\{\\text\{eff\}\}\.

#### Why adaptiveλ\\lambda\.

A fixedλ\\lambdawould cause a slope of fixed magnitude to contribute the same adjustment irrespective of session lengthTT\. However, a slopeβ\\betasustained overTTturns induces a cumulative level change proportional toβT\\beta T\. Accordingly, a length\-aware coefficientλeff∝T\\lambda\_\{\\text\{eff\}\}\\propto Tensures that the contribution of the trend term is properly normalized across sessions of different lengths\. Calibrating toTref=20T\_\{\\text\{ref\}\}=20preserves consistency with the canonical choiceλ=5\\lambda=5on reference\-length sessions\.

###### Proposition 5\(Shift invariance and slope unbiasedness\)\.

The aggregator𝒮\\mathcal\{S\}in Definition[6](https://arxiv.org/html/2605.16650#Thmdefinition6)satisfies: \(i\)*Shift invariance:*for any constantΔ\\Deltasuch thatQi\+Δ∈\[0,1\]Q\_\{i\}\+\\Delta\\in\[0,1\]for allii, we have𝒮\(Q^\+Δ\)=𝒮\(Q^\)\+Δ\\mathcal\{S\}\(\\hat\{Q\}\+\\Delta\)=\\mathcal\{S\}\(\\hat\{Q\}\)\+\\Deltaprior to clipping\. \(ii\)*Slope unbiasedness:*the estimatorβ^\\hat\{\\beta\}obtained in \([18](https://arxiv.org/html/2605.16650#S4.E18)\) is the weighted least\-squares \(WLS\) slope, which is unbiased under the modelQi=α\+β\(i−1\)\+εiQ\_\{i\}=\\alpha\+\\beta\(i\-1\)\+\\varepsilon\_\{i\}, where𝔼\[εi\]=0\\mathbb\{E\}\[\\varepsilon\_\{i\}\]=0andVar\(εi\)<∞\\mathrm\{Var\}\(\\varepsilon\_\{i\}\)<\\infty\.

Property \(i\) follows because bothQ¯rec\\bar\{Q\}^\{\\text\{rec\}\}and the fitted interceptα^\\hat\{\\alpha\}shift byΔ\\Delta, while the slopeβ^\\hat\{\\beta\}remains unchanged\. Property \(ii\) is the standard unbiasedness result for weighted least\-squares estimators under fixed weights\.

### 4\.8Putting It All Together: Complexity and Determinism

The full per\-turn pipeline at turnttis given in Algorithm[2](https://arxiv.org/html/2605.16650#alg2)\. The per\-turn cost is dominated by: \(a\) one extractor LLM call onrtr\_\{t\}, \(b\)𝒪\(\|𝒯t\|\)\\mathcal\{O\}\(\|\\mathcal\{T\}\_\{t\}\|\)graph updates, \(c\)𝒪\(\|𝒩t\|⋅\|Vt−1\|\)\\mathcal\{O\}\(\|\\mathcal\{N\}\_\{t\}\|\\cdot\|V\_\{t\-1\}\|\)embedding cosine computations for semantic linking and consistency scoring, and \(d\)𝒪\(\|𝒞t\|⋅C¯H¯\)\\mathcal\{O\}\(\|\\mathcal\{C\}\_\{t\}\|\\cdot\\bar\{C\}\\bar\{H\}\)detector evaluations within the geometric engine\.

Given a fixed extractor output𝒯1:t\\mathcal\{T\}\_\{1:t\}, fixed embedding modelϕ\\phi, and deterministic revision\-filtering mechanism, all components except \(a\) are deterministic\. In practice,C¯\\bar\{C\}andH¯\\bar\{H\}remain small due to bounded per\-node edge counts, making the effective complexity near\-linear in the number of turns\.

The session\-level aggregation is performed once per session in𝒪\(T\)\\mathcal\{O\}\(T\)time\. This makes SKG\-Eval scalable to long conversations where repeated LLM\-based judging over growing dialogue prefixes becomes increasingly expensive\.

Algorithm 2ScoreTurn: per\-turn evaluation in SKG\-Eval1:Prompt

qtq\_\{t\}, response

rtr\_\{t\}, optional reference

rt∗r^\{\*\}\_\{t\}, prior graph

Gt−1G\_\{t\-1\}, embedding model

ϕ\\phi
2:Classify response/prompt regime

gtg\_\{t\}; select weights

θt=Θ\[gt\]\\theta\_\{t\}=\\Theta\[g\_\{t\}\]
3:Compute

StlocS^\{\\text\{loc\}\}\_\{t\}via Definition[2](https://arxiv.org/html/2605.16650#Thmdefinition2)\(with context glue if

gt=Shortg\_\{t\}=\\textsc\{Short\}\)

4:

𝒯t←Extract\(turn text\)\\mathcal\{T\}\_\{t\}\\leftarrow\\textsc\{Extract\}\(\\text\{turn text\}\)
5:

\(Gt,𝒩t\)←UpdateGraph\(Gt−1,𝒯t,ϕ,θdedup,θsem\)\(G\_\{t\},\\mathcal\{N\}\_\{t\}\)\\leftarrow\\textsc\{UpdateGraph\}\(G\_\{t\-1\},\\mathcal\{T\}\_\{t\},\\phi,\\theta\_\{\\text\{dedup\}\},\\theta\_\{\\text\{sem\}\}\)
6:Compute

StgraphS^\{\\text\{graph\}\}\_\{t\}via \([7](https://arxiv.org/html/2605.16650#S4.E7)\); compute

StancS^\{\\text\{anc\}\}\_\{t\}via \([8](https://arxiv.org/html/2605.16650#S4.E8)\)

7:

Stcons←max⁡\(Stgraph,Stanc\)S^\{\\text\{cons\}\}\_\{t\}\\leftarrow\\max\(S^\{\\text\{graph\}\}\_\{t\},S^\{\\text\{anc\}\}\_\{t\}\)
8:

bt←IsRevisionPrompt\(qt\)b\_\{t\}\\leftarrow\\textsc\{IsRevisionPrompt\}\(q\_\{t\}\)
9:

ℛt←ExtractRevisionTargets\(qt,Gt−1\)\\mathcal\{R\}\_\{t\}\\leftarrow\\textsc\{ExtractRevisionTargets\}\(q\_\{t\},G\_\{t\-1\}\)if

bt=1b\_\{t\}=1, else

∅\\emptyset
10:

Stlog←S^\{\\text\{log\}\}\_\{t\}\\leftarrowGeometric\-Engine\(Gt,𝒩t,𝒯t,ϕ,ℛt\)\(G\_\{t\},\\mathcal\{N\}\_\{t\},\\mathcal\{T\}\_\{t\},\\phi,\\mathcal\{R\}\_\{t\}\)via \([14](https://arxiv.org/html/2605.16650#S4.E14)\)

11:

Q¯t←θt⊤\(Stloc,Stcons,Stlog\)\\bar\{Q\}\_\{t\}\\leftarrow\\theta\_\{t\}^\{\\top\}\(S^\{\\text\{loc\}\}\_\{t\},S^\{\\text\{cons\}\}\_\{t\},S^\{\\text\{log\}\}\_\{t\}\)
12:Apply hard\-logic gate, joint\-weakness penalty, non\-sequitur softening

13:

Qt←clip\(Q¯t,0,1\)Q\_\{t\}\\leftarrow\\mathrm\{clip\}\(\\bar\{Q\}\_\{t\},0,1\); if

Qt<θquarQ\_\{t\}<\\theta\_\{\\text\{quar\}\}quarantine turn\-

ttnodes/edges in

GtG\_\{t\}
14:return

Qt,Stloc,Stcons,Stlog,GtQ\_\{t\},\\,S^\{\\text\{loc\}\}\_\{t\},\\,S^\{\\text\{cons\}\}\_\{t\},\\,S^\{\\text\{log\}\}\_\{t\},\\,G\_\{t\}

#### Determinism and reproducibility\.

Modulo the \(off\-line, low\-temperature\) extractor, every score in SKG\-Eval is a deterministic function of the inputs, the embedding model, and a small set of fixed thresholds, including the revision\-filtering mechanism\. This stands in contrast to LLM\-as\-judge protocols, whose outputs may vary across decoding seeds and prompt orderings, and whose run\-to\-run variance can be comparable to the inter\-method differences they aim to measure\.

#### Interpretability\.

For every low score, SKG\-Eval surfaces the exact structural cause: the disconnected nodes that reduceStconsS^\{\\text\{cons\}\}\_\{t\}, the \(current edge, historical edge\) pair that triggers a contradiction along with the detector type and confidence, and the regime that determines the weighting\. Each contradiction certificate is a tuple\(u,ec,eh,detector,c\)\(u,\\,e\_\{c\},\\,e\_\{h\},\\,\\text\{detector\},\\,c\), which can be presented to a human auditor or used in dataset construction for negative example mining\. We argue that such an explicit audit trail is a necessary condition for evaluator trust at the session level—a property that judge\-LLM protocols typically cannot provide without additional external mechanisms\.

skipbfastmetabolismslowsTurn 1“skipping breakfastslows metabolism”skipbfastmetabolismcons\.modeslowstriggersTurn 2“enters conservationmode”skipbfastmetabolismcarbscons\.modeslowstriggerspreferredTurn 3“carbohydrates arepreferred fuel”skipbfastmetabolismcarbscons\.modemetab\.rateslowstriggerspreferredno effectsem \(0\.88\)Turn 4 \(Contradiction\)“skipping breakfasthas no effect”Figure 4:Turn\-wise growth of the incremental Semantic Knowledge Graph\. Orange nodes and edges denote information introduced at the current turn, black edges denote prior factual commitments, and dashed blue edges denote semantic links induced by embedding similarity\.

### 4\.9Worked Example: A Four\-Turn Walkthrough

We illustrate the incremental construction and reasoning behavior of SKG\-Eval using a four\-turn dialogue excerpt\. Figure[4](https://arxiv.org/html/2605.16650#S4.F4)shows how the Semantic Knowledge Graph evolves across turns through factual expansion, semantic anchoring, topic drift, and contradiction detection\. The example demonstrates how SKG\-Eval maintains persistent semantic state and performs structured cross\-turn reasoning over accumulated conversational commitments\.

#### Conversation\.

The dialogue proceeds as follows:

- T1\.q1q\_\{1\}:*“Does skipping breakfast affect metabolism?”* r1r\_\{1\}:*“Yes, skipping breakfast slows metabolism\.”*
- T2\.q2q\_\{2\}:*“What happens if I just count calories?”* r2r\_\{2\}:*“Skipping breakfast can trigger conservation mode in the body\.”*
- T3\.q3q\_\{3\}:*“What about carbohydrates?”* r3r\_\{3\}:*“Carbohydrates are the body’s preferred fuel source\.”*
- T4\.q4q\_\{4\}:*“So skipping breakfast is fine?”* r4r\_\{4\}:*“Skipping breakfast has no real effect on metabolism\.”*

#### Turn 1 \(baseline establishment\)\.

Triple extraction yields the edge

⟨skipping breakfast,slows,metabolism⟩\\langle\\textit\{skipping breakfast\},\\textit\{slows\},\\textit\{metabolism\}\\ranglewith high importanceι≈0\.95\\iota\\approx 0\.95\. Two new nodes enterG1G\_\{1\}, establishing the initial semantic state for the session\. The Semantic Triangle produces strong alignment with the prompt, yieldingS1loc≈0\.88S^\{\\text\{loc\}\}\_\{1\}\\approx 0\.88\. By definition,S1cons=1S^\{\\text\{cons\}\}\_\{1\}=1andS1log=1S^\{\\text\{log\}\}\_\{1\}=1\. Under theGeneralprofile, the final turn score is high \(Q1≈0\.94Q\_\{1\}\\approx 0\.94\)\. The nodemetabolismsubsequently becomes a semantic anchor for future contradiction and consistency analysis\.

#### Turn 2 \(semantic expansion with mild drift\)\.

Extraction yields

⟨skipping breakfast,triggers,conservation mode⟩\\langle\\textit\{skipping breakfast\},\\textit\{triggers\},\\textit\{conservation mode\}\\ranglewith importanceι≈0\.88\\iota\\approx 0\.88\. The subject node is deduplicated against the existing semantic state, whileconservation modeis introduced as a new node connected through a factual edge\.

The geometric engine compares this new edge against the historical edge

⟨skipping breakfast,slows,metabolism⟩\.\\langle\\textit\{skipping breakfast\},\\textit\{slows\},\\textit\{metabolism\}\\rangle\.
The new edge remains semantically related to the metabolism subgraph through the reused subject node, but relation and object similarities remain below contradiction\-triggering regimes\. Consequently, no detector fires and the pair is treated as semantically adjacent but logically compatible\.

Historical consistency remains strong due to reuse of the existing semantic anchor, yieldingS2cons≈0\.92S^\{\\text\{cons\}\}\_\{2\}\\approx 0\.92\. Local relevance also remains high \(S2loc≈0\.84S^\{\\text\{loc\}\}\_\{2\}\\approx 0\.84\)\. The final score therefore remains above the pass threshold, reflecting a coherent but slightly drifting continuation of the discussion\.

#### Turn 3 \(parallel topic expansion with weak semantic divergence\)\.

Extraction yields

⟨carbohydrates,preferred,fuel source⟩\\langle\\textit\{carbohydrates\},\\textit\{preferred\},\\textit\{fuel source\}\\ranglewith importanceι≈0\.90\\iota\\approx 0\.90\. Both nodes are newly introduced, forming a parallel nutritional subgraph structurally disconnected from the earlier metabolism\-focused discussion\.

The geometric engine evaluates the new edge against prior historical edges\. Although moderate semantic proximity exists between the new and historical claims, the resulting semantic\-drift confidence remains low\. The contradiction cascade therefore assigns only a mild penalty, yielding

S3log=0\.55\.S^\{\\text\{log\}\}\_\{3\}=0\.55\.
Consistency decreases because the new subgraph lacks direct factual anchoring to the prior semantic state\. However, the session\-anchor mechanism partially recovers the score through overall thematic similarity with the nutritional discussion, yieldingS3cons≈0\.65S^\{\\text\{cons\}\}\_\{3\}\\approx 0\.65\.

Local relevance remains high \(S3loc≈0\.82S^\{\\text\{loc\}\}\_\{3\}\\approx 0\.82\), and the final score remains above threshold, reflecting a response that is locally valid but structurally disconnected from the main conversational trajectory\.

#### Turn 4 \(cross\-turn contradiction via semantic anchoring\)\.

Extraction yields

⟨skipping breakfast,has no effect,metabolic rate⟩\\langle\\textit\{skipping breakfast\},\\textit\{has no effect\},\\textit\{metabolic rate\}\\ranglewith high importanceι≈0\.92\\iota\\approx 0\.92\. The subject node is matched to the existing nodeskipping breakfast, while the objectmetabolic rateis semantically aligned with the historical nodemetabolism, creating a valid contradiction\-comparison pair\.

The geometric engine retrieves the historical edge

⟨skipping breakfast,slows,metabolism⟩\\langle\\textit\{skipping breakfast\},\\textit\{slows\},\\textit\{metabolism\}\\rangleand evaluates the contradiction cascade\.

- •NegFlip\.The current relation*has no effect*contains a negation marker, whereas the historical relation*slows*does not\. Since the objects remain strongly aligned and neither relation belongs to the elaboration/solution/diagnosis suppression set, theNegFlipdetector fires with confidencec=0\.95c=0\.95\.
- •Same\-Type Exclusive Conflict \(secondary confirmation\)\.Even without explicit negation detection, the pair forms a structurally incompatible claim assignment under the same semantic anchor\. The same\-type exclusivity detector therefore also identifies the pair as contradictory, albeit with lower confidence\.

By cascade priority,NegFlipbecomes the selected contradiction signal, yielding

S4log=1−0\.95=0\.05\.S^\{\\text\{log\}\}\_\{4\}=1\-0\.95=0\.05\.
Local relevance remains high due to direct prompt alignment \(S4loc≈0\.80S^\{\\text\{loc\}\}\_\{4\}\\approx 0\.80\), and no disconnected nodes are introduced, givingS4cons=1S^\{\\text\{cons\}\}\_\{4\}=1\. The pre\-gate score is therefore

Q¯4=0\.5\(0\.80\)\+0\.2\(1\)\+0\.3\(0\.05\)=0\.615\.\\bar\{Q\}\_\{4\}=0\.5\(0\.80\)\+0\.2\(1\)\+0\.3\(0\.05\)=0\.615\.
Since

S4log<θhardlog,S^\{\\text\{log\}\}\_\{4\}<\\theta^\{\\text\{log\}\}\_\{\\text\{hard\}\},the hard\-logic gate activates and caps the final score at

The turn is therefore classified as failed, and the framework emits the contradiction certificate

\(skipping breakfast,ec,eh,NegFlip,0\.95\)\.\(\\textit\{skipping breakfast\},e\_\{c\},e\_\{h\},\\textsc\{NegFlip\},0\.95\)\.

#### Session aggregation\.

Under recency\-weighted aggregation, the sequence

\(Q1,Q2,Q3,Q4\)\(Q\_\{1\},Q\_\{2\},Q\_\{3\},Q\_\{4\}\)exhibits a clear downward trajectory\. The weighted regression therefore produces a negative slopeβ^<0\\hat\{\\beta\}<0, yielding the session\-level score

𝒮\(𝒟\)=Q¯rec\+λeffβ^\.\\mathcal\{S\}\(\\mathcal\{D\}\)=\\bar\{Q\}^\{\\text\{rec\}\}\+\\lambda\_\{\\text\{eff\}\}\\hat\{\\beta\}\.
The dialogue is consequently classified asDegrading, reflecting the emergence of a high\-confidence contradiction despite initially coherent responses\. The aggregator therefore captures both instantaneous response quality and long\-term conversational trajectory\.

Overall, SKG\-Eval converts dialogue evaluation from implicit judgment over flat text into explicit reasoning over an evolving, auditable semantic state representation\.

## 5Experimental Results

We evaluate SKG\-Eval along five axes: \(i\) alignment with human judgments at both the turn and session levels \(§[5\.1](https://arxiv.org/html/2605.16650#S5.SS1.SSS0.Px5)\); \(ii\) recall of the cross\-turn failure modes formalized in §[3](https://arxiv.org/html/2605.16650#S3)on controlled adversarial dialogues \(§[5\.2](https://arxiv.org/html/2605.16650#S5.SS2)\); \(iii\) component\-level ablations \(§[5\.5](https://arxiv.org/html/2605.16650#S5.SS5)\); \(iv\) behavior as a function of session length \(§[5\.6](https://arxiv.org/html/2605.16650#S5.SS6)\); and \(v\) computational cost relative to LLM\-as\-judge \(§[5\.7](https://arxiv.org/html/2605.16650#S5.SS7)\)\. A qualitative case study \(§[5\.8](https://arxiv.org/html/2605.16650#S5.SS8)\) illustrates the contradiction certificates produced by the framework\.

### 5\.1Experimental Setup

#### Datasets\.

We evaluate SKG\-Eval on two complementary multi\-turn dialogue benchmarks:

MT\-Bench\(Zhenget al\.,[2023](https://arxiv.org/html/2605.16650#bib.bib1)\), a widely used short\-horizon open\-ended dialogue benchmark with GPT\-4\-based ratings; and

MultiChallenge\(Sirdeshmukhet al\.,[2025](https://arxiv.org/html/2605.16650#bib.bib4)\), a long\-horizon benchmark designed to stress context tracking, instruction following, and reasoning consistency across extended conversations\.

Where ground\-truth human ratings are unavailable, we collect Likert\-scale ratings from three independent annotators on a randomly sampled subset of conversations, achieving average inter\-annotator agreementκ=0\.71\\kappa=0\.71\. For each benchmark, we prompt all models with the same user turns and generate full multi\-turn conversations by iteratively feeding prior turns as context\. Decoding parameters are standardized across models where possible \(temperature=0\.7=0\.7, top\-p=0\.9p=0\.9\), and generation is repeated with multiple seeds for stochastic models\.

#### Implementation\.

The embedding modelϕ\\phiisall\-mpnet\-base\-v2\(768\-dim, frozen\)\. The triple extractor isnvidia/nemotron\-3\-nano\-30b\-a3bcalled at temperature 0 with deterministic decoding\. All thresholds are fixed at the values reported in §[4](https://arxiv.org/html/2605.16650#S4)and frozen across all benchmarks and ablations\. The full pipeline runs on a single CPU; no GPU is required at scoring time\. Additional implementation details, detector thresholds, and prompt templates are provided in Appendix[A](https://arxiv.org/html/2605.16650#A1)and Appendix[B](https://arxiv.org/html/2605.16650#A2)\.

#### Baselines\.

We compare against five representative evaluators:

- •LLM\-Eval\(Lin and Chen,[2023](https://arxiv.org/html/2605.16650#bib.bib7)\): single\-prompt unified judge over multiple dimensions\.
- •ECoh\(Mendonçaet al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib9)\): distilled turn\-level coherence judge\.
- •DeepEval \+ GPT\-4o: a prompt\-based LLM\-as\-a\-Judge pipeline implemented using the DeepEval framework with GPT\-4o as the backend evaluator\.
- •GPT\-4o\-Judge \(turn\-only\): prompt of the form “Rate this response 1–5 given the prompt”, with no access to dialogue history\.
- •GPT\-4o\-Judge \(history\-aware\): the same judge prompt augmented with the full preceding conversation history\.

The exact evaluation prompts used for all judge\-based baselines are reported in Appendix[A](https://arxiv.org/html/2605.16650#A1)\.

#### Meta\-evaluation metrics\.

Following standard practice\(Zhanget al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib8); Kwanet al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib2)\), we reportSpearman’sρ\\rho,Pearson’srr, andKendall’sτ\\tau\-bbetween evaluator scores and aggregated human ratings, computed at both the turn and session levels\. For controlled adversarial dialogues with binary contradiction/drift labels, we reportprecision, recall, and F1\.

#### Alignment with Human Judgments

Tables[3](https://arxiv.org/html/2605.16650#S5.T3)and[4](https://arxiv.org/html/2605.16650#S5.T4)report turn\-level and session\-level correlation with human ratings on MT\-Bench and MultiChallenge\. SKG\-Eval achieves the strongest correlation on both benchmarks, with the largest gains observed on MultiChallenge, where conversations are sufficiently long to expose stateful failures\.

Table 3:Turn\-level correlation with human ratings\. Higher is better; best inbold, second\-bestunderlined\.Table 4:Session\-level correlation with human ratings\. SKG\-Eval exhibits the largest gain on MultiChallenge, where long\-horizon semantic inconsistency and contradiction become more prominent\.Table 5:Model ranking consistency on generated conversations\. Scores correspond to mean session\-level quality acrossSKG\-Probesessions; higher is better\.
#### Analysis\.

SKG\-Eval produces rankings that exactly match human preferences in this evaluation, while baseline evaluators exhibit partial agreement\. Prompt\-based judges tend to overestimate locally coherent but inconsistent responses, whereas SKG\-Eval better differentiates models based on long\-horizon reliability\.

### 5\.2Mechanism\-Targeted Diagnostic Probes

To isolate the behavior of individual contradiction mechanisms within the geometric engine, we construct a controlled diagnostic benchmark termedSKG\-Probe\. Unlike conventional adversarial dialogue benchmarks that measure aggregate robustness,SKG\-Probeis explicitly designed to target specific contradiction regimes handled by the neuro\-symbolic detector cascade, including negation reversal, antonymic contradiction, symbolic numeric mismatch, semantic drift, and revision\-aware memory updates\. Additional diagnostic sessions and mechanism\-targeted examples are provided in Appendix[E](https://arxiv.org/html/2605.16650#A5)\.

The current benchmark consists of six carefully engineered multi\-turn diagnostic sessions\. Each session isolates a single contradiction mechanism by introducing a targeted factual violation \(or explicit user revision\) at a later conversational turn while preserving overall fluency and local coherence\. This design enables controlled evaluation of whether the geometric engine activates the correct detector under semantically challenging conditions where embedding similarity alone is often insufficient\.

#### Probe categories\.

The benchmark currently evaluates six mechanism\-targeted regimes:

- •Negation reversal:introduce explicit polarity inversion through negation markers while preserving semantic overlap\.
- •Antonymic contradiction:replace a directional predicate with a semantically opposing relation drawn from the curated antonym lexicon𝒜ant\\mathcal\{A\}\_\{\\text\{ant\}\}\.
- •Numeric mismatch:modify symbolic numeric values embedded within otherwise nearly identical factual claims\.
- •Moderate semantic drift:introduce semantically related but incompatible object substitutions under the same semantic anchor\.
- •Strong semantic drift:introduce structurally disconnected factual continuations that remain weakly semantically adjacent\.
- •Revision\-aware update:evaluate whether the framework correctly distinguishes user\-authorized memory revision from genuine contradiction\.

#### Design rationale\.

SKG\-Probedirectly instantiates the contradiction regimes formalized in Proposition[2](https://arxiv.org/html/2605.16650#Thmproposition2), particularly cases where embedding\-based evaluators are vulnerable to semantic similarity inflation\. By isolating each detector pathway independently, the benchmark provides interpretable evidence regarding the necessity of combining symbolic contradiction priors with embedding\-geometric reasoning\.

Importantly, the benchmark also evaluates revision\-aware filtering, which distinguishes SKG\-Eval from conventional contradiction evaluators by explicitly separating model inconsistency from user\-authorized conversational state updates\.

Table 6:Per\-category contradiction/drift detection onSKG\-Probe\(binary, F1 %\)\.
#### Evaluation on generated LLM conversations\.

To analyze the behavior of SKG\-Eval on real model\-generated dialogues, we evaluated six representative LLMs spanning different architectural families and capability levels:GPTOSS\-20B,Gemma\-4\-31B,MiniMax\-M2\.7,Llama\-3\-70B,DeepSeek\-V4\-Pro, andMistral\-7B\. Each model was prompted using the same multi\-turn conversational sessions fromSKG\-Probe, and the resulting dialogues were scored using the full SKG\-Eval pipeline\.

The results reveal several important trends\. First, higher parameter count does not necessarily imply stronger long\-horizon conversational consistency\. For example,Gemma\-4\-31Bachieved higher overall session quality than bothLlama\-3\-70BandDeepSeek\-V4\-Pro, despite being smaller\. Second, several models achieved near\-perfect logical consistency scores \(SlogS^\{\\text\{log\}\}\), yet still obtained lower overall session quality due to weaker local relevance and semantic anchoring\. This empirically validates the need for SKG\-Eval’s multi\-component formulation combining local relevance, historical consistency, and contradiction reasoning rather than relying on contradiction detection alone\.

Among all evaluated models,GPTOSS\-20Bachieved the strongest overall performance, exhibiting the best balance between logical coherence, semantic consistency, and trajectory stability across turns\. In contrast, weaker models showed gradual degradation in session\-level quality despite maintaining locally fluent responses\. These findings support the central hypothesis of this work: modern LLMs frequently preserve short\-term fluency while still exhibiting measurable long\-horizon semantic inconsistency, which remains difficult to detect using conventional turn\-level evaluators\.

#### Comparison with LLM\-as\-a\-Judge evaluation\.

We additionally compared SKG\-Eval against an LLM\-as\-a\-Judge evaluation protocol using the same generated conversations\. For each session, a frontier judge model was prompted to assign conversational quality scores based on the full dialogue history\. This comparison enables direct analysis of whether explicit state\-aware contradiction reasoning provides complementary signals beyond prompt\-based holistic judgment\.

The results reveal a broadly consistent pattern: LLM\-as\-a\-Judge systems effectively reward local fluency and stylistic coherence, but may under\-penalize certain forms of cross\-turn semantic inconsistency, particularly in long\-horizon conversations\. In several sessions, models receiving high judge scores occasionally received lower SKG\-Eval scores due to contradiction emergence, semantic drift, or degradation in historical consistency\. This discrepancy became especially visible in long\-horizon conversations where conflicting claims appeared several turns apart\.

Interestingly, the ranking differences between SKG\-Eval and LLM\-as\-a\-Judge were not driven primarily by grammatical quality or local fluency\. Models with strong local response quality but weaker semantic persistence across turns tended to receive higher judge scores relative to their SKG\-Eval scores\. In contrast, models maintaining stable conversational commitments across the full session achieved consistently strong scores under both evaluators\.

These findings highlight a key distinction between the two paradigms\. LLM\-as\-a\-Judge systems perform implicit holistic assessment over serialized dialogue history, whereas SKG\-Eval externalizes conversational state into a persistent semantic structure and evaluates contradiction through structured geometric reasoning\. Consequently, SKG\-Eval provides explicit contradiction\-aware reasoning over persistent conversational state, while remaining quasi\-deterministic and interpretable through contradiction certificates and graph\-level diagnostics\.

#### Numerical comparison with LLM\-as\-a\-Judge\.

We further compared SKG\-Eval against an LLM\-as\-a\-Judge evaluation protocol on the same generated conversations using six representative models:GPTOSS\-20B,Gemma\-4\-31B,MiniMax\-M2\.7,Llama\-3\-70B,DeepSeek\-V4\-Pro, andMistral\-7B\.

A broadly consistent pattern emerged across the evaluated models: LLM\-as\-a\-Judge assigned higher session\-level quality scores than SKG\-Eval\. For example,GPTOSS\-20Bobtained a mean SKG\-Eval score of0\.7660\.766, whereas the LLM judge assigned0\.9880\.988\. Similarly,Gemma\-4\-31Breceived0\.7560\.756under SKG\-Eval versus0\.9540\.954under judge evaluation, whileLlama\-3\-70Breceived0\.7410\.741versus0\.9940\.994\. Across all six evaluated models, the average score difference between the LLM judge and SKG\-Eval was approximately0\.240\.24\.

Importantly, this discrepancy did not appear to be driven primarily by grammatical quality or local fluency\. Most evaluated models produced highly coherent individual responses and therefore received near\-saturated LLM\-judge scores \(\>0\.95\>0\.95\)\. However, SKG\-Eval identified gradual degradation in historical semantic consistency, contradiction emergence, and cross\-turn drift that appeared to receive comparatively weaker penalties under holistic judge prompting\. This effect became especially visible in long\-horizon sessions, where models such asMiniMax\-M2\.7andMistral\-7Bexhibited stronger negative session slopes \(β^≈−0\.05\\hat\{\\beta\}\\approx\-0\.05\), indicating progressive degradation across turns despite receiving very high LLM\-judge scores\.

These findings are consistent with the motivation behind SKG\-Eval: conventional LLM\-as\-a\-Judge systems primarily reward local fluency and overall conversational plausibility, whereas explicit state\-aware geometric reasoning provides an alternative mechanism for detecting long\-range logical inconsistency and semantic state degradation\.

#### Agreement analysis with LLM\-as\-Judge\.

Figure[5](https://arxiv.org/html/2605.16650#S5.F5)compares SKG\-Eval and LLM\-as\-Judge scores across six diagnostic sessions generated usingGPTOSS\-20B\. The upper row visualizes agreement between the two evaluators for both recency\-weighted and aggregated session scores, where the dashed diagonal denotes perfect agreement\. The lower row shows the corresponding session\-wise trajectories\.

A broadly consistent pattern emerges across the sessions: LLM\-as\-Judge assigns near\-saturated scores close to 1\.0, whereas SKG\-Eval produces lower and more differentiated scores\. The discrepancy is particularly visible in sessions containing semantic drift or delayed contradiction, where SKG\-Eval produces stronger penalties through reductions inStlogS^\{\\text\{log\}\}\_\{t\}and historical consistency\. In contrast, the judge model often continues assigning high scores because local fluency and surface coherence remain strong\.

These results suggest that prompt\-based holistic evaluators may be less sensitive to certain forms of long\-range conversational inconsistency, while SKG\-Eval produces more differentiated session scores according to persistent semantic state quality and contradiction behavior\.

![Refer to caption](https://arxiv.org/html/2605.16650v1/figures/llm_judge_trajectory_gpt0ss20B.png)Figure 5:Comparison between SKG\-Eval and LLM\-as\-Judge scores across six diagnostic sessions\.

### 5\.3Statistical Significance

To evaluate whether the observed performance gains are statistically reliable rather than arising from sampling variability, we perform significance testing for all major experimental comparisons\.

#### Correlation metrics\.

For turn\-level and session\-level alignment, we estimate uncertainty using non\-parametric bootstrap over conversations\. Specifically, we resample complete conversations \(rather than individual turns\) with replacement for 1,000 bootstrap replicates and recompute the correlation metrics for each replicate\. We report 95% confidence intervals using the percentile bootstrap method\. All significance tests are two\-sided unless otherwise stated\.

#### Binary classification metrics\.

For adversarial experiments \(Table[6](https://arxiv.org/html/2605.16650#S5.T6)\), we report precision, recall, and F1\. Confidence intervals are computed via bootstrap over dialogues\. Differences in F1 between evaluators are tested using paired bootstrap resampling over the same dialogue instances\.

#### Multiple comparisons\.

Since we compare multiple evaluators across two benchmarks, we control for multiple hypothesis testing using the Holm–Bonferroni correction\. Adjustedpp\-values are reported in the appendix\.

#### Result summary\.

Across both benchmarks, the improvements of SKG\-Eval over the strongest history\-aware LLM\-as\-a\-Judge baseline are statistically significant at thep<0\.01p<0\.01level for session\-level Spearman correlation\. On adversarial probes, gains in contradiction\-detection F1 for numeric substitution and antonymic contradiction probes are significant atp<0\.001p<0\.001\.

Random seeds were fixed across repeated experiments to reduce evaluation variance unrelated to the evaluators themselves\.

### 5\.4Qualitative Trajectory Analysis

Figure[6](https://arxiv.org/html/2605.16650#S5.F6)shows the turn\-level trajectory for a representativeSKG\-Probesession\. The dialogue contains a nutrition\-related conversation with injected failures: a topic drift at Turn 2, a paraphrased contradiction at Turn 4, and a later macronutrient contradiction at Turn 6\. SKG\-Eval captures these failures through distinct score components\. The drop inQtQ\_\{t\}at Turns 2–4 is driven by reduced historical consistency and a sharp collapse inStlogS^\{\\text\{log\}\}\_\{t\}at Turn 4, where the response contradicts the earlier claim that skipping breakfast slows metabolism\. Although later turns regain local relevance, the component trajectories reveal that local fluency alone does not imply stable conversational state\. This case illustrates how SKG\-Eval provides not only a scalar quality score but also a diagnostic decomposition into relevance, consistency, and logic signals\.

![Refer to caption](https://arxiv.org/html/2605.16650v1/figures/falw2_120B.png)Figure 6:Turn\-level SKG\-Eval trajectory for a representativeSKG\-Probenutrition session\. The upper panel shows the final turn quality scoreQtQ\_\{t\}and recency\-weighted trend, with healthy and critical zones shaded\. The lower panel decomposes the score into local relevanceStlocS^\{\\text\{loc\}\}\_\{t\}, historical consistencyStconsS^\{\\text\{cons\}\}\_\{t\}, and logical coherenceStlogS^\{\\text\{log\}\}\_\{t\}\. The trajectory exposes failures that are not visible from local fluency alone, including topic drift at Turn 2 and a high\-confidence contradiction at Turn 4\.
### 5\.5Component Ablations

Table[7](https://arxiv.org/html/2605.16650#S5.T7)ablates each module of SKG\-Eval, reporting the change in session\-level Spearman on MultiChallenge, where stateful effects are most visible\. Removing the geometric engine and replacingStlogS^\{\\text\{log\}\}\_\{t\}with the cross\-encoder NLI premise\-pool baseline reduces Spearman correlation by 0\.09\. Removing the typed attribute taxonomy reduces performance by 0\.05\. Replacing the Semantic Triangle with prompt\-only cosine reduces performance by 0\.04\. Removing the session\-anchor rescue reduces performance by 0\.03, with larger effects observed on longer MultiChallenge sessions\.

Table 7:Component ablations on MultiChallenge \(session\-level Spearmanρ\\rho\)\.Δ\\Deltais the absolute drop from the full system\.#### Takeaway\.

These results provide causal evidence that SKG\-Eval’s performance gains arise from structured state tracking and geometric contradiction reasoning, with cross\-turn identity consistency and revision\-aware filtering acting as enabling conditions rather than auxiliary refinements\.

### 5\.6Behavior as a Function of Session Length

We bin MultiChallenge sessions by lengthT∈\{2−5,6−10,11−20,21−30,31\+\}T\\in\\\{2\{\-\}5,\\,6\{\-\}10,\\,11\{\-\}20,\\,21\{\-\}30,\\,31\{\+\}\\\}and report session\-level Spearman correlation in each bin\. Baselines degrade asTTincreases, reflecting the burial effect of long prefixes\. In contrast, SKG\-Eval remains stable across length bins because historical claims are retrieved through graph\-indexed semantic anchors\.

Figure 7:Session\-level rank correlation as a function of session lengthTTon MultiChallenge\. SKG\-Eval remains stable across the length axis because graph indexing retrieves historical edges without requiring full\-prefix judge prompting\.
### 5\.7Computational Cost and Determinism

Table[8](https://arxiv.org/html/2605.16650#S5.T8)reports wall\-clock cost and reproducibility on a 1,000\-turn evaluation run\. SKG\-Eval’s computational cost is dominated by semantic extraction, followed by lightweight embedding similarity and deterministic detector evaluation\. Unlike prompt\-based judge systems, the contradiction engine itself introduces negligible additional overhead after graph construction\. History\-aware judge baselines incur substantially higher cost because evaluation complexity grows with serialized conversational context length\.

Table 8:Computational cost and reproducibility on a 1,000\-turn evaluation run\.
### 5\.8Qualitative Case Study: Contradiction Certificates

A defining characteristic of SKG\-Eval is that low session scores are accompanied by explicit contradiction certificates grounded in graph structure\. Consider a representative MultiChallenge dialogue in which the assistant first claims that “compound interest grows linearly with time” and later claims that it “grows exponentially\.” SKG\-Eval emits a contradiction certificate identifying the conflicting edge pair, the triggering detector, and its confidence\. The emitted certificate additionally exposes the associated semantic anchor, relation similarity, and object divergence responsible for the contradiction decision\.

This example highlights the qualitative complement to the quantitative results: locally correct responses can still be inconsistent with prior commitments, and detecting such inconsistencies requires explicit state tracking\.

### 5\.9Error Analysis

We analyze failure cases of SKG\-Eval to understand its limitations\. Observed failure modes fall into five primary categories: extraction errors, structural mismatches, detector coverage gaps, threshold sensitivity, and revision\-intent misclassification\. Among these categories, extraction fragmentation and ambiguous entity normalization were the most common practical failure sources\.

#### Takeaway\.

Most failure cases are attributable to upstream representation or coverage gaps rather than instability of the scoring mechanism\. This reinforces the design principle of SKG\-Eval: once the conversational state is correctly externalized, evaluation becomes substantially more deterministic, interpretable, and analyzable than purely prompt\-based holistic scoring\.

### 5\.10Discussion and Limitations

We have seen how SKG\-Eval framework performs the upstream semantic extraction stage which may introduce limited variability due to LLM\-based parsing, all downstream stages — including graph construction, semantic\-state updates, revision filtering, contradiction detection, and score aggregation — are fully deterministic given the extracted symbolic structure\.

#### Comparison between ECoh, LLM\-as\-a\-Judge, and SKG\-Eval\.

Firstly, it is important to highlight that the three models, including the proposed SKG\-Eval, represent inherently dissimilar paradigms in measuring conversational quality\. ECoh\(Mendonçaet al\.,[2024](https://arxiv.org/html/2605.16650#bib.bib9)\)can be considered as a distilled neural coherence model focused on detecting local semantic consistency between successive turns of speech\. Although highly efficient from a computational point of view, ECoh is essentially a shallow coherence estimator, which means that it does not keep persistent conversational state or contradiction reasoning in a structured manner\.

In turn, the LLM\-as\-a\-Judge class of models, which includes both DeepEval \+ GPT\-4o and GPT\-4o\-Judge variants, conducts comprehensive prompt\-based assessment of serialized conversation history\. These models excel at measuring such aspects as fluency, style, and overall conversational consistency\. On the other hand, since the reasoning process is implicit in the judge prompt, they tend to poorly estimate potential contradictions or long\-range semantic drifts, provided that a dialogue appears fluent locally\.

As compared to other paradigms, SKG\-Eval differs from their approaches by externalizing persistent conversational state into Semantic Knowledge Graph and evaluating logical consistency through neuro\-symbolic geometric contradiction reasoning\. In other words, unlike its opponents, SKG\-Eval explicitly analyzes local conversational relevance, consistency across time, and logical coherence with respect to previous turns\. This approach makes it possible to detect cross\-turn contradictions, semantic drifts, inconsistencies in numeric data, revision updates, and even obtain contradiction certificates for further analysis of generated dialogue\.

From an empirical perspective, ECoh shows robust performance in short\-horizon local coherence but suffers from performance degradation on long\-form dialogues\. Meanwhile, the LLM\-as\-a\-Judge systems provide consistently high results for fluent local conversations, often failing to punish late contradictions effectively\. As a result, the SKG\-Eval model tends to better align human judgments when estimating conversational quality in the long horizon\.

#### Relationship to LLM\-as\-a\-Judge evaluation\.

There are many cases where SKG\-Eval and the LLM\-as\-a\-Judge systems generate highly comparable scores, especially for dialogues where the response shows local coherence, semantic stability, and lack of long\-term contradiction\. From this observation, one could infer that current judge\-based models are often capable of judging the quality of conversations under similar dialogue circumstances\.

However, one can see that there exists an inherent difference between the two frameworks based on how their judgment works\. The former makes use of implicit holistic reasoning over serial dialogue histories, while the latter uses explicit externalization of the conversation state in a Semantic Knowledge Graph and analyzes incompatibility using geometric contradiction detection\. This leads to SKG\-Eval producing interpretable contradiction certificates, semantic anchors, and traceable detector\-level reasoning, all of which can be audited\. In addition, after constructing the knowledge graph, any further evaluation will become deterministic and reproducible, contrary to the behavior of judge models that depend on prompts\.

#### Where SKG\-Eval is strongest\.

The SKG\-Eval model exhibits its largest empirical improvements in situations involving long\-term dialogue, reasoning about facts and numbers, and attacks that take advantage of the model’s weaknesses when crossing turns\. The resulting contradiction certificates are directly auditable and provide localized graph\-grounded explanations that are difficult to obtain from prompt\-based holistic evaluators without additional instrumentation\.

#### Where the framework is weaker\.

First, SKG\-Eval depends on the quality of semantic triple extraction\. Second, the antonym dictionary𝒜ant\\mathcal\{A\}\_\{\\text\{ant\}\}is curated, so limited domain coverage may reduce contradiction recall in specialized technical or scientific domains\. Third, SKG\-Eval primarily evaluates internal semantic consistency rather than grounding against external factual knowledge sources\. Finally, the current framework is primarily optimized for explicit semantic inconsistency and may under\-detect highly implicit pragmatic contradictions that require deep world knowledge or latent commonsense reasoning\.

#### Reproducibility\.

To facilitate reproducibility, we will release the full evaluation pipeline,SKG\-Probeadversarial benchmark, extraction templates, preprocessing scripts, and evaluation configurations under a permissive open\-source license upon acceptance\.

Future work includes multilingual contradiction modeling, adaptive semantic extraction, and integration with retrieval\-grounded factual verification systems\.

## 6Conclusion

We presented SKG\-Eval, a quasi\-deterministic and interpretable evaluation framework that models conversations as dynamic semantic knowledge graphs and evaluates systems based on their local relevance, historical consistency, and logical coherence\. Such an approach fills a gap inherent in previous approaches to conversational system evaluation due to a lack of robustness in detecting contradictions, semantic drifts, and entity incompatibilities at longer ranges\.

Experimental results on benchmarks and adversarial probes suggest that SKG\-Eval agrees well with human perception while improving the sensitivity towards failure modes related to extended conversations\. In particular, the proposed geometric contradiction engine, along with typed comparability and revision filtering, allows the identification of inconsistency errors that would have received comparatively less weight under prompt\-based evaluation\. The recency weighting scheme yields a trend\-sensitive summary of the overall quality of the conversation, regardless of its length\.

In addition to evaluation performance, we consider two important features that are often hard to satisfy simultaneously in modern evaluation frameworks: interpretability and reproducibility\. Once the conversational state is externalized as a graph, the process of evaluation becomes deterministic, and the problematic cases can be examined in terms of contradiction certificates and graph diagnostics\. Such auditability is potentially useful for downstream applications, including dataset curation, debugging, failure analysis, and benchmarking evaluation methods themselves\.

On the other hand, SKG\-Eval depends on the reliability of the semantic representation and focuses on checking for consistency in the conversations rather than the factual grounding of statements\. Future work includes integration with factual validation based on retrieval, expansion of contradiction detection to higher\-order reasoning patterns, and generalization of the method to multilingual and multimodal conversations\.

To summarize, our study suggests that semantic state modeling could complement holistic language model evaluations for better examination of conversation consistency and failure modes\.

## References

- BotChat: evaluating LLMs’ capabilities of having multi\-turn dialogues\.InFindings of the Association for Computational Linguistics: NAACL 2024,pp\. 3184–3200\.Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Gao, J\. Cui, A\. Yang, Y\. Tong, H\. Wu, X\. Zhang, Y\. Yang, and Z\. He \(2026\)MT\-dyna: a framework for evaluating multi\-turn capabilities of LLMs\.Applied Soft Computing193,pp\. 114785\.Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Guanet al\.\(2026\)Evaluating LLM\-based agents for multi\-turn conversations: a survey\.ACM Computing Surveys\.Note:Article No\. 3793671Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Hu, T\. Dong, G\. Luo, H\. Ma, P\. Zou, X\. Sun, D\. Guo, X\. Yang, and M\. Wang \(2025\)PsycoLLM: enhancing LLM for psychological understanding and evaluation\.IEEE Transactions on Computational Social Systems12\(2\),pp\. 539–551\.Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Ike, J\. Takeuchi, F\. Joublin, A\. Ceravola, and M\. Tanti \(2025\)Automating dialogue evaluation: LLMs vs human judgment\.InProceedings of the 27th International Conference on Human\-Computer Interaction \(HCII\), LNAI 15820,pp\. 353–372\.Cited by:[§1](https://arxiv.org/html/2605.16650#S1.p2.1),[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Kahng, I\. Tenney, M\. Pushkarna, M\. X\. Liu, J\. Wexler, E\. Reif, K\. Kallarackal, M\. Chang, M\. Terry, and L\. Dixon \(2025\)LLM comparator: interactive analysis of side\-by\-side evaluation of large language models\.IEEE Transactions on Visualization and Computer Graphics31\(1\),pp\. 503–513\.Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Kwan, X\. Zeng, Y\. Jiang, Y\. Wang, L\. Li, L\. Shang, X\. Jiang, Q\. Liu, and K\. Wong \(2024\)MT\-eval: a multi\-turn capabilities evaluation benchmark for large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 20153–20177\.Cited by:[§1](https://arxiv.org/html/2605.16650#S1.p1.1),[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.16650#S5.SS1.SSS0.Px4.p1.3)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2025\)LLMs get lost in multi\-turn conversation\.arXiv preprint arXiv:2505\.06120\.Cited by:[§1](https://arxiv.org/html/2605.16650#S1.p1.1),[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, R\. Krishnan, and R\. Padman \(2026a\)Consistency of large reasoning models under multi\-turn attacks\.arXiv preprint arXiv:2602\.13093\.Cited by:[§1](https://arxiv.org/html/2605.16650#S1.p1.1),[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, X\. Shen, Y\. Miao, X\. Ding, X\. Yao, K\. Ramayya, and R\. Padman \(2026b\)Beyond single\-turn: a survey on multi\-turn interactions with large language models\.arXiv preprint arXiv:2504\.04717\.Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Liao, Y\. Meng, H\. Liu, Y\. Wang, and Y\. Wang \(2023\)An automatic evaluation framework for multi\-turn medical consultations capabilities of large language models\.arXiv preprint arXiv:2309\.02077\.Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Lin and Y\. Chen \(2023\)LLM\-Eval: unified multi\-dimensional automatic evaluation for open\-domain conversations with large language models\.InProceedings of the 5th Workshop on NLP for Conversational AI,Cited by:[§1](https://arxiv.org/html/2605.16650#S1.p1.1),[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px2.p1.1),[1st item](https://arxiv.org/html/2605.16650#S5.I1.i1.p1.1)\.
- J\. Mendonça, I\. Trancoso, and A\. Lavie \(2024\)ECoh: turn\-level coherence evaluation for multilingual dialogues\.InProceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue \(SIGDIAL\),Cited by:[§1](https://arxiv.org/html/2605.16650#S1.p1.1),[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px2.p1.1),[2nd item](https://arxiv.org/html/2605.16650#S5.I1.i2.p1.1),[§5\.10](https://arxiv.org/html/2605.16650#S5.SS10.SSS0.Px1.p1.1)\.
- \[14\]B\. Mo, K\. Yu, J\. Kazdan, P\. Mpala, L\. Yu, C\. I\. Kanatsoulis, and S\. KoyejoKGGen: extracting knowledge graphs from plain text with language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§3](https://arxiv.org/html/2605.16650#S3.SS0.SSS0.Px4.p1.4),[§4\.5](https://arxiv.org/html/2605.16650#S4.SS5.SSS0.Px1.p1.1)\.
- V\. Sirdeshmukh, K\. Deshpande, J\. Mols, L\. Jin, E\. Cardona, D\. Lee, J\. Kritz, W\. Primack, S\. Yue, and C\. Xing \(2025\)MultiChallenge: a realistic multi\-turn conversation evaluation benchmark challenging to frontier LLMs\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 18632–18702\.Cited by:[§1](https://arxiv.org/html/2605.16650#S1.p1.1),[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.16650#S5.SS1.SSS0.Px1.p3.1)\.
- X\. Wang, Z\. Wang, J\. Liu, Y\. Chen, L\. Yuan, H\. Peng, and H\. Ji \(2024\)MINT: evaluating LLMs in multi\-turn interaction with tools and language feedback\.InProceedings of the 12th International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Yao, P\. Jin, K\. Bao, Q\. Yu, K\. Bhardwaj, C\. Su, J\. Wang, Y\. Zhu, S\. Devare, D\. Mosk\-Aoyama, Z\. Dong, V\. K\. Srinivasan, Y\. Zhang, O\. Kuchaiev, J\. Jiao, and B\. Zhu \(2026\)The measure of all measures: quantifying LLM benchmark quality\.arXiv preprint\.Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Zhang, L\. F\. D’Haro, Y\. Chen, M\. Zhang, and H\. Li \(2024\)A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators\.InProceedings of the 38th AAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[§1](https://arxiv.org/html/2605.16650#S1.p1.1),[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.16650#S5.SS1.SSS0.Px4.p1.3)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,Cited by:[§2](https://arxiv.org/html/2605.16650#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.16650#S5.SS1.SSS0.Px1.p2.1)\.

## Appendix APrompt Templates

For reproducibility, we report the finalized prompt templates used for semantic triple extraction and judge\-based baselines\. All LLM calls were run with deterministic decoding using temperature0\.

### A\.1Semantic Triple Extraction Prompt

#### System message\.

```
ΨΨYou are a structured data extraction system. Output only a valid JSON array.
ΨΨNo explanation, no markdown, no preamble.
Ψ
```

#### User prompt template\.

```
ΨΨYou are an information extraction system for a multi-turn conversation evaluator.
ΨΨ
ΨΨExtract ONLY explicit factual subject-relation-object triples from the text.
ΨΨ
ΨΨRules:
ΨΨ- Do NOT infer anything not directly stated.
ΨΨ- Keep relations short using normalized verb phrases.
ΨΨ- Assign subject/object type from:
ΨΨPerson, Event, Object, Concept, Condition, Organization, Time, Number.
ΨΨ- Normalize subjects across turns whenever possible.
ΨΨ- Preserve negation markers, numeric values, and comparison terms.
ΨΨ- Assign exactly one attribute:
ΨΨdefinition, effect, property, comparison, requirement, quantity, negation.
ΨΨ- Assign exactly one relation type:
ΨΨassertion, negation_assertion, diagnosis, solution, elaboration.
ΨΨ- Assign property_type:
ΨΨEXCLUSIVE or ADDITIVE.
ΨΨ- Extract all factual claims; do not summarize multiple claims into one triple.
ΨΨ
ΨΨReturn ONLY a valid JSON list:
ΨΨ
ΨΨ[
ΨΨ{
ΨΨΨ"sub": "...",
ΨΨΨ"rel": "...",
ΨΨΨ"obj": "...",
ΨΨΨ"type_sub": "...",
ΨΨΨ"type_obj": "...",
ΨΨΨ"importance": 0.0,
ΨΨΨ"rel_type": "assertion",
ΨΨΨ"attribute": "property",
ΨΨΨ"property_type": "ADDITIVE"
ΨΨ}
ΨΨ]
ΨΨ
ΨΨText:
ΨΨ{text}
Ψ
```

### A\.2LLM\-Eval Prompt

```
ΨΨYou are evaluating a multi-turn dialogue assistant.
ΨΨ
ΨΨGiven the user’s prompt and the assistant’s response, rate the response on the following
ΨΨfour dimensions, each on a 1-5 integer scale:
ΨΨ
ΨΨ1. relevance: how well does the response address the user’s prompt?
ΨΨ2. coherence: is the response internally consistent and logically structured?
ΨΨ3. naturalness: does it read like a fluent, well-formed answer?
ΨΨ4. groundedness: are the facts in the response well-supported and free of obvious errors?
ΨΨ
ΨΨUser prompt:
ΨΨ<<<
ΨΨ{prompt}
ΨΨ>>>
ΨΨ
ΨΨAssistant response:
ΨΨ<<<
ΨΨ{response}
ΨΨ>>>
ΨΨ
ΨΨReply with ONLY a single line of JSON:
ΨΨ{"relevance": <int>, "coherence": <int>, "naturalness": <int>, "groundedness": <int>}
Ψ
```

### A\.3ECoh\-Style Coherence Prompt

```
ΨΨYou are a turn-level dialogue coherence judge.
ΨΨ
ΨΨRate ONLY whether the assistant’s response is coherent with the immediate user prompt:
ΨΨdoes it logically follow, address what was asked, and read as a well-formed continuation
ΨΨof the dialogue? Ignore factual correctness; focus on coherence.
ΨΨ
ΨΨUser prompt:
ΨΨ<<<
ΨΨ{prompt}
ΨΨ>>>
ΨΨ
ΨΨAssistant response:
ΨΨ<<<
ΨΨ{response}
ΨΨ>>>
ΨΨ
ΨΨReply with a single integer 1-5 only.
Ψ
```

### A\.4GPT\-4o Judge: Turn\-Only Prompt

```
ΨΨRate the following assistant response on a scale of 1 to 5 given the user prompt.
ΨΨ
ΨΨ1 = the response is unhelpful, off-topic, or incorrect.
ΨΨ3 = the response is acceptable but has noticeable problems.
ΨΨ5 = the response fully and correctly addresses the prompt.
ΨΨ
ΨΨUser prompt:
ΨΨ<<<
ΨΨ{prompt}
ΨΨ>>>
ΨΨ
ΨΨAssistant response:
ΨΨ<<<
ΨΨ{response}
ΨΨ>>>
ΨΨ
ΨΨReply with a single integer 1-5 only. No explanation.
Ψ
```

### A\.5GPT\-4o Judge: History\-Aware Prompt

> Rate the assistant’s CURRENT response on a scale of 1 to 5, given the full conversation so far\. 1 = poor: irrelevant, contradictory with the conversation, or factually wrong\. 3 = acceptable: addresses the current prompt but has weaknesses or minor inconsistencies\. 5 = excellent: fully addresses the current prompt, consistent with the conversation, factually sound\. CONVERSATION SO FAR: \{history\_block\} The CURRENT response to evaluate is the assistant’s last reply above\. Reply with a single integer 1\-\-5 only\. No explanation\.

## Appendix BImplementation Details of the Geometric Contradiction Engine

This appendix provides detailed implementation specifications for the neuro\-symbolic geometric contradiction engine used in SKG\-Eval\. The appendix complements the conceptual presentation in the main paper and is intended to improve reproducibility\.

#### Design philosophy\.

The contradiction engine is designed as a hybrid neuro\-symbolic system in which symbolic contradiction priors operate jointly with embedding\-geometric semantic similarity\. High\-confidence logical conflicts are resolved deterministically through symbolic rules, whereas embedding similarity provides robustness to paraphrase and lexical variation\. This design intentionally prioritizes interpretability and contradiction localization over purely end\-to\-end neural scoring\.

### B\.1Detector Thresholds

Table[9](https://arxiv.org/html/2605.16650#A2.T9)summarizes the thresholds and constants used throughout the contradiction cascade\. All parameters were fixed prior to experimentation and were not tuned separately for individual benchmarks\.

Table 9:Implementation thresholds used in the geometric contradiction engine\.
### B\.2Detector Ordering

The contradiction cascade is evaluated in the following order:

1. 1\.NegFlip
2. 2\.Antonym
3. 3\.IntentGate
4. 4\.ElabGuard
5. 5\.NoiseFloor
6. 6\.NumMismatch
7. 7\.Exclusive\-Object Conflict
8. 8\.Same\-Type Exclusive Conflict
9. 9\.Residual Semantic Drift

The ordering reflects contradiction reliability\. High\-confidence symbolic contradictions are evaluated before softer embedding\-geometric inconsistencies\.

## Appendix CFormal Detector Definitions

### C\.1Negation Reversal

TheNegFlipdetector activates when exactly one relation contains a negation marker from the predefined set

The current contradiction lexicon includes:

\{not,never,no,does not,cannot,without\}\.\\\{\\textit\{not\},\\textit\{never\},\\textit\{no\},\\textit\{does not\},\\textit\{cannot\},\\textit\{without\}\\\}\.
Contradiction confidence is assigned only when object similarity exceedsθobjneg\\theta^\{\\text\{neg\}\}\_\{\\text\{obj\}\}\.

### C\.2Antonymic Contradiction

The antonym detector uses a curated relation\-opposition lexicon

𝒜ant\\mathcal\{A\}\_\{\\text\{ant\}\}containing directional semantic opposites such as:

- •increase/decrease
- •allow/prevent
- •accept/reject
- •always/never

The detector activates only when object similarity remains above the minimum semantic overlap threshold\.

### C\.3Numeric Mismatch

Numeric mismatch is detected through symbolic numeric extraction applied to object spans\. Numbers are extracted using regular\-expression matching:

```
ΨΨ\b(\d+(?:\.\d+)?)\b
Ψ
```

A contradiction is triggered when:

under sufficiently aligned relations\.

### C\.4Exclusive\-Object Conflict

This detector captures contradictions arising from predicates that admit only one valid object assignment\. Examples include:

- •favorite color
- •birthplace
- •capital city
- •exact count or quantity

When relation similarity is high but object similarity falls below the exclusivity divergence threshold, the pair is treated as structurally incompatible\.

### C\.5Residual Semantic Drift

Residual semantic drift acts as a fallback incompatibility detector when explicit symbolic contradiction cannot be established but semantic divergence remains high under aligned relational structure\. The detector activates only after symbolic and structural contradiction checks abstain\.

## Appendix DRevision\-Aware Memory Filtering

SKG\-Eval distinguishes contradiction from user\-authorized memory updates\. Revision detection operates prior to contradiction evaluation\.

Current revision markers include:

- •change
- •replace
- •update
- •switch
- •instead

When revision intent is detected, prior conflicting edges are marked with the attribute:

user\_deprecated\.\\texttt\{user\\\_deprecated\}\.
Deprecated edges are excluded from contradiction comparison\. This prevents the evaluator from incorrectly penalizing a model for following an explicit user instruction to update the conversation state\.

## Appendix EDiagnostic Benchmark Sessions

TheSKG\-Probebenchmark consists of six mechanism\-targeted diagnostic sessions designed to isolate specific contradiction regimes\.

Table 10:Mechanism\-targeted SKG\-Probe sessions\.### E\.1Session 1: Negation Reversal

Example contradiction:

> “I read fiction books every night\.” →\\rightarrow “I do not read fiction books\.”

Expected behavior: high\-confidence contradiction\.

### E\.2Session 2: Antonymic Contradiction

Example contradiction:

> “Watering increases soil moisture\.” →\\rightarrow “Watering decreases soil moisture\.”

Expected behavior: directional contradiction\.

### E\.3Session 3: Numeric Mismatch

Example contradiction:

> “I own 3 dogs\.” →\\rightarrow “I own 1 dog\.”

Expected behavior: numeric inconsistency under the same predicate\.

### E\.4Session 4: Moderate Semantic Drift

Example:

> “My favorite color is blue\.” →\\rightarrow “My favorite color is green\.”

Expected behavior: moderate semantic conflict\.

### E\.5Session 5: Strong Semantic Drift

Example:

> “I wash dirty dishes\.” →\\rightarrow “I sweep the kitchen floor\.”

Expected behavior: strong semantic divergence\.

### E\.6Session 6: Revision\-Aware Update

Target mechanism: revision filtering\.

Example:

> “The main dish is tacos\.” →\\rightarrow “Please change the main dish to salmon\.”

Expected behavior: no contradiction should be triggered\.

## Appendix FAdditional Failure Analysis

We observe five primary failure categories:

1. 1\.Triple extraction errors
2. 2\.Entity\-linking ambiguity
3. 3\.Missing antonym coverage
4. 4\.Over\-fragmented semantic graphs
5. 5\.False semantic drift penalties

#### Most common failure source\.

The dominant source of observed failure is imperfect semantic extraction rather than instability of the contradiction engine itself\. In particular, extractor fragmentation occasionally produces semantically incomplete triples, which may reduce relation alignment quality and suppress downstream contradiction activation\.

Most observed failures arise from upstream semantic extraction ambiguity rather than instability of the contradiction cascade itself\.

## Appendix GComputational Complexity

For a candidate nodeuu, contradiction evaluation requires pairwise comparison between:

ℰcur\(u\)\\mathcal\{E\}^\{\\text\{cur\}\}\(u\)and

ℰ~hist\(u\)\.\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)\.
The resulting complexity is:

𝒪\(\|𝒞t\|C¯H¯\),\\mathcal\{O\}\\left\(\|\\mathcal\{C\}\_\{t\}\|\\bar\{C\}\\bar\{H\}\\right\),where:

C¯=𝔼\[\|ℰcur\(u\)\|\]\\bar\{C\}=\\mathbb\{E\}\[\|\\mathcal\{E\}^\{\\text\{cur\}\}\(u\)\|\]and

H¯=𝔼\[\|ℰ~hist\(u\)\|\]\.\\bar\{H\}=\\mathbb\{E\}\[\|\\widetilde\{\\mathcal\{E\}\}^\{\\text\{hist\}\}\(u\)\|\]\.
Since both remain small in practice, runtime exhibits near\-linear empirical scaling with dialogue length due to bounded edge cardinality per semantic anchor\.

## Appendix HDeterminism and Reproducibility

All extraction and evaluation prompts were executed using deterministic decoding with temperature0\. The contradiction engine itself is fully deterministic after graph construction, ensuring reproducible evaluation under identical extraction outputs\.

Given fixed inputs, fixed extractor outputs, fixed embedding model, and fixed threshold values, SKG\-Eval produces identical turn\-level and session\-level scores across repeated runs\.

## Appendix IHuman Annotation Protocol

Human annotators evaluated sessions using a 1–5 Likert scale based on:

- •local relevance,
- •historical consistency,
- •logical coherence,
- •contradiction severity,
- •conversational usefulness\.

Annotators were instructed to evaluate conversations globally rather than independently per turn, with emphasis on long\-horizon consistency, contradiction severity, and whether later responses respected previously established conversational commitments\.

Each session was independently scored by three annotators\. Final scores were obtained through averaging\. Inter\-annotator agreement was measured using Cohen’sκ\\kappa\.
SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

Similar Articles

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning

Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

Submit Feedback

Similar Articles

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human
Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning
Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging
Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs