Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction

arXiv cs.CL 04/20/26, 04:00 AM Papers

Summary

This paper introduces Bipredictability (P) and the Information Digital Twin (IDT), a lightweight method to monitor conversational consistency in multi-turn LLM interactions using token frequency statistics without embeddings or model internals. The approach achieves 100% sensitivity in detecting contradictions and topic shifts while establishing a practical monitoring framework for extended LLM deployments.

arXiv:2604.13061v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly deployed in multi-turn settings where earlier responses shape later ones, making reliability dependent on whether a conversation remains consistent over time. When this consistency degrades undetected, downstream decisions lose their grounding in the exchange that produced them. Yet current evaluation methods assess isolated outputs rather than the interaction producing them. Here we show that conversational structural consistency can be monitored directly from token frequency statistics, without embeddings, auxiliary evaluators, or access to model internals. We formalize this signal as Bipredictability (P), which measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty, and implement it in a lightweight auxiliary architecture, the Information Digital Twin (IDT). Across 4,574 conversational turns spanning 34 conditions, one student model and three frontier teacher models, P established a stable runtime baseline, aligned with structural consistency in 85 percent of conditions but with semantic quality in only 44 percent, and the IDT detected all tested contradictions, topic shifts and non-sequiturs with 100 percent sensitivity. These results show that reliability in extended LLM interaction cannot be reduced to response quality alone, and that structural monitoring from the observable token stream can complement semantic evaluation in deployment.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
Source: https://arxiv.org/abs/2604.13061
View PDF (https://arxiv.org/pdf/2604.13061)

> Abstract: Large language models, LLMs, are increasingly deployed in multiturn settings where earlier responses shape later ones, making reliability dependent on whether a conversation remains consistent over time. When this consistency degrades undetected, downstream decisions lose their grounding in the exchange that produced them. Yet current evaluation methods assess isolated outputs rather than the interaction producing them. Here we show that conversational structural consistency can be monitored directly from token frequency statistics, without embeddings, auxiliary evaluators or access to model internals. We formalize this signal as Bipredictability, P, which measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty, and implement it in a lightweight auxiliary architecture, the Information Digital Twin, IDT. Across 4,574 conversational turns spanning 34 conditions, one student model and three frontier teacher models, P established a stable runtime baseline, aligned with structural consistency in 85 percent of conditions but with semantic quality in only 44 percent, and the IDT detected all tested contradictions, topic shifts and non-sequiturs with 100 percent sensitivity. These results show that reliability in extended LLM interaction cannot be reduced to response quality alone, and that structural monitoring from the observable token stream can complement semantic evaluation in deployment.

## Submission history

From: Wael Hafez [view email (https://arxiv.org/show-email/c24f3b3e/2604.13061)] **[v1](https://arxiv.org/abs/2604.13061v1)** Wed, 18 Mar 2026 18:10:37 UTC (607 KB) **[v2]** Fri, 17 Apr 2026 15:43:33 UTC (613 KB)

Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction

Similar Articles

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

Submit Feedback

Similar Articles

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild