Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction

arXiv cs.CL Papers

Summary

This paper introduces Bipredictability (P) and the Information Digital Twin (IDT), a lightweight method to monitor conversational consistency in multi-turn LLM interactions using token frequency statistics without embeddings or model internals. The approach achieves 100% sensitivity in detecting contradictions and topic shifts while establishing a practical monitoring framework for extended LLM deployments.

arXiv:2604.13061v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly deployed in multi-turn settings where earlier responses shape later ones, making reliability dependent on whether a conversation remains consistent over time. When this consistency degrades undetected, downstream decisions lose their grounding in the exchange that produced them. Yet current evaluation methods assess isolated outputs rather than the interaction producing them. Here we show that conversational structural consistency can be monitored directly from token frequency statistics, without embeddings, auxiliary evaluators, or access to model internals. We formalize this signal as Bipredictability (P), which measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty, and implement it in a lightweight auxiliary architecture, the Information Digital Twin (IDT). Across 4,574 conversational turns spanning 34 conditions, one student model and three frontier teacher models, P established a stable runtime baseline, aligned with structural consistency in 85 percent of conditions but with semantic quality in only 44 percent, and the IDT detected all tested contradictions, topic shifts and non-sequiturs with 100 percent sensitivity. These results show that reliability in extended LLM interaction cannot be reduced to response quality alone, and that structural monitoring from the observable token stream can complement semantic evaluation in deployment.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
Source: https://arxiv.org/abs/2604.13061
View PDF (https://arxiv.org/pdf/2604.13061)

> Abstract: Large language models, LLMs, are increasingly deployed in multiturn settings where earlier responses shape later ones, making reliability dependent on whether a conversation remains consistent over time. When this consistency degrades undetected, downstream decisions lose their grounding in the exchange that produced them. Yet current evaluation methods assess isolated outputs rather than the interaction producing them. Here we show that conversational structural consistency can be monitored directly from token frequency statistics, without embeddings, auxiliary evaluators or access to model internals. We formalize this signal as Bipredictability, P, which measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty, and implement it in a lightweight auxiliary architecture, the Information Digital Twin, IDT. Across 4,574 conversational turns spanning 34 conditions, one student model and three frontier teacher models, P established a stable runtime baseline, aligned with structural consistency in 85 percent of conditions but with semantic quality in only 44 percent, and the IDT detected all tested contradictions, topic shifts and non-sequiturs with 100 percent sensitivity. These results show that reliability in extended LLM interaction cannot be reduced to response quality alone, and that structural monitoring from the observable token stream can complement semantic evaluation in deployment.

## Submission history

From: Wael Hafez [view email (https://arxiv.org/show-email/c24f3b3e/2604.13061)] **[v1](https://arxiv.org/abs/2604.13061v1)** Wed, 18 Mar 2026 18:10:37 UTC (607 KB) **[v2]** Fri, 17 Apr 2026 15:43:33 UTC (613 KB)

Similar Articles

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

arXiv cs.AI

This paper investigates whether different LLMs share common inference patterns when predicting the same token, using interaction-based explanations. Results show that advanced LLMs exhibit consistent interaction patterns, suggesting implicit optimization toward shared inference mechanisms.

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

arXiv cs.CL

This paper introduces Found in Conversation (FiC), a training framework using View-Asymmetric Self-Distillation to close the multi-turn performance gap in LLMs. The method teaches models to recover single-turn competence from underspecified multi-turn prompts, achieving 92-100% recovery across model families and sizes.