Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
Summary
This paper introduces Bipredictability (P) and the Information Digital Twin (IDT), a lightweight method to monitor conversational consistency in multi-turn LLM interactions using token frequency statistics without embeddings or model internals. The approach achieves 100% sensitivity in detecting contradictions and topic shifts while establishing a practical monitoring framework for extended LLM deployments.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
# Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction Source: https://arxiv.org/abs/2604.13061 View PDF (https://arxiv.org/pdf/2604.13061) > Abstract: Large language models, LLMs, are increasingly deployed in multiturn settings where earlier responses shape later ones, making reliability dependent on whether a conversation remains consistent over time. When this consistency degrades undetected, downstream decisions lose their grounding in the exchange that produced them. Yet current evaluation methods assess isolated outputs rather than the interaction producing them. Here we show that conversational structural consistency can be monitored directly from token frequency statistics, without embeddings, auxiliary evaluators or access to model internals. We formalize this signal as Bipredictability, P, which measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty, and implement it in a lightweight auxiliary architecture, the Information Digital Twin, IDT. Across 4,574 conversational turns spanning 34 conditions, one student model and three frontier teacher models, P established a stable runtime baseline, aligned with structural consistency in 85 percent of conditions but with semantic quality in only 44 percent, and the IDT detected all tested contradictions, topic shifts and non-sequiturs with 100 percent sensitivity. These results show that reliability in extended LLM interaction cannot be reduced to response quality alone, and that structural monitoring from the observable token stream can complement semantic evaluation in deployment. ## Submission history From: Wael Hafez [view email (https://arxiv.org/show-email/c24f3b3e/2604.13061)] **[v1](https://arxiv.org/abs/2604.13061v1)** Wed, 18 Mar 2026 18:10:37 UTC (607 KB) **[v2]** Fri, 17 Apr 2026 15:43:33 UTC (613 KB)
Similar Articles
Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning
Introduces Independent Combinatorial Tokens (ICT) framework that uses Jensen-Shannon divergence between token logit distributions to identify critical branching points, preventing entropy collapse and explosion in RLVR for LLM reasoning. Achieves up to 14.9% pass@4 improvement on Qwen models.
Cross-LLM Consistency in Inference: Evidence from Shared Interactions
This paper investigates whether different LLMs share common inference patterns when predicting the same token, using interaction-based explanations. Results show that advanced LLMs exhibit consistent interaction patterns, suggesting implicit optimization toward shared inference mechanisms.
SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability
The paper introduces SeDT, a training-free inference-time method that improves LLM reliability in multi-turn conversations by annotating conversation history with cumulative relevance scores from three signals, achieving up to +37.7% performance gains on the Lost-in-Conversation benchmark.
Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap
This paper introduces Found in Conversation (FiC), a training framework using View-Asymmetric Self-Distillation to close the multi-turn performance gap in LLMs. The method teaches models to recover single-turn competence from underspecified multi-turn prompts, achieving 92-100% recovery across model families and sizes.
Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild
This paper analyzes longitudinal conversational trajectories of Bing Copilot users and compares them with WildChat data, finding that individual user habits are sticky and that WildChat overrepresents power users, challenging static views of user-LLM interactions.