CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics
Summary
This paper introduces CIG (Conversational Information Gain), a framework for measuring how utterances advance collective understanding in deliberative dialogues by tracking evolving semantic memory and scoring utterances on novelty, relevance, and implication scope. The authors demonstrate that memory-derived dynamics correlate better with human-perceived dialogue quality than traditional heuristics and develop LLM-based predictors for information-focused conversation analysis.
View Cached Full Text
Cached at: 04/20/26, 08:28 AM
# CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics
Source: https://arxiv.org/html/2604.15647
Jey Han Lau, Lea Frermann
School of Computing and Information Systems, The University of Melbourne
{mingbin,laujh,lfrermann}@unimelb.edu.au
## Abstract
Measuring the quality of public deliberation requires evaluating not only civility or argument structure, but also the informational progress of a conversation. We introduce a framework for Conversational Information Gain (CIG) that evaluates each utterance in terms of how it advances collective understanding of the target topic. To operationalize CIG, we model an evolving semantic memory of the discussion: the system extracts atomic claims from utterances and incrementally consolidates them into a structured memory state. Using this memory, we score each utterance along three interpretable dimensions: Novelty, Relevance, and Implication Scope. We annotate 80 segments from two moderated deliberative settings (TV debates and community discussions) with these dimensions and show that memory-derived dynamics (e.g., the number of claim updates) correlate more strongly with human-perceived CIG than traditional heuristics such as utterance length or TF–IDF. We develop effective LLM-based CIG predictors paving the way for information-focused conversation quality analysis in dialogues and deliberative success. The code and annotation data are available at https://github.com/mrknight21/memcig-analysis/tree/master/data.
## 1 Introduction

**Figure 1:** Overview of the CIG pipeline. Each utterance is evaluated with the Semantic Memory as knowledge context for Novelty, Relevance, Implication Scope, and overall CIG (1–4). The Semantic Memory is maintained through two modules: Extraction, which converts utterances into atomic claims; and Consolidation, which matches extracted claims against the retrieved memory, triggering ADD, UPDATE, or NONE operations.
Public deliberation—reasoned dialogue aimed at collective understanding and decision-making for public interest—is fundamental to democratic societies. Yet, the quality of these exchanges, from community forums to public debates, can vary widely from stagnation to productive collaboration. Metrics for evaluating dialogue quality have largely focused on structural formality over substantive content. Approaches based on schemes like the Deliberative Quality Index (DQI) or computational proxies for civility and argument structure struggle to capture true informational progress. These surface-level signals can mislead: civil phrasing can conceal malicious intent, and valuable insights might emerge from informal exchanges. Such methods fail to distinguish constructive progress from bureaucratic talk.
Grounded in Meadow and Yuan's notion of information impact as "A change, or the nature or magnitude of change, in the knowledge base of a subject domain of the recipient," we define Conversational Information Gain (CIG) as the degree to which an utterance advances collective understanding toward the goal/topic. We decompose CIG into three interpretable aspects—Novelty, Relevance, and Implication Scope—capturing whether a contribution introduces new information, connects to the shared goal, and extends its implications to the public community beyond individual cases.
To operationalize CIG, we require a representation of the evolving collective knowledge state against which each new utterance can be evaluated: both annotators and models must know what has already been said. As shown in Figure 1, we implement this through a lightweight semantic memory that maintains a consolidated set of claims. We first validate CIG through human annotation of 80 dialogue segments drawn from two moderated group-discussion settings—TV debates and community discussions—achieving moderate-to-high inter-annotator agreement for CIG and its aspects. We then validate automation by using an LLM with access to the same information as the human annotators (topic, short context, and a prior-memory summary), and show that LLM predictions closely track aggregated human judgments. Finally, by varying the model's prior context, we find that predictions based on retrieved memory summaries are highly correlated with those based on the full preceding transcript, indicating that memory summaries provide a compact yet faithful substitute for full-history context in automated CIG assessment.
We then analyze how memory dynamics relate to perceived informational progress, finding that simple memory-state signals (e.g., claim update counts) track human CIG ratings more consistently than common heuristics such as sentence entropy or TF–IDF. An unsupervised aggregation analysis reveals a "conjunctive bottleneck": an utterance's CIG is effectively limited by its weakest aspect. Finally, we provide a case study illustrating how CIG can be used to analyze downstream interaction dynamics in moderated discussions.
## 2 Related Work
### Measuring Informativeness in Conversation
While definitions of informativeness vary across disciplines, they share a common conceptual core: the change a message induces in the recipient—whether in certainty, utility, or knowledge state. Particularly, dialogue evaluation has largely converged on two necessary conditions required to trigger this change: novelty (the presence of new signal) and relevance (the alignment of that signal to the goal context). However, these two dimensions alone are insufficient to capture the magnitude or worthiness of a contribution. While metrics like "impact" or "usefulness" attempt to proxy this third dimension, they are often vaguely defined or rely on subjective Likert scales that conflate personal preference with collective value.
A complementary line of work extends classical information theory, leveraging surprisal or information-density to trace information flow. However, such measures only partially align with human-perceived salience, limiting their precision in localizing information exchanges.
### Agent Semantic Memory
Agent memory modules provide a way to measure knowledge acquisition during conversation. Originally developed to sustain coherence and personalization in conversational agents over long conversation sessions, these modules function by maintaining a selective, persistent state. In frameworks such as Mem0, salient claims are extracted from utterances and an LLM applies update operations to a knowledge store. This approach improves temporal and multi-hop reasoning on long-dialogue benchmarks and reduces latency and token cost relative to full-history baselines.
| Aspect | Label | Definition / Anchor |
|--------|-------|-------------------|
| **Conversational Information Gain** | 1. No gain | Repeats or obstructs; no meaningful advance beyond the existing knowledge. |
| | 2. Minimal gain | Small clarification or slight nuance that is noticeable but limited. |
| | 3. Incremental | Adds new details/mechanisms within the same conceptual frame or ideas within the topic. |
| | 4. Insightful | Reframes or introduces new ideas under the topic; shifts the conversation in a new valuable way. |
| **Novelty** | 1. Not novel | Repetition/paraphrase of prior content or non-common-sense content. |
| | 2. Minimally novel | Minor or mostly predictable detail added to an existing idea. |
| | 3. Moderately novel | New evidence, concrete example, or supporting detail expanding an existing idea. |
| | 4. Highly novel | New framework, principle, idea, or line of reasoning that opens a new direction. |
| **Relevance** | 1. Not relevant | Off-topic; no connection to the conversation goal. |
| | 2. Minimally relevant | Loose or indirect link; requires inference to connect. |
| | 3. Moderately relevant | Substantially related but not central (e.g., side issue or counterpoint). |
| | 4. Highly relevant | Directly and explicitly addresses the core topic or goal. |
| **Implication Scope** | 1. Local | Manages the immediate moment; implication limited to participants/procedures. |
| | 2. Bounded | Self-contained fact, feeling, or stance; no generalization beyond the case. |
| | 3. Generalizing | Inductively generalizes a case or evidence to a broader audience. |
| | 4. Universal | States an abstract principle, value, or norm with wide or universal applicability. |
**Table 1:** Conversational Information Gain (CIG) rubric. Each aspect is rated on a 1–4 scale using Prior Knowledge and the preceding dialogue as the collective knowledge context.
## 3 Conversational Information Gain (CIG)
Following previous studies in group discussion analysis, we define Conversational Information Gain (CIG) as how much a response advances the group's shared understanding of the topic or progress toward the goal, given both prior knowledge and the preceding dialogue. We decompose CIG into three aspects—Novelty, Relevance, and Implication Scope—to connect general criteria for informativeness to the demands of public deliberation. Novelty and Relevance capture whether a contribution introduces new information and is tied to the discussion goal, while Implication Scope captures how broadly the contribution's implications extend beyond the immediate case and people, reflecting the public orientation of deliberation. All constructs are rated on a four-level scale (Table 1).
Importantly, CIG reflects informational advancement rather than conversational quality per se—reiterations or coordination moves typically score low on CIG, even though they may still support the discussion indirectly. To anchor high CIG levels, we draw on Chi (2009)'s typology of conceptual change which distinguishes assimilation—adding information within an existing mental model—from accommodation—restructuring the model itself. Accordingly, we separate our top two CIG levels: Level 3 (Incremental) reflects assimilation, where an utterance adds evidence, details, or mechanisms within the current framing; Level 4 (Insightful) reflects accommodation, where an utterance introduces a new framing or principle that qualitatively redirects the discussion. Detailed definitions for each aspect are in Table 1, and examples in Appendix Table 12.
### Novelty
Assesses whether the information is new compared to the prior knowledge and preceding dialogue. Traditional approaches often proxy novelty with n-gram overlap against prior text, but recent work shows that lexical novelty correlates poorly with human perception. Consequently, we rate novelty on a similar four-level scale as CIG, with levels reflecting the message's effect on the existing conceptual framing. The novelty score is intentionally independent of topical alignment and magnitude of impact.
### Relevance
Measures how substantively a message relates to the main conversation topic or goal. Since lexical-overlap proxies correlate weakly with human judgments in context-sensitive and implicit dialogue, we adopt the four-level topic–conversation relevance scale. Levels 1 (off-topic) and 4 (directly on-topic) form the endpoints; the key distinction is between the middle tiers. Level 2 (minimally relevant) covers content whose connection to the goal is indirect and requires a bridging inference (e.g., "school choice" for topic "housing affordability"), and Level 3 covers content that is clearly connected but not central, typically a recognized subtopic (e.g., "zoning restrictions").
### Implication Scope
Measures the intended reach of a statement—who it is meant to matter to and how far its implications generalize. Many dialogue-evaluation schemes include an "impact" dimension. In public deliberation, however, perceived impact is subjective and highly context-dependent. We therefore operationalize this idea—drawing on the notion of "generality" in Meadow and Yuan (1997)—as the scope of implication a contribution projects beyond the immediate speaker and case. This is deliberation-motivated: participants often move between situated testimony (what happened to me/us here) and public reasons (what should matter to the community), and this "lifting" from particulars to publicly addressable concerns is commonly treated as central to deliberative legitimacy. Our rubric has four levels. Level 1 covers local or procedural moves whose significance is confined to the immediate interaction (e.g., turn management). Level 2 captures bounded, case-specific content (e.g., a personal fact or stance) that does not generalize beyond the case. Level 3 reflects inductive generalization, where a specific experience or piece of evidence is used to motivate a broader pattern—such as testimony offered to evidence social harm. Level 4 denotes universal, principle-level claims framed for wide public applicability (e.g., rights, fairness, and so on).Similar Articles
Cognis: Context-Aware Memory for Conversational AI Agents
Lyzr Cognis introduces a unified, open-source memory system for conversational AI that fuses BM25 and Matryoshka vector search with version-aware ingestion, achieving SOTA on LoCoMo and LongMemEval benchmarks.
SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs
Proposes SKG-Eval, a quasi-deterministic evaluation framework for multi-turn dialogue that uses incremental semantic knowledge graphs to detect cross-turn inconsistencies, contradiction, and topic drift, achieving higher correlation with human judgments.
Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
This paper introduces Bipredictability (P) and the Information Digital Twin (IDT), a lightweight method to monitor conversational consistency in multi-turn LLM interactions using token frequency statistics without embeddings or model internals. The approach achieves 100% sensitivity in detecting contradictions and topic shifts while establishing a practical monitoring framework for extended LLM deployments.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
Introduces Inquisitive Conversational Agents (ICAs) for proactive information extraction in legal dialogue, proposing a Dual Hierarchical Reinforcement Learning framework that learns when and how to ask probing questions, evaluated on U.S. Supreme Court oral arguments.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
This paper identifies KV-cache contamination as a failure mode for activation steering in dialogue and proposes GCAD, a method that extracts steering signals from prompt contributions and applies token-level gating to improve long-horizon coherence, achieving substantial gains on multi-turn benchmarks.