CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling

arXiv cs.CL Papers

Summary

CobwebTM is a low-parameter lifelong hierarchical topic modeling approach that adapts the Cobweb algorithm to continuous document embeddings, enabling unsupervised topic discovery and dynamic hierarchical organization without predefining topic counts. The method combines incremental symbolic concept formation with pretrained representations to achieve strong topic coherence while avoiding catastrophic forgetting.

arXiv:2604.14489v2 Announce Type: replace Abstract: Topic modeling seeks to uncover latent semantic structure in text corpora with minimal supervision. Neural approaches achieve strong performance but require extensive tuning and struggle with lifelong learning due to catastrophic forgetting and fixed capacity, while classical probabilistic models lack flexibility and adaptability to streaming data. We introduce CobwebTM, a low-parameter lifelong hierarchical topic model based on incremental probabilistic concept formation. By adapting the Cobweb algorithm to continuous document embeddings, CobwebTM constructs semantic hierarchies online, enabling unsupervised topic discovery, dynamic topic creation, and hierarchical organization without predefining the number of topics. Across diverse datasets, CobwebTM achieves strong topic coherence, stable topics over time, and high-quality hierarchies, demonstrating that incremental symbolic concept formation combined with pretrained representations is an efficient approach to topic modeling.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling
Source: https://arxiv.org/html/2604.14489
Karthik Singaravadivelan, Anant Gupta11footnotemark:1, Zekun Wang, Christopher J\. MacLellan College of Computing Georgia Institute of Technology Atlanta, GA 30332 USA \{ksingara3,agupta886,zwang910\}@gatech\.edu

###### Abstract

Topic modeling seeks to uncover latent semantic structure in text corpora with minimal supervision\. Neural approaches achieve strong performance but require extensive tuning and struggle with lifelong learning due to catastrophic forgetting and fixed capacity, while classical probabilistic models lack flexibility and adaptability to streaming data\. We introduce CobwebTM, a low\-parameter lifelong hierarchical topic model based on incremental probabilistic concept formation\. By adapting the Cobweb algorithm to continuous document embeddings, CobwebTM constructs semantic hierarchies online, enabling unsupervised topic discovery, dynamic topic creation, and hierarchical organization without predefining the number of topics\. Across diverse datasets, CobwebTM achieves strong topic coherence, stable topics over time, and high\-quality hierarchies, demonstrating that incremental symbolic concept formation combined with pretrained representations is an efficient approach to topic modeling\.††footnotetext:Code available at https://github.com/Teachable-AI-Lab/cobweb-language-embedding

CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling

Karthik Singaravadivelan††thanks:Equal Contribution, Anant Gupta11footnotemark:1, Zekun Wang, Christopher J\. MacLellan College of Computing Georgia Institute of Technology Atlanta, GA 30332 USA\{ksingara3,agupta886,zwang910\}@gatech\.edu

## 1 Introduction

Refer to caption\(a\) Subtree from 20 Newsgroups dataset
Refer to caption\(b\) Subtree from AG News dataset

Figure 1: A visualization of three levels of the hierarchy induced by CobwebTM\. For each node, we display the top five representative words extracted using the c\-tf\-idf procedure described in Section 3\.2\.1 (https://arxiv.org/html/2604.14489#S3.SS2.SSS1)\. Words that appear in multiple nodes at the same level are underlined to highlight shared semantic content across sibling topics\.

Topic modeling seeks to uncover latent semantic structure in large document collections by grouping text into coherent topics\. It is a fundamental tool for document organization, corpus exploration, and information retrieval, particularly in settings where labeled data is unavailable\. As modern text corpora grow in scale, diversity, and temporal span, effective topic modeling increasingly requires methods that support unsupervised topic discovery, adapt to streaming data, and represent topics at multiple levels of abstraction\.

Early work in topic modeling was dominated by probabilistic generative models, most notably Latent Dirichlet Allocation (LDA) (Blei et al\., 2003b (https://arxiv.org/html/2604.14489#bib.bib42))\. While influential, LDA requires the number of topics to be specified in advance, assumes independence between topics, and relies on bag\-of\-words representations that ignore semantic similarity between words\. These assumptions limit its ability to model imbalanced, correlated, or evolving topics, making it poorly suited for lifelong or streaming settings\.

Recent advances in representation learning have led to neural topic models that leverage dense document embeddings (Zheng et al\., 2013 (https://arxiv.org/html/2604.14489#bib.bib32); Wu et al\., 2024b (https://arxiv.org/html/2604.14489#bib.bib31))\. These approaches often achieve improved topic coherence and richer semantic representations, but at the cost of increased complexity\. Neural topic models are typically highly parameterized, sensitive to hyperparameter choices, and trained in batch settings that assume access to the full corpus\. Consequently, they struggle in lifelong learning scenarios where data arrives incrementally and topic structure must evolve over time\. Moreover, neural architectures are prone to catastrophic forgetting, causing previously learned topics to degrade as new data is introduced\.

Lifelong topic modeling addresses these challenges by updating topics incrementally as new documents arrive\. Methods such as Online LDA (Hoffman et al\., 2010 (https://arxiv.org/html/2604.14489#bib.bib44)) and neural lifelong topic models mitigate some scalability issues but retain key limitations, including fixed topic capacity, limited topic restructuring, and reliance on corpus\-specific training\. More recent embedding\-based pipelines replace static clustering with incremental clustering algorithms, yet these methods remain sensitive to parameter choices and typically lack principled mechanisms for organizing topics at multiple levels of abstraction\.

In practice, however, topic structure is inherently hierarchical: broad themes naturally decompose into progressively finer subtopics\. Capturing such hierarchical organization improves interpretability and allows models to represent semantic relationships between topics rather than treating them as independent clusters\. Consequently, hierarchical topic models have been widely explored in both probabilistic and neural frameworks (Blei et al\., 2003a (https://arxiv.org/html/2604.14489#bib.bib1); Koltsov et al\., 2021 (https://arxiv.org/html/2604.14489#bib.bib40))\. These approaches aim to learn topic trees that capture varying levels of abstraction within a corpus\.

Despite their promise, many hierarchical topic models rely on fixed\-depth latent structures or require batch training over the full corpus, limiting their applicability in dynamic or streaming environments\. In many modern systems, hierarchy is therefore imposed post hoc after flat topic discovery, rather than learned incrementally as the data evolves\. This disconnect between lifelong learning and hierarchical structure motivates the need for topic modeling approaches that can simultaneously support incremental updates and flexible hierarchical organization\.

In this work, we revisit incremental concept formation as an alternative paradigm for topic modeling\. We introduce CobwebTM, a lifelong hierarchical topic modeling framework based on the Cobweb algorithm (Fisher, 1987 (https://arxiv.org/html/2604.14489#bib.bib45)) for probabilistic concept formation\. By adapting Cobweb to operate over continuous document embeddings, CobwebTM incrementally constructs a semantic hierarchy as documents arrive, enabling unsupervised topic discovery without predefining the number of topics\.

Our contributions are threefold: (1) we introduce CobwebTM, an incremental hierarchical topic modeling framework for unsupervised topic discovery over streaming text; (2) we show that probabilistic concept formation in embedding space provides a simple yet effective mechanism for lifelong topic modeling without catastrophic forgetting or fixed topic capacity; and (3) through extensive empirical evaluation, we demonstrate that CobwebTM matches or outperforms recent neural and clustering\-based methods in both topic quality and hierarchical structure\.

## 2 Related Work

### 2\.1 Lifelong Topic Modeling

Online LDA (Hoffman et al\., 2010 (https://arxiv.org/html/2604.14489#bib.bib44)) is the most widely used lifelong topic model, updating global topics via mini\-batch variational inference\. However, it inherits LDA's bag\-of\-words assumption, requires a predefined number of topics, and lacks mechanisms for restructuring topics as new data arrives\.

Most neural topic models are trained in batch settings and struggle with sequential updates without retraining (Wu et al\., 2024a (https://arxiv.org/html/2604.14489#bib.bib50))\. They are also prone to catastrophic forgetting (Luo et al\., 2025 (https://arxiv.org/html/2604.14489#bib.bib46))\. Mitigation techniques such as replay or elastic weight consolidation (Gupta et al\., 2020 (https://arxiv.org/html/2604.14489#bib.bib43)) reduce forgetting but still rely on fixed latent dimensions\.

Embedding\-based pipelines instead perform topic discovery through clustering over neural representations\. BERTopic (Grootendorst, 2022 (https://arxiv.org/html/2604.14489#bib.bib47)), for example, combines transformer embeddings and clustering\. Lifelong variants replace static clustering with incremental methods such as DBStream (Bär et al\., 2014 (https://arxiv.org/html/2604.14489#bib.bib12)) or Mini\-Batch KMeans (Sculley, 2010 (https://arxiv.org/html/2604.14489#bib.bib21)), though these approaches typically assume flat clustering and remain sensitive to parameter choices\. Recent approaches such as TopicGPT (Pham et al\., 2024 (https://arxiv.org/html/2604.14489#bib.bib3)) and FASTopic (Wu et al\., 2024c (https://arxiv.org/html/2604.14489#bib.bib4)) improve topic quality through LLM\-based generation or embedding\-level semantic modeling, but are either computationally expensive at scale or do not support hierarchical and incremental topic discovery\.

### 2\.2 Hierarchical Topic Modeling

Hierarchical topic models organize topics across levels of abstraction\. Early Bayesian approaches such as hLDA (Blei et al\., 2003a (https://arxiv.org/html/2604.14489#bib.bib1)) and related models (Mimno et al\., 2007 (https://arxiv.org/html/2604.14489#bib.bib8); Perotte et al\., 2011 (https://arxiv.org/html/2604.14489#bib.bib6)) learn topic trees through generative processes\. More recent methods construct hierarchies over embedding\-based topic representations\. Examples include CluHTM (Viegas et al\., 2020 (https://arxiv.org/html/2604.14489#bib.bib22)), HyHTM (Shahidi et al\., 2023 (https://arxiv.org/html/2604.14489#bib.bib23)), and hierarchical variants of BERTopic (Grootendorst, 2022 (https://arxiv.org/html/2604.14489#bib.bib47)), which typically derive hierarchies through clustering or linkage procedures applied after flat topic discovery\. Neural hierarchical topic models further learn structured latent representations using VAEs (Kingma and Welling, 2013 (https://arxiv.org/html/2604.14489#bib.bib24)), including tree\-based (Isonuma et al\., 2020 (https://arxiv.org/html/2604.14489#bib.bib41)), fixed\-depth (Duan et al\., 2021 (https://arxiv.org/html/2604.14489#bib.bib36)), and geometrically regularized models (Wu et al\., 2024d (https://arxiv.org/html/2604.14489#bib.bib38); Lu et al\., 2024 (https://arxiv.org/html/2604.14489#bib.bib39))\. However, these models are generally trained in batch settings and impose structural constraints that limit their flexibility in lifelong or streaming scenarios\.

### 2\.3 Incremental Concept Formation

Humans organize knowledge hierarchically using prototypes and graded category membership (Rosch and Mervis, 1975 (https://arxiv.org/html/2604.14489#bib.bib19))\. Incremental clustering methods formalize this process by building taxonomies whose internal nodes summarize concept\-level statistics\.

Cobweb (Fisher, 1987 (https://arxiv.org/html/2604.14489#bib.bib45)) incrementally constructs a probabilistic taxonomy through conceptual clustering, dynamically creating and restructuring nodes to maximize category utility\. Recent work has extended Cobweb to neural settings and demonstrated robustness in vision and language tasks (MacLellan et al\., 2022 (https://arxiv.org/html/2604.14489#bib.bib13); MacLellan and Thakur, 2021 (https://arxiv.org/html/2604.14489#bib.bib14); Wang et al\., 2025 (https://arxiv.org/html/2604.14489#bib.bib15); Barari et al\., 2024a (https://arxiv.org/html/2604.14489#bib.bib10),b (https://arxiv.org/html/2604.14489#bib.bib16); Lian et al\., 2025 (https://arxiv.org/html/2604.14489#bib.bib17))\.

Unlike probabilistic topic models such as LDA, which directly learn P(word|topic) and P(topic|document) through Dirichlet priors, our approach derives these quantities through clustering in embedding space\. Continuous Cobweb incrementally partitions transformer document embeddings into a hierarchical mixture of clusters, estimating document–topic associations via category utility\. Topic–word distributions are computed post hoc using class\-based TF–IDF over the documents assigned to each node\.

## 3 Methodology

We propose CobwebTM, a topic modeling framework that incrementally organizes document embeddings into a dynamic semantic hierarchy\. Unlike batch clustering methods such as k\-Means or HDBSCAN, CobwebTM supports continual updates without retraining through a two\-step neuro\-symbolic process\.

First, we perform document–topic inference directly in the latent space of pretrained transformer embeddings\. Assuming the embedding space reflects an underlying mixture of topics, we apply the continuous Cobweb algorithm to incrementally partition the space, assigning each document to a node that maximizes category utility\. This procedure produces a hierarchical clustering that implicitly defines the document–topic distribution\.

Second, we derive topic–word representations from the resulting hierarchy\. Each node represents a topic defined by the documents in its subtree\. Treating nodes as classes, we compute word–topic distributions using c\-TF\-IDF, producing interpretable topic descriptors from the highest\-ranked words\.

### 3\.1 Probabilistic Concept Formation

At the core of our approach is a variant of Cobweb adapted for continuous\-valued attributes (Barari et al\., 2024a (https://arxiv.org/html/2604.14489#bib.bib10))\. Each concept node c maintains a D\-dimensional multivariate Gaussian with diagonal covariance,

p(x|c) = 𝒩(x; μc, diag(σc²)),

where μc ∈ ℝD is the node mean and σc² ∈ ℝD is the variance vector\. These statistics are updated incrementally as new documents are incorporated\.

Cobweb constructs a hierarchy of concepts online\. Given a new document embedding x, the algorithm performs a top\-down search over the tree guided by Category Utility (CU) (Gluck and Corter, 1985 (https://arxiv.org/html/2604.14489#bib.bib18); Corter and Gluck, 1992 (https://arxiv.org/html/2604.14489#bib.bib30))\. Following Barari et al\. (2024a (https://arxiv.org/html/2604.14489#bib.bib10)), we adopt an information\-theoretic formulation that measures the expected reduction in feature uncertainty obtained by knowing the child concept\.

Let a parent node cp have children 𝒞(cp), each with count Nc\. The empirical probability of concept c under the parent is

P(c|cp) = Nc / ∑c' ∈ 𝒞(cp) Nc' = Nc / Ncp\. (1)

We measure node uncertainty using the differential entropy of the Gaussian:

U(c) = (1/2) ∑d=1D log(2πe σc,d²)\. (2)

The category utility of a parent node is then

CU(cp) = ∑c ∈ 𝒞(cp) P(c|cp) [U(cp) − U(c)]\. (3)

Maximizing CU favors partitions that reduce feature uncertainty while maintaining sufficient support, balancing intra\-cluster similarity and inter\-cluster separation\. For continuous attributes, this corresponds to maximizing variance reduction induced by the partition, allowing Cobweb to determine the depth and breadth of the hierarchy without specifying the number of topics K\.

At each node, Cobweb evaluates four operators to determine how x should be incorporated into the hierarchy: (1) insert x into the best\-matching existing child and update its Gaussian parameters; (2) create a new singleton child node for x; (3) merge the two best\-matching children and assign x to the merged node; and (4) split the best\-matching child by promoting its children to siblings\. For each operator, Cobweb computes the resulting category utility and selects the operator that produces the highest CU\. This greedy approach enables dynamic topic restructuring without requiring batch retraining\.

Similar Articles

Cognifold: Always-On Proactive Memory via Cognitive Folding

arXiv cs.AI

Introduces Cognifold, a brain-inspired always-on proactive memory for LLM agents that continuously organizes fragmented event streams into self-emerging cognitive structures via graph-topology self-organization, extending Complementary Learning Systems theory with a prefrontal intent layer.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Hugging Face Daily Papers

MM-WebAgent is a hierarchical agentic framework that generates coherent and visually consistent webpages by coordinating AIGC-based element generation through joint optimization of layout and multimodal content. The paper introduces a benchmark and multi-level evaluation protocol, demonstrating improvements over code-generation and agent-based baselines.

CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents

arXiv cs.AI

This paper introduces CoCoDA, a framework that uses a co-evolving compositional Directed Acyclic Graph (DAG) to manage tool libraries for augmented agents. It enables small language models to efficiently retrieve and compose tools, allowing an 8B model to match or exceed the performance of a 32B model on reasoning benchmarks.