A comparative study of transformer-based embeddings for topic coherence

arXiv cs.CL Papers

Summary

This paper systematically compares the impact of model size on topic quality using seven transformer-based language models in a BERTopic pipeline, finding that model size has negligible effect on topic coherence, suggesting smaller models can perform comparably to larger ones.

arXiv:2605.28832v1 Announce Type: new Abstract: Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{\"o}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:12 AM

# A comparative study of transformer-based embeddings for topic coherence
Source: [https://arxiv.org/abs/2605.28832](https://arxiv.org/abs/2605.28832)
[View PDF](https://arxiv.org/pdf/2605.28832)

> Abstract:Topic modeling is a branch of Natural Language Processing \(NLP\) that aims to organize large collections of texts into coherent groups according to word co\-occurrence patterns, with Latent Dirichlet Allocation \(LDA\) remaining one of the most widely used and interpretable probabilistic approaches\. Recent advances in NLP, particularly transformer\-based language models, offer improved document representations\. It is also known that the size of the model \(in terms of number of parameters\) has a significant impact in the performance of the language models on different pre\-defined tasks\. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer\-based language models \(from small models such as MiniLM to large ones such as LLaMA\-2\) in a BERTopic pipeline on a variety of corpora\. Topic quality is evaluated using coherence and divergence metrics following R\{ö\}der et al\. \(2015\)\. Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models\.

## Submission history

From: Willy Rodriguez \[[view email](https://arxiv.org/show-email/04047c1a/2605.28832)\] \[via CCSD proxy\] **\[v1\]**Fri, 10 Apr 2026 08:34:47 UTC \(2,342 KB\)

Similar Articles

Scaling laws for neural language models

OpenAI Blog

Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.