How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description

arXiv cs.CL 05/26/26, 04:00 AM Papers
llm bibliometrics cluster-description evaluation scientific-synthesis hybrid-workflow
Summary
This paper evaluates whether bibliometric structure improves LLM-assisted scientific literature synthesis by comparing six pipelines for generating cluster descriptions. Results show LLMs perform best in a hybrid workflow where bibliometric algorithms define clusters and LLMs generate readable descriptions.
arXiv:2605.24351v1 Announce Type: new Abstract: Large language models (LLMs) can support scientific literature synthesis, but remain prone to hallucinated references, uneven coverage, and weakly grounded thematic organization. We evaluate whether bibliometric structure improves LLM-assisted synthesis by comparing six pipelines for generating cluster descriptions under different levels of evidence and structure. Using 100 published bibliometric analyses, we reconstruct Scopus corpora, extract human-written cluster descriptions, and assess outputs by human alignment, semantic coverage, clustering quality, graph quality, and reference grounding. Results show that LLMs produce descriptions semantically close to human-written ones, but are unreliable when asked to infer bibliometric structure from scratch. Performance improves when bibliometric algorithms define the clusters and the LLM interprets them. Overall, LLM-assisted bibliometric synthesis is most promising as a hybrid workflow in which algorithms provide auditable structure and LLMs generate readable descriptions.
Original Article
View Cached Full Text
Cached at: 05/26/26, 09:02 AM
# How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description
Source: [https://arxiv.org/html/2605.24351](https://arxiv.org/html/2605.24351)
Abraham Camelo\-Guerrero School of Information Technology York University Toronto, Ontario M3J 1P3 acamelog@yorku\.ca &Jairo Diaz\-Rodriguez Department of Mathematics and Statistics York University Toronto, Ontario M3J 1P3 jdiazrod@yorku\.ca

###### Abstract

Large language models \(LLMs\) can support scientific literature synthesis, but remain prone to hallucinated references, uneven coverage, and weakly grounded thematic organization\. We evaluate whether bibliometric structure improves LLM\-assisted synthesis by comparing six pipelines for generating cluster descriptions under different levels of evidence and structure\. Using 100 published bibliometric analyses, we reconstruct Scopus corpora, extract human\-written cluster descriptions, and assess outputs by human alignment, semantic coverage, clustering quality, graph quality, and reference grounding\. Results show that LLMs produce descriptions semantically close to human\-written ones, but are unreliable when asked to infer bibliometric structure from scratch\. Performance improves when bibliometric algorithms define the clusters and the LLM interprets them\. Overall, LLM\-assisted bibliometric synthesis is most promising as a hybrid workflow in which algorithms provide auditable structure and LLMs generate readable descriptions\.

How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description

Abraham Camelo\-GuerreroSchool of Information TechnologyYork UniversityToronto, Ontario M3J 1P3acamelog@yorku\.caJairo Diaz\-RodriguezDepartment of Mathematics and StatisticsYork UniversityToronto, Ontario M3J 1P3jdiazrod@yorku\.ca

## 1Introduction

Large language models \(LLMs\) are increasingly used for scientific literature synthesis, including related\-work generation, scientific summarization, and automatic literature review generation\(Hu and Wan,[2014](https://arxiv.org/html/2605.24351#bib.bib23); Chenet al\.,[2021](https://arxiv.org/html/2605.24351#bib.bib24); Luet al\.,[2020](https://arxiv.org/html/2605.24351#bib.bib25); Kasanishiet al\.,[2023](https://arxiv.org/html/2605.24351#bib.bib26); Tanget al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib36)\)\. Retrieval\-augmented generation improves grounding in external corpora\(Lewiset al\.,[2020](https://arxiv.org/html/2605.24351#bib.bib27); Gaoet al\.,[2023](https://arxiv.org/html/2605.24351#bib.bib28)\), and recent systems target scientific synthesis directly\(Asaiet al\.,[2026](https://arxiv.org/html/2605.24351#bib.bib29)\)\. However, generating a literature review from scratch remains difficult: LLMs may hallucinate references, provide uneven coverage, or impose an organization on the literature that is not supported by the scholarly record\(Tanget al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib36)\)\. A natural alternative is to separate organization from writing: first construct a structured map of the literature, then use the LLM to synthesize descriptions within that structure\.

Bibliometric analysis provides such a structured approach\. It uses publication and citation metadata to organize papers through relations such as bibliographic coupling, co\-citation, and direct citation\(Kessler,[1963](https://arxiv.org/html/2605.24351#bib.bib4); Small,[1973](https://arxiv.org/html/2605.24351#bib.bib11); Boyack and Klavans,[2010](https://arxiv.org/html/2605.24351#bib.bib12)\)\. Widely used in management, information science, health sciences, environmental studies, education, and scientometrics, bibliometric science mapping provides an auditable way to identify research streams, intellectual foundations, influential works, and scholarly communities\(Coboet al\.,[2011](https://arxiv.org/html/2605.24351#bib.bib3)\)\. Recent work has begun to use LLMs around bibliometric analysis, mainly for auxiliary tasks such as search support, summarization, topic classification, and thematic mapping\(Sarachuk,[2025](https://arxiv.org/html/2605.24351#bib.bib49); Heet al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib50); Keenan and Heavin,[2026](https://arxiv.org/html/2605.24351#bib.bib51)\)\. However, these studies do not systematically evaluate how different levels of LLM responsibility affect the quality of the bibliometric workflow itself\. This makes bibliometric analysis a useful test case for LLM\-assisted synthesis: if LLMs struggle to organize a literature from scratch, how much external structure should they receive?

![Refer to caption](https://arxiv.org/html/2605.24351v1/llm_assisted_bibliometric_synthesis_workflow.drawio.png)Figure 1:Workflow for evaluating LLM\-assisted bibliometric synthesis\.We address this question by evaluating six LLM\-assisted pipelines for generating bibliometric cluster descriptions under different levels of evidence and structure \(Figure[1](https://arxiv.org/html/2605.24351#S1.F1)\)\. The pipelines range fromBlind, where the model receives only the search query, to structured settings such asLabeledandRanked, where the model receives papers grouped by bibliographic coupling or direct citation clusters\. We build the benchmark from 100 published bibliometric analyses by manually extracting author queries, reconstructing Scopus corpora, and collecting human\-written cluster descriptions\. We assess the outputs in terms of human alignment, semantic coverage, clustering quality, agreement with the underlying bibliometric graph, and reference grounding\.

This design contributes to a broader shift in LLM\-assisted literature review research from isolated single\-task applications toward structured, multi\-stage workflows\. Recent work has examined LLMs for tasks such as screening, query generation, search, extraction, organization, and synthesis under human supervision within these multi\-stage workflows\(Nykvistet al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib40); Wanget al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib41); Yeet al\.,[2024](https://arxiv.org/html/2605.24351#bib.bib42); Pei and Sun,[2025](https://arxiv.org/html/2605.24351#bib.bib43); Silva and Wickramaarachchi,[2025](https://arxiv.org/html/2605.24351#bib.bib44)\)\. Other studies emphasize structured outputs, including review tables, schemas, hierarchical maps, and extracted evidence, because these outputs are easier to inspect than free\-form prose\(Padmakumaret al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib45); Hsuet al\.,[2024](https://arxiv.org/html/2605.24351#bib.bib46); Johnet al\.,[2026](https://arxiv.org/html/2605.24351#bib.bib47); Jansenet al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib48)\)\. Our study extends this direction by testing structure itself as an experimental condition\.

Our results show that: \(i\) LLMs can generate cluster descriptions that are semantically close to human\-written ones, but are unreliable when asked to infer bibliometric structure from scratch; \(ii\) when given appropriate bibliometric structure, LLM\-generated descriptions can score higher than human descriptions on our corpus\-level, clustering, and graph\-based metrics, with the strongest performance occurring when bibliometric algorithms first define the clusters and the LLM is used to interpret them; and \(iii\) the best form of structure depends on the relation: bibliographic coupling benefits from full cluster context, while citation analysis can often be summarized from compact link\-ranked evidence\. \(iv\) Overall, LLM\-based bibliometric analysis is most promising as a hybrid workflow in which algorithms provide auditable structure and LLMs translate that structure into readable descriptions\.

## 2Related Work

NLP work on scientific literature synthesis includes related\-work generation, scientific multi\-document summarization, and automatic literature review generation\(Hu and Wan,[2014](https://arxiv.org/html/2605.24351#bib.bib23); Chenet al\.,[2021](https://arxiv.org/html/2605.24351#bib.bib24); Luet al\.,[2020](https://arxiv.org/html/2605.24351#bib.bib25); Kasanishiet al\.,[2023](https://arxiv.org/html/2605.24351#bib.bib26); Tanget al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib36)\)\. Recent retrieval\-augmented systems further ground scientific generation in external corpora\(Lewiset al\.,[2020](https://arxiv.org/html/2605.24351#bib.bib27); Gaoet al\.,[2023](https://arxiv.org/html/2605.24351#bib.bib28); Asaiet al\.,[2026](https://arxiv.org/html/2605.24351#bib.bib29)\), while recent evaluations show that LLM\-generated literature reviews still suffer from hallucinated references and uneven coverage\(Tanget al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib36)\)\. Our work differs by studying not full review generation, but the more specific task of generating cluster descriptions with the principles of bibliometric analysis\(Coboet al\.,[2011](https://arxiv.org/html/2605.24351#bib.bib3)\)\.

Our task connects topic labeling with bibliometric science mapping\. Topic models and automatic labeling methods induce and verbalize latent themes from text\(Bleiet al\.,[2003](https://arxiv.org/html/2605.24351#bib.bib32); Meiet al\.,[2007](https://arxiv.org/html/2605.24351#bib.bib33); Lauet al\.,[2011](https://arxiv.org/html/2605.24351#bib.bib34); Bhatiaet al\.,[2016](https://arxiv.org/html/2605.24351#bib.bib35)\), and recent work uses LLMs to improve topic modeling and topic interpretability\(Liuet al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib37); Yanget al\.,[2025](https://arxiv.org/html/2605.24351#bib.bib38)\)\. Bibliometric methods instead cluster papers using relations such as bibliographic coupling, co\-citation, and direct citation\(Kessler,[1963](https://arxiv.org/html/2605.24351#bib.bib4); Small,[1973](https://arxiv.org/html/2605.24351#bib.bib11); Boyack and Klavans,[2010](https://arxiv.org/html/2605.24351#bib.bib12)\), often with community detection algorithms such as Louvain\(Blondelet al\.,[2008](https://arxiv.org/html/2605.24351#bib.bib39)\)\. We use bibliometric clusters as an auditable scaffold for LLM generation, testing both whether LLMs can infer literature structure on their own and how much external structure is needed to improve cluster descriptions\.

## 3Background: Bibliometric Analysis

### 3\.1Bibliometric Analysis as Science Mapping

Bibliometric analysis is a structured approach for studying scientific literatures through metadata such as titles, abstracts, keywords, references, and citations\. In science mapping, the goal is to reveal the organization of a research field: its main themes, intellectual foundations, influential works, and research communities\(Coboet al\.,[2011](https://arxiv.org/html/2605.24351#bib.bib3)\)\. Rather than producing a single linear summary, bibliometric analysis typically organizes a literature into clusters that can be interpreted as research streams or thematic areas\.

A simplified bibliometric workflow consists of four stagesDonthuet al\.\([2021](https://arxiv.org/html/2605.24351#bib.bib53)\):

query→corpus→clustering→clusterdescriptions\.\\text\{query\}\\rightarrow\\text\{corpus\}\\rightarrow\\text\{clustering\}\\rightarrow\\begin\{array\}\[\]\{c\}\\text\{cluster\}\\\\ \\text\{descriptions\}\\end\{array\}\.
##### Query\.

The query defines the scope of the analysis\. It specifies which topic, keywords, time period, document types, or database fields are included in the search\. Because all later stages depend on the retrieved papers, the query strongly shapes the resulting analysis\. In published bibliometric reviews, the query is part of the methodological record and provides a reproducible entry point into the literature\.

##### Corpus\.

Executing the query in a bibliographic database such as Scopus produces a corpus of papers\. The corpus is the universe of documents to be organized\. It provides textual evidence, such as titles and abstracts, as well as bibliographic evidence, such as references and citation links\. Importantly, the corpus is not itself a clustering\. It is a collection of papers that still needs to be structured\.

##### Clustering\.

The clustering stage organizes the corpus into groups of related papers\. Bibliometric clustering usually begins by defining relations among papers, then applying a clustering algorithm to those relations\. Different relations capture different notions of scholarly relatedness\. In this work, we focus on*bibliographic coupling*and*citation*analysis\.

*Bibliographic coupling*links two papers when they cite the same prior work\(Kessler,[1963](https://arxiv.org/html/2605.24351#bib.bib4)\)\. LetRiR\_\{i\}denote the set of references cited by paperpip\_\{i\}\. The bibliographic coupling weight between paperspip\_\{i\}andpjp\_\{j\}is:

wijBC=\|Ri∩Rj\|\.w\_\{ij\}^\{BC\}=\|R\_\{i\}\\cap R\_\{j\}\|\.
If two papers share many references, they are likely to draw on similar intellectual foundations\. Bibliographic coupling is therefore useful for identifying research fronts based on shared prior literature\(Boyack and Klavans,[2010](https://arxiv.org/html/2605.24351#bib.bib12)\)\.

*Citation*analysis links papers when one paper cites another\(Garfield,[1955](https://arxiv.org/html/2605.24351#bib.bib21); Price,[1965](https://arxiv.org/html/2605.24351#bib.bib10)\)\. Letcij=1c\_\{ij\}=1if paperpip\_\{i\}cites paperpjp\_\{j\}, and0otherwise\. We use an undirected projection:

wijCIT=cij\+cji\.w\_\{ij\}^\{CIT\}=c\_\{ij\}\+c\_\{ji\}\.
This relation captures direct scholarly influence within a retrieved corpus\. Compared with bibliographic coupling, citation analysis emphasizes citation paths and intellectual lineage\.

After constructing a bibliographic\-coupling or citation\-based representation, a clustering algorithm groups papers into communities\. In our study, this stage is implemented using Louvain community detectionYin \([2024](https://arxiv.org/html/2605.24351#bib.bib52)\)\. The result is a paper\-level clustering that specifies which papers belong together\.

##### Cluster descriptions\.

A clustering identifies groups of related papers, but it does not explain what those groups mean\. In traditional bibliometric reviews, human analysts interpret clusters by inspecting representative papers, central works, recurring terms, and citation patterns\. They then write labels and descriptions that summarize each cluster’s theme, scope, and relation to other clusters\.

This final interpretive step is where LLMs may be especially useful\. LLMs can synthesize text and produce fluent descriptions, but they may be unreliable when asked to infer an entire literature structure from a topic alone\. A bibliometric workflow provides increasing structure: first a query, then a retrieved corpus, then an algorithmic clustering\. This structure allows us to test whether LLMs are more reliable as autonomous cluster generators or as interpreters of clusters produced by standard bibliometric methods\.

## 4Conceptual Framework

We view LLM\-assisted bibliometric analysis as a workflow\-allocation problem: which stages should be handled by the LLM, and which should remain algorithmic? Our six pipelines form a ladder of increasing structure, from query\-only generation, where the LLM must infer papers, themes, references, and clusters on its own, to structured workflows where a corpus is retrieved, a bibliographic or citation graph is built, clusters are assigned algorithmically, and the LLM only summarizes the resulting clusters\. We hypothesize that LLMs are more reliable as interpreters of bibliometric structure than as unconstrained generators of literature maps, and test whether retrieved evidence and graph\-induced clusters reduce hallucinated references, weak coverage, and unsupported thematic organization\. Our goal is not to evaluate unconstrained topic discovery, but cluster\-faithful bibliometric description generation: given or inferred evidence about a corpus, a system should produce descriptions whose induced interpretation preserves the bibliographic coupling or citation structure that bibliometric analysis is designed to expose\.

Table 1:The six pipelines form a continuum of decreasing LLM responsibility and increasing bibliometric structure\.
## 5Methodology

Our methodology follows the standard workflow of bibliometric analysis and uses it to define a controlled evaluation of LLM\-assisted cluster description\. We begin from published bibliometric review papers, extract the authors’ original search queries and cluster descriptions, reconstruct the corresponding Scopus corpora, apply standard bibliometric clustering analysis, and then evaluate six LLM pipelines that receive different amounts of evidence and structure\.

### 5\.1Data: Source Bibliometric Review Papers

The benchmark is constructed from 100 peer\-reviewed published bibliometric review papers, that we manually revised\. Each source paper conducts a bibliographic coupling analysis of a research topic and reports a set of cluster descriptions\. We use these papers because they provide naturally occurring examples of expert bibliometric analysis\.

Letrir\_\{i\}denote theii\-th source paper\. From each source paper, we manually extract two artifacts\. First, we extract the search query used by the authors, denoted byqiq\_\{i\}\. Second, we extract the human\-authored cluster descriptions reported in the paper, denoted byHi=\{hi1,hi2,…,hiKi\},H\_\{i\}=\\\{h\_\{i1\},h\_\{i2\},\\ldots,h\_\{iK\_\{i\}\}\\\},whereKi=\|Hi\|K\_\{i\}=\|H\_\{i\}\|is the number of clusters reported by the source paper\. The extracted queryqiq\_\{i\}defines the topic and retrieval scope of the original bibliometric analysis\. The extracted human descriptionsHiH\_\{i\}provide the reference interpretation for the*bibliographic coupling*analysis and determine the target number of clusters used in evaluation\. Human\-authored citation\-cluster descriptions were not consistently available, so we leave human comparison for*citation*analysis outside our scope\.

### 5\.2Benchmark Construction

For each source paperrir\_\{i\}, we reconstruct the bibliometric\-analysis workflow in four stages:

ri→qi→Di→Zi→Yi\.r\_\{i\}\\rightarrow q\_\{i\}\\rightarrow D\_\{i\}\\rightarrow Z\_\{i\}\\rightarrow Y\_\{i\}\.
Here,qiq\_\{i\}is the query extracted from the source paper,DiD\_\{i\}is the Scopus corpus obtained by rerunning that query,ZiZ\_\{i\}is the clustering obtained by applying a standard bibliometric clustering analysis, andYiY\_\{i\}is the cluster description output generated by a pipeline\.

##### Query extraction\.

For each bibliometric review paperrir\_\{i\}, we manually identify the search query used by the original authors\. This query may include keywords, Boolean operators, field restrictions, publication\-year restrictions, or other search constraints reported in the paper\. We denote the extracted query byqiq\_\{i\}\.

##### Corpus reconstruction\.

We executeqiq\_\{i\}in Scopus to obtain a corpusDi=\{pi1,pi2,…,pini\}\.D\_\{i\}=\\\{p\_\{i1\},p\_\{i2\},\\ldots,p\_\{in\_\{i\}\}\\\}\.Each paperpijp\_\{ij\}contains a title, abstract, and citation metadata when available\.

##### Bibliometric clustering\.

For each reconstructed corpusDiD\_\{i\}, we apply a standard bibliometric clustering analysis\. The scope of this paper considers two biblimetric analysis modes: bibliographic coupling or direct citation\. In the bibliographic\-coupling mode, papers are related by shared references\. In the direct\-citation mode, papers are related by citation links among papers inDiD\_\{i\}\. We construct the corresponding paper\-relation representation and apply Louvain community detection to obtain a paper\-level clusteringZi=\{zi1,zi2,…,zini\},Z\_\{i\}=\\\{z\_\{i1\},z\_\{i2\},\\ldots,z\_\{in\_\{i\}\}\\\},wherezij∈\{1,…,Ki\}z\_\{ij\}\\in\\\{1,\\ldots,K\_\{i\}\\\}is the cluster assignment of paperpijp\_\{ij\}\.

The number of clusters is fixed toKiK\_\{i\}, the number of human\-authored cluster descriptions extracted from the source paper\. We tune the Louvain resolution parameter only to match this target number of clusters\. We do not tune the clustering to optimize agreement with the human descriptions or with any downstream evaluation metric\.

##### Cluster description generation\.

Each evaluated pipeline produces a set of cluster descriptions

Yi=\{\(li1,di1\),…,\(liKi,diKi\)\},Y\_\{i\}=\\\{\(l\_\{i1\},d\_\{i1\}\),\\ldots,\(l\_\{iK\_\{i\}\},d\_\{iK\_\{i\}\}\)\\\},
wherelikl\_\{ik\}is the generated label for clusterkk, anddikd\_\{ik\}is its generated natural\-language description\.

Different pipelines receive different objects from the reconstructed workflow\. Some receive only the extracted queryqiq\_\{i\}\. Others receive the reconstructed corpusDiD\_\{i\}\. The most structured pipelines receive the Louvain clusteringZiZ\_\{i\}and ask the LLM only to describe the resulting clusters\.

![Refer to caption](https://arxiv.org/html/2605.24351v1/BERTScore_human-alignment_by_pipeline.png)Figure 2:Pipeline Human comparison, measured with BERTScore\.
![Refer to caption](https://arxiv.org/html/2605.24351v1/reference-grounded_coverage.png)Figure 3:Reference\-grounded coverage across pipelines\.
![Refer to caption](https://arxiv.org/html/2605.24351v1/reference_grounding.png)Figure 4:Reference validation forBlindby matching criterion: \(a\) title, \(b\) title and year, and \(c\) title, year, and first author\.

### 5\.3Pipeline Design

The six pipelines differ in how much evidence and bibliometric structure are provided to the LLM\. They form a progression fromBlind, where the model receives only the author query, toRanked, where the clusters have already been produced by a Louvain\-based bibliometric or citation procedure and the model receives only a compact set of representative papers to summarize\. The pipeline ladder is designed to progressively reduce the structural responsibility assigned to the LLM\. In the least structured setting, the model must infer the relevant themes and produce cluster descriptions from the query alone\. In the most structured setting, both the cluster structure and the evidence selection are supplied by the bibliometric procedure, leaving the LLM primarily responsible for final verbalization\. A detailed description of the pipelines and prompts is provided in Appendix[B](https://arxiv.org/html/2605.24351#A2)and[C](https://arxiv.org/html/2605.24351#A3)\.

##### Blind\.

TheBlindpipeline receives only the extracted author queryqiq\_\{i\}and the target number of clustersKiK\_\{i\}\. The LLM is asked to generateKiK\_\{i\}cluster labels and descriptions directly from the query\. This is the least informed baseline\. It must rely on its knowledge to infer the main research areas associated with the query\. This pipeline approximates a naive LLM\-based literature mapping prompt, except that the number of clusters is controlled\.

##### Corpus\.

TheCorpuspipeline receives the extracted author queryqiq\_\{i\}, the target number of clustersKiK\_\{i\}, and the full reconstructed Scopus corpusDiD\_\{i\}, represented by paper titles and abstracts\. The LLM is asked to generate all cluster labels and descriptions in a single pass\. This pipeline tests whether access to the retrieved corpus is sufficient for the model to organize the literature\. The LLM receives substantially more evidence than inBlind, but it still does not receive an algorithmic clustering\. It must both infer the cluster structure and describe the resulting clusters\.

##### Corpus\-select\.

TheCorpus\-selectpipeline receives the same inputs asCorpus\. However, it decomposes the task into two stages\. In the first stage, the LLM selects papers fromDiD\_\{i\}that are relevant or representative for each of theKiK\_\{i\}clusters\. In the second stage, the model synthesizes cluster labels and descriptions from the selected papers\. This pipeline tests whether explicit evidence selection improves corpus\-grounded cluster analysis\. The LLM still remains responsible for inducing the cluster structure, but it is encouraged to filter the corpus before producing the final descriptions\.

##### Labeled\.

TheLabeledpipeline receives the corpus pre\-grouped according to the Louvain clusteringZiZ\_\{i\}\. The LLM is given papers fromDiD\_\{i\}together with their algorithmic cluster assignments\. The model is then asked to generate labels and descriptions for the given clusters in a single pass\. This pipeline provides strong structural information\. UnlikeCorpusandCorpus\-select, the LLM no longer needs to decide which papers belong together\. Its role shifts from cluster induction to cluster interpretation: it must explain the meaning of clusters produced by the standard bibliometric or citation clustering procedure\.

##### Labeled\-select

TheLabeled\-selectpipeline also receives the corpus pre\-grouped by the Louvain clusteringZiZ\_\{i\}, but uses a two\-stage procedure\. In the first stage, the LLM selects the most informative or representative papers within each predefined cluster\. In the second stage, it synthesizes the final cluster labels and descriptions from the selected within\-cluster evidence\. This pipeline tests whether evidence selection remains useful when the cluster structure is already given\. Compared withLabeled, the model receives the same algorithmic grouping, but is asked to identify the most useful evidence inside each cluster before writing the description\.

##### Ranked

TheRankedpipeline also starts from the Louvain clusteringZiZ\_\{i\}, but provides the LLM with only a compact subset of papers from each cluster\. For each cluster, papers are ranked using link\-based importance derived from the bibliometric clustering algorithm\. The top 10 papers per cluster are then provided to the LLM, which generates the cluster labels and descriptions\. This pipeline tests whether a compact, high\-signal subset of each algorithmic cluster is sufficient for accurate cluster description\. The LLM receives less context than inLabeledandLabeled\-select, but it also has less structural responsibility: the clusters have already been defined by Louvain, and the evidence shown to the model has already been ranked by the bibliometric procedure\. The model’s role is therefore limited to synthesizing descriptions from representative evidence, makingRankedthe most constrained pipeline in terms of LLM decision\-making\.

Table 2:Main results for bibliographic coupling and citation analyses for all pipelines\. For each metric,*Rank*is the mean rank across benchmark instances, where lower is better;*Med\.*is the median value; and*%Win*is the percentage of instances won, where higher is better\. Bold values mark the best result within each analysis block\.

## 6Evaluation

We evaluate each generated cluster analysis along five dimensions that reflect the main goals of bibliometric analysis: alignment with human interpretations, semantic coverage of the corpus, recovery of cluster structure, agreement with the underlying bibliometric graph, and reference grounding\. Full metric definitions are provided in Appendix[D](https://arxiv.org/html/2605.24351#A4)\. We complement this scalable structure\-sensitive automatic metrics with a small\-scale human evaluation reported in Appendix[E\.5](https://arxiv.org/html/2605.24351#A5.SS5)\.

##### Human comparison\.

We use BERTScore F1 to measure semantic alignment between generated cluster descriptions and the human\-authored descriptions reported in the source bibliometric reviews\. Generated and human descriptions are matched one\-to\-one to maximize total semantic similarity\. This captures whether the generated descriptions recover similar themes, even when they use different wording\.

##### Semantic coverage\.

We measure coverage as the average maximum embedding similarity between each sentence in the reconstructed Scopus corpus and the generated cluster descriptions\. This evaluates whether the pipeline describes the broader corpus rather than only a narrow subset of papers\.

##### Clustering quality\.

We assign each paper to the generated cluster description with the highest embedding similarity and compare these induced assignments with the bibliometric or citation clusters using Adjusted Rand Index\. We also report silhouette score in abstract embedding space\. These metrics test whether the generated descriptions preserve the cluster structure intend to explain\.

##### Graph\-structural quality\.

We compute modularity using the induced paper assignments on the underlying bibliographic or citation graph\. This evaluates whether the descriptions correspond to dense regions of the paper\-relation network, rather than only matching the final Louvain labels\.

##### Reference grounding\.

ForBlind, we validate generated references against OpenAlex using title, year, and first\-author matching\. For the other pipelines, we compute reference\-grounded coverage between cited abstracts and generated descriptions\. This tests whether references are valid and whether descriptions reflect the evidence they cite\.

## 7Results

Figure[4](https://arxiv.org/html/2605.24351#S5.F4)reports semantic alignment with the human\-authored cluster descriptions, Figure[4](https://arxiv.org/html/2605.24351#S5.F4)and Figure[4](https://arxiv.org/html/2605.24351#S5.F4)show reference\-grounded results\. Table[2](https://arxiv.org/html/2605.24351#S5.T2)summarizes the main results for bibliographic coupling and citation analyses across metrics\.

##### Human comparison\.

Figure[4](https://arxiv.org/html/2605.24351#S5.F4)shows consistently high BERTScore values above 0\.8 across all pipelines, indicating strong semantic overlap with human\-written cluster descriptions\. Table[2](https://arxiv.org/html/2605.24351#S5.T2)shows a different pattern when the outputs are evaluated against corpus\-level and graph\-structural metrics\. Human descriptions achieve reasonable coverage, but they rank closer to the middle on average and perform poorly on cluster and graph quality\. This does not mean that the human descriptions are low quality in an absolute sense\. Rather, it shows that descriptions that are semantically similar to human\-written summaries do not necessarily induce paper assignments that recover the bibliometric or citation structure\. Structured LLM pipelines often achieve higher values than the human descriptions on these structure\-sensitive metrics, suggesting that they can complement human\-written descriptions when the goal is to summarize corpus coverage and clustering structure in bibliometric analyses\.

##### Structure improves performance\.

Table[2](https://arxiv.org/html/2605.24351#S5.T2)also shows that LLMs are weak when they must infer bibliometric structure on their own\. TheBlindpipeline performs poorly in both bibliographic coupling and citation settings, suggesting that query\-only generation is not reliable for autonomous bibliometric mapping\. Figure[4](https://arxiv.org/html/2605.24351#S5.F4)further shows that this weakness extends to reference grounding\. Under title\-only matching, manyBlindreferences appear to correspond to real papers, but when the validation also requires the publication year and first author to match, precision drops\. Thus,Blindcan often produce references that look plausible, but it is less reliable at generating fully accurate citations\.

Providing the full Scopus corpus helps:CorpusandCorpus\-selectimprove overBlind\. Figure[4](https://arxiv.org/html/2605.24351#S5.F4)further shows that reference\-grounded coverage is strongest for pipelines that rely on selected or representative evidence, especiallyCorpus\-selectandRanked, whileCorpusperforms lowest\. This suggests that evidence selection improves local grounding\. However, Table[2](https://arxiv.org/html/2605.24351#S5.T2)shows corpus\-only pipelines still lag behind pipelines that include clustering structure on cluster and graph quality\. Seeing the papers improves grounding, but it does not guarantee that the model recovers bibliographic or citation structure\. A strong conclusion is that LLMs are better used as interpreters of bibliometric structure than as generators of that structure from scratch\.Labeled,Labeled\-select, andRankedperform best overall\. This reflects the main design principle of the pipeline ladder: as more structure is provided to the model, the LLM has less freedom to invent or infer the organization of the field\. In the structured pipelines, the bibliometric clustering determines which papers belong together, and the LLM is increasingly restricted to interpreting those clusters\. The same pattern appears in both bibliographic coupling and citation analysis, but the best form of structure differs by mode\.

##### Task\-dependent structure\.

The best amount of structure depends on the bibliometric relation being summarized\. Bibliographic coupling groups papers through shared references, so its clusters often reflect broad intellectual backgrounds or research traditions\. Because these themes may be distributed across both central and peripheral papers,Labeledperforms especially well: the model benefits from access to the full labeled cluster\. Citation analysis, by contrast, is based on direct links among papers and often concentrates structure around influential or central works\. In this setting,Rankedperforms especially well because top link\-ranked papers can capture much of the cluster’s lineage, influence pattern, or methodological core\. Overall, broad shared\-reference clusters benefit from richer labeled context, while citation\-based clusters can often be described effectively from compact, central evidence\.

##### Ablations\.

We conduct four ablation studies to test the robustness of the pipeline comparisons \(Appendix[E](https://arxiv.org/html/2605.24351#A5)\)\. We ablate the generative model by comparing GPT\-5\.4 and Gemini\-2\.5\-Flash, the embedding model used for evaluation by comparingall\-mpnet\-base\-v2andtext\-embedding\-3\-large, the prompt formulation by removing or adding query and bibliographic\-coupling context, and the amount of evidence by varying the number of references or top\-ranked papers provided per cluster\. Across these ablations, the main patterns described above remain stable\. Prompt and evidence\-size choices can shift which structured pipeline performs best, but they do not change the broader conclusion that evidence structure matters more than prompt wording input parameters\.

## 8Conclusions

In this paper, we evaluated how LLMs support bibliometric cluster analysis by comparing six pipelines with increasing levels of evidence and structure\. Using published bibliometric reviews, we extracted author queries and human cluster descriptions, reconstructed Scopus corpora, and generated cluster descriptions under different workflow designs\. Our results show that LLMs can produce descriptions that are semantically similar to human\-written ones and, when given bibliometric structure, can achieve higher values than human descriptions on corpus\-level, clustering, and graph\-based metrics\. However, they remain weak at inferring bibliometric structure from scratch, even with access to the full corpus\. They work best when bibliometric algorithms first define the clusters and the LLM is used to interpret them\.

The best pipeline also depends on the bibliometric relation\. Bibliographic coupling benefits from full cluster context, while citation analysis can often be summarized from compact link\-ranked evidence\. This suggests that structure helps most when it constrains parts of the task where LLMs are unreliable, but the right amount of structure depends on the reasoning required\. Overall, LLM\-based bibliometric analysis is most promising as a hybrid workflow: algorithms provide auditable structure, and LLMs translate that structure into readable cluster descriptions\.

This finding reflects a broader pattern in LLM system design\. Across retrieval\-augmented generation, tool use, code generation, and data analysis, LLMs are most reliable when external systems provide structure and the model performs synthesis, explanation, or translation\. In bibliometric analysis, algorithms should handle precise structural tasks such as clustering papers by citations or shared references, while LLMs should handle the interpretive task of writing coherent descriptions\.

##### Limitation\.

These results should be interpreted as evidence of structural fidelity, not as a general comparison between LLMs and human experts\. Human descriptions were written for scholarly interpretation, not to optimize embedding\-based coverage, ARI, or modularity\. Thus, when structured LLM pipelines score higher on these metrics, it means they better preserve the reconstructed bibliometric or citation structure used in our evaluation, not that they offer greater domain insight, nuance, or scholarly value\.

## Acknowledgments

This work was supported by the Natural Sciences and Engineering Research Council of Canada \(NSERC\) under grant DGECR\-2022\-04531\.

## References

- A\. Asai, J\. He, R\. Shao, W\. Shi, A\. Singh, J\. C\. Chang, K\. Lo, L\. Soldaini, S\. Feldman, M\. D’Arcy,et al\.\(2026\)Synthesizing scientific literature with retrieval\-augmented language models\.Nature,pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p1.1),[§2](https://arxiv.org/html/2605.24351#S2.p1.1)\.
- Automatic labelling of topics with neural embeddings\.InProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers,pp\. 953–963\.Cited by:[§2](https://arxiv.org/html/2605.24351#S2.p2.1)\.
- D\. M\. Blei, A\. Y\. Ng, and M\. I\. Jordan \(2003\)Latent dirichlet allocation\.Journal of Machine Learning Research3,pp\. 993–1022\.Cited by:[§2](https://arxiv.org/html/2605.24351#S2.p2.1)\.
- V\. D\. Blondel, J\. Guillaume, R\. Lambiotte, and E\. Lefebvre \(2008\)Fast unfolding of communities in large networks\.Journal of Statistical Mechanics: Theory and Experiment2008\(10\),pp\. P10008\.Cited by:[§2](https://arxiv.org/html/2605.24351#S2.p2.1)\.
- K\. W\. Boyack and R\. Klavans \(2010\)Co\-citation analysis, bibliographic coupling, and direct citation: which citation approach represents the research front most accurately?\.Journal of the American Society for Information Science and Technology61\(12\),pp\. 2389–2404\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p2.1),[§2](https://arxiv.org/html/2605.24351#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.24351#S3.SS1.SSS0.Px3.p4.1)\.
- X\. Chen, H\. Alamro, M\. Li, S\. Gao, X\. Zhang, D\. Zhao, and R\. Yan \(2021\)Capturing relations between scientific papers: an abstractive model for related work section generation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 6068–6077\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p1.1),[§2](https://arxiv.org/html/2605.24351#S2.p1.1)\.
- M\. J\. Cobo, A\. G\. López\-Herrera, E\. Herrera\-Viedma, and F\. Herrera \(2011\)Science mapping software tools: review, analysis, and cooperative study among tools\.Journal of the American Society for Information Science and Technology62\(7\),pp\. 1382–1402\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p2.1),[§2](https://arxiv.org/html/2605.24351#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.24351#S3.SS1.p1.1)\.
- N\. Donthu, S\. Kumar, D\. Mukherjee, N\. Pandey, and W\. M\. Lim \(2021\)How to conduct a bibliometric analysis: an overview and guidelines\.Journal of Business Research133,pp\. 285–296\.Cited by:[§3\.1](https://arxiv.org/html/2605.24351#S3.SS1.p2.1)\.
- Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, and H\. Wang \(2023\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.10997\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p1.1),[§2](https://arxiv.org/html/2605.24351#S2.p1.1)\.
- E\. Garfield \(1955\)Citation indexes for science: a new dimension in documentation through association of ideas\.Science122\(3159\),pp\. 108–111\.Cited by:[§3\.1](https://arxiv.org/html/2605.24351#S3.SS1.SSS0.Px3.p5.4)\.
- M\. He, J\. Niu, D\. Liu, P\. Wu, and B\. X\. Hu \(2025\)Remote sensing in river obstruction research: a bibliometric analysis integrated with large language model\.Journal of Hydrology: Regional Studies62,pp\. 102850\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p2.1)\.
- C\. Hsu, E\. Bransom, J\. Sparks, B\. Kuehl, C\. Tan, D\. Wadden, L\. L\. Wang, and A\. Naik \(2024\)Chime: llm\-assisted hierarchical organization of scientific studies for literature review support\.Findings of the Association for Computational Linguistics: ACL 2024,pp\. 118–132\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p4.1)\.
- Y\. Hu and X\. Wan \(2014\)Automatic generation of related work sections in scientific papers: an optimization approach\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1624–1633\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p1.1),[§2](https://arxiv.org/html/2605.24351#S2.p1.1)\.
- L\. Hubert and P\. Arabie \(1985\)Comparing partitions\.Journal of Classification2\(1\),pp\. 193–218\.Cited by:[§D\.3](https://arxiv.org/html/2605.24351#A4.SS3.p3.3)\.
- T\. Jansen, L\. W\. Liebenow, U\. Mertens, F\. T\. Schmidt, J\. F\. Lohmann, J\. Fleckenstein, and J\. Meyer \(2025\)Data extraction by generative artificial intelligence: assessing determinants of accuracy using human\-extracted data from systematic review databases\.Psychological Bulletin151\(10\),pp\. 1280\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p4.1)\.
- L\. John, A\. M\. Ghanmi, T\. Wittenborg, S\. Auer, and O\. Karras \(2026\)ExtracTable: human\-in\-the\-loop transformation of scientific corpora into structured knowledge\.InInternational Conference on Theory and Practice of Digital Libraries,pp\. 470–487\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p4.1)\.
- T\. Kasanishi, M\. Isonuma, J\. Mori, and I\. Sakata \(2023\)SciReviewGen: a large\-scale dataset for automatic literature review generation\.InFindings of the Association for Computational Linguistics: ACL 2023,Toronto, Canada,pp\. 6695–6715\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p1.1),[§2](https://arxiv.org/html/2605.24351#S2.p1.1)\.
- P\. B\. Keenan and C\. Heavin \(2026\)Beyond manual review: using llms to systematically map five decades of IFIP WG8\.3 decision\-support research\.Journal of Decision Systems35\(1\),pp\. 2665325\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p2.1)\.
- M\. M\. Kessler \(1963\)Bibliographic coupling between scientific papers\.American Documentation14\(1\),pp\. 10–25\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p2.1),[§2](https://arxiv.org/html/2605.24351#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.24351#S3.SS1.SSS0.Px3.p2.4)\.
- J\. H\. Lau, K\. Grieser, D\. Newman, and T\. Baldwin \(2011\)Automatic labelling of topic models\.InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies,pp\. 1536–1545\.Cited by:[§2](https://arxiv.org/html/2605.24351#S2.p2.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p1.1),[§2](https://arxiv.org/html/2605.24351#S2.p1.1)\.
- J\. Liu, Z\. Shang, W\. Ke, P\. Wang, Z\. Luo, J\. Liu, G\. Li, and Y\. Li \(2025\)LLM\-guided semantic\-aware clustering for topic modeling\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,pp\. 18420–18435\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.902)Cited by:[§2](https://arxiv.org/html/2605.24351#S2.p2.1)\.
- Y\. Lu, Y\. Dong, and L\. Charlin \(2020\)Multi\-XScience: a large\-scale dataset for extreme multi\-document summarization of scientific articles\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 8068–8074\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p1.1),[§2](https://arxiv.org/html/2605.24351#S2.p1.1)\.
- Q\. Mei, X\. Shen, and C\. Zhai \(2007\)Automatic labeling of multinomial topic models\.InProceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 490–499\.Cited by:[§2](https://arxiv.org/html/2605.24351#S2.p2.1)\.
- M\. E\. J\. Newman \(2004\)Fast algorithm for detecting community structure in networks\.Physical Review E69\(6\),pp\. 066133\.Cited by:[§D\.4](https://arxiv.org/html/2605.24351#A4.SS4.p2.1)\.
- B\. Nykvist, B\. Macura, M\. Xylia, and E\. Olsson \(2025\)Testing the utility of gpt for title and abstract screening in environmental systematic evidence synthesis\.Environmental Evidence14\(1\),pp\. 7\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p4.1)\.
- V\. Padmakumar, J\. C\. Chang, K\. Lo, D\. Downey, and A\. Naik \(2025\)Intent\-aware schema generation and refinement for literature review tables\.Findings of the Association for Computational Linguistics: EMNLP2025,pp\. 23450–23472\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p4.1)\.
- B\. Pei and X\. Sun \(2025\)Leveraging llms for streamlining and demystifying the systematic literature review process\.In2025 IEEE Frontiers in Education Conference \(FIE\),pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p4.1)\.
- D\. J\. d\. S\. Price \(1965\)Networks of scientific papers: the pattern of bibliographic references indicates the nature of the scientific research front\.Science149\(3683\),pp\. 510–515\.Cited by:[§3\.1](https://arxiv.org/html/2605.24351#S3.SS1.SSS0.Px3.p5.4)\.
- P\. J\. Rousseeuw \(1987\)Silhouettes: a graphical aid to the interpretation and validation of cluster analysis\.Journal of Computational and Applied Mathematics20,pp\. 53–65\.Cited by:[§D\.3](https://arxiv.org/html/2605.24351#A4.SS3.p6.1)\.
- K\. Sarachuk \(2025\)Still not a remedy for academics: the use of generative AI\-powered tools in bibliometric analysis\.InElectronics, Communications and Networks,pp\. 57–63\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p2.1)\.
- N\. Silva and D\. Wickramaarachchi \(2025\)Enhancing systematic literature reviews: evaluating the performance of llm\-based tools across key systematic literature review stages\.In2025 5th International Conference on Advanced Research in Computing \(ICARC\),pp\. 1–6\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p4.1)\.
- H\. Small \(1973\)Co\-citation in the scientific literature: a new measure of the relationship between two documents\.Journal of the American Society for information Science24\(4\),pp\. 265–269\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p2.1),[§2](https://arxiv.org/html/2605.24351#S2.p2.1)\.
- X\. Tang, X\. Duan, and Z\. G\. Cai \(2025\)Large language models for automated literature review: an evaluation of reference generation, abstract writing, and review composition\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 1602–1617\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p1.1),[§2](https://arxiv.org/html/2605.24351#S2.p1.1)\.
- S\. Wang, H\. Scells, B\. Koopman, and G\. Zuccon \(2025\)Reassessing large language model boolean query generation for systematic reviews\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 3296–3305\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p4.1)\.
- X\. Yang, H\. Zhao, W\. Xu, Y\. Qi, J\. Lu, D\. Phung, and L\. Du \(2025\)Neural topic modeling with large language models in the loop\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1377–1401\.Cited by:[§2](https://arxiv.org/html/2605.24351#S2.p2.1)\.
- A\. Ye, A\. Maiti, M\. Schmidt, and S\. J\. Pedersen \(2024\)A hybrid semi\-automated workflow for systematic and literature review processes with large language model analysis\.Future Internet16\(5\),pp\. 167\.Cited by:[§1](https://arxiv.org/html/2605.24351#S1.p4.1)\.
- D\. Yin \(2024\)Quantitative analysis of scholars’ topic switching behavior in computer science: a two\-dimensional metric approach\.IEEE Access12,pp\. 104263–104271\.Cited by:[§3\.1](https://arxiv.org/html/2605.24351#S3.SS1.SSS0.Px3.p8.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2019\)BERTScore: evaluating text generation with bert\.arXiv preprint arXiv:1904\.09675\.Cited by:[§D\.1](https://arxiv.org/html/2605.24351#A4.SS1.p1.1)\.

## Appendix AEthical Considerations

LLM\-assisted bibliometric analysis may influence how researchers understand a field\. Incorrect descriptions, unsupported references, or misleading cluster labels could distort scientific interpretation\. We therefore evaluate reference grounding and distinguish in\-corpus references from unsupported or invalid references\.

The study uses bibliographic metadata and abstracts from Scopus\. Licensing constraints may limit redistribution of raw records\. Where raw Scopus data cannot be released, we will release prompts, code, derived identifiers, and evaluation scripts to support reproducibility within licensing constraints\.

These results should be interpreted as evidence of structural fidelity, not as a general comparison between LLMs and human experts\. Human descriptions were written for scholarly interpretation, not to optimize embedding\-based coverage, ARI, or modularity\. Thus, when structured LLM pipelines score higher on these metrics, it means they better preserve the reconstructed bibliometric or citation structure used in our evaluation, not that they offer greater domain insight, nuance, or scholarly value\.

## Appendix BData collection and Pipeline Description

### B\.1Data Collection

Before describing the pipelines, it is important to clarify that all pipelines required the number of clusters as an input parameter\. This value was derived from the human\-generated cluster descriptions, which served as the reference for comparison\.

To construct this reference set, we collected 100 bibliometric studies, whose corpus characteristics are summarized in Table[3](https://arxiv.org/html/2605.24351#A2.T3), and extracted the cluster descriptions reported in their bibliographic coupling analyses\. For each PDF document, we generated a CSV file containing one row per cluster description\. The total number of rows in each CSV file was then used to determine the number of clusters and was incorporated as an input parameter in the corresponding pipeline\.

In addition, for each of the 100 studies, we extracted the search strategy used to construct the original corpus of papers\. Specifically, we identified whether the study relied on Scopus or Web of Science, extracted the reported search query, and incorporated all available filters whenever possible\. The query was subsequently executed in the corresponding database to retrieve the dataset of papers\. When certain filters could not be implemented directly within the query, they were applied manually after retrieval\.

Therefore, for each bibliometric study, three main elements were obtained: the human\-generated cluster descriptions, the database query, and the corresponding dataset of papers\. These elements were used as inputs across the pipelines, with the exception of Blind, which used only the query and the number of clusters derived from the human\-generated cluster descriptions\.

Table 3:Summary of benchmark corpus characteristics\.
### B\.2Pipeline Output

Each pipeline produced a CSV file as output\. This file contained the cluster identifier, the cluster description, and the references associated with each cluster\.

### B\.3Pipelines

All six pipelines use the same base prompt, which defines the task, output structure, and formatting requirements\. The complete prompt schema is reported in Appendix[C](https://arxiv.org/html/2605.24351#A3)\.

#### B\.3\.1Blind

Blind used three inputs: the search query, the base prompt, and the target number of clusters\. The query provided the thematic context of the study, while the base prompt specified the task instructions and the expected structure of the response\. The target number of clusters defined the exact number of clusters to be generated\. These inputs were passed to the LLM, which generated a structured response in JSON format\. The JSON output was then parsed and saved as a CSV file containing the cluster identifier, cluster description, and associated references\.

#### B\.3\.2Corpus

Corpus used four inputs: the search query, the base prompt, the target number of clusters, and the Scopus/Web of Science corpus context\. The query provided the thematic context of the study, while the base prompt specified the task instructions and the expected output format\. The target number of clusters defined the exact number of clusters to be generated, and the corpus context provided the records used as the evidence base\. These inputs were passed to the LLM, which generated a structured response in JSON format\. The JSON output was then parsed and saved as a CSV file containing the cluster identifier, cluster description, and associated references\.

#### B\.3\.3Corpus\-select

Corpus\-select used four inputs: the search query, the base prompt, the target number of clusters, and the Scopus/Web of Science corpus context\. The query provided the thematic context of the study, while the base prompt specified the task instructions\. The target number of clusters defined the exact number of clusters to be generated, and the corpus context provided the records used as the evidence base\.

This pipeline was implemented in two stages\. First, the LLM identified the clusters and selected the most relevant references for each cluster, producing a structured list of cluster identifiers and associated paper identifiers\. Second, the selected references were used to construct a reduced cluster\-specific context\. This focused context was then passed back to the LLM to generate the thematic description of each cluster\. The final response was generated in JSON format, parsed, and saved as a CSV file containing the cluster identifier, cluster description, and associated references\.

### B\.4Louvain Algorithm

For Labeled, Labeled\-select, and Ranked, the Scopus/Web of Science corpus context was enriched with cluster labels\. These labels were generated using a bibliographic coupling and Louvain community detection procedure\. In this approach, each paper was represented by its reference list, and papers were considered related when they cited the same prior works\. The references were used to construct a paper\-by\-reference matrix, from which a bibliographic coupling matrix was computed\. This matrix captured the degree of reference overlap between pairs of papers and was then normalized using association strength to obtain a similarity score\.

The resulting similarities were used to build a weighted network, where nodes represented papers and edges represented bibliographic coupling relationships above a minimum threshold\. Louvain community detection was then applied to identify clusters of papers\. When human\-generated cluster descriptions were available, the Louvain resolution parameter was adjusted so that the number of detected communities approximated the number of human clusters\. After clustering, each paper received a cluster label, and a link strength value was computed to measure its centrality within the bibliographic coupling network\. These labels were incorporated into the corpus context used as input for Labeled, Labeled\-select, and Ranked\.

### B\.5Label\-Based Pipelines

#### B\.5\.1Labeled

Labeled used four inputs: the search query, the base prompt, the target number of clusters, and the labeled Scopus/Web of Science corpus context\. The query provided the thematic context of the study, while the base prompt specified the task instructions and required output format\. The target number of clusters indicated how many clusters had to be described, and the labeled corpus context provided the records together with their assigned cluster labels\.

Using this labeled corpus context as the evidence base, the LLM generated one thematic description for each existing cluster while preserving the provided cluster identities\. The final response was generated in JSON format, parsed, and saved as a CSV file containing the cluster identifier, cluster description, and associated references\.

#### B\.5\.2Labeled\-select

Labeled\-select used four inputs: the search query, the base prompt, the target number of clusters, and the labeled Scopus/Web of Science corpus context\. The query provided the thematic context of the study, while the base prompt specified the task instructions\. The target number of clusters indicated how many clusters had to be processed, and the labeled corpus context provided the records together with their assigned cluster labels\.

This pipeline was implemented in two stages\. First, the LLM selected the most relevant references within each labeled cluster, producing a structured list of cluster identifiers and associated paper identifiers\. Second, the selected references were used to construct a reduced cluster\-specific context\. This focused context was then passed back to the LLM to generate one thematic description for each cluster while preserving the original cluster labels\. The final response was generated in JSON format, parsed, and saved as a CSV file containing the cluster identifier, cluster description, and associated references\.

#### B\.5\.3Ranked

Ranked used four inputs: the search query, the base prompt, the target number of clusters, and the selected cluster context\. The query provided the thematic context of the study, while the base prompt specified the task instructions and required output format\. The target number of clusters indicated how many clusters had to be described, and the selected cluster context provided a reduced evidence set composed of the most representative papers from each labeled cluster\.

Using this focused evidence set, the LLM generated one thematic description for each selected cluster while preserving the provided cluster labels\. The final response was generated in JSON format, parsed, and saved as a CSV file containing the cluster identifier, cluster description, and associated references\.

## Appendix CPrompt Schema

### C\.1Blind

#### C\.1\.1Prompt 1

You are an expert research analyst in bibliometrics\.

This task is specifically based on bibliographic coupling, not general bibliometric analysis\.

Query: \{\{query\}\}

Task: Identify exactly \{\{target\_cluster\_count\}\} coherent thematic clusters within this research topic\. The number of clusters must exactly match the human reference clustering\. Use cluster\_id values 1 through \{\{target\_cluster\_count\}\}\.

For each cluster: \- write a precise academic description of the theme, main lines of inquiry, and distinctive focus \- use less than 250 words to describe the cluster \- provide references from your prior knowledge only \- use no more than 10 references per cluster \- do not browse or claim internet access \- do not invent references if uncertain

Output requirements: \- return ONLY valid JSON \- return a JSON array with exactly \{\{target\_cluster\_count\}\} objects \- each object must contain: cluster\_id, description, references \- references must be an array of bibliographic strings \- cluster\_id must be numeric

### C\.2Corpus

#### C\.2\.1Prompt 1

You are an expert in bibliometrics and scientific literature analysis\.

This task is specifically based on bibliographic coupling, not general bibliometric analysis\.

Query/context: \{\{query\}\}

Use ONLY the Scopus records below\. Identify exactly \{\{target\_cluster\_count\}\} thematic clusters supported by this corpus, describe each cluster, and select the most relevant papers for each cluster in the same response\. The number of clusters must exactly match the human reference clustering\. Use cluster\_id values 1 through \{\{target\_cluster\_count\}\}\. Use less than 250 words to describe each cluster\.

Reference rules: \- every reference must be written exactly as \[\#\] \- \# must correspond to one of the provided paper\_id values \- only cite papers from the provided Scopus records \- only select papers that are truly relevant for describing the cluster \- use no more than 10 references per cluster

Output requirements: \- return ONLY valid JSON \- return a JSON array with exactly \{\{target\_cluster\_count\}\} objects \- each object must contain: cluster\_id, description, references \- references must be an array of \[\#\] tokens

Scopus records: \{\{scopus\_context\}\}

### C\.3Corpus\-select

#### C\.3\.1Prompt 1

You are an expert in bibliometrics and scientific literature analysis\.

This task is specifically based on bibliographic coupling, not general bibliometric analysis\.

Query/context: \{\{query\}\}

Use ONLY the Scopus records below\. Identify exactly \{\{target\_cluster\_count\}\} thematic clusters and select the most relevant papers for each cluster\. The number of clusters must exactly match the human reference clustering\. Use cluster\_id values 1 through \{\{target\_cluster\_count\}\}\.

Reference rules: \- every reference must be written exactly as \[\#\] \- \# must correspond to one of the provided paper\_id values \- only select papers that are truly relevant for describing the cluster \- use no more than 10 references per cluster

Output requirements: \- return ONLY valid JSON \- return a JSON array with exactly \{\{target\_cluster\_count\}\} objects \- each object must contain: cluster\_id, references \- references must be an array of \[\#\] tokens

Scopus records: \{\{scopus\_context\}\}

#### C\.3\.2Prompt 2

You are an expert in bibliometrics and scientific literature analysis\.

This task is specifically based on bibliographic coupling, not general bibliometric analysis\.

Query/context: \{\{query\}\}

Use ONLY the selected cluster papers below\. Write one academic description for each of the exactly \{\{target\_cluster\_count\}\} clusters\. Use cluster\_id values 1 through \{\{target\_cluster\_count\}\}\. Use less than 250 words to describe each cluster\.

Output requirements: \- return ONLY valid JSON \- return a JSON array with exactly \{\{target\_cluster\_count\}\} objects \- each object must contain: cluster\_id, description \- do not include references in this step

Selected cluster papers: \{\{cluster\_contexts\}\}

### C\.4Labeled

#### C\.4\.1Prompt 1

You are an expert in bibliometrics and scientific literature analysis\.

This task is specifically based on bibliographic coupling, not general bibliometric analysis\.

Query/context: \{\{query\}\}

Use ONLY the labeled Scopus records below\. Each paper already has a cluster label, and you must preserve those labels\. Write one academic description for each of the exactly \{\{target\_cluster\_count\}\} labeled clusters using the whole labeled corpus as context\. Use less than 250 words to describe each cluster\.

Reference rules: \- every reference must be written exactly as \[\#\] \- use only the provided paper\_id values \- do not cite papers outside the labeled Scopus records \- use no more than 10 references per cluster

Output requirements: \- return ONLY valid JSON \- return a JSON array with exactly \{\{target\_cluster\_count\}\} objects \- each object must contain: cluster\_id, description, references \- cluster\_id must match one of the provided labeled cluster values \- references must be an array of \[\#\] tokens

Labeled Scopus records: \{\{labeled\_scopus\_context\}\}

### C\.5Labeled\-select

#### C\.5\.1Prompt 1

You are an expert in bibliometrics and scientific literature analysis\.

This task is specifically based on bibliographic coupling, not general bibliometric analysis\.

Query/context: \{\{query\}\}

Use ONLY the labeled Scopus records below\. Each paper already has a cluster label, and you must preserve those labels\. Select the papers that are most relevant for writing a representative description for each of the exactly \{\{target\_cluster\_count\}\} labeled clusters\.

Reference rules: \- every reference must be written exactly as \[\#\] \- use only the provided paper\_id values \- for each cluster\_id, only select papers that belong to that same labeled cluster \- use no more than 10 references per cluster

Output requirements: \- return ONLY valid JSON \- return a JSON array with exactly \{\{target\_cluster\_count\}\} objects \- each object must contain: cluster\_id, references \- cluster\_id must match one of the provided labeled cluster values \- references must be an array of \[\#\] tokens

Labeled Scopus records: \{\{labeled\_scopus\_context\}\}

#### C\.5\.2Prompt 2

You are an expert in bibliometrics and scientific literature analysis\.

This task is specifically based on bibliographic coupling, not general bibliometric analysis\.

Query/context: \{\{query\}\}

Use ONLY the labeled Scopus records below\. Each paper already has a cluster label, and you must preserve those labels\. Select the papers that are most relevant for writing a representative description for each of the exactly \{\{target\_cluster\_count\}\} labeled clusters\.

Reference rules: \- every reference must be written exactly as \[\#\] \- use only the provided paper\_id values \- for each cluster\_id, only select papers that belong to that same labeled cluster \- use no more than 10 references per cluster

Output requirements: \- return ONLY valid JSON \- return a JSON array with exactly \{\{target\_cluster\_count\}\} objects \- each object must contain: cluster\_id, references \- cluster\_id must match one of the provided labeled cluster values \- references must be an array of \[\#\] tokens

Labeled Scopus records: \{\{labeled\_scopus\_context\}\}

### C\.6Ranked

#### C\.6\.1Prompt 1

You are an expert research analyst writing bibliometric cluster descriptions\.

This task is specifically based on bibliographic coupling, not general bibliometric analysis\.

Query/context: \{\{query\}\}

You will be provided multiple bibliographic coupling clusters at once\. For each cluster, you will receive the top papers of that cluster ranked by link strength\. Use ONLY the selected cluster papers below, and preserve the provided cluster labels\.

Your task: Write one coherent academic description for each of the exactly \{\{target\_cluster\_count\}\} clusters\.

Strict rules: \- use ONLY the information provided below \- do NOT introduce external knowledge or invent papers, methods, findings, datasets, journals, years, or topics \- base all statements strictly on patterns visible in the provided titles and abstracts \- identify the main research theme of each cluster \- emphasize the topic, subtheme, angle, or intellectual profile that makes each cluster distinctive relative to the others in this set \- do not infer distinctions that are not clearly supported by the provided information \- use less than 250 words to describe each cluster \- every reference must be written exactly as \[\#\] \- use only the provided paper\_id values \- each cluster description must cite at least 3 different papers from that same cluster when enough papers are available \- use no more than 10 references per cluster \- write each cluster description as one dense academic paragraph or two short paragraphs

Output requirements: \- return ONLY valid JSON \- return a JSON array with exactly \{\{target\_cluster\_count\}\} objects \- each object must contain: cluster\_id, description, references \- cluster\_id must match one of the provided cluster labels \- references must be an array of \[\#\] tokens

Selected cluster papers: \{\{selected\_clusters\_context\}\}

![Refer to caption](https://arxiv.org/html/2605.24351v1/bibliographic_and_citation.png)Figure 5:Boxplots of pipeline performance across papers under the bibliographic and citation settings\. Panels \(a–d\) correspond to the bibliographic setting and panels \(e–h\) to the citation setting

## Appendix DEvaluation

### D\.1Human Comparison

We compare generated cluster descriptions with the human\-authored benchmark descriptions using BERTScore\(Zhanget al\.,[2019](https://arxiv.org/html/2605.24351#bib.bib54)\)\. For each source paperrir\_\{i\}, the human benchmark is

Hi=\{hi1,hi2,…,hiKi\},H\_\{i\}=\\\{h\_\{i1\},h\_\{i2\},\\ldots,h\_\{iK\_\{i\}\}\\\},
and the pipeline output is

Yi=\{\(li1,di1\),…,\(liKi,diKi\)\},Y\_\{i\}=\\\{\(l\_\{i1\},d\_\{i1\}\),\\ldots,\(l\_\{iK\_\{i\}\},d\_\{iK\_\{i\}\}\)\\\},
wheredikd\_\{ik\}is the generated description for clusterkk\. We compute the BERTScore F1 value between every generated descriptiondiud\_\{iu\}and every human descriptionhivh\_\{iv\}, producing an alignment matrixAi∈ℝKi×KiA\_\{i\}\\in\\mathbb\{R\}^\{K\_\{i\}\\times K\_\{i\}\}:

Aiuv=BERTScoreF1⁡\(diu,hiv\)\.A\_\{iuv\}=\\operatorname\{BERTScore\}\_\{F1\}\(d\_\{iu\},h\_\{iv\}\)\.
We then solve a one\-to\-one assignment problem that maximizes the total semantic alignment between generated and human descriptions\. LetMi∗M\_\{i\}^\{\*\}denote the optimal set of matched generated–human pairs\. Since each generated output containsKiK\_\{i\}clusters, the human\-comparison score for source paperrir\_\{i\}is the mean BERTScore F1 across theKiK\_\{i\}optimally matched pairs:

HumanBERT⁡\(i\)=1Ki∑\(u,v\)∈Mi∗Aiuv\.\\operatorname\{HumanBERT\}\(i\)=\\frac\{1\}\{K\_\{i\}\}\\sum\_\{\(u,v\)\\in M\_\{i\}^\{\*\}\}A\_\{iuv\}\.
Higher values indicate that the generated cluster descriptions are more semantically similar to the human\-authored benchmark descriptions\.

### D\.2Semantic Evaluation

The semantic evaluation asks whether the generated cluster descriptions collectively cover the content of the reconstructed Scopus corpusDiD\_\{i\}\. Let

Si=\{si1,…,sia\}S\_\{i\}=\\\{s\_\{i1\},\\ldots,s\_\{ia\}\\\}
be the set of sentence\-level textual atoms extracted fromDiD\_\{i\}, and let

Bi=\{bi1,…,bir\}B\_\{i\}=\\\{b\_\{i1\},\\ldots,b\_\{ir\}\\\}
be the set of textual atoms extracted from the generated cluster descriptions\.

We embed all corpus and description atoms into a shared embedding space\. For each corpus sentencesijs\_\{ij\}, we compute its maximum cosine similarity to any generated description atom:

cov⁡\(sij\)=maxb∈Bi⁡cos⁡\(e\(sij\),e\(b\)\)\.\\operatorname\{cov\}\(s\_\{ij\}\)=\\max\_\{b\\in B\_\{i\}\}\\cos\(e\(s\_\{ij\}\),e\(b\)\)\.
The semantic coverage score for instanceiiis:

Coverage⁡\(i\)=1\|Si\|∑sij∈Sicov⁡\(sij\)\.\\operatorname\{Coverage\}\(i\)=\\frac\{1\}\{\|S\_\{i\}\|\}\\sum\_\{s\_\{ij\}\\in S\_\{i\}\}\\operatorname\{cov\}\(s\_\{ij\}\)\.
This metric measures how well the generated descriptions collectively cover the semantic content of the corpus\. We treat it as a scalable proxy for coverage rather than as a direct measure of factual entailment\.

### D\.3Clustering Evaluation

The clustering evaluation asks whether the generated descriptions induce paper\-level assignments that recover the bibliometric or citation clusteringZiZ\_\{i\}\. For each generated clusterkk, letdikd\_\{ik\}be the text representation formed by concatenating its label and description\. For each paperpijp\_\{ij\}, letaija\_\{ij\}be its title\-and\-abstract representation\. We assign each paper to the most similar generated cluster:

z^ij=arg⁡maxk∈\{1,…,Ki\}⁡cos⁡\(e\(aij\),e\(dik\)\)\.\\hat\{z\}\_\{ij\}=\\arg\\max\_\{k\\in\\\{1,\\ldots,K\_\{i\}\\\}\}\\cos\(e\(a\_\{ij\}\),e\(d\_\{ik\}\)\)\.
This produces an induced partitionZ^i\\hat\{Z\}\_\{i\}of the reconstructed corpus\. Our primary clustering metric is Adjusted Rand Index \(ARI\)\(Hubert and Arabie,[1985](https://arxiv.org/html/2605.24351#bib.bib14)\), computed between the induced partitionZ^i\\hat\{Z\}\_\{i\}and the Louvain bibliometric or citation clusteringZiZ\_\{i\}:

ARI⁡\(Z^i,Zi\)\.\\operatorname\{ARI\}\(\\hat\{Z\}\_\{i\},Z\_\{i\}\)\.
ARI measures whether the paper groups implied by the generated descriptions match the clusters produced by the standard bibliometric clustering procedure\. For corpus\-based pipelines, this evaluates whether the LLM recovers the algorithmic cluster structure from text alone\. For labeled and ranked pipelines, it evaluates whether the generated descriptions preserve the meaning of the clusters they were given\.

As a secondary clustering metric, we compute the silhouette score\(Rousseeuw,[1987](https://arxiv.org/html/2605.24351#bib.bib13)\)of the induced partitionZ^i\\hat\{Z\}\_\{i\}in abstract embedding space\. The silhouette score measures whether papers assigned to the same generated cluster are semantically closer to one another than to papers assigned to different clusters\. While ARI measures agreement with the bibliometric clustering, silhouette measures semantic separability of the induced groups\.

### D\.4Graph Evaluation

The graph evaluation asks whether the induced partitionZ^i\\hat\{Z\}\_\{i\}respects the bibliographic or citation structure of the corpus\. This is distinct from ARI: ARI compares the induced assignments to the Louvain clusteringZiZ\_\{i\}, while graph modularity directly measures how well the induced assignments align with the underlying paper\-relation structure\.

Given induced assignmentsZ^i\\hat\{Z\}\_\{i\}, we compute modularity\(Newman,[2004](https://arxiv.org/html/2605.24351#bib.bib5)\):

Q=12w∑u,v\[wuv−kukv2w\]𝕀\[z^u=z^v\],Q=\\frac\{1\}\{2w\}\\sum\_\{u,v\}\\left\[w\_\{uv\}\-\\frac\{k\_\{u\}k\_\{v\}\}\{2w\}\\right\]\\mathbb\{I\}\[\\hat\{z\}\_\{u\}=\\hat\{z\}\_\{v\}\],
wherewuvw\_\{uv\}is the edge weight between papersuuandvv,kuk\_\{u\}is the weighted degree of paperuu, and2w=∑u,vwuv2w=\\sum\_\{u,v\}w\_\{uv\}\.

Higher modularity indicates that the clusters induced by the generated descriptions correspond more strongly to dense regions of the bibliographic or citation relation structure\.

### D\.5Reference Grounding

We evaluate reference grounding in two ways\. ForBlind, we validate whether generated references correspond to real bibliographic records\. Because this pipeline may generate references without access to the reconstructed Scopus corpus, each reference is matched to the closest OpenAlex record and evaluated under three increasingly strict criteria: title match, title plus publication year, and title plus publication year plus first author\. Title matching uses fuzzy string similarity with a threshold of 80, year matching requires the same publication year, and author matching requires the same normalized first\-author surname\.

For the other pipelines, we evaluate whether generated references are grounded in the provided evidence\. Each generated reference is classified as in\-corpus valid, out\-of\-corpus valid, or invalid\. In\-corpus valid references point to papers in the reconstructed Scopus corpusDiD\_\{i\}\. Out\-of\-corpus valid references correspond to real papers outsideDiD\_\{i\}, while invalid references cannot be matched to a real paper or valid corpus identifier\. This distinction is especially important because less structured pipelines may rely on parametric memory, whereas structured pipelines constrain references to retrieved papers, algorithmic clusters, or link\-ranked evidence\.

We also measure whether each generated description reflects the cited evidence\. Reference\-grounded coverage applies the same sentence\-level coverage procedure used in the semantic evaluation, but restricts the evidence set to the papers explicitly cited by each generated cluster\. For each source paperrir\_\{i\}and generated clusterkk, letEikrefE\_\{ik\}^\{\\mathrm\{ref\}\}be the set of sentence\-level textual atoms extracted from the Scopus abstracts cited in that cluster’s references field, and letBikB\_\{ik\}be the set of sentence\-level textual atoms extracted from the generated descriptiondikd\_\{ik\}\. The cluster\-level score is

RGCik=1\|Eikref\|∑e∈Eikrefmaxb∈Bik⁡cos⁡\(𝐞\(e\),𝐞\(b\)\)\.\\operatorname\{RGC\}\_\{ik\}=\\frac\{1\}\{\|E\_\{ik\}^\{\\mathrm\{ref\}\}\|\}\\sum\_\{e\\in E\_\{ik\}^\{\\mathrm\{ref\}\}\}\\max\_\{b\\in B\_\{ik\}\}\\cos\(\\mathbf\{e\}\(e\),\\mathbf\{e\}\(b\)\)\.
The final score for source paperrir\_\{i\}is the average across generated clusters:

RGC⁡\(i\)=1Ki∑k=1KiRGCik\.\\operatorname\{RGC\}\(i\)=\\frac\{1\}\{K\_\{i\}\}\\sum\_\{k=1\}^\{K\_\{i\}\}\\operatorname\{RGC\}\_\{ik\}\.
Higher values indicate that generated cluster descriptions more fully reflect the semantic content of the specific Scopus abstracts they cite as supporting evidence\.

### D\.6Evaluation rationale

Our evaluation is designed around the purpose of bibliometric analysis, not generic text\-generation quality\. A bibliometric cluster description should interpret a science map by covering the corpus, distinguishing the induced clusters, aligning with the underlying paper\-relation graph, and citing supporting evidence\. We therefore complement semantic similarity to human\-written descriptions with structure\-sensitive metrics: semantic coverage, ARI, silhouette, modularity, and reference grounding\.

Human\-authored descriptions remain an important reference for interpretive alignment, but they are not the sole evaluation target\. Published cluster descriptions are written for scholarly explanation and may emphasize conceptual nuance or narrative framing rather than paper\-level reconstruction in a rebuilt corpus\. Accordingly, we use human comparison as one dimension of evaluation, while the remaining metrics test whether generated descriptions preserve the bibliometric structure they are intended to explain\.

While our evaluation captures several important aspects of bibliometric cluster description quality, it also highlights limitations that motivate the need for a broader evaluation rationale\. In particular, this work should be positioned not only as a bibliometric study, but also as an NLP evaluation of structured LLM\-assisted scientific synthesis\. The proposed metrics assess semantic alignment, corpus coverage, clustering fidelity, graph consistency, and reference grounding, but they do not fully capture expert judgment, conceptual nuance, or the scholarly usefulness of a generated cluster description\. In addition, the benchmark reconstruction process and human\-evaluation protocol require careful reporting, since differences between reconstructed corpora and the original bibliometric studies may affect downstream comparisons\. These considerations motivate our use of multiple complementary metrics and our interpretation of the results as evidence of structural fidelity rather than as a direct replacement for human bibliometric expertise\.

## Appendix EAblations

To further analyze the behavior of the proposed pipelines, a set of ablation studies was conducted\. These experiments were not performed on the full set of 100 bibliometric studies, but on a subset of 20 studies, due to the computational and financial cost associated with repeatedly executing the pipelines under different configurations\.

### E\.1Generative Model Ablation

The first ablation study evaluated the effect of the language model used to generate the cluster descriptions\. In this experiment, the pipelines were executed using two different generative models: GPT\-5\.4 and Gemini\-2\.5\-Flash\. The objective was to compare how the performance and behavior of the pipelines changed depending on the model responsible for producing the thematic cluster descriptions\.

Table[4](https://arxiv.org/html/2605.24351#A5.T4)reports the results of the generative\-model ablation\. The results show broadly consistent behavior across the two generative models\. TheBlindpipeline remains the weakest setting for most metrics, while the best results are generally obtained by the structured pipelines, especiallyLabeledandRanked\. Under GPT\-5\.4,Labeledachieves the best mean\-rank and median coverage results, as well as the strongest median ARI and modularity\. Under Gemini\-2\.5\-Flash,Rankedobtains the best median modularity and nearly the best median ARI, whileLabeledhas the best ARI mean rank and highest ARI win percentage\. The main difference between models is therefore not whether structure helps, but which structured pipeline benefits most\. This supports the robustness of the main finding that LLM\-assisted bibliometric synthesis performs best when the model is given algorithmic cluster structure or representative cluster evidence\.

Table 4:Benchmark results by model source\. Results are grouped by summary statistic\. Median summarizes central performance, mean rank summarizes relative ordering across instances, and win percentage reports how often each pipeline achieves the best score\. Bold values indicate the best result for each metric and model source within each statistic\.![Refer to caption](https://arxiv.org/html/2605.24351v1/human_alignment_gemin.png)Figure 6:BERTScore human alignment by pipeline for Gemini\.
![Refer to caption](https://arxiv.org/html/2605.24351v1/reference-grounded_coverage.png)Figure 7:Distribution of reference\-grounded coverage scores across description\-generation pipelines using Gemini\.
![Refer to caption](https://arxiv.org/html/2605.24351v1/reference_grounding.png)Figure 8:Histograms of cluster\-level reference precision under progressively stricter matching criteria for Gemini\. Panel \(a\) shows precision based on title matching only, panel \(b\) requires both title and year to match, and panel \(c\) requires title, year, and first author to match\.

### E\.2Embedding Model Ablation

The second ablation study tested whether the evaluation results depended on the embedding model used to compute semantic similarities\. We evaluated the same generated cluster descriptions using two embedding models:all\-mpnet\-base\-v2andtext\-embedding\-3\-large\. All other experimental settings were kept fixed\. We then recomputed the evaluation metrics reported in Table[5](https://arxiv.org/html/2605.24351#A5.T5): coverage, silhouette score, ARI, and modularity\. Stable relative performance across both embedding models was interpreted as evidence that the pipeline comparisons were robust to the choice of semantic representation model\.

Table[5](https://arxiv.org/html/2605.24351#A5.T5)shows that the main conclusions are broadly stable across embedding models\. In both MPNet and OpenAI embedding spaces, theBlindpipeline remains among the weakest settings, while the structured pipelines, especiallyLabeledandRanked, obtain the strongest results across ARI and modularity\.Labeledis the most consistently strong pipeline, achieving the best mean\-rank and median results for coverage and ARI under both embedding models, as well as the best median modularity\.Rankedis also competitive, especially for silhouette and modularity\. The main difference is that MPNet gives higher absolute coverage and silhouette values than OpenAI, but the relative ordering of the pipelines remains similar\. This suggests that the evaluation results are not driven by a single embedding model\.

Table 5:Benchmark results by embedding model\. MPNet denotesall\-mpnet\-base\-v2, and OpenAI denotestext\-embedding\-3\-large\. Median summarizes central performance, mean rank summarizes relative ordering across instances, and win percentage reports how often each pipeline achieves the best score\. Bold values indicate the best result for each metric and embedding model within each statistic\.
### E\.3Prompt Ablation

The third ablation study evaluated the effect of prompt formulation on cluster\-description generation\. We compared a simpler prompt against an enriched prompt that included two additional elements\. First, the enriched prompt added the query field:

Second, it added a methodological explanation of the bibliographic relation:

> The analysis that will be performed is a bibliographic coupling analysis\. Bibliographic coupling clusters papers that share cited references\. Papers in the same cluster often reflect related intellectual backgrounds, methods, or problem framings\.

The simpler prompt removed both of these elements\. This ablation tests whether giving the model explicit topic context and bibliometric\-relation context improves the generated descriptions relative to a more minimal prompt\.

Table[6](https://arxiv.org/html/2605.24351#A5.T6)shows that prompt formulation affects performance, but it does not change the main pattern across pipelines\. TheBlindpipeline remains weak across most configurations, while structured pipelines continue to perform best\. Adding bibliographic\-coupling context improves some structural results, especially forLabeled, which achieves the best ARI mean rank and win percentage under the bibliographic prompt, and forRanked, which obtains the best modularity median under the same configuration\. The no\-query condition improves coverage forCorpus\-selectand modularity forLabeled\-select, but these gains are not consistent across all metrics\. Overall, the results suggest that prompt details can shift which structured pipeline performs best, but they do not overturn the broader conclusion that externally provided bibliometric structure is more important than prompt formulation alone\.

Table 6:Benchmark results by prompt configuration\.Biblio\.denotes prompts with bibliographic context\. Median summarizes central performance, mean rank summarizes relative ordering across instances, and win percentage reports how often each pipeline achieves the best score\. Bold values indicate the best result for each metric and prompt configuration within each statistic\.
### E\.4Evidence\-Size Ablation

The fourth ablation study evaluated whether the amount of reference evidence requested from or provided to the model affects the quality of the generated cluster descriptions\. We compared three evidence\-size settings: 10, 20, and 40 references per cluster\. For pipelines 1–5, this value was expressed in the prompt as the approximate number of references\[\#\]that the model should use when enough relevant papers were available\. ForRanked, the value instead controlled the number of top\-ranked papers from each cluster that were provided to the model as input evidence\. This ablation tests whether more compact or broader cluster\-level evidence leads to better bibliometric cluster descriptions\.

Table[7](https://arxiv.org/html/2605.24351#A8.T7)shows that changing the amount of reference evidence affects performance, but the overall pattern remains stable\. TheBlindpipeline remains weak across most settings, while the strongest results continue to come from structured pipelines\.Labeledperforms best for several core metrics, especially coverage and ARI, and remains strong across all three evidence\-size settings\. Increasing the evidence size benefits some pipelines, particularlyCorpus\-select, which improves in coverage as the setting increases from 10 to 40, andLabeled\-select, which obtains strong modularity and ARI results at larger settings\. However, larger evidence sets do not uniformly improve performance; for example,Rankedis competitive across ARI and modularity but does not consistently improve as more papers are provided\. Overall, the results suggest that the amount of evidence matters, but evidence structure and cluster labels remain more important than simply increasing the number of references\.

### E\.5Human Evaluation of Structured Pipeline Outputs

To complement the automatic metrics, we conducted a small\-scale human evaluation of all pipelines\. Two independent evaluators assessed outputs for five randomly selected benchmark papers, with five clusters per paper, yielding 25 cluster descriptions per pipeline\. They compared descriptions by thematic coherence, specificity, faithfulness to evidence, and usefulness for cluster interpretation\. Both evaluators agreed that, among the structured pipelines,*Ranked*produced the strongest descriptions\.*Labeled*and*Labeled\-select*were generally similar in quality, with*Labeled*slightly preferred overall\. Human descriptions were rated below to the structured pipelines and above*Blind*, which was penalized for hallucinated references\.

## Appendix FCode Availability

## Appendix GComputational Cost

All experiments were run on a single laptop and required approximately three days of wall\-clock time\. The total API cost was approximately 500 USD, consisting of about 400 USD in OpenAI usage and 100 USD in Gemini usage\.

## Appendix HUse of AI Assistants

Beyond their use in the experimental pipelines, LLMs were also used as writing and research assistants during the preparation of the paper, including for editing, contrasting ideas, and supporting literature search\.

Table 7:Benchmark results by experiment variant\. Median summarizes central performance, mean rank summarizes relative ordering across instances, and win percentage reports how often each pipeline achieves the best score\. Bold values indicate the best result for each metric and variant within each statistic\.
How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description

Similar Articles

Thinking Like a Scientist? A Structural Study of LLM-Generated Research Methods

What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience

Six questions before you add an LLM

LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

LLM Wiki v2 (16 minute read)

Submit Feedback

Similar Articles

Thinking Like a Scientist? A Structural Study of LLM-Generated Research Methods
What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience
Six questions before you add an LLM
LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]