Term-Centric Hierarchy Induction from Heterogeneous Corpora
Summary
Proposes a term-centric framework for inducing hierarchical taxonomies from heterogeneous text sources, enabling cross-source alignment and interpretable hierarchies. Experiments on a multi-source benchmark demonstrate improved coherence and quality over text- and summary-based baselines.
View Cached Full Text
Cached at: 06/26/26, 05:20 AM
# Term-Centric Hierarchy Induction from Heterogeneous Corpora
Source: [https://arxiv.org/html/2606.26963](https://arxiv.org/html/2606.26963)
Elena Senger1,2Yuri Campbell2Jan\-Peter Bergmann2Rob van der Goot3Barbara Plank1 1MaiNLP, Center for Information and Language Processing, LMU Munich, Germany 2Fraunhofer Institute for Systems and Innovation Research ISI, Germany 3Department of Computer Science, IT University of Copenhagen, Denmark elena\.senger@cis\.lmu\.de, robv@itu\.dk, b\.plank@lmu\.de \{yuri\.campbell,jan\-peter\.bergmann\}@isi\.fraunhofer\.de,
###### Abstract
Organizing knowledge from diverse text sources into interpretable hierarchies is crucial for tasks such as policy analysis, innovation monitoring, and exploratory domain mapping\. Existing taxonomy induction methods typically rely on document\-level representations that capture entire documents rather than the specific domain concepts relevant for knowledge organization, limiting their ability to generalize across heterogeneous sources\. We propose a term\-centric framework for inducing hierarchical taxonomies from heterogeneous corpora that scales to massive document collections\. Our approach maps documents from diverse sources into a shared representation space using automatic term extraction, enabling robust cross\-source alignment\. Based on these representations, we construct interpretable hierarchies that integrate domain priors with data\-driven clustering\. Experiments on a novel English and German multi\-source benchmark of over one million documents demonstrate that our method improves cross\-source coherence and hierarchy quality over text\- and summary\-based baselines\. A case study on German regional innovation analysis further demonstrates its practical utility for technology landscape mapping\.
Term\-Centric Hierarchy Induction from Heterogeneous Corpora
Elena Senger1,2Yuri Campbell2Jan\-Peter Bergmann2Rob van der Goot3Barbara Plank11MaiNLP, Center for Information and Language Processing, LMU Munich, Germany2Fraunhofer Institute for Systems and Innovation Research ISI, Germany3Department of Computer Science, IT University of Copenhagen, Denmarkelena\.senger@cis\.lmu\.de, robv@itu\.dk, b\.plank@lmu\.de\{yuri\.campbell,jan\-peter\.bergmann\}@isi\.fraunhofer\.de,
Figure 1:Overview of the TERMNET framework\. Documents from heterogeneous sources are first mapped to a shared representation via automatic term extraction\. The resulting embeddings are organized using a seed\-guided hierarchical clustering procedure: predefined seed categories \(representing broad scientific and technological domains\) initialize the top levels of the hierarchy, which are then expanded in a data\-driven manner\.## 1Introduction
Hierarchical representations organize large text corpora into interpretable, multi\-level structures\. Such hierarchies facilitate exploratory search, domain mapping, and trend analysis by enabling users to navigate from broad thematic areas to fine\-grained topics\. Recent clustering\- and LLM\-based approaches have shown promise for taxonomy induction from scientific literatureZhuet al\.\([2025](https://arxiv.org/html/2606.26963#bib.bib1)\); Katzet al\.\([2024](https://arxiv.org/html/2606.26963#bib.bib7)\); Oargaet al\.\([2024](https://arxiv.org/html/2606.26963#bib.bib6)\); Gaoet al\.\([2025](https://arxiv.org/html/2606.26963#bib.bib5)\)\. In practice, analysing complex domains often requires integrating heterogeneous sources reflecting different contexts\. Tasks such as policy foresight, regional innovation analysis, and domain\-specific knowledge discovery rely on synthesizing evidence from multiple data sources, for example to identify emerging technologies or monitor strategic priorities\(Polchar,[2024](https://arxiv.org/html/2606.26963#bib.bib39); Hakiman and Stull\-Lane,[2022](https://arxiv.org/html/2606.26963#bib.bib40)\)\. Heterogeneous data sources pose two main challenges for hierarchy induction: 1\) they differ in style and structure\. For example, scientific papers emphasize methods and findings, patents focus on technical claims, and funding records describe strategic objectives\. Standard document embeddings may therefore reflect source boundaries rather than thematic structure\. 2\) data\-driven clustering follows the empirical corpus rather than the true structure of the domain\. Since methods such as K\-Means allocate more centroids to high\-density or high\-variance regions\(Manninget al\.,[2008](https://arxiv.org/html/2606.26963#bib.bib25)\), sampling and coverage biases can lead to finer partitions where documents are abundant while under\-represented areas may be merged or fragmented\(Esteret al\.,[1996](https://arxiv.org/html/2606.26963#bib.bib23); McInneset al\.,[2017](https://arxiv.org/html/2606.26963#bib.bib24)\)\.
To address these challenges, we proposeTERMNET, a scalable term\-centric framework for inducing hierarchical taxonomies from heterogeneous corpora \(Figure[1](https://arxiv.org/html/2606.26963#S0.F1)\)\. Unlike prior taxonomy induction methods that rely on raw–document or summary embeddings, TERMNET maps documents into a shared semantic space using automatic term extraction, reducing source\-specific style effects\. Based on these representations, we construct hierarchies through a clustering process that integrates domain priors with data\-driven signals, producing human\-interpretable and domain\-balanced taxonomy structures\. We evaluate TERMNET on our newly introduced large\-scale multi\-source benchmark containing over one million English and German documents\. It outperforms raw\-text and summary baselines in clustering quality, cross\-source integration, and human interpretability, as validated by automatic and human evaluation\. A policy\-oriented case study further demonstrates the practical utility of the induced hierarchies\. Our main contributions are:
- •We introduceTERMNET, a scalable term\-centric framework for hierarchy induction from heterogeneous corpora\.
- •We propose an evaluation protocol for multi\-source hierarchy induction, including source entropy and intruder detection, and perform large\-scale automatic and human evaluations\.
- •We release a multi\-source benchmark of over one million documents spanning publications, patents, and funding records to support research on heterogeneous knowledge organization\.
## 2Related Work
Research on hierarchy induction has traditionally focused on pattern\-based hypernym extraction\(e\.g\. Hearst,[1992](https://arxiv.org/html/2606.26963#bib.bib32); Shwartzet al\.,[2016](https://arxiv.org/html/2606.26963#bib.bib31); Panchenkoet al\.,[2016](https://arxiv.org/html/2606.26963#bib.bib30)\)and distributional or clustering\-based methods that organize semantically related concepts or documents into hierarchical structures\(e\.g\. Wanget al\.,[2013](https://arxiv.org/html/2606.26963#bib.bib33); Liuet al\.,[2012](https://arxiv.org/html/2606.26963#bib.bib34); Mimnoet al\.,[2007](https://arxiv.org/html/2606.26963#bib.bib35)\)\. More recently, LLMs have been applied either as standalone approaches or integrated into traditional pipelines\(e\.g\. Gaoet al\.,[2025](https://arxiv.org/html/2606.26963#bib.bib5); Zhuet al\.,[2025](https://arxiv.org/html/2606.26963#bib.bib1); Katzet al\.,[2024](https://arxiv.org/html/2606.26963#bib.bib7)\)\.
### 2\.1LLM\-Enhanced Hierarchy Induction
Research on hierarchy induction has largely focused on single\-source corpora\. One early multi\-source example,Zhuet al\.\([2013](https://arxiv.org/html/2606.26963#bib.bib49)\),constructs topic hierarchies from blogs, community question\-answering sites, and Twitter, but targets narrow topics and small\-scale user\-generated content\. We extend this direction to large\-scale institutional data sources and domain\-level hierarchies\.
Table 1:Mean text, summary, and parsed response lengths and example sentences per origin\.The work most closely related to ours is SCYCHIC\(Gaoet al\.,[2025](https://arxiv.org/html/2606.26963#bib.bib5)\), which organizes scientific abstracts into multi\-level hierarchies by combining embedding\-based K\-Means clustering with selective LLM\-based summarization\. A key insight is that decomposing papers into contribution types yields more coherent structures than treating each paper as a single\-topic entity\.Oargaet al\.\([2024](https://arxiv.org/html/2606.26963#bib.bib6)\)leverage LLMs for zero\-shot ontology and knowledge graph generation from scientific literature by prompting models to extract vocabulary, infer hierarchical category structures, and extract relations in an end\-to\-end manner, showing effectiveness in domain\-specific settings, e\.g\. chemistry\.Zhuet al\.\([2025](https://arxiv.org/html/2606.26963#bib.bib1)\)encode papers along multiple semantic aspects \(e\.g\., methodology, data, evaluation\) and cluster each summarized aspect with a probabilistic embedding\-based model, followed by a dynamic search to ensure consistent cluster assignments when building a taxonomy\.Katzet al\.\([2024](https://arxiv.org/html/2606.26963#bib.bib7)\)introduce an LLM\-guided framework that organizes scientific query results \(a few thousand papers\) into two\-level hierarchies\. Their system first embeds and clusters retrieved papers using Gaussian Mixture Models, followed by LLM\-based naming, filtering, and grouping for exploratory browsing\.
These methods operate on single\-source scientific corpora and rely on document\-level representations, such as raw text, summaries, or aspect\-based reformulations\.Summaries may preserve source\-specific style or hallucinate content\.Therefore, we propose to use automatic terms extraction for hierarchy induction\.
### 2\.2Seed\-Guided Hierarchy Induction
Seed\-guided hierarchy construction expands a small initial hierarchy using corpus evidence\(Shenet al\.,[2025](https://arxiv.org/html/2606.26963#bib.bib27)\)\. Early approaches rely on embedding\-based methods that recursively organize concepts or attach new ones to existing nodes\(e\.g\. Zhanget al\.,[2018](https://arxiv.org/html/2606.26963#bib.bib29); Leeet al\.,[2022](https://arxiv.org/html/2606.26963#bib.bib38); Huanget al\.,[2020](https://arxiv.org/html/2606.26963#bib.bib28)\)\. More recent work leverages LLMs for seed\-guided hierarchy construction\. For instance, TAXOINSTRUCT\(Shenet al\.,[2025](https://arxiv.org/html/2606.26963#bib.bib27)\)uses instruction\-tuned LLMs to generate sibling entities and infer parent relations, and other approaches iteratively extend seed hierarchies via prompting strategies\(Gaoet al\.,[2025](https://arxiv.org/html/2606.26963#bib.bib5)\)\. We use use seed hierarchies in a cross\-source setting, combining seed\-guided category assignment for the upper layers with data\-driven clustering for the lower layers, to approximate the conceptual structure of the domain rather than the empirical distribution of individual datasets\.
## 3Source Data
To evaluate our approach, we construct a multi\-source benchmark combining scientific publications, patents, and public research funding records\. These sources capture complementary stages of the innovation pipeline: scientific knowledge production, technological protection, and publicly funded research activity\. The dataset focuses primarily on German\-affiliated organizations and contains documents in both German and English\. The corpus integrates four major data sources\. It contains 578,335 publication abstracts from OpenAlex \(openalex\)\(Priemet al\.,[2022](https://arxiv.org/html/2606.26963#bib.bib42)\), 353,043 patent abstracts from the USPTO corpus \(uspto\)\(Liet al\.,[2018](https://arxiv.org/html/2606.26963#bib.bib41)\), 12,979 project descriptions from EU framework programs Horizon 2020\(Publications Office of the European Union,[2015](https://arxiv.org/html/2606.26963#bib.bib43)\)and Horizon Europe \(horizon\)\(Publications Office of the European Union,[2022](https://arxiv.org/html/2606.26963#bib.bib44)\), and 100,655 project descriptions from FöKAT \(foekat\), which catalogs research projects funded by the German Federal Government\(Bundesministerium für Forschung, Technologie und Raumfahrt \(BMFTR\),[2026](https://arxiv.org/html/2606.26963#bib.bib45)\)\. The resulting dataset comprises 1,044,977 documents\.
The sources differ substantially in their linguistic characteristics and document structure\. Publication and patent abstracts typically contain well\-structured descriptions of research contributions and technological inventions, whereas funding records are shorter and often contain administrative or program\-specific terminology\. Combining these heterogeneous sources provides a challenging benchmark for methods aiming to capture technological and scientific topics across institutional contexts\. Table[1](https://arxiv.org/html/2606.26963#S2.T1)shows representative example documents and summary statistics for each source\. Details on data retrieval, filtering criteria and licensing are provided in Appendix[A](https://arxiv.org/html/2606.26963#A1)\.
## 4Methods
### 4\.1Problem Formulation
LetD=\{d1,…,dN\}D=\\\{d\_\{1\},\\dots,d\_\{N\}\\\}denote a heterogeneous corpus consisting of documents from multiple sources\. Each document may differ in structure, style, and length depending on its source\. Our goal is to induce a hierarchical taxonomyℋ=\(𝒱,ℰ\)\\mathcal\{H\}=\(\\mathcal\{V\},\\mathcal\{E\}\)over the documents inDD, where𝒱\\mathcal\{V\}represents the set of named taxonomy nodes, also referred to as categories, andℰ\\mathcal\{E\}represents the unique parent–child relations between them\. Each node has a unique parent, except for the rootv0∈𝒱v\_\{0\}\\in\\mathcal\{V\}, which has none\. The hierarchy organizes documents into increasingly specific categories\. Each document is associated with a unique path from the root to a leaf node\. For a nodevv, we useDvD\_\{v\}to denote the documents that pass throughvvalong this path, and𝒞v\\mathcal\{C\}\_\{v\}to denote its direct children\. The resulting hierarchy should satisfy two objectives: \(i\) semantically coherent clusters, in which sibling nodes are thematically distinct and documents within a node share a common topic; and \(ii\) balanced domain coverage, such that the hierarchy reflects the breadth of the technological and scientific landscape rather than the frequency distribution of the corpus\. We further assume access to domain priorsℋp\\mathcal\{H\}\_\{p\}that define coarse\-grained categories in the upper layers ofℋ\\mathcal\{H\}\. These priors may guide the hierarchy construction but do not fully determine its structure\.
### 4\.2TERMNET
We propose TERMNET a term\-centric framework for inducing scalable and interpretable hierarchies from heterogeneous corpora \(Figure[1](https://arxiv.org/html/2606.26963#S0.F1), Algorithm[1](https://arxiv.org/html/2606.26963#algorithm1)\)\. The key motivations are i\) to abstract away from source\-specific linguistic conventions and document structures by representing documents through their technological key\-terms ii\) to ensure interpretability with a guided hierarchy construction procedure that balances domain priors with data\-driven discovery\.
#### Term\-centric Representation
Documents from different sources may follow distinct stylistic conventions, which can cause document representations to reflect source identity rather than thematic structure\. We address this by representing documents through their domain\-specific terms—words or phrases denoting defined concepts within a specialized domain\. We use DiSTER, a fine\-tuned model for cross\-domain term extraction\(Sengeret al\.,[2025](https://arxiv.org/html/2606.26963#bib.bib21)\), to identify concepts, methods, materials, and technologies in each document\. The extracted terms are concatenated and embedded to form the document representation\. This maps heterogeneous texts into a shared representation space based on domain\-specific terms\. This encourages documents referring to similar concepts to align while reducing source\-specific linguistic variation\.
#### Hierarchy Guidance
We construct the hierarchy using a hybrid strategy that combines domain priors with recursive clustering\. The upper levels are initialized using predefined seed categories representing broad scientific domains, formingℋp\\mathcal\{H\}\_\{p\}\. First, all documents are assigned to the root nodev0v\_\{0\}\(Alg\.[1](https://arxiv.org/html/2606.26963#algorithm1), line 2\)\. Then, for each nodevvwith direct children, we apply K\-Means over term\-based embeddings of the documentsDvD\_\{v\}to obtain fine\-grained clusters, where the number of clusterskkis determined as a multipleα\\alphaof the number of seed categories under nodevv, that isk=α\|𝒞v\|k=\\alpha\|\\mathcal\{C\}\_\{v\}\|\. Each cluster is then assigned to one of the existing children ofvvor to a newly created child using zero\-shot LLM classification \(Alg\.[1](https://arxiv.org/html/2606.26963#algorithm1), lines 5–14\)\. For this decision, the LLM receives representative keywords and documents retrieved using class\-TF–IDF\(Grootendorst,[2022](https://arxiv.org/html/2606.26963#bib.bib36)\)\. New nodes are created only if the cluster exceeds a minimum relative sizesmins\_\{\\min\}and if the cosine distance \(dissimilarity\) between the proposed label’s embedding and the closest sibling label’s embedding exceeds a thresholdτ\\tau\.smins\_\{\\min\}andτ\\tauaim to balance the guidance’s depth granularity and suppress sibling explosion due to LLMs eagerness in creating new categories\.
After reaching the seed depthPP, the hierarchy is expanded in a fully data\-driven manner using recursive top\-down K\-Means clustering \(Alg\.[1](https://arxiv.org/html/2606.26963#algorithm1), lines 15–18\)\. For each node, the number of children is chosen proportional to its document count while remaining within predefined bounds\(Bm,BM\)\(B\_\{m\},B\_\{M\}\)\. This allows dense nodes to be partitioned more finely while maintaining comparable granularity across hierarchy levels\. Overall, the hybrid strategy preserves interpretability at higher levels while enabling fine\-grained organization at lower levels\.
1
Input:Documents
\{d1,…,dn\}\\\{d\_\{1\},\\ldots,d\_\{n\}\\\}, hierarchy guidance
ℋp\\mathcal\{H\}\_\{p\}with depth
PP, total layers
LL, cluster multiplier
α\\alpha, size threshold
smins\_\{\\min\}, similarity threshold
τ\\tau, branching bounds
Bm,BMB\_\{m\},B\_\{M\}
2
3Initialization:Extract domain\-specific terms for each
did\_\{i\}and compute embedding
rir\_\{i\}
4
5Add all documents
did\_\{i\}to the root node
v0v\_\{0\}
6for*l=0l=0toP−1P\-1*do
7for*each nodevvin𝒱l\\mathcal\{V\}\_\{l\}*do
8
k←α⋅\|𝒞v\|k\\leftarrow\\alpha\\cdot\|\\mathcal\{C\}\_\{v\}\|
9Create
kkfine\-grained clusters
ℳ\\mathcal\{M\}using K\-Means
10for*Fine\-grained clusterm∈ℳm\\in\\mathcal\{M\}*do
11Retrieve representative keywords and documents
12Zero\-shot classify cluster
mminto children of
vv
13if*LLM proposes new categoryand\|m\|≥smin\|Dv\|\|m\|\\geq s\_\{\\min\}\|D\_\{v\}\|anddissimilarity\>τ\>\\tau*then
14Create new child node
w∈𝒞vw\\in\\mathcal\{C\}\_\{v\}
15Assign documents of
mmto
ww
16
17else
18Assign documents of
mmto selected child of
vv
19
20
21
22
23for*l=Pl=PtoL−1L\-1*do
24for*each nodevvin𝒱l\\mathcal\{V\}\_\{l\}*do
25Cluster document representations of
vvinto
k∈\{Bm,Bm\+1⋯,BM\}k\\in\\\{B\_\{m\},B\_\{m\}\+1\\cdots,B\_\{M\}\\\}clusters
26Add every cluster as child of
vv
27
28
29
30return*Hierarchical structure*
Algorithm 1TERMNET algorithm
### 4\.3Baselines
We compare against hierarchical k\-means clustering and a recent embedding taxonomy induction approach\. Algorithms are provided in Appendix[B](https://arxiv.org/html/2606.26963#A2)\.
#### Recursive K\-Means
We implement recursive K\-Means as a hierarchical clustering approach\. Starting from a global partition, clusters are recursively subdivided in a breadth\-first manner until the target granularity is reached\. To ensure scalability, we apply Mini\-Batch K\-Means\(Sculley,[2010](https://arxiv.org/html/2606.26963#bib.bib37)\)for clusters exceeding 10,000 documents and standard K\-Means\(MacQueen,[1967](https://arxiv.org/html/2606.26963#bib.bib13)\)otherwise\. This approach produces a strictly nested hierarchy in which each document is assigned to a path defined by successive centroid refinements\. We apply this baseline to three document representations: raw text, document summaries, and extracted domain\-specific terms\.
#### SCYCHIC
Due to compute feasibility, we adapt the SCYCHIC framework\(Gaoet al\.,[2025](https://arxiv.org/html/2606.26963#bib.bib5)\)to our large\-scale corpus by using smaller embedding and summarization models \(see Section[5\.1](https://arxiv.org/html/2606.26963#S5.SS1)\)\. To address token constraints, in the bottom\-up phase, we summarize only a representative subset of documents per cluster selected using Maximal Marginal Relevance \(MMR\), balancing similarity to the cluster centroid with semantic diversity\.
## 5Experimental Design
### 5\.1Setup
All methods use Qwen3\-Embedding\-0\.6B\(Zhanget al\.,[2025](https://arxiv.org/html/2606.26963#bib.bib47)\)to generate document embeddings\. For document summarization, we employ Meta\-Llama\-3\-8B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.26963#bib.bib48)\), and for autmatic term extraction we use DiSTER with instruction\-tuned Meta\-Llama\-3\-8B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.26963#bib.bib48)\)followingSengeret al\.\([2025](https://arxiv.org/html/2606.26963#bib.bib21)\)\.
The hierarchy guidanceℋp\\mathcal\{H\}\_\{p\}is fixed with two category layers curated by domain experts and inspired by the ASJC classification\(Elsevier,[2024](https://arxiv.org/html/2606.26963#bib.bib46)\)\. Fine\-grained clustering uses a multiplierα=50\\alpha=50that determines the number of clusters per node\. The creation of new categories during guidance is controlled by a minimum relative cluster sizesmin=1%s\_\{\\min\}=1\\%and a cosine similarity thresholdτ=0\.12\\tau=0\.12to prevent redundant labels\. For the data\-driven hierarchy expansion phase, the branching factor is bounded byBm=3B\_\{m\}=3andBM=6B\_\{M\}=6\. The final hierarchy depth is fixed toL=4L=4, resulting in hierarchies with approximatelyk=\{6,40,180,680\}k=\\\{6,40,180,680\\\}clusters from top to bottom\. This configuration was selected in consultation with domain experts to support the downstream case study and analyses \(Section[7](https://arxiv.org/html/2606.26963#S7)\)\. However, the method itself is not restricted to a fixed hierarchy depth or number of clusters\.
Table 2:Automatic metrics and human evaluation scores across the different hierarchy induction approaches\. TERMNET wo\. terms differs only in the input representation from the full TERMNET approach by using raw text instead of extracted terms\.
### 5\.2Evaluation
We evaluate the hierarchies with unsupervised, supervised, and human\-centered metrics to assess clustering quality, cross\-source integration, structure and usability\.
#### Unsupervised Metrics
We measure lexical cluster distinctiveness at each hierarchy level using topicdiversity\. Topic diversity is measured using Inverted Rank\-Biased Overlap\(Webberet al\.,[2010](https://arxiv.org/html/2606.26963#bib.bib16); Terragniet al\.,[2021](https://arxiv.org/html/2606.26963#bib.bib17)\)over the top1010representative terms per cluster\. Cross\-source integration is measured withsource entropyfor each cluster based on the distribution of document origins\. Higher entropy indicates a more balanced source mixture, while lower entropy reflects source dominance\. Perfect balance is unattainable due to skewedness in source\-sizes and also not necessarily desirable, hence we comparatively evaluate reductions in source\-driven fragmentation using entropy\. Preliminary experiments showed that clustering raw documents produced source\-separated clusters with low entropy; therefore, higher entropy indicates improved integration\. The composition of clusters from different sources is further validated through human evaluation\.
#### Supervised Metrics
To evaluate high\-level thematic alignment, we manually construct a gold standard from source\-specific classifications \(e\.g\., OpenAlex fields and IPC codes for patents\) and map these categories to induced top\-level clusters\. Documents are considered correctly assigned if their original categories correspond to the mapped cluster\. For example, publications labeled withGeneral Immunology and MicrobiologyandHealth Informaticsare expected to appear in the correspondingLife SciencesorHealth Sciencesclusters\. Since we only have these labels for a subset of the data, we can only measure precision\.
To evaluate the semantic coherence, we design an intruder detection task inspired byBhatiaet al\.\([2018](https://arxiv.org/html/2606.26963#bib.bib9)\)\. Given a parent cluster and its child clusters, along with one semantically similar intruder cluster drawn from a different parent at the same hierarchy depth, the task is to identify the non\-matching cluster\. In a validation study we randomly sampled 60 parent clusters, balanced across hiarchy levels\. For each instance, two human annotators independently identified the intruder cluster\. Human inter\-annotator agreement reached Cohen’sκ=0\.73\\kappa=0\.73, while intruder predictions from an LLM \(Llama\-3\.3\-70B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.26963#bib.bib48)\)\) matched annotators with meanκ=0\.76\\kappa=0\.76and higher mean precision, so we use it for large\-scale evaluation and report macro\-accuracy per layer\.
#### Human Evaluation
FollowingHuet al\.\([2025](https://arxiv.org/html/2606.26963#bib.bib8)\)andZhuet al\.\([2025](https://arxiv.org/html/2606.26963#bib.bib1)\), we evaluate hierarchies along three dimensions adapted to cross\-data source hierarchy induction:Structure,ValidityandComposition\.Structureassesses whether the hierarchy exhibits a clear progression from broad research areas to specific sub\-technologies with consistent granularity across levels\.Validityevaluates alignment with experts’ understanding of how concepts are typically grouped and related\.Compositionexamines whether clusters reflect a plausible mix of data sources given the domain \(e\.g\., publication\- vs\. patent\- or project\-intensive areas\)\. Each dimension is rated on a five\-point Likert scale \(see Appendix[C](https://arxiv.org/html/2606.26963#A3)\)\. The mean inter\-annotator agreement is Cohen’sκ=0\.43\\kappa=0\.43\.111The moderate chance\-corrected agreement scores \(κ=0\.43\\kappa=0\.43,α=0\.57\\alpha=0\.57\) reflect minor calibration differences and low variation\. While exact agreement is 44%, 87% of ratings differed by at most one Likert point\. Both annotators consistently ranked TERMNET highest, indicating disagreement mainly affect absolute ratings rather than system ordering\. For per\-annotator see Appendix[D](https://arxiv.org/html/2606.26963#A4)\.
## 6Results & Analysis
### 6\.1Quantitative Analysis
Table[2](https://arxiv.org/html/2606.26963#S5.T2)summarizes automatic metrics and human evaluation scores\. Overall,TERMNETachieves the strongest performance across most metrics, obtaining the highest precision \(67\.70\), intruder accuracy \(94\.44\), entropy \(36\.70\), and human evaluation scores\. Topic diversity remains very high for all methods \(≈99\\approx 99\), indicating that clusters are lexically distinct and largely non\-overlapping regardless of the clustering strategy\.
Figure 2:Sensitivity of the hierarchy shape to the three construction hyperparameters\.#### Clustering Quality
Among the baselines, K\-Means with term representations achieves the highest precision \(61\.12\), outperforming raw text \(53\.28\) and summaries \(47\.56\), while SCYCHIC reaches 60\.52\. This suggests that term\-centric representations reduce source\-specific linguistic variation and better capture technological concepts\. Seed guidance further improves alignment,TERMNETwithout terms reaches 66\.39 precision and the highest baseline entropy \(33\.11\)\. The full model performs best overall, with 67\.70 precision and 94\.44 intruder accuracy\. The large improvement in intruder accuracy compared to baselines \(e\.g\., 84\.30 for K\-Means raw and 79\.02 for SCYCHIC\) indicates substantially higher semantic coherence of the generated hierarchy\.
#### Structure and Validity
Table[2](https://arxiv.org/html/2606.26963#S5.T2)shows thatTERMNETobtains the highest expert ratings for Structure and Validity \(both 4\.5\)\. The strongest baselines reach Structure scores of 4\.0 and Validity scores of at most 3\.0\. Even without term extraction,TERMNETarchives Structure = 4\.0, indicating that hierarchy guidance alone already improves the organization of domains\.
#### Cross\-Source Integration
Term representations substantially increase source entropy, indicating better integration of publications, patents, and funding records\. K\-Means with terms increases entropy to 30\.00 compared to 24\.50 for raw text\.TERMNETfurther improves integration, achieving the highest entropy \(36\.70\)\. The humanCompositionof 5\.0 confirms that it’s clusters contain plausible source mixtures, rather than reflecting source\-specific artifacts\.
### 6\.2Sensitivity Analysis and Practical Guidance
We analyse the three hyperparameters controlling hierarchy construction: the over\-clustering multiplierα\\alpha, the minimum relative sizesmins\_\{\\min\}for creating guided nodes, and the similarity thresholdτ\\taufor avoiding redundant labels\. Figure[2](https://arxiv.org/html/2606.26963#S6.F2)shows how the parameters affect the hierarchy shape\. Reducingsmins\_\{\\min\}from 0\.03 to 0\.003 increases cluster counts by about a factor of four, from 32 to 148 at layer 1 and from 639 to 1,768 at layer 3, as smaller candidate clusters survive the guided merging step\. The effect saturates forsmin≥0\.02s\_\{\\min\}\\geq 0\.02, where further increases no longer affect the hierarchy\. The seed multiplierα\\alphahas a weaker but visible effect, smaller values create more clusters, while results stabilize forα≥60\\alpha\\geq 60for the givensmins\_\{\\min\}andτ\\tau\. The hyperparameter are cros\-dependent\. Atsmin=0\.01s\_\{\\min\}=0\.01, sweepingτ\\tauyields identical cluster counts\. In contrast atsmin=0\.003s\_\{\\min\}=0\.003,τ=0\.4\\tau=0\.4nearly doubles the number of clusters compared toτ≤0\.2\\tau\\leq 0\.2\. Thus,τ\\tauonly becomes active when the size threshold is permissive enough for many small candidate clusters to be considered for new labels\. Practically, we therefore recommend choosingsmins\_\{\\min\}first, usingα\\alphaas a secondary density control, and tuningτ\\tauonly for very lowsmins\_\{\\min\}\.
Table 3:Evaluation scores for five sensitivity configurations\.Table 4:Top\-level clusters and document counts per data source for the raw text K\-Means baseline andTERMNET\. Cluster names for the baseline were generated using an LLM\. Keywords represent truncated examples of characteristic terms for each cluster\. The distribution illustrates how TERMNET produces more balanced cross\-source clusters compared to raw text clustering\.Table[3](https://arxiv.org/html/2606.26963#S6.T3)supports this interpretation: across five configurations, intruder accuracy remains between 92\.57 and 96\.28, and normalized source entropy between 44\.50 and 48\.64\. The parameters therefore control navigational granularity while leaving measured quality comparatively stable\.
### 6\.3Qualitative Analysis
Table 5:Automatic metrics and human evaluation scores for the proprietary dataset\.Table 6:Impact of document representations on source entropy and human\-perceived accuracy\.Analysis of intruder detection errors reveal subtle hierarchy limitations\.TERMNETwithout terms performs worse at broad levels because some clusters include semantically distant subclusters\. For example ‘Social and Community Development‘ was placed under ‘Life Sciences‘, causing the model to select it instead of the true intruder, ‘Mathematics‘\.With term extraction, such social\-science subclusters are absent from ‘Life Sciences‘, and the true intruder ’Automotive Engineering’ is identified correctly\. Such errors illustrate that while the seed\-guided hierarchy ensures broad structural coherence, precision at fine\-grained levels benefits from term\-centric representations\.
We hypothesize that the highCompositionscores are largely driven by term extraction,StructureandValiditybenefit from hierarchy guidance\. The upper layers of the hierarchy are initialized using the seed hierarchy, resulting the upper layers which roughly resemble established scientific hierarchies\. This familiar structure helps experts navigate the taxonomy\. In contrast, purely data\-driven approaches can deviate from established structures\. For example, K\-Means with summaries creates detailed engineering clusters at the top level but omits social sciences and humanities, and deeper clusters such as ‘Policy and Governance Studies‘ appear under ‘Life Sciences‘, reflecting less coherent assignments without seed guidance\.
Table[4](https://arxiv.org/html/2606.26963#S6.T4)further illustrates differences in cluster composition\. Raw text K\-Means produces clusters that are strongly dominated by individual data sources\. For example, ’Engineering and Applied Technologies’ contains 275k USPTO documents and 205k FÖKAT records, while other sources are sparsely represented\. In contrast,TERMNETproduces more balanced clusters\. For instance, the ’Engineering’ and ’IT and Computing’ clusters contain substantial contributions from both OpenAlex and USPTO documents, suggesting that documents from different sources referring to similar concepts are aligned more effectively\.
These observations suggest thatTERMNETbalances hierarchy guidance with data\-driven clustering, producing hierarchies that are interpretable at high levels while capturing fine\-grained, semantically coherent subtopics\.
## 7Case Study: Structural Change
To illustrateTERMNET’s utility for regional analysis, we provide induced hierarchies to geographers studying structurally weak German regions\. This case study uses in\-house data, substituting publicly available OpenAlex abstracts with Scopus abstracts for higher\-quality metadata and PATSTAT replaces USPTO to focus on German affiliations\. Thehorizonandfoekatdatasets are used as before\.TERMNETachieves the highest scores on most metrics\(Table[5](https://arxiv.org/html/2606.26963#S6.T5)\), demonstrating robust performance\. For human evaluation, we added aUsefulnessdimension to capture the practical utility of the hierarchies for geographers\. Expert annotations were limited to the most promising approaches due to resource constraints\.
We further evaluate clustering quality and cross\-source integration using K\-Means withk=650k=650on three input types: raw text, summaries, and terms \(Table[6](https://arxiv.org/html/2606.26963#S6.T6)\)\. Only a single clustering layer was generated\. From each clustering, 50 documents were randomly sampled for blinded manual assessment\. A geography expert judged whether each document was thematically consistent with its cluster, using the raw document text, representative keywords, representative cluster documents, and an LLM\-generated cluster label\. Term\-centric representations achieved higher source entropy and human\-perceived accuracy, indicating improved cross\-source mixing and more coherent clusters\.
## 8Conclusion
We introducedTERMNET, a term\-centric framework for inducing hierarchical taxonomies from heterogeneous corpora\. By combining domain\-specific term representations, seed\-guided construction, and data\-driven clustering,TERMNETproduces interpretable hierarchies that align with established domain structures while integrating multiple data sources\. Experiments show consistent gains in thematic alignment, semantic coherence, and cross\-source integration across automatic metrics and expert evaluations\.
## Limitations
While TERMNET demonstrates strong performance in cross\-source hierarchy induction, several limitations remain that offer directions for future research\. First, the interpretability and structure of the upper hierarchy levels depend on the quality of the expert\-provided seed categories\. While this guidance improves alignment with established domain structures, poorly specified seeds may bias the resulting taxonomy\. Second, our evaluation focuses on English and German corpora within scientific and technological domains\. The generalizability of the approach to other languages, domains, or document genres remains an open question\. Third, the current implementation constructs a static hierarchy from a fixed snapshot of documents\. In real\-world technology monitoring scenarios, corpora evolve continuously as new publications and patents appear\. Supporting incremental updates and modelling the temporal evolution of hierarchical structures would therefore be an important extension for future research\.
## Ethical Considerations
This work uses publicly available textual data sources and does not involve personal or sensitive information\. Nevertheless, hierarchical representations learned from large corpora may reflect biases present in the underlying data, which could influence how topics or domains are structured and interpreted\. Users should therefore treat automatically induced hierarchies as exploratory tools rather than authoritative representations of knowledge\.
Large language models were used to assist with grammar correction, spelling, and refinement of the manuscript text\. They were not used to generate experimental results, analyses, or scientific claims\.
## References
- Topic intrusion for automatic topic model evaluation\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 844–849\.External Links:[Link](https://aclanthology.org/D18-1098/),[Document](https://dx.doi.org/10.18653/v1/D18-1098)Cited by:[§5\.2](https://arxiv.org/html/2606.26963#S5.SS2.SSS0.Px2.p2.2)\.
- Bundesministerium für Forschung, Technologie und Raumfahrt \(BMFTR\) \(2026\)FöKAT – Federal Catalog of Research Projects in Germany\.Note:Used and published with permission\. The dataset aggregates publicly available information on research, development, and innovation projects, including contact details of German research institutions \(see Bundesbericht Forschung und Innovation, BuFI, https://www\.bmftr\.bund\.de\), available research reports via Technische Informationsbibliothek \(TIB, https://www\.tib\.eu\), and information on project funding managed by administrative project carriers \(https://www\.ptnetz\.de\)\. Additional federal funding sources include BMFTR, BMWE, BMV, BMUKN, BMJV, BMLEH\.DatabaseExternal Links:[Link](https://foerderportal.bund.de/foekat/jsp/SucheAction.do?actionMode=searchmask)Cited by:[4th item](https://arxiv.org/html/2606.26963#A1.I1.i4.p1.1),[§3](https://arxiv.org/html/2606.26963#S3.p1.1)\.
- Elsevier \(2024\)All science journal classification \(asjc\) codes\.Note:[https://service\.elsevier\.com/app/answers/detail/a\_id/15181/supporthub/scopus/](https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus/)Accessed: 2026\-03\-13Cited by:[§5\.1](https://arxiv.org/html/2606.26963#S5.SS1.p2.8)\.
- M\. Ester, H\. Kriegel, J\. Sander, and X\. Xu \(1996\)A density\-based algorithm for discovering clusters in large spatial databases with noise\.InProceedings of the Second International Conference on Knowledge Discovery and Data Mining \(KDD’96\),pp\. 226–231\.Cited by:[§1](https://arxiv.org/html/2606.26963#S1.p1.1)\.
- M\. Gao, J\. Shah, W\. Wang, K\. Huang, and D\. Khashabi \(2025\)Science hierarchography: hierarchical organization of science literature\.External Links:2504\.13834,[Link](https://arxiv.org/abs/2504.13834)Cited by:[Appendix B](https://arxiv.org/html/2606.26963#A2.p1.1),[§1](https://arxiv.org/html/2606.26963#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26963#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2606.26963#S2.SS2.p1.1),[§2](https://arxiv.org/html/2606.26963#S2.p1.1),[§4\.3](https://arxiv.org/html/2606.26963#S4.SS3.SSS0.Px2.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§5\.1](https://arxiv.org/html/2606.26963#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2606.26963#S5.SS2.SSS0.Px2.p2.2)\.
- M\. R\. Grootendorst \(2022\)BERTopic: neural topic modeling with a class\-based tf\-idf procedure\.ArXivabs/2203\.05794\.External Links:[Link](https://api.semanticscholar.org/CorpusID:247411231)Cited by:[§4\.2](https://arxiv.org/html/2606.26963#S4.SS2.SSS0.Px2.p1.13)\.
- K\. Hakiman and C\. Stull\-Lane \(2022\)Innovation in governance: integrating technical and contextual perspectives to address fragility\.Technical ReportSPARC\.External Links:[Link](https://www.sparc-knowledge.org/publications-resources/innovation-governance-integrating-technical-and-contextual-perspectives)Cited by:[§1](https://arxiv.org/html/2606.26963#S1.p1.1)\.
- M\. A\. Hearst \(1992\)Automatic acquisition of hyponyms from large text corpora\.InCOLING 1992 Volume 2: The 14th International Conference on Computational Linguistics,External Links:[Link](https://aclanthology.org/C92-2082/)Cited by:[§2](https://arxiv.org/html/2606.26963#S2.p1.1)\.
- Y\. Hu, Z\. Li, Z\. Zhang, C\. Ling, R\. Kanjiani, B\. Zhao, and L\. Zhao \(2025\)Taxonomy tree generation from citation graph\.External Links:2410\.03761,[Link](https://arxiv.org/abs/2410.03761)Cited by:[§5\.2](https://arxiv.org/html/2606.26963#S5.SS2.SSS0.Px3.p1.1)\.
- J\. Huang, Y\. Xie, Y\. Meng, Y\. Zhang, and J\. Han \(2020\)CoRel: seed\-guided topical taxonomy construction by concept learning and relation transferring\.InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’20,pp\. 1928–1936\.External Links:[Link](http://dx.doi.org/10.1145/3394486.3403244),[Document](https://dx.doi.org/10.1145/3394486.3403244)Cited by:[§2\.2](https://arxiv.org/html/2606.26963#S2.SS2.p1.1)\.
- U\. Katz, M\. Levy, and Y\. Goldberg \(2024\)Knowledge navigator: LLM\-guided browsing framework for exploratory search in scientific literature\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 8838–8855\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.516/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.516)Cited by:[§1](https://arxiv.org/html/2606.26963#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26963#S2.SS1.p2.1),[§2](https://arxiv.org/html/2606.26963#S2.p1.1)\.
- D\. Lee, J\. Shen, S\. Kang, S\. Yoon, J\. Han, and H\. Yu \(2022\)TaxoCom: topic taxonomy completion with hierarchical discovery of novel topic clusters\.InProceedings of the ACM Web Conference 2022,WWW ’22,New York, NY, USA,pp\. 2819–2829\.External Links:ISBN 9781450390965,[Link](https://doi.org/10.1145/3485447.3512002),[Document](https://dx.doi.org/10.1145/3485447.3512002)Cited by:[§2\.2](https://arxiv.org/html/2606.26963#S2.SS2.p1.1)\.
- S\. Li, J\. Hu, Y\. Cui, and J\. Hu \(2018\)DeepPatent: patent classification with convolutional neural networks and word embedding\.Scientometrics117\(2\),pp\. 721–744\.External Links:ISSN 1588\-2861,[Document](https://dx.doi.org/10.1007/s11192-018-2905-5)Cited by:[2nd item](https://arxiv.org/html/2606.26963#A1.I1.i2.p1.1),[§3](https://arxiv.org/html/2606.26963#S3.p1.1)\.
- X\. Liu, Y\. Song, S\. Liu, and H\. Wang \(2012\)Automatic taxonomy construction from keywords\.InProceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’12,New York, NY, USA,pp\. 1433–1441\.External Links:ISBN 9781450314626,[Link](https://doi.org/10.1145/2339530.2339754),[Document](https://dx.doi.org/10.1145/2339530.2339754)Cited by:[§2](https://arxiv.org/html/2606.26963#S2.p1.1)\.
- J\. B\. MacQueen \(1967\)Some methods for classification and analysis of multivariate observations\.InProceedings of the fifth berkeley symposium on mathematical statistics and probability,pp\. :281–297\.Cited by:[§4\.3](https://arxiv.org/html/2606.26963#S4.SS3.SSS0.Px1.p1.1)\.
- C\. D\. Manning, P\. Raghavan, and H\. Sch"utze \(2008\)Introduction to information retrieval\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2606.26963#S1.p1.1)\.
- L\. McInnes, J\. Healy, and S\. Astels \(2017\)hdbscan: hierarchical density based clustering\.The Journal of Open Source Software2\(11\),pp\. 205\.External Links:[Document](https://dx.doi.org/10.21105/joss.00205)Cited by:[§1](https://arxiv.org/html/2606.26963#S1.p1.1)\.
- D\. Mimno, W\. Li, and A\. McCallum \(2007\)Mixtures of hierarchical topics with pachinko allocation\.InProceedings of the 24th International Conference on Machine Learning,ICML ’07,New York, NY, USA,pp\. 633–640\.External Links:ISBN 9781595937933,[Link](https://doi.org/10.1145/1273496.1273576),[Document](https://dx.doi.org/10.1145/1273496.1273576)Cited by:[§2](https://arxiv.org/html/2606.26963#S2.p1.1)\.
- A\. Oarga, M\. Hart, A\. M\. Bran, M\. Lederbauer, and P\. Schwaller \(2024\)Scientific knowledge graph and ontology generation using open large language models\.InAI for Accelerated Materials Design \- NeurIPS 2024,External Links:[Link](https://openreview.net/forum?id=wMMhffCxXZ)Cited by:[§1](https://arxiv.org/html/2606.26963#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26963#S2.SS1.p2.1)\.
- A\. Panchenko, S\. Faralli, E\. Ruppert, S\. Remus, H\. Naets, C\. Fairon, S\. P\. Ponzetto, and C\. Biemann \(2016\)TAXI at SemEval\-2016 task 13: a taxonomy induction method based on lexico\-syntactic patterns, substrings and focused crawling\.InProceedings of the 10th International Workshop on Semantic Evaluation \(SemEval\-2016\),S\. Bethard, M\. Carpuat, D\. Cer, D\. Jurgens, P\. Nakov, and T\. Zesch \(Eds\.\),San Diego, California,pp\. 1320–1327\.External Links:[Link](https://aclanthology.org/S16-1206/),[Document](https://dx.doi.org/10.18653/v1/S16-1206)Cited by:[§2](https://arxiv.org/html/2606.26963#S2.p1.1)\.
- J\. Polchar \(2024\)Using foresight to anticipate emerging critical risks: proposed methodology\.Technical reportTechnical Report79,OECD Working Papers on Public Governance,OECD Publishing,Paris\.External Links:[Document](https://dx.doi.org/10.1787/84820cd8-en),[Link](https://doi.org/10.1787/84820cd8-en)Cited by:[§1](https://arxiv.org/html/2606.26963#S1.p1.1)\.
- J\. Priem, H\. Piwowar, and R\. Orr \(2022\)OpenAlex: a fully\-open index of scholarly works, authors, venues, institutions, and concepts\.External Links:2205\.01833,[Link](https://arxiv.org/abs/2205.01833)Cited by:[1st item](https://arxiv.org/html/2606.26963#A1.I1.i1.p1.1),[§3](https://arxiv.org/html/2606.26963#S3.p1.1)\.
- Publications Office of the European Union \(2015\)CORDIS \- EU Research Projects under Horizon 2020 \(2014–2020\)\.Publications Office of the European Union\.Note:Accessed: 2026\-03\-13External Links:[Document](https://dx.doi.org/10.2906/112117098108/12),[Link](https://doi.org/10.2906/112117098108/12)Cited by:[3rd item](https://arxiv.org/html/2606.26963#A1.I1.i3.p1.1),[§3](https://arxiv.org/html/2606.26963#S3.p1.1)\.
- Publications Office of the European Union \(2022\)CORDIS \- EU Research Projects under HORIZON EUROPE \(2021–2027\)\.Publications Office of the European Union\.Note:Accessed: 2026\-03\-13External Links:[Document](https://dx.doi.org/10.2906/112117098108/20),[Link](https://doi.org/10.2906/112117098108/20)Cited by:[3rd item](https://arxiv.org/html/2606.26963#A1.I1.i3.p1.1),[§3](https://arxiv.org/html/2606.26963#S3.p1.1)\.
- D\. Sculley \(2010\)Web\-scale k\-means clustering\.InProceedings of the 19th International Conference on World Wide Web,WWW ’10,New York, NY, USA,pp\. 1177–1178\.External Links:ISBN 9781605587998,[Link](https://doi.org/10.1145/1772690.1772862),[Document](https://dx.doi.org/10.1145/1772690.1772862)Cited by:[§4\.3](https://arxiv.org/html/2606.26963#S4.SS3.SSS0.Px1.p1.1)\.
- E\. Senger, Y\. Campbell, R\. V\. D\. Goot, and B\. Plank \(2025\)Crossing domains without labels: distant supervision for term extraction\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,S\. Potdar, L\. Rojas\-Barahona, and S\. Montella \(Eds\.\),Suzhou \(China\),pp\. 1366–1378\.External Links:[Link](https://aclanthology.org/2025.emnlp-industry.95/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-industry.95),ISBN 979\-8\-89176\-333\-3Cited by:[§4\.2](https://arxiv.org/html/2606.26963#S4.SS2.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.26963#S5.SS1.p1.1)\.
- Y\. Shen, Y\. Zhang, Y\. Zhang, and J\. Han \(2025\)A unified taxonomy\-guided instruction tuning framework for entity set expansion and taxonomy expansion\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 3208–3220\.External Links:[Link](https://aclanthology.org/2025.findings-acl.167/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.167),ISBN 979\-8\-89176\-256\-5Cited by:[§2\.2](https://arxiv.org/html/2606.26963#S2.SS2.p1.1)\.
- V\. Shwartz, Y\. Goldberg, and I\. Dagan \(2016\)Improving hypernymy detection with an integrated path\-based and distributional method\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),K\. Erk and N\. A\. Smith \(Eds\.\),Berlin, Germany,pp\. 2389–2398\.External Links:[Link](https://aclanthology.org/P16-1226/),[Document](https://dx.doi.org/10.18653/v1/P16-1226)Cited by:[§2](https://arxiv.org/html/2606.26963#S2.p1.1)\.
- S\. Terragni, E\. Fersini, and E\. Messina \(2021\)Word embedding\-based topic similarity measures\.InInternational conference on applications of Natural Language to information systems,pp\. 33–45\.Cited by:[§5\.2](https://arxiv.org/html/2606.26963#S5.SS2.SSS0.Px1.p1.1)\.
- C\. Wang, M\. Danilevsky, N\. Desai, Y\. Zhang, P\. Nguyen, T\. Taula, and J\. Han \(2013\)A phrase mining framework for recursive construction of a topical hierarchy\.InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’13,New York, NY, USA,pp\. 437–445\.External Links:ISBN 9781450321747,[Link](https://doi.org/10.1145/2487575.2487631),[Document](https://dx.doi.org/10.1145/2487575.2487631)Cited by:[§2](https://arxiv.org/html/2606.26963#S2.p1.1)\.
- W\. Webber, A\. Moffat, and J\. Zobel \(2010\)A similarity measure for indefinite rankings\.ACM Transactions on Information Systems \(TOIS\)28\(4\),pp\. 1–38\.Cited by:[§5\.2](https://arxiv.org/html/2606.26963#S5.SS2.SSS0.Px1.p1.1)\.
- C\. Zhang, F\. Tao, X\. Chen, J\. Shen, M\. Jiang, B\. Sadler, M\. Vanni, and J\. Han \(2018\)TaxoGen: unsupervised topic taxonomy construction by adaptive term embedding and clustering\.InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,KDD ’18,New York, NY, USA,pp\. 2701–2709\.External Links:ISBN 9781450355520,[Link](https://doi.org/10.1145/3219819.3220064),[Document](https://dx.doi.org/10.1145/3219819.3220064)Cited by:[§2\.2](https://arxiv.org/html/2606.26963#S2.SS2.p1.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou \(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.External Links:2506\.05176,[Link](https://arxiv.org/abs/2506.05176)Cited by:[§5\.1](https://arxiv.org/html/2606.26963#S5.SS1.p1.1)\.
- K\. Zhu, L\. Liao, Y\. Gu, L\. Huang, X\. Feng, and B\. Qin \(2025\)Context\-aware hierarchical taxonomy generation for scientific papers via LLM\-guided multi\-aspect clustering\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 15627–15645\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.788/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.788),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.26963#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26963#S2.SS1.p2.1),[§2](https://arxiv.org/html/2606.26963#S2.p1.1),[§5\.2](https://arxiv.org/html/2606.26963#S5.SS2.SSS0.Px3.p1.1)\.
- X\. Zhu, Z\. Ming, X\. Zhu, and T\. Chua \(2013\)Topic hierarchy construction for the organization of multi\-source user generated contents\.InProceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’13,New York, NY, USA,pp\. 233–242\.External Links:ISBN 9781450320344,[Link](https://doi.org/10.1145/2484028.2484032),[Document](https://dx.doi.org/10.1145/2484028.2484032)Cited by:[§2\.1](https://arxiv.org/html/2606.26963#S2.SS1.p1.1)\.
## Appendix ADetails on the Dataset Creation and Licensing
### A\.1Dataset Creation
This appendix provides details on the data retrieval, filtering, and text construction steps applied to each source in the dataset\.
- •OpenAlex\(Priemet al\.,[2022](https://arxiv.org/html/2606.26963#bib.bib42)\): We retrieved publications via the OpenAlex API with at least one German\-affiliated author for the period 2015–2023\. Only records with a DOI were retained\. The text corpus was constructed by concatenating thetitleandabstractfields\.
- •USPTO\(Liet al\.,[2018](https://arxiv.org/html/2606.26963#bib.bib41)\): We used patent records from the USPTO dataset for the years 2014–2015\. The textual content consists of theAbstractfield\.
- •CORDIS: We used project data from Horizon 2020 \(funded 2014\-2020\)\(Publications Office of the European Union,[2015](https://arxiv.org/html/2606.26963#bib.bib43)\)and Horizon Europe \(funded 2021\-2027\)\(Publications Office of the European Union,[2022](https://arxiv.org/html/2606.26963#bib.bib44)\)\. The text corpus combinesProjectTitleandProjectObjective\.
- •FöKAT\(Bundesministerium für Forschung, Technologie und Raumfahrt \(BMFTR\),[2026](https://arxiv.org/html/2606.26963#bib.bib45)\): We used funding records from 2014–2025, with the year determined byLaufzeit von\. The text corpus consists of theThemafield\. Compared to the other datasets, FOEKAT entries are substantially shorter and frequently contain funding\-specific terminology or proper names of funding programs without descriptive context\. To mitigate this, domain experts excluded selected categories from the FÖKAT taxonomy \(Leistungsplansystematik\) and applied additional keyword\-based filtering to remove highly specific funding instruments \(e\.g\.,Professorinnenprogramm\)\.
Across all sources, we additionally removed extremely short entries \(less than five characters\)\.
### A\.2Licensing
The benchmark dataset aggregates information from United States Patent and Trademark Office \(USPTO\) patents \(CC BY 4\.0\), OpenAlex publications \(CC0\), CORDIS project data \(generally available under CC BY 4\.0 / PSI reuse terms\), and FöKAT records used with permission\. To ensure license compatibility and proper attribution requirements across sources, the compiled benchmark dataset is released under the Creative Commons Attribution 4\.0 \(CC BY 4\.0\) license\.
## Appendix BAlgorithms
In Algorithm[2](https://arxiv.org/html/2606.26963#algorithm2), we present the classical approach to hierarchical clustering using recursive K\-Means\. In Algorithm[3](https://arxiv.org/html/2606.26963#algorithm3), we reproduce the SCYCHIC algorithm, introduced byGaoet al\.\([2025](https://arxiv.org/html/2606.26963#bib.bib5)\)\.
1
Input:Set of documents
\{di,…,dn\}\\\{d\_\{i\},\\ldots,d\_\{n\}\\\}, number of layers
LL, target number of clusters
\(k1,…,kL\)\(k\_\{1\},\\ldots,k\_\{L\}\)
2
3Initialization:Embed each
did\_\{i\}
4for*l=1l=1toLL*do
5if*l=1l=1*then
6Cluster all documents into
k1k\_\{1\}clusters
7
8else
9for*each clusterccfrom layerl−1l\-1*do
10Cluster documents of
ccinto
klk\_\{l\}subclusters
11
12
13
14
15return*Hierarchical structure*
Algorithm 2Recursive K\-Means hierarchy construction1
Input:Set of documents
\{di,…,dn\}\\\{d\_\{i\},\\ldots,d\_\{n\}\\\}, number of layers
LL, target number of clusters
\(k1,…,kL\)\(k\_\{1\},\\ldots,k\_\{L\}\)
2
3Initialization:Embed each
did\_\{i\}
for*l=1l=1to⌊L/2⌋\\lfloor L/2\\rfloor*
//Top\-down phase
4do
5if*l=1l=1*then
6Cluster all documents into
k1k\_\{1\}clusters
7
8else
9for*each clusterccfrom layerl−1l\-1*do
10Cluster documents of
ccinto
klk\_\{l\}subclusters
11
12
13
14
for*each clusterτ\\tauat level⌊L/2⌋\\lfloor L/2\\rfloor*
//Bottom\-up phase
15do
16for*l=Ll=Ldown to⌊L/2⌋\+1\\lfloor L/2\\rfloor\+1*do
17if*l=Ll=L*then
18
EE= \{ embeddings of documents within
τ\\tau\}
19
20else
21
EE= \{ embeddings of summaries of cluster
l\+1l\+1\}
22
23Cluster
EEinto
klk\_\{l\}subclusters
24Generate summary for each subcluster
25
26
27return*Hierarchical structure*
Algorithm 3Scychic algorithm
## Appendix CDetails on Human Evaluation
The four dimonsions for the Likert scale are:
- •Structure:Does the organization follow a clear logical hierarchy, transitioning from broad research areas to specific sub\-technologies, or do the clusters maintain a consistent level of granularity on each layer?
- •Validity:Does the taxonomy align with experts’ understanding of the scientific and technological landscape, including how concepts are commonly grouped, related, or distinguished?
- •Composition:Does the clusters reflect an appropriate and plausible composition of data sources given the nature of the technology \(e\.g\., publication\-dominated scientific fields versus patent\- or project\-intensive applied domains\)?
- •Usefulness \(Case Study\):How useful is the hierarchy for policy analysts or foresight practitioners seeking to monitor technological developments, compare signals across data sources, or identify emerging or underexplored areas?
The 1\-5 Likert Scale is defined as:
1. 1\.Completely inaccurate, with significant factual errors or misrepresentations of the domain\.
2. 2\.Mostly inaccurate, capturing only a few correct facts but failing to represent the domain coherently\.
3. 3\.Moderately accurate, containing some factual correctness but missing important concepts or relationships\.
4. 4\.Mostly accurate, representing the domain well with minor factual inaccuracies or omissions\.
5. 5\.Highly accurate, thoroughly reflecting the domain’s factual structure with no noticeable errors\.
## Appendix DPer\-Annotator Likert Scale Ratings
Tables[7](https://arxiv.org/html/2606.26963#A4.T7)and[8](https://arxiv.org/html/2606.26963#A4.T8)present the per\-annotator Likert\-scale ratings\. TERMNET consistently achieves the highest score, either independently or tied with other approaches, with one exception: for annotator 1 in the “Structure” dimension, where it scores one point below the top\-rated approach\.
Table 7:Likert\-scale ratings from Annotator 1Table 8:Likert\-scale ratings from Annotator 2Similar Articles
SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification
SCHK-HTC is a novel method for few-shot hierarchical text classification that combines sibling contrastive learning with hierarchical knowledge-aware prompt tuning to better distinguish semantically similar classes at deeper hierarchy levels. The approach achieves state-of-the-art performance across three benchmark datasets by enhancing model perception of subtle differences between sibling classes.
CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling
CobwebTM is a low-parameter lifelong hierarchical topic modeling approach that adapts the Cobweb algorithm to continuous document embeddings, enabling unsupervised topic discovery and dynamic hierarchical organization without predefining topic counts. The method combines incremental symbolic concept formation with pretrained representations to achieve strong topic coherence while avoiding catastrophic forgetting.
HierBias: Context-Conditioned Hierarchical Media Bias Detection with Multi-Task Type Classification
HierBias introduces a hierarchical context-conditioned model for media bias detection that leverages document context to improve sentence-level classification, achieving state-of-the-art F1 and MCC on the BABE and BASIL datasets.
Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection
This paper proposes a multi-level contextual token relation modeling framework for machine-generated text detection, integrating local Markov-informed calibration and global rule-support reasoning to improve detection across cross-LLM and cross-domain settings with low computational overhead.
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.