Knowledge Graph-Enhanced Zero-Shot Topic Classification: A Multi-Strategy Comparative Study
Summary
This paper proposes a zero-shot multi-label topic classification framework enhanced with per-article knowledge graphs, comparing four base variants and their graph-augmented counterparts across fifteen LLMs and eight datasets. The study finds that keyword-enhanced classification performs best, and graph augmentation improves small models but degrades performance in larger ones.
View Cached Full Text
Cached at: 06/01/26, 09:23 AM
# Knowledge Graph-Enhanced Zero-Shot Topic Classification: A Multi-Strategy Comparative Study
Source: [https://arxiv.org/html/2605.30465](https://arxiv.org/html/2605.30465)
Shahana Akter1,Yatharth Vohra2,Ankita Shukla2 Souvika Sarkar1 1A2I Lab, School of Computing, Wichita State University sxakter7@shockers\.wichita\.edu souvika\.sarkar@wichita\.edu 2EIS Lab, College of Engineering, University of Nevada, Reno yvohra@unr\.edu, ankitas@unr\.edu
###### Abstract
Multi\-label topic classification without labeled training data is a challenging task, specially when documents contain complex relational information\. We present a zero\-shot multi\-label topic classification framework and systematically investigate how per\-article knowledge graph augmentation affects its performance\. The base framework classifies topics in documents without labeled training data and has four variants: article\-only classification, keyword\-enhanced classification, and self\-consistency decoding variants of both\. Then, we augment each base variant with per\-article knowledge graph\. This graph is extracted from the input document through a pipeline similar to KGGen based on subject\-predicate\-object triples\. We test all eight methods, four base and four graph augmented on fifteen LLMs and eight multi\-label datasets across different domains\. For the base framework, keyword\-enhanced classification \(AK\) is the best performing method, and six out of fifteen LLMs surpass the sentence\-encoder baseline\. Graph augmentation has positive and negative impacts on small and large models, respectively\. This shows that larger models already contain enough relational information from pretraining\. Furthermore, the self\-consistency decoding variant does not show performance improvements in any experiment while increasing computation costs about fivefold\.
Knowledge Graph\-Enhanced Zero\-Shot Topic Classification: A Multi\-Strategy Comparative Study
Shahana Akter1, Yatharth Vohra2, Ankita Shukla2Souvika Sarkar11A2I Lab, School of Computing, Wichita State Universitysxakter7@shockers\.wichita\.edusouvika\.sarkar@wichita\.edu2EIS Lab, College of Engineering, University of Nevada, Renoyvohra@unr\.edu, ankitas@unr\.edu
## 1Introduction
Text classification is a core task in natural language processing\. It is concerned with assigning one or more labels to a document based on its content\(Alghamdi and Alfalqi,[2015](https://arxiv.org/html/2605.30465#bib.bib8); Chauhan and Shah,[2021](https://arxiv.org/html/2605.30465#bib.bib9)\)\. Recent large language models \(LLMs\) have expanded the scope of this task by enabling zero\-shot and few\-shot classification from natural\-language instructions alone\(Brownet al\.,[2020](https://arxiv.org/html/2605.30465#bib.bib5); Yinet al\.,[2019](https://arxiv.org/html/2605.30465#bib.bib6)\)\. This is especially useful in settings where labeled training data is unavailable or where label sets must be defined dynamically at inference time\.
A particularly difficult variant is*multi\-label*text classification, in which each document may belong to multiple categories\(Veerannaet al\.,[2016](https://arxiv.org/html/2605.30465#bib.bib7)\)\. In realistic applications, a single article or review often spans several related topics\. For example, a health article may concern bothHeart HealthandWomen’s Health\. Or a product review may discuss bothLensandBatteryperformance\. In zero\-shot settings, this problem becomes harder\. That is because the model must map documents to previously unseen or user\-defined labels without task\-specific fine\-tuning\.
Recent work has shown that zero\-shot multi\-label topic inference is feasible using either sentence\-encoder similarity or prompt\-based LLM classification\(Sarkaret al\.,[2023](https://arxiv.org/html/2605.30465#bib.bib1)\)\. In such settings, users provide a collection of documents together with a set of candidate topic labels, and optionally short keyword descriptions for each topic, and the system must assign zero or more labels to each document\. Prior results indicate that topic keywords substantially improve performance over topic names alone\. They also show that strong sentence encoders such as Sentence\-BERT are competitive embedding\-based baselines, and that prompt\-based LLM methods can outperform encoder\-based similarity approaches on many datasets\(Sarkaret al\.,[2023](https://arxiv.org/html/2605.30465#bib.bib1); Reimers and Gurevych,[2019](https://arxiv.org/html/2605.30465#bib.bib2)\)\.
However, most existing zero\-shot approaches represent documents and labels as largely flat text objects\. Sentence\-encoder methods collapse a document into a single embedding vector, while direct prompting methods typically supply only the raw article text and a list of candidate labels or keywords\. In both cases, relational structure inside the document remains implicit\. As a result, documents that contain overlapping vocabulary but differ in how concepts are related can be difficult to distinguish\. For example, two topics such asMental HealthandBrain and Cognitive Healthmay be associated with similar terms, even when the underlying conceptual relationships in the document differ in meaningful ways\.
This paper addresses that limitation by augmenting zero\-shot multi\-label topic classification with*per\-article knowledge graphs*\. For each input article, we extract a directed graph of subject–predicate–object triples using an LLM\-based pipeline adapted from KGGen\(Moet al\.,[2025](https://arxiv.org/html/2605.30465#bib.bib14)\)\. The resulting graph provides a structured representation of entities and relations mentioned in the article\. It captures information that is difficult to preserve in a single pooled embedding or an unconstrained raw\-text prompt\. We then serialize this graph into a compact textual form and provide it to an LLM as an additional context for topic prediction\.
We evaluate our approach across fifteen LLMs and eight multi\-label datasets, including the seven benchmark datasets used in prior zero\-shot topic\-inference work and the English subset of the SemEval\-2018 Task 1 dataset\(Mohammadet al\.,[2018](https://arxiv.org/html/2605.30465#bib.bib15); Sarkaret al\.,[2023](https://arxiv.org/html/2605.30465#bib.bib1)\)\. We compare against established sentence\-encoder baselines and examine whether structured relational context improves zero\-shot multi\-label classification performance\.
Specifically, our contributions are as follows\.
1. 1\.We build a zero\-shot multi\-label topic classification framework with four variants, Article Only \(AO\), Article \+ Keywords \(AK\), and self\-consistency decoding for both \(AOS, AKS\)\. We then evaluate it systematically across fifteen LLMs and eight datasets\.
2. 2\.We propose a per\-article knowledge graph construction pipeline adapted from KGGenMoet al\.\([2025](https://arxiv.org/html/2605.30465#bib.bib14)\)and define four graph\-augmented variants \(AG, AKG, AGS, AKGS\) that add graph context to each base method\.
3. 3\.We conduct a systematic study comparing all eight methods\. The study demonstrates that graph augmentation consistently helps smaller models\. It is most effective when combined with keyword\-enhanced classification with self\-consistency\.
4. 4\.We provide a runtime and cost analysis that shows that self\-consistency decoding increases computational cost approximately fivefold but does not help with classification performance in any of the settings\.
## 2Related Work
This work is built on prior research in topic modeling, supervised and zero\-shot text classification, large language models, knowledge graph construction, and knowledge\-graph\-enhanced text classification\. We review the most relevant work in each area below and position our contribution relative to the existing literature\.
### 2\.1Topic Modeling and Topic Classification
##### Classical Topic Models:
Probabilistic approaches to topic discovery date back to PLSA\(Hofmann,[1999](https://arxiv.org/html/2605.30465#bib.bib31)\)\. Blei et al\. later introduced LDA, which added a document\-level generative process and became the dominant unsupervised topic model\(Bleiet al\.,[2003](https://arxiv.org/html/2605.30465#bib.bib32)\)\. Subsequent work extended topic modeling in several directions, including document\-relative similarity\(Duet al\.,[2015](https://arxiv.org/html/2605.30465#bib.bib34)\), weakly supervised topic\-label mapping\(Hingmire and Chakraborti,[2014](https://arxiv.org/html/2605.30465#bib.bib35)\), and hierarchical Dirichlet processes\(Wanget al\.,[2011](https://arxiv.org/html/2605.30465#bib.bib33)\)\.
##### Supervised and Weakly Supervised Topic Classification:
When labeled data is available, supervised methods can learn topic assignments directly and often perform well on well\-annotated corpora\(Tuarobet al\.,[2015](https://arxiv.org/html/2605.30465#bib.bib39)\)\. Other work has studied noisy category annotations\(Iwataet al\.,[2009](https://arxiv.org/html/2605.30465#bib.bib36)\), used topic models to support document annotation\(Poursabzi\-Sangdeh and Boyd\-Graber,[2015](https://arxiv.org/html/2605.30465#bib.bib38)\), and applied latent topic models to video categorization\(Engelset al\.,[2010](https://arxiv.org/html/2605.30465#bib.bib37)\)\. Additional work has explored weakly supervised neural classification\(Menget al\.,[2018](https://arxiv.org/html/2605.30465#bib.bib40)\)and domain\-specific supervised classification tasks such as tracking sexual violence reports\(Hassanet al\.,[2020](https://arxiv.org/html/2605.30465#bib.bib41)\)\.
##### Zero\-Shot Topic Classification:
A separate line of work considers settings in which no labeled examples are available at inference time\. Karmaker Santu et al\. explored topic\-modeling\-based zero\-shot methods\(Santuet al\.,[2016](https://arxiv.org/html/2605.30465#bib.bib42)\), while Li et al\. and Zha and Li studied dataless classification without annotated training data\(Liet al\.,[2016](https://arxiv.org/html/2605.30465#bib.bib43); Zha and Li,[2019](https://arxiv.org/html/2605.30465#bib.bib44)\)\. Veeranna et al\. measured label\-document similarity using pretrained word embeddings\(Veerannaet al\.,[2016](https://arxiv.org/html/2605.30465#bib.bib7)\), and later work pursued zero\-shot text classification through embedding\-based and prompt\-based approaches\(Rios and Kavuluru,[2018](https://arxiv.org/html/2605.30465#bib.bib45); Xiaet al\.,[2018](https://arxiv.org/html/2605.30465#bib.bib46); Pushp and Srivastava,[2017](https://arxiv.org/html/2605.30465#bib.bib47); Puri and Catanzaro,[2019](https://arxiv.org/html/2605.30465#bib.bib48); Chenet al\.,[2021](https://arxiv.org/html/2605.30465#bib.bib49); Gong and Eldardiry,[2021](https://arxiv.org/html/2605.30465#bib.bib50); Yinet al\.,[2019](https://arxiv.org/html/2605.30465#bib.bib6)\)\.
The most directly relevant prior work is that ofSarkaret al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib1)\), who formalized zero\-shot multi\-label topic inference and benchmarked sentence encoders and LLMs across seven datasets\. Their results showed that keyword\-augmented topic representations improved performance and that Sentence\-BERT was the strongest embedding model among the sentence\-encoder baselines they evaluated\(Sarkaret al\.,[2023](https://arxiv.org/html/2605.30465#bib.bib1); Reimers and Gurevych,[2019](https://arxiv.org/html/2605.30465#bib.bib2)\)\. However, their framework reduces each document and topic to a single vector representation, leaving relational structure within the document unmodeled\.Van Nootenet al\.\([2026](https://arxiv.org/html/2605.30465#bib.bib12)\)further showed that fixed global thresholds perform poorly when similarity distributions vary across models and label sets\. Our approach addresses these limitations by constructing per\-article knowledge graphs that provide relational context unavailable to flat embedding representations\.
### 2\.2Large Language Models and Prompting
LLMs have shown strong zero\-shot classification ability, assigning labels from natural\-language task descriptions without task\-specific training\(Brownet al\.,[2020](https://arxiv.org/html/2605.30465#bib.bib5); Yinet al\.,[2019](https://arxiv.org/html/2605.30465#bib.bib6); Chae and Davidson,[2025](https://arxiv.org/html/2605.30465#bib.bib11); Vandemoorteleet al\.,[2025](https://arxiv.org/html/2605.30465#bib.bib18)\)\. Prior work has also shown that prompting strategy matters: prompt design can substantially affect model behavior and downstream performance\(Kojimaet al\.,[2022](https://arxiv.org/html/2605.30465#bib.bib19); Sahooet al\.,[2024](https://arxiv.org/html/2605.30465#bib.bib20); Whiteet al\.,[2023](https://arxiv.org/html/2605.30465#bib.bib28); Reynolds and McDonell,[2021](https://arxiv.org/html/2605.30465#bib.bib29)\)\.Janget al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib30)\)further tested the limits of prompt understanding under negated instructions\. Applying this approach to multi\-label topic classification introduces practical challenges\. Plain\-text prompts do not explicitly encode the relations among entities, events, and concepts mentioned in an article\. To address this limitation, we serialize a per\-article knowledge graph as a compact set of entities and subject–predicate–object triples, giving the model relational context that raw text alone does not provide, which suggests that LLMs tend to treat graph\-structured prompts more as contextual paragraphs than as explicit graph structuresHuanget al\.\([2024](https://arxiv.org/html/2605.30465#bib.bib51)\)\.
### 2\.3Knowledge Graph Construction from Text
Knowledge graph extraction from unstructured text is a longstanding problem in information extraction and knowledge acquisition\(Jiet al\.,[2022](https://arxiv.org/html/2605.30465#bib.bib17)\)\. Early rule\-based systems such as YAGOSuchaneket al\.\([2007](https://arxiv.org/html/2605.30465#bib.bib52)\)relied on hard\-coded rules\. Then OpenIEAngeliet al\.\([2015](https://arxiv.org/html/2605.30465#bib.bib53)\)improved on this using dependency parsing to extract triples \(subject, relation, object\)\. However, both of them tend to produce overly specific, inconsistent predicates\. More recent transformer\-based extraction pipelinesQiaoet al\.\([2022](https://arxiv.org/html/2605.30465#bib.bib54)\)and external knowledge base approaches such as entity linking to Wikidata or ConceptNet offer better quality but they require fixed relation schemas, domain\-specific supervision, or entities present in a pre\-built knowledge base\. None of these are applicable in our open\-domain, zero\-shot setting\. Hence, we adapt a more recent approach KGGenMoet al\.\([2025](https://arxiv.org/html/2605.30465#bib.bib14)\)\.
KGGen uses LLMs to produce subject–predicate–object triples in two stages: entity extraction and relation extraction based on the extracted entity list\. It also includes clustering to merge similar entities\. Because KGGen was designed for corpus\-level extraction, we adapt it here to operate independently on each article to match the per\-document setting of zero\-shot multi\-label topic classification\.
### 2\.4Knowledge\-Graph\-Enhanced Text Classification
A growing body of work has explored how knowledge graphs can improve text classification\.Wanget al\.\([2017](https://arxiv.org/html/2605.30465#bib.bib27)\)augmented a CNN\-based classifier with concept mappings from an external knowledge base to address limited context in short texts\.Chenet al\.\([2022](https://arxiv.org/html/2605.30465#bib.bib21)\)proposed S\-BERT\-KG, which enriches Sentence\-BERT representations with ConceptNet\-based knowledge for zero\-shot social media classification\.Shiet al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib22)\)introduced ChatGraph, which extracts knowledge graphs from text and uses the resulting graph representations to train an interpretable classifier\.Liuet al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib23)\)incorporated external knowledge into a hierarchical text classification model through a knowledge\-aware encoder and hierarchical label attention\.Zanget al\.\([2025](https://arxiv.org/html/2605.30465#bib.bib24)\)proposed KG\-HTC, which retrieves relevant subgraphs from a label\-taxonomy knowledge graph and provides them to an LLM as structured context for zero\-shot hierarchical text classification\. Other related work has explored knowledge\-graph\-based data expansion under limited labeled data\(Zhang and Shafiq,[2023](https://arxiv.org/html/2605.30465#bib.bib25)\)and graph\-based similarity methods enriched with external knowledge\(Shanavaset al\.,[2021](https://arxiv.org/html/2605.30465#bib.bib26)\)\.
Despite these advances, most existing approaches rely on external knowledge bases or supervised training\(Chenet al\.,[2022](https://arxiv.org/html/2605.30465#bib.bib21); Shiet al\.,[2023](https://arxiv.org/html/2605.30465#bib.bib22); Liuet al\.,[2023](https://arxiv.org/html/2605.30465#bib.bib23); Zanget al\.,[2025](https://arxiv.org/html/2605.30465#bib.bib24); Shanavaset al\.,[2021](https://arxiv.org/html/2605.30465#bib.bib26)\)\. In contrast, our approach constructs a knowledge graph directly from each article\. This doesn’t require any labeled training data or external graph, and targets zero\-shot classification\.
### 2\.5Self\-Consistency Decoding
Because LLM outputs are stochastic, a single decoding pass may be unreliable\.Wanget al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib13)\)proposed self\-consistency by sampling multiple outputs at non\-zero temperature and aggregating them\. We adapt that idea to two of our four classification variants\. We run the model five times and retain a topic only if it appears in at least two runs\.
To our knowledge, prior work has not directly combined per\-article knowledge graph extraction with LLM prompting for zero\-shot multi\-label topic classification\. Embedding\-based approaches\(Veerannaet al\.,[2016](https://arxiv.org/html/2605.30465#bib.bib7); Sarkaret al\.,[2023](https://arxiv.org/html/2605.30465#bib.bib1)\)and direct LLM prompting\(Brownet al\.,[2020](https://arxiv.org/html/2605.30465#bib.bib5); Yinet al\.,[2019](https://arxiv.org/html/2605.30465#bib.bib6); Sarkaret al\.,[2023](https://arxiv.org/html/2605.30465#bib.bib1)\)both treat documents and topics as largely flat representations, leaving relational structure unmodeled\. Knowledge\-graph\-enhanced classifiers, as reviewed above, used pre\-built graphs as an external source of information\. Our framework addresses this gap by extracting a knowledge graph from each article’s text and supplying it as structured context to a zero\-shot LLM classifier\. It has been evaluated across fifteen LLMs and eight multi\-label datasets\.
## 3Problem Statement
In the traditional topic classification method, a labeled training dataset and a set of predefined labels are given\. This helps to fine\-tune the model in a supervised manner\(Alghamdi and Alfalqi,[2015](https://arxiv.org/html/2605.30465#bib.bib8)\)\. However, in real\-life applications, the set of pre\-defined labels is not always fixed\. It is mostly application\-dependent and differs from one user to another\. Consequently, it is hard to anticipate the target set of labels and the corresponding labeled training dataset in advance\. This is why we work on a zero\-shot basis, as proposed in\(Yinet al\.,[2019](https://arxiv.org/html/2605.30465#bib.bib6)\)\. In zero\-shot classification, user gives the set of labels during the inference time\. No labeled data is provided beforehand\. We define our problem as follows:
###### Definition 1
Given a collection of documents𝒟=\{d1,d2,…,dn\}\\mathcal\{D\}=\\\{d\_\{1\},d\_\{2\},\\ldots,d\_\{n\}\\\}and a set of user\-defined topics𝒯x=\{t1,…,tm\}\\mathcal\{T\}\_\{x\}=\\\{t\_\{1\},\\ldots,t\_\{m\}\\\}provided at inference time, assign zero or more topics from𝒯x\\mathcal\{T\}\_\{x\}to each documentdi∈𝒟d\_\{i\}\\in\\mathcal\{D\}\. The model does not have access to any labeled training data\. Each topict∈𝒯xt\\in\\mathcal\{T\}\_\{x\}is expressed as a word or short phrase\. The user optionally provides a list of associated keywords𝒦t\\mathcal\{K\}\_\{t\}per topic as an additional context\.
The zero\-shot method requires everything to be processed in real\-time\. The topic list𝒯x\\mathcal\{T\}\_\{x\}is defined by the user at the inference time\. Therefore, no topic specific labeled data is provided beforehand\(Yinet al\.,[2019](https://arxiv.org/html/2605.30465#bib.bib6)\)\. The multi\-label requirement reflects that a single document may have several topics associated with it\. For example, a health article may talk about bothHeart HealthandWomen’s Health, or a product review may discuss bothLensandBatteryperformance\.
The optional keyword list𝒦t\\mathcal\{K\}\_\{t\}helps the model to better understand user’s intent\. For the same document, two different users can define entirely different topic lists𝒯x\\mathcal\{T\}\_\{x\}\. In this case, keywords can help to clarify the meaning of each topic in a particular context\. A topicttmay not appear by its name or phrase anywhere in a documentdid\_\{i\}\. For example, a text aboutDepression,Anxiety, andAntidepressant Drugsmay not contain the phraseMental Healthanywhere\. But it is obvious that the document is aboutMental Health\. It is important to identify such implicit topics\.
Conversely, if a keyword is present in a document, that does not necessarily mean that the document is about the corresponding topic\. Similarly, the absence of keywords does not mean that the topic is not relevant\. The keyword is given just as an additional detail\. In summary, keywords are useful clues that the user provides to the model\. However, the presence or absence of the keywords does not always determine the correct topic assignments for a particular document\.
## 4Methodology
Our framework consists of two components: a base classification pipeline that operates directly on article text, and an optional knowledge graph augmentation that adds structured relational context to each classification variant\. The graph augmented pipeline is illustrated in Figure[1](https://arxiv.org/html/2605.30465#S4.F1)\. We describe the base methods first, then the knowledge graph construction, and finally the graph\-augmented variants\.
### 4\.1Base Classification Pipeline
In the base classification pipeline, the LLM is queried directly for topic\(s\) for a given document\. The input to the model consists of the document text and a list of possible topics, which are expected to generate zero or more topic labels in one \(or more\) forward pass\(es\) \(depending on whether self\-consistency versions are used\)\. No labeled examples are provided in the prompt, the pipeline is fully zero\-shot\.
#### 4\.1\.1Variant AO: Article Only
The model receives just the text of the article and the topic names without any additional information\. The model has to completely depend on the knowledge it has about what each topic name means and also the information provided by the article\. This is the toughest setting and also the most realistic scenario when neither keywords nor structured context are available\.
#### 4\.1\.2Variant AK: Article \+ Keywords
The model is also provided with a per\-topic keyword list\. Every data set we test onSarkaret al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib1)\)comes with a corresponding keyword file that contains keywords which have strong association with each topic, arranged as:topic \(keywords: k1, k2, k3, …\)\. The model is asked to make use of these keywords to understand what the topic means, compare the article content with them, and identify the appropriate topics\. If there is no matching topic, the output will benone\. The keywords guide the model to connect the article content to the topic label\.
#### 4\.1\.3Variant AOS: Article Only with Self\-Consistency Decoding
Self\-consistencyWanget al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib13)\)addresses the stochastic nature of single\-pass LLM generation: the same prompt may produce different outputs on different runs\. In AOS, the Article Only prompt is runN=5N=5times at temperature0\.50\.5\. All topics predicted from all iterations are collected, and topics that appear in at least two out of five iterations are retained in the final prediction\. In this way, self\-consistency serves as noise reduction\. Topics which repeat consistently are more likely to be valid, whereas those which occur in a single iteration are treated as noise\.
#### 4\.1\.4Variant AKS: Article \+ Keywords with Self\-Consistency Decoding
In AKS, keyword\-based classification is used together with self\-consistency decoding based on the same majority voting mechanism as in AOS \(N=5N=5iterations, temperature0\.50\.5, vote threshold 2\)\. The input to the system includes the article and the topic list and the keywords\.
Figure 1:Overall architecture of the proposed KG\-augmented zero\-shot multi\-label topic inference system\. Each article first passes through a four\-stage KG construction pipeline to produce a per\-article knowledge graph\. The graph, corresponding article, topic list and an optional keyword list are then fed into four classification variants: graph\-only classification, keyword\-enhanced classification, and self\-consistency decoding for both graph\-only classification and keyword\-enhanced classification\.
### 4\.2Knowledge Graph Construction
After establishing our basic classification pipeline, we explore whether the incorporation of additional structured relational information from the per\-article knowledge graph improves classification results\. Specifically, for each input article, we create a directed graph consisting of subject\-predicate\-object triples following a modified version of the KGGen pipelineMoet al\.\([2025](https://arxiv.org/html/2605.30465#bib.bib14)\)\. KGGen is a plain text to KG construction framework which utilizes LLMs to generate structured triples in a multi\-stage process\.
The construction process follows four stages applied independently to each article:
1. 1\.Entity Extraction:key entities \(nouns, verbs, or sentiment\-relevant adjectives\) are extracted from the article text\.
2. 2\.Relation Extraction:subject–predicate–object triples are extracted, grounded on the entity list from stage 1\.
3. 3\.Entity Clustering:a sentence encoder \(all\-MiniLM\-L6\-v2\) groups entities by cosine similarity \(threshold0\.750\.75\)\. Candidate clusters entities are validated by an LLM before merging, to ensure clusters contain genuinely synonymous entities\.
4. 4\.Graph Assembly:validated clusters are merged and the resulting triples form the per\-article knowledge graph\.
Each of the fifteen LLMs acts as the base model for entity and relation extraction in any particular experiment\. The knowledge graph is created once for each LLM and stored in JSON format on disk so that the same graph will be used in every classification experiment for that particular LLM\.
### 4\.3Graph\-Augmented Variants
Each base variant has a corresponding graph\-augmented variant that consists of using the serialized knowledge graph as an extra input\. Classification rule and self\-consistency procedure remain the same for both variants\. The above described one\-to\-one mapping is done on purpose so that we can observe how much the graph affects the performance of each model\. Four graph\-augmented variants include:
- •AG\(Article \+ Graph\): AO augmented with the graph\.
- •AKG\(Article \+ Keywords \+ Graph\): AK augmented with the graph\.
- •AGS\(Article \+ Graph, Multi\-pass\): AOS augmented with the graph\.
- •AKGS\(Article \+ Keywords \+ Graph, Multi\-pass\): AKS augmented with the graph\.
### 4\.4Zero\-Shot Setting and Task Formulation
All methods operate in a strict zero\-shot setting\. No examples from the target evaluation dataset are used at any step of the pipeline\. The system must generalise purely based on the information available in the article, the topic names, the optional keyword list, and the knowledge encoded in the pretrained LLM or sentence encoder\.
## 5Experimental Setup
In this section we talk about the datasets, large language models, evaluation metrics, and implementation details that were used in our experiments\.
### 5\.1Datasets
Our system was evaluated on eight multi\-label topic classification datasets, including five product review datasets \(Cellular phone, Digital camera 1, Digital camera 2, DVD player, and Mp3 player\)Sarkaret al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib1)\), two large datasets \(Medical and News\)Sarkaret al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib1)\), and the English subset of the SemEval 2018 datasetMohammadet al\.\([2018](https://arxiv.org/html/2605.30465#bib.bib15)\)\. An overview of the datasets is provided in Table[1](https://arxiv.org/html/2605.30465#S5.T1)\.
DatasetArticlesAvg\. ArticleLengthTopicsTopics/ArticleMedical2066693181\.128News8940589120\.805Cellular phone58716231\.058Digital camera 164218241\.069Digital camera 238017201\.039DVD player83915230\.781Mp3 player181117210\.956SemEval325916112\.415Table 1:Statistics of the datasetsIn zero\-shot learning, the end user can provide auxiliary information about topic\. All The seven datasets fromSarkaret al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib1)\)contain keyword files associated with each dataset\. These keywords are the auxiliary information or words related to each topic of a specific dataset\.
The eighth dataset is SemEval\-2018 Task 1Mohammadet al\.\([2018](https://arxiv.org/html/2605.30465#bib.bib15)\), a multi\-label emotion classification benchmark\. There was no keyword file associated with this dataset\. So we generated per\-topic keywords ourselves with the help of the concept\-annotation framework ofSarkar and Karmaker \([2022](https://arxiv.org/html/2605.30465#bib.bib16)\), following the same keyword format used in the other seven datasets\.
### 5\.2Large Language Models
A significant experimental contribution of our study is a systematic comparison of fifteen large language models across the full end\-to\-end pipeline\. We used each of the models for both knowledge graph extraction and topic classification, within the same experiment\. The end\-to\-end design shows that each model’s classification performance reflects both the quality of the knowledge graph it extracts and the quality of its classification reasoning\.
The fifteen LLMs include LLaMA 3\.3\-70B Instruct, LLaMA 3\.1\-8B, Qwen 2\.5\-72B, Qwen 3\-32B, Qwen 2\.5\-7B, Gemma 3\-27B, and others\. Detailed performance comparisons are demonstrated in Table[2](https://arxiv.org/html/2605.30465#S6.T2)and[3](https://arxiv.org/html/2605.30465#S6.T3)\.
### 5\.3Evaluation Metrics
We used three widely used metrics–Precision, Recall, and F1\-score to evaluate our model\. First, we take each model’s inferred topics and compare them against the ground\-truth \("gold"\) topics to derive true positive, false positive, and false negative counts\. Then in the second step, we use these per\-article statistics to produce the final scores for the entire dataset\. We calculated micro\-average of the F1\-score by aggregating the global counts of true positives, false positives, and false negatives across all instances\.
### 5\.4Implementation Details
We implemented all the experiments in Python 3\.11\. The libraries we used include–DSPy \(version compatible with Hugging Face and OpenAI backends\) for LLM management and structured prompting; pandas for results management; and json for graph serialization and deserialization\. Knowledge graphs are generated once per LLM, per dataset, and cached to disk as JSON files\. This caching strategy ensures that the expensive KG construction step is performed only once, and that all downstream classification experiments for the same LLM use identical graph inputs\. When an experiment is resumed after interruption, cached graphs are loaded directly without regeneration\. All API calls are made sequentially \(one article at a time\), with a try\-except block around each call to handle transient API errors\. Failed calls result in an empty prediction for the affected article, which counts as all FNs and no TPs for that article in evaluation\. The self\-consistency variant makes N × \(number of articles\) API calls per experiment\. Temperature is set to 0\.3 for deterministic variants and 0\.5 for self\-consistency decoding\.
## 6Results and Analysis
Large modelsMedium size modelsSmaller modelsDatasetBase\.MethodLLaMA3\.3\-70BQwen2\.5\-72BQwen3\-32BGemma3\-27BGPT\-4oGPT\-OSS20BMixtral8x7BGemma2\-9BLLaMA3\.1\-8BQwen2\.5\-7BGemma3n\-E4BLLaMA3\.2\-3BQwen2\.5\-3BMinistral3BDS\-R11\.5BMedical0\.594ao0\.6690\.6360\.6220\.7070\.7140\.6460\.6160\.6340\.6760\.6980\.5740\.5650\.5390\.5280\.488ak0\.7020\.6990\.6770\.7590\.7660\.6710\.6570\.6710\.5190\.6710\.5850\.5700\.5570\.5390\.497aos0\.6780\.6370\.6280\.7040\.7190\.6240\.6120\.6310\.6630\.7020\.5510\.5500\.5420\.5260\.484aks0\.6990\.6850\.6610\.7390\.7360\.6310\.6140\.6260\.4360\.5680\.5440\.5370\.5250\.5110\.472News0\.512ao0\.6430\.6430\.6600\.6360\.7040\.6750\.6320\.6490\.6220\.6130\.5030\.4980\.4770\.4630\.422ak0\.7020\.7130\.7160\.6870\.7490\.6930\.6670\.6850\.6600\.6460\.5050\.5070\.4870\.4750\.433aos0\.6380\.6410\.6650\.6350\.7030\.6570\.6260\.6440\.6230\.6080\.4990\.4930\.4690\.4600\.421aks0\.6740\.6890\.6960\.6700\.7190\.6600\.6320\.6410\.6190\.6050\.4810\.4840\.4590\.4430\.406Cell\. phone0\.520ao0\.6690\.6470\.6930\.6640\.6480\.6830\.5950\.6140\.5870\.6330\.5290\.3800\.4830\.4700\.415ak0\.7630\.6980\.7150\.7130\.7580\.7520\.6350\.6480\.6190\.6430\.5540\.3450\.4950\.5940\.429aos0\.6880\.6580\.6810\.6630\.6760\.6950\.5890\.6070\.5840\.6720\.5070\.5290\.4820\.3990\.409aks0\.7020\.6890\.6770\.6930\.7360\.6800\.5880\.6050\.5820\.6660\.4940\.3140\.4640\.5100\.398Dig\. cam\. 10\.500ao0\.6350\.6220\.6480\.6350\.6150\.6780\.5710\.5960\.6770\.6740\.5010\.4440\.4630\.4860\.395ak0\.7420\.7140\.7050\.6870\.7500\.7240\.6070\.6350\.6960\.7120\.5270\.4110\.4780\.5140\.401aos0\.6390\.6430\.6520\.6350\.6310\.6780\.5650\.5930\.6490\.6440\.4820\.4770\.4580\.3720\.388aks0\.6720\.7010\.6550\.6630\.7160\.5770\.5640\.5960\.6500\.6650\.4780\.2690\.4470\.4150\.379Dig\. cam\. 20\.603ao0\.6250\.6380\.6420\.6630\.6230\.6520\.6270\.5840\.5810\.6130\.5520\.4580\.5480\.4170\.374ak0\.7370\.5710\.6970\.6210\.7070\.6830\.6530\.6220\.6140\.6490\.5800\.4660\.5620\.4800\.387aos0\.6310\.6530\.6380\.6490\.6440\.6600\.6210\.5780\.6390\.6240\.5600\.5380\.5490\.3820\.367aks0\.7560\.5850\.6640\.6900\.6760\.5710\.6140\.5750\.6410\.4820\.5470\.4390\.5370\.4540\.357DVD player0\.501ao0\.5490\.6020\.5920\.5490\.5780\.5800\.5050\.5130\.5020\.5640\.4790\.2650\.4310\.3970\.339ak0\.6530\.6920\.6670\.6000\.6990\.6550\.5340\.5500\.6050\.5700\.4960\.3230\.4460\.5400\.352aos0\.5350\.5990\.6050\.5480\.5740\.6100\.4980\.5100\.5340\.5380\.4720\.2880\.4270\.3510\.340aks0\.5860\.6840\.6990\.5740\.6980\.5950\.5010\.5120\.6050\.5320\.4650\.2370\.4140\.4060\.330Mp3 player0\.521ao0\.6160\.5770\.5590\.6130\.6720\.6560\.5260\.5360\.4950\.5890\.4930\.3600\.4450\.5070\.378ak0\.7300\.6440\.6100\.6670\.7170\.8040\.5570\.5730\.6200\.6410\.5120\.3940\.4570\.6410\.385aos0\.6150\.5750\.5600\.6150\.6680\.6620\.5260\.5360\.6060\.6100\.4870\.3470\.4400\.4060\.372aks0\.6480\.6210\.5940\.6420\.6980\.7050\.5200\.5360\.6270\.6470\.4800\.3750\.4290\.5220\.358SemEval0\.550ao0\.6420\.6090\.6240\.6390\.6550\.6140\.5590\.5760\.5640\.5780\.5210\.4460\.4700\.4560\.404ak0\.6440\.6260\.6350\.6380\.6530\.6170\.5890\.6130\.5730\.5910\.5450\.5180\.4840\.4760\.412aos0\.6300\.6040\.6290\.6390\.6500\.6060\.5530\.5740\.5680\.5530\.5180\.4740\.4660\.4750\.399aks0\.6380\.6060\.6210\.6350\.6470\.5940\.5490\.5710\.5360\.5550\.5150\.3990\.4520\.4390\.388
Table 2:F1\-scores for the four base classification methods \(AO, AK, AOS, AKS\) across model sizes\. AO = Article Only; AK = Article \+ Keywords; AOS = Article Only \+ Self\-Consistency; AKS = Article \+ Keywords \+ Self\-Consistency\. DS\-R1 1\.5B = DeepSeek\-R1\-Distill\-Qwen\-1\.5B\. Bold values exceed the baselineSarkaret al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib1)\)\(Base\. column\)\.We first present the performance of the four base methods \(Section[6\.1](https://arxiv.org/html/2605.30465#S6.SS1)\), then measure the effect of adding the knowledge graph, and finally analyse runtime and cost implications \(Section[6\.4](https://arxiv.org/html/2605.30465#S6.SS4)\)\. Table[2](https://arxiv.org/html/2605.30465#S6.T2)reports F1\-scores for the four base methods \(AO, AK, AOS, AKS\) across all fifteen LLMs and eight datasets\. Table[3](https://arxiv.org/html/2605.30465#S6.T3)reports F1\-scores for the four graph\-augmented methods \(AG, AKG, AGS, AKGS\)\. Both tables include the sentence encoder baseline fromSarkaret al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib1)\)for reference\.
### 6\.1Base Classification Performance
##### Article Only \(AO\)\.
The Article Only method provides a strong competitive baseline for large and medium models\. Among the large models, LLaMA 3\.3\-70B achieves the highest AO score on most datasets \(e\.g\., Medical 0\.669, News 0\.643\)\. Qwen models are competitive: Qwen 2\.5\-72B outperforms LLaMA on DVD player \(0\.602 vs\. 0\.549\), and Qwen 3\-32B outperforms LLaMA on News \(0\.660\) and Cellular phone \(0\.693\)\. GPT\-4o performs strongly, achieving AO scores of 0\.714 on Medical, 0\.704 on News, and 0\.672 on Mp3 player\. Among medium models, most exceed the baseline on most datasets\. Among small models \(≤\\leq3B\), models systematically fail to outperform the baseline on almost all datasets\.
##### Article \+ Keywords \(AK\)\.
Adding keywords results in the largest improvements of any method we test\. For all large models, AK yields an average F1 improvement of 0\.06–0\.11 over their AO scores\. For example, LLaMA 3\.3\-70B improves by an average of\+0\.086\+0\.086F1 \(Mp3 player:0\.616→0\.7300\.616\\to 0\.730; Digital cam\. 1:0\.635→0\.7420\.635\\to 0\.742\), Qwen 2\.5\-72B by\+0\.067\+0\.067\(DVD player:0\.602→0\.6920\.602\\to 0\.692\), and Qwen 3\-32B by\+0\.050\+0\.050\(Cellular phone:0\.693→0\.7150\.693\\to 0\.715\)\. These improvements generalise across topic domains, suggesting that lexical guidance provides useful semantic insight that helps LLMs better align article content with the correct topic labels\.
##### Article Only with Self\-Consistency \(AOS\)\.
Adding self\-consistency to the article\-only setup has a negligible or slightly negative effect on average\. LLaMA 3\.3\-70B AOS scores lie within±0\.010\\pm 0\.010of the corresponding AO scores on six of the eight datasets, and fall noticeably below AO on DVD player \(0\.5350\.535vs\.0\.5490\.549\) and SemEval \(0\.6300\.630vs\.0\.6420\.642\)\. Qwen 2\.5\-72B and Qwen 3\-32B show similar trends\. These results suggest that running multiple AO predictions and applying majority voting does not provide additional useful information compared to a single prediction, and can in some cases weaken correct predictions\.
##### Article \+ Keywords with Self\-Consistency \(AKS\)\.
AKS consistently falls behind AK\. On Digital cam\. 1, the AKS score is0\.6720\.672compared to0\.7420\.742for AK; on Digital cam\. 2 the scores are0\.7560\.756vs\.0\.7370\.737\(a rare exception\)\. A similar pattern holds across Qwen and other models\. We suspect that self\-consistency introduces noise on each run, and majority voting selects a more conservative prediction than the keyword cue alone would produce\. Nevertheless, AKS still outperforms AO by approximately\+0\.04\+0\.04F1 on average\.
### 6\.2Effect of Knowledge Graph: With vs\. Without Comparison
Large modelsMedium size modelsSmaller modelsDatasetBase\.MethodLLaMA3\.3\-70BQwen2\.5\-72BQwen3\-32BGemma3\-27BGPT\-4oGPT\-OSS20BMixtral8x7BGemma2\-9BLLaMA3\.1\-8BQwen2\.5\-7BGemma3n\-E4BLLaMA3\.2\-3BQwen2\.5\-3BMinistral3BDS\-R11\.5BMedical0\.594AG0\.7040\.6280\.6180\.7010\.7080\.6440\.6140\.6310\.6570\.6880\.5760\.5650\.5410\.5290\.487AKG0\.6520\.7020\.6790\.7590\.7660\.6760\.6600\.6730\.4960\.7140\.5920\.5720\.5590\.5440\.503AGS0\.6840\.6250\.6200\.6970\.7120\.6210\.6100\.6280\.6330\.6970\.5490\.5490\.5380\.5260\.483AKGS0\.5990\.6850\.6630\.7390\.7380\.6510\.6380\.6490\.4240\.7030\.5670\.5610\.5480\.5350\.495News0\.512AG0\.6350\.6370\.6540\.6310\.6980\.6720\.6300\.6480\.6230\.6110\.5040\.4980\.4730\.4610\.421AKG0\.7010\.7110\.7150\.6890\.7520\.6980\.6720\.6900\.6610\.6470\.5100\.5100\.4910\.4760\.437AGS0\.6300\.6340\.6560\.6270\.6930\.6540\.6260\.6450\.6190\.6070\.5000\.4950\.4700\.4580\.418AKGS0\.6770\.6940\.6990\.6690\.7210\.6790\.6510\.6660\.6390\.6260\.5020\.5070\.4800\.4670\.428Cellular phone0\.520AG0\.6620\.6370\.6840\.6580\.6410\.6840\.5920\.6110\.5870\.6310\.5280\.3810\.4810\.4690\.413AKG0\.7630\.7010\.7160\.7160\.7560\.7530\.6370\.6530\.6250\.6440\.5550\.3510\.4990\.6000\.431AGS0\.6770\.6510\.6720\.6540\.6680\.6920\.5880\.6080\.5830\.6700\.5080\.5290\.4780\.3970\.408AKGS0\.7040\.6940\.6810\.6960\.7380\.7030\.6110\.6290\.6030\.6890\.5130\.3330\.4880\.5330\.419Digital cam\. 10\.500AG0\.6300\.6170\.6380\.6270\.6080\.6750\.5680\.5970\.6790\.6740\.5010\.4430\.4610\.4840\.391AKG0\.7450\.7150\.7060\.6850\.7520\.7290\.6110\.6390\.7010\.7170\.5280\.4150\.4790\.5170\.407AGS0\.6270\.6320\.6400\.6230\.6200\.6790\.5640\.5940\.6490\.6460\.4820\.4730\.4580\.3690\.387AKGS0\.6750\.7010\.6540\.6650\.7170\.6010\.5870\.6150\.6710\.6860\.4980\.2930\.4680\.4370\.398Digital cam\. 20\.603AG0\.6240\.6230\.6370\.6530\.6180\.6530\.6260\.5820\.5820\.5550\.5530\.4550\.5490\.4160\.372AKG0\.7250\.6600\.6980\.6210\.7070\.6840\.6580\.6240\.6190\.6150\.5840\.4690\.5670\.4850\.389AGS0\.6270\.6100\.6320\.6400\.6320\.6560\.6230\.5790\.6390\.6390\.5560\.5350\.5460\.3820\.368AKGS0\.6500\.6410\.6640\.6910\.6780\.5900\.6350\.6000\.6610\.6540\.5700\.4600\.5560\.4780\.381DVD player0\.501AG0\.5440\.5940\.5860\.5410\.5690\.5810\.5030\.5140\.5010\.5610\.4760\.2670\.4310\.3980\.341AKG0\.6520\.6920\.6690\.5990\.6990\.6590\.5410\.5560\.6060\.5710\.4990\.3250\.4490\.5460\.358AGS0\.5270\.5880\.5940\.5370\.5630\.6110\.4990\.5110\.5310\.5350\.4720\.2860\.4280\.3470\.337AKGS0\.5860\.6850\.6990\.5790\.6990\.6180\.5200\.5320\.6240\.5520\.4880\.2570\.4380\.4250\.349Mp3 player0\.521AG0\.6100\.5680\.5520\.6070\.6630\.6570\.5280\.5370\.4950\.5860\.4890\.3610\.4430\.5070\.374AKG0\.7300\.6420\.6130\.6650\.7190\.8050\.5620\.5790\.6240\.6430\.5170\.3970\.4610\.6450\.388AGS0\.6070\.5650\.5540\.6030\.6590\.6600\.5240\.5340\.6060\.6120\.4850\.3470\.4400\.4030\.370AKGS0\.6520\.6250\.5970\.6450\.7020\.7290\.5420\.5550\.6480\.6660\.5010\.3980\.4500\.5470\.381SemEval0\.550AG0\.6350\.6010\.6160\.6320\.6480\.6150\.5580\.5730\.5610\.5790\.5230\.4450\.4680\.4560\.401AKG0\.6430\.6250\.6370\.6400\.6560\.6220\.5930\.6150\.5790\.5950\.5480\.5230\.4860\.4800\.417AGS0\.6200\.5980\.6180\.6280\.6430\.6050\.5540\.5700\.5670\.5550\.5190\.4710\.4650\.4720\.397AKGS0\.6370\.6080\.6210\.6350\.6490\.6140\.5710\.5910\.5570\.5740\.5350\.4210\.4750\.4620\.409
Table 3:F1\-scores for the four graph\-augmented classification methods \(AG, AKG, AGS, AKGS\) across model sizes\. AG = Article \+ Graph; AKG = Article \+ Keywords \+ Graph; AGS = AG \+ Self\-Consistency; AKGS = AKG \+ Self\-Consistency\. DS\-R1 1\.5B = DeepSeek\-R1\-Distill\-Qwen\-1\.5B\. Bold values exceed the baselineSarkaret al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib1)\)\(Base\. column\)\.To determine the impact of the knowledge graph, we assessed each variant of classification in a graph\-free manner\. Here, the LLM is only fed the text of the articles and the topics list \(optionally keywords\)\.No graph information is sent to the LLM\. Table[2](https://arxiv.org/html/2605.30465#S6.T2)presents the F1\-scores of the no graph methods for each model and data size\.
##### Size Dependency of Model Performance\.
The impact of knowledge graph varies considerably depending on model size\. For larger models, there is a marginal average reduction in performance with the inclusion of the knowledge graph \(Δ=−0\.0121\\Delta=\-0\.0121\), while for smaller models, there is a consistent average improvement in performance \(Δ=\+0\.0145\\Delta=\+0\.0145\)\. It seems that the larger models possess enough implicit relation structure learned from pretraining, and the inclusion of the knowledge graph contributes very little to the models\. This interferes with its performance since it introduces noise into the model\. On the other hand, smaller models seem to benefit from the addition of knowledge graph\.
##### Effect of Dataset Size\.
In both the larger article\-based datasets \(Medical, News\) and the smaller product review datasets, the graph shows an average improvement close to zero or even negative \(Δ=−0\.0025\\Delta=\-0\.0025andΔ=−0\.0017\\Delta=\-0\.0017, respectively\)\. This suggests that the advantage provided by the knowledge graph is not dependent on the size of the dataset or its document lengths\. Short and long documents exhibit similar effect in performance when the graph is included\.
##### Method Sensitivity\.
The greatest divergence occurs between different classification methods\. The AG and AGS methods have a minor impact \(Δ=−0\.0087\\Delta=\-0\.0087andΔ=−0\.0102\\Delta=\-0\.0102, respectively\) on their performance from the graph\. The AKG method yields a slight benefit \(Δ=\+0\.0040\\Delta=\+0\.0040\)\. The AKGS method obtains the maximum gain \(Δ=\+0\.0241\\Delta=\+0\.0241\) from the graph\. Thus, we can conclude that the graph only contributes to the system’s performance if used in combination with the lexical information: without keywords, the graph does not contribute enough to the prediction and may even confuse the model\. But if there are lexical clues, the graph also adds significant relational context to the problem\.
Large modelsMedium size modelsSmaller modelsDatasetMethodLLaMA3\.3\-70BQwen2\.5\-72BQwen3\-32BGemma3\-27BGPT\-4oGPT\-OSS20BMixtral8x7BGemma2\-9BLLaMA3\.1\-8BQwen2\.5\-7BGemma3n\-E4BLLaMA3\.2\-3BQwen2\.5\-3BMinistral3BDS\-R11\.5BMedicalag14\.1913\.426\.335\.5412\.873\.172\.231\.991\.712\.650\.810\.770\.850\.690\.51akg15\.6314\.766\.926\.0914\.183\.452\.412\.191\.932\.910\.890\.850\.930\.720\.56ags70\.9767\.0331\.5527\.6164\.3515\.7911\.069\.878\.6513\.263\.963\.924\.163\.372\.53akgs78\.0373\.7334\.7130\.3770\.7917\.3212\.1610\.829\.5214\.584\.364\.324\.573\.712\.79Newsag14\.2113\.386\.295\.5012\.913\.132\.191\.951\.752\.610\.770\.810\.810\.650\.49akg15\.5914\.726\.966\.0514\.223\.492\.452\.151\.892\.950\.850\.890\.890\.760\.54ags70\.9367\.0531\.5127\.5764\.4115\.7511\.029\.838\.6913\.183\.923\.964\.123\.332\.49akgs78\.0773\.6934\.6730\.3370\.8317\.3612\.1210\.869\.5614\.544\.324\.364\.533\.672\.75Cellular phoneag1\.441\.320\.650\.571\.310\.340\.240\.180\.190\.270\.100\.060\.100\.050\.04akg1\.581\.450\.670\.631\.440\.330\.220\.240\.170\.290\.110\.070\.110\.090\.05ags7\.086\.723\.132\.786\.541\.561\.120\.970\.891\.320\.410\.370\.430\.320\.21akgs7\.827\.353\.493\.067\.171\.751\.191\.100\.931\.450\.450\.410\.480\.390\.23Digital cam\. 1ag1\.401\.360\.610\.531\.290\.300\.200\.220\.150\.250\.060\.100\.060\.090\.04akg1\.541\.490\.710\.591\.420\.370\.260\.200\.210\.310\.070\.110\.070\.050\.05ags7\.126\.683\.172\.746\.581\.601\.081\.010\.851\.280\.370\.410\.390\.360\.22akgs7\.787\.393\.453\.027\.211\.711\.231\.060\.971\.410\.410\.450\.440\.350\.24Digital cam\. 2ag1\.431\.350\.620\.561\.330\.320\.230\.200\.170\.260\.090\.080\.080\.070\.04akg1\.571\.460\.700\.621\.460\.360\.250\.210\.200\.280\.100\.090\.110\.070\.05ags7\.116\.693\.162\.776\.561\.591\.091\.000\.871\.310\.400\.380\.420\.330\.22akgs7\.817\.363\.483\.057\.191\.741\.201\.090\.941\.430\.440\.420\.470\.380\.24DVD playerag1\.411\.330\.640\.541\.300\.330\.210\.190\.160\.240\.080\.090\.090\.060\.03akg1\.551\.480\.680\.601\.430\.350\.230\.230\.180\.300\.090\.100\.080\.080\.05ags7\.096\.713\.142\.756\.521\.571\.110\.980\.881\.290\.380\.400\.400\.350\.21akgs7\.797\.383\.463\.037\.151\.721\.221\.070\.961\.420\.420\.440\.450\.360\.23Mp3 playerag1\.451\.340\.630\.551\.340\.310\.220\.210\.180\.280\.080\.080\.080\.070\.04akg1\.591\.470\.690\.611\.470\.350\.240\.170\.190\.270\.090\.090\.090\.070\.05ags7\.136\.703\.152\.766\.571\.581\.100\.990\.871\.300\.390\.390\.410\.340\.22akgs7\.837\.373\.473\.047\.201\.731\.211\.080\.951\.440\.430\.430\.460\.370\.24SemEvalag1\.421\.360\.620\.561\.320\.320\.220\.200\.160\.260\.080\.080\.070\.070\.04akg1\.561\.490\.700\.601\.450\.360\.260\.220\.200\.320\.100\.090\.100\.080\.05ags7\.106\.723\.162\.776\.551\.601\.121\.010\.861\.330\.400\.370\.430\.320\.21akgs7\.807\.393\.483\.057\.181\.751\.231\.090\.941\.460\.440\.410\.480\.360\.23
Table 4:Total runtime \(seconds\) per article across methods and model sizes\. AG = Article \+ Graph; AKG = Article \+ Keywords \+ Graph; AGS = AG \+ Self\-Consistency; AKGS = AKG \+ Self\-Consistency\. DS\-R1 1\.5B = DeepSeek\-R1\-Distill\-Qwen\-1\.5B\.Boldvalues indicate the self\-consistency variants \(agsandakgs\), which require approximately five times more computation than their single\-pass counterparts \(agandakg\) due to multiple runs\.
### 6\.3Summary of Findings
1. 1\.Keyword enhancement \(AK\) is the best single method, dominating all others by a considerable margin, with average F1 improvements of0\.060\.06–0\.110\.11over article\-only classification \(AO\)\.
2. 2\.Self\-consistency decoding does not improve performancein any setting\. AOS≈\\approxAO and AKS<<AK across all datasets and models, indicating a trade\-off between sampling diversity and lexical precision\.
3. 3\.Six of the fifteen LLMs consistently outperform the baselineacross all datasets and all four base classification variants: LLaMA 3\.3\-70B, Qwen 2\.5\-72B, Qwen 3\-32B, Gemma 3\-27B, GPT\-4o, and Mixtral 8x7B\. Medium models are competitive on most datasets\. Small \(≤\\leq3B\) models generally fail to match the baseline\.
4. 4\.Knowledge graph augmentation has mixed effects: it consistently helps smaller models \(avg\.\+0\.015\+0\.015F1\) and slightly hurts large models \(avg\.−0\.012\-0\.012F1\)\. Large models appear to already capture sufficient relational structure from pretraining\.
5. 5\.The graph is most useful in combination with lexical information\.AKGS gains the largest benefit from graph augmentation \(\+0\.024\+0\.024F1 on average\), while AG and AGS experience minor degradation\.
### 6\.4Runtime and Cost Analysis
Table[4](https://arxiv.org/html/2605.30465#S6.T4)presents total runtime \(in seconds\) per article across methods and model sizes\.
Self\-consistency variants \(AOS, AKS, AGS, AKGS\) require approximately five times more computation than their single\-pass counterparts due toN=5N=5inference runs\. As discussed above, they offer no improvement in performance\. In contrast, keyword\-enhanced classification \(AK\) achieves the best balance between performance and computational cost, with only a marginal increase in runtime compared to article\-only classification\. AK is therefore the most practical choice for real\-world deployment\.
Runtime is also affected by document length and model size\. The Medical and News datasets contain significantly longer documents, so all methods take more time on these datasets\. Large models such as GPT\-4o and LLaMA\-70B require more inference time than smaller models due to higher computational complexity\.
For API\-based models \(e\.g\., GPT\-4o\), the monetary cost of self\-consistency variants is five times that of single\-pass methods, a cost not justified given that AKS consistently underperforms AK\. For open\-source models deployed locally \(LLaMA, Qwen, Gemma\), cost is measured in runtime rather than monetary terms, but the fivefold increase equally applies\.
## 7Conclusion
We introduced a zero\-shot multi\-label topic classification framework evaluated across four variants, fifteen LLMs, and eight benchmark datasets\. We then conducted a systematic ablation of the effect of per\-article graph augmentation on each variant\. The base framework, without any graph, already demonstrates strong performance: six of the fifteen LLMs consistently outperform the sentence\-encoder baseline, and keyword\-enhanced classification \(AK\) achieves the best results among all methods tested\.
Knowledge graph addition improves performance for small models \(avg\. \+0\.015 F1\), but does not help much with larger models due to the inclusion of enough relational information during pre\-training\. The greatest improvement is achieved by AKGS \(\+0\.024 F1\)\. Self\-consistency decoding degrades performance in most setups, while increasing costs five times\. Model size and keyword guidance are the major factors that influence classification performance, with graph augmentation providing complementary improvement to smaller models\.
## 8Limitations
Our approach has a number of practical limitations\. Firstly, self\-consistency methods need five inference steps per article\. It makes them much costlier than single\-step methods and perhaps unaffordable without an API key\. Secondly, the keyword\-enriched methods were significantly better than the graph\-only methods\. This indicates that the performance of the methods still depends to a certain extent on the availability of keywords\. Thirdly, we use the same model for both graph generation and classification\. The quality of the graph reflects the quality of the model’s reasoning ability\. Lastly, we only test our framework on English datasets while the generalization to other languages has not been explored\.
## References
- A survey of topic modeling in text mining\.International Journal of Advanced Computer Science and Applications6\(1\)\.Cited by:[§1](https://arxiv.org/html/2605.30465#S1.p1.1),[§3](https://arxiv.org/html/2605.30465#S3.p1.1)\.
- G\. Angeli, M\. J\. J\. Premkumar, and C\. D\. Manning \(2015\)Leveraging linguistic structure for open domain information extraction\.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 344–354\.Cited by:[§2\.3](https://arxiv.org/html/2605.30465#S2.SS3.p1.1)\.
- D\. M\. Blei, A\. Y\. Ng, and M\. I\. Jordan \(2003\)Latent Dirichlet allocation\.Journal of Machine Learning Research3,pp\. 993–1022\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.30465#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1),[§2\.5](https://arxiv.org/html/2605.30465#S2.SS5.p2.1)\.
- Y\. Chae and T\. Davidson \(2025\)Large language models for text classification: from zero\-shot learning to instruction\-tuning\.Sociological Methods & Research\.Cited by:[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1)\.
- U\. Chauhan and A\. Shah \(2021\)Topic modeling using latent Dirichlet allocation: a survey\.ACM Computing Surveys54\(7\),pp\. 1–35\.Cited by:[§1](https://arxiv.org/html/2605.30465#S1.p1.1)\.
- J\. Chen, Z\. Gong, and W\. Liu \(2021\)Multi\-label zero\-shot text classification by exploiting label semantics\.InACL Findings,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1)\.
- Q\. Chen, W\. Wang, K\. Huang, and F\. Coenen \(2022\)Zero\-shot text classification via knowledge graph embedding for social media data\.IEEE Internet of Things Journal9\(12\),pp\. 9205–9213\.Cited by:[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p1.1),[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p2.1)\.
- J\. Du, J\. Jiang, D\. Song, and L\. Liao \(2015\)Topic modeling with document relative similarities\.InIJCAI,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px1.p1.1)\.
- C\. Engels, K\. Deschacht, and M\. Moens \(2010\)Automatic categorization of videos using a latent topic model\.Multimedia Tools and Applications49\(1\),pp\. 77–90\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px2.p1.1)\.
- H\. Gong and H\. Eldardiry \(2021\)Prompt\-based zero\-shot text classification\.InNAACL,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1)\.
- N\. Hassan, A\. Poudel, J\. Hale, C\. Hubacek, K\. T\. Huq, S\. K\. K\. Santu, and S\. I\. Ahmed \(2020\)Towards automated sexual violence report tracking\.InProceedings of the International AAAI Conference on Web and Social Media,Vol\.14,pp\. 250–259\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px2.p1.1)\.
- S\. Hingmire and S\. Chakraborti \(2014\)Topic labeled text classification: a weakly supervised approach\.InSIGIR,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px1.p1.1)\.
- T\. Hofmann \(1999\)Probabilistic latent semantic indexing\.InProceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 50–57\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px1.p1.1)\.
- J\. Huang, X\. Zhang, Q\. Mei, and J\. Ma \(2024\)Can llms effectively leverage graph structural information through prompts, and why?\.External Links:2309\.16595,[Link](https://arxiv.org/abs/2309.16595)Cited by:[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1)\.
- T\. Iwata, T\. Yamada, and N\. Ueda \(2009\)Modeling social annotation data with content relevance using a topic model\.InNeurIPS,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px2.p1.1)\.
- J\. Jang, S\. Ye, and M\. Seo \(2023\)Can large language models truly understand prompts? a case study with negated prompts\.InTransfer Learning for Natural Language Processing Workshop,pp\. 52–62\.Cited by:[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1)\.
- S\. Ji, S\. Pan, E\. Cambria, P\. Marttinen, and P\. S\. Yu \(2022\)A survey on knowledge graphs: representation, acquisition, and applications\.IEEE TNNLS33\(2\),pp\. 494–514\.Cited by:[§2\.3](https://arxiv.org/html/2605.30465#S2.SS3.p1.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam,et al\.\(2023\)Dspy: compiling declarative language model calls into self\-improving pipelines\.arXiv preprint arXiv:2310\.03714\.Cited by:[§A\.1](https://arxiv.org/html/2605.30465#A1.SS1.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InNeurIPS,Cited by:[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1)\.
- C\. Li, J\. Xing, A\. Sun, and Z\. Ma \(2016\)Effective document labeling with very few seed words: a topic model approach\.InProceedings of the 25th ACM International on Conference on Information and Knowledge Management,pp\. 85–94\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1)\.
- Y\. Liu, K\. Zhang, Z\. Huang, K\. Wang, Y\. Zhang, Q\. Liu, and E\. Chen \(2023\)Enhancing hierarchical text classification through knowledge graph integration\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 5797–5810\.Cited by:[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p1.1),[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p2.1)\.
- Y\. Meng, J\. Shen, C\. Zhang, and J\. Han \(2018\)Weakly\-supervised neural text classification\.InCIKM,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px2.p1.1)\.
- B\. Mo, K\. Yu, J\. Kazdan, J\. Cabezas, P\. Mpala, L\. Yu, C\. Cundy, C\. Kanatsoulis, and S\. Koyejo \(2025\)KGGen: extracting knowledge graphs from plain text with language models\.arXiv preprint arXiv:2502\.09956\.Cited by:[item 2](https://arxiv.org/html/2605.30465#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.30465#S1.p5.1),[§2\.3](https://arxiv.org/html/2605.30465#S2.SS3.p1.1),[§4\.2](https://arxiv.org/html/2605.30465#S4.SS2.p1.1)\.
- S\. Mohammad, F\. Bravo\-Marquez, M\. Salameh, and S\. Kiritchenko \(2018\)Semeval\-2018 task 1: affect in tweets\.InProceedings of the 12th international workshop on semantic evaluation,pp\. 1–17\.Cited by:[§1](https://arxiv.org/html/2605.30465#S1.p6.1),[§5\.1](https://arxiv.org/html/2605.30465#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.30465#S5.SS1.p3.1)\.
- F\. Poursabzi\-Sangdeh and J\. Boyd\-Graber \(2015\)Speeding document annotation with topic models\.InNAACL,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px2.p1.1)\.
- R\. Puri and B\. Catanzaro \(2019\)Zero\-shot text classification with generative language models\.arXiv preprint arXiv:1912\.10165\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1)\.
- P\. K\. Pushp and M\. M\. Srivastava \(2017\)Train once, test anywhere: zero\-shot learning for text classification\.arXiv preprint arXiv:1712\.05972\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1)\.
- B\. Qiao, Z\. Zou, Y\. Huang, K\. Fang, X\. Zhu, and Y\. Chen \(2022\)A joint model for entity and relation extraction based on bert\.Neural Computing and Applications34\(5\),pp\. 3471–3481\.Cited by:[§2\.3](https://arxiv.org/html/2605.30465#S2.SS3.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using Siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,pp\. 3982–3992\.Cited by:[§1](https://arxiv.org/html/2605.30465#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p2.1)\.
- L\. Reynolds and K\. McDonell \(2021\)Prompt programming for large language models: beyond the few\-shot paradigm\.InCHI Extended Abstracts,Cited by:[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1)\.
- A\. Rios and R\. Kavuluru \(2018\)Few\-shot and zero\-shot multi\-label learning for structured label spaces\.InEMNLP,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1)\.
- P\. Sahoo, A\. K\. Singh, S\. Saha, V\. Jain, S\. Mondal, and A\. Chadha \(2024\)A systematic survey of prompt engineering in large language models: techniques and applications\.arXiv:2402\.07927\.Cited by:[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1)\.
- S\. K\. K\. Santu, S\. Syed, and J\. Foulds \(2016\)Generalized topic modeling\.JMLR17\(1\),pp\. 1–39\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1)\.
- S\. Sarkar, D\. Feng, and S\. K\. K\. Santu \(2023\)Zero\-shot multi\-label topic inference with sentence encoders & LLMs\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 16218–16233\.Cited by:[§1](https://arxiv.org/html/2605.30465#S1.p3.1),[§1](https://arxiv.org/html/2605.30465#S1.p6.1),[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p2.1),[§2\.5](https://arxiv.org/html/2605.30465#S2.SS5.p2.1),[§4\.1\.2](https://arxiv.org/html/2605.30465#S4.SS1.SSS2.p1.1),[§5\.1](https://arxiv.org/html/2605.30465#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.30465#S5.SS1.p2.1),[Table 2](https://arxiv.org/html/2605.30465#S6.T2),[Table 3](https://arxiv.org/html/2605.30465#S6.T3),[§6](https://arxiv.org/html/2605.30465#S6.p1.1)\.
- S\. Sarkar and S\. K\. Karmaker \(2022\)Concept annotation from users perspective: a new challenge\.InCompanion proceedings of the web conference 2022,pp\. 1180–1188\.Cited by:[§5\.1](https://arxiv.org/html/2605.30465#S5.SS1.p3.1)\.
- N\. Shanavas, H\. Wang, Z\. Lin, and G\. Hawe \(2021\)Knowledge\-driven graph similarity for text classification\.International Journal of Machine Learning and Cybernetics12\(4\),pp\. 1067–1081\.Cited by:[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p1.1),[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p2.1)\.
- Y\. Shi, H\. Ma, W\. Zhong, G\. Mai, X\. Li, T\. Liu, and J\. Huang \(2023\)ChatGraph: interpretable text classification by converting ChatGPT knowledge to graphs\.InICDMW,pp\. 515–520\.Cited by:[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p1.1),[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p2.1)\.
- F\. M\. Suchanek, G\. Kasneci, and G\. Weikum \(2007\)Yago: a core of semantic knowledge\.InProceedings of the 16th international conference on World Wide Web,pp\. 697–706\.Cited by:[§2\.3](https://arxiv.org/html/2605.30465#S2.SS3.p1.1)\.
- S\. Tuarob, C\. S\. Tucker, M\. Salathe, and N\. Ram \(2015\)An ensemble heterogeneous classification methodology for discovering health\-related knowledge in social media messages\.Journal of Biomedical Informatics55,pp\. 73–89\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px2.p1.1)\.
- J\. Van Nooten, A\. Kosar, G\. De Pauw, and W\. Daelemans \(2026\)One size does not fit all: exploring variable thresholds for distance\-based multi\-label text classification\.IEEE Transactions on Knowledge and Data Engineering\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p2.1)\.
- N\. Vandemoortele, B\. Steenwinckel, F\. Ongenae, and S\. V\. Hoecke \(2025\)From haystack to needle: label space reduction for zero\-shot classification\.arXiv preprint arXiv:2502\.08436\.Cited by:[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1)\.
- S\. P\. Veeranna, J\. Nam, E\. L\. Mencía, and J\. Fürnkranz \(2016\)Using semantic similarity for multi\-label zero\-shot classification of text documents\.InEuropean Symposium on Artificial Neural Networks \(ESANN\),pp\. 423–428\.Cited by:[§1](https://arxiv.org/html/2605.30465#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1),[§2\.5](https://arxiv.org/html/2605.30465#S2.SS5.p2.1)\.
- C\. Wang, J\. Paisley, and D\. Blei \(2011\)Online variational inference for the hierarchical Dirichlet process\.InAISTATS,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px1.p1.1)\.
- J\. Wang, Z\. Wang, D\. Zhang, and J\. Yan \(2017\)Combining knowledge with deep convolutional neural networks for short text classification\.InProceedings of the Twenty\-Sixth International Joint Conference on Artificial Intelligence \(IJCAI\-17\),pp\. 2915–2921\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2017/406),[Link](https://doi.org/10.24963/ijcai.2017/406)Cited by:[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations,Cited by:[§2\.5](https://arxiv.org/html/2605.30465#S2.SS5.p1.1),[§4\.1\.3](https://arxiv.org/html/2605.30465#S4.SS1.SSS3.p1.2)\.
- J\. White, Q\. Fu, S\. Hays, M\. Sandborn, C\. Olea, H\. Gilbert, A\. Elnashar, J\. Spencer\-Smith, and D\. C\. Schmidt \(2023\)A prompt pattern catalog to enhance prompt engineering with ChatGPT\.arXiv preprint arXiv:2302\.11382\.Cited by:[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1)\.
- C\. Xia, C\. Zhang, X\. Yan, Y\. Chang, and P\. S\. Yu \(2018\)Zero\-shot user intent detection via capsule neural networks\.InEMNLP,Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1)\.
- W\. Yin, J\. Hay, and D\. Roth \(2019\)Benchmarking zero\-shot text classification: datasets, evaluation and entailment approach\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,pp\. 3914–3923\.Cited by:[§1](https://arxiv.org/html/2605.30465#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1),[§2\.2](https://arxiv.org/html/2605.30465#S2.SS2.p1.1),[§2\.5](https://arxiv.org/html/2605.30465#S2.SS5.p2.1),[§3](https://arxiv.org/html/2605.30465#S3.p1.1),[§3](https://arxiv.org/html/2605.30465#S3.p2.1)\.
- Q\. Zang, C\. Zgrzendek, I\. Tchappi, A\. Khadangi, and J\. Sedlmeir \(2025\)KG\-HTC: integrating knowledge graphs into LLMs for effective zero\-shot hierarchical text classification\.arXiv preprint arXiv:2505\.05583\.Cited by:[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p1.1),[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p2.1)\.
- D\. Zha and J\. Li \(2019\)Multi\-label dataless text classification with topic modeling\.Knowledge and Information Systems61\(1\),pp\. 137–160\.Cited by:[§2\.1](https://arxiv.org/html/2605.30465#S2.SS1.SSS0.Px3.p1.1)\.
- H\. Zhang and O\. Shafiq \(2023\)Towards improving text classification tasks based on knowledge graphs for limited labeled data\.InCanadian AI,Cited by:[§2\.4](https://arxiv.org/html/2605.30465#S2.SS4.p1.1)\.
## Appendix AAppendix
### A\.1DSPy Signatures
All classification and graph construction steps are implemented as DSPy signaturesKhattabet al\.\([2023](https://arxiv.org/html/2605.30465#bib.bib55)\)\. A DSPy signature is a Python class where the doc string contains the task instruction and the typed fields declare what the model receives as input and what it must produce as output\. DSPy automatically constructs the prompt to the LLM from this schema\. All variants are zero\-shot: no labeled examples are provided and no DSPy optimizer is used\.
#### A\.1\.1Graph Construction Signatures
##### Stage 1: Entity Extraction\.
[⬇](data:text/plain;base64,Y2xhc3MgRW50aXR5RXh0cmFjdG9yKGRzcHkuU2lnbmF0dXJlKToKICAgICIiIkV4dHJhY3Qga2V5IGVudGl0aWVzIGZyb20gdGhlIGdpdmVuIHRleHQuIEV4dHJhY3RlZCBlbnRpdGllcyBhcmUgbm91bnMsIHZlcmJzLCBvciBhZGplY3RpdmVzLCBwYXJ0aWN1bGFybHkgcmVnYXJkaW5nIHNlbnRpbWVudC4gVGhpcyBpcyBmb3IgYW4gZXh0cmFjdGlvbiB0YXNrLCBwbGVhc2UgYmUgdGhvcm91Z2ggYW5kIGFjY3VyYXRlIHRvIHRoZSByZWZlcmVuY2UgdGV4dC4KCiAgICBSZXR1cm4gT05MWSBhIHZhbGlkIEpTT04gbGlzdCBmb3JtYXQ6CiAgICBbImVudGl0eTEiLCAiZW50aXR5MiIsICJlbnRpdHkzIl0iIiIKICAgIHRleHQgPSBkc3B5LklucHV0RmllbGQoZGVzYz0iVGhlIHRleHQgdG8gZXh0cmFjdCBlbnRpdGllcyBmcm9tIikKICAgIGVudGl0aWVzID0gZHNweS5PdXRwdXRGaWVsZChkZXNjPSJMaXN0IG9mIGV4dHJhY3RlZCBlbnRpdGllcyBpbiBKU09OIGZvcm1hdCIpXGVuZHtsc3RsaXN0aW5nfQpcZW5ke3Rjb2xvcmJveH0KClxwYXJhZ3JhcGh7U3RhZ2UgMiAtLS0gUmVsYXRpb24gRXh0cmFjdGlvbi59CgpcYmVnaW57dGNvbG9yYm94fVtlbmhhbmNlZCxicmVha2FibGUsY29sYmFjaz1ncmF5ITgsY29sZnJhbWU9Z3JheSE0MCwKICBib3hydWxlPTAuNHB0LGFyYz0ycHQsbGVmdD00cHQscmlnaHQ9NHB0LHRvcD0zcHQsYm90dG9tPTNwdF0KXGJlZ2lue2xzdGxpc3Rpbmd9W3N0eWxlPXNpZ3N0eWxlXQpjbGFzcyBSZWxhdGlvbkV4dHJhY3Rvcihkc3B5LlNpZ25hdHVyZSk6CiAgICAiIiJFeHRyYWN0IHN1YmplY3QtcHJlZGljYXRlLW9iamVjdCB0cmlwbGVzIGZyb20gdGhlIGFzc2lzdGFudCBtZXNzYWdlLiBBIHByZWRpY2F0ZSAoMS0zIHdvcmRzKSBkZWZpbmVzIHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiBzdWJqZWN0IGFuZCBvYmplY3QuIFN1YmplY3QgYW5kIG9iamVjdCBhcmUgZW50aXRpZXMgZnJvbSB0aGUgcHJvdmlkZWQgbGlzdC4gVGhpcyBpcyBhbiBleHRyYWN0aW9uIHRhc2s7IGJlIHRob3JvdWdoLCBhY2N1cmF0ZSwgYW5kIGZhaXRoZnVsIHRvIHRoZSByZWZlcmVuY2UgdGV4dC4KCiAgICBSZXR1cm4gT05MWSB2YWxpZCBKU09OIGZvcm1hdDoKICAgIFtbInN1YmoxIiwicHJlZDEiLCJvYmoxIl0sWyJzdWJqMiIsInByZWQyIiwib2JqMiJdXSIiIgogICAgdGV4dCA9IGRzcHkuSW5wdXRGaWVsZChkZXNjPSJUaGUgdGV4dCB0byBleHRyYWN0IHJlbGF0aW9ucyBmcm9tIikKICAgIGVudGl0aWVzID0gZHNweS5JbnB1dEZpZWxkKGRlc2M9Ikxpc3Qgb2YgYXZhaWxhYmxlIGVudGl0aWVzIikKICAgIHRyaXBsZXMgID0gZHNweS5PdXRwdXRGaWVsZChkZXNjPSJMaXN0IG9mIFtzdWJqZWN0LCBwcmVkaWNhdGUsIG9iamVjdF0gIgoidHJpcGxlcyBpbiBKU09OIGZvcm1hdCIp)classEntityExtractor\(dspy\.Signature\):"""Extractkeyentitiesfromthegiventext\.Extractedentitiesarenouns,verbs,oradjectives,particularlyregardingsentiment\.Thisisforanextractiontask,pleasebethoroughandaccuratetothereferencetext\.ReturnONLYavalidJSONlistformat:\["entity1","entity2","entity3"\]"""text=dspy\.InputField\(desc="Thetexttoextractentitiesfrom"\)entities=dspy\.OutputField\(desc="ListofextractedentitiesinJSONformat"\)\\end\{lstlisting\}\\end\{tcolorbox\}\\paragraph\{Stage2\-\-\-RelationExtraction\.\}\\begin\{tcolorbox\}\[enhanced,breakable,colback=gray\!8,colframe=gray\!40,boxrule=0\.4pt,arc=2pt,left=4pt,right=4pt,top=3pt,bottom=3pt\]\\begin\{lstlisting\}\[style=sigstyle\]classRelationExtractor\(dspy\.Signature\):"""Extractsubject\-predicate\-objecttriplesfromtheassistantmessage\.Apredicate\(1\-3words\)definestherelationshipbetweensubjectandobject\.Subjectandobjectareentitiesfromtheprovidedlist\.Thisisanextractiontask;bethorough,accurate,andfaithfultothereferencetext\.ReturnONLYvalidJSONformat:\[\["subj1","pred1","obj1"\],\["subj2","pred2","obj2"\]\]"""text=dspy\.InputField\(desc="Thetexttoextractrelationsfrom"\)entities=dspy\.InputField\(desc="Listofavailableentities"\)triples=dspy\.OutputField\(desc="Listof\[subject,predicate,object\]""triplesinJSONformat"\)
##### Stage 3 — Cluster Validation\.
After relation extraction,all\-MiniLM\-L6\-v2groups entities by cosine similarity \(threshold 0\.75\)\. Candidate clusters of 2–4 entities are validated by the following signature before merging\.
[⬇](data:text/plain;base64,Y2xhc3MgQ2x1c3RlclZhbGlkYXRvcihkc3B5LlNpZ25hdHVyZSk6CiAgICAiIiJWZXJpZnkgaWYgdGhlc2UgZW50aXRpZXMgYmVsb25nIGluIHRoZSBzYW1lIGNsdXN0ZXIuIEEgY2x1c3RlciBzaG91bGQgY29udGFpbiBlbnRpdGllcyB0aGF0IGFyZSB0aGUgc2FtZSBpbiBtZWFuaW5nLCB3aXRoIGRpZmZlcmVudCB0ZW5zZXMsIHBsdXJhbCBmb3Jtcywgc3RlbSBmb3JtcywgdXBwZXIvbG93ZXIgY2FzZXMsIG9yIGNsb3NlIHNlbWFudGljIG1lYW5pbmdzLgoKICAgIFJldHVybiBPTkxZIHZhbGlkIEpTT046IFsiZW50aXR5MSIsICJlbnRpdHkyIiwgLi4uXQogICAgUmV0dXJuIG9ubHkgZW50aXRpZXMgeW91IGFyZSBjb25maWRlbnQgYmVsb25nIHRvZ2V0aGVyLgogICAgSWYgbm90IGNvbmZpZGVudCwgcmV0dXJuIGVtcHR5IGxpc3QgW10uIiIiCiAgICBlbnRpdGllcyA9IGRzcHkuSW5wdXRGaWVsZChkZXNjPSJFbnRpdGllcyB0byB2YWxpZGF0ZSIpCiAgICB2YWxpZF9jbHVzdGVyID0gZHNweS5PdXRwdXRGaWVsZCgKICAgICAgICAgICAgICAgICAgICAgICAgZGVzYz0iVmFsaWRhdGVkIGNsdXN0ZXIgaW4gSlNPTiBmb3JtYXQiKQ==)classClusterValidator\(dspy\.Signature\):"""Verifyiftheseentitiesbelonginthesamecluster\.Aclustershouldcontainentitiesthatarethesameinmeaning,withdifferenttenses,pluralforms,stemforms,upper/lowercases,orclosesemanticmeanings\.ReturnONLYvalidJSON:\["entity1","entity2",\.\.\.\]Returnonlyentitiesyouareconfidentbelongtogether\.Ifnotconfident,returnemptylist\[\]\."""entities=dspy\.InputField\(desc="Entitiestovalidate"\)valid\_cluster=dspy\.OutputField\(desc="ValidatedclusterinJSONformat"\)
#### A\.1\.2Classification Signatures
The knowledge graph is serialized into the following format before being passed to any classification signature\. If the graph has no edges, the field reads"No knowledge graph available\."
[⬇](data:text/plain;base64,S25vd2xlZGdlIEdyYXBoOgpFbnRpdGllczogZW50aXR5MSwgZW50aXR5MiwgZW50aXR5MywgLi4uCgpSZWxhdGlvbnNoaXBzOgplbnRpdHkxIC0tW3ByZWRpY2F0ZV0tLT4gZW50aXR5MgplbnRpdHkyIC0tW3ByZWRpY2F0ZV0tLT4gZW50aXR5MwouLi4=)KnowledgeGraph:Entities:entity1,entity2,entity3,\.\.\.Relationships:entity1\-\-\[predicate\]\-\-\>entity2entity2\-\-\[predicate\]\-\-\>entity3\.\.\.
##### Variant A — AG:
[⬇](data:text/plain;base64,Y2xhc3MgS0dPbmx5VG9waWNDbGFzc2lmaWNhdGlvbihkc3B5LlNpZ25hdHVyZSk6CiAgICAiIiJDbGFzc2lmeSB0b3BpY3MgZm9yIHRoaXMgYXJ0aWNsZSB1c2luZyBPTkxZIHRoZSBhcnRpY2xlIHRleHQgYW5kIGl0cyBrbm93bGVkZ2UgZ3JhcGguIERPIE5PVCB1c2UgYW55IGV4dGVybmFsIGtleXdvcmQgaW5mb3JtYXRpb24uIEFuYWx5emU6ICgxKSBlbnRpdGllcyBpbiB0aGUgS0csICgyKSByZWxhdGlvbnNoaXBzIGJldHdlZW4gZW50aXRpZXMsICgzKSB0aGUgb3ZlcmFsbCBzZW1hbnRpYyBzdHJ1Y3R1cmUsICg0KSB0aGUgYXJ0aWNsZSB0ZXh0LgoKICAgIFJldHVybiAxLTMgdG9waWNzIGZyb20gdGhlIGF2YWlsYWJsZSBsaXN0IHRoYXQgYmVzdCBtYXRjaCB0aGUgYXJ0aWNsZS4gSWYgbm8gdG9waWNzIHN0cm9uZ2x5IG1hdGNoLCByZXR1cm4gJ25vbmUnLiIiIgogICAgYXJ0aWNsZV90ZXh0ID0gZHNweS5JbnB1dEZpZWxkKGRlc2M9IlRoZSBhcnRpY2xlIGNvbnRlbnQiKQogICAga2dfc3VtbWFyeSA9IGRzcHkuSW5wdXRGaWVsZChkZXNjPSJLbm93bGVkZ2UgZ3JhcGg6IGVudGl0aWVzIGFuZCByZWxhdGlvbnNoaXBzIikKICAgIGF2YWlsYWJsZV90b3BpY3MgPSBkc3B5LklucHV0RmllbGQoZGVzYz0iTGlzdCBvZiBwb3NzaWJsZSB0b3BpYyBsYWJlbHMiKQogICAgcHJlZGljdGVkX3RvcGljcyA9IGRzcHkuT3V0cHV0RmllbGQoZGVzYz0iMS0zIHJlbGV2YW50IHRvcGljcyBvciAnbm9uZScuIgoiIE9OTFkgbmFtZXMgZnJvbSBhdmFpbGFibGVfdG9waWNzLiIp)classKGOnlyTopicClassification\(dspy\.Signature\):"""ClassifytopicsforthisarticleusingONLYthearticletextanditsknowledgegraph\.DONOTuseanyexternalkeywordinformation\.Analyze:\(1\)entitiesintheKG,\(2\)relationshipsbetweenentities,\(3\)theoverallsemanticstructure,\(4\)thearticletext\.Return1\-3topicsfromtheavailablelistthatbestmatchthearticle\.Ifnotopicsstronglymatch,return’none’\."""article\_text=dspy\.InputField\(desc="Thearticlecontent"\)kg\_summary=dspy\.InputField\(desc="Knowledgegraph:entitiesandrelationships"\)available\_topics=dspy\.InputField\(desc="Listofpossibletopiclabels"\)predicted\_topics=dspy\.OutputField\(desc="1\-3relevanttopicsor’none’\.""ONLYnamesfromavailable\_topics\."\)
##### Variant B — AKG:
Topics are passed as one entry per line, e\.g\.Mental Health \(keywords: depression, anxiety, \.\.\.\), with up to 10 keywords per topic\.
[⬇](data:text/plain;base64,Y2xhc3MgVG9waWNDbGFzc2lmaWNhdGlvbldpdGhLZXl3b3Jkcyhkc3B5LlNpZ25hdHVyZSk6CiAgICAiIiJHaXZlbiBhcnRpY2xlIHRleHQsIGl0cyBLRywgYW5kIHRvcGljcyB3aXRoIGtleXdvcmRzLCBkZXRlcm1pbmUgd2hpY2ggdG9waWNzIGFyZSBtb3N0IHJlbGV2YW50LiBVc2Uga2V5d29yZHMgdG8gdW5kZXJzdGFuZCB3aGF0IGVhY2ggdG9waWMgcmVwcmVzZW50cy4gTWF0Y2ggYXJ0aWNsZSBjb250ZW50IGFuZCBLRyBlbnRpdGllcy9yZWxhdGlvbnMgYWdhaW5zdCBrZXl3b3Jkcy4KCiAgICBSZXR1cm4gb25seSBuYW1lcyBmcm9tIGF2YWlsYWJsZV90b3BpY3MuIFJldHVybiBOTyBNT1JFIFRIQU4gNCB0b3BpY3MuIElmIG5vIHRvcGljcyBtZWV0IHN0cm9uZyBldmlkZW5jZSBjcml0ZXJpYSwgcmV0dXJuICdub25lJy4iIiIKICAgIGFydGljbGVfdGV4dCAgIDogc3RyID0gSW5wdXRGaWVsZCgKICAgICAgICBkZXNjPSJUaGUgdGV4dCBjb250ZW50IG9mIHRoZSBhcnRpY2xlIikKICAgIGtub3dsZWRnZV9ncmFwaDogc3RyID0gSW5wdXRGaWVsZCgKICAgICAgICBkZXNjPSJLbm93bGVkZ2UgZ3JhcGg6IGVudGl0aWVzIGFuZCByZWxhdGlvbnNoaXBzIikKICAgIGF2YWlsYWJsZV90b3BpY3Nfd2l0aF9rZXl3b3Jkczogc3RyID0gSW5wdXRGaWVsZCgKICAgICAgICBkZXNjPSJUb3BpY3Mgd2l0aCBrZXl3b3JkczogIgogICAgICAgICAgICAgIid0b3BpYyAoa2V5d29yZHM6IGt3MSwga3cyLCAuLi4pJyIpCiAgICBwcmVkaWN0ZWRfdG9waWNzOiBzdHIgPSBPdXRwdXRGaWVsZCgKICAgICAgICBkZXNjPSJDb21tYS1zZXBhcmF0ZWQgdG9waWMgTkFNRVMuIE1heCA0LiAiCiAgICAgICAgICAgICAiSWYgbm8gbWF0Y2gsIHJldHVybiAnbm9uZScuIik=)classTopicClassificationWithKeywords\(dspy\.Signature\):"""Givenarticletext,itsKG,andtopicswithkeywords,determinewhichtopicsaremostrelevant\.Usekeywordstounderstandwhateachtopicrepresents\.MatcharticlecontentandKGentities/relationsagainstkeywords\.Returnonlynamesfromavailable\_topics\.ReturnNOMORETHAN4topics\.Ifnotopicsmeetstrongevidencecriteria,return’none’\."""article\_text:str=InputField\(desc="Thetextcontentofthearticle"\)knowledge\_graph:str=InputField\(desc="Knowledgegraph:entitiesandrelationships"\)available\_topics\_with\_keywords:str=InputField\(desc="Topicswithkeywords:""’topic\(keywords:kw1,kw2,\.\.\.\)’"\)predicted\_topics:str=OutputField\(desc="Comma\-separatedtopicNAMES\.Max4\.""Ifnomatch,return’none’\."\)
##### Variant C — AGS:
CalledN=5N\{=\}5times at temperature0\.50\.5\. A topic is retained only if it appears in at least 2 of the 5 runs\.
[⬇](data:text/plain;base64,Y2xhc3MgQ29uc2Vuc3VzQ2xhc3NpZmljYXRpb24oZHNweS5TaWduYXR1cmUpOgogICAgIiIiQ2xhc3NpZnkgdG9waWNzIGZvciB0aGlzIGFydGljbGUuIEJlIHRob3VnaHRmdWwgYW5kIHByZWNpc2UuIFJldHVybiAxLTMgdG9waWNzIGNsZWFybHkgZGlzY3Vzc2VkIGluIHRoZSBhcnRpY2xlLiIiIgogICAgYXJ0aWNsZV90ZXh0IDogc3RyID0gZHNweS5JbnB1dEZpZWxkKGVzYz0iQXJ0aWNsZSBjb250ZW50IikKICAgIGtnX3N1bW1hcnk6IHN0ciA9IGRzcHkuSW5wdXRGaWVsZChkZXNjPSJLbm93bGVkZ2UgZ3JhcGggc3VtbWFyeSIpCiAgICBhdmFpbGFibGVfdG9waWNzIDogc3RyID0gZHNweS5JbnB1dEZpZWxkKAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBkZXNjPSJQb3NzaWJsZSB0b3BpY3MiKQogICAgcHJlZGljdGVkX3RvcGljcyA6IHN0ciA9IGRzcHkuT3V0cHV0RmllbGQoZGVzYz0iMS0zIHJlbGV2YW50IHRvcGljcyBvciAnbm9uZSciKQ==)classConsensusClassification\(dspy\.Signature\):"""Classifytopicsforthisarticle\.Bethoughtfulandprecise\.Return1\-3topicsclearlydiscussedinthearticle\."""article\_text:str=dspy\.InputField\(esc="Articlecontent"\)kg\_summary:str=dspy\.InputField\(desc="Knowledgegraphsummary"\)available\_topics:str=dspy\.InputField\(desc="Possibletopics"\)predicted\_topics:str=dspy\.OutputField\(desc="1\-3relevanttopicsor’none’"\)
##### Variant D — AKGS:
Same majority\-vote aggregation as Variant C \(N=5N\{=\}5, temperature0\.50\.5\)\.
[⬇](data:text/plain;base64,Y2xhc3MgQ29uc2Vuc3VzQ2xhc3NpZmljYXRpb25XaXRoS2V5d29yZHMoZHNweS5TaWduYXR1cmUpOgogICAgIiIiQ2xhc3NpZnkgdG9waWNzIHVzaW5nIHByb3ZpZGVkIGtleXdvcmRzIGFzIGd1aWRhbmNlLiBLZXl3b3JkcyBoZWxwIHVuZGVyc3RhbmQgd2hhdCBlYWNoIHRvcGljIHJlcHJlc2VudHMuIE1hdGNoIGFydGljbGUgY29udGVudCBhbmQgS0cgYWdhaW5zdCBrZXl3b3Jkcy4gQmUgcHJlY2lzZTsgcmV0dXJuIDEtMyB0b3BpY3MgQ0xFQVJMWSBkaXNjdXNzZWQuIElmIHVuY2VydGFpbiwgYmUgY29uc2VydmF0aXZlLiIiIgogICAgYXJ0aWNsZV90ZXh0IDogc3RyID0gZHNweS5JbnB1dEZpZWxkKGRlc2M9IkFydGljbGUgY29udGVudCIpCiAgICBrZ19zdW1tYXJ5IDogc3RyID0gZHNweS5JbnB1dEZpZWxkKGRlc2M9Iktub3dsZWRnZSBncmFwaCBzdW1tYXJ5IikKICAgIGF2YWlsYWJsZV90b3BpY3Nfd2l0aF9rZXl3b3Jkczogc3RyID0gZHNweS5JbnB1dEZpZWxkKGRlc2M9IlRvcGljcyB3aXRoIGtleXdvcmRzOiAiCiAgICAgICAgICAgICAiJ3RvcGljIChrZXl3b3Jkczoga3cxLCBrdzIsIC4uLiknIikKICAgIHByZWRpY3RlZF90b3BpY3M6IHN0ciA9IGRzcHkuT3V0cHV0RmllbGQoZGVzYz0iMS0zIHJlbGV2YW50IHRvcGljcyBvciAnbm9uZScuICIKIk9OTFkgbmFtZXMgZnJvbSBhdmFpbGFibGVfdG9waWNzLiIp)classConsensusClassificationWithKeywords\(dspy\.Signature\):"""Classifytopicsusingprovidedkeywordsasguidance\.Keywordshelpunderstandwhateachtopicrepresents\.MatcharticlecontentandKGagainstkeywords\.Beprecise;return1\-3topicsCLEARLYdiscussed\.Ifuncertain,beconservative\."""article\_text:str=dspy\.InputField\(desc="Articlecontent"\)kg\_summary:str=dspy\.InputField\(desc="Knowledgegraphsummary"\)available\_topics\_with\_keywords:str=dspy\.InputField\(desc="Topicswithkeywords:""’topic\(keywords:kw1,kw2,\.\.\.\)’"\)predicted\_topics:str=dspy\.OutputField\(desc="1\-3relevanttopicsor’none’\.""ONLYnamesfromavailable\_topics\."\)Similar Articles
Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment
MetaKGEnrich is a fully automated pipeline that uses graph metrics to detect knowledge gaps in LLM applications, retrieves web evidence, and improves answer quality by 80-87% across three benchmark datasets.
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
This paper introduces MHGraphBench, a knowledge-graph-grounded benchmark for evaluating large language models on mental health knowledge, including entity recognition, relation judgment, and multi-hop reasoning. Experiments across 15 LLMs reveal a gap between recognition and judgment capabilities.
Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling
Proposes KMAS, an adaptive negative sampling method to improve training of knowledge graph foundation models, achieving state-of-the-art results across 44 datasets.
Holographic Memory for Zero-Shot Compositional Reasoning in Knowledge Graphs: A Mechanistic Study of Where and Why It Fails
This paper investigates holographic reduced representations for zero-shot compositional reasoning in knowledge graphs, finding that while single-hop performance is strong, composition fails due to retrieval capacity and interference effects in the superposed memory, not the bind-unbind algebra.
@hxiao: Not a fan of Knowledge Graphs, but recently I started using them more often for a surprising reason: to build non-trivi…
The author describes using a knowledge graph extractor built with a Qwen model to generate challenging multi-hop QA pairs for evaluating agentic search systems.