SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification

arXiv cs.CL Papers

Summary

SCHK-HTC is a novel method for few-shot hierarchical text classification that combines sibling contrastive learning with hierarchical knowledge-aware prompt tuning to better distinguish semantically similar classes at deeper hierarchy levels. The approach achieves state-of-the-art performance across three benchmark datasets by enhancing model perception of subtle differences between sibling classes.

arXiv:2604.15998v1 Announce Type: new Abstract: Few-shot Hierarchical Text Classification (few-shot HTC) is a challenging task that involves mapping texts to a predefined tree-structured label hierarchy under data-scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent-child prediction consistency, they face a critical bottleneck, the difficulty in distinguishing semantically similar sibling classes due to insufficient domain knowledge. We introduce an innovative method named Sibling Contrastive Learning with Hierarchical Knowledge-aware Prompt Tuning for few-shot HTC tasks (SCHK-HTC). Our work enhances the model's perception of subtle differences between sibling classes at deeper levels, rather than just enforcing hierarchical rules. Specifically, we propose a novel framework featuring two core components: a hierarchical knowledge extraction module and a sibling contrastive learning mechanism. This design guides model to encode discriminative features at each hierarchy level, thus improving the separability of confusable classes. Our approach achieves superior performance across three benchmark datasets, surpassing existing state-of-the-art methods in most cases. Our code is available at https://github.com/happywinder/SCHK-HTC.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:30 AM

# SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-aware Prompt Tuning For Hierarchy Text Classification

Source: https://arxiv.org/html/2604.15998

###### Abstract

Few-shot Hierarchical Text Classification (few-shot HTC) is a challenging task that involves mapping texts to a predefined tree-structured label hierarchy under data-scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent-child prediction consistency, they face a critical bottleneck: the difficulty in distinguishing semantically similar sibling classes due to insufficient domain knowledge. We introduce an innovative method named Sibling Contrastive Learning with Hierarchical Knowledge-aware Prompt Tuning for few-shot HTC tasks (SCHK-HTC). Our work enhances the model's perception of subtle differences between sibling classes at deeper levels, rather than just enforcing hierarchical rules. Specifically, we propose a novel framework featuring two core components: a hierarchical knowledge extraction module and a sibling contrastive learning mechanism. This design guides the model to encode discriminative features at each hierarchy level, thus improving the separability of confusable classes. Our approach achieves superior performance across three benchmark datasets, surpassing existing state-of-the-art methods in most cases. Our code is available at https://github.com/happywinder/SCHK-HTC.

Index Terms—hierarchical text classification, prompt tuning, contrastive learning, knowledge graph

## 1 Introduction

Hierarchical Text Classification (HTC), a specialized form of multi-label text classification, has found wide-ranging applications in numerous real-world scenarios, such as news topic categorization and academic paper classification. Few-shot HTC extends this task, presenting even greater challenges. The core objective of few-shot HTC is to accurately classify texts or documents from the coarsest to the finest granularity within a class hierarchy, given an extremely limited number of samples.

With the advent and proliferation of Pre-trained Language Models (PLMs), the prompt-tuning paradigm, which employs PLMs as text encoders, has emerged as a dominant research trend. This approach effectively bridges the gap between the pretraining objectives of PLMs and the requirements of downstream tasks. Early prominent HTC methods utilized graph neural networks to encode the label taxonomy. While effective, these approaches are inherently data-intensive and perform poorly in few-shot scenarios. HierVerb introduced a paradigm shift by replacing explicit label hierarchy encoders with a contrastive learning objective. This approach proved highly effective, setting new SOTA performance on several datasets. Nevertheless, pulling lower levels' label embeddings increases their representational overlap and thus exacerbates confusion, ultimately hindering performance. This highlights a critical limitation of such approaches: as classification descends to deeper levels of the hierarchy, the semantic differences between labels become increasingly subtle, making them difficult to distinguish based solely on the text. This amplifies the need for external knowledge. K-HTC incorporates Knowledge Graphs (KG) to provide domain knowledge, aiming to mitigate interference from general-purpose pre-training data. However, its knowledge utilization is not hierarchical and lacks a mechanism to effectively fuse label semantics with domain-specific knowledge. Furthermore, its performance in low-resource settings was not analyzed. DCL leverages an external knowledge base through retrieval-augmented generation and large language models (LLM), achieving impressive performance gains. However, this approach suffers from two significant drawbacks: a massive parameter count that increases computational costs, and heavy reliance on extensive annotated data for in-context learning. Thus, the challenge of achieving effective discrimination between sibling labels at deeper levels, especially under low-resource constraints, constitutes a central and unresolved issue.

Fig. 1: Classification Acc. (%) on deepest level of WOS and DBpedia dataset.

Fig. 2: The overall architecture of the proposed sibling contrastive learning with hierarchical knowledge-aware prompt-tuning (SCHK-HTC) framework.

Motivated by these observations, we propose a novel framework to tackle these challenges through two core innovations. First, to compensate for the scarcity of domain knowledge, we introduce a mechanism to extract hierarchical knowledge features from KG. This provides the model with structured, level-aware context crucial for classification in data-limited settings. Second, to address the ambiguity among fine-grained classes, we employ a contrastive learning objective specifically on sibling labels. This forces the model to learn subtle yet critical distinctions between semantically similar categories. Together, these two components enable our model to learn more discriminative representations for effective few-shot HTC. The main contributions of this paper are summarized as follows: (1) We propose a novel hierarchical knowledge-aware contrastive learning method based on prompt tuning. (2) We integrate KG into few-shot HTC to alleviate the issue of insufficient domain knowledge, and employ contrastive learning to further address the problem of high semantic similarity among sibling classes. (3) We validate the effectiveness of our method on multiple mainstream datasets, achieving significant performance improvements.

## 2 Methods

In this section, we will introduce the proposed SCHK-HTC in detail. To enhance the model's discriminative power for sibling classes by endowing it with domain-specific knowledge, we propose a framework that incorporates both contrastive learning and KG into prompt-tuning. Our architecture's Hierarchical Knowledge-aware Encoder (HK-Encoder) captures intrinsic knowledge hierarchies, while the hierarchical context encoder extracts richly contextualized and highly discriminative features from text. The overall architecture is depicted in Fig. 2.

### 2.1 Hierarchical Knowledge-aware Prompt-tuning

#### 2.1.1 Hierarchical Knowledge-aware Encoder

To generate a knowledge-aware representation, we construct a relevant subgraph $\mathcal{G}$ by performing entity linking on the input text against Wikidata, extracting the linked entities $\mathcal{E}$ along with their one-hop neighbors and interconnecting relations $\mathcal{R}$. The entity linking process is modeled as a two-stage procedure. First, a mention detection (MD) function identifies a set of textual mentions $M=\{m_1, m_2, \ldots, m_k\}$ within the document $D$. Second, an entity disambiguation step links each mention $m_i$ to its correct entity $e_i^*$ in the KG. This step typically involves generating a set of candidate entities $C(m_i) \subset KG$ and ranking them to find the best match. The final set of linked entities is denoted as $\mathcal{E}=\{e_1, e_2, \ldots, e_k\}$:

$$\mathcal{E}=\{e_i^* \mid m_i \in \mathrm{MD}(D), e_i^*=\underset{c \in C(m_i)}{\mathrm{argmax}} \psi(m_i, c, D)\}$$

We employ BERT to encode knowledge from two complementary modalities. Given an input sequence $X=\{x_1, x_2, \ldots, x_n\}$, we concatenate the input text with a pre-defined cloze-style template "[CLS] the first layers' knowledge is [MASK]..." via string concatenation:

$$\text{input} = \text{template} + X$$

Then we link the entities within $X$ to the subgraph, obtaining a corresponding set of entities $\{e_1, e_2, \ldots, e_k\}$. For semantic modality, we initialize representations $\{w_1, w_2, \ldots, w_k\}$ using BERT's embedding layer $\mathrm{Emb}_{\mathrm{BERT}}$:

$$\{w_1, \ldots, w_k\} = \mathrm{Emb}_{\mathrm{BERT}}(\{e_1, \ldots, e_k\})$$

For structural modality, we employ a two-stage strategy: initial global embeddings $L$ are generated using Node2Vec on the subgraph:

$$L = \text{Node2Vec}(\mathcal{E}, \mathcal{R})$$

For each node, we aggregate information from a randomly sampled set of its neighbors in $\mathcal{G}$. This is achieved through random neighbor sampling and feature aggregation, which combines the node's own features with those of its neighbors to produce contextually enriched embeddings. $\mathcal{AGG}$ represents the random sample and average aggregation function:

$$\{g_1, g_2, \ldots, g_k\} = \mathcal{AGG}(L, \mathcal{G}, \{e_1, e_2, \ldots, e_k\})$$

The semantic and structural representations are fused via element-wise addition. Finally, we extract the resulting [MASK] token's hidden state from the transformer blocks to serve as the final hierarchical knowledge-aware representation.

#### 2.1.2 Hierarchical Context Encoder

While knowledge-aware features capture entity-specific details, they lack broader sentence context information. To complement them, we extract discriminative contextual features using a prompt-based text encoding strategy adapted from DPT. For each hierarchical layer, we construct a contrastive prompt "[CLS] the first layer is [MASK] rather than [MASK]..." containing a positive-negative [MASK] pair. The $[\mathrm{MASK}]_{\text{pos}}$ is assigned the ground-truth label, while the $[\mathrm{MASK}]_{\text{neg}}$ is assigned a confusable sibling label, compelling the model to learn fine-grained distinctions. We define the final-layer feature of the $[\mathrm{MASK}]_{\text{pos}}$ token as $h_{\text{text}}$, which will be utilized in the subsequent fusion stage.

### 2.2 Training Objectives

#### 2.2.1 Knowledge-aware Hierarchical InfoNCE Loss

Our model extracts hierarchical knowledge in a layer-by-layer fashion. To structure the learned representation space, we introduce a Knowledge-aware Hierarchical InfoNCE loss, which is driven by the label hierarchy. The core principle is that for any two samples $x_i$ and $x_j$, let $y_i^{(l)}$ and $y_j^{(l)}$ denote their ground-truth labels at layer $l$. If $y_i^{(l)} = y_j^{(l)}$, then their corresponding knowledge representations, $h_i^{(l)}$ and $h_j^{(l)}$, should exhibit higher similarity than they would with the representation $h_k^{(l)}$ of any sample $x_k$ where the label $y_k^{(l)} \neq y_i^{(l)}$. This structural constraint is enforced using a contrastive objective. For an anchor sample $x_i$ with its layer $l$ representation $h_i^{(l)}$, we define the set of positives $\mathcal{P}_i^{(l)}$ as samples sharing the label $y_i^{(l)}$, and the set of negatives $\mathcal{N}_i^{(l)}$ as those with different labels. The InfoNCE loss for layer $l$ then aims to pull the anchor $h_i^{(l)}$ closer to all positive representations $\{h_p^{(l)} \mid p \in \mathcal{P}_i^{(l)}\}$ while pushing it away from all negative representations $\{h_n^{(l)} \mid n \in \mathcal{N}_i^{(l)}\}$. The loss is formulated as:

$$\mathcal{L}_{\mathrm{K}}^{(l)} = -\log \frac{\sum_{p \in \mathcal{P}_i^{(l)}} e^{s(h_i^{(l)}, h_p^{(l)}) / \tau}}{\sum_{p \in \mathcal{P}_i^{(l)}} e^{s(h_i^{(l)}, h_p^{(l)}) / \tau} + \sum_{n \in \mathcal{N}_i^{(l)}} e^{s(h_i^{(l)}, h_n^{(l)}) / \tau}}$$

We perform a layer-wise summation of the losses:

$$\mathcal{L}_{\mathrm{KH\text{-}infoNCE}} = \sum_{l=1}^{L} \lambda_l \cdot \mathcal{L}_{\mathrm{K}}^{(l)}$$

where $s(\cdot)$ represents the cosine similarity function, $\tau$ is the temperature hyper-parameter, and $\lambda_l$ is the coefficient per layer.

#### 2.2.2 Sibling Contrastive Learning Loss

To enhance discriminability among sibling classes, we introduce a Sibling Contrastive Learning (SCL) Loss that leverages verbalizers' output for hard-negative mining. For each layer $l$, we select top-$k$ labels with the highest predicted probabilities excluding the ground-truth label from the verbalizer's output as the hard-negative set $\mathcal{N}_{\text{hard}}^{(l)}$. These hard negatives are used as targets for a corresponding negative verbalizer in a contrastive objective. The objective of our dual-template contrastive learning strategy is to compel the model to focus on the fine-grained semantic differences between labels, thereby enhancing its discriminative capability. We initialize our verbalizer by first using an LLM to generate detailed textual explanations for each class label. These explanations are subsequently passed through a pre-trained BERT, and we take the resulting "[CLS]" token embedding as the initial vectors for our verbalizer. $h_n^{(l)}, h_p^{(l)}$ represent the $l$-th layer negative verbalizer and positive verbalizer output respectively, $v_p^{(l)}$ denotes the ground truth label embedding. $v_{n,i}^{(l)}$ is the embedding for the $i$-th hard-negative label sampled at the $l$-th layer. The loss is formulated as:

$$\mathcal{L}_{\text{Sibling}} = -\frac{1}{L}\log\sum_{l=1}^{L}\left(\frac{s(h_p^{(l)}, h_n^{(l)})}{\tau} + \frac{e^{s(h_p^{(l)}, v_p^{(l)}) / \tau}}{e^{s(h_p^{(l)}, v_p^{(l)}) / \tau} + \sum_{i=1}^{|\mathcal{N}_{\text{hard}}^{(l)}|} e^{s(h_p^{(l)}, v_{n,i}^{(l)}) / \tau}}\right)$$

#### 2.2.3 Verbalizer Classification Loss

For each hierarchical layer $l$, we fuse the knowledge-aware features $h_k^{(l)}$ a

Similar Articles

Hierarchical text-conditional image generation with CLIP latents

OpenAI Blog

OpenAI proposes a hierarchical two-stage model for text-conditional image generation using CLIP latents: a prior that generates CLIP image embeddings from text captions, and a diffusion-based decoder that generates images from embeddings. The approach improves image diversity and enables zero-shot language-guided image manipulations.

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Hugging Face Daily Papers

Switch-KD proposes a novel visual-switch knowledge distillation framework for efficiently compressing vision-language models by unifying multimodal knowledge transfer within a shared text-probability space. The method achieves 3.6-point average improvement across 10 multimodal benchmarks when distilling a 0.5B TinyLLaVA student from a 3B teacher model.

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Hugging Face Daily Papers

This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.

Stochastic Neural Networks for hierarchical reinforcement learning

OpenAI Blog

OpenAI researchers propose a framework using stochastic neural networks for hierarchical reinforcement learning that pre-trains useful skills guided by a proxy reward, then leverages these skills for faster learning in downstream tasks with sparse rewards or long horizons.