SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

arXiv cs.CL Papers

Summary

Proposes SSDAU, a structured semantic data augmentation method for joint entity and relation extraction that preserves semantic structure by segmenting text based on entity labels and using BERTTopic for topic consistency, significantly outperforming existing augmentation methods.

arXiv:2605.23440v1 Announce Type: new Abstract: Joint Entity and Relation Extraction (JERE) is highly susceptible to weak generalization due to low-quality training data. Data augmentation is a common strategy to enhance model generalization across different domains. However, existing data augmentation methods often overlook text relevance and may disrupt semantic structures and dependencies, making it difficult to generate effective augmented data for improving model generalization. In this paper, we propose Structured Semantic Data Augmentation (SSDAU), a novel method designed to preserve the semantic structure of text during augmentation. SSDAU segments text based on entity labels and employs an encoder to capture semantic features of entities through context awareness. It then performs entity semantic restructuring to generate augmented data. To distinguish semantically similar entities, SSDAU fuses contextualized embeddings with traditional similarity scores. To mitigate potential topic ambiguity and information loss, we apply the BERTTopic model to filter out irrelevant topics, ensuring topic consistency. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular data augmentation baselines. Experiments demonstrate that SSDAU generates semantically consistent data with superior robustness against ambiguity (8.26\% F1 decrease vs.\ 31.91\% for baselines), significantly outperforming all existing methods across all metrics.
Original Article
View Cached Full Text

Cached at: 05/25/26, 09:02 AM

# SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction
Source: [https://arxiv.org/html/2605.23440](https://arxiv.org/html/2605.23440)
Jiawei He1,2 jiaweihe@smail\.nju\.edu\.cn&Mengyu Shi1 mengyushi@smail\.nju\.edu\.cn&Chunrong Fang1,∗ fangchunrong@nju\.edu\.cn 1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2Amap, Alibaba Group, China ∗Corresponding author

###### Abstract

Joint Entity and Relation Extraction \(JERE\) is highly susceptible to weak generalization due to low\-quality training data\. Data augmentation is a common strategy to enhance model generalization across different domains\. However, existing data augmentation methods often overlook text relevance and may disrupt semantic structures and dependencies, making it difficult to generate effective augmented data for improving model generalization\. In this paper, we propose Structured Semantic Data Augmentation \(SSDAU\), a novel method designed to preserve the semantic structure of text during augmentation\. SSDAU segments text based on entity labels and employs an encoder to capture semantic features of entities through context awareness\. It then performs entity semantic restructuring to generate augmented data\. To distinguish semantically similar entities, SSDAU fuses contextualized embeddings with traditional similarity scores\. To mitigate potential topic ambiguity and information loss, we apply the BERTTopic model to filter out irrelevant topics, ensuring topic consistency\. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular data augmentation baselines\. Experiments demonstrate that SSDAU generates semantically consistent data with superior robustness against ambiguity \(8\.26% F1 decrease vs\. 31\.91% for baselines\), significantly outperforming all existing methods across all metrics\.

## 1Introduction

Joint Entity and Relation Extraction \(JERE\) is widely used for representation learning on text data due to its strong performance in applications such as information retrieval\(Linet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib41)\), question answering\(Abdelazizet al\.,[2021](https://arxiv.org/html/2605.23440#bib.bib42)\)and text summarization\(Zhonget al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib43)\)\. The generalization performance of JERE models heavily depends on the quality and scale of the training data\. A common strategy to enhance generalization is data augmentation\. Techniques such as MixUp\(Chenget al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib45)\)and back\-translation\(Xieet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib46)\)enable efficient expansion of the training set by generating new data with subtle perturbations derived from the original samples\.

However, a key challenge in applying existing techniques to enhance the generalization of JERE models is that introducing noise or perturbations into the original data may weaken entity relevance\(Kambhatlaet al\.,[2022](https://arxiv.org/html/2605.23440#bib.bib47)\)\. Training on incorrectly generated data can ultimately degrade JERE models’ performance\. Additionally, entities are often involved in multiple triples with complex semantic relations and dependencies\. Existing data augmentation methods can disrupt the structures and dependencies, leading to issues such as overlapping relations and cascading\(Liuet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib35)\)\.

To address this issue, we propose Structured Semantic Data Augmentation \(SSDAU\) to preserve the semantic structure of text during data augmentation\. Instead of directly perturbing text, SSDAU aligns triplet text to maintain semantic integrity\. First, we use a feature\-based encoder to segment the text, ensuring that each segment retains the semantics of its neighboring regions\. Next, we match segments with similar semantic labels using a decoder\. Our approach integrates contextualized embeddings with pretrained pooler weights to differentiate semantically similar but distinct entities and employs topic\-aware consistency filtering to prevent error propagation\. Finally, we substitute text with high similarity to reorganize the original text, generating augmented data while preserving semantic coherence\. To mitigate error propagation, we employ a topic\-aware consistency filtering mechanism that scores candidate triples using the BERTTopic model and eliminates those inconsistent with gold\-standard semantics\.

To assess the effectiveness of SSDAU, we compared its performance on four widely used datasets with seven baseline methods\. The experimental results validate our main finding: SSDAU consistently outperforms other methods in both common and low\-quality data scenarios\. Our ablation studies further reinforce this conclusion\. Even when faced with semantic ambiguity, SSDAU maintains stable performance with an average F1 score decrease of only 8\.26% across all datasets, while other baselines suffer substantial degradation\. This robust performance extends to overall effectiveness, with SSDAU achieving average precision of 92\.03% and F1 score of 91\.96% across all datasets, substantially outperforming all baseline methods including recent approaches like ChatIE\.

## 2Related Work

#### Information Extraction

JERE is a fundamental NLP task that aims to map entity and relation, generate a text\-to\-triplet model based on their correlation, and assign the triple to a new annotation\(Fuet al\.,[2019](https://arxiv.org/html/2605.23440#bib.bib10)\)\. Previous JERE models mostly employ joint modeling\(Renet al\.,[2017](https://arxiv.org/html/2605.23440#bib.bib9)\)or sequential annotation\([38](https://arxiv.org/html/2605.23440#bib.bib8)\)to extract entities and relations together\. They focus on structured learning by manually constructing features, building information tables or knowledge to enhance the relevance of entity extraction and relation recognition\(Miwa and Bansal,[2016](https://arxiv.org/html/2605.23440#bib.bib11)\)\. However, manually constructed features make it hard to achieve positive results in different applications\. To address this challenge, Zhao et al\.\(Zhaoet al\.,[2021](https://arxiv.org/html/2605.23440#bib.bib12)\)propose decomposing the JERE task and completing contextual learning by modifying the classification process\. They divided the JERE models into three categories: multi\-module multi\-step\(Zhenget al\.,[2021](https://arxiv.org/html/2605.23440#bib.bib14); Weiet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib15)\), multi\-module one\-step\(Suiet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib13); Wanget al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib17)\)and one\-module one\-step\(Shanget al\.,[2022](https://arxiv.org/html/2605.23440#bib.bib16)\)\. The accuracy of these models is limited by the quality of the training data, and our structured semantic data augmentation method can help generate a large amount of high quality data, which has a great advantage in the basic and downstream applications of JERE models\.

#### Semantic Match

Semantic matching is a sub\-task of text matching used to retrieve semantically similar texts in search scenarios\(Wuet al\.,[2022](https://arxiv.org/html/2605.23440#bib.bib18)\)\. Some representative approaches include cosine similarity, term frequency\-inverse document frequency \(TF\-IDF\) calculation, and deep structured semantic model \(DSSM\)\(Gaoet al\.,[2021](https://arxiv.org/html/2605.23440#bib.bib19)\)\. Recent studies have shown that pre\-training semantic classification models can effectively compress massive text and improve the generalization ability of semantic matching models\(Brownet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib24)\)\. For example, the emergence of*Similarities*\(Zhang Bingyu,[2022](https://arxiv.org/html/2605.23440#bib.bib23)\)provides a solid foundation for developing practical applications for text semantic matching tasks\. In particular, the semantic matching function of*Similarities*has been widely recognized for its superior effect in text relation extraction\. Therefore, based on the existing text similarity matching techniques, we improve the existing JERE work by text semantic matching\.

![Refer to caption](https://arxiv.org/html/2605.23440v1/x1.png)Figure 1:Overview of SSDAU\.The Data Discretization and Reconstructioncomponent discretizes the text dataSSsemantically using the Encoder and outputs text collections in the form of segmented sets\. The Decoder then processes these segmented sets to facilitate theStructured Semantic Data Augmentationcomponent, where the Input View is based on similarity matching, while the Output View focuses on augmenting the data\. Finally, theScoring\-based Consistency Filteringcomponent uses a structured semantic classifier to filter low\-resource data, and the remaining augmented data£\\poundsandTare used as augmented dataSgS\_\{g\}to train a more robust JERE model
#### Data Augmentation

Data augmentation is a cost\-effective and efficient method that can improve the performance and accuracy of machine learning models, especially in a data\-constrained environment\(Cashmanet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib25)\)\. Common data augmentation techniques used in NLP include proximal word replacement\(Wei and Zou,[2019](https://arxiv.org/html/2605.23440#bib.bib28)\), word vector replacement\(Wang and Yang,[2015](https://arxiv.org/html/2605.23440#bib.bib29)\), masked language model replacement\(Jiaoet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib30)\), back translation\(Zhanget al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib31)\), adding noise\(Minet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib32); Yanet al\.,[2019](https://arxiv.org/html/2605.23440#bib.bib33); Houet al\.,[2018](https://arxiv.org/html/2605.23440#bib.bib34)\), etc\. In addition, Zhang et al\.\(Zhanget al\.,[2015](https://arxiv.org/html/2605.23440#bib.bib26)\)and Jonas et al\.Mueller and Thyagarajan \([2016](https://arxiv.org/html/2605.23440#bib.bib27)\)propose a lexical substitution method for augmented data that preserves original semantics by word proxemics\. However, this method is limited by the size and lexical coverage of the proxemics list\. Unlike existing methods employing simple perturbation\(Liuet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib35)\)or extra augmentor model\(Houet al\.,[2021](https://arxiv.org/html/2605.23440#bib.bib51); Huet al\.,[2019](https://arxiv.org/html/2605.23440#bib.bib60)\), we propose the sampling\-based augmentation, generating data with the same semantic structure by maintaining the semantic logic of the samples\.

## 3Method

In this section, we first define the problems\. Then, we introduce the three main components of SSDAU: 1\) data discretization and reconstruction, 2\) structured semantic data augmentation, and 3\) scoring\-based consistency filtering\.

### 3\.1Preliminaries

Given set of sentencesS=\{s1,s2,…,sN\}S=\\\{s\_\{1\},s\_\{2\},\.\.\.,s\_\{N\}\\\}containingLLtoken andKKpredefined relationsR=\{r1,r2,…,rK\}R=\\\{r\_\{1\},r\_\{2\},\.\.\.,r\_\{K\}\\\}, we extract entities and relations to construct triplesT=\{\(hi,ri,ti\)\}i=1MT=\\\{\(h\_\{i\},r\_\{i\},t\_\{i\}\)\\\}\_\{i=1\}^\{M\}inSS, wherehih\_\{i\},tit\_\{i\}are the head and tail entities, respectively,NNrepresents the numKber of sentences,MMrepresents the number of triples\. In this process, we maintain a three\-dimensional matrixML∗K∗LM^\{L\*K\*L\}to store the existing knowledge\.

Since triplets are the core output format of JERE, we use the triplet as the basic unit of data augmentation and partition the text according to the triplet to obtain three series of text collections\. To preserve the contextual semantics of the segmented text, we keep the contextual tokenllof each segmented text and record the location of each cut pointpp\.

### 3\.2Data Discretization and Reconstruction

#### Encoder

We use the triplet as the basic unit of data augmentation to eliminate the noise from textual perturbations\. We design a text feature\-based encoderEE\. The input of the encoder is the sentence textSS, and for each sentencesis\_\{i\}, we find the specified text block \(qhi,qri,qtiq\_\{h\_\{i\}\},q\_\{r\_\{i\}\},q\_\{t\_\{i\}\}\) based on the triplet tags \(ρhi,ρri,ρti\\rho\_\{h\_\{i\}\},\\rho\_\{r\_\{i\}\},\\rho\_\{t\_\{i\}\}\), and record the context token \(lhi,lri,ltil\_\{h\_\{i\}\},l\_\{r\_\{i\}\},l\_\{t\_\{i\}\}\) and the cut position \(phi,pri,ptip\_\{h\_\{i\}\},p\_\{r\_\{i\}\},p\_\{t\_\{i\}\}\)\. The encoder processes all the input text and gets three output text collections according to the tag types: head entity collectionQhQ\_\{h\}, tail entity collectionQtQ\_\{t\}and relation entity collectionQrQ\_\{r\}\.

![Refer to caption](https://arxiv.org/html/2605.23440v1/x2.png)Figure 2:The structure of our feature\-based encoder\.
#### Decoder

We design a similarity form based onKKrelation types andMMtriplet labels in the sentence setSS\., and use it as the basis for designing a form similarity\-based text matching decoderDD\. The input of decoderDDis\(Qh,Qt,Qs\)\(Q\_\{h\},Q\_\{t\},Q\_\{s\}\), and it divides the text collections according to the relation types and label types to getL​K​LLKLgroups text libraryB=\{B1,B2,…,BL​K​L\}B=\\\{B\_\{1\},B\_\{2\},\.\.\.,B\_\{LKL\}\\\}with the same relation type and the same label\.

### 3\.3Structured Semantic Data Augmentation

#### Discrete Text Matching

We designed a text matcher based on the semantic similarity evaluation tool*Similarities*to align the decoder’s output\. A text blockbbin an output groupBi=b1,b2,…,bjB\_\{i\}=\{b\_\{1\},b\_\{2\},\.\.\.,b\_\{j\}\}from the decoder stores the textqq, context tokensll, label typeρ\\rho, and segmentation positionpp\. We perform matching across allbbin different text corporaBiB\_\{i\}, incorporating semantic, syntactic, and lexical similarity evaluations, as well as context token similarity assessments\. To effectively distinguish between semantically similar but distinct entities, we enhance this process by incorporating contextualized \[CLS\] embeddings from a pretrained BERT encoder and apply pretrained pooler weights to compute entity\-level semantic correlation\. This correlation score is then fused with the original semantic similarity score to obtain a hybrid similarity measure\. The matching results are normalized to a value between 0 and 1 and inserted into a priority queue sorted in descending order of similarity\. Finally, for eachBiB\_\{i\}, we obtain a similarity\-based priority queuePiP\_\{i\}\.

#### Data Augmentation

After completing the similarity matching, we filter out the data in the priority queuePi=P1,P2,…,PK​MP\_\{i\}=\{P\_\{1\},P\_\{2\},\.\.\.,P\_\{KM\}\}with a similarity score lower than the thresholdε\\varepsilon\. For the remaining data, we replace the text content of the corresponding text blocks based on the recorded segmentation positionllin each block’s information, thereby generating the augmented data\.

### 3\.4Scoring\-based Classifier

To further improve the quality of the augmented data, we employ a BERTTopic model to identify and retain key terms from topic descriptions\. We then filter out augmented data associated with irrelevant topics, ensuring the topic coherence of the generated text\.

First, we extract all entities and relations from the text\. Then, we encode the tokens using BERTKenton and Toutanova \([2019](https://arxiv.org/html/2605.23440#bib.bib36)\), Next, we combine entities and relations in the form of\(lh,r,lt\)\(l\_\{h\},r,l\_\{t\}\)and perform triplet extraction using joint entity and relation extraction \(Shang et al\., 2022\)\. Finally, we apply a function to compute the correlation between the head and tail entities\. The scoring function is defined as:

h⋆t=ϕ​\(W​\[lh;lt\]T\+b\)\\begin\{split\}h\\star t=\\phi\(W\[l\_\{h\};l\_\{t\}\]^\{T\}\+b\)\\end\{split\}\(1\)Wherehhandttrepresent the head and tail, respectively\.⋆\\stardenotes circular correlation\(ℝd×ℝd→ℝd\)\(\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{d\}\)\.W∈ℝde×2​dW\\in\\mathbb\{R\}^\{d\_\{e\}\\times 2d\}andbbare trainable weights and biases, respectively, whereded\_\{e\}denotes the dimension of the entity\.\[;\]\[;\]is the concatenation operation andϕ​\(⋅\)\\phi\(\\cdot\)represents the ReLU activation function\.

We then incorporate the highly evaluated entity pairs with the relations and use the relational representation function

R∈ℝde×4​K\.R\\in\\mathbb\{R\}^\{d\_\{e\}\\times 4K\}\.The vector function is defined as follows:

υ​\(lh,rk,lt\)k=1K=RT​ϕ​\(drop​\(W​\[lh;lt\]T\+b\)\)\\upsilon\(l\_\{h\},r\_\{k\},l\_\{t\}\)\_\{k=1\}^\{K\}=R^\{T\}\\phi\(\\mathrm\{drop\}\(W\[l\_\{h\};l\_\{t\}\]^\{T\}\+b\)\)\(2\)whereυ\\upsilonrepresents the score vector anddrop​\(⋅\)\\mathrm\{drop\}\(\\cdot\)refers to the dropout strategy \(Srivastava et al\., 2014\)\.

Next, we add the scoring vectorυ\\upsilonto the softmax function to predict the corresponding labels\. The formulated triples are presented as follows:

ζtriple=−∑i,j,klog⁡P​\(y​\(li,rk,lj\),g​\(li,rk,lj\)∣S\)L×K×L\\zeta\_\{\\text\{triple\}\}=\-\\frac\{\\sum\_\{i,j,k\}\\log P\(y\(l\_\{i\},r\_\{k\},l\_\{j\}\),g\(l\_\{i\},r\_\{k\},l\_\{j\}\)\\mid S\)\}\{L\\times K\\times L\}\(3\)whereg​\(li,rk,lj\)g\(l\_\{i\},r\_\{k\},l\_\{j\}\)represents the gold tag obtained from annotations\. We match all triplets with the golden\-label triplets to compute the topic score for each triplet\. This topic\-aware consistency filtering mechanism effectively mitigates error propagation by scoring candidate triples and eliminating those inconsistent with gold\-standard semantics, ensuring robust performance even under semantic ambiguity\.

## 4Experiment

### 4\.1Experimental Setup

#### Baseline

We compare SSDAU with seven commonly used data augmentation methods, including word substitution \(WS\)\(Wei and Zou,[2019](https://arxiv.org/html/2605.23440#bib.bib28)\), Back Translation \(BT\)\(Xieet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib46)\), Noise Introduction \(NI\)\(Fanghua Ye,[2022](https://arxiv.org/html/2605.23440#bib.bib50)\), Same\-tag Semantic Noise \(SSN\)\(Yanet al\.,[2019](https://arxiv.org/html/2605.23440#bib.bib33)\), Generative Models \(GM\)\(Houet al\.,[2021](https://arxiv.org/html/2605.23440#bib.bib51)\), Mixup\(Huet al\.,[2019](https://arxiv.org/html/2605.23440#bib.bib60)\), and ChatIE\(Weiet al\.,[2023](https://arxiv.org/html/2605.23440#bib.bib62)\)\.

#### Dataset

We conduct our experiments on two representative English datasets, NYT and WebNLG\. Both types of datasets have two variations: fully annotated type \(NYT, WebNLG\) and partially annotated type \(NYT, WebNLG\)\.

#### Protocol

We select five models for three different types of JERE tasks: Multi\-module Multi\-Step \(PRGC\(Zhenget al\.,[2021](https://arxiv.org/html/2605.23440#bib.bib14)\), CasRel\(Weiet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib15)\)\), Multi\-module One\-Step \(TPLinker\(Wanget al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib17)\), SPN4RE\(Suiet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib13)\)\), and One\-module One\-Step \(OneRel\)\(Shanget al\.,[2022](https://arxiv.org/html/2605.23440#bib.bib16)\)\. We use the following metrics to measure the effectiveness, performance, and adaptability of SSDAU: precision \(Prec\), F1\-score \(F1\), and Intersection over Union \(IoU\)\.

#### Implementation

We conducted all experiments on a single server equipped with an Intel Xeon Gold 6248 2\.50GHz CPU, two Tesla V100 SXM2 32GB GPUs, and Ubuntu 18\.04\.6 operating system\. We reused the pre\-trained BERT model \(base\-cased English\) from Huggingface\.

### 4\.2Result

#### Comparison with Baselines

Table 1:The number of augmented samples produced by SSDAU at various thresholds on different datasets\.Table[2](https://arxiv.org/html/2605.23440#S4.T2)presents the effectiveness \(Prec\), performance \(F1\), and adaptability \(IoU\) results of SSDAU and seven baselines for different JERE tasks\. The results demonstrate that SSDAU consistently outperforms all baselines in terms of the effectiveness of data augmentation for various JERE tasks\. In terms of performance, SSDAU achieves the best F1 scores and generates positive outcomes, unlike the seven baselines that negatively impact JERE models\. Regarding adaptability, the results of IoU for augmented data indicate that our method performs better across different JERE models\. Notably, our method maintains stable performance even in challenging scenarios\. When tested with semantically ambiguous data, SSDAU exhibits minimal performance degradation \(less than 2\.1% F1 drop on NYT\), while baselines like BT suffer significant deterioration \(up to 22\.5% drop\)\.

In comparison to Back Translation\(Xieet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib46)\)and Generative Models\(Houet al\.,[2021](https://arxiv.org/html/2605.23440#bib.bib51)\), maintaining the semantic structure of the text proves to be more effective than preserving semantic continuity\. Contrasted with Noise Introduction\(Fanghua Ye,[2022](https://arxiv.org/html/2605.23440#bib.bib50)\)and Same\-tag Semantic Noise\(Yanet al\.,[2019](https://arxiv.org/html/2605.23440#bib.bib33)\), the method that maps discrete text by tags exhibits superior performance to adding noise directly\. In contrast to Word Substitution\(Wei and Zou,[2019](https://arxiv.org/html/2605.23440#bib.bib28)\)and Mixup\(Chenget al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib45)\), labeled discrete texts demonstrate superior properties in JERE data augmentation tasks compared to unlabeled samples\. When compared to ChatIE\(Weiet al\.,[2023](https://arxiv.org/html/2605.23440#bib.bib62)\), SSDAU demonstrates substantially higher precision and F1 scores across all datasets\. This highlights the superiority of our structured approach over LLM\-based methods for precise extraction tasks\. Based on these results, we conclude that SSDAU’s approach of preserving structured semantics through contextualized embeddings and topic\-aware consistency filtering is superior to existing data augmentation strategies, particularly in scenarios requiring high precision and effective disambiguation capabilities\.

We evaluate method robustness using semantically ambiguous data constructed with SciER\(Zhanget al\.,[2024](https://arxiv.org/html/2605.23440#bib.bib63)\)\. Table[2](https://arxiv.org/html/2605.23440#S4.T2)compares SSDAU with seven baselines under normal and error conditions\. Baselines show significant performance degradation with ambiguous data, with BT’s F1 score dropping by 52\.20% on WebNLG \(from 89\.90% to 37\.70%\)\. In contrast, SSDAU maintains stable performance across all datasets, with maximum F1 score reduction of only 1\.80% on NYT\*\. These results confirm our topic\-aware consistency filtering mechanism effectively mitigates error propagation, providing superior robustness against semantic ambiguity\.

Table 2:Comparison of SSDAU with baselines under normal conditions and with semantically ambiguous data\.
#### Performance on Different JERE Tasks

Table[3](https://arxiv.org/html/2605.23440#S4.T3)displays the effectiveness of SSDAU and baselines for various JERE models\. The results indicate that the SSDAU\-augmented dataset exhibits improvements across different types of JERE models, such as 3\.03% improvement on precision for the WebNLG\* dataset in SPN and a 0\.94% improvement for the NYTg dataset in the TPLinker model\. These outcomes demonstrate the feasibility of our approach for augmenting unstructured texts into structured semantic data for JERE tasks\. Moreover, we observe that SSDAU performs better on partially annotated type datasets than on fully annotated type datasets\. Notably, our method achieves about a 3% improvement with the NYT\* of the CasRel model and the WebNLG\* dataset of the SPN model\.

Table 3:The precision of different models under different datasets\. Each cell \(A/B\) represents the performance of training with the original dataset \(A\) and the data augmented by SSDAU \(B\)\. Values in bold indicate the improvement\.

### 4\.3Ablation Study of SSDAU

We conduct an ablation study to evaluate SSDAU’s important components\. Throughout this process, we maintain consistent settings across all components\.

#### Data Discretization and Reconstruction

We evaluate performance after removing the pre\-processing component by directly splitting data based on triad messages without semantic tags, and by applying conventional no\-split and full\-split schemes\(Gaoet al\.,[2020](https://arxiv.org/html/2605.23440#bib.bib55)\)\. As shown in Table[4](https://arxiv.org/html/2605.23440#S4.T4), we evaluate the effectiveness of the pre\-processing components both before and after removal using precision as a metric\. Our results demonstrate that the Data Discretization and Reconstruction component outperforms the no\-pre\-processing approach, with an improvement of approximately 2\.02%–3\.20%\. Furthermore, we find that incorporating semantic tagging prompts positively impacts discrete text data augmentation in low\-resource JERE tasks\.

#### Structured Semantic Data Augmentation

We evaluate the augmentation component by measuring similarity between pre\-processed texts using exact matching, generating augmented data by substituting labels in composed discrete texts\. The augmented data is classified by triplet type, used for model training, and assessed after component removal\.

Table 4:Ablation study for SSDAU\. "No Split" denotes not splitting the text\. "No Label Split" denotes splitting by semantics without semantic tag\. "Full Split" denotes complete splitting of the words in the text\.As shown in Table[4](https://arxiv.org/html/2605.23440#S4.T4), only the third group\(h,r,t\)\(h,r,t\)of augmented text shows a slight positive effect \(0\.38%\) on JERE tasks, while the other four types negatively impact precision\. Removing the augmentation component eliminates threshold restrictions, introducing low\-quality data that reduces model precision\. The component’s absence disrupts text extraction and semantic structure preservation, causing significant performance degradation and highlighting the importance of semantically structured augmentation\.

#### Scoring\-based Consistency Filtering

We assess the impact of the consistency filtering component in SSDAU\. Table[4](https://arxiv.org/html/2605.23440#S4.T4)shows the precision of the JERE models with and without filtered data\. The results demonstrate that the filtered data positively impacts the model’s precision, whereas the precision decreases when low\-quality augmented data are not removed\. This highlights the importance of consistency filtering in maintaining the model’s precision\.

#### Parameter Initialization

We investigate the impact of parameter initialization in our model by comparing three initialization methods: random initialization, zero initialization, and pretrained initialization \(used in SSDAU\)\. As shown in Table[6](https://arxiv.org/html/2605.23440#S4.T6), the pretrained initialization method consistently outperforms others across all four datasets\. Significance tests \(t\-test\) confirm these improvements are statistically significant \(p=0\.012p=0\.012on NYT,p=0\.009p=0\.009on WebNLG,p=0\.016p=0\.016on NYT\*,p=0\.008p=0\.008on WebNLG\*\)\. These results validate our design choice of using HuggingFace’s default pretrained parameters for the BERTTopic model\.

Table 5:Semantic consistency verification of augmented text\.ν\\nuis the syntactic coherence\.

### 4\.4Analysis

![Refer to caption](https://arxiv.org/html/2605.23440v1/x3.png)\(a\)The Partially Data
![Refer to caption](https://arxiv.org/html/2605.23440v1/x4.png)\(b\)The Exactly Data

Figure 3:The comparison between the number of triads included in SSDAU after augmentation and the initial one for different types of datasets\.#### Semantic coherence analysis\.

During the semantic coherence analysis of SSDAU, we follow a two\-step process to ensure semantic consistency in the augmented text\. First, we augment all texts by considering similarities between annotations of the same type and entity text, while preserving the semantic annotations \(e\.g\., “location contains location”\)\. Next, we use Biber Tagger\(A\. Bergman,[2022](https://arxiv.org/html/2605.23440#bib.bib59)\)to match triplet texts with identical tags\. The high degree of syntactic agreement between Text1 and Text2 is demonstrated in Table[5](https://arxiv.org/html/2605.23440#S4.T5)\. We filter out texts with low relevance \(below 0\.8\) and incorporate the remaining data into the training set as augmented data, ensuring the semantic consistency of the augmented text\.

Table 6:Ablation study on parameter initialization across four datasets\.
#### Training Cost and Convergence

Figure[3](https://arxiv.org/html/2605.23440#S4.F3)provides details about the original and augmented texts containing varying numbers of triplets\. We focus specifically on scenarios where an entity appears in multiple triplet relations and categorize the texts based on the number of triplets to evaluate the effectiveness of SSDAU for such texts\. By classifying the augmented data according to triplet counts and incorporating it into the training set, we assess the performance of different JERE models using the same test set\. The results demonstrate the effectiveness of SSDAU for texts with different triplet counts\. Our method proves valuable across texts with varying numbers of triplets, showing that as the number of triplets in the training set decreases, the availability of augmented data increases, leading to improved model precision\.

Table 7:Some augmented examples selected by SSDAU\. Black denotes original examples\. Text chunks in Red are the discrete text\. Text chunks in Blue are the precondition for text segmentation and augmentation\.ε1\\varepsilon\_\{1\}is the entity similarity threshold andε2\\varepsilon\_\{2\}is the relation similarity threshold\.

### 4\.5Case Study

Table[7](https://arxiv.org/html/2605.23440#S4.T7)presents three cases of SSDAU applied to JERE tasks\. In the first case, we replace the head entity “Mitch Mustain” with “Amy Grant” while preserving the semantic label and other text intact\. In the second case, we substitute the tail entity “Arkansas” with “Nashville” while maintaining the original semantic labels and other texts\. In the third case, we modify all the text except for the entity and change the semantic label from “people\|\|people\|\|place\_lived” to “people\|\|people\|\|location\.” Our data augmentation approach can expand texts without introducing additional noise, resulting in natural and diverse augmentations\. Compared to existing methods, SSDAU’s augmented data resolves diversity and quality issues more effectively\.

## 5Conclusion

We propose SSDAU, a data augmentation paradigm designed to perform instance augmentation for low\-resource JERE tasks by labeling the semantic segmentation of entity texts and assessing similarity within neighboring semantic regions\. Our approach integrates contextualized embeddings with traditional similarity scores to effectively distinguish semantically similar but distinct entities, while employing topic\-aware consistency filtering with pretrained initialization to mitigate error propagation\. Compared to traditional methods, SSDAU effectively addresses the challenge of data scarcity in low\-resource scenarios and mitigates issues such as reduced textual relevance and overlapping relations\. These findings suggest that preserving the semantic structure of texts through structured semantic tags can be a promising approach for text data augmentation\.

## 6Limitation

Although the proposed SSDAU outperforms all baseline methods, it still has some limitations\. Firstly, while we alleviate the need for high\-quality data in SSDAU by filtering low\-quality data, incorporating more high\-quality data may further improve SSDAU’s performance\. Secondly, we improve Similarities for structured semantic matching of long texts through pre\-processing\. The efficiency of our approach can be enhanced by utilizing a more efficient semantic text\-matching component\. In future work, it would be interesting to validate our approach in real\-time using newly acquired high\-quality data and explore the development of semantic text matching components that deliver superior results for long texts\.

## References

- \[1\]\(2022\-05\)Towards responsible natural language annotation for the varieties of arabic\.InFindings of the Association for Computational Linguistics: ACL 2022,Dublin, Ireland,pp\. 364–371\.External Links:[Link](https://aclanthology.org/2022.findings-acl.31),[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.31)Cited by:[§4\.4](https://arxiv.org/html/2605.23440#S4.SS4.SSS0.Px1.p1.1)\.
- \[2\]I\. Abdelaziz, S\. Ravishankar, P\. Kapanipathi, S\. Roukos, and A\. Gray\(2021\)A semantic parsing and reasoning\-based approach to knowledge base question answering\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 15985–15987\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/17988)Cited by:[§1](https://arxiv.org/html/2605.23440#S1.p1.1)\.
- \[3\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px2.p1.1)\.
- \[4\]D\. Cashman, S\. Xu, S\. Das, F\. Heimerl, C\. Liu, S\. R\. Humayoun, M\. Gleicher, A\. Endert, and R\. Chang\(2020\)Cava: a visual analytics system for exploratory columnar data augmentation using knowledge graphs\.IEEE Transactions on Visualization and Computer Graphics27\(2\),pp\. 1731–1741\.External Links:[Link](https://ieeexplore.ieee.org/abstract/document/9222249)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1)\.
- \[5\]Y\. Cheng, L\. Jiang, W\. Macherey, and J\. Eisenstein\(2020\)AdvAug: robust adversarial augmentation for neural machine translation\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 5961–5970\.External Links:[Link](https://arxiv.org/abs/2006.11834)Cited by:[§1](https://arxiv.org/html/2605.23440#S1.p1.1),[§4\.2](https://arxiv.org/html/2605.23440#S4.SS2.SSS0.Px1.p2.1)\.
- \[6\]E\. Y\. Fanghua Ye\(2022\-05\)ASSIST: towards label noise\-robust dialogue state tracking\.InFindings of the Association for Computational Linguistics: ACL 2022,Dublin, Ireland,pp\. 2719–2731\.External Links:[Link](https://aclanthology.org/2022.findings-acl.214),[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.214)Cited by:[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.23440#S4.SS2.SSS0.Px1.p2.1)\.
- \[7\]T\. Fu, P\. Li, and W\. Ma\(2019\-07\)Graphrel: modeling text as relational graphs for joint entity and relation extraction\.InProceedings of the 57th annual meeting of the association for computational linguistics,Florence, Italy,pp\. 1409–1418\.External Links:[Link](https://aclanthology.org/P19-1136),[Document](https://dx.doi.org/10.18653/v1/P19-1136)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]P\. Gao, Q\. Wan, and L\. Shen\(2020\)Split and merge: component based segmentation network for text detection\.InInternational Conference on Pattern Recognition and Artificial Intelligence,pp\. 14–27\.External Links:[Link](https://link.springer.com/chapter/10.1007/978-3-030-59830-3_2)Cited by:[§4\.3](https://arxiv.org/html/2605.23440#S4.SS3.SSS0.Px1.p1.1)\.
- \[9\]T\. Gao, X\. Yao, and D\. Chen\(2021\)SimCSE: simple contrastive learning of sentence embeddings\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 6894–6910\.External Links:[Link](https://arxiv.org/abs/2104.08821)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px2.p1.1)\.
- \[10\]Y\. Hou, S\. Chen, W\. Che, C\. Chen, and T\. Liu\(2021\)C2c\-genda: cluster\-to\-cluster generation for data augmentation of slot filling\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 13027–13035\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/17540)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.23440#S4.SS2.SSS0.Px1.p2.1)\.
- \[11\]Y\. Hou, Y\. Liu, W\. Che, and T\. Liu\(2018\)Sequence\-to\-sequence data augmentation for dialogue language understanding\.InProceedings of the 27th International Conference on Computational Linguistics,pp\. 1234–1245\.External Links:[Link](https://arxiv.org/abs/1807.01554)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1)\.
- \[12\]Z\. Hu, B\. Tan, R\. Salakhutdinov, T\. Mitchell, and E\. P\. Xing\(2019\)Learning data manipulation for augmentation and weighting\.InProceedings of the 33rd International Conference on Neural Information Processing Systems,pp\. 15764–15775\.External Links:[Link](https://proceedings.neurips.cc/paper/2019/hash/671f0311e2754fcdd37f70a8550379bc-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px1.p1.1)\.
- \[13\]X\. Jiao, Y\. Yin, L\. Shang, X\. Jiang, X\. Chen, L\. Li, F\. Wang, and Q\. Liu\(2020\-11\)TinyBERT: distilling bert for natural language understanding\.InFindings of the Association for Computational Linguistics: EMNLP 2020,Online,pp\. 4163–4174\.External Links:[Link](https://aclanthology.org/2020.findings-emnlp.372),[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.372)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1)\.
- \[14\]N\. Kambhatla, L\. Born, and A\. Sarkar\(2022\-05\)CipherDAug: ciphertext based data augmentation for neural machine translation\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Dublin, Ireland,pp\. 201–218\.External Links:[Link](https://aclanthology.org/2022.acl-long.17),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.17)Cited by:[§1](https://arxiv.org/html/2605.23440#S1.p2.1)\.
- \[15\]J\. D\. M\. C\. Kenton and L\. K\. Toutanova\(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of NAACL\-HLT,pp\. 4171–4186\.External Links:[Link](https://openreview.net/forum?id=SkZmKmWOWH)Cited by:[§3\.4](https://arxiv.org/html/2605.23440#S3.SS4.p2.1)\.
- \[16\]Y\. Lin, H\. Ji, F\. Huang, and L\. Wu\(2020\-07\)A joint neural model for information extraction with global features\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 7999–8009\.External Links:[Link](https://aclanthology.org/2020.acl-main.713),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.713)Cited by:[§1](https://arxiv.org/html/2605.23440#S1.p1.1)\.
- \[17\]S\. Liu, K\. Lee, and I\. Lee\(2020\)Document\-level multi\-topic sentiment classification of email data with bilstm and data augmentation\.Knowledge\-Based Systems197,pp\. 105918\.External Links:[Link](https://www.sciencedirect.com/science/article/pii/S0950705120302574)Cited by:[§1](https://arxiv.org/html/2605.23440#S1.p2.1),[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1)\.
- \[18\]J\. Min, R\. T\. McCoy, D\. Das, E\. Pitler, and T\. Linzen\(2020\-07\)Syntactic data augmentation increases robustness to inference heuristics\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 2339–2352\.External Links:[Link](https://aclanthology.org/2020.acl-main.212),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.212)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1)\.
- \[19\]M\. Miwa and M\. Bansal\(2016\-10\)End\-to\-end relation extraction using lstms on sequences and tree structures\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Seoul, South Korea,pp\. 1105–1116\.External Links:[Link](https://aclanthology.org/Y16-3002)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]J\. Mueller and A\. Thyagarajan\(2016\)Siamese recurrent architectures for learning sentence similarity\.InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence,pp\. 2786–2792\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/10350)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1)\.
- \[21\]X\. Ren, Z\. Wu, W\. He, M\. Qu, C\. R\. Voss, H\. Ji, T\. F\. Abdelzaher, and J\. Han\(2017\)Cotype: joint extraction of typed entities and relations with knowledge bases\.InProceedings of the 26th International Conference on World Wide Web,pp\. 1015–1024\.External Links:[Link](https://dl.acm.org/doi/abs/10.1145/3038912.3052708)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1)\.
- \[22\]Y\. Shang, H\. Huang, and X\. Mao\(2022\)Onerel: joint entity and relation extraction with one module in one step\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 11285–11293\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/21379)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2605.23440#S4.T3.2.6.4.1)\.
- \[23\]D\. Sui, Y\. Chen, K\. Liu, J\. Zhao, X\. Zeng, and S\. Liu\(2020\)Joint entity and relation extraction with set prediction networks\.arXiv preprint arXiv:2011\.01675\.External Links:[Link](https://arxiv.org/abs/2011.01675)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2605.23440#S4.T3.2.3.1.1)\.
- \[24\]W\. Y\. Wang and D\. Yang\(2015\)That’s so annoying\!\!\!: a lexical and frame\-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using\# petpeeve tweets\.InProceedings of the 2015 conference on empirical methods in natural language processing,pp\. 2557–2563\.External Links:[Link](https://openreview.net/forum?id=r1NRyfzuWB)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1)\.
- \[25\]Y\. Wang, B\. Yu, Y\. Zhang, T\. Liu, H\. Zhu, and L\. Sun\(2020\-12\)TPLinker: single\-stage joint extraction of entities and relations through token pair linking\.InProceedings of the 28th International Conference on Computational Linguistics\),Barcelona, Spain \(Online\),pp\. 1572–1582\.External Links:[Link](https://aclanthology.org/2020.coling-main.138),[Document](https://dx.doi.org/10.18653/v1/2020.coling-main.138)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2605.23440#S4.T3.2.7.5.1)\.
- \[26\]J\. Wei and K\. Zou\(2019\-11\)EDA: easy data augmentation techniques for boosting performance on text classification tasks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 6382–6388\.External Links:[Link](https://aclanthology.org/D19-1670),[Document](https://dx.doi.org/10.18653/v1/D19-1670)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.23440#S4.SS2.SSS0.Px1.p2.1)\.
- \[27\]X\. Wei, X\. Cui, N\. Cheng, X\. Wang, X\. Zhang, S\. Huang, P\. Xie, J\. Xu, Y\. Chen, M\. Zhang,et al\.\(2023\)Chatie: zero\-shot information extraction via chatting with chatgpt\.arXiv preprint arXiv:2302\.10205\.Cited by:[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.23440#S4.SS2.SSS0.Px1.p2.1)\.
- \[28\]Z\. Wei, J\. Su, Y\. Wang, Y\. Tian, and Y\. Chang\(2020\-07\)A novel cascade binary tagging framework for relational triple extraction\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 1476–1488\.External Links:[Link](https://aclanthology.org/2020.acl-main.136),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.136)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2605.23440#S4.T3.2.5.3.1)\.
- \[29\]X\. Wu, Z\. Wu, Y\. Lu, L\. Ju, and S\. Wang\(2022\)Style mixing and patchwise prototypical matching for one\-shot unsupervised domain adaptive semantic segmentation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 2740–2749\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/20177)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px2.p1.1)\.
- \[30\]Q\. Xie, Z\. Dai, E\. Hovy, M\. Luong, and Q\. V\. Le\(2020\)Unsupervised data augmentation for consistency training\.InProceedings of the 34th International Conference on Neural Information Processing Systems,pp\. 6256–6268\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/20177)Cited by:[§1](https://arxiv.org/html/2605.23440#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.23440#S4.SS2.SSS0.Px1.p2.1)\.
- \[31\]G\. Yan, Y\. Li, S\. Zhang, and Z\. Chen\(2019\)Data augmentation for deep learning of judgment documents\.InInternational Conference on Intelligent Science and Big Data Engineering,pp\. 232–242\.External Links:[Link](https://link.springer.com/chapter/10.1007/978-3-030-36204-1_19)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.23440#S4.SS2.SSS0.Px1.p2.1)\.
- \[32\]N\. A\. Zhang Bingyu\(2022\-05\)The document vectors using cosine similarity revisited\.InProceedings of the Third Workshop on Insights from Negative Results in NLP,Dublin, Ireland,pp\. 129–133\.External Links:[Link](https://aclanthology.org/2022.insights-1.17),[Document](https://dx.doi.org/10.18653/v1/2022.insights-1.17)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px2.p1.1)\.
- \[33\]Q\. Zhang, Z\. Chen, H\. Pan, C\. Caragea, L\. J\. Latecki, and E\. Dragut\(2024\)SciER: an entity and relation extraction dataset for datasets, methods, and tasks in scientific documents\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 13083–13100\.Cited by:[§4\.2](https://arxiv.org/html/2605.23440#S4.SS2.SSS0.Px1.p3.1)\.
- \[34\]X\. Zhang, J\. Zhao, and Y\. Lecun\(2015\)Character\-level convolutional networks for text classification\.Advances in Neural Information Processing Systems2015,pp\. 649–657\.External Links:[Link](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1)\.
- \[35\]Y\. Zhang, T\. Ge, and X\. Sun\(2020\-07\)Parallel data augmentation for formality style transfer\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 3221–3228\.External Links:[Link](https://aclanthology.org/2020.acl-main.294),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.294)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px3.p1.1)\.
- \[36\]T\. Zhao, Z\. Yan, Y\. Cao, and Z\. Li\(2021\)A unified multi\-task learning framework for joint extraction of entities and relations\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 14524–14531\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/17707)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1)\.
- \[37\]H\. Zheng, R\. Wen, X\. Chen, Y\. Yang, Y\. Zhang, Z\. Zhang, N\. Zhang, B\. Qin, X\. Ming, and Y\. Zheng\(2021\-08\)PRGC: potential relation and global correspondence based joint relational triple extraction\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),Online,pp\. 6225–6235\.External Links:[Link](https://aclanthology.org/2021.acl-long.486),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.486)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23440#S4.SS1.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2605.23440#S4.T3.2.4.2.1)\.
- \[38\]\(2017\-07\)Zheng, suncong and wang, feng and bao, hongyun and hao, yuexing and zhou, peng and xu, bo\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vancouver, Canada,pp\. 1227–1236\.External Links:[Link](https://aclanthology.org/P17-1113),[Document](https://dx.doi.org/10.18653/v1/P17-1113)Cited by:[§2](https://arxiv.org/html/2605.23440#S2.SS0.SSS0.Px1.p1.1)\.
- \[39\]M\. Zhong, P\. Liu, Y\. Chen, D\. Wang, X\. Qiu, and X\. Huang\(2020\-07\)Extractive summarization as text matching\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 6197–6208\.External Links:[Link](https://aclanthology.org/2020.acl-main.552),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.552)Cited by:[§1](https://arxiv.org/html/2605.23440#S1.p1.1)\.

Similar Articles

Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)

arXiv cs.CL

This paper presents SSAS (Syntactic & Semantic Context Assessment Summarization), a framework designed to improve consistency in LLM-based sentiment prediction by reducing noise and variance through hierarchical classification and iterative summarization. Empirical evaluation on three industry-standard datasets shows up to 30% improvement in data quality and reliability for enterprise decision-making.

TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

arXiv cs.CL

TERGAD is a novel data augmentation framework that uses large language models to translate node-level topological properties into semantic narratives, then fuses these with original node attributes via a gated dual-branch autoencoder for graph anomaly detection, achieving state-of-the-art results on six datasets.