TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

arXiv cs.CL Papers

Summary

TRIDENT is a novel framework and dataset synthesis pipeline for enhancing LLM safety through tri-dimensional red-teaming data that covers lexical diversity, malicious intent, and jailbreak tactics. Fine-tuning Llama-3.1-8B on TRIDENT-Edge achieves 14.29% reduction in Harm Score and 20% decrease in Attack Success Rate compared to baseline models.

arXiv:2505.24672v2 Announce Type: replace Abstract: Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:31 AM

# TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

Source: https://arxiv.org/html/2505.24672

First Author Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 email@domain Second Author Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 email@domain Xiaorui Wu1, Xiaofeng Mao2, Fei Li1*, Xin Zhang3, Xuanhong Li1, Chong Teng1, Donghong Ji1*, Zhuang Li4†

1Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
2Ant Group
3Ant International
4School of Computing Technologies, Royal Melbourne Institute of Technology, Australia

1{wuxiaorui, lifei_csnlp, lixuanhong, tengchong, dhji}@whu.edu.cn
[email protected]
[email protected]
[email protected]

## Abstract

Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: **Lexical Diversity**, **Malicious Intent**, and **Jailbreak Tactics**. We further introduce **TRIDENT**, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: **TRIDENT-Core**, comprising 26,311 examples, and **TRIDENT-Edge**, with 18,773 examples. Fine-tuning **Meta-Llama-3.1-8B** on **TRIDENT-Edge** demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the **WildBreak** dataset. Our datasets are available at https://github.com/FishT0ucher/TRIDENT.

**Disclaimer:** The paper contains content that may be profane, vulgar, or offensive.

## 1 Introduction

![Figure 1: Instruction classification in six baseline red-teaming datasets and TRIDENT-Core using Llama-Guard-3-8B reveals a heavily skewed distribution, with most instructions concentrated in domains like Violent Crimes, Non-Violent Crimes and Sexual Content.]

Large Language Models (LLMs) have led to remarkable advances in natural language processing (NLP), contributing to progress in fields such as economics, society, and culture. However, their widespread deployment poses significant risks. Trained on extensive unsupervised corpora, LLMs may generate outputs that reflect biases, discrimination, or values misaligned with societal norms. Moreover, they can be exploited for malicious ends, such as crafting phishing messages (Shibli et al. 2024) or enabling cyberattacks (Mahmoodi and Jameii 2024), which underscores the urgent need to address these safety issues.

Red-teaming is a widely used strategy for uncovering vulnerabilities in LLMs by generating a diverse range of malicious instructions, either automatically using LLMs or manually by experts. These malicious instructions, when paired with carefully crafted, norm-adherent responses, form specialized datasets that support safety alignment efforts, particularly methods such as Supervised Fine-Tuning (SFT). Fine-tuning LLMs on alignment datasets helps reduce the likelihood of harmful outputs, ensuring safer and more reliable model behavior (Ganguli et al. 2022).

A key challenge in this process is achieving comprehensive coverage of potential safety risks, which requires diverse red-teaming instructions. Current data curation methods often focus on lexical diversity, enriching vocabulary (Chan et al. 2024), but neglect other critical dimensions. As shown in Figure 1, even lexically varied datasets exhibit imbalances in domains of malicious user intents, with certain types dominating while others are underrepresented. Such imbalances limit LLMs' ability to acquire comprehensive safety knowledge. Meanwhile, we found that most existing datasets do not consider jailbreak tactics, resulting in LLMs fine-tuned with these datasets performing poorly in handling jailbreak attacks.

![Figure 2: Illustration of our data generation pipeline for building TRIDENT]

To address this limitation, we identify three essential dimensions of risk-related diversity:

- **Lexical Diversity** enriches the vocabulary and linguistic complexity of instructions, improving model robustness.
- **Malicious Intent Diversity** ensures a balanced coverage of multiple harmful intent categories (e.g., violence, defamation) within user instructions, broadening the model's exposure to diverse harmful scenarios.
- **Jailbreak Tactic Diversity** incorporates various adversarial techniques, enhancing the model's resilience against manipulative jailbreak attacks.

Measuring these dimensions provides a framework to quantify risk coverage, guiding more effective dataset curation to enhance LLM safety.

Based on these dimensions, we introduce **TRIDENT**, an innovative automated data generation pipeline that minimizes human intervention. TRIDENT employs a zero-shot approach using a chat-LLM to generate diverse personas and attributes, which then guide instruction generation. Through persona-based role-playing, the LLM ensures both lexical and malicious intent diversity (Shah et al. 2023), while integrated jailbreak tactics further expand risk coverage. Each harmful instruction is then paired with a benign, ethically aligned response generated by a safety-focused LLM, such as GPT-4o-mini.

This process produces two comprehensive datasets: **TRIDENT-Core**, which contains 26,311 examples focused on lexical and malicious intent diversity, and **TRIDENT-Edge** (examples in Table), which incorporates jailbreak tactic diversity into the examples in TRIDENT-Core, resulting in 18,773 examples. Our evaluation shows that fine-tuning **Meta-Llama-3.1-8B** on **TRIDENT-Edge** significantly outperforms current state-of-the-art baselines (AttaQ (Kour et al. 2023), AART (Radhapuru et al. 2023), HH_RLHF (Ganguli et al. 2022), Safe_RLHF (Ji et al. 2024a), WildBreak (Jiang et al. 2024b), and WildChat (Zhao et al. 2024)-fine-tuned Meta-Llama-3.1-8B) across seven benchmarks, reducing the Harm Score (HS) by 13.89% and Attack Success Rate (ASR) by 20%. Additionally, our ablation studies reveal that each dimension of diversity substantially contributes to improving LLM safety.

Overall, our contributions are as follows:

i) We introduce a systematic framework to analyze the risk coverage of red-teaming datasets across three fundamental diversity dimensions: lexical, malicious intent, and jailbreak tactic.

ii) We present TRIDENT, an automated and scalable pipeline that efficiently generates diverse instruction-response pairs, yielding TRIDENT-Core and TRIDENT-Edge datasets.

iii) Through extensive experiments, we demonstrate that our diversity-enhanced datasets substantially improve both LLM safety and helpfulness across multiple benchmarks, with ablation studies highlighting the distinct contributions of each diversity dimension.

## 2 TRIDENT Data Generation Pipeline

To overcome the limitations of existing red-teaming datasets, we introduce **TRIDENT**, an automated data curation pipeline designed to systematically enhance three key dimensions of diversity: **Lexical Diversity**, **Malicious Intent Diversity**, and **Jailbreak Tactic Diversity**. These dimensions address critical gaps in current datasets by broadening linguistic variation, expanding the coverage of malicious intents, and fortifying models against adversarial tactics. Figure 2 illustrates the pipeline, which progresses from defining high-level intent domains to generating diverse, malicious instructions and norm-adherent responses.

#### Defining Intent Domains

The starting point of TRIDENT is the definition of **Intent Domains**, which includes 14 categories of malicious user intents, including violent crimes, defamation, and sex-related crimes, etc. These domains are adopted from the hazard categories defined by Llama-Guard-3-8B (Inan et al. 2023) and MLCommons (https://mlcommons.org/2024/04/mlc-aisafety-v0-5-poc/), chosen for their well-established standards and broad coverage of potential threats. This framework effectively addresses both common risks, such as defamation and violent crimes, and specialized threats, such as code interpreter abuse. By leveraging Llama-Guard-3-8B's systematic classification, TRIDENT ensures accuracy, scalability, and comprehensive coverage in categorizing malicious intents, providing a strong foundation for subsequent steps in the pipeline.

#### Scenario Generation

As in Figure 2, we generate domain-specific scenarios using the uncensored Llama-3.1-8b-instruct model in a zero-shot setting (https://huggingface.co/aifeifei798/DarkIdol-Llama-3.1-8B-Instruct-1.2-Uncensored). For instance, in the "Code Interpreter Abuse" domain, it might create a scenario involving the misuse of programming tools to embed hidden malicious code. By grounding abstract intent domains in realistic scenarios, this method effectively supports subsequent persona creation and instruction generation.

#### Persona Generation

We use a two-step approach with a single LLM to generate diverse personas and their attributes from scenarios. A **persona** captures an individual's role, behavior, and goals within a scenario context, while **attributes** define more specific persona details like occupation, personality traits, and experiences.

**Step 1: Scenario-to-Persona Generation.** The same LLM from scenario generation infers contextually appropriate personas and their defining attributes from each scenario. This ensures personas exhibit realistic motivations and behaviors grounded in plausible situations. For instance, given a scenario in the "Code Interpreter Abuse" domain, the model might generate a persona of a "charismatic hacker who exploits technical expertise to manipulate others," with attributes including "occupation: cybercriminal," "personality: manipulative and ambitious," and "life experiences: influenced by unethical tech leaders."

**Step 2: Persona-to-Persona Expansion.** We further diversify our persona set by prompting the LLM to generate related personas by exploring interpersonal connections and shared attributes. For example, the model might expand the hacker persona to include a "brilliant but reclusive developer who creates technical tools for phishing campaigns." Guided by the Six Degrees of Separation theory (Travers and Milgram 1977), this approach allows us to expand from the intent domains defined by Llama-Guard into undefined domains by generating a sufficient number of related personas.

#### Instruction Generation

Our pipeline generates harmful instructions through two key steps: i) transforming prepared personas and attributes into instructions to enhance **Lexical** and **Malicious Intent Diversity**, and ii) improving **Jailbreak Tactic Diversity**. These steps together ensure comprehensive coverage of risks in the instructions.

**Step 1: Enhancing Lexical and Malicious Intent Diversity.** We employ a role-playing approach where the LLM acts as previously generated personas to create diverse instructions. Each persona's unique characteristics naturally influence the language and style of generated content, contributing to lexical diversity. For instance, when adopting the role of a "cunning politician," the LLM generates formally worded content, while as a "cybercriminal," it produces technically sophisticated malicious instructions. Additionally, Persona-to-Persona Expansion achieves an expansion from the intent domain defined by Llama-Guard to undefined domains, enhancing the diversity of malicious intent.

**Step 2: Incorporating Jailbreak Tactics.** To improve the dataset's adversarial robustness, we apply six advanced jailbreak methods, each including a multitude of jailbreak tactics, to transform base instructions into six varied forms. One of these transformed instructions, selected at random, replaces the original if it successfully bypasses Meta-Llama-3.1-8B's defenses. The methods are:

- **Cipher Encoding** (Yuan et al. 2024b) encrypts instructions in code-like formats, requiring decryption to reveal the harmful intent.
- **Code Injection** (Kang et al. 2023) embeds harmful instructions within benign-appearing code snippets.
- **Low-Resource Translation** (Deng et al. 2024) converts instructions into less common languages while maintaining their malicious intent.
- **Past Tense Rewriting** (Andriushchenko and Flammarion 2024) modifies the temporal context of instructions.
- **Persona Modulation** (Shah et al. 2023) adapts instructions to match specific persona styles.
- **RENELLM Techniques** (Ding et al. 2024) apply multiple transformations, including paraphrasing, structure alteration, and strategic misspellings.

**TRIDENT-Core and TRIDENT-Edge.** TRIDENT-Core consists of instructions generated with emphasis on Lexical Diversity and Malicious Intent Diversity, aiming to encourage other researchers in extending TRIDENT-Core with more advanced jailbreak methods. TRIDENT-Edge extends this foundation by incorporating the jailbreak tactics, adding the third dimension of diversity and strengthening the dataset's defense against adversarial jailbreak attacks.

#### Instruction Filtering

TRIDENT employs a two-stage filtering process to ensure dataset quality and diversity. First, Llama-Guard-3-8B identifies and retains only instructions classified as 'unsafe,' filtering out benign ones. Second, the process iterates through the instruction set, calculating pairwise BLEU similarity scores (Papineni et al. 2002) between each new instruction and existing entries.

Similar Articles

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

arXiv cs.CL

DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.