RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

arXiv cs.CL Papers

Summary

RedBench introduces a universal dataset aggregating 37 benchmark datasets with 29,362 samples across 22 risk categories and 19 domains to enable standardized and comprehensive red teaming evaluation of large language models. The work addresses inconsistencies in existing red teaming datasets and provides baselines, evaluation code, and open-source resources for assessing LLM robustness against adversarial prompts.

arXiv:2601.03699v2 Announce Type: replace Abstract: As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: https://github.com/knoveleng/redeval
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:31 AM

# RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Source: https://arxiv.org/html/2601.03699

Quy-Anh Dang, Chris Ngo
Knovel Engineering Lab
Knovel Engineering Singapore
{quyanh.dang, chris.ngo}@knoveleng.com

Correspondence: [email protected]

This work was also completed while I was pursuing my master's degree at VNU University of Science.

Truong-Son Hy
Department of Computer Science
University of Alabama at Birmingham
USA
[email protected]

###### Abstract

As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment.

Code: https://github.com/knoveleng/redeval

Warning: This paper contains examples that may be offensive, harmful, or biased.

## 1 Introduction

Large Language Models (LLMs) have transformed the landscape of natural language processing, delivering remarkable performance in diverse applications, including multilingual translation (Team et al., 2022), medical diagnostics (Thirunavukarasu et al., 2023; Li et al., 2023), tool-augmented reasoning (Schick et al., 2023; Bubeck et al., 2023), and conversational assistance (Achiam et al., 2023; Touvron et al., 2023; Anil et al., 2023). As LLMs increasingly support safety-critical domains such as healthcare (Singhal et al., 2022) and legal advisory systems (Maddela et al., 2023), their robustness against adversarial inputs has become a critical concern. Adversarial prompts—carefully designed inputs crafted to exploit model vulnerabilities and elicit unsafe, biased, or erroneous responses—pose significant risks to the reliability and trustworthiness of LLMs (Perez et al., 2022c; Wei et al., 2023; Zou et al., 2023). Addressing these vulnerabilities is both a technical necessity and an ethical imperative to ensure the safe and responsible deployment of LLMs in real-world settings (Anwar et al., 2024; Hendrycks et al., 2022).

Despite the growing availability of datasets that aim to investigate LLM vulnerabilities, such as AdvBench (Zou et al., 2023), HarmBench (Mazeika et al., 2024), and Do-Not-Answer (Wang et al., 2024b), significant challenges remain. Existing datasets often adopt inconsistent definitions of risk categories, vary in scope and format, and lack comprehensive coverage of specific domains or use cases. For example, some datasets focus narrowly on toxicity or bias, while others emphasize jailbreaking techniques, resulting in fragmented evaluation frameworks. This inconsistency raises a critical research question: Why is there no universal dataset that provides consistent risk categorization and comprehensive evaluation in diverse domains? Furthermore, a related question emerges: Which risk categories and domains do existing datasets prioritize and how do they align with real-world LLM deployment scenarios?

Furthermore, prior red-teaming methods, including approaches such as RainbowPlus (Dang et al., 2025), AutoDAN (Liu et al., 2024; 2025), Tree-of-Attacks (Mehrotra et al., 2023), and GPTFuzzer (Yu et al., 2023), have primarily evaluated older LLM architectures. These studies often lack comparisons with recently released state-of-the-art models such as Qwen2.5 (Yang et al., 2024; Team, 2024c), Llama 3.1 (Team, 2024b), and Gemma 2 (Team, 2024a). This gap prompts another key research question: How do modern LLMs perform under red teaming evaluations and what new insights can be gained from benchmarking their robustness?

To address these research questions, we introduce *RedBench*, a novel universal dataset designed to advance red teaming for LLMs. RedBench aggregates and harmonizes 37 existing datasets from leading conferences and influential papers, providing a standardized framework for evaluating LLM vulnerabilities. By systematically analyzing the risk categories and domains covered by these datasets, RedBench offers a comprehensive overview of the current landscape of red teaming resources. Furthermore, we select targeted sub-datasets from RedBench to establish baselines for modern LLMs, enabling robust comparisons and fostering future research. To ensure accessibility and reproducibility, we open-source both the RedBench dataset and the associated evaluation code.

Our study makes the following contributions:

- **RedBench Dataset**: A universal dataset that consolidates 37 existing red teaming datasets, providing consistent risk categorization and comprehensive coverage of domains to enable standardized LLM evaluations.

- **Comprehensive Analysis**: A detailed analysis of risk categories and domains in existing datasets, highlighting gaps and opportunities for future red-teaming research.

- **Baselines for Modern LLMs**: Evaluation baselines for state-of-the-art LLMs, including Qwen2.5, Llama 3.1, and Gemma 2, to assess their robustness against adversarial prompts and foster comparative studies.

- **Open-Source Resources**: Publicly available dataset and evaluation code to promote transparency, reproducibility, and community-driven advances in LLM red-teaming.

## 2 Methodology

### 2.1 Data Collection

To construct RedBench, a high-quality and comprehensive dataset for red teaming large language models (LLMs), we aggregated 37 benchmark datasets sourced from leading peer-reviewed conferences, journals, and reliable repositories. These sources include prominent venues such as the Neural Information Processing Systems (NeurIPS), the Annual Meeting of the Association for Computational Linguistics (ACL), the International Conference on Machine Learning (ICML), the International Conference on Learning Representations (ICLR), and the preprint repository arXiv, among others. The selection criteria prioritized peer-reviewed status, relevance to red teaming objectives, and coverage of diverse risk scenarios, ensuring a robust and representative foundation for a universal red-teaming dataset.

The resulting corpus comprises 29,362 samples, covering a wide range of prompt types designed to probe LLM vulnerabilities. These samples are categorized into two primary red teaming directions:

- **Attack**: This direction evaluates the susceptibility of the model to harmful or adversarial prompts that aim to elicit unsafe, biased, or erroneous responses (Dang et al., 2025; Liu et al., 2025). Of the 37 datasets, 33 focus on this direction, including well-known benchmarks such as HarmBench (Mazeika et al., 2024), AdvBench (Zou et al., 2023), and DAN (Shen et al., 2024a). These datasets contain instructions designed to exploit vulnerabilities in areas such as toxicity, misinformation, and jailbreaking.

- **Refusal**: This direction assesses the model's tendency to over-defend by refusing benign or legitimate prompts, which can hinder usability (Brahman et al., 2024; Cui et al., 2024). Four datasets address this direction: CoCoNot (Brahman et al., 2024), ORBench (Cui et al., 2024), SGXTest (Gupta et al., 2024) and XSTest (Röttger et al., 2024). These datasets include prompts designed to test the boundaries of appropriate refusal behavior, ensuring that models do not reject harmless requests unnecessarily.

This dual-focus approach ensures that RedBench captures both offensive vulnerabilities (via attack prompts) and defensive overreach (via refusal prompts), providing a holistic framework for evaluating LLM robustness.

**Figure 1**: Distribution of publication sources for the 37 benchmark datasets in RedBench. The figure illustrates the diversity of high-quality sources, with arXiv, ACL, NeurIPS, and ICLR being the most represented.

The distribution of dataset sources reflects the diversity and academic rigor of the collected datasets. The majority of datasets originate from arXiv (8 datasets), ACL (6 datasets), NeurIPS (6 datasets), and ICLR (6 datasets), underscoring the prominence of these venues in LLM and red-teaming research. Additional contributions come from EMNLP, ACM, ICML, EACL, USENIX, and NAACL, ensuring a broad representation of perspectives and methodologies. Figure 1 provides a visual representation of this distribution, highlighting the balance between preprint repositories and peer-reviewed conference proceedings.

To ensure the quality and relevance of RedBench, each dataset was meticulously curated based on several criteria:

(1) *Task Relevance*: Datasets were selected for their alignment with red teaming objectives, focusing on prompts that test model safety, robustness, or refusal behavior.

(2) *Risk Scenario Coverage*: The datasets collectively cover a wide spectrum of risk categories, including toxicity, bias, misinformation, jailbreaking, and over-refusal, addressing both offensive and defensive failure modes.

(3) *Data Integrity*: Only datasets with clear documentation, reproducible prompts, and verified sources were included to ensure reliability and usability.

This rigorous curation process guarantees that RedBench serves as a high-quality, standardized resource for evaluating LLMs in diverse red-teaming scenarios. By aggregating and harmonizing these 37 datasets, RedBench provides a unified and comprehensive platform for red teaming research. The dataset's extensive coverage of attack and refusal prompts, combined with its diverse and credible sources, positions RedBench as a valuable tool for benchmarking modern LLMs and advancing the development of robust and secure language models.

### 2.2 Taxonomy for Datasets

A critical limitation of existing red teaming datasets is the lack of consistency in risk definitions and categorizations, which often results in labels that are overlapped, ambiguous, or poorly defined (Zou et al., 2023; Xie et al., 2024). This fragmentation hinders cross-dataset comparisons and complicates systematic evaluations of LLM vulnerabilities. To address this challenge, we developed a standardized taxonomy for adversarial prompts in RedBench, assigning to each of the 29,362 samples two labels: *Risk Category* and *Domain*. This taxonomy unifies disparate datasets, ensures clarity in risk and context classification, and facilitates comprehensive red teaming evaluations across diverse scenarios.

The taxonomy is structured around two dimensions: *Risk Category*, which identifies the type of harm or misuse a prompt may elicit, and *Domain*, which specifies the contextual area in which the prompt is situated. By applying this dual-labeling approach, RedBench enables researchers to analyze LLM vulnerabilities with granularity, supporting both broad risk assessments and domain-specific investigations. The following subsections detail the definitions and annotation process for these labels.

#### 2.2.1 Risk Categories

We define 22 distinct risk categories, each corresponding to a specific type of harm or misuse that LLMs might enable or exacerbate. These categories, presented in Table 8, were developed through a systematic review of existing red teaming frameworks (Zou et al., 2023; Xie et al., 2024), safety guidelines from organizations such as NIST (Autio et al., 2024) and OWASP (OWASP Foundation, 2025). Each category is precisely defined to avoid overlap, is grounded in real-world implications, and is applicable to a wide range of testing scenarios. For prompts designed to evaluate refusal behavior (i.e., benign prompts that should not be rejected), we assign the *Risk Category* as "No Risk" to distinguish them from adversarial prompts.

#### 2.2.2 Domains

To capture the contextual diversity of adversarial prompts, we define 19 domains, each representing a specific area of application or use case for LLMs. These domains, listed in Table 9, were informed by a comprehensive analysis of LLM deployment scenarios (Singhal et al., 2022; Maddela et al., 2023), stakeholder consultations, and application-specific red-teaming studies (Bhatt et al., 2023; Han et al., 2024). The domains range from specialized fields such as healthcare and military to broader areas such as general knowledge, ensuring that RedBench reflects the multifaceted contexts in which LLMs operate.

#### 2.2.3 Annotation Process

To assign *Risk Category* and *Domain* labels to all 29,362 samples in RedBench, we implemented a semi-automated annotation pipeline that combines the efficiency of state-of-the-art LLMs with the reliability of human oversight. The process utilized Qwen2.5-72B-Instruct (Team, 2024c), selected for its strong instruction-following capabilities and high performance in classification tasks, further validated by high agreement with human annotators on a random sample of 300 prompts (see Appendix C for details). The annotation pipeline proceeded as follows:

1. **Prompt Design**: We developed detailed prompts to guide the LLM in classification...

Similar Articles

RedactionBench

arXiv cs.CL

RedactionBench is a manually annotated benchmark for evaluating contextual PII redaction in large language models, introducing the R-Score metric and showing that contextual redaction remains an unsolved problem.