Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

arXiv cs.CL Papers

Summary

This paper presents a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples across 8 NIST safety categories, finding that model size does not correlate with detection performance and that Qwen Guard (4B) achieves the highest recall.

arXiv:2605.28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:11 AM

# A Comprehensive EvaluationPublished as a workshop paper at ICLR 2026
Source: [https://arxiv.org/html/2605.28830](https://arxiv.org/html/2605.28830)
## Benchmarking Open\-Source Safety Guard Models: A Comprehensive Evaluation††thanks:Published as a workshop paper at ICLR 2026

###### Abstract

As Large Language Models \(LLMs\) are increasingly deployed in safety\-critical applications, robust content moderation becomes essential\. We present a comprehensive evaluation of 14 open\-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories\. Our benchmark aggregates four diverse datasets \(HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails\), filtered to focus exclusively on safety\-relevant content \(violence, hate speech, harassment, sexual content, suicide/self\-harm, profanity, threats, and health misinformation\)\. We find thatrecall is the critical metricfor safety applications, as missing harmful content poses greater risk than false positives\. Our evaluation reveals surprising results: Qwen Guard \(4B parameters\) achieves the highest recall \(83\.97%\) while larger models like Llama Guard \(12B\) and GPT\-OSS Safeguard \(20B\) exhibit conservative behavior, missing up to 75% of unsafe content\. We demonstrate that model size does not correlate with safety detection performance and that general\-purpose guard models outperform specialized ones\. These findings provide practical guidance for selecting safety guard models in production deployments\.

## 1Introduction

The rapid adoption of Large Language Models \(LLMs\) in various applications introduces significant safety concerns that demand rigorous content moderation\. Safety guard models have emerged as a critical component in LLM deployment pipelines, serving as filters that classify input prompts or model outputs as safe or unsafe\. However, the proliferation of guard models, each with different architectures, training methodologies, and safety taxonomies, creates uncertainty for practitioners selecting appropriate safeguards for their applications\.

Despite the critical importance of guard models, no comprehensive benchmark exists that evaluates modern open\-source safety models across diverse, safety\-focused datasets using standardized taxonomy\.

### 1\.1Research Questions

This work addresses three primary research questions:

1. 1\.RQ1: How do existing open\-source guard models perform on safety detection across diverse datasets and harm categories?
2. 2\.RQ2: What is the relationship between model size, architecture, and safety detection performance?
3. 3\.RQ3: Which evaluation metrics best capture guard model effectiveness for safety\-critical deployments?

### 1\.2Contributions

Our contributions include: \(1\) a comprehensive benchmark of 79,331 samples from four publicly available sources, filtered using the NIST AI Risk Management Framework\(National Institute of Standards and Technology,[2023](https://arxiv.org/html/2605.28830#bib.bib5)\)to focus on 8 safety subcategories; \(2\) large\-scale evaluation of 14 open\-source guard models \(110M–20B parameters\) from Google, NVIDIA, IBM, Meta, and Alibaba; \(3\) metric analysis demonstrating that recall is the primary metric and high\-precision models can be dangerously conservative; and \(4\) actionable recommendations for model selection\.

## 2Related Work

### 2\.1Guard Models

Safety guard models have evolved from simple keyword\-based filters to sophisticated LLM\-based classifiers\. Llama Guard\(Meta AI,[2025](https://arxiv.org/html/2605.28830#bib.bib17)\)introduced a taxonomy\-driven approach using instruction\-tuned models for input\-output safeguarding\. WildGuard\(Hanet al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib8)\)extended this to handle both prompt and response classification with broader coverage\. More recently, models from diverse organizations have emerged: Granite Guardian\(Padhiet al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib10)\)from IBM, Qwen Guard\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.28830#bib.bib6)\)from Alibaba, and ShieldGemma\(Google DeepMind,[2024](https://arxiv.org/html/2605.28830#bib.bib16)\)from Google, each with varying safety taxonomies and architectural choices\.

### 2\.2Safety Benchmarks

As large language models become more powerful and widely deployed, ensuring the safety of their outputs is critical\. To mitigate risks,guardrail modelssuch as Llama Guard, ShieldGemma, WildGuard, and Qwen Guard are adopted as filtering mechanisms that perform real\-time risk detection on both user prompts and model responses, ensuring safer interactions in AI systems\.

Existing safety benchmarks primarily target general\-purpose LLMs\. SafetyBench\(Zhanget al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib21)\)provides 11,435 multiple\-choice questions to test whether LLMs select safe responses\. HELM\(Lianget al\.,[2023](https://arxiv.org/html/2605.28830#bib.bib22)\)evaluates LLMs holistically across many dimensions including safety\. RabakBench\(Chuaet al\.,[2025](https://arxiv.org/html/2605.28830#bib.bib23)\)constructs localized multilingual safety benchmarks for low\-resource languages\. These benchmarks evaluate the generation behavior of LLMs, not the effectiveness of guardrail models\.

For guardrail model evaluation, GuardBench\(Bassani and Sanchez,[2024](https://arxiv.org/html/2605.28830#bib.bib20)\)introduced a benchmark comprising 40 datasets and evaluated 13 models\. However, their evaluated models were primarily Llama Guard variants and content moderation models \(Detoxify, ToxiGen\), and they acknowledge the lack of “a generally accepted taxonomy of unsafe content,” leaving results fragmented across datasets without unified categorization\.

Our work focuses on evaluating guardrail models and addresses these gaps: \(1\) we evaluate 14 models from diverse vendors \(Google, NVIDIA, IBM, Meta, Alibaba\) including recent 2025 models \(Qwen Guard, GPT\-OSS Safeguard, DynaGuard, GuardReasoner, Llama Guard 4\); \(2\) we adopt the NIST AI Risk Management Framework\(National Institute of Standards and Technology,[2023](https://arxiv.org/html/2605.28830#bib.bib5)\)for standardized categorization across 8 safety subcategories; and \(3\) we provide category\-level analysis revealing which safety categories remain challenging for current guardrail models\.

## 3Methodology

### 3\.1Dataset Construction

We construct a master safety dataset by aggregating and filtering four publicly available sources according to the NIST AI Risk Management Framework\.

##### NIST AI Risk Framework\.

The NIST AI RMF defines 7 risk categories: SAFETY, SECURITY, ACCOUNTABILITY, PRIVACY, FAIRNESS, TRANSPARENCY, and UNCERTAINTY\. Our benchmark focuses exclusively on SAFETY to ensure evaluation targets safety\-specific detection rather than conflating safety with security or privacy\. Within SAFETY, we define 8 subcategories: \(1\) Violence, \(2\) Hate Speech, \(3\) Suicide & Self\-Harm, \(4\) Sexual Content, \(5\) Harassment, \(6\) Profanity, \(7\) Threats, and \(8\) Health misinformation\.

##### Source Datasets\.

HarmBench\(Mazeikaet al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib4)\): 400 text behaviors designed to test model robustness\. We retain 103 samples after filtering out copyright \(100\), cybercrime and illegal activity \(132→\\rightarrowSECURITY\), and misinformation \(65→\\rightarrowACCOUNTABILITY\)\.

StrongREJECT\(Soulyet al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib2)\): 313 forbidden prompts testing rejection capabilities\. We retain 154 samples \(Violence, Sexual Content, Hate/Harassment\) after filtering out non\-violent crimes and illegal goods \(109→\\rightarrowSECURITY\) and disinformation \(50→\\rightarrowACCOUNTABILITY\)\.

RealToxicityPrompts\(Gehmanet al\.,[2020](https://arxiv.org/html/2605.28830#bib.bib1)\): 99,442 naturally occurring prompts with Perspective API toxicity scores across 8 dimensions\. We retain 67,521 samples after filtering out theflirtationcategory \(31,921 samples\) as it does not map to any of our 8 NIST SAFETY subcategories\. We map remaining scores to binary labels using a threshold of 0\.5, the standard midpoint for binary classification that balances false positives and false negatives\.

BeaverTails\(Jiet al\.,[2023](https://arxiv.org/html/2605.28830#bib.bib3)\): 27,186 human\-annotated samples with 14 harm categories\. We retain 11,553 samples after filtering out terrorism and weapons \(SECURITY\), financial crime and privacy violations \(PRIVACY\), political content \(CROSS\-CUTTING\), and non\-violent unethical behavior \(not safety\-specific\)\.

##### Dataset Selection Rationale\.

Our dataset achieves complete coverage of all 8 NIST safety subcategories: RealToxicityPrompts provides naturally\-occurring toxicity \(Harassment, Threats, Profanity, Hate Speech\); BeaverTails contributes human\-annotated samples \(Violence, Sexual Content, Health, and critically, Suicide & Self\-Harm\-the only source for this category\); HarmBench and StrongREJECT add adversarial edge cases\. This 79,331\-sample benchmark spans adversarial, natural, and human\-annotated sources\.

##### Labeling Logic\.

Labeling varies by dataset: HarmBench and StrongREJECT samples are allunsafe\(adversarial by design\); BeaverTails uses human\-annotated labels; RealToxicityPrompts computes the maximum toxicity score across 7 dimensions \(excluding flirtation\) and labels samples as unsafe if score\>0\.5\>0\.5\. Full labeling details are in Appendix[B](https://arxiv.org/html/2605.28830#A2)\.

Table 1:Master Safety Dataset CompositionThe final dataset contains 79,331 samples with 54\.7% unsafe and 45\.3% safe labels\. Notably, all safe samples originate from RealToxicityPrompts, as the other three datasets are designed exclusively for adversarial/harmful content evaluation\.

### 3\.2Guard Models Evaluated

We evaluate 14 open\-source guard models spanning diverse architectures, sizes, and safety taxonomies\. Table[2](https://arxiv.org/html/2605.28830#S3.T2)summarizes the models\.

Table 2:Guard Models Evaluated\. Models are grouped by architecture type: decoder\-only LLMs \(top\) and encoder\-only transformers \(bottom\)\.ModelSizeBase ArchitectureCategoriesDecoder\-only LLMsQwen Guard\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.28830#bib.bib6)\)4BQwen310Nemotron Safety\(Joshiet al\.,[2025](https://arxiv.org/html/2605.28830#bib.bib7)\)8BLlama 3\.123WildGuard\(Hanet al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib8)\)7BMistral\-7B13MD\-Judge\(Liet al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib9)\)7BInternLM216Granite Guardian\(Padhiet al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib10)\)8BGraniteCustomDynaGuard\(Hooveret al\.,[2025](https://arxiv.org/html/2605.28830#bib.bib11)\)8BQwen3DynamicDuoGuard\(Denget al\.,[2025](https://arxiv.org/html/2605.28830#bib.bib12)\)0\.5BQwen 2\.512Llama Guard\(Meta AI,[2025](https://arxiv.org/html/2605.28830#bib.bib17)\)12BLlama 4 \(pruned\)14ShieldGemma\(Google DeepMind,[2024](https://arxiv.org/html/2605.28830#bib.bib16)\)2BGemma 24GuardReasoner\(Liuet al\.,[2025](https://arxiv.org/html/2605.28830#bib.bib15)\)3BLlama 3\.2ReasoningGPT\-OSS Safeguard\(OpenAI,[2025](https://arxiv.org/html/2605.28830#bib.bib18)\)20BGPT\-OSSCustomEncoder\-only TransformersEthicalEye\(Patel and Raj,[2024](https://arxiv.org/html/2605.28830#bib.bib13)\)270MXLM\-RoBERTaBinaryPoliteGuard\(Intel Corporation,[2024](https://arxiv.org/html/2605.28830#bib.bib14)\)110MBERT\-base4MetaHateBERT\(Piotet al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib19)\)110MBERT\-baseBinary##### Label Normalization\.

Models output varying labels \(safe, unsafe, controversial, error\)\. We normalize by mapping Qwen Guard’scontroversiallabel tounsafe\(as these contain contextually harmful content that should be flagged\), excluding error predictions, and mapping all other labels to binary safe/unsafe\.

## 4Evaluation and Results

### 4\.1Evaluation Metrics \(RQ3\)

For safety\-critical applications, recall is the primary metric: missing harmful content \(false negatives\) poses greater risk than incorrectly flagging safe content \(false positives\)\. We report Recall, Precision, F1 Score, Accuracy, ROC\-AUC, and MCC \(Matthews Correlation Coefficient\), with results ranked by recall throughout\.

### 4\.2Overall Performance \(RQ1\)

Table[3](https://arxiv.org/html/2605.28830#S4.T3)presents the main evaluation results, ranked by recall\. Our primary finding is that Qwen Guard achieves the highest recall \(83\.97%\), significantly outperforming larger models\.

Table 3:Guard Model Performance \(Ranked by Recall\)Key observations: \(1\) Conservative models are dangerous\-while ShieldGemma achieves highest precision \(82\.20%\), it misses 54\.51% of unsafe content, and GPT\-OSS Safeguard misses 75\.14%; \(2\) General\-purpose models outperform specialized ones\-MetaHateBERT, designed for hate speech, achieves only 15\.79% recall, failing to generalize across categories\.

Figure[1](https://arxiv.org/html/2605.28830#S4.F1)visualizes model performance through two complementary views: the precision\-recall tradeoff and detailed confusion matrices\.

![Refer to caption](https://arxiv.org/html/2605.28830v1/x1.png)\(a\)Precision\-Recall Tradeoff
![Refer to caption](https://arxiv.org/html/2605.28830v1/x2.png)\(b\)Confusion Matrices

Figure 1:Performance analysis of 14 guard models\.\(a\)Precision\-recall comparison where each line connects recall \(green\) to precision \(blue\)\. Line length indicates the gap\-models with long leftward lines \(GPT\-OSS, Llama Guard\) miss most unsafe content despite high precision\.\(b\)Normalized confusion matrices sorted by recall\. The bottom\-left cell \(FN\) shows missed unsafe content: top models achieve 8\.8–12\.5% FN, while conservative models show 41–46% FN rates\.
### 4\.3Performance by Dataset

Table[4](https://arxiv.org/html/2605.28830#S4.T4)shows per\-dataset recall for the top 5 models\.

Table 4:Recall by Dataset \(Top 5 Models\)Adversarial datasets show varied performance: Most models achieve near\-perfect recall on HarmBench \(99\-100%\), while StrongREJECT performance varies\-Qwen Guard achieves only 54\.55% recall on StrongREJECT despite leading overall, suggesting dataset\-specific biases\. RealToxicityPrompts proves most challenging across all models, containing subtle, naturally\-occurring toxicity that many models fail to detect\. To verify that results are not driven by source artifacts, we provide stratified analysis comparing RealToxicityPrompts \(balanced safe/unsafe from same source\) versus adversarial datasets combined in Appendix[E](https://arxiv.org/html/2605.28830#A5); model rankings remain consistent across both splits\.

### 4\.4Performance by NIST Category

Table[5](https://arxiv.org/html/2605.28830#S4.T5)shows recall variation across safety categories for all models\. This analysis reveals systematic patterns in what types of harmful content models detect well versus poorly\.

Table 5:Recall by NIST Safety Category \(All 14 Models\)\. Categories ordered by detection difficulty \(avg recall\): Suicide/Self\-Harm \(78%\)\>\>Violence \(71%\)\>\>Hate \(62%\)\>\>Sexual \(59%\)\>\>Health \(57%\)\>\>Harassment \(54%\)\>\>Profanity \(51%\)\>\>Threats \(43%\)\. Bold indicates best per category\.Category\-specific findings reveal systematic patterns: Violence and Suicide/Self\-Harm are easiest to detect \(\>\>90% recall for most LLMs\), while Threats is hardest \(43\.2% average\)\. Encoder models show surprising strength\-EthicalEye achieves highest recall on Harassment \(89\.5%\) and Profanity \(92\.3%\) despite ranking 8th overall\. No single model dominates: MD\-Judge leads on Violence/Suicide/Health, Qwen Guard on Hate/Sexual/Threats\. Large models \(GPT\-OSS, Llama Guard\) consistently underperform on implicit harm categories\.

### 4\.5Threshold Sensitivity Analysis

Using the labeling function defined in Section[3\.1](https://arxiv.org/html/2605.28830#S3.SS1), we analyze sensitivity to thresholdτ∈\{0\.3,0\.4,0\.5,0\.6,0\.7\}\\tau\\in\\\{0\.3,0\.4,0\.5,0\.6,0\.7\\\}for RealToxicityPrompts\. Table[6](https://arxiv.org/html/2605.28830#S4.T6)shows results for top 5 models\.

Table 6:Threshold Sensitivity: Recall \(R\), Precision \(P\), and F1 at Different Labeling Thresholds\. Bold indicates default threshold 0\.5\. Atτ\\tau=0\.3, 90\.6% of samples are unsafe; atτ\\tau=0\.7, only 24\.5% are unsafe\.Key findings: \(1\) Model rankings are stable\-Qwen Guard maintains highest recall at all thresholds, followed by Nemotron Safety and WildGuard\. \(2\) Precision\-recall tradeoff\-as threshold increases \(stricter ground truth\), recall improves while precision degrades\. \(3\) F1 peaks at 0\.5\-justifying our default threshold choice\. Full results for all 14 models in Appendix[C\.2](https://arxiv.org/html/2605.28830#A3.SS2)\.

### 4\.6Error Analysis: False Negatives

To understand model failures beyond aggregate metrics, we analyze false negatives\-unsafe content incorrectly classified as safe\. This analysis reveals systematic blind spots that explain the 5\.3×\\timesrecall gap between best and worst performers\.

Methodology\.For each of the 43,393 unsafe samples, we compute each model’s prediction and calculate false negative rates \(FN rate = samples missed / total unsafe\) by NIST category\. Table[7](https://arxiv.org/html/2605.28830#S4.T7)shows results for top\-3 and bottom\-3 models by overall recall\.

Table 7:False Negative Rate \(%\) by NIST Category\. Lower is better\. Top\-3 models \(Qwen, Nemotron, WildGuard\) vs Bottom\-3 \(Llama Guard, GPT\-OSS, MetaHateBERT\)\. Bold indicates best per category\. Full 14\-model results in Appendix[D](https://arxiv.org/html/2605.28830#A4)\.Key findings: The 5\.3×\\timesperformance gap between Qwen Guard \(15\.9% FN\) and MetaHateBERT \(84\.2% FN\) reveals systematic differences\. Consistent with Section 4\.3, Suicide/Self\-Harm has lowest FN rate \(21\.8%\) while Threats has highest \(56\.1%\)\. Conservative models show a striking explicit\-vs\-implicit pattern: Llama Guard achieves 18\.9% FN on explicit categories but 83\.0% on implicit categories \(64\.1pp gap\), suggesting reliance on keyword matching rather than semantic understanding\.

## 5Discussion

Our evaluation addresses the three research questions posed in Section[1](https://arxiv.org/html/2605.28830#S1): RQ1 is answered through our comprehensive performance analysis across datasets and NIST categories \(Sections 4\.2–4\.5\), RQ2 through our model size analysis \(Finding 2\), and RQ3 through our metric analysis demonstrating recall as the critical metric \(Section 4\.1, Finding 1\)\.

### 5\.1Key Findings

Finding 1 \(RQ3\): High precision does not compensate for low recall\.

Precision\-optimized models \(GPT\-OSS Safeguard, Llama Guard\) miss up to 75% of unsafe content\. A 25% recall rate is insufficient regardless of precision\.

Finding 2 \(RQ2\): Larger is not safer\.

The inverse correlation between model size and safety detection performance challenges conventional assumptions\. Figure[2](https://arxiv.org/html/2605.28830#S5.F2)visualizes this relationship: Qwen Guard \(4B\) achieves 3\.4x higher recall than GPT\-OSS Safeguard \(20B\), and the Pearson correlation between log10\-transformed model size and recall is negligible \(r=0\.21r=0\.21,p=0\.48p=0\.48,n=14n=14\)\. We hypothesize that larger models may be trained with more conservative safety thresholds, prioritizing avoiding false positives over comprehensive detection\.

![Refer to caption](https://arxiv.org/html/2605.28830v1/figures/fig_size_vs_recall.png)Figure 2:Model size does not predict safety detection performance\. Each point represents a guard model, with decoder\-only LLMs \(blue circles\) and encoder\-only transformers \(red squares\)\. The 4B Qwen Guard achieves 3\.4x higher recall than the 20B GPT\-OSS Safeguard\. The weak trend line \(r=0\.21\) confirms no meaningful correlation between model size and recall\.Finding 3: Label normalization significantly impacts results\.

Qwen Guard outputs three labels:safe,unsafe, andcontroversial\. Thecontroversialcategory contains contextually harmful content \(e\.g\., implicit threats, borderline hate speech\)\. Table[8](https://arxiv.org/html/2605.28830#S5.T8)shows the impact of treatingcontroversialas unsafe versus safe\.

Table 8:Impact of Qwen Guard’scontroversiallabel treatment on evaluation metrics\.Mergingcontroversialwithunsafeincreases recall by 37\.2 percentage points \(from 46\.75% to 83\.97%\) at the cost of 20\.3 percentage points in precision\. This tradeoff is justified for safety\-critical applications: the F1 score improves by 14\.3 percentage points, and\-consistent with our central thesis\-catching 37% more harmful content outweighs the increase in false positives\. We do not treatcontroversialas a separate third class because our benchmark focuses on binary classification for production safety systems, where content must be either blocked or allowed\-a three\-class output does not provide actionable guidance for deployment\. Without this merging, Qwen Guard’s recall would drop to 46\.75%, ranking it 10th rather than 1st among evaluated models\.

### 5\.2Limitations

Our study has limitations: \(1\) all safe samples originate from RealToxicityPrompts, potentially biasing evaluation; \(2\) the 0\.5 threshold is a design choice, though model rankings remain stable across 0\.3–0\.7 \(Section 4\.4\); \(3\) we evaluate only prompts, not responses; \(4\) evaluation is English\-only; and \(5\) domain\-specific applications may require specialized benchmarks\.

## 6Conclusion

We present the first comprehensive benchmark of 14 open\-source safety guard models on 79,331 samples spanning 8 NIST safety categories, revealing that many widely\-deployed models exhibit dangerous conservatism, missing up to 75% of harmful content\. Our key contributions are: \(1\) a methodology for NIST\-aligned safety benchmarks; \(2\) empirical evidence that model size does not predict performance; and \(3\) actionable recommendations for model selection\. Qwen Guard \(4B\) emerges as best\-performing with 83\.97% recall, followed by Nemotron Safety \(8B\) and WildGuard \(7B\)\. Future work should address multilingual evaluation and response\-level classification\.

#### Reproducibility Statement

All datasets are publicly available: HarmBench, StrongREJECT\(Soulyet al\.,[2024](https://arxiv.org/html/2605.28830#bib.bib2)\), RealToxicityPrompts, and BeaverTails\. Model inference used default configurations \(temperature 0\.0\)\. Dataset construction follows Section[3\.1](https://arxiv.org/html/2605.28830#S3.SS1)\.

#### Ethics Statement

This work uses publicly available datasets containing harmful content for safety research\. Our benchmark aims to improve AI safety\. While metrics could inform attacks, transparent evaluation benefits outweigh risks\.

## References

- GuardBench: a large\-scale benchmark for guardrail models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 18393–18409\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1022)Cited by:[§2\.2](https://arxiv.org/html/2605.28830#S2.SS2.p3.1)\.
- G\. Chua, L\. Tan, Z\. Ge, and R\. K\. Lee \(2025\)RabakBench: scaling human annotations to construct localized multilingual safety benchmarks for low\-resource languages\.arXiv preprint arXiv:2507\.05980\.External Links:2507\.05980,[Link](https://arxiv.org/abs/2507.05980)Cited by:[§2\.2](https://arxiv.org/html/2605.28830#S2.SS2.p2.1)\.
- Y\. Deng, Y\. Yang, J\. Zhang, W\. Wang, and B\. Li \(2025\)DuoGuard: a two\-player rl\-driven framework for multilingual llm guardrails\.External Links:2502\.05163,[Link](https://arxiv.org/abs/2502.05163)Cited by:[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.9.9.1)\.
- S\. Gehman, S\. Gururangan, M\. Sap, Y\. Choi, and N\. A\. Smith \(2020\)Realtoxicityprompts: evaluating neural toxic degeneration in language models\.arXiv preprint arXiv:2009\.11462\.Cited by:[§3\.1](https://arxiv.org/html/2605.28830#S3.SS1.SSS0.Px2.p3.1)\.
- Google DeepMind \(2024\)ShieldGemma: generative ai content moderation based on gemma\.Note:[https://ai\.google\.dev/gemma/docs/shieldgemma](https://ai.google.dev/gemma/docs/shieldgemma)Built on Gemma 2Cited by:[§2\.1](https://arxiv.org/html/2605.28830#S2.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.11.11.1)\.
- S\. Han, K\. Rao, A\. Ettinger, L\. Jiang, B\. Y\. Lin, N\. Lambert, Y\. Choi, and N\. Dziri \(2024\)WildGuard: open one\-stop moderation tools for safety risks, jailbreaks, and refusals of llms\.External Links:2406\.18495,[Link](https://arxiv.org/abs/2406.18495)Cited by:[§2\.1](https://arxiv.org/html/2605.28830#S2.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.5.5.1)\.
- M\. Hoover, V\. Baherwani, N\. Jain, K\. Saifullah, J\. Vincent, C\. Jain, M\. K\. Rad, C\. B\. Bruss, A\. Panda, and T\. Goldstein \(2025\)DynaGuard: a dynamic guardrail model with user\-defined policies\.arXiv preprint arXiv:2509\.02563\.External Links:[Link](https://arxiv.org/abs/2509.02563)Cited by:[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.8.8.1)\.
- Intel Corporation \(2024\)Polite guard: text classification for politeness detection\.Note:[https://huggingface\.co/Intel/polite\-guard](https://huggingface.co/Intel/polite-guard)Fine\-tuned BERT\-baseCited by:[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.16.16.1)\.
- J\. Ji, M\. Liu, J\. Dai, X\. Pan, C\. Zhang, C\. Bian, C\. Zhang, R\. Sun, Y\. Wang, and Y\. Yang \(2023\)BeaverTails: towards improved safety alignment of llm via a human\-preference dataset\.arXiv preprint arXiv:2307\.04657\.Cited by:[§3\.1](https://arxiv.org/html/2605.28830#S3.SS1.SSS0.Px2.p4.1)\.
- R\. Joshi, R\. Paul, K\. Singla, A\. Kamath, M\. Evans, K\. Luna, S\. Ghosh, U\. Vaidya, E\. Long, S\. S\. Chauhan,et al\.\(2025\)CultureGuard: towards culturally\-aware dataset and guard model for multilingual safety applications\.arXiv preprint arXiv:2508\.01710\.Cited by:[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.4.4.1)\.
- L\. Li, B\. Dong, R\. Wang, X\. Hu, W\. Zuo, D\. Lin, Y\. Qiao, and J\. Shao \(2024\)SALAD\-bench: a hierarchical and comprehensive safety benchmark for large language models\.arXiv preprint arXiv:2402\.05044\.Cited by:[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.6.6.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar,et al\.\(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.External Links:[Link](https://arxiv.org/abs/2211.09110)Cited by:[§2\.2](https://arxiv.org/html/2605.28830#S2.SS2.p2.1)\.
- Y\. Liu, H\. Gao, S\. Zhai, X\. Jun, T\. Wu, Z\. Xue, Y\. Chen, K\. Kawaguchi, J\. Zhang, and B\. Hooi \(2025\)GuardReasoner: towards reasoning\-based llm safeguards\.arXiv preprint arXiv:2501\.18492\.Cited by:[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.12.12.1)\.
- M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li, D\. Forsyth, and D\. Hendrycks \(2024\)HarmBench: a standardized evaluation framework for automated red teaming and robust refusal\.External Links:2402\.04249Cited by:[§3\.1](https://arxiv.org/html/2605.28830#S3.SS1.SSS0.Px2.p1.2)\.
- Meta AI \(2025\)Llama guard 4: safeguarding human\-ai conversations with multimodal content moderation\.Note:[https://huggingface\.co/meta\-llama/Llama\-Guard\-4\-12B](https://huggingface.co/meta-llama/Llama-Guard-4-12B)12B parameters, pruned from Llama 4 Scout\. Classifies 14 hazard categories per MLCommons taxonomy\.Cited by:[§2\.1](https://arxiv.org/html/2605.28830#S2.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.10.10.1)\.
- National Institute of Standards and Technology \(2023\)Artificial intelligence risk management framework \(ai rmf 1\.0\)\.Technical reportTechnical ReportNIST AI 100\-1,U\.S\. Department of Commerce\.External Links:[Document](https://dx.doi.org/10.6028/NIST.AI.100-1),[Link](https://www.nist.gov/itl/ai-risk-management-framework)Cited by:[§1\.2](https://arxiv.org/html/2605.28830#S1.SS2.p1.1),[§2\.2](https://arxiv.org/html/2605.28830#S2.SS2.p4.1)\.
- OpenAI \(2025\)GPT\-oss\-safeguard: safety reasoning models for content moderation\.arXiv preprint arXiv:2508\.10925\.Cited by:[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.13.13.1)\.
- I\. Padhi, M\. Nagireddy, G\. Cornacchia, S\. Chaudhury, T\. Pedapati, P\. Dognin, K\. Murugesan, E\. Miehling, M\. S\. Cooper, K\. Fraser, G\. Zizzo, M\. Z\. Hameed, M\. Purcell, M\. Desmond, Q\. Pan, Z\. Ashktorab, I\. Vejsbjerg, E\. M\. Daly, M\. Hind, W\. Geyer, A\. Rawat, K\. R\. Varshney, and P\. Sattigeri \(2024\)Granite guardian\.External Links:2412\.07724,[Link](https://arxiv.org/abs/2412.07724)Cited by:[§2\.1](https://arxiv.org/html/2605.28830#S2.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.7.7.1)\.
- K\. Patel and J\. Raj \(2024\)EthicalEye: cross\-lingual toxicity detection model\.Note:[https://huggingface\.co/autopilot\-ai/EthicalEye](https://huggingface.co/autopilot-ai/EthicalEye)AutopilotAICited by:[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.15.15.1)\.
- P\. Piot, P\. Martín\-Rodilla, and J\. Parapar \(2024\)MetaHate: a dataset for unifying efforts on hate speech detection\.InProceedings of the International AAAI Conference on Web and Social Media,Vol\.18,pp\. 2025–2039\.External Links:[Document](https://dx.doi.org/10.1609/icwsm.v18i1.31445)Cited by:[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.17.17.1)\.
- A\. Souly, Q\. Lu, D\. Bowen, T\. Trinh, E\. Hsieh, S\. Pandey, P\. Abbeel, J\. Svegliato, S\. Emmons, O\. Watkins, and S\. Toyer \(2024\)A strongREJECT for empty jailbreaks\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,Cited by:[§3\.1](https://arxiv.org/html/2605.28830#S3.SS1.SSS0.Px2.p2.2),[§6](https://arxiv.org/html/2605.28830#S6.SS0.SSSx1.p1.1)\.
- Z\. Zhang, L\. Lei, L\. Wu, R\. Sun, Y\. Huang, C\. Long, X\. Liu, X\. Lei, J\. Tang, and M\. Huang \(2024\)SafetyBench: evaluating the safety of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,pp\. 15814–15839\.External Links:[Link](https://arxiv.org/abs/2309.07045)Cited by:[§2\.2](https://arxiv.org/html/2605.28830#S2.SS2.p2.1)\.
- H\. Zhao, C\. Yuan, F\. Huang, X\. Hu, Y\. Zhang, A\. Yang, B\. Yu, D\. Liu, J\. Zhou, J\. Lin,et al\.\(2025\)Qwen3Guard technical report\.arXiv preprint arXiv:2510\.14276\.Cited by:[§2\.1](https://arxiv.org/html/2605.28830#S2.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.28830#S3.T2.4.3.3.1)\.

## Appendix AAdditional Results

This appendix provides comprehensive supplementary results that support the main findings presented in the paper\.

### A\.1Full Performance Metrics

Table[9](https://arxiv.org/html/2605.28830#A1.T9)provides complete evaluation metrics for all 14 models, including ROC\-AUC and Matthews Correlation Coefficient \(MCC\)\. MCC is particularly informative for imbalanced datasets as it considers all four confusion matrix quadrants\. Models are ranked by recall, consistent with our thesis that detecting unsafe content is the primary objective for safety guard models\. The MCC values reveal that even top\-performing models show moderate correlation \(0\.40–0\.46\), indicating room for improvement across all evaluated systems\.

Table 9:Complete Evaluation Metrics

## Appendix BImplementation Details

### B\.1Label Normalization Details

Qwen Guard\(Qwen3Guard\-Gen\-4B\): Outputs three labels \(safe, unsafe, controversial\)\. Thecontroversiallabel is merged withunsafeas these samples contain contextually harmful content\.

Nemotron Safety\(Llama\-3\.1\-Nemotron\-Safety\-Guard\-8B\-v3\): Returns JSON with “User Safety” and “Safety Categories” fields\. We use the “User Safety” field for binary classification\.

WildGuard: Returns structured output with “Harmful request: yes/no”\. We map “yes” to unsafe\.

MD\-Judge\(MD\-Judge\-v0\_2\-internlm2\_7b\): LLM\-based judge that outputs safe/unsafe classification with reasoning\.

Granite Guardian\(granite\-guardian\-3\.3\-8b\): Returns safe/unsafe with optional violation categories\.

DynaGuard\(DynaGuard\-4B\): Decoder\-based safety model returning binary safe/unsafe labels\.

DuoGuard\(DuoGuard\-0\.5B\): Sequence classifier with 12 risk categories\. Uses sigmoid activation; sample is unsafe if any category probability exceeds 0\.5 threshold\.

EthicalEye: Encoder\-based classifier outputs Safe/Un\-Safe\. We map “Un\-Safe” to unsafe\.

PoliteGuard: Politeness classifier with four labels\. We map “impolite” to unsafe; “polite”, “somewhat polite”, and “neutral” to safe\.

GuardReasoner\(GuardReasoner\-3B\): Llama\-3\.2\-3B fine\-tuned via R\-SFT and HS\-DPO\. Uses simplified prompt for one\-word safe/unsafe output\.

ShieldGemma\(shieldgemma\-2b\): Uses Yes/No generation for policy violation detection\. “Yes” indicates policy violation \(unsafe\)\.

Llama Guard\(Llama\-Guard\-4\-12B\): Returns safe/unsafe with S1–S14 category codes indicating violation type\.

GPT\-OSS Safeguard: Uses NIST category\-specific policy prompts\. Returns JSON with label and rationale\.

MetaHateBERT: Hate speech encoder classifier\. LABEL\_1 \(hate\) maps to unsafe; LABEL\_0 to safe\.

## Appendix CDataset Statistics

### C\.1NIST Category Distribution

Table 10:Sample Distribution by NIST Safety CategoryData Quality Notes\.BeaverTails is the only source for Suicide & Self\-Harm category \(173 samples\), while RealToxicityPrompts is the primary source for Profanity \(5,600\) and Threats \(17,187\)\. Safe samples exclusively come from RealToxicityPrompts due to other datasets being adversarial by design\. Confidence scores are 1\.0 for HarmBench, StrongREJECT, and BeaverTails, and variable \(0\.0–1\.0\) for RealToxicityPrompts\.

### C\.2Threshold Sensitivity Analysis

RealToxicityPrompts samples are labeled based on Perspective API toxicity scores\. For each sample, we computehighest\_score=max⁡\(combined scores across 7 dimensions\)\\text\{highest\\\_score\}=\\max\(\\text\{combined scores across 7 dimensions\}\), where combined score is the average of prompt and continuation scores per dimension\. Samples are labeled unsafe ifhighest\_score\>τ\\text\{highest\\\_score\}\>\\tau\. We analyze sensitivity to thresholdτ∈\{0\.3,0\.4,0\.5,0\.6,0\.7\}\\tau\\in\\\{0\.3,0\.4,0\.5,0\.6,0\.7\\\}\.

Table 11:Dataset Composition at Different ThresholdsTable 12:Threshold Sensitivity: Recall, Precision, and F1 at Different Labeling Thresholds \(All Models\)\. Bold indicates threshold 0\.5 \(default\)\. As threshold increases, ground truth becomes stricter\-fewer samples are labeled unsafe\-causing recall to increase and precision to decrease\.Key findings:

- •Ranking stability: Model rankings remain consistent across all thresholds\. Qwen Guard maintains the highest recall regardless of threshold choice, followed by Nemotron Safety and WildGuard\.
- •Precision\-recall tradeoff: As the threshold increases \(stricter labeling\), recall improves while precision degrades\. At threshold 0\.5, F1 scores peak for most models, indicating optimal balance\.
- •Anomalous behavior: EthicalEye, PoliteGuard, and MetaHateBERT show minimal recall variation across thresholds, suggesting these models’ predictions are largely independent of subtle toxicity distinctions\.
- •Robustness: Our conclusions about relative model performance are robust to threshold selection\.

## Appendix DFalse Negative Analysis \(Full Results\)

This section provides the complete false negative analysis for all 14 models across all 8 NIST safety categories\. False Negative Rate \(FN%\) represents the percentage of unsafe samples incorrectly classified as safe\-lower values indicate better detection\. This comprehensive breakdown reveals category\-specific strengths and weaknesses that are masked by aggregate metrics\. For example, MD\-Judge achieves the lowest FN rates on Violence \(1\.1%\) and Suicide \(0\.6%\), while EthicalEye excels on Harassment \(10\.5%\) and Profanity \(7\.7%\)\-demonstrating that no single model dominates across all harm types\.

Table 13:False Negative Rate \(%\) by NIST Category\-All 14 Models\. Models sorted by Overall FN rate \(best to worst\)\. Bold indicates best performer per category\.Extended observations:

- •Category difficulty: Violence and Suicide & Self\-Harm are consistently the easiest categories for detection \(lowest FN rates\), while Threats and Harassment show the highest variability across models\.
- •Specialized trade\-offs: Some models excel in specific categories despite poor overall performance\. EthicalEye achieves 7\.7% FN on Profanity \(best among all models\) but 81\.4% on Violence\. PoliteGuard achieves 16\.0% on Harassment but 65\.7% on Health\.
- •Performance gap: The gap between best \(Qwen Guard, 15\.9%\) and worst \(MetaHateBERT, 84\.2%\) performers represents a 5\.3×\\timesdifference in false negative rate\.
- •Implicit vs\. explicit harm: Models trained on explicit safety taxonomies \(Qwen, Nemotron\) consistently outperform those trained on general toxicity \(MetaHateBERT, GPT\-OSS\) for contextually harmful content like Threats and Harassment\.

## Appendix EPer\-Source Performance Analysis

To verify that model rankings are not driven by dataset source artifacts, we analyze performance stratified by data source\. RealToxicityPrompts \(RTP\) contains both safe and unsafe samples \(67,521 samples\), while adversarial datasets are 100% unsafe by design\. Tables[14](https://arxiv.org/html/2605.28830#A5.T14)–[16](https://arxiv.org/html/2605.28830#A5.T16)show F1, Recall, and Precision by dataset for all 14 models\.

Table 14:F1 Score by Dataset \(All 14 Models\)\. Models sorted by overall recall\.Table 15:Recall by Dataset \(All 14 Models\)\. Models sorted by overall recall\.Table 16:Precision by Dataset \(All 14 Models\)\. Precision=1\.0 for adversarial datasets because all samples are unsafe\.Key Findings:\(1\) RTP is the most challenging source\-all models show lower F1 on RTP vs adversarial datasets; \(2\) Model rankings are consistent across sources\-Qwen Guard leads on RTP recall \(80\.1%\) and BeaverTails, while bottom models \(MetaHateBERT, GPT\-OSS\) rank poorly everywhere; \(3\) Encoder models \(EthicalEye, PoliteGuard\) excel on RTP but fail on adversarial; \(4\)Top\-7 decoder LLMs maintain consistent rankingsacross sources, confirming robustness to dataset composition\.

Similar Articles

OSGuard: A Benchmark for Safety in Computer-Use Agents

arXiv cs.AI

OSGuard is a dual-granularity benchmark for evaluating safety in computer-use agents under benign user instructions, featuring action-level judgments and risk-augmented execution suites to detect unsafe shortcuts.

Benchmarking Large Language Models for Safety Data Extraction

arXiv cs.CL

This paper benchmarks four large language models (Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, Llama 3.1-70B) for extracting structured information from Safety Data Sheets, finding that text-based extraction with chain-of-thought prompting yields the highest accuracy (84% by Gemini 1.5 Pro) but no model surpasses the 90% threshold required for reliable industrial deployment.

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv cs.LG

This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.