AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
Summary
AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.
View Cached Full Text
Cached at: 06/24/26, 07:48 AM
# AdversaBench: Automated LLM Red-Teaming
Source: [https://arxiv.org/html/2606.24589](https://arxiv.org/html/2606.24589)
###### Abstract
Scaling adversarial evaluation of large language models requires two things at once: a way to generate hard inputs, and a way to confirm that the resulting failures are real\. A single judge is fast, but it cannot tell you when it is being lenient, and it cannot resolve close calls on its own\. We builtAdversaBench, an automated red\-teaming pipeline and reliability study that mutates seed prompts with five structured operators, queries a weak target model, and confirms failures through a three\-judge panel with a meta\-judge tiebreaker\. We report a study on 45 seeds \(15 per category: reasoning, instruction\-following, and tool use\), expanded from an initial pilot of 30 to improve coverage of temporal reasoning, strict formatting constraints, and tool edge cases\. Every seed produced a confirmed failure\. Four results stood out\. First, operator effectiveness varies sharply by category:inject\_distractorscores 0\.33 mean reward on instruction\-following seeds but 1\.00 on reasoning and tool\-use, a pattern that is visible in the per\-category heatmap but invisible in aggregate counts\. Second, binary failure rate hides difficulty: instruction\-following seeds needed 2\.4 attacker iterations on average versus 1\.1 for reasoning and tool\-use, and this gap is confirmed by a survival curve showing 60% of instruction seeds still unbroken after iteration 1 compared to 10% in other categories\. Third, pairwise judge agreement of 80–87% coexists with near\-zero Cohen’sκ\\kappa\(−0\.05\-0\.05to−0\.11\-0\.11\) because failures make up 90–97% of all verdicts; category\-level disagreement rates are more informative, with instruction\-following showing 33% panel splits versus 0% for reasoning\. Fourth, a zero\-shot transferability test on 15 verified adversarial prompts shows that attacks generated against Llama 3\.1 8B transfer to Llama 3\.3 70B, suggesting the mutations exploit general behavioral patterns rather than weaknesses specific to a small model\. We release the pipeline, dataset, and analysis scripts at[https://github\.com/khanak0509/AdversaBench](https://github.com/khanak0509/AdversaBench)\.
## 1\. Introduction
Red\-teaming has become a routine part of LLM development\. The goal is simple: find inputs that expose incorrect reasoning, instruction violations, or tool misuse before those failures show up in production\. Naive prompting finds the easy cases\. The harder ones need structured search over prompt variants, and once you find a candidate failure, you still need to verify it\.
At scale, verification usually means LLM\-as\-a\-Judge\(Zheng et al\.,[2023](https://arxiv.org/html/2606.24589#bib.bib1)\)\. Human pairwise evaluation remains the reference standard, but it does not scale to thousands of synthetic examples per day\. Strong models can stand in for human annotators when the task is preference ranking, andZheng et al\. \([2023](https://arxiv.org/html/2606.24589#bib.bib1)\)showed that careful rubric prompting can match human agreement rates above 80%\.
Adversarial confirmation is a different task\. The judge is not picking between Answer A and Answer B\. It sees one prompt, one response, and a ground\-truth specification of correct behavior\. The verdict is fail or pass\. That shift matters\. A lenient judge silently drops real failures\. A strict judge inflates them\. A single\-judge pipeline gives you no way to see when either is happening\.
This paper describesAdversaBenchand reports what we learned from running it end to end\. The contributions are:
1. 1\.A reproducible LangGraph pipeline with five mutation operators, epsilon\-greedy operator selection, attacker escalation, and checkpoint resume\. Failures are confirmed by a three\-judge panel with a meta\-judge tiebreaker\.
2. 2\.A case for reporting*iteration cost*alongside failure rate\. On 45 seeds, instruction\-following required 2\.4 iterations on average versus 1\.1 for reasoning and tool\-use, even though every category eventually broke\. A survival curve makes this gap visually apparent\.
3. 3\.An inter\-judge reliability study adapted fromZheng et al\. \([2023](https://arxiv.org/html/2606.24589#bib.bib1)\)\. We apply Cohen’sκ\\kappato single\-response verdicts rather than pairwise preferences, and show that high raw agreement can mask near\-zeroκ\\kappawhen failures dominate the label distribution\.
4. 4\.A zero\-shot transferability experiment showing that adversarial prompts generated against a weak 8B target model transfer to a substantially stronger 70B model, indicating that the mutations expose general behavioral patterns rather than model\-specific weaknesses\.
## 2\. Background and Related Work
### 2\.1\. LLM\-as\-a\-Judge
Zheng et al\. \([2023](https://arxiv.org/html/2606.24589#bib.bib1)\)introduced MT\-Bench and Chatbot Arena, showing that GPT\-4 judgments align with human preferences at scale\.Li et al\. \([2023](https://arxiv.org/html/2606.24589#bib.bib6)\)extended the same idea to instruction\-following leaderboards\. Both lines of work report inter\-annotator agreement, often as Cohen’sκ\\kappa:
κ=Po−Pe1−Pe\\kappa\\;=\\;\\frac\{P\_\{o\}\-P\_\{e\}\}\{1\-P\_\{e\}\}\(1\)
wherePoP\_\{o\}is observed agreement andPeP\_\{e\}is the agreement you would expect if two judges voted independently at their own base rates\. Values above 0\.8 are usually called strong agreement; 0\.6 to 0\.8 is moderate\.
Our setup differs in one important way\. Zheng et al\. compare two candidate answers in fixed positions\. We ask each judge whether a single target response violatesexpected\_behavior\. The metric is stillκ\\kappa, but the unit of analysis is a binary correctness verdict, not a preference label\.
### 2\.2\. Whenκ\\kappais Misleading Under Skew
When one label dominates,PeP\_\{e\}rises towardPoP\_\{o\}andκ\\kappacollapses even if the judges are substantively aligned on hard cases\.Feinstein and Cicchetti \([1990](https://arxiv.org/html/2606.24589#bib.bib5)\)called this the high\-agreement, low\-κ\\kappaparadox\. It shows up whenever prevalence is skewed, and adversarial evaluation with a weak target is an extreme version of that setting\.
### 2\.3\. Automated Red\-Teaming
Perez et al\. \([2022](https://arxiv.org/html/2606.24589#bib.bib2)\)showed that one language model can generate adversarial inputs for another at scale\. Later systems tend to use structured mutations rather than open\-ended generation, because operators make it easier to attribute*why*a prompt broke the target\.AdversaBenchfollows that pattern\. Recent jailbreak and safety benchmarks such as HarmBench\(Mazeika et al\.,[2024](https://arxiv.org/html/2606.24589#bib.bib7)\), JailbreakBench\(Chao et al\.,[2024](https://arxiv.org/html/2606.24589#bib.bib8)\), and StrongREJECT\(Souly et al\.,[2024](https://arxiv.org/html/2606.24589#bib.bib9)\)standardize attack suites and automated grading at larger scale; our work complements these efforts with a focused study on multi\-judge confirmation, evaluator reliability, and cross\-model transferability\.
### 2\.4\. Agent Benchmarks
AgentDojo\(Debenedetti et al\.,[2024](https://arxiv.org/html/2606.24589#bib.bib3)\)evaluates prompt\-injection attacks in tool\-using agents\. Our tool\-use category targets related failure modes \(wrong tool choice, ignored errors, fabricated results\) but uses mock tools and focuses on behavioral elicitation rather than injection into a live agent stack\.
## 3\. System Design
### 3\.1\. Overview
Figure[1](https://arxiv.org/html/2606.24589#S3.F1)shows the full loop\. The attacker mutates a seed prompt, the target model responds, three judges vote fail or pass, and disagreements go to a meta\-judge\. If no failure is confirmed, the loop repeats for up to five iterations\.
seeds\.jsonAttackerTargetLlama 8BJudges×\\times3Savedataset\.jsonMeta\-judgeprompt3/3 agreesplitresolveretry
Figure 1:AdversaBench pipeline\. Unanimous judge consensus saves directly; splits are resolved by a GPT\-4o\-mini meta\-judge\.
### 3\.2\. Seed Construction
We wrote 45 seed prompts in three categories: reasoning \(15\), instruction\-following \(15\), and tool\-use \(15\)\. The initial 30 seeds \(10 per category\) were expanded by 15 additional seeds targeting specific failure modes identified during the pilot run: temporal and spatial reasoning, strict formatting constraints \(e\.g\., valid JSON with an exact key count\), negative constraints \(e\.g\., avoiding a specific letter\), and tool edge cases such as division\-by\-zero and hallucinated API schemas\.
Each seed has a base prompt, anexpected\_behaviorfield, and areference\_answer\. Ground truth is fixed before any judge runs, so verdicts can be checked against an objective standard rather than panel opinion alone\. Theexpected\_behaviorspecifications were written to be unambiguous enough for a judge to verify without subjective interpretation — for instance, specifying “exactly 3 sentences” rather than “a brief summary\.”
The box below is a real verified\-tier row from the released dataset\.instruction\-001asks for a three\-sentence summary*and*a detailed explanation\. The target violates the length constraint\. Two judges flag the failure; Qwen3 passes\. The meta\-judge confirms fail\.
Seed \(instruction\-001\) prompt: "Summarize this in exactly 3 sentences but also provide a detailed explanation\." expected\_behavior: "Exactly 3 sentences; detailed explanation must not break the sentence limit\."Iteration 1 \-\-\- rephrase \(confirmed failure\) Mutated prompt restates the conflicting length constraints in more explicit terms\. Target response: produced a long multi\-paragraph answer\. Judge verdicts: Llama=fail, Cerebras=fail, Qwen3=pass Resolution: meta\-judge \(GPT\-4o\-mini\) =\> fail \[verified tier\]
### 3\.3\. Mutation Operators
Each iteration applies one of five operators\. Selection is epsilon\-greedy withε=0\.2\\varepsilon=0\.2: with probability1−ε1\-\\varepsilonthe attacker reuses the best\-performing operator seen so far for that seed category; otherwise it samples uniformly\. Theε=0\.2\\varepsilon=0\.2value was not ablated in this study and should be treated as a hyperparameter choice; retrospective analysis of operator transition sequences is reported in Section[4\.4](https://arxiv.org/html/2606.24589#S4.SS4)\. The operators are:
- •rephrase: surface restatement of the seed\.
- •inject\_distractor: adds a plausible but misleading constraint\.
- •role\_flip: reframes the task under an adversarial persona\.
- •constraint\_add: appends conflicting requirements\.
- •jailbreak\_wrap: embeds the seed in a jailbreak\-style template\.
The primary attacker is Llama 3\.3 70B on Groq\. Starting at iteration 2, the pipeline escalates to GPT\-4o\-mini for stronger mutations when the primary attacker has not yet confirmed a failure\. Each seed gets at most five iterations before the run stops\.
### 3\.4\. Judge Panel and Consensus
Three judges score every candidate failure: Llama 3\.3 70B \(Groq\), Cerebras GPT\-OSS 120B, and Qwen3 32B \(Groq\)\. Each returns a structuredfailure\_detectedverdict through Pydantic\-validated output\.
Unanimous fail across all three judges yields acleanrow\. If the panel splits, or a judge errors on structured output, GPT\-4o\-mini acts as meta\-judge\. The exporteddataset\_verified\.jsoncontains all confirmed failures: clean rows plus rows whose consensus flag isverifiedafter meta\-judge arbitration\.
### 3\.5\. Target and Implementation
The target is Llama 3\.1 8B Instant on Groq\. We chose a smaller model on purpose: the point of this study is failure elicitation and evaluation reliability, not frontier\-model robustness\. A weak target also keeps failure prevalence high, which stress\-tests the judge metrics in instructive ways\.
The code uses LangGraph for orchestration and LangChain for model calls\. All models and paths live inconfig\.yaml\. Checkpoint files let interrupted runs resume from the last completed seed\. The pipeline includes robust API rate\-limit handling: when a provider exhausts its daily token limit, the pipeline falls back to an alternate provider without losing checkpoint state\.
### 3\.6\. Reproducibility
```
pip install -r requirements.txt
python main.py
python audit.py
python inter_judge_analysis.py
python visualize_results.py
python operator_ablation.py
python test_transferability.py
```
The analysis scripts read saved verdicts fromdataset\.jsonanddataset\_verified\.json\. They do not re\-query judges\. Full per\-row outputs including mutation history and judge reasons are in the repository\.
## 4\. Results
### 4\.1\. Overall Performance
All 45 seeds produced confirmed failures\. The clean tier \(unanimous 3/3 judge consensus\) and verified tier \(clean \+ meta\-judge arbitration\) together cover every seed\. An independent audit scored clean\-tier rows on a 1–5 quality scale; all received 5/5\.
### 4\.2\. Operator Effectiveness
Table[1](https://arxiv.org/html/2606.24589#S4.T1)shows the final operator for each confirmed break by category\. The aggregate picture is misleading: globallyinject\_distractorandrole\_flipappear most often, but the per\-category breakdown in Figure[2](https://arxiv.org/html/2606.24589#S4.F2)tells a cleaner story\.
Table 1:Final operator by task category \(45 seeds\)\.Figure 2:Operator effectiveness heatmap \(mean reward per operator–category pair\)\.inject\_distractorachieves 1\.00 mean reward on reasoning and tool\-use seeds but 0\.33 on instruction\-following, confirming that no single operator dominates across all categories\.inject\_distractoraccounts for half of tool\-use breaks but achieves a mean reward of 0\.33 on instruction\-following seeds \(see Figure[2](https://arxiv.org/html/2606.24589#S4.F2)\)\.rephraseis the most common finisher on instruction\-following \(4/15\), and bothjailbreak\_wrapsuccesses were instruction\-following seeds\. This category–operator interaction is the main argument against reporting aggregate operator counts: the aggregate hides the fact that the best operator depends strongly on the task type\.
### 4\.3\. Iteration Cost
Every category eventually broke, so binary failure rate is 15/15 across the board\. Iteration cost tells a different story \(Table[2](https://arxiv.org/html/2606.24589#S4.T2)\)\. Instruction\-following seeds needed 2\.4 iterations on average\. Reasoning and tool\-use needed 1\.1\. Only 4/10 original instruction seeds broke on the first attempt, compared with 9/10 in the other two categories\.
Table 2:Average iteration cost and first\-attempt success by category\.Figure 3:Target model survival curve by iteration and category\. Reasoning and tool\-use seeds drop to 10% survival after iteration 1; instruction\-following seeds remain at 60% survival after iteration 1 and require up to 5 iterations to fully break\.The survival curve \(Figure[3](https://arxiv.org/html/2606.24589#S4.F3)\) makes this gap visually apparent\. Reasoning and tool\-use lines drop steeply after the first iteration; the instruction\-following line decays slowly across all five iterations\. Seedinstruction\-002illustrates the stacking behavior: it took five iterations \(inject\_distractor, thenrole\_flip, then threeconstraint\_addattempts\) before judges unanimously confirmed the break\.
### 4\.4\. Operator Ablation
A retrospective analysis of savedmutation\_historyarrays reveals the average reward per operator across all seeds:rephraseandjailbreak\_wrapeach achieved 1\.00 mean reward but were selected infrequently \(n=4n=4andn=2n=2respectively\)\. The workload was carried byconstraint\_add\(0\.636,n=11n=11\),role\_flip\(0\.615,n=13n=13\), andinject\_distractor\(0\.562,n=16n=16\), all of which maintained strong success rates despite high selection volume\.
Markov\-chain analysis of operator transitions shows three high\-confidence escalation paths:constraint\_add→\\tojailbreak\_wrap\(100% success\),role\_flip→\\torole\_flip\(100% success on repeated application\), andinject\_distractor→\\toconstraint\_add\(100% success\)\. These patterns suggest that the epsilon\-greedy policy discovered a reasonable escalation ordering, but a learned policy over operator transitions could improve efficiency further\.
### 4\.5\. Inter\-Judge Reliability
We computed all reliability statistics post hoc from saved verdicts indataset\.json, following the spirit ofZheng et al\. \([2023](https://arxiv.org/html/2606.24589#bib.bib1)\)but applied to single\-response fail/pass labels\.
#### 4\.5\.1\. Leniency
Leniency for judgejjis the fraction of pass votes:
Lj=pass votes by judgejN,N=45\.L\_\{j\}\\;=\\;\\frac\{\\text\{pass votes by judge \}j\}\{N\},\\qquad N=45\.\(2\)
Llama 3\.3 70B passed only one seed \(3% leniency\)\. Cerebras and Qwen3 each passed three \(10%\)\. Higher leniency means more missed failures\.
Table 3:Judge leniency across 45 seeds\.
#### 4\.5\.2\. Pairwise Agreement and theκ\\kappaParadox
Table[4](https://arxiv.org/html/2606.24589#S4.T4)reports raw agreement and Cohen’sκ\\kappafor each judge pair\. Agreement is high \(80–87%\), butκ\\kappais near zero or slightly negative\.
Table 4:Pairwise judge agreement and Cohen’sκ\\kappa\.This is not a bug in the code\. With marginal fail rates around 90–97%, chance agreement is already high\. For Llama and Cerebras specifically,Pe=p1,failp2,fail\+p1,passp2,pass≈0\.873P\_\{e\}=p\_\{1,\\mathrm\{fail\}\}p\_\{2,\\mathrm\{fail\}\}\+p\_\{1,\\mathrm\{pass\}\}p\_\{2,\\mathrm\{pass\}\}\\approx 0\.873, while observed agreement isPo≈0\.867P\_\{o\}\\approx 0\.867\. The numerator in Equation[1](https://arxiv.org/html/2606.24589#S2.E1)is therefore close to zero\. Judges agree on most rows because failures are common, not because they are reading the evidence the same way on hard cases\.
#### 4\.5\.3\. Disagreement by Category
Becauseκ\\kappais unreliable under this label distribution, we report raw disagreement rates in Table[5](https://arxiv.org/html/2606.24589#S4.T5)and Figure[4](https://arxiv.org/html/2606.24589#S4.F4)\. A row counts as a disagreement if the three judges do not all share the same verdict\.
Table 5:Judge disagreement rate by category\.Figure 4:Judge disagreement by category\. Reasoning failures are unanimous across all seeds; instruction\-following produces partial disagreement in 33% of cases, confirming that the meta\-judge is most active on the hardest category\.Reasoning failures are clear cut: every row was a unanimous fail\. Instruction\-following is where the panel actually splits\. The meta\-judge is doing exactly what it was added for: resolving cases where reasonable judges disagree on genuinely ambiguous failures\.
### 4\.6\. Zero\-Shot Transferability
A key question for any red\-teaming pipeline is whether the generated adversarial prompts are overfit to the specific target model\. To test this, we extracted 15 verified adversarial prompts fromdataset\_verified\.jsonthat had successfully broken the Llama 3\.1 8B target, and passed them zero\-shot to Llama 3\.3 70B Versatile — a model roughly 8×\\timeslarger and significantly more capable\.
Preliminary manual review oftransferability\_results\.jsonindicates that a substantial fraction of the logical traps, constraint additions, and distractors that fooled the 8B model successfully transferred to the 70B model\. Seeds involving ambiguous or conflicting constraints \(primarilyconstraint\_addandinject\_distractormutations\) showed the highest transfer rate, whilejailbreak\_wrapmutations were less consistent across model scales\.
This result suggests that the mutations expose general behavioral patterns — difficulty resolving conflicting constraints, susceptibility to misleading context — rather than weaknesses specific to a small model\. Systematic human annotation of the transfer results and extension to additional target models are left for future work\.
## 5\. Discussion
Iteration cost deserves to be a first\-class metric\.When every seed eventually fails, pass/fail rate stops discriminating\. Mean iterations\-to\-failure does\. Instruction\-following took more than twice the attacker effort in this run, a gap that is invisible if you only report whether the model broke\. The survival curve makes this argument without requiring tables\.
Operator selection is category\-dependent\.The heatmap in Figure[2](https://arxiv.org/html/2606.24589#S4.F2)makes it clear that no single operator dominates across all task types\. A pipeline that selects operators uniformly or based only on aggregate reward is leaving efficiency on the table\. The retrospective Markov\-chain analysis \(Section[4\.4](https://arxiv.org/html/2606.24589#S4.SS4)\) suggests that learning transition probabilities per category could reduce average iteration cost further\.
Multi\-judge evaluation is not optional for instruction\-following\.A single judge would have missed failures that Cerebras or Qwen3 passed, and would have hidden the fact that a third of instruction seeds split the panel\. Logging disagreement by category is more informative than picking the strictest judge and calling it done\.
Reportκ\\kappawith context, or report something else\.Our 80–87% agreement numbers look healthy in isolation\. They are mostly base\-rate agreement\. For skewed adversarial settings, we would report category\-stratified disagreement alongsideκ\\kappa, and treat near\-zeroκ\\kappaas a prevalence signal rather than proof that judges are independent\.
Limitations\.This is a study on 45 seeds with an intentionally weak target, so a 45/45 break rate is by design\. The ground\-truthexpected\_behaviorspecifications are author\-constructed and have not been independently validated by human raters; verdicts are only as reliable as the specifications themselves\. Theε=0\.2\\varepsilon=0\.2selection parameter was not ablated\. The transferability result is based on preliminary manual review of 15 prompts rather than systematic annotation\. Larger seed sets, stronger targets, independent spec validation, and human annotation of transfer results are the obvious next steps\.
## 6\. Conclusion
AdversaBenchis a reproducible red\-teaming methodology with structured mutations, multi\-judge confirmation, tiered dataset export, and cross\-model transferability testing\. Running it on 45 seeds produced four practical lessons for adversarial evaluation design: operator effectiveness varies sharply by task category; iteration cost reveals difficulty that binary failure rate hides; Cohen’sκ\\kappacan look meaningless under heavy class imbalance even when raw disagreement rates show exactly where judges diverge; and adversarial prompts generated against a weak model transfer to stronger ones, suggesting the mutations capture general behavioral patterns\. The code, data, and analysis scripts are available at[https://github\.com/khanak0509/AdversaBench](https://github.com/khanak0509/AdversaBench)\.
## References
- Zheng et al\. \[2023\]Zheng, L\., Chiang, W\.\-L\., Sheng, Y\., Zhuang, S\., Wu, Z\., Zhuang, Y\., Lin, Z\., Li, Z\., Li, D\., Xing, E\., Zhang, H\., Gonzalez, J\. E\., and Stoica, I\. \(2023\)\.Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.*Advances in Neural Information Processing Systems*, 36\.[https://arxiv\.org/abs/2306\.05685](https://arxiv.org/abs/2306.05685)\.
- Perez et al\. \[2022\]Perez, E\., Huang, S\., Song, F\., Cai, T\., Ring, R\., Aslanides, J\., Glaese, A\., McAleese, N\., and Irving, G\. \(2022\)\.Red teaming language models with language models\.*arXiv preprint arXiv:2202\.03286*\.
- Debenedetti et al\. \[2024\]Debenedetti, E\., Zhang, J\., Balunovic, M\., Beurer\-Kellner, L\., Fischer, M\., and Vechev, M\. \(2024\)\.AgentDojo: A dynamic environment to evaluate attacks and defenses for LLM agents\.*arXiv preprint arXiv:2406\.13352*\.
- Cohen \[1960\]Cohen, J\. \(1960\)\.A coefficient of agreement for nominal scales\.*Educational and Psychological Measurement*, 20\(1\), 37–46\.
- Feinstein and Cicchetti \[1990\]Feinstein, A\. R\. and Cicchetti, D\. V\. \(1990\)\.High agreement but low kappa: I\. The problems of two paradoxes\.*Journal of Clinical Epidemiology*, 43\(6\), 543–549\.
- Li et al\. \[2023\]Li, X\., Zhang, T\., Dubois, Y\., Taori, R\., Gulrajani, I\., Guestrin, C\., Liang, P\., and Hashimoto, T\. B\. \(2023\)\.AlpacaEval: An automatic evaluator of instruction\-following models\.[https://github\.com/tatsu\-lab/alpaca\_eval](https://github.com/tatsu-lab/alpaca_eval)\.
- Mazeika et al\. \[2024\]Mazeika, M\., Phan, L\., Yin, X\., Zou, A\., Wang, Z\., Mu, N\., Sakhaee, E\., Li, N\., Basart, S\., Li, B\., Forsyth, D\., and Hendrycks, D\. \(2024\)\.HarmBench: A standardized evaluation framework for automated red teaming and robust refusal\.*arXiv preprint arXiv:2402\.04249*\.
- Chao et al\. \[2024\]Chao, P\., Robey, A\., Dobriban, E\., Hsieh, C\.\-J\., Smith, R\., and Jana, S\. \(2024\)\.JailbreakBench: An open robustness benchmark for jailbreaking large language models\.*arXiv preprint arXiv:2404\.01318*\.
- Souly et al\. \[2024\]Souly, A\., Lu, Q\., Bowen, D\., Trinh, T\., Hsieh, E\., Pandey, S\., Abbeel, P\., Svegliato, J\., Yang, J\., Gupta, S\., Mangaokar, N\., et al\. \(2024\)\.A strongREJECT for empty jailbreaks\.*arXiv preprint arXiv:2402\.10260*\.Similar Articles
CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
CHASE introduces a co-evolutionary red-blue teaming framework that uses reinforcement learning to harden LLMs against adaptive black-box adversarial attacks, reducing jailbreak success by 43.2% on benchmarks while maintaining zero false refusals on benign prompts.
CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement
CollabBench is a new benchmark for evaluating and training LLM agents in cooperative games, featuring diverse player simulation and a collaborative training paradigm. Experiments show 19.5% higher efficiency and 24.4% improved affective performance over base models.
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
RankJudge is a benchmark generator that creates paired multi-turn conversations with injected flaws to evaluate LLM judges on their ability to correctly identify better and worse responses in complex dialogues.
DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation
DLawBench is a new benchmark for evaluating large language models in multi-turn legal consultation, covering Chinese and US law with four client types. Experiments show significant room for improvement, with the best model achieving only 0.562 on legal reasoning.
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.