Gate AI: LLM Security Benchmark Evaluation Methodology and Results
Summary
This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.
View Cached Full Text
Cached at: 06/03/26, 09:41 AM
# LLM Security Benchmark Evaluation Methodology & Results1footnote 11footnote 1Working preprint. Subsequent versions may update benchmark numbers, dataset composition, or methodology as the framework evolves; refer to the most recent version for current figures.
Source: [https://arxiv.org/html/2606.02959](https://arxiv.org/html/2606.02959)
\(2026\-05\-27\)
###### Abstract
Published evaluations of prompt\-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per\-dataset threshold tuning and undisclosed operating points\. We describe an evaluation harness that addresses both\. The detector under evaluation is scored across 16 public benchmarks \(12,111 samples\) using 5\-fold cross\-validation\. StratifiedKFold \(by row\) is the headline pass; a parallel StratifiedGroupKFold pass over a composite key \(parent\-prompt id plus MinHash \+ LSH near\-duplicate clusters at Jaccard≳0\.8\\gtrsim 0\.8\) runs alongside it as a leakage\-premium diagnostic\. A single global operating point is selected on the held\-out folds \(max F1 subject to FPR≤1%\\leq 1\\%\) and applied uniformly to every dataset, so per\-dataset results reflect one threshold rather than per\-benchmark optimisation\. Generalisation is examined through a battery of diagnostics \(leave\-one\-dataset\-out cross\-validation, a random\-label control, adversarial validation, permutation feature importance, length\-bias correlation, classifier\-head agreement, cross\-source near\-duplicate detection, threshold transferability, train\-vs\-OOF agreement, and a paraphrase\-invariance probe\), most with a quantitative pass threshold and the remainder with a stated failure mode\. For every external comparison, the detector’s threshold is re\-tuned to the competitor’s published false\-positive rate so head\-to\-head values are evaluated at matched operating points\.
Keywords:LLM Security · Prompt Injections · Jailbreaks · Benchmark
## 1 Introduction
A representative deployment for the systems examined here is an agentic assistant with read access to a user’s email inbox and write access to outgoing mail\. Once attacker\-supplied content reaches the assistant \(a phishing email, a hostile attachment, a poisoned calendar invite\), a single line of natural\-language instruction inside that content can cause the model to draft and dispatch mail on the attacker’s behalf, exfiltrate prior thread contents, or invoke any other tool the assistant has access to\. The same threat surface appears in retrieval\-augmented chat, browser\-automation agents, and any application where untrusted data is mixed into an LLM’s prompt\. Defensive systems that filter these attacks have proliferated, but their published evaluations are uneven: ad\-hoc datasets, undisclosed thresholds, per\-dataset tuning, and inconsistent definitions of what constitutes a positive label make cross\-system comparison difficult\.
This report describes the evaluation harness used to benchmark the detector under test against published competitor numbers\. The work is intentionally scoped to non\-proprietary testing methodology: how the trace is assembled, how the cross\-validation prevents sibling\-chunk leakage, how a single global operating point is selected and applied uniformly, how per\-dataset matched\-FPR comparisons re\-tune the threshold to neutralise FPR mismatches, and how leave\-one\-dataset\-out and random\-label diagnostics stress\-test generalisation\. The detector itself is treated as a black box\. The remainder of the paper is organised as follows\. Section 2 describes the testing methodology in full: trace assembly, leakage\-resistant cross\-validation, inner\-validation and threshold selection, per\-chunk\-to\-per\-prompt aggregation, micro vs macro choice, operating\-point selection, the matched\-FPR comparison protocol, bootstrap confidence intervals, calibration, and the generalisation diagnostics that run on every release alongside the empirical results that ground them \(per\-fold threshold stability, leave\-one\-dataset\-out, random\-label, calibration tables\), plus notes on determinism, limitations, and pretraining contamination\. Section 3 describes the dataset corpus and its attack\-family composition\. Section 4 reports aggregate results, per\-dataset comparisons, and a head\-to\-head with the most\-published commercial competitor\. Section 5 reports end\-to\-end latency\. An appendix glossary defines every metric and term used in the paper\.
## 2 Methodology
### 2\.1 Datasets and trace assembly
The evaluation trace combines 16 public benchmarks covering balanced collections, all\-attack adversarial corpora, and all\-benign over\-defense benchmarks\. Each upstream dataset is loaded through a single source\-tagged loader and the resulting samples are unioned into one trace\. Trace identity is content\-hashed over the loader file contents and a per\-loader sample cap, so re\-runs against the same loader version and cap produce bit\-identical traces; this makes cache reuse safe and reproducibility verifiable\. Per\-dataset descriptions and citations appear in Section 3\.
### 2\.2 Cross\-validation
Out\-of\-fold predictions come fromK=5K=5cross\-validation\. The trace𝒟=\{\(xi,yi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}is partitioned into disjoint folds𝒟=⨆k=1K𝒟k\\mathcal\{D\}=\\bigsqcup\_\{k=1\}^\{K\}\\mathcal\{D\}\_\{k\}\. Two splitters run in parallel:
- •StratifiedKFoldpreserving the label marginalP\(y∣𝒟k\)≈P\(y∣𝒟\)P\(y\\mid\\mathcal\{D\}\_\{k\}\)\\approx P\(y\\mid\\mathcal\{D\}\)for everykk\. This is the headline pass; the trained model and headline F1 / FPR come from it\.
- •StratifiedGroupKFold \(diagnostic\)with a*composite*group\-keyg\(⋅\)g\(\\cdot\)that is the union \(under union\-find\) of two membership relations:*primary*key: parent\-prompt index when the upstream pipeline emits chunked rows, else the row id; and*near\-duplicate*key: MinHash \+ LSH clusters on 5\-character shingles, calibrated so rows with Jaccard similarity≳0\.8\\gtrsim 0\.8collide\. Rows that share either key collapse into the same group transitively\. The splitter places every member of a group in the same fold \(no within\-group leakage\) while keeping the per\-fold label marginal close toP\(y∣𝒟\)P\(y\\mid\\mathcal\{D\}\)\.
The two passes use the same model, same features, same inner\-valid early\-stopping / threshold methodology; only the fold\-assignment policy differs\. The gapΔF1=F1strat−F1sgk\\Delta F\_\{1\}=F\_\{1\}^\{\\text\{strat\}\}\-F\_\{1\}^\{\\text\{sgk\}\}estimates the residual leakage premium that exact\-text\-identity folds would have captured but the composite\-grouped folds reject\. The earlier paper revision used plainGroupKFoldfor this diagnostic; that conflated leakage with class\-marginal imbalance across folds \(GroupKFold does not stratify by label\) and turnedΔF1\\Delta F\_\{1\}into a loose upper bound\. Switching the diagnostic to StratifiedGroupKFold removes the class\-marginal term, soΔF1\\Delta F\_\{1\}on a clean trace tracks leakage alone\.
On a per\-prompt trace with no chunked rows and no near\-duplicate clusters, the composite key degenerates to row\-identity, the diagnostic partition matches a random stratified KFold, andΔF1\\Delta F\_\{1\}is uninformative in that regime; we say so explicitly to avoid claiming protection the diagnostic cannot provide\.
Figure 1:Per\-fold ROC overlay for the headline 5\-fold cross\-validation\. Each curve is one held\-out fold’s score distribution against ground truth; tight clustering around a common envelope is the leak\-free signal that fold\-to\-fold performance is stable\.
### 2\.3 Inner validation, early stopping, and threshold selection
Each outer fold’s training set𝒟ktrain=𝒟∖𝒟k\\mathcal\{D\}\_\{k\}^\{\\text\{train\}\}=\\mathcal\{D\}\\setminus\\mathcal\{D\}\_\{k\}is split into stratified inner\-train and inner\-valid subsets in an 85 / 15 ratio\. The inner\-valid slice is the early\-stopping monitor \(max 800 iterations, patience 40, log\-loss on the inner\-valid slice as the stopping criterion\) and the operating\-threshold selector\. The per\-fold threshold is
θk∗=argmaxθ∈ΘkF1\(y^θ,y\|𝒟kinner\-valid\),\\theta\_\{k\}^\{\*\}=\\arg\\max\_\{\\theta\\in\\Theta\_\{k\}\}\\,F\_\{1\}\\\!\\left\(\\hat\{y\}\_\{\\theta\},\\,y\\,\\middle\|\\,\\mathcal\{D\}\_\{k\}^\{\\text\{inner\-valid\}\}\\right\),whereΘk\\Theta\_\{k\}is the union of all observed inner\-valid probabilities and a fallback grid\{0\.3,0\.4,0\.5,0\.6,0\.7,0\.8,0\.9\}\\\{0\.3,0\.4,0\.5,0\.6,0\.7,0\.8,0\.9\\\}\. The chosenθk∗\\theta\_\{k\}^\{\*\}is applied to the held\-out test fold𝒟k\\mathcal\{D\}\_\{k\}for hard labels\. The canonical operating point isθ~=mediankθk∗\\tilde\{\\theta\}=\\operatorname\{median\}\_\{k\}\\theta\_\{k\}^\{\*\}\. The test fold is never an evaluation target during training\.
#### Per\-fold threshold stability \(empirical\)\.
The 5 inner\-valid\-picked thresholds for the headline StratifiedKFold pass:
Meanθk∗=0\.460\\theta\_\{k\}^\{\*\}=0\.460, medianθ~=0\.500\\tilde\{\\theta\}=0\.500\(the canonical operating point defined in §2\.3\),σ=0\.153\\sigma=0\.153, range\[0\.200,0\.650\]\[0\.200,0\.650\]\. Tight clustering aroundθ~\\tilde\{\\theta\}is the leak\-free signal: each fold independently arrived at a similar threshold from its own held\-out inner\-valid slice, without ever seeing its test fold\.
#### Adversarial validation\.
Train an auxiliary classifiergϕ:x↦\[0,1\]g\_\{\\phi\}:x\\mapsto\[0,1\]to predict whether each row came from train or test:ϕ∗=argminϕ1N∑i\(gϕ\(xi\)−𝟙\[i∈test\]\)2\\phi^\{\*\}=\\arg\\min\_\{\\phi\}\\tfrac\{1\}\{N\}\\sum\_\{i\}\\bigl\(g\_\{\\phi\}\(x\_\{i\}\)\-\\mathbb\{1\}\[i\\in\\text\{test\}\]\\bigr\)^\{2\}\. A well\-balanced split yieldsAUC\(gϕ∗\)≈0\.5\\mathrm\{AUC\}\(g\_\{\\phi^\{\*\}\}\)\\approx 0\.5\. Any meaningful lift means the OOF metric is conflating distribution shift with detection signal\.
Figure 2:Adversarial\-validation AUC per fold \(target≈0\.5\\approx 0\.5\)\.
#### Train\-vs\-OOF agreement\.
Score the per\-fold model on its own training rows and compare to the OOF score on the same fold\. Define the gapΔ\(k\)=F1train,k−F1OOF,k\\Delta^\{\(k\)\}=F\_\{1\}^\{\\text\{train\},k\}\-F\_\{1\}^\{\\text\{OOF\},k\}\. A small meanΔ¯\\bar\{\\Delta\}over folds confirms the OOF metric is not under\-reporting due to a pipeline\-state mismatch between train and score paths; a largeΔ¯\\bar\{\\Delta\}signals overfitting\.
Figure 3:Per\-fold train vs OOF F1\.
### 2\.4 Per\-chunk to per\-prompt aggregation
When the pipeline chunks long inputs, training operates per chunk and aggregates at metric time\. Let chunks of parent promptppbe indexed byc∈C\(p\)c\\in C\(p\)\. Continuous probabilities are max\-pooled, and the hard label is the single global thresholdθ~=mediankθk∗\\tilde\{\\theta\}=\\operatorname\{median\}\_\{k\}\\theta\_\{k\}^\{\*\}\(from §2\.3\) applied to the max\-pooled probability:
p^p=maxc∈C\(p\)p^p,c,y^p=𝟙\[p^p≥θ~\]\.\\hat\{p\}\_\{p\}=\\max\_\{c\\in C\(p\)\}\\hat\{p\}\_\{p,c\},\\qquad\\hat\{y\}\_\{p\}=\\mathbb\{1\}\\\!\\left\[\\hat\{p\}\_\{p\}\\geq\\tilde\{\\theta\}\\right\]\.Sample\-level F1 / FPR / precision / recall come from this hard label; AUC comes fromp^p\\hat\{p\}\_\{p\}\(threshold\-free\)\.
#### Per\-fold vs aggregate operating point\.
The per\-fold threshold table in §2\.3 reports the per\-fold inner\-valid\-chosenθk∗\\theta\_\{k\}^\{\*\}, and the per\-fold F1 in that table is evaluated atθk∗\\theta\_\{k\}^\{\*\}on foldkk’s held\-out rows \(per\-fold operating point\)\. The aggregate F1/FPR reported in §3 use the singleθ~\\tilde\{\\theta\}applied to every row’s max\-pooled probability \(global operating point\)\. Per\-fold F1 and aggregate F1 are therefore*not directly comparable*on chunked traces where sibling chunks of one parent can land on different folds; the per\-fold table is for stability diagnostics, the aggregate is the headline number that ships with one threshold\. On the trace evaluated in this paper every prompt fits in a single chunk so this aggregation reduces to identity \(\|C\(p\)\|=1\|C\(p\)\|=1for allpp\) and the two operating points coincide; the formulae are reported because the same pipeline scores chunked production traffic where\|C\(p\)\|\>1\|C\(p\)\|\>1\.
Figure 4:Cascade F1 vs FPR sweep, micro \(blue\) and macro \(orange\) on the same axes\. Each curve sweeps the global thresholdθ\\thetaacross the full OOF range\. The micro curve pools the per\-source confusion matrices and is therefore dominated by the bigger sources; the macro curve takes the unweighted per\-source F1 mean and gives every source the same weight\. A meaningful gap between the two curves at the same FPR is the visual signature of per\-source skew: when macro sits below micro the model is leaning on a few large sources at the expense of smaller ones; when macro sits above micro the bigger sources are dragging the pooled metric down\. The headline operating point \(FPR≤1%\\leq 1\\%\) and natural threshold are marked on both curves\.
### 2\.5 Aggregation: micro vs macro
Let𝒮\\mathcal\{S\}be the set of source datasets\. Micro\-aggregates pool the confusion matrix across all rows, then derive metrics:
F1micro=2⋅TPpool2TPpool\+FPpool\+FNpool\.F\_\{1\}^\{\\text\{micro\}\}=\\frac\{2\\cdot\\mathrm\{TP\}\_\{\\text\{pool\}\}\}\{2\\,\\mathrm\{TP\}\_\{\\text\{pool\}\}\+\\mathrm\{FP\}\_\{\\text\{pool\}\}\+\\mathrm\{FN\}\_\{\\text\{pool\}\}\}\.Macro\-aggregates take the unweighted mean across sources, skipping single\-class slices where the metric is undefined:
F1macro=1\|𝒮def\|∑s∈𝒮defF1\(s\)\.F\_\{1\}^\{\\text\{macro\}\}=\\frac\{1\}\{\\lvert\\mathcal\{S\}\_\{\\text\{def\}\}\\rvert\}\\sum\_\{s\\in\\mathcal\{S\}\_\{\\text\{def\}\}\}F\_\{1\}^\{\(s\)\}\.Per\-source positive rates vary widely \(0% on benign\-only sets, 100% on all\-attack sets, balanced mixes elsewhere\), so micro is dominated by larger sources while macro weights all sources equally\. Both views are reported; large gaps indicate per\-source skew worth examining\.
#### Prevalence ratio per source\.
Per\-source positive rateps=𝔼x∈s\[y\]p\_\{s\}=\\mathbb\{E\}\_\{x\\in s\}\[y\]in the train mix vs the eval mix:ρs=pseval/pstrain\\rho\_\{s\}=p\_\{s\}^\{\\text\{eval\}\}/p\_\{s\}^\{\\text\{train\}\}\. Aρs\\rho\_\{s\}far from11on any source signals the evaluation has been re\-balanced in a way that flatters or punishes the macro number; we reportρs\\rho\_\{s\}alongside per\-source results\.
Figure 5:Train vs eval positive\-rate ratioρs\\rho\_\{s\}per source\.
### 2\.6 Operating\-point selection
A single global operating point applies uniformly to every dataset\. There is no per\-dataset threshold tuning and no per\-dataset hand\-tuned decision rule\. The operating point is the threshold combination that solves
θop=argmaxθF1\(y^θ,y\)subject toFPR\(y^θ,y\)≤τ,\\theta^\{\\text\{op\}\}=\\arg\\max\_\{\\theta\}F\_\{1\}\(\\hat\{y\}\_\{\\theta\},y\)\\quad\\text\{subject to\}\\quad\\mathrm\{FPR\}\(\\hat\{y\}\_\{\\theta\},y\)\\leq\\tau,solved by exhaustive search over a coarse\-but\-well\-spread candidate grid on out\-of\-fold predictions, withτ\\tauthe target FPR \(this run:τ=1%\\tau=1\\%\)\.τ\\tauis a deployment choice, not a theoretical constant:1%1\\%approximates the operator\-reported false\-alarm rate above which downstream users report alert fatigue and start ignoring or bypassing the detector\. The sameθop\\theta^\{\\text\{op\}\}then scores every dataset; per\-dataset values reflect what one global threshold yields, not the best achievable in isolation\. The threshold is reported alongside the document so the operating point is auditable\.
#### Threshold transferability\.
For each sourcesswe re\-pick the thresholdθs∗\\theta\_\{s\}^\{\*\}that achieves the same target FPR onss\-only rows\. We report the spreadσθ=stddev\(\{θs∗−θop\}s∈𝒮\)\\sigma\_\{\\theta\}=\\mathrm\{stddev\}\(\\\{\\theta\_\{s\}^\{\*\}\-\\theta^\{\\text\{op\}\}\\\}\_\{s\\in\\mathcal\{S\}\}\)\. A tight spread is the signature of a transferable operating point; a wide spread means a single globalθop\\theta^\{\\text\{op\}\}is mis\-calibrated for some sources\.
Figure 6:Per\-sourceθs∗\\theta\_\{s\}^\{\*\}at matched FPR vs the globalθop\\theta^\{\\text\{op\}\}\. Each dot is a source\.
### 2\.7 Matched\-FPR per\-dataset comparisons
Comparing detectors at the same global threshold can mislead when competitors publish their numbers at very different operating points\. To remove that confound, every per\-dataset competitor entry that publishes both a primary metric and an FPR is also surfaced with our value re\-evaluated at the threshold whose FPR matches theirs on that dataset\. The result is a like\-for\-like comparison at the same FPR per row\. Two surrogate paths apply on a subset of rows so the reader can tell them apart from a true match\.\(a\)On datasets whose primary metric is itself FPR or over\-defense\-accuracy \(notinject, wildguard\-benign\), matching FPR collapses the comparison; our value at the global headline threshold is shown instead and tagged†in both the per\-dataset table and the Lakera overview plot\.\(b\)When a competitor publishes only the primary metric without an FPR \(which Lakera does on the three gentel\-bench attack\-type splits\), our threshold is matched to a 1% FPR surrogate rather than the competitor’s unknown operating point; those rows are tagged‡\.
#### Threshold\-selection caveat\.
Pickingθsmatched\\theta^\{\\text\{matched\}\}\_\{s\}to satisfy the competitor’s published FPR uses the same per\-source rows that the matched\-FPR F1 is then scored on; in the threshold\-on\-test sense of Cawley & Talbot 2010\[[22](https://arxiv.org/html/2606.02959#bib.bib22)\]this is double\-dipping, and the point\-estimate F1 slightly overstates what a fixed pre\-registered threshold would have produced\. The matched\-nnbootstrap CI surfaced in the companion table mitigates this by resampling rows; rows whose draws also re\-pickθ\\thetawiden the CI to include the threshold\-selection uncertainty, so when the competitor’s point estimate falls inside the CI we mark the row as “within noise” rather than claim a win\. The headline global\-θop\\theta^\{\\text\{op\}\}F1 in Section 3 is not affected; it is reported at a single threshold picked from inner\-valid data without seeing the per\-source rows it is scored on\.
### 2\.8 Generalisation diagnostics
We stress\-test the headline number with a small set of integrity checks that each carry a precise pass criterion\. Below: name, mechanism, math, and the actual diagnostic plot inline\. Each check runs as part of every evaluation release\.
#### Leave\-one\-dataset\-out \(LODO\)\.
Hold out one sourcessfrom training and evaluate Gate only on rows ofss\. For each held\-out source we computeF1\(LODO,s\)=F1\(y^−s,ys\)F\_\{1\}^\{\(\\text\{LODO\},s\)\}=F\_\{1\}\(\\hat\{y\}\_\{\-s\},y\_\{s\}\)wherey^−s\\hat\{y\}\_\{\-s\}is the prediction from a model trained on𝒟∖\{s\}\\mathcal\{D\}\\setminus\\\{s\\\}\. The macro meanF¯1LODO=1\|𝒮\|∑s∈𝒮F1\(LODO,s\)\\bar\{F\}\_\{1\}^\{\\text\{LODO\}\}=\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\sum\_\{s\\in\\mathcal\{S\}\}F\_\{1\}^\{\(\\text\{LODO\},s\)\}bounds how much performance drops when a previously unseen distribution arrives\.
Figure 7:Leave\-one\-dataset\-out F1 delta from the macro mean per held\-out source\. Bar = \(per\-source F1\)−\-\(macro F1\) in percentage points; macro mean at 0\. Blue≥\\geqmacro, red below\. Held\-out F1% and sample countnnannotated at right of each row\.ilion\-benchsits substantially below the macro mean\. Its prompt distribution is remarkably different from the rest of the public benchmark cohort because of its role\-playing structure\.
#### Random\-label control\.
Shuffle the labelsy~=π\(y\)\\tilde\{y\}=\\pi\(y\)whereπ\\piis a uniform permutation of\{1,…,N\}\\\{1,\\dots,N\\\}and score the model’s existing hard predictionsy^\\hat\{y\}against the shuffled targets\. Under independence, the expected F1 is
F1chance=2pypy^py\+py^,F\_\{1\}^\{\\text\{chance\}\}=\\frac\{2\\,p\_\{y\}\\,p\_\{\\hat\{y\}\}\}\{p\_\{y\}\+p\_\{\\hat\{y\}\}\},wherepy=𝔼\[y\]p\_\{y\}=\\mathbb\{E\}\[y\]is the label prevalence andpy^=𝔼\[y^\]p\_\{\\hat\{y\}\}=\\mathbb\{E\}\[\\hat\{y\}\]is the predicted\-positive rate\. Anything notably aboveF1chanceF\_\{1\}^\{\\text\{chance\}\}is row\-identity leakage\. \(A retrain\-from\-scratch variant of this check, in which the full pipeline is refit on shuffled labels, is also supported but is not part of the cheap CI diagnostic\.\)
Figure 8:Random\-label control: predicted hard labels scored against shuffled targets collapse to the chance F1 baselineF1chance=2pypy^/\(py\+py^\)F\_\{1\}^\{\\text\{chance\}\}=2p\_\{y\}p\_\{\\hat\{y\}\}/\(p\_\{y\}\+p\_\{\\hat\{y\}\}\), confirming no row\-identity leakage\.
#### Random\-label result \(empirical\)\.
Under shuffled labels: AUC =0\.5146⋅\\cdotF1 =84\.44%⋅\\cdotalways\-positive baseline2py/\(1\+py\)=84\.44%2p\_\{y\}/\(1\+p\_\{y\}\)=84\.44\\%\(upper bound on chance F1;py^p\_\{\\hat\{y\}\}unavailable\)\. Both statistics sit at chance level; no exploitable signal survives label permutation, so out\-of\-fold predictions are not leveraging row\-identity leakage\.
#### Length\-bias correlation\.
Pearson correlation between the input text length\|x\|\|x\|and Gate’s scorep^\\hat\{p\},ρlen=corr\(\|x\|,p^\)\\rho\_\{\\text\{len\}\}=\\mathrm\{corr\}\(\|x\|,\\hat\{p\}\), overall and per\-source\. A strong\|ρ\|\>0\.3\|\\rho\|\>0\.3flags a shallow detector that is effectively a length heuristic: longer prompts have more surface area for malicious tokens to land in, so a detector that learns to flag long inputs would inherit some of the signal without actually understanding the attack\. The per\-source breakdown matters because a global correlation near zero can hide source\-specific biases that cancel out in aggregate\. Sources where benign and attack examples differ sharply in length are flagged separately so the per\-sourceρ\\rhocan be inspected\.
Figure 9:Length\-bias Pearson correlation between input length and Gate score, overall and per source\.
#### Permutation feature importance\.
For each featurejj, permute its columnx⋅j→xπ\(⋅\)jx\_\{\\cdot j\}\\to x\_\{\\pi\(\\cdot\)j\}and re\-score on OOF predictions\. The importance is the drop in F1:ΔF1\(j\)=F1\(y^,y\)−F1\(y^πj,y\)\\Delta F\_\{1\}^\{\(j\)\}=F\_\{1\}\(\\hat\{y\},y\)\-F\_\{1\}\(\\hat\{y\}^\{\\pi\_\{j\}\},y\)\. LargeΔF1\(j\)\\Delta F\_\{1\}^\{\(j\)\}confirms the feature is load\-bearing in the held\-out regime, beyond what a training\-time importance ranking can credit, since the latter can attribute weight to features that fail to generalise out of fold\.
Figure 10:Top permutation\-importance features by held\-out F1 drop\.
#### Classifier\-head agreement \(Cohen’sκ\\kappa\)\.
Pairwise Cohen’sκ\\kappabetween the hard predictions of the ensemble heads:κ=po−pe1−pe\\kappa=\\frac\{p\_\{o\}\-p\_\{e\}\}\{1\-p\_\{e\}\}wherepop\_\{o\}is observed agreement andpep\_\{e\}is chance agreement\. Lowκ\\kappavalues confirm the heads carry independent signal worth combining;κ→1\\kappa\\to 1would mean the heads are redundant\. Head names are anonymised in the rendered figure\.*Empirical:*pairwiseκ∈\[0\.45,0\.56\]\\kappa\\in\[0\.45,0\.56\]across 3 head pairs; the heads carry distinguishable signal\.
Figure 11:Pairwise Cohen’sκ\\kappabetween ensemble heads\.
#### Other checks \(summary only\)\.
Cross\-source near\-duplicate hashing catches prompts that appear in multiple datasets under conflicting labels \(the most common form of implicit data\-quality leak when stitching public benchmarks\), and surfaces them for manual relabel; determinism replay diffs two runs with identical seed for byte\-equal OOF probabilities; a paraphrase\-invariance check bounds how lexically\-shallow the final\-stage signal is\.
### 2\.9 Confidence intervals and bootstrap
Every headline metric \(F1, precision, recall, FPR, AUC\) is reported with a 95% stratified bootstrap confidence interval\. Resamples are drawn within stratification cells\(s,y\)∈𝒮×\{0,1\}\(s,y\)\\in\\mathcal\{S\}\\times\\\{0,1\\\}\(source×\\timeslabel\), so rare\-class slices do not collapse; within each cell\(s,y\)\(s,y\)we drawns,yn\_\{s,y\}rows with replacement, so the total resample size equals the originalNNexactly\. WithB=10 000B=10\\,000resamples and per\-resample metric estimatesθ^∗\(1\),…,θ^∗\(B\)\\hat\{\\theta\}^\{\*\(1\)\},\\dots,\\hat\{\\theta\}^\{\*\(B\)\}, the percentile interval is
CI95\(θ^\)=\[θ^\(⌈0\.025B⌉\)∗,θ^\(⌊0\.975B⌋\)∗\]\.\\mathrm\{CI\}\_\{95\}\(\\hat\{\\theta\}\)=\\bigl\[\\,\\hat\{\\theta\}^\{\*\}\_\{\(\\lceil 0\.025B\\rceil\)\},\\ \\hat\{\\theta\}^\{\*\}\_\{\(\\lfloor 0\.975B\\rfloor\)\}\\,\\bigr\]\.Per\-dataset CIs are stratified by label only\. Datasets withn<200n<200are flagged as small\-n \(their CI is correspondingly wide\)\. When a competitor’s published point estimate falls inside our CI we treat the comparison as within noise rather than claiming a win\.BBcontrols tail\-stability of the CI \(withB=10 000B=10\\,000each tail is determined by≈250\\approx\\\!250resamples, reproducible across seeds to better than a column\-width of the CI\); the CI width itself is driven bynn\.
The full\-page forest figure visualises this directly: Gate’s per\-dataset point estimate is a tapered blue band \(50% CI core, 80 / 95 / 99% tails\), grouped by attack family; red diamonds mark Lakera Guard, grey dots every other published competitor \(asterisk = self\-evaluation, treated as an upper bound\)\.
### 2\.10 Calibration and threshold sweeps
Probability calibration is reported pre\- and post\-isotonic regression\. Brier score and Expected Calibration Error \(ECE\) are computed overM=15M=15equal\-width confidence bins:
ECE=∑m=1M\|Bm\|N\|acc\(Bm\)−conf\(Bm\)\|,\\mathrm\{ECE\}=\\sum\_\{m=1\}^\{M\}\\frac\{\|B\_\{m\}\|\}\{N\}\\bigl\|\\operatorname\{acc\}\(B\_\{m\}\)\-\\operatorname\{conf\}\(B\_\{m\}\)\\bigr\|,Brier=1N∑i=1N\(p^i−yi\)2\.\\mathrm\{Brier\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(\\hat\{p\}\_\{i\}\-y\_\{i\}\)^\{2\}\.A reliability diagram checks whether the detector’s output probabilities are distribution\-calibrated against the empirical positive rate at each predicted\-probability bucket\.
#### Calibration \(empirical\)\.
Pre\- and post\-isotonic regression overM=15M=15equal\-width bins:
Figure 12:Reliability diagram: predicted probability bin vs observed positive rate, with a per\-bin sample\-count histogram below\.Figure 13:Per\-dataset primary metric with nested bootstrap CIs\.
### 2\.11 Comparing against external competitors
External baselines come from third\-party published numbers \(papers, model cards, dataset cards\)\. Self\-evaluations are marked with an asterisk and treated as upper bounds because the competitor’s training split may overlap the dataset\. We do not rerun frozen published numbers on our own trace \(evaluation splits, dataset versions, and preprocessing all differ\), so head\-to\-head deltas are bracketed by our own bootstrap CIs, and per\-dataset footnotes record the competitor’s publishednnand the date of their evaluation\.
### 2\.12 Determinism and reproducibility
Every randomised step in the evaluation pipeline is seeded from a singleSEEDconstant: the outer CV split, the inner\-valid split inside each fold, and the bootstrap\-resample sequence used for confidence intervals\. Intermediate artifacts are checkpointed to JSONL so downstream metrics replay deterministically from the recorded artifact rather than re\-running upstream computation that may drift over time\.
### 2\.13 Limitations
The composite group key in Section 2 covers exact\-text duplicates \(SHA\-256 of normalised text\) and approximate\-text duplicates \(MinHash \+ LSH on 5\-character shingles at a Jaccard cut\-off of≈0\.8\\approx 0\.8\)\. MinHash \+ LSH was chosen for its scalability on larger evaluation traces — clustering cost is near\-linear in the row count, so the diagnostic remains tractable as the corpus grows\. Adversaries that mutate prompts beyond that shingle\-overlap budget \(aggressive paraphrasing, language switching, encoding\-level transforms\) will not be linked into the same group and could in principle leak across folds; future evaluations will pair MinHash with more comprehensive similarity functions \(sentence embeddings, dense\-retrieval search indexes\) for the overlap calculation so semantic near\-duplicates that shingle\-Jaccard misses also get caught\. Per\-dataset matched\-FPR comparisons use the competitor’s published per\-dataset FPR as the threshold target where available, but some sources publish only aggregate numbers, in which case the comparison falls back to the global threshold and is tagged as such in figures\. The benchmark breadth itself is limited to publicly available datasets with at least one published competitor number on a metric we evaluate; the breadth grows as new academic releases are added\.
### 2\.14 Pretraining contamination
Public prompt\-injection datasets such asdeepset/prompt\-injections,xTRam1/safe\-guard\-prompt\-injection, and BIPIA are widely indexed and almost certainly appear in the pretraining data of any modern foundation\-model\-derived detector\. This applies to the system under test and to every competitor built on or fine\-tuned from a publicly pretrained base, so it is not a per\-system disclaimer but a property of the field\. Mitigations tracked as follow\-up work include \(a\) held\-out evaluation on novel attack patterns generated after foundation\-model training cutoffs, and \(b\) per\-source comparison of test F1 against the rate at which the source appears in publicly indexed corpora\.
## 3 Data
Every dataset in this leaderboard is publicly available, with its primary metric defined a priori and its competitor citations sourced from third\-party publications \(papers, model cards, dataset cards\) unless explicitly marked as a self\-evaluation\. The trace is content\-hashed by its loader source set so any re\-run on the same loader hash is bit\-identical\.
### Trace composition by attack family
Datasets grouped by the attack distribution they primarily stress\-test\. The same global operating point scores every family, without per\-family tuning\. 3 synthetic datasets are excluded from this headline trace and from every metric reported in Sections 4–5; they appear only in the OOF diagnostics, where they sanity\-check threshold transferability\.
Figure 14:Metric availability across published competitor entries: F1 and recall dominate, FPR is rarest, which is why matched\-FPR comparisons need per\-dataset fallbacks\.
### Per\-dataset detail
Rank column\.Gate’s rank on this dataset’s primary metric among all comparable systems \(Gate plus every third\-party\-verified competitor that publishes that metric\), shown as rank / total\. The total counts only systems publishing the primary metric, so it can be smaller than the \# comp\. column, which counts every competitor publishing any comparable metric \(F1 / recall / precision / FPR\)\. Computed at Gate’s global headline threshold, not the matched\-FPR line\.Gate vs best comp\. column\.The main row is Gate at the global headline operating point \(FPR≤1%\\leq 1\\%\) vs the best third\-party\-verified competitor on this dataset’s*primary metric*\(the metric the dataset was designed to be scored on — F1 for balanced splits, Recall for all\-attack collections, ODA =1−FPR1\-\\text\{FPR\}for all\-benign probes, FPR for over\-defense corpora\)\. Each cell leads with the named metric \(F1 / Recall / Precision / Accuracy / ODA / FPR\) so the comparison axis is unambiguous\. The second line, where present, is the like\-for\-like matched\-FPR comparison: Gate re\-thresholded to a competitor’s published FPR on this dataset, scored on the same primary metric\. The matched\-FPR comparator is the first competitor walked down the metric\-ranked list \(best first\) that publishes an FPR*and*for which retuning Gate to that FPR improves on Gate’s value at the global headline operating point\. Lakera Guard is excluded from this walk because the right\-hand column already carries that comparison\. When the best competitor publishes FPR and the retune improves Gate, the second line and the main row name the same comparator; otherwise the walk falls through to the highest\-metric non\-Lakera competitor that satisfies both conditions\. When no non\-Lakera competitor on the row publishes an FPR, or when every such retune would lower Gate’s value, the matched\-FPR line is dropped\. Datasets whose primary metric is FPR or ODA \(notinject, wildguard\-benign\) skip the second line because matching FPR is degenerate there \(the matched axis equals the primary axis\)\.Gate vs Lakera column\.†Lakera comparison shown at Gate’s global headline threshold because matching FPR collapses the comparison \(the dataset’s primary metric is FPR / over\-defense\-accuracy itself\)\. Applies to: notinject, wildguard\-benign\.‡Lakera published the primary metric but not an FPR for this dataset, so Gate’s threshold is matched to a 1% FPR surrogate instead of Lakera’s \(unknown\) operating point\. Applies to: gentel\-jailbreaking, gentel\-goal\-hijacking, gentel\-prompt\-leaking\. All other Lakera\-column rows are true matched\-FPR comparisons \(Gate’s threshold re\-tuned per row to hit Lakera’s published FPR\)\.
### Coverage gaps
Areas this report does*not*cover\. Customers should treat claims as scoped to the distributions actually evaluated above\.
- •Adversarial pixel and image\-payload attacks \(vision\-only channels\)\. The current evaluation covers text inputs\.
- •Voice\-channel jailbreaks \(audio prompts transcribed inside the agent loop\)\. Tracked as follow\-up but not in this report\.
- •Languages beyond the public multilingual benchmarks \(MultiJail’s nine languages and SEA\-SafeguardBench’s eight\)\. Coverage of low\-resource and emerging\-language attack patterns is a known gap\.
- •Long\-context document injection beyond the chunk lengths of the public benchmarks \(typical max ~8 KB\)\. Production traces with very long documents may have different distribution\.
- •Code\-execution and shell\-injection payloads embedded in natural\-language prompts \(e\.g\. subprocess wrappers, Python eval bait\)\. The current trace contains a handful of such examples through harmbench and salad\-data, but no dedicated benchmark for code\-execution intent\.
- •Memory\-poisoning and stored\-prompt attacks where the injection lands in an upstream cache, RAG store, or session memory and fires on a subsequent unrelated request\. The current evaluation scores each prompt independently; cross\-turn or cross\-session attacks are not represented\.
- •Adaptive \(white\-box\) attackers who optimise prompts against a frozen Gate snapshot\. The published benchmarks here are mostly static; competitive evaluation against an active attacker who queries the detector before crafting their payload is left for future work\.
- •Drift across model generations: the cascade is anchored to a specific reasoning\-model checkpoint\. Behaviour against attack distributions targeted at newer or different base models has not been measured\.
- •Prompts in domain dialects \(legal contracts, clinical notes, financial filings\) where lexical priors differ sharply from the general\-web text used to train the public benchmarks\. The evaluation under\-samples these distributions\.
- •Sibling\-chunk and near\-duplicate leakage protection is a diagnostic, not a guarantee\. Headline F1/FPR come from the StratifiedKFold pass; the parallel StratifiedGroupKFold pass is reported only as a leakage diagnostic via the F1 gap \(strat \- sgk\)\. Where the composite group\-key degenerates to row\-identity \(no chunked rows and no near\-duplicate clusters\) the diagnostic delta is uninformative\.
*Scope of the claim\.*The per\-dataset numbers in this paper apply to the model snapshot identified on the title page, evaluated on the public benchmark cohort listed in §3\. The deployed model is fine\-tuned on additional internal data distributions that are not part of any benchmark above; production deployments may therefore report different per\-dataset numbers than this paper\. Where benchmarks are revised \(dataset hardening, new versions\) or the model is retrained, we re\-evaluate and publish a revised version of this paper rather than retroactively edit existing figures\.
## 4 Results
![[Uncaptioned image]](https://arxiv.org/html/2606.02959v1/plots/global_competitor_pareto.png)
Figure 15:Global Pareto: F1 vs FPR averaged across each system’s independently verified rows on the benchmarks evaluated here\. Marker colours: blue==Gate at the natural threshold \(θ=0\.5\\theta=0\.5\), green==Gate at the 1% FPR headline operating point, red==Lakera Guard, grey==other competitors with evidence onnds≥2n\_\{\\mathrm\{ds\}\}\\geq 2benchmarks, orange==single\-benchmark competitors \(nds=1n\_\{\\mathrm\{ds\}\}=1\)\. We keep entries where F1 and FPR are published on the same row and the source is third\-party; vendor blogs and self\-reported numbers are excluded\. Per\-system rows are averaged within a benchmark first, then across benchmarks, so each benchmark contributes once\. Marker size scales with mean published competitor latency; Gate’s marker uses the median competitor latency as a presentational fallback \(the cascade has no single representative latency, and §5 reports Gate’s actual distribution\)\.
Two operating points anchor the comparison\. The*headline*thresholdθ1%\\theta\_\{1\\%\}is the value picked on the held\-out folds that bounds pooled FPR at≤1%\\leq 1\\%, applied uniformly to every dataset\. The*natural*thresholdθF1\\theta\_\{F\_\{1\}\}is the unconstrained F1\-maximiser \(θ=0\.5\\theta=0\.5on the calibrated probability\)\. The table below reports both, for both micro \(pooled confusion matrix\) and macro \(per\-source F1 then averaged\) aggregations across the 16 public datasets \(12,111 samples, 5\-fold StratifiedKFold \(by row\)\)\.
At the headline operating point, Gate ranks \#1 on 8 ranked datasets, \#2 on 3, \#3 on 1; 1 dataset \(hackaprompt\) has no published third\-party comparator with a comparable metric and is omitted from the per\-dataset table\.
### 4\.1 Per\-dataset leaderboard
Each dataset uses its*primary metric*\(F1 for balanced splits, Recall for all\-attack collections, Over\-Defense Accuracy for all\-benign sets; asterisk = self\-evaluation, treat as upper bound\)\. The per\-system per\-dataset Pareto below plots one point per\(system×dataset\)\(\\text\{system\}\\times\\text\{dataset\}\)pair; the aggregated mean Pareto at the top of the section is the same data projected onto a single point per system\. The per\-dataset table grounding these plots lives in Section 3 \(Data\)\.
Figure 16:Detection Error Tradeoff on probit axes\. Probit scaling stretches the low\-FPR regime where competitor differentiation happens\.Figure 17:Maximum TPR at each FPR budget\. Gate sweepsθ\\thetaalong the curve with a bootstrap band; black dots are individual competitors\.Figure 18:Every \(system×\\timesdataset\) operating point\. Gate at natural and at FPR\-1%1\\%; Lakera Guard in red; second\-most\-published competitor in orange; everything else grey\.
### Head\-to\-head: Lakera Guard
Lakera Guard is the most\-cited commercial competitor in this space and publishes per\-dataset numbers on the broadest set of benchmarks\. The figures below report the head\-to\-head comparison at*per\-dataset matched FPR*so the comparison is like\-for\-like regardless of where Lakera placed its operating point\. Two caveats apply on a subset of rows\. \(a\) When the dataset’s primary metric is FPR or over\-defense\-accuracy \(notinject, wildguard\-benign\), matching FPR collapses the comparison, so Gate’s value at the global headline threshold is shown instead, marked†\. \(b\) When Lakera published the primary metric but no FPR for a dataset \(the three gentel\-bench splits\), Gate’s threshold is matched to a 1% FPR surrogate rather than Lakera’s unknown operating point, marked‡\. The footnote under each figure enumerates which rows are true matched\-FPR comparisons\.
#### A note on Lakera’s marketed operating point\.
Lakera’s product literature111[https://www\.lakera\.ai/blog/lakera\-guard\-fall\-25\-adaptive\-at\-scale](https://www.lakera.ai/blog/lakera-guard-fall-25-adaptive-at-scale)advertises an FPR as low as 0\.1–0\.2%\. That figure comes from Lakera’s internal*Challenging Moderation Benchmark*, with the model’s adaptive\-calibration tuned per customer as part of onboarding; it is neither an out\-of\-the\-box number nor a third\-party measurement on a public benchmark\. The matched\-FPR figures below evaluate Lakera at its third\-party published FPR on each public benchmark \(arXiv:2505\.13028 Table 5\[[1](https://arxiv.org/html/2606.02959#bib.bib1)\]ondeepset, arXiv:2409\.19521 Table 2 \(Goal Hijacking Attack\)\[[18](https://arxiv.org/html/2606.02959#bib.bib18)\]ongentel\-goal\-hijacking, arXiv:2409\.19521 Table 1 \(Jailbreak Attack\)\[[18](https://arxiv.org/html/2606.02959#bib.bib18)\]ongentel\-jailbreaking, arXiv:2409\.19521 Table 3 \(Prompt Leaking Attack\)\[[18](https://arxiv.org/html/2606.02959#bib.bib18)\]ongentel\-prompt\-leaking, arXiv:2603\.13247 Table 3 \(ILION\-Bench v2 comparative evaluation\)\[[3](https://arxiv.org/html/2606.02959#bib.bib3)\]onilion\-bench, arXiv:2410\.22770 Table 1\[[11](https://arxiv.org/html/2606.02959#bib.bib11)\]onnotinject, and arXiv:2410\.22770 Table 7\[[11](https://arxiv.org/html/2606.02959#bib.bib11)\]onwildguard\-benign\)\. We include the note for transparency; the figures below report what this evaluation framework can measure directly: third\-party numbers on public benchmarks at a matched operating point\.
### Sample\-size\-matched comparisons
Comparing a frozen published score onntheirsn\_\{\\text\{theirs\}\}samples to our number on a larger trace is not strictly apples\-to\-apples; a smaller evaluation set carries wider variance\. For every competitor that published a smallernnthan ours on the same dataset, we drawBBsubsamples of sizentheirsn\_\{\\text\{theirs\}\}from our out\-of\-fold predictions and bracket the comparison with a 95% percentile interval\. The reader can compare the competitor’s point estimate against the matched\-nninterval directly: when the published value sits inside the interval the comparison is within sampling noise\.
#### ilion\-bench\.
Primary metric: f1; our trace:n=400n=400\.
Figure 19:NotInject: false\-positive rate against every published competitor\. Lower is better\. Blue bar is the system under test at the global headline threshold; green bars are competitors with higher FPR than us, red bars lower\.Figure 20:WildGuard\-benign: false\-positive rate against every published competitor\. Same colour rules as the NotInject chart above\.Figure 21:Gate vs Lakera Guard: per\-dataset values at matched FPR\. Bars are the per\-dataset primary metric \(F1 for balanced splits, over\-defense accuracy for all\-benign splits\) with Lakera’s published FPR used as the threshold target\. Rows marked†have FPR / over\-defense\-accuracy as the primary metric, where matching FPR collapses the comparison, so Gate’s value at the global headline threshold is shown instead\. Rows marked‡are gentel\-bench splits where Lakera published the primary metric but no FPR, so Gate’s threshold is matched to a 1% FPR surrogate rather than to Lakera’s unknown operating point\. Unmarked rows are true matched\-FPR comparisons; the per\-dataset table on the preceding page enumerates which rows fall into each bucket\.
#### What the figure shows\.
At matched FPR, Gate is at or above Lakera Guard on every unmarked row in the figure — the bars give the head\-to\-head magnitude per dataset\. The largest improvements appear on the multi\-attack benchmarks where Lakera’s headline operating point is tuned for a different FPR than ours; matching at Lakera’s published FPR moves both points onto a like\-for\-like axis\. Dagger rows are excluded from the headline matched\-FPR tally because the primary metric is FPR or over\-defense\-accuracy \(the comparison collapses when FPR itself is matched\), and double\-dagger rows are gentel\-bench splits where Lakera published the primary metric but no corresponding FPR\.
#### Caveats\.
The head\-to\-head is bounded by what each side publishes\. Lakera’s per\-dataset numbers come from the third\-party measurements cited above; Gate is re\-evaluated against those frozen reference points on the same public traces, so the comparison is point\-in\-time and tied to the Lakera version those papers measured\. The figure isolates detection at a matched operating point — latency, throughput, and serving cost are reported separately in §5 and are not part of this matched\-FPR claim\. Benchmarks where Lakera does not publish are absent from the figure rather than reported as wins\.
## 5 Latency
End\-to\-end latency is measured under the deployed routing policy: per\-stage samples come from concurrency sweeps on the production hardware, and the per\-request composite is drawn under the cascade’s actual exit policy\. The distribution is bimodal — a fast\-mode cluster \(≈\\approx90% of requests, first stage only\) and a deferred\-mode tail \(≈\\approx10%\) that incurs a second per\-request analysis step\. Numbers are saturation\-free percentiles on warm pods; cold\-start, bursty\-queue contention, and cross\-AZ jitter are not modelled and would shift the tail above the table\.
Figure 22:Latency density distribution: density histogram of end\-to\-end latency \(log x\-axis, linear y\-axis\)\.Against published competitor latencies, Gate’s mean of 104 ms is the lowest in the corpus\. Each competitor figure is the arithmetic mean of every per\-dataset latency we have in the cited papers; the parenthesised range shows the min–max spread\. Closest: Lakera Guard \(mean 140 ms acrossn=6n=6per\-dataset measurements, range 51–305 ms\)\. Slowest: Rebuff \(mean 23898 ms acrossn=2n=2per\-dataset measurements, range 18465–29330 ms\)\. Competitor papers do not label percentiles \(likely mean\-of\-N on the cited hardware\)\.
Figure 23:End\-to\-end detection latency: Gate \(blue, p50→\\top95 with mean marker\) vs published competitors \(grey min–max range or single dot\)\. Log x\-axis\. Sources: Palit et al\.\[[1](https://arxiv.org/html/2606.02959#bib.bib1)\]\(Tables 3, 5\), JavelinGuard\[[2](https://arxiv.org/html/2606.02959#bib.bib2)\]\(Table 4\), ILION\-Bench v2 / Chitan\[[3](https://arxiv.org/html/2606.02959#bib.bib3)\]\(Table 3 — OpenAI Moderation and Lakera baselines reported by an author of a competing system; flagged 3rd\-party but the surrounding context is vendor\-biased\), and NeuralTrust\[[4](https://arxiv.org/html/2606.02959#bib.bib4)\]firewall comparison\. “×N\\times N” counts data points per source\.
## 6 Conclusion
Two design choices drive the headline numbers reported here\. The first is treating prompt\-injection detection as a*cascade ensemble*problem rather than a single\-model classification problem: signals from independent classifier heads, an embedding representation, and a deferred second\-stage analysis path are combined under a global decision rule, so each component contributes within its strength and the system as a whole inherits none of any single component’s blind spots\. The second is enforcing a single global operating point during evaluation: one threshold applied uniformly across every benchmark, selected on held\-out out\-of\-fold predictions subject to a fixed FPR cap, rather than per\-dataset tuning\. The combination yields a micro\-F1 of97\.4%at the deployable headline operating point \(FPR≤1%\\leq 1\\%, pooled 1\.0%\), and98\.7%at the unconstrainedF1F\_\{1\}\-maximising natural threshold \(θ=0\.5\\theta=0\.5, pooled 4\.2%\), across 12,111 samples from 16 public benchmarks\. The system ranks \#1 on 8 ranked benchmarks and within 95% bootstrap noise on most of the rest\.
The methodology checks support those numbers as well as the model\. Headline metrics are reported with stratified bootstrap 95% confidence intervals; a parallel StratifiedGroupKFold pass with a composite \(parent\-prompt id \+ MinHash near\-duplicate\) key runs alongside the headline StratifiedKFold pass so the gap quantifies any residual leakage; random\-label, adversarial\-validation, length\-bias, threshold\-transferability, train\-vs\-OOF agreement, and prevalence\-ratio diagnostics each carry a precise pass criterion and run on every release; per\-fold threshold stability is reported numerically so the operating point is auditable; external comparisons are evaluated at*per\-dataset matched FPR*where competitors publish an FPR, and as small\-nnmatched bootstrap intervals where they don’t, so head\-to\-head wins are bracketed by noise rather than asserted point\-estimate\-to\-point\-estimate\. End\-to\-end latency is53 ms p50, 104 ms mean, 571 ms p95at the deployed operating point\.
The biggest unmodelled risk is pretraining contamination of the benchmark corpora themselves; the largest in\-scope follow\-up is held\-out evaluation on attack patterns generated after foundation\-model training cutoffs\.
## Appendix AGlossary
F1harmonic mean of precision and recall:F1=2⋅P⋅R/\(P\+R\)F\_\{1\}=2\\cdot P\\cdot R/\(P\+R\)\.
Precision \(P\)fraction of flagged prompts that were actually injections:P=TP/\(TP\+FP\)P=\\mathrm\{TP\}/\(\\mathrm\{TP\}\+\\mathrm\{FP\}\)\.
Recall \(R\)fraction of injections caught:R=TP/\(TP\+FN\)R=\\mathrm\{TP\}/\(\\mathrm\{TP\}\+\\mathrm\{FN\}\)\.
FPRfalse\-positive rate:FPR=FP/\(FP\+TN\)\\mathrm\{FPR\}=\\mathrm\{FP\}/\(\\mathrm\{FP\}\+\\mathrm\{TN\}\)\.
AUCarea under the ROC curve\. 0\.5 is chance, 1\.0 is perfect\.
Over\-Defense Accuracy \(ODA\)for all\-benign datasets,ODA=1−FPR\\mathrm\{ODA\}=1\-\\mathrm\{FPR\}\.
KK\-fold CVpartition the trace intoKKdisjoint folds; hold each out for evaluation while training on the rest\.
StratifiedKFoldKK\-fold variant that preserves the label marginalP\(y∣𝒟k\)≈P\(y∣𝒟\)P\(y\\mid\\mathcal\{D\}\_\{k\}\)\\approx P\(y\\mid\\mathcal\{D\}\); the headline pass that feeds the abstract and leaderboard\.
StratifiedGroupKFoldKK\-fold variant that preserves the label marginal*and*enforces a group constraint so members of one group \(parent prompt \+ near\-duplicate cluster\) cannot straddle folds; reported as a leakage diagnostic viaΔF1=F1strat−F1sgk\\Delta F\_\{1\}=F\_\{1\}^\{\\text\{strat\}\}\-F\_\{1\}^\{\\text\{sgk\}\}\.
OOF predictionout\-of\-fold prediction: for each row, the prediction made by a model trained on theK−1K\-1folds that excluded it\.
Stratified bootstrap CIconfidence interval fromBBresamples drawn within stratification cells \(here: source×\\timeslabel\); percentile interval at2\.5%2\.5\\%/97\.5%97\.5\\%\.
LODOleave\-one\-dataset\-out cross\-validation\. Train on every other source; evaluate on the held\-out source\.
ECEExpected Calibration Error: weighted average of\|acc\(Bm\)−conf\(Bm\)\|\|\\operatorname\{acc\}\(B\_\{m\}\)\-\\operatorname\{conf\}\(B\_\{m\}\)\|across confidence bins\.
Brier scoremean squared error between predicted probabilities and binary labels:1N∑i\(p^i−yi\)2\\frac\{1\}\{N\}\\sum\_\{i\}\(\\hat\{p\}\_\{i\}\-y\_\{i\}\)^\{2\}\.
Operating pointthe threshold \(or threshold combination\) at which the system commits to hard labels\. Chosen here at a fixed FPR target on out\-of\-fold predictions, applied uniformly across every dataset\.
Self\-evaluation \(\*\)a competitor score reported by the system’s own authors; treated as an upper bound\.
## References
- \[1\]S\. Palit, D\. Woods\. Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset\. arXiv:2505\.13028\. 2025\.[https://arxiv\.org/abs/2505\.13028](https://arxiv.org/abs/2505.13028)
- \[2\]Y\. Datta, S\. Rajasekar\. JavelinGuard: Low\-Cost Transformer Architectures for LLM Security\. arXiv:2506\.07330\. 2025\.[https://arxiv\.org/abs/2506\.07330](https://arxiv.org/abs/2506.07330)
- \[3\]F\. A\. Chitan\. ILION: Deterministic Pre\-Execution Safety Gates for Agentic AI Systems\. arXiv:2603\.13247\. 2026\.[https://arxiv\.org/abs/2603\.13247](https://arxiv.org/abs/2603.13247)
- \[4\]V\. García\. Which firewall best prevents prompt injection attacks? NeuralTrust blog\. 2025\.[https://neuraltrust\.ai/blog/prevent\-prompt\-injection\-attacks\-firewall\-comparison](https://neuraltrust.ai/blog/prevent-prompt-injection-attacks-firewall-comparison)
- \[5\]deepset\. deepset/prompt\-injections \(community\-labelled prompt\-injection dataset\)\. Hugging Face Datasets\. 2023\.[https://huggingface\.co/datasets/deepset/prompt\-injections](https://huggingface.co/datasets/deepset/prompt-injections)
- \[6\]Rogue Security\. prompt\-injection\-jailbreak\-sentinel\-v2 \(model card\)\. Hugging Face\. 2025\.[https://huggingface\.co/rogue\-security/prompt\-injection\-jailbreak\-sentinel\-v2](https://huggingface.co/rogue-security/prompt-injection-jailbreak-sentinel-v2)
- \[7\]S\. Schulhoff et al\. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition\. EMNLP 2023; project site\. 2023\.[https://www\.hackaprompt\.com](https://www.hackaprompt.com/)
- \[8\]jackhhao\. jackhhao/jailbreak\-classification \(binary jailbreak vs benign classification dataset\)\. Hugging Face Datasets\. 2023\.[https://huggingface\.co/datasets/jackhhao/jailbreak\-classification](https://huggingface.co/datasets/jackhhao/jailbreak-classification)
- \[9\]S\. Abdelnabi et al\. LLMail\-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge\. arXiv:2506\.09956\. 2025\.[https://arxiv\.org/abs/2506\.09956](https://arxiv.org/abs/2506.09956)
- \[10\]I\. Wu, M\. Maslowski\. CourtGuard: A Local, Multiagent Prompt Injection Classifier\. arXiv:2510\.19844\. 2025\.[https://arxiv\.org/abs/2510\.19844](https://arxiv.org/abs/2510.19844)
- \[11\]H\. Li, X\. Liu\. InjecGuard: Benchmarking and Mitigating Over\-defense in Prompt Injection Guardrail Models\. arXiv:2410\.22770\. 2024\.[https://arxiv\.org/abs/2410\.22770](https://arxiv.org/abs/2410.22770)
- \[12\]L\. E\. Erdogan et al\. safe\-guard\-prompt\-injection \(synthetic prompt\-injection dataset, n=10,296\)\. Hugging Face Datasets\. 2024\.[https://huggingface\.co/datasets/xTRam1/safe\-guard\-prompt\-injection](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection)
- \[13\]J\. Kasundra et al\. AprielGuard\. arXiv:2512\.20293\. 2025\.[https://arxiv\.org/abs/2512\.20293](https://arxiv.org/abs/2512.20293)
- \[14\]A\. Zou et al\. Universal and Transferable Adversarial Attacks on Aligned Language Models\. arXiv:2307\.15043\. 2023\.[https://arxiv\.org/abs/2307\.15043](https://arxiv.org/abs/2307.15043)
- \[15\]A\. Robey et al\. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks\. arXiv:2310\.03684\. 2023\.[https://arxiv\.org/abs/2310\.03684](https://arxiv.org/abs/2310.03684)
- \[16\]J\. Yi et al\. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models\. arXiv:2312\.14197\. 2023\.[https://arxiv\.org/abs/2312\.14197](https://arxiv.org/abs/2312.14197)
- \[17\]Lakera AI\. Lakera/gandalf\_ignore\_instructions \(embedding\-filtered Gandalf RCT subset\)\. Hugging Face Datasets\. 2023\.[https://huggingface\.co/datasets/Lakera/gandalf\_ignore\_instructions](https://huggingface.co/datasets/Lakera/gandalf_ignore_instructions)
- \[18\]R\. Li et al\. GenTel\-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks\. arXiv:2409\.19521\. 2024\.[https://arxiv\.org/abs/2409\.19521](https://arxiv.org/abs/2409.19521)
- \[19\]M\. Mazeika et al\. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal\. arXiv:2402\.04249\. 2024\.[https://arxiv\.org/abs/2402\.04249](https://arxiv.org/abs/2402.04249)
- \[20\]S\. Han et al\. WildGuard: Open One\-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs\. arXiv:2406\.18495\. 2024\.[https://arxiv\.org/abs/2406\.18495](https://arxiv.org/abs/2406.18495)
- \[21\]L\. Li et al\. SALAD\-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models\. arXiv:2402\.05044\. 2024\.[https://arxiv\.org/abs/2402\.05044](https://arxiv.org/abs/2402.05044)
- \[22\]G\. C\. Cawley, N\. L\. C\. Talbot\. On Over\-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation\. Journal of Machine Learning Research 11\. 2010\.[https://jmlr\.org/papers/v11/cawley10a\.html](https://jmlr.org/papers/v11/cawley10a.html)Similar Articles
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
This paper introduces a framework for validating comparative LLM safety scoring without ground-truth labels, using an 'instrumental-validity chain' to establish deployment evidence. It demonstrates the method using a local-first tool called SimpleAudit on Norwegian safety packs and compares models like Borealis and Gemma 3.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
This paper introduces GAMBIT, a benchmark for evaluating adversarial robustness in multi-agent LLM collectives, featuring adaptive imposters and recalibration modes to address the limitations of existing shallow evaluations.
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
This paper introduces a margin-based confidence ranking method for LLM-as-a-judge systems, learning a dedicated estimator to ensure monotonicity between confidence and human-disagreement risk, with generalization guarantees and improved ranking accuracy across datasets.