Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

arXiv cs.AI Papers

Summary

This paper introduces LeanGuard, a lightweight bidirectional encoder-based safety guardrail that matches the accuracy of larger reasoning-based guardrails while being approximately 100x faster, challenging the assumption that chain-of-thought reasoning is necessary for effective moderation.

arXiv:2606.26686v1 Announce Type: new Abstract: In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides. This may not match how guardrails are actually deployed. A guardrail sometimes should not be heavy and slow, and it often runs on-device, for example on an embodied robot. In this paper, we pose a question whether a safety guardrail really needs to reason. To answer this question, we train a lightweight bidirectional encoder and a reasoning guard on the same corpus, and we then remove only the reasoning while we keep everything else fixed. With this controlled same-base comparison, we show that the chain does not improve moderation accuracy. We name the resulting guard LeanGuard. A 395M label-only encoder reaches an average F1 of 82.90 $\pm$ 0.26 over public benchmarks. It matches a reasoning guard that is built on a much larger decoder, while it uses only a single forward pass over an input of at most 512 tokens. This is about a ~100x reduction in inference compute. We further show that this label-only encoder stays robust under training-label noise and retains far more recall at a strict false-positive rate than the reasoning guard, so a heavier reasoning guard is not the more robust choice either. Our finding suggests that the current guardrail benchmarks may not be hard enough to reward reasoning, and that the necessity of CoT for moderation is still not proven. We release all source codes and models including LeanGuard at https://github.com/ndb796/LeanGuard.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:15 AM

# Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation
Source: [https://arxiv.org/html/2606.26686](https://arxiv.org/html/2606.26686)
###### Abstract

In order to screen a prompt or a response, the recent guardrail methods generate a chain\-of\-thought \(CoT\) before they issue a verdict\. This design follows a common belief that step\-by\-step reasoning improves a decision\. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides\. This may not match how guardrails are actually deployed\. A guardrail sometimes should not be heavy and slow, and it often runs on\-device, for example on an embodied robot\. In this paper, we pose a question whether a safety guardrail really needs to reason\. To answer this question, we train a lightweight bidirectional encoder and a reasoning guard on the same corpus, and we then remove only the reasoning while we keep everything else fixed\. With this controlled same\-base comparison, we show that the chain does not improve moderation accuracy\. We name the resulting guardLeanGuard\. A 395M label\-only encoder reaches an average F1 of82\.90±\\pm0\.26over public benchmarks\. It matches a reasoning guard that is built on a much larger decoder, while it uses only a single forward pass over an input of at most 512 tokens\. This is about a∼100×\{\\sim\}100\\timesreduction in inference compute\. We further show that this label\-only encoder stays robust under training\-label noise and retains far more recall at a strict false\-positive rate than the reasoning guard, so a heavier reasoning guard is not the more robust choice either\. Our finding suggests that the current guardrail benchmarks may not be hard enough to reward reasoning, and that the necessity of CoT for moderation is still not proven\. We release all source codes and models including LeanGuard athttps://github\.com/ndb796/LeanGuard\.111Project page:https://ndb796\.github\.io/LeanGuard\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.26686v1/x1.png)Figure 1:Cost\-accuracy plane \(logxx\)\.LeanGuard, our 395M label\-only encoder, matches the much larger reasoning guards at about∼100×\{\\sim\}100\\timeslower inference cost and a single forward pass\. We train this model and release it as an open\-source guardrail\.Large language models are increasingly deployed behind*safety guardrails*, models that screen prompts and responses for harmful content and decide whether the system should comply or refuse\(Inanet al\.[2023](https://arxiv.org/html/2606.26686#bib.bib4); Hanet al\.[2024](https://arxiv.org/html/2606.26686#bib.bib6); Ghoshet al\.[2024](https://arxiv.org/html/2606.26686#bib.bib7)\)\. As guardrails have become standard infrastructure, two design choices have come to dominate\. The first is to build the guard as a large decoder\-based generative classifier that is fine\-tuned to emit its verdict directly in natural language\. The second, a fast\-growing line of work, makes that same decoder*reason*first before it commits to a verdict\. GuardReasoner\(Liuet al\.[2025b](https://arxiv.org/html/2606.26686#bib.bib5)\), for instance, trains the model to produce an explicit chain\-of\-thought \(CoT\) before its verdict, on the premise that thinking step by step yields a more accurate and trustworthy guard\.

This reasoning\-first view has hardened into a near\-consensus, yet this consensus may be misplaced\. Safety moderation is, at its core, a bounded*labeling*decision, asking for example*“is this input harmful?”*or*“did the model comply or refuse?”*It is not the kind of open\-ended, multi\-step problem on which CoT has actually been shown to help\(Spragueet al\.[2025](https://arxiv.org/html/2606.26686#bib.bib15)\)\. In our experiments, a lightweight bidirectional encoder \(a BERT\-family model\) matches a decoder reasoning guard*even under a tight input budget of a few hundred tokens*, without generating any reasoning at all\. In our experiments, a small single\-pass classifier keeps pace with a far larger reasoning model, which may point to one of two conclusions, each consequential: either \(1\) current guardrail benchmarks are not hard enough to reward reasoning, or \(2\) the necessity of CoT for safety moderation has simply never been established\. We set out to test which\.

![Refer to caption](https://arxiv.org/html/2606.26686v1/x2.png)Figure 2:The chain\-of\-thought of a guard decoder may be post\-hoc\. As the decoder generates its chain from left to right, we read its hidden state at an early token, a middle token, and a late token, then we fit a linear probe on the late \(at\-verdict\) state and apply it to the earlier states\. All three hidden states already fall inside the harmful region of the decision space, and the probe confidence barely changes\. The verdict is therefore fixed before the chain is written, and the later reasoning only restates it\. When we re\-sample the chain, the majority verdict changes on only∼5%\{\\sim\}5\\%of inputs \(Observation 1\), at about∼100×\{\\sim\}100\\timesthe inference cost\.We organize the evidence aroundtwo misconceptionsthe field holds about reasoning guards\.\(M1\)that chain\-of\-thought is necessary for an accurate guardrail, and\(M2\)that a heavier generative or reasoning guard is more capable and more robust to training\-label noise\. Our central tool is a*same\-base*comparison that varies only whether the model reasons before it labels\. We hold architecture, scale, pretraining, and data fixed, which is the one controlled cut that isolates the effect of reasoning\. Our contributions are as follows\.

- •A plain decoder needs no reasoning to reach state\-of\-the\-art accuracy\.Through extensive same\-base experiments, on an identical decoder backbone, simply training the model to predict the verdict directly, with no chain\-of\-thought, matches the reasoning recipe of GuardReasoner\. The reasoning that prior work frequently treats as essential may buy no accuracy\. We argue that this is the overlooked result\. A standard discriminative backbone may already be enough \(Section[5](https://arxiv.org/html/2606.26686#S5)\)\.
- •Reasoning is frequently post\-hoc and should be used with caution\.In our experiments, re\-sampling the chain\-of\-thought almost never changes the verdict\. The model has effectively decided before it reasons, so the chain justifies a pre\-determined answer rather than computing it, which raises the question of whether the reasoning contributes at all\. We have also found that adding a chain\-of\-thought does not improve, and can even lower, accuracy on a fixed base, so the chain can quietly hurt \(Section[5](https://arxiv.org/html/2606.26686#S5.SSx1)\)\.
- •A heavier reasoning guard may not be the more robust choice\.A label\-only encoder stays accurate under injected training\-label noise, holding strong F1 even when a large fraction of its labels are corrupted, and it retains far more recall than the reasoning guard at a strict false\-positive rate, where the reasoning guard’s confidence polarizes\. In our experiments, heavier and slower does not translate into more robust \(Section[6](https://arxiv.org/html/2606.26686#S6)\)\.
- •An open, deployable guardrail\.Beyond the analysis we release a practical model,LeanGuard, a 395M single\-pass guard that we publicly release with an ONNX export for on\-device use, together with a unified and reproducible comparison against the prevailing guard landscape \(Llama Guard 2 and 3, WildGuard, Aegis, ShieldGemma, MD\-Judge\) and all training and evaluation code\. We hope that our lightweight, reasoning\-free guard model becomes a ready\-to\-deploy baseline the community can build on\.

In this study, “reasoning” refers to CoT fine\-tuning, not test\-time reasoning, tool use, or verifier\-based methods\. This work is a controlled empirical study rather than a new architecture, and we release LeanGuard with all code and models\.

## 2Related Work

LLM\-based safety guards\.Llama Guard\(Inanet al\.[2023](https://arxiv.org/html/2606.26686#bib.bib4); Dubey and others[2024](https://arxiv.org/html/2606.26686#bib.bib40)\)recast prompt and response moderation as instruction\-tuned generative classification on a roughly 7B decoder, a template inherited by Llama Guard 3\(Meta AI[2024](https://arxiv.org/html/2606.26686#bib.bib24)\), Aegis\(Ghoshet al\.[2024](https://arxiv.org/html/2606.26686#bib.bib7)\), ShieldGemma\(Zeng and others[2024](https://arxiv.org/html/2606.26686#bib.bib23)\), WildGuard\(Hanet al\.[2024](https://arxiv.org/html/2606.26686#bib.bib6)\), MD\-Judge\(Li and others[2024](https://arxiv.org/html/2606.26686#bib.bib25)\), and Granite Guardian\(Padhi and others[2024](https://arxiv.org/html/2606.26686#bib.bib26)\)and evaluated on a now\-standard suite\(Linet al\.[2023](https://arxiv.org/html/2606.26686#bib.bib8); Markovet al\.[2023](https://arxiv.org/html/2606.26686#bib.bib9); Mazeikaet al\.[2024](https://arxiv.org/html/2606.26686#bib.bib10); Jiet al\.[2023](https://arxiv.org/html/2606.26686#bib.bib11); Daiet al\.[2024](https://arxiv.org/html/2606.26686#bib.bib12); Röttgeret al\.[2024](https://arxiv.org/html/2606.26686#bib.bib13)\)\. Across this line of studies the guard is almost always a multi\-billion\-parameter generative model, and the discriminative encoder\-based alternative is the road not taken, even though such an encoder is the natural fit for a fixed\-label decision\.

Reasoning guards\.A fast\-growing line makes the generative guard*reason*first\. GuardReasoner\(Liuet al\.[2025b](https://arxiv.org/html/2606.26686#bib.bib5)\), our closest baseline, uses R\-SFT and HS\-DPO\. ThinkGuard\(Wen and others[2025](https://arxiv.org/html/2606.26686#bib.bib32)\)distills slow thinking on the premise that single\-pass classifiers are too shallow, and R2\-Guard\(Kang and Li[2024](https://arxiv.org/html/2606.26686#bib.bib33)\)adds knowledge\-grounded logical reasoning, with extensions to multilingual and multi\-modal moderation\. All of them share the assumption that an accurate guard must reason, yet they do not deeply address the clean*same\-base*ablation that removes only the reasoning, so their reported gains confound the reasoning with architecture, scale, data, and objective\. We supply that ablation\.

Revisiting noisy labels and necessities of CoT\.Recent evidence finds that CoT helps far less broadly than assumed\. CoT gives large gains almost only on math and symbolic tasks\(Spragueet al\.[2025](https://arxiv.org/html/2606.26686#bib.bib15)\), its chains are often post\-hoc rather than causal\(Turpinet al\.[2023](https://arxiv.org/html/2606.26686#bib.bib14); Lanham and others[2023](https://arxiv.org/html/2606.26686#bib.bib27)\), an efficiency literature documents wasteful overthinking\(Sui and others[2025](https://arxiv.org/html/2606.26686#bib.bib34)\), and latent\-reasoning methods recover the benefit while emitting no reasoning at all\(Denget al\.[2024](https://arxiv.org/html/2606.26686#bib.bib35); Haoet al\.[2024](https://arxiv.org/html/2606.26686#bib.bib36)\)\. Safety moderation is a short, non\-symbolic labeling task with a small output space, so the prior work may sit in this unfavorable regime\. Our guards also train on imperfectly labeled corpora, and early\-learning regularization\(Liuet al\.[2020](https://arxiv.org/html/2606.26686#bib.bib17)\)explains why a single\-epoch discriminative recipe stays robust, while CoT fine\-tuning is fragile to propagating noise\(Havrilla and Iyer[2024](https://arxiv.org/html/2606.26686#bib.bib18); Zhouet al\.[2024](https://arxiv.org/html/2606.26686#bib.bib19)\)and classical noise\-robust losses\(Zhang and Sabuncu[2018](https://arxiv.org/html/2606.26686#bib.bib20); Mülleret al\.[2019](https://arxiv.org/html/2606.26686#bib.bib21); Chowdhuryet al\.[2024](https://arxiv.org/html/2606.26686#bib.bib22)\)have no clean analogue for a free\-form reasoning trace\.

Safety for embodied and robotic agents\.As LLMs and VLMs begin to drive robots, benchmarks such as ASIMOV\(Sermanetet al\.[2025](https://arxiv.org/html/2606.26686#bib.bib29)\), SafeAgentBench\(Yin and others[2024](https://arxiv.org/html/2606.26686#bib.bib28)\), and AgentSafe\(Liuet al\.[2025a](https://arxiv.org/html/2606.26686#bib.bib30)\), organizing risk after Asimov into harm to humans, to the environment, and to the agent itself, report that capable planners execute unsafe tasks at high rates\(Zhang and others[2024](https://arxiv.org/html/2606.26686#bib.bib38)\), and defenses such as RoboGuard\(Ravichandranet al\.[2025](https://arxiv.org/html/2606.26686#bib.bib37)\)insert heavy, reasoning\-driven safeguards into the control loop\. These motivate why an on\-device guard must be small and fast, and why a guard that must wait for a chain is a poor fit for the embodied setting\(Wanget al\.[2025](https://arxiv.org/html/2606.26686#bib.bib31)\)\. Reliability is just as critical for the spatial decisions these agents make\. In spatial question answering for navigation, a localization agent such as BinTrack\(Naet al\.[2026b](https://arxiv.org/html/2606.26686#bib.bib41)\)must return an accurate metric coordinate, because a large localization error can send the robot far from the goal and waste a long traversal before it recovers\. This high cost of a confidently wrong answer is exactly what a guardrail should prevent, and Semantic Flip\(Naet al\.[2026a](https://arxiv.org/html/2606.26686#bib.bib42)\)synthesizes out\-of\-distribution query and memory pairs so that a lightweight rejection module can learn when an embodied query is unanswerable and the agent should decline rather than act on an arbitrary coordinate\. These robotic spatial\-reasoning pipelines make the same case as text moderation, that the guard which is accurate, lightweight, and fast is the one an on\-device agent can actually run\.

## 3Problem Formulation

A moderation instance isx=\(p,a\)x=\(p,a\)with a promptppand an optional responseaa\. The label is a tripley=\(yreq,ycomp,yresp\)y=\(y^\{\\textsc\{req\}\},y^\{\\textsc\{comp\}\},y^\{\\textsc\{resp\}\}\)of integer\-coded verdicts\. The request\-harm label isyreq∈\{0,1\}y^\{\\textsc\{req\}\}\\in\\\{0,1\\\}, where0denotes an unharmful request and11a harmful one\. The response\-harm label isyresp∈\{0,1\}y^\{\\textsc\{resp\}\}\\in\\\{0,1\\\}, where0denotes an unharmful response and11a harmful response\. The completion label isycomp∈\{0,1\}y^\{\\textsc\{comp\}\}\\in\\\{0,1\\\}, where0denotes a refusal and11a compliance\. A*discriminative encoder*computes one bidirectional representationh=Encϕ​\(x\)h=\\mathrm\{Enc\}\_\{\\phi\}\(x\)and reads offy^\(k\)=arg⁡max⁡\(Wk​h\)\\hat\{y\}^\{\(k\)\}=\\arg\\max\(W\_\{k\}h\)in a single forward pass\. A*generative reasoner*models a chainr=\(r1,…,rT\)r=\(r\_\{1\},\\dots,r\_\{T\}\)and the verdict jointly,Pθ​\(r,y∣x\)=∏tPθ​\(rt∣x,r<t\)​Pθ​\(y∣x,r\)P\_\{\\theta\}\(r,y\\mid x\)=\\prod\_\{t\}P\_\{\\theta\}\(r\_\{t\}\\mid x,r\_\{<t\}\)\\,P\_\{\\theta\}\(y\\mid x,r\), and*decodes*rrbeforeyy\.

The discriminative encoder\.Our primary guardfϕf\_\{\\phi\}is ModernBERT\-large\(Warneret al\.[2024](https://arxiv.org/html/2606.26686#bib.bib2); Devlinet al\.[2019](https://arxiv.org/html/2606.26686#bib.bib1)\)\. It encodes the full instance into a single pooled representationh=Encϕ​\(x\)∈ℝdh=\\mathrm\{Enc\}\_\{\\phi\}\(x\)\\in\\mathbb\{R\}^\{d\}and attaches three independent linear headsWreq,Wcomp,WrespW\_\{\\textsc\{req\}\},W\_\{\\textsc\{comp\}\},W\_\{\\textsc\{resp\}\}, one per verdict component, each trained with cross\-entropy against its gold label\. The objective is their sum,ℒϕ=∑kCE​\(Wk​h,y\(k\)\)\\mathcal\{L\}\_\{\\phi\}=\\sum\_\{k\}\\mathrm\{CE\}\(W\_\{k\}h,\\,y^\{\(k\)\}\), and inference reads offy^\(k\)=arg⁡max⁡\(Wk​h\)\\hat\{y\}^\{\(k\)\}=\\arg\\max\(W\_\{k\}h\)in a single forward pass\. The supervision is the verdict alone, and no reasoning is ever constructed or scored, so the model commits its entire capacity to the labeling decision rather than to fluent text\. This is the natural inductive bias for a bounded\-label problem\. Bidirectional attention lets every token attend to every other before one decision is made, and a fixed\-size head casts moderation as one\-shot classification rather than autoregressive generation, which \(Section[6](https://arxiv.org/html/2606.26686#S6)\) is also what makes it robust to label noise\. All encoder weights are fine\-tuned\. We refer to this trained guard asLeanGuard\.

The generative guard and parameter\-efficient adaptation\.The competing paradigmgθg\_\{\\theta\}instead*emits*the verdict as text\. We instantiate this generative paradigm on a decoder \(Llama\-3\.2\(Dubey and others[2024](https://arxiv.org/html/2606.26686#bib.bib40)\)\) and an encoder\-decoder \(T5\(Raffelet al\.[2020](https://arxiv.org/html/2606.26686#bib.bib3)\)\) and train it with a*completion\-only*objective\. The prompt tokens are masked out of the loss \(label=−100\\text\{label\}=\{\-\}100\) and the gradient flows only through the completion, so the model is supervised purely on what it must generate\. The same backbone is trained in two modes that share data, optimizer, and schedule and differ only in the completion target\. The*with\-CoT*mode generates the GuardReasoner reasoning trace and then the verdict, while*label\-only*generates the verdict token\(s\) alone\. This yields the three settings whose comparison is the entire study, generate\-with\-CoT \(gθcotg\_\{\\theta\}^\{\\textsc\{cot\}\}\), generate\-label\-only \(gθlabg\_\{\\theta\}^\{\\textsc\{lab\}\}\), and the classification head \(fϕf\_\{\\phi\}\)\. The two generative settings are reachable on one Llama base by changing only the completion target, which isolates the chain of thought on a fixed base, whilefϕf\_\{\\phi\}is the discriminative encoder\. For the heavier decoder settings we use parameter\-efficient adaptation\. Updating all parameters of a 3B decoder is costly, so we adapt these heavier settings with Low\-Rank Adaptation \(LoRA\(Huet al\.[2021](https://arxiv.org/html/2606.26686#bib.bib39)\)\)\. LoRA freezes each adapted projectionW0W\_\{0\}and learns a low\-rank updateΔ​W=B​A\\Delta W=BAwithA∈ℝr×dA\\in\\mathbb\{R\}^\{r\\times d\},B∈ℝk×rB\\in\\mathbb\{R\}^\{k\\times r\}, and rankr≪min⁡\(d,k\)r\\ll\\min\(d,k\), injected on the attention projections, so only a small fraction of parameters is trained while the pretrained weights stay fixed\. For the 1\.24B same\-base comparison we report a single\-epoch budget and rest the conclusion on the absence of a chain\-of\-thought gain rather than on the sign of any single adapter\-matched pair\. Holding backbone, data, optimizer, and schedule fixed and varying only the target and the head turns the question “does the chain help?” into a single controlled comparison rather than a confounded one \(Section[4](https://arxiv.org/html/2606.26686#S4)\)\.

Cost\.WithC​\(ℓ\)C\(\\ell\)the FLOPs of one forward pass, the reasoner costs≥\(T\+1\)​C​\(\|x\|\)\\geq\(T\{\+\}1\)\\,C\(\|x\|\)against the encoder’sC​\(\|x\|\)C\(\|x\|\), so

cost​\(gθ\)cost​\(fϕ\)≳\(T\+1\)⋅\|θ\|\|ϕ\|,\\frac\{\\text\{cost\}\(g\_\{\\theta\}\)\}\{\\text\{cost\}\(f\_\{\\phi\}\)\}\\;\\gtrsim\\;\(T\{\+\}1\)\\cdot\\frac\{\|\\theta\|\}\{\|\\phi\|\}\\,,\(1\)which forT=𝒪​\(102\)T=\\mathcal\{O\}\(10^\{2\}\)and\|θ\|/\|ϕ\|=1\.24​B/0\.395​B\|\\theta\|/\|\\phi\|=1\.24\\text\{B\}/0\.395\\text\{B\}is two orders of magnitude\.

Noise slope\.Training corpora for guards are labeled by humans and with model assistance, and both introduce errors, so robustness to training\-label noise matters in practice\. Corrupting each training label independently with rateη\\etagives a headlineF1​\(η\)F\_\{1\}\(\\eta\)and a degradation slopes=−d​F1/d​ηs=\-\\,dF\_\{1\}/d\\eta\. A guard is noise\-robust if and only ifssis small \(Section[6](https://arxiv.org/html/2606.26686#S6)\)\.

Operating\-point metric\.A guard emits a harmful\-class scoreσ​\(x\)∈\[0,1\]\\sigma\(x\)\\\!\\in\\\!\[0,1\]\. At thresholdτ\\tauit has a false\-positive rateFPR​\(τ\)\\mathrm\{FPR\}\(\\tau\)and a true\-positive rateTPR​\(τ\)\\mathrm\{TPR\}\(\\tau\)\. Production fixes a small target FPRα\\alphaand cares aboutTPR​@​α\\mathrm\{TPR\}@\\alpha, the recall atFPR​\(τα\)=α\\mathrm\{FPR\}\(\\tau\_\{\\alpha\}\)=\\alpha\. A paradigm that polarizesσ\\sigmatoward\{0,1\}\\\{0,1\\\}loses the score resolution needed nearτα\\tau\_\{\\alpha\}\(Section[6](https://arxiv.org/html/2606.26686#S6)\)\.

These give three testable hypotheses, organized by the two misconceptions\.H1\(against M1\) the CoT term inPθP\_\{\\theta\}does not improvey^\\hat\{y\}on a fixed base\.H2\(against M2\) a single\-pass discriminative encoder remains robust as the training\-label noise rateη\\etagrows, so its degradation slopessis small\.H3\(against M2\) the reasoner loses recall at a strict FPR because its confidence polarizes, so the discriminative inductive bias, not a reasoning component, governs the operating point\.

## 4Experimental Setup

Corpus and protocol\.We use the public GuardReasoner training corpus of 127,465 conversation\-level examples, which carries both the three\-part verdict label*and*a reasoning trace per example\. This lets us define a clean ablation on*one*corpus\. The with\-CoT condition trains on \(reasoning\+\+verdict\), and the label\-only condition trains on the verdict target alone, with the reasoning removed\. The data, the optimizer, and the schedule are identical, and only the supervision target changes\. We train every setting ourselves\. All settings share the data, a one\-epoch schedule, an effective batch of 16, AdamW with cosine decay and 10% warmup, and bf16 mixed precision\. What differs per backbone is exactly the lever we study\. Our primary model is the ModernBERT encoder, trained at learning rate3×10−53\\\!\\times\\\!10^\{\-5\}over a 1024\-token context, and the completion\-only decoder \(Llama\-3\.2\-1B\) and encoder\-decoder \(T5\) settings use learning rates1×10−51\\\!\\times\\\!10^\{\-5\}and3×10−43\\\!\\times\\\!10^\{\-4\}respectively, with the heavier decoder settings adapted by LoRA, and we defer their full configuration to the recipe\. To put the central null on solid statistical footing we runthree seeds for every trained settingand report the mean±\\pmstandard deviation\. We additionally evaluate the released GuardReasoner\-1B and 3B\(Liuet al\.[2025b](https://arxiv.org/html/2606.26686#bib.bib5)\)checkpoints, which were trained on this same corpus, with our scorer as the reference reasoning system\.

Models and evaluation\.Our primary model is ModernBERT\-large\(Warneret al\.[2024](https://arxiv.org/html/2606.26686#bib.bib2)\)\(395M\) with three linear heads on the pooled representation, and the same\-base experiments use a Llama\-3\.2\-1B decoder \(1\.24B\) and a T5\-base\(Raffelet al\.[2020](https://arxiv.org/html/2606.26686#bib.bib3)\)encoder\-decoder \(220M\)\. We evaluate on public test sets over three tasks, abbreviated in Table[2](https://arxiv.org/html/2606.26686#S5.T2)as follows\.*Prompt\-harm*: ToxicChat \(TC\), OpenAI Moderation \(OAI\), AegisSafetyTest \(Aeg\), SimpleSafetyTests \(SST\), HarmBenchPrompt \(HBp\), and the WildGuardTest prompt split \(WGp\)\.*Response\-harm*: HarmBenchResponse \(HBr\), the WildGuardTest response split \(WGr\), BeaverTails \(BT\), SafeRLHF \(SRL\), and XSTestResponseHarmful \(XSh\)\.*Refusal*: the WildGuardTest refusal split \(WGf\) and XSTestResponseRefusal \(XSr\)\. The three families probe complementary failure modes: prompt\-harm asks whether a*request*is harmful, response\-harm whether a model’s*reply*is harmful, and refusal whether the reply declines or complies, which a guard must separate from harmfulness so a safe refusal is not scored as a violation\. Per task we report the dataset\-size\-weighted F1 of the harmful or refusal class, and the*headline classification performance*\(the*headline*, for short\) is the unweighted mean of the three task F1 scores\. This is the reasoning baseline’s protocol, matched cell\-for\-cell \(Table[2](https://arxiv.org/html/2606.26686#S5.T2)\)\.

## 5CoT May Not Improve Accuracy \(M1\)

Table[1](https://arxiv.org/html/2606.26686#S5.T1)reports the same\-base experiments and Table[2](https://arxiv.org/html/2606.26686#S5.T2)the per\-benchmark landscape, both with three seeds per trained setting\. These experiments answer the obvious objection that a “label\-only decoder” is still a generative causal LM\. On*one*Llama base we compare generate\-with\-CoT against generate\-label\-only, varying only whether the model reasons before it labels\. Reading along the 1\.24B row, the released with\-CoT GuardReasoner\-1B reaches 82\.05, while a single\-epoch label\-only setting on the same Llama 1B model base already reaches 81\.42, so the reasoning pipeline buys at most0\.630\.63F1 over a far cheaper label\-only run, and the 395M encoder \(82\.90\) exceeds both\. On a compute\-matched single\-epoch comparison on the*same*base and data, adding CoT does not help, giving 81\.35 against the label\-only 81\.42\. We have also found that a simple fine\-tuning run*without*HS\-DPO reproduces this prior state of the art almost exactly, reaching 81\.93, which confirms the baseline is easy to hit and is not the bottleneck\. The same pattern holds at*3B*, where GuardReasoner\-3B reaches 82\.50 and still does not beat the 395M encoder\.On the T5 encoder\-decoder, removing CoT raises F1 by 7\.05 at 220M\(80\.02 against 72\.97\), a residual consistent with exposure bias in seq2seq verdict generation, so we treat the T5 result as corroborating rather than headline evidence\. The label\-only encoder leads GuardReasoner\-1B on most axes \(Table[2](https://arxiv.org/html/2606.26686#S5.T2), 9 of 13 cells\)\.

Table 1:Same\-base experiments over backbone and training setting \(headline F1, 3 seeds, mean, with std≤0\.4\\leq 0\.4for all trained settings\)\. Reading*along a row*isolates the chain\-of\-thought \(\+\+CoT vs\. label\-only\) on*one*base\. On the same Llama\-3\.2\-1B base a chain\-of\-thought does not improve accuracy \(81\.35 with CoT vs\. 81\.42 label\-only\), and on T5\-base it costs7\.057\.05F1 \(72\.97 vs\. 80\.02\)\. The 395M ModernBERT encoder reaches 82\.90 in a single forward pass and exceeds both reasoning guards\.Table 2:1:1 per\-benchmark landscape under GuardReasoner’s protocol\(per\-cell F1, identical test cells, with dataset abbreviations defined in the Experimental Setup\)\. Per column, the best score is shown inboldand the second best in*italic*\.LeanGuardand GuardReasoner\-1B are scored with our unified scorer\.§WildGuard\-7B and GuardReasoner\-3B are taken from the GuardReasoner paper\(Liuet al\.[2025b](https://arxiv.org/html/2606.26686#bib.bib5)\)\(Tables 2, 5, and 8\), so their cells follow that paper’s protocol, which can slightly differ from ours by a small margin\. The 395M LeanGuard wins or ties9 of 13cells against GuardReasoner\-1B\.A lightweight classifier suffices\.The 395M label\-only encoder, LeanGuard, reaches82\.90±\\pm0\.26\(3 seeds\)\. It matches GuardReasoner\-1B in one forward pass at∼100×\{\\sim\}100\\timeslower cost \(Figure[1](https://arxiv.org/html/2606.26686#S1.F1)\), and stays competitive with the wider production\-guard landscape \(Llama\-Guard 2 and 3, WildGuard, Aegis, ShieldGemma, MD\-Judge\) under our scorer\. We do not over\-attribute this cross\-architecture cell, because ModernBERT’s pretraining is newer, so we claim only*sufficiency*\. The confound\-controlled statement is the same\-base comparison that removes only the reasoning \(Table[1](https://arxiv.org/html/2606.26686#S5.T1)\), and the cheap encoder simply shows that a single\-pass classifier can also be small and fast\.

### The Reasoning May Be Post\-hoc

Why does a chain\-of\-thought leave the verdict unchanged? The usual information argument is real but weak\. At inference the reasoning is generated from the input, soRRis a function ofXX, the chainY−X−RY\\\!\-\\\!X\\\!\-\\\!Rholds, andI​\(Y;R∣X\)=0I\(Y;R\\mid X\)=0by the data\-processing inequality, so there is no new*information*about the verdict\. This is almost definitional and concerns information, not computation\. A reasoning trace could still help by making the decision easier to*compute*\. What rules that out here is not a theorem but a*measurable*property of the trained reasoner, which we state as a simple observation and then verify\.

Model the guard as forming an*implicit*pre\-reasoning verdicty^0=arg⁡maxy⁡Pθ​\(y∣x\)\\hat\{y\}\_\{0\}=\\arg\\max\_\{y\}P\_\{\\theta\}\(y\\mid x\), the answer it would give fromxxalone, then sampling a reasoning tracer∼Pθ\(⋅∣x,y^0\)r\\sim P\_\{\\theta\}\(\\cdot\\mid x,\\hat\{y\}\_\{0\}\)and emittingy^=arg⁡maxy⁡Pθ​\(y∣x,r\)\\hat\{y\}=\\arg\\max\_\{y\}P\_\{\\theta\}\(y\\mid x,r\)\. R\-SFT trains on reasoning traces annotated to support the gold verdict, and HS\-DPO prefers reasoning\-verdict pairs whose verdict matches, so both push the chain toward*self\-consistency*,Pr⁡\[y^=y^0\]≥1−ϵ\\Pr\[\\hat\{y\}=\\hat\{y\}\_\{0\}\]\\geq 1\-\\epsilon\.

Observation 1 \(post\-hoc justification, not revision\)\.*If the chain isϵ\\epsilon\-self\-consistent theny^=y^0\\hat\{y\}=\\hat\{y\}\_\{0\}with probability≥1−ϵ\\geq 1\-\\epsilon, so it overturns an erroneous first\-glance verdict with probability≤ϵ\\leq\\epsilon\. Combined withI​\(Y;R∣X\)=0I\(Y;R\\mid X\)=0, the reasoning adds neither information nor error\-correction\.*The content of the observation is not the one\-line algebra but whether its premise holds, so we*measure*ϵ\\epsilonin two complementary ways\. First, we re\-sample the reasoning and re\-read the verdict, and across aKK\-sample majority vote the final verdict changes on only 5\.08% of inputs\. Second, we read the decoder’s hidden state at an early, a middle, and a late token while it generates its chain, fit a linear probe on the late at\-verdict state, and apply it to the earlier states \(Figure[2](https://arxiv.org/html/2606.26686#S1.F2)\)\. The probe already places all three states inside the harmful region, and the harmful\-class confidence is essentially flat across the chain, about0\.670\.67at the early token and0\.670\.67at the late at\-verdict token, so the verdict is fixed before the chain is written\. The chain is therefore empirically a justification step, not a correction step, and a slow one,∼102\{\\sim\}10^\{2\}sequential forward passes per verdict\.

Corollary \(the chain does not correct label\-noise errors\)\.If training\-label noiseη\\etaraises the rate of an incorrecty^0\\hat\{y\}\_\{0\}, the rationalized output inherits those errors at rate≥\(1−ϵ\)​Pr⁡\[y^0​wrong\]\\geq\(1\-\\epsilon\)\\Pr\[\\hat\{y\}\_\{0\}\\text\{ wrong\}\], and a confident chain tends to defend rather than revise them\(Cheginiet al\.[2025](https://arxiv.org/html/2606.26686#bib.bib16)\), so a chain has no error\-correcting advantage under label noise\. Consistent with this, a one\-dimensional linear projection of the encoder already separates the harmful/unharmful decision at AUC0\.926\. The decision is nearly linearly separable, so a single pass realizes it and there is no serial structure for a chain to exploit, asSpragueet al\.\([2025](https://arxiv.org/html/2606.26686#bib.bib15)\)would predict\.

What actually moves the needle\.If the chain is not the lever, classical discriminative training is\. On the same encoder, we have also experimented with label smoothing\(Mülleret al\.[2019](https://arxiv.org/html/2606.26686#bib.bib21)\)\(82\.88\), generalized cross\-entropy\(Zhang and Sabuncu[2018](https://arxiv.org/html/2606.26686#bib.bib20)\)\(81\.83\), and a longer schedule \(83\.37\), all of which hold at or above the reasoning guard\. Recent reasoning\-guard work largely overlooks that, for accuracy under realistic labels, these decades\-old single\-pass tools dominate a reasoning trace\.

## 6Heavy Reasoning and Robustness \(M2\)

A single\-pass encoder is robust to label noise\.A heavier reasoning guard is often assumed to be the safer and more robust choice\. We find that a lightweight label\-only encoder is already highly robust to training\-label noise\. Under injected label noise \(Table[3](https://arxiv.org/html/2606.26686#S6.T3)\), the 395M label\-only encoder degrades at only−0\.81\-0\.81F1 per 10% \(R2=0\.99R^\{2\}\{=\}0\.99\) and still reaches 80\.56 with 30% of its training labels corrupted, while a smaller T5 encoder\-decoder trained label\-only degrades almost twice as fast at−1\.55\-1\.55per 10%\. The practical headline performance is that the encoder trained with 10% of its labels corrupted \(82\.16\) still matches a clean GuardReasoner\-1B \(82\.05\)\. For accuracy under realistic noisy labels a single\-pass discriminative encoder is a strong and robust lever, and the strict\-FPR comparison below shows that a heavier reasoning guard is not the more robust choice where it matters most in production\.

![Refer to caption](https://arxiv.org/html/2606.26686v1/x3.png)Figure 3:Noise\-robustness under symmetric training\-label corruption\. The 395M label\-only encoder \(LeanGuard\) degrades at only−0\.81\-0\.81F1 per 10% \(seed 6100\) and still scores 80\.56 at 30% noise, while a smaller T5 encoder\-decoder trained label\-only degrades almost twice as fast \(−1\.55\-1\.55per 10%\)\. A single bidirectional pass over the whole input is robust to realistic label noise\.Table 3:Headline F1 score under symmetric training\-label noise \(curve in Figure[3](https://arxiv.org/html/2606.26686#S6.F3)\)\. Here↑\\uparrowdenotes that higher is better and↓\\downarrowdenotes that a smaller slope magnitude is better, and the best entry in each column is inbold\. The 395M ModernBERT encoder degrades only−0\.81\-0\.81F1 per 10% and still scores 80\.56 with 30% of its labels corrupted, so a single\-pass discriminative encoder is robust to realistic label noise\.Lightweight and embodied deployment\.Real guardrails must often be small and fast, on\-device, low\-latency, with no room to wait for a chain, and trained on labels of uneven quality\. A 395M single\-pass encoder fits this regime, and the reasoning guard, about∼100×\{\\sim\}100\\timesslower, does not\. LeanGuard runs cheaply on a short context\. A 512\-token window already matches GuardReasoner\-1B \(Figure[4](https://arxiv.org/html/2606.26686#S6.F4)\), and it exports to ONNX for a single\-pass deployment\. This is the embodied\-robot scenario that motivates the work, where a guard that must generate a chain before it acts is a poor fit for the control loop\. On such a controller the guard must return a verdict within a single control step, so a chain of about10210^\{2\}sequential decode steps can miss the deadline outright, whereas one bidirectional pass returns a fixed\-latency verdict the controller can schedule against\. The same property pays off away from the robot as well\. A guard that screens every prompt and every response sits on the critical path of each interaction, so a hundred\-fold reduction in per\-call cost lowers serving cost directly and leaves headroom to run the guard at a higher sampling rate or alongside other safety checks rather than in place of them\.

![Refer to caption](https://arxiv.org/html/2606.26686v1/x4.png)Figure 4:Headline F1 score under various context lengths\. A 512\-token window already matches GuardReasoner\-1B\. Most public benchmarks are short or put the decisive content early, so a short context is enough to match a 1B reasoning guard\.Data efficiency\.The single\-pass design is also cheap to*train*, not only to run\. With only a quarter of the GuardReasoner corpus the label\-only encoder already reaches 81\.43, within 0\.6 F1 of the full\-data GuardReasoner\-1B \(82\.05\), and the full corpus carries it past every 1B and 3B reasoning guard we evaluate \(Figure[5](https://arxiv.org/html/2606.26686#S6.F5)\)\. The curve is steep early and then flat, so most of the signal is learned from a small, easily curated slice of the data\. Because the encoder needs neither a long context nor a large labeled budget to match a reasoning decoder, it is also cheaper to*re\-train*as moderation policies drift, whereas a reasoning guard must regenerate a full chain\-annotated corpus for every policy change\.

![Refer to caption](https://arxiv.org/html/2606.26686v1/x5.png)Figure 5:Data efficiency comparison\. With 25% of the training corpus the label\-only encoder is already within 0\.6 F1 of the full\-data GuardReasoner\-1B, and the full corpus exceeds it\. The decision is learnable from a small, well\-curated slice, so the encoder is cheap to retrain as policies drift\.Strict\-FPR operating point\.Production guards run at a small false\-positive rate to avoid over\-blocking benign traffic, so recall at a fixed FPR matters more than a thresholded F1\. We obtain a harmful\-class scoreσ​\(x\)\\sigma\(x\)per model and sweep its threshold \(Figure[6](https://arxiv.org/html/2606.26686#S6.F6)\)\. The encoder uses its softmax probability, and the reasoner uses its harmful verdict\-token probability\. At a 1% FPR the encoder retains44\.8TPR while the reasoner retains only 10\.1, far less recall where production guards operate\. The gap is not a rescaling artifact, becauseTPR​@​α\\mathrm\{TPR\}@\\alphadepends only on the ranking that the score induces and is invariant to monotonic transforms\. It instead reflects ranking resolution, since the reasoner’s confidence is polarized toward\{0,1\}\\\{0,1\\\}\(Cheginiet al\.[2025](https://arxiv.org/html/2606.26686#bib.bib16)\)\. We therefore scope the claim to substantially more recall at strict FPR\.

![Refer to caption](https://arxiv.org/html/2606.26686v1/x6.png)Figure 6:Recall at a strict 1% false\-positive rate\. BecauseTPR​@​α\\mathrm\{TPR\}@\\alphais invariant to monotonic rescaling of the score, the gap reflects ranking resolution rather than calibration\. The label\-only encoder retains far more recall where production guards operate, so a heavier reasoning guard may not be the more deployable choice\.
## 7Discussion and Limitations

This is a*controlled empirical study*, not a new architecture\. Its contribution is the same\-base ablation that the reasoning\-guard literature left empty\. \(i\)*Reasoning means CoT fine\-tuning\.*We test the dominant recipe \(R\-SFT and HS\-DPO, GuardReasoner\) at 1\.24B and 3B, not test\-time reasoning, tool use, or verifier pipelines, and the title and abstract are scoped accordingly\. \(ii\)*Cross\-architecture confound\.*The 395M versus 1\.24B comparison conflates architecture with pretraining, so the controlled claim rests on the same\-base comparison that removes only the reasoning \(Table[1](https://arxiv.org/html/2606.26686#S5.T1)\), which varies the chain alone on one base\. We report the cheap encoder and its noise robustness as sufficiency, and bidirectional attention is a factor we do not isolate\. \(iii\)*The mechanism is an observation\.*Its force is the measured∼5%\{\\sim\}5\\%verdict\-flip rate and the at\-verdict probe, not the one\-line algebra\. \(iv\)*Score choice at strict FPR\.*We report the harmful verdict\-token extraction for the reasoner and scope the claim to substantially more recall\. Our contribution is to combine known ingredients, namely that CoT helps mainly on reasoning tasks\(Spragueet al\.[2025](https://arxiv.org/html/2606.26686#bib.bib15)\), and that encoders are cheap and robust, into the controlled test that settles whether a reasoning guard’s chain earns its cost\.

When a chain would earn its cost\.Our claim is about*accuracy and robustness*on the standard moderation suite, not about interpretability\. A reasoning trace still produces a human\-readable rationale that an operator can audit, and that has value even when it does not change the verdict, so a deployment that needs an explanation may accept the cost knowingly\. We see this as a separable axis from accuracy\. A label\-only encoder can be paired with a cheap post\-hoc rationale only when an audit trail is actually requested, rather than paying for a chain on every call by default\. The result also predicts where reasoning*should*help: a task whose decision is not nearly linearly separable, with genuinely multi\-step policies, compositional rules, or tool\-mediated lookups, would give the chain real serial structure to exploit, unlike the short, single\-step labeling task studied here\. Mapping that boundary is exactly what the same\-base test is for, and extending it to multilingual and multimodal moderation, where the harmful/unharmful boundary may be less linear, is the natural next step\.

## 8Conclusion

On a fixed base, at both 1\.24B and 3B, a chain\-of\-thought does not measurably improve a safety guardrail’s accuracy\. This is because the reasoning guard largely*justifies*a first\-glance verdict that it overturns on only∼5%\{\\sim\}5\\%of inputs, at∼100×\{\\sim\}100\\timesthe inference cost\. A 395M label\-only encoder, LeanGuard, matches a 1\.24B reasoning guard, stays robust under injected training\-label noise, and retains far more recall at a strict false\-positive rate than the reasoning guard, which counters the assumption that a heavier reasoning guard is the safer choice\. For a lightweight safety guardrail, and especially for the on\-device embodied setting that motivates this work, a calibrated label\-only encoder is a simpler, cheaper, and at\-least\-as\-accurate default than a CoT\-fine\-tuned reasoning decoder\. We release LeanGuard with all data splits, models, code, and an ONNX export\.

More broadly, a calibrated discriminative encoder, not a CoT decoder, is the natural*default baseline*against which new guards are measured\. We hope that future studies will report the same\-base ablation that isolates the chain from architecture, scale, and data\.

## References

- A\. Chegini, H\. Kazemi, G\. Souza, M\. Safi, Y\. Song, S\. Bengio, S\. Williamson, and M\. Farajtabar \(2025\)Reasoning’s razor: reasoning improves accuracy but can hurt recall at critical operating points in safety and hallucination detection\.External Links:2510\.21049Cited by:[§5](https://arxiv.org/html/2606.26686#S5.SSx1.p4.3),[§6](https://arxiv.org/html/2606.26686#S6.p4.3)\.
- S\. R\. Chowdhury, A\. Kini, and N\. Natarajan \(2024\)Provably robust DPO: aligning language models with noisy feedback\.External Links:2403\.00409Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1)\.
- J\. Dai, X\. Pan, R\. Sun, J\. Ji, X\. Xu, M\. Liu, Y\. Wang, and Y\. Yang \(2024\)Safe RLHF: safe reinforcement learning from human feedback\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- Y\. Deng, Y\. Choi, and S\. Shieber \(2024\)From explicit CoT to implicit CoT: learning to internalize CoT step by step\.arXiv preprint arXiv:2405\.14838\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of NAACL\-HLT,Cited by:[§3](https://arxiv.org/html/2606.26686#S3.p2.5)\.
- A\. Dubeyet al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1),[§3](https://arxiv.org/html/2606.26686#S3.p3.11)\.
- S\. Ghosh, P\. Varshney, E\. Galinkin, and C\. Parisien \(2024\)AEGIS: online adaptive ai content safety moderation with ensemble of llm experts\.External Links:2404\.05993Cited by:[§1](https://arxiv.org/html/2606.26686#S1.p1.1),[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- S\. Han, K\. Rao, A\. Ettinger, L\. Jiang, B\. Y\. Lin, N\. Lambert, Y\. Choi, and N\. Dziri \(2024\)WildGuard: open one\-stop moderation tools for safety risks, jailbreaks, and refusals of llms\.InAdvances in Neural Information Processing Systems \(Datasets and Benchmarks Track\),Cited by:[§1](https://arxiv.org/html/2606.26686#S1.p1.1),[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. Weston, and Y\. Tian \(2024\)Training large language models to reason in a continuous latent space\.arXiv preprint arXiv:2412\.06769\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1)\.
- A\. Havrilla and M\. Iyer \(2024\)Understanding the effect of noise in llm training data with algorithmic chains of thought\.External Links:2402\.04004Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.arXiv preprint arXiv:2106\.09685\.Cited by:[§3](https://arxiv.org/html/2606.26686#S3.p3.11)\.
- H\. Inan, K\. Upasani, J\. Chi, R\. Rungta, K\. Iyer, Y\. Mao, M\. Tontchev, Q\. Hu, B\. Fuller, D\. Testuggine, and M\. Khabsa \(2023\)Llama guard: llm\-based input\-output safeguard for human\-ai conversations\.External Links:2312\.06674Cited by:[§1](https://arxiv.org/html/2606.26686#S1.p1.1),[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- J\. Ji, M\. Liu, J\. Dai, X\. Pan, C\. Zhang, C\. Bian, B\. Chen, R\. Sun, Y\. Wang, and Y\. Yang \(2023\)BeaverTails: towards improved safety alignment of llm via a human\-preference dataset\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- M\. Kang and B\. Li \(2024\)R2\-guard: robust reasoning enabled LLM guardrail via knowledge\-enhanced logical reasoning\.arXiv preprint arXiv:2407\.05557\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p2.1)\.
- T\. Lanhamet al\.\(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.External Links:2307\.13702Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1)\.
- L\. Liet al\.\(2024\)SALAD\-Bench: a hierarchical and comprehensive safety benchmark for large language models\.InFindings of the ACL,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- Z\. Lin, Z\. Wang, Y\. Tong, Y\. Wang, Y\. Guo, Y\. Wang, and J\. Shang \(2023\)ToxicChat: unveiling hidden challenges of toxicity detection in real\-world user\-ai conversation\.InFindings of EMNLP,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- A\. Liu, Z\. Ying,et al\.\(2025a\)AgentSafe: benchmarking the safety of embodied agents on hazardous instructions\.External Links:2506\.14697Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p4.1)\.
- S\. Liu, J\. Niles\-Weed, N\. Razavian, and C\. Fernandez\-Granda \(2020\)Early\-learning regularization prevents memorization of noisy labels\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1)\.
- Y\. Liu, H\. Gao, S\. Zhai, J\. Xia, T\. Wu, Z\. Xue, Y\. Chen, K\. Kawaguchi, J\. Zhou, and B\. Hooi \(2025b\)GuardReasoner: towards reasoning\-based llm safeguards\.External Links:2501\.18492Cited by:[§1](https://arxiv.org/html/2606.26686#S1.p1.1),[§2](https://arxiv.org/html/2606.26686#S2.p2.1),[§4](https://arxiv.org/html/2606.26686#S4.p1.5),[Table 2](https://arxiv.org/html/2606.26686#S5.T2)\.
- T\. Markov, C\. Zhang, S\. Agarwal, T\. Eloundou, T\. Lee, S\. Adler, A\. Jiang, and L\. Weng \(2023\)A holistic approach to undesired content detection in the real world\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li, D\. Forsyth, and D\. Hendrycks \(2024\)HarmBench: a standardized evaluation framework for automated red teaming and robust refusal\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- Meta AI \(2024\)Llama guard 3\.Note:https://huggingface\.co/meta\-llama/Llama\-Guard\-3\-8BCited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- R\. Müller, S\. Kornblith, and G\. Hinton \(2019\)When does label smoothing help?\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1),[§5](https://arxiv.org/html/2606.26686#S5.SSx1.p5.1)\.
- D\. Na, C\. Kim, G\. Choi, and D\. Hong \(2026a\)Semantic flip: synthetic ood generation for robust refusal in embodied question answering and spatial localization\.arXiv preprint arXiv:2606\.16898\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p4.1)\.
- D\. Na, C\. Kim, S\. Rho, G\. Choi, G\. Lee, and D\. Hong \(2026b\)Binary tracking for spatial qa and navigation with open vision\-language models\.arXiv preprint arXiv:2606\.16902\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p4.1)\.
- I\. Padhiet al\.\(2024\)Granite guardian\.External Links:2412\.07724Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.Cited by:[§3](https://arxiv.org/html/2606.26686#S3.p3.11),[§4](https://arxiv.org/html/2606.26686#S4.p2.1)\.
- Z\. Ravichandran, A\. Robey, V\. Kumar, G\. J\. Pappas,et al\.\(2025\)Safety guardrails for LLM\-enabled robots\.arXiv preprint arXiv:2503\.07885\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p4.1)\.
- P\. Röttger, H\. R\. Kirk, B\. Vidgen, G\. Attanasio, F\. Bianchi, and D\. Hovy \(2024\)XSTest: a test suite for identifying exaggerated safety behaviours in large language models\.InProceedings of NAACL\-HLT,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- P\. Sermanet, A\. Majumdar, A\. Irpan, D\. Kalashnikov, and V\. Sindhwani \(2025\)Generating robot constitutions and benchmarks for semantic safety\.External Links:2503\.08663Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p4.1)\.
- Z\. Sprague, F\. Yin, J\. D\. Rodriguez, D\. Jiang, M\. Wadhwa, P\. Singhal, X\. Zhao, X\. Ye, K\. Mahowald, and G\. Durrett \(2025\)To CoT or not to CoT? chain\-of\-thought helps mainly on math and symbolic reasoning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.26686#S1.p2.1),[§2](https://arxiv.org/html/2606.26686#S2.p3.1),[§5](https://arxiv.org/html/2606.26686#S5.SSx1.p4.3),[§7](https://arxiv.org/html/2606.26686#S7.p1.1)\.
- Y\. Suiet al\.\(2025\)Stop overthinking: a survey on efficient reasoning for large language models\.arXiv preprint arXiv:2503\.16419\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1)\.
- M\. Turpin, J\. Michael, E\. Perez, and S\. R\. Bowman \(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1)\.
- N\. Wang, Z\. Yan, W\. Li, C\. Ma, H\. Chen, and T\. Xiang \(2025\)Advancing embodied agent security: from safety benchmarks to input moderation\.External Links:2504\.15699Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p4.1)\.
- B\. Warner, A\. Chaffin, B\. Clavié,et al\.\(2024\)Smarter, better, faster, longer: a modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference\.External Links:2412\.13663Cited by:[§3](https://arxiv.org/html/2606.26686#S3.p2.5),[§4](https://arxiv.org/html/2606.26686#S4.p2.1)\.
- X\. Wenet al\.\(2025\)ThinkGuard: deliberative slow thinking leads to cautious guardrails\.arXiv preprint arXiv:2502\.13458\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p2.1)\.
- S\. Yinet al\.\(2024\)SafeAgentBench: a benchmark for safe task planning of embodied language model agents\.External Links:2412\.13178Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p4.1)\.
- W\. Zenget al\.\(2024\)ShieldGemma: generative AI content moderation based on Gemma\.External Links:2407\.21772Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p1.1)\.
- H\. Zhanget al\.\(2024\)BadRobot: jailbreaking embodied LLMs in the physical world\.arXiv preprint arXiv:2407\.20242\.Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p4.1)\.
- Z\. Zhang and M\. R\. Sabuncu \(2018\)Generalized cross entropy loss for training deep neural networks with noisy labels\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1),[§5](https://arxiv.org/html/2606.26686#S5.SSx1.p5.1)\.
- Z\. Zhou, R\. Tao, J\. Zhu, Y\. Luo, Z\. Wang, and B\. Han \(2024\)Can language models perform robust reasoning in chain\-of\-thought prompting with noisy rationales?\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.26686#S2.p3.1)\.

Similar Articles

Robust and Efficient Guardrails with Latent Reasoning

arXiv cs.AI

CoLaGuard is a new guardrail model that transfers multi-step safety reasoning into a continuous latent space, achieving 12.9x speedup and 22.4x token reduction compared to explicit reasoning baselines while matching macro-F1 performance on ten safety benchmarks.

OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform

Papers with Code Trending

OpenGuardrails is an open-source platform for AI safety, offering context-aware content-safety and manipulation detection (e.g., prompt injection, jailbreaking) via a unified model, plus a separate NER pipeline for data-leakage identification. It achieves state-of-the-art performance on safety benchmarks and supports private, enterprise-grade deployment.