Robust and Efficient Guardrails with Latent Reasoning

arXiv cs.AI 05/29/26, 04:00 AM Papers
latent-reasoning guardrails llm-safety efficient-inference arxiv uc-davis
Summary
CoLaGuard is a new guardrail model that transfers multi-step safety reasoning into a continuous latent space, achieving 12.9x speedup and 22.4x token reduction compared to explicit reasoning baselines while matching macro-F1 performance on ten safety benchmarks.
arXiv:2605.29068v1 Announce Type: new Abstract: Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.
Original Article
View Cached Full Text
Cached at: 05/29/26, 09:12 AM
# Robust and Efficient Guardrails with Latent Reasoning
Source: [https://arxiv.org/html/2605.29068](https://arxiv.org/html/2605.29068)
Siddharth Sai Xiaofei Wen Muhao Chen University of California, Davis \{sai,xfwe,muhchen\}@ucdavis\.edu

###### Abstract

Maintaining the safety of large language models \(LLMs\) is crucial as they are increasingly deployed in real\-world applications\. Existing safety guardrails typically rely on single\-pass classification or, more recently, distilled reasoning\. Reasoning\-based guardrails significantly outperform classification\-only baselines, but they incur substantial query latency and token overhead that make them impractical for high\-throughput deployment\. To address this challenge, we proposeCoLaGuard, a guardrail model that transfers multi\-step safety reasoning into a continuous latent space through a stage\-wise training curriculum, enabling direct hidden\-state propagation at inference\. Evaluated on ten prompt\- and response\-moderation settings spanning eight safety benchmarks,CoLaGuardimproves macro\-F1 by 8\.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macro\-F1 while delivering a 12\.9×\\timesspeedup and 22\.4×\\timesreduction in token usage\. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives\.

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai Xiaofei Wen Muhao ChenUniversity of California, Davis\{sai,xfwe,muhchen\}@ucdavis\.edu

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.29068v1/x1.png)

Figure 1:Overview ofCoLaGuard\. Unlike explicit reasoning guardrails \(left\) that generate chain\-of\-thought tokens before assigning labels,CoLaGuard\(right\) reasons through recurrent latent states, preserving moderation performance while avoiding token generation overhead and enabling 12\.9×\\timesfaster inference and 22\.4×\\timesfewer tokens\.CoLaGuard’s stage\-wise internalization curriculum \(center\) begins with explicit CoT supervision and progressively replaces reasoning tokens with latent states, shifting reasoning into hidden activations\.As Large Language Models \(LLMs\) become integral to daily and industrial applications, ensuring their alignment with human values is critical\. Although alignment training methods such as RLHF\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.29068#bib.bib56); Rafailovet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib57)\)can improve model behavior, they require modifying the target model and are costly to update after deployment\. External safety guardrails\(Inanet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib28); Hanet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib1)\)therefore provide a practical alternative by offloading input and output moderation to smaller, third\-party models\. Early guardrails typically formulate moderation as single\-pass classification, which is efficient but often becomes brittle under ambiguous, adversarial, or context\-dependent safety decisions\.

Recent explicit reasoning guardrails\(Wenet al\.,[2025b](https://arxiv.org/html/2605.29068#bib.bib61); Liuet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib35)\)improve robustness by learning from distilled chain\-of\-thought \(CoT\) supervision\(Hsiehet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib49); Kimet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib50)\)and generating intermediate rationales before predicting a safety label\. MrGuard\(Yanget al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib2)\)further extends reasoning\-based guardrails to multilingual safety moderation by combining synthetic multilingual supervision with curriculum\-guided Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib3)\)\. However, this robustness comes at a steep computational cost\. Because these models verbalize their intermediate rationales, moderation becomes a long autoregressive generation process\. The additional CoT tokens substantially inflate inference time and completion\-token cost, making explicit reasoning guardrails difficult to deploy in high\-traffic, real\-time settings\(Liuet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib35); Sreedharet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib37)\)\. Existing efficiency\-oriented variants, such as shorter supervised traces or reasoning on/off switches\(NVIDIA,[2025](https://arxiv.org/html/2605.29068#bib.bib62); Sreedharet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib37)\), reduce the amount or frequency of rationale generation but still rely on explicit decoding and may sacrifice robustness\.

This motivates a natural question:can guardrails retain the benefits of reasoning supervision without generating reasoning tokens at inference time?We study this question throughCoLaGuard, a latent\-reasoning safety guardrail that internalizes explicit safety rationales into continuous recurrent states as shown in Figure[1](https://arxiv.org/html/2605.29068#S1.F1)\. Inspired by Coconut\(Haoet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib51)\)and ICoT\-SI\(Denget al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib42)\),CoLaGuardperforms a fixed number of latent recurrent steps in place of explicit rationale generation\. It first learns from CoT supervision and then progressively replaces rationale tokens with latent states, allowing the model to directly predict the safety label without autoregressive rationale generation\. A practical challenge is that pretrained LLMs are optimized to consume token embeddings rather than recirculated contextual hidden states, which can create a distribution mismatch during latent recurrence\. To reduce this mismatch, we adopt Context\-Prediction Fusion\(Liuet al\.,[2026](https://arxiv.org/html/2605.29068#bib.bib15)\), which combines contextual hidden\-state information with predictive semantic guidance from the vocabulary embedding space\. This stabilizes latent recurrence while preserving the latency and token\-efficiency benefits of avoiding explicit CoT generation\.

In summary, this work makes three main contributions\. \(1\) We introduceCoLaGuard, a latent\-reasoning safety guardrail that internalizes explicit safety rationales through a stage\-wise curriculum, enabling moderation without autoregressive rationale generation at inference time\. \(2\) We show thatCoLaGuardpreserves the robustness of explicit reasoning guardrails while substantially reducing inference cost, suggesting that reasoning\-based moderation can be made practical without verbalized rationales\. \(3\) We analyze the latent recurrence process and find thatCoLaGuardimproves over vanilla Coconut, consistent with progressive safety\-relevant representation shifts across latent steps that are largely absent in vanilla Coconut recurrence\.

## 2Related Work

#### LLM Guardrails

External guardrails provide a lightweight mechanism for safety moderation without modifying the base LLM\. Early architectures such as Llama Guard\(Inanet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib28)\)and WildGuard\(Hanet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib1)\)treated moderation as classification, followed by models like ShieldGemma\(Zenget al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib29)\), Aegis\(Ghoshet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib16)\), and Qwen3Guard\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib17)\), which improved performance through broader taxonomies\. The broader guardrail literature has expanded robustness through adversarially resilient moderation\(Yuanet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib21)\)and structured safety knowledge\(Kang and Li,[2025](https://arxiv.org/html/2605.29068#bib.bib20)\)\. Recent work further improves performance with reasoning: GuardReasoner\(Liuet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib35)\)and ThinkGuard\(Wenet al\.,[2025b](https://arxiv.org/html/2605.29068#bib.bib61)\)use chain\-of\-thought rationales\(Weiet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib19)\)from expert models to improve generalization, while MrGuard\(Yanget al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib2)\)extends reasoning\-based guardrails to multilingual moderation through synthetic multilingual supervision and curriculum\-guided GRPO\. Others explore efficiency trade\-offs through shorter rationale traces and on/off switches\(NVIDIA,[2025](https://arxiv.org/html/2605.29068#bib.bib62); Sreedharet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib37); Rebedeaet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib34)\)\. However, because these models verbalize reasoning in natural language, they incur steep autoregressive decoding costs that limit their practicality for high\-traffic, real\-world deployment\.

#### Latent Reasoning

A growing literature suggests that effective reasoning can occur within a model’s hidden states rather than through explicit tokens\(Chenet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib22); Zhuet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib23); Biranet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib59)\)\. This space includes augmenting models with "thinking" tokens\(Goyalet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib47); Zelikmanet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib46); Pfauet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib60)\), internalizing CoT through staged curricula\(Denget al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib45),[2025](https://arxiv.org/html/2605.29068#bib.bib42)\), and feeding hidden states back as continuous input embeddings\(Haoet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib51); Cheng and Durme,[2024](https://arxiv.org/html/2605.29068#bib.bib48); Zhuet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib23)\)\. However, these methods have largely been studied on mathematical and logical reasoning tasks, and directly recycling raw hidden states can become unstable at larger scales due to distribution mismatch with the token embedding manifold\. Latent Thoughts Tuning\(Liuet al\.,[2026](https://arxiv.org/html/2605.29068#bib.bib15)\)addresses this with a context\-prediction fusion mechanism that aligns contextual hidden states with predictive signals from the vocabulary embedding space\.CoLaGuardadapts these techniques, showing that latent reasoning can drastically reduce latency costs and preserve the robustness of explicit baselines in safety moderation\.

## 3CoLaGuard

We now presentCoLaGuard, a latent\-reasoning guardrail framework for efficient prompt and response moderation\.CoLaGuarduses explicit safety rationales generated by expert models as training\-time supervision, then progressively internalizes this step\-by\-step reasoning into recurrent latent states so that inference incurs only a fixed latent computation budget before decoding the safety labels\. We formulate the guardrail task in §[3\.1](https://arxiv.org/html/2605.29068#S3.SS1), describe reasoning\-augmented supervision and explicit warm\-up in §[3\.2](https://arxiv.org/html/2605.29068#S3.SS2)–§[3\.3](https://arxiv.org/html/2605.29068#S3.SS3), and present latent recurrence, stage\-wise internalization and efficient inference in §[3\.4](https://arxiv.org/html/2605.29068#S3.SS4)–§[3\.6](https://arxiv.org/html/2605.29068#S3.SS6)\.

### 3\.1Guardrail Task

Given a user promptxxand a model responsess, a guardrail modelGθG\_\{\\theta\}predicts the safety of both the input request and the generated response\(y^p,y^r\)=Gθ\(x,s\)\(\\hat\{y\}^\{p\},\\hat\{y\}^\{r\}\)=G\_\{\\theta\}\(x,s\), wherey^p∈𝒴\\hat\{y\}^\{p\}\\in\\mathcal\{Y\}denotes the predicted prompt harmfulness label,y^r∈𝒴\\hat\{y\}^\{r\}\\in\\mathcal\{Y\}denotes the predicted response harmfulness label, and𝒴\\mathcal\{Y\}denotes the set of safety categories in the guardrail’s policy\.

### 3\.2Reasoning\-Augmented Supervision

The central challenge is maintaining the robustness of reasoning\-based guardrails without requiring that the guardrail verbalize its reasoning process at inference time\.CoLaGuardaddresses this by using explicit rationales for the initial training scaffolding\. This follows prior work on chain\-of\-thought reasoning, step\-by\-step distillation, and reasoning\-based safety guardrails, where intermediate rationales provide richer supervision than final labels alone\(Weiet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib19); Hsiehet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib49); Kimet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib50); Liuet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib35); Wenet al\.,[2025b](https://arxiv.org/html/2605.29068#bib.bib61)\)\.

We assume access to a reasoning\-augmented guardrail corpus𝒟=\{\(xi,si,ri,yi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},s\_\{i\},r\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}, wherexix\_\{i\}is a user prompt,sis\_\{i\}is the corresponding model response,yi=\(yip,yir\)y\_\{i\}=\(y\_\{i\}^\{p\},y\_\{i\}^\{r\}\)contains the final prompt and response safety labels, and

ri=\(ri1,ri2,…,rimi\)r\_\{i\}=\(r\_\{i\}^\{1\},r\_\{i\}^\{2\},\\ldots,r\_\{i\}^\{m\_\{i\}\}\)is a step\-separated safety rationale\.

Unlike standard label\-only guardrail training, this supervision exposes the model to the reasoning underlying the final moderation decision\. However,CoLaGuarddoes not aim to generate these rationales at inference time\. Instead, the rationales serve as targets during the initial stages of training so that the model can later compress the deliberation process into latent steps\.

### 3\.3Explicit Reasoning Warm\-Up

The first stage \(Stage 0\) trains the model as an explicit reasoning guardrail\. Given an instructionII, promptxx, responsess, rationalerr, and final label tupleyy, the model is optimized to generate structured safety\-relevant rationale followed by the final safety labels:

ℒwarm=−𝔼\(x,s,r,y\)∼𝒟log⁡pθ\(r,y∣I,x,s\)\.\\mathcal\{L\}\_\{\\mathrm\{warm\}\}=\-\\mathbb\{E\}\_\{\(x,s,r,y\)\\sim\\mathcal\{D\}\}\\log p\_\{\\theta\}\(r,y\\mid I,x,s\)\.This warm\-up follows the explicit reasoning guardrail paradigm, where models learn to verbalize intermediate safety reasoning before predicting final moderation labels\(Liuet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib35); Wenet al\.,[2025b](https://arxiv.org/html/2605.29068#bib.bib61)\)\. We denote the resulting model asGθ0G\_\{\\theta\}^\{0\}, from which subsequent stages progressively replace explicit rationale steps with latent recurrent steps\.

### 3\.4Dual\-Mode Latent Recurrence

To internalize reasoning,CoLaGuardswitches between two modes\. In language mode, the model consumes standard token embeddings and predicts the next token autoregressively\. In latent mode, the model does not consume a standard token embedding; instead, the previous hidden state is fed back as the next input representation\.

Lete\(⋅\)e\(\\cdot\)denote the token embedding function and letht∈ℝdh\_\{t\}\\in\\mathbb\{R\}^\{d\}be the last\-layer hidden state at positiontt\. For a sequence with a latent span beginning at positionaaand ending at positionbb, vanilla latent recurrence replaces the input embedding at each latent position with the previous hidden state:

Et=\{e\(wt\),t<aort\>b,ht−1,a≤t≤b,E\_\{t\}=\\begin\{cases\}e\(w\_\{t\}\),&t<a\\text\{ or \}t\>b,\\\\ h\_\{t\-1\},&a\\leq t\\leq b,\\end\{cases\}wherewtw\_\{t\}is the discrete token at positionttoutside the latent span\. This follows the chain\-of\-continuous\-thought formulation introduced byHaoet al\.\([2025](https://arxiv.org/html/2605.29068#bib.bib51)\), allowing the model to perform recurrent computation in continuous latent space rather than generating intermediate rationale tokens\.

While this latent recurrence lays the foundation ofCoLaGuard, directly feeding contextual hidden states back into a pretrained transformer creates a distribution mismatch since the base model is trained to consume token embeddings, whileht−1h\_\{t\-1\}is a hidden representation\. To reduce the hidden\-state/token\-embedding mismatch observed in latent recurrence, we adopt context\-prediction fusion from Latent Thoughts Tuning\(Liuet al\.,[2026](https://arxiv.org/html/2605.29068#bib.bib15)\)\. At each latent position, the model first computes a predictive embedding from the next\-token distribution induced by the previous hidden state:

epred\(ht−1\)=∑v∈𝒱pp~θ\(v∣ht−1\)e\(v\),e\_\{\\mathrm\{pred\}\}\(h\_\{t\-1\}\)=\\sum\_\{v\\in\\mathcal\{V\}\_\{p\}\}\\tilde\{p\}\_\{\\theta\}\(v\\mid h\_\{t\-1\}\)e\(v\),where𝒱p\\mathcal\{V\}\_\{p\}is the nucleus\-filtered vocabulary set, andp~θ\(v∣ht−1\)\\tilde\{p\}\_\{\\theta\}\(v\\mid h\_\{t\-1\}\)is the renormalized probability distribution over this set\. Structural latent\-control tokens are excluded from this distribution\.

The recurrent input is then constructed by fusing the contextual hidden state with the predictive embedding:

e~t=αht−1\+\(1−α\)epred\(ht−1\),\\tilde\{e\}\_\{t\}=\\alpha h\_\{t\-1\}\+\(1\-\\alpha\)e\_\{\\mathrm\{pred\}\}\(h\_\{t\-1\}\),whereα∈\[0,1\]\\alpha\\in\[0,1\]controls the balance between contextual continuity and semantic anchoring\. Finally, a lightweight projection module maps the fused representation back into the model input space:

𝐞tin=\{ht−1,α=1,gϕ\(e~t\),α<1and adapter is used,e~t,α<1and no adapter is used\.\\mathbf\{e\}^\{\\mathrm\{in\}\}\_\{t\}=\\begin\{cases\}h\_\{t\-1\},&\\alpha=1,\\\\ g\_\{\\phi\}\(\\tilde\{e\}\_\{t\}\),&\\alpha<1\\text\{ and adapter is used\},\\\\ \\tilde\{e\}\_\{t\},&\\alpha<1\\text\{ and no adapter is used\}\.\\end\{cases\}wheregϕg\_\{\\phi\}is a trainable adapter\. Whenα=1\\alpha=1, this reduces to vanilla hidden\-state recurrence; whenα<1\\alpha<1, the recurrent state is anchored by predictive information from the vocabulary embedding\.

### 3\.5Stage\-Wise Internalization

The core ofCoLaGuardis a stage\-wise curriculum that progressively replaces natural\-language rationale steps with recurrent latent steps\. The staged replacement schedule follows prior internalization curricula showing that gradually replacing explicit reasoning tokens is more stable than removing rationales all at once\(Denget al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib45),[2025](https://arxiv.org/html/2605.29068#bib.bib42); Haoet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib51)\)\.

For an example withmmrationale steps, we write the step\-separated rationale as

r=\(r1,r2,…,rm\),r=\(r^\{1\},r^\{2\},\\ldots,r^\{m\}\),letKKdenote the maximum number of reasoning steps represented by the latent budget, and defineℓk=min⁡\(k,K\)\\ell\_\{k\}=\\min\(k,K\)\. At stagekk, the firstkkrationale steps are removed and replaced withℓkc\\ell\_\{k\}clatent positions:

\(r1,…,rk\)→\(z1,…,zℓkc\),\(r^\{1\},\\ldots,r^\{k\}\)\\rightarrow\(z\_\{1\},\\ldots,z\_\{\\ell\_\{k\}c\}\),whereccis the number of latent positions allocated per replaced reasoning step within the latent budget\. We denote the resulting training sequence asq\(k\)q^\{\(k\)\}, which contains the instruction, prompt, response, latent span, any remaining rationale stepsrk\+1,…,rmr^\{k\+1\},\\ldots,r^\{m\}, and the final labelsyy\.

Ifk≥mk\\geq m, the rationale is fully replaced and the final label tuple follows the latent span directly; however, because the latent budget is fixed, examples with more thanKKrationale steps may still contain explicit rationale tokens after the maximum latent stage is reached\. We therefore include a final compression stage that keeps the latent span fixed atKcKcpositions while removing all remaining rationale steps:

r1:m→\(z1,…,zKc\)\.r^\{1:m\}\\rightarrow\(z\_\{1\},\\ldots,z\_\{Kc\}\)\.This extra stage enables the absorption of residual explicit reasoning signal into the fixed latent recurrence rather than remaining decoded as text\.

Training optimizes only the remaining language tokens and final labels\. The prompt, response, latent\-control tokens, and latent positions are masked from the language\-modeling loss\. Letℳ\(k\)\\mathcal\{M\}^\{\(k\)\}be the set of supervised token positions inq\(k\)q^\{\(k\)\}\. The internalization objective is

ℒint\(k\)=−𝔼\(x,s,r,y\)∼𝒟∑t∈ℳ\(k\)log⁡pθ\(qt\(k\)∣q<t\(k\)\)\.\\mathcal\{L\}\_\{\\mathrm\{int\}\}^\{\(k\)\}=\-\\mathbb\{E\}\_\{\(x,s,r,y\)\\sim\\mathcal\{D\}\}\\sum\_\{t\\in\\mathcal\{M\}^\{\(k\)\}\}\\log p\_\{\\theta\}\(q^\{\(k\)\}\_\{t\}\\mid q^\{\(k\)\}\_\{<t\}\)\.The same masked language\-modeling objective is used for the final compression stage, with supervised positions restricted to the final safety\-label tokens\.

Askkincreases, less of the original rationale remains in language space, forcing more of the safety decision process to be represented by latent recurrence\. This objective gives the latent positions no direct textual target so that the latent states are optimized through their downstream ability to predict the remaining rationale steps and the final safety labels\.

### 3\.6Efficient Inference

At inference time,CoLaGuardreceives only the instruction, prompt, and response\. It appends a fixed latent span and performs recurrent latent computation using the fused update in §[3\.4](https://arxiv.org/html/2605.29068#S3.SS4)\. LetZL=\[⟨start\-latent⟩,z1,…,zL,⟨end\-latent⟩\]Z\_\{L\}=\\big\[\\langle\\mathrm\{start\\text\{\-\}latent\}\\rangle,z\_\{1\},\\ldots,z\_\{L\},\\langle\\mathrm\{end\\text\{\-\}latent\}\\rangle\\big\]denote the fixed latent span, whereLLis the latent budget used at deployment\. After the latent span, the model returns to language mode and autoregressively predicts the final prompt and response safety labels:

\(y^p,y^r\)=arg⁡max\(yp,yr\)∈Y2⁡pθ\(yp,yr∣I,x,s,ZL\)\.\(\\hat\{y\}^\{p\},\\hat\{y\}^\{r\}\)=\\arg\\max\_\{\(y^\{p\},y^\{r\}\)\\in Y^\{2\}\}p\_\{\\theta\}\(y^\{p\},y^\{r\}\\mid I,x,s,Z\_\{L\}\)\.
BecauseCoLaGuarddoes not generate natural\-language rationales, its inference cost scales with the number of latent positions rather than the length of an explicit chain\-of\-thought\.

BenchmarkSamplesPrompt Harmfulness DetectionToxicChat\(Linet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib38)\)2,853OpenAI Moderation\(Markovet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib40)\)1,680Aegis Safety Test\(Ghoshet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib16)\)359HarmBench\(Mazeikaet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib39)\)239WildGuardTest\(Hanet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib1)\)1,756Response Harmfulness DetectionHarmBench\(Mazeikaet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib39)\)602SafeRLHF\(Jiet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib14)\)2,000BeaverTails\(Jiet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib30)\)3,021XSTest\(Röttgeret al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib41)\)446WildGuardTest\(Hanet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib1)\)1,768Table 1:Evaluation benchmarks for prompt and response harmfulness detection\.Table 2:F1 Score \(%\) of Models on 5 Benchmarks of Prompt Harmfulness Detection\.Boldandunderlinedvalues denote the best and runner\-up\. “–” denotes the result is unavailable\.MethodModelSizeToxicChatHarmBenchOpenAIMod\.AegisSafetyTestWildGuardTestMacroAvgMicroAvgClosed\-Source Guard APIGPT\-4oUnknown64\.4682\.2762\.2681\.0780\.8774\.1969\.59GPT\-4o\+CoTUnknown73\.4381\.9876\.7888\.2482\.7580\.6477\.69o1\-previewUnknown57\.6989\.6174\.6083\.1576\.3176\.2769\.00Open\-Source Guard ModelLLaMA Guard7B61\.6067\.2075\.8074\.1056\.0066\.9464\.48LLaMA Guard 28B47\.1094\.0076\.1071\.8070\.9071\.9863\.16LLaMA Guard 38B53\.1298\.9479\.6999\.5068\.4779\.9467\.52Aegis Guard Defensive7B70\.0077\.7067\.5084\.8078\.5075\.7072\.60Aegis Guard Permissive7B73\.0070\.5074\.7082\.9071\.5074\.5273\.46Aegis Guard 2\.08B––81\.00–81\.60––ShieldGemma2B6\.9111\.8113\.897\.479\.369\.899\.44ShieldGemma9B67\.9267\.9678\.5877\.6357\.7469\.9768\.43WildGuard7B70\.8098\.9072\.1089\.4088\.9084\.0277\.68QwQ\-preview32B34\.8186\.7361\.5880\.2366\.0265\.8753\.47GuardReasoner1B72\.4396\.3170\.0689\.3487\.3783\.1077\.37GuardReasoner3B78\.2089\.1071\.8791\.3989\.0183\.9180\.48GuardReasoner8B78\.7991\.8672\.0090\.1889\.1784\.4080\.83Latent Reasoning Guardrail \(Ours\)CoLaGuard\(Ours\)3B75\.2794\.2573\.1590\.5888\.1584\.2879\.49CoLaGuard\(Ours\)8B75\.2693\.5473\.4589\.4589\.4484\.2379\.77

## 4Experiments

To evaluateCoLaGuard, we conduct experiments on multiple safety benchmarks, comparing \(1\) safety classification performance across various baselines and \(2\) inference efficiency against explicit reasoning guardrails\.

### 4\.1Experimental Setup

#### Reasoning Augmented Dataset\.

We use the GuardReasonerTrain dataset\(Liuet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib35)\)as the primary training source for our guardrail model\. GuardReasonerTrain is a 127,000\-example reasoning\-augmented compilation of the following safety\-focused datasets: WildGuard\(Hanet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib1)\), AegisSafety\(Ghoshet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib16)\), BeaverTails\(Jiet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib30)\), and ToxicChat\(Linet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib38)\)\. Each example comes with a prompt, composed of the guardrail instructions, a user input, and a user output; a multi\-step reasoning trace separated into the three tasks of request moderation, refusal detection, and response moderation; and ground\-truth answers for the three tasks\. Using the same reasoning\-augmented supervision source as GuardReasoner allows us to directly compare explicit rationale generation against latent internalization under a matched training signal\.

Both iCoT\(Denget al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib42)\)and Coconut\(Haoet al\.,[2025](https://arxiv.org/html/2605.29068#bib.bib51)\)show that replacing too many language tokens per stage can destabilize training\. We therefore split reasoning traces into smaller step\-level replacements, but this increases the number of training stages and overall computational cost\. To reduce cost and focus on request and response safety moderation, we remove the refusal task fromCoLaGuardtraining supervision\.

#### Training Details\.

We use separate training configurations for the explicit CoT warm\-up stage and the latent internalization stages\. In Stage 0, we fully fine\-tune Llama 3\.1 8B on GuardReasonerTrain to obtain an explicit reasoning baseline\. Training is performed on 8×\\timesA100 \(80GB\) GPUs for 3 epochs, with per\-device batch size 1, gradient accumulation 32, AdamW optimization\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.29068#bib.bib58)\), a cosine learning\-rate schedule, and an initial learning rate of5×10−55\\times 10^\{\-5\}\. The fusion coefficient is set toα=1\.0\\alpha=1\.0in this stage, so the fusion module is inactive\.

Table 3:F1 Score \(%\) of Models on 5 Benchmarks of Response Harmfulness Detection\.Boldandunderlinedvalues denote the best and runner\-up\. “–” denotes the result is unavailable\.MethodModelSizeHarmBenchSafeRLHFBeaverTailsXSTestResponseWildGuardTestMacroAvgMicroAvgClosed\-Source Guardrail APIGPT\-4oUnknown56\.3464\.0578\.6365\.1265\.2465\.8869\.41GPT\-4o\+CoTUnknown65\.9965\.1082\.2686\.9071\.4374\.3474\.45o1\-previewUnknown76\.4066\.6079\.9674\.7550\.0069\.5469\.22Open\-Source GuardrailLLaMA Guard7B52\.0048\.4067\.1082\.0050\.5060\.0058\.27LLaMA Guard 28B77\.8051\.6071\.8090\.8066\.5071\.7066\.99LLaMA Guard 38B85\.0744\.3667\.8487\.6770\.8071\.1564\.97Aegis Guard Defensive7B62\.2059\.3074\.7052\.8049\.1059\.6262\.79Aegis Guard Permissive7B60\.8055\.9073\.8060\.4056\.4061\.4663\.55Aegis Guard 2\.08B–––86\.2077\.50––ShieldGemma2B35\.3616\.9230\.9765\.5520\.1333\.7927\.24ShieldGemma9B56\.4447\.0763\.6173\.8647\.0057\.6055\.67HarmBench LLaMA13B84\.3060\.0077\.1064\.5045\.7066\.3265\.49HarmBench Mistral7B87\.0052\.4075\.2072\.0060\.1069\.3466\.70MD\-Judge7B81\.6064\.7086\.7090\.4076\.8080\.0478\.67BeaverDam7B58\.4072\.1089\.9083\.6063\.4073\.4876\.60WildGuard7B86\.3064\.2084\.4094\.7075\.4081\.0077\.95QwQ\-preview32B69\.6562\.7677\.2645\.9517\.5654\.6457\.73GuardReasoner1B84\.7568\.3985\.8490\.1274\.8180\.7879\.06GuardReasoner3B85\.6669\.0286\.7291\.3679\.7082\.4980\.80GuardReasoner8B85\.4770\.0487\.6094\.3478\.2083\.1381\.22Latent Reasoning Guardrail \(Ours\)CoLaGuard3B86\.3668\.7286\.2994\.1977\.2382\.5680\.22CoLaGuard8B86\.3870\.4986\.5592\.0281\.2383\.3381\.55Starting from the Stage\-0 checkpoint, we then train the stage\-wise internalization curriculum\. Since roughly 80% of GuardReasonerTrain examples contain at most six reasoning steps, we use six latent recurrent steps as the fixed inference budget\. Each internalization stage replaces one additional reasoning step with latent states, and a final compression stage removes any remaining explicit reasoning for longer traces while preserving the same six\-step latent budget\. Each stage is trained for one epoch with a reset AdamW optimizer and a constant learning rate of1×10−51\\times 10^\{\-5\}\.

During internalization, we linearly anneal the fusion coefficient fromα=1\.0\\alpha=1\.0toα=0\.6\\alpha=0\.6over the first 200 warm\-up steps\. We set the fusion temperature to 1\.0, top\-ppto 0\.9, and use a fusion adapter with hidden dimension 1024\. All training is conducted in bf16 precision, and checkpoints are saved after each stage\.

For the 3B model, we use Llama 3\.2 3B as the backbone and set the internalization learning rate to2×10−52\\times 10^\{\-5\}, which we found more stable for this scale\. Following the implementation choice inLiuet al\.\([2026](https://arxiv.org/html/2605.29068#bib.bib15)\), we disable the fusion adapter for the 3B model because Llama 3\.2 3B uses tied input\-output embeddings\. All other training settings are kept identical to the 8B configuration\.

#### Safety Evaluation\.

To assess the performance and efficiency of our guardrail model while isolating the effect of latent reasoning against explicit rationale generation under matched supervision, we evaluate on benchmarks used byLiuet al\.\([2025](https://arxiv.org/html/2605.29068#bib.bib35)\)\(Table[1](https://arxiv.org/html/2605.29068#S3.T1)\) and use GuardReasoner \(SFT\-only, without hard\-sample DPO\) as our primary explicit reasoning baseline\. More details on these benchmarks can be found in Appendix[A](https://arxiv.org/html/2605.29068#A1)\.

We compareCoLaGuardagainst 20 baselines spanning closed\-source APIs, open\-source guard models, and our primary explicit reasoning baseline\. Baseline names and model sizes are reported in Tables[2](https://arxiv.org/html/2605.29068#S3.T2)and[3](https://arxiv.org/html/2605.29068#S4.T3); corresponding references are provided in Appendix[A\.2](https://arxiv.org/html/2605.29068#A1.SS2)\.

### 4\.2Results

#### Overall Classification Performance\.

Tables[2](https://arxiv.org/html/2605.29068#S3.T2)and[3](https://arxiv.org/html/2605.29068#S4.T3)report F1 scores on prompt and response harmfulness detection\.CoLaGuard8B is comparable to GuardReasoner 8B, with prompt macro\-F1 of 84\.23 vs\. 84\.40 and response macro\-F1 of 83\.33 vs\. 83\.13\. Compared with Llama Guard 3, it improves the average macro\-F1 across both tasks by 8\.24 points while avoiding explicit rationale generation\.

At the benchmark level,CoLaGuard8B achieves the best F1 on WildGuardTest for both prompt and response detection \(89\.44 and 81\.23\), and ranks second on HarmBench response and SafeRLHF \(86\.38 and 70\.49\)\. Its lower prompt micro\-F1 relative to GuardReasoner 8B \(79\.77 vs\. 80\.83\) is mainly due to ToxicChat, which accounts for 41\.4% of the prompt evaluation set and therefore has a large effect on the micro average\.

#### Model Size Comparison\.

CoLaGuard3B is already competitive with GuardReasoner 3B, slightly improving both prompt macro\-F1 \(84\.28 vs\. 83\.91\) and response macro\-F1 \(82\.56 vs\. 82\.49\)\. Scaling to 8B mainly benefits response detection and yields better combined averages \(83\.78 vs\. 83\.42 macro; 80\.66 vs\. 79\.86 micro\), suggesting a modest but more consistent gain from the larger backbone\.

Table 4:Inference Efficiency and Performance Comparison\. We report inference time, completion token cost, and efficiency\-adjusted F1 \(EA\-F1\)\. Inference is conducted on 1×\\timesH100 \(80GB\) GPU\. EA\-F1 denotes Efficiency\-Adjusted F1\(Wenet al\.,[2025a](https://arxiv.org/html/2605.29068#bib.bib64)\), a normalized metric that jointly accounts for F1 score and inference speed, where higher values indicate better efficiency\-performance trade\-off\.Metric3B8BGuardReasonerCoLaGuardGuardReasonerCoLaGuardTime Cost \(ms/query\)3801\.03318\.94407\.8342\.0Token Cost \(token/query\)281\.9613\.0289\.412\.9EA\-F10\.21222\.50410\.18382\.3601
#### Inference Efficiency\.

Table[4](https://arxiv.org/html/2605.29068#S4.T4)shows thatCoLaGuardsubstantially reduces inference cost compared with GuardReasoner\. At 8B, latency drops from 4,407\.8 to 342\.0 ms/query, a 12\.9×\\timesspeedup, while token usage decreases from 289\.4 to 12\.9 tokens/query, a 22\.4×\\timesreduction\. These gains come from replacing long autoregressive CoT generation with a fixed six\-step latent recurrence\.CoLaGuardalso achieves much higher EA\-F1 at both model sizes, showing a stronger accuracy\-efficiency trade\-off for deployment\.

### 4\.3Ablation Studies

#### Analyzing Latent Recurrence Dynamics\.

![Refer to caption](https://arxiv.org/html/2605.29068v1/x2.png)Figure 2:Geometric Analysis of Latent Representations\.\(Top\)UMAP of mean harmful/unharmful trajectories across recurrence stepsh0h\_\{0\}–h5h\_\{5\}\.\(Bottom\)Intra\-sample cosine similarity heatmap between latent steps\. Vanilla Coconut shows highly similar latent states and early label separation, whileCoLaGuardexhibits progressive class differentiation across recurrence steps\.Recent work questions whether latent tokens in Coconut\-style recurrence perform meaningful computation beyond acting as learned placeholders\.Zhanget al\.\([2025](https://arxiv.org/html/2605.29068#bib.bib5)\)find that vanilla Coconut tokens form clustered embeddings with limited input sensitivity, suggesting placeholder behavior from learned shortcuts\.Liuet al\.\([2026](https://arxiv.org/html/2605.29068#bib.bib15)\)show that Context\-Prediction Fusion mitigates inter\-sample representational collapse, suggesting more expressive latent states\. We extend this analysis toCoLaGuardthrough WildGuardTest latent trajectories and a full\-suite CPF ablation against vanilla Coconut\.

Figure[2](https://arxiv.org/html/2605.29068#S4.F2)shows the average pairwise cosine similarity between latent steps\(hi,hj\)\(h\_\{i\},h\_\{j\}\)across samples and mean harmful/unharmful trajectories via UMAP\(McInneset al\.,[2020](https://arxiv.org/html/2605.29068#bib.bib52)\)\. Vanilla Coconut exhibits uniformly high cross\-step similarity, consistent with early commitment to a fixed latent state that is simply propagated forward; its harmful and unharmful trajectories are already separated ath0h\_\{0\}, with limited additional separation in later steps\. In contrast,CoLaGuardshows noticeably lower cross\-step similarity, indicating that its latent states continue to evolve throughout the recurrence rather than collapsing after the initial step\. Its trajectories begin closer together and diverge progressively, suggesting that recurrence contributes to the refinement of safety\-relevant representations rather than simply preserving an early decision\.

As an ablation of Context\-Prediction Fusion, a vanilla Coconut guardrail with the same six\-step latent budget reaches 81\.82 combined macro\-F1 and 79\.78 combined micro\-F1, compared with 83\.78 and 80\.72 forCoLaGuard\. Context\-Prediction Fusion yields clear gains that bring it to parity with the explicit reasoning baseline \(\+1\.96 macro\-F1, \+0\.94 micro\-F1\), suggesting that the more progressive latent shifts in Figure[2](https://arxiv.org/html/2605.29068#S4.F2)may be relevant to downstream moderation performance\.

#### Scaling Training Data\.

![Refer to caption](https://arxiv.org/html/2605.29068v1/x3.png)Figure 3:Training Data Scaling\.CoLaGuard8B prompt and response macro\-F1 across training data sizes\.Figure[3](https://arxiv.org/html/2605.29068#S4.F3)shows thatCoLaGuard8B improves consistently with more reasoning\-augmented training data\. Response macro\-F1 improves sharply from 8k to 30k examples \(\+1\.97 points\) but shows limited additional gain at 127k \(\+0\.19 points\)\. Prompt macro\-F1 increases more gradually, gaining 0\.53 points from 8k to 30k and 1\.12 points from 30k to 127k\.

These trends suggest that response moderation benefits earlier from diverse supervision, while prompt moderation continues to improve at larger scale\. Overall, the results show thatCoLaGuardscales reliably with training data and achieves its best performance with the full GuardReasonerTrain corpus\.

## 5Conclusion

We introducedCoLaGuard, a latent reasoning guardrail that internalizes explicit safety reasoning through a stage\-wise curriculum\. Across prompt and response harmfulness detection benchmarks,CoLaGuardmatches the average macro\-F1 of an explicit reasoning guardrail while substantially reducing inference cost\.CoLaGuard8B matches GuardReasoner 8B in macro\-F1 while achieving 12\.9×\\timeslower latency and 22\.4×\\timesfewer tokens\. These results show that latent reasoning is a practical path toward safety guardrails that are both robust and efficient for deployment\.

## Limitations

WhileCoLaGuarddemonstrates strong efficiency and competitive safety performance, several limitations remain\. First, our evaluation focuses on text\-based prompt and response harmfulness detection, leaving broader policy taxonomies, multilingual inputs, multimodal content, and long\-horizon agent behavior for future work\. Second,CoLaGuardis trained from distilled reasoning traces and may inherit biases or coverage gaps from the underlying supervision\. Finally, although our latent representation analysis suggests progressive safety\-relevant refinement, more causal interventions are needed to fully characterize how each latent step contributes to the final decision to improve interpretability of safety decisions\.

## Ethics Statement

The aim of this work is to improve the reliability and efficiency of LLM safety guardrails\. While latent reasoning moderation may make strong safety filters more practical in high\-traffic settings, these guardrails can still produce false positives and false negatives on ambiguous or context\-dependent inputs\. Therefore,CoLaGuarditself should not be considered a replacement for human oversight in real\-world deployment, but rather should be used as part of a broader moderation system\. The safety data used in the evaluation and training processes may contain harmful or sensitive content and should be handled with appropriate access controls and annotator\-care practices\.

## References

- Hopping too late: exploring the limitations of large language models on multi\-hop queries\.External Links:2406\.12775,[Link](https://arxiv.org/abs/2406.12775)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Chen, A\. Zhao, H\. Xia, X\. Lu, H\. Wang, Y\. Chen, W\. Zhang, J\. Wang, W\. Li, and X\. Shen \(2025\)Reasoning beyond language: a comprehensive survey on latent chain\-of\-thought reasoning\.External Links:2505\.16782,[Link](https://arxiv.org/abs/2505.16782)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Cheng and B\. V\. Durme \(2024\)Compressed chain of thought: efficient reasoning through dense representations\.External Links:2412\.13171,[Link](https://arxiv.org/abs/2412.13171)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Deng, Y\. Choi, and S\. Shieber \(2025\)From explicit CoT to implicit CoT: learning to internalize CoT step by step\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fRPmc94QeH)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p3.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1),[§3\.5](https://arxiv.org/html/2605.29068#S3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px1.p2.1)\.
- Y\. Deng, K\. Prasad, R\. Fernandez, P\. Smolensky, V\. Chaudhary, and S\. Shieber \(2023\)Implicit chain of thought reasoning via knowledge distillation\.External Links:2311\.01460,[Link](https://arxiv.org/abs/2311.01460)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1),[§3\.5](https://arxiv.org/html/2605.29068#S3.SS5.p1.1)\.
- S\. Ghosh, P\. Varshney, E\. Galinkin, and C\. Parisien \(2024\)AEGIS: online adaptive ai content safety moderation with ensemble of llm experts\.External Links:2404\.05993,[Link](https://arxiv.org/abs/2404.05993)Cited by:[§A\.1](https://arxiv.org/html/2605.29068#A1.SS1.p4.1),[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.8.2),[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.9.2),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.5.1),[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Ghosh, P\. Varshney, M\. N\. Sreedhar, A\. Padmakumar, T\. Rebedea, J\. R\. Varghese, and C\. Parisien \(2025\)AEGIS2\.0: a diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Albuquerque, New Mexico,pp\. 5992–6026\.External Links:[Link](https://aclanthology.org/2025.naacl-long.306/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.306)Cited by:[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.10.2)\.
- S\. Goyal, Z\. Ji, A\. S\. Rawat, A\. K\. Menon, S\. Kumar, and V\. Nagarajan \(2024\)Think before you speak: training language models with pause tokens\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ph04CRkPdC)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Grattafioriet al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.7.2)\.
- S\. Han, K\. Rao, A\. Ettinger, L\. Jiang, B\. Y\. Lin, N\. Lambert, Y\. Choi, and N\. Dziri \(2024\)Wildguard: open one\-stop moderation tools for safety risks, jailbreaks, and refusals of llms\.Advances in neural information processing systems37,pp\. 8093–8131\.Cited by:[§A\.1](https://arxiv.org/html/2605.29068#A1.SS1.p2.1),[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.12.2),[§1](https://arxiv.org/html/2605.29068#S1.p1.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.13.1),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.7.1),[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. E\. Weston, and Y\. Tian \(2025\)Training large language model to reason in a continuous latent space\.External Links:[Link](https://openreview.net/forum?id=tG4SgayTtk)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p3.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2605.29068#S3.SS4.p2.7),[§3\.5](https://arxiv.org/html/2605.29068#S3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px1.p2.1)\.
- C\. Hsieh, C\. Li, C\. Yeh, H\. Nakhost, Y\. Fujii, A\. Ratner, R\. Krishna, C\. Lee, and T\. Pfister \(2023\)Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 8003–8017\.External Links:[Link](https://aclanthology.org/2023.findings-acl.507/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.507)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.29068#S3.SS2.p1.1)\.
- H\. Inan, K\. Upasani, J\. Chi, R\. Rungta, K\. Iyer, Y\. Mao, M\. Tontchev, Q\. Hu, B\. Fuller, D\. Testuggine, and M\. Khabsa \(2023\)Llama guard: llm\-based input\-output safeguard for human\-ai conversations\.External Links:2312\.06674,[Link](https://arxiv.org/abs/2312.06674)Cited by:[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.5.2),[§1](https://arxiv.org/html/2605.29068#S1.p1.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Ji, D\. Hong, B\. Zhang, B\. Chen, J\. Dai, B\. Zheng, T\. Qiu, J\. Zhou, K\. Wang, B\. Li, S\. Han, Y\. Guo, and Y\. Yang \(2024\)PKU\-SafeRLHF: towards multi\-level safety alignment for LLMs with human preference\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2406.15513)Cited by:[§A\.1](https://arxiv.org/html/2605.29068#A1.SS1.p7.1),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.10.1)\.
- J\. Ji, M\. Liu, J\. Dai, X\. Pan, C\. Zhang, C\. Bian, B\. Chen, R\. Sun, Y\. Wang, and Y\. Yang \(2023\)BeaverTails: towards improved safety alignment of LLM via a human\-preference dataset\.InThirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=g0QovXbFw3)Cited by:[§A\.1](https://arxiv.org/html/2605.29068#A1.SS1.p8.1),[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.17.2),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.11.1),[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px1.p1.1)\.
- M\. Kang and B\. Li \(2025\)$R^2$\-guard: robust reasoning enabled LLM guardrail via knowledge\-enhanced logical reasoning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CkgKSqZbuC)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Kim, S\. J\. Joo, D\. Kim, J\. Jang, S\. Ye, J\. Shin, and M\. Seo \(2023\)The cot collection: improving zero\-shot and few\-shot learning of language models via chain\-of\-thought fine\-tuning\.External Links:2305\.14045,[Link](https://arxiv.org/abs/2305.14045)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.29068#S3.SS2.p1.1)\.
- L\. Li, B\. Dong, R\. Wang, X\. Hu, W\. Zuo, D\. Lin, Y\. Qiao, and J\. Shao \(2024\)SALAD\-bench: a hierarchical and comprehensive safety benchmark for large language models\.External Links:2402\.05044,[Link](https://arxiv.org/abs/2402.05044)Cited by:[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.16.2)\.
- Z\. Lin, Z\. Wang, Y\. Tong, Y\. Wang, Y\. Guo, Y\. Wang, and J\. Shang \(2023\)ToxicChat: unveiling hidden challenges of toxicity detection in real\-world user\-ai conversation\.External Links:2310\.17389,[Link](https://arxiv.org/abs/2310.17389)Cited by:[§A\.1](https://arxiv.org/html/2605.29068#A1.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.3.1),[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px1.p1.1)\.
- W\. Liu, D\. Min, and L\. Cheng \(2026\)Latent thoughts tuning: bridging context and reasoning with fused information in latent tokens\.External Links:2602\.10229,[Link](https://arxiv.org/abs/2602.10229)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p3.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2605.29068#S3.SS4.p3.1),[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px2.p4.1),[§4\.3](https://arxiv.org/html/2605.29068#S4.SS3.SSS0.Px1.p1.1)\.
- Y\. Liu, H\. Gao, S\. Zhai, Y\. He, J\. Xia, Z\. Hu, Y\. Chen, X\. Yang, J\. Zhang, S\. Z\. Li, H\. Xiong, and B\. Hooi \(2025\)GuardReasoner: towards reasoning\-based llm safeguards\.External Links:2501\.18492,[Link](https://arxiv.org/abs/2501.18492)Cited by:[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.18.2),[§1](https://arxiv.org/html/2605.29068#S1.p2.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.29068#S3.SS2.p1.1),[§3\.3](https://arxiv.org/html/2605.29068#S3.SS3.p1.6),[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px3.p1.1)\.
- Llama Team \(2024\)Meta Llama guard 2\.Note:[https://github\.com/meta\-llama/PurpleLlama/blob/main/Llama\-Guard2/MODEL\_CARD\.md](https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md)Cited by:[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.6.2)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.External Links:1711\.05101,[Link](https://arxiv.org/abs/1711.05101)Cited by:[§4\.1](https://arxiv.org/html/2605.29068#S4.SS1.SSS0.Px2.p1.3)\.
- T\. Markov, C\. Zhang, S\. Agarwal, T\. Eloundou, T\. Lee, S\. Adler, A\. Jiang, and L\. Weng \(2023\)A holistic approach to undesired content detection in the real world\.External Links:2208\.03274,[Link](https://arxiv.org/abs/2208.03274)Cited by:[§A\.1](https://arxiv.org/html/2605.29068#A1.SS1.p6.1),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.4.1)\.
- M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li, D\. Forsyth, and D\. Hendrycks \(2024\)HarmBench: a standardized evaluation framework for automated red teaming and robust refusal\.External Links:2402\.04249,[Link](https://arxiv.org/abs/2402.04249)Cited by:[§A\.1](https://arxiv.org/html/2605.29068#A1.SS1.p5.1),[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.14.2),[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.15.2),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.6.1),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.9.1)\.
- L\. McInnes, J\. Healy, and J\. Melville \(2020\)UMAP: uniform manifold approximation and projection for dimension reduction\.External Links:1802\.03426,[Link](https://arxiv.org/abs/1802.03426)Cited by:[§4\.3](https://arxiv.org/html/2605.29068#S4.SS3.SSS0.Px1.p2.2)\.
- NVIDIA \(2025\)Nemotron Content Safety Reasoning 4B\.Note:[https://huggingface\.co/nvidia/Nemotron\-Content\-Safety\-Reasoning\-4B](https://huggingface.co/nvidia/Nemotron-Content-Safety-Reasoning-4B)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p2.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2024\)OpenAI o1 system card\.Note:[https://openai\.com/index/openai\-o1\-system\-card/](https://openai.com/index/openai-o1-system-card/)Cited by:[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.2.2),[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.3.2),[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.4.2)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p1.1)\.
- J\. Pfau, W\. Merrill, and S\. R\. Bowman \(2024\)Let’s think dot by dot: hidden computation in transformer language models\.External Links:2404\.15758,[Link](https://arxiv.org/abs/2404.15758)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1)\.
- Qwen Team \(2024\)QwQ: reflect deeply on the boundaries of the unknown\.Note:[https://qwenlm\.github\.io/blog/qwq\-32b\-preview/](https://qwenlm.github.io/blog/qwq-32b-preview/)Cited by:[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.13.2)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=HPuSIXJaa9)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p1.1)\.
- T\. Rebedea, R\. Dinu, M\. Sreedhar, C\. Parisien, and J\. Cohen \(2023\)NeMo guardrails: a toolkit for controllable and safe llm applications with programmable rails\.External Links:2310\.10501,[Link](https://arxiv.org/abs/2310.10501)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Röttger, H\. Kirk, B\. Vidgen, G\. Attanasio, F\. Bianchi, and D\. Hovy \(2024\)XSTest: a test suite for identifying exaggerated safety behaviours in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 5377–5400\.External Links:[Link](https://aclanthology.org/2024.naacl-long.301/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.301)Cited by:[§A\.1](https://arxiv.org/html/2605.29068#A1.SS1.p9.1),[Table 1](https://arxiv.org/html/2605.29068#S3.T1.1.12.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p2.1)\.
- M\. N\. Sreedhar, T\. Rebedea, and C\. Parisien \(2025\)Safety through reasoning: an empirical study of reasoning guardrail models\.External Links:2505\.20087,[Link](https://arxiv.org/abs/2505.20087)Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p2.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2023\)Chain\-of\-thought prompting elicits reasoning in large language models\.External Links:2201\.11903,[Link](https://arxiv.org/abs/2201.11903)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.29068#S3.SS2.p1.1)\.
- X\. Wen, W\. J\. Mo, Y\. Xie, P\. Qi, and M\. Chen \(2025a\)Towards policy\-compliant agents: learning efficient guardrails for policy violation detection\.arXiv preprint arXiv:2510\.03485\.Cited by:[Table 4](https://arxiv.org/html/2605.29068#S4.T4)\.
- X\. Wen, W\. Zhou, W\. Jacky Mo, and M\. Chen \(2025b\)THINKGUARD: deliberative slow thinking leads to cautious guardrails\.arXiv preprint arXiv:2502\.13458\.Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p2.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.29068#S3.SS2.p1.1),[§3\.3](https://arxiv.org/html/2605.29068#S3.SS3.p1.6)\.
- Y\. Yang, S\. Dan, S\. Li, D\. Roth, and I\. Lee \(2025\)MrGuard: a multilingual reasoning guardrail for universal LLM safety\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 27377–27396\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1392/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1392),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2605.29068#S1.p2.1),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Yuan, Z\. Xiong, Y\. Zeng, N\. Yu, R\. Jia, D\. Song, and B\. Li \(2024\)RigorLLM: resilient guardrails for large language models against undesired content\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Zelikman, G\. R\. Harik, Y\. Shao, V\. Jayasiri, N\. Haber, and N\. Goodman \(2024\)Quiet\-STar: language models can teach themselves to think before speaking\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=oRXPiSOGH9)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Zeng, Y\. Liu, R\. Mullins, L\. Peran, J\. Fernandez, H\. Harkous, K\. Narasimhan, D\. Proud, P\. Kumar, B\. Radharapu, O\. Sturman, and O\. Wahltinez \(2024\)ShieldGemma: generative ai content moderation based on gemma\.External Links:2407\.21772,[Link](https://arxiv.org/abs/2407.21772)Cited by:[Table 5](https://arxiv.org/html/2605.29068#A1.T5.1.11.2),[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhang, B\. Tang, T\. Ju, S\. Duan, and G\. Liu \(2025\)Do latent tokens think? a causal and adversarial analysis of chain\-of\-continuous\-thought\.External Links:2512\.21711,[Link](https://arxiv.org/abs/2512.21711)Cited by:[§4\.3](https://arxiv.org/html/2605.29068#S4.SS3.SSS0.Px1.p1.1)\.
- H\. Zhao, C\. Yuan, F\. Huang, X\. Hu, Y\. Zhang, A\. Yang, B\. Yu, D\. Liu, J\. Zhou, J\. Lin, B\. Yang, C\. Cheng, J\. Tang, J\. Jiang, J\. Zhang, J\. Xu, M\. Yan, M\. Sun, P\. Zhang, P\. Xie, Q\. Tang, Q\. Zhu, R\. Zhang, S\. Wu, S\. Zhang, T\. He, T\. Tang, T\. Xia, W\. Liao, W\. Shen, W\. Yin, W\. Zhou, W\. Yu, X\. Wang, X\. Deng, X\. Xu, X\. Zhang, Y\. Liu, Y\. Li, Y\. Zhang, Y\. Jiang, Y\. Wan, and Y\. Zhou \(2025\)Qwen3Guard technical report\.External Links:2510\.14276,[Link](https://arxiv.org/abs/2510.14276)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Zhu, T\. Peng, T\. Cheng, X\. Qu, J\. Huang, D\. Zhu, H\. Wang, K\. Xue, X\. Zhang, Y\. Shan, T\. Cai, T\. Kergan, A\. Kembay, A\. Smith, C\. Lin, B\. Nguyen, Y\. Pan, Y\. Chou, Z\. Cai, Z\. Wu, Y\. Zhao, T\. Liu, J\. Yang, W\. Zhou, C\. Zheng, C\. Li, Y\. Zhou, Z\. Li, Z\. Zhang, J\. Liu, G\. Zhang, W\. Huang, and J\. Eshraghian \(2025\)A survey on latent reasoning\.External Links:2507\.06203,[Link](https://arxiv.org/abs/2507.06203)Cited by:[§2](https://arxiv.org/html/2605.29068#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix ASafety Evaluation

### A\.1Description of Benchmarks

To assess the performance and efficiency of our latent reasoning guardrail model, we evaluate it across eight unique safety\-related benchmarks\.

*WildGuard*\(Hanet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib1)\): WildGuardMix is a large\-scale safety moderation dataset with 92,000 labeled examples that cover both normal and adversarial prompt behaviors that come coupled with corresponding refusal and compliance responses\. The WildGuardTest split is human\-annotated and covers 5,000 safety labeled examples\.

*ToxicChat*\(Linet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib38)\): ToxicChat is a benchmark that includes 10,000 real user queries, leveraged as adversarial prompts for testing content moderation and toxicity detection in human\-AI interactions\.

*Aegis Safety Test 1\.0*\(Ghoshet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib16)\): A dataset of approximately 11,000 manually annotated examples, Aegis Safety Test 1\.0 was curated with the purpose of testing LLM safety alignment in accordance with Nvidia’s content safety taxonomy\.

*HarmBench*\(Mazeikaet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib39)\): HarmBench is a framework that is systematically designed to address the lack of standardized evaluation frameworks in the field of automated red teaming\. By leveraging various behaviors, this framework can be used to generate red\-teaming test cases for evaluating the adversarial robustness of LLMs\.

*OpenAI Moderation*\(Markovet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib40)\): A benchmark for assessing LLMs’ ability to detect harmful content based on OpenAI’s safety guidelines, covering violence, self\-harm, and misinformation\.

*SafeRLHF*\(Jiet al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib14)\): A dataset of 82,000 questions with two responses each, every entry in SafeRLHF includes safety meta\-labels as well as preference between the two responses\.

*BeaverTails*\(Jiet al\.,[2023](https://arxiv.org/html/2605.29068#bib.bib30)\): The Beavertails dataset was introduced to further research on safety alignment in LLMs\. The complete dataset includes over 300,000 question\-answer pairs that are annotated with safety meta\-labels and corresponding, violated safety categories\.

*XSTest*\(Röttgeret al\.,[2024](https://arxiv.org/html/2605.29068#bib.bib41)\): Developed to evaluate refusal behaviors and identify systematic failure modes in large language models, XSTest is comprised of 250 safe prompts across ten prompt types and contrasting 200 unsafe prompts that human\-aligned models should refuse\.

### A\.2Baseline Details

Details are presented in Table[5](https://arxiv.org/html/2605.29068#A1.T5)\.

BaselineReferenceModel SizeGPT\-4oOpenAI \([2024](https://arxiv.org/html/2605.29068#bib.bib7)\)UnknownGPT\-4o \+ CoTOpenAI \([2024](https://arxiv.org/html/2605.29068#bib.bib7)\)Unknowno1\-previewOpenAI \([2024](https://arxiv.org/html/2605.29068#bib.bib7)\)UnknownLLaMA GuardInanet al\.\([2023](https://arxiv.org/html/2605.29068#bib.bib28)\)7BLLaMA Guard 2Llama Team \([2024](https://arxiv.org/html/2605.29068#bib.bib10)\)8BLLaMA Guard 3Grattafiori and others \([2024](https://arxiv.org/html/2605.29068#bib.bib11)\)8BAegis Guard DefensiveGhoshet al\.\([2024](https://arxiv.org/html/2605.29068#bib.bib16)\)7BAegis Guard PermissiveGhoshet al\.\([2024](https://arxiv.org/html/2605.29068#bib.bib16)\)7BAegis Guard 2\.0Ghoshet al\.\([2025](https://arxiv.org/html/2605.29068#bib.bib12)\)8BShieldGemmaZenget al\.\([2024](https://arxiv.org/html/2605.29068#bib.bib29)\)2B / 9BWildGuardHanet al\.\([2024](https://arxiv.org/html/2605.29068#bib.bib1)\)7BQwQ\-previewQwen Team \([2024](https://arxiv.org/html/2605.29068#bib.bib13)\)32BHarmBench LLaMAMazeikaet al\.\([2024](https://arxiv.org/html/2605.29068#bib.bib39)\)13BHarmBench MistralMazeikaet al\.\([2024](https://arxiv.org/html/2605.29068#bib.bib39)\)7BMD\-JudgeLiet al\.\([2024](https://arxiv.org/html/2605.29068#bib.bib31)\)7BBeaverDamJiet al\.\([2023](https://arxiv.org/html/2605.29068#bib.bib30)\)7BGuardReasonerLiuet al\.\([2025](https://arxiv.org/html/2605.29068#bib.bib35)\)1B / 3B / 8BTable 5:Baseline references and model sizes for Tables[2](https://arxiv.org/html/2605.29068#S3.T2)and[3](https://arxiv.org/html/2605.29068#S4.T3)\.
Robust and Efficient Guardrails with Latent Reasoning

Similar Articles

DT-Guard: Intent-Driven Reasoning-Active Training for Reasoning-Free LLM Safety Guardrail

Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

Submit Feedback

Similar Articles

DT-Guard: Intent-Driven Reasoning-Active Training for Reasoning-Free LLM Safety Guardrail
Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation
SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning
CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment
CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning