AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue
Summary
Introduces AERIC, a lightweight hidden-state monitoring method for detecting implicit harmful content in LLM dialogue without extra forward passes, achieving improved AUROC over strong baselines with minimal latency overhead.
View Cached Full Text
Cached at: 05/26/26, 08:59 AM
# AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue
Source: [https://arxiv.org/html/2605.23974](https://arxiv.org/html/2605.23974)
Jihyung Park Saleh Afroogh Junfeng Jiao The University of Texas at Austin \{jihyung803, saleh\.afroogh\}@utexas\.edu jjiao@austin\.utexas\.edu
###### Abstract
Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text\. Existing response\-level guards are strong at judging completed text, and native streaming guards move closer to token time, but both settings leave open whether a lightweight monitor can anticipate implicit harmful drift from the generator’s own internal trajectory\. We study anticipatory same\-pass monitoring, where a safety monitor may read hidden states produced during ordinary decoding but may not invoke an additional forward pass through the base model\. We introduceAERIC, a transfer\-oriented hidden\-state approach for implicit harmful dialogue that combines short\-horizon hazard forecasting, support\-sensitive suppression, and prompt\-conditioned residual scoring under a same\-pass exponential moving average decision rule\. The default linear monitor contains only387387trainable head parameters\. Against Qwen3Guard\-Stream\-4B on balanced benchmarks,AERICimproves AUROC from0\.68300\.6830to0\.71430\.7143on DiaSafety and from0\.82190\.8219to0\.85820\.8582on Harmful Advice\. For prompt\-level trigger benchmarks, we calibrate theAERICthreshold by a source\-side safe\-budget rule that maximizes trigger coverage while constraining the safe\-trigger rate to at most10%10\\%\. Under that rule, trigger@6464reaches0\.64380\.6438and0\.46560\.4656on HarmBench DirectRequest and0\.68490\.6849and0\.73630\.7363on SocialHarmBench for Qwen and Gemma, respectively, withholding between23\.5323\.53and41\.8641\.86answer tokens on average\. Same\-pass deployment is also efficient: on a 63\-prompt harmful\-prompt fixed\-generation benchmark aggregated over HarmBench DirectRequest and SocialHarmBench under Qwen3\-8B, the monitor increases mean latency by only2\.34%2\.34\\%, whereas Qwen3Guard\-Stream\-4B increases it by79\.40%79\.40\\%\. These results support a focused claim: explicit harmful supervision can transfer to implicit harm monitoring under a strict no\-extra\-forward\-pass constraint, yielding a practical pre\-exposure risk signal even against a strong native streaming guard, although intervention policy remains an open systems problem\.
## 1Introduction
Current LLM safety faces two partly distinct challenges\. First, safeguards are typically strongest when judging fully observed text, whereas streaming deployment requires safety decisions over partial generations and therefore benefits from early intervention during decoding\[[12](https://arxiv.org/html/2605.23974#bib.bib2),[10](https://arxiv.org/html/2605.23974#bib.bib3),[26](https://arxiv.org/html/2605.23974#bib.bib4)\]\. Second, harmfulness is not always explicit: unsafe behavior can be covert, context\-sensitive, or only implicit in the trajectory of a response, especially in dialogue and advice settings\[[16](https://arxiv.org/html/2605.23974#bib.bib5),[24](https://arxiv.org/html/2605.23974#bib.bib6),[21](https://arxiv.org/html/2605.23974#bib.bib7),[25](https://arxiv.org/html/2605.23974#bib.bib9),[14](https://arxiv.org/html/2605.23974#bib.bib15),[13](https://arxiv.org/html/2605.23974#bib.bib28)\]\. Existing guards such as ShieldGemma and WildGuard are strong response\-level moderators, but their default role is retrospective: they judge whether a prompt or a completed response is unsafe after enough text has already been produced\. Even when moderation is moved closer to token time, detection often still depends on running an additional guard over growing prefixes, or on waiting until the generated text itself becomes sufficiently explicit to classify\. In both cases, the system reacts after the unsafe trajectory has already become legible on the surface\.
This paper studies a stricter objective: anticipatory same\-pass monitoring\. In streaming generation, safety decisions must be made over partial outputs rather than only after the full response is available, which pushes moderation toward fine\-grained checks during decoding\[[29](https://arxiv.org/html/2605.23974#bib.bib10),[20](https://arxiv.org/html/2605.23974#bib.bib11)\]\. But frequent safety checking can itself become a systems bottleneck: repeated intervention during generation increases inference cost and latency unless the monitoring signal is kept lightweight\[[27](https://arxiv.org/html/2605.23974#bib.bib14)\]\. We therefore focus on a same\-pass setting in which the monitor reads hidden states already produced by the generator during ordinary decoding, rather than repeatedly invoking a separate generative guard\. This follows a broader shift toward deriving safety signals from internal representations to support real\-time monitoring at lower cost\[[9](https://arxiv.org/html/2605.23974#bib.bib13)\]\.
The harder challenge is that harmfulness is often implicit\. In many dialogue settings, danger is not carried by a single explicit toxic phrase\. A response may remain polite, supportive in tone, or locally reasonable while still drifting toward self\-harm encouragement, unsafe medical guidance, exploitative escalation, or other context\-dependent failures\. This is especially true in dialogue safety and advice settings, where the same words can play very different roles depending on the prompt and the conversational context\. Looking only at the visible prefix, or only at the model’s top next\-token preferences, often answers the wrong question\. Those signals say what the model is likely to say next locally, but not whether the continuation has already begun to move toward an unsafe region of behavior\[[16](https://arxiv.org/html/2605.23974#bib.bib5),[24](https://arxiv.org/html/2605.23974#bib.bib6),[25](https://arxiv.org/html/2605.23974#bib.bib9)\]\.
Our approach starts from the observation that the model’s hidden states can contain forward\-looking information about continuation that is not fully recoverable from surface text alone\. We introduceAERIC, Anticipatory Evidence and Residual Inference for Continuations, a same\-pass hidden\-state monitor for implicit harmful dialogue\.AERICcombines three signals that correspond directly to the failure modes above\. The first is a future\-hazard head that predicts whether harmful continuation is about to begin within a short horizon\. This is the mechanism that lets the monitor act before the harmful content becomes explicit on screen\. The second is a support head that models counterevidence for safe, bounded, supportive, or de\-escalatory continuation\. This matters because emotionally intense language is not necessarily unsafe, and a monitor that only looks for hazard tends to overfire on difficult but still appropriate assistance\. The third is a paired residual head that measures prompt\-conditioned unsafe drift\. Instead of asking only whether the current text looks risky in isolation, it asks whether the hidden\-state trajectory has deviated toward an unsafe continuation relative to safe behavior for the same kind of prompt\. The resulting raw score is deployed through a same\-pass exponential moving average decision rule, which preserves prefix\-measurability while stabilizing online triggers\.
Across two generator families, this framing produces a consistent picture:AERICreaches AUROC0\.71430\.7143on DiaSafety and0\.85820\.8582on Harmful Advice withQwen/Qwen3\-8B, and0\.71810\.7181and0\.82870\.8287, respectively, withgoogle/gemma\-4\-E4B\-it\. Qwen3Guard\-Stream\-4B is the strongest native streaming baseline in our comparison, yetAERICremains above it in AUROC on both balanced targets\. AUPRC margins are smaller and overlap in some comparisons, so we report both ranking metrics explicitly\. On prompt\-only harmful request suites evaluated under a source\-side10%10\\%safe\-trigger budget, the monitor remains actionable before exposure\. On HarmBench DirectRequest it reaches trigger@640\.64380\.6438with Qwen and0\.46560\.4656with Gemma, while on SocialHarmBench it reaches0\.68490\.6849and0\.73630\.7363, respectively\. It also does so with low overhead\. On a 63\-prompt harmful\-prompt fixed\-generation aggregate over HarmBench DirectRequest and SocialHarmBench underQwen/Qwen3\-8B, the same\-pass monitor increases mean latency by only2\.34%2\.34\\%\. The corresponding overheads are79\.40%79\.40\\%for Qwen3Guard\-Stream\-4B,158\.73%158\.73\\%for prefixized ShieldGemma\-9B, and216\.74%216\.74\\%for prefixized WildGuard\.
Existing guards are usually best at deciding whether harm is already present in completed or sufficiently explicit text, whereas our goal is to detect harmful continuation before exposure\. Existing moderation signals also rely heavily on surface form, whereas our target is implicit harmful drift that depends on prompt context and discourse role\.AERICshows that explicit harmful supervision can be repurposed into an anticipatory same\-pass monitor for implicit harmful dialogue by combining short\-horizon hazard forecasting, support\-sensitive suppression, prompt\-conditioned residual scoring, and EMA\-smoothed online triggering\. This gives a practical pre\-exposure signal under a strict no\-extra\-forward\-pass constraint, even though the downstream intervention policy remains a separate systems problem\.111To support reproducibility, code and evaluation scripts will be released with the camera\-ready version\.
## 2Related Work
#### Implicit harmfulness\.
A central difficulty in language\-model safety is that harmfulness is often implicit rather than explicit\. Prior work argues that harmful text is not a single surface\-level phenomenon and should be characterized along multiple dimensions, including context and pragmatic effect\[[22](https://arxiv.org/html/2605.23974#bib.bib31)\]\. Work on covertly unsafe text and implicit toxicity further shows that dangerous or toxic meaning may not be reducible to obvious keywords, known surface patterns, slurs, or overtly toxic phrases\[[16](https://arxiv.org/html/2605.23974#bib.bib5),[7](https://arxiv.org/html/2605.23974#bib.bib30),[25](https://arxiv.org/html/2605.23974#bib.bib9)\]\. In dialogue and advice settings, unsafe meaning may depend less on a single toxic phrase than on conversational context, discourse role, or the likely direction of the continuation\[[24](https://arxiv.org/html/2605.23974#bib.bib6),[21](https://arxiv.org/html/2605.23974#bib.bib7)\]\. DiaSafety\[[24](https://arxiv.org/html/2605.23974#bib.bib6)\]is a representative benchmark for this setting, since safety judgments often depend on contextual interpretation rather than explicit wording alone\. Harmful Advice\[[14](https://arxiv.org/html/2605.23974#bib.bib15),[13](https://arxiv.org/html/2605.23974#bib.bib28)\]provides a complementary advice\-oriented target in which harmfulness can remain locally subtle even when the overall recommendation is unsafe\. Prior work on hidden\-state probing further motivates our approach by showing that internal representations can reveal latent properties such as deception and future continuation structure before those properties are fully visible in surface text\[[1](https://arxiv.org/html/2605.23974#bib.bib18),[18](https://arxiv.org/html/2605.23974#bib.bib17)\]\. We build on that premise, but shift it to a transfer setting in which the goal is not simply to read out latent attributes, but to forecast unsafe continuation in implicit harmful dialogue from prefix hidden states\.
#### LLM safeguards\.
A separate line of work studies safeguards more directly\. ShieldGemma222[https://huggingface\.co/google/shieldgemma\-9b](https://huggingface.co/google/shieldgemma-9b)\[[28](https://arxiv.org/html/2605.23974#bib.bib34)\]and WildGuard333[https://huggingface\.co/allenai/wildguard](https://huggingface.co/allenai/wildguard)\[[6](https://arxiv.org/html/2605.23974#bib.bib20)\]are representative response\-level guards that classify harmful prompts, harmful responses, and refusal behavior after text has already been produced\. They are strong and practical baselines, and we compare against them directly, but their default operating mode is still retrospective\. Recent work has therefore pushed moderation closer to token time\. Qwen3Guard\-Stream\-4B444[https://huggingface\.co/Qwen/Qwen3Guard\-Stream\-4B](https://huggingface.co/Qwen/Qwen3Guard-Stream-4B)\[[31](https://arxiv.org/html/2605.23974#bib.bib21)\]is especially relevant in our setting because it is a native streaming guard, while NExT\-Guard\[[4](https://arxiv.org/html/2605.23974#bib.bib22)\], HIDDENGUARD\[[17](https://arxiv.org/html/2605.23974#bib.bib23)\], and Kelp\[[11](https://arxiv.org/html/2605.23974#bib.bib24)\]all move safety monitoring toward streaming or latent\-state signals\. Our work is aligned with that trajectory, but differs in two ways\. First, we focus specifically on implicit harmful dialogue rather than broad harmfulness detection\. Second, we impose a same\-pass constraint: the monitor may read the generator’s own hidden states during ordinary decoding, but may not invoke an additional forward pass through the base model\. On the supervision side, FineHarm\[[12](https://arxiv.org/html/2605.23974#bib.bib2)\]is useful because it provides onset\-sensitive labels over explicit harmful continuation rather than only response\-level labels, while Anthropic HH\-RLHF\[[2](https://arxiv.org/html/2605.23974#bib.bib26)\]and Anthropic red\-teaming data\[[5](https://arxiv.org/html/2605.23974#bib.bib27)\]provide matched safe and unsafe behavior that we use for support\-style supervision and prompt\-conditioned residual learning\.
## 3Method
Figure 1:Overview ofAERIC\. During ordinary decoding, the frozen generator produces a current hidden statehth\_\{t\}and a cached prompt representationpp\.AERICreads these already\-computed states, computes future\-hazard, support, and paired\-residual scores, and applies EMA smoothing to produce an online trigger signal\.### 3\.1Same\-Pass Monitoring Setup
Let a frozen causal language model generate tokensy1,…,yTy\_\{1\},\\ldots,y\_\{T\}conditioned on a promptxx\. At decode steptt, the model produces a hidden stateht∈ℝdh\_\{t\}\\in\\mathbb\{R\}^\{d\}for the current prefix\(x,y<t\)\(x,y\_\{<t\}\)\. We also cache a prompt summaryp∈ℝdp\\in\\mathbb\{R\}^\{d\}from prompt\-side hidden states before response decoding begins\. A monitor is same\-pass if its decision at timettdepends only on quantities already available from this decode trajectory, such ashth\_\{t\},pp, and previous monitor values\. It may not invoke an additional forward pass through the base model or repeatedly call a separate generative guard\. This pipeline is depicted in Figure[1](https://arxiv.org/html/2605.23974#S3.F1)\.
This constraint defines the deployment setting we care about\. The monitor must be prefix\-measurable, available before end\-of\-sequence, and cheap enough to run during ordinary decoding\. In the defaultAERICmonitor, hidden states are projected to a128128\-dimensional representation and standardized\. The trainable part of the monitor consists only of three scalar heads, described below\.
### 3\.2Anticipatory Evidence and Prompt\-Conditioned Drift
AERICcombines three signals\. The first signal is future hazard\. Instead of training a standard unsafe\-prefix classifier that labels every prefix of an unsafe response as unsafe, we use onset\-aware supervision from FineHarm\[[12](https://arxiv.org/html/2605.23974#bib.bib2)\]\. Letoodenote the harmful onset index derived from span or sentence annotations\. For a horizonHH, the hazard label at prefixttis
zthaz=𝟙\[0≤o−t≤H\]\.z\_\{t\}^\{\\mathrm\{haz\}\}=\\mathbb\{1\}\[\\,0\\leq o\-t\\leq H\\,\]\.\(1\)The future\-hazard score is
ft=wf⊤h~t\+bf,f\_\{t\}=w\_\{f\}^\{\\top\}\\tilde\{h\}\_\{t\}\+b\_\{f\},\(2\)whereh~t\\tilde\{h\}\_\{t\}is the projected and standardized hidden state\. This head asks whether the current trajectory is close to entering harmful continuation, rather than whether the visible prefix already looks unsafe\.
The second signal is support evidence\. A hazard\-only monitor can overfire on emotionally intense but still bounded responses, especially in advice, refusal, or de\-escalation settings\. We therefore train a support head
ct=wc⊤h~t\+bcc\_\{t\}=w\_\{c\}^\{\\top\}\\tilde\{h\}\_\{t\}\+b\_\{c\}\(3\)whose positive class consists of safe, supportive, bounded, or de\-escalatory continuations, and whose negative class consists of imminent or post\-onset unsafe continuations\. In practice, the support\-positive pool mixes safe auxiliary rows with harmless responses to harmful prompts from Anthropic HH\-style data\[[2](https://arxiv.org/html/2605.23974#bib.bib26)\]\. This head acts as counterevidence: it suppresses alerts when the response is intense but still consistent with safe assistance\.
The third signal is prompt\-conditioned residual drift\. A simple way to condition on the prompt would be to classifyht−ph\_\{t\}\-pdirectly, but naive prompt subtraction can remove useful structure\. Instead, we learn the residual direction from matched safe and unsafe continuations for the same prompt\. Letht\(s\)h\_\{t\}^\{\(s\)\}andht\(u\)h\_\{t\}^\{\(u\)\}denote hidden states from safe and unsafe continuations paired at the prompt level, and define
δt\(s\)=ht\(s\)−p,δt\(u\)=ht\(u\)−p\.\\delta\_\{t\}^\{\(s\)\}=h\_\{t\}^\{\(s\)\}\-p,\\qquad\\delta\_\{t\}^\{\(u\)\}=h\_\{t\}^\{\(u\)\}\-p\.\(4\)The residual head scores a prompt\-relative state as
r\(δ\)=wr⊤δ\+br,r\(\\delta\)=w\_\{r\}^\{\\top\}\\delta\+b\_\{r\},\(5\)and is trained with the pairwise ranking loss
ℒres=max\(0,1−r\(δt\(u\)\)\+r\(δt\(s\)\)\)\.\\mathcal\{L\}\_\{\\mathrm\{res\}\}=\\max\\left\(0,\\ 1\-r\(\\delta\_\{t\}^\{\(u\)\}\)\+r\(\\delta\_\{t\}^\{\(s\)\}\)\\right\)\.\(6\)This encourages unsafe residual states to score above safe residual states for the same prompt\. At test time, the residual score is
rt=r\(ht−p\)\.r\_\{t\}=r\(h\_\{t\}\-p\)\.\(7\)The residual head is therefore not just prompt subtraction\. It learns which prompt\-relative deviations correspond to unsafe continuation under matched prompt conditions\.
### 3\.3Online Score, Size, and Calibration
The raw per\-token monitor score combines the three signals:
gt=ft−αct\+βrt,g\_\{t\}=f\_\{t\}\-\\alpha c\_\{t\}\+\\beta r\_\{t\},\(8\)whereα\>0\\alpha\>0controls suppression by support evidence andβ\>0\\beta\>0controls the contribution of prompt\-conditioned residual drift\. The future\-hazard term asks whether harmful continuation is likely soon, the support term asks whether the current trajectory still looks bounded or corrective, and the residual term asks whether the continuation has drifted toward an unsafe direction relative to safe behavior for the same prompt type\.
For online monitoring, we smooth the raw score with an exponential moving average:
mt=λgt\+\(1−λ\)mt−1,m\_\{t\}=\\lambda g\_\{t\}\+\(1\-\\lambda\)m\_\{t\-1\},\(9\)withλ=0\.3\\lambda=0\.3in our main runs\. Appendix[C](https://arxiv.org/html/2605.23974#A3)includes a short sensitivity check showing that balanced AUROC is stable across a wider range ofλ\\lambdavalues\. A generation is flagged when
Becausemtm\_\{t\}depends only on the current hidden state, the cached prompt summary, and previous monitor values, the decision remains same\-pass and prefix\-measurable\.
The default linear monitor is intentionally small\. Each of the three scalar heads has128128weights and one bias, for a total of3×129=3873\\times 129=387trainable head parameters\. The deployed artifact additionally stores fixed projection matrices and normalization statistics, totaling about1\.601\.60M stored scalars \(6\.16\.1MiB in fp32\) forQwen/Qwen3\-8Band about1\.001\.00M stored scalars \(3\.83\.8MiB\) forgoogle/gemma\-4\-E4B\-it\. These counts describe only the monitor artifact; the generator is the frozen model whose already\-computed hidden states are being read\. Full monitor\-size and hyperparameter details are reported in Appendix[C](https://arxiv.org/html/2605.23974#A3)\.
For balanced benchmarks, no operating threshold is selected; AUROC and AUPRC are computed from the terminal EMA score\. For prompt\-level trigger benchmarks, we use a source\-side safe\-budget calibration rule forAERIC: on a held\-out source calibration split, we choose the threshold that maximizes trigger coverage while constraining the safe\-trigger rate to be at most10%10\\%\. We therefore treat prompt\-benchmark results as exposure\-withholding diagnostics rather than as primary balanced\-classification evidence\. We also tested small same\-pass MLP replacements for individual heads, but report them only as a head\-capacity follow\-up because they were not standardized across all model and benchmark pairs\.
## 4Experimental Setup
### 4\.1Model and Inference
We evaluateAERICon two base models:Qwen3\-8Bandgemma\-4\-E4B\-it\. In both cases we read hidden states from layer−8\-8\. The monitor is always same\-pass: no experiment allows an additional forward pass through the generator\. The defaultAERICscores come from small linear heads on top of hidden states that are already available during decoding, and the online decision signal is the EMA\-smoothed monitor described above\.
### 4\.2Source and Target Data
#### Source supervision\.
Most transfer experiments use a FineHarm split of1024/256/2561024/256/256for train, calibration, and held\-out source test\. The paired residual head uses10241024matched safe rows and10241024matched unsafe rows from Anthropic HH data, yielding981981matched prompt contexts\. We also use auxiliary safe rows from English advice\-oriented sources when training the support head\.
#### Balanced implicit\-harm targets\.
We evaluate on the full DiaSafety\[[24](https://arxiv.org/html/2605.23974#bib.bib6)\]test set \(10941094rows\) and on Harmful Advice\[[14](https://arxiv.org/html/2605.23974#bib.bib15),[13](https://arxiv.org/html/2605.23974#bib.bib28)\]\(550550rows\)\. DiaSafety is the harder conversational target because contextual confounds are stronger\. Harmful Advice is the clearest transfer win for the full method\.
#### Prompt\-conditioned trigger targets\.
We evaluate on HarmBench DirectRequest\[[15](https://arxiv.org/html/2605.23974#bib.bib32)\]\(320320prompts\) and on SocialHarmBench\[[19](https://arxiv.org/html/2605.23974#bib.bib33)\]\(585585raw prompts;584584non\-empty prompts in our run\)\. Both are prompt\-only harmful request suites, so they are suitable for answer\-generation trigger diagnostics but not for balanced response classification\. Their role here is purely systems\-oriented: does the monitor fire early enough during generation to withhold continuation before it is exposed\. Additional dataset and preprocessing details are given in Appendix[E](https://arxiv.org/html/2605.23974#A5)\.
### 4\.3Baselines
We compare against three external baselines\.
- •Post\-hoc moderation: ShieldGemma\-9B and WildGuard\[[6](https://arxiv.org/html/2605.23974#bib.bib20)\]on completed prompt\-response pairs\. For prompt\-level trigger comparisons, we run ShieldGemma and WildGuard on growing prefixes as fixed\-budget streaming approximations\.
- •Native streaming guard: Qwen3Guard\-Stream\-4B\[[31](https://arxiv.org/html/2605.23974#bib.bib21)\], a pretrained token\-level streaming guard evaluated on both balanced targets, on HarmBench DirectRequest, and on SocialHarmBench for both generator families\.
### 4\.4Metrics
#### Balanced implicit\-harm benchmarks\.
On DiaSafety and Harmful Advice we report AUROC and AUPRC in the main table\. These are the paper’s primary threshold\-free ranking metrics\. Each balanced\-benchmark cell additionally reports a nonparametric95%95\\%bootstrap confidence interval obtained by resampling evaluation rows with replacement and recomputing the metric\. The intervals therefore capture evaluation\-set sampling variability with the trained monitor and frozen generator held fixed, rather than retraining or random\-seed variation\. Thresholded operating behavior still matters for deployment, but it is secondary to the ranking story and is discussed in the text rather than elevated to the headline table\.
#### Prompt\-conditioned trigger benchmarks\.
On HarmBench DirectRequest and SocialHarmBench we report trigger@88, trigger@1616, trigger@3232, trigger@6464, and mean*withheld answer tokens*relative to a no\-stop baseline\. Here, withheld answer tokens refers to the number of answer tokens that would have been emitted under no\-stop but were never shown because the monitor fired first\. ForAERIC, we set the operating threshold by a source\-side safe\-budget calibration rule: on a held\-out source calibration split, we choose the threshold that maximizes trigger coverage while constraining the safe\-trigger rate to be at most10%10\\%\. The calibration objective and selected source\-side thresholds are detailed in Appendix[B](https://arxiv.org/html/2605.23974#A2)\.
## 5Results
### 5\.1Balanced Implicit\-Harm Benchmarks
Table 1:Balanced implicit\-harm benchmarks\. Each cell reports the point estimate with a nonparametric95%95\\%bootstrap confidence interval on a second line, computed by resampling evaluation rows with replacement\. Bold marks the best point estimate in each metric column and underline marks the second\-best point estimate\.AERICis evaluated with hidden states from two base models\.Table[1](https://arxiv.org/html/2605.23974#S5.T1)shows thatAERICimproves implicit\-harm ranking over both response\-level guards and the strongest native streaming guard in our comparison\. The main table reports point estimates together with nonparametric95%95\\%bootstrap confidence intervals for the balanced benchmarks\. Compared with the best external baseline in each metric column, the bestAERICconfiguration improves on DiaSafety by3\.263\.26AUROC points and2\.212\.21AUPRC points, and on Harmful Advice by3\.633\.63AUROC points and0\.900\.90AUPRC points\. The gains are largest in AUROC, indicating that the main effect is improved ordering of safe and unsafe continuations rather than simply choosing a better operating threshold\. This distinction matters because our balanced benchmarks are evaluated as threshold\-free ranking tasks, while deployment\-time threshold selection is handled separately in the trigger experiments\.
These results support the paper’s main balanced\-benchmark claim: explicit harmful supervision can be converted into a same\-pass hidden\-state signal that transfers to implicit harmful dialogue\. The cross\-model pattern strengthens this interpretation\. Gemma gives the strongest DiaSafety result, while Qwen gives the strongest Harmful Advice result, so the monitor is not simply exploiting idiosyncrasies of one generator’s representation space\. Instead, the same future\-support\-residual monitor family extracts a reusable anticipatory risk signal from already\-computed hidden states, improving over guards that rely on completed text or separate streaming guard computation\.
### 5\.2Early\-Trigger Benchmarks
Table[2](https://arxiv.org/html/2605.23974#S5.T2)reports prompt\-only harmful request benchmarks separately because they are not balanced response\-classification tasks\. Their role is systems\-oriented: can a source\-calibrated online monitor fire early enough to withhold continuation before more of the answer is exposed?
Table 2:Prompt\-level trigger benchmarks under a source\-side10%10\\%safe\-trigger budget forAERIC\. Rows are grouped first by dataset and then by evaluation setting; bold marks the best value within each dataset/setting block\. Withheld denotes the mean number of answer tokens that would have been shown under no\-stop but were never exposed because the monitor fired first\. The safe\-budget calibration applies toAERIC; external guards use their native unsafe decisions under the prefix protocol described in Appendix[D](https://arxiv.org/html/2605.23974#A4)\.
The prompt\-level picture is more conservative under the source\-side10%10\\%safe\-trigger budget, but it remains encouraging\. On HarmBench DirectRequest withQwen3\-8B,AERICreaches trigger@64640\.64380\.6438, compared with0\.10630\.1063for Qwen3Guard\-Stream\-4B,0\.14380\.1438for WildGuard Prefix, and0\.21880\.2188for ShieldGemma\-9B Prefix, while withholding32\.5532\.55answer tokens on average\. On the same HarmBench slice withGemma\-4\-E4B\-it,AERICreaches0\.46560\.4656, compared with0\.32810\.3281for Qwen3Guard\-Stream\-4B,0\.37500\.3750for WildGuard Prefix, and0\.20940\.2094for ShieldGemma\-9B Prefix\. SocialHarmBench shows that the signal is not limited to generic harmful instructions and that the effect survives a second generator family\. Under the Qwen generator, the same calibrated monitor reaches trigger@64640\.68490\.6849, compared with0\.07190\.0719for Qwen3Guard\-Stream\-4B,0\.11990\.1199for WildGuard Prefix, and0\.20380\.2038for ShieldGemma\-9B Prefix\. Under the Gemma generator, it reaches0\.73630\.7363, compared with0\.29620\.2962for Qwen3Guard\-Stream\-4B,0\.33220\.3322for WildGuard Prefix, and0\.11990\.1199for ShieldGemma\-9B Prefix, while withholding41\.8641\.86answer tokens on average\. These benchmarks depend on the source\-side10%10\\%safe\-trigger budget calibration and therefore should not be read as balanced classification evidence, but they do show that the same\-pass signal remains actionable before end\-of\-sequence on prompt\-only harmful request suites\.
### 5\.3Component Ablation of the Base Score Family
The final online monitor combines the three linear heads with an EMA decision rule\. To isolate what each head contributes under that same online rule, Table[3](https://arxiv.org/html/2605.23974#S5.T3)reruns the underlyingQwen/Qwen3\-8Btransfer family one component at a time with the refreshed EMA aggregation: future\-hazard only, then future\-hazard plus support, and finally the paired residual head\. The main balanced results above are the final EMA\-selected monitor; this table is the matching structural ablation under the same aggregation scheme\.
Table 3:Ablation of the underlyingQwen/Qwen3\-8Btransfer family under the same EMA aggregation used by the final online monitor\. Bold marks the best value in each metric column\. The support head provides the first large gain on Harmful Advice, and the paired residual term adds the final AUROC lift on both balanced targets\.The pattern is consistent across targets\. The support head explains the first large jump on Harmful Advice, where assertive but still bounded guidance is a major confound, moving AUROC from0\.68730\.6873to0\.84300\.8430\. The paired residual term then improves AUROC on both targets, with the clearest relative benefit on DiaSafety, where prompt\-conditioned conversational confounds are stronger, moving AUROC from0\.69930\.6993to0\.71430\.7143\. DiaSafety AUPRC also rises modestly after adding the residual head, which suggests that the main benefit is better ranking without changing the basic mechanism of the monitor family\. In other words, the final EMA monitor is not hiding a different signal; it is aggregating a score family whose component contributions still point in the same direction under the refreshed methodology\.
### 5\.4Computational Overhead
The systems story is equally important\. We measure latency on a harmful\-prompt fixed\-generation overhead benchmark sampled from HarmBench DirectRequest and SocialHarmBench underQwen/Qwen3\-8B\. On that aggregate, the paired\-residual monitor increases mean latency by only2\.34%2\.34\\%, compared with79\.40%79\.40\\%for Qwen3Guard\-Stream\-4B,158\.73%158\.73\\%for prefixized ShieldGemma\-9B, and216\.74%216\.74\\%for prefixized WildGuard\. Broken out by dataset,AERICincreases mean latency by2\.38%2\.38\\%on HarmBench DirectRequest and2\.31%2\.31\\%on SocialHarmBench; the corresponding numbers are77\.61%77\.61\\%and81\.41%81\.41\\%for Qwen3Guard\-Stream\-4B,157\.68%157\.68\\%and159\.91%159\.91\\%for prefixized ShieldGemma\-9B, and218\.88%218\.88\\%and214\.35%214\.35\\%for prefixized WildGuard\. Measurement details are reported in Appendix[A](https://arxiv.org/html/2605.23974#A1)\.
## 6Limitations and Broader Impact
AERICshould be understood as an online risk signal, not as a complete intervention policy\. Our balanced results evaluate threshold\-free ranking, while prompt\-level trigger results on HarmBench DirectRequest and SocialHarmBench depend on a frozen operating threshold selected before evaluation\. High trigger coverage can reduce exposure, but it does not decide what a deployed system should do next: stop generation, ask a clarifying question, route to a safer response policy, or escalate to human review\. This distinction matters because safety interventions can create their own failure modes\. In particular, overly conservative safeguards can produce false refusals or over\-refusal on benign but sensitive requests, reducing usefulness precisely in settings where users may need careful support\[[3](https://arxiv.org/html/2605.23974#bib.bib35),[30](https://arxiv.org/html/2605.23974#bib.bib36),[23](https://arxiv.org/html/2605.23974#bib.bib37)\]\. The support head is designed to reduce part of this failure mode, but it does not remove the need for careful thresholding and downstream policy design\.
The empirical scope is also limited\. We evaluate two generator families and read hidden states from a single layer, so the current results do not establish robustness across model scales, architectures, languages, decoding strategies, or layer choices\. The supervision is also imperfectly matched to the target setting: we transfer from explicit harmful supervision and matched safe/unsafe behavior to implicit, context\-sensitive harmfulness, rather than training on large\-scale prompt\-matched implicit\-harm onset labels\. Better source data with implicit onset annotations, more domain\-matched safe continuations, and multilingual coverage would likely improve the transfer frontier\. More broadly, same\-pass monitoring reduces overhead but does not remove safety risk: a low\-latency hidden\-state monitor can warn before further exposure, but automatic hard stops can still produce confusing truncations, unnecessary refusals, or missed harms\.AERICshould therefore be treated as one component of a broader safety stack rather than as a standalone substitute for evaluation, user\-facing policy design, and human oversight\[[8](https://arxiv.org/html/2605.23974#bib.bib38)\]\.
## 7Conclusion
We presentedAERIC, a same\-pass hidden\-state monitor for implicit harmful dialogue\. Rather than asking a separate guard to repeatedly classify growing text prefixes,AERICreads the generator’s already\-computed hidden states and combines short\-horizon hazard forecasting, support\-sensitive suppression, and prompt\-conditioned residual scoring under an EMA\-smoothed online decision rule\. The results show that explicit harmful supervision can transfer to implicit\-harm ranking across two generator families, improving over response\-level guards and Qwen3Guard\-Stream\-4B, the strongest native streaming baseline in our comparison\. The same frozen monitor also remains actionable on prompt\-only harmful request suites, where it can trigger before end\-of\-sequence and withhold continuation before further exposure\. The claim is deliberately limited:AERICprovides a lightweight pre\-exposure risk signal, not a complete serving policy\. Remaining bottlenecks include contextual confounds on DiaSafety, calibration\-sensitive trigger behavior, and the still\-open problem of deciding how an application should respond after an early risk signal\.
## References
- \[1\]A\. Azaria and T\. Mitchell\(2023\-12\)The Internal State of an LLM Knows When It’s Lying\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 967–976\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.68/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.68)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan\(2022\-04\)Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback\.arXiv\.Note:arXiv:2204\.05862 \[cs\]Comment: Data available at https://github\.com/anthropics/hh\-rlhfExternal Links:[Link](http://arxiv.org/abs/2204.05862),[Document](https://dx.doi.org/10.48550/arXiv.2204.05862)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.23974#S3.SS2.p2.2)\.
- \[3\]J\. Cui, W\. Chiang, I\. Stoica, and C\. Hsieh\(2025\-06\)OR\-Bench: An Over\-Refusal Benchmark for Large Language Models\.InProceedings of the 42nd International Conference on Machine Learning,\(en\)\.External Links:[Link](https://openreview.net/forum?id=CdFnEu0JZV)Cited by:[§6](https://arxiv.org/html/2605.23974#S6.p1.1)\.
- \[4\]J\. Fang, N\. Chen, H\. Jiang, D\. Zhang, F\. Shen, X\. Wang, X\. He, and T\. Chua\(2026\-02\)NExT\-Guard: Training\-Free Streaming Safeguard without Token\-Level Labels\.arXiv\.Note:arXiv:2603\.02219 \[cs\]External Links:[Link](http://arxiv.org/abs/2603.02219),[Document](https://dx.doi.org/10.48550/arXiv.2603.02219)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px2.p1.1)\.
- \[5\]D\. Ganguli, L\. Lovitt, J\. Kernion, A\. Askell, Y\. Bai, S\. Kadavath, B\. Mann, E\. Perez, N\. Schiefer, K\. Ndousse, A\. Jones, S\. Bowman, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, N\. Elhage, S\. El\-Showk, S\. Fort, Z\. Hatfield\-Dodds, T\. Henighan, D\. Hernandez, T\. Hume, J\. Jacobson, S\. Johnston, S\. Kravec, C\. Olsson, S\. Ringer, E\. Tran\-Johnson, D\. Amodei, T\. Brown, N\. Joseph, S\. McCandlish, C\. Olah, J\. Kaplan, and J\. Clark\(2022\-11\)Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned\.arXiv\.Note:arXiv:2209\.07858 \[cs\]External Links:[Link](http://arxiv.org/abs/2209.07858),[Document](https://dx.doi.org/10.48550/arXiv.2209.07858)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]S\. Han, K\. Rao, A\. Ettinger, L\. Jiang, B\. Y\. Lin, N\. Lambert, Y\. Choi, and N\. Dziri\(2024\-12\)WildGuard: Open One\-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs\.arXiv\.Note:arXiv:2406\.18495 \[cs\]Comment: NeurIPS 2024 Camera Ready\. First two authors contributed equally\. Third and fourth authors contributed equallyExternal Links:[Link](http://arxiv.org/abs/2406.18495),[Document](https://dx.doi.org/10.48550/arXiv.2406.18495)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px2.p1.1),[1st item](https://arxiv.org/html/2605.23974#S4.I1.i1.p1.1)\.
- \[7\]T\. Hartvigsen, S\. Gabriel, H\. Palangi, M\. Sap, D\. Ray, and E\. Kamar\(2022\-07\)ToxiGen: A Large\-Scale Machine\-Generated Dataset for Adversarial and Implicit Hate Speech Detection\.arXiv\.Note:arXiv:2203\.09509 \[cs\]Comment: Published as a long paper at ACL 2022\. Code: https://github\.com/microsoft/TOXIGENExternal Links:[Link](http://arxiv.org/abs/2203.09509),[Document](https://dx.doi.org/10.48550/arXiv.2203.09509)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]International AI Safety Report\(2026\-02\)International AI Safety Report 2026\.Technical reportInternational AI Safety Report\.External Links:[Link](https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026)Cited by:[§6](https://arxiv.org/html/2605.23974#S6.p2.1)\.
- \[9\]D\. Jiao, Y\. Liu, Y\. Yuan, Z\. Tang, L\. Du, H\. Wu, and A\. Anderson\(2026\-04\)LLM Safety From Within: Detecting Harmful Content with Internal Representations\.arXiv\.Note:arXiv:2604\.18519 \[cs\] version: 1Comment: 17 pages,10 figures,6 tablesExternal Links:[Link](http://arxiv.org/abs/2604.18519),[Document](https://dx.doi.org/10.48550/arXiv.2604.18519)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p2.1)\.
- \[10\]P\. Kavumba, K\. Wataoka, H\. H\. Nguyen, J\. Li, and M\. Ohagi\(2026\-04\)Predict, Don’t React: Value\-Based Safety Forecasting for LLM Streaming\.arXiv\.Note:arXiv:2604\.03962 \[cs\]External Links:[Link](http://arxiv.org/abs/2604.03962),[Document](https://dx.doi.org/10.48550/arXiv.2604.03962)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p1.1)\.
- \[11\]X\. Li, M\. Wu, Y\. Zhu, Y\. Lv, Y\. Chen, C\. Chen, J\. Guo, and H\. Xue\(2025\-10\)Kelp: A Streaming Safeguard for Large Models via Latent Dynamics\-Guided Risk Detection\.arXiv\.Note:arXiv:2510\.09694 \[cs\]External Links:[Link](http://arxiv.org/abs/2510.09694),[Document](https://dx.doi.org/10.48550/arXiv.2510.09694)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px2.p1.1)\.
- \[12\]Y\. Li, Q\. Sheng, Y\. Yang, X\. Zhang, and J\. Cao\(2025\-09\)From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring\.arXiv\.Note:arXiv:2506\.09996 \[cs\]Comment: NeurIPS 2025 Accepted PaperExternal Links:[Link](http://arxiv.org/abs/2506.09996),[Document](https://dx.doi.org/10.48550/arXiv.2506.09996)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p1.1),[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.23974#S3.SS2.p1.3)\.
- \[13\]L\. Luettgau, V\. Cheung, M\. Dubois, K\. Juechems, J\. Bergs, L\. Symes, H\. Davidson, B\. O’Dell, H\. R\. Kirk, M\. Rollwage, and C\. Summerfield\(2026\-04\)People readily follow personal advice from AI but it does not improve their well\-being\.arXiv\.Note:arXiv:2511\.15352 \[cs\] version: 3External Links:[Link](http://arxiv.org/abs/2511.15352),[Document](https://dx.doi.org/10.48550/arXiv.2511.15352)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p1.1),[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.23974#S4.SS2.SSS0.Px2.p1.2)\.
- \[14\]Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p1.1),[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.23974#S4.SS2.SSS0.Px2.p1.2)\.
- \[15\]M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li, D\. Forsyth, and D\. Hendrycks\(2024\-02\)HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal\.arXiv\.Note:arXiv:2402\.04249 \[cs\]Comment: Website: https://www\.harmbench\.orgExternal Links:[Link](http://arxiv.org/abs/2402.04249),[Document](https://dx.doi.org/10.48550/arXiv.2402.04249)Cited by:[§4\.2](https://arxiv.org/html/2605.23974#S4.SS2.SSS0.Px3.p1.3)\.
- \[16\]A\. Mei, A\. Kabir, S\. Levy, M\. Subbiah, E\. Allaway, J\. Judge, D\. Patton, B\. Bimber, K\. McKeown, and W\. Y\. Wang\(2022\-12\)Mitigating Covertly Unsafe Text within Natural Language Systems\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 2914–2926\.External Links:[Link](https://aclanthology.org/2022.findings-emnlp.211/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.211)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p1.1),[§1](https://arxiv.org/html/2605.23974#S1.p3.1),[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1)\.
- \[17\]L\. Mei, S\. Liu, Y\. Wang, B\. Bi, R\. Yuan, and X\. Cheng\(2024\-10\)HiddenGuard: Fine\-Grained Safe Generation with Specialized Representation Router\.arXiv\.Note:arXiv:2410\.02684 \[cs\]External Links:[Link](http://arxiv.org/abs/2410.02684),[Document](https://dx.doi.org/10.48550/arXiv.2410.02684)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px2.p1.1)\.
- \[18\]K\. Pal, J\. Sun, A\. Yuan, B\. Wallace, and D\. Bau\(2023\-12\)Future Lens: Anticipating Subsequent Tokens from a Single Hidden State\.InProceedings of the 27th Conference on Computational Natural Language Learning \(CoNLL\),J\. Jiang, D\. Reitter, and S\. Deng \(Eds\.\),Singapore,pp\. 548–560\.External Links:[Link](https://aclanthology.org/2023.conll-1.37/),[Document](https://dx.doi.org/10.18653/v1/2023.conll-1.37)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1)\.
- \[19\]P\. S\. Pandey, H\. S\. Le, D\. Bhardwaj, R\. Mihalcea, and Z\. Jin\(2026\-02\)SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests\.arXiv\.Note:arXiv:2510\.04891 \[cs\]Comment: ICLR 2026External Links:[Link](http://arxiv.org/abs/2510.04891),[Document](https://dx.doi.org/10.48550/arXiv.2510.04891)Cited by:[§4\.2](https://arxiv.org/html/2605.23974#S4.SS2.SSS0.Px3.p1.3)\.
- \[20\]H\. Phan, V\. Li, and Q\. Lei\(2025\-11\)Think Twice, Generate Once: Safeguarding by Progressive Self\-Reflection\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 9466–9483\.External Links:ISBN 979\-8\-89176\-335\-7,[Link](https://aclanthology.org/2025.findings-emnlp.503/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.503)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p2.1)\.
- \[21\]H\. Qiu, T\. Zhao, A\. Li, S\. Zhang, H\. He, and Z\. Lan\(2023\-07\)A Benchmark for Understanding Dialogue Safety in Mental Health Support\.arXiv\.Note:arXiv:2307\.16457 \[cs\]Comment: accepted to The 12th CCF International Conference on Natural Language Processing and Chinese Computing \(NLPCC2023\)External Links:[Link](http://arxiv.org/abs/2307.16457),[Document](https://dx.doi.org/10.48550/arXiv.2307.16457)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p1.1),[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1)\.
- \[22\]M\. Rauh, J\. Mellor, J\. Uesato, P\. Huang, J\. Welbl, L\. Weidinger, S\. Dathathri, A\. Glaese, G\. Irving, I\. Gabriel, W\. Isaac, and L\. A\. Hendricks\(2022\-10\)Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models\.arXiv\.Note:arXiv:2206\.08325 \[cs\]Comment: Accepted to NeurIPS 2022 Datasets and Benchmarks Track; 10 pages plus appendixExternal Links:[Link](http://arxiv.org/abs/2206.08325),[Document](https://dx.doi.org/10.48550/arXiv.2206.08325)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]G\. Sullutrone, R\. A\. Vigliermo, S\. Bergamaschi, and L\. Sala\(2025\-07\)COVER: Context\-Driven Over\-Refusal Verification in LLMs\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 24214–24229\.External Links:ISBN 979\-8\-89176\-256\-5,[Link](https://aclanthology.org/2025.findings-acl.1243/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1243)Cited by:[§6](https://arxiv.org/html/2605.23974#S6.p1.1)\.
- \[24\]H\. Sun, G\. Xu, J\. Deng, J\. Cheng, C\. Zheng, H\. Zhou, N\. Peng, X\. Zhu, and M\. Huang\(2022\-05\)On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark\.InFindings of the Association for Computational Linguistics: ACL 2022,S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 3906–3923\.External Links:[Link](https://aclanthology.org/2022.findings-acl.308/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.308)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p1.1),[§1](https://arxiv.org/html/2605.23974#S1.p3.1),[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.23974#S4.SS2.SSS0.Px2.p1.2)\.
- \[25\]J\. Wen, P\. Ke, H\. Sun, Z\. Zhang, C\. Li, J\. Bai, and M\. Huang\(2023\-11\)Unveiling the Implicit Toxicity in Large Language Models\.arXiv\.Note:arXiv:2311\.17391 \[cs\]Comment: EMNLP 2023 Main ConferenceExternal Links:[Link](http://arxiv.org/abs/2311.17391),[Document](https://dx.doi.org/10.48550/arXiv.2311.17391)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p1.1),[§1](https://arxiv.org/html/2605.23974#S1.p3.1),[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px1.p1.1)\.
- \[26\]Z\. Xuan, X\. Mao, D\. Chen, X\. Zhang, Y\. Dong, and J\. Zhou\(2025\-07\)ShieldHead: Decoding\-time Safeguard for Large Language Models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18129–18143\.External Links:ISBN 979\-8\-89176\-256\-5,[Link](https://aclanthology.org/2025.findings-acl.932/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.932)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p1.1)\.
- \[27\]J\. Yang, H\. Guo, Z\. Lu, Y\. Zhao, Y\. Zhang, S\. Hu, Q\. Duan, Y\. Wang, and T\. Wei\(2025\-12\)Prefix Probing: Lightweight Harmful Content Detection for Large Language Models\.arXiv\.Note:arXiv:2512\.16650 \[cs\] version: 1External Links:[Link](http://arxiv.org/abs/2512.16650),[Document](https://dx.doi.org/10.48550/arXiv.2512.16650)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p2.1)\.
- \[28\]W\. Zeng, Y\. Liu, R\. Mullins, L\. Peran, J\. Fernandez, H\. Harkous, K\. Narasimhan, D\. Proud, P\. Kumar, B\. Radharapu, O\. Sturman, and O\. Wahltinez\(2024\-08\)ShieldGemma: Generative AI Content Moderation Based on Gemma\.arXiv\.Note:arXiv:2407\.21772 \[cs\]External Links:[Link](http://arxiv.org/abs/2407.21772),[Document](https://dx.doi.org/10.48550/arXiv.2407.21772)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px2.p1.1)\.
- \[29\]X\. Zeng, Y\. Shang, J\. Chen, J\. Zhang, and Y\. Tian\(2025\-07\)Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 1974–1988\.External Links:ISBN 979\-8\-89176\-251\-0,[Link](https://aclanthology.org/2025.acl-long.97/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.97)Cited by:[§1](https://arxiv.org/html/2605.23974#S1.p2.1)\.
- \[30\]Z\. Zhang, W\. Xu, F\. Wu, and C\. K\. Reddy\(2025\-07\)FalseReject: A Resource for Improving Contextual Safety and Mitigating Over\-Refusals in LLMs via Structured Reasoning\.arXiv\(en\)\.Note:arXiv:2505\.08054 \[cs\]Comment: Accepted at COLM 2025External Links:[Link](http://arxiv.org/abs/2505.08054),[Document](https://dx.doi.org/10.48550/arXiv.2505.08054)Cited by:[§6](https://arxiv.org/html/2605.23974#S6.p1.1)\.
- \[31\]H\. Zhao, C\. Yuan, F\. Huang, X\. Hu, Y\. Zhang, A\. Yang, B\. Yu, D\. Liu, J\. Zhou, J\. Lin, B\. Yang, C\. Cheng, J\. Tang, J\. Jiang, J\. Zhang, J\. Xu, M\. Yan, M\. Sun, P\. Zhang, P\. Xie, Q\. Tang, Q\. Zhu, R\. Zhang, S\. Wu, S\. Zhang, T\. He, T\. Tang, T\. Xia, W\. Liao, W\. Shen, W\. Yin, W\. Zhou, W\. Yu, X\. Wang, X\. Deng, X\. Xu, X\. Zhang, Y\. Liu, Y\. Li, Y\. Zhang, Y\. Jiang, Y\. Wan, and Y\. Zhou\(2025\-10\)Qwen3Guard Technical Report\.arXiv\.Note:arXiv:2510\.14276 \[cs\]External Links:[Link](http://arxiv.org/abs/2510.14276),[Document](https://dx.doi.org/10.48550/arXiv.2510.14276)Cited by:[§2](https://arxiv.org/html/2605.23974#S2.SS0.SSS0.Px2.p1.1),[2nd item](https://arxiv.org/html/2605.23974#S4.I1.i2.p1.1)\.
## Appendix ASystems Measurement Details
We measure runtime on a single NVIDIA RTX 6000 Ada Generation GPU with 48GB memory, driver version 575\.57\.08, CUDA 12\.8, PyTorch 2\.11\.0, and Transformers 5\.5\.4\. All latency measurements use greedy decoding withQwen/Qwen3\-8B,disable\_thinking, a maximum of 64 generated answer tokens, and the same assistant system prompt used in the prompt\-trigger experiments\. Times include generator decoding and monitor or guard computation, but not model\-loading time\.
The harmful\-prompt latency benchmark samples the same prompt families used in Table[2](https://arxiv.org/html/2605.23974#S5.T2): 32 HarmBench DirectRequest prompts and 31 SocialHarmBench prompts\. ForAERIC, the monitor reads hidden states already produced by the generator and computes the same\-pass heads during decoding\. For WildGuard Prefix and ShieldGemma\-9B Prefix, we call the guard every 8 emitted answer tokens on the current prompt\-response prefix\. For Qwen3Guard\-Stream\-4B, we update the native streaming guard with each newly emitted guard\-token delta\. The measurement is fixed\-generation rather than stop\-policy latency: generation continues up to the same 64\-token cap, so the comparison isolates monitoring overhead rather than savings from early stopping\.
Table 4:Absolute latency and overhead for the harmful\-prompt fixed\-generation benchmark\. Overhead is computed relative to the no\-stop generation baseline measured on the same dataset and run configuration\.Aggregating the two prompt families by row count gives mean total latency 1190\.69 ms forAERIC, 2087\.77 ms for Qwen3Guard\-Stream\-4B, 3026\.31 ms for ShieldGemma\-9B Prefix, and 3685\.04 ms for WildGuard Prefix\. The corresponding overheads over no\-stop generation are 2\.34%, 79\.40%, 158\.73%, and 216\.74%, respectively\.
## Appendix BCalibration and Threshold Details
The balanced response\-classification benchmarks do not use a tuned decision threshold in the headline table\. We report AUROC and AUPRC from the final EMA score, so those results should be read as threshold\-free ranking evaluations\.
For prompt\-level trigger benchmarks,AERICuses a source\-side safe\-budget rule\. LetSafeTrigger\(τ\)\\mathrm\{SafeTrigger\}\(\\tau\)denote the fraction of safe source\-calibration continuations that would trigger at thresholdτ\\tau, and letHarmTrigger@K\(τ\)\\mathrm\{HarmTrigger@\}K\(\\tau\)denote the fraction of harmful source\-calibration continuations that trigger within the firstKKmonitored answer tokens\. We select
τ⋆=argmaxτ:SafeTrigger\(τ\)≤BHarmTrigger@K\(τ\),\\tau^\{\\star\}=\\arg\\max\_\{\\tau:\\ \\mathrm\{SafeTrigger\}\(\\tau\)\\leq B\}\\mathrm\{HarmTrigger@\}K\(\\tau\),\(11\)withB=0\.10B=0\.10andK=16K=16in the main prompt\-trigger table\. The threshold is chosen before evaluating HarmBench DirectRequest or SocialHarmBench\.
Table 5:Source\-side safe\-budget calibration statistics for prompt\-level trigger experiments\. The main table uses the 10% budget rows\.
## Appendix CMonitor Size and Hyperparameters
The defaultAERICmonitor uses three linear heads after a 128\-dimensional projection: future hazard, support counterevidence, and paired residual drift\. Each head has 128 weights plus one bias, for3×129=3873\\times 129=387trainable head parameters\. The fixed projection matrices and normalization statistics are stored in the monitor artifact but are not additional trainable head parameters\. In fp32, the stored monitor artifact contains about 1\.60M scalars forQwen/Qwen3\-8B\(6\.1 MiB\) and about 1\.00M scalars forgoogle/gemma\-4\-E4B\-it\(3\.8 MiB\)\.
Table 6:Main monitor hyperparameters\. The score\-composition weightsα\\alphaandβ\\betaare selected from the listed grid by development AUROC;λ\\lambdais fixed rather than jointly tuned\.To check whether the EMA smoother is sensitive to this choice, Table[7](https://arxiv.org/html/2605.23974#A3.T7)variesλ\\lambdafor the defaultQwen/Qwen3\-8Bbalanced\-benchmark monitor while leaving the learned heads fixed\. Balanced AUROC changes only modestly across the tested range: from 0\.7196 to 0\.7096 on DiaSafety and from 0\.8596 to 0\.8563 on Harmful Advice\. We therefore do not observe a sharp optimum, andλ=0\.3\\lambda=0\.3is retained as a middle operating point rather than as a brittle tuned value\.
Table 7:Sensitivity of the EMA smoothing coefficientλ\\lambdafor the defaultQwen/Qwen3\-8Bbalanced\-benchmark monitor\. The learned heads are held fixed and only the EMA smoothing parameter is changed at evaluation time\. Performance is stable across the tested range, so we keepλ=0\.3\\lambda=0\.3as a central default\.
## Appendix DBaseline Protocols
For balanced response\-classification benchmarks, post\-hoc guards score completed prompt\-response pairs\. WildGuard is decoded deterministically and parsed into a harmful\-response decision\. ShieldGemma\-9B is run over the prompt\-response pair under four policy guidelines: dangerous content, harassment, hate speech, and sexually explicit information\. We use the maximum unsafe probability across these categories as the ShieldGemma score\.
For prompt\-trigger benchmarks, all prefix baselines are replayed over the same realized no\-stop continuations used byAERIC\. WildGuard Prefix and ShieldGemma\-9B Prefix are called every 8 emitted answer tokens on the current prompt\-response prefix\. ShieldGemma\-9B Prefix again uses the maximum unsafe probability across the four guideline categories and triggers when the resulting unsafe decision is positive\. Qwen3Guard\-Stream\-4B is evaluated as a native streaming guard: the prompt is first sent as the user stream, and each newly emitted assistant guard\-token delta is then passed to the guard\. The stream triggers when the latest native risk level isUnsafe\.
The prompt\-trigger comparison is therefore not a post\-hoc full\-output comparison\. It asks when each method would first fire on the same deterministic generation path\. This replay design avoids confounding guard quality with different generator samples\.
## Appendix EDataset and Preprocessing Details
DiaSafety and Harmful Advice are treated as balanced implicit\-harm response\-classification benchmarks\. DiaSafety contains 1094 test rows in our run; Harmful Advice contains 550 rows\. The headline metrics for these datasets are AUROC and AUPRC over completed response trajectories\.
HarmBench DirectRequest and SocialHarmBench are treated as prompt\-only harmful request suites\. HarmBench DirectRequest contains 320 prompts\. SocialHarmBench contains 585 raw prompts in the loaded split; one prompt is empty after field extraction in our run, yielding 584 non\-empty prompts for the full prompt\-trigger table\. These datasets are not used as balanced response\-classification benchmarks in the paper\.
The prompt\-trigger generation setup uses greedy decoding, a maximum answer length of 128 tokens for the main trigger table, and a system prompt instructing the generator to provide direct answers without refusal or safety disclaimers\. The latency appendix uses the same prompt families but caps generation at 64 tokens to isolate fixed\-generation monitoring overhead\.
## Appendix FUncertainty Interval Details
For the balanced implicit\-harm benchmarks in Table[1](https://arxiv.org/html/2605.23974#S5.T1), the bracketed intervals are nonparametric95%95\\%bootstrap confidence intervals\. For each method, benchmark, and metric, we repeatedly resample evaluation rows with replacement from the fixed test set, recompute the metric on each bootstrap sample, and report the percentile interval\. The source of variability is therefore the finite evaluation sample rather than retraining or random\-seed variation: all generators, guards, andAERICmonitors are kept fixed while the test rows are resampled\.Similar Articles
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.
EUDAIMONIA: Evaluating Undesirable Dynamics in AI
This paper introduces EUDAIMONIA, a benchmark for evaluating harmful social dynamics in LLMs, such as encouraging unhealthy intimacy or dependence. Testing 22 recent models, including Claude-Opus-4.7 and GPT-5.5, it finds persistent violation rates around 30%, suggesting these failures are social-alignment problems unsolved by extended reasoning.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
Presents TurnGate, a turn-level monitor that detects hidden malicious intent in multi-turn dialogues by identifying the earliest turn where a response would enable harmful action, along with the Multi-Turn Intent Dataset (MTID) to support training and evaluation.
AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents
AURA introduces an intent-directed probing step for LLM agents to surface implicit user needs behind situated queries, improving coverage on a benchmark while reducing unnecessary tool calls and preventing privacy violations.
Adaptive Latent Agentic Reasoning
This paper introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework for LLM agents that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought for harder decisions, achieving up to 84.6% token reduction while maintaining task accuracy.