Before the Last Token: Diagnosing Final-Token Safety Probe Failures

arXiv cs.LG Papers

Summary

This paper investigates failures of final-token safety probes on jailbreak prompts, finding that harmful content can be distributed across earlier tokens and missed by the final readout. It proposes a PCA-HMM trajectory model as a diagnostic tool that recovers many misses without the false positives of naive token pooling.

arXiv:2605.12726v1 Announce Type: new Abstract: Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is missed by this readout. We study this prefill-time failure mode using SafeSwitch-style probes trained only on clean harmful and benign prompts across three instruction-tuned LLMs. The probes achieve high recall on clean harmful prompts, but miss many jailbreaks and can produce false positives on safety-adjacent benign prompts. Subspace analyses suggest that missed jailbreaks differ from clean benign prompts along directions that are poorly captured by the probe's representational subspace, and increasing probe bottleneck width does not reliably resolve this mismatch. Token-level prefill analyses reveal that probe-visible unsafe evidence often appears earlier in the sequence but is not exposed at the final-token readout, while naive max-pooling over token positions overfires on safe prompts. A simple PCA-HMM trajectory model, trained only on the same clean split, recovers many final-token misses from user-content prefill trajectories without the catastrophic false-positive behavior of naive token pooling, motivating trajectory-aware hidden-state analyses as diagnostic complements to final-token probes
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:17 AM

# Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Source: [https://arxiv.org/html/2605.12726](https://arxiv.org/html/2605.12726)
###### Abstract

Final\-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe\-visible unsafe evidence distributed across earlier user\-token representations that is missed by this readout\. We study this prefill\-time failure mode using SafeSwitch\-style probes trained only on clean harmful and benign prompts across three instruction\-tuned LLMs\. The probes achieve high recall on clean harmful prompts, but miss many jailbreaks and can produce false positives on safety\-adjacent benign prompts\. Subspace analyses suggest that missed jailbreaks differ from clean benign prompts along directions that are poorly captured by the probe’s representational subspace, and increasing probe bottleneck width does not reliably resolve this mismatch\. Token\-level prefill analyses reveal that probe\-visible unsafe evidence often appears earlier in the sequence but is not exposed at the final\-token readout, while naive max\-pooling over token positions overfires on safe prompts\. A simple PCA\-HMM trajectory model, trained only on the same clean split, recovers many final\-token misses from user\-content prefill trajectories without the catastrophic false\-positive behavior of naive token pooling, motivating trajectory\-aware hidden\-state analyses as diagnostic complements to final\-token probes\.

AI safety, jailbreak diagnostics, safety probes, hidden\-state trajectories, mechanistic interpretability, language model monitoring

## 1Introduction

Probe\-based safety systems such as SafeSwitch\(Han et al\.,[2025](https://arxiv.org/html/2605.12726#bib.bib3)\)offer a lightweight way to monitor language models before they generate unsafe outputs\. These systems train classifiers on internal activations and use the resulting signal to trigger downstream safety interventions\. In the prefill setting, a common design is to make this decision from the hidden state at the final prompt token\. This is efficient, but it makes safety monitoring depend on one learned readout of one contextual representation\.

We study a failure mode of this final\-token readout\. Across three instruction\-tuned models, we train SafeSwitch\-style probes on clean harmful and benign prompts from the original SafeSwitch train split\. We construct jailbreak evaluations by applying jailbreak templates from HarmBench\(Mazeika et al\.,[2024](https://arxiv.org/html/2605.12726#bib.bib7)\)to held\-out SorryBench harmful requests\. These probes achieve high recall on clean harmful prompts, yet miss many jailbreaks\. Jailbreak prompts are not content\-free distribution shifts: they contain an otherwise harmful request embedded inside adversarial framing, role play, obfuscation, or instruction\-following context\. Thus, a final\-token miss does not imply that harmful content is absent from the prompt; it means the trained probe does not detect the wrapped request from the final prompt\-token representation\.

Our results suggest that jailbreak wrappers shift the final\-token representation away from the clean harmfulness contrast captured by the probe\. This is not simply a capacity problem: increasing the probe bottleneck width does not reliably improve jailbreak detection\. Geometrically, contrast directions for missed jailbreaks have little energy in the probe\-visible subspace, while the visible caught\-versus\-missed jailbreak contrast aligns strongly with the clean harmful\-versus\-benign training direction\. This is consistent with a shortcut\-like readout: jailbreaks are caught when their final\-token representation still projects onto the clean harmful contrast, and missed when the wrapper\-induced shift moves them away from that readout\.

We then inspect token\-by\-token hidden\-state trajectories during prompt prefill\. Missed jailbreaks often contain non\-final positions that receive high probe scores, including positions near the embedded harmful request, even though the final\-token score is low\. Naive pooling over token positions is not a solution because safe prompts can also produce high intermediate scores\. This motivates trajectory\-aware diagnostics: simple PCA\-HMM models fit only on the same original SafeSwitch train split recover most final\-token\-probe jailbreak misses across models\. We use these models diagnostically, not as deployed jailbreak detectors, to show that final\-token probe failures reflect where and how wrapper\-induced representations expose safety\-relevant information in hidden\-state trajectories\.

## 2Related Work

#### Representation geometry of refusal and harmfulness\.

Prior work has studied internal directions and subspaces associated with refusal, harmfulness, and alignment\-related behavior\.Arditi et al\. \([2024](https://arxiv.org/html/2605.12726#bib.bib1)\)identify refusal\-related directions that can affect model outputs, while subsequent work argues that refusal and safety\-related representations need not reduce to a single axis and can involve multiple directions or separable concepts\(Pan et al\.,[2025](https://arxiv.org/html/2605.12726#bib.bib10); Wollschläger et al\.,[2025](https://arxiv.org/html/2605.12726#bib.bib13); Zhao et al\.,[2025](https://arxiv.org/html/2605.12726#bib.bib15)\)\. Closest to our geometry analysis,Shah et al\. \([2025](https://arxiv.org/html/2605.12726#bib.bib12)\)train harmfulness subcategory probes and study the low\-rank structure of their probe weights, including steering experiments\. Our analysis is narrower: we use difference\-of\-means contrasts and the row\-space of one trained binary final\-token probe to characterize where that probe is sensitive or insensitive\. We do not interpret these directions as universal harmfulness features or as mechanistic control variables\.

#### Hidden\-state trajectories for safety diagnostics\.

Recent work also studies safety\- or alignment\-relevant information in hidden\-state dynamics, including layer\-wise evolution, decoding\-time trajectories, and latent trajectory classifiers\(Zhou et al\.,[2024](https://arxiv.org/html/2605.12726#bib.bib16); Liu et al\.,[2026](https://arxiv.org/html/2605.12726#bib.bib6); Lin et al\.,[2026](https://arxiv.org/html/2605.12726#bib.bib5); Damirchi et al\.,[2026](https://arxiv.org/html/2605.12726#bib.bib2)\)\. Our setting is more restricted: within a single prefill pass, we ask whether jailbreak prompts missed by a final\-token probe contain earlier probe\-visible evidence, and whether a simple trajectory diagnostic trained on the same clean split recovers those misses\. We treat the PCA\-HMM as a diagnostic complement to the final\-token readout, not as a standalone safety monitor\.

## 3Experimental Setup

#### Models and probes\.

We evaluate three instruction\-tuned models: Llama\-3\.1\-8B\-Instruct\(Meta,[2024](https://arxiv.org/html/2605.12726#bib.bib8)\), Mistral\-7B\-Instruct\-v0\.1\(Jiang et al\.,[2023](https://arxiv.org/html/2605.12726#bib.bib4)\), and OLMo3\-7B\-Instruct\(Olmo et al\.,[2025](https://arxiv.org/html/2605.12726#bib.bib9)\)\. For each model, we train SafeSwitch\-style probes on final prompt\-token hidden states from the original SafeSwitch train split, using clean harmful prompts and benign prompts\. Unless otherwise stated, probes are evaluated at threshold 0\.5\.

#### Evaluation sets\.

We use held\-out SorryBench harmful prompts\(Xie et al\.,[2024](https://arxiv.org/html/2605.12726#bib.bib14)\)as clean harmful evaluations\. To construct semantic jailbreaks, we wrap held\-out SorryBench harmful requests with jailbreak templates from HarmBench\(Mazeika et al\.,[2024](https://arxiv.org/html/2605.12726#bib.bib7)\)\. We additionally report direct harmful\-request evaluations on AdvBench\(Zou et al\.,[2023](https://arxiv.org/html/2605.12726#bib.bib17)\)and false\-positive rates on benign XSTest prompts\(Röttger et al\.,[2024](https://arxiv.org/html/2605.12726#bib.bib11)\)\. We report detection rate on harmful and jailbreak sources and false\-positive rate on XSTest\.

## 4Final\-Token Probes

### 4\.1Probe evaluation

We evaluate the SafeSwitch\-style monitoring interface directly: a probe reads the final prompt\-token hidden state, projects it through a width\-wwbottleneck, and predicts unsafe versus safe\. We use the released SafeSwitch Llama prober and train analogous Mistral and OLMo3 probers with the same interface, usingw=64w=64throughout\. We then apply each frozen probe to clean held\-out SorryBench prompts, the same SorryBench prompts wrapped with HarmBench jailbreak templates, and benign safety\-adjacent XSTest prompts\. This tests whether a final\-token readout trained on clean harmfulness still exposes the unsafe request after jailbreak wrapping\.

### 4\.2Probes miss wrapped requests

Table[1](https://arxiv.org/html/2605.12726#S4.T1)reports final\-token probe results across the three evaluation conditions\. The probes retain high recall on clean held\-out SorryBench harmful prompts, but miss a substantial fraction of the same harmful requests once they are wrapped in HarmBench jailbreak templates\. The same probes also produce non\-trivial false positives on XSTest, a benign set designed to be safety\-adjacent\. The failure is therefore not a simple thresholding issue: lowering the threshold to recover wrapped harmful requests would further increase false positives on benign safety\-adjacent instructions\.

Table 1:Final\-token probe detection rates on harmful sources and false\-positive rate on benign XSTest\. Jailbreak prompts use the same held\-out SorryBench harmful requests as the clean condition, wrapped with HarmBench templates\.This failure has direct downstream consequences\. SafeSwitch\-style monitoring is a cascade in which the final\-token prober gates a stage\-2 refusal head: if the prober does not fire on the prefill, the refusal head is never invoked and the wrapped harmful request flows through to generation\.

Direct harmful benchmarks rule out a generic distribution\-shift explanation\. The same final\-token probes detect AdvBench at 99\.2–99\.8% and HarmBench at 96\.2–100\.0% across the three models, so the jailbreak drop is not a generic inability to recognize harmful requests outside the training file; the controlled contrast is clean SorryBench versus the same prompts under adversarial wrapping\.

### 4\.3Width is not the bottleneck

SafeSwitch\-style probers use a bottlenecked linear readout, so a natural explanation for the wrapped\-request failure is that the bottleneck row\-space is too narrow to expose the relevant signal\. We test this directly by sweeping the bottleneck width from 64 to 1024 with the same training procedure and split\.

Wider readouts do not reliably improve jailbreak detection\. Going from width 64 to 1024, jailbreak detection moves from 72\.2% to 78\.0% for Llama, from 86\.9% to 87\.6% for Mistral, and from 63\.6% to 55\.7% for OLMo3\. For Llama and Mistral the change is small; for OLMo3 the wider probe is worse\. The full sweep across widths\{64,128,256,512,1024\}\\\{64,128,256,512,1024\\\}is reported in Appendix[A](https://arxiv.org/html/2605.12726#A1)and shows the same picture\. The wrapped\-request failure is therefore not just a capacity issue, which motivates looking at where the missed signal sits in representation space rather than how much the probe row\-space can hold\.

## 5Geometry of Missed Jailbreaks

Width does not explain the wrapped\-request failure, so we ask where the missed signal lives in representation space\. Leth​\(x\)∈ℝDh\(x\)\\in\\mathbb\{R\}^\{D\}denote the model’s hidden state at the final prompt token of inputxx\. For a finite prompt set𝒜\\mathcal\{A\}, define

μ​\(𝒜\)=1\|𝒜\|​∑x∈𝒜h​\(x\)\.\\mu\(\\mathcal\{A\}\)=\\frac\{1\}\{\|\\mathcal\{A\}\|\}\\sum\_\{x\\in\\mathcal\{A\}\}h\(x\)\.For any nonzero vectorvv, defineu​\(v\)=v/∥v∥2u\(v\)=v/\\lVert v\\rVert\_\{2\}\. Letℋtrain\\mathcal\{H\}\_\{\\mathrm\{train\}\}andℬtrain\\mathcal\{B\}\_\{\\mathrm\{train\}\}be the SafeSwitch\-train harmful and benign prompt sets,𝒮sorry\\mathcal\{S\}\_\{\\mathrm\{sorry\}\}be held\-out SorryBench harmful prompts,𝒳xstest\\mathcal\{X\}\_\{\\mathrm\{xstest\}\}be benign XSTest prompts, and𝒥caught,𝒥miss\\mathcal\{J\}\_\{\\mathrm\{caught\}\},\\mathcal\{J\}\_\{\\mathrm\{miss\}\}be the jailbreak prompts caught and missed by the final\-token probe\. We compute four unit\-norm difference\-of\-means directions:

dharm\\displaystyle d\_\{\\mathrm\{harm\}\}=u​\(μ​\(ℋtrain\)−μ​\(ℬtrain\)\),\\displaystyle=u\\\!\\left\(\\mu\(\\mathcal\{H\}\_\{\\mathrm\{train\}\}\)\-\\mu\(\\mathcal\{B\}\_\{\\mathrm\{train\}\}\)\\right\),dsafe\\displaystyle d\_\{\\mathrm\{safe\}\}=u​\(μ​\(𝒮sorry\)−μ​\(𝒳xstest\)\),\\displaystyle=u\\\!\\left\(\\mu\(\\mathcal\{S\}\_\{\\mathrm\{sorry\}\}\)\-\\mu\(\\mathcal\{X\}\_\{\\mathrm\{xstest\}\}\)\\right\),dmiss\\displaystyle d\_\{\\mathrm\{miss\}\}=u​\(μ​\(𝒥miss\)−μ​\(ℬtrain\)\),\\displaystyle=u\\\!\\left\(\\mu\(\\mathcal\{J\}\_\{\\mathrm\{miss\}\}\)\-\\mu\(\\mathcal\{B\}\_\{\\mathrm\{train\}\}\)\\right\),Δcm\\displaystyle\\Delta\_\{\\mathrm\{cm\}\}=u​\(μ​\(𝒥caught\)−μ​\(𝒥miss\)\)\.\\displaystyle=u\\\!\\left\(\\mu\(\\mathcal\{J\}\_\{\\mathrm\{caught\}\}\)\-\\mu\(\\mathcal\{J\}\_\{\\mathrm\{miss\}\}\)\\right\)\.dharmd\_\{\\mathrm\{harm\}\}is the clean training\-set harmful\-vs\-benign contrast the prober is trained to find\. Motivated by refusal\-direction work\(Arditi et al\.,[2024](https://arxiv.org/html/2605.12726#bib.bib1)\),dsafed\_\{\\mathrm\{safe\}\}is a safety\-contrast proxy, not the Arditi refusal direction itself: because SorryBench is genuinely harmful and XSTest is safety\-adjacent benign, their mean difference contrasts harmful prompts with safety\-adjacent benign prompts while partially controlling for refusal\-likely surface cues\. We keepdharmd\_\{\\mathrm\{harm\}\}anddsafed\_\{\\mathrm\{safe\}\}as distinct directions, consistent with recent evidence that LLMs encode harmfulness and refusal as separate internal concepts\(Zhao et al\.,[2025](https://arxiv.org/html/2605.12726#bib.bib15)\)\.dmissd\_\{\\mathrm\{miss\}\}measures whether missed jailbreaks differ from clean benign at the final token\.Δcm\\Delta\_\{\\mathrm\{cm\}\}characterizes what separates caught from missed jailbreaks within the jailbreak set\.

LetW∈ℝw×DW\\in\\mathbb\{R\}^\{w\\times D\}be the prober’s first\-layer weight matrix\. We compute its thin SVDW=U​Σ​Vr⊤W=U\\Sigma V\_\{r\}^\{\\top\}, whereVr∈ℝD×rV\_\{r\}\\in\\mathbb\{R\}^\{D\\times r\}andr=rank​\(W\)≤wr=\\mathrm\{rank\}\(W\)\\leq w, and use the right singular vectors as an orthonormal basis forrowspace​\(W\)\\mathrm\{rowspace\}\(W\)\. ThusΠW=Vr​Vr⊤\\Pi\_\{W\}=V\_\{r\}V\_\{r\}^\{\\top\}is the orthogonal projector onto the probe\-visible subspace\. For each directiond∈ℝDd\\in\\mathbb\{R\}^\{D\}we report its probe\-visible*energy*

EW​\(d\)=∥ΠW​d∥22∥d∥22∈\[0,1\],\\mathrm\{E\}\_\{W\}\(d\)\\;=\\;\\frac\{\\lVert\\Pi\_\{W\}\\,d\\rVert\_\{2\}^\{2\}\}\{\\lVert d\\rVert\_\{2\}^\{2\}\}\\;\\in\\;\[0,1\],the fraction ofdd’s squared norm captured by the probe row\-space\. For alignments, we compare the ordinary cosinecos⁡\(a,b\)\\cos\(a,b\)with the probe\-visible cosine

cosW⁡\(a,b\)=⟨ΠW​a,ΠW​b⟩∥ΠW​a∥2​∥ΠW​b∥2,\\cos\_\{W\}\(a,b\)=\\frac\{\\langle\\Pi\_\{W\}a,\\Pi\_\{W\}b\\rangle\}\{\\lVert\\Pi\_\{W\}a\\rVert\_\{2\}\\lVert\\Pi\_\{W\}b\\rVert\_\{2\}\},usinga=Δcma=\\Delta\_\{\\mathrm\{cm\}\}andb∈\{dharm,dsafe,dmiss\}b\\in\\\{d\_\{\\mathrm\{harm\}\},d\_\{\\mathrm\{safe\}\},d\_\{\\mathrm\{miss\}\}\\\}\.

Table 2:Probe\-visible energyEW\\mathrm\{E\}\_\{W\}of four contrast directions \(%\)\.Table[2](https://arxiv.org/html/2605.12726#S5.T2)reportsEW\\mathrm\{E\}\_\{W\}for each direction\. Across all three models,dharmd\_\{\\mathrm\{harm\}\}has the highest probe\-visible energy anddmissd\_\{\\mathrm\{miss\}\}has the lowest \(3\.5%,2\.4%,3\.2%3\.5\\%,2\.4\\%,3\.2\\%\): the mean direction separating missed jailbreaks from clean benign prompts is the most poorly represented of all four directions in the probe readout, not just at threshold\. Equivalently, roughly 96–98% ofdmissd\_\{\\mathrm\{miss\}\}’s squared norm lies in the orthogonal complement of the probe row\-space\. Moreover, no single SVD basis vector ofWWaligns with any of the four directions above\|cos\|=0\.29\|\\cos\|=0\.29, so evendharmd\_\{\\mathrm\{harm\}\}is spread thinly across basis vectors rather than captured by one dominant readout direction\. This is consistent with harmfulness being a multi\-dimensional internal concept\(Zhao et al\.,[2025](https://arxiv.org/html/2605.12726#bib.bib15)\)\.

Full alignment results are reported in Appendix[B](https://arxiv.org/html/2605.12726#A2), Table[5](https://arxiv.org/html/2605.12726#A2.T5)\. Probe projection sharpens alignment with the harmful proxy from0\.320\.32–0\.390\.39to0\.680\.68–0\.780\.78, and with the safety proxy from0\.120\.12–0\.190\.19to0\.410\.41–0\.510\.51\. The full\-space anti\-alignment withdmissd\_\{\\mathrm\{miss\}\}is not visible to the probe: after projection,Δcm\\Delta\_\{\\mathrm\{cm\}\}has only weak alignment withdmissd\_\{\\mathrm\{miss\}\}\(0\.210\.21,0\.130\.13,0\.060\.06\)\.

Together these findings are consistent with a shortcut\-like readout: within the jailbreak distribution, caught prompts are those whose final\-token representations retain stronger projection onto the clean harmful\-vs\-benign direction, while missed prompts move away from this readout\. Safety\-related cues contribute partially \(probe cosine0\.410\.41–0\.510\.51withdsafed\_\{\\mathrm\{safe\}\}\), but the strongest projected alignment is withdharmd\_\{\\mathrm\{harm\}\}\. Of the four directions tested, the probe row\-space is most orthogonal todmissd\_\{\\mathrm\{miss\}\}, and within probe row\-spaceΔcm\\Delta\_\{\\mathrm\{cm\}\}has only weak alignment withdmissd\_\{\\mathrm\{miss\}\}\. The wrapper appears to reduce exposure of the relevant feature to the final\-token readout, so the probe relies primarily on the clean training\-distribution harm proxy\.

## 6Unsafe Evidence Appears Before the Last Token

The geometry above shows that the final\-token probe row\-space is most blind to the direction separating missed jailbreaks from clean benign prompts\. This suggests a positional failure: the unsafe request may be visible earlier in the prompt but no longer exposed at the final token\. Recent work studies safety\-relevant hidden\-state dynamics along other axes, including layer\-wise safety evolution, decoding\-time token trajectories, multi\-probe latent trajectories, and token\-layer displacement paths\(Zhou et al\.,[2024](https://arxiv.org/html/2605.12726#bib.bib16); Liu et al\.,[2026](https://arxiv.org/html/2605.12726#bib.bib6); Lin et al\.,[2026](https://arxiv.org/html/2605.12726#bib.bib5); Damirchi et al\.,[2026](https://arxiv.org/html/2605.12726#bib.bib2)\)\. Our diagnostic is narrower: within a single prefill pass, we sweep prompt\-token positions with the same frozen prober and ask whether final\-token misses contain earlier probe\-visible unsafe evidence\.

Across the three models, 87 Llama, 65 Mistral, and 127 OLMo3 final\-token\-missed jailbreaks had recoverable embedded harmful\-request spans\. All recoverable cases are detected at the harmful\-request span but missed at the final token \(Appendix[E](https://arxiv.org/html/2605.12726#A5)\)\. Here recoverable means that token matching locates the original harmful request inside the wrapped prompt\. Missed jailbreaks are therefore not invisible to the prober everywhere\. The failure is positional: by the final token, the wrapped prompt no longer exposes the same probe\-visible feature to the prober\.

This clarifies the Stage\-2 limitation\. The same Stage\-1 prober can fire on the embedded harmful request when evaluated at the request tokens, but SafeSwitch uses only the final\-token Stage\-1 decision to gate Stage 2\. Thus the downstream refusal head can only refine prompts that pass the final\-token gate; it cannot recover probe\-visible unsafe evidence that appears earlier and collapses before the readout position\.

This observation does not make naive token pooling a usable detector\. Across all three models, max\-pooling the prober score over prompt positions flags100%100\\%of jailbreak prompts, but it also flags100%100\\%of XSTest prompts\. For Llama, XSTest has mean max score0\.9980\.998despite a final\-token score of0\.2290\.229\. Thus intermediate high scores are not semantically reliable on their own\. The useful object is not whether any token score is high, but how the score evolves across the prompt: missed jailbreaks exhibit a high harmful\-request score followed by collapse at the final token\. This motivates trajectory\-level diagnostics rather than single\-position readouts or max\-pooling\.

## 7Token Trajectories Recover Probe Misses

The max\-pooling result shows that any\-position probe scores are too noisy, but the token\-position analysis suggests a more structured signal: missed jailbreaks can expose probe\-visible unsafe evidence earlier and then collapse at the final token\. We therefore use a simple generative sequence diagnostic that asks whether the user\-content trajectory is more likely under clean harmful trajectories or clean benign trajectories\.

For each model, we fit a 64\-dimensional PCA on prompt\-token hidden states from balanced clean harmful and benign SafeSwitch training prompts, then fit two 2\-state diagonal\-covariance Gaussian HMMs in PCA space, one per class\. Sequence lengths are passed to the HMM fit\. We score a trajectoryz1:Tz\_\{1:T\}by the length\-normalized log\-likelihood ratio

s​\(z1:T\)=1T​\[log⁡pharm​\(z1:T\)−log⁡pbenign​\(z1:T\)\]\.s\(z\_\{1:T\}\)=\\frac\{1\}\{T\}\\left\[\\log p\_\{\\mathrm\{harm\}\}\(z\_\{1:T\}\)\-\\log p\_\{\\mathrm\{benign\}\}\(z\_\{1:T\}\)\\right\]\.The threshold is selected only on the clean training split to maximize harmful\-vs\-benign training accuracy\. We do not tune thresholds on jailbreaks or XSTest\.

![Refer to caption](https://arxiv.org/html/2605.12726v1/x1.png)Figure 1:Jailbreak–XSTest operating\-point shift from the final\-token probe to the user\-window PCA\-HMM trajectory diagnostic\. Across all three models, the trajectory diagnostic recovers many final\-token misses while reducing XSTest false positives at this operating point\.Table 3:User\-window PCA\-HMM diagnostic results\. Recovery is measured only among jailbreaks missed by the final\-token probe\.Figure[1](https://arxiv.org/html/2605.12726#S7.F1)visualizes the operating\-point shift, and Table[3](https://arxiv.org/html/2605.12726#S7.T3)gives the corresponding rates\. The diagnostic recovers 236 of 250 Llama misses, 106 of 118 Mistral misses, and 310 of 328 OLMo3 misses without inheriting max\-pooling’s100%100\\%XSTest false\-positive rate\.

This result is consistent with the geometry above\. The missed\-jailbreak directiondmissd\_\{\\mathrm\{miss\}\}has much higher squared\-projection energy in the trajectory PCA subspace than in the final\-token probe row\-space:3\.5%→11\.5%3\.5\\%\\to 11\.5\\%for Llama,2\.4%→29\.2%2\.4\\%\\to 29\.2\\%for Mistral, and3\.2%→33\.7%3\.2\\%\\to 33\.7\\%for OLMo3\. Thus the PCA\-HMM is best interpreted as a diagnostic bridge: information that is weakly exposed to the final\-token probe can remain recoverable in the prompt trajectory\. We do not treat the HMM states as mechanistic features or the diagnostic as a deployed detector\.

## 8Limitations and Future Work

This study is diagnostic rather than a deployed defense\. The PCA\-HMM threshold is tuned only on clean training data, but false positives remain nonzero and HMM states should not be interpreted as mechanistic safety states\. Future work should study calibrated fusion with final\-token probes, robustness to adaptive and length\-controlled jailbreaks, and causal interventions on trajectory features\.

## References

- Arditi et al\. \(2024\)Arditi, A\., Obeso, O\., Syed, A\., Paleka, D\., Panickssery, N\., Gurnee, W\., and Nanda, N\.Refusal in language models is mediated by a single direction, 2024\.URL[https://arxiv\.org/abs/2406\.11717](https://arxiv.org/abs/2406.11717)\.
- Damirchi et al\. \(2026\)Damirchi, H\., la Jara, I\. M\. D\., Abbasnejad, E\., Shamsi, A\., Zhang, Z\., and Shi, J\.Truth as a trajectory: What internal representations reveal about large language model reasoning, 2026\.URL[https://arxiv\.org/abs/2603\.01326](https://arxiv.org/abs/2603.01326)\.
- Han et al\. \(2025\)Han, P\., Qian, C\., Chen, X\., Zhang, Y\., Ji, H\., and Zhang, D\.Safeswitch: Steering unsafe llm behavior via internal activation signals, 2025\.URL[https://arxiv\.org/abs/2502\.01042](https://arxiv.org/abs/2502.01042)\.
- Jiang et al\. \(2023\)Jiang, A\. Q\., Sablayrolles, A\., Mensch, A\., Bamford, C\., Chaplot, D\. S\., de las Casas, D\., Bressand, F\., Lengyel, G\., Lample, G\., Saulnier, L\., Lavaud, L\. R\., Lachaux, M\.\-A\., Stock, P\., Scao, T\. L\., Lavril, T\., Wang, T\., Lacroix, T\., and Sayed, W\. E\.Mistral 7b, 2023\.URL[https://arxiv\.org/abs/2310\.06825](https://arxiv.org/abs/2310.06825)\.
- Lin et al\. \(2026\)Lin, Z\., Yang, J\., Qiu, Y\., Guo, H\., Bao, Y\., and Guan, Y\.N\-glare: An non\-generative latent representation\-efficient llm safety evaluator, 2026\.URL[https://arxiv\.org/abs/2511\.14195](https://arxiv.org/abs/2511.14195)\.
- Liu et al\. \(2026\)Liu, C\., Liu, X\., Li, X\., Xin, B\., and Ding, K\.Trajguard: Streaming hidden\-state trajectory detection for decoding\-time jailbreak defense, 2026\.URL[https://arxiv\.org/abs/2604\.07727](https://arxiv.org/abs/2604.07727)\.
- Mazeika et al\. \(2024\)Mazeika, M\., Phan, L\., Yin, X\., Zou, A\., Wang, Z\., Mu, N\., Sakhaee, E\., Li, N\., Basart, S\., Li, B\., Forsyth, D\., and Hendrycks, D\.Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024\.URL[https://arxiv\.org/abs/2402\.04249](https://arxiv.org/abs/2402.04249)\.
- Meta \(2024\)Meta\.meta\-llama/llama\-3\.1\-8b\-instruct, 2024\.URL[https://huggingface\.co/meta\-llama/Llama\-3\.1\-8B\-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)\.Accessed: 2025\-02\-21\.
- Olmo et al\. \(2025\)Olmo, T\., Ettinger, A\., Bertsch, A\., Kuehl, B\., Graham, D\., Heineman, D\., Groeneveld, D\., Brahman, F\., Timbers, F\., Ivison, H\., Morrison, J\., Poznanski, J\., Lo, K\., Soldaini, L\., Jordan, M\., Chen, M\., Noukhovitch, M\., Lambert, N\., Walsh, P\., Dasigi, P\., Berry, R\., Malik, S\., Shah, S\., Geng, S\., Arora, S\., Gupta, S\., Anderson, T\., Xiao, T\., Murray, T\., Romero, T\., Graf, V\., Asai, A\., Bhagia, A\., Wettig, A\., Liu, A\., Rangapur, A\., Anastasiades, C\., Huang, C\., Schwenk, D\., Trivedi, H\., Magnusson, I\., Lochner, J\., Liu, J\., Miranda, L\. J\. V\., Sap, M\., Morgan, M\., Schmitz, M\., Guerquin, M\., Wilson, M\., Huff, R\., Bras, R\. L\., Xin, R\., Shao, R\., Skjonsberg, S\., Shen, S\. Z\., Li, S\. S\., Wilde, T\., Pyatkin, V\., Merrill, W\., Chang, Y\., Gu, Y\., Zeng, Z\., Sabharwal, A\., Zettlemoyer, L\., Koh, P\. W\., Farhadi, A\., Smith, N\. A\., and Hajishirzi, H\.Olmo 3, 2025\.URL[https://arxiv\.org/abs/2512\.13961](https://arxiv.org/abs/2512.13961)\.
- Pan et al\. \(2025\)Pan, W\., Liu, Z\., Chen, Q\., Zhou, X\., Yu, H\., and Jia, X\.The hidden dimensions of llm alignment: A multi\-dimensional analysis of orthogonal safety directions, 2025\.URL[https://arxiv\.org/abs/2502\.09674](https://arxiv.org/abs/2502.09674)\.
- Röttger et al\. \(2024\)Röttger, P\., Kirk, H\. R\., Vidgen, B\., Attanasio, G\., Bianchi, F\., and Hovy, D\.Xstest: A test suite for identifying exaggerated safety behaviours in large language models, 2024\.URL[https://arxiv\.org/abs/2308\.01263](https://arxiv.org/abs/2308.01263)\.
- Shah et al\. \(2025\)Shah, M\., Angeline, S\., Kumar, A\. R\., Chheda, N\., Zhu, K\., Sharma, V\., O’Brien, S\., and Cai, W\.The geometry of harmfulness in llms through subconcept probing, 2025\.URL[https://arxiv\.org/abs/2507\.21141](https://arxiv.org/abs/2507.21141)\.
- Wollschläger et al\. \(2025\)Wollschläger, T\., Elstner, J\., Geisler, S\., Cohen\-Addad, V\., Günnemann, S\., and Gasteiger, J\.The geometry of refusal in large language models: Concept cones and representational independence, 2025\.URL[https://arxiv\.org/abs/2502\.17420](https://arxiv.org/abs/2502.17420)\.
- Xie et al\. \(2024\)Xie, T\., Qi, X\., Zeng, Y\., Huang, Y\., Sehwag, U\. M\., Huang, K\., He, L\., Wei, B\., Li, D\., Sheng, Y\., Jia, R\., Li, B\., Li, K\., Chen, D\., Henderson, P\., and Mittal, P\.Sorry\-bench: Systematically evaluating large language model safety refusal, 2024\.URL[https://arxiv\.org/abs/2406\.14598](https://arxiv.org/abs/2406.14598)\.
- Zhao et al\. \(2025\)Zhao, J\., Huang, J\., Wu, Z\., Bau, D\., and Shi, W\.Llms encode harmfulness and refusal separately, 2025\.URL[https://arxiv\.org/abs/2507\.11878](https://arxiv.org/abs/2507.11878)\.
- Zhou et al\. \(2024\)Zhou, Z\., Yu, H\., Zhang, X\., Xu, R\., Huang, F\., and Li, Y\.How alignment and jailbreak work: Explain llm safety through intermediate hidden states, 2024\.URL[https://arxiv\.org/abs/2406\.05644](https://arxiv.org/abs/2406.05644)\.
- Zou et al\. \(2023\)Zou, A\., Wang, Z\., Carlini, N\., Nasr, M\., Kolter, J\. Z\., and Fredrikson, M\.Universal and transferable adversarial attacks on aligned language models, 2023\.URL[https://arxiv\.org/abs/2307\.15043](https://arxiv.org/abs/2307.15043)\.

## Appendix ABottleneck Width Sweep

Table[4](https://arxiv.org/html/2605.12726#A1.T4)reports final\-token probe jailbreak detection rate as a function of bottleneck width\. Wider readouts do not reliably improve detection: Llama gains modestly, Mistral is roughly flat, and OLMo3 degrades\.

Table 4:Final\-token probe jailbreak detection rate \(%\) as a function of bottleneck width\.
## Appendix BGeometry Alignment Results

Table 5:Cosine alignment ofΔcm\\Delta\_\{\\mathrm\{cm\}\}with the harmful, safety\-proxy, and missed\-jailbreak directions, in the full hidden\-state space \(Full\) and after projection onto the probe row\-space \(Probe\)\.
## Appendix CTrajectory Complementarity

![Refer to caption](https://arxiv.org/html/2605.12726v1/x2.png)Figure 2:Complementarity on jailbreak prompts\. Stacked bars partition 900 jailbreak prompts by whether they are caught by the final\-token probe, the PCA\-HMM trajectory diagnostic, both, or neither\. Percentages inside bars are normalized by the 900\-prompt jailbreak set\. PCA\-HMM catches many prompts missed by the final\-token probe: 236 for Llama, 106 for Mistral, and 310 for OLMo3\.
## Appendix DPCA\-HMM Length Correlations

Table[6](https://arxiv.org/html/2605.12726#A4.T6)reports the within\-source Spearman rank correlation between user\-content window length and PCA\-HMM log\-likelihood ratio\. Correlations are weak to moderate within each source, indicating that the trajectory score is not purely a length proxy within a given evaluation condition\.

Table 6:Within\-source Spearman correlation between window length and PCA\-HMM score\.
## Appendix EToken\-Position Results

Table 7:Token\-position probe scores for final\-token\-missed jailbreak prompts with located harmful\-request spans\. Goal is the maximum prober score over the embedded harmful request span\.All located spans are detected at the harmful\-request span and missed at the final token\. A span is “located” when the original harmful request token sequence can be matched inside the wrapped jailbreak prompt; prompts without a located span are excluded from this span\-specific analysis rather than counted as failures\.

As a direct\-harmful control, Table[8](https://arxiv.org/html/2605.12726#A5.T8)reports the same span\-versus\-final\-token comparison on sampled clean harmful prompts\. Since each direct prompt is itself the harmful request, the request span is located by construction\. These prompts remain high at the final token, unlike final\-token\-missed jailbreaks\.

Table 8:Direct harmful prompt control\. We sample 50 SorryBench and 50 AdvBench prompts per model\. Goal is the maximum prober score over the direct harmful request span\.

Similar Articles

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

arXiv cs.CL

This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

arXiv cs.AI

This paper proposes Multi-Stage In-Flight Rejection (MSIFR), a training-free framework that reduces token waste in LLM-based synthetic data generation by detecting and terminating low-quality generation trajectories at intermediate checkpoints. Across five models and seven benchmarks, MSIFR reduces token consumption by 11–77% as a standalone method and up to 78.2% when combined with early-exit methods, while preserving or improving accuracy.

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

arXiv cs.CL

TPA proposes a novel method for detecting hallucinations in RAG systems by attributing next-token probabilities to seven distinct sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregating by Part-of-Speech tags. The approach achieves state-of-the-art performance across five LLMs including Llama2, Llama3, Mistral, and Qwen.