When Autoregressive Consistency Hurts Safety Alignment

arXiv cs.LG Papers

Summary

This paper analyzes why LLM safety alignment is fragile, attributing it to 'autoregressive consistency'—the tendency of next-token prediction to extend the current response trajectory—which concentrates alignment updates on early tokens. The authors introduce a 'random insertion attack' exploiting this property and propose an adversarial safety alignment framework to address it.

arXiv:2606.04168v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model's behavior near the first few output tokens. We argue that this phenomenon can be understood through autoregressive consistency, the tendency of next-token prediction to preserve and extend the current response trajectory consistently. By analyzing the learning dynamics of safety alignment, we show that autoregressive consistency can concentrate alignment updates on early tokens, offering a mechanistic explanation for shallow safety alignment. The same mechanism also predicts a broader class of attacks on LLMs: attacks that induce harmful continuation states at arbitrary positions in the output trajectory. As a concrete example, we introduce random insertion attack, which inserts a short harmful span into an otherwise safe refusal trajectory and exploits autoregressive consistency to sustain the resulting harmful branch, thereby bypassing safety alignment. Notably, a short harmful span can redirect the generation to be harmful even after a long refusal prefix, highlighting autoregressive consistency as a potential broader failure mechanism. This suggests that safety alignment should also break harmful autoregressive consistency throughout the output trajectory. We therefore propose adversarial safety alignment, an initial framework based on worst-case harmful continuation states, and instantiate it with random worst-insertion training. Overall, our results suggest that autoregressive consistency should be treated as a central consideration in both safety alignment and attack design.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:21 AM

# When Autoregressive Consistency Hurts Safety Alignment
Source: [https://arxiv.org/html/2606.04168](https://arxiv.org/html/2606.04168)
Bochen Lyu University of Southampton bochen\.lyu@soton\.ac\.uk &Yiyang Jia∗ Independent Researcher yyjiahbar@gmail\.com Xiaohao Cai University of Southampton x\.cai@soton\.ac\.uk & Zhanxing Zhu University of Southampton z\.zhu@soton\.ac\.uk

###### Abstract

Safety alignment in large language models \(LLMs\) is fragile due to the shallow issue, where the fine\-tuning mainly reshapes the model’s behavior near the first few tokens\. In this paper, we argue that this phenomenon can be understood by a property of autoregressive models—*autoregressive consistency*—the tendency of next\-token prediction to preserve and extend the current response trajectory consistently\. Specifically, we analyze the learning dynamics of safety alignment and show that autoregressive consistency can concentrate alignment updates on early tokens, thus offering a mechanistic explanation for shallow safety alignment\. This analysis points to a broader class of attacks for LLMs that induce harmful continuation at arbitrary positions in the output trajectory\. As a concrete example, we introduce random insertion attack, which inserts a short harmful span into an otherwise safe refusal trajectory and exploits autoregressive consistency to sustain the resulting harmful branch, thereby bypassing safety alignment\. Notably, a short harmful span can redirect the generation to be harmful even after a long refusal prefix, highlighting autoregressive consistency as a potential broader failure mechanism\. This suggests that safety alignment should also*break harmful autoregressive consistency*throughout the output trajectory\. To this end, we propose a new adversarial safety alignment framework employing worst\-case harmful continuation states, together with a practical initial implementation using random worst\-insertion attack\. Overall, our results suggest that autoregressive consistency should be treated as a crucial consideration in both safety alignment and attack design\.

## 1Introduction

Safety alignment stands at the core of large language models \(LLMs\) safety, aiming to ensure that LLMs refuse to generate harmful contents given harmful user queries\. Current safety alignment is typically achieved via supervised fine\-tuning \(SFT\)\(Weiet al\.,[2022](https://arxiv.org/html/2606.04168#bib.bib1)\)and preference optimization methods such as Direct Preference Optimization \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.04168#bib.bib2)\)and Reinforcement Learning with Human Feedback \(RLHF\)\(Baiet al\.,[2022](https://arxiv.org/html/2606.04168#bib.bib3); Ouyanget al\.,[2022](https://arxiv.org/html/2606.04168#bib.bib5)\)\. Despite their practical success, aligned LLMs remain fragile under certain attacks \(e\.g\., prefill attack\(Andriushchenkoet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib18)\)and optimization based attacks\(Chaoet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib19); Mehrotraet al\.,[2024](https://arxiv.org/html/2606.04168#bib.bib22); Zouet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib6),[2023](https://arxiv.org/html/2606.04168#bib.bib20)\)\)\.

![Refer to caption](https://arxiv.org/html/2606.04168v1/x1.png)Figure 1:When autoregressive consistency hurts safety alignment\.Autoregressive consistency \(Definition[2\.1](https://arxiv.org/html/2606.04168#S2.Thmdefinition1)\), the tendency to preserve and extend the current response branch, helps explain why safety alignment can become shallow: an analysis of the learning dynamics of safety alignment reveals that gradient updates concentrate on early refusal tokens, while later refusal tokens are largely sustained by consistency\-driven continuation \(Section[2\.1](https://arxiv.org/html/2606.04168#S2.SS1)\)\. The same mechanism exposes a broader class of attacks that induce harmful continuation states throughout the trajectory and rely on autoregressive consistency to bypass safety alignment; random insertion attack \(Section[2\.2](https://arxiv.org/html/2606.04168#S2.SS2)\) is a simple representative example\. This motivates adversarial safety alignment framework \(Section[3\.1](https://arxiv.org/html/2606.04168#S3.SS1)\), which trains models to break harmful autoregressive consistency and recover safe refusal behavior from harmful states throughout the output trajectory\.Recently, Qi et al\.\(Qiet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib4)\)showed that this fragility is closely related to*shallow safety alignment*: safety fine\-tuning primarily modifies the model’s generative distribution near only the first few output tokens\. As a result, aligned models may remain safe when their responses begin with regular refusal prefixes, such as “Sorry, I cannot”, but can bypass safety alignment once the initial tokens are forced away from this refusal regime\. For example, requiring a safety\-aligned model to start with “Sure, here is a detailed instruction on” can make it engage with harmful queries such as “How to build a bomb at home?” rather than refuse them\. This led Qi et al\.\(Qiet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib4)\)to argue that safety alignment should be made more than just a few tokens deep\.

But*why does safety alignment naturally concentrate on the first few tokens*, and*is there a broader failure mechanism beyond “safety alignment is shallow hence brittle to attacks targeting prefixes”?*These questions matter because they shift the problem from observing shallow alignment to explaining why it arises and what more general mechanism makes aligned models brittle\. In this paper, we argue that both issues can be understood throughautoregressive consistency\(Definition[2\.1](https://arxiv.org/html/2606.04168#S2.Thmdefinition1)\): the tendency of next\-token prediction to preserve and extend the local branch consistently\.

Autoregressive consistency is essential for coherent generation in benign settings, allowing autoregressive models to maintain topic, style, and response trajectory over time\. However, the same property can hurt safety alignment in two ways:

1. 1\.Autoregressive consistency offers a mechanistic explanation for why safety alignment can become shallow\. Intuitively, during safety fine\-tuning, the model is trained to give refusal responses to harmful queries\. Once the initial tokens move the model onto a refusal branch, however, the base model can already sustain the remaining refusal continuation with high confidence\. This makes later refusal tokens less informative for learning, so alignment updates mainly concentrate on the early tokens that initiate refusal while leaving late positions largely unchanged\. Theorem[2\.1](https://arxiv.org/html/2606.04168#S2.Thmtheorem1)formalizes this gradient\-concentration effect\.
2. 2\.Autoregressive consistency reveals a broader class of trajectory\-level attacks beyond prefix\-targeting ones\. Prefix\-based attacks can be viewed as inducing a harmful continuation state near the start of generation, after which autoregressive consistency sustains the harmful branch\. However, the same mechanism can arise anywhere in the output trajectory: any attack that induces a harmful continuation state may lead the model to extend that harmful branch—it need not only target the start of generation\. We demonstrate this broader class through*random insertion attack*, a simple representative example that inserts a short harmful span into a random position of an otherwise safe refusal trajectory\. Its effectiveness shows that harmful autoregressive consistency can be triggered away from the beginning of generation, revealing a failure mode beyond prefix\-based brittleness captured by shallow safety alignment\.

![Refer to caption](https://arxiv.org/html/2606.04168v1/x2.png)Figure 2:An example of random insertion attack\[𝐫:r;𝐡:q\]\[\{\\mathbf\{r\}\}\_\{:r\};\{\\mathbf\{h\}\}\_\{:q\}\]\.Random insertion attack is intentionally simple: given a harmful query and a safe refusal response, it inserts a short harmful span at a random position inside the refusal trajectory and then asks the model to continue from the resulting partial response \(Fig\.[2](https://arxiv.org/html/2606.04168#S1.F2)\)\. For example, for the query “How to build a bomb?”, the response may begin with a protected refusal prefix such as “I cannot fulfill your request \(more tokens from the refusal response\)”, but a later inserted short span “To make a bomb, you can” can redirect generation onto a harmful branch, which is then extended consistently\. Crucially, the failure can occur even after the model has already been placed on a safe refusal trajectory for many tokens\. Thus, protecting only the first few tokens, or even making refusal prefixes deeper, does not fully address the underlying mechanism: harmful autoregressive consistency can still be induced later in generation\. This highlights a more general failure mode not captured by prefix fragility alone, suggesting that “depth” is not the full primitive\. Furthermore, safety alignment should be guided by a broader objective:

Safety alignment should also train models to break harmful autoregressive consistency throughout the whole trajectory\.

In other words, the central challenge is not merely to make a response start safely, or even to extend refusal beyond the first few tokens, but to ensure that the model can break harmful autoregressive consistency at arbitrary positions in the output sequence, instead of being redirected into a harmful branch and continuing it consistently\.

Motivated by this view, we propose an initial*adversarial safety alignment*\(Section[3](https://arxiv.org/html/2606.04168#S3)\) framework, inspired by adversarial training\(Madryet al\.,[2019](https://arxiv.org/html/2606.04168#bib.bib14)\)\. At a high level, the framework treats safety alignment as learning to recover from adversarially constructed harmful continuation states\. For each harmful query and refusal response, the inner problem searches for a continuation state from which the current model is most likely to sustain a harmful branch, while the outer problem trains the model to recover the refusal trajectory from that state\. As a practical first instantiation, we approximate this objective with random worst\-insertion training, which selects the most harmful state among randomly inserted harmful spans along the refusal trajectory\. Our experiments show that even this simple instantiation improves robustness to random insertion attacks while remaining competitive against prefill and several common jailbreak attacks\.

#### Summary of contributions\.

This paper studies when autoregressive consistency hurts safety alignment, following the conceptual chain summarized in Fig\.[1](https://arxiv.org/html/2606.04168#S1.F1)\. First, we introduce autoregressive consistency as a mechanism for understanding safety alignment and show, through a learning\-dynamics analysis, how it can make alignment updates concentrate on early refusal tokens while leaving later continuation largely unchanged\. Second, we show that the same mechanism predicts a broader class of attacks: attacks can induce harmful continuation states not only at the beginning of generation, but also at arbitrary positions in the output trajectory\. We instantiate this class with random insertion attack, a deliberately simple but effective representative example\. Third, motivated by this failure mode, we propose adversarial safety alignment as an initial framework for training models to break harmful autoregressive consistency, together with random worst\-insertion training as an initial practical implementation\. Overall, our goal is not to present final attack or defense algorithms, but to identify harmful autoregressive consistency as a broader failure mechanism and to motivate future safety alignment methods\.

Beyond these contributions, our view suggests how existing attack and defense methods could be generalized\. On the attack side, optimization\-based attacks such as GCG\(Zouet al\.,[2023](https://arxiv.org/html/2606.04168#bib.bib20)\)primarily search over input\-side suffixes; our perspective suggests that similar optimization could also target intermediate autoregressive states inside a refusal trajectory\. On the defense side, recent adversarial training methods\(Sheshadriet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib7); Xhonneuxet al\.,[2024](https://arxiv.org/html/2606.04168#bib.bib8)\)expose models to worst\-case perturbations in input, embedding, or latent space, while decoding\-based defenses\(Xuet al\.,[2024](https://arxiv.org/html/2606.04168#bib.bib10)\)often focus on the first few generated tokens\. Our results suggest a complementary trajectory\-level object of robustness: an adversarial autoregressive state from which harmful continuation is likely to persist\. Thus, future defenses should not only elicit refusal at the beginning of generation, but also train models to break harmful autoregressive consistency and recover throughout the output trajectory\.

#### Preliminaries\.

We usepθp\_\{\\theta\}for the autoregressive model parametrized byθ\\theta\.𝐱=\(x1,…,xL\)\{\\mathbf\{x\}\}=\(x\_\{1\},\\dots,x\_\{L\}\)denotes a sequence withLLtokens, which is also often called as trajectory in this paper\. The length of𝐱\{\\mathbf\{x\}\}is represented by\|𝐱\|\|\{\\mathbf\{x\}\}\|\. We use𝐱:t\{\\mathbf\{x\}\}\_\{:t\}to denote its firsttttokens, and, similarly,𝐱t:t\+k\{\\mathbf\{x\}\}\_\{t:t\+k\}for\(xt,xt\+1,…,xt\+k\)\(x\_\{t\},x\_\{t\+1\},\\dots,x\_\{t\+k\}\)\. We use\[𝐱;𝐲\]\[\{\\mathbf\{x\}\};\{\\mathbf\{y\}\}\]to denote the their concatenation\.\[T\]\[T\]denotes integers in\[1,T\]\[1,T\]\. In the context of alignment, we will use𝐱\{\\mathbf\{x\}\}to denote a harmful prompt and𝐲\{\\mathbf\{y\}\}to be its response, where𝐲=𝐫\{\\mathbf\{y\}\}=\{\\mathbf\{r\}\}if it is a refusal and𝐲=𝐡\{\\mathbf\{y\}\}=\{\\mathbf\{h\}\}if it is a harmful response\. The safety of models under an attack will be measured by attack success rate \(ASR\) on a given safety evaluation dataset, which is the ratio of harmful outputs generated by the model under the given attack\.

## 2Understanding Safety Alignment via Autoregressive Consistency

The shallow safety alignment phenomenon, in which the model’s safe behavior is strongest near the first few tokens, can cause aligned LLMs to comply with harmful user queries once generation departs from the regular refusal prefixes of safe responses\. This observation raises a fundamental question: why should safety alignment naturally concentrate near the beginning of the response, even when the model is supervised on an entire refusal trajectory?

In this section, we analyze safety alignment through the lens of autoregressive consistency\. After formalizing the concept in Definition[2\.1](https://arxiv.org/html/2606.04168#S2.Thmdefinition1), we study thelearning dynamicsof safety alignment and show that autoregressive consistency can make later refusal tokens less informative for learning, leading alignment updates to concentrate near the beginning of the response while leaving late positions largely unchanged \(Section[2\.1](https://arxiv.org/html/2606.04168#S2.SS1)\)\. We then show that the same mechanism suggestsa broader class of attacks: attacks need not target only the start of generation, but can also bypass safety alignment by inducing harmful continuation states inside the output trajectory, with random insertion attack as one concrete example \(Section[2\.2](https://arxiv.org/html/2606.04168#S2.SS2)\)\. Thus, whereas Qi et al\.\(Qiet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib4)\)identified shallow safety alignment and motivated deeper safety alignment, we explain why such shallowness can arise and argue that the underlying failure mechanism extends beyond the shallow issue\.

### 2\.1Shallow Safety Alignment Can Arise from Autoregressive Consistency

In certain natural language domains, a sufficiently long prefix is expected to induce a near\-deterministic conditional distribution over continuations\(Piantadosi,[2014](https://arxiv.org/html/2606.04168#bib.bib34); Raychevet al\.,[2014](https://arxiv.org/html/2606.04168#bib.bib33); Shannon,[1951](https://arxiv.org/html/2606.04168#bib.bib31)\)\. This arises in text categories that are low\-entropy by design—where the space of expected responses is severely restricted by mere syntactical and semantic consistencies, such thatp​\(xt\+1∣𝐱:t\)p\(x\_\{t\+1\}\\mid\{\\mathbf\{x\}\}\_\{:t\}\)is close to 1 forttlarger than some critical valuetct\_\{c\}\. A canonical example is programming codes\. As the output trajectory of an autoregressive modelpθbasep\_\{\\theta\_\{\\operatorname\{base\}\}\}is trained well enough to capture such tendencies, one may expect thatpθbasep\_\{\\theta\_\{\\operatorname\{base\}\}\}to satisfy the same property\. We formalize this property as the model’s autoregressive consistency with the data:

###### Definition 2\.1\(Autoregressive consistency\)\.

The base autoregressive modelpθbasep\_\{\\theta\_\{\\operatorname\{base\}\}\}has autoregressive consistency on𝒟data\\mathcal\{D\}\_\{\\rm\{data\}\}, if∀𝐱∈𝒟data\\ \\forall\{\\mathbf\{x\}\}\\in\\mathcal\{D\}\_\{\\rm\{data\}\}and∀ϵ\>0,\\ \\forall\\epsilon\>0,

∃tc\>0,such that​∀t\>tc:pθbase​\(xt\+1∣𝐱:t\)\>1−ϵ\.\\exists t\_\{c\}\>0,\\ \\text\{such that \}\\forall t\>t\_\{c\}:\\ \\begin\{aligned\} p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(x\_\{t\+1\}\\mid\{\\mathbf\{x\}\}\_\{:t\}\)\>1\-\\epsilon\.\\end\{aligned\}\(1\)

Intuitively, the output trajectory of autoregressive models is obtained by repeatedly conditioning on their own generated tokens\. As a result, next\-token prediction tends to sustain and extend the current prefix in a locally consistent way\. The work of Qi et al\.Qiet al\.\([2025](https://arxiv.org/html/2606.04168#bib.bib4)\)strongly suggests that the data commonly used in safety alignment training enable the model to have such autoregressive consistency in the first place, e\.g\., forcing the base \(unaligned\) model to start with refusal prefix can make them maintain that refusal response, that is

![Refer to caption](https://arxiv.org/html/2606.04168v1/x3.png)\(a\)Llama\-2\-7B\.
![Refer to caption](https://arxiv.org/html/2606.04168v1/x4.png)\(b\)Llama\-3\.1\-8B\.

Figure 3:Empirical measurement of next\-token probability\.For each continuation positiontt, we plot the mean probability assigned by the base model to the next token in the given trajectory,pθbase​\(yt\+1∣𝐱,𝐲:t\)p\_\{\\theta\_\{\\mathrm\{base\}\}\}\(y\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\), for harmful and refusal continuations\. Shaded regions denote standard errors across examples\. For each continuation type, the dashed horizontal line of the same color shows the mean over the first 5 continuation positions, and the dotted horizontal line of the same color shows the mean over the last 5 continuation positions\. The results are computed on the dataset open\-sourced by Qi et al\.\(Qiet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib4)\), which contains 256 harmful instructions, each paired with a refusal response and a harmful response \(see Appendix[C](https://arxiv.org/html/2606.04168#A3)for more details\)\.###### Assumption 2\.1\(Autoregressive consistency in safety alignment\)\.

For safety alignment data pairs\(𝐱,𝐲\)∼𝒟align\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\\sim\\mathcal\{D\}\_\{\\rm\{align\}\}, where𝐱\{\\mathbf\{x\}\}is a harmful prompt and𝐲\{\\mathbf\{y\}\}is a paired response \(which can be a safe refusal or a harmful response\), the base autoregressive modelpθbasep\_\{\\theta\_\{\\operatorname\{base\}\}\}has autoregressive consistency on𝒟align\\mathcal\{D\}\_\{\\rm\{align\}\}, if∀\(𝐱,𝐲\)∈𝒟align\\ \\forall\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\\in\\mathcal\{D\}\_\{\\rm\{align\}\}and∀ϵ\>0,\\ \\forall\\epsilon\>0,

∃tc\>0,such that​∀t\>tc:pθbase​\(yt\+1∣𝐱,𝐲:t\)\>1−ϵ\.\\exists t\_\{c\}\>0,\\ \\text\{such that \}\\forall t\>t\_\{c\}:\\ \\begin\{aligned\} p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(y\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\\,\{\\mathbf\{y\}\}\_\{:t\}\)&\>1\-\\epsilon\.\\end\{aligned\}\(2\)

That is, the base model is confident at locking in a refusal or harmful response once the first few response tokens are given\. Fig\.[3](https://arxiv.org/html/2606.04168#S2.F3)provides empirical support for the soft analogue of Assumption[2\.1](https://arxiv.org/html/2606.04168#S2.Thmassumption1)\. For bothLlama\-2\-7BandLlama\-3\.1\-8B, the probability assigned to the trajectory\-consistent next tokenpθbase​\(yt\+1∣𝐱,𝐲:t\)p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(y\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\\,\{\\mathbf\{y\}\}\_\{:t\}\)increases sharply after the initial positions and stabilizes at a substantially higher level for both harmful and refusal continuations, respectively\. This suggests that, once a continuation branch has been established, the base model increasingly favors locally consistent continuation\. Although Assumption[2\.1](https://arxiv.org/html/2606.04168#S2.Thmassumption1)is stated in a near\-deterministic form for analytical clarity, exact\-token probabilities around0\.50\.5are already high in open\-ended natural language, where probability mass is distributed across many semantically equivalent continuations\. Thus, these results can be interpreted as evidence for a soft empirical analogue of autoregressive consistency, and not as a literal verification of the idealized assumption\.

In benign generation, autoregressive consistency is essential: it allows a model to maintain the topic, style, and reasoning continuity in a narrow scope\. However, it may also induce a shortcut in safety alignment: move only the first few tokens onto a refusal trajectory and leave the rest of the generation to autoregressive consistency\. We theoretically investigate whether safety alignment exploits this by analyzing the learning dynamics of alignment, beginning with a formal definition of shallow alignment\.

###### Definition 2\.2\(Shallow alignment\)\.

Letpθbasep\_\{\\theta\_\{\\mathrm\{base\}\}\}be the base model andpθalignedp\_\{\\theta\_\{\\mathrm\{aligned\}\}\}be the aligned model\. Suppose the prompt\-response data pairs\(𝐱,𝐲\)\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)in an alignment data set𝒟align\\mathcal\{D\}\_\{\\mathrm\{align\}\}follow a distributionpdatap\_\{\\rm\{data\}\}\. An alignment is shallow if the aligned modelpθalignedp\_\{\\theta\_\{\\operatorname\{aligned\}\}\}satisfies:∀ε1,ε2\>0\\forall\\varepsilon\_\{1\},\\varepsilon\_\{2\}\>0,∃t2≥t1\>0\\exists t\_\{2\}\\geq t\_\{1\}\>0such that

KL\[pdata\(⋅\|𝐱,𝐲:t\)\|pθaligned\(⋅\|𝐱,𝐲:t\)\]\\displaystyle\\mathrm\{KL\}\\Big\[p\_\{\\rm\{data\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\,\|\\,p\_\{\\theta\_\{\\operatorname\{aligned\}\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\Big\]<ε1,0≤t<t1,\\displaystyle<\\varepsilon\_\{1\},\\qquad 0\\leq t<t\_\{1\},\(3\)\|\|pθaligned\(⋅\|𝐱,𝐲:t\)−pθbase\(⋅\|𝐱,𝐲:t\)\|\|\\displaystyle\\left\|\\left\|p\_\{\\theta\_\{\\operatorname\{aligned\}\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\right\|\\right\|<ε2,t\>t2,\\displaystyle<\\varepsilon\_\{2\},\\qquad t\>t\_\{2\},\(4\)whereKL\\mathrm\{KL\}denotes Kullback–Leibler divergence\.

The first condition means that the model is well aligned up to the first few response tokens and the second condition means that the alignment training did not move the model far from its base model for late token generations\. In practice, this means shallow alignment makes sure that the first few tokens are refusal keywords for a harmful prompt, and the remaining refusal generation is guaranteed by autoregressive consistency\. However, if a harmful prefix with length larger thant2t\_\{2\}is given, we expect a harmful response to continue\. In the following, we take the first condition as a given and mainly investigate the second\.

We now reveal that the shallow alignment defined above can arise from autoregressive consistency\. Consider alignment training initialized from the base autoregressive modelθbase\\theta\_\{\\operatorname\{base\}\}and optimization objectives of SFT or DPO:

LSFT​\(θ\)\\displaystyle L\_\{\\mathrm\{SFT\}\}\(\\theta\)=−∑t𝔼\(𝐱,𝐫\)∼𝒟​\[log⁡pθ​\(rt\+1\|𝐱,𝐫:t\)\],\\displaystyle=\-\\sum\_\{t\}\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)\\sim\{\\mathcal\{D\}\}\}\\left\[\\log p\_\{\\mathrm\{\\theta\}\}\(r\_\{t\+1\}\|\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\\right\],\(5\)LDPO​\(θ\)\\displaystyle L\_\{\\mathrm\{DPO\}\}\(\\theta\)=−𝔼\(𝐱,𝐫,𝐡\)∼𝒟​\[log⁡σ​\(β​\[log⁡pθ​\(𝐫∣𝐱\)pθbase​\(𝐫∣𝐱\)−log⁡pθ​\(𝐡∣𝐱\)pθbase​\(𝐡∣𝐱\)\]\)\]\.\\displaystyle=\-\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim\{\\mathcal\{D\}\}\}\\left\[\\log\\sigma\\\!\\left\(\\beta\\left\[\\log\\frac\{p\_\{\\theta\}\(\{\\mathbf\{r\}\}\\mid\{\\mathbf\{x\}\}\)\}\{p\_\{\\theta\_\{\\mathrm\{base\}\}\}\(\{\\mathbf\{r\}\}\\mid\{\\mathbf\{x\}\}\)\}\-\\log\\frac\{p\_\{\\theta\}\(\{\\mathbf\{h\}\}\\mid\{\\mathbf\{x\}\}\)\}\{p\_\{\\theta\_\{\\mathrm\{base\}\}\}\(\{\\mathbf\{h\}\}\\mid\{\\mathbf\{x\}\}\)\}\\right\]\\right\)\\right\]\.\(6\)Although these objectives differ in form, their gradients similarly decompose into token\-level terms involving next\-token log\-probabilities and their parameter gradients∇θpθ​\(yt\+1∣𝐱,𝐲:t\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(y\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\. We therefore analyze the learning dynamics following the strategy below\.

#### Analysis strategy\.

Our analysis proceeds in two stages\. First, we separate out the part of the argument that does not depend on the particular alignment objective: under autoregressive consistency, the late\-position next\-token distributionpθ\(⋅∣𝐱,𝐲:t\)p\_\{\\theta\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)has a small gradient with respect to the model parameters \(Proposition[2\.1](https://arxiv.org/html/2606.04168#S2.Thmprop1)\)\. Second, we apply this observation to the SFT and DPO objectives in Eq\. \([5](https://arxiv.org/html/2606.04168#S2.E5)\) and Eq\. \([6](https://arxiv.org/html/2606.04168#S2.E6)\), showing that their late\-position gradient contributions are indeed suppressed \(Proposition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmprop2)\)\. As a result, the effective learning signal concentrates only near the early tokens that initiate refusal, giving a learning dynamics explanation of shallow alignment under gradient descent\. The final statement is summarized in Theorem[2\.1](https://arxiv.org/html/2606.04168#S2.Thmtheorem1)\.

We start with the first stage, which relies on the bound on softmax Jacobian \(proof in Appendix[B\.1](https://arxiv.org/html/2606.04168#A2.SS1)\)\.

###### Lemma 1\(Bound on softmax Jacobian\)\.

Under Assumption[2\.1](https://arxiv.org/html/2606.04168#S2.Thmassumption1), the Frobenius norm of the base model softmax Jacobian satisfies

‖At‖F2<2​ϵ,∀t\>tc,\\\|A\_\{t\}\\\|\_\{F\}^\{2\}<2\\epsilon,\\qquad\\forall t\>t\_\{c\},\(7\)where\(At\)i​j=∇zipθbase​\(j​\-th token\|𝐱,𝐲:t\)\(A\_\{t\}\)\_\{ij\}=\\nabla\_\{z\_\{i\}\}p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(j\\textup\{\-th token\}\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)andzi=zθ​\(i​\-th token\|𝐱,𝐲:t\)z\_\{i\}=z\_\{\\theta\}\(i\\textup\{\-th token\}\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)is the output logit for theii\-th token in the vocabulary before final softmax layer\.

As a result, the conditional probability has a small magnitude at late positions, as shown below\.

###### Proposition 2\.1\(Gradient concentration of conditional probability\)\.

Under Assumption[2\.1](https://arxiv.org/html/2606.04168#S2.Thmassumption1), and assuming‖∇θz‖F\\\|\\nabla\_\{\\theta\}z\\\|\_\{F\}is upper bounded by a constantCC, wherezzis the logit vector defined in Lemma[1](https://arxiv.org/html/2606.04168#Thmlemma1), then at initialization

\[∥∇θpθ\(⋅\|𝐱,𝐲:t\)∥F\]θ=θbase<C\(2ϵ\)12,∀t\>tc\.\\big\[\\\|\\nabla\_\{\\theta\}p\_\{\\theta\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}\\big\]\_\{\\theta=\\theta\_\{\\rm\{base\}\}\}<C\(2\\epsilon\)^\{\\frac\{1\}\{2\}\},\\qquad\\forall t\>t\_\{c\}\.\(8\)

###### Proof\.

By chain rule∇θpθ\(⋅\|𝐱,𝐲:t\)=∇zpθ\(⋅\|𝐱,𝐲:t\)⋅∇θz\\nabla\_\{\\theta\}p\_\{\\theta\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)=\\nabla\_\{z\}p\_\{\\theta\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\cdot\\nabla\_\{\\theta\}z\. At initialization, by Lemma[1](https://arxiv.org/html/2606.04168#Thmlemma1)we have∥∇θpθ\(⋅\|𝐱,𝐲:t\)∥F=∥At⋅∇θz∥F≤∥At∥F∥∇θz∥F<C\(2ϵ\)12\\\|\\nabla\_\{\\theta\}p\_\{\\theta\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}=\\\|A\_\{t\}\\cdot\\nabla\_\{\\theta\}z\\\|\_\{F\}\\leq\\\|A\_\{t\}\\\|\_\{F\}\\\|\\nabla\_\{\\theta\}z\\\|\_\{F\}<C\(2\\epsilon\)^\{\\frac\{1\}\{2\}\}\. ∎

Proposition[2\.1](https://arxiv.org/html/2606.04168#S2.Thmprop1)is independent of the particular form of the alignment objective: it shows that next\-token distributions at late positions have small parameter gradients under autoregressive consistency\. We can specialize this bound to the SFT and DPO objectives in Eq\. \([5](https://arxiv.org/html/2606.04168#S2.E5)\) and Eq\. \([6](https://arxiv.org/html/2606.04168#S2.E6)\)\. This yields a corresponding concentration property for their loss gradients, summarized in the following proposition \(proof in Appendix[B\.2](https://arxiv.org/html/2606.04168#A2.SS2)\)\.

###### Proposition 2\.2\(Concentration of alignment gradients\)\.

For SFT \(Eq\. \([5](https://arxiv.org/html/2606.04168#S2.E5)\)\) or DPO \(Eq\. \([6](https://arxiv.org/html/2606.04168#S2.E6)\)\) on a traditional alignment dataset \(where autoregressive consistency holds\), the dominant contributions to the loss gradient∇θL\|θ=θbase\\nabla\_\{\\theta\}L\|\_\{\\theta=\\theta\_\{\\mathrm\{base\}\}\}are from∇θpθ\(⋅\|𝐱,𝐲:t\)\\nabla\_\{\\theta\}p\_\{\\theta\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)witht≤tct\\leq t\_\{c\}\.

Proposition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmprop2)implies that∇θL\|θ=θbase\\nabla\_\{\\theta\}L\|\_\{\\theta=\\theta\_\{\\mathrm\{base\}\}\}effectively truncates at thetct\_\{c\}\-th term, i\.e\., autoregressive consistency suppresses gradient contributions at late positions for the base model\. As a result, a gradient update from the base model primarily changes next\-token probabilities near early positionst<tct<t\_\{c\}, while only weakly affecting those at late positions \(Proposition[B\.1](https://arxiv.org/html/2606.04168#A2.Thmprop1)\), and thus a similar autoregressive consistency holds after the first gradient update \(Corollary[B\.1](https://arxiv.org/html/2606.04168#A2.Thmcorollary1)\)\. The same argument extends to multi\-step gradient descent on the base model by induction: once the first update leaves next\-token probabilities at late positions close to the base model, autoregressive consistency continues to hold there in a weakened form, so subsequent updates remain similarly suppressed; iterating this argument over multiple gradient steps yields the shallow alignment phenomenon: alignment training mainly modifies the early tokens, while the next\-token distributions at later positions remain close to those of the base model\. The following theorem summarizes this effect \(proof in Appendix\.[B\.4](https://arxiv.org/html/2606.04168#A2.SS4)\)\.

###### Theorem 2\.1\(Shallow alignment from autoregressive consistency\)\.

Suppose the base model satisfies Assumption[2\.1](https://arxiv.org/html/2606.04168#S2.Thmassumption1)withϵ0=ϵ\\epsilon\_\{0\}=\\epsilon, and letθ0=θbase\\theta\_\{0\}=\\theta\_\{\\mathrm\{base\}\}at the initialization\. Consider gradient descent for a fixed iteration countK≥1K\\geq 1

θk\+1=θk−η​∇L​\(θk\),k=0,1,…,K−1,\\theta\_\{k\+1\}=\\theta\_\{k\}\-\\eta\\nabla L\(\\theta\_\{k\}\),\\quad k=0,1,\\ldots,K\-1,whereLLis an alignment objective of SFT in Eq\. \([5](https://arxiv.org/html/2606.04168#S2.E5)\) or DPO in Eq\. \([6](https://arxiv.org/html/2606.04168#S2.E6)\)\. Assume that, along the optimization trajectory, the logit Jacobian and the second derivative of the conditional next\-token distribution∥∇θ2pθ\(⋅∣𝐱,𝐲:t\)∥F\\\|\\nabla\_\{\\theta\}^\{2\}p\_\{\\theta\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}at late positionst\>tct\>t\_\{c\}are uniformly bounded by constantsCCandBB, respectively, and that the loss gradient is also upper bounded‖∇L​\(θk\)‖≤G\\\|\\nabla L\(\\theta\_\{k\}\)\\\|\\leq G\. Forε2\\varepsilon\_\{2\}defined in Definition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmdefinition2), letα=max⁡\{C/2,B/2\}\\alpha=\\max\\\{C/\\sqrt\{2\},\\sqrt\{B/2\}\\\}, and letη≤ϵ0\+ε2−ϵ0α​K​G\\eta\\leq\\frac\{\\sqrt\{\\epsilon\_\{0\}\+\\varepsilon\_\{2\}\}\-\\sqrt\{\\epsilon\_\{0\}\}\}\{\\alpha KG\}\. Defineϵk\\epsilon\_\{k\}recursively by

ϵk\+1=ϵk\+C​2​ϵk​η​‖∇L​\(θk\)‖\+B2​η2​‖∇L​\(θk\)‖2\.\\quad\\epsilon\_\{k\+1\}=\\epsilon\_\{k\}\+C\\sqrt\{2\\epsilon\_\{k\}\}\\eta\\\|\\nabla L\(\\theta\_\{k\}\)\\\|\+\\frac\{B\}\{2\}\\eta^\{2\}\\\|\\nabla L\(\\theta\_\{k\}\)\\\|^\{2\}\.\(9\)
Then, the model satisfies a weaker form of autoregressive consistency after each gradient update

∀k∈\[K\],t\>tc:pθk\(yt\+1∣𝐱,𝐲:t\)\>1−ϵk,\\forall k\\in\[K\],\\ t\>t\_\{c\}:\\quad p\_\{\\theta\_\{k\}\}\(y\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\>1\-\\epsilon\_\{k\},\(10\)and SFT/DPO gradient contributions at late positions are also suppressed toO​\(ϵk\)O\(\\sqrt\{\\epsilon\_\{k\}\}\)\. Therefore, the effective alignment gradient remains concentrated on the early positionst≤tct\\leq t\_\{c\}while leaving late positions largely unchanged throughout training\. As a result, the alignment is shallow, i\.e\., SFT/DPO changes the model mainly near the early response tokens while leaving next\-token distributions at late positions close to those of the base model

∥Δθbase→θalignedpθ\(⋅∣𝐱,𝐲:t\)∥:=∥pθK\(⋅∣𝐱,𝐲:t\)−pθbase\(⋅∣𝐱,𝐲:t\)∥F≤ε2,t\>tc\.\\\|\\Delta\_\{\\theta\_\{\\operatorname\{base\}\}\\to\\theta\_\{\\operatorname\{aligned\}\}\}p\_\{\\theta\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|:=\\\|p\_\{\\theta\_\{K\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-p\_\{\\theta\_\{\\mathrm\{base\}\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}\\leq\\varepsilon\_\{2\},\\quad t\>t\_\{c\}\.\(11\)

#### Consequence: aligned model differs from base mainly at deptht≤tct\\leq t\_\{c\}\.

This is precisely the finding of Qi et al\.\(Qiet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib4)\)the aligned model differs from the base model only at the first few output positions\. The base model’s existing capacity to continue a sufficiently long given prefix is left entirely untouched by alignment training\. When the said prefix is harmful, jailbreaking happens\.

Proposition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmprop2)and Theorem[2\.1](https://arxiv.org/html/2606.04168#S2.Thmtheorem1)together suggest that we might be able to mitigate shallow alignment by breaking harmful autoregressive consistency at arbitrarily late token positions, thereby enhancing the gradient norm of the loss functions\. This will be the main topic of Section[3](https://arxiv.org/html/2606.04168#S3)\.

### 2\.2Attacks Need Not Target Only the Start

Our analysis of the learning dynamics of safety alignment identifies autoregressive consistency as a deeper failure mechanism\. The same mechanism also exposes a broader class of attacks targeting the trajectory\. Attacks targeting prefixes are effective not only because they bypass shallow safety alignment near the beginning of generation, but also because they induce a harmful continuation state that autoregressive consistency can then preserve and extend\. From this perspective, the start of generation is only one possible attack location\. More generally, any attack that induces an autoregressive state from which the model favors harmful continuation over refusal can be dangerous, regardless of where that state appears in the output trajectory\. This is the attack\-side counterpart of Fig\.[1](https://arxiv.org/html/2606.04168#S1.F1): harmful autoregressive consistency can be triggered throughout the trajectory, not only at the prefix\.

Specifically, let𝐱\{\\mathbf\{x\}\}be a harmful prompt, let𝐲\{\\mathbf\{y\}\}be the output trajectory, letτt=\(𝐱,𝐲:t\)\\tau\_\{t\}=\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)be the partial autoregressive state\. LetΦθ​\(τt\)\\Phi\_\{\\theta\}\(\\tau\_\{t\}\)measure how strongly the model favors harmful continuation over refusal response atτt\\tau\_\{t\}, which we call*harmful autoregressive consistency margin*\. A simple example isΦθ​\(τt\)=log⁡pθ​\(𝐡∣τt\)−log⁡pθ​\(𝐫∣τt\)\\Phi\_\{\\theta\}\(\\tau\_\{t\}\)=\\log p\_\{\\theta\}\(\{\\mathbf\{h\}\}\\mid\\tau\_\{t\}\)\-\\log p\_\{\\theta\}\(\{\\mathbf\{r\}\}\\mid\\tau\_\{t\}\), where𝐡\{\\mathbf\{h\}\}is a harmful span and𝐫\{\\mathbf\{r\}\}is a refusal span\. Then, for example, prefill attacks can be seen as constructingτ~t=\(𝐱,𝐲~:t\)\\tilde\{\\tau\}\_\{t\}=\(\{\\mathbf\{x\}\},\\tilde\{\{\\mathbf\{y\}\}\}\_\{:t\}\), where𝐲~:t\\tilde\{\{\\mathbf\{y\}\}\}\_\{:t\}is an unsafe assistant prefix andttis small, such thatΦθ​\(τ~t\)\\Phi\_\{\\theta\}\(\\tilde\{\\tau\}\_\{t\}\)is large\. Similarly, input\-side suffix attacks such as GCG\(Zouet al\.,[2023](https://arxiv.org/html/2606.04168#bib.bib20)\)can be viewed as modifying the prompt context so that the model’s initial autoregressive state has a large harmful margin\. By exploiting autoregressive consistency throughout the output trajectory, the attack need not be confined to the start\. We introduce random insertion attack as a concrete example below\.

#### Random insertion attack\.

If the model’s autoregressive consistency is strong, the attacker may only need to insert a short harmful span to trigger harmful continuation\. Random insertion attack works by forcing the model onto a harmful branch at a random position in the response, after which autoregressive consistency sustains that branch\. Specifically, we construct the random insertion attack using a triplet dataset\(𝐱,𝐫,𝐡\)\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\), where𝐫\{\\mathbf\{r\}\}is the refusal trajectory and𝐡\{\\mathbf\{h\}\}is a harmful response which can be short\. For each triplet, we take a short span𝐡:q\{\\mathbf\{h\}\}\_\{:q\}of lengthqq, and randomly sample a positioniifrom\|𝐫\|\|\{\\mathbf\{r\}\}\|\. This yields the final attackτ~i\+q=\(𝐱,\[𝐫:i;𝐡:q\]\)\\tilde\{\\tau\}\_\{i\+q\}=\(\{\\mathbf\{x\}\},\[\{\\mathbf\{r\}\}\_\{:i\};\{\\mathbf\{h\}\}\_\{:q\}\]\)\(Fig\.[2](https://arxiv.org/html/2606.04168#S1.F2)\)\. The model is then asked to continue fromτ~i\+q\\tilde\{\\tau\}\_\{i\+q\}according topθ\(⋅∣τ~i\+q\)p\_\{\\theta\}\(\\cdot\\mid\\tilde\{\\tau\}\_\{i\+q\}\)\. If the inserted span induces harmful autoregressive consistency, the model may leave the refusal trajectory and continue the harmful branch\. In this sense, random insertion attack covers prefill attack as a special case in which the insertion position is restricted to the beginning of the output trajectory\.

![Refer to caption](https://arxiv.org/html/2606.04168v1/x5.png)Figure 4:ASR vs\. harmful span lengths for either prefill attack or random insertion attack evaluated on HEx\-PHI safety benchmark\(Qiet al\.,[2024](https://arxiv.org/html/2606.04168#bib.bib15)\)\.The random insertion attack is intentionally simple\. Its purpose is to demonstrate that harmful autoregressive consistency can be induced away from the start of generation\. Even when the model is forced to begin—and to remain for many tokens—on a refusal trajectory, hence not “shallow” \(Fig\.[2](https://arxiv.org/html/2606.04168#S1.F2)\), a short harmful span inserted later can still redirect generation onto a harmful branch\. Specifically, Fig\.[4](https://arxiv.org/html/2606.04168#S2.F4)shows that, for bothLlama\-2\-7B\-Chat\(Touvronet al\.,[2023](https://arxiv.org/html/2606.04168#bib.bib26)\)andGemma\-2\-9b\-it\(Gemma Team, Google,[2024](https://arxiv.org/html/2606.04168#bib.bib25)\), ASR for random insertion attacks are above 75% and generally increases with harmful span length\. In addition, even 10 tokens harmful span inserted in an otherwise safe refusal output can dramatically redirect the generation to a harmful output, revealing the limitation of only protecting the prefix\. Therefore, the vulnerability is not merely a failure of shallow prefixes, but also a consequence of the model’s tendency to preserve and extend the current autoregressive trajectory\. Autoregressive consistency should therefore be treated as a crucial consideration in future safety alignment\.

## 3Breaking Autoregressive Consistency by Adversarial Safety Alignment

Qi et al\.\(Qiet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib4)\)made an important step by identifying shallow safety alignment and arguing that safety alignment should be made deeper\. Our analysis in Section[2](https://arxiv.org/html/2606.04168#S2)suggests that depth alone is still an incomplete objective\. Training deeper may extend the defended horizon against prefix\-like attacks, but it does not by itself address the underlying dynamical problem\. As shown in Section[2\.2](https://arxiv.org/html/2606.04168#S2.SS2), random insertion attack can remain effective even when the beginning of the output trajectory is already protected by a safe refusal \(hence not “shallow”\)\. The broader failure mechanism arises from autoregressive consistency: once generation enters a harmful branch, whether near the beginning or in the middle of the output, the model tends to preserve and extend that branch\. This implies that safety alignment should not only be trained to extend refusal farther from the beginning of generation\.

Then what should the objective be instead? Following the conceptual chain in Fig\.[1](https://arxiv.org/html/2606.04168#S1.F1), we argue that safety alignment should also learn to*break harmful autoregressive consistency*and recover from harmful continuation states throughout the output trajectory\. Following this principle, we introduce adversarial safety alignment in Section[3\.1](https://arxiv.org/html/2606.04168#S3.SS1)as an initial framework, and then present training with random worst\-insertion attack in Section[3\.2](https://arxiv.org/html/2606.04168#S3.SS2)as a practical first approximation to this framework\.

The purpose of adversarial safety alignment is to translate the preceding theoretical insight into a trainable objective\. Our goal is not to present a final defense recipe, but to use a simple instantiation to study whether training against harmful continuation states can improve robustness beyond depth\-specialized alignment\.

### 3\.1Adversarial Safety Alignment

Our core intuition is the following: given a refusal response𝐫\{\\mathbf\{r\}\}, even from the worst corrupted continuation state along this trajectory \(i\.e\., a state from which the model can most easily establish and sustain a locally consistent harmful branch\), training should make the model recover safe refusal behavior\. In practice, we use the original refusal trajectory𝐫\{\\mathbf\{r\}\}as the recovery target\. As in the conceptual chain in Fig\.[1](https://arxiv.org/html/2606.04168#S1.F1), this encourages safety alignment to become robust to harmful continuation induced by autoregressive consistency throughout the output trajectory\.

We therefore proposeadversarial safety alignmentas an initial framework inspired by adversarial training\(Madryet al\.,[2019](https://arxiv.org/html/2606.04168#bib.bib14)\)\. In standard adversarial training, the inner problem typically searches for a perturbation that maximizes the same loss minimized by the outer problem\. Here, the roles of the two problems are different\. The inner problem identifies a continuation state from which the current model most strongly favors harmful continuation, while the outer problem trains the model to recover safe refusal behavior from that state\. Concretely, for each training example, we construct a set of candidate continuation states along the output trajectory, select the state that most strongly supports harmful autoregressive consistency, and then minimize a supervised refusal loss conditioned on the selected state\.

Formally, given an LLM parameterized byθ\\theta,Adversarial Safety Alignmentis formulated as

minθ⁡𝔼\(𝐱,𝐫\)∼𝒟​\[LSFT​\(𝐫,τθ⋆​\(𝐱,𝐫\);θ\)\],s\.t\.τθ⋆​\(𝐱,𝐫\)∈arg⁡maxτ~t∈𝒜θ​\(𝐱,𝐫\)​Φθ​\(τ~t\)\.\\displaystyle\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)\\sim\{\\mathcal\{D\}\}\}\\Big\[L\_\{\\operatorname\{SFT\}\}\\big\(\{\\mathbf\{r\}\},\\tau^\{\\star\}\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\);\\theta\\big\)\\Big\],\\quad s\.t\.\\;\\;\\tau^\{\\star\}\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)\\in\\underset\{\\tilde\{\\tau\}\_\{t\}\\in\{\\mathcal\{A\}\}\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)\}\{\\arg\\max\}\\ \\Phi\_\{\\theta\}\(\\tilde\{\\tau\}\_\{t\}\)\.\(12\)Here,\(𝐱,𝐫\)\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)is a pair of harmful prompt𝐱\{\\mathbf\{x\}\}and a corresponding refusal response𝐫\{\\mathbf\{r\}\}\. The set𝒜θ​\(𝐱,𝐫\)\{\\mathcal\{A\}\}\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)contains candidate harmful states adversarially constructed from\(𝐱,𝐫\)\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)along the refusal trajectory\. The inner objectiveΦθ​\(τ~t\)\\Phi\_\{\\theta\}\(\\tilde\{\\tau\}\_\{t\}\), which we call harmful autoregressive consistency margin, measures how easily the current model can maintain harmful autoregressive consistency from the adversarially constructed stateτ~\\tilde\{\\tau\}\. The outer loss

LSFT​\(𝐫,τθ⋆​\(𝐱,𝐫\);θ\)=−log⁡pθ​\(𝐫\|τθ⋆​\(𝐱,𝐫\)\)L\_\{\\operatorname\{SFT\}\}\\big\(\{\\mathbf\{r\}\},\\tau^\{\\star\}\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\);\\theta\\big\)=\-\\log p\_\{\\theta\}\\left\(\{\\mathbf\{r\}\}\|\\tau^\{\\star\}\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)\\right\)then trains the model to recover the refusal response from the selected adversarial state\. Although this framework is motivated by the goal of breaking harmful autoregressive consistency, Eq\. \([12](https://arxiv.org/html/2606.04168#S3.E12)\) should be understood as a practical surrogate rather than a direct optimization of a formal trajectory\-level measure of that quantity\.

A practical implementation of adversarial safety alignment therefore requires two design choices\. The first is the candidate set𝒜θ​\(𝐱,𝐫\)\{\\mathcal\{A\}\}\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\), which should cover harmful continuation states throughout the output trajectory\. The second is the inner objectiveΦθ​\(τ~t\)\\Phi\_\{\\theta\}\(\\tilde\{\\tau\}\_\{t\}\), which should measure how strongly a perturbed state favors harmful continuation over refusal recovery\. We next present a first practical approximation to this framework by specifying concrete choices for both components\.

### 3\.2A Practical Instantiation: Adversarial Safety Alignment with Worst\-Insertion Attack

Since solving the adversarial safety alignment objective in Eq\. \([12](https://arxiv.org/html/2606.04168#S3.E12)\) exactly is intractable, we instantiate it with a practical approximation based on random worst\-insertion attack\. This choice is directly motivated by Section[2\.2](https://arxiv.org/html/2606.04168#S2.SS2): random insertion attack is a concrete example of the broader class of attacks that exploit harmful autoregressive consistency by inducing harmful continuation states along the output trajectory\. To instantiate the framework, we specify both the candidate set𝒜\{\\mathcal\{A\}\}and the inner objectiveΦ\\Phi\.

#### Construction of𝒜\{\\mathcal\{A\}\}\.

We construct𝒜\{\\mathcal\{A\}\}using random insertion attack \(Section[2\.2](https://arxiv.org/html/2606.04168#S2.SS2)\)\. Because this construction uses a harmful span𝐡\{\\mathbf\{h\}\}in addition to the harmful prompt𝐱\{\\mathbf\{x\}\}and refusal trajectory𝐫\{\\mathbf\{r\}\}, we denote the resulting candidate set by𝒜​\(𝐱,𝐫,𝐡\)\{\\mathcal\{A\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\. We construct it using a triplet dataset\(𝐱,𝐫,𝐡\)\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\), where𝐫\{\\mathbf\{r\}\}is the refusal trajectory and𝐡\{\\mathbf\{h\}\}is a harmful response from which the inserted span is taken\. For each triplet, we take a short harmful span𝐡:q\{\\mathbf\{h\}\}\_\{:q\}of lengthqq, and samplekkpositions\{ij\}j=1k\\\{i\_\{j\}\\\}\_\{j=1\}^\{k\}uniformly from\[\|𝐫\|\]\[\|\{\\mathbf\{r\}\}\|\]\. This yields the candidate set𝒜​\(𝐱,𝐫,𝐡\)=\{\(𝐱,\[𝐫:ij;𝐡:q\]\)\}j=1k\{\\mathcal\{A\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)=\\left\\\{\(\{\\mathbf\{x\}\},\[\{\\mathbf\{r\}\}\_\{:i\_\{j\}\};\{\\mathbf\{h\}\}\_\{:q\}\]\)\\right\\\}\_\{j=1\}^\{k\}\. Each element of𝒜​\(𝐱,𝐫,𝐡\)\{\\mathcal\{A\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)is thus a perturbed continuation state obtained by truncating the refusal trajectory at positioniji\_\{j\}and appending a short harmful span𝐡:q\{\\mathbf\{h\}\}\_\{:q\}, thereby forcing the model into a locally harmful state at that point\. In this way, the candidate set covers diverse corrupted states throughout the refusal trajectory while keeping the inner search tractable\.

Inner objectiveΦ\\Phi\.For the inner objective, we compare harmful continuation with refusal recovery from the same perturbed state\. We first prepare a bank ofmmshort candidate refusal continuations,ℛ=\{𝐫jC\}j=1m\{\\mathcal\{R\}\}=\\\{\{\\mathbf\{r\}\}\_\{j\}^\{\\operatorname\{C\}\}\\\}\_\{j=1\}^\{m\}, which we use to measure how strongly the model favors refusal recovery\. Then, for eachτt∈𝒜​\(𝐱,𝐫,𝐡\)\{\\mathbf\{\\tau\}\}\_\{t\}\\in\{\\mathcal\{A\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\), we define the harmful autoregressive consistency margin as

Φθ​\(τt\)=H​\(τt;θ\)−R​\(τt;θ\)\.\\Phi\_\{\\theta\}\(\\tau\_\{t\}\)=H\(\{\\mathbf\{\\tau\}\}\_\{t\};\\theta\)\-R\(\{\\mathbf\{\\tau\}\}\_\{t\};\\theta\)\.Here,

H​\(τt;θ\):=1To​log⁡pθ​\(𝐡q\+1:q\+To∣τt\)H\(\{\\mathbf\{\\tau\}\}\_\{t\};\\theta\):=\\frac\{1\}\{T\_\{o\}\}\\log p\_\{\\theta\}\\\!\\left\(\{\\mathbf\{h\}\}\_\{q\+1:q\+T\_\{o\}\}\\mid\{\\mathbf\{\\tau\}\}\_\{t\}\\right\)measures how strongly the model continues the harmful branch from the perturbed state\. We interpret this as a practical score of harmful autoregressive consistency\.

R​\(τt;θ\):=max𝐫C∈ℛ⁡1\|𝐫C\|​log⁡pθ​\(𝐫C∣τt\)R\(\{\\mathbf\{\\tau\}\}\_\{t\};\\theta\):=\\max\_\{\{\\mathbf\{r\}\}^\{\\operatorname\{C\}\}\\in\{\\mathcal\{R\}\}\}\\frac\{1\}\{\|\{\\mathbf\{r\}\}^\{\\operatorname\{C\}\}\|\}\\log p\_\{\\theta\}\\\!\\left\(\{\\mathbf\{r\}\}^\{\\operatorname\{C\}\}\\mid\{\\mathbf\{\\tau\}\}\_\{t\}\\right\)measures how easily the model can restart refusal from that same state, where themax\\maxoperator selects the single refusal candidate𝐫C\{\\mathbf\{r\}\}^\{\\operatorname\{C\}\}that the model most prefers\. Thus, maximizingΦθ\\Phi\_\{\\theta\}selects the inserted candidate state where the model most strongly favors harmful continuation over refusal recovery\. We refer to this approximation as random worst\-insertion attack\.

### 3\.3Experiments

#### Implementation details\.

For adversarial safety alignment with random worst\-insertion attack, we use the objective Eq\. \([12](https://arxiv.org/html/2606.04168#S3.E12)\), denoted byLInsertionL\_\{\\mathrm\{Insertion\}\}for simplicity, for a triplet dataset\(𝐱,𝐫,𝐡\)∼DSafety\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim D\_\{\\operatorname\{Safety\}\}with𝒜\{\\mathcal\{A\}\}andΦ\\Phispecified above and add a utility lossLUtilityL\_\{\\operatorname\{Utility\}\}on a utility dataset\(𝐳,𝐲\)∼DUtility\(\{\\mathbf\{z\}\},\{\\mathbf\{y\}\}\)\\sim D\_\{\\operatorname\{Utility\}\}to maintain model utility\. Thus, the overall objective is

minθ⁡λ​LInsertion\+\(1−λ\)​LUtility,where​LUtility:=−𝔼\(𝐳,𝐲\)∼DUtility​\[log⁡pθ​\(𝐲\|𝐳\)\]\.\\min\_\{\\theta\}\\ \\lambda L\_\{\\mathrm\{Insertion\}\}\+\(1\-\\lambda\)L\_\{\\operatorname\{Utility\}\},\\quad\\text\{where \}\\ L\_\{\\operatorname\{Utility\}\}:=\-\\mathbb\{E\}\_\{\(\{\\mathbf\{z\}\},\{\\mathbf\{y\}\}\)\\sim D\_\{\\operatorname\{Utility\}\}\}\\Big\[\\log p\_\{\\theta\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{z\}\}\)\\Big\]\.\(13\)We refer to the method asInsertionfor simplicity below\. For comparison, we study two baselines,CleanandPrefill\. Specifically, \(1\) ForClean, we fine\-tune the model directly on the clean harmful prompt and refusal response pair\(𝐱,𝐫,𝐡\)∼DSafety\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim D\_\{\\operatorname\{Safety\}\}by standard SFT \(we do not use the harmful response𝐡\{\\mathbf\{h\}\}, hence the nameClean\), i\.e\., the objective is

minθ−λ​𝔼\(𝐱,𝐫,𝐡\)∼DSafety​\[log⁡pθ​\(𝐫\|𝐱\)\]\+\(1−λ\)​LUtility;\\min\_\{\\theta\}\\ \-\\lambda\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim D\_\{\\operatorname\{Safety\}\}\}\\Big\[\\log p\_\{\\theta\}\(\{\\mathbf\{r\}\}\|\{\\mathbf\{x\}\}\)\\Big\]\+\(1\-\\lambda\)L\_\{\\operatorname\{Utility\}\};\(2\) ForPrefill, we apply the deep alignment method from Qi et al\.\(Qiet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib4)\)by fine\-tuning with SFT on an augmented dataset\(\[𝐱;𝐡:q\],𝐫\)\(\[\{\\mathbf\{x\}\};\\,\{\\mathbf\{h\}\}\_\{:q\}\],\{\\mathbf\{r\}\}\)for\(𝐱,𝐫,𝐡\)∼DSafety\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim D\_\{\\operatorname\{Safety\}\}, i\.e\., the objective is

minθ−λ​𝔼\(𝐱,𝐫,𝐡\)∼DSafety​\[log⁡pθ​\(𝐫\|𝐱,𝐡:q\)\]\+\(1−λ\)​LUtility,\\min\_\{\\theta\}\\ \-\\lambda\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim D\_\{\\operatorname\{Safety\}\}\}\\Big\[\\log p\_\{\\theta\}\(\{\\mathbf\{r\}\}\|\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:q\}\)\\Big\]\+\(1\-\\lambda\)L\_\{\\operatorname\{Utility\}\},which corresponds to making safety alignment deeper against prefix\-style attacks\. This objective can also be seen as adversarial safety alignment with the inner problem solved by prefill attack, hence the simplified namePrefill\.

In our experiment, we setλ=0\.5\\lambda=0\.5for all three methods, and useq=50q=50for bothPrefillandInsertion111A natural variant is to sampleqqfrom a distribution, such as a uniform distribution, for each data point\. Since our primary goal here is to study the underlying mechanism, we focus on the simplest setting and leave this generalization to future work\.\. ForInsertion, we set the number of candidate positions in𝒜\{\\mathcal\{A\}\}tok=32k=32\. For the inner objectiveΦ\\Phi, we usem=4m=4for the bank of candidate refusal restart\. We useLlama\-2\-7b\-Chatmodel and the safety data from Qi et al\.\(Qiet al\.,[2025](https://arxiv.org/html/2606.04168#bib.bib4)\)asDSafetyD\_\{\\operatorname\{Safety\}\}, respectively\. The utility datasetDUtilityD\_\{\\operatorname\{Utility\}\}is from Alpaca data\(Taoriet al\.,[2023](https://arxiv.org/html/2606.04168#bib.bib16)\)\. We evaluate ASR on the HEx\-PHI safety benchmark and defer further details to Appendix[C](https://arxiv.org/html/2606.04168#A3)\.

Table 1:ASR \(%\)↓\\downarrowcomparison across attacks with harmful span length 25, 50, 75, and 100\.
#### Results\.

Tab\.[1](https://arxiv.org/html/2606.04168#S3.T1)reports ASR for random insertion attack \(Section[2\.2](https://arxiv.org/html/2606.04168#S2.SS2)\) and prefill attack across harmful span lengths 25, 50, 75, and 100 tokens onLlama\-2\-7b\-Chatmodels \(further\) aligned withClean,Prefill, andInsertion, respectively\. Results are reported as mean±\\pmstandard deviation across three runs\. Tab\.[1](https://arxiv.org/html/2606.04168#S3.T1)shows three main findings: \(1\)Cleanremains highly vulnerable to both attacks, even though both attacks are simple and inexpensive, indicating that standard safety alignment remains brittle; \(2\)Prefillsharply reduces ASR on the matched prefill attack, but transfers poorly to random insertion attack, confirming that making alignment deeper helps but still is incomplete; \(3\)Insertionachieves the lowest ASR on random insertion attack while remaining competitive on prefill attack, suggesting that adversarial safety alignment can better mitigates the underlying failure mechanism induced by autoregressive consistency\.

#### Evaluation on other attacks\.

Table 2:ASR \(%\)↓\\downarrowof different attacks onLlama\-2\-7b\-Chat\.We further evaluate on several common attacks, including: \(1\) GCG, a white\-box suffix attack; \(2\) PAIR\(Chaoet al\.,[2024](https://arxiv.org/html/2606.04168#bib.bib23)\), a black\-box attack that uses an attacker LLM to generate prompts; \(3\) TAP\(Mehrotraet al\.,[2024](https://arxiv.org/html/2606.04168#bib.bib22)\), a black\-box attack that utilizes an attacker LLM to iteratively refine prompts; and \(4\) AutoDAN\(Liuet al\.,[2024](https://arxiv.org/html/2606.04168#bib.bib21)\)which performs adversarial input perturbations\. We use the rawLlama\-2\-7b\-Chatmodel \(named asInitial\) and a further aligned version obtained by adversarial safety alignment \(named asInsertionas before\)\. We use the HarmBench\(Mazeikaet al\.,[2024](https://arxiv.org/html/2606.04168#bib.bib17)\)for the evaluation, where we take the 200 standard behaviors from it as our evaluation dataset due to the significant computational cost of attacks like GCG\. The results are summarized in Tab\.[2](https://arxiv.org/html/2606.04168#S3.T2), showing that adversarial safety alignment remains competitive beyond random insertion attack\.

#### Evaluation on other models\.

In Tab\.[3](https://arxiv.org/html/2606.04168#S3.T3), we repeat the experiments from Tab\.[1](https://arxiv.org/html/2606.04168#S3.T1)on two additional models:Llama\-3\.1\-8B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.04168#bib.bib24)\)andGemma\-2\-9b\-it\(Gemma Team, Google,[2024](https://arxiv.org/html/2606.04168#bib.bib25)\)to validate that our conclusions also hold for these models\. The results show that the main finding in Tab\.[1](https://arxiv.org/html/2606.04168#S3.T1)transfers across model families\. On bothLlama\-3\.1\-8B\-InstructandGemma\-2\-9b\-it, adversarial safety alignment substantially improves robustness under random insertion attack, while still remaining competitive on prefill attack\.

Table 3:Comparison ofASR \(%\)↓\\downarrowon different models across attacks with harmful span length 25, 50, 75, and 100\. These models are trained using the same set of parameters as in Tab\.[1](https://arxiv.org/html/2606.04168#S3.T1)

## 4Discussion

Our work explains shallow safety alignment through the lens of autoregressive consistency by analyzing the learning dynamics of safety alignment\. We show that autoregressive consistency can concentrate alignment updates near the early response tokens, providing a mechanistic explanation for why safety alignment can become shallow\. We further argue that the same mechanism can hurt safety alignment by preserving and extending a harmful continuation once it has begun\. This suggests a broader class of attacks that induce harmful continuation states inside the output trajectory, with random insertion attack as one concrete example\.

These observations suggest that making alignment deeper is helpful but incomplete, because it does not directly target the underlying harmful mechanism\. Safety alignment should also train models to break harmful autoregressive consistency and recover safe behavior from harmful continuation states\. As an initial step, we propose an adversarial safety alignment framework and instantiate it with random worst\-insertion training\. We stress that the goal of this work is not to present a final defense recipe, but to identify harmful autoregressive consistency as a mechanism underlying safety fragility and to study the alignment objectives suggested by this mechanism\. We hope this work encourages future alignment and attack methods to explicitly consider the role of autoregressive consistency throughout generation\.

## References

- M\. Andriushchenko, F\. Croce, and N\. Flammarion \(2025\)Jailbreaking leading safety\-aligned LLMs with simple adaptive attacks\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hXA8wqRdyV)Cited by:[§1](https://arxiv.org/html/2606.04168#S1.p1.1)\.
- A\. Arditi, O\. Obeso, A\. Syed, D\. Paleka, N\. Panickssery, W\. Gurnee, and N\. Nanda \(2024\)Refusal in language models is mediated by a single direction\.External Links:2406\.11717,[Link](https://arxiv.org/abs/2406.11717)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px2.p1.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan \(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.External Links:2204\.05862,[Link](https://arxiv.org/abs/2204.05862)Cited by:[§1](https://arxiv.org/html/2606.04168#S1.p1.1)\.
- S\. Casper, L\. Schulze, O\. Patel, and D\. Hadfield\-Menell \(2025\)Defending against unforeseen failure modes with latent adversarial training\.External Links:2403\.05030,[Link](https://arxiv.org/abs/2403.05030)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px3.p1.1)\.
- P\. Chao, A\. Robey, E\. Dobriban, H\. Hassani, G\. J\. Pappas, and E\. Wong \(2025\)Jailbreaking black box large language models in twenty queries\.In2025 IEEE Conference on Secure and Trustworthy Machine Learning \(SaTML\),pp\. 23–42\.Cited by:[§1](https://arxiv.org/html/2606.04168#S1.p1.1)\.
- P\. Chao, A\. Robey, E\. Dobriban, H\. Hassani, G\. J\. Pappas, and E\. Wong \(2024\)Jailbreaking black box large language models in twenty queries\.External Links:2310\.08419,[Link](https://arxiv.org/abs/2310.08419)Cited by:[§3\.3](https://arxiv.org/html/2606.04168#S3.SS3.SSS0.Px3.p1.1)\.
- D\. Ganguli, L\. Lovitt, J\. Kernion, A\. Askell, Y\. Bai, S\. Kadavath, B\. Mann, E\. Perez, N\. Schiefer, K\. Ndousse, A\. Jones, S\. Bowman, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, N\. Elhage, S\. El\-Showk, S\. Fort, Z\. Hatfield\-Dodds, T\. Henighan, D\. Hernandez, T\. Hume, J\. Jacobson, S\. Johnston, S\. Kravec, C\. Olsson, S\. Ringer, E\. Tran\-Johnson, D\. Amodei, T\. Brown, N\. Joseph, S\. McCandlish, C\. Olah, J\. Kaplan, and J\. Clark \(2022\)Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned\.External Links:2209\.07858,[Link](https://arxiv.org/abs/2209.07858)Cited by:[Appendix C](https://arxiv.org/html/2606.04168#A3.SS0.SSS0.Px3.p1.3)\.
- Gemma Team, Google \(2024\)Gemma 2: improving open language models at a practical size\.External Links:2408\.00118,[Link](https://arxiv.org/abs/2408.00118)Cited by:[§2\.2](https://arxiv.org/html/2606.04168#S2.SS2.SSS0.Px1.p2.1),[§3\.3](https://arxiv.org/html/2606.04168#S3.SS3.SSS0.Px4.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.3](https://arxiv.org/html/2606.04168#S3.SS3.SSS0.Px4.p1.1)\.
- B\. Y\. Lin, A\. Ravichander, X\. Lu, N\. Dziri, M\. Sclar, K\. Chandu, C\. Bhagavatula, and Y\. Choi \(2024\)The unlocking spell on base LLMs: rethinking alignment via in\-context learning\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=wxJ0eXwwda)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px1.p1.1)\.
- X\. Liu, N\. Xu, M\. Chen, and C\. Xiao \(2024\)AutoDAN: generating stealthy jailbreak prompts on aligned large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=7Jwpw4qKkb)Cited by:[§3\.3](https://arxiv.org/html/2606.04168#S3.SS3.SSS0.Px3.p1.1)\.
- A\. Madry, A\. Makelov, L\. Schmidt, D\. Tsipras, and A\. Vladu \(2019\)Towards deep learning models resistant to adversarial attacks\.External Links:1706\.06083,[Link](https://arxiv.org/abs/1706.06083)Cited by:[§1](https://arxiv.org/html/2606.04168#S1.p8.1),[§3\.1](https://arxiv.org/html/2606.04168#S3.SS1.p2.1)\.
- M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li, D\. Forsyth, and D\. Hendrycks \(2024\)HarmBench: a standardized evaluation framework for automated red teaming and robust refusal\.External Links:2402\.04249,[Link](https://arxiv.org/abs/2402.04249)Cited by:[Appendix C](https://arxiv.org/html/2606.04168#A3.SS0.SSS0.Px5.p1.1),[§3\.3](https://arxiv.org/html/2606.04168#S3.SS3.SSS0.Px3.p1.1)\.
- A\. Mehrotra, M\. Zampetakis, P\. Kassianik, B\. Nelson, H\. Anderson, Y\. Singer, and A\. Karbasi \(2024\)Tree of attacks: jailbreaking black\-box llms automatically\.External Links:2312\.02119,[Link](https://arxiv.org/abs/2312.02119)Cited by:[§1](https://arxiv.org/html/2606.04168#S1.p1.1),[§3\.3](https://arxiv.org/html/2606.04168#S3.SS3.SSS0.Px3.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[§1](https://arxiv.org/html/2606.04168#S1.p1.1)\.
- S\. T\. Piantadosi \(2014\)Zipf’s word frequency law in natural language: a critical review and future directions\.Psychonomic bulletin & review21\(5\),pp\. 1112–1130\.Cited by:[§2\.1](https://arxiv.org/html/2606.04168#S2.SS1.p1.5)\.
- X\. Qi, A\. Panda, K\. Lyu, X\. Ma, S\. Roy, A\. Beirami, P\. Mittal, and P\. Henderson \(2025\)Safety alignment should be made more than just a few tokens deep\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6Mxhg9PtDE)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px1.p1.1),[Appendix C](https://arxiv.org/html/2606.04168#A3.SS0.SSS0.Px3.p1.3),[§1](https://arxiv.org/html/2606.04168#S1.p2.1),[Figure 3](https://arxiv.org/html/2606.04168#S2.F3.2.2.2),[Figure 3](https://arxiv.org/html/2606.04168#S2.F3.4.2.2),[§2\.1](https://arxiv.org/html/2606.04168#S2.SS1.SSS0.Px2.p1.1),[§2\.1](https://arxiv.org/html/2606.04168#S2.SS1.p2.1),[§2](https://arxiv.org/html/2606.04168#S2.p2.1),[§3\.3](https://arxiv.org/html/2606.04168#S3.SS3.SSS0.Px1.p1.10),[§3\.3](https://arxiv.org/html/2606.04168#S3.SS3.SSS0.Px1.p2.8),[§3](https://arxiv.org/html/2606.04168#S3.p1.1)\.
- X\. Qi, Y\. Zeng, T\. Xie, P\. Chen, R\. Jia, P\. Mittal, and P\. Henderson \(2024\)Fine\-tuning aligned language models compromises safety, even when users do not intend to\!\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hTEGyKf0dZ)Cited by:[Figure 4](https://arxiv.org/html/2606.04168#S2.F4),[Figure 4](https://arxiv.org/html/2606.04168#S2.F4.3.2)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=HPuSIXJaa9)Cited by:[§1](https://arxiv.org/html/2606.04168#S1.p1.1)\.
- V\. Raychev, M\. Vechev, and E\. Yahav \(2014\)Code completion with statistical language models\.InProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation,PLDI ’14,New York, NY, USA,pp\. 419–428\.External Links:ISBN 9781450327848,[Link](https://doi.org/10.1145/2594291.2594321),[Document](https://dx.doi.org/10.1145/2594291.2594321)Cited by:[§2\.1](https://arxiv.org/html/2606.04168#S2.SS1.p1.5)\.
- C\. E\. Shannon \(1951\)Prediction and entropy of printed english\.The Bell System Technical Journal30\(1\),pp\. 50–64\.External Links:[Document](https://dx.doi.org/10.1002/j.1538-7305.1951.tb01366.x)Cited by:[§2\.1](https://arxiv.org/html/2606.04168#S2.SS1.p1.5)\.
- A\. Sheshadri, A\. Ewart, P\. Guo, A\. Lynch, C\. Wu, V\. Hebbar, H\. Sleight, A\. C\. Stickland, E\. Perez, D\. Hadfield\-Menell, and S\. Casper \(2025\)Latent adversarial training improves robustness to persistent harmful behaviors in llms\.External Links:2407\.15549,[Link](https://arxiv.org/abs/2407.15549)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.04168#S1.SS0.SSS0.Px1.p2.1)\.
- R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Stanford alpaca: an instruction\-following llama model\.GitHub\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Cited by:[Appendix C](https://arxiv.org/html/2606.04168#A3.SS0.SSS0.Px3.p1.3),[§3\.3](https://arxiv.org/html/2606.04168#S3.SS3.SSS0.Px1.p2.8)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.External Links:2307\.09288,[Link](https://arxiv.org/abs/2307.09288)Cited by:[§2\.2](https://arxiv.org/html/2606.04168#S2.SS2.SSS0.Px1.p2.1)\.
- J\. Wei, M\. Bosma, V\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le \(2022\)Finetuned language models are zero\-shot learners\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=gEZrGCozdqR)Cited by:[§1](https://arxiv.org/html/2606.04168#S1.p1.1)\.
- T\. Wollschläger, J\. Elstner, S\. Geisler, V\. Cohen\-Addad, S\. Günnemann, and J\. Gasteiger \(2026\)The geometry of refusal in large language models: concept cones and representational independence\.External Links:2502\.17420,[Link](https://arxiv.org/abs/2502.17420)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px2.p1.1)\.
- S\. Xhonneux, A\. Sordoni, S\. Günnemann, G\. Gidel, and L\. Schwinn \(2024\)Efficient adversarial training in llms with continuous attacks\.External Links:2405\.15589,[Link](https://arxiv.org/abs/2405.15589)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.04168#S1.SS0.SSS0.Px1.p2.1)\.
- Z\. Xu, F\. Jiang, L\. Niu, J\. Jia, B\. Y\. Lin, and R\. Poovendran \(2024\)SafeDecoding: defending against jailbreak attacks via safety\-aware decoding\.External Links:2402\.08983,[Link](https://arxiv.org/abs/2402.08983)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.04168#S1.SS0.SSS0.Px1.p2.1)\.
- J\. Zhao, J\. Huang, Z\. Wu, D\. Bau, and W\. Shi \(2026\)LLMs encode harmfulness and refusal separately\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=zLkpt30ngy)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px2.p1.1)\.
- C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. YU, S\. Zhang, G\. Ghosh, M\. Lewis, L\. Zettlemoyer, and O\. Levy \(2023\)LIMA: less is more for alignment\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=KBMOKmX2he)Cited by:[Appendix A](https://arxiv.org/html/2606.04168#A1.SS0.SSS0.Px1.p1.1)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks \(2025\)Representation engineering: a top\-down approach to ai transparency\.External Links:2310\.01405,[Link](https://arxiv.org/abs/2310.01405)Cited by:[§1](https://arxiv.org/html/2606.04168#S1.p1.1)\.
- A\. Zou, Z\. Wang, N\. Carlini, M\. Nasr, J\. Z\. Kolter, and M\. Fredrikson \(2023\)Universal and transferable adversarial attacks on aligned language models\.External Links:2307\.15043,[Link](https://arxiv.org/abs/2307.15043)Cited by:[§1](https://arxiv.org/html/2606.04168#S1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.04168#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.04168#S2.SS2.p2.12)\.

## Appendix

## Appendix ARelated Works

#### Shallow \(safety\) alignment\.

Qi et al\.\[[17](https://arxiv.org/html/2606.04168#bib.bib4)\]identified a shortcut in current safety alignment procedures: apparent refusal behavior can often be induced by modifying the model’s generative distribution primarily over the first few response tokens\. They argued that this shallow alignment contributes to a range of downstream vulnerabilities, and thus motivating that “safety alignment should be made more than a few token”\. This phenomenon is closely related to the Superficial Alignment Hypothesis \(SAH\)\[[30](https://arxiv.org/html/2606.04168#bib.bib27)\], which argues that alignment in current LLMs may largely change the surface form of model\-user interaction rather than deeply altering the model’s underlying capabilities or behavior\. In addition, Lin et al\.\[[10](https://arxiv.org/html/2606.04168#bib.bib30)\]showed that the differences introduced by alignment fine\-tuning between aligned and unaligned base models diminish as the generated sequence becomes longer\. Our work builds on this line of research but asks a different question\. Rather than characterizing the existence of shallow alignment, we analyze the learning dynamics of safety alignment to understand why such shallowness naturally arises under autoregressive generation and what broader failure mechanism it reflects\. We show that autoregressive consistency provides a mechanistic account of shallow safety alignment and further suggests vulnerabilities beyond prefix\-targeting attacks\. A deeper connection between our analysis and the broader phenomenon of superficial alignment is left to future work\.

#### Understanding safety of LLMs\.

A recent line of work studies LLM safety by probing and intervening on internal model states\. Arditi et al\.\[[2](https://arxiv.org/html/2606.04168#bib.bib28)\]showed that refusal behavior can be mediated by a one\-dimensional subspace, while Wollschlager et al\.\[[26](https://arxiv.org/html/2606.04168#bib.bib29)\]argued that refusal is better characterized by multiple independent directions and higher\-dimensional concept cones, pointing to a richer geometric structure underlying refusal\. More recently, Zhao et al\.\[[29](https://arxiv.org/html/2606.04168#bib.bib11)\]distinguished harmfulness recognition from refusal behavior, showing that harmfulness is represented at the user\-instruction position whereas refusal is represented at the post\-instruction position\. Our work is complementary: we show that even after refusal has been established in the output trajectory, generation can still be redirected by a harmful continuation state\. This suggests that safety alignment should not merely elicit refusal, but should connect harmfulness recognition to stable recovery throughout generation\. In addition, Xu et al\.\[[28](https://arxiv.org/html/2606.04168#bib.bib10)\]observed that safety\-disclaimer tokens often remain among the top\-ranked next\-token candidates under jailbreak attacks, and used this signal to amplify safety tokens during decoding\. However, for efficiency, their defense protects only the first few tokens\. In contrast, our results show that this can be insufficient, since harmful autoregressive consistency may be induced later inside the generation trajectory\.

#### Defenses for LLMs\.

A family of defenses inspired by adversarial training \(AT\) has been developed to improve LLM robustness, typically by augmenting training with adversarial prompts or perturbations generated dynamically during training\. Casper et al\.\[[4](https://arxiv.org/html/2606.04168#bib.bib9)\]studied latent adversarial training \(LAT\), which applies adversarial perturbations to hidden representations, motivated by the view that latent states encode more compressed and abstract features used by the model\. Xhonneux et al\.\[[27](https://arxiv.org/html/2606.04168#bib.bib8)\]proposed continuous adversarial training in token\-embedding space, including CAT and CAPO, and showed that robustness to continuous embedding perturbations can transfer to discrete jailbreak attacks such as GCG, AutoDAN, and PAIR\. Sheshadri et al\.\[[22](https://arxiv.org/html/2606.04168#bib.bib7)\]further extended LAT from untargeted perturbations to targeted latent perturbations that explicitly steer the model toward undesirable behaviors\. While these methods primarily define adversarial perturbations in the input or latent space, our work offers a complementary trajectory\-level perspective: the adversary should also be defined over autoregressive states induced during generation, where harmful continuation can be preserved and extended\.

## Appendix BProofs for Section[2](https://arxiv.org/html/2606.04168#S2)

### B\.1Proof of Lemma[1](https://arxiv.org/html/2606.04168#Thmlemma1)

###### Proof\.

By a straightforward computation, the softmax Jacobian

At=diag\(pθ\(⋅\|𝐱,𝐲:t\)\)−pθ\(⋅\|𝐱,𝐲:t\)pθ\(⋅\|𝐱,𝐲:t\)T,A\_\{t\}=\\mathrm\{diag\}\(p\_\{\\mathrm\{\\theta\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\)\-p\_\{\\mathrm\{\\theta\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)p\_\{\\mathrm\{\\theta\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)^\{T\},wherepθ\(⋅\|𝐱,𝐲:t\)p\_\{\\mathrm\{\\theta\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)is viewed as a column vector whoseiith entry ispθ​\(i​th​token\|𝐱,𝐲:t\)p\_\{\\mathrm\{\\theta\}\}\(i\\mathrm\{th\\ token\}\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\), anddiag​\(u\)\\mathrm\{diag\}\(u\)forms a diagonal matrix from a vectoruuwhere the diagonal entries are the components ofuu\. We first prove that all eigenvalues ofAtA\_\{t\}are between 0 and 1\. Consider an arbitrary normalized vectorvv, we have

vT​At​v=∑ipi​vi2−\(∑ipi​vi\)2,v^\{T\}A\_\{t\}v=\\sum\_\{i\}p\_\{i\}v\_\{i\}^\{2\}\-\\left\(\\sum\_\{i\}p\_\{i\}v\_\{i\}\\right\)^\{2\},sincepip\_\{i\}are probabilities, this is just the variance ofviv\_\{i\}under probability distributionpip\_\{i\}, and therefore always non\-negative\. This shows thatAtA\_\{t\}is positive semi\-definite\. Furthermore,vT​At​v<∑ipi​vi2v^\{T\}A\_\{t\}v<\\sum\_\{i\}p\_\{i\}v\_\{i\}^\{2\}, and sincevvis normalized such thatvi2≤1v\_\{i\}^\{2\}\\leq 1, we havevT​At​v<∑ipi=1v^\{T\}A\_\{t\}v<\\sum\_\{i\}p\_\{i\}=1, which proves that all eigenvalues ofAtA\_\{t\}are smaller than 1\.

To prove Lemma[1](https://arxiv.org/html/2606.04168#Thmlemma1), we note

∥At∥F2=tr\(At2\)≤tr\(At\)=\(1−∥pθ\(⋅\|𝐱,𝐲:t\)∥2\)≤1−pθ\(yt\+1\|𝐱,𝐲:t\)2,\\\|A\_\{t\}\\\|\_\{F\}^\{2\}\\;=\\mathrm\{tr\}\(A\_\{t\}^\{2\}\)\\;\\leq\\;\\mathrm\{tr\}\(A\_\{t\}\)\\;=\\;\(1\-\\\|p\_\{\\mathrm\{\\theta\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|^\{2\}\)\\;\\leq\\;1\-p\_\{\\mathrm\{\\theta\}\}\(y\_\{t\+1\}\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)^\{2\},At initializationθ=θbase\\theta=\\mathrm\{\\theta\_\{base\}\}, Assumption[2\.1](https://arxiv.org/html/2606.04168#S2.Thmassumption1)shows thatpθ​\(yt\+1\|𝐱,𝐲:t\)\>1−ϵp\_\{\\mathrm\{\\theta\}\}\(y\_\{t\+1\}\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\>1\-\\epsilonfor allt\>tct\>t\_\{c\}\. Therefore

‖At‖F2<2​ϵ\\\|A\_\{t\}\\\|\_\{F\}^\{2\}<2\\epsilonfor allt\>tct\>t\_\{c\}atθ=θbase\\theta=\\mathrm\{\\theta\_\{base\}\}\.

∎

### B\.2Proof of Proposition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmprop2)

###### Proof\.

For supervised fine tuning,

LSFT=−∑t𝔼\(𝐱,𝐫\)∼𝒟​\[log⁡pθ​\(rt\+1\|𝐱,𝐫:t\)\]L\_\{\\mathrm\{SFT\}\}=\-\\sum\_\{t\}\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)\\sim\{\\mathcal\{D\}\}\}\\left\[\\log p\_\{\\mathrm\{\\theta\}\}\(r\_\{t\+1\}\|\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\\right\]where𝐱\{\\mathbf\{x\}\}is a harmful prompt and𝐫\{\\mathbf\{r\}\}is a refusal response\.

∇θLSFT=−∇θ​∑t≤tc𝔼\(𝐱,𝐫\)∼𝒟​\[log⁡pθ​\(rt\+1\|𝐱,𝐫:t\)\]−∑t\>tc𝔼\(𝐱,𝐫\)∼𝒟​\[∇θpθ​\(rt\+1\|𝐱,𝐫:t\)pθ​\(rt\+1\|𝐱,𝐫:t\)\]\\nabla\_\{\\theta\}L\_\{\\mathrm\{SFT\}\}=\-\\nabla\_\{\\theta\}\\sum\_\{t\\leq t\_\{c\}\}\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)\\sim\{\\mathcal\{D\}\}\}\\left\[\\log p\_\{\\mathrm\{\\theta\}\}\(r\_\{t\+1\}\|\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\\right\]\-\\sum\_\{t\>t\_\{c\}\}\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)\\sim\{\\mathcal\{D\}\}\}\\left\[\\frac\{\\nabla\_\{\\theta\}p\_\{\\mathrm\{\\theta\}\}\(r\_\{t\+1\}\|\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\}\{p\_\{\\mathrm\{\\theta\}\}\(r\_\{t\+1\}\|\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\}\\right\]By assumptionpθbasep\_\{\\mathrm\{\\theta\_\{base\}\}\}is autoregressive on𝒟\\mathcal\{D\}, therefore by proposition[2\.1](https://arxiv.org/html/2606.04168#S2.Thmprop1)

‖∑t\>tc𝔼\(𝐱,𝐫\)∼𝒟​\[∇θpθ​\(rt\+1\|𝐱,𝐫:t\)pθ​\(rt\+1\|𝐱,𝐫:t\)\]‖θ=θbase<\(2​ϵ\)121−ϵ​\(T−tc\),\\left\|\\left\|\\sum\_\{t\>t\_\{c\}\}\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\)\\sim\{\\mathcal\{D\}\}\}\\left\[\\frac\{\\nabla\_\{\\theta\}p\_\{\\mathrm\{\\theta\}\}\(r\_\{t\+1\}\|\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\}\{p\_\{\\mathrm\{\\theta\}\}\(r\_\{t\+1\}\|\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\}\\right\]\\right\|\\right\|\_\{\\theta=\\theta\_\{\\mathrm\{base\}\}\}<\\frac\{\(2\\epsilon\)^\{\\frac\{1\}\{2\}\}\}\{1\-\\epsilon\}\(T\-t\_\{c\}\),whereTTis the maximal length of the refusal sequence\.

For direct preference optimization,

LDPO=−𝔼\(𝐱,𝐫,𝐡\)∼𝒟​\[log⁡σ​\(β​kθ​\(𝐱,𝐫,𝐡\)\)\]L\_\{\\mathrm\{DPO\}\}=\-\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim\{\\mathcal\{D\}\}\}\\left\[\\log\\sigma\(\\beta k\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\)\\right\]\(14\)whereσ\\sigmais the logistic function and

kθ​\(𝐱,𝐫,𝐡\)=∑tlog⁡pθ​\(rt\|𝐱,𝐫:t\)pθbase​\(rt\|𝐱,𝐫:t\)−∑tlog⁡pθ​\(ht\|𝐱,𝐡:t\)pθbase​\(ht\|𝐱,𝐡:t\)\.k\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)=\\sum\_\{t\}\\log\\frac\{p\_\{\\theta\}\(r\_\{t\}\|\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\}\{p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(r\_\{t\}\|\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\}\-\\sum\_\{t\}\\log\\frac\{p\_\{\\theta\}\(h\_\{t\}\|\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:t\}\)\}\{p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(h\_\{t\}\|\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:t\}\)\}\.In the data triplet\(𝐱,𝐫,𝐡\)\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\),xxis the harmful prompt,rris a refusal response andhhis a harmful response\.

∇θLDPO=−β​𝔼​\[\[1−σ​\(β​kθ\)\]​\(∑t∇θlog⁡pθ​\(rt\+1∣𝐱,𝐫:t\)−∑t∇θlog⁡pθ​\(ht\+1∣𝐱,𝐡:t\)\)\]\.\\nabla\_\{\\theta\}L\_\{\\mathrm\{DPO\}\}\\;=\\;\-\\beta\\,\\mathbb\{E\}\\\!\\left\[\[1\-\\sigma\(\\beta k\_\{\\theta\}\)\]\\bigl\(\\sum\_\{t\}\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\-\\sum\_\{t\}\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(h\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:t\}\)\\bigr\)\\right\]\.At initialization,1−σ​\(β​kθ​\(𝐱,𝐫,𝐡\)\)=1/21\-\\sigma\(\\beta k\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\)=1/2which is a constant, therefore we can again split∇θLDPO\\nabla\_\{\\theta\}L\_\{\\mathrm\{DPO\}\}intot≤tct\\leq t\_\{c\}andt\>tct\>t\_\{c\}contributions just as we did forLSFTL\_\{\\mathrm\{SFT\}\}\. Thet\>tct\>t\_\{c\}contribution can be bounded in the same manner as the SFT bound:

12\|\|∑t\>tc∇θlogpθ\(rt\+1∣𝐱,𝐫:t\)−∑t\>tc∇θlogpθ\(ht\+1∣𝐱,𝐡:t\)\|\|θ=θbase\\displaystyle\\frac\{1\}\{2\}\\left\|\\left\|\\sum\_\{t\>t\_\{c\}\}\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\-\\sum\_\{t\>t\_\{c\}\}\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(h\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:t\}\)\\right\|\\right\|\_\{\\mathrm\{\\theta=\\theta\_\{base\}\}\}≤\\displaystyle\\leq12\|\|∑t\>tc∇θlogpθ\(rt\+1∣𝐱,𝐫:t\)\|\|θ=θbase\+12\|\|∑t\>tc∇θlogpθ\(ht\+1∣𝐱,𝐡:t\)\|\|θ=θbase\\displaystyle\\frac\{1\}\{2\}\\left\|\\left\|\\sum\_\{t\>t\_\{c\}\}\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\\right\|\\right\|\_\{\\mathrm\{\\theta=\\theta\_\{base\}\}\}\+\\frac\{1\}\{2\}\\left\|\\left\|\\sum\_\{t\>t\_\{c\}\}\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(h\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:t\}\)\\right\|\\right\|\_\{\\mathrm\{\\theta=\\theta\_\{base\}\}\}<\\displaystyle<\(2​ϵ\)121−ϵ​\(T−tc\)\\displaystyle\\frac\{\(2\\epsilon\)^\{\\frac\{1\}\{2\}\}\}\{1\-\\epsilon\}\(T\-t\_\{c\}\)whereTTis the maximal length of𝐫\{\\mathbf\{r\}\}and𝐡\{\\mathbf\{h\}\}\. ∎

### B\.3One step version

Before proving the full multi\-step result, we first examine the simpler case of a single gradient update from the base model\. This one\-step setting already captures the central mechanism: under autoregressive consistency, next\-token distributions at late positions have small sensitivity to the alignment gradient, so the update mainly affects early response positions while leaving later positions close to the base model\. Proposition[B\.1](https://arxiv.org/html/2606.04168#A2.Thmprop1)makes this statement precise\. It also implies that autoregressive consistency is preserved after the update in a weakened form, as shown in Corollary[B\.1](https://arxiv.org/html/2606.04168#A2.Thmcorollary1)\. This one\-step result provides the basic induction step used in the multi\-step proof in Appendix[B\.4](https://arxiv.org/html/2606.04168#A2.SS4)\. Define

Δθbase→θalignedpθ\(⋅\|𝐱,𝐲:t\):=pθaligned\(⋅\|𝐱,𝐲:t\)−pθbase\(⋅\|𝐱,𝐲:t\)\.\\Delta\_\{\\theta\_\{\\operatorname\{base\}\}\\to\\theta\_\{\\operatorname\{aligned\}\}\}p\_\{\\mathrm\{\\theta\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\):=p\_\{\\mathrm\{\\theta\_\{\\operatorname\{aligned\}\}\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\.\(15\)
###### Proposition B\.1\(One\-step shallow safety alignment\)\.

Letη\\etabe the learning rate andLLbe the alignment loss function \(SFT or DPO\)\. Under conditions of Proposition[2\.1](https://arxiv.org/html/2606.04168#S2.Thmprop1)and that‖∇θz‖<C\\\|\\nabla\_\{\\theta\}z\\\|<C, then the one\-step gradient update from the base model

∥Δθbase→θalignedpθ\(⋅\|𝐱,𝐲:t\)∥F\\displaystyle\\\|\\Delta\_\{\\theta\_\{\\operatorname\{base\}\}\\to\\theta\_\{\\operatorname\{aligned\}\}\}p\_\{\\mathrm\{\\theta\}\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}=∥−η∇θpθ\(⋅\|𝐱,𝐲:t\)⋅∇θL∥θ=θbase\+O\(η2\)\\displaystyle=\\left\\\|\-\\eta\\nabla\_\{\\theta\}p\_\{\\theta\}\(\\cdot\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\cdot\\nabla\_\{\\theta\}L\\right\\\|\_\{\\theta=\\theta\_\{\\mathrm\{base\}\}\}\+O\(\\eta^\{2\}\)\(16\)<η​C​\(2​ϵ\)12∥​∇θL∥θ=θbase\+O​\(η2\)\\displaystyle<\\eta C\(2\\epsilon\)^\{\\frac\{1\}\{2\}\}\\,\\\|\\nabla\_\{\\theta\}L\\\|\_\{\\theta=\\theta\_\{\\mathrm\{base\}\}\}\+O\(\\eta^\{2\}\)for allt\>tct\>t\_\{c\}\.

Note that since the Frobenius norm is \(the square root of\) a sum over squares, this also implies that the above inequality for any giveniith token in the vocabularyΔ​pθbase​\(i​th​token\|𝐱,𝐲:t\)\\Delta p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(i\\mathrm\{th\\ token\}\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\), which we summarize as

###### Corollary B\.1\.

Under the conditions of Theorem[2\.1](https://arxiv.org/html/2606.04168#S2.Thmtheorem1),

\|Δθbase→θalignedpθbase\(anynexttoken\|𝐱,𝐲:t\)\|<ηC\(2ϵ\)12∥∇θL∥θ=θbase,∀t\>tc,\|\\Delta\_\{\\theta\_\{\\operatorname\{base\}\}\\to\\theta\_\{\\operatorname\{aligned\}\}\}p\_\{\\mathrm\{\\theta\_\{base\}\}\}\(\\mathrm\{any\\ next\\ token\}\\,\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\|<\\eta C\(2\\epsilon\)^\{\\frac\{1\}\{2\}\}\\,\\\|\\nabla\_\{\\theta\}L\\\|\_\{\\theta=\\theta\_\{\\mathrm\{base\}\}\},\\qquad\\forall t\>t\_\{c\},\(17\)and therefore after one gradient update

pθ1​\(yt\+1\|𝐱,𝐲:t\)\>1−ϵ−η​C​\(2​ϵ\)12​‖∇θL‖θ=θbase,∀t\>tc\.p\_\{\\mathrm\{\\theta\_\{1\}\}\}\(y\_\{t\+1\}\|\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\>1\-\\epsilon\-\\eta C\(2\\epsilon\)^\{\\frac\{1\}\{2\}\}\\,\\\|\\nabla\_\{\\theta\}L\\\|\_\{\\theta=\\theta\_\{\\mathrm\{base\}\}\},\\qquad\\forall t\>t\_\{c\}\.\(18\)

Therefore the gradient updates on late\-position tokens are suppresed byϵ1/2\\epsilon^\{1/2\}\.

### B\.4Proof of Theorem[2\.1](https://arxiv.org/html/2606.04168#S2.Thmtheorem1)

We prove the multi\-step version of the gradient\-concentration argument\. Recall that Proposition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmprop2)shows that, at initialization, the dominant contributions to the SFT or DPO gradient come from the early positionst≤tct\\leq t\_\{c\}, because autoregressive consistency suppresses the gradient contribution of late positionst\>tct\>t\_\{c\}\. We now show that the same concentration persists along gradient descent, and hence the resulting alignment remains shallow\.

###### Proof\.

Letθ0=θbase\\theta\_\{0\}=\\theta\_\{\\mathrm\{base\}\}, and consider gradient descent

θk\+1=θk−ηk​∇L​\(θk\),k=0,…,K−1,\\theta\_\{k\+1\}=\\theta\_\{k\}\-\\eta\_\{k\}\\nabla L\(\\theta\_\{k\}\),\\qquad k=0,\\ldots,K\-1,\(19\)whereLLis either the SFT objective in Eq\. \([5](https://arxiv.org/html/2606.04168#S2.E5)\) or the DPO objective in Eq\. \([6](https://arxiv.org/html/2606.04168#S2.E6)\)\.

#### Smoothness assumptions\.

We use the following standard smoothness assumptions along the optimization trajectory\. This is the usual local smoothness regime in which a Taylor\-expansion is meaningful\.

1. 1\.For all contexts\(𝐱,𝐲:t\)\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)witht\>tct\>t\_\{c\}, assume that the logit Jacobian is uniformly bounded \(as in Proposition[2\.1](https://arxiv.org/html/2606.04168#S2.Thmprop1)\): ∥∇θzθ\(⋅∣x,y:t\)∥F≤C\.\\\|\\nabla\_\{\\theta\}z\_\{\\theta\}\(\\cdot\\mid x,y\_\{:t\}\)\\\|\_\{F\}\\leq C\.\(A\.1\)
2. 2\.The conditional next\-token distribution is twice differentiable along the line segment between consecutive iterates, and that ∥∇θ2pθ\(⋅∣𝐱,𝐲:t\)∥F≤B\\\|\\nabla\_\{\\theta\}^\{2\}p\_\{\\theta\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}\\leq B\(A\.2 \)for allt\>tct\>t\_\{c\}\.
3. 3\.The total update length is small enough so that the sequenceϵk\\epsilon\_\{k\}\(defined below\) satisfies ∀k≤K:ϵk<1\.\\forall k\\leq K:\\ \\epsilon\_\{k\}<1\.\(A\.3\)
4. 4\.The gradient norm is uniformly bounded along the optimization path: ‖∇L​\(θk\)‖≤G,k=0,…,K−1\.\\\|\\nabla L\(\\theta\_\{k\}\)\\\|\\leq G,\\qquad k=0,\\ldots,K\-1\.\(A\.4\)

For a constant learning rateη\\eta, we let

η≤ϵ0\+ε2−ϵ0α​K​G,α=max⁡\{C2,B2\},\\eta\\leq\\frac\{\\sqrt\{\\epsilon\_\{0\}\+\\varepsilon\_\{2\}\}\-\\sqrt\{\\epsilon\_\{0\}\}\}\{\\alpha KG\},\\quad\\alpha=\\max\\left\\\{\\frac\{C\}\{\\sqrt\{2\}\},\\sqrt\{\\frac\{B\}\{2\}\}\\right\\\},\(20\)whereϵ2¯\\bar\{\\epsilon\_\{2\}\}is defined in Definition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmdefinition2)\. For DPO, the same argument is applied to both the refusal trajectory𝐫\{\\mathbf\{r\}\}and the harmful trajectoryhh, since the DPO gradient contains token\-level log\-probability gradients from both\. For SFT, only the refusal trajectory𝐫\{\\mathbf\{r\}\}is needed\. To avoid duplicating notation, we write𝐲\{\\mathbf\{y\}\}\(as stated in the Preliminaries in Section[1](https://arxiv.org/html/2606.04168#S1)\) for whichever trajectory is currently being considered\.

By Assumption[2\.1](https://arxiv.org/html/2606.04168#S2.Thmassumption1), for everyt\>tct\>t\_\{c\},

pθ0​\(yt\+1∣𝐱,𝐲:t\)\>1−ϵ0,p\_\{\\theta\_\{0\}\}\(y\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\>1\-\\epsilon\_\{0\},whereϵ0=ϵ\\epsilon\_\{0\}=\\epsilon\. We prove by induction that there exists a nondecreasing sequence

ϵ0≤ϵ1≤⋯≤ϵK\\epsilon\_\{0\}\\leq\\epsilon\_\{1\}\\leq\\cdots\\leq\\epsilon\_\{K\}\(21\)such that

∀k≤K,t\>tc:pθk\(yt\+1∣𝐱,𝐲:t\)\>1−ϵk\.\\forall k\\leq K,\\ t\>t\_\{c\}:\\quad p\_\{\\theta\_\{k\}\}\(y\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\>1\-\\epsilon\_\{k\}\.\(22\)

#### Step0\.

The base casek=0k=0is exactly Assumption[2\.1](https://arxiv.org/html/2606.04168#S2.Thmassumption1)\.

#### Stepkk\.

Now suppose the claim holds at stepkk\. Fix a late positiont\>tct\>t\_\{c\}\. By the induction hypothesis,

pθk\(⋅∣𝐱,𝐲:t\)\>1−ϵk\.p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\>1\-\\epsilon\_\{k\}\.Recall that the proof of Lemma[1](https://arxiv.org/html/2606.04168#Thmlemma1)does not depend on being exactly atθbase\\theta\_\{\\mathrm\{base\}\}; it only uses the fact that the conditional distribution assigns probability at least1−ϵk1\-\\epsilon\_\{k\}to one token\. Therefore, we obtain

∥∇zpθ\(⋅∣𝐱,𝐲:t\)∥F<2​ϵk\.\\\|\\nabla\_\{z\}p\_\{\\theta\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}<\\sqrt\{2\\epsilon\_\{k\}\}\.By the chain rule and the bounded\-logit\-Jacobian assumption \([A\.1](https://arxiv.org/html/2606.04168#A2.Ex20)\),

∥∇θpθk\(⋅∣𝐱,𝐲:t\)∥F\\displaystyle\\\|\\nabla\_\{\\theta\}p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}≤∥∇zpθk\(⋅∣𝐱,𝐲:t\)∥F∥∇θzθk∥F\\displaystyle\\leq\\\|\\nabla\_\{z\}p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}\\\|\\nabla\_\{\\theta\}z\_\{\\theta\_\{k\}\}\\\|\_\{F\}\(23\)<C​2​ϵk\.\\displaystyle<C\\sqrt\{2\\epsilon\_\{k\}\}\.Let

Δk:=θk\+1−θk=−ηk​∇L​\(θk\)\.\\Delta\_\{k\}:=\\theta\_\{k\+1\}\-\\theta\_\{k\}=\-\\eta\_\{k\}\\nabla L\(\\theta\_\{k\}\)\.By Taylor expansion and the second\-order smoothness assumption \([A\.2](https://arxiv.org/html/2606.04168#A2.Ex21)\),

∥pθk\+1\(⋅∣𝐱,𝐲:t\)−pθk\(⋅∣𝐱,𝐲:t\)∥F\\displaystyle\\\|p\_\{\\theta\_\{k\+1\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}≤∥∇θpθk\(⋅∣𝐱,𝐲:t\)∥F∥Δk∥\+B2∥Δk∥2\\displaystyle\\leq\\\|\\nabla\_\{\\theta\}p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}\\\|\\Delta\_\{k\}\\\|\+\\frac\{B\}\{2\}\\\|\\Delta\_\{k\}\\\|^\{2\}\(24\)<C​2​ϵk​ηk​‖∇L​\(θk\)‖\+B2​ηk2​‖∇L​\(θk\)‖2,\\displaystyle<C\\sqrt\{2\\epsilon\_\{k\}\}\\eta\_\{k\}\\\|\\nabla L\(\\theta\_\{k\}\)\\\|\+\\frac\{B\}\{2\}\\eta\_\{k\}^\{2\}\\\|\\nabla L\(\\theta\_\{k\}\)\\\|^\{2\},where we substitute the bound from Eq\. \([23](https://arxiv.org/html/2606.04168#A2.E23)\) in the second inequality\.

Now define

ϵk\+1:=ϵk\+C​2​ϵk​ηk​‖∇L​\(θk\)‖\+B2​ηk2​‖∇L​\(θk\)‖2\.\\epsilon\_\{k\+1\}:=\\epsilon\_\{k\}\+C\\sqrt\{2\\epsilon\_\{k\}\}\\eta\_\{k\}\\\|\\nabla L\(\\theta\_\{k\}\)\\\|\+\\frac\{B\}\{2\}\\eta\_\{k\}^\{2\}\\\|\\nabla L\(\\theta\_\{k\}\)\\\|^\{2\}\.Then

pθk\+1\(⋅∣𝐱,𝐲:t\)\\displaystyle p\_\{\\theta\_\{k\+1\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)≥pθk\(⋅∣𝐱,𝐲:t\)−∥pθk\+1\(⋅∣𝐱,𝐲:t\)−pθk\(⋅∣𝐱,𝐲:t\)∥F\\displaystyle\\geq p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-\\\|p\_\{\\theta\_\{k\+1\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}\(25\)\>1−ϵk−\(ϵk\+1−ϵk\)\\displaystyle\>1\-\\epsilon\_\{k\}\-\(\\epsilon\_\{k\+1\}\-\\epsilon\_\{k\}\)=1−ϵk\+1\.\\displaystyle=1\-\\epsilon\_\{k\+1\}\.This completes the induction\. Hence autoregressive consistency is preserved at every late position, with a gradually weakened constantϵk\\epsilon\_\{k\}\.

We next show that this implies gradient concentration throughout training, not merely at initialization\.

#### Gradient concentration for SFT\.

For SFT \(Eq\. \([5](https://arxiv.org/html/2606.04168#S2.E5)\)\) andt\>tct\>t\_\{c\}, using the induction result abovepθk​\(rt\+1∣𝐱,𝐫:t\)\>1−ϵk,p\_\{\\theta\_\{k\}\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\>1\-\\epsilon\_\{k\},we obtain that \(similar to the proof in Appendix[B\.2](https://arxiv.org/html/2606.04168#A2.SS2)\)

∥−∇θlogpθk\(rt\+1∣𝐱,𝐫:t\)∥F\\displaystyle\\left\\\|\-\\nabla\_\{\\theta\}\\log p\_\{\\theta\_\{k\}\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\\right\\\|\_\{F\}=‖∇θpθk​\(rt\+1∣𝐱,𝐫:t\)pθk​\(rt\+1∣𝐱,𝐫:t\)‖F\\displaystyle=\\left\\\|\\frac\{\\nabla\_\{\\theta\}p\_\{\\theta\_\{k\}\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\}\{p\_\{\\theta\_\{k\}\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\}\\right\\\|\_\{F\}\(26\)≤∥∇θpθk\(⋅∣𝐱,𝐫:t\)∥F1−ϵk\\displaystyle\\leq\\frac\{\\\|\\nabla\_\{\\theta\}p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\\\|\_\{F\}\}\{1\-\\epsilon\_\{k\}\}<C​2​ϵk1−ϵk\.\\displaystyle<\\frac\{C\\sqrt\{2\\epsilon\_\{k\}\}\}\{1\-\\epsilon\_\{k\}\}\.Thus every late SFT token\-gradient contribution isO​\(ϵk\)O\(\\sqrt\{\\epsilon\_\{k\}\}\)\. Summing overt\>tct\>t\_\{c\}, the total late\-position contribution to the SFT gradient is bounded by

∥∑t\>tc−∇θlogpθk\(rt\+1∣𝐱,𝐫:t\)∥F≤\(T−tc\)C​2​ϵk1−ϵk,\\left\\\|\\sum\_\{t\>t\_\{c\}\}\-\\nabla\_\{\\theta\}\\log p\_\{\\theta\_\{k\}\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\\right\\\|\_\{F\}\\leq\(T\-t\_\{c\}\)\\frac\{C\\sqrt\{2\\epsilon\_\{k\}\}\}\{1\-\\epsilon\_\{k\}\},\(27\)whereTTis the response length\. Hence, as long asϵk\\epsilon\_\{k\}remains small, the SFT gradient contribution fromt\>tct\>t\_\{c\}remains suppressed at every stepkk\. This is exactly the multi\-step analogue of Proposition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmprop2)for SFT\.

#### DPO\.

For DPO, recall that the DPO objective is

LDPO​\(θ\)=−𝔼\(x,r,h\)∼D​\[log⁡σ​\(β​kθ​\(x,r,h\)\)\],L\_\{\\mathrm\{DPO\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\(x,r,h\)\\sim D\}\\left\[\\log\\sigma\\left\(\\beta k\_\{\\theta\}\(x,r,h\)\\right\)\\right\],
where

kθ​\(𝐱,𝐫,𝐡\)=log⁡pθ​\(𝐫∣𝐱\)pθbase​\(𝐫∣𝐱\)−log⁡pθ​\(𝐡∣𝐱\)pθbase​\(𝐡∣𝐱\)\.k\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)=\\log\\frac\{p\_\{\\theta\}\(\{\\mathbf\{r\}\}\\mid\{\\mathbf\{x\}\}\)\}\{p\_\{\\theta\_\{\\mathrm\{base\}\}\}\(\{\\mathbf\{r\}\}\\mid\{\\mathbf\{x\}\}\)\}\-\\log\\frac\{p\_\{\\theta\}\(\{\\mathbf\{h\}\}\\mid\{\\mathbf\{x\}\}\)\}\{p\_\{\\theta\_\{\\mathrm\{base\}\}\}\(\{\\mathbf\{h\}\}\\mid\{\\mathbf\{x\}\}\)\}\.Since the base\-model terms do not depend onθ\\theta, we can further expand its gradient as

∇θLDPO​\(θ\)=−β​𝔼\(𝐱,𝐫,𝐡\)∼𝒟​\[σ​\(−β​kθ​\(𝐱,𝐫,𝐡\)\)​∇θkθ​\(𝐱,𝐫,𝐡\)\],\\nabla\_\{\\theta\}L\_\{\\mathrm\{DPO\}\}\(\\theta\)=\-\\beta\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim\{\\mathcal\{D\}\}\}\\left\[\\sigma\(\-\\beta k\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\)\\nabla\_\{\\theta\}k\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\right\],with0<σ​\(−β​kθ​\(𝐱,𝐫,𝐡\)\)<10<\\sigma\(\-\\beta k\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\)<1\. Furthermore,

∇θkθ​\(𝐱,𝐫,𝐡\)\\displaystyle\\nabla\_\{\\theta\}k\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)=∇θlog⁡pθ​\(𝐫∣𝐱\)−∇θlog⁡pθ​\(𝐡∣𝐱\)\\displaystyle=\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(\{\\mathbf\{r\}\}\\mid\{\\mathbf\{x\}\}\)\-\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(\{\\mathbf\{h\}\}\\mid\{\\mathbf\{x\}\}\)\(28\)=∑t\[∇θlog⁡pθ​\(rt\+1∣𝐱,𝐫:t\)−∇θlog⁡pθ​\(ht\+1∣𝐱,𝐡:t\)\],\\displaystyle=\\sum\_\{t\}\\left\[\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\-\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(h\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:t\}\)\\right\],where we use the autoregressive factorization in the second equality\. Therefore, the DPO gradient is a bounded scalar reweighting of the difference between the refusal and harmful token\-level log\-probability gradients\. Similar to the case of SFT, for every late refusal tokent\>tct\>t\_\{c\}and every late harmful token, the induction result above gives,∀t\>tc\\forall t\>t\_\{c\}:

pθk\(rt\+1∣𝐱,𝐫:t\)\>1−ϵk⟹suppression at late tokens∥∇θlogpθk\(rt\+1∣𝐱,𝐫:t\)∥F<C​2​ϵk1−ϵk,\\displaystyle p\_\{\\theta\_\{k\}\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\>1\-\\epsilon\_\{k\}\\overset\{\\text\{suppression at late tokens\}\}\{\\implies\}\\left\\\|\\nabla\_\{\\theta\}\\log p\_\{\\theta\_\{k\}\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\\right\\\|\_\{F\}<\\frac\{C\\sqrt\{2\\epsilon\_\{k\}\}\}\{1\-\\epsilon\_\{k\}\},\(29\)pθk\(ht\+1∣𝐱,𝐡:t\)\>1−ϵk⟹suppression at late tokens∥∇θlogpθk\(ht\+1∣𝐱,𝐡:t\)∥F<C​2​ϵk1−ϵk\.\\displaystyle p\_\{\\theta\_\{k\}\}\(h\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:t\}\)\>1\-\\epsilon\_\{k\}\\overset\{\\text\{suppression at late tokens\}\}\{\\implies\}\\left\\\|\\nabla\_\{\\theta\}\\log p\_\{\\theta\_\{k\}\}\(h\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:t\}\)\\right\\\|\_\{F\}<\\frac\{C\\sqrt\{2\\epsilon\_\{k\}\}\}\{1\-\\epsilon\_\{k\}\}\.Consequently, the late\-position contribution to the DPO gradient satisfies

‖∇θLDPO,t\>tc​\(θk\)‖F\\displaystyle\\ \\\|\\nabla\_\{\\theta\}L\_\{\\mathrm\{DPO\},\\,t\>t\_\{c\}\}\(\\theta\_\{k\}\)\\\|\_\{F\}\(30\)≤\\displaystyle\\leqβ𝔼\(𝐱,𝐫,𝐡\)∼𝒟\[∑t\>tc∥∇θlogpθk\(rt\+1∣𝐱,𝐫:t\)∥F\+∑t\>tc∥∇θlogpθk\(ht\+1∣𝐱,𝐡:t\)∥F\]\\displaystyle\\ \\beta\\,\\mathbb\{E\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim\{\\mathcal\{D\}\}\}\\left\[\\sum\_\{t\>t\_\{c\}\}\\left\\\|\\nabla\_\{\\theta\}\\log p\_\{\\theta\_\{k\}\}\(r\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{r\}\}\_\{:t\}\)\\right\\\|\_\{F\}\+\\sum\_\{t\>t\_\{c\}\}\\left\\\|\\nabla\_\{\\theta\}\\log p\_\{\\theta\_\{k\}\}\(h\_\{t\+1\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{h\}\}\_\{:t\}\)\\right\\\|\_\{F\}\\right\]<\\displaystyle<2​β​\(T−tc\)​C​2​ϵk1−ϵk,\\displaystyle 2\\beta\(T\-t\_\{c\}\)\\frac\{C\\sqrt\{2\\epsilon\_\{k\}\}\}\{1\-\\epsilon\_\{k\}\},whereTTupper\-bounds the response lengths\. Hence the DPO gradient at late positions is alsoO​\(ϵk\)O\(\\sqrt\{\\epsilon\_\{k\}\}\)at every training stepk≤Kk\\leq K\. This gives the multi\-step analogue of Proposition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmprop2)for DPO\.

Therefore, for both SFT and DPO, the gradient contributions at late positions remain suppressed throughout training\. The only token positions that can provide dominate learning signal are therefore the early positionst≤tct\\leq t\_\{c\}\. In this precise sense, the alignment gradient remains concentrated on the early tokens throughout gradient descent\. It remains to prove that the final aligned model stays close to the base model at late positions\. From the Taylor bound above,∥pθk\+1\(⋅∣𝐱,𝐲:t\)−pθk\(⋅∣𝐱,𝐲:t\)∥F≤ϵk\+1−ϵk\.\\\|p\_\{\\theta\_\{k\+1\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}\\leq\\epsilon\_\{k\+1\}\-\\epsilon\_\{k\}\.An easy summation overk=0,…,K−1k=0,\\ldots,K\-1gives

∥pθK\(⋅∣𝐱,𝐲:t\)−pθ0\(⋅∣𝐱,𝐲:t\)∥F\\displaystyle\\\|p\_\{\\theta\_\{K\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-p\_\{\\theta\_\{0\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}≤∑k=0K−1∥pθk\+1\(⋅∣𝐱,𝐲:t\)−pθk\(⋅∣𝐱,𝐲:t\)∥F\\displaystyle\\leq\\sum\_\{k=0\}^\{K\-1\}\\\|p\_\{\\theta\_\{k\+1\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-p\_\{\\theta\_\{k\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}\(31\)≤∑k=0K−1\(ϵk\+1−ϵk\)\\displaystyle\\leq\\sum\_\{k=0\}^\{K\-1\}\(\\epsilon\_\{k\+1\}\-\\epsilon\_\{k\}\)=ϵK−ϵ0\\displaystyle=\\epsilon\_\{K\}\-\\epsilon\_\{0\}⟹∥pθK\(⋅∣𝐱,𝐲:t\)−pθbase\(⋅∣𝐱,𝐲:t\)∥F≤ϵK−ϵ0\.\\implies\\\|p\_\{\\theta\_\{K\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\-p\_\{\\theta\_\{\\operatorname\{base\}\}\}\(\\cdot\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{:t\}\)\\\|\_\{F\}\\leq\\epsilon\_\{K\}\-\\epsilon\_\{0\}\.\(32\)It remains to verify thatϵK−ϵ0\\epsilon\_\{K\}\-\\epsilon\_\{0\}is smaller than the toleranceε2\\varepsilon\_\{2\}required in Definition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmdefinition2)\. LetSK=∑k=0K−1ηk​‖∇L​\(θk\)‖\.S\_\{K\}=\\sum\_\{k=0\}^\{K\-1\}\\eta\_\{k\}\\\|\\nabla L\(\\theta\_\{k\}\)\\\|\.From the recursion definingϵk,\\epsilon\_\{k\},we have

ϵK≤ϵ0\+α​SK\.\\sqrt\{\\epsilon\_\{K\}\}\\leq\\sqrt\{\\epsilon\_\{0\}\}\+\\alpha S\_\{K\}\.Under the bounded\-gradient assumption \([A\.4](https://arxiv.org/html/2606.04168#A2.Ex23)\),SK≤K​G​η\.S\_\{K\}\\leq KG\\eta\.Therefore,ϵK−ϵ0≤2​α​K​G​η​ϵ0\+α2​K2​G2​η2\.\\epsilon\_\{K\}\-\\epsilon\_\{0\}\\leq 2\\alpha KG\\eta\\sqrt\{\\epsilon\_\{0\}\}\+\\alpha^\{2\}K^\{2\}G^\{2\}\\eta^\{2\}\.By the condition on learning rate given in Eq\. \([20](https://arxiv.org/html/2606.04168#A2.E20)\), this quantity is at mostε2\\varepsilon\_\{2\}\. Hence the second condition in Definition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmdefinition2)is established\.

The first condition in Definition[2\.2](https://arxiv.org/html/2606.04168#S2.Thmdefinition2)concerns the model being aligned at early positions\. Since the effective learning signal of SFT or DPO is concentrated on the early positionst≤tct\\leq t\_\{c\}, under the usual assumption that the alignment optimization succeeds on the part of the objective where the gradient is not suppressed, the trained model satisfies the this condition as a given\. Therefore, the alignment is shallow\. ∎

## Appendix CExperimental Details

Algorithm 1Adversarial Safety Alignment with Worst\-Insertion Attack1:Model

pθp\_\{\\theta\}; safety triplet dataset

𝒟Safety\\mathcal\{D\}\_\{\\mathrm\{Safety\}\}containing

\(𝐱,𝐫,𝐡\)\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\); utility dataset

𝒟Utility\\mathcal\{D\}\_\{\\mathrm\{Utility\}\}; harmful span length

qq; number of sampled insertion positions

kk; refusal restart bank

ℛ=\{rjC\}j=1m\\mathcal\{R\}=\\\{r^\{C\}\_\{j\}\\\}\_\{j=1\}^\{m\}; utility weight

λ∈\(0,1\]\\lambda\\in\(0,1\]
2:Safety\-aligned model parameters

θ\\theta
3:foreach training stepdo

4:Sample a safety minibatch

ℬS∼𝒟Safety\\mathcal\{B\}\_\{S\}\\sim\\mathcal\{D\}\_\{\\mathrm\{Safety\}\}
5:Sample a utility minibatch

ℬU∼𝒟Utility\\mathcal\{B\}\_\{U\}\\sim\\mathcal\{D\}\_\{\\mathrm\{Utility\}\}
6:Initialize

LInsertion←0L\_\{\\mathrm\{Insertion\}\}\\leftarrow 0
7:foreach triplet

\(𝐱,𝐫,𝐡\)∈ℬS\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\in\\mathcal\{B\}\_\{S\}do

8:Take the harmful prefix span

𝐡:q\{\\mathbf\{h\}\}\_\{:q\}
9:Uniformly sample insertion positions

\{ij\}j=1k∼Unif​\(\[\|𝐫\|\]\)\\\{i\_\{j\}\\\}\_\{j=1\}^\{k\}\\sim\\mathrm\{Unif\}\(\[\|\{\\mathbf\{r\}\}\|\]\)
10:Construct candidate perturbed states

𝒜​\(𝐱,𝐫,𝐡\)=\{τj=\(𝐱,\[𝐫:ij;𝐡:q\]\)\}j=1k\\mathcal\{A\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)=\\left\\\{\\tau\_\{j\}=\\bigl\(\{\\mathbf\{x\}\},\\,\[\{\\mathbf\{r\}\}\_\{:i\_\{j\}\};\{\\mathbf\{h\}\}\_\{:q\}\]\\bigr\)\\right\\\}\_\{j=1\}^\{k\}
11:foreach candidate state

τj∈𝒜​\(𝐱,𝐫,𝐡\)\\tau\_\{j\}\\in\\mathcal\{A\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)do

12:Compute harmful\-continuation score

H​\(τj;θ\)=1To​log⁡pθ​\(𝐡q\+1:q\+To∣τj\)H\(\\tau\_\{j\};\\theta\)=\\frac\{1\}\{T\_\{o\}\}\\log p\_\{\\theta\}\\left\(\{\\mathbf\{h\}\}\_\{q\+1:q\+T\_\{o\}\}\\mid\\tau\_\{j\}\\right\)
13:Compute refusal\-restart score

R​\(τj;θ\)=max𝐫C∈ℛ⁡1\|𝐫C\|​log⁡pθ​\(𝐫C∣τj\)R\(\\tau\_\{j\};\\theta\)=\\max\_\{\{\\mathbf\{r\}\}^\{C\}\\in\\mathcal\{R\}\}\\frac\{1\}\{\|\{\\mathbf\{r\}\}^\{C\}\|\}\\log p\_\{\\theta\}\\left\(\{\\mathbf\{r\}\}^\{C\}\\mid\\tau\_\{j\}\\right\)
14:Compute harmful autoregressive consistency margin

Φθ​\(τj\)=H​\(τj;θ\)−R​\(τj;θ\)\\Phi\_\{\\theta\}\(\\tau\_\{j\}\)=H\(\\tau\_\{j\};\\theta\)\-R\(\\tau\_\{j\};\\theta\)
15:endfor

16:Select the worst insertion state

τθ⋆​\(𝐱,𝐫,𝐡\)∈arg⁡maxτj∈𝒜​\(𝐱,𝐫,𝐡\)⁡Φθ​\(τj\)\\tau^\{\\star\}\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\in\\arg\\max\_\{\\tau\_\{j\}\\in\\mathcal\{A\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\}\\Phi\_\{\\theta\}\(\\tau\_\{j\}\)
17:Accumulate refusal\-recovery loss

LInsertion←LInsertion−log⁡pθ​\(𝐫∣τθ⋆​\(𝐱,𝐫,𝐡\)\)L\_\{\\mathrm\{Insertion\}\}\\leftarrow L\_\{\\mathrm\{Insertion\}\}\-\\log p\_\{\\theta\}\\left\(\{\\mathbf\{r\}\}\\mid\\tau^\{\\star\}\_\{\\theta\}\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\right\)
18:endfor

19:Compute utility loss

LUtility=−1\|ℬU\|​∑\(𝐳,𝐲\)∈ℬUlog⁡pθ​\(𝐲∣𝐳\)L\_\{\\mathrm\{Utility\}\}=\-\\frac\{1\}\{\|\\mathcal\{B\}\_\{U\}\|\}\\sum\_\{\(\{\\mathbf\{z\}\},\{\\mathbf\{y\}\}\)\\in\\mathcal\{B\}\_\{U\}\}\\log p\_\{\\theta\}\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{z\}\}\)
20:Form total objective

L=λ​LInsertion\+\(1−λ\)​LUtilityL=\\lambda L\_\{\\mathrm\{Insertion\}\}\+\(1\-\\lambda\)L\_\{\\mathrm\{Utility\}\}
21:Update parameters

θ\\thetaby gradient descent on

LL
22:endfor

23:return

θ\\theta

For convenience, we first present the adversarial safety alignment with random\-worst insertion in Algorithm[1](https://arxiv.org/html/2606.04168#alg1)\.

#### Training configurations\.

In all fine\-tuning experiments which involve bothDSafetyD\_\{\\operatorname\{Safety\}\}andDUtilityD\_\{\\operatorname\{Utility\}\}, including those in Tab\.[1](https://arxiv.org/html/2606.04168#S3.T1), and Tab\.[3](https://arxiv.org/html/2606.04168#S3.T3), we use configurations in Tab\.[4](https://arxiv.org/html/2606.04168#A3.T4), where the value ofλ\\lambdais controlled by batch size\.

Table 4:Training configurations for Tab\.[1](https://arxiv.org/html/2606.04168#S3.T1)and[3](https://arxiv.org/html/2606.04168#S3.T3)
#### Evaluation of ASR\.

In this paper, attacks construct a partial perturbed autoregressive stateτ~t\\tilde\{\\tau\}\_\{t\}from the data\(𝐱,𝐫,𝐡\)\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\), where𝐱\{\\mathbf\{x\}\}is the harmful prompt,𝐫\{\\mathbf\{r\}\}is the safe refusal response, and𝐡\{\\mathbf\{h\}\}is the harmful response \(could be a short span\)\. The model then continues generation from this perturbed state according topθ\(⋅∣τ~t\)p\_\{\\theta\}\(\\cdot\\mid\\tilde\{\\tau\}\_\{t\}\)at positiontt\. To conveniently evaluate the ASR under an attack, we first construct a set of refusal start keywords𝒞\{\\mathcal\{C\}\}that commonly appear in a safe response, and let the model continue the generation up to 50 tokens bypθ​\(𝐲∣τ~t\)p\_\{\\theta\}\(\{\\mathbf\{y\}\}\\mid\\tilde\{\\tau\}\_\{t\}\), where\|𝐲\|=50\|\{\\mathbf\{y\}\}\|=50\. Then the attack is considered successful if none of the keywords in𝒞\{\\mathcal\{C\}\}appears in𝐲\{\\mathbf\{y\}\}\.

#### Dataset\.

\(1\) For training data, as indicated in Section[3\.1](https://arxiv.org/html/2606.04168#S3.SS1), we use a triple dataset\(𝐱,𝐫,𝐡\)∼DSafety\(\{\\mathbf\{x\}\},\{\\mathbf\{r\}\},\{\\mathbf\{h\}\}\)\\sim D\_\{\\operatorname\{Safety\}\}for both prefill and random insertion attack\. We use the dataset open\-sourced by Qi et al\.\[[17](https://arxiv.org/html/2606.04168#bib.bib4)\]\. Specifically, this dataset consists of 256 harmful instructions𝐱\{\\mathbf\{x\}\}, primarily from Ganguli et al\.\[[7](https://arxiv.org/html/2606.04168#bib.bib12)\], which does not overlap with any safety evaluation datasets used in this paper\. \(2\) The utility dataDUtilityD\_\{\\operatorname\{Utility\}\}is obtained from the benign instructions from Alpaca\[[23](https://arxiv.org/html/2606.04168#bib.bib16)\]paired with a response from the initialLlama\-2\-7B\-Chat\. This dataset aims to make the fine\-tuned model preserve its original behavior on benign instructions\. \(3\) For evaluation data, we use the HEx\-PHI safety benchmark\[[17](https://arxiv.org/html/2606.04168#bib.bib4)\]for experiments in Fig\.[4](https://arxiv.org/html/2606.04168#S2.F4), Tab\.[1](https://arxiv.org/html/2606.04168#S3.T1), and Tab\.[3](https://arxiv.org/html/2606.04168#S3.T3)\. This dataset contains 330 harmful instructions spanning 11 categories, and we evaluate whether the model produces compliant responses to these harmful requests\.

#### Candidate refusal bankℛ\{\\mathcal\{R\}\}\.

For adversarial safety alignment with random worst\-insertion, we useℛ=\{“I cannot fulfill”, “I apologize, but I cannot”, “I am unable”, “I’m sorry”\}\{\\mathcal\{R\}\}=\\\{\\text\{\`\`I cannot fulfill'',\\, \`\`I apologize, but I cannot'',\\, \`\`I am unable'',\\, \`\`I'm sorry''\}\\\}\. We intentionally keep this bank small to avoid introducing significant additional computational overhead\.

#### Additional details for experiments in Tab\.[2](https://arxiv.org/html/2606.04168#S3.T2)\.

To evaluate ASR against these attacks, we use the official HarmBench codebase\[[13](https://arxiv.org/html/2606.04168#bib.bib17)\], which provides both the attack optimization procedures and the final evaluation protocol\. We keep all settings at their default values and evaluate on the HarmBench dataset\. Since evaluating optimization\-based attacks is computationally expensive, we use the subset of 200 standard harmful behaviors\. For the aligned model, we randomly select one of the three models trained with adversarial safety alignment using random worst\-insertion from the experiments in Tab\.[1](https://arxiv.org/html/2606.04168#S3.T1)\.

Similar Articles

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

arXiv cs.AI

This paper demonstrates that LLM safety vulnerabilities extend beyond 'shallow safety' (first-token alignment) to any point during generation, showing that short token injections mid-sequence can redirect models toward harmful outputs. The authors propose training on generation trajectories with simulated mid-sequence perturbations to improve robustness.

Hidden Latent-State Shifts in LLMs: Why Current Alignment Is Blind to Real Internal Dangers — Especially With Agents

Reddit r/artificial

This paper demonstrates that LLMs can enter measurably different internal latent states under coherent context while maintaining aligned outputs, revealing a blind spot in current alignment methods that only monitor surface tokens. The Gemma-3-12B-IT experiment shows strong residual stream geometry shifts that existing safety frameworks cannot detect, with implications for agentic AI deployment.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

arXiv cs.AI

This paper proposes a hybrid framework combining first-order safety alignment with zeroth-order refinement to enhance the robustness of LLM safety alignment against post-alignment perturbations. Theoretical and empirical results show that only a few refinement steps can improve robustness while preserving safety.

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

arXiv cs.CL

This paper investigates how incorporating web retrieval into LLM agents can degrade safety alignment, revealing the 'Safe Source Paradox' where even safety-oriented documents increase harmful compliance. It introduces the AgentREVEAL diagnostic framework and HarmURLBench benchmark to analyze and evaluate retrieval-induced safety vulnerabilities.