FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness

arXiv cs.CL Papers

Summary

This paper introduces FragileFlow, a plug-in regularizer that improves the robustness of LLMs and VLMs by controlling 'correct-but-fragile' predictions through spectral analysis and PAC-Bayes bounds.

arXiv:2605.08896v1 Announce Type: new Abstract: Robust adaptation of LLMs and VLMs is often evaluated by average accuracy or average consistency under perturbations. However, these averages can hide a structured failure mode: a prediction may remain correct while probability mass already flows from particular true classes toward systematic wrong competitors near the decision boundary. In this paper, we formalize this phenomenon as margin-aware error flow and introduce FragileFlow, a plug-in regularizer that uses a calibrated margin buffer to identify correct-but-fragile predictions and organize their off-class probability mass into a class-wise vulnerable-risk matrix. Theoretically, we provide the first PAC-Bayes upper bound for this margin-aware error-flow object, showing how empirical spectral control yields a conservative route to deterministic worst-class robustness under a stability condition. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show that FragileFlow consistently improves the proposed theory-facing risk measures over matched baselines, yields perturbed worst-class accuracy gains in most settings, and preserves clean accuracy across comparisons.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:06 AM

# FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness
Source: [https://arxiv.org/html/2605.08896](https://arxiv.org/html/2605.08896)
Zhuoyun Li, Boxuan Wang, Jinwei Hu, Xiaowei Huang, Yi Dong School of Computer Science and Informatics, University of Liverpool, UK

###### Abstract

Robust adaptation of LLMs and VLMs is often evaluated by average accuracy or average consistency under perturbations\. However, these averages can hide a structured failure mode: a prediction may remain correct while probability mass already flows from particular true classes toward systematic wrong competitors near the decision boundary\. In this paper, we formalize this phenomenon as margin\-aware error flow and introduce*FragileFlow*, a plug\-in regularizer that uses a calibrated margin buffer to identify correct\-but\-fragile predictions and organize their off\-class probability mass into a class\-wise vulnerable\-risk matrix\. Theoretically, we provide the first PAC\-Bayes upper bound for this margin\-aware error\-flow object, showing how empirical spectral control yields a conservative route to deterministic worst\-class robustness under a stability condition\. Experiments on multiple\-choice LLM benchmarks and few\-shot CLIP adaptation show that FragileFlow consistently improves the proposed theory\-facing risk measures over matched baselines, yields perturbed worst\-class accuracy gains in most settings, and preserves clean accuracy across comparisons\.

## 1Introduction

Foundation models have evolved from simple text generators into central components of high\-stakes decision\-makingBommasaniet al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib43)\); Brownet al\.\([2020](https://arxiv.org/html/2605.08896#bib.bib44)\); Radfordet al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib36)\); Huet al\.\([2026a](https://arxiv.org/html/2605.08896#bib.bib3)\)\. In these practical applications, robustness is an important requirement, as a minor visual distraction or a subtle linguistic perturbation can easily alter a critical recommendationWanget al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib46)\); Zhuet al\.\([2024](https://arxiv.org/html/2605.08896#bib.bib47)\); Maoet al\.\([2023](https://arxiv.org/html/2605.08896#bib.bib48)\); Huet al\.\([2026b](https://arxiv.org/html/2605.08896#bib.bib70)\); Schlarmannet al\.\([2024](https://arxiv.org/html/2605.08896#bib.bib49)\)\. Consequently, a reliable model must not only perform well on clean inputs but also remain stable under various perturbations\.

Existing robust adaptation methods have made important progress toward this goal\. Adversarial training exposes models to difficult perturbations, while regularized fine\-tuning methods encourage local smoothness or consistency between clean and perturbed inputsGoodfellowet al\.\([2015](https://arxiv.org/html/2605.08896#bib.bib50)\); Aghajanyanet al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib35)\); Madryet al\.\([2018](https://arxiv.org/html/2605.08896#bib.bib6)\); Zhanget al\.\([2019](https://arxiv.org/html/2605.08896#bib.bib7)\); Jianget al\.\([2020](https://arxiv.org/html/2605.08896#bib.bib8)\); Wuet al\.\([2020](https://arxiv.org/html/2605.08896#bib.bib53)\)\. In LLMs, this is often implemented through adversarial, KL\-based, noise\-based, or trust\-region regularizationZhuet al\.\([2020](https://arxiv.org/html/2605.08896#bib.bib54)\); Jianget al\.\([2020](https://arxiv.org/html/2605.08896#bib.bib8)\); Aghajanyanet al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib35)\); in VLMs, robustness is commonly improved by perturbing visual inputs, tuning prompts, or adapting lightweight componentsMaoet al\.\([2023](https://arxiv.org/html/2605.08896#bib.bib48)\); Zhanget al\.\([2024](https://arxiv.org/html/2605.08896#bib.bib55)\); Schlarmannet al\.\([2024](https://arxiv.org/html/2605.08896#bib.bib49)\); Wanget al\.\([2025](https://arxiv.org/html/2605.08896#bib.bib56)\)\. However, gaps still remain\. First, most objectives still optimize an average notion of robustness: they ask whether accuracy, loss, or consistency improves on average, but they do not track how probability mass moves among wrong options before the final prediction failsXuet al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib15)\); Benzet al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib58)\); Tianet al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib42)\); Li and Liu \([2023](https://arxiv.org/html/2605.08896#bib.bib60)\)\. This misses a fragile regime in which the predicted option remains correct, yet perturbations already push substantial probability toward systematic wrong competitors\. Second, many practical robust adaptation methods are supported mainly by empirical improvements, while their objectives are not tied to a generalization bound for the specific fragile behavior they aim to reduceNeyshaburet al\.\([2018](https://arxiv.org/html/2605.08896#bib.bib22)\); Lotfiet al\.\([2022](https://arxiv.org/html/2605.08896#bib.bib62)\); Huet al\.\([2025](https://arxiv.org/html/2605.08896#bib.bib69)\); Jinet al\.\([2025](https://arxiv.org/html/2605.08896#bib.bib66)\)\. As a result, a model may look stable under standard aggregate metrics while hiding structured vulnerabilities\.

To address this empirical and theoretical gap, we study this correct\-but\-fragile regime through finite\-option LLM and VLM tasks\. This setting gives a controlled interface: the model’s distribution over candidate options is directly observable, and each wrong option has a clear semantic meaning\. Instead of asking only whether the final answer flips, we ask where the off\-class probability goes, which true options are most vulnerable, and whether the resulting error pattern is scattered or structured\. Based on this observation, we introduceFragileFlow, a lightweight plug\-in regularizer for robust adaptation\. FragileFlow constructs a margin\-aware error\-flow matrix under perturbation\. The margin buffer focuses on examples that are already misclassified or still correct but close to the decision boundary, while the spectral penalty suppresses coherent probability flow from true options toward recurring wrong competitors\. In this way, FragileFlow targets a failure pattern that average robustness objectives can easily miss\.

Crucially, we provide the theoretical foundation that existing empirical methods lack\. While PAC\-Bayes analysis has been used for standard generalization and begun to extend to modern neural networks and language modelsJinet al\.\([2025](https://arxiv.org/html/2605.08896#bib.bib66)\); Nagarajan and Kolter \([2019](https://arxiv.org/html/2605.08896#bib.bib64)\), we establish, to our knowledge, the first PAC\-Bayes spectral control framework tailored for margin\-aware error flow\. Our bound rigorously connects the empirical spectral error\-flow term to the population’s vulnerable worst\-class risk\. Under a stated logit\-stability condition, this demonstrates that FragileFlow is not merely a heuristic add\-on, but is mathematically aligned with deterministic worst\-class robustness at test time, directly bridging the gap between empirical adaptation and provable generalization\. Our main contributions are threefold:

- •We formalize the “correct\-but\-fragile” failure mode in robust adaptation, showing how models can silently leak probability mass toward systematic wrong competitors near the decision boundary before the final prediction fails\.
- •We introduce*FragileFlow*, a margin\-aware spectral plug\-in regularizer that suppresses structured vulnerable probability flow and can be attached to existing adaptation objectives\.
- •We bridge the theoretical gap in robust adaptation by establishing the first PAC\-Bayes spectral control route for this error\-flow object\. We further show its connection to deterministic worst\-class risk and validate its effectiveness across both LLM and VLM settings\.

## 2Methodology

### 2\.1Prediction and Perturbation Formulation

We now develop FragileFlow in the order shown in Fig\.[1](https://arxiv.org/html/2605.08896#S2.F1)\. First, we set up the notation for finite\-option prediction and perturbations\. Each input is assigned one label from a fixed set ofKKcandidate options\. We first define the prediction interface and the perturbation notation for a fixed adapted modelθ\\theta\. Randomized adaptation parameters are introduced later in Section[2\.3](https://arxiv.org/html/2605.08896#S2.SS3)\.

###### Definition 2\.1\(Finite\-option prediction\)\.

Let𝒟\\mathcal\{D\}be a distribution over input–label pairs\(x,y\)\(x,y\)withy∈\{1,…,K\}:=\[K\]y\\in\\\{1,\\ldots,K\\\}:=\[K\]\. LetS=\{\(xr,yr\)\}r=1mS=\\\{\(x\_\{r\},y\_\{r\}\)\\\}\_\{r=1\}^\{m\}be a finite sample drawn from𝒟\\mathcal\{D\}\. For a modelθ\\theta, each inputxxinduces a score vectorsθ​\(x\)∈ℝKs\_\{\\theta\}\(x\)\\in\\mathbb\{R\}^\{K\}, wheresθ​\(x,k\)s\_\{\\theta\}\(x,k\)is the score assigned to optionkk\. The induced option distribution ispθ​\(k∣x\):=exp⁡\(sθ​\(x,k\)\)∑k′=1Kexp⁡\(sθ​\(x,k′\)\)p\_\{\\theta\}\(k\\mid x\):=\\frac\{\\exp\(s\_\{\\theta\}\(x,k\)\)\}\{\\sum\_\{k^\{\\prime\}=1\}^\{K\}\\exp\(s\_\{\\theta\}\(x,k^\{\\prime\}\)\)\}\. The deterministic predictor used for evaluation isy^θdet​\(x\):=arg⁡maxk∈\[K\]⁡sθ​\(x,k\)\\hat\{y\}^\{\\mathrm\{det\}\}\_\{\\theta\}\(x\):=\\arg\\max\_\{k\\in\[K\]\}s\_\{\\theta\}\(x,k\)\.

###### Definition 2\.2\(Option margin\)\.

For an example\(x,y\)\(x,y\), the option margin isΔθ​\(x,y\):=sθ​\(x,y\)−maxk≠y⁡sθ​\(x,k\)\\Delta\_\{\\theta\}\(x,y\):=s\_\{\\theta\}\(x,y\)\-\\max\_\{k\\neq y\}s\_\{\\theta\}\(x,k\)\. A positive margin means that the correct option is selected by the deterministic predictor\. A negative margin means the predictor selects a wrong option\. A small positive margin means that the prediction is still correct, but a wrong option is close to overtaking the true one\.

###### Definition 2\.3\(Perturbed distribution and sample\)\.

For each inputxx, let𝒰​\(x\)\\mathcal\{U\}\(x\)be the allowed perturbation set\. A perturbation ruleΠ\(⋅∣x,y\)\\Pi\(\\cdot\\mid x,y\)is supported on𝒰​\(x\)\\mathcal\{U\}\(x\)and may represent either random or adversarially selected perturbations\. The perturbed distribution𝒟′\\mathcal\{D\}^\{\\prime\}is induced by drawing\(x,y\)∼𝒟\(x,y\)\\sim\\mathcal\{D\}and then drawingx′∼Π\(⋅∣x,y\)x^\{\\prime\}\\sim\\Pi\(\\cdot\\mid x,y\)\. Given the finite sampleSS, the corresponding perturbed sample isS′:=\{\(xr′,yr\)\}r=1m,xr′∼Π\(⋅∣xr,yr\)S^\{\\prime\}:=\\\{\(x^\{\\prime\}\_\{r\},y\_\{r\}\)\\\}\_\{r=1\}^\{m\},\\quad x^\{\\prime\}\_\{r\}\\sim\\Pi\(\\cdot\\mid x\_\{r\},y\_\{r\}\)\.

![Refer to caption](https://arxiv.org/html/2605.08896v1/x1.png)Figure 1:Overview of FragileFlow\.The figure summarizes the pipeline from finite\-option prediction to margin\-aware error flow, PAC\-Bayes control, deterministic stability, and the final plug\-in objective\. Section tags indicate where each component is defined\.
### 2\.2Margin\-aware Error Flow

Perturbations can change more than the final predicted option\. They can also change where the model places probability among the wrong options\. We first record this off\-option allocation with an error\-flow matrix\.

###### Definition 2\.4\(Ungated error\-flow matrix\)\.

For a fixed modelθ\\theta, the ungated population error\-flow matrix under𝒟′\\mathcal\{D\}^\{\\prime\}is

\(Mθ𝒟′\)i​j:=𝔼\(x′,y\)∼𝒟′​\[pθ​\(i∣x′\)∣y=j\],i≠j,\(Mθ𝒟′\)j​j:=0\.\(M\_\{\\theta\}^\{\\mathcal\{D\}^\{\\prime\}\}\)\_\{ij\}:=\\mathbb\{E\}\_\{\(x^\{\\prime\},y\)\\sim\\mathcal\{D\}^\{\\prime\}\}\\left\[p\_\{\\theta\}\(i\\mid x^\{\\prime\}\)\\mid y=j\\right\],\\quad i\\neq j,\\qquad\(M\_\{\\theta\}^\{\\mathcal\{D\}^\{\\prime\}\}\)\_\{jj\}:=0\.\(1\)Rows index the receiving optionii, and columns index the ground\-truth optionjj\. The zero diagonal removes probability mass assigned to the correct option\. This matrix tells us where wrong\-option probability goes, but not how dangerous that probability is\. A wrong option receiving extra probability is less concerning when the true option is far ahead\. It is more concerning when the true option is only slightly ahead, because a small additional shift can flip the prediction\.

###### Definition 2\.5\(Margin gate\)\.

Given a safety bufferγ≥0\\gamma\\geq 0and a temperatureκ\>0\\kappa\>0, the smooth margin gate is

gγ,κθ​\(x′,j\):=σ​\(γ−Δθ​\(x′,j\)κ\)∈\(0,1\),g\_\{\\gamma,\\kappa\}^\{\\theta\}\(x^\{\\prime\},j\):=\\sigma\\left\(\\frac\{\\gamma\-\\Delta\_\{\\theta\}\(x^\{\\prime\},j\)\}\{\\kappa\}\\right\)\\in\(0,1\),\(2\)whereσ​\(⋅\)\\sigma\(\\cdot\)is the logistic sigmoid\. This gate is a differentiable version of𝟏​\{Δθ​\(x′,j\)≤γ\}\\mathbf\{1\}\\\{\\Delta\_\{\\theta\}\(x^\{\\prime\},j\)\\leq\\gamma\\\}\. It is large when the perturbed example is already misclassified or lies inside the safety buffer, and small when the true option has a clear margin\. Intuitively,γ\\gammadraws a band around the decision boundary\. Examples inside this band are not all wrong, but they are easy to flip\. We therefore weight the off\-option probability in Eq\.[1](https://arxiv.org/html/2605.08896#S2.E1)by this margin gate\.

###### Definition 2\.6\(Margin\-aware error\-flow matrix\)\.

The margin\-aware population error\-flow matrix is

\(Mθ𝒟′,γ\)i​j:=𝔼\(x′,y\)∼𝒟′​\[gγ,κθ​\(x′,j\)​pθ​\(i∣x′\)∣y=j\],i≠j,\(Mθ𝒟′,γ\)j​j:=0\.\(M\_\{\\theta\}^\{\\mathcal\{D\}^\{\\prime\},\\gamma\}\)\_\{ij\}:=\\mathbb\{E\}\_\{\(x^\{\\prime\},y\)\\sim\\mathcal\{D\}^\{\\prime\}\}\\left\[g\_\{\\gamma,\\kappa\}^\{\\theta\}\(x^\{\\prime\},j\)p\_\{\\theta\}\(i\\mid x^\{\\prime\}\)\\mid y=j\\right\],\\quad i\\neq j,\\qquad\(M\_\{\\theta\}^\{\\mathcal\{D\}^\{\\prime\},\\gamma\}\)\_\{jj\}:=0\.\(3\)
Larger values ofγ\\gammainclude a wider safety band, so the matrix becomes more conservative\. We quantify the risk of the phenomenon:

###### Definition 2\.7\(Vulnerable risks\)\.

For a fixed modelθ\\theta, the vulnerable worst\-option risk isVWRγ​\(θ;𝒟′\):=‖Mθ𝒟′,γ‖1\\mathrm\{VWR\}\_\{\\gamma\}\(\\theta;\\mathcal\{D\}^\{\\prime\}\):=\\\|M\_\{\\theta\}^\{\\mathcal\{D\}^\{\\prime\},\\gamma\}\\\|\_\{1\}\. The vulnerable spectral risk isVSRγ​\(θ;𝒟′\):=‖Mθ𝒟′,γ‖2\\mathrm\{VSR\}\_\{\\gamma\}\(\\theta;\\mathcal\{D\}^\{\\prime\}\):=\\\|M\_\{\\theta\}^\{\\mathcal\{D\}^\{\\prime\},\\gamma\}\\\|\_\{2\}\.

Here∥⋅∥1\\\|\\cdot\\\|\_\{1\}denotes the induced matrix11\-norm, i\.e\., the maximum column sum, and∥⋅∥2\\\|\\cdot\\\|\_\{2\}denotes the spectral norm\. The first quantity asks which ground\-truth option leaks the most vulnerable probability mass to wrong options\. The second asks whether this leakage is structured rather than scattered\. For example, several true options may drift toward the same wrong option, or a group of options may become mutually confusable\. The standard norm conversion givesVWRγ​\(θ;𝒟′\)≤K​VSRγ​\(θ;𝒟′\)\\mathrm\{VWR\}\_\{\\gamma\}\(\\theta;\\mathcal\{D\}^\{\\prime\}\)\\leq\\sqrt\{K\}\\,\\mathrm\{VSR\}\_\{\\gamma\}\(\\theta;\\mathcal\{D\}^\{\\prime\}\)\(see Appendix[A\.2](https://arxiv.org/html/2605.08896#A1.SS2)\)\. Thus, controlling the spectral risk gives a conservative way to reduce the worst\-option vulnerable readout while also penalizing coherent error\-flow patterns\.

### 2\.3PAC\-Bayes Spectral Control with Randomized Adaptation

The previous definitions describe the vulnerable error\-flow object on the perturbed population𝒟′\\mathcal\{D\}^\{\\prime\}\. In practice, we only observe a finite perturbed sampleS′S^\{\\prime\}\. For generalization, we need to relate the empirical matrix onS′S^\{\\prime\}to the population risk on𝒟′\\mathcal\{D\}^\{\\prime\}\. We use PAC\-Bayes analysis because it gives a direct way to separate the observed empirical term from the complexity of the adapted model\.

###### Definition 2\.8\(Randomized adaptation\)\.

Letw∈ℝdtrainw\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{train\}\}\}denote the trainable adaptation coordinates, and writeθ​\(w\):=𝒯​\(θ0,w\)\\theta\(w\):=\\mathcal\{T\}\(\\theta\_\{0\},w\), whereθ0\\theta\_\{0\}is fixed and𝒯\\mathcal\{T\}specifies howwwis inserted into the predictor\. LetPPbe a prior overwwchosen before observing the sample, and letQQbe a posterior after observing the sample\. We writeμ:=𝔼w~∼Q​\[w~\]\\mu:=\\mathbb\{E\}\_\{\\widetilde\{w\}\\sim Q\}\[\\widetilde\{w\}\], and the deterministic adapted model used for evaluation isθ​\(μ\)\\theta\(\\mu\)\.

For the analysis, a sampled coordinatew~∼Q\\widetilde\{w\}\\sim Qinduces a modelθ​\(w~\)\\theta\(\\widetilde\{w\}\)\. We use the corresponding option distributionpθ​\(w~\)\(⋅∣x\)p\_\{\\theta\(\\widetilde\{w\}\)\}\(\\cdot\\mid x\)to define a Gibbs predictor\. This randomized predictor is only an analysis device; the empirical results still use the deterministic modelθ​\(μ\)\\theta\(\\mu\)\. The connection between these two predictors is handled in Section[2\.4](https://arxiv.org/html/2605.08896#S2.SS4)\.

###### Definition 2\.9\(Posterior\-averaged vulnerable risks\)\.

We fix the perturbation protocol before the PAC\-Bayes analysis, so𝒟′\\mathcal\{D\}^\{\\prime\}andS′S^\{\\prime\}are the population and empirical perturbed objects\. The remaining randomness comes fromw~∼Q\\widetilde\{w\}\\sim Q\. The posterior\-averaged margin\-aware matrices areM¯𝒟′,γQ:=𝔼w~∼Q​\[Mθ​\(w~\)𝒟′,γ\]​and​M¯S′,γQ:=𝔼w~∼Q​\[Mθ​\(w~\)S′,γ\]\\bar\{M\}\_\{\\mathcal\{D\}^\{\\prime\},\\gamma\}^\{Q\}:=\\mathbb\{E\}\_\{\\widetilde\{w\}\\sim Q\}\\left\[M\_\{\\theta\(\\widetilde\{w\}\)\}^\{\\mathcal\{D\}^\{\\prime\},\\gamma\}\\right\]\\text\{and \}\\bar\{M\}\_\{S^\{\\prime\},\\gamma\}^\{Q\}:=\\mathbb\{E\}\_\{\\widetilde\{w\}\\sim Q\}\\left\[M\_\{\\theta\(\\widetilde\{w\}\)\}^\{S^\{\\prime\},\\gamma\}\\right\]\. The posterior vulnerable worst\-option risk and spectral risk areVWRγ​\(Q;𝒟′\):=‖M¯𝒟′,γQ‖1\\mathrm\{VWR\}\_\{\\gamma\}\(Q;\\mathcal\{D\}^\{\\prime\}\):=\\\|\\bar\{M\}\_\{\\mathcal\{D\}^\{\\prime\},\\gamma\}^\{Q\}\\\|\_\{1\}andVSRγ​\(Q;𝒟′\):=‖M¯𝒟′,γQ‖2\\mathrm\{VSR\}\_\{\\gamma\}\(Q;\\mathcal\{D\}^\{\\prime\}\):=\\\|\\bar\{M\}\_\{\\mathcal\{D\}^\{\\prime\},\\gamma\}^\{Q\}\\\|\_\{2\}, respectively\. We also writeVSRγ​\(Q;S′\):=‖M¯S′,γQ‖2\\mathrm\{VSR\}\_\{\\gamma\}\(Q;S^\{\\prime\}\):=\\\|\\bar\{M\}\_\{S^\{\\prime\},\\gamma\}^\{Q\}\\\|\_\{2\}for the empirical posterior spectral risk\.

The following theorem gives the empirical\-to\-population step\. It shows that a small empirical spectral error\-flow signal onS′S^\{\\prime\}controls the population vulnerable worst\-option risk, up to an adaptation\-complexity term\.

###### Theorem 2\.10\(PAC\-Bayes control of margin\-aware vulnerable risk\)\.

Fixγ≥0\\gamma\\geq 0andδ∈\(0,1\)\\delta\\in\(0,1\)\. Letmjm\_\{j\}be the number of option\-jjsamples used in the option\-conditional empirical matrix, and letmmin:=minj∈\[K\]⁡mj≥1m\_\{\\min\}:=\\min\_\{j\\in\[K\]\}m\_\{j\}\\geq 1\. Condition on the realized option counts\{mj\}j=1K\\\{m\_\{j\}\\\}\_\{j=1\}^\{K\}and on the perturbation protocol that induces𝒟′\\mathcal\{D\}^\{\\prime\}andS′S^\{\\prime\}\. Then, with probability at least1−δ1\-\\deltaover the draw ofSSand the induced perturbed sampleS′S^\{\\prime\}, for all posteriorsQQsimultaneously,

VWRγ​\(Q;𝒟′\)≤K​VSRγ​\(Q;S′\)\+2​2​K​\(KL​\(Q∥P\)\+2​K​ln⁡9\+ln⁡2δ\)mmin\.\\mathrm\{VWR\}\_\{\\gamma\}\(Q;\\mathcal\{D\}^\{\\prime\}\)\\leq\\sqrt\{K\}\\,\\mathrm\{VSR\}\_\{\\gamma\}\(Q;S^\{\\prime\}\)\+2\\sqrt\{\\frac\{2K\\left\(\\mathrm\{KL\}\(Q\\\|P\)\+2K\\ln 9\+\\ln\\frac\{2\}\{\\delta\}\\right\)\}\{m\_\{\\min\}\}\}\.\(4\)

The theorem gives the control chain used by our method\. The first term is the empirical vulnerable spectral risk measured onS′S^\{\\prime\}, which is the quantity later targeted by the plug\-in regularizer\. The second term is the price of adaptation: it grows withKL​\(Q∥P\)\\mathrm\{KL\}\(Q\\\|P\)and decreases with the smallest option\-wise sample sizemminm\_\{\\min\}\. The bufferγ\\gammachanges which examples contribute to the vulnerable matrix, but it does not add a separate complexity term\. The proof is given in Appendix[A\.3](https://arxiv.org/html/2605.08896#A1.SS3)\.

### 2\.4From Posterior Risk to Deterministic Worst\-class Risk

Theorem[2\.10](https://arxiv.org/html/2605.08896#S2.Thmtheorem10)controls a posterior\-averaged risk, while the model used at test time is the deterministic adapted modelθ​\(μ\)\\theta\(\\mu\)\. We now connect these two objects through a simple logit\-stability condition\.

###### Definition 2\.11\(Deterministic worst\-class risk\)\.

For a deterministic modelθ\\theta, the perturbed worst\-class risk isWCRdet​\(θ;𝒟′\):=maxj∈\[K\]⁡Pr\(x′,y\)∼𝒟′⁡\(y^θdet​\(x′\)≠j∣y=j\)\\mathrm\{WCR\}^\{\\mathrm\{det\}\}\(\\theta;\\mathcal\{D\}^\{\\prime\}\):=\\max\_\{j\\in\[K\]\}\\Pr\_\{\(x^\{\\prime\},y\)\\sim\\mathcal\{D\}^\{\\prime\}\}\(\\hat\{y\}^\{\\mathrm\{det\}\}\_\{\\theta\}\(x^\{\\prime\}\)\\neq j\\mid y=j\)\.

This is the hard\-error quantity used for downstream evaluation\. It differs fromVWRγ\\mathrm\{VWR\}\_\{\\gamma\}, which measures gated off\-option probability mass rather than hard prediction errors\.

###### Definition 2\.12\(Posterior logit stability\)\.

For a posterior samplew~∼Q\\widetilde\{w\}\\sim Q, define the largest option\-score shift from the posterior mean model as

ΞQ​\(μ,w~\):=sup\(x′,y\)∈supp​\(𝒟′\)maxk∈\[K\]⁡\|sθ​\(w~\)​\(x′,k\)−sθ​\(μ\)​\(x′,k\)\|\.\\Xi\_\{Q\}\(\\mu,\\widetilde\{w\}\):=\\sup\_\{\(x^\{\\prime\},y\)\\in\\mathrm\{supp\}\(\\mathcal\{D\}^\{\\prime\}\)\}\\max\_\{k\\in\[K\]\}\\left\|s\_\{\\theta\(\\widetilde\{w\}\)\}\(x^\{\\prime\},k\)\-s\_\{\\theta\(\\mu\)\}\(x^\{\\prime\},k\)\\right\|\.\(5\)We say that the posterior mean is\(γ,ρ\)\(\\gamma,\\rho\)\-stable ifPrw~∼Q⁡\(ΞQ​\(μ,w~\)≤γ/2\)≥1−ρ\\Pr\_\{\\widetilde\{w\}\\sim Q\}\(\\Xi\_\{Q\}\(\\mu,\\widetilde\{w\}\)\\leq\\gamma/2\)\\geq 1\-\\rho\.

The stability condition says that most posterior samples stay close to the mean model in option\-score space\. On the stable eventΞQ​\(μ,w~\)≤γ/2\\Xi\_\{Q\}\(\\mu,\\widetilde\{w\}\)\\leq\\gamma/2, every option score changes by at mostγ/2\\gamma/2, so the margin changes by at mostγ\\gamma\. The same bufferγ\\gammatherefore has two roles: it defines the vulnerable band in the gate, and it also absorbs bounded posterior score fluctuations around the mean model\. As a result, when the mean model makes a deterministic error on a perturbed example, every stable posterior sample remains inside theγ\\gamma\-vulnerable region counted by the gate\. Only the unstableρ\\rho\-probability event can escape this accounting\. The detailed case analysis is shown in Appendix[A\.4](https://arxiv.org/html/2605.08896#A1.SS4)\.

###### Proposition 2\.13\(Deterministic bridge under logit stability\)\.

Suppose the posterior mean is\(γ,ρ\)\(\\gamma,\\rho\)\-stable\. Also suppose that the gate satisfiesgγ,κθ​\(w~\)​\(x′,j\)≥ηg\_\{\\gamma,\\kappa\}^\{\\theta\(\\widetilde\{w\}\)\}\(x^\{\\prime\},j\)\\geq\\etawheneverΔθ​\(w~\)​\(x′,j\)≤γ\\Delta\_\{\\theta\(\\widetilde\{w\}\)\}\(x^\{\\prime\},j\)\\leq\\gamma, for someη\>0\\eta\>0\. Then

WCRdet​\(θ​\(μ\);𝒟′\)≤η−1​\(1\+eγ\)​VWRγ​\(Q;𝒟′\)\+ρ\.\\mathrm\{WCR\}^\{\\mathrm\{det\}\}\(\\theta\(\\mu\);\\mathcal\{D\}^\{\\prime\}\)\\leq\\eta^\{\-1\}\(1\+e^\{\\gamma\}\)\\mathrm\{VWR\}\_\{\\gamma\}\(Q;\\mathcal\{D\}^\{\\prime\}\)\+\\rho\.

The proof is given in Appendix[A\.5](https://arxiv.org/html/2605.08896#A1.SS5)\-[A\.6](https://arxiv.org/html/2605.08896#A1.SS6)\. Together, Theorem[2\.10](https://arxiv.org/html/2605.08896#S2.Thmtheorem10)and Proposition[2\.13](https://arxiv.org/html/2605.08896#S2.Thmtheorem13)give the control route used in this paper\. The empirical spectral term controls the posterior vulnerable risk, and logit stability transfers this control to the deterministic model evaluated at test time\. In the experiments, perturbed worst\-class accuracy is the downstream readout, whileVWRγ\\mathrm\{VWR\}\_\{\\gamma\}andVSRγ\\mathrm\{VSR\}\_\{\\gamma\}are the theory\-facing quantities\.

### 2\.5Plug\-in Spectral Safety Control

The theory suggests a direct plug\-in principle: reduce the empirical vulnerable spectral risk, while encouraging local logit stability under input and coordinate perturbations\. We use the same trainable coordinate notationwwas in Definition[2\.8](https://arxiv.org/html/2605.08896#S2.Thmtheorem8)\.

###### Definition 2\.14\(Plug\-in objective\)\.

Letwwdenote the current trainable coordinates, and letℒbase\\mathcal\{L\}\_\{\\mathrm\{base\}\}be a standard training objective\. For a fixed safety bufferγ\\gamma, we optimize

ℒtotal​\(w\):=ℒbase​\(w\)\+α​ℛspec​\(w;γ\)\+β​ℛstab​\(w\),\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(w\):=\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(w\)\+\\alpha\\mathcal\{R\}\_\{\\mathrm\{spec\}\}\(w;\\gamma\)\+\\beta\\mathcal\{R\}\_\{\\mathrm\{stab\}\}\(w\),\(6\)whereα,β≥0\\alpha,\\beta\\geq 0are validation\-selected weights\. Hereℛspec\\mathcal\{R\}\_\{\\mathrm\{spec\}\}targets the empirical spectral term in Theorem[2\.10](https://arxiv.org/html/2605.08896#S2.Thmtheorem10), whileℛstab\\mathcal\{R\}\_\{\\mathrm\{stab\}\}is an output\-level proxy for the logit\-stability condition in Proposition[2\.13](https://arxiv.org/html/2605.08896#S2.Thmtheorem13)\.

###### Definition 2\.15\(Mini\-batch spectral regularizer\)\.

Given a paired mini\-batch, letBj:=\{r∈B:yr=j\}B\_\{j\}:=\\\{r\\in B:y\_\{r\}=j\\\}\. For every optionjjwith\|Bj\|\>0\|B\_\{j\}\|\>0, we form the batch margin\-aware estimator\(M~wB,γ\)i​j:=1\|Bj\|∑r∈Bjgγ,κθ​\(w\)\(xr′,j\)pθ​\(w\)\(i∣xr′\),i≠j,\(M~wB,γ\)j​j:=0\(\\widetilde\{M\}\_\{w\}^\{B,\\gamma\}\)\_\{ij\}:=\\frac\{1\}\{\|B\_\{j\}\|\}\\sum\_\{r\\in B\_\{j\}\}g\_\{\\gamma,\\kappa\}^\{\\theta\(w\)\}\(x^\{\\prime\}\_\{r\},j\)p\_\{\\theta\(w\)\}\(i\\mid x^\{\\prime\}\_\{r\}\),i\\neq j,\(\\widetilde\{M\}\_\{w\}^\{B,\\gamma\}\)\_\{jj\}:=0\. Then,

ℛspec​\(w;γ\):=VSR^γ​\(w;B\):=‖M~wB,γ‖2\.\\mathcal\{R\}\_\{\\mathrm\{spec\}\}\(w;\\gamma\):=\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}\(w;B\):=\\left\\\|\\widetilde\{M\}\_\{w\}^\{B,\\gamma\}\\right\\\|\_\{2\}\.\(7\)

If optionjjis absent from the mini\-batch, its column is skipped for that update\. This term suppresses the dominant structured mode of vulnerable probability flow; implementation details are given in Appendix[A\.8](https://arxiv.org/html/2605.08896#A1.SS8)\.

###### Definition 2\.16\(Local stability regularizer\)\.

LetQ~w\\widetilde\{Q\}\_\{w\}be a local perturbation distribution centered atww, and draww~∼Q~w\\tilde\{w\}\\sim\\widetilde\{Q\}\_\{w\}\. For the same paired mini\-batchBB, define

ℛstab\(w\):=𝔼w~∼Q~w\[1\|B\|∑r∈BKL\(sg\[pθ​\(w\)\(⋅∣xr\)\]∥pθ​\(w~\)\(⋅∣xr′\)\)\],\\mathcal\{R\}\_\{\\mathrm\{stab\}\}\(w\):=\\mathbb\{E\}\_\{\\tilde\{w\}\\sim\\widetilde\{Q\}\_\{w\}\}\\left\[\\frac\{1\}\{\|B\|\}\\sum\_\{r\\in B\}\\mathrm\{KL\}\\left\(\\operatorname\{sg\}\\\!\\left\[p\_\{\\theta\(w\)\}\(\\cdot\\mid x\_\{r\}\)\\right\]\\,\\middle\\\|\\,p\_\{\\theta\(\\tilde\{w\}\)\}\(\\cdot\\mid x^\{\\prime\}\_\{r\}\)\\right\)\\right\],\(8\)wheresg⁡\[⋅\]\\operatorname\{sg\}\[\\cdot\]denotes stop\-gradient\. This term penalizes prediction changes caused jointly by input perturbation and local coordinate noise\. Its coordinate\-noise component discourages large centered logit and margin shifts, making it a practical proxy for reducing the stability\-failure probabilityρ\\rhoin Proposition[2\.13](https://arxiv.org/html/2605.08896#S2.Thmtheorem13)\. See Appendix[A\.10](https://arxiv.org/html/2605.08896#A1.SS10)\.

## 3Experiments

### 3\.1Experimental setup

To validate the theoretical control route and test its practical effect, we evaluate FragileFlow under a unified finite\-option prediction protocol across two perturbation channels\. In LLM tasks, perturbations modify the question text; in VLM tasks, perturbations modify the image input\. In both cases, the model produces a distribution over candidate options, so the same margin\-aware error\-flow matrix, calibrated buffer, and worst\-class readouts can be used\. This lets us test whether the proposed risk\-control mechanism transfers across text\-side and image\-side robustness settings\.

Models and tasks\.For LLMs, we evaluate Qwen2\.5\-0\.5B\-Instruct and Qwen2\.5\-1\.5B\-InstructQwen Team \([2025](https://arxiv.org/html/2605.08896#bib.bib28),[2024](https://arxiv.org/html/2605.08896#bib.bib41)\), and Mistral\-7B\-Instruct\-v0\.2Jianget al\.\([2023](https://arxiv.org/html/2605.08896#bib.bib29)\); Mistral AI \([2023](https://arxiv.org/html/2605.08896#bib.bib30)\)on ARC\-ChallengeClarket al\.\([2018](https://arxiv.org/html/2605.08896#bib.bib31)\)and CommonsenseQATalmoret al\.\([2019](https://arxiv.org/html/2605.08896#bib.bib32)\)\. These multiple\-choice tasks expose the full verbalizer distribution, so the class\-conditional error\-flow matrix can be formed directly\. We test three task\-preserving perturbations: typo noise, distractor insertion, and format rewriting, following text robustness and behavioral testing protocolsMorriset al\.\([2020](https://arxiv.org/html/2605.08896#bib.bib33)\); Ribeiroet al\.\([2020](https://arxiv.org/html/2605.08896#bib.bib34)\)\. For VLMs, we evaluate CLIP ViT\-B/32Radfordet al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib36)\)followingSchlarmannet al\.\([2024](https://arxiv.org/html/2605.08896#bib.bib49)\)with LoRA adaptationHuet al\.\([2022](https://arxiv.org/html/2605.08896#bib.bib37)\)on DTDCimpoiet al\.\([2014](https://arxiv.org/html/2605.08896#bib.bib38)\), OxfordPetsParkhiet al\.\([2012](https://arxiv.org/html/2605.08896#bib.bib39)\), and Caltech101Fei\-Feiet al\.\([2007](https://arxiv.org/html/2605.08896#bib.bib40)\)\. Robustness is measured on PGD\-perturbed test imagesMadryet al\.\([2018](https://arxiv.org/html/2605.08896#bib.bib6)\), following the adversarial LoRA adaptation protocol ofGhiasvandet al\.\([2025](https://arxiv.org/html/2605.08896#bib.bib27)\)\.

Baselines and plug\-in placement\.For LLMs, we attach FragileFlow to cross\-entropy training \(CE\) with augmentationMorriset al\.\([2020](https://arxiv.org/html/2605.08896#bib.bib33)\), R3FAghajanyanet al\.\([2021](https://arxiv.org/html/2605.08896#bib.bib35)\), and SMARTJianget al\.\([2020](https://arxiv.org/html/2605.08896#bib.bib8)\), covering data\-level augmentation, randomized embedding smoothness, and local smoothness regularization\. Plain CE is included as an unpaired reference\. For VLMs, we apply the plug\-in to theinnerPGD step, theoutermodel update, orboth, separating whether the regularizer shapes adversarial\-example construction, regularizes the adapted model, or combines the two effects\.

Metrics and protocol\.We report clean accuracy, perturbed worst\-class accuracy,VWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}, andVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}; the VLM table additionally reports perturbed average accuracy and clean worst\-class accuracy\. Higher accuracy is better, while lowerVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}andVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}indicate less margin\-vulnerable probability flow\. All paired comparisons use the same data split, perturbation, random seeds, and calibrated buffer\.

### 3\.2Safety\-buffer calibration

The bufferγ\\gammadetermines which perturbed examples are counted as margin\-vulnerable by the error\-flow matrix\. Since raw logit margins vary across models and tasks, we do not fixγ\\gammaas a global numeric value\. Instead, for each setting, we compute margins on a held\-out perturbed validation split and setγ=γq\\gamma=\\gamma\_\{q\}, theqq\-th quantile of the validation\-margin distribution\. We sweepq∈\{0\.10,0\.25,0\.50\}q\\in\\\{0\.10,0\.25,0\.50\\\}and compare each base learner with its plug\-in counterpart under the same calibrated buffer\.

Figure[2](https://arxiv.org/html/2605.08896#S3.F2)shows the calibration behavior on the LLM settings\. The last two columns reportVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}andσmax\\sigma\_\{\\max\}, whereσmax=VSR^γ\\sigma\_\{\\max\}=\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}represents the empirical spectral norm of the margin\-aware error\-flow matrix\. Asqqincreases, the buffer covers a wider vulnerable region, so more perturbed examples and off\-option probability mass are counted; the measured vulnerable risks therefore increase by design\. The meaningful comparison is within the sameqq, where the plug\-in consistently moves the operating point toward lower vulnerable flow\. The intermediate choiceq=0\.25q=0\.25gives the most stable trade\-off: it preserves clean accuracy, maintains or improves perturbed worst\-class accuracy, and reduces the two margin\-aware risk measures\. We therefore useγ25\\gamma\_\{25\}, the 25th percentile of held\-out perturbed validation margins, in all remaining experiments\.

![Refer to caption](https://arxiv.org/html/2605.08896v1/x2.png)Figure 2:Safety\-buffer calibration for LLM robustness\.Results are averaged over three LLMs and two datasets\. Dashed lines are base learners, and solid curves are plug\-in counterparts\. Largerqqwidens the vulnerable region, so risk values increase by construction\.
### 3\.3Main results: reducing vulnerable error flow

With the safety buffer fixed toγ25\\gamma\_\{25\}, we evaluate whether the plug\-in realizes the control mechanism predicted by the theory and whether this control translates into practical robustness gains\. A limited sweep over\(α,β\)\(\\alpha,\\beta\)is reported in Appendix[D\.4](https://arxiv.org/html/2605.08896#A4.SS4); in the main experiments, we use one shared default setting across models, datasets, and base learners\.

Table[1](https://arxiv.org/html/2605.08896#S3.T1)reports the LLM results\. Across the three models, two datasets, and three robust adaptation objectives, FragileFlow reduces bothVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}andVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}in every paired comparison\. This directly verifies the intended mechanism: under the same calibrated buffer, the plug\-in compresses the dominant margin\-vulnerable error\-flow structure rather than merely changing average accuracy\. The downstream metrics show the same trend in most cases\. Perturbed worst\-class accuracy improves in most paired comparisons, while clean accuracy remains comparable to the corresponding base learner\. In the few cases where perturbed worst\-class accuracy does not increase, we keep the results unfiltered and still observe lower values for both controlled risk quantities\. This separation is expected: the plug\-in directly targets vulnerable probability flow, while its conversion into deterministic worst\-class accuracy also depends on the stability bridge in Proposition[2\.13](https://arxiv.org/html/2605.08896#S2.Thmtheorem13)and on the optimization behavior of the base learner\.

Table 1:Main LLM results at the calibrated bufferγ25\\gamma\_\{25\}\.Results are reported over three paired seeds\. Perturbed\-side metrics average over typo noise, distractor insertion, and format rewriting; clean accuracy is measured on the original test set\. Lower is better forVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}andVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}; higher is better for accuracy\. Bold indicates improvement over the paired base learner\.ModelDatasetMethodVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}↓\\downarrowVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}↓\\downarrowPtb WC Acc↑\\uparrowClean Acc↑\\uparrowbase\\columncolorplugblue\+plugbase\\columncolorplugblue\+plugbase\\columncolorplugblue\+plugbase\\columncolorplugblue\+plugQwen\-0\.5BARC\-CCE*49\.64±\\pm1\.73*\\columncolorplugblue–*45\.06±\\pm2\.02*\\columncolorplugblue–*26\.40±\\pm2\.17*\\columncolorplugblue–*52\.33±\\pm2\.52*\\columncolorplugblue–CE\+aug48\.07±\\pm1\.55\\columncolorplugblue38\.46±\\pm1\.4243\.34±\\pm1\.47\\columncolorplugblue34\.26±\\pm2\.1538\.58±\\pm8\.92\\columncolorplugblue38\.31±\\pm8\.0054\.20±\\pm3\.41\\columncolorplugblue53\.87±\\pm3\.19R3F49\.68±\\pm1\.76\\columncolorplugblue39\.45±\\pm1\.0745\.00±\\pm2\.07\\columncolorplugblue35\.64±\\pm0\.5426\.40±\\pm2\.21\\columncolorplugblue28\.09±\\pm3\.3452\.13±\\pm2\.44\\columncolorplugblue53\.93±\\pm1\.89SMART53\.70±\\pm3\.68\\columncolorplugblue36\.83±\\pm4\.0046\.85±\\pm0\.84\\columncolorplugblue34\.11±\\pm2\.5522\.04±\\pm2\.42\\columncolorplugblue27\.73±\\pm2\.8252\.00±\\pm0\.53\\columncolorplugblue53\.13±\\pm0\.61Qwen\-0\.5BCSQACE*40\.85±\\pm2\.88*\\columncolorplugblue–*36\.73±\\pm0\.70*\\columncolorplugblue–*28\.56±\\pm2\.82*\\columncolorplugblue–*57\.87±\\pm2\.32*\\columncolorplugblue–CE\+aug38\.32±\\pm3\.46\\columncolorplugblue29\.30±\\pm2\.7334\.02±\\pm2\.17\\columncolorplugblue25\.79±\\pm1\.5640\.33±\\pm6\.61\\columncolorplugblue43\.22±\\pm7\.1559\.27±\\pm1\.92\\columncolorplugblue59\.73±\\pm2\.81R3F40\.84±\\pm2\.46\\columncolorplugblue32\.03±\\pm2\.7836\.95±\\pm1\.13\\columncolorplugblue28\.59±\\pm1\.2528\.67±\\pm2\.56\\columncolorplugblue28\.22±\\pm2\.7257\.53±\\pm2\.55\\columncolorplugblue57\.73±\\pm2\.02SMART39\.22±\\pm2\.83\\columncolorplugblue31\.47±\\pm2\.5334\.80±\\pm1\.60\\columncolorplugblue27\.45±\\pm1\.5030\.22±\\pm2\.37\\columncolorplugblue32\.89±\\pm2\.2459\.33±\\pm2\.60\\columncolorplugblue60\.80±\\pm2\.78Qwen\-1\.5BARC\-CCE*30\.87±\\pm2\.29*\\columncolorplugblue–*27\.83±\\pm2\.57*\\columncolorplugblue–*48\.98±\\pm4\.94*\\columncolorplugblue–*74\.13±\\pm2\.39*\\columncolorplugblue–CE\+aug30\.34±\\pm4\.12\\columncolorplugblue26\.46±\\pm1\.5226\.30±\\pm1\.58\\columncolorplugblue23\.77±\\pm1\.2563\.84±\\pm4\.07\\columncolorplugblue68\.62±\\pm3\.6976\.20±\\pm1\.71\\columncolorplugblue76\.93±\\pm0\.42R3F30\.95±\\pm2\.19\\columncolorplugblue28\.43±\\pm2\.6327\.85±\\pm2\.57\\columncolorplugblue26\.33±\\pm3\.1447\.89±\\pm5\.03\\columncolorplugblue48\.80±\\pm4\.9774\.33±\\pm2\.25\\columncolorplugblue74\.20±\\pm2\.62SMART31\.61±\\pm1\.04\\columncolorplugblue29\.09±\\pm1\.8927\.49±\\pm0\.45\\columncolorplugblue25\.40±\\pm1\.0750\.18±\\pm5\.91\\columncolorplugblue52\.71±\\pm5\.8075\.00±\\pm0\.53\\columncolorplugblue74\.93±\\pm0\.70Qwen\-1\.5BCSQACE*28\.12±\\pm1\.39*\\columncolorplugblue–*24\.92±\\pm1\.93*\\columncolorplugblue–*45\.67±\\pm6\.16*\\columncolorplugblue–*73\.20±\\pm0\.92*\\columncolorplugblue–CE\+aug25\.20±\\pm0\.89\\columncolorplugblue21\.87±\\pm1\.2521\.84±\\pm0\.51\\columncolorplugblue18\.86±\\pm0\.9362\.22±\\pm1\.77\\columncolorplugblue65\.67±\\pm1\.8075\.40±\\pm0\.72\\columncolorplugblue75\.13±\\pm0\.76R3F27\.82±\\pm1\.48\\columncolorplugblue24\.46±\\pm1\.6024\.71±\\pm1\.78\\columncolorplugblue21\.64±\\pm2\.1744\.67±\\pm6\.28\\columncolorplugblue45\.89±\\pm6\.4473\.53±\\pm0\.81\\columncolorplugblue73\.40±\\pm1\.06SMART27\.57±\\pm2\.42\\columncolorplugblue23\.41±\\pm1\.7824\.25±\\pm1\.46\\columncolorplugblue20\.26±\\pm1\.1849\.78±\\pm6\.27\\columncolorplugblue50\.67±\\pm5\.5274\.80±\\pm2\.51\\columncolorplugblue74\.87±\\pm2\.91Mistral\-7BARC\-CCE*31\.41±\\pm4\.52*\\columncolorplugblue–*27\.13±\\pm1\.71*\\columncolorplugblue–*52\.18±\\pm4\.97*\\columncolorplugblue–*75\.33±\\pm0\.64*\\columncolorplugblue–CE\+aug32\.91±\\pm4\.15\\columncolorplugblue29\.60±\\pm4\.7628\.09±\\pm1\.34\\columncolorplugblue25\.96±\\pm2\.0161\.18±\\pm4\.21\\columncolorplugblue65\.96±\\pm6\.2675\.80±\\pm1\.25\\columncolorplugblue76\.93±\\pm3\.56R3F31\.32±\\pm3\.77\\columncolorplugblue30\.99±\\pm2\.8126\.99±\\pm1\.47\\columncolorplugblue25\.80±\\pm1\.1250\.73±\\pm4\.33\\columncolorplugblue51\.20±\\pm4\.3475\.20±\\pm0\.53\\columncolorplugblue75\.07±\\pm0\.46SMART52\.44±\\pm5\.96\\columncolorplugblue43\.60±\\pm5\.6541\.88±\\pm5\.33\\columncolorplugblue36\.11±\\pm5\.9326\.67±\\pm9\.98\\columncolorplugblue36\.09±\\pm8\.8268\.67±\\pm3\.90\\columncolorplugblue71\.73±\\pm4\.32Mistral\-7BCSQACE*26\.56±\\pm3\.27*\\columncolorplugblue–*22\.13±\\pm0\.50*\\columncolorplugblue–*51\.89±\\pm8\.69*\\columncolorplugblue–*76\.40±\\pm0\.92*\\columncolorplugblue–CE\+aug25\.90±\\pm2\.06\\columncolorplugblue23\.34±\\pm1\.6220\.81±\\pm0\.51\\columncolorplugblue18\.57±\\pm0\.9360\.78±\\pm4\.81\\columncolorplugblue64\.33±\\pm4\.2377\.33±\\pm1\.51\\columncolorplugblue77\.67±\\pm1\.60R3F26\.18±\\pm3\.92\\columncolorplugblue24\.04±\\pm4\.2621\.94±\\pm1\.13\\columncolorplugblue20\.02±\\pm1\.6251\.22±\\pm8\.55\\columncolorplugblue52\.00±\\pm8\.8476\.87±\\pm0\.76\\columncolorplugblue76\.67±\\pm0\.81SMART37\.73±\\pm3\.25\\columncolorplugblue30\.37±\\pm0\.9331\.78±\\pm2\.00\\columncolorplugblue26\.21±\\pm0\.8734\.56±\\pm6\.84\\columncolorplugblue38\.89±\\pm5\.4369\.33±\\pm0\.31\\columncolorplugblue70\.67±\\pm0\.31

Table[2](https://arxiv.org/html/2605.08896#S3.T2)provides a cross\-modal test of the same mechanism\. Here the perturbation acts on the image input through PGD, rather than on the text\-side question or verbalizer interface\. Despite this different perturbation channel, FragileFlow again lowersVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}andVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}across datasets and plug\-in placements\. Clean accuracy is largely preserved, and the additional VLM metrics show that the gains are not confined to the proposed risk scores: perturbed average accuracy and worst\-class readouts remain stable or improve in most settings\.

Table 2:Cross\-modal VLM results on CLIP ViT\-B/32\.Results are reported for 16\-shot LoRA adversarial adaptation over three seeds, with robustness evaluated by 100\-step PGD atε=1\.0/255\\varepsilon=1\.0/255\.Inner,outer, andbothdenote where the plug\-in is applied during adaptation\. Lower is better forVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}andVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}; higher is better for accuracy metrics\.DatasetPlugin typeVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}↓\\downarrowVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}↓\\downarrowClean Acc↑\\uparrowPtb Acc↑\\uparrowClean WC↑\\uparrowPtb WC↑\\uparrowDTD–28\.33±0\.3828\.33\\pm 0\.3838\.97±4\.8038\.97\\pm 4\.8067\.57±0\.93\\mathbf\{67\.57\\pm 0\.93\}22\.66±1\.1222\.66\\pm 1\.1228\.70±3\.4628\.70\\pm 3\.460\.000\.00\\cellcolorplugyellowInner\\cellcolorplugyellow27\.22±0\.4727\.22\\pm 0\.47\\cellcolorplugyellow37\.47±5\.0437\.47\\pm 5\.04\\cellcolorplugyellow67\.53±1\.0467\.53\\pm 1\.04\\cellcolorplugyellow22\.73±1\.10\\mathbf\{22\.73\\pm 1\.10\}\\cellcolorplugyellow27\.78±4\.5427\.78\\pm 4\.54\\cellcolorplugyellow0\.000\.00\\cellcolorplugyellowOuter\\cellcolorplugyellow26\.91±0\.5926\.91\\pm 0\.59\\cellcolorplugyellow36\.25±4\.8336\.25\\pm 4\.83\\cellcolorplugyellow67\.55±0\.6867\.55\\pm 0\.68\\cellcolorplugyellow22\.56±1\.1922\.56\\pm 1\.19\\cellcolorplugyellow29\.63±4\.72\\mathbf\{29\.63\\pm 4\.72\}\\cellcolorplugyellow0\.000\.00\\cellcolorplugyellowBoth\\cellcolorplugyellow26\.89±0\.56\\mathbf\{26\.89\\pm 0\.56\}\\cellcolorplugyellow36\.15±4\.34\\mathbf\{36\.15\\pm 4\.34\}\\cellcolorplugyellow67\.38±0\.4367\.38\\pm 0\.43\\cellcolorplugyellow22\.28±1\.0722\.28\\pm 1\.07\\cellcolorplugyellow29\.63±3\.4629\.63\\pm 3\.46\\cellcolorplugyellow0\.000\.00OxfordPets–63\.53±2\.0963\.53\\pm 2\.0958\.81±5\.4858\.81\\pm 5\.4889\.52±0\.0889\.52\\pm 0\.0820\.34±1\.1420\.34\\pm 1\.1454\.00±2\.4554\.00\\pm 2\.450\.000\.00\\cellcolorplugyellowInner\\cellcolorplugyellow63\.27±2\.1563\.27\\pm 2\.15\\cellcolorplugyellow58\.02±5\.9458\.02\\pm 5\.94\\cellcolorplugyellow89\.38±0\.1989\.38\\pm 0\.19\\cellcolorplugyellow20\.72±1\.1520\.72\\pm 1\.15\\cellcolorplugyellow54\.00±2\.9454\.00\\pm 2\.94\\cellcolorplugyellow0\.33±0\.47\\mathbf\{0\.33\\pm 0\.47\}\\cellcolorplugyellowOuter\\cellcolorplugyellow61\.60±2\.8161\.60\\pm 2\.81\\cellcolorplugyellow57\.12±5\.6557\.12\\pm 5\.65\\cellcolorplugyellow89\.79±0\.3489\.79\\pm 0\.34\\cellcolorplugyellow20\.70±1\.0420\.70\\pm 1\.04\\cellcolorplugyellow56\.67±3\.40\\mathbf\{56\.67\\pm 3\.40\}\\cellcolorplugyellow0\.33±0\.47\\mathbf\{0\.33\\pm 0\.47\}\\cellcolorplugyellowBoth\\cellcolorplugyellow61\.53±2\.79\\mathbf\{61\.53\\pm 2\.79\}\\cellcolorplugyellow56\.82±5\.66\\mathbf\{56\.82\\pm 5\.66\}\\cellcolorplugyellow89\.82±0\.44\\mathbf\{89\.82\\pm 0\.44\}\\cellcolorplugyellow20\.82±1\.33\\mathbf\{20\.82\\pm 1\.33\}\\cellcolorplugyellow56\.33±1\.2556\.33\\pm 1\.25\\cellcolorplugyellow0\.33±0\.47\\mathbf\{0\.33\\pm 0\.47\}Caltech101–47\.30±8\.6147\.30\\pm 8\.6122\.20±3\.7422\.20\\pm 3\.7495\.04±0\.2595\.04\\pm 0\.2564\.60±2\.7264\.60\\pm 2\.7242\.22±3\.1442\.22\\pm 3\.145\.19±4\.09\\mathbf\{5\.19\\pm 4\.09\}\\cellcolorplugyellowInner\\cellcolorplugyellow45\.40±7\.2045\.40\\pm 7\.20\\cellcolorplugyellow20\.79±3\.61\\mathbf\{20\.79\\pm 3\.61\}\\cellcolorplugyellow95\.06±0\.30\\mathbf\{95\.06\\pm 0\.30\}\\cellcolorplugyellow64\.72±2\.6064\.72\\pm 2\.60\\cellcolorplugyellow40\.00±3\.4440\.00\\pm 3\.44\\cellcolorplugyellow5\.19±4\.09\\mathbf\{5\.19\\pm 4\.09\}\\cellcolorplugyellowOuter\\cellcolorplugyellow46\.14±9\.7746\.14\\pm 9\.77\\cellcolorplugyellow20\.90±4\.3120\.90\\pm 4\.31\\cellcolorplugyellow94\.97±0\.3294\.97\\pm 0\.32\\cellcolorplugyellow64\.72±2\.9664\.72\\pm 2\.96\\cellcolorplugyellow42\.78±2\.83\\mathbf\{42\.78\\pm 2\.83\}\\cellcolorplugyellow5\.00±3\.605\.00\\pm 3\.60\\cellcolorplugyellowBoth\\cellcolorplugyellow45\.23±8\.80\\mathbf\{45\.23\\pm 8\.80\}\\cellcolorplugyellow20\.89±4\.1420\.89\\pm 4\.14\\cellcolorplugyellow94\.90±0\.2694\.90\\pm 0\.26\\cellcolorplugyellow64\.75±2\.99\\mathbf\{64\.75\\pm 2\.99\}\\cellcolorplugyellow42\.22±3\.1442\.22\\pm 3\.14\\cellcolorplugyellow5\.19±4\.09\\mathbf\{5\.19\\pm 4\.09\}

The placement results further clarify how the regularizer acts\. Applying the plug\-in to the inner PGD step shapes adversarial\-example construction, whereas applying it to the outer update directly regularizes the adapted model\. These placements lead to different downstream trade\-offs, but the common pattern is stable: under a fixed calibrated buffer, FragileFlow reduces concentrated vulnerable error flow while preserving clean utility across both text\-side and image\-side robustness settings\.

### 3\.4Mechanistic ablation: spectral compression and stability

Table 3:Compact stability\-term ablation summary\.Values are average shifts relative to the corresponding non\-plug reference\. Risk changes are relative; accuracy changes are absolute percentage points\.SettingObjectiveΔ​VSR^γ↓\\Delta\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}\\downarrowΔ​VWR^γ↓\\Delta\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}\\downarrowΔ​WC↑\\Delta\\mathrm\{WC\}\\uparrowΔ​PtbAcc↑\\Delta\\mathrm\{PtbAcc\}\\uparrowΔ​CleanAcc↑\\Delta\\mathrm\{CleanAcc\}\\uparrowLLMRspecR\_\{\\mathrm\{spec\}\}only\(β=0\)\(\\beta=0\)−13\.02%\-13\.02\\%−13\.14%\-13\.14\\%\+0\.48\+0\.48pp\+0\.44\+0\.44pp\+0\.42\+0\.42ppRspec\+β​RstabR\_\{\\mathrm\{spec\}\}\+\\beta R\_\{\\mathrm\{stab\}\}−13\.97%\-13\.97\\%−14\.37%\-14\.37\\%\+1\.52\+1\.52pp\+0\.68\+0\.68pp\+0\.57\+0\.57ppVLMRspecR\_\{\\mathrm\{spec\}\}only\(β=0\)\(\\beta=0\)−5\.21%\-5\.21\\%−3\.35%\-3\.35\\%\+0\.24\+0\.24pp\+0\.33\+0\.33pp−0\.08\-0\.08ppRspec\+β​RstabR\_\{\\mathrm\{spec\}\}\+\\beta R\_\{\\mathrm\{stab\}\}−6\.47%\-6\.47\\%−4\.35%\-4\.35\\%\+0\.28\+0\.28pp\+0\.28\+0\.28pp\+0\.02\+0\.02pp

![Refer to caption](https://arxiv.org/html/2605.08896v1/x3.png)

Figure 3:Spectral compression with and without stability\.Each panel plotsσmax=VSR^γ\\sigma\_\{\\max\}=\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}againstVWR^γ/K\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}/\\sqrt\{K\}\. Centroids summarize the base and plug\-in runs under the spectral\-only and composite objectives\.

We finally isolate the two components of FragileFlow\. The spectral termRspecR\_\{\\mathrm\{spec\}\}directly penalizes the dominant mode of the margin\-aware error\-flow matrix, measured byσmax=VSR^γ\\sigma\_\{\\max\}=\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}\. The stability termRstabR\_\{\\mathrm\{stab\}\}is not meant to replace this spectral mechanism; it is added to make the controlled risk less sensitive to local coordinate perturbations, which is the condition used by the deterministic bridge in Proposition[2\.13](https://arxiv.org/html/2605.08896#S2.Thmtheorem13)\.

Figure[3](https://arxiv.org/html/2605.08896#S3.F3)visualizes this mechanism\. Each panel plotsσmax\\sigma\_\{\\max\}against the normalized vulnerable worst\-class massVWR^γ/K\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}/\\sqrt\{K\}\. In both LLM and VLM settings, the spectral\-only objective \(β=0\\beta=0\) already moves the plug\-in centroid toward the lower\-left region, showing thatRspecR\_\{\\mathrm\{spec\}\}alone compresses the intended vulnerable\-flow geometry\. The composite objectiveRspec\+β​RstabR\_\{\\mathrm\{spec\}\}\+\\beta R\_\{\\mathrm\{stab\}\}preserves the same direction rather than changing the mechanism, which supports the intended complementary role of the stability term\.

Table[3](https://arxiv.org/html/2605.08896#S3.T3)shows the aggregate shifts relative to the corresponding non\-plug reference; risk metrics are reported asrelativechanges, and accuracy metrics asabsolutepercentage\-point changes\. Detailed per\-setting LLM and VLM results are reported in Appendix Tables[7](https://arxiv.org/html/2605.08896#A4.T7)and[8](https://arxiv.org/html/2605.08896#A4.T8)\. On LLMs,RspecR\_\{\\mathrm\{spec\}\}alone substantially reduces the two risk measures, while the larger downstream worst\-class gain appears after addingRstabR\_\{\\mathrm\{stab\}\}\. On VLMs, both variants reduce the risk measures under PGD evaluation, with the composite objective giving the stronger risk reduction\. These results show the intended functional separation:RspecR\_\{\\mathrm\{spec\}\}compresses the vulnerable error\-flow structure, whileRstabR\_\{\\mathrm\{stab\}\}improves the reliability of its transfer to evaluation accuracy\.

Experiment summary\.Taken together, the experiments support the proposed risk\-control chain: the calibrated buffer defines a stable vulnerable region, the plug\-in reduces the targeted error\-flow risks, and the ablations confirm the complementary roles ofRspecR\_\{\\mathrm\{spec\}\}andRstabR\_\{\\mathrm\{stab\}\}\. This improves worst\-class robustness while largely preserving clean utility\.

## 4Related Work

### 4\.1Robustness to Input Perturbations

Robustness to input perturbations is commonly studied through adversarial training, consistency regularization, and flatness\-based objectives\. Adversarial training improves worst\-case robustness by training on perturbed examples\(Goodfellowet al\.,[2015](https://arxiv.org/html/2605.08896#bib.bib50); Madryet al\.,[2018](https://arxiv.org/html/2605.08896#bib.bib6); Zhanget al\.,[2019](https://arxiv.org/html/2605.08896#bib.bib7)\), while SMART, R3F, adversarial weight perturbation, and sharpness\-aware training stabilize predictions under input, embedding, or parameter perturbations\(Jianget al\.,[2020](https://arxiv.org/html/2605.08896#bib.bib8); Aghajanyanet al\.,[2021](https://arxiv.org/html/2605.08896#bib.bib35); Wuet al\.,[2020](https://arxiv.org/html/2605.08896#bib.bib53); Foretet al\.,[2021](https://arxiv.org/html/2605.08896#bib.bib10)\)\. These objectives are effective, but they usually aggregate robustness across samples or losses\. Recent work extends these ideas to LLMs and VLMs\. For LLMs, prompt perturbations, distractors, typos, and instruction edits motivate perturbation\-aware fine\-tuning and prompt\-consistency learning\(Qianget al\.,[2024](https://arxiv.org/html/2605.08896#bib.bib11); Guptaet al\.,[2024](https://arxiv.org/html/2605.08896#bib.bib12); Agrawalet al\.,[2025](https://arxiv.org/html/2605.08896#bib.bib13)\)\. For VLMs, robust adaptation often perturbs visual inputs while tuning lightweight components such as adapters, LoRA modules, or vision encoders\(Schlarmannet al\.,[2024](https://arxiv.org/html/2605.08896#bib.bib49); Ghiasvandet al\.,[2025](https://arxiv.org/html/2605.08896#bib.bib27); Ohet al\.,[2024](https://arxiv.org/html/2605.08896#bib.bib4)\)\. A related line studies class\-wise and worst\-class robustness\. Prior work shows that adversarial training can create large robustness disparities across classes\(Xuet al\.,[2021](https://arxiv.org/html/2605.08896#bib.bib15); Tianet al\.,[2021](https://arxiv.org/html/2605.08896#bib.bib42)\)\. BAT, CFA, and WAT reduce such disparities through class balancing, class\-specific adversarial configurations, or direct worst\-class optimization\(Sunet al\.,[2023](https://arxiv.org/html/2605.08896#bib.bib16); Weiet al\.,[2023](https://arxiv.org/html/2605.08896#bib.bib17); Li and Liu,[2023](https://arxiv.org/html/2605.08896#bib.bib60)\)\. Confusional spectral regularization further regularizes hard adversarial confusion for worst\-class robust fairness\(Jinet al\.,[2025](https://arxiv.org/html/2605.08896#bib.bib66)\)\. FragileFlow shares the weakest\-class motivation, but controls a margin\-aware probability\-flow matrix in LLM/VLM adaptation, capturing where probability mass moves before a prediction fails\.

### 4\.2PAC\-Bayes for Neural Networks

PAC\-Bayes theory is a classical framework for connecting empirical performance with population risk through a posterior–prior complexity term\(McAllester,[1999](https://arxiv.org/html/2605.08896#bib.bib19); Catoni,[2007](https://arxiv.org/html/2605.08896#bib.bib20)\)\. For neural networks, it has been used to analyze generalization through flatness, compression, margin\-based complexity, and parameter perturbations\(Dziugaite and Roy,[2017](https://arxiv.org/html/2605.08896#bib.bib21); Neyshaburet al\.,[2018](https://arxiv.org/html/2605.08896#bib.bib22); Aroraet al\.,[2018](https://arxiv.org/html/2605.08896#bib.bib23)\)\. Recent work has gradually extended this framework to larger models and robust learning\. Compression\-based analyses suggest that pretrained or adapted LLMs can have much smaller effective complexity than their raw parameter count\(Lotfiet al\.,[2024](https://arxiv.org/html/2605.08896#bib.bib24); Liet al\.,[2026](https://arxiv.org/html/2605.08896#bib.bib65)\)\. PAC\-driven fine\-tuning further uses posterior perturbations to guide pretrained language model adaptation\(Liuet al\.,[2023](https://arxiv.org/html/2605.08896#bib.bib67)\)\. Robust PAC\-Bayes analyses connect bound minimization to adversarial or certified robustness\(Wanget al\.,[2023](https://arxiv.org/html/2605.08896#bib.bib25); Jinet al\.,[2025](https://arxiv.org/html/2605.08896#bib.bib66)\)\. Our work is inspired by this line of analysis, but uses PAC\-Bayes for a different purpose\. Rather than bounding standard generalization error, compression behavior, fine\-tuning performance, or average robust risk, we analyze the margin\-aware error\-flow matrix induced by finite\-option LLM/VLM adaptation under perturbation\. This lets us move beyond a purely empirical regularizer: the same object optimized by FragileFlow also appears in a PAC\-Bayes control route for vulnerable worst\-class risk\.

## 5Conclusion

In this paper, we identified a problem that is often missed by average robustness evaluation: under perturbation, some classes can become fragile by consistently leaking probability mass toward recurring wrong competitors\. We introduced*FragileFlow*to capture and reduce this vulnerable error flow, and provided a PAC\-Bayes control route to connet the empirical regularized object to worst\-class behavior at the population level\. Experiments on LLM and VLM adaptation validate this view\. More broadly, our results suggest that robustness should be studied not only through final prediction errors, but also through the structure that emerges before those errors occur\. The probability\-flow view gives a natural way to inspect how fragility concentrates across classes and points toward more interpretable robustness analysis\. An immediate next step, which we are currently pursuing, is to understand how this form of regularization acts inside the model, including whether it suppresses fragile representation directions, changes internal error\-flow pathways, or extends from finite\-option prediction to generated tokens and semantic decision states\.

## References

- \[1\]\(2021\)Better fine\-tuning by reducing representational collapse\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=OQ08SN70M1V)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p3.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[2\]A\. Agrawal, L\. Alazraki, S\. Honarvar, T\. Mensink, and M\. Rei\(2025\)Enhancing LLM robustness to perturbed instructions: an empirical study\.InICLR 2025 Workshop on Building Trust in Language Models and Applications,External Links:[Link](https://openreview.net/forum?id=abllmCsDp8)Cited by:[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[3\]S\. Arora, R\. Ge, B\. Neyshabur, and Y\. Zhang\(2018\)Stronger generalization bounds for deep nets via a compression approach\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 254–263\.External Links:[Link](https://proceedings.mlr.press/v80/arora18b.html)Cited by:[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[4\]P\. Benz, C\. Zhang, A\. Karjauv, and I\. S\. Kweon\(2021\-11 Dec\)Robustness may be at odds with fairness: an empirical study on class\-wise accuracy\.pp\. 325–342\.External Links:[Link](https://proceedings.mlr.press/v148/benz21a.html)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1)\.
- \[5\]R\. Bommasani, D\. A\. Hudson, E\. Adeli, R\. Altman, S\. Arora, S\. von Arx, M\. S\. Bernstein, J\. Bohg, A\. Bosselut, E\. Brunskill, E\. Brynjolfsson, S\. Buch, D\. Card, R\. Castellon, N\. S\. Chatterji, A\. S\. Chen, K\. A\. Creel, J\. Davis, D\. Demszky, C\. Donahue, M\. Doumbouya, E\. Durmus, S\. Ermon, J\. Etchemendy, K\. Ethayarajh, L\. Fei\-Fei, C\. Finn, T\. Gale, L\. E\. Gillespie, K\. Goel, N\. D\. Goodman, S\. Grossman, N\. Guha, T\. Hashimoto, P\. Henderson, J\. Hewitt, D\. E\. Ho, J\. Hong, K\. Hsu, J\. Huang, T\. F\. Icard, S\. Jain, D\. Jurafsky, P\. Kalluri, S\. Karamcheti, G\. Keeling, F\. Khani, O\. Khattab, P\. W\. Koh, M\. S\. Krass, R\. Krishna, R\. Kuditipudi, A\. Kumar, F\. Ladhak, M\. Lee, T\. Lee, J\. Leskovec, I\. Levent, X\. L\. Li, X\. Li, T\. Ma, A\. Malik, C\. D\. Manning, S\. P\. Mirchandani, E\. Mitchell, Z\. Munyikwa, S\. Nair, A\. Narayan, D\. Narayanan, B\. Newman, A\. Nie, J\. C\. Niebles, H\. Nilforoshan, J\. F\. Nyarko, G\. Ogut, L\. Orr, I\. Papadimitriou, J\. S\. Park, C\. Piech, E\. Portelance, C\. Potts, A\. Raghunathan, R\. Reich, H\. Ren, F\. Rong, Y\. H\. Roohani, C\. Ruiz, J\. Ryan, C\. R’e, D\. Sadigh, S\. Sagawa, K\. Santhanam, A\. Shih, K\. P\. Srinivasan, A\. Tamkin, R\. Taori, A\. W\. Thomas, F\. Tramèr, R\. E\. Wang, W\. Wang, B\. Wu, J\. Wu, Y\. Wu, S\. M\. Xie, M\. Yasunaga, J\. You, M\. A\. Zaharia, M\. Zhang, T\. Zhang, X\. Zhang, Y\. Zhang, L\. Zheng, K\. Zhou, and P\. Liang\(2021\)On the opportunities and risks of foundation models\.ArXiv\.External Links:[Link](https://crfm.stanford.edu/assets/report.pdf)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p1.1)\.
- \[6\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 1877–1901\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p1.1)\.
- \[7\]O\. Catoni\(2007\)PAC\-bayesian supervised classification: the thermodynamics of statistical learning\.Institute of Mathematical Statistics,Beachwood, Ohio\.External Links:ISBN 9780940600720,[Link](https://www.jstor.org/stable/i20461497)Cited by:[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[8\]M\. Cimpoi, S\. Maji, I\. Kokkinos, S\. Mohamed, and A\. Vedaldi\(2014\)Describing textures in the wild\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 3606–3613\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2014.461),[Link](https://openaccess.thecvf.com/content_cvpr_2014/html/Cimpoi_Describing_Textures_in_2014_CVPR_paper.html)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[9\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.External Links:1803\.05457,[Document](https://dx.doi.org/10.48550/arXiv.1803.05457),[Link](https://arxiv.org/abs/1803.05457)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[10\]G\. K\. Dziugaite and D\. M\. Roy\(2017\)Computing nonvacuous generalization bounds for deep \(stochastic\) neural networks with many more parameters than training data\.InProceedings of the Thirty\-Third Conference on Uncertainty in Artificial Intelligence,External Links:[Link](https://arxiv.org/abs/1703.11008)Cited by:[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[11\]L\. Fei\-Fei, R\. Fergus, and P\. Perona\(2007\)Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories\.Computer Vision and Image Understanding106\(1\),pp\. 59–70\.External Links:[Document](https://dx.doi.org/10.1016/j.cviu.2005.09.012),[Link](https://doi.org/10.1016/j.cviu.2005.09.012)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[12\]P\. Foret, A\. Kleiner, H\. Mobahi, and B\. Neyshabur\(2021\)Sharpness\-aware minimization for efficiently improving generalization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6Tm1mposlrM)Cited by:[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[13\]S\. Ghiasvand, H\. E\. Oskouie, M\. Alizadeh, and R\. Pedarsani\(2025\)Few\-shot adversarial low\-rank fine\-tuning of vision\-language models\.External Links:2505\.15130,[Link](https://arxiv.org/abs/2505.15130)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[14\]I\. J\. Goodfellow, J\. Shlens, and C\. Szegedy\(2015\)Explaining and harnessing adversarial examples\.In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7\-9, 2015, Conference Track Proceedings,Y\. Bengio and Y\. LeCun \(Eds\.\),External Links:[Link](http://arxiv.org/abs/1412.6572)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[15\]V\. Gupta, P\. Pandya, T\. Kataria, V\. Gupta, and D\. Roth\(2024\)Evaluating concurrent robustness of language models across diverse challenge sets\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 22162–22184\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1237),[Link](https://aclanthology.org/2024.emnlp-main.1237/)Cited by:[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[16\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[17\]J\. Hu, Y\. Dong, Y\. Sun, and X\. Huang\(2026\-Mar\.\)Tapas are free\! training\-free adaptation of programmatic agents via llm\-guided program synthesis in dynamic environments\.Proceedings of the AAAI Conference on Artificial Intelligence40\(35\),pp\. 29477–29485\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/40189),[Document](https://dx.doi.org/10.1609/aaai.v40i35.40189)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p1.1)\.
- \[18\]J\. Hu, X\. Huang, Y\. Sun, Y\. Dong, and X\. Huang\(2026\)Lying with truths: open\-channel multi\-agent collusion for belief manipulation via generative montage\.arXiv preprint arXiv:2601\.01685\.Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p1.1)\.
- \[19\]J\. Hu, Z\. Huang, X\. Yin, W\. Ruan, G\. Cheng, Y\. Dong, and X\. Huang\(2025\)FALCON: fine\-grained activation manipulation by contrastive orthogonal unalignment for large language model\.pp\. 102210–102232\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/9408564a4229f4a933ac9bd09a29ee96-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1)\.
- \[20\]A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. Le Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. El Sayed\(2023\)Mistral 7b\.External Links:2310\.06825,[Document](https://dx.doi.org/10.48550/arXiv.2310.06825),[Link](https://arxiv.org/abs/2310.06825)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[21\]H\. Jiang, P\. He, W\. Chen, X\. Liu, J\. Gao, and T\. Zhao\(2020\)SMART: robust and efficient fine\-tuning for pre\-trained natural language models through principled regularized optimization\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 2177–2190\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.197),[Link](https://aclanthology.org/2020.acl-main.197/)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p3.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[22\]G\. Jin, S\. Wu, J\. Liu, T\. Huang, and R\. Mu\(2025\)Enhancing robust fairness via confusional spectral regularization\.External Links:[Link](https://openreview.net/forum?id=lW0ZndAimF)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§1](https://arxiv.org/html/2605.08896#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[23\]B\. Li and W\. Liu\(2023\)WAT: improve the worst\-class robustness in adversarial training\.External Links:ISBN 978\-1\-57735\-880\-0,[Link](https://doi.org/10.1609/aaai.v37i12.26749),[Document](https://dx.doi.org/10.1609/aaai.v37i12.26749)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[24\]Z\. Li, B\. Wang, J\. Hu, Z\. Huang, Q\. He, X\. Huang, G\. Cheng, X\. Huang, and Y\. Dong\(2026\)Where do prompt perturbations break generation? a segment\-level view of robustness in lora\-tuned language models\.External Links:2605\.01605,[Link](https://arxiv.org/abs/2605.01605)Cited by:[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[25\]G\. Liu, Z\. Xue, X\. Zhang, K\. Johnson, and R\. Wang\(2023\-12\)PAC\-tuning: fine\-tuning pre\-trained language models with PAC\-driven perturbed gradient descent\.Singapore,pp\. 12178–12189\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.748/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.748)Cited by:[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[26\]S\. Lotfi, M\. A\. Finzi, S\. Kapoor, A\. Potapczynski, M\. Goldblum, and A\. G\. Wilson\(2022\)PAC\-bayes compression bounds so tight that they can explain generalization\.External Links:[Link](https://openreview.net/forum?id=o8nYuR8ekFm)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1)\.
- \[27\]S\. Lotfi, M\. A\. Finzi, Y\. Kuang, T\. G\. J\. Rudner, M\. Goldblum, and A\. G\. Wilson\(2024\)Non\-vacuous generalization bounds for large language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 32801–32818\.External Links:[Link](https://proceedings.mlr.press/v235/lotfi24a.html)Cited by:[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[28\]A\. Madry, A\. Makelov, L\. Schmidt, D\. Tsipras, and A\. Vladu\(2018\)Towards deep learning models resistant to adversarial attacks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rJzIBfZAb)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[29\]C\. Mao, S\. Geng, J\. Yang, X\. Wang, and C\. Vondrick\(2023\)Understanding zero\-shot adversarial robustness for large\-scale models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=P4bXCawRi5J)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p1.1),[§1](https://arxiv.org/html/2605.08896#S1.p2.1)\.
- \[30\]D\. A\. McAllester\(1999\)PAC\-bayesian model averaging\.InProceedings of the Twelfth Annual Conference on Computational Learning Theory,COLT ’99,New York, NY, USA,pp\. 164–170\.External Links:[Document](https://dx.doi.org/10.1145/307400.307435),[Link](https://doi.org/10.1145/307400.307435)Cited by:[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[31\]Mistral AI\(2023\)Mistral\-7B\-Instruct\-v0\.2\.Note:[https://huggingface\.co/mistralai/Mistral\-7B\-Instruct\-v0\.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)Hugging Face model cardCited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[32\]J\. X\. Morris, E\. Lifland, J\. Y\. Yoo, J\. Grigsby, D\. Jin, and Y\. Qi\(2020\)TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,pp\. 119–126\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-demos.16),[Link](https://aclanthology.org/2020.emnlp-demos.16/)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p3.1)\.
- \[33\]V\. Nagarajan and Z\. Kolter\(2019\)Deterministic PAC\-bayesian generalization bounds for deep networks via generalizing noise\-resilience\.External Links:[Link](https://openreview.net/forum?id=Hygn2o0qKX)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p4.1)\.
- \[34\]B\. Neyshabur, S\. Bhojanapalli, D\. McAllester, and N\. Srebro\(2018\)A PAC\-bayesian approach to spectrally\-normalized margin bounds for neural networks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Skz_WfbCZ)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[35\]C\. Oh, H\. Lim, M\. Kim, D\. Han, S\. Yun, J\. Choo, A\. Hauptmann, Z\. Cheng, and K\. Song\(2024\)Towards calibrated robust fine\-tuning of vision\-language models\.InAdvances in Neural Information Processing Systems,Vol\.37\.External Links:[Document](https://dx.doi.org/10.52202/079017-0403),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/173e4732a89fab9fb225203f35996677-Abstract-Conference.html)Cited by:[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[36\]O\. M\. Parkhi, A\. Vedaldi, A\. Zisserman, and C\. V\. Jawahar\(2012\)Cats and dogs\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 3498–3505\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2012.6248092),[Link](https://www.robots.ox.ac.uk/%CB%9Cvgg/publications/2012/parkhi12a/)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[37\]Y\. Qiang, S\. Nandi, N\. Mehrabi, G\. Ver Steeg, A\. Kumar, A\. Rumshisky, and A\. Galstyan\(2024\)Prompt perturbation consistency learning for robust language models\.InFindings of the Association for Computational Linguistics: EACL 2024,St\. Julian’s, Malta,pp\. 1357–1370\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-eacl.91),[Link](https://aclanthology.org/2024.findings-eacl.91/)Cited by:[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[38\]Qwen Team\(2024\)Qwen2\.5\-1\.5B\-Instruct\.Note:[https://huggingface\.co/Qwen/Qwen2\.5\-1\.5B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)Hugging Face model cardCited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[39\]Qwen Team\(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Document](https://dx.doi.org/10.48550/arXiv.2412.15115),[Link](https://arxiv.org/abs/2412.15115)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[40\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever\(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 8748–8763\.External Links:[Link](https://proceedings.mlr.press/v139/radford21a.html)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[41\]M\. T\. Ribeiro, T\. Wu, C\. Guestrin, and S\. Singh\(2020\)Beyond accuracy: behavioral testing of NLP models with CheckList\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 4902–4912\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.442),[Link](https://aclanthology.org/2020.acl-main.442/)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[42\]C\. Schlarmann, N\. D\. Singh, F\. Croce, and M\. Hein\(2024\-21–27 Jul\)Robust CLIP: unsupervised adversarial fine\-tuning of vision embeddings for robust large vision\-language models\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 43685–43704\.External Links:[Link](https://proceedings.mlr.press/v235/schlarmann24a.html)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p1.1),[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[43\]C\. Sun, C\. Xu, C\. Yao, S\. Liang, Y\. Wu, D\. Liang, X\. Liu, and A\. Liu\(2023\)Improving robust fairness via balance adversarial training\.InProceedings of the Thirty\-Seventh AAAI Conference on Artificial Intelligence and Thirty\-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence,AAAI’23/IAAI’23/EAAI’23\.External Links:ISBN 978\-1\-57735\-880\-0,[Link](https://doi.org/10.1609/aaai.v37i12.26769),[Document](https://dx.doi.org/10.1609/aaai.v37i12.26769)Cited by:[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[44\]A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant\(2019\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4149–4158\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1421),[Link](https://aclanthology.org/N19-1421/)Cited by:[§3\.1](https://arxiv.org/html/2605.08896#S3.SS1.p2.1)\.
- \[45\]Q\. Tian, K\. Kuang, K\. Jiang, F\. Wu, and Y\. Wang\(2021\)Analysis and applications of class\-wise robustness in adversarial training\.InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining,KDD ’21,New York, NY, USA,pp\. 1561–1570\.External Links:ISBN 9781450383325,[Link](https://doi.org/10.1145/3447548.3467403),[Document](https://dx.doi.org/10.1145/3447548.3467403)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[46\]B\. Wang, C\. Xu, S\. Wang, S\. Wang, Z\. Gan, Y\. Cheng, J\. Gao, A\. Awadallah, and B\. Li\(2021\)Adversarial glue: a multi\-task benchmark for robustness evaluation of language models\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,J\. Vanschoren and S\. Yeung \(Eds\.\),Vol\.1,pp\.\.External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/335f5352088d7d9bf74191e006d8e24c-Paper-round2.pdf)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p1.1)\.
- \[47\]X\. Wang, K\. Chen, J\. Zhang, J\. Chen, and X\. Ma\(2025\)TAPT: test\-time adversarial prompt tuning for robust inference in vision\-language models\.pp\. 19910–19920\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52734.2025.01854)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1)\.
- \[48\]Z\. Wang, N\. Ding, T\. Levinboim, X\. Chen, and R\. Soricut\(2023\)Improving robust generalization by direct PAC\-bayesian bound minimization\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 16458–16468\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52729.2023.01579),[Link](https://openaccess.thecvf.com/content/CVPR2023/html/Wang_Improving_Robust_Generalization_by_Direct_PAC-Bayesian_Bound_Minimization_CVPR_2023_paper.html)Cited by:[§4\.2](https://arxiv.org/html/2605.08896#S4.SS2.p1.1)\.
- \[49\]Z\. Wei, Y\. Wang, Y\. Guo, and Y\. Wang\(2023\)CFA: class\-wise calibrated fair adversarial training\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 8193–8201\.External Links:[Link](https://openaccess.thecvf.com/content/CVPR2023/html/Wei_CFA_Class-Wise_Calibrated_Fair_Adversarial_Training_CVPR_2023_paper.html),[Document](https://dx.doi.org/10.1109/CVPR52729.2023.00792)Cited by:[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[50\]D\. Wu, S\. Xia, and Y\. Wang\(2020\)Adversarial weight perturbation helps robust generalization\.pp\. 2958–2969\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/1ef91c212e30e14bf125e9374262401f-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[51\]H\. Xu, X\. Liu, Y\. Li, A\. K\. Jain, and J\. Tang\(2021\)To be robust or to be fair: towards fairness in adversarial training\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 11492–11501\.External Links:[Link](https://proceedings.mlr.press/v139/xu21b.html)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[52\]H\. Zhang, Y\. Yu, J\. Jiao, E\. Xing, L\. El Ghaoui, and M\. I\. Jordan\(2019\-09–15 Jun\)Theoretically principled trade\-off between robustness and accuracy\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 7472–7482\.External Links:[Link](https://proceedings.mlr.press/v97/zhang19p.html)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.08896#S4.SS1.p1.1)\.
- \[53\]J\. Zhang, X\. Ma, X\. Wang, L\. Qiu, J\. Wang, Y\. Jiang, and J\. Sang\(2024\)Adversarial prompt tuning for vision\-language models\.Berlin, Heidelberg,pp\. 56–72\.External Links:ISBN 978\-3\-031\-72994\-2,[Link](https://doi.org/10.1007/978-3-031-72995-9_4),[Document](https://dx.doi.org/10.1007/978-3-031-72995-9%5F4)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1)\.
- \[54\]C\. Zhu, Y\. Cheng, Z\. Gan, S\. Sun, T\. Goldstein, and J\. Liu\(2020\)FreeLB: enhanced adversarial training for natural language understanding\.External Links:[Link](https://openreview.net/forum?id=BygzbyHFvB)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p2.1)\.
- \[55\]K\. Zhu, J\. Wang, J\. Zhou, Z\. Wang, H\. Chen, Y\. Wang, L\. Yang, W\. Ye, Y\. Zhang, N\. Gong, and X\. Xie\(2024\)PromptRobust: towards evaluating the robustness of large language models on adversarial prompts\.InProceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis,LAMPS ’24,New York, NY, USA,pp\. 57–68\.External Links:ISBN 9798400712098,[Link](https://doi.org/10.1145/3689217.3690621),[Document](https://dx.doi.org/10.1145/3689217.3690621)Cited by:[§1](https://arxiv.org/html/2605.08896#S1.p1.1)\.

## Appendix AProofs and Additional Details for Section[2](https://arxiv.org/html/2605.08896#S2)

### A\.1Proof of margin\-stability

###### Proof\.

Lety∈\[K\]y\\in\[K\]be the true class and definec​\(x\):=maxk≠y⁡sθ​\(x,k\)c\(x\):=\\max\_\{k\\neq y\}s\_\{\\theta\}\(x,k\)andc​\(x′\):=maxk≠y⁡sθ​\(x′,k\)c\(x^\{\\prime\}\):=\\max\_\{k\\neq y\}s\_\{\\theta\}\(x^\{\\prime\},k\)\. By definition,Δθ​\(x,y\)=sθ​\(x,y\)−c​\(x\)\\Delta\_\{\\theta\}\(x,y\)=s\_\{\\theta\}\(x,y\)\-c\(x\)andΔθ​\(x′,y\)=sθ​\(x′,y\)−c​\(x′\)\\Delta\_\{\\theta\}\(x^\{\\prime\},y\)=s\_\{\\theta\}\(x^\{\\prime\},y\)\-c\(x^\{\\prime\}\)\. Assume

maxk∈\[K\]⁡\|sθ​\(x,k\)−sθ​\(x′,k\)\|≤ε\.\\max\_\{k\\in\[K\]\}\|s\_\{\\theta\}\(x,k\)\-s\_\{\\theta\}\(x^\{\\prime\},k\)\|\\leq\\varepsilon\.Then\|sθ​\(x,y\)−sθ​\(x′,y\)\|≤ε\|s\_\{\\theta\}\(x,y\)\-s\_\{\\theta\}\(x^\{\\prime\},y\)\|\\leq\\varepsilon\. Moreover, for everyk≠yk\\neq y, we havesθ​\(x′,k\)≤sθ​\(x,k\)\+εs\_\{\\theta\}\(x^\{\\prime\},k\)\\leq s\_\{\\theta\}\(x,k\)\+\\varepsilon, so taking the maximum overk≠yk\\neq yyieldsc​\(x′\)≤c​\(x\)\+εc\(x^\{\\prime\}\)\\leq c\(x\)\+\\varepsilon\. Hence

Δθ​\(x′,y\)=sθ​\(x′,y\)−c​\(x′\)≥\(sθ​\(x,y\)−ε\)−\(c​\(x\)\+ε\)=Δθ​\(x,y\)−2​ε,\\Delta\_\{\\theta\}\(x^\{\\prime\},y\)=s\_\{\\theta\}\(x^\{\\prime\},y\)\-c\(x^\{\\prime\}\)\\geq\(s\_\{\\theta\}\(x,y\)\-\\varepsilon\)\-\(c\(x\)\+\\varepsilon\)=\\Delta\_\{\\theta\}\(x,y\)\-2\\varepsilon,\. IfΔθ​\(x,y\)\>γ\\Delta\_\{\\theta\}\(x,y\)\>\\gammaandε<γ/2\\varepsilon<\\gamma/2, thenΔθ​\(x′,y\)\>0\\Delta\_\{\\theta\}\(x^\{\\prime\},y\)\>0, and hence the true class is the unique maximizer among the verbalizer classes\. Thereforey^θdet​\(x′\)=y\\hat\{y\}^\{\\mathrm\{det\}\}\_\{\\theta\}\(x^\{\\prime\}\)=y\. ∎

This also lays the groundwork for the subsequent deterministic bridge\. A later proposition demonstrates that when the logits of the posterior samples are sufficiently close to those of the mean model, the posterior vulnerable risk can be linked to the deterministic worst\-class risk\. This section clarifies in advance, under a fixed input perturbation, that the margin buffer can absorb a limited amount of logit deviation\.

### A\.2Proof of spectral norm control

###### Proof\.

Recall‖A‖1=maxj​∑i\|Ai​j\|\\\|A\\\|\_\{1\}=\\max\_\{j\}\\sum\_\{i\}\|A\_\{ij\}\|and‖A‖2=max‖u‖2=1⁡‖A​u‖2\\\|A\\\|\_\{2\}=\\max\_\{\\\|u\\\|\_\{2\}=1\}\\\|Au\\\|\_\{2\}\. Leteje\_\{j\}be thejj\-th standard basis vector\. Then

‖A‖1=maxj⁡‖A​ej‖1≤maxj⁡K​‖A​ej‖2≤K​‖A‖2,\\\|A\\\|\_\{1\}=\\max\_\{j\}\\\|Ae\_\{j\}\\\|\_\{1\}\\leq\\max\_\{j\}\\sqrt\{K\}\\,\\\|Ae\_\{j\}\\\|\_\{2\}\\leq\\sqrt\{K\}\\,\\\|A\\\|\_\{2\},where we used‖z‖1≤K​‖z‖2\\\|z\\\|\_\{1\}\\leq\\sqrt\{K\}\\\|z\\\|\_\{2\}forz∈ℝKz\\in\\mathbb\{R\}^\{K\}\. ∎

### A\.3Proof of Theorem[2\.10](https://arxiv.org/html/2605.08896#S2.Thmtheorem10)

###### Proof\.

We condition throughout on the realized class counts\{mj\}j=1K\\\{m\_\{j\}\\\}\_\{j=1\}^\{K\}and on the perturbation protocol that inducesD′D^\{\\prime\}andS′S^\{\\prime\}\. In particular, for this PAC\-Bayes statement, the perturbed sample is treated as fixed with respect to the posteriorQQbeing evaluated\. This ensures that the quantities below are ordinary bounded statistics indexed by the trainable\-coordinate random variableww\.

Step 1: Spectral norm as a bilinear form\.For any matrixA∈ℝK×KA\\in\\mathbb\{R\}^\{K\\times K\},

‖A‖2=sup‖u‖2=1,‖v‖2=1\|u⊤​A​v\|\.\\\|A\\\|\_\{2\}=\\sup\_\{\\\|u\\\|\_\{2\}=1,\\\|v\\\|\_\{2\}=1\}\\left\|u^\{\\top\}Av\\right\|\.We apply this to

AQ:=M¯D′,γQ−M¯S′,γQ\.A\_\{Q\}:=\\bar\{M\}^\{Q\}\_\{D^\{\\prime\},\\gamma\}\-\\bar\{M\}^\{Q\}\_\{S^\{\\prime\},\\gamma\}\.
Step 2: Stratified bilinear statistic for fixedu,vu,v\.Fix unit vectorsu,v∈ℝKu,v\\in\\mathbb\{R\}^\{K\}\. For each classj∈\[K\]j\\in\[K\], define

ℓu,j​\(w;x′\):=∑i≠jui​gγ,κθ​\(w\)​\(x′,j\)​pθ​\(w\)​\(i∣x′\)\.\\ell\_\{u,j\}\(w;x^\{\\prime\}\):=\\sum\_\{i\\neq j\}u\_\{i\}\\,g\_\{\\gamma,\\kappa\}^\{\\theta\(w\)\}\(x^\{\\prime\},j\)\\,p\_\{\\theta\(w\)\}\(i\\mid x^\{\\prime\}\)\.LetDj′D^\{\\prime\}\_\{j\}denote the class\-jjconditional distribution, and write the class\-jjsample inS′S^\{\\prime\}as\{xj,r′\}r=1mj\\\{x^\{\\prime\}\_\{j,r\}\\\}\_\{r=1\}^\{m\_\{j\}\}\. For fixedww, define the stratified bilinear generalization gap

Zu,v​\(w;S′\):=∑j=1Kvj​\(𝔼x′∼Dj′​ℓu,j​\(w;x′\)−1mj​∑r=1mjℓu,j​\(w;xj,r′\)\)\.Z\_\{u,v\}\(w;S^\{\\prime\}\):=\\sum\_\{j=1\}^\{K\}v\_\{j\}\\left\(\\mathbb\{E\}\_\{x^\{\\prime\}\\sim D^\{\\prime\}\_\{j\}\}\\ell\_\{u,j\}\(w;x^\{\\prime\}\)\-\\frac\{1\}\{m\_\{j\}\}\\sum\_\{r=1\}^\{m\_\{j\}\}\\ell\_\{u,j\}\(w;x^\{\\prime\}\_\{j,r\}\)\\right\)\.By expanding the bilinear form column by column, we have

u⊤​\(M¯D′,γQ−M¯S′,γQ\)​v=𝔼w~∼Q​\[Zu,v​\(w~;S′\)\]\.u^\{\\top\}\\big\(\\bar\{M\}^\{Q\}\_\{D^\{\\prime\},\\gamma\}\-\\bar\{M\}^\{Q\}\_\{S^\{\\prime\},\\gamma\}\\big\)v=\\mathbb\{E\}\_\{\\widetilde\{w\}\\sim Q\}\\left\[Z\_\{u,v\}\(\\widetilde\{w\};S^\{\\prime\}\)\\right\]\.
Step 3: Boundedness\.For fixedjj, define

ai:=gγ,κθ​\(w\)​\(x′,j\)​pθ​\(w\)​\(i∣x′\),i≠j\.a\_\{i\}:=g\_\{\\gamma,\\kappa\}^\{\\theta\(w\)\}\(x^\{\\prime\},j\)\\,p\_\{\\theta\(w\)\}\(i\\mid x^\{\\prime\}\),\\qquad i\\neq j\.Thenai≥0a\_\{i\}\\geq 0and

∑i≠jai≤gγ,κθ​\(w\)​\(x′,j\)​∑i≠jpθ​\(w\)​\(i∣x′\)≤1\.\\sum\_\{i\\neq j\}a\_\{i\}\\leq g\_\{\\gamma,\\kappa\}^\{\\theta\(w\)\}\(x^\{\\prime\},j\)\\sum\_\{i\\neq j\}p\_\{\\theta\(w\)\}\(i\\mid x^\{\\prime\}\)\\leq 1\.Since‖u‖2=1\\\|u\\\|\_\{2\}=1implies‖u‖∞≤1\\\|u\\\|\_\{\\infty\}\\leq 1, we obtain

\|ℓu,j​\(w;x′\)\|=\|∑i≠jui​ai\|≤‖u‖∞​∑i≠jai≤1\.\\left\|\\ell\_\{u,j\}\(w;x^\{\\prime\}\)\\right\|=\\left\|\\sum\_\{i\\neq j\}u\_\{i\}a\_\{i\}\\right\|\\leq\\\|u\\\|\_\{\\infty\}\\sum\_\{i\\neq j\}a\_\{i\}\\leq 1\.Thus each class\-conditional term is bounded in\[−1,1\]\[\-1,1\]\.

Step 4: PAC\-Bayes bound for the stratified bilinear gap\.For fixedu,v,wu,v,w, the random variables inZu,v​\(w;S′\)Z\_\{u,v\}\(w;S^\{\\prime\}\)are independent across the stratified class samples, conditional on the class counts\. Since eachℓu,j​\(w;x′\)∈\[−1,1\]\\ell\_\{u,j\}\(w;x^\{\\prime\}\)\\in\[\-1,1\], Hoeffding’s lemma gives, for anyλ\>0\\lambda\>0,

𝔼S′​exp⁡\(λ​Zu,v​\(w;S′\)\)≤exp⁡\(λ22​∑j=1Kvj2mj\),\\mathbb\{E\}\_\{S^\{\\prime\}\}\\exp\\left\(\\lambda Z\_\{u,v\}\(w;S^\{\\prime\}\)\\right\)\\leq\\exp\\left\(\\frac\{\\lambda^\{2\}\}\{2\}\\sum\_\{j=1\}^\{K\}\\frac\{v\_\{j\}^\{2\}\}\{m\_\{j\}\}\\right\),and the same bound holds for−Zu,v​\(w;S′\)\-Z\_\{u,v\}\(w;S^\{\\prime\}\)\. By the standard PAC\-Bayes change\-of\-measure argument, with probability at least1−δ01\-\\delta\_\{0\}overS′S^\{\\prime\}, simultaneously for all posteriorsQQ,

\|𝔼w~∼Q​Zu,v​\(w~;S′\)\|≤2​\(∑j=1Kvj2mj\)​\(KL​\(Q∥P\)\+ln⁡2δ0\)\.\\left\|\\mathbb\{E\}\_\{\\widetilde\{w\}\\sim Q\}Z\_\{u,v\}\(\\widetilde\{w\};S^\{\\prime\}\)\\right\|\\leq\\sqrt\{2\\left\(\\sum\_\{j=1\}^\{K\}\\frac\{v\_\{j\}^\{2\}\}\{m\_\{j\}\}\\right\)\\left\(\\mathrm\{KL\}\(Q\\\|P\)\+\\ln\\frac\{2\}\{\\delta\_\{0\}\}\\right\)\}\.Sincemj≥mminm\_\{j\}\\geq m\_\{\\min\}and‖v‖2=1\\\|v\\\|\_\{2\}=1,

∑j=1Kvj2mj≤1mmin\.\\sum\_\{j=1\}^\{K\}\\frac\{v\_\{j\}^\{2\}\}\{m\_\{j\}\}\\leq\\frac\{1\}\{m\_\{\\min\}\}\.Therefore, for fixedu,vu,v, with probability at least1−δ01\-\\delta\_\{0\}, simultaneously for allQQ,

\|u⊤​\(M¯D′,γQ−M¯S′,γQ\)​v\|≤2​\(KL​\(Q∥P\)\+ln⁡2δ0\)mmin\.\\left\|u^\{\\top\}\\big\(\\bar\{M\}^\{Q\}\_\{D^\{\\prime\},\\gamma\}\-\\bar\{M\}^\{Q\}\_\{S^\{\\prime\},\\gamma\}\\big\)v\\right\|\\leq\\sqrt\{\\frac\{2\\left\(\\mathrm\{KL\}\(Q\\\|P\)\+\\ln\\frac\{2\}\{\\delta\_\{0\}\}\\right\)\}\{m\_\{\\min\}\}\}\.
Step 5: Epsilon\-net reduction\.Let𝒩1/4\\mathcal\{N\}\_\{1/4\}be a1/41/4\-net of the Euclidean unit sphere inℝK\\mathbb\{R\}^\{K\}\. We use the standard bound

\|𝒩1/4\|≤9K,\|\\mathcal\{N\}\_\{1/4\}\|\\leq 9^\{K\},and the standard net\-to\-spectrum reduction

‖A‖2≤2​maxu^,v^∈𝒩1/4⁡\|u^⊤​A​v^\|\.\\\|A\\\|\_\{2\}\\leq 2\\max\_\{\\hat\{u\},\\hat\{v\}\\in\\mathcal\{N\}\_\{1/4\}\}\\left\|\\hat\{u\}^\{\\top\}A\\hat\{v\}\\right\|\.Apply the fixed\-\(u,v\)\(u,v\)bound to all pairs\(u^,v^\)∈𝒩1/4×𝒩1/4\(\\hat\{u\},\\hat\{v\}\)\\in\\mathcal\{N\}\_\{1/4\}\\times\\mathcal\{N\}\_\{1/4\}with

δ0:=δ\|𝒩1/4\|2\.\\delta\_\{0\}:=\\frac\{\\delta\}\{\|\\mathcal\{N\}\_\{1/4\}\|^\{2\}\}\.By a union bound, with probability at least1−δ1\-\\delta, simultaneously for all posteriorsQQand all net points,

\|u^⊤​\(M¯D′,γQ−M¯S′,γQ\)​v^\|≤2​\(KL​\(Q∥P\)\+ln⁡2​\|𝒩1/4\|2δ\)mmin\.\\left\|\\hat\{u\}^\{\\top\}\\big\(\\bar\{M\}^\{Q\}\_\{D^\{\\prime\},\\gamma\}\-\\bar\{M\}^\{Q\}\_\{S^\{\\prime\},\\gamma\}\\big\)\\hat\{v\}\\right\|\\leq\\sqrt\{\\frac\{2\\left\(\\mathrm\{KL\}\(Q\\\|P\)\+\\ln\\frac\{2\|\\mathcal\{N\}\_\{1/4\}\|^\{2\}\}\{\\delta\}\\right\)\}\{m\_\{\\min\}\}\}\.Using\|𝒩1/4\|≤9K\|\\mathcal\{N\}\_\{1/4\}\|\\leq 9^\{K\}, we get

ln⁡2​\|𝒩1/4\|2δ≤2​K​ln⁡9\+ln⁡2δ\.\\ln\\frac\{2\|\\mathcal\{N\}\_\{1/4\}\|^\{2\}\}\{\\delta\}\\leq 2K\\ln 9\+\\ln\\frac\{2\}\{\\delta\}\.Thus,

‖M¯D′,γQ−M¯S′,γQ‖2≤2​2​\(KL​\(Q∥P\)\+2​K​ln⁡9\+ln⁡2δ\)mmin\.\\left\\\|\\bar\{M\}^\{Q\}\_\{D^\{\\prime\},\\gamma\}\-\\bar\{M\}^\{Q\}\_\{S^\{\\prime\},\\gamma\}\\right\\\|\_\{2\}\\leq 2\\sqrt\{\\frac\{2\\left\(\\mathrm\{KL\}\(Q\\\|P\)\+2K\\ln 9\+\\ln\\frac\{2\}\{\\delta\}\\right\)\}\{m\_\{\\min\}\}\}\.
Step 6: From spectral deviation to vulnerable worst\-class risk\.By the norm conversion in Appendix[A\.2](https://arxiv.org/html/2605.08896#A1.SS2),

VWRγ​\(Q;D′\)=‖M¯D′,γQ‖1≤K​‖M¯D′,γQ‖2\.\\mathrm\{VWR\}\_\{\\gamma\}\(Q;D^\{\\prime\}\)=\\left\\\|\\bar\{M\}^\{Q\}\_\{D^\{\\prime\},\\gamma\}\\right\\\|\_\{1\}\\leq\\sqrt\{K\}\\left\\\|\\bar\{M\}^\{Q\}\_\{D^\{\\prime\},\\gamma\}\\right\\\|\_\{2\}\.Using the triangle inequality,

‖M¯D′,γQ‖2≤‖M¯S′,γQ‖2\+‖M¯D′,γQ−M¯S′,γQ‖2\.\\left\\\|\\bar\{M\}^\{Q\}\_\{D^\{\\prime\},\\gamma\}\\right\\\|\_\{2\}\\leq\\left\\\|\\bar\{M\}^\{Q\}\_\{S^\{\\prime\},\\gamma\}\\right\\\|\_\{2\}\+\\left\\\|\\bar\{M\}^\{Q\}\_\{D^\{\\prime\},\\gamma\}\-\\bar\{M\}^\{Q\}\_\{S^\{\\prime\},\\gamma\}\\right\\\|\_\{2\}\.Therefore,

VWRγ​\(Q;D′\)≤K​VSRγ​\(Q;S′\)\+2​2​K​\(KL​\(Q∥P\)\+2​K​ln⁡9\+ln⁡2δ\)mmin\.\\mathrm\{VWR\}\_\{\\gamma\}\(Q;D^\{\\prime\}\)\\leq\\sqrt\{K\}\\,\\mathrm\{VSR\}\_\{\\gamma\}\(Q;S^\{\\prime\}\)\+2\\sqrt\{\\frac\{2K\\left\(\\mathrm\{KL\}\(Q\\\|P\)\+2K\\ln 9\+\\ln\\frac\{2\}\{\\delta\}\\right\)\}\{m\_\{\\min\}\}\}\.This completes the proof\. ∎

### A\.4Case\-wise view of the stability event

Let

Δμ:=Δθ​\(μ\)​\(x′,y\),Δ~:=Δθ​\(w~\)​\(x′,y\),\\Delta\_\{\\mu\}:=\\Delta\_\{\\theta\(\\mu\)\}\(x^\{\\prime\},y\),\\qquad\\widetilde\{\\Delta\}:=\\Delta\_\{\\theta\(\\tilde\{w\}\)\}\(x^\{\\prime\},y\),and define the stability event

ℰstab:=\{ΞQ​\(μ,w~\)≤γ/2\}\.\\mathcal\{E\}\_\{\\mathrm\{stab\}\}:=\\left\\\{\\Xi\_\{Q\}\(\\mu,\\tilde\{w\}\)\\leq\\gamma/2\\right\\\}\.Onℰstab\\mathcal\{E\}\_\{\\mathrm\{stab\}\}, every verbalizer score changes by at mostγ/2\\gamma/2\. Since the margin is the difference between the true\-class score and the largest wrong\-class score, the margin can change by at mostγ\\gamma:

\|Δ~−Δμ\|≤γ\.\|\\widetilde\{\\Delta\}\-\\Delta\_\{\\mu\}\|\\leq\\gamma\.Equivalently,

Δμ−γ≤Δ~≤Δμ\+γ\.\\Delta\_\{\\mu\}\-\\gamma\\leq\\widetilde\{\\Delta\}\\leq\\Delta\_\{\\mu\}\+\\gamma\.We now spell out the implications case by case\.

##### Case 1:Δμ≤0\\Delta\_\{\\mu\}\\leq 0\.

The deterministic mean model makes an error\. Onℰstab\\mathcal\{E\}\_\{\\mathrm\{stab\}\}, the upper bound gives

Δ~≤Δμ\+γ≤γ\.\\widetilde\{\\Delta\}\\leq\\Delta\_\{\\mu\}\+\\gamma\\leq\\gamma\.Thus the posterior sample lies in theγ\\gamma\-vulnerable region\{Δ~≤γ\}\\\{\\widetilde\{\\Delta\}\\leq\\gamma\\\}\. This is the only direction needed for Proposition[2\.13](https://arxiv.org/html/2605.08896#S2.Thmtheorem13): a deterministic error of the mean model is covered by the posterior vulnerable event, except whenℰstab\\mathcal\{E\}\_\{\\mathrm\{stab\}\}fails\.

##### Case 2:0<Δμ≤γ0<\\Delta\_\{\\mu\}\\leq\\gamma\.

The mean model is correct but already near the decision boundary\. Onℰstab\\mathcal\{E\}\_\{\\mathrm\{stab\}\},

Δμ−γ≤Δ~≤Δμ\+γ\.\\Delta\_\{\\mu\}\-\\gamma\\leq\\widetilde\{\\Delta\}\\leq\\Delta\_\{\\mu\}\+\\gamma\.Since0<Δμ≤γ0<\\Delta\_\{\\mu\}\\leq\\gamma, this implies

−γ<Δμ−γ≤Δ~≤Δμ\+γ≤2​γ\.\-\\gamma<\\Delta\_\{\\mu\}\-\\gamma\\leq\\widetilde\{\\Delta\}\\leq\\Delta\_\{\\mu\}\+\\gamma\\leq 2\\gamma\.Hence the posterior sample may either remain correct, flip, or enter theγ\\gamma\-vulnerable region depending on the realized posterior shift\. Counting such mean\-correct but noise\-sensitive examples in the posterior vulnerable risk isconservativeand does not invalidate the upper bound on deterministic worst\-class risk\.

##### Case 3:Δμ\>γ\\Delta\_\{\\mu\}\>\\gamma\.

The mean model has a positive safety margin\. Onℰstab\\mathcal\{E\}\_\{\\mathrm\{stab\}\}, the lower bound gives

Δ~≥Δμ−γ\>0,\\widetilde\{\\Delta\}\\geq\\Delta\_\{\\mu\}\-\\gamma\>0,so posterior sampling cannot flip the prediction under the stable event\. There are two subcases\.

IfΔμ\>2​γ\\Delta\_\{\\mu\}\>2\\gamma, then

Δ~≥Δμ−γ\>γ,\\widetilde\{\\Delta\}\\geq\\Delta\_\{\\mu\}\-\\gamma\>\\gamma,so the stable posterior sample stays outside the vulnerable region\.

Ifγ<Δμ≤2​γ\\gamma<\\Delta\_\{\\mu\}\\leq 2\\gamma, then

0<Δμ−γ≤γ,0<\\Delta\_\{\\mu\}\-\\gamma\\leq\\gamma,so a stable posterior sample may still enter the vulnerable region\{Δ~≤γ\}\\\{\\widetilde\{\\Delta\}\\leq\\gamma\\\}, reflecting local sensitivity around the mean model\. Therefore, this conservative counting is harmless for the upper bound, while genuine large posterior\-induced margin shifts are accounted for by the failure probabilityρ\\rho\.

This unstable posterior\-induced margin shifts are precisely the events accounted for by the failure probabilityρ\\rho, and they motivate the stability regularizer used in the main method\.

### A\.5Risk\-level Gibbs\-to\-deterministic bridge

For the bridge proof, define the posterior margin\-failure risk

Rmf,γ​\(Q;D′\):=maxj∈\[K\]⁡𝔼x′∼Dj′​\[Prw~∼Q⁡\(Δθ​\(w~\)​\(x′,j\)≤γ\)\],R\_\{\\mathrm\{mf\},\\gamma\}\(Q;D^\{\\prime\}\):=\\max\_\{j\\in\[K\]\}\\mathbb\{E\}\_\{x^\{\\prime\}\\sim D^\{\\prime\}\_\{j\}\}\\left\[\\Pr\_\{\\widetilde\{w\}\\sim Q\}\\left\(\\Delta\_\{\\theta\(\\widetilde\{w\}\)\}\(x^\{\\prime\},j\)\\leq\\gamma\\right\)\\right\],\(9\)whereDj′D^\{\\prime\}\_\{j\}denotes the class\-jjconditional perturbed distribution\.

###### Lemma A\.1\(Risk\-level bridge under logit stability\)\.

Fixγ≥0\\gamma\\geq 0and letQQbe any posterior over trainable coordinates with finite meanμ=𝔼w~∼Q​\[w~\]\\mu=\\mathbb\{E\}\_\{\\tilde\{w\}\\sim Q\}\[\\tilde\{w\}\]\. Assume

Prw~∼Q⁡\(ΞQ​\(μ,w~\)≤γ/2\)≥1−ρ\.\\Pr\_\{\\tilde\{w\}\\sim Q\}\\big\(\\Xi\_\{Q\}\(\\mu,\\tilde\{w\}\)\\leq\\gamma/2\\big\)\\geq 1\-\\rho\.Then

WCRdet​\(θ​\(μ\);D′\)≤Rmf,γ​\(Q;D′\)\+ρ\.\\mathrm\{WCR\}^\{\\mathrm\{det\}\}\(\\theta\(\\mu\);D^\{\\prime\}\)\\leq R\_\{\\mathrm\{mf\},\\gamma\}\(Q;D^\{\\prime\}\)\\;\+\\;\\rho\.

###### Proof\.

Fix\(x′,y\)\(x^\{\\prime\},y\)and define

E:=\{y^θ​\(μ\)det​\(x′\)≠y\},F​\(w~\):=\{Δθ​\(w~\)​\(x′,y\)≤γ\},E:=\\\{\\hat\{y\}^\{\\mathrm\{det\}\}\_\{\\theta\(\\mu\)\}\(x^\{\\prime\}\)\\neq y\\\},\\qquad F\(\\tilde\{w\}\):=\\\{\\Delta\_\{\\theta\(\\tilde\{w\}\)\}\(x^\{\\prime\},y\)\\leq\\gamma\\\},and the stability event

S​\(w~\):=\{ΞQ​\(μ,w~\)≤γ/2\}\.S\(\\tilde\{w\}\):=\\\{\\Xi\_\{Q\}\(\\mu,\\tilde\{w\}\)\\leq\\gamma/2\\\}\.IfEEoccurs, thenΔθ​\(μ\)​\(x′,y\)≤0\\Delta\_\{\\theta\(\\mu\)\}\(x^\{\\prime\},y\)\\leq 0\. OnS​\(w~\)S\(\\tilde\{w\}\), the score perturbation betweenθ​\(μ\)\\theta\(\\mu\)andθ​\(w~\)\\theta\(\\tilde\{w\}\)is bounded byγ/2\\gamma/2for every class\. By the margin\-stability argument in Appendix[A\.1](https://arxiv.org/html/2605.08896#A1.SS1), with the two score vectors swapped, we obtain

Δθ​\(w~\)​\(x′,y\)≤Δθ​\(μ\)​\(x′,y\)\+γ≤γ\.\\Delta\_\{\\theta\(\\tilde\{w\}\)\}\(x^\{\\prime\},y\)\\leq\\Delta\_\{\\theta\(\\mu\)\}\(x^\{\\prime\},y\)\+\\gamma\\leq\\gamma\.HenceE∩S​\(w~\)⊆F​\(w~\)E\\cap S\(\\tilde\{w\}\)\\subseteq F\(\\tilde\{w\}\), so pointwise

𝟏​\(E\)≤𝟏​\(F​\(w~\)\)\+𝟏​\(S​\(w~\)c\)\.\\mathbf\{1\}\(E\)\\leq\\mathbf\{1\}\(F\(\\tilde\{w\}\)\)\+\\mathbf\{1\}\(S\(\\tilde\{w\}\)^\{c\}\)\.Taking expectation overw~∼Q\\tilde\{w\}\\sim Qgives

𝟏​\(E\)≤𝔼w~∼Q​\[𝟏​\{Δθ​\(w~\)​\(x′,y\)≤γ\}\]\+ρ\.\\mathbf\{1\}\(E\)\\leq\\mathbb\{E\}\_\{\\tilde\{w\}\\sim Q\}\\big\[\\mathbf\{1\}\\\{\\Delta\_\{\\theta\(\\tilde\{w\}\)\}\(x^\{\\prime\},y\)\\leq\\gamma\\\}\\big\]\+\\rho\.Now take conditional expectation overx′∼Dj′x^\{\\prime\}\\sim D^\{\\prime\}\_\{j\}for each classjj, and then maximize overj∈\[K\]j\\in\[K\]\. This yields the stated risk\-level bridge\. No entrywise matrix domination is claimed\. ∎

### A\.6Margin failure is controlled by vulnerable mass

We now prove the second part of the deterministic\-risk bridge used in the main text\.

###### Lemma A\.2\(Margin failure controlled by margin\-aware vulnerable mass\)\.

Fixγ≥0\\gamma\\geq 0\. Suppose the gate satisfies

gγ,κθ​\(x′,j\)≥ηwheneverΔθ​\(x′,j\)≤γg\_\{\\gamma,\\kappa\}^\{\\theta\}\(x^\{\\prime\},j\)\\geq\\eta\\quad\\text\{whenever\}\\quad\\Delta\_\{\\theta\}\(x^\{\\prime\},j\)\\leq\\gammafor someη\>0\\eta\>0\. Then

Rmf,γ​\(Q;D′\)≤η−1​\(1\+eγ\)​VWRγ​\(Q;D′\)\.R\_\{\\mathrm\{mf\},\\gamma\}\(Q;D^\{\\prime\}\)\\leq\\eta^\{\-1\}\(1\+e^\{\\gamma\}\)\\,\\mathrm\{VWR\}\_\{\\gamma\}\(Q;D^\{\\prime\}\)\.For the sigmoid gategγ,κθ​\(x′,j\)=σ​\(\(γ−Δθ​\(x′,j\)\)/κ\)g\_\{\\gamma,\\kappa\}^\{\\theta\}\(x^\{\\prime\},j\)=\\sigma\(\(\\gamma\-\\Delta\_\{\\theta\}\(x^\{\\prime\},j\)\)/\\kappa\), one may takeη=1/2\\eta=1/2\.

###### Proof\.

Fix a posterior samplew~∼Q\\tilde\{w\}\\sim Q, a perturbed examplex′x^\{\\prime\}, and a classjj\. For brevity writeθ=θ​\(w~\)\\theta=\\theta\(\\tilde\{w\}\)\. IfΔθ​\(x′,j\)≤γ\\Delta\_\{\\theta\}\(x^\{\\prime\},j\)\\leq\\gamma, then by the definition of the margin there exists some competitori≠ji\\neq jsuch that

sθ​\(x′,i\)≥sθ​\(x′,j\)−γ\.s\_\{\\theta\}\(x^\{\\prime\},i\)\\geq s\_\{\\theta\}\(x^\{\\prime\},j\)\-\\gamma\.Therefore,

exp⁡\(sθ​\(x′,i\)\)≥e−γ​exp⁡\(sθ​\(x′,j\)\)\.\\exp\(s\_\{\\theta\}\(x^\{\\prime\},i\)\)\\geq e^\{\-\\gamma\}\\exp\(s\_\{\\theta\}\(x^\{\\prime\},j\)\)\.Using only this competitor in the softmax denominator gives

pθ​\(j∣x′\)=exp⁡\(sθ​\(x′,j\)\)∑k=1Kexp⁡\(sθ​\(x′,k\)\)≤exp⁡\(sθ​\(x′,j\)\)exp⁡\(sθ​\(x′,j\)\)\+exp⁡\(sθ​\(x′,i\)\)≤11\+e−γ\.p\_\{\\theta\}\(j\\mid x^\{\\prime\}\)=\\frac\{\\exp\(s\_\{\\theta\}\(x^\{\\prime\},j\)\)\}\{\\sum\_\{k=1\}^\{K\}\\exp\(s\_\{\\theta\}\(x^\{\\prime\},k\)\)\}\\leq\\frac\{\\exp\(s\_\{\\theta\}\(x^\{\\prime\},j\)\)\}\{\\exp\(s\_\{\\theta\}\(x^\{\\prime\},j\)\)\+\\exp\(s\_\{\\theta\}\(x^\{\\prime\},i\)\)\}\\leq\\frac\{1\}\{1\+e^\{\-\\gamma\}\}\.Hence

1−pθ​\(j∣x′\)≥11\+eγ\.1\-p\_\{\\theta\}\(j\\mid x^\{\\prime\}\)\\geq\\frac\{1\}\{1\+e^\{\\gamma\}\}\.Moreover, on the same eventΔθ​\(x′,j\)≤γ\\Delta\_\{\\theta\}\(x^\{\\prime\},j\)\\leq\\gamma, the gate satisfiesgγ,κθ​\(x′,j\)≥ηg\_\{\\gamma,\\kappa\}^\{\\theta\}\(x^\{\\prime\},j\)\\geq\\eta\. Thus,

gγ,κθ​\(x′,j\)​\(1−pθ​\(j∣x′\)\)≥η1\+eγ\.g\_\{\\gamma,\\kappa\}^\{\\theta\}\(x^\{\\prime\},j\)\\bigl\(1\-p\_\{\\theta\}\(j\\mid x^\{\\prime\}\)\\bigr\)\\geq\\frac\{\\eta\}\{1\+e^\{\\gamma\}\}\.Equivalently, pointwise,

𝟏​\{Δθ​\(x′,j\)≤γ\}≤η−1​\(1\+eγ\)​gγ,κθ​\(x′,j\)​\(1−pθ​\(j∣x′\)\)\.\\mathbf\{1\}\\\{\\Delta\_\{\\theta\}\(x^\{\\prime\},j\)\\leq\\gamma\\\}\\leq\\eta^\{\-1\}\(1\+e^\{\\gamma\}\)g\_\{\\gamma,\\kappa\}^\{\\theta\}\(x^\{\\prime\},j\)\\bigl\(1\-p\_\{\\theta\}\(j\\mid x^\{\\prime\}\)\\bigr\)\.Taking expectation overw~∼Q\\tilde\{w\}\\sim Qand over the class\-conditional distributionx′∼Dj′x^\{\\prime\}\\sim D^\{\\prime\}\_\{j\}, and then maximizing overj∈\[K\]j\\in\[K\], gives

Rmf,γ​\(Q;D′\)≤η−1​\(1\+eγ\)​VWRγ​\(Q;D′\)\.R\_\{\\mathrm\{mf\},\\gamma\}\(Q;D^\{\\prime\}\)\\leq\\eta^\{\-1\}\(1\+e^\{\\gamma\}\)\\mathrm\{VWR\}\_\{\\gamma\}\(Q;D^\{\\prime\}\)\.∎

##### Combined consequence\.

Combining Lemma[A\.2](https://arxiv.org/html/2605.08896#A1.Thmtheorem2)with the risk\-level bridge in Lemma[A\.1](https://arxiv.org/html/2605.08896#A1.Thmtheorem1)gives

WCRdet​\(θ​\(μ\);D′\)≤η−1​\(1\+eγ\)​VWRγ​\(Q;D′\)\+ρ\.\\mathrm\{WCR\}^\{\\mathrm\{det\}\}\(\\theta\(\\mu\);D^\{\\prime\}\)\\leq\\eta^\{\-1\}\(1\+e^\{\\gamma\}\)\\mathrm\{VWR\}\_\{\\gamma\}\(Q;D^\{\\prime\}\)\+\\rho\.Combining this inequality with Theorem[2\.10](https://arxiv.org/html/2605.08896#S2.Thmtheorem10)further yields, with the same high\-probability event as the theorem,

WCRdet​\(θ​\(μ\);D′\)≤η−1​\(1\+eγ\)​\[K​VSRγ​\(Q;S′\)\+2​2​K​\(KL​\(Q∥P\)\+2​K​ln⁡9\+ln⁡2δ\)mmin\]\+ρ\.\\mathrm\{WCR\}^\{\\mathrm\{det\}\}\(\\theta\(\\mu\);D^\{\\prime\}\)\\leq\\eta^\{\-1\}\(1\+e^\{\\gamma\}\)\\left\[\\sqrt\{K\}\\,\\mathrm\{VSR\}\_\{\\gamma\}\(Q;S^\{\\prime\}\)\+2\\sqrt\{\\frac\{2K\\left\(\\mathrm\{KL\}\(Q\\\|P\)\+2K\\ln 9\+\\ln\\frac\{2\}\{\\delta\}\\right\)\}\{m\_\{\\min\}\}\}\\right\]\+\\rho\.This is the conservative PAC\-Bayes route from the empirical margin\-aware spectral structure to deterministic worst\-class risk\.

### A\.7Training algorithm \(classification plug\-in; experimental LoRA instantiation\)

Algorithm 1Plug\-in spectral safety control for verbalizer classification0:Dataset

SS, perturbation mechanism

𝒰​\(⋅\)\\mathcal\{U\}\(\\cdot\), safety buffer

γ\\gamma, gate temperature

κ\\kappa, base objective

ℒbase\\mathcal\{L\}\_\{\\mathrm\{base\}\}, spectral weight

α\\alpha, stability weight

β\\beta, power\-iteration steps

TpiT\_\{\\mathrm\{pi\}\}, refresh interval

NN, numerical constant

εspec\\varepsilon\_\{\\mathrm\{spec\}\}, and local\-coordinate perturbation scale

σQ\\sigma\_\{Q\}\.

1:Initialize trainable coordinates

ϕ\\phiand a unit vector

v∈ℝKv\\in\\mathbb\{R\}^\{K\}\.

2:foreach training step

ttdo

3:Sample a mini\-batch

B=\{\(xq,yq\)\}B=\\\{\(x\_\{q\},y\_\{q\}\)\\\}from

SS\.

4:Construct perturbed inputs

xq′∈𝒰​\(xq\)x^\{\\prime\}\_\{q\}\\in\\mathcal\{U\}\(x\_\{q\}\)and form

B′=\{\(xq′,yq\)\}B^\{\\prime\}=\\\{\(x^\{\\prime\}\_\{q\},y\_\{q\}\)\\\}\.

5:Compute verbalizer probabilities

pθ​\(ϕ\)\(⋅∣xq′\)p\_\{\\theta\(\\phi\)\}\(\\cdot\\mid x^\{\\prime\}\_\{q\}\), margins

Δθ​\(ϕ\)​\(xq′,yq\)\\Delta\_\{\\theta\(\\phi\)\}\(x^\{\\prime\}\_\{q\},y\_\{q\}\), and gates

gγ,κθ​\(ϕ\)​\(xq′,yq\)g^\{\\theta\(\\phi\)\}\_\{\\gamma,\\kappa\}\(x^\{\\prime\}\_\{q\},y\_\{q\}\)\.

6:Build the differentiable batch matrix

M~ϕbatch,γ\\widetilde\{M\}\_\{\\phi\}^\{\\mathrm\{batch\},\\gamma\}via Eq\. \([7](https://arxiv.org/html/2605.08896#S2.E7)\)\. Columns whose classes are absent from

BBare masked out for this update; the matrix dimension remains

K×KK\\times K\.

7:if

tmodN=0t\\bmod N=0then

8:Update

vvby

TpiT\_\{\\mathrm\{pi\}\}steps of power iteration on

\(M~ϕbatch,γ\)⊤​M~ϕbatch,γ\(\\widetilde\{M\}\_\{\\phi\}^\{\\mathrm\{batch\},\\gamma\}\)^\{\\top\}\\widetilde\{M\}\_\{\\phi\}^\{\\mathrm\{batch\},\\gamma\}, and normalize

vv\.

9:endif

10:Treat

vvas fixed within the current refresh window and estimate

ℛspec=v⊤​\(M~ϕbatch,γ\)⊤​M~ϕbatch,γ​v\+εspec\.\\mathcal\{R\}\_\{\\mathrm\{spec\}\}=\\sqrt\{v^\{\\top\}\(\\widetilde\{M\}\_\{\\phi\}^\{\\mathrm\{batch\},\\gamma\}\)^\{\\top\}\\widetilde\{M\}\_\{\\phi\}^\{\\mathrm\{batch\},\\gamma\}v\+\\varepsilon\_\{\\mathrm\{spec\}\}\}\.
11:if

β\>0\\beta\>0then

12:Sample a temporary local perturbation

u∼𝒩​\(0,σQ2​I\)u\\sim\\mathcal\{N\}\(0,\\sigma\_\{Q\}^\{2\}I\)on the trainable coordinates and set

ϕ~=ϕ\+u\\widetilde\{\\phi\}=\\phi\+u\.

13:Compute the stability penalty

ℛstab\\mathcal\{R\}\_\{\\mathrm\{stab\}\}via Eq\. \([8](https://arxiv.org/html/2605.08896#S2.E8)\)\.

14:else

15:Set

ℛstab=0\\mathcal\{R\}\_\{\\mathrm\{stab\}\}=0\.

16:endif

17:Update

ϕ\\phiby minimizing

ℒbase\+α​ℛspec\+β​ℛstab\.\\mathcal\{L\}\_\{\\mathrm\{base\}\}\+\\alpha\\mathcal\{R\}\_\{\\mathrm\{spec\}\}\+\\beta\\mathcal\{R\}\_\{\\mathrm\{stab\}\}\.
18:endfor

### A\.8Gradient estimator induced by power iteration

LetA=M~ϕγA=\\widetilde\{M\}^\{\\gamma\}\_\{\\phi\}and defineB=A⊤​AB=A^\{\\top\}A\. Power iteration produces a unit vectorvvapproximating the top eigenvector ofBB, henceσmax​\(A\)≈v⊤​B​v\\sigma\_\{\\max\}\(A\)\\approx\\sqrt\{v^\{\\top\}Bv\}\. If we treatvvas fixed within a refresh window, then

σ^​\(A\):=v⊤​A⊤​A​v⇒∂σ^∂A=1σ^​\(A\)​A​v​v⊤\.\\widehat\{\\sigma\}\(A\):=\\sqrt\{v^\{\\top\}A^\{\\top\}Av\}\\quad\\Rightarrow\\quad\\frac\{\\partial\\widehat\{\\sigma\}\}\{\\partial A\}=\\frac\{1\}\{\\widehat\{\\sigma\}\(A\)\}\\,Avv^\{\\top\}\.\(10\)
In implementation, we use the stabilized estimator

σ^ε​\(A\)=v⊤​A⊤​A​v\+εspec,\\widehat\{\\sigma\}\_\{\\varepsilon\}\(A\)=\\sqrt\{v^\{\\top\}A^\{\\top\}Av\+\\varepsilon\_\{\\mathrm\{spec\}\}\},with a smallεspec\>0\\varepsilon\_\{\\mathrm\{spec\}\}\>0\. The corresponding gradient is

∂σ^ε∂A=A​v​v⊤σ^ε​\(A\)\.\\frac\{\\partial\\widehat\{\\sigma\}\_\{\\varepsilon\}\}\{\\partial A\}=\\frac\{Avv^\{\\top\}\}\{\\widehat\{\\sigma\}\_\{\\varepsilon\}\(A\)\}\.

### A\.9Safety gate temperature

The temperatureκ\\kappain Eq\. \([2](https://arxiv.org/html/2605.08896#S2.E2)\) interpolates between a hard margin indicator and a smooth weighting\. A smallerκ\\kappamakesgγ,κθg\_\{\\gamma,\\kappa\}^\{\\theta\}closer to𝟏​\{Δ≤γ\}\\mathbf\{1\}\\\{\\Delta\\leq\\gamma\\\}, but may increase gradient variance\.

### A\.10Local Interpretation of the Joint Stability Regularizer

We provide a local interpretation of the stability regularizer in Eq\. \([8](https://arxiv.org/html/2605.08896#S2.E8)\)\. For a mini\-batch indexqq, let

zq:=sθ​\(w\)​\(xq\),zq,u′:=sθ​\(w\+u\)​\(xq′\),z\_\{q\}:=s\_\{\\theta\(w\)\}\(x\_\{q\}\),\\qquad z^\{\\prime\}\_\{q,u\}:=s\_\{\\theta\(w\+u\)\}\(x^\{\\prime\}\_\{q\}\),whereuudenotes a temporary perturbation in the trainable\-coordinate space\. Let

pq:=softmax​\(zq/T\),pq,u′:=softmax​\(zq,u′/T\)\.p\_\{q\}:=\\mathrm\{softmax\}\(z\_\{q\}/T\),\\qquad p^\{\\prime\}\_\{q,u\}:=\\mathrm\{softmax\}\(z^\{\\prime\}\_\{q,u\}/T\)\.The consistency loss comparespqp\_\{q\}withpq,u′p^\{\\prime\}\_\{q,u\}\. Since the softmax is invariant to adding the same constant to all logits, we consider the centered logit drift

Δ​zq​\(u\):=Π​\(zq,u′−zq\),Π:=I−1K​𝟏𝟏⊤,\\Delta z\_\{q\}\(u\):=\\Pi\\left\(z^\{\\prime\}\_\{q,u\}\-z\_\{q\}\\right\),\\qquad\\Pi:=I\-\\frac\{1\}\{K\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\},where𝟏∈ℝK\\mathbf\{1\}\\in\\mathbb\{R\}^\{K\}is the all\-one vector\. The projectionΠ\\Piremoves the common\-shift direction, i\.e\., the component of the logits to which the softmax distribution is invariant\.

For small centered logit drift, the softmax KL admits the second\-order expansion

KL​\(pq∥pq,u′\)=12​T2​Δ​zq​\(u\)⊤​Fq​Δ​zq​\(u\)\+O​\(‖Δ​zq​\(u\)‖3\),\\mathrm\{KL\}\(p\_\{q\}\\\|p^\{\\prime\}\_\{q,u\}\)=\\frac\{1\}\{2T^\{2\}\}\\Delta z\_\{q\}\(u\)^\{\\top\}F\_\{q\}\\Delta z\_\{q\}\(u\)\+O\(\\\|\\Delta z\_\{q\}\(u\)\\\|^\{3\}\),where

Fq:=Diag​\(pq\)−pq​pq⊤F\_\{q\}:=\\mathrm\{Diag\}\(p\_\{q\}\)\-p\_\{q\}p\_\{q\}^\{\\top\}is the Fisher matrix of the categorical distribution induced by the clean branch\. Thus the KL consistency term locally penalizes a Fisher\-weighted centered logit drift\.

To separate the sources of this drift, assumew~=w\+u\\tilde\{w\}=w\+uwithu∼𝒩​\(0,σQ2​I\)u\\sim\\mathcal\{N\}\(0,\\sigma\_\{Q\}^\{2\}I\)and linearize the logits aroundww:

sθ​\(w\+u\)​\(xq′\)≈sθ​\(w\)​\(xq′\)\+∇wsθ​\(w\)​\(xq′\)​u\.s\_\{\\theta\(w\+u\)\}\(x^\{\\prime\}\_\{q\}\)\\approx s\_\{\\theta\(w\)\}\(x^\{\\prime\}\_\{q\}\)\+\\nabla\_\{w\}s\_\{\\theta\(w\)\}\(x^\{\\prime\}\_\{q\}\)u\.Therefore,

Δ​zq​\(u\)≈aq\+Jq​u,\\Delta z\_\{q\}\(u\)\\approx a\_\{q\}\+J\_\{q\}u,where

aq:=Π​\(sθ​\(w\)​\(xq′\)−sθ​\(w\)​\(xq\)\)a\_\{q\}:=\\Pi\\left\(s\_\{\\theta\(w\)\}\(x^\{\\prime\}\_\{q\}\)\-s\_\{\\theta\(w\)\}\(x\_\{q\}\)\\right\)captures the clean\-to\-perturbed input drift, and

Jq:=Π​∇wsθ​\(w\)​\(xq′\)J\_\{q\}:=\\Pi\\nabla\_\{w\}s\_\{\\theta\(w\)\}\(x^\{\\prime\}\_\{q\}\)captures the local sensitivity of the logits to trainable\-coordinate perturbations\. Substituting this decomposition into the quadratic approximation gives

\(aq\+Jq​u\)⊤​Fq​\(aq\+Jq​u\)=aq⊤​Fq​aq\+2​aq⊤​Fq​Jq​u\+u⊤​Jq⊤​Fq​Jq​u\.\(a\_\{q\}\+J\_\{q\}u\)^\{\\top\}F\_\{q\}\(a\_\{q\}\+J\_\{q\}u\)=a\_\{q\}^\{\\top\}F\_\{q\}a\_\{q\}\+2a\_\{q\}^\{\\top\}F\_\{q\}J\_\{q\}u\+u^\{\\top\}J\_\{q\}^\{\\top\}F\_\{q\}J\_\{q\}u\.Taking expectation over the zero\-mean Gaussian perturbation eliminates the cross term:

𝔼u​\[2​aq⊤​Fq​Jq​u\]=2​aq⊤​Fq​Jq​𝔼​\[u\]=0\.\\mathbb\{E\}\_\{u\}\[2a\_\{q\}^\{\\top\}F\_\{q\}J\_\{q\}u\]=2a\_\{q\}^\{\\top\}F\_\{q\}J\_\{q\}\\mathbb\{E\}\[u\]=0\.Moreover, using𝔼​\[u​u⊤\]=σQ2​I\\mathbb\{E\}\[uu^\{\\top\}\]=\\sigma\_\{Q\}^\{2\}I, we have

𝔼u​\[u⊤​Jq⊤​Fq​Jq​u\]=σQ2​Tr​\(Jq⊤​Fq​Jq\)\.\\mathbb\{E\}\_\{u\}\\left\[u^\{\\top\}J\_\{q\}^\{\\top\}F\_\{q\}J\_\{q\}u\\right\]=\\sigma\_\{Q\}^\{2\}\\mathrm\{Tr\}\(J\_\{q\}^\{\\top\}F\_\{q\}J\_\{q\}\)\.Hence,

𝔼u​KL​\(pq∥pq,u′\)≈12​T2​aq⊤​Fq​aq\+σQ22​T2​Tr​\(Jq⊤​Fq​Jq\)\.\\mathbb\{E\}\_\{u\}\\,\\mathrm\{KL\}\(p\_\{q\}\\\|p^\{\\prime\}\_\{q,u\}\)\\approx\\frac\{1\}\{2T^\{2\}\}a\_\{q\}^\{\\top\}F\_\{q\}a\_\{q\}\+\\frac\{\\sigma\_\{Q\}^\{2\}\}\{2T^\{2\}\}\\mathrm\{Tr\}\(J\_\{q\}^\{\\top\}F\_\{q\}J\_\{q\}\)\.The first term penalizes clean\-to\-perturbed input drift, matching the consistency principle used in standard robust training\. The second term penalizes the output effect of local perturbations in the trainable coordinates\. This second term is the part directly related to the stability event used in Proposition[2\.13](https://arxiv.org/html/2605.08896#S2.Thmtheorem13): it discourages a posterior sampleθ​\(w\+u\)\\theta\(w\+u\)from inducing a large centered logit drift relative toθ​\(w\)\\theta\(w\)on the perturbed input\.

To see the connection to margin stability, letdq​\(u\):=Π​\(sθ​\(w\+u\)​\(xq′\)−sθ​\(w\)​\(xq′\)\)d\_\{q\}\(u\):=\\Pi\(s\_\{\\theta\(w\+u\)\}\(x^\{\\prime\}\_\{q\}\)\-s\_\{\\theta\(w\)\}\(x^\{\\prime\}\_\{q\}\)\)\. Since margins are invariant to common logit shifts, only this centered drift matters\. For any labelyy,

\|Δθ​\(w\+u\)​\(xq′,y\)−Δθ​\(w\)​\(xq′,y\)\|≤2​‖dq​\(u\)‖∞≤2​‖dq​\(u\)‖2\.\\left\|\\Delta\_\{\\theta\(w\+u\)\}\(x^\{\\prime\}\_\{q\},y\)\-\\Delta\_\{\\theta\(w\)\}\(x^\{\\prime\}\_\{q\},y\)\\right\|\\leq 2\\\|d\_\{q\}\(u\)\\\|\_\{\\infty\}\\leq 2\\\|d\_\{q\}\(u\)\\\|\_\{2\}\.Thus, a posterior\-induced margin shift larger thanγ\\gammarequires‖dq​\(u\)‖2\>γ/2\\\|d\_\{q\}\(u\)\\\|\_\{2\}\>\\gamma/2\. Under the local non\-degeneracy condition that the Fisher matrix has eigenvalue at leastλq\>0\\lambda\_\{q\}\>0on the relevant centered subspace, the quadratic term satisfies

dq​\(u\)⊤​Fq​dq​\(u\)≥λq​‖dq​\(u\)‖22\.d\_\{q\}\(u\)^\{\\top\}F\_\{q\}d\_\{q\}\(u\)\\geq\\lambda\_\{q\}\\\|d\_\{q\}\(u\)\\\|\_\{2\}^\{2\}\.Consequently, by Markov’s inequality, the probability of a large local margin shift in absolute value is controlled by the expected Fisher\-weighted drift:

Pru⁡\(\|Δθ​\(w\+u\)​\(xq′,y\)−Δθ​\(w\)​\(xq′,y\)\|\>γ\)≲4λq​γ2​𝔼u​\[dq​\(u\)⊤​Fq​dq​\(u\)\],\\Pr\_\{u\}\\left\(\\left\|\\Delta\_\{\\theta\(w\+u\)\}\(x^\{\\prime\}\_\{q\},y\)\-\\Delta\_\{\\theta\(w\)\}\(x^\{\\prime\}\_\{q\},y\)\\right\|\>\\gamma\\right\)\\lesssim\\frac\{4\}\{\\lambda\_\{q\}\\gamma^\{2\}\}\\mathbb\{E\}\_\{u\}\\left\[d\_\{q\}\(u\)^\{\\top\}F\_\{q\}d\_\{q\}\(u\)\\right\],up to the local second\-order approximation above\. Therefore,ℛstab\\mathcal\{R\}\_\{\\mathrm\{stab\}\}should be understood as a practical proxy for reducing the stability\-failure probabilityρ\\rho: it penalizes the input drift term and, more importantly for the deterministic bridge, the coordinate\-noise sensitivity that can produce large margin shifts\. The PAC\-Bayes complexity term controls the size of the posterior in parameter space, while this Fisher\-weighted term controls how such local perturbations affect the predictive distribution\.

## Appendix BSpecial Cases of the Trainable\-Coordinate PAC\-Bayes Setup

### B\.1LoRA specialization

In the LoRA case, the trainable coordinates are exactly the LoRA parameters:

w=ϕLoRA∈ℝdℓ,dtrain=dℓ,θ​\(w\)=𝒯LoRA​\(θ0,w\)\.w=\\phi\_\{\\mathrm\{LoRA\}\}\\in\\mathbb\{R\}^\{d\_\{\\ell\}\},\\qquad d\_\{\\mathrm\{train\}\}=d\_\{\\ell\},\\qquad\\theta\(w\)=\\mathcal\{T\}\_\{\\mathrm\{LoRA\}\}\(\\theta\_\{0\},w\)\.Under the Gaussian family

P=𝒩​\(0,τP2​I\),Q=𝒩​\(μ,σQ2​I\),P=\\mathcal\{N\}\(0,\\tau\_\{P\}^\{2\}I\),\\qquad Q=\\mathcal\{N\}\(\\mu,\\sigma\_\{Q\}^\{2\}I\),\(11\)the KL term becomes

KL​\(Q∥P\)=‖μ‖222​τP2\+dℓ2​\(σQ2τP2−1−ln⁡σQ2τP2\)\.\\mathrm\{KL\}\(Q\\\|P\)=\\frac\{\\\|\\mu\\\|\_\{2\}^\{2\}\}\{2\\tau\_\{P\}^\{2\}\}\+\\frac\{d\_\{\\ell\}\}\{2\}\\left\(\\frac\{\\sigma\_\{Q\}^\{2\}\}\{\\tau\_\{P\}^\{2\}\}\-1\-\\ln\\frac\{\\sigma\_\{Q\}^\{2\}\}\{\\tau\_\{P\}^\{2\}\}\\right\)\.\(12\)This is the LoRA\-dimensional specialization of the GaussianKL​\(Q∥P\)\\mathrm\{KL\}\(Q\\\|P\)term used in Theorem[2\.10](https://arxiv.org/html/2605.08896#S2.Thmtheorem10)\. The PAC\-Bayes theorem itself is unchanged; only the instantiation ofwwanddtraind\_\{\\mathrm\{train\}\}changes\.

### B\.2Full\-parameter fine\-tuning

For full fine\-tuning, we take

w=θ−θ0∈ℝdfull,dtrain=dfull\.w=\\theta\-\\theta\_\{0\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{full\}\}\},\\qquad d\_\{\\mathrm\{train\}\}=d\_\{\\mathrm\{full\}\}\.The theorem in Section[2\.3](https://arxiv.org/html/2605.08896#S2.SS3)again applies in exactly the same form\. The main difference is quantitative: the complexity term is typically larger because the Gaussian KL now scales withdfulld\_\{\\mathrm\{full\}\}rather thandℓd\_\{\\ell\}\. The routes below are only*sufficient*ways to instantiate the stability condition[2\.13](https://arxiv.org/html/2605.08896#S2.Thmtheorem13); they are not automatic guarantees\.

##### Setup\.

Letμ∈ℝdfull\\mu\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{full\}\}\}denote the full fine\-tuning displacement, so that the posterior mean corresponds to parametersθ​\(μ\)=θ0\+μ\\theta\(\\mu\)=\\theta\_\{0\}\+\\mu\. Foru∼𝒩​\(0,σQ2​Idfull\)u\\sim\\mathcal\{N\}\(0,\\sigma\_\{Q\}^\{2\}I\_\{d\_\{\\mathrm\{full\}\}\}\), the sampled parameter vector isθ​\(μ\+u\)\\theta\(\\mu\+u\)\. We seek sufficient conditions implying

Pru∼𝒩​\(0,σQ2​Idfull\)⁡\(Ξ​\(μ,u\)≤γ/2\)≥1−ρ\.\\Pr\_\{u\\sim\\mathcal\{N\}\(0,\\sigma\_\{Q\}^\{2\}I\_\{d\_\{\\mathrm\{full\}\}\}\)\}\\big\(\\Xi\(\\mu,u\)\\leq\\gamma/2\\big\)\\geq 1\-\\rho\.

#### Route 1: A Lipschitz\-type bound \(explicit but conservative\)

Suppose one can bound the score perturbation by

Ξ​\(μ,u\)≤Lμ​‖u‖2\\Xi\(\\mu,u\)\\leq L\_\{\\mu\}\\\|u\\\|\_\{2\}for some constantLμ\>0L\_\{\\mu\}\>0\. For Transformers, such a bound can be obtained conservatively by combining layerwise operator\-norm perturbation bounds with bounded hidden representations\. For example, if a pointwise nonlinearityψ\\psiisLψL\_\{\\psi\}\-Lipschitz and the relevant linear maps admit perturbation bounds of the form

‖\(W\+Δ​W\)​h−W​h‖2≤‖Δ​W‖2​‖h‖2,\\\|\(W\+\\Delta W\)h\-Wh\\\|\_\{2\}\\leq\\\|\\Delta W\\\|\_\{2\}\\,\\\|h\\\|\_\{2\},then repeated application through the network yields a global constantLμL\_\{\\mu\}\. Combining this with Gaussian norm concentration,

Pr⁡\(‖u‖2≤σQ​\(dfull\+2​ln⁡\(1/ρ\)\)\)≥1−ρ,\\Pr\\Big\(\\\|u\\\|\_\{2\}\\leq\\sigma\_\{Q\}\\big\(\\sqrt\{d\_\{\\mathrm\{full\}\}\}\+\\sqrt\{2\\ln\(1/\\rho\)\}\\big\)\\Big\)\\geq 1\-\\rho,gives the sufficient condition

Lμ​σQ​\(dfull\+2​ln⁡\(1/ρ\)\)≤γ/2\.L\_\{\\mu\}\\,\\sigma\_\{Q\}\\big\(\\sqrt\{d\_\{\\mathrm\{full\}\}\}\+\\sqrt\{2\\ln\(1/\\rho\)\}\\big\)\\leq\\gamma/2\.

#### Route 2: Local \(Jacobian\-based\) stability certificate

Alternatively, one can use local first\-order sensitivity\. Ifsθ​\(w\)​\(x′,k\)s\_\{\\theta\(w\)\}\(x^\{\\prime\},k\)is differentiable inww, then foruuin a local neighborhood,

\|sθ​\(μ\+u\)​\(x′,k\)−sθ​\(μ\)​\(x′,k\)\|≤‖∇wsθ​\(μ\)​\(x′,k\)‖2​‖u‖2\+H2​‖u‖22,\|s\_\{\\theta\(\\mu\+u\)\}\(x^\{\\prime\},k\)\-s\_\{\\theta\(\\mu\)\}\(x^\{\\prime\},k\)\|\\leq\\\|\\nabla\_\{w\}s\_\{\\theta\(\\mu\)\}\(x^\{\\prime\},k\)\\\|\_\{2\}\\,\\\|u\\\|\_\{2\}\+\\frac\{H\}\{2\}\\\|u\\\|\_\{2\}^\{2\},whereHHcontrols the local Hessian along the segmentμ\+t​u\\mu\+tu\. Ignoring or separately controlling the second\-order term yields the practical approximation

Ξ\(μ,u\)≲Gμ∥u∥2,Gμ:=sup\(x′,y\)∼D′maxk∈\[K\]∥∇wsθ​\(μ\)\(x′,k\)∥2\.\\Xi\(\\mu,u\)\\lesssim G\_\{\\mu\}\\,\\\|u\\\|\_\{2\},\\qquad G\_\{\\mu\}:=\\sup\_\{\(x^\{\\prime\},y\)\\sim D^\{\\prime\}\}\\max\_\{k\\in\[K\]\}\\\|\\nabla\_\{w\}s\_\{\\theta\(\\mu\)\}\(x^\{\\prime\},k\)\\\|\_\{2\}\.Together with Gaussian norm concentration, a sufficient local condition is

Gμ​σQ​\(dfull\+2​ln⁡\(1/ρ\)\)≤γ/2\.G\_\{\\mu\}\\,\\sigma\_\{Q\}\\big\(\\sqrt\{d\_\{\\mathrm\{full\}\}\}\+\\sqrt\{2\\ln\(1/\\rho\)\}\\big\)\\leq\\gamma/2\.
##### Remark on complexity\.

Under a fully isotropic Gaussian posterior, the KL term scales withdfulld\_\{\\mathrm\{full\}\}\. This is the main reason why the LoRA specialization is often more attractive in PAC\-Bayes analyses, even though the theorem itself does not require parameter efficiency\.

## Appendix CDiscussions and Limitations

##### Adaptive perturbation generation\.

Our PAC\-Bayes analysis conditions on a fixed perturbed sample or a perturbation protocol independent of the learned posterior\. This is appropriate for the LLM text\-perturbation setting and for frozen or sample\-split perturbation generators\. The VLM PGD training loop is more adaptive: adversarial examples can depend on the current model parameters\. We therefore interpret the VLM results as an empirical stress test of the FragileFlow regularizer rather than as a direct instantiation of the strict PAC\-Bayes theorem\. Extending the bound to fully model\-dependent adversarial data generation, for example through data\-dependent PAC\-Bayes or algorithm\-dependent perturbation kernels, is a direction for future work\.

##### Trainable\-coordinate choice and complexity\.

Our theory is stated over generic trainable coordinates and is not tied to LoRA\. We use LoRA in the experiments mainly for resource efficiency and because it matches common adaptation practice for large models\. Full\-parameter fine\-tuning is compatible with the same formal framework and may give the regularizer more freedom to reshape fragile error\-flow patterns, but it also increases the PAC\-Bayes complexity term and may require stronger control of the adapted parameters\. In practice, this suggests a trade\-off between adaptation capacity and complexity control; empirical parameter regularization, structured posteriors, or norm\-constrained updates are natural ways to keep this trade\-off stable\. We leave a systematic comparison between LoRA and full\-parameter adaptation to future work\.

##### Calibration and tuning budget\.

Our experiments use a fixed calibration protocol for the safety buffer and a limited sensitivity sweep for the plug\-in weights, rather than exhaustively tuning these choices for every model–dataset–learner combination\. This makes the comparisons more conservative and avoids selecting hyperparameters directly to maximize each reported test metric, but it also means that the reported numbers should not be read as the best achievable performance of FragileFlow\. The consistent reductions in vulnerable\-flow measures under this restrained tuning protocol support the main theory\-facing claim, while larger\-scale calibration and task\-specific tuning may further improve the robustness–utility trade\-off\.

##### Data scale and perturbation strength\.

We evaluate FragileFlow under fixed data budgets and fixed perturbation protocols in order to keep the comparison controlled across models, datasets, and base learners\. We do not exhaustively study how the effect changes with larger adaptation sets, different calibration\-set sizes, or stronger perturbation budgets\. These factors may influence both the estimated vulnerable\-flow matrix and the robustness–utility trade\-off\. A more systematic scaling study over data size and perturbation strength would be useful for understanding when spectral error\-flow control is most beneficial\.

##### Broader impacts\.

This work studies a general robustness objective and does not introduce a new deployment domain or user\-facing system\. Its positive impact is that worst\-class\-oriented robustness may help reduce concentrated failures in finite\-option LLM and VLM applications, especially when errors repeatedly affect a small subset of classes or choices\. At the same time, robustness methods are dual\-use: stronger adaptation techniques could also make undesirable or poorly governed models more stable under perturbation, and improved robustness metrics may create overconfidence if used without task\-specific safety evaluation\. For this reason, FragileFlow should be viewed as a diagnostic and training tool to be combined with domain\-specific validation, safety testing, and monitoring rather than as a standalone guarantee of safe deployment\.

## Appendix DAdditional Experimental Results

### D\.1LLM main results with per\-cell standard deviations

Table 4:LLM main results \(mean±\\pmseed std\): Clean Worst\-Class Acc \(%\)\.This is the WC\-Acc on clean data excluding the perturbation\. Reported as mean±\\pmseed standard deviation over three paired seeds\. calibrated bufferγ25\\gamma\_\{25\}, default\(α,β\)\(\\alpha,\\beta\)\.CE\+augR3FSMARTModelDataset*CE*base\+plugbase\+plugbase\+plugQwen\-0\.5BARC\-C*41\.16±\\pm3\.28*44\.18±\\pm3\.9843\.47±\\pm3\.9340\.89±\\pm2\.8342\.84±\\pm5\.2234\.13±\\pm3\.7943\.38±\\pm4\.38Qwen\-0\.5BCSQA*45\.22±\\pm3\.64*45\.33±\\pm7\.1645\.56±\\pm7\.9445\.56±\\pm3\.3745\.00±\\pm3\.3648\.00±\\pm4\.0148\.89±\\pm3\.69Qwen\-1\.5BARC\-C*66\.84±\\pm2\.85*68\.27±\\pm5\.4370\.67±\\pm2\.4966\.67±\\pm3\.0166\.67±\\pm2\.8966\.49±\\pm3\.1866\.76±\\pm4\.36Qwen\-1\.5BCSQA*61\.67±\\pm2\.79*67\.56±\\pm2\.1067\.33±\\pm1\.8061\.78±\\pm2\.7261\.67±\\pm2\.8763\.22±\\pm3\.1963\.33±\\pm3\.87Mistral\-7BARC\-C*67\.29±\\pm4\.87*65\.69±\\pm4\.0668\.53±\\pm6\.2567\.47±\\pm4\.4167\.11±\\pm4\.0342\.13±\\pm20\.5454\.84±\\pm14\.82Mistral\-7BCSQA*66\.00±\\pm3\.62*67\.11±\\pm3\.8866\.89±\\pm3\.5766\.67±\\pm4\.4466\.44±\\pm4\.0750\.56±\\pm4\.8757\.89±\\pm3\.42

Table 5:LLM main results \(mean±\\pmseed std\): Accuracy on perturbation\-only inputs \(%\)\.Each cell averages accuracy over\{\\\{typo, distractor, format\_rewrite\}\\\}\. calibrated bufferγ25\\gamma\_\{25\}, default\(α,β\)\(\\alpha,\\beta\)\. Reported as mean±\\pmseed standard deviation over three paired seeds\.CE\+augR3FSMARTModelDataset*CE*base\+plugbase\+plugbase\+plugQwen\-0\.5BARC\-C*41\.44±\\pm1\.80*49\.16±\\pm6\.0249\.47±\\pm5\.5041\.53±\\pm1\.9742\.40±\\pm2\.3642\.16±\\pm1\.2842\.11±\\pm0\.99Qwen\-0\.5BCSQA*42\.82±\\pm1\.77*53\.60±\\pm2\.5354\.36±\\pm2\.6842\.67±\\pm1\.6442\.73±\\pm1\.5044\.67±\\pm2\.0844\.16±\\pm2\.31Qwen\-1\.5BARC\-C*62\.80±\\pm3\.00*72\.18±\\pm1\.1673\.22±\\pm1\.1062\.84±\\pm3\.1063\.02±\\pm3\.3868\.47±\\pm2\.3368\.20±\\pm2\.59Qwen\-1\.5BCSQA*59\.98±\\pm3\.12*71\.49±\\pm1\.3771\.53±\\pm1\.0859\.98±\\pm3\.1859\.71±\\pm3\.3365\.33±\\pm1\.8065\.58±\\pm3\.52Mistral\-7BARC\-C*66\.78±\\pm2\.12*72\.42±\\pm2\.8573\.04±\\pm4\.0466\.36±\\pm2\.3466\.07±\\pm2\.4555\.51±\\pm5\.3556\.98±\\pm4\.78Mistral\-7BCSQA*66\.42±\\pm3\.44*74\.11±\\pm2\.4274\.60±\\pm2\.1966\.47±\\pm2\.9666\.11±\\pm3\.1051\.96±\\pm4\.0352\.89±\\pm4\.34

##### Discussion\.

These auxiliary metrics are reported mainly for completeness and as utility checks, aligning the LLM appendix with the additional metrics reported for VLMs\. Clean worst\-class accuracy shows mixed but generally small changes across base learners, while perturbation\-only average accuracy is mostly stable\. Thus, these results do not form a separate claim of uniform improvement; they simply indicate that the vulnerable\-flow reductions reported in the main paper are not obtained by a clear degradation of standard utility metrics\.

### D\.2VLM 4\-shot low\-shot stress test

Table 6:CLIP ViT\-B/32 4\-shot low\-shot stress test\.Same evaluation protocol as the main ViT table, but with shots = 4\. Robustness is evaluated on test images using the default PGD setting of our ViT pipeline \(attack\_type=PGD,ε=1\.0/255\\varepsilon=1\.0/255, 100 attack steps\)\. PGD WC\-Acc denotes worst\-class accuracy on PGD\-perturbed test images\.Bold = best among LoRA\-adv family rows, as in the main table\.DatasetMethodClean Acc↑\\uparrowPGD Acc↑\\uparrowClean WC↑\\uparrowPGD WC↑\\uparrowVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}\(pgd\)↓\\downarrowVWRγ\\mathrm\{VWR\}\_\{\\gamma\}\(pgd\)↓\\downarrownnDTDLoRA\-adv59\.54±0\.3759\.54\\pm 0\.3720\.23±1\.1920\.23\\pm 1\.1915\.74±5\.71\\mathbf\{15\.74\\pm 5\.71\}0\.00\\mathbf\{0\.00\}32\.19±2\.8332\.19\\pm 2\.8326\.92±0\.0626\.92\\pm 0\.063LoRA\-adv \+ plugin \(inner\)59\.57±0\.49\\mathbf\{59\.57\\pm 0\.49\}20\.37±1\.2420\.37\\pm 1\.2415\.74±5\.7115\.74\\pm 5\.710\.000\.0032\.37±3\.0332\.37\\pm 3\.0327\.01±0\.1227\.01\\pm 0\.123LoRA\-adv \+ plugin \(outer\)58\.94±0\.8558\.94\\pm 0\.8520\.41±1\.21\\mathbf\{20\.41\\pm 1\.21\}12\.96±7\.2912\.96\\pm 7\.290\.000\.0031\.21±1\.9531\.21\\pm 1\.9526\.66±0\.2826\.66\\pm 0\.283LoRA\-adv \+ plugin \(both\)59\.06±0\.9159\.06\\pm 0\.9120\.29±1\.2320\.29\\pm 1\.2312\.04±6\.5512\.04\\pm 6\.550\.000\.0031\.18±2\.41\\mathbf\{31\.18\\pm 2\.41\}26\.61±0\.35\\mathbf\{26\.61\\pm 0\.35\}3OxfordPetsLoRA\-adv86\.64±0\.4486\.64\\pm 0\.4419\.48±0\.7219\.48\\pm 0\.7254\.33±9\.2954\.33\\pm 9\.290\.67±0\.470\.67\\pm 0\.4757\.09±1\.9957\.09\\pm 1\.9961\.77±0\.3661\.77\\pm 0\.363LoRA\-adv \+ plugin \(inner\)86\.73±0\.5386\.73\\pm 0\.5319\.49±0\.7019\.49\\pm 0\.7054\.33±10\.5054\.33\\pm 10\.500\.67±0\.470\.67\\pm 0\.4757\.32±2\.4157\.32\\pm 2\.4161\.48±0\.4861\.48\\pm 0\.483LoRA\-adv \+ plugin \(outer\)87\.27\\mathbf\{87\.27\}20\.55\\mathbf\{20\.55\}56\.00\\mathbf\{56\.00\}1\.01\\mathbf\{1\.01\}53\.48\\mathbf\{53\.48\}60\.8460\.841LoRA\-adv \+ plugin \(both\)86\.98±0\.4286\.98\\pm 0\.4219\.88±0\.7819\.88\\pm 0\.7853\.00±11\.4353\.00\\pm 11\.431\.00±0\.821\.00\\pm 0\.8255\.25±1\.8055\.25\\pm 1\.8060\.31±0\.24\\mathbf\{60\.31\\pm 0\.24\}3Caltech101LoRA\-adv94\.04±0\.0394\.04\\pm 0\.0360\.97±1\.06\\mathbf\{60\.97\\pm 1\.06\}25\.16±14\.1325\.16\\pm 14\.131\.28±1\.81\\mathbf\{1\.28\\pm 1\.81\}25\.71±4\.5625\.71\\pm 4\.5655\.83±3\.5655\.83\\pm 3\.563LoRA\-adv \+ plugin \(inner\)94\.04±0\.1094\.04\\pm 0\.1060\.82±1\.0660\.82\\pm 1\.0626\.46±13\.84\\mathbf\{26\.46\\pm 13\.84\}1\.28±1\.811\.28\\pm 1\.8125\.63±4\.6225\.63\\pm 4\.6256\.09±3\.4556\.09\\pm 3\.453LoRA\-adv \+ plugin \(outer\)94\.08±0\.09\\mathbf\{94\.08\\pm 0\.09\}60\.85±0\.9860\.85\\pm 0\.9824\.23±13\.0724\.23\\pm 13\.070\.000\.0024\.26±4\.0324\.26\\pm 4\.0354\.62±3\.1554\.62\\pm 3\.153LoRA\-adv \+ plugin \(both\)93\.98±0\.1193\.98\\pm 0\.1160\.77±0\.8960\.77\\pm 0\.8924\.23±13\.0724\.23\\pm 13\.070\.000\.0024\.04±4\.58\\mathbf\{24\.04\\pm 4\.58\}54\.45±3\.03\\mathbf\{54\.45\\pm 3\.03\}3

##### Discussion\.

This 4\-shot setting is included as a low\-shot stress test and as the per\-dataset detail behind the VLM ablation in Table[8](https://arxiv.org/html/2605.08896#A4.T8)\. Because the adaptation set is very small, the results are naturally noisy and should not be read as a claim that every plug\-in placement improves every metric\. The main signal is that the outer and both variants often reduce the vulnerable\-flow measures while keeping clean and PGD average accuracy close to the LoRA\-adv baseline, whereas the inner\-only variant is less stable\. PGD worst\-class accuracy is frequently floor\-limited in this setting, so it provides limited resolution beyond showing the difficulty of the stress test\. We therefore use this table as supporting evidence that FragileFlow remains useful in a harder low\-shot protocol, rather than as a primary claim of uniform improvement\.

### D\.3β=0\\beta=0ablation: per\-cell numeric breakdown

Table 7:LLMβ=0\\beta=0ablation \(ℛspec\\mathcal\{R\}\_\{\\mathrm\{spec\}\}only vs\. composite\)\.This table is intended as a mechanism ablation for the stability term\. The main comparison is between the spectral\-only plug\-in\(β=0\)\(\\beta=0\)and the composite objective\(β\>0\)\(\\beta\>0\)\.Baseprovides a non\-plug scale reference averaged over four learners:base\_clean\(CE\),base\_aug,r3f, andsmart\. The \+Plug rows are computed from the corresponding plug\-in runs\. Values in parentheses report changes relative toBase; the bottom block averages these shifts over the six model–dataset cells\. Lower is better forVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}andVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}, while higher is better for accuracy metrics\. Bold marks the desired direction relative toBase\.ModelDatasetConfigVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}↓\\downarrowVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}↓\\downarrowWC\-Acc↑\\uparrowPtb Acc↑\\uparrowClean Acc↑\\uparrowQwen\-0\.5BARC\-CBase45\.0645\.0650\.2750\.2740\.0940\.0950\.0150\.0152\.6752\.67\+Plug \(β=0\\beta\{=\}0\)34\.88​\(−10\.18\)\\mathbf\{34\.88\}\(\\mathbf\{\-10\.18\}\)38\.44​\(−11\.83\)\\mathbf\{38\.44\}\(\\mathbf\{\-11\.83\}\)43\.69​\(\+3\.60\)\\mathbf\{43\.69\}\(\\mathbf\{\+3\.60\}\)51\.47​\(\+1\.46\)\\mathbf\{51\.47\}\(\\mathbf\{\+1\.46\}\)54\.30​\(\+1\.63\)\\mathbf\{54\.30\}\(\\mathbf\{\+1\.63\}\)\+Plug \(β\>0\\beta\{\>\}0\)34\.67​\(−10\.39\)\\mathbf\{34\.67\}\(\\mathbf\{\-10\.39\}\)38\.25​\(−12\.02\)\\mathbf\{38\.25\}\(\\mathbf\{\-12\.02\}\)43\.23​\(\+3\.14\)\\mathbf\{43\.23\}\(\\mathbf\{\+3\.14\}\)50\.90​\(\+0\.89\)\\mathbf\{50\.90\}\(\\mathbf\{\+0\.89\}\)53\.64​\(\+0\.98\)\\mathbf\{53\.64\}\(\\mathbf\{\+0\.98\}\)Qwen\-0\.5BCSQABase35\.6235\.6239\.8139\.8146\.0346\.0353\.1053\.1058\.5058\.50\+Plug \(β=0\\beta\{=\}0\)26\.79​\(−8\.84\)\\mathbf\{26\.79\}\(\\mathbf\{\-8\.84\}\)30\.42​\(−9\.39\)\\mathbf\{30\.42\}\(\\mathbf\{\-9\.39\}\)46\.09​\(\+0\.06\)\\mathbf\{46\.09\}\(\\mathbf\{\+0\.06\}\)53\.84​\(\+0\.74\)\\mathbf\{53\.84\}\(\\mathbf\{\+0\.74\}\)59\.32​\(\+0\.82\)\\mathbf\{59\.32\}\(\\mathbf\{\+0\.82\}\)\+Plug \(β\>0\\beta\{\>\}0\)27\.27​\(−8\.35\)\\mathbf\{27\.27\}\(\\mathbf\{\-8\.35\}\)30\.93​\(−8\.88\)\\mathbf\{30\.93\}\(\\mathbf\{\-8\.88\}\)46\.48​\(\+0\.45\)\\mathbf\{46\.48\}\(\\mathbf\{\+0\.45\}\)53\.80​\(\+0\.70\)\\mathbf\{53\.80\}\(\\mathbf\{\+0\.70\}\)59\.42​\(\+0\.92\)\\mathbf\{59\.42\}\(\\mathbf\{\+0\.92\}\)Qwen\-1\.5BARC\-CBase27\.3727\.3730\.9430\.9467\.0767\.0771\.3471\.3474\.9274\.92\+Plug \(β=0\\beta\{=\}0\)25\.36​\(−2\.01\)\\mathbf\{25\.36\}\(\\mathbf\{\-2\.01\}\)28\.86​\(−2\.08\)\\mathbf\{28\.86\}\(\\mathbf\{\-2\.08\}\)66\.40​\(−0\.67\)66\.40\(\-0\.67\)71\.73​\(\+0\.39\)\\mathbf\{71\.73\}\(\\mathbf\{\+0\.39\}\)75\.00​\(\+0\.08\)\\mathbf\{75\.00\}\(\\mathbf\{\+0\.08\}\)\+Plug \(β\>0\\beta\{\>\}0\)25\.50​\(−1\.87\)\\mathbf\{25\.50\}\(\\mathbf\{\-1\.87\}\)28\.33​\(−2\.61\)\\mathbf\{28\.33\}\(\\mathbf\{\-2\.61\}\)68\.03​\(\+0\.96\)\\mathbf\{68\.03\}\(\\mathbf\{\+0\.96\}\)71\.93​\(\+0\.59\)\\mathbf\{71\.93\}\(\\mathbf\{\+0\.59\}\)75\.36​\(\+0\.44\)\\mathbf\{75\.36\}\(\\mathbf\{\+0\.44\}\)Qwen\-1\.5BCSQABase23\.9323\.9327\.1827\.1863\.5663\.5669\.2269\.2274\.2374\.23\+Plug \(β=0\\beta\{=\}0\)20\.52​\(−3\.41\)\\mathbf\{20\.52\}\(\\mathbf\{\-3\.41\}\)23\.13​\(−4\.04\)\\mathbf\{23\.13\}\(\\mathbf\{\-4\.04\}\)65\.22​\(\+1\.67\)\\mathbf\{65\.22\}\(\\mathbf\{\+1\.67\}\)70\.05​\(\+0\.83\)\\mathbf\{70\.05\}\(\\mathbf\{\+0\.83\}\)74\.96​\(\+0\.72\)\\mathbf\{74\.96\}\(\\mathbf\{\+0\.72\}\)\+Plug \(β\>0\\beta\{\>\}0\)20\.25​\(−3\.68\)\\mathbf\{20\.25\}\(\\mathbf\{\-3\.68\}\)23\.25​\(−3\.93\)\\mathbf\{23\.25\}\(\\mathbf\{\-3\.93\}\)64\.11​\(\+0\.56\)\\mathbf\{64\.11\}\(\\mathbf\{\+0\.56\}\)69\.76​\(\+0\.54\)\\mathbf\{69\.76\}\(\\mathbf\{\+0\.54\}\)74\.47​\(\+0\.23\)\\mathbf\{74\.47\}\(\\mathbf\{\+0\.23\}\)Mistral\-7BARC\-CBase31\.0231\.0237\.0237\.0260\.6460\.6470\.4070\.4073\.7573\.75\+Plug \(β=0\\beta\{=\}0\)30\.56​\(−0\.47\)\\mathbf\{30\.56\}\(\\mathbf\{\-0\.47\}\)35\.00​\(−2\.02\)\\mathbf\{35\.00\}\(\\mathbf\{\-2\.02\}\)60\.22​\(−0\.42\)60\.22\(\-0\.42\)70\.08​\(−0\.32\)70\.08\(\-0\.32\)73\.60​\(−0\.15\)73\.60\(\-0\.15\)\+Plug \(β\>0\\beta\{\>\}0\)29\.62​\(−1\.40\)\\mathbf\{29\.62\}\(\\mathbf\{\-1\.40\}\)34\.73​\(−2\.29\)\\mathbf\{34\.73\}\(\\mathbf\{\-2\.29\}\)63\.50​\(\+2\.85\)\\mathbf\{63\.50\}\(\\mathbf\{\+2\.85\}\)71\.41​\(\+1\.01\)\\mathbf\{71\.41\}\(\\mathbf\{\+1\.01\}\)74\.58​\(\+0\.83\)\\mathbf\{74\.58\}\(\\mathbf\{\+0\.83\}\)Mistral\-7BCSQABase24\.1724\.1729\.0929\.0962\.5862\.5870\.8270\.8274\.9874\.98\+Plug \(β=0\\beta\{=\}0\)22\.32​\(−1\.85\)\\mathbf\{22\.32\}\(\\mathbf\{\-1\.85\}\)27\.73​\(−1\.36\)\\mathbf\{27\.73\}\(\\mathbf\{\-1\.36\}\)61\.24​\(−1\.34\)61\.24\(\-1\.34\)70\.34​\(−0\.48\)70\.34\(\-0\.48\)74\.41​\(−0\.57\)74\.41\(\-0\.57\)\+Plug \(β\>0\\beta\{\>\}0\)21\.60​\(−2\.57\)\\mathbf\{21\.60\}\(\\mathbf\{\-2\.57\}\)25\.91​\(−3\.18\)\\mathbf\{25\.91\}\(\\mathbf\{\-3\.18\}\)63\.74​\(\+1\.16\)\\mathbf\{63\.74\}\(\\mathbf\{\+1\.16\}\)71\.19​\(\+0\.37\)\\mathbf\{71\.19\}\(\\mathbf\{\+0\.37\}\)75\.00​\(\+0\.02\)\\mathbf\{75\.00\}\(\\mathbf\{\+0\.02\}\)*Averaged shift vs\. Base \(6 cells\)*Δrel​VSR^γ\\Delta\_\{\\mathrm\{rel\}\}\\,\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}Δrel​VWR^γ\\Delta\_\{\\mathrm\{rel\}\}\\,\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}Δabs​WC\\Delta\_\{\\mathrm\{abs\}\}\\,\\mathrm\{WC\}Δabs​PtbAcc\\Delta\_\{\\mathrm\{abs\}\}\\,\\mathrm\{PtbAcc\}Δabs​CleanAcc\\Delta\_\{\\mathrm\{abs\}\}\\,\\mathrm\{CleanAcc\}\+Plug \(β=0\\beta=0\)−13\.02%\-13\.02\\%−13\.14%\-13\.14\\%\+0\.48\+0\.48pp\+0\.44\+0\.44pp\+0\.42\+0\.42pp\+Plug \(β\>0\\beta\>0\)−13\.97%\-13\.97\\%−14\.37%\-14\.37\\%\+1\.52\+1\.52pp\+0\.68\+0\.68pp\+0\.57\+0\.57pp

Table 8:VLMβ=0\\beta=0ablation \(ℛspec\\mathcal\{R\}\_\{\\mathrm\{spec\}\}only vs\. composite\)\.This table reports a CLIP ViT\-B/32 \+ LoRA mechanism ablation for the stability term\. The main comparison is between the spectral\-only plug\-in\(β=0\)\(\\beta=0\)and the composite objective\(β\>0\)\(\\beta\>0\)\.Baseprovides a non\-plug scale reference averaged overloraandlora\_adv; the \+Plug rows average the corresponding plug\-in variants\. Values in parentheses report changes relative toBase, and the bottom block averages these shifts over the six dataset–shot cells\. Because PGD WC is often near the floor, this VLM ablation is mainly interpreted as evidence of vulnerable\-risk compression and utility preservation\. Lower is better forVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}andVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}, while higher is better for accuracy metrics\. Bold marks the desired direction relative toBase\.DatasetShotsConfigVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}↓\\downarrowVWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}↓\\downarrowPtb WC↑\\uparrowPtb Acc↑\\uparrowClean Acc↑\\uparrow*PGD evaluation*\(Ptb WC/Ptb Acc = PGD worst\-class/accuracy; Clean Acc shown in the last column\)DTD4Base30\.9930\.9927\.0527\.050\.000\.0020\.9220\.9260\.7660\.76DTD4\+Plug \(β=0\\beta\{=\}0\)30\.42​\(−0\.57\)\\mathbf\{30\.42\}\\ \(\\mathbf\{\-0\.57\}\)26\.93​\(−0\.12\)\\mathbf\{26\.93\}\\ \(\\mathbf\{\-0\.12\}\)0\.00​\(\+0\.00\)0\.00\\ \(\+0\.00\)21\.04​\(\+0\.12\)\\mathbf\{21\.04\}\\ \(\\mathbf\{\+0\.12\}\)60\.25​\(−0\.51\)60\.25\\ \(\-0\.51\)DTD4\+Plug \(β\>0\\beta\{\>\}0\)28\.76​\(−2\.23\)\\mathbf\{28\.76\}\\ \(\\mathbf\{\-2\.23\}\)25\.78​\(−1\.28\)\\mathbf\{25\.78\}\\ \(\\mathbf\{\-1\.28\}\)0\.00​\(\+0\.00\)0\.00\\ \(\+0\.00\)20\.54​\(−0\.38\)20\.54\\ \(\-0\.38\)60\.71​\(−0\.05\)60\.71\\ \(\-0\.05\)DTD16Base49\.3349\.3327\.8227\.820\.000\.0021\.3721\.3768\.6268\.62DTD16\+Plug \(β=0\\beta\{=\}0\)47\.10​\(−2\.22\)\\mathbf\{47\.10\}\\ \(\\mathbf\{\-2\.22\}\)27\.74​\(−0\.08\)\\mathbf\{27\.74\}\\ \(\\mathbf\{\-0\.08\}\)0\.00​\(\+0\.00\)0\.00\\ \(\+0\.00\)21\.25​\(−0\.12\)21\.25\\ \(\-0\.12\)68\.22​\(−0\.40\)68\.22\\ \(\-0\.40\)DTD16\+Plug \(β\>0\\beta\{\>\}0\)46\.51​\(−2\.82\)\\mathbf\{46\.51\}\\ \(\\mathbf\{\-2\.82\}\)27\.82​\(−0\.01\)\\mathbf\{27\.82\}\\ \(\\mathbf\{\-0\.01\}\)0\.00​\(\+0\.00\)0\.00\\ \(\+0\.00\)21\.11​\(−0\.25\)21\.11\\ \(\-0\.25\)68\.31​\(−0\.31\)68\.31\\ \(\-0\.31\)OxfordPets4Base65\.8665\.8664\.1964\.191\.001\.0019\.3419\.3486\.5686\.56OxfordPets4\+Plug \(β=0\\beta\{=\}0\)58\.40​\(−7\.45\)\\mathbf\{58\.40\}\\ \(\\mathbf\{\-7\.45\}\)62\.33​\(−1\.86\)\\mathbf\{62\.33\}\\ \(\\mathbf\{\-1\.86\}\)1\.01​\(\+0\.01\)\\mathbf\{1\.01\}\\ \(\\mathbf\{\+0\.01\}\)19\.78​\(\+0\.44\)\\mathbf\{19\.78\}\\ \(\\mathbf\{\+0\.44\}\)86\.87​\(\+0\.31\)\\mathbf\{86\.87\}\\ \(\\mathbf\{\+0\.31\}\)OxfordPets4\+Plug \(β\>0\\beta\{\>\}0\)58\.11​\(−7\.75\)\\mathbf\{58\.11\}\\ \(\\mathbf\{\-7\.75\}\)62\.07​\(−2\.12\)\\mathbf\{62\.07\}\\ \(\\mathbf\{\-2\.12\}\)1\.25​\(\+0\.25\)\\mathbf\{1\.25\}\\ \(\\mathbf\{\+0\.25\}\)19\.94​\(\+0\.60\)\\mathbf\{19\.94\}\\ \(\\mathbf\{\+0\.60\}\)86\.90​\(\+0\.34\)\\mathbf\{86\.90\}\\ \(\\mathbf\{\+0\.34\}\)OxfordPets16Base65\.6165\.6167\.1067\.100\.500\.5018\.7018\.7088\.7088\.70OxfordPets16\+Plug \(β=0\\beta\{=\}0\)62\.56​\(−3\.05\)\\mathbf\{62\.56\}\\ \(\\mathbf\{\-3\.05\}\)66\.18​\(−0\.93\)\\mathbf\{66\.18\}\\ \(\\mathbf\{\-0\.93\}\)0\.25​\(−0\.25\)0\.25\\ \(\-0\.25\)19\.01​\(\+0\.31\)\\mathbf\{19\.01\}\\ \(\\mathbf\{\+0\.31\}\)89\.03​\(\+0\.33\)\\mathbf\{89\.03\}\\ \(\\mathbf\{\+0\.33\}\)OxfordPets16\+Plug \(β\>0\\beta\{\>\}0\)62\.32​\(−3\.29\)\\mathbf\{62\.32\}\\ \(\\mathbf\{\-3\.29\}\)65\.94​\(−1\.16\)\\mathbf\{65\.94\}\\ \(\\mathbf\{\-1\.16\}\)0\.25​\(−0\.25\)0\.25\\ \(\-0\.25\)19\.20​\(\+0\.50\)\\mathbf\{19\.20\}\\ \(\\mathbf\{\+0\.50\}\)88\.98​\(\+0\.28\)\\mathbf\{88\.98\}\\ \(\\mathbf\{\+0\.28\}\)Caltech1014Base25\.8325\.8355\.9955\.990\.000\.0061\.8961\.8994\.1494\.14Caltech1014\+Plug \(β=0\\beta\{=\}0\)24\.72​\(−1\.10\)\\mathbf\{24\.72\}\\ \(\\mathbf\{\-1\.10\}\)53\.33​\(−2\.66\)\\mathbf\{53\.33\}\\ \(\\mathbf\{\-2\.66\}\)0\.00​\(\+0\.00\)0\.00\\ \(\+0\.00\)62\.11​\(\+0\.22\)\\mathbf\{62\.11\}\\ \(\\mathbf\{\+0\.22\}\)94\.13​\(−0\.01\)94\.13\\ \(\-0\.01\)Caltech1014\+Plug \(β\>0\\beta\{\>\}0\)24\.79​\(−1\.04\)\\mathbf\{24\.79\}\\ \(\\mathbf\{\-1\.04\}\)53\.12​\(−2\.87\)\\mathbf\{53\.12\}\\ \(\\mathbf\{\-2\.87\}\)0\.00​\(\+0\.00\)0\.00\\ \(\+0\.00\)62\.05​\(\+0\.16\)\\mathbf\{62\.05\}\\ \(\\mathbf\{\+0\.16\}\)94\.14​\(\+0\.00\)94\.14\\ \(\+0\.00\)Caltech10116Base19\.9419\.9448\.4448\.445\.005\.0065\.1365\.1395\.0995\.09Caltech10116\+Plug \(β=0\\beta\{=\}0\)19\.02​\(−0\.93\)\\mathbf\{19\.02\}\\ \(\\mathbf\{\-0\.93\}\)43\.44​\(−5\.00\)\\mathbf\{43\.44\}\\ \(\\mathbf\{\-5\.00\}\)6\.67​\(\+1\.67\)\\mathbf\{6\.67\}\\ \(\\mathbf\{\+1\.67\}\)66\.15​\(\+1\.01\)\\mathbf\{66\.15\}\\ \(\\mathbf\{\+1\.01\}\)94\.90​\(−0\.19\)94\.90\\ \(\-0\.19\)Caltech10116\+Plug \(β\>0\\beta\{\>\}0\)18\.92​\(−1\.02\)\\mathbf\{18\.92\}\\ \(\\mathbf\{\-1\.02\}\)43\.02​\(−5\.42\)\\mathbf\{43\.02\}\\ \(\\mathbf\{\-5\.42\}\)6\.67​\(\+1\.67\)\\mathbf\{6\.67\}\\ \(\\mathbf\{\+1\.67\}\)66\.20​\(\+1\.06\)\\mathbf\{66\.20\}\\ \(\\mathbf\{\+1\.06\}\)94\.95​\(−0\.14\)94\.95\\ \(\-0\.14\)*Averaged shift vs\. Base \(6 dataset×\\timesshot cells\)*Δrel​VSR^γ\\Delta\_\{\\mathrm\{rel\}\}\\,\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}Δrel​VWR^γ\\Delta\_\{\\mathrm\{rel\}\}\\,\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}Δabs​PtbWC\\Delta\_\{\\mathrm\{abs\}\}\\,\\mathrm\{PtbWC\}Δabs​PtbAcc\\Delta\_\{\\mathrm\{abs\}\}\\,\\mathrm\{PtbAcc\}Δabs​CleanAcc\\Delta\_\{\\mathrm\{abs\}\}\\,\\mathrm\{CleanAcc\}\+Plug \(β=0\\beta=0\)−5\.21%\-5\.21\\%−3\.35%\-3\.35\\%\+0\.24\+0\.24pp\+0\.33\+0\.33pp−0\.08\-0\.08pp\+Plug \(β\>0\\beta\>0\)−6\.47%\-6\.47\\%−4\.35%\-4\.35\\%\+0\.28\+0\.28pp\+0\.28\+0\.28pp\+0\.02\+0\.02pp

##### Discussion\.

Table[7](https://arxiv.org/html/2605.08896#A4.T7)and Table[8](https://arxiv.org/html/2605.08896#A4.T8)shows the functional ablation of the two terms we designed\. The ablation isolates the role of the stability term in the composite plug\-in objective\. The spectral\-only variant already reduces the vulnerable\-flow measures in both LLM and VLM settings, confirming that the main effect comes from controlling the class\-structured error\-flow matrix\. Adding the stability term further improves the average LLM shifts and makes the accuracy\-side gains more reliable, especially in cases where the spectral\-only variant reduces risk but does not consistently improve worst\-class or perturbation accuracy\. For VLMs, the same pattern appears mainly through stronger vulnerable\-risk compression and utility preservation, while PGD worst\-class accuracy remains floor\-limited and is therefore less informative\. Thus, these tables support the intended mechanism: spectral control is the core driver of vulnerable\-flow reduction, and the stability term helps this control translate more stably into downstream robustness and utility\.

### D\.4Plug\-in strength sensitivity \(α,β\\alpha,\\betasweep\)

![Refer to caption](https://arxiv.org/html/2605.08896v1/x4.png)Figure 4:Sensitivity of the plug\-in to its strength on CLIP ViT\-B/32, Caltech101\.We sweep along the diagonalβ≈α/2\\beta\\approx\\alpha/2for bothPluginouter\\mathrm\{Plugin\}\_\{\\mathrm\{outer\}\}andPluginboth\\mathrm\{Plugin\}\_\{\\mathrm\{both\}\}\.*Left:*clean accuracy remains nearly flat acrossα∈\{0,0\.03,0\.1,0\.3,1\}\\alpha\\in\\\{0,0\.03,0\.1,0\.3,1\\\}, suggesting that the plug\-in does not introduce a clear utility cost over this range\.*Right:*the vulnerable\-flow measure is generally lower for nonzero plug\-in strengths than forα=0\\alpha=0, although the response is not strictly monotonic\. This suggests that the default setting is not an isolated tuned point, while also showing that very aggressive strength choices need not improve the safety metric further\.##### Discussion\.

This sweep is intended as a sensitivity check rather than a full hyperparameter search\. The main observation is that clean accuracy stays stable across the tested range, while the vulnerable\-flow measure is usually reduced once the plug\-in is activated\. The response is not strictly monotonic, so we do not claim that larger regularization strength is always better\. Instead, the sweep supports the weaker and more relevant point that the default plug\-in strength is not a fragile single\-point choice and that the robustness gains do not come from an obvious clean\-accuracy trade\-off\.

### D\.5Additional VLM sanity checks

Table 9:Cross\-model sanity on Qwen2\.5\-VL\-3B\-Instruct \(4\-shot, 3 seeds, PGD\-10 atε=1/255\\varepsilon=1/255\)\.Bold = best among LoRA\-adv family rows\(same convention as the main VLM table\)\. Clean WC is uninformative on Qwen2\.5\-VL because the per\-class evaluation uses only∼10\\sim 10samples per class; PGD WC\-Acc is often identically0for the same reason\. We therefore regardVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}\(pgd\) andVWRγ\\mathrm\{VWR\}\_\{\\gamma\}\(pgd\) as the primary safety signals here\.DatasetMethodClean Acc↑\\uparrowPGD Acc↑\\uparrowClean WC↑\\uparrowPGD WC↑\\uparrowVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}\(pgd\)↓\\downarrowVWRγ\\mathrm\{VWR\}\_\{\\gamma\}\(pgd\)↓\\downarrownnDTDZeroShot60\.0060\.003\.203\.200\.000\.000\.000\.0019\.8919\.8922\.2522\.251LoRA68\.33±3\.9068\.33\\pm 3\.9014\.20±0\.8214\.20\\pm 0\.820\.000\.000\.000\.0020\.63±3\.2220\.63\\pm 3\.2227\.75±0\.4427\.75\\pm 0\.443LoRA\-adv61\.47±3\.35\\mathbf\{61\.47\\pm 3\.35\}13\.60±2\.1213\.60\\pm 2\.120\.00\\mathbf\{0\.00\}0\.00\\mathbf\{0\.00\}10\.18±2\.4210\.18\\pm 2\.4218\.65±3\.0418\.65\\pm 3\.043LoRA\-adv \+ plugin \(outer\)61\.33±3\.7361\.33\\pm 3\.7314\.20±1\.07\\mathbf\{14\.20\\pm 1\.07\}0\.000\.000\.000\.008\.07±1\.44\\mathbf\{8\.07\\pm 1\.44\}16\.15±2\.25\\mathbf\{16\.15\\pm 2\.25\}3LoRA\-adv \+ plugin \(both\)59\.20±2\.2059\.20\\pm 2\.2013\.40±1\.3113\.40\\pm 1\.310\.000\.000\.000\.009\.01±1\.689\.01\\pm 1\.6817\.31±2\.2717\.31\\pm 2\.273

![Refer to caption](https://arxiv.org/html/2605.08896v1/x5.png)Figure 5:QwenVL cross\-model summary on DTD\.Delta bars of each LoRA\-adv\+\+plug\-in variant relative to LoRA\-adv on the four reported metrics\. Both variants improve the class\-structural safety metrics \(VSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\},VWR^γ\\widehat\{\\mathrm\{VWR\}\}\_\{\\gamma\}\) with minimal utility cost, showing that the plug\-in is not tied to CLIP ViT\.Table 10:Weak\-attack supplementary results on CLIP ViT\-B/32 \(4\-shot\) \- smaller radius\.Test robustness is evaluated with PGD atε=0\.5/255\\varepsilon=0\.5/255for 100 steps\. These results come from the dedicated weak\-attack branch and are reported as point estimates from a single seed \(n=1n=1for every row\)\.Bold = best among LoRA\-adv family rows\.DatasetMethodClean Acc↑\\uparrowPGD Acc↑\\uparrowClean WC↑\\uparrowPGD WC↑\\uparrowVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}\(pgd\)↓\\downarrowVWRγ\\mathrm\{VWR\}\_\{\\gamma\}\(pgd\)↓\\downarrownnDTDZeroShot42\.7942\.7919\.9219\.920\.000\.000\.000\.0018\.3818\.3821\.7621\.761LoRA\-adv59\.8759\.8728\.43\\mathbf\{28\.43\}16\.67\\mathbf\{16\.67\}0\.00\\mathbf\{0\.00\}27\.7327\.7326\.1726\.171LoRA\-adv \+ plugin \(inner\)60\.22\\mathbf\{60\.22\}28\.43\\mathbf\{28\.43\}16\.67\\mathbf\{16\.67\}0\.00\\mathbf\{0\.00\}27\.9627\.9626\.0226\.021LoRA\-adv \+ plugin \(outer\)59\.9359\.9328\.0728\.0716\.67\\mathbf\{16\.67\}0\.00\\mathbf\{0\.00\}27\.33\\mathbf\{27\.33\}25\.7225\.721LoRA\-adv \+ plugin \(both\)59\.8759\.8728\.0128\.0116\.67\\mathbf\{16\.67\}0\.00\\mathbf\{0\.00\}27\.6327\.6325\.50\\mathbf\{25\.50\}1OxfordPetsZeroShot85\.0485\.0436\.7436\.740\.000\.000\.000\.0046\.6946\.6951\.8651\.861LoRA\-adv86\.1886\.1834\.9434\.9451\.0051\.004\.004\.0049\.3649\.3664\.4264\.421LoRA\-adv \+ plugin \(inner\)86\.2186\.2134\.8634\.8651\.0051\.004\.004\.0049\.1849\.1863\.9363\.931LoRA\-adv \+ plugin \(outer\)87\.33\\mathbf\{87\.33\}35\.6835\.6856\.00\\mathbf\{56\.00\}5\.00\\mathbf\{5\.00\}46\.1346\.1362\.3062\.301LoRA\-adv \+ plugin \(both\)87\.2287\.2235\.90\\mathbf\{35\.90\}56\.00\\mathbf\{56\.00\}5\.00\\mathbf\{5\.00\}45\.88\\mathbf\{45\.88\}61\.74\\mathbf\{61\.74\}1Caltech101ZeroShot91\.4091\.4066\.6566\.656\.676\.670\.000\.0015\.7815\.7838\.6038\.601LoRA\-adv94\.0494\.0472\.6672\.6626\.6726\.670\.00\\mathbf\{0\.00\}15\.3215\.3229\.8729\.871LoRA\-adv \+ plugin \(inner\)94\.0094\.0072\.74\\mathbf\{72\.74\}33\.33\\mathbf\{33\.33\}0\.00\\mathbf\{0\.00\}15\.5115\.5130\.2430\.241LoRA\-adv \+ plugin \(outer\)94\.16\\mathbf\{94\.16\}72\.7072\.7033\.33\\mathbf\{33\.33\}0\.00\\mathbf\{0\.00\}15\.2915\.2929\.8429\.841LoRA\-adv \+ plugin \(both\)94\.0894\.0872\.6272\.6226\.6726\.670\.00\\mathbf\{0\.00\}15\.22\\mathbf\{15\.22\}29\.74\\mathbf\{29\.74\}1

Table 11:Weak\-attack supplementary results on CLIP ViT\-B/32 \(4\-shot\) \- fewer steps\.Test robustness is evaluated with PGD atε=1\.0/255\\varepsilon=1\.0/255for 20 steps\. These results come from the dedicated weak\-attack branch and are reported as point estimates from a single seed \(n=1n=1for every row\)\. Caltech101 was not included in this branch\.Bold = best among LoRA\-adv family rows\.DatasetMethodClean Acc↑\\uparrowPGD Acc↑\\uparrowClean WC↑\\uparrowPGD WC↑\\uparrowVSR^γ\\widehat\{\\mathrm\{VSR\}\}\_\{\\gamma\}\(pgd\)↓\\downarrowVWRγ\\mathrm\{VWR\}\_\{\\gamma\}\(pgd\)↓\\downarrownnDTDZeroShot42\.7942\.7914\.7214\.720\.000\.000\.000\.0019\.2719\.2721\.7821\.781LoRA\-adv60\.0560\.0521\.0421\.0416\.67\\mathbf\{16\.67\}0\.00\\mathbf\{0\.00\}29\.7829\.7826\.8326\.831LoRA\-adv \+ plugin \(inner\)60\.22\\mathbf\{60\.22\}21\.22\\mathbf\{21\.22\}16\.67\\mathbf\{16\.67\}0\.00\\mathbf\{0\.00\}29\.9129\.9126\.7626\.761LoRA\-adv \+ plugin \(outer\)60\.0560\.0521\.1021\.1016\.67\\mathbf\{16\.67\}0\.00\\mathbf\{0\.00\}29\.6329\.6326\.3026\.301LoRA\-adv \+ plugin \(both\)60\.2860\.2820\.9820\.9816\.67\\mathbf\{16\.67\}0\.00\\mathbf\{0\.00\}29\.05\\mathbf\{29\.05\}26\.17\\mathbf\{26\.17\}1OxfordPetsZeroShot85\.0485\.0421\.4521\.450\.000\.000\.000\.0047\.6847\.6850\.8050\.801LoRA\-adv86\.1886\.1819\.4119\.4151\.0051\.001\.001\.0056\.0456\.0462\.2562\.251LoRA\-adv \+ plugin \(inner\)86\.2186\.2119\.5419\.5449\.0049\.002\.00\\mathbf\{2\.00\}55\.8255\.8262\.0462\.041LoRA\-adv \+ plugin \(outer\)87\.27\\mathbf\{87\.27\}20\.55\\mathbf\{20\.55\}56\.00\\mathbf\{56\.00\}1\.011\.0153\.37\\mathbf\{53\.37\}60\.8060\.801LoRA\-adv \+ plugin \(both\)86\.7086\.7020\.3320\.3347\.0047\.002\.00\\mathbf\{2\.00\}53\.4753\.4760\.71\\mathbf\{60\.71\}1

##### Discussion\.

These tables are intended as supplementary sanity checks rather than primary evidence\. Table[9](https://arxiv.org/html/2605.08896#A4.T9)shows that the plug\-in can also reduce vulnerable\-flow measures on Qwen2\.5\-VL, suggesting that the effect is not tied only to CLIP ViT\-B/32\. The weak\-attack CLIP results in Tables[10](https://arxiv.org/html/2605.08896#A4.T10)and[11](https://arxiv.org/html/2605.08896#A4.T11)are single\-seed point estimates, so we interpret them conservatively\. Across these weaker PGD protocols, the outer and both variants usually reduce the vulnerable\-flow measures with little change in clean accuracy, while inner\-only placement is less stable\. PGD worst\-class accuracy remains floor\-limited in several cells, so these tables are used mainly to check that the vulnerable\-flow compression persists under alternative VLM evaluation protocols\.

## Appendix ECompute resources

All experiments are conducted in Python®on a machine equipped with an AMD EPYC®7452 32\-Core Processor, 128GB of RAM, and one A100 GPU with 40GB of VRAM\.

Similar Articles

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

arXiv cs.LG

CRMA introduces a spectrally-bounded residual adapter that enables continual fine-tuning of LLMs without catastrophic forgetting by enforcing a doubly-stochastic mixing matrix via Sinkhorn normalization. Experimental results on Mistral-7B and Gemma-2-9B show improved backward transfer and reduced forgetting compared to frozen-substrate baselines.

Uncertainty-aware Multi-fidelity Closure via Conditional Normalizing Flows

arXiv cs.LG

This paper proposes an uncertainty-aware multi-fidelity framework based on conditional normalizing flows to improve the predictive accuracy of reduced-order models (ROMs) for complex multiscale systems. The method learns a probabilistic mapping from low-fidelity to high-fidelity coefficients and is demonstrated on a vortex merging problem, showing improved accuracy with uncertainty quantification.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

arXiv cs.AI

This paper proposes a hybrid framework combining first-order safety alignment with zeroth-order refinement to enhance the robustness of LLM safety alignment against post-alignment perturbations. Theoretical and empirical results show that only a few refinement steps can improve robustness while preserving safety.