Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms
Summary
This paper systematically investigates unlearnable examples under diverse training paradigms, revealing that pretrained weights weaken existing methods, and proposes Shallow Semantic Camouflage (SSC) to maintain unlearnability by generating perturbations in a semantically valid subspace.
View Cached Full Text
Cached at: 05/08/26, 06:39 AM
# Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms
Source: [https://arxiv.org/html/2605.05224](https://arxiv.org/html/2605.05224)
Bo Wang, Jia Ni, Mengnan Zhao, Zhan Qin, Kui RenManuscript received April 16, 2026\.Bo Wang and Jia Ni are with the School of Information and Communication Engineering, Dalian University of Technology, Dalian 116024, China\.Mengnan Zhao is with the School of Computer Science and Technology, Anhui University, Hefei 230601, China \(e\-mail: zmn@ahu\.edu\.cn\)\.Zhan Qin and Kui Ren are with the School of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China\.
###### Abstract
The unauthorized use of personal data in model training has emerged as a growing privacy threat\. Unlearnable examples \(UEs\) address this issue by embedding imperceptible perturbations into benign examples to obstruct feature learning\. However, existing studies mainly evaluate UEs under from\-scratch training settings, leaving their behavior under the widely adopted pretraining–finetuning \(PF\) paradigm largely unexplored\. In this work, we provide the first systematic investigation of unlearnable examples across diverse training paradigms\. Our analysis reveals that loading and freezing pretrained weights significantly weakens the effectiveness of existing UEs methods\. We further explain these findings through semantic filtering: while UEs tend to induce models to overfit non\-semantic noise, thereby weakening their semantic extraction capabilities, under the PF paradigm, frozen shallow layers preserve data semantics, effectively filtering out distracting information like unlearnable noise\. Guided by these insights, we propose a hierarchical deception strategy, Shallow Semantic Camouflage \(SSC\), that confines the generation process to a semantically valid subspace, aiming to bypass the semantic suppression introduced by pretrained weights\. Extensive experiments demonstrate that our method consistently preserves data unlearnability even under challenging training paradigms, such as shallow\-layer freezing and semantic\-focused pretraining \(SF\-Pretrain\), bridging the critical gap in pretrain\-based unlearnable learning\.
## IIntroduction
The availability of large\-scale datasets\[[2](https://arxiv.org/html/2605.05224#bib.bib1),[15](https://arxiv.org/html/2605.05224#bib.bib2)\]has driven rapid progress in deep learning\. Such data is frequently scraped from public sources and lacks effective access restrictions\[[10](https://arxiv.org/html/2605.05224#bib.bib5),[17](https://arxiv.org/html/2605.05224#bib.bib4)\], leading to growing concerns about the unauthorized use of data and intellectual property protection\[[1](https://arxiv.org/html/2605.05224#bib.bib9),[41](https://arxiv.org/html/2605.05224#bib.bib27),[54](https://arxiv.org/html/2605.05224#bib.bib11)\]\. For example, massive web\-collected datasets frequently contain personal or sensitive user information, and such information can be memorized during model optimization\[[47](https://arxiv.org/html/2605.05224#bib.bib13),[20](https://arxiv.org/html/2605.05224#bib.bib14),[32](https://arxiv.org/html/2605.05224#bib.bib10)\], posing substantial privacy\-leakage risks\[[14](https://arxiv.org/html/2605.05224#bib.bib15)\]\. More seriously, prior work\[[34](https://arxiv.org/html/2605.05224#bib.bib64),[27](https://arxiv.org/html/2605.05224#bib.bib18),[30](https://arxiv.org/html/2605.05224#bib.bib17)\]has demonstrated that even with data sanitization, adversaries can still reconstruct high\-fidelity private content or infer sensitive attributes from trained models\. To mitigate these issues, recent research\[[7](https://arxiv.org/html/2605.05224#bib.bib20),[31](https://arxiv.org/html/2605.05224#bib.bib37),[43](https://arxiv.org/html/2605.05224#bib.bib38),[42](https://arxiv.org/html/2605.05224#bib.bib35),[21](https://arxiv.org/html/2605.05224#bib.bib12)\]has proposed the unlearnable technique, which reduces model generalization on private data and thereby discourages unauthorized data exploitation\.
Figure 1:Protection effect of UEs against unauthorized use\. To prevent privacy leaks from raw data, unlearnable perturbations are used to inhibit feature learning\. While standard perturbations falter under the PF paradigm, the proposed SSC employs channel\-level semantic noise to provide robust data protection\.Early efforts for constructing unlearnable examples originate from the Error\-Minimizing Noise paradigm\[[12](https://arxiv.org/html/2605.05224#bib.bib31)\], which suppresses learnable signals by directly reducing the training loss of each example\. Although the resulting perturbations are visually imperceptible, the unlearnable examples\[[12](https://arxiv.org/html/2605.05224#bib.bib31),[5](https://arxiv.org/html/2605.05224#bib.bib34)\]show vulnerability to adversarial training\[[39](https://arxiv.org/html/2605.05224#bib.bib30),[49](https://arxiv.org/html/2605.05224#bib.bib23)\]\. Meanwhile, data processing before training operations can also reduce the effectiveness of unlearnable perturbations\[[53](https://arxiv.org/html/2605.05224#bib.bib40),[55](https://arxiv.org/html/2605.05224#bib.bib41),[19](https://arxiv.org/html/2605.05224#bib.bib42)\], making it essential to enhance robustness\. To enhance robustness, methods\[[6](https://arxiv.org/html/2605.05224#bib.bib32),[44](https://arxiv.org/html/2605.05224#bib.bib19),[45](https://arxiv.org/html/2605.05224#bib.bib21)\]are proposed, exhibiting increased resistance to adversarial training\. Nevertheless, both Error\-Minimizing Noise and Robust Unlearnable Examples suffer from optimization instability\. To address this issue, Liu et al\.\[[24](https://arxiv.org/html/2605.05224#bib.bib33)\]further introduced Stable Unlearnable Examples\. This method improves the stability of the defense noise by training it against random perturbations\.
While recent unlearnable approaches have demonstrated strong effectiveness in obstructing model learning, they mainly focus on the from\-scratch training settings\. However, their behavior under the widely adopted pretraining–finetuning paradigm\[[13](https://arxiv.org/html/2605.05224#bib.bib24),[26](https://arxiv.org/html/2605.05224#bib.bib25)\]remains insufficiently explored\. Thus, this work provides a systematic investigation of UEs across diverse training paradigms\. The analyses reveal that loading and freezing pretrained network weights during optimization substantially weakens the unlearnability of existing methods\. In particular, the loading and freezing of the shallow layers of pretrained models significantly affect the protection performance\. Building on this, we further observe that weight\-aware perturbations tailored for pretrained weights can mitigate the vulnerability, but these generation methods exhibit limited cross\-weight transferability in practical black\-box scenarios\. Consequently, both normal and weight\-aware unlearnable examples remain vulnerable under the PF paradigm\.
From the perspective of semantic mismatch, we further provide explanations for these findings\. In our view, the pretrained shallow layers extract semantic information, whereas existing unlearnable perturbations are mostly semantically inconsistent with natural images\. This mismatch enables frozen shallow layers to function as reliable semantic filters, ignoring semantic\-mismatch information and preserving the structure of natural objects, thus passing more useful information\. To further validate this hypothesis, we introduce SF\-Pretrain, which forces the pretrained network to extract semantic representations, thereby concentrating on natural representation spaces and enhancing semantic filtering capabilities during shallow layer learning\. Experimental results demonstrate that loading weights reinforced with semantic priors can more effectively reduce unlearnability\.
In this work, to enhance the data protection capabilities of UEs across diverse training paradigms, we further propose a hierarchical deception strategy, Shallow Semantic Camouflage, preserving the efficacy of unlearnable examples suppressed by pretrained weights and semantic\-focused shallow layers\. In contrast to conventional methods that optimize noise solely to induce overfitting, our framework employs a reference model as a semantic guide and applies adversarial constraints to enforce strict semantic alignment in shallow layers to generate channel\-level semantic perturbations\. This approach compels the optimization process to transfer perturbations from shallow features to higher levels of the representation space\. Within these spaces, the induced perturbations can simulate genuine semantics during propagation\. As a result, perturbations bypass shallow filtering and influence deeper semantic processing\. Extensive experiments confirm that our method not only performs robustly under standard transfer settings, but also shows strong robustness under challenging training paradigms such as shallow freezing and SF\-Pretrain\. We summarize our main contributions as follows:
- •We systematically reveal the vulnerability of unlearnable examples under the pretraining–finetuning paradigm\. Through a comprehensive analysis across diverse finetuning configurations, we find that loading freezing pretrained shallow layers significantly weakens existing UEs protection\.
- •We systematically explain the reason of UEs failing under the PF paradigm\. Through an analysis of feature propagation and frequency\-domain behavior, we observe that semantic mismatch with natural image statistics is the primary cause of the resulting degradation in unlearnability\.
- •We propose a hierarchical deception strategy to preserve the unlearnability across diverse training paradigms\. By enforcing semantic alignment in shallow layers to generate channel\-level semantic perturbations, which can bypass pretrained semantic filters and remain robust\.
- •Extensive experiments on CIFAR\-10, CIFAR\-100, and Tiny\-ImageNet demonstrate that our hierarchical deception strategy consistently outperforms state\-of\-the\-art \(SOTA\) baselines across diverse training paradigms\.
## IIRelated work
### II\-AUnlearnable Examples
Unlearnable examples are a data\-centric privacy protection technique\. Its core principle is to introduce imperceptible perturbations into training examples, aiming to prevent machine learning models from extracting effective feature representations, thereby protecting data privacy\. Existing research on unlearnable examples can be broadly categorized into three directions according to their optimization objectives: improving protection effectiveness, enhancing perturbation robustness, and increasing task versatility and stealthiness\.
The first line of work focuses on improving protection effectiveness\. Error\-Minimizing noise\[[12](https://arxiv.org/html/2605.05224#bib.bib31)\]formulates UEs generation as a bi\-level optimization problem that induces the model into local optima\. To alleviate its computational cost, the Game\-Theoretic Unlearnable Example\[[23](https://arxiv.org/html/2605.05224#bib.bib57)\]reformulates the process as a Stackelberg game with a generator approximating the equilibrium\. Other approaches enhance effectiveness from different perspectives\. Targeted Adversarial Poisoning\[[5](https://arxiv.org/html/2605.05224#bib.bib34)\]frames unlearnability as a data poisoning problem, while Neural Tangent Generalization Attack\[[50](https://arxiv.org/html/2605.05224#bib.bib58)\]exploits Neural Tangent Kernel analysis to induce clean\-label generalization failure in black\-box settings\. The Autoregressive method\[[38](https://arxiv.org/html/2605.05224#bib.bib59)\]further removes reliance on surrogate models by generating perturbations without targeting a specific network\.
From the perspective of robustness, previous studies focus on preserving the efficacy of perturbation under data augmentation and defensive training strategies\. Robust Error\-Minimizing\[[6](https://arxiv.org/html/2605.05224#bib.bib32)\]integrates adversarial training and expectation over transformations into the optimization process to counter augmentations\. Stable Error\-Minimizing\[[24](https://arxiv.org/html/2605.05224#bib.bib33)\]introduces consistency regularization into the generation process to maintain consistent unlearnable properties of perturbations across diverse model parameter perturbations and input transformations\. Provably Unlearnable Examples\[[43](https://arxiv.org/html/2605.05224#bib.bib38)\]no longer relies solely on empirical test accuracy; instead, it adopts techniques such as parametric smoothing to derive certified upper bounds on achievable test accuracy, providing theoretical guarantees for the degradation of model performance\.
A third research direction investigates versatility and stealthiness across diverse tasks and model settings\. Versatile Transferable Generator\[[22](https://arxiv.org/html/2605.05224#bib.bib56)\]introduces an adversarial domain enhancement strategy that generates perturbations by simulating changes in the data distribution, thereby improving cross\-architecture transferability\. Multimodal Unlearnable Examples\[[42](https://arxiv.org/html/2605.05224#bib.bib35)\]introduce collaborative disruption between the visual and linguistic modalities by jointly optimizing image perturbations and textual triggers\. Deep Hiding\[[31](https://arxiv.org/html/2605.05224#bib.bib37)\]enhances perceptual stealthiness by embedding predefined semantic patterns into clean examples using invertible neural networks\.
### II\-BExisting Defenses against UEs
Defenses against unlearnable examples aim to restore data utility by reducing the impact of imperceptible perturbations during model training\. Existing research can be divided into three primary directions: preprocessing\-based defenses, training\-phase defenses, and generative purification defenses\.
Preprocessing\-based defenses leverage the sensitivity of low\-level perturbation features to image processing to neutralize UEs through data transformation\. Traditional methods utilize lightweight operations such as grayscale conversion, JPEG compression\[[25](https://arxiv.org/html/2605.05224#bib.bib39),[29](https://arxiv.org/html/2605.05224#bib.bib43)\], and spatial filtering to eliminate the shortcuts introduced by UEs\[[36](https://arxiv.org/html/2605.05224#bib.bib46)\]\. While computationally efficient, these methods are often bypassed by unlearnable examples with complex features, and their effectiveness varies significantly depending on the specific attack mechanism\.
Training\-phase defenses integrate robustness directly into the model optimization process\. Adversarial Training\[[28](https://arxiv.org/html/2605.05224#bib.bib47)\]is a classic strategy that forces the model to learn robust features\. To balance efficiency, UEraser\[[36](https://arxiv.org/html/2605.05224#bib.bib46)\]applies combinations of augmentations \(such as Mixup\[[52](https://arxiv.org/html/2605.05224#bib.bib53)\], CutMix\[[51](https://arxiv.org/html/2605.05224#bib.bib54)\], and CutOut\[[3](https://arxiv.org/html/2605.05224#bib.bib52)\]\) and loss\-maximizing policies to diversify data distributions\. Furthermore, specialized objectives such as orthogonal projection\[[37](https://arxiv.org/html/2605.05224#bib.bib49)\]isolate and suppress the influence of unlearnable examples during backpropagation\. Despite their robustness, the computational cost of these measures during the training phase is massive, and they may lead to potential performance degradation on clean data\[[25](https://arxiv.org/html/2605.05224#bib.bib39)\]\.
Generative purification defenses utilize deep generative architectures to separate or strip away perturbations from clean data\. For example, D\-VAE\[[48](https://arxiv.org/html/2605.05224#bib.bib50)\]employs a rate\-constrained variational autoencoder for unsupervised perturbation separation, thereby maintaining data integrity while removing harmful signals\. Similarly, AVATAR\[[4](https://arxiv.org/html/2605.05224#bib.bib44)\]builds upon the methodology of DiffPure\[[33](https://arxiv.org/html/2605.05224#bib.bib51)\], applying diffusion models to counteract intentional perturbations while preserving the essential semantics of training images\. However, these approaches depend on complex architectures and require substantial auxiliary data, which increases implementation costs\[[35](https://arxiv.org/html/2605.05224#bib.bib16)\]\.
Overall, existing defenses balance efficiency, robustness, and deployment cost\. Defenses based on preprocessing incur relatively low computational overhead but often struggle to handle complex feature distortions\. Training\-stage defenses enhance model resilience through optimized learning strategies; however, they are accompanied by substantial computational expenses and performance degradation\. Generative purification methods achieve strong restoration quality, yet their reliance on sophisticated architectures and auxiliary data limits their scalability in large\-scale deployment scenarios\.
Existing generation methods of unlearnable examples primarily focus on the training\-from\-scratch paradigm, leaving their efficacy under pretraining–finetuning unexplored\. This PF paradigm is both more computationally efficient and more aligned with real\-world deployment than training models from the ground up\. Our evaluation reveals that the pretrained models, especially with layer freezing, significantly suppress the unlearnability of existing UEs\. Instead, this paper proposes a novel SSC to realize unlearnability under the PF paradigm\.
## IIIVulnerability of Unlearnable Examples
### III\-APreliminary
#### III\-A1Unlearnable Examples
Unlearnable examples aim to prevent models from using unauthorized examples by reducing the learning performance of the model on perturbed data\. Following prior work\[[12](https://arxiv.org/html/2605.05224#bib.bib31),[6](https://arxiv.org/html/2605.05224#bib.bib32)\], we consider a standardKK\-class image classification task\. Let𝒟c=\{\(xi,yi\)\}i=1N\\mathcal\{D\}\_\{c\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote the clean training dataset containingNNexamples, wherexi∈𝒳⊂ℝdx\_\{i\}\\in\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\}represents the input image andyi∈𝒴=\{1,…,K\}y\_\{i\}\\in\\mathcal\{Y\}=\\\{1,\\dots,K\\\}denotes the corresponding label\.
To prevent unauthorized model training using the dataset, the protector creates an unlearnable dataset𝒟u\\mathcal\{D\}\_\{u\}by adding imperceptible perturbations to the clean images\. The unauthorized model denoted byf\(⋅;θ\):𝒳→𝒴f\(\\cdot;\\theta\):\\mathcal\{X\}\\to\\mathcal\{Y\}, with parametersθ\\theta\. The unlearnable dataset is defined as:
𝒟u=\{\(xi\+δi,yi\)\}i=1N,\\mathcal\{D\}\_\{u\}=\\\{\(x\_\{i\}\+\\delta\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\},\(1\)whereδi∈ℝd\\delta\_\{i\}\\in\\mathbb\{R\}^\{d\}represents the unlearnable perturbation\. To ensure visual stealthiness, the perturbation is limited within a bounded budget, which generally satisfies‖δi‖p≤ϵ\\\|\\delta\_\{i\}\\\|\_\{p\}\\leq\\epsilon\.
Existing methods enable the model to learn perturbations rather than semantic information by adding a shortcut feature form that takes precedence over semantic content\. Specifically, this is achieved by solving a bi\-level optimization problem on the surrogate modelf′\(⋅;θ′\)f^\{\\prime\}\(\\cdot;\\theta^\{\\prime\}\)\. In this process, the inner optimization shows how a learner updates model parameters using perturbed data, while the outer optimization adjusts the perturbation to hinder learning as much as possible\.
minδ𝔼\(x,y\)∼𝒟c\[ℒ\(f′\(x\+δ;θ∗\),y\)\],\\displaystyle\\min\_\{\\delta\}\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{c\}\}\[\\mathcal\{L\}\(f^\{\\prime\}\(x\+\\delta;\\theta^\{\*\}\),y\)\],\(2\)s\.t\.θ∗=argminθ′𝔼\(x,y\)∼𝒟c\[ℒ\(f′\(x\+δ;θ′\),y\)\],\\displaystyle\\text\{s\.t\.\}\\quad\\theta^\{\*\}=\\arg\\min\_\{\\theta^\{\\prime\}\}\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{c\}\}\[\\mathcal\{L\}\(f^\{\\prime\}\(x\+\\delta;\\theta^\{\\prime\}\),y\)\],whereℒ\(⋅\)\\mathcal\{L\}\(\\cdot\)denotes the cross\-entropy loss function\. The perturbationδ\\deltaencourages the training loss to decrease more rapidly on shortcut features than on semantically meaningful features during optimization\. As a result, the unauthorized modelffis more likely to learn non\-robust representations, leading to degraded generalization performance on the clean test set\.
#### III\-A2PF Paradigm
Unlike previous studies that predominantly evaluated unlearnable examples under the training\-from\-scratch setting, this work investigates the robustness of UEs under the practical pretraining–finetuning paradigm\. This setup reflects realistic unauthorized training scenarios, where adversaries typically rely on pretrained weights instead of training models from scratch\.
Letf\(⋅;θpre\)f\(\\cdot;\\theta\_\{pre\}\)denote a model pretrained on a large\-scale source dataset\. Under the PF paradigm, the adversary initializes the unauthorized model parametersθ\\thetawithθpre\\theta\_\{pre\}to perform the classification task described in Eq\. \(2\)\. To formally analyze the impact of layer\-wise transferability, we selected different network components to freeze as distinct finetuning configurations\. Regardless of the specific freezing strategy employed, the model parameters can be generalized into a binary structure\. Specifically, we decompose the model parametersθ\\thetainto two functional components:
θ=\{θf,θl\},\\theta=\\\{\\theta\_\{f\},\\theta\_\{l\}\\\},\(3\)whereθf\\theta\_\{f\}\(fixed parameters\) denotes the subset of weights that are kept frozen during the fine\-tuning process, andθl\\theta\_\{l\}\(learnable parameters\) represents the remaining weights that are subject to gradient\-based optimization\. This decomposition is particularly important for investigating how the fixed layers of the pretrained model interact with unlearnable perturbations and how the learnable layers optimize during training\.
During the fine\-tuning phase, the model loads pre\-trained parameters and subsequently updatesθl\\theta\_\{l\}on the unlearnable dataset𝒟u\\mathcal\{D\}\_\{u\}, expressed as:
θl∗=argminθl∑\(xi\+δi,yi\)∈𝒟uℒ\(f\(xi\+δi;θfpre,θl\),yi\)\.\\theta\_\{l\}^\{\*\}=\\arg\\min\_\{\\theta\_\{l\}\}\\sum\_\{\(x\_\{i\}\+\\delta\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{u\}\}\\mathcal\{L\}\\big\(f\(x\_\{i\}\+\\delta\_\{i\};\\theta\_\{f\}^\{pre\},\\theta\_\{l\}\),y\_\{i\}\\big\)\.\(4\)
TABLE I:Vulnerability analysis of UEs under the PF paradigm\. We evaluate the clean test accuracy \(%\) of various UEs methods against unauthorized models with different frozen layer configurations\. The column headers denote the layers frozen during the unauthorized model’s training, in whichL1,2L\_\{1,2\}indicates that Layer 1 and Layer 2 are frozen, while ‘—’ denotes full model training without any frozen layers\. Higher accuracy indicates a greater vulnerability of unlearnable examples\.To thoroughly evaluate the robustness of Unlearnable Examples under the PF paradigm, we conduct quantitative experiments on the CIFAR\-10 dataset using the ResNet\-18 architecture\. In our experimental configuration,LiL\_\{i\}denotes theii\-th residual block of the model\. For example,L3,4L\_\{3,4\}indicates that the first two blocks remain trainable, while the third and fourth blocks are fixed with pre\-trained weights\. We measure the clean test accuracy of the unauthorized model to evaluate defense effectiveness: a higher accuracy indicates a greater vulnerability of the UEs\. The quantitative results are summarized in Table[I](https://arxiv.org/html/2605.05224#S3.T1)\. Our key observations are as follows:
1. 1\.Unlearnable examples demonstrate significant effectiveness under the training\-from\-scratch paradigm\. For instance, training on EMN yields a clean test accuracy of 18\.47%, indicating that model generalization is successfully prevented in the absence of pretrain parameters\.
2. 2\.All evaluated UEs methods exhibit significant vulnerabilities under the PF paradigm\. The protection fails substantially when foundational components are included in the fixed parametersθf\\theta\_\{f\}\. This widespread failure across all evaluated methods suggests that pretrained features can effectively bypass the protection intended by UEs\.
3. 3\.Freezing shallow layers is more detrimental to unlearnable examples protection than freezing deep layers\. For the AR method, theL1,2L\_\{1,2\}configuration reaches 79\.27% accuracy while theL3,4L\_\{3,4\}configuration only reaches 12\.41% under\. This phenomenon can also be observed in other defense methods, confirming that shallow\-layer features play a dominant role in neutralizing the defense\.
Figure 2:Evaluation of cross\-weight robustness in black\-box settings\. Evaluating the transferability of weight\-aware unlearnable examples \(optimized on ImageNet V1 weights\) against models initialized with different weight configurations\. The dashed line marks the theoretical lower bound of the attack\.
### III\-BVulnerabilities of Weight\-Aware Unlearnable Generation
Based on the observation that frozen shallow layers significantly attenuate unlearnability, a natural question arises: Can data protectors restore effective protection under the PF paradigm by incorporating pretrained models into the construction of unlearnable examples? To answer this, we conduct a weight\-aware unlearnable generation experiment, investigating the extent to which this method can counteract the observed protection failure\. To systematically evaluate the effectiveness of these adaptive attacks, we generated a series of unlearnable examples by unfreezing different combinations of residual blocks, resulting in 10 distinct generation strategies, ranging from shallow\-focused optimization to deep\-focused optimization and multi\-scale hybrid configurations\.
The performance of these strategies under white\-box settings, where the surrogate and unauthorized model share identical weights, is presented in Table[II](https://arxiv.org/html/2605.05224#S3.T2)\. Furthermore, the black\-box experimental results, which evaluate the transferability of perturbations optimized on ImageNet V1 against different weight initializations, are shown in Figure[2](https://arxiv.org/html/2605.05224#S3.F2)\. Based on a comprehensive comparison and analysis, the primary experimental observations are presented as follows:
TABLE II:Vulnerability analysis of weight\-aware UEs\. We evaluated the clean test accuracy \(%\) of UEs generated by 10 freezing strategies under 4 different weight initialization configurations\. The column headers denote the trainable layers of the surrogate model during generation\. The row headers indicate the trainable layers of the unauthorized model during training\.1. 1\.Models trained on weight\-aware UEs consistently achieve lower clean test accuracy than those trained on standard UEs under the PF setting\. This suggests that incorporating pretrained weights during generation can partially alleviate the degradation of unlearnability\.
2. 2\.In the white\-box setting, freezing conv1 and layer1 during generation yields the strongest protection performance across all configurations\. This indicates that constraining perturbation optimization in shallow layers biases the generation process toward perturbations more robust to layer\-freezing\. As a result, these perturbations become more transferable under adaptive attacks\.
3. 3\.When the weights used during generation precisely match the unauthorized model’s initialization \(ImageNet V1\), the unlearnable examples preserve their unlearnability, reducing accuracy to 28%\. This confirms that the perturbation effectively exploits the specific numerical distribution of the weights for which it was optimized\.
4. 4\.In the black\-box setting, when the attack is transferred to models initialized with different weights, such as those pretrained on ImageNetV2/V3 and CIFAR\-100, its effectiveness declines markedly, with test accuracy increasing to 45% or higher\. This finding indicates that the weight\-aware generation strategy overfits to particular parameter values rather than learning weight\-agnostic features\.
In summary, the results demonstrate that while weight\-aware unlearnable generation can maintain unlearnable performance by leveraging specific pretrained parameters, these perturbations lack generalization under black\-box settings\. Given that white\-box scenarios do not exist in reality, these findings indicate that enhancing the robustness of unlearnable examples under the PF paradigm cannot depend exclusively on leveraging specific pretrained weights\.
## IVFailure Mechanisms of Unlearnable Examples
In this section, we explore the reasons why unlearnable examples fail under the pretraining\-finetuning paradigm when shallow layers remain frozen\. First, we evaluated the inconsistency of layer\-wise features and found that features exhibited unusually high consistency in pre\-trained models with frozen shallow layers; this indicates that perturbations fail to dominate feature representations, unlike the patterns observed in models trained from scratch\. Then, we investigated the perturbation transmission dynamics and discovered that semantic mismatch significantly suppresses perturbation energy within these shallow layers\. Additionally, further analysis pinpointed this mismatch to differences in frequency\-domain distributions: pre\-trained shallow layers function as semantic filters that primarily respond to low\-frequency natural image semantics while automatically filtering out mid\-to\-high frequency perturbation signals\. Finally, we validated this analysis through SF\-Pretrain experiments\.
### IV\-ANoise Transmission Analysis under PF
To quantitatively describe the impact of unlearnable perturbations at different network depths, we first examine their effects at the representation level via feature\-space consistency\. Letfl\(⋅\)f\_\{l\}\(\\cdot\)denote the feature map output by thell\-th layer of the model\. For a clean examplexxand its unlearnable counterpartx′=x\+δx^\{\\prime\}=x\+\\delta, we measure the layer\-wise cosine similarity:
𝒮l\(x,δ\)=⟨vec\(fl\(x\)\),vec\(fl\(x\+δ\)\)⟩‖vec\(fl\(x\)\)‖2⋅‖vec\(fl\(x\+δ\)\)‖2,\\mathcal\{S\}\_\{l\}\(x,\\delta\)=\\frac\{\\langle\\mathrm\{vec\}\(f\_\{l\}\(x\)\),\\mathrm\{vec\}\(f\_\{l\}\(x\+\\delta\)\)\\rangle\}\{\\\|\\mathrm\{vec\}\(f\_\{l\}\(x\)\)\\\|\_\{2\}\\cdot\\\|\\mathrm\{vec\}\(f\_\{l\}\(x\+\\delta\)\)\\\|\_\{2\}\},\(5\)wherevec\(⋅\)\\mathrm\{vec\}\(\\cdot\)flattens the feature map into a vector\.
As shown by the red line in Fig\.[3](https://arxiv.org/html/2605.05224#S4.F3), the semantic consistency𝒮l\\mathcal\{S\}\_\{l\}exhibits a pronounced declining trend as the network depth increases when trained from scratch\. In contrast, under the PF paradigm with frozen shallow layers as shown by the blue line, we observe a strikingly different phenomenon: despite the injection of perturbations capable of deceiving models trained from scratch, feature representations remain highly consistent throughout the network \(𝒮l\>0\.98\\mathcal\{S\}\_\{l\}\>0\.98in stage 1–2 of ResNet\)\.
However, consistency at the feature level alone does not clarify whether the perturbation was successfully injected and subsequently ignored, or whether it failed to propagate through the network\. To directly examine the propagation behavior of unlearnable perturbations, we analyze the dynamics of signal transmission across network layers\.
To better characterize this propagation, we introduce the Perturbation Transfer Rate \(PTR\), a metric that quantifies the sensitivity of feature representations to injected noise, defined as follows:
PTR=‖Φ\(x\+δ\)−Φ\(x\)‖2‖Φ\(x\)‖2\.\\mathrm\{PTR\}=\\frac\{\\\|\\Phi\(x\+\\delta\)\-\\Phi\(x\)\\\|\_\{2\}\}\{\\\|\\Phi\(x\)\\\|\_\{2\}\}\.\(6\)
Fig\.[4](https://arxiv.org/html/2605.05224#S4.F4)illustrates the contrasting behaviors of a model trained from scratch on unlearnable examples and a pretrained model with frozen shallow layers\. Unlearnable noise demonstrates distinct transmission patterns under these two training paradigms\. The model trained from scratch exhibits a steadily increasing PTR, confirming that perturbations are amplified during training and subsequently become the primary shortcut features\. In contrast, the pretrained model demonstrates significant suppression of perturbation energy\. This attenuation emerges in the initial shallow layer and persists throughout the network\. These findings suggest that unlearnable perturbations generated by different methods do not propagate beyond frozen, pretrained, shallow layers, thereby preventing their influence on downstream learnable parameters\. Consequently, the feature inconsistency observed in Fig\.[3](https://arxiv.org/html/2605.05224#S4.F3)results directly from perturbation transmission failure, rather than from insensitivity of the pretrained model to the features\.
Figure 3:Layer\-wise feature consistency\. Cosine similarity𝒮l\\mathcal\{S\}\_\{l\}between clean and unlearnable feature representations across network depth\. While perturbations progressively decouple features in models trained from scratch, pretrained models with frozen shallow layers maintain consistently high similarity\.Figure 4:Perturbation transmission analysis\. Solid bars represent UEs trained from scratch, showing a progressive amplification of perturbation energy in deeper layers\. In contrast, hatched inner bars denote pretrained models, where the energy is attenuated\. The line indicates the mean PTR trend of models\.
### IV\-BAnalysis of Semantic Filtering
Figure 5:Illustration of the Semantic Filtering Effect\. This figure demonstrates that the inherent semantic information in the input is naturally stronger than the perturbations\. The frozen shallow layers of the model act as a filter, further suppressing noise and amplifying the dominance of semantic features\.The preceding analysis demonstrates that frozen, pretrained shallow layers inhibit the propagation of unlearnable perturbations\. However, the precise physical origin of this semantic filtering and why these perturbations fail to bypass the shallow layers remains unclear\. To address this, we perform a spectral analysis, which revealed that the pretrained shallow layers act as semantic filters due to the statistical mismatch between the perturbations and the semantics of natural images\.
The response of shallow layersΦ\(⋅;θs\)\\Phi\(\\cdot;\\theta\_\{\\text\{s\}\}\)can be approximated via a first\-order Taylor expansion:
Φ\(x\+δ\)≈Φ\(x\)\+∇xΦ\(x\)⋅δ,\\Phi\(x\+\\delta\)\\approx\\Phi\(x\)\+\\nabla\_\{x\}\\Phi\(x\)\\cdot\\delta,\(7\)Under this approximation, the resulting feature discrepancy induced by perturbationδ\\deltais governed by the projection term‖∇xΦ\(x\)⋅δ‖2\\\|\\nabla\_\{x\}\\Phi\(x\)\\cdot\\delta\\\|\_\{2\}, reflecting its preservation through shallow layers\. A spectral mismatch between the perturbation energy and the frequency components of natural images leads to discrepancies in shallow\-layer gradient interactions, which in turn lead to semantic filtering\. To quantitatively characterize this mismatch, we employ Power Spectral Density \(PSD\) analysis in the frequency domain\. We compute the radially averaged PSD, denoted asP\(f\)P\(f\), by averaging the 2D power spectrum\|ℱ\(z\)\(u,v\)\|2\|\\mathcal\{F\}\(z\)\(u,v\)\|^\{2\}over the azimuthal angleϕ\\phifor a given radial frequencyf=u2\+v2f=\\sqrt\{u^\{2\}\+v^\{2\}\}:
P\(f\)=12π∫02π\|ℱ\(z\)\(f,ϕ\)\|2𝑑ϕ\.P\(f\)=\\frac\{1\}\{2\\pi\}\\int\_\{0\}^\{2\\pi\}\|\\mathcal\{F\}\(z\)\(f,\\phi\)\|^\{2\}d\\phi\.\(8\)Here,zzdenotes an input \(either a clean examplexxor a perturbationδ\\delta\)\. This functionP\(f\)P\(f\)measures the energy distribution across spatial frequenciesff\. As illustrated in Fig\.[6](https://arxiv.org/html/2605.05224#S4.F6)\(a\), the radially averaged PSDP\(f\)P\(f\)reveals a fundamental divergence in energy distribution between natural images and perturbations\. For a more accurate comparison, we define the relative spectral density between the perturbation and the natural image as:
R\(f\)=log2\(Pδ\(f\)Px\(f\)\),R\(f\)=\\log\_\{2\}\\left\(\\frac\{P\_\{\\delta\}\(f\)\}\{P\_\{x\}\(f\)\}\\right\),\(9\)wherePδ\(f\)P\_\{\\delta\}\(f\)andPx\(f\)P\_\{x\}\(f\)denote the radially averaged PSD of the perturbationδ\\deltaand the clean imagexx, respectively\.
As shown in Fig\.[6](https://arxiv.org/html/2605.05224#S4.F6)\(b\), clean images exhibit a characteristic power\-law decay, concentrating energy in low\-frequency components associated with semantic structure\. In contrast, unlearnable perturbations show reduced energy in these low\-frequency bands and a relative dominance in the mid\-to\-high frequency range\. Because shallow kernelsΦ\\Phiare pretrained to capture natural image regularities, their input gradients∇xΦ\(x\)\\nabla\_\{x\}\\Phi\(x\)predominantly reside in a subspace aligned with these low\-frequency semantic structures\. Consequently, as the perturbationδ\\deltadeviates from these spectral patterns, its interaction with the shallow\-layer gradients weakens, leading to a minimal projection:
‖∇xΦ\(x\)⋅δ‖2≪‖Φ\(x\)‖2\.\\\|\\nabla\_\{x\}\\Phi\(x\)\\cdot\\delta\\\|\_\{2\}\\ll\\\|\\Phi\(x\)\\\|\_\{2\}\.\(10\)

Figure 6:Spectral analysis of UEs\. \(a\) Radial PSD of various examples, where the power densityP\(f\)P\(f\)is defined in Eq\. \(8\)\. \(b\) Relative spectral energy densityR\(f\)R\(f\)between perturbations and clean examples, defined in Eq\. \(9\) as thelog2\\log\_\{2\}ratio\.
Figure 7:Robustness comparison of UEs under diverse training paradigms\. UEs exhibit vulnerability under SF\-Pretrain\.
These results confirm that frozen pretrained shallow layers function as semantic filters\. By suppressing spectral components inconsistent with natural image statistics, they prevent perturbations from propagating to deeper learnable parameters\. This ensures that updates toθdeep\\theta\_\{\\text\{deep\}\}are driven by authentic semantics, disrupting the shortcut pathways exploited by UEs\.
### IV\-CAnalysis of Shallow Semantic Enhancement
Figure 8:Overview of the proposed Hierarchical Deception framework via Shallow Semantic Camouflage\. After creating noiseδ\\delta, we introduce a reference anchor to enforce semantic alignment\. Noise is anchored to the semantic features of a reference model\. Through a Bilevel Optimization Loop, the attack simultaneously updates the surrogate model’s weights and the perturbation while maintaining Semantic Alignment \(RsemR\_\{sem\}\)\.To further validate the semantic filtering explanation, this part introduces SF\-Pretrain, which enhances semantic awareness in pretrained shallow layers\. Specifically, SF\-Pretrain introduces constraints in the shallow layers of the model, prompting the shallow network to extract more discriminative features\. Let\{Φk\(x;θs\(k\)\)\}k=1K\\\{\\Phi\_\{k\}\(x;\\theta\_\{s\}^\{\(k\)\}\)\\\}\_\{k=1\}^\{K\}denote the shallow feature extractors atKKdistinct early stages\. We attach a lightweight auxiliary classifierhaux\(k\)\(⋅\)h\_\{aux\}^\{\(k\)\}\(\\cdot\)to each shallow exit, the enhancement objective is formulated as:
ℒSF=\\displaystyle\\mathcal\{L\}\_\{SF\}=ℒtask\(f\(x;θ\),y\)\+\\displaystyle\\mathcal\{L\}\_\{task\}\(f\(x;\\theta\),y\)\+\(11\)λ∑k=1KℒCE\(k\)\(haux\(k\)\(Φk\(x;θs\(k\)\)\),y\),\\displaystyle\\lambda\\sum\_\{k=1\}^\{K\}\\mathcal\{L\}\_\{CE\}^\{\(k\)\}\\left\(h\_\{aux\}^\{\(k\)\}\(\\Phi\_\{k\}\(x;\\theta\_\{s\}^\{\(k\)\}\)\),y\\right\),whereℒtask\\mathcal\{L\}\_\{task\}is the standard classification loss of the final output, andλ\\lambdais a penalty coefficient to prioritize shallow semantic alignment\. From an information\-theoretic perspective, minimizing the joint semantic loss∑ℒCE\(k\)\\sum\\mathcal\{L\}\_\{CE\}^\{\(k\)\}can be interpreted as encouraging higher mutual dependence between the shallow features at multiple scales and the semantic labelsyy, which is closely related to maximizing a variational lower bound of the Mutual Information \(MI\):
maxθs∑k=1KI\(Φk\(x;θs\(k\)\);y\)\.\\max\_\{\\theta\_\{s\}\}\\sum\_\{k=1\}^\{K\}I\(\\Phi\_\{k\}\(x;\\theta\_\{s\}^\{\(k\)\}\);y\)\.\(12\)By explicitly promoting semantic alignment across shallow layers, SF\-Pretrain effectively enhances the semantic selectivity of shallow feature extractors\. SF\-Pretrain compelsθs\\theta\_\{s\}to eliminate information weakly correlated with the semantic classyyat the outset of the network forward pass\. We initialize the unauthorized model using the SF\-Pretrain weights and evaluate its robustness against SOTA unlearnable methods\.
As visualized in the radar chart in Fig\.[7](https://arxiv.org/html/2605.05224#S4.F7), the results provide strong support for our hypothesis\. Compared to standard pretraining and training from scratch, the red line and the green line, models initialized with SF\-Pretrain \(green\) exhibit a reduction in unlearnability and achieve substantially higher clean test accuracy across all evaluated attack benchmarks\.
These findings indicate that substantial enhancement of shallow semantics leads to more thorough filtering of unlearnable perturbations\. These results conclusively demonstrate that the failure of existing methods is closely associated with semantic filtering, thereby motivating the development of the proposed Hierarchical Deception strategy to bypass this filter\.
## VSSC for Hierarchical Deception
Based on the preceding analysis, the vulnerability of existing unlearnable examples stems from a semantic mismatch between perturbation and the natural images\. We propose a hierarchical deception framework termed SSC\. Our approach aims to achieve protection by aligning channel\-level semantic perturbations with the shallow semantic manifold of natural images, enabling them to bypass filters and influence the learning process of deeper layers across diverse training paradigms\.
The overview of SSC is shown in Fig\.[8](https://arxiv.org/html/2605.05224#S4.F8)\. SSC begins with generation and semantic anchoring, where a trainable generator produces a perturbationδ\\deltato construct an adversarial example\. The example is then processed through both a surrogate model and a frozen reference model to extract shallow features, which are used to calculate a semantic alignment loss \(RsemR\_\{sem\}\)\. Within the Bilevel Optimization Loop, the framework iteratively updates the surrogate model’s weights and the perturbationδ\\deltato ensure the noise remains semantically consistent while achieving the poisoning goal\. During the evaluation stage, the optimized perturbations are introduced to the clean dataset, inducing learning failure in the unauthorized model across diverse training paradigms\.
The proposed SSC framework employs a dual\-objective bilevel optimization to induce hierarchical deception\. Given a surrogate modelf′\(⋅;θ′\)f^\{\\prime\}\(\\cdot;\\theta^\{\\prime\}\)with parametersθ′=\{θs′,θd′\}\\theta^\{\\prime\}=\\\{\\theta^\{\\prime\}\_\{s\},\\theta^\{\\prime\}\_\{d\}\\\}, whereθs′\\theta^\{\\prime\}\_\{s\}denotes the shallow parameters andθd′\\theta^\{\\prime\}\_\{d\}represents the deep parameters\. We introduce a dual\-objective inner loop that minimizes classification error and feature discrepancy on perturbed data\. The optimization problem is defined as:
minδℒouter\(θ∗,δ\)s\.t\.θ∗=argminθℒinner\(θ′,δ\),\\min\_\{\\delta\}\\mathcal\{L\}\_\{outer\}\(\\theta^\{\*\},\\delta\)\\quad\\text\{s\.t\.\}\\quad\\theta^\{\*\}=\\arg\\min\_\{\\theta\}\\mathcal\{L\}\_\{inner\}\(\\theta^\{\\prime\},\\delta\),\(13\)The perturbationδ\\deltais constrained within thel∞l\\infty\-norm ballΔ=\{δ∣‖δ‖∞≤ϵ\}\\Delta=\\\{\\delta\\mid\\\|\\delta\\\|\_\{\\infty\}\\leq\\epsilon\\\}, whereProjϵ\\text\{Proj\}\_\{\\epsilon\}denotes the projection operator that clips the perturbation to theϵ\\epsilon\-boundary\.
The outer objectiveℒouter\\mathcal\{L\}\_\{outer\}targets unlearnability by minimizing the classification loss on the perturbed examples:
ℒouter\(θ∗,δ\)=∑\(xi,yi\)∈𝒟cℒCE\(f′\(xi\+δi;θ∗\),yi\)\.\\mathcal\{L\}\_\{\\mathrm\{outer\}\}\(\\theta^\{\*\},\\delta\)=\\sum\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{c\}\}\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\\big\(f^\{\\prime\}\(x\_\{i\}\+\\delta\_\{i\};\\theta^\{\*\}\),y\_\{i\}\\big\)\.\(14\)
By minimizingℒouter\\mathcal\{L\}\_\{outer\}, the framework encourages the surrogate model to rely on non\-robust features, thereby preventing the model from learning the underlying clean semantic patterns\. Concurrently, we introduce a semantic alignment term to the standard objective in the inner loop to ensure that these perturbations can penetrate pretrained filters\. The modified inner lossℒinner\\mathcal\{L\}\_\{inner\}is formulated as:
Linner\(θ′,δ\)=∑\(xi,yi\)∈𝒟c\[\\displaystyle\\\!\\\!\\\!\\\!\\\!\\mathcal\{\\quad\}\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\\prime\},\\delta\)=\\sum\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{c\}\}\\Big\[ℒCE\(f′\(xi\+δi;θ′\),yi\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\\big\(f^\{\\prime\}\(x\_\{i\}\+\\delta\_\{i\};\\theta^\{\\prime\}\),y\_\{i\}\\big\)\(15\)\+λ⋅ℛsem\(gθs′\(xi\+δi\),gref\(xi\)\)\],\\displaystyle\\\!\\\!\\\!\\\!\\\!\\\!\\\!\\\!\\\!\\ \+\\lambda\\cdot\\mathcal\{R\}\_\{sem\}\\big\(g\_\{\\theta^\{\\prime\}\_\{s\}\}\(x\_\{i\}\+\\delta\_\{i\}\),g\_\{\\mathrm\{ref\}\}\(x\_\{i\}\)\\big\)\\Big\],
whereℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}denotes the cross\-entropy loss\. For notational simplicity, we usegθs\(⋅\)g\_\{\\theta\_\{s\}\}\(\\cdot\)to denote the aggregate shallow feature extractor corresponding to the set of shallow layers defined in Section[IV\-C](https://arxiv.org/html/2605.05224#S4.SS3)\. The termℛsem\\mathcal\{R\}\_\{sem\}quantifies the discrepancy between the surrogate model’s shallow features and the semantic anchor established by a frozen reference modelgrefg\_\{\\mathrm\{ref\}\}\. In practical applications,ℛsem\\mathcal\{R\}\_\{sem\}is computed using Gram matrices of feature activations\. These matrices capture channel\-wise correlations and texture\-level statistics enforced by the reference model within shallow representations\. By constraining these Gram\-based statistics, the optimization ensures that generated perturbations remain consistent with the semantic priors imposed by the pretrained shallow filters\. This dual\-objective approach avoids the suppression of semantic\-mismatch perturbations during forward propagation\.
By introducing channel\-layer semantic perturbation, SSC constrains shallow representations toward the natural image manifold, effectively shifting interference to deeper representations\. This misleads the model’s learning process, preserving the unlearnability across diverse training paradigms\.
Algorithm 1Shallow Semantic Camouflage \(SSC\)1:Input:Clean dataset
𝒟c\\mathcal\{D\}\_\{c\}, reference extractor
grefg\_\{ref\}, surrogate model
f′\(⋅;θ′\)f^\{\\prime\}\(\\cdot;\\theta^\{\\prime\}\), budget
ϵ\\epsilon, penalty
λ\\lambda, inner steps
KK, epochs
TT, batch size
BB, learning rates
α,η\\alpha,\\eta\.
2:Output:Unlearnable Dataset
𝒟u\\mathcal\{D\}\_\{u\}
3:// Phase I: Initialization
4:
δi←𝟎\\delta\_\{i\}\\leftarrow\\mathbf\{0\}for all
\(xi,yi\)∈𝒟c\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{c\};
5:Initialize
θ′=\{θs′,θd′\}\\theta^\{\\prime\}=\\\{\\theta^\{\\prime\}\_\{s\},\\theta^\{\\prime\}\_\{d\}\\\};
6:// Phase II: Optimization Loop
7:forepoch
t=1t=1to
TTdo
8:forminibatch
\{\(xi,yi\)\}i=1B⊂𝒟c\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{B\}\\subset\\mathcal\{D\}\_\{c\}do
9:
θ′\(t\)←θ′\\theta^\{\\prime\(t\)\}\\leftarrow\\theta^\{\\prime\};// Save current state for parameter reset
10:// Inner Loop: Semantic\-Constrained Training
11:for
k=1k=1to
KKdo
12:
zadv←gθs′\(xi\+δi\)z\_\{adv\}\\leftarrow g\_\{\\theta^\{\\prime\}\_\{s\}\}\(x\_\{i\}\+\\delta\_\{i\}\);
zref←gref\(xi\)z\_\{ref\}\\leftarrow g\_\{ref\}\(x\_\{i\}\);
13:
ℛsem←1B∑i=1B‖Gram\(zadv\)−Gram\(zref\)‖F2\\mathcal\{R\}\_\{sem\}\\leftarrow\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\\|\\text\{Gram\}\(z\_\{adv\}\)\-\\text\{Gram\}\(z\_\{ref\}\)\\\|\_\{F\}^\{2\};
14:
ℒinner←1B∑ℒCE\(f′\(xi\+δi;θ′\),yi\)\+λℛsem\\mathcal\{L\}\_\{inner\}\\leftarrow\\frac\{1\}\{B\}\\sum\\mathcal\{L\}\_\{CE\}\(f^\{\\prime\}\(x\_\{i\}\+\\delta\_\{i\};\\theta^\{\\prime\}\),y\_\{i\}\)\+\\lambda\\mathcal\{R\}\_\{sem\};
15:
θ′←θ′−α∇θ′ℒinner\\theta^\{\\prime\}\\leftarrow\\theta^\{\\prime\}\-\\alpha\\nabla\_\{\\theta^\{\\prime\}\}\\mathcal\{L\}\_\{inner\};
16:endfor
17:
θ∗←θ′\\theta^\{\*\}\\leftarrow\\theta^\{\\prime\};
18:// Outer Loop: Perturbation Update \(Poisoning\)
19:
ℒouter←1B∑i=1BℒCE\(f′\(xi\+δi;θ∗\),yi\)\\mathcal\{L\}\_\{outer\}\\leftarrow\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathcal\{L\}\_\{CE\}\(f^\{\\prime\}\(x\_\{i\}\+\\delta\_\{i\};\\theta^\{\*\}\),y\_\{i\}\);
20:
δi←Projϵ\(δi−η⋅sign\(∇δiℒouter\)\)\\delta\_\{i\}\\leftarrow\\text\{Proj\}\_\{\\epsilon\}\\left\(\\delta\_\{i\}\-\\eta\\cdot\\text\{sign\}\(\\nabla\_\{\\delta\_\{i\}\}\\mathcal\{L\}\_\{outer\}\)\\right\);
21:
θ′←θ′\(t\)\\theta^\{\\prime\}\\leftarrow\\theta^\{\\prime\(t\)\};// Reset surrogate model
22:endfor
23:endfor
24:
𝒟u←\{\(xi\+δi,yi\)∣\(xi,yi\)∈𝒟c\}\\mathcal\{D\}\_\{u\}\\leftarrow\\\{\(x\_\{i\}\+\\delta\_\{i\},y\_\{i\}\)\\mid\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{c\}\\\};
## VIExperiments
### VI\-AExperimental Setups
#### VI\-A1Datasets and Models
We evaluate the effectiveness of our method on three benchmark datasets: CIFAR\-10\[[16](https://arxiv.org/html/2605.05224#bib.bib62)\], CIFAR\-100\[[16](https://arxiv.org/html/2605.05224#bib.bib62)\], and Tiny ImageNet\[[18](https://arxiv.org/html/2605.05224#bib.bib61)\]\. These datasets vary in resolution and complexity, serving as standard benchmarks for image classification\. Regarding the model architecture, following the standard evaluation protocol for unlearnable examples\[[12](https://arxiv.org/html/2605.05224#bib.bib31),[6](https://arxiv.org/html/2605.05224#bib.bib32)\], we primarily use ResNet\-18\[[8](https://arxiv.org/html/2605.05224#bib.bib65)\]as the backbone network to simulate the unauthorized model\. To rigorously assess the resistance of unlearnable examples against transfer learning, we utilize models initialized with weights trained on different datasets\. Additionally, we also conduct experiments using pretrained models ResNet\-50\[[9](https://arxiv.org/html/2605.05224#bib.bib66)\], DenseNet\-121\[[11](https://arxiv.org/html/2605.05224#bib.bib68)\], and VGG\-11\[[40](https://arxiv.org/html/2605.05224#bib.bib67)\]to evaluate the cross\-architecture transferability of our method\.
#### VI\-A2Training Paradigms
To rigorously evaluate protection performance, we evaluated our approach across three training paradigms: training from scratch, standard pretraining–finetuning, and semantic\-focused pretraining–finetuning\. 1\)Full Training: As a standard baseline, the unauthorized model is initialized with random weights and trained on the unlearnable dataset\. This setting evaluates the basic effectiveness of UEs without prior knowledge interference\. 2\)Standard PF: Based on the vulnerability analysis in Section[III\-A](https://arxiv.org/html/2605.05224#S3.SS1), we found that freezing shallow layers maximally reduces unlearnability under the Standard Pretraining–Finetuning paradigm\. Specifically, the configuration freezing conv1 and layer1achieves the highest clean test accuracy\. To rigorously evaluate the effectiveness of unlearnable examples, we adopt this configuration as the default setting for all subsequent pretraining experiments\. 3\)SF\-PF: Building upon the standard PF paradigm, we introduce the semantic\-focused Pretraining–Finetuning as a more strict training paradigm\. By introducing detection branches and depth supervision into shallow layers during pretraining, SF\-Pretrain significantly enhances the pretrained model’s ability to filter out semantic mismatch noise\.
TABLE III:Comparison of clean test accuracy \(%\)↓\\downarrowon CIFAR\-10, CIFAR\-100, and Tiny\-ImageNet\. Using ResNet\-18 as the unauthorized model\. Lower accuracy indicates better protection performance\.DatasetTraining ParadigmPretraining DatasetUEs Generation Method\(ResNet\-18\)EMNNTGAREMARSHRVTGGUEOursCIFAR\-10Full Training—18\.4711\.8317\.2513\.9112\.0616\.5813\.6212\.34Standard PFImageNet65\.1966\.7750\.4264\.0546\.8863\.5155\.2932\.96Tiny\-ImageNet53\.6352\.1432\.7562\.3833\.5754\.9254\.1122\.68CIFAR\-10057\.7161\.3352\.4853\.6549\.8255\.1559\.5938\.27SF\-PFCIFAR\-10072\.0469\.6656\.2366\.9160\.5570\.7867\.4047\.17CIFAR\-100Full Training—7\.424\.8711\.334\.198\.658\.049\.715\.58Standard PFImageNet64\.2751\.9540\.6663\.4846\.1362\.8164\.5524\.39Tiny\-ImageNet56\.5142\.7632\.8856\.1127\.3554\.6250\.9715\.24CIFAR\-1068\.0956\.4451\.7058\.5349\.8863\.2655\.6130\.17SF\-PFCIFAR\-1072\.8562\.1362\.4963\.7760\.3468\.5666\.9249\.05Tiny\-ImageNetFull Training—8\.734\.2616\.515\.899\.148\.6712\.059\.92Standard PFImageNet57\.1852\.8449\.3560\.7753\.4151\.6362\.2931\.56CIFAR\-10058\.9347\.7031\.4443\.5837\.2153\.0752\.8626\.19CIFAR\-1055\.3452\.1143\.7959\.4546\.9050\.2849\.6727\.52SF\-PFCIFAR\-1061\.8855\.4647\.0362\.1249\.5755\.7153\.2434\.69\\rowcolorgray\!20Average Accuracy52\.7648\.2138\.7455\.0341\.2451\.6751\.0328\.94
#### VI\-A3Compared Methods
We compare our proposed hierarchical deception strategy with seven state\-of\-the\-art unlearnable examples generation methods, covering distinct generation mechanisms: EMN\[[12](https://arxiv.org/html/2605.05224#bib.bib31)\], the pioneering approach based on the error\-minimizing noise paradigm; REM\[[6](https://arxiv.org/html/2605.05224#bib.bib32)\], which integrates adversarial training to enhance robustness against data augmentations; NTGA\[[50](https://arxiv.org/html/2605.05224#bib.bib58)\], an algorithm exploiting Neural Tangent Kernel theory to degrade model generalization; AR\[[38](https://arxiv.org/html/2605.05224#bib.bib59)\], which prevents feature extraction by constructing adversarial regimes; SHR\[[46](https://arxiv.org/html/2605.05224#bib.bib63)\], an approach combining synthetic noise with robustness optimization objectives; GUE\[[23](https://arxiv.org/html/2605.05224#bib.bib57)\], which reformulates the attack as a nonzero\-sum Stackelberg game solved via an autoencoder\-like generator; and VTG\[[22](https://arxiv.org/html/2605.05224#bib.bib56)\], which employs adversarial domain enhancement and perturbation\-label coupling to improve transferability across varying architectures and resolutions\.
#### VI\-A4Experiment Settings
We reproduce all the compared methods under the identical experimental settings\. For a fair evaluation, all methods, including ours, are restricted to anℓ∞\\ell\_\{\\infty\}\-norm perturbation budget ofϵ=8/255\\epsilon=8/255with respect to the image pixel range\[0,1\]\[0,1\]\. For all experiment settings, we enforce a unified training configuration to ensure fair comparison: models are trained with SGD, momentum 0\.9, weight decay5×10−45\\times 10^\{\-4\}, and a batch size of 128\. We train for 100 epochs with a standard learning rate schedule\.
#### VI\-A5Evaluation Metrics
To assess the privacy protection capability of unlearnable examples, we adopt Clean Test Accuracy as the evaluation metric\. This metric quantifies classification accuracy of models trained on unlearnable datasets and evaluated on clean test sets\. Lower clean test accuracy suggests models fail to learn effective information from training data, indicating stronger privacy protection of UEs\.
### VI\-BFull Training from Scratch
We first evaluate the effectiveness of unlearnable examples under the standard full Training from Scratch paradigm, as summarized in Table[VI\-A2](https://arxiv.org/html/2605.05224#S6.SS1.SSS2)\. Across datasets of increasing complexity, including CIFAR\-10, CIFAR\-100, and Tiny\-ImageNet, our method consistently achieves state\-of\-the\-art protection performance\. On CIFAR\-10, our approach restricts clean test accuracy to 12\.34%, performing comparably to strong baselines such as NTGA, which demonstrates competitive performance in low\-complexity settings\. As dataset complexity increases, the advantage of our method becomes more pronounced\. On Tiny\-ImageNet, the more complex dataset in our evaluation, which features higher resolution and diverse semantics, our method maintains a low accuracy of 9\.92%\. This result highlights the method’s superior ability to inject persistent unlearnable signals, even when the model must learn highly intricate feature representations\. These findings demonstrate that our approach scales effectively with the complexity of the dataset and outperforms baseline methods\.
### VI\-CRobustness under the PF Paradigm
We next evaluate robustness under the more practical and challenging pretraining–finetuning paradigm, where unauthorized models are initialized from pretrained model weights and shallow layers are frozen\. In this setting, the performance gap between our method and the compared methods becomes significantly more pronounced\. As shown in Table[VI\-A2](https://arxiv.org/html/2605.05224#S6.SS1.SSS2), compared methods experience a severe protection collapse under the PF paradigm\. This degradation can be attributed to frozen, pretrained shallow layers that suppress perturbations lying off the natural image feature manifold, causing unlearnable signals to fail\. In contrast, our method demonstrates strong robustness under the same conditions\. For example, on Tiny\-ImageNet using an ImageNet\-pretrained ResNet\-18, established methods such as AR and GUE fail substantially, with clean test accuracies rising to 60\.77% and 62\.29%, respectively\. Meanwhile, our method suppresses the accuracy to 31\.56%, outperforming the strongest compared method REM by 17\.79%\. This substantial margin highlights the effectiveness of our approach in overcoming semantic filtering induced by frozen pretrained shallow layers\.
TABLE IV:Comparison of clean test accuracy\(%\)↓\\downarrowunder the Standard PF paradigm across different architectures\. All network architectures are pretrained on ImageNet\. Lower accuracy indicates better protection performance\.### VI\-DRobustness under SF\-Pretrain paradigm
Finally, we subject the proposed approach to a more rigorous training paradigm, SF\-Pretrain, which explicitly penalizes semantic mismatch features in shallow layers\. This setting represents an extreme scenario where the semantic filtering effect is extremely strengthened\. As summarized in Table[VI\-A2](https://arxiv.org/html/2605.05224#S6.SS1.SSS2), our approach outperforms the compared methods under the SF\-Pretrain paradigm across different datasets\. For better analysis, we conduct experiments using unlearnable examples generated from CIFAR\-10 and evaluate them under pretraining paradigms with varying degrees of semantic\-focused\. The penalty strengthλ∈\{1,3,5,7\}\\lambda\\in\\\{1,3,5,7\\\}determines the degree of semantic regularization imposed on the shallow layers of the pretrained model\. Larger values correspond to stronger suppression of semantic mismatch features\. This experiment aims to verify that the filtering of perturbations by frozen shallow layers stems from their semantic mismatch with natural images; as semantic\-filtering is progressively enhanced, the suppression of these perturbations intensifies\.
As shown in the bottom section of Table[V](https://arxiv.org/html/2605.05224#S6.T5), increasingλ\\lambdasubstantially reduces the effectiveness of most baseline methods\. Under the strongest semantic constraint \(λ=7\\lambda=7\), the protection provided by EMN and NTGA nearly collapses, with clean test accuracies rising to 80\.57% and 77\.23%, respectively\. In contrast, our method remains the most resilient, maintaining a suppressed accuracy of 55\.29% even under aggressive semantic enforcement\. These results suggest that, by explicitly aligning perturbations with the natural feature manifold, our method constructs a form of semantic camouflage\. Although increasing the semantic penalty inevitably weakens the unlearnable effect for all methods, including ours,this degradation is substantially milder for the proposed approach compared to prior methods that rely on semantic mismatch perturbations\. Consequently, even as the semantic filters become more restrictive, our method consistently preserves a stronger protection effect, allowing unlearnability to persist when semantic mismatch perturbations are largely eliminated\.
TABLE V:Comparison of clean test accuracy \(%\) on CIFAR\-10 under differentλ\\lambdavalues under SF pretraining–finetuning paradigm\.### VI\-EScalability to Target Dataset Complexity
To systematically assess the scalability of the proposed method, we evaluate its protection efficacy across datasets characterized by higher resolutions and more intricate semantic spaces\. As detailed in Table[VI\-A2](https://arxiv.org/html/2605.05224#S6.SS1.SSS2), while baseline methods achieve competitive results on the relatively simple CIFAR\-10 dataset, their protection capabilities collapse on the more complex Tiny\-ImageNet\. For instance, under the standard pretraining paradigm pretrained on Imagenet, methods such as AR and GUE fail substantially on Tiny\-ImageNet, yielding clean test accuracies of 60\.77% and 62\.29% respectively\. Conversely, our method consistently maintains a robust unlearnable effect, suppressing the accuracy to 31\.56% and outperforming the strongest baseline REM by a margin of 17\.79 percentage points\. This suppression demonstrates that the proposed hierarchical deception strategy SSC scales effectively without being compromised by the rich semantic diversity and high dimensionality of the protected data\.
Figure 9:Feature perturbation responses in shallow layers of an SF\-Pretrain model\. The second row presents the feature activations of the clean example\. The following rows visualize feature\-level perturbation residuals using the Seismic colormap, where red and blue indicate feature enhancement and suppression, respectively, and white denotes minimal change\.### VI\-FRobustness in Black\-Box Scenarios
In practical applications, protectors generally do not possess prior knowledge of the architecture or pretraining data utilized by unauthorized models\. Consequently, a practical protection method must exhibit robustness across diverse pretraining datasets and ensure cross\-architecture transferability\.
#### VI\-F1Robustness across Diverse Pretraining Datasets
Unauthorized models are frequently initialized with weights pretrained on diverse large\-scale datasets to accelerate convergence and enhance feature extraction capabilities\. To confirm the cross\-dataset defense efficacy of our method, we evaluate its resilience against variations in pretraining data distributions\.
As shown in the Pretraining sections of Table[VI\-A2](https://arxiv.org/html/2605.05224#S6.SS1.SSS2), the proposed approach maintains a formidable unlearnable effect regardless of pretraining data origin\. Specifically, when an unauthorized model utilizes weights pretrained on Tiny\-ImageNet to learn from unlearnable examples generated on CIFAR\-100, our strategy restricts the test accuracy to a mere 15\.24%\. Under the identical cross\-domain setting, baseline methods including EMN and NTGA fail to preserve their protective capabilities, permitting accuracies to surge to 56\.51% and 42\.76%, respectively\. Furthermore, in near\-domain transfer scenarios where unlearnable CIFAR\-10 examples are evaluated on models initialized with CIFAR\-100 weights, our method limits the accuracy to 38\.27%, significantly outperforming the most competitive baseline REM at 52\.48%\. These outcomes verify that the proposed hierarchical deception strategy SSC aligns perturbations with natural image feature manifolds, ensuring robustness across diverse pretraining datasets\.
#### VI\-F2Cross\-Architecture Transferability
Beyond diverse pretraining datasets, unauthorized users may deploy a wide array of neural network architectures varying in depth and capacity\. To evaluate cross\-architecture transferability, we generate unlearnable examples utilizing a lightweight ResNet\-18 model and assess their effectiveness against various unseen network architectures\. Crucially, to isolate the impact of architectural variations from pretraining data biases, all evaluated models are pretrained on the same dataset, ImageNet\.
The comprehensive results in Table[VI\-C](https://arxiv.org/html/2605.05224#S6.SS3)demonstrate that our method consistently achieves the lowest clean test accuracy across all evaluated architectures and datasets\. Notably, the protective efficacy remains highly robust not only on deeper models with expanded capacity like ResNet\-50, where it restricts CIFAR\-100 accuracy to 27\.83%, but also on architectures employing entirely distinct feature extraction paradigms\. When confronting the dense connectivity pattern of DenseNet\-121, our method successfully suppresses the accuracy to 27\.95% on CIFAR\-10 and 34\.72% on Tiny\-ImageNet\. Similarly, against the traditional sequential convolutions of VGG\-11, the accuracy is firmly held at 24\.41% on CIFAR\-100\. These findings provide compelling evidence that the increased parameter capacity and varied structural designs of different models are insufficient to bypass the deeply integrated unlearnable perturbations introduced by our approach\.
### VI\-GSemantic Perception Analysis in Shallow Layers
We qualitatively analyze the perceptual representations of different UEs generation methods by extracting features from the first layer of the SF\-Pretrain model\. By extracting feature perturbations from the initial layers, we analyze whether the model’s perceptual focus on these noises aligns with the intrinsic image semantics\. To isolate the effect of perturbations from the underlying image structure, we compute feature\-level residuals induced solely by the injected noise\. These residuals are visualized using the Seismic colormap, where red denotes feature enhancement, blue indicates suppression, and white corresponds to negligible changes\.
As shown in Figure[9](https://arxiv.org/html/2605.05224#S6.F9), the second row presents the standard activations of clean samples, serving as a reference for evaluating perturbation patterns\. Compared with conventional baseline methods, our channel\-wise semantic perturbations produce responses with lower intensity in the residual maps in non\-semantic regions, as reflected by reduced color saturation\. Such moderate feature deviations avoid inducing extreme activations, which may otherwise be attenuated by frozen shallow layers in pretrained models\. Moreover, while baseline methods tend to activate spatially irrelevant regions in certain channels, our perturbations exhibit strong spatial alignment with the corresponding clean feature maps\. The activation regions closely follow the semantic boundaries of target objects, indicating a higher degree of semantic consistency\.
These observations demonstrate that our method generates perturbations whose feature responses align well with the original image semantics across channels\. As a result, the perturbations resemble natural feature variations rather than anomalous signals, allowing them to propagate more effectively through frozen shallow layers and influence deeper representations during subsequent training processing\.
## VIICONCLUSION
In this study, we conducted the first systematic analysis of UEs under the pretraining–finetuning paradigm\. We identified a critical vulnerability in which standard UEs, which depend on shallow shortcuts, are suppressed by the frozen shallow layers of pretrained models as a result of a semantic filtering effect\. To address this vulnerability, we introduce SSC, a method that aligns perturbations with shallow semantic representations to circumvent semantic filtering and concentrate the poisoning effect in deep, trainable layers\. Comprehensive experiments on CIFAR\-10, CIFAR\-100, and Tiny\-ImageNet show that SSC consistently surpasses existing approaches, preserving the efficacy of unlearnable examples across diverse training paradigms\. In summary, this work reveals a previously unrecognized vulnerability of UEs in contemporary transfer learning pipelines and offers a practical approach for safeguarding data privacy across diverse training paradigms\.
## References
- \[1\]\(2024\)The right to scrap data on the internet: from the us case hiqlabs, inc\. v\. linkedin corp\. to the chatgpt scraping cases: differences between us and eu law\.Global Privacy Law Review\.External Links:[Document](https://dx.doi.org/10.54648/gplr2024001)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[2\]J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei\(2009\)Imagenet: a large\-scale hierarchical image database\.In2009 IEEE conference on computer vision and pattern recognition,pp\. 248–255\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[3\]T\. Devries and G\. W\. Taylor\(2017\)Improved regularization of convolutional neural networks with cutout\.InCoRR,Vol\.abs/1708\.04552\.Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p3.1)\.
- \[4\]H\. M\. Dolatabadi, S\. Erfani, and C\. Leckie\(2024\)The devil’s advocate: shattering the illusion of unexploitable data using diffusion models\.In2024 IEEE Conference on Secure and Trustworthy Machine Learning \(SaTML\),Vol\.,pp\. 358–386\.Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p4.1)\.
- \[5\]L\. Fowl, M\. Goldblum, P\. Chiang, J\. Geiping, W\. Czaja, and T\. Goldstein\(2021\)Adversarial examples make strong poisons\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 30339–30351\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p2.1)\.
- \[6\]S\. Fu, F\. He, Y\. Liu, L\. Shen, and D\. Tao\(2022\)Robust unlearnable examples: protecting data privacy against adversarial learning\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p3.1),[§III\-A1](https://arxiv.org/html/2605.05224#S3.SS1.SSS1.p1.5),[§VI\-A1](https://arxiv.org/html/2605.05224#S6.SS1.SSS1.p1.1),[§VI\-A3](https://arxiv.org/html/2605.05224#S6.SS1.SSS3.p1.1)\.
- \[7\]Y\. Ge, R\. Gu, Y\. Liu, L\. Zhao, B\. Du, and Q\. Wang\(2025\)When unlearnable examples cooperate with watermarking: a dual voice data protection against unauthorized exploitation\.IEEE Transactions on Dependable and Secure Computing\(\),pp\. 1–18\.External Links:[Document](https://dx.doi.org/10.1109/TDSC.2025.3648349)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[8\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§VI\-A1](https://arxiv.org/html/2605.05224#S6.SS1.SSS1.p1.1)\.
- \[9\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§VI\-A1](https://arxiv.org/html/2605.05224#S6.SS1.SSS1.p1.1)\.
- \[10\]R\. Hong, J\. A\. Hutson, W\. Agnew, I\. Huda, T\. Kohno, and J\. Morgenstern\(2025\)A common pool of privacy problems: legal and technical lessons from a large\-scale web\-scraped machine learning dataset\.ArXivabs/2506\.17185\.External Links:[Document](https://dx.doi.org/10.48550/arxiv.2506.17185)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[11\]G\. Huang, Z\. Liu, L\. Van Der Maaten, and K\. Q\. Weinberger\(2017\)Densely connected convolutional networks\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 4700–4708\.Cited by:[§VI\-A1](https://arxiv.org/html/2605.05224#S6.SS1.SSS1.p1.1)\.
- \[12\]H\. Huang, X\. Ma, S\. M\. Erfani, J\. Bailey, and Y\. Wang\(2021\)Unlearnable examples: making personal data unexploitable\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p2.1),[§III\-A1](https://arxiv.org/html/2605.05224#S3.SS1.SSS1.p1.5),[§VI\-A1](https://arxiv.org/html/2605.05224#S6.SS1.SSS1.p1.1),[§VI\-A3](https://arxiv.org/html/2605.05224#S6.SS1.SSS3.p1.1)\.
- \[13\]E\. Iofinova, A\. Peste, M\. Kurtz, and D\. Alistarh\(2022\)How well do sparse imagenet models transfer?\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12266–12276\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p3.1)\.
- \[14\]S\. Ishihara\(2023\)Training data extraction from pre\-trained language models: a survey\.ArXivabs/2305\.16157\.External Links:[Document](https://dx.doi.org/10.48550/arxiv.2305.16157)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[15\]T\. Karras, S\. Laine, and T\. Aila\(2019\)A style\-based generator architecture for generative adversarial networks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 4401–4410\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[16\]A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[§VI\-A1](https://arxiv.org/html/2605.05224#S6.SS1.SSS1.p1.1)\.
- \[17\]V\. Krotov and L\. Johnson\(2022\)Big web data: challenges related to data, technology, legality, and ethics\.Business Horizons\.External Links:[Document](https://dx.doi.org/10.1016/j.bushor.2022.10.001)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[18\]Y\. Le and X\. Yang\(2015\)Tiny imagenet visual recognition challenge\.CS 231N7\(7\),pp\. 3\.Cited by:[§VI\-A1](https://arxiv.org/html/2605.05224#S6.SS1.SSS1.p1.1)\.
- \[19\]H\. J\. Lee and Y\. M\. Ro\(2023\)Robust proxy: improving adversarial robustness by robust proxy learning\.IEEE Transactions on Information Forensics and Security18\(\),pp\. 4021–4033\.External Links:[Document](https://dx.doi.org/10.1109/TIFS.2023.3288672)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1)\.
- \[20\]K\. Lee, D\. Ippolito, A\. Nystrom, C\. Zhang, D\. Eck, C\. Callison\-Burch, and N\. Carlini\(2021\)Deduplicating training data makes language models better\.pp\. 8424–8445\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.577)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[21\]S\. Li, M\. Xue, B\. Z\. H\. Zhao, H\. Zhu, and X\. Zhang\(2021\)Invisible backdoor attacks on deep neural networks via steganography and regularization\.IEEE Transactions on Dependable and Secure Computing18\(5\),pp\. 2088–2105\.External Links:[Document](https://dx.doi.org/10.1109/TDSC.2020.3021407)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[22\]Z\. Li, J\. Cai, G\. Xu, H\. Zheng, Q\. Li, F\. Zhou, S\. Yang, C\. Ling, and B\. Wang\(2025\)Versatile transferable unlearnable example generator\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p4.1),[§VI\-A3](https://arxiv.org/html/2605.05224#S6.SS1.SSS3.p1.1)\.
- \[23\]S\. Liu, Y\. Wang, and X\. Gao\(2024\)Game\-theoretic unlearnable example generator\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 21349–21358\.Cited by:[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p2.1),[§VI\-A3](https://arxiv.org/html/2605.05224#S6.SS1.SSS3.p1.1)\.
- \[24\]Y\. Liu, K\. Xu, X\. Chen, and L\. Sun\(2023\)Stable unlearnable example: enhancing the robustness of unlearnable examples via stable error\-minimizing noise\.External Links:2311\.13091,[Link](https://arxiv.org/abs/2311.13091)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p3.1)\.
- \[25\]Z\. Liu, Z\. Zhao, and M\. A\. Larson\(2023\)Image shortcut squeezing: countering perturbative availability poisons with compression\.pp\. 22473–22487\.Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p2.1),[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p3.1)\.
- \[26\]Y\. Lu, M\. Y\. Yang, G\. Kamath, and Y\. Yu\(2024\)Indiscriminate data poisoning attacks on pre\-trained feature extractors\.In2024 IEEE Conference on Secure and Trustworthy Machine Learning \(SaTML\),pp\. 327–343\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p3.1)\.
- \[27\]N\. Lukas, A\. Salem, R\. Sim, S\. Tople, L\. Wutschitz, and S\. Zanella\-B’eguelin\(2023\)Analyzing leakage of personally identifiable information in language models\.2023 IEEE Symposium on Security and Privacy \(SP\),pp\. 346–363\.External Links:[Document](https://dx.doi.org/10.1109/sp46215.2023.10179300)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[28\]A\. Madry, A\. Makelov, L\. Schmidt, D\. Tsipras, and A\. Vladu\(2018\)Towards deep learning models resistant to adversarial attacks\.InInternational conference on learning representations,Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p3.1)\.
- \[29\]M\. W\. Marcellin, M\. J\. Gormish, A\. Bilgin, and M\. P\. Boliek\(2000\)An overview of jpeg2000\.InProceedings DCC 2000\. Data Compression Conference,pp\. 523–541\.Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p2.1)\.
- \[30\]S\. Meisenbacher, A\. Klymenko, A\. Bodea, and F\. Matthes\(2025\)The double\-edged sword of llm\-based data reconstruction: understanding and mitigating contextual vulnerability in word\-level differential privacy text sanitization\.Proceedings of the 24th Workshop on Privacy in the Electronic Society\.External Links:[Document](https://dx.doi.org/10.1145/3733802.3764058)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[31\]R\. Meng, C\. Yi, Y\. Yu, S\. Yang, B\. Shen, and A\. C\. Kot\(2024\)Semantic deep hiding for robust unlearnable examples\.IEEE Transactions on Information Forensics and Security19,pp\. 6545–6558\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p4.1)\.
- \[32\]J\. Morris, S\. Newman, K\. Palaniappan, J\. Fan, and D\. Lin\(2023\)“Do you know you are tracked by photos that you didn’t take”: large\-scale location\-aware multi\-party image privacy protection\.IEEE Transactions on Dependable and Secure Computing20\(1\),pp\. 301–312\.External Links:[Document](https://dx.doi.org/10.1109/TDSC.2021.3132230)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[33\]W\. Nie, B\. Guo, Y\. Huang, C\. Xiao, A\. Vahdat, and A\. Anandkumar\(2022\)Diffusion models for adversarial purification\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p4.1)\.
- \[34\]S\. Pang, Z\. Lu, H\. Wang, P\. Fu, Y\. Zhou, M\. Xue, and B\. Li\(2024\)Reconstruction of differentially private text sanitization via large language models\.2025 28th International Symposium on Research in Attacks, Intrusions and Defenses \(RAID\),pp\. 1–17\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[35\]S\. Pang, Y\. Rao, Z\. Lu, H\. Wang, Y\. Zhou, and M\. Xue\(2025\)PriDM: effective and universal private data recovery via diffusion models\.IEEE Transactions on Dependable and Secure Computing22\(4\),pp\. 3259–3274\.External Links:[Document](https://dx.doi.org/10.1109/TDSC.2025.3529756)Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p4.1)\.
- \[36\]T\. Qin, X\. Gao, J\. Zhao, K\. Ye, and C\. Xu\(2023\)Learning the unlearnable: adversarial augmentations suppress unlearnable example attacks\.arXiv preprint arXiv:2303\.15127\.Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p2.1),[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p3.1)\.
- \[37\]P\. Sandoval\-Segura, S\. Singla, J\. Geiping, M\. Goldblum, and T\. Goldstein\(2024\)What can we learn from unlearnable datasets?\.Advances in Neural Information Processing Systems36\.Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p3.1)\.
- \[38\]P\. Sandoval\-Segura, V\. Singla, J\. Geiping, M\. Goldblum, T\. Goldstein, and D\. Jacobs\(2022\)Autoregressive perturbations for data poisoning\.Advances in Neural Information Processing Systems35,pp\. 27374–27386\.Cited by:[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p2.1),[§VI\-A3](https://arxiv.org/html/2605.05224#S6.SS1.SSS3.p1.1)\.
- \[39\]P\. Sandoval\-Segura, V\. Singla, J\. Geiping, M\. Goldblum, and T\. Goldstein\(2023\)What can we learn from unlearnable datasets?\.Advances in Neural Information Processing Systems36,pp\. 75372–75391\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1)\.
- \[40\]K\. Simonyan and A\. Zisserman\(2014\)Very deep convolutional networks for large\-scale image recognition\.arXiv preprint arXiv:1409\.1556\.Cited by:[§VI\-A1](https://arxiv.org/html/2605.05224#S6.SS1.SSS1.p1.1)\.
- \[41\]G\. Somepalli, V\. Singla, M\. Goldblum, J\. Geiping, and T\. Goldstein\(2023\)Diffusion art or digital forgery? investigating data replication in diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 6048–6058\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[42\]\[\. F\. N\. Wanget al\.\(2024\)Multimodal unlearnable examples: protecting data against multimodal contrastive learning\.InProceedings of the 32nd ACM International Conference on Multimedia,pp\. \[page range\]\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p4.1)\.
- \[43\]D\. Wang, M\. Xue, B\. Li, S\. Camtepe, and L\. Zhu\(2025\)Provably unlearnable data examples\.InThe Network and Distributed System Security \(NDSS\) Symposium,Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p3.1)\.
- \[44\]Z\. Wang, Y\. Wang, and Y\. Wang\(2021\)Fooling adversarial training with inducing noise\.arXiv preprint arXiv:2111\.10130\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1)\.
- \[45\]R\. Wen, Z\. Zhao, Z\. Liu, M\. Backes, T\. Wang, and Y\. Zhang\(2023\)Is adversarial training really a silver bullet for mitigating data poisoning?\.InThe Eleventh International Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1)\.
- \[46\]D\. Yu, H\. Zhang, W\. Chen, J\. Yin, and T\. Liu\(2022\)Availability attacks create shortcuts\.InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 2367–2376\.Cited by:[§VI\-A3](https://arxiv.org/html/2605.05224#S6.SS1.SSS3.p1.1)\.
- \[47\]W\. Yu, T\. Pang, Q\. Liu, C\. Du, B\. Kang, Y\. Huang, M\. Lin, and S\. Yan\(2023\)Bag of tricks for training data extraction from language models\.ArXivabs/2302\.04460\.External Links:[Document](https://dx.doi.org/10.48550/arxiv.2302.04460)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[48\]Y\. Yu, Y\. Wang, S\. Xia, W\. Yang, S\. Lu, Y\. Tan, and A\. C\. Kot\(2024\)Purify unlearnable examples via rate\-constrained variational autoencoders\.InInternational Conference on Machine Learning,Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p4.1)\.
- \[49\]Y\. Yu, W\. Yang, Y\. Tan, and A\. C\. Kot\(2022\)Towards robust rain removal against adversarial attacks: a comprehensive benchmark analysis and beyond\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 6013–6022\.Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1)\.
- \[50\]C\. Yuan and S\. Wu\(2021\)Neural tangent generalization attacks\.InInternational Conference on Machine Learning,pp\. 12230–12240\.Cited by:[§II\-A](https://arxiv.org/html/2605.05224#S2.SS1.p2.1),[§VI\-A3](https://arxiv.org/html/2605.05224#S6.SS1.SSS3.p1.1)\.
- \[51\]S\. Yun, D\. Han, S\. Chun, S\. J\. Oh, Y\. Yoo, and J\. Choe\(2019\)CutMix: regularization strategy to train strong classifiers with localizable features\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 6022–6031\.Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p3.1)\.
- \[52\]H\. Zhang, M\. Cissé, Y\. N\. Dauphin, and D\. Lopez\-Paz\(2018\)Mixup: beyond empirical risk minimization\.InProceedings of the 6th International Conference on Learning Representations,Cited by:[§II\-B](https://arxiv.org/html/2605.05224#S2.SS2.p3.1)\.
- \[53\]J\. Zhang, S\. Peng, Y\. Gao, Z\. Zhang, and Q\. Hong\(2023\)APMSA: adversarial perturbation against model stealing attacks\.IEEE Transactions on Information Forensics and Security18\(\),pp\. 1667–1679\.External Links:[Document](https://dx.doi.org/10.1109/TIFS.2023.3246766)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1)\.
- \[54\]X\. Zhang, H\. Xu, Z\. Ba, Z\. Wang, Y\. Hong, J\. Liu, Z\. Qin, and K\. Ren\(2024\)PrivacyAsst: safeguarding user privacy in tool\-using large language model agents\.IEEE Transactions on Dependable and Secure Computing21\(6\),pp\. 5242–5258\.External Links:[Document](https://dx.doi.org/10.1109/TDSC.2024.3372777)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p1.1)\.
- \[55\]Y\. Zhu, Y\. Chen, X\. Li, R\. Zhang, X\. Tian, B\. Zheng, and Y\. Chen\(2023\)Information\-containing adversarial perturbation for combating facial manipulation systems\.IEEE Transactions on Information Forensics and Security18\(\),pp\. 2046–2059\.External Links:[Document](https://dx.doi.org/10.1109/TIFS.2023.3262156)Cited by:[§I](https://arxiv.org/html/2605.05224#S1.p2.1)\.Similar Articles
Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
This paper introduces Attention-Shifting (AS), a novel framework for selective machine unlearning in LLMs that balances effective removal of sensitive information while preventing hallucinations and preserving model utility. The method uses importance-aware attention suppression and retention enhancement to achieve up to 15% higher accuracy preservation compared to existing unlearning approaches on standard benchmarks.
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
This paper introduces LoPE, a training framework that uses prompt-space perturbations to address the zero-advantage problem in reinforcement learning with verifiable rewards, thereby enhancing reasoning exploration in large language models.
Can Large Language Models Reinvent Foundational Algorithms?
Researchers introduce 'Unlearn-and-Reinvent', a pipeline that removes knowledge of foundational algorithms (e.g., Dijkstra's, Euclid's) from LLMs via unlearning, then tests whether models can independently reinvent them. Results show LLMs can reinvent algorithms with intuitive structures but struggle with those requiring non-obvious data structures or counterintuitive invariants.
Adversarial training methods for semi-supervised text classification
This paper presents adversarial and virtual adversarial training methods adapted for text classification by applying perturbations to word embeddings in RNNs rather than raw inputs. The approach achieves state-of-the-art results on semi-supervised and supervised text classification benchmarks while reducing overfitting.
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability
Systematic study shows LLM-based dense retrievers outperform BERT baselines on typos and poisoning but remain vulnerable to semantic perturbations, with embedding geometry predicting robustness.