Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

arXiv cs.CL 04/20/26, 04:00 AM Papers

Summary

This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.

arXiv:2604.15780v1 Announce Type: cross Abstract: Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain "unsafe tickets" responsible for harmful behaviors, and pruning reveals "safety tickets" that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:30 AM

## Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

Source: https://arxiv.org/html/2604.15780

Wai Man Si, Mingjie Li♣\\clubsuit, Michael Backes, Yang Zhang♣\\clubsuit

CISPA Helmholtz Center for Information Security

###### Abstract

Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain “unsafe tickets” responsible for harmful behaviors, and pruning reveals “safety tickets” that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.

$\\clubsuit$$\\clubsuit$footnotetext:Co-corresponding authors.

Refer to caption

Figure 1:Illustration of post-training methods for safety. The instruct model fails to produce safe outputs when given unsafe prompts. Circuit breaking maps unsafe prompts to random outputs with LoRA fine-tuning, which requires extra storage. Our framework selectively removes connections to unsafe outputs, forcing the model to redirect unsafe prompts to safe responses.

## Introduction

The rapid advancement of ML, particularly in language models (LMs) and vision-language models (VLMs), has enabled their integration into applications ranging from productivity tools to mobile assistants. While these models demonstrate impressive capabilities, training them from scratch is expensive, requiring a lot of computational resources and large datasets. Consequently, developers increasingly rely on open-source pre-trained models such as those released by Mistral[JSMBCCBLLSLLSSLWLS23]and Meta[TLIMLLRGHARJGL23,TMSAABBBBBBBCCCEFFFFGGGHHHIKKKKKKLLLLLMMMMMNPRRSSSSSTTTWKXYZZFKNRSES23,DJPKALMSYFGHYMSKHRZRGSRBTCCNBMMKTWWFNASPLECMGPHLALDSRZSLANMPCNKXTZIKMECLGVPMSLBHLFCHLWYBSPRJSJAUPLHSa24]. While these models lower the barrier to entry, they also raise safety concerns, as their behaviors may misalign with safety constraints or remain vulnerable to adversarial attacks. To mitigate unsafe behaviors in ML models, several post-training methods have been proposed. Common approaches include SFT[OWJAWMZASRSHKMSAWCLL22]and RLHF[CLBMLA17]. While effective, these methods require large annotated datasets, substantial GPU memory, and significant training time. Also, these approaches do not explicitly remove the unsafe behaviors inherited from pre-training, and such behaviors often resurface under adversarial prompts or distribution shifts. Lighter-weight alternatives such as internal model interventions[LPVPW23,LBPWKM24,TTULMM23]and prompt engineering[ZYKMWH24]reduce overhead but suffer from scalability issues, manual effort, and inference inefficiencies. These methods cannot intrinsically remove unsafe behaviors from the model itself. Overall, existing strategies remain either resource-intensive or fragile in practice.

From the perspective of the Lottery Ticket Hypothesis (LTH)[FC19], we argue that subnetworks responsible for unsafe behaviors—“unsafe tickets”—remain embedded in aligned models. These subnetworks may be activated in rare but critical situations, leading to harmful outputs. To fundamentally eliminate these safety risks, we propose to directly identify and remove these unsafe subnetworks, thereby exposing a “safety ticket” that preserves general utility while mitigating unsafe behaviors. In this paper, we present a resource-efficient pruning framework for safety alignment (Figure1 (https://arxiv.org/html/2604.15780#S0.F1)). Our method employs gradient-free attribution to localize parameters contributing disproportionately to unsafe generations and prunes them iteratively, without requiring fine-tuning or prompt engineering. Our pruning framework can mitigate unsafe behaviors in ML models, including both LMs and VLMs, while maintaining overall performance. Empirical evaluations demonstrate its effectiveness. On Mistral-7B-Instruct-v0.2[JSMBCCBLLSLLSSLWLS23], the proportion of unsafe responses is reduced to 1%, and on LLaVA-v1.6-Mistral-7B[LLLL23], to 2%, with negligible degradation in general performance. These improvements are achieved in just 455 seconds of pruning using a greedy search strategy. Beyond safety, the framework also enhances robustness, reducing the success rates of various attacks. These results highlight the framework’s utility as a lightweight post-training mechanism for enhancing the robustness of ML models against real-world threats. To better understand these improvements, we find that pruning does not impair the model’s ability to recognize unsafe inputs. Instead, it rebalances the output distribution, reducing the likelihood of unsafe completions while increasing refusal responses. Moreover, these behaviors are primarily localized in the output projection and second MLP blocks, consistent with prior findings[LPVPW23,LBPWKM24]. Our main contributions are as follows:

- •We propose a resource-efficient, model-agnostic pruning framework that improves safety and robustness for deployment in resource-constrained environments.
- •We introduce a novel perspective that connects pruning-based safety alignment with LTH, demonstrating that models contain “unsafe tickets” whose removal reveals safety-preserving subnetworks.
- •We conduct extensive evaluations on LMs and VLMs, showing substantial reductions in unsafe outputs and improved robustness at low computational cost.

Refer to caption

Figure 2:Overview of the pruning framework pipeline.

## Methodology

Based on LTH and its extensions[CIJLTZ23,DK21,MYPT19], we hypothesize that unsafe behaviors encountered during pre-training may be retained within the model as sparse subnetworks, which we term “unsafe tickets.” When reactivated, these subnetworks can lead aligned LLMs to exhibit unsafe behaviors. To mitigate these potential risks, we propose a resource-efficient pruning framework that iteratively identifies and removes such subnetworks. As illustrated in Figure2 (https://arxiv.org/html/2604.15780#S1.F2), the framework consists of four stages as follows.

### Stage 1: Behavior Profiling

The first stage of our framework constructs a model-specific behavioral dataset to ensure that pruning decisions align with actual failure cases on the target model. Unlike fine-tuning, which updates model parameters using general datasets, our goal is to prune connections leading to unsafe outputs, which requires data that directly activates those connections. However, public safety datasets often fail to align with the target model’s behavior. For example, a prompt labeled as unsafe in the dataset may not elicit an unsafe response from the model, making it ineffective for identifying relevant parameters. To address this, we generate two tailored datasets, one for unsafe and one for safe behaviors, by querying the model with diverse prompts. Each prompt-response pair is automatically labeled using an external safety classifier,111Unsafe is defined by the paper’s safety classifier, though users may substitute other criteria like truthfulness or bias.. To improve labeling accuracy and reduce false positives, we check the label with refusal indicators in AppendixB.6 (https://arxiv.org/html/2604.15780#A2.SS6). To increase diversity and minimize redundancy, we cluster responses within each set and sampleKKrepresentative pairs, producing a compact dataset that reflects the model’s risk surface. Finally, to avoid degradation in pruning quality caused by response-length variation (see AppendixA (https://arxiv.org/html/2604.15780#A1)), we enforce a fixed response lengthllduring data collection.

### Stage 2: Attribution Analysis

The second stage quantifies the contribution of model parameters to unsafe and safe outputs. Precise attribution identifies which parameters can be pruned to reduce unsafe behavior while preserving utility. Traditional attribution methods, such as patch attribution[SRC23], gradient-based techniques[LAT19], and linear probing[LPVPW23,LBPWKM24], are effective but require multiple forward/backward passes, making them impractical for large models or resource-constrained environments. To address this, we adopt and extend Wanda[SLBK24], an unstructured pruning method originally designed to remove redundant parameters. We reinterpret Wanda as a parameter attribution tool, enabling efficient estimation of each parameter’s influence on safe and unsafe generation without backpropagation.

LetW∈RCout×CinW\\in\\mathbb{R}^{C_{\\text{out}}\\times C_{\\text{in}}}be the weight matrix of a transformer linear layer (e.g., attention), and letX∈RL×CinX\\in\\mathbb{R}^{L\\times C_{\\text{in}}}be the corresponding input activation matrix, whereL=A+BL=A+B, withAadenoting the number of prompt tokens andBBthe number of response tokens. The original Wanda pruning is defined as S=|W|⋅∑i=1L‖xi‖2S=|W|\\cdot\\sum_{i=1}^{L}\\|x_{i}\\|_{2}where‖xi‖2\\|x_{i}\\|_{2}is the L2 norm of theii-th row ofXX. Note that the original Wanda score treats prompt and response tokens equally. To isolate response-related activations, as unsafe behavior arises only during the generation phase, we introduce a binary maskM∈{0,1}M\\in\\{0,1\\}defined asM:A=0,MA+1:A+B=1M_{:A}=0,M_{A+1:A+B}=1. So our modified Wanda score only accounts for activation magnitude for response: S′=Mi|W|⋅∑i=1L‖xi‖2=|W|⋅∑i=A+1A+B‖xi‖2S^{\\prime}=M_{i}|W|\\cdot\\sum_{i=1}^{L}\\|x_{i}\\|_{2}\\,=|W|\\cdot\\sum_{i=A+1}^{A+B}\\|x_{i}\\|_{2}Thus,S′S^{\\prime}completely excludes the prompt activations and reflects importance solely under the response context.

### Stage 3: Component Scoring

Following attribution analysis, we compute a normalized importance score for each model component to guide unsafe-aware pruning decisions. The objective is to identify the top-pppercent of parameters within a component that contribute to unsafe outputs while minimizing the impact on safe generation. This contrastive mechanism is central to our framework, enabling suppression of unsafe behaviors without degrading benign capabilities. We begin by computing the masked attribution scores over the unsafe and safe subsets, denotedSu′S^{\\prime}_{u}andSs′S^{\\prime}_{s}, respectively. The contrastive importance scoreIIis then defined as:I=Su′Ss′+εI=\\frac{S^{\\prime}_{u}}{S^{\\prime}_{s}+\\epsilon}, whereε\\epsilonis a small constant for numerical stability. Since raw importance scores are not directly comparable across layers and components due to statistical variability, we apply z-score normalization within each component to obtain the normalized scoreI^\\hat{I}. This normalization ensures consistent prioritization across components. Finally, we select the top-pppercent of parameters within each component and sum their values to compute the final importance scoreI′^\\hat{I^{\\prime}}. A higherI′^\\hat{I^{\\prime}}indicates that parameters in the corresponding component are disproportionately active during unsafe generations relative to safe ones. These scores enable fine-grained pruning decisions that effectively target unsafe behavior while preserving overall model performance.

### Stage 4: Iterative Pruning

In the final stage, we adopt an iterative pruning strategy to refine model behavior incrementally, rather than pruning all components at once with a fixed ratio. This design is motivated by two observations: transformer components contribute unevenly to unsafe outputs, so indiscriminate pruning may harm utility; and some components exert a stronger influence on unsafe behavior and should therefore be pruned more aggressively. Iterative pruning thus allows adaptive and precise mitigation of unsafe behaviors. From the LTH perspective, this iterative search process can be viewed as progressively removing unsafe subnetworks until a safety-preserving subnetwork emerges. We explore two strategies:

Greedy Pruning.It is an efficient approach that prunes the component with the highest normalized importance score at each iteration. Although computationally cheap, this method may be suboptimal since it ignores nonlinear interdependencies among components.

Beam Search Pruning.In contrast, beam search can more effectively explore pruning trajectories by evaluating multiple candidate sequences in each iteration. Beam search begins by computing initial importance scores and selecting the top-b1b_{1}candidates to initialize the beam. Each candidate prunesp%p\\%of its parameters, yieldingb1b_{1}initial model variants. For each variant, we reconstruct the model state according to its pruning history, recompute component importance scores, and expand the search by generating up tob1×b1b_{1}\\times b_{1}new variants. Each candidate is then evaluated with a contrastive objective: L=CELosss−CELossu,L=\\mathrm{CELoss}_{s}-\\mathrm{CELoss}_{u},whereCELosss\\mathrm{CELoss}_{s}andCELossu\\mathrm{CELoss}_{u}are the cross-entropy losses on safe and unsafe validation sets. This objective favors a variant that maintains performance on safe data while degrading performance on unsafe data. The top-b2b_{2}candidates with the highestLLscores are retained for the next beam, and the process terminates once the cumulative sparsity exceeds the pruning thresholdρ\\rho. This search-based pruning framework enables the discovery of effective trajectories that suppress unsafe behavior. While greedy pruning provides a fast heuristic solution, beam search allows deeper exploration of the pruning space, yielding safer and more robust models at the cost of increased computation.

Table 1:Performance of different post-training methods on Mistral-7B. Baselines include DPO, CB, and Goal, while One-Pass, Greedy, and Beam represent our pruning framework.

Table 2:Performance of different post-training methods on Mistral-Nemo and Mistral-Small, including Unsafe Rate (Unsafe), Over-Refusal Rate (Over-Refusal), and Utility.

## Experiments

### Language Models

Models and Datasets.We evaluate our pruning framework on three instruction-tuned LMs from the HuggingFace Hub: “Mistral-7B-Instruct-v0.2,” “Mistral-Nemo-Instruct-2407” (12B), and “Mistral-Small-Instruct-2409” (23B).222For simplicity, we omit the “Instruct-version” suffix throughout the remainder of this paper.We also include an 8-bit quantized variant of Mistral-7B. All models are evaluated with their default decoding configuration (e.g., sampling temperature of 1.0), and each output is sampled three times to reduce variance. To evaluate the model safety, we use HarmBench[MPYZWMSLBLFH24], which aggregates AdvBench[ZWKF23]and the TDC 2023. We use JailbreakBench (JBB)[CDRACSDFPTHW24]for over-refusal, and MT-Bench[ZCSZWZLLLXZGS23]for model utility. Detailed evaluation metrics are provided in AppendixB.2 (https://arxiv.org/html/2604.15780#A2.SS2). We also report the dataset leakage study in Appendi

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

Similar Articles

Small LLMs: Pruning vs. Training from Scratch

Optimizing Korean-Centric LLMs via Token Pruning

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

Don't Go Breaking My LLM: The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration

Submit Feedback

Similar Articles

Small LLMs: Pruning vs. Training from Scratch

Optimizing Korean-Centric LLMs via Token Pruning

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

Don't Go Breaking My LLM: The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration