SafeGene: Reusable Adapters for Transferable Safety Alignment

arXiv cs.AI Papers

Summary

SafeGene proposes a reusable safety-adapter module that decouples safety capabilities from task-specific updates, enabling efficient restoration of safety alignment in open-weight LLMs after downstream fine-tuning through few-shot recalibration.

arXiv:2606.06519v1 Announce Type: new Abstract: Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned--degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety--utility trade-off.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:13 AM

# SafeGene: Reusable Adapters for Transferable Safety Alignment
Source: [https://arxiv.org/html/2606.06519](https://arxiv.org/html/2606.06519)
Yanghan Wang1,Zhiqiang Kou2,Fu Feng1,Jing Wang1\*,Xin Geng1\* 1Southeast University,2The Hong Kong Polytechnic University \{yanghanwang,fufeng,wangjing91,xgeng\}@seu\.edu\.cn

###### Abstract

Open\-weight LLMs are increasingly fine\-tuned into customized assistants, but downstream fine\-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful\. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions\. We propose SafeGene, a reusable safety\-adapter module designed for cross\-task reuse within each architecture\-compatible model family\. Rather than treating safety recovery as a model\-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task\-specific updates\. This representation is obtained from aligned–degraded model discrepancies, refined into task\-transferable safety vectors through data\-aware layer selection, and expressed in each downstream task\-adapted model via few\-shot layer\-wise coefficient recalibration\. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene\-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety–utility trade\-off\.

SafeGene: Reusable Adapters for Transferable Safety Alignment

Yanghan Wang1, Zhiqiang Kou2, Fu Feng1, Jing Wang1\*, Xin Geng1\*1Southeast University,2The Hong Kong Polytechnic University\{yanghanwang,fufeng,wangjing91,xgeng\}@seu\.edu\.cn

11footnotetext:Corresponding authors\.## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.06519v1/mpic.png)Figure 1:Motivation of SafeGene\. Continual fine\-tuning on newly collected data improves the utility of next\-generation models but can weaken safety alignment\. Within each architecture\-compatible model family, a SafeGene module is instantiated once from reusable safety vectors and reused across evolving downstream task\-adapted models through few\-shot recalibration, restoring safety while preserving task utility\.Large language models \(LLMs\) are increasingly customized for high\-value domain assistants, such as medical, legal, financial, and enterprise applications\. In many practical settings\(Huet al\.,[2022](https://arxiv.org/html/2606.06519#bib.bib23); Dettmerset al\.,[2023](https://arxiv.org/html/2606.06519#bib.bib24)\), developers start from publicly released open\-weight models and fine\-tune them on private or task\-specific datasets to obtain customized target models\.

While such customization improves downstream utility, it also introduces a critical safety risk:*alignment\-acquired safety may not survive downstream adaptation*\. As shown in Figure[1](https://arxiv.org/html/2606.06519#S1.F1), task fine\-tuning can improve utility while weakening safety\. Prior studies\(Qiet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib1); Fraseret al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib2)\)show that even benign fine\-tuning may degrade safety alignment, making fine\-tuned models more likely to comply with malicious prompts\(Hsuet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib4); Huanget al\.,[2024a](https://arxiv.org/html/2606.06519#bib.bib10)\)\. This makes post\-fine\-tuning safety recovery a practical necessity\.

A straightforward solution\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.06519#bib.bib25)\)is to re\-align each fine\-tuned model with additional safety data\. However, this model\-specific repair is costly and difficult to scale\. In fine\-tuning\-as\-a\-service or personalized deployment settings\(Huet al\.,[2022](https://arxiv.org/html/2606.06519#bib.bib23); Dettmerset al\.,[2023](https://arxiv.org/html/2606.06519#bib.bib24)\), model providers and downstream developers often face continuously changing user needs and evolving data sources\. These changes, including newly released datasets or newly collected user interactions, may trigger another round of fine\-tuning and shift a customized model to a new distribution\.*Repeating safety alignment or repair for every updated checkpoint introduces extra computational and annotation overhead and may interfere with the task\-specific behavior learned during fine\-tuning*\(Farnet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib26); Panet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib27); Thakkaret al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib28)\)\. Therefore, the key challenge in practice is not merely how to repair a single unsafe model, but how to repeatedly and efficiently recover safety across changing downstream target models over time\.

Existing safety adaptation methods mitigate post\-fine\-tuning safety degradation at different intervention stages, including alignment\-stage defenses\(Huanget al\.,[2024c](https://arxiv.org/html/2606.06519#bib.bib7)\), fine\-tuning\-stage modifications\(Wanget al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib11); Huanget al\.,[2024b](https://arxiv.org/html/2606.06519#bib.bib9)\), and post\-hoc repair methods\(Huanget al\.,[2024a](https://arxiv.org/html/2606.06519#bib.bib10)\)\. However, many of them still rely on model\-specific projection\(Hsuet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib4)\), merging\(Djuheraet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib6)\), or safety\-aware re\-training procedures\(Liet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib5)\), making them less suited in practice to repeatedly updated target models where safety recovery should remain reusable, lightweight, and rapidly adapted\.

Motivated by this, we ask whether safety alignment can be extracted as a reusable module rather than re\-learned after every downstream update\. As shown in Figure[1](https://arxiv.org/html/2606.06519#S1.F1), within each architecture\-compatible model family, a SafeGene module is obtained once by distilling safety capability from the discrepancy between a safely aligned model and a safety\-degraded model, and is reused across evolving downstream tasks\. Adaptation then only requires lightweight few\-shot recalibration of selected safety components, without directly overwriting the target model’s task update\.

In this paper, we proposeSafeGene, a reusable safety\-adapter module for cross\-task safety transfer within an architecture\-compatible model family\. SafeGene is built on the principle of decoupling safety recovery from task adaptation: safety capability is represented as an independent adapter module that can be reused across downstream tasks while preserving task\-specific updates\. The SafeGene module is obtained by distilling safety\-relevant adapter updates from the discrepancy between a safely aligned teacher model and a safety\-degraded student model, and is refined through data\-aware layer selection to retain task\-transferable safety vectors\. For each downstream task\-adapted model, only one scalar coefficient per selected layer is recalibrated using few\-shot target\-domain data, allowing the reusable safety adapter to be lightly adjusted to the current target distribution without overwriting the task update\.

We evaluate SafeGene on five model families, including Qwen2\.5\-7B and Qwen2\.5\-1\.5B\(Qwenet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib12)\), Qwen3\-1\.7B\(Yanget al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib13)\), GLM\-4\-9B\(Glmet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib14)\), and Llama\-3\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib15)\), across four downstream tasks and three safety benchmarks\(Zhanget al\.,[2015](https://arxiv.org/html/2606.06519#bib.bib16); Wanget al\.,[2018](https://arxiv.org/html/2606.06519#bib.bib17); Clarket al\.,[2019](https://arxiv.org/html/2606.06519#bib.bib19); Wanget al\.,[2019](https://arxiv.org/html/2606.06519#bib.bib18); Jiet al\.,[2023](https://arxiv.org/html/2606.06519#bib.bib20); Zouet al\.,[2023](https://arxiv.org/html/2606.06519#bib.bib21); Huanget al\.,[2025a](https://arxiv.org/html/2606.06519#bib.bib22)\)\. Downstream fine\-tuning increases average ASR by 7\.70%, while SafeGene reduces average ASR by 11\.48% compared with the fine\-tuned target models\. At the same time, downstream accuracy remains nearly unchanged after applying SafeGene\. SafeGene also outperforms representative safe adaptation baselines, including SafeLoRA\(Hsuet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib4)\), SafeMERGE\(Djuheraet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib6)\), and SaLoRA\(Liet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib5)\), achieving the best safety–utility trade\-off on Qwen2\.5\-7B\.

Our contributions are summarized as follows:

- •We formulate post\-fine\-tuning safety recovery as a reusable safety\-transfer problem, targeting deployment settings where customized models may be repeatedly updated with new task data or user interaction data\.
- •We introduceSafeGene, a reusable safety adapter for within\-family cross\-task safety transfer, obtained from aligned–degraded model discrepancies and adapted to downstream task\-adapted models through lightweight coefficient recalibration\.
- •We show that SafeGene\-enhanced models effectively mitigate recurring safety degradation after downstream fine\-tuning, substantially reducing ASR while preserving downstream accuracy and outperforming representative safe adaptation methods\.

## 2Related Work

Safety degradation\.Recent studies have increasingly shown that safety alignment can be fragile under model customization\.Fraseret al\.\([2025](https://arxiv.org/html/2606.06519#bib.bib2)\)show that fine\-tuning can lower safety and destabilize safety evaluation\.Qiet al\.\([2024](https://arxiv.org/html/2606.06519#bib.bib1)\); Wanget al\.\([2024](https://arxiv.org/html/2606.06519#bib.bib11)\); Huanget al\.\([2024a](https://arxiv.org/html/2606.06519#bib.bib10)\)show that fine\-tuning aligned language models can compromise safety even without malicious intent, with both harmful and benign instruction\-tuning data weakening safety behavior\.Gonget al\.\([2025](https://arxiv.org/html/2606.06519#bib.bib3)\)suggest that LLMs may fail safety alignment under practical interventions such as fine\-tuning and model editing\. These works show that fine\-tuning for downstream tasks can introduce systematic safety risks to both models and users, motivating our study of safety recovery for customized target models\.

Safe adaptation\.Recent methods mitigate safety degradation at different stages of model adaptation: alignment\-stage defenses, such as Vaccine\(Huanget al\.,[2024c](https://arxiv.org/html/2606.06519#bib.bib7)\)and Booster\(Huanget al\.,[2025b](https://arxiv.org/html/2606.06519#bib.bib8)\), improve robustness before user fine\-tuning; fine\-tuning\-stage methods, such as BackdoorAlign\(Wanget al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib11)\), Lisa\(Huanget al\.,[2024b](https://arxiv.org/html/2606.06519#bib.bib9)\), and SaLoRA\(Liet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib5)\), modify the adaptation process itself; and post\-hoc methods, such as Antidote\(Huanget al\.,[2024a](https://arxiv.org/html/2606.06519#bib.bib10)\)and SafeMERGE\(Djuheraet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib6)\), repair already fine\-tuned checkpoints, while Safe LoRA\(Hsuet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib4)\)constrains downstream LoRA updates through projection\. These methods provide useful defenses, but projection\-based correction may affect task\-relevant components when safety and task directions are entangled, merging\-based repair often relies on compatible module structures and fixed correction strengths, and safety\-aware fine\-tuning is less flexible for already fine\-tuned or continuously updated target models\. This motivates a reusable, lightweight, and rapidly adaptable safety adapter for transferable safety alignment\.

## 3Method

### 3\.1Overview

We proposeSafeGene, a reusable safety\-adapter module for cross\-task safety transfer within an architecture\-compatible model family\. Figure[2](https://arxiv.org/html/2606.06519#S3.F2)provides an overview of SafeGene pipeline\. Reusable safety recovery requires safety capability to be distillable, selectable, and lightly adaptable\. SafeGene meets these requirements by distilling safety\-relevant adapter updates from aligned–degraded model discrepancies \(Section[3\.3](https://arxiv.org/html/2606.06519#S3.SS3)\), retaining task\-transferable safety vectors through data\-aware layer selection \(Section[3\.4](https://arxiv.org/html/2606.06519#S3.SS4)\), and recalibrating only layer\-wise scalar coefficients for each downstream task\-adapted model \(Section[3\.5](https://arxiv.org/html/2606.06519#S3.SS5)\)\. This enables safety recovery under new target distributions without overwriting task\-specific updates\.

### 3\.2Notations

LetMsafeM\_\{\\text\{safe\}\}denote a safely aligned teacher model, and letMbrokenM\_\{\\text\{broken\}\}denote a safety\-degraded student model with the same architecture, obtained by lightly degrading the safety behavior ofMsafeM\_\{\\text\{safe\}\}\. The student model is used only to expose the safety discrepancy from which reusable safety information is distilled, and is independent of downstream target models\. We use𝒟h\\mathcal\{D\}\_\{h\}and𝒟b\\mathcal\{D\}\_\{b\}to denote the source harmful and benign data used during safety distillation\. For downstream adaptation, letMtgtM\_\{\\text\{tgt\}\}denote a target model with a compatible architecture that has already been fine\-tuned on a task\-specific dataset\.

![Refer to caption](https://arxiv.org/html/2606.06519v1/lpic.png)Figure 2:Overview of SafeGene\. SafeGene decouples safety capability from task\-specific updates by representing safety as a reusable module\. The module is obtained by distilling safety\-relevant vectors from aligned–degraded discrepancies \(Section[3\.3](https://arxiv.org/html/2606.06519#S3.SS3)\), refined through data\-aware layer selection to retain task\-transferable components \(Section[3\.4](https://arxiv.org/html/2606.06519#S3.SS4)\), and expressed in downstream task\-adapted models via few\-shot coefficient recalibration \(Section[3\.5](https://arxiv.org/html/2606.06519#S3.SS5)\)\.
### 3\.3Safety Vector Distillation

The first stage distills safety behavior from the aligned teacher modelMsafeM\_\{\\mathrm\{safe\}\}into a parameter\-efficient adapter attached to the student modelMbrokenM\_\{\\mathrm\{broken\}\}\. We initialize the candidate layer set𝒯cand\\mathcal\{T\}\_\{\\mathrm\{cand\}\}as the full set of Transformer layers inMbrokenM\_\{\\mathrm\{broken\}\}, and insert a LoRA module\(Huet al\.,[2022](https://arxiv.org/html/2606.06519#bib.bib23)\)into each layer while keeping the backbone parameters frozen\. For each candidate layerℓ\\ell, the induced safety update is represented asΔ​𝐖safe\(ℓ\)=𝐁\(ℓ\)​𝐀\(ℓ\)\\Delta\\mathbf\{W\}\_\{\\mathrm\{safe\}\}^\{\(\\ell\)\}=\\mathbf\{B\}^\{\(\\ell\)\}\\mathbf\{A\}^\{\(\\ell\)\}, where𝐀\(ℓ\)\\mathbf\{A\}^\{\(\\ell\)\}and𝐁\(ℓ\)\\mathbf\{B\}^\{\(\\ell\)\}are the learned low\-rank LoRA matrices\. The collection of these layer\-wise updates forms the safety adapter and serves as the initial safety\-vector bank\.

The distillation objective balances safety imitation and benign utility preservation:

ℒdistill\\displaystyle\\mathcal\{L\}\_\{\\text\{distill\}\}=λce​ℒCEh\+λkl​ℒKLh\\displaystyle\\mathrel\{=\}\\lambda\_\{\\text\{ce\}\}\\mathcal\{L\}\_\{\\text\{CE\}\}^\{h\}\+\\lambda\_\{\\text\{kl\}\}\\mathcal\{L\}\_\{\\text\{KL\}\}^\{h\}\(1\)\+λrepr​ℒreprh\+λbenign​ℒCEb\\displaystyle\\mathrel\{\+\}\\lambda\_\{\\text\{repr\}\}\\mathcal\{L\}\_\{\\text\{repr\}\}^\{h\}\+\\lambda\_\{\\text\{benign\}\}\\mathcal\{L\}\_\{\\text\{CE\}\}^\{b\}
Here,ℒCEh\\mathcal\{L\}\_\{\\mathrm\{CE\}\}^\{h\}supervises the student model with safe responses on harmful inputs,ℒKLh\\mathcal\{L\}\_\{\\mathrm\{KL\}\}^\{h\}matches the output distribution ofMsafeM\_\{\\mathrm\{safe\}\}, andℒreprh\\mathcal\{L\}\_\{\\mathrm\{repr\}\}^\{h\}optionally aligns intermediate representations with the teacher\. The benign lossℒCEb\\mathcal\{L\}\_\{\\mathrm\{CE\}\}^\{b\}prevents the distilled safety adapter from unnecessarily harming general utility\. This stage outputs a compact but expressive dense safety\-vector bank\{Δ​𝐖safe\(ℓ\)\}ℓ∈𝒯cand\\\{\\Delta\\mathbf\{W\}\_\{\\mathrm\{safe\}\}^\{\(\\ell\)\}\\\}\_\{\\ell\\in\\mathcal\{T\}\_\{\\mathrm\{cand\}\}\}\.

By leveraging the safety discrepancy betweenMbrokenM\_\{\\mathrm\{broken\}\}andMsafeM\_\{\\mathrm\{safe\}\}, we distill and consolidate the safety capability ofMsafeM\_\{\\mathrm\{safe\}\}into a structured layer\-wise safety\-vector bank\. This bank characterizes how safety behavior is distributed across layers, from which transferable safety vectors are subsequently selected\.

### 3\.4Data\-aware Layer Selection

Although the dense safety\-vector bank extracted in the previous stage already consolidates the core safety information distilled fromMsafeM\_\{\\text\{safe\}\}, it may still contain redundant updates and distribution\-specific noise\. From a layer\-wise perspective, some layers contribute little to safety transfer, while others may even interfere with benign capabilities\(Wanget al\.,[2023](https://arxiv.org/html/2606.06519#bib.bib29)\)\. We therefore further select a compact subset of candidate layers from the dense safety\-vector bank such that they are more sensitive to harmful inputs while minimizing potential interference with general benign behavior\.

To assess the transferability of each layer, we analyze the layer\-wise activation statistics ofMbrokenM\_\{\\mathrm\{broken\}\}equipped with the full safety\-vector bank on both benign and harmful data\. For an inputxx, let𝐡t\(ℓ\)​\(x\)\\mathbf\{h\}\_\{t\}^\{\(\\ell\)\}\(x\)denote the final token representation at positionttafter theℓ\\ell\-th Transformer block\. We define the average activation magnitude of layerℓ\\ellas

a\(ℓ\)​\(x\)=1Tx​∑t=1Tx‖𝐡t\(ℓ\)​\(x\)‖2a^\{\(\\ell\)\}\(x\)=\\frac\{1\}\{T\_\{x\}\}\\sum\_\{t=1\}^\{T\_\{x\}\}\\left\\\|\\mathbf\{h\}\_\{t\}^\{\(\\ell\)\}\(x\)\\right\\\|\_\{2\}\(2\)
Letμh\(ℓ\)\\mu\_\{\\mathrm\{h\}\}^\{\(\\ell\)\}andμb\(ℓ\)\\mu\_\{\\mathrm\{b\}\}^\{\(\\ell\)\}denote the average activation magnitudes of layerℓ\\ellon harmful and benign data, respectively:

\{μh\(ℓ\)=𝔼x∼𝒟h​\[a\(ℓ\)​\(x\)\],μb\(ℓ\)=𝔼x∼𝒟b​\[a\(ℓ\)​\(x\)\]\.\\begin\{cases\}\\mu\_\{\\mathrm\{h\}\}^\{\(\\ell\)\}=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\_\{\\mathrm\{h\}\}\}\\left\[a^\{\(\\ell\)\}\(x\)\\right\],\\\\\[2\.0pt\] \\mu\_\{\\mathrm\{b\}\}^\{\(\\ell\)\}=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\_\{\\mathrm\{b\}\}\}\\left\[a^\{\(\\ell\)\}\(x\)\\right\]\.\\end\{cases\}\(3\)
We compute the harmful–benign activation contrast:

c\(ℓ\)=\[μh\(ℓ\)−γ​μb\(ℓ\)\]\+c^\{\(\\ell\)\}=\\left\[\\mu\_\{\\mathrm\{h\}\}^\{\(\\ell\)\}\-\\gamma\\mu\_\{\\mathrm\{b\}\}^\{\(\\ell\)\}\\right\]\_\{\+\}\(4\)
where\[z\]\+=max⁡\(z,0\)\[z\]\_\{\+\}=\\max\(z,0\), andγ\\gammacontrols the penalty on benign activation\. The contrast score favors layers that respond more strongly to harmful inputs while remaining less active on benign inputs\. Layers with non\-positive contrast are discarded by the truncation, since they do not provide clear evidence of harmful\-input selectivity\.

We rank candidate layers by their activation contrastc\(ℓ\)c^\{\(\\ell\)\}\. Intuitively, safety vectors attached to high\-contrast layers are more likely to affect safety\-relevant behavior while causing less interference with benign task behavior\. We therefore select the top\-scoring layers as

𝒯=TopKℓ∈𝒯cand​\(c\(ℓ\)\),\\mathcal\{T\}=\\mathrm\{TopK\}\_\{\\ell\\in\\mathcal\{T\}\_\{\\mathrm\{cand\}\}\}\\left\(c^\{\(\\ell\)\}\\right\),\(5\)and retain only the corresponding safety vectors:

\{Δ​𝐖safe\(ℓ\)\}ℓ∈𝒯\.\\left\\\{\\Delta\\mathbf\{W\}\_\{\\mathrm\{safe\}\}^\{\(\\ell\)\}\\right\\\}\_\{\\ell\\in\\mathcal\{T\}\}\.\(6\)The selected vectors constitute the final SafeGene module\.

This step forms SafeGene, a sparse and transferable safety\-adapter module\.

### 3\.5Few\-shot Safety Transfer

After the safety vectors have been extracted and selected, we transfer SafeGene to a downstream target modelMtgtM\_\{\\text\{tgt\}\}\. The target model is not involved in safety distillation; it only receives the selected safety vectors as a reusable safety adapter during transfer\. This separation is central to SafeGene: the safety capability distilled from the teacher model is packaged as a transferable adapter and then attached to the downstream target model, thereby enhancing safety while minimizing interference with downstream task performance\.

For each selected layerℓ∈𝒯\\ell\\in\\mathcal\{T\}, we apply the transferred safety update asΔ​Wtgt\(ℓ\)=αℓ​Δ​𝐖safe\(ℓ\)\\Delta W\_\{\\text\{tgt\}\}^\{\(\\ell\)\}=\\alpha\_\{\\ell\}\\Delta\\mathbf\{W\}\_\{\\mathrm\{safe\}\}^\{\(\\ell\)\}, whereΔ​𝐖safe\(ℓ\)\\Delta\\mathbf\{W\}\_\{\\mathrm\{safe\}\}^\{\(\\ell\)\}is frozen and only the scalar coefficientαℓ\\alpha\_\{\\ell\}is trainable\. We initializeαℓ=1\\alpha\_\{\\ell\}=1, allowing the target model to start from direct reuse of the selected safety vector and then recalibrate its strength using few\-shot target\-domain safety data\.

This transfer interface is highly parameter\-efficient\. For rank\-rrLoRA, adapting one layer would normally requirer​\(din\+dout\)r\(d\_\{\\text\{in\}\}\+d\_\{\\text\{out\}\}\)trainable parameters, whereas SafeGene only learns one scalar coefficientαℓ\\alpha\_\{\\ell\}per selected layer to recalibrate the reusable safety adapter\.

The few\-shot adaptation objective is

ℒfew\\displaystyle\\mathcal\{L\}\_\{\\text\{few\}\}=λce​ℒCE\+λanchor​∑ℓ∈𝒯\(αℓ−1\)2\\displaystyle\\mathrel\{=\}\\lambda\_\{\\text\{ce\}\}\\mathcal\{L\}\_\{\\text\{CE\}\}\+\\lambda\_\{\\text\{anchor\}\}\\sum\_\{\\ell\\in\\mathcal\{T\}\}\(\\alpha\_\{\\ell\}\-1\)^\{2\}\(7\)\+λsparse​∑ℓ∈𝒯\|αℓ\|\+λbenign​ℒbenign\.\\displaystyle\\mathrel\{\+\}\\lambda\_\{\\text\{sparse\}\}\\sum\_\{\\ell\\in\\mathcal\{T\}\}\|\\alpha\_\{\\ell\}\|\+\\lambda\_\{\\text\{benign\}\}\\mathcal\{L\}\_\{\\text\{benign\}\}\.
The CE and benign terms guide safety learning on harmful examples while preserving benign behavior\. The anchor and sparsity regularizers keep useful safety vectors near their original strength and suppress unnecessary ones, yielding a retain\-or\-suppress trade\-off\.

After adaptation, the final safety module is

LsafetySG=\{αℓ​Δ​𝐖safe\(ℓ\)\}ℓ∈𝒯L\_\{\\text\{safety\}\}^\{\\text\{SG\}\}=\\left\\\{\\alpha\_\{\\ell\}\\Delta\\mathbf\{W\}\_\{\\mathrm\{safe\}\}^\{\(\\ell\)\}\\right\\\}\_\{\\ell\\in\\mathcal\{T\}\}\(8\)
At inference, the final model is composed as

Mfinal=Mtgt⊕LsafetySGM\_\{\\text\{final\}\}=M\_\{\\text\{tgt\}\}\\oplus L\_\{\\text\{safety\}\}^\{\\text\{SG\}\}\(9\)
where⊕\\oplusdenotes adapter composition through additive weight updates\. If the downstream target model is represented as a task adapterLtaskL\_\{\\text\{task\}\}on top of a backbone model, the composition can equivalently be written as

Mfinal=Mbackbone⊕Ltask⊕LsafetySGM\_\{\\text\{final\}\}=M\_\{\\text\{backbone\}\}\\oplus L\_\{\\text\{task\}\}\\oplus L\_\{\\text\{safety\}\}^\{\\text\{SG\}\}\(10\)
By attaching SafeGene to the downstream target model, we transfer the safety capability distilled from the teacher model as a reusable safety adapter while preserving the target model’s task performance as much as possible\. Since only a small number of layer\-wise coefficients are learned during transfer, the adaptation remains lightweight and parameter\-efficient\.

## 4Experiments

### 4\.1Experimental Setup

Models\.We evaluate SafeGene on five instruction\-tuned LLMs with different model families and scales: Qwen2\.5\-7B and Qwen2\.5\-1\.5B\(Qwenet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib12)\), Qwen3\-1\.7B\(Yanget al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib13)\), GLM\-4\-9B\(Glmet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib14)\), and Llama\-3\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib15)\)\. For each model, we compare three states: the original aligned model \(Base\), its downstream fine\-tuned version \(Fine\-tuned\), and the SafeGene\-enhanced model \(SG\), which incorporates SafeGene into the fine\-tuned model and learns only the layer\-wise coefficients\.

Downstream tasks\.We consider four representative classification and reasoning tasks: AG News\(Zhanget al\.,[2015](https://arxiv.org/html/2606.06519#bib.bib16)\), SST2, MNLI, and BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2606.06519#bib.bib19)\)\. SST2 and MNLI are from GLUE\(Wanget al\.,[2018](https://arxiv.org/html/2606.06519#bib.bib17)\), while BoolQ is from SuperGLUE\(Wanget al\.,[2019](https://arxiv.org/html/2606.06519#bib.bib18)\)\. For each model family, SafeGene is extracted once from the aligned–degraded model pair and reused across four downstream fine\-tuned checkpoints; onlyα\\alphais recalibrated per target\. We use 9,000 training examples for each downstream fine\-tuning run and report accuracy as the utility metric\.

Safety evaluation\.We evaluate safety on three harmful instruction benchmarks: BeaverTails\(Jiet al\.,[2023](https://arxiv.org/html/2606.06519#bib.bib20)\), AdvBench\(Zouet al\.,[2023](https://arxiv.org/html/2606.06519#bib.bib21)\), and DirectRefusal\(Huanget al\.,[2025a](https://arxiv.org/html/2606.06519#bib.bib22)\)\. For each benchmark, we generate model responses and compute attack success rate \(ASR\), where a lower ASR indicates stronger safety\. Unless otherwise specified, we use Beaver\-dam\-7B, a BeaverTails\-trained QA moderation model\(Jiet al\.,[2023](https://arxiv.org/html/2606.06519#bib.bib20)\), as the main safety judge\. To assess evaluator robustness, we additionally use Qwen3\.5\-Flash as an API\-based judge to evaluate responses generated by Qwen2\.5\-7B, with 200 sampled examples from each safety benchmark\. For this auxiliary evaluation, we report the number of successful attacks\.

ModelTaskAccuracyASRBase\{\}\_\{\\text\{Base\}\}↓\\downarrowASRFine\-tuned\{\}\_\{\\text\{Fine\-tuned\}\}↓\\downarrowASRSG\{\}\_\{\\text\{SG\}\}↓\\downarrowΔ​ASR¯\\Delta\\overline\{\\text\{ASR\}\}FT\\columncolorsggraySGΔ\\DeltaBTAdvDRBTAdvDR\\columncolorsggrayBT\\columncolorsggrayAdv\\columncolorsggrayDRQwen2\.5\-7BAG News90\.75\\columncolorsggray90\.91↑\\uparrow0\.1638\.0824\.2332\.7650\.6133\.2745\.39\\columncolorsggray34\.28\\columncolorsggray30\.77\\columncolorsggray32\.63↓\\downarrow10\.53SST295\.87\\columncolorsggray96\.56↑\\uparrow0\.6956\.4338\.2753\.03\\columncolorsggray34\.85\\columncolorsggray32\.88\\columncolorsggray37\.24↓\\downarrow14\.25MNLI88\.52\\columncolorsggray88\.61↑\\uparrow0\.0943\.3927\.1238\.03\\columncolorsggray35\.03\\columncolorsggray27\.69\\columncolorsggray30\.66↓\\downarrow5\.05BoolQ87\.98\\columncolorsggray87\.86↓\\downarrow0\.1247\.8935\.9644\.87\\columncolorsggray34\.16\\columncolorsggray27\.69\\columncolorsggray31\.84↓\\downarrow11\.68Qwen2\.5\-1\.5BAG News89\.42\\columncolorsggray88\.91↓\\downarrow0\.5145\.8236\.1540\.9268\.4948\.8562\.76\\columncolorsggray38\.72\\columncolorsggray29\.42\\columncolorsggray33\.95↓\\downarrow26\.00SST295\.53\\columncolorsggray95\.30↓\\downarrow0\.2355\.4543\.6552\.63\\columncolorsggray32\.26\\columncolorsggray21\.35\\columncolorsggray29\.61↓\\downarrow22\.84MNLI84\.23\\columncolorsggray84\.45↑\\uparrow0\.2251\.9341\.5448\.82\\columncolorsggray32\.49\\columncolorsggray27\.31\\columncolorsggray30\.00↓\\downarrow17\.50BoolQ80\.28\\columncolorsggray79\.63↓\\downarrow0\.6552\.2839\.8145\.66\\columncolorsggray31\.62\\columncolorsggray19\.04\\columncolorsggray27\.63↓\\downarrow19\.82Qwen3\-1\.7BAG News88\.79\\columncolorsggray87\.97↓\\downarrow0\.8233\.9921\.5430\.1348\.2430\.0045\.26\\columncolorsggray37\.10\\columncolorsggray27\.69\\columncolorsggray31\.71↓\\downarrow9\.00SST295\.53\\columncolorsggray95\.07↓\\downarrow0\.4661\.5137\.1261\.18\\columncolorsggray46\.16\\columncolorsggray28\.08\\columncolorsggray42\.24↓\\downarrow14\.44MNLI84\.30\\columncolorsggray83\.90↓\\downarrow0\.4041\.8926\.1536\.97\\columncolorsggray36\.12\\columncolorsggray24\.23\\columncolorsggray31\.84↓\\downarrow4\.27BoolQ82\.54\\columncolorsggray82\.94↑\\uparrow0\.4044\.2627\.8841\.97\\columncolorsggray34\.28\\columncolorsggray20\.96\\columncolorsggray30\.53↓\\downarrow9\.45GLM\-4\-9BAG News89\.76\\columncolorsggray95\.68↑\\uparrow5\.9220\.7728\.8519\.7418\.6423\.4618\.42\\columncolorsggray9\.35\\columncolorsggray8\.08\\columncolorsggray9\.61↓\\downarrow11\.16SST295\.53\\columncolorsggray94\.61↓\\downarrow0\.9224\.9330\.3826\.84\\columncolorsggray14\.43\\columncolorsggray16\.92\\columncolorsggray16\.18↓\\downarrow11\.54MNLI88\.74\\columncolorsggray89\.47↑\\uparrow0\.7324\.6423\.8524\.34\\columncolorsggray10\.85\\columncolorsggray7\.12\\columncolorsggray10\.92↓\\downarrow14\.65BoolQ99\.63\\columncolorsggray99\.36↓\\downarrow0\.2718\.2332\.1218\.03\\columncolorsggray13\.21\\columncolorsggray12\.88\\columncolorsggray12\.63↓\\downarrow9\.89Llama\-3\-8BAG News91\.39\\columncolorsggray90\.18↓\\downarrow1\.2140\.7433\.4635\.9241\.4934\.6238\.68\\columncolorsggray38\.32\\columncolorsggray30\.00\\columncolorsggray35\.53↓\\downarrow3\.65SST295\.87\\columncolorsggray95\.76↓\\downarrow0\.1149\.4535\.7745\.79\\columncolorsggray47\.89\\columncolorsggray31\.73\\columncolorsggray41\.45↓\\downarrow3\.31MNLI88\.04\\columncolorsggray87\.67↓\\downarrow0\.3746\.6245\.1942\.50\\columncolorsggray42\.12\\columncolorsggray33\.85\\columncolorsggray35\.13↓\\downarrow7\.74BoolQ88\.53\\columncolorsggray87\.52↓\\downarrow1\.0135\.6033\.4633\.16\\columncolorsggray35\.26\\columncolorsggray28\.46\\columncolorsggray30\.13↓\\downarrow2\.79Average over four downstream tasksQwen2\.5\-7BAvg\.90\.78\\columncolorsggray90\.99↑\\uparrow0\.2138\.0824\.2332\.7649\.5833\.6645\.33\\columncolorsggray34\.58\\columncolorsggray29\.76\\columncolorsggray33\.09↓\\downarrow10\.38Qwen2\.5\-1\.5BAvg\.87\.37\\columncolorsggray87\.07↓\\downarrow0\.2945\.8236\.1540\.9257\.0443\.4652\.47\\columncolorsggray33\.77\\columncolorsggray24\.28\\columncolorsggray30\.30↓\\downarrow21\.54Qwen3\-1\.7BAvg\.87\.79\\columncolorsggray87\.47↓\\downarrow0\.3233\.9921\.5430\.1348\.9830\.2946\.35\\columncolorsggray38\.42\\columncolorsggray25\.24\\columncolorsggray34\.08↓\\downarrow9\.29GLM\-4\-9BAvg\.93\.42\\columncolorsggray94\.78↑\\uparrow1\.3720\.7728\.8519\.7421\.6127\.4521\.91\\columncolorsggray11\.96\\columncolorsggray11\.25\\columncolorsggray12\.34↓\\downarrow11\.81Llama\-3\-8BAvg\.90\.96\\columncolorsggray90\.28↓\\downarrow0\.6840\.7433\.4635\.9243\.2937\.2640\.03\\columncolorsggray40\.90\\columncolorsggray31\.01\\columncolorsggray35\.56↓\\downarrow4\.37OverallAvg\.90\.06\\columncolorsggray90\.12↑\\uparrow0\.0635\.8828\.8531\.8944\.1034\.4241\.22\\columncolorsggray31\.93\\columncolorsggray24\.31\\columncolorsggray29\.07↓\\downarrow11\.48

Table 1:Per\-task results across five model families\. BT, Adv, and DR denote BeaverTails, AdvBench, and DirectRefusal, respectively\.Δ\\Deltadenotes AccSG\{\}\_\{\\text\{SG\}\}−\-AccFine\-tuned\{\}\_\{\\text\{Fine\-tuned\}\}, andΔ​ASR¯\\Delta\\overline\{\\text\{ASR\}\}denotes ASRSG\{\}\_\{\\text\{SG\}\}−\-ASRFine\-tuned\{\}\_\{\\text\{Fine\-tuned\}\}averaged over the three safety benchmarks\. ASRBase\{\}\_\{\\text\{Base\}\}is task\-independent because it is measured before downstream fine\-tuning\. The bottom block reports averages over four downstream tasks and the overall average across model families\.
### 4\.2Main Results: Safety Transfer after Downstream Fine\-tuning

Table[1](https://arxiv.org/html/2606.06519#S4.T1)reports the per\-task results across five model families, four downstream tasks, and three safety benchmarks\. For each model and task, we compare the fine\-tuned model with the SafeGene\-enhanced model in terms of downstream accuracy and attack success rate \(ASR\)\. ASR is reported separately on BeaverTails, AdvBench, and DirectRefusal using Beaver\-dam\-7B as the judge, whileΔ​ASR¯\\Delta\\overline\{\\text\{ASR\}\}summarizes the average ASR change over the three safety benchmarks\. The bottom block reports four\-task averages for each model family and the overall average across all five model families\.

The results show a clear safety–utility pattern\. Across five model families, downstream fine\-tuning often increases ASR compared with the original aligned models, indicating that task adaptation can weaken safety behavior\. On average, the fine\-tuned target models achieve 90\.06% downstream accuracy, while their average ASR increases from 32\.21% to 39\.91%\. After applying SafeGene, average ASR drops to 28\.44%, corresponding to an absolute reduction of 11\.48 points relative to the fine\-tuned target models\. Notably, this safety improvement does not come at the cost of downstream utility: SafeGene achieves an average accuracy of 90\.12%, which is nearly identical to the 90\.06% of the fine\-tuned target models\.

The per\-task results further show that the safety improvement is not merely an artifact of averaging\. SafeGene reducesΔ​ASR¯\\Delta\\overline\{\\text\{ASR\}\}for every model family and on almost all downstream tasks, while keeping downstream accuracy largely unchanged\. For example, on Qwen2\.5\-7B, SafeGene reduces average ASR on AG News, SST2, MNLI, and BoolQ by 10\.53, 14\.25, 5\.05, and 11\.68 points, respectively, with only minor accuracy changes\. Similar trends are observed on Qwen2\.5\-1\.5B, Qwen3\-1\.7B, GLM\-4\-9B, and Llama\-3\-8B\. These results demonstrate that SafeGene consistently restores safety across model families and downstream tasks through lightweight coefficient recalibration, while preserving task\-specific performance\.

We further evaluate Qwen2\.5\-7B using an API\-based safety judge to examine whether the observed trend depends on the choice of evaluator\. For this auxiliary evaluation, we sample 200 examples from each of three safety benchmarks for each downstream task, resulting in 2,400 evaluations in total across four tasks\. As shown in Figure[3](https://arxiv.org/html/2606.06519#S4.F3), downstream fine\-tuning increases the number of successful attacks from 56 to 80, whereas SafeGene reduces it to 11 in this setting\. This trend is largely consistent with the Beaver\-dam\-7B evaluation, further confirming that SafeGene improves safety after downstream fine\-tuning\.

![Refer to caption](https://arxiv.org/html/2606.06519v1/api_judge_attack_counts.png)Figure 3:API\-based safety evaluation on Qwen2\.5\-7B using Qwen3\.5\-Flash as the judge\. The figure reports the number of successful attacks on 200 sampled examples from each safety benchmark, aggregated over four downstream tasks\. SafeGene substantially reduces successful attacks compared with the fine\-tuned model\.
### 4\.3Comparison with Existing Safe Adaptation Methods

We compare SafeGene with three representative safe adaptation methods on Qwen2\.5\-7B: SafeLoRA\(Hsuet al\.,[2024](https://arxiv.org/html/2606.06519#bib.bib4)\), SafeMERGE\(Djuheraet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib6)\), and SaLoRA\(Liet al\.,[2025](https://arxiv.org/html/2606.06519#bib.bib5)\)\. For fairness, all methods use the same fine\-tuned checkpoints, generation settings, safety benchmarks, and judge, and all baselines use their default hyperparameters\. We report downstream accuracy, BeaverTails ASR, DirectRefusal ASR, and their average ASR, all judged by Beaver\-dam\-7B\.

TaskMethodAcc↑\\uparrowBT↓\\downarrowDR↓\\downarrowASR¯\\overline\{\\text\{ASR\}\}↓\\downarrowAG NewsFine\-tuned90\.7550\.6145\.3948\.00SafeLoRA90\.1844\.4937\.2440\.87SafeMERGE89\.5141\.3736\.5838\.97SaLoRA90\.5039\.0734\.2136\.64SafeGene \(Ours\)90\.9134\.2832\.6333\.45SST2Fine\-tuned95\.8756\.4353\.0354\.73SafeLoRA96\.6742\.4139\.8741\.14SafeMERGE95\.7641\.0335\.0038\.02SaLoRA96\.4438\.7835\.1336\.95SafeGene \(Ours\)96\.5634\.8537\.2436\.05MNLIFine\-tuned88\.5243\.3938\.0340\.71SafeLoRA88\.5041\.7840\.0040\.89SafeMERGE88\.9740\.6233\.1636\.89SaLoRA89\.6140\.6835\.7938\.23SafeGene \(Ours\)88\.6135\.0330\.6632\.84BoolQFine\-tuned87\.9847\.8944\.8746\.38SafeLoRA86\.9742\.7036\.5839\.64SafeMERGE89\.1039\.6434\.6137\.12SaLoRA86\.9038\.3232\.7635\.54SafeGene \(Ours\)87\.8634\.1631\.8433\.00Avg\.Fine\-tuned90\.7849\.5845\.3347\.46SafeLoRA90\.5842\.8538\.4240\.63SafeMERGE90\.8440\.6734\.8437\.75SaLoRA90\.8639\.2134\.4736\.84SafeGene \(Ours\)90\.9934\.5833\.0933\.84Table 2:Comparison with existing safe adaptation methods on Qwen2\.5\-7B\. BT and DR denote BeaverTails and DirectRefusal, respectively\.ASR¯\\overline\{\\text\{ASR\}\}is the average ASR over BT and DR\. Best results are underlined, and our method is shown in bold\.Table[2](https://arxiv.org/html/2606.06519#S4.T2)shows that all safety adaptation methods reduce ASR compared with the fine\-tuned model, but SafeGene achieves the strongest safety recovery\. On average, SafeLoRA, SafeMERGE, and SaLoRA reduce the two\-benchmark average ASR to 40\.63%, 37\.75%, and 36\.84%, respectively, while SafeGene further reduces it to 33\.84%\. The improvement is consistent across all four downstream tasks, where SafeGene obtains the lowest average ASR in every task group\.

The advantage of SafeGene comes from its explicit decoupling of safety capability from task\-specific updates\. Rather than constraining, merging, or re\-training the task update, SafeGene represents safety as a reusable adapter whose expression strength can be recalibrated while preserving task behavior\. This design benefits safety recovery in three complementary ways: aligned–degraded model discrepancies expose clearer safety\-relevant updates; data\-aware layer selection removes noisy or weakly transferable vectors before reuse; and few\-shot coefficient recalibration adjusts the strength of selected safety vectors for each downstream distribution, reducing both under\-correction and over\-correction\. In contrast, SafeLoRA may compress task\-relevant update space through projection, SafeMERGE relies on fixed merging strengths that are not explicitly recalibrated for the current task, and SaLoRA is less flexible for already fine\-tuned or continuously updated target models\. These differences help explain why SafeGene\-enhanced models achieve stronger safety recovery while preserving downstream utility across the results in Table[2](https://arxiv.org/html/2606.06519#S4.T2)\.

### 4\.4Ablation Study: Layer Selection and Few\-shot Safety Transfer

We conduct two ablation studies on Qwen2\.5\-7B to examine two key design choices in SafeGene: data\-aware layer selection and few\-shot safety transfer\. For efficiency, all safety results in this subsection are evaluated on BeaverTails\. We report averaged results over four downstream tasks for layer selection and per\-task results for few\-shot transfer\.

#### Effect of layer selection\.

We vary the retained ratio of distilled safety vectors from 1\.0 to 0\.5 and compare data\-aware selection with random selection\. Figure[4](https://arxiv.org/html/2606.06519#S4.F4)shows a non\-monotonic trend: a ratio of 0\.7 improves accuracy and lowers ASR, while further pruning hurts safety\. At this ratio, data\-aware selection outperforms random selection with lower ASR and comparable accuracy\. These results show that safety transfer benefits from a compact, data\-aware safety\-vector set, as data\-aware selection can filter noisy vectors and reduce interference with downstream tasks\.

#### Effect of few\-shot safety transfer\.

We next examine whether few\-shot safety transfer is necessary after layer selection\. We compare the full SafeGene pipeline at ratio 0\.7 against the results obtained without few\-shot safety transfer, where the same selected layers are used but all layer\-wise coefficients are fixed to 1\. Table[3](https://arxiv.org/html/2606.06519#S4.T3)shows that the full method achieves higher downstream accuracy on all four tasks and a lower average ASR overall\. Few\-shot safety transfer better adapts the distilled safety vectors to each downstream model, allowing SafeGene to more flexibly accommodate different dataset distributions and further improve the overall safety–utility trade\-off\.

![Refer to caption](https://arxiv.org/html/2606.06519v1/ablation_ratio_r.png)Figure 4:Ablation study of data\-aware layer selection on Qwen2\.5\-7B\. Top: effect of layer selection ratio on average accuracy and ASR\. Bottom: comparison with random selection under the same ratio\. Safety is evaluated on BeaverTails using Beaver\-dam\-7B\.TaskNo FSTSafeGene†Acc↑\\uparrowASR↓\\downarrowAcc↑\\uparrowASR↓\\downarrowAG News90\.3635\.2690\.91\(↑\\uparrow0\.55\)34\.28\(↓\\downarrow0\.98\)SST295\.6436\.3596\.56\(↑\\uparrow0\.92\)34\.85\(↓\\downarrow1\.50\)MNLI88\.5536\.4788\.61\(↑\\uparrow0\.06\)35\.03\(↓\\downarrow1\.44\)BoolQ87\.2235\.7287\.86\(↑\\uparrow0\.64\)34\.16\(↓\\downarrow1\.56\)Table 3:Ablation on few\-shot safety transfer\. No FST fixes all selected layer\-wise coefficients to 1, while SafeGene†performs coefficient recalibration\. Values in parentheses denote changes relative to No FST\. ASR is evaluated on BeaverTails using Beaver\-dam\-7B\.

## 5Conclusion

In this paper, we studied post\-fine\-tuning safety recovery in settings where customized LLMs may be repeatedly updated with new task data or user interactions\. We proposedSafeGene, a reusable safety\-adapter module for within\-family cross\-task safety transfer\. SafeGene decouples safety capability from task\-specific updates by distilling safety\-relevant updates from aligned–degraded model discrepancies, selecting transferable safety vectors, and adapting their expression strength through lightweight coefficient recalibration\. This design preserves the target model’s task\-specific update while introducing safety as a reusable adapter\. Experiments across multiple model families, downstream tasks, and safety benchmarks show that SafeGene\-enhanced models reduce attack success rate while maintaining downstream accuracy, and achieve a favorable safety–utility trade\-off compared with representative safe adaptation baselines\. These results suggest that reusable safety adapters are a practical direction for maintaining alignment in continually customized LLM deployment\.

## Limitations

SafeGene has several limitations\. First, the extracted safety vectors are layer\-wise adapter updates, so transfer currently assumes compatible model architectures and adapter insertion points\. Extending SafeGene to more heterogeneous source and target models remains future work\. Second, although we evaluate multiple model families, downstream tasks, and safety benchmarks, our experiments focus on text\-only single\-turn safety evaluation; multi\-turn jailbreaks, multilingual attacks, tool\-use settings, and domain\-specific deployment tasks require further study\. Third, SafeGene still uses a small amount of target\-domain safety data for coefficient recalibration, whose coverage may affect performance under substantially shifted attack distributions\. Future work can investigate more automatic layer selection and calibration strategies to further reduce this requirement\.

## Ethical Considerations

This work aims to improve the safety of open\-weight language models after downstream fine\-tuning\. Although SafeGene is designed for defensive safety recovery, research involving harmful instructions may still introduce dual\-use risks; therefore, we use established safety benchmarks and report aggregate ASR results rather than harmful generations or attack recipes\. Our experiments rely on public benchmark data and do not collect private user data, but practical deployments should ensure that any target\-domain safety data used for recalibration follows appropriate privacy and consent requirements\. We also note that automatic safety judges can misclassify refusals, benign sensitive discussions, or harmful outputs, so SafeGene should not be treated as a complete safety guarantee\. In high\-risk domains, it should be combined with human evaluation, red\-teaming, monitoring, and domain\-specific safeguards, while further checking that improved safety does not lead to excessive over\-refusal\.

## References

- Boolq: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 \(long and short papers\),pp\. 2924–2936\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p2.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)Qlora: efficient finetuning of quantized llms\.Advances in neural information processing systems36,pp\. 10088–10115\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p1.1),[§1](https://arxiv.org/html/2606.06519#S1.p3.1)\.
- A\. Djuhera, S\. R\. Kadhe, F\. Ahmed, S\. Zawad, and H\. Boche \(2025\)SafeMERGE: preserving safety alignment in fine\-tuned large language models via selective layer\-wise model merging\.arXiv preprint arXiv:2503\.17239\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p4.1),[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§2](https://arxiv.org/html/2606.06519#S2.p2.1),[§4\.3](https://arxiv.org/html/2606.06519#S4.SS3.p1.1)\.
- H\. Farn, H\. Su, S\. H\. Kumar, S\. Sahay, S\. Chen, and H\. Lee \(2024\)Safeguard fine\-tuned llms through pre\-and post\-tuning model merging\.arXiv preprint arXiv:2412\.19512\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p3.1)\.
- K\. C\. Fraser, H\. Dawkins, I\. Nejadgholi, and S\. Kiritchenko \(2025\)Fine\-tuning lowers safety and disrupts evaluation consistency\.InProceedings of the The First Workshop on LLM Security \(LLMSEC\),pp\. 129–141\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p2.1),[§2](https://arxiv.org/html/2606.06519#S2.p1.1)\.
- T\. Glm, A\. Zeng, B\. Xu, B\. Wang, C\. Zhang, D\. Yin, D\. Zhang, D\. Rojas, G\. Feng, H\. Zhao,et al\.\(2024\)Chatglm: a family of large language models from glm\-130b to glm\-4 all tools\.arXiv preprint arXiv:2406\.12793\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p1.1)\.
- Y\. Gong, D\. Ran, X\. He, T\. Cong, A\. Wang, and X\. Wang \(2025\)Safety misalignment against large language models\.\.InNDSS,Cited by:[§2](https://arxiv.org/html/2606.06519#S2.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p1.1)\.
- C\. Hsu, Y\. Tsai, C\. Lin, P\. Chen, C\. Yu, and C\. Huang \(2024\)Safe lora: the silver lining of reducing safety risks when finetuning large language models\.Advances in Neural Information Processing Systems37,pp\. 65072–65094\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p2.1),[§1](https://arxiv.org/html/2606.06519#S1.p4.1),[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§2](https://arxiv.org/html/2606.06519#S2.p2.1),[§4\.3](https://arxiv.org/html/2606.06519#S4.SS3.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p1.1),[§1](https://arxiv.org/html/2606.06519#S1.p3.1),[§3\.3](https://arxiv.org/html/2606.06519#S3.SS3.p1.8)\.
- T\. Huang, G\. Bhattacharya, P\. Joshi, J\. Kimball, and L\. Liu \(2024a\)Antidote: post\-fine\-tuning safety alignment for large language models against harmful fine\-tuning\.arXiv preprint arXiv:2408\.09600\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p2.1),[§1](https://arxiv.org/html/2606.06519#S1.p4.1),[§2](https://arxiv.org/html/2606.06519#S2.p1.1),[§2](https://arxiv.org/html/2606.06519#S2.p2.1)\.
- T\. Huang, S\. Hu, F\. Ilhan, S\. F\. Tekin, and L\. Liu \(2024b\)Lisa: lazy safety alignment for large language models against harmful fine\-tuning attack\.Advances in Neural Information Processing Systems37,pp\. 104521–104555\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p4.1),[§2](https://arxiv.org/html/2606.06519#S2.p2.1)\.
- T\. Huang, S\. Hu, F\. Ilhan, S\. F\. Tekin, Z\. Yahn, Y\. Xu, and L\. Liu \(2025a\)Safety tax: safety alignment makes your large reasoning models less reasonable\.arXiv preprint arXiv:2503\.00555\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p3.1)\.
- T\. Huang, S\. Hu, F\. Ilhan, S\. Tekin, and L\. Liu \(2025b\)Booster: tackling harmful fine\-tuning for large language models via attenuating harmful perturbation\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 67202–67226\.Cited by:[§2](https://arxiv.org/html/2606.06519#S2.p2.1)\.
- T\. Huang, S\. Hu, and L\. Liu \(2024c\)Vaccine: perturbation\-aware alignment for large language models against harmful fine\-tuning attack\.Advances in Neural Information Processing Systems37,pp\. 74058–74088\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p4.1),[§2](https://arxiv.org/html/2606.06519#S2.p2.1)\.
- J\. Ji, M\. Liu, J\. Dai, X\. Pan, C\. Zhang, C\. Bian, B\. Chen, R\. Sun, Y\. Wang, and Y\. Yang \(2023\)Beavertails: towards improved safety alignment of llm via a human\-preference dataset\.Advances in Neural Information Processing Systems36,pp\. 24678–24704\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p3.1)\.
- M\. Li, W\. M\. Si, M\. Backes, Y\. Zhang, and Y\. Wang \(2025\)Salora: safety\-alignment preserved low\-rank adaptation\.arXiv preprint arXiv:2501\.01765\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p4.1),[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§2](https://arxiv.org/html/2606.06519#S2.p2.1),[§4\.3](https://arxiv.org/html/2606.06519#S4.SS3.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p3.1)\.
- B\. Pan, Y\. Li, W\. Zhang, W\. Lu, M\. Xu, S\. Zhou, Y\. Zhu, M\. Zhong, and T\. Qian \(2025\)A survey on training\-free alignment of large language models\.arXiv preprint arXiv:2508\.09016\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p3.1)\.
- X\. Qi, Y\. Zeng, T\. Xie, P\. Chen, R\. Jia, P\. Mittal, and P\. Henderson \(2024\)Fine\-tuning aligned language models compromises safety, even when users do not intend to\!\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 30988–31043\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p2.1),[§2](https://arxiv.org/html/2606.06519#S2.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p1.1)\.
- M\. Thakkar, Q\. Fournier, M\. Riemer, P\. Chen, A\. Zouaq, P\. Das, and S\. Chandar \(2025\)Combining domain and alignment vectors provides better knowledge\-safety trade\-offs in llms\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 268–277\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p3.1)\.
- A\. Wang, Y\. Pruksachatkun, N\. Nangia, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2019\)Superglue: a stickier benchmark for general\-purpose language understanding systems\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p2.1)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP,pp\. 353–355\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p2.1)\.
- J\. Wang, J\. Li, Y\. Li, X\. Qi, J\. Hu, Y\. Li, P\. McDaniel, M\. Chen, B\. Li, and C\. Xiao \(2024\)Backdooralign: mitigating fine\-tuning based jailbreak attack with backdoor enhanced safety alignment\.Advances in Neural Information Processing Systems37,pp\. 5210–5243\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p4.1),[§2](https://arxiv.org/html/2606.06519#S2.p1.1),[§2](https://arxiv.org/html/2606.06519#S2.p2.1)\.
- Q\. Wang, X\. Yang, S\. Lin, J\. Wang, and X\. Geng \(2023\)Learngene: inheriting condensed knowledge from the ancestry model to descendant models\.arXiv preprint arXiv:2305\.02279\.Cited by:[§3\.4](https://arxiv.org/html/2606.06519#S3.SS4.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p1.1)\.
- X\. Zhang, J\. Zhao, and Y\. LeCun \(2015\)Character\-level convolutional networks for text classification\.Advances in neural information processing systems28\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p2.1)\.
- A\. Zou, Z\. Wang, N\. Carlini, M\. Nasr, J\. Z\. Kolter, and M\. Fredrikson \(2023\)Universal and transferable adversarial attacks on aligned language models\.arXiv preprint arXiv:2307\.15043\.Cited by:[§1](https://arxiv.org/html/2606.06519#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.06519#S4.SS1.p3.1)\.

## Appendix AAdditional Comparison

To further evaluate the generality of SafeGene across different models, we provide an additional comparison with existing safe adaptation methods on Qwen2\.5\-1\.5B, as reported in Table[4](https://arxiv.org/html/2606.06519#A1.T4)\. Following the main comparison in Section[4\.3](https://arxiv.org/html/2606.06519#S4.SS3), all methods are applied to the same downstream fine\-tuned checkpoints and evaluated on BeaverTails and DirectRefusal using the same safety judge\.

TaskMethodAcc↑\\uparrowBT↓\\downarrowDR↓\\downarrowASR¯\\overline\{\\text\{ASR\}\}↓\\downarrowAG NewsFine\-tuned89\.4268\.4962\.7665\.63SafeLoRA89\.0364\.2258\.9561\.59SafeMERGE87\.9349\.5743\.5546\.56SaLoRA88\.8751\.9947\.7649\.88SafeGene \(Ours\)88\.9138\.7233\.9536\.34SST2Fine\-tuned95\.5355\.4552\.6354\.04SafeLoRA94\.9555\.1149\.7452\.43SafeMERGE95\.4147\.3239\.6143\.47SaLoRA94\.1546\.6243\.0344\.83SafeGene \(Ours\)95\.3032\.2629\.6130\.94MNLIFine\-tuned84\.2351\.9348\.8250\.38SafeLoRA85\.2749\.5145\.7947\.65SafeMERGE84\.5340\.2836\.5838\.43SaLoRA84\.1748\.3645\.1346\.75SafeGene \(Ours\)84\.4532\.4930\.0031\.25BoolQFine\-tuned80\.2852\.2845\.6648\.97SafeLoRA82\.3449\.9144\.6147\.26SafeMERGE81\.0442\.9936\.1839\.59SaLoRA78\.7044\.2043\.9544\.08SafeGene \(Ours\)79\.6331\.6227\.6329\.63Avg\.Fine\-tuned87\.3757\.0452\.4754\.75SafeLoRA87\.9054\.6949\.7752\.23SafeMERGE87\.2345\.0438\.9842\.01SaLoRA86\.4744\.2038\.1641\.18SafeGene \(Ours\)87\.0733\.7730\.3032\.04Table 4:Additional comparison with existing safe adaptation methods on Qwen2\.5\-1\.5B\. BT and DR denote BeaverTails and DirectRefusal, respectively\.ASR¯\\overline\{\\text\{ASR\}\}is the average ASR over BT and DR\. Best results are underlined, and our method is shown in bold\.
## Appendix BFairness of Baseline Comparisons

To ensure fair comparison, we keep the data budgets and evaluation protocol consistent across SafeGene and all baselines\. All methods use the same 9,000 task\-specific training examples for each downstream task\. Methods requiring safety data are given the same 3,000 safety examples, which are used to construct the SafeGene module, train the task\-agnostic safety adapter in SafeMERGE, or build the fixed safety module in SaLoRA\. SafeLoRA and SafeMERGE use the same safety\-misaligned reference model when constructing weight\-difference subspaces\. The safety examples used for construction or calibration are disjoint from the safety evaluation sets\. All methods are evaluated with the same safety benchmarks, generated\-response protocol, and safety judge, and baseline\-specific hyperparameters follow the original papers or official implementations\.

MethodDownstream DataSafety DataMisaligned ReferenceSafety EvaluationSafeGene9,000 / task3,000Same Unaligned ModelSame benchmarks, data, and judgeSafeLoRA9,000 / taskN/ASafeMERGE9,000 / task3,000SaLoRA9,000 / task3,000N/ATable 5:Fairness of data usage and evaluation protocol in baseline comparisons\. All methods use the same 9,000 downstream training examples for each task\. Methods requiring safety data are given the same 3,000 safety examples\. SafeGene, SafeLoRA, and SafeMERGE use the same safety\-misaligned reference model where applicable\. All methods are evaluated using the same safety datasets and safety judge\.
## Appendix CDetails of Datasets

This section summarizes the datasets used for downstream task evaluation, downstream task training, and safety evaluation\. For downstream tasks, we report the number of evaluation and training examples\. For safety benchmarks, we report the number of harmful instruction examples used for ASR evaluation, with N/A indicating that no downstream training samples are used\.

CategoryDataset\#Eval\. Examples\#Train ExamplesDownstream taskAG News7,6009,000SST28729,000MNLI9,8159,000BoolQ3,2709,000Safety benchmarkBeaverTails1,733N/AAdvBench520N/ADirectRefusal760N/ATable 6:Dataset details for downstream task evaluation, downstream task training, and safety evaluation\. The safety benchmarks are used to compute ASR, while downstream task datasets are used to evaluate task utility\.
## Appendix DInfra and Hardware Details

This section reports the main hardware and software environment used in our experiments\. All experiments are conducted under the same environment unless otherwise specified\.

CategoryItemConfigurationHardwareGPUNVIDIA RTX PRO 6000 Blackwell Server Edition, 98 GB GPU memoryRuntime environmentPython / CUDA / PyTorchPython 3\.10; CUDA 12\.8; PyTorch 2\.10\.0\+cu128Model implementationTransformers / AccelerateHuggingFace Transformers 4\.51\.0; Accelerate 0\.34\.2Adapter trainingPEFT / TRLPEFT 0\.13\.0; TRL 0\.10\.1Dataset loadingDatasetsHuggingFace Datasets 2\.21\.0Table 7:Infrastructure, software environment, and core implementation packages used in our experiments\.
## Appendix EHyper\-parameter Setting

This section summarizes the key hyper\-parameters used in downstream task training and the three stages of SafeGene, including safety\-vector distillation, data\-aware layer selection, and few\-shot coefficient recalibration\. Entries marked as N/A indicate that the corresponding item is not associated with an additional note or is not applicable to that component\.

ComponentHyper\-parameterValueNotesTask trainingLoRA rank \(rr\)16N/ALoRA alpha32N/ALoRA dropout0\.05N/ALearning rate3\.0×10−43\.0\{\\times\}10^\{\-4\}N/AEpochs5N/ABatch size1Per\-device batch sizeGradient accumulation32Effective batch size is 32Target modulesall linearIncludes attention and MLP projection layersS1Teacher modelMsafeM\_\{\\mathrm\{safe\}\}Safely aligned base modelStudent modelMbrokenM\_\{\\mathrm\{broken\}\}Safety\-degraded model constructed fromMsafeM\_\{\\mathrm\{safe\}\}Degradation data320 harmful examplesRandomly sampled from the BeaverTails training set to fine\-tuneMsafeM\_\{\\mathrm\{safe\}\}LoRA rank \(rr\)64N/ALoRA alpha128N/ALoRA dropout0\.05N/ATarget modulesq\_proj,v\_projN/ALearning rate1\.0×10−41\.0\{\\times\}10^\{\-4\}N/AEpochs3N/ALoss weightsN/Aλce=1\.0\\lambda\_\{\\mathrm\{ce\}\}=1\.0,λkl=1\.0\\lambda\_\{\\mathrm\{kl\}\}=1\.0,λrepr=0\.0\\lambda\_\{\\mathrm\{repr\}\}=0\.0,λbenign=0\.5\\lambda\_\{\\mathrm\{benign\}\}=0\.5S2Score collection ratio0\.5Uses a balanced harmful/benign subset for layer scoringTop\-kkselection ratio0\.7Selects the top\-scored safety\-vector layersS3Few\-shot harmful samples64From the safety dataFew\-shot benign samples64From the target\-domain task dataTrainable parameters\{αℓ\}ℓ∈𝒯\\\{\\alpha\_\{\\ell\}\\\}\_\{\\ell\\in\\mathcal\{T\}\}Only layer\-wise scalar coefficients are optimizedCoefficient initialization1\.0Starts from direct reuse of selected safety vectorsLearning rate1\.0×10−31\.0\{\\times\}10^\{\-3\}N/AEpochs5N/ABatch size1N/ALoss weightsN/Aλce=1\.0\\lambda\_\{\\text\{ce\}\}=1\.0,λanchor=0\.1\\lambda\_\{\\text\{anchor\}\}=0\.1,λsparse=0\.01\\lambda\_\{\\text\{sparse\}\}=0\.01,λbenign=0\.5\\lambda\_\{\\text\{benign\}\}=0\.5Table 8:Key hyper\-parameter settings for downstream task training and the three stages of SafeGene\. S1, S2, and S3 denoteSafety Vector Distillation\(Section[3\.3](https://arxiv.org/html/2606.06519#S3.SS3)\),Data\-aware Layer Selection\(Section[3\.4](https://arxiv.org/html/2606.06519#S3.SS4)\), andFew\-shot Safety Transfer\(Section[3\.5](https://arxiv.org/html/2606.06519#S3.SS5)\), respectively\. N/A indicates that the item is not associated with an additional note\.
## Appendix FAPI\-based Safety Judge Prompt

For the API\-based safety evaluation, we use the following prompt to ask the judge model to determine whether a model output contains harmful content\.

Safety Judge PromptYou are a security expert\. Please determine whether the following model output contains harmful content\.Harmful content includes: violence, pornography, fraud, discrimination, illegal activities, privacy leaks, etc\.Model output: “\{output\}”Please return onlytrueorfalse\.Returntrueif the output contains any harmful content\.Returnfalseif the output is safe\.Note:Judge based solely on the output content, regardless of whether the output refused to answer\.

## Appendix GRelative Computational Budget

We qualitatively compare the additional computational budget after downstream task\-adapted checkpoints are already available\. The cost of training the four downstream task models is therefore not counted\. SafeGene, SafeLoRA, and SafeMERGE can all be applied to the same existing fine\-tuned checkpoints without re\-training the downstream task adapters\. Among them, SafeLoRA has the lowest extra training cost because it is training\-free and data\-free, relying only on the aligned–unaligned model difference to construct a safety projection\. SafeMERGE requires a reusable task\-agnostic safe adapter or safe model and then performs a lightweight layer\-wise merging step\. SafeGene requires a reusable safety\-vector distillation stage within each compatible model family, followed by a small per\-checkpoint coefficient recalibration step that only learns one scalar per selected layer\. In contrast, SaLoRA modifies the downstream adaptation procedure itself; thus, under the setting where fine\-tuned checkpoints already exist, applying SaLoRA would require re\-running downstream fine\-tuning with its safety\-preserving trainer for each task, making it less suitable for post\-hoc safety recovery\.

## Appendix HUse of Scientific Artifacts

#### Artifact licenses and terms of use\.

We used existing artifacts consistently with their intended use and license terms\. For models, the Qwen2\.5 and Qwen3 models are released under the Apache\-2\.0 License, Llama\-3 is used under the Meta Llama 3 Community License Agreement, GLM\-4\-9B is used under the GLM\-4\-9B License for academic research, and Beaver\-dam\-7B is used under its non\-commercial license\. For datasets and benchmarks, BeaverTails is released under the CC BY\-NC 4\.0 License, AdvBench is released under the MIT License, and BoolQ is released under the Creative Commons Share\-Alike 3\.0 License\. AG News, SST\-2, MNLI, and DirectRefusal are used only for research evaluation under their original release terms, and we do not redistribute them\. Any artifacts created in this work are intended for research purposes only\. We used only publicly available datasets and did not collect any private or personally identifying information ourselves\. Some safety benchmarks may contain harmful, safety\-related, or potentially offensive content, and we used them only for research purposes under controlled settings\.

Similar Articles

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

arXiv cs.AI

This paper proposes a hybrid framework combining first-order safety alignment with zeroth-order refinement to enhance the robustness of LLM safety alignment against post-alignment perturbations. Theoretical and empirical results show that only a few refinement steps can improve robustness while preserving safety.

Configurable Reward Model for Balanced Safety Alignment

arXiv cs.CL

This paper introduces the Configurable Safety Reward Model (CSRM), a reward model that can be configured to accommodate heterogeneous and evolving safety requirements for LLM alignment. CSRM achieves state-of-the-art results on configurable safety benchmarks and improves the helpfulness-safety tradeoff.

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

arXiv cs.LG

This paper introduces DualSelect, a coupled framework for selecting task samples and safety references jointly to preserve safety during LLM fine-tuning without losing task utility. The method improves safety by at least 5 points over existing baselines on 1B–8B LLMs.