PACT: Preserving Anchored Cores in Task-vectors for Model Merging
Summary
The paper identifies 'Load-Bearing Wall' dimensions in pre-trained models that retain task-specific knowledge not fully captured by task vectors in model merging, and proposes PACT (PreserveAnchoredCores) to preserve these cores, achieving state-of-the-art performance across benchmarks.
View Cached Full Text
Cached at: 06/18/26, 05:44 AM
# PACT: Preserving Anchored Cores in Task-vectors for Model Merging
Source: [https://arxiv.org/html/2606.18627](https://arxiv.org/html/2606.18627)
Ningyuan Shi1,⋆, Zhipeng Zhou2,⋆, Hao Wang3, Chunyan Miao2, Peilin Zhao3,† 1Shanghai Jiao Tong University2Nanyang Technological University 3The Hong Kong University of Science and Technology \(Guangzhou\) shiningyuanAccount@sjtu\.edu\.cn, zzpustcml@gmail\.com haowang@hkust\-gz\.edu\.cn, ascymiao@ntu\.edu\.sg, peilinzhao@sjtu\.edu\.cn
###### Abstract
Model merging has emerged as a training\-free alternative to multi\-task learning, aiming to combine multiple task\-specific fine\-tuned models into a single multi\-task model\. Most existing model merging approaches follow the*Task Arithmetic*paradigm, which decomposes fine\-tuned weights into pre\-trained parameters and task vectors, and performs merging exclusively in the task\-vector space\. The effectiveness of this paradigm implicitly relies on the assumption that task\-specific knowledge is encoded solely within task vectors\. In this paper, we argue that this assumption generally does not hold due to the intrinsic task preferences of pre\-trained models\. Specifically, we identifyLoad\-Bearing Wall \(LBW\) dimensions, namely some task\-critical knowledge that remains embedded in the pre\-trained weights rather than being fully transferred into task vectors\. We characterize LBW dimensions from both scalar\-weight and subspace perspectives, thereby covering the major paradigms of existing model merging methods\. Our analysis reveals that, by ignoring LBW dimensions, task\-vector\-based approaches fail to fully resolve task conflicts and may inadvertently damage task\-specific knowledge encoded in the pre\-trained model, leading to degradation\. To address this issue, we proposePreserveAnchoredCores, which preserves the anchored task\-specific cores \(i\.e\., LBW components\) within task vectors by aligning their orthogonal complements with the subspace of the pre\-trained weights\. These aligned subspace components are then removed from the task vectors before applying existing model merging algorithms\. Furthermore, we develop an efficient variant based on randomized SVD to improve scalability\. SincePreserveAnchoredCores approaches model merging from a complementary perspective, it can be seamlessly integrated with existing methods\. Extensive experiments across multiple benchmarks demonstrate thatPreserveAnchoredCores consistently enhances mainstream model merging approaches and establishes new state\-of\-the\-art performance\.
## 1Introduction
Pre\-trained models \(PTMs\) have become the foundation of modern machine learning systems, providing strong general\-purpose representations across a wide range of domains and tasks\(Carion et al\.,[2020](https://arxiv.org/html/2606.18627#bib.bib3); Radford et al\.,[2021](https://arxiv.org/html/2606.18627#bib.bib31); Caron et al\.,[2021](https://arxiv.org/html/2606.18627#bib.bib4)\)\. To achieve high performance on downstream applications, PTMs are typically specialized through task\-specific fine\-tuning\(Wortsman et al\.,[2022b](https://arxiv.org/html/2606.18627#bib.bib46); Ilharco et al\.,[2022b](https://arxiv.org/html/2606.18627#bib.bib18)\)\. However, extending such specialization to the multi\-task setting often requires complex optimization procedures, substantial computational resources, and access to large amounts of task\-specific training data\(Wei et al\.,[2024](https://arxiv.org/html/2606.18627#bib.bib43)\)\. Model merging has recently emerged as an attractive training\-free alternative, aiming to combine multiple task\-specific fine\-tuned models into a single multi\-task model without additional training\(Li et al\.,[2023](https://arxiv.org/html/2606.18627#bib.bib24); Yang et al\.,[2026](https://arxiv.org/html/2606.18627#bib.bib51)\)\.
Existing model merging methods can be broadly categorized into scalar weight\-based and subspace\-based approaches\(Yang et al\.,[2026](https://arxiv.org/html/2606.18627#bib.bib51); Ruan et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib32)\)\. Scalar weight\-based methods\(Ilharco et al\.,[2022a](https://arxiv.org/html/2606.18627#bib.bib17); Yadav et al\.,[2023](https://arxiv.org/html/2606.18627#bib.bib50); Nguyen et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib27)\)treat individual parameter coordinates as the basic unit of task knowledge and perform coordinate\-wise operations on model weights or task vectors\. In contrast, subspace\-based approaches\(Stoica et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib36); Gargiulo et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib12); Marczak et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib25)\)represent task knowledge as a collection of important directions in parameter space and merge models by identifying, preserving, or recombining task\-relevant subspaces\. Despite their methodological differences, both paradigms operate primarily in the space of*task vectors*, namely the parameter updates induced by fine\-tuning\.
The widespread adoption of task vectors is largely motivated by the insights of Task Arithmetic\(Ilharco et al\.,[2022a](https://arxiv.org/html/2606.18627#bib.bib17)\), which showed that task vectors corresponding to different tasks are often approximately orthogonal\. This observation suggests that task\-specific knowledge is largely encoded in task vectors and can therefore be manipulated independently\. Consequently, most existing model merging methods implicitly assume that task vectors provide a sufficiently complete representation of task\-specific knowledge, while the pre\-trained model serves merely as a shared initialization point\. In this paper, we challenge this assumption by asking a simple yet fundamental question:
> *Is task\-specific and \-critical knowledge fully captured by task vectors?*
Our answer is negative\. Through a series of controlled experiments spanning both scalar weight\-based and subspace\-based merging paradigms, we identify a previously overlooked phenomenon: a portion of task\-specific and \-critical knowledge remains anchored in the pre\-trained model itself\. We refer to such dimensions asLoad\-Bearing Wall \(LBW\)dimensions\. Although these dimensions often exhibit only small changes during fine\-tuning and therefore contribute little to the task vector, they play a disproportionately important role in maintaining task performance\. As a result, task vectors alone provide an incomplete representation of task\-specific knowledge\.
Figure 1:Illustration of LBW dimensions in pre\-trained model and its impact on model merging\.This observation has important implications for model merging\. Since existing approaches operate exclusively on task vectors, they are unable to explicitly protect LBW dimensions during merging\. Consequently, updates introduced by other task vectors may overwrite or distort the pre\-trained structures on which a task critically depends, leading to performance degradation after merging\. Through destructive perturbation experiments, we verify both the existence of LBW components and their vulnerability during model aggregation, as illustrated in Figure[1](https://arxiv.org/html/2606.18627#S1.F1)\.
Building upon these findings, we proposePACT, a data\-free framework for protecting task\-specific LBW knowledge during model merging\. The key idea is to identify and preserve the task\-anchored components associated with LBW dimensions while filtering out task\-vector components that interfere with the corresponding pre\-trained subspace\. SincePACToperates directly on task vectors, it can be seamlessly integrated with existing scalar weight\-based and subspace\-based merging algorithms\. To further improve scalability, we develop an efficient variant based on randomized SVD with substantially reduced computational complexity\. Our contributions are summarized as follows:
- •We revisit a fundamental assumption underlying Task Arithmetic and investigate whether task vectors fully capture task\-specific knowledge\. We empirically identify the existence of LBW dimensions and demonstrate their critical role in model merging\.
- •We proposePACT, a data\-free framework that protects task\-specific LBW knowledge from interference during model merging\. We further develop an efficient variant with improved scalability through randomized SVD\.
- •Extensive experiments on vision and language benchmarks, covering both full fine\-tuning and LoRA fine\-tuning settings, demonstrate thatPACTconsistently improves mainstream model merging methods and establishes new state\-of\-the\-art \(SOTA\) performance\.
## 2Related Work
Model merging aims to integrate multiple task\-specific fine\-tuned expert models into a single multi\-task model without accessing their original training data, thereby avoiding the computational overhead and data privacy concerns of traditional multi\-task learning\. The task vector lies at the core of most static merging methods\. Task Arithmetic\(Ilharco et al\.,[2022a](https://arxiv.org/html/2606.18627#bib.bib17)\)performs multi\-task merging by scaling and summing task vectors, establishing the plug\-and\-play merging paradigm\. TIES\-Merging\(Yadav et al\.,[2023](https://arxiv.org/html/2606.18627#bib.bib50)\)sequentially applies magnitude pruning, sign disambiguation, and disjoint aggregation to alleviate parameter conflicts\. Consensus Merging\(Wang et al\.,[2024](https://arxiv.org/html/2606.18627#bib.bib42)\)eliminates “selfish” and “catastrophic” weights that benefit only a single task at the expense of others\. Concrete Merging\(Tang et al\.,[2023](https://arxiv.org/html/2606.18627#bib.bib39)\)uses meta\-learning to generate binary masks that suppress conflicting parameters\. AWD\(Xiong et al\.,[2024](https://arxiv.org/html/2606.18627#bib.bib49)\)reduces interference by promoting orthogonality among task vectors\. DARE\(Yu et al\.,[2024](https://arxiv.org/html/2606.18627#bib.bib52)\)randomly drops part of the parameters as a dropout\-style preprocessing step\. MAP\(Li et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib23)\)leverages a second\-order Taylor expansion and linear regression to estimate the Hessian, providing loss\-based guidance for merging\. Both MetaGPT\(Zhou et al\.,[2024](https://arxiv.org/html/2606.18627#bib.bib53)\)and TATR\(Sun et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib38)\)explicitly model the loss gap: the former derives a closed\-form solution for the merging coefficient, while the latter models the loss gap as the product of gradients and task vectors to quantify knowledge conflicts\.
Recent approaches exploit SVD to capture the low\-rank structure of task matrices, operating within subspaces to align, orthogonalize, or equalize components so as to preserve shared knowledge while suppressing interference\. KnOTS\(Stoica et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib36)\), designed for LoRA fine\-tuning, merges adapters by aligning and averaging their right singular vectors\. TSV\(Gargiulo et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib12)\)applies SVD to each task matrix and whitens the resulting subspaces to reduce inter\-task interference\. Iso\-C\(Marczak et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib25)\)replaces the singular values of the summed task matrix with their mean, flattening the singular spectrum to expand the effective subspace and significantly improve the subspace alignment ratio\. Iso\-CTS\(Marczak et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib25)\)extends this by injecting task\-specific singular directions, balancing shared and task\-specific knowledge\. DOGE TA\(Wei et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib44)\)formulates merging as a constrained optimization problem, constructs a shared subspace, and optimizes a correction vector via projected gradient descent with adaptive task\-aware merging coefficients, further narrowing the performance gap to individual task models\.
However, all these model merging methods rest on an implicit assumption: that the task\-specific and \-critical knowledge resides solely within the corresponding task vector and is independent of the pre\-trained base model\. Consequently, they overlook the LBW parameters within the pre\-trained parameters that are critical for each specific task\. Our study is the first to examine the role of pre\-trained parameters for individual tasks and to protect the pre\-trained core subspace by filtering LBW dimensions from task vectors\.
## 3Motivation and Observation
In this section, we first introduce the general task\-vector formulation of model merging and establish the notation used throughout the remainder of the paper\. We then validate our motivation from two perspectives across both scalar weight and subspace paradigms: \(1\) identifying the existence ofLoad\-Bearing Wall \(LBW\) dimensions, and \(2\) demonstrating how these components are disrupted during model merging\. Together, these observations provide key insights that motivate the design of our method presented in the next section\.
### 3\.1Preliminary
Model merging seeks to combine multiple deep neural networks—each fine\-tuned from an identical pre\-trained model on separate tasks—into a single unified model\. Letθ0\\theta\_\{0\}represent the pre\-trained weights andθt\\theta\_\{t\}denote the weights fine\-tuned for tasktt, witht=1,…,Tt=1,\.\.\.,T, whereTTis the total number of tasks\. We writeθt\(ℓ\)\\theta^\{\(\\ell\)\}\_\{t\}for the weights at layerℓ\\ellcorresponding to tasktt, and letLLbe the total number of layers\. The goal of model merging is to determine a merging functionffsuch that the resulting model
θM\(ℓ\)=f\(θ0\(ℓ\),\{θt\(ℓ\)\}t=1T\),∀ℓ=1,…,L\\displaystyle\\theta^\{\(\\ell\)\}\_\{M\}=f\(\\theta^\{\(\\ell\)\}\_\{0\},\\\{\\theta^\{\(\\ell\)\}\_\{t\}\\\}\_\{t=1\}^\{T\}\),\\ \\forall\\ \\ell=1,\.\.\.,L\(1\)can perform all tasks that the individual modelsθt\\theta\_\{t\}were trained on\.
Following\(Ilharco et al\.,[2022a](https://arxiv.org/html/2606.18627#bib.bib17)\), the layer\-wise task vectorΔt\(ℓ\)\\Delta\_\{t\}^\{\(\\ell\)\}is defined as the the difference between the fine\-tuned weightsθt\(ℓ\)\\theta^\{\(\\ell\)\}\_\{t\}and the pre\-trained weightsθ0\(ℓ\)\\theta^\{\(\\ell\)\}\_\{0\}at layerℓ\\ell:Δt\(ℓ\)=θt\(ℓ\)−θ0\(ℓ\)\\Delta\_\{t\}^\{\(\\ell\)\}=\\theta^\{\(\\ell\)\}\_\{t\}\-\\theta^\{\(\\ell\)\}\_\{0\}\. For the rest of the paper, the superscriptℓ\\ellis dropped when the context is clear, and all definitions apply to an arbitrary layer\. In Task Arithmetic, the merging function sums all task matrices and adds them back to the pre\-trained weights:
θTA\(ℓ\)=θ0\(ℓ\)\+αΔTA\(ℓ\),withΔTA\(ℓ\)=∑t=1TΔt\(ℓ\)\.\\displaystyle\\theta\_\{TA\}^\{\(\\ell\)\}=\\theta^\{\(\\ell\)\}\_\{0\}\+\\alpha\\Delta\_\{TA\}^\{\(\\ell\)\},\\ with\\ \\Delta\_\{TA\}^\{\(\\ell\)\}=\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}^\{\(\\ell\)\}\.\(2\)whereα\\alphais a scaling factor chosen on a held\-out validation set\. This strategy enables the reuse and transfer of knowledge from several fine\-tuned models to the pre\-trained model without requiring additional training or access to the original training data\. Building upon this, subsequent algorithms have focused on studying the conflicts amongΔt\\Delta\_\{t\}and their aggregation methods, implicitly assuming that there is no interference betweenθ0\\theta\_\{0\}andΔTA\\Delta\_\{TA\}and that they can be directly summed\. Our research begins precisely from this premise\. Before presenting our empirical evidences, let’s first give a brief definition on the LBW dimensions\.
###### Definition 3\.1\(LBW Dimensions\)\.
For a tasktt, a parameter dimensioniiis called an LBW dimension if it contributes significantly to the task performance while receiving only negligible updates during fine\-tuning, i\.e\.,\|Δt\(i\)\|≈0\|\\Delta\_\{t\}^\{\(i\)\}\|\\approx 0\. In such dimensions, task\-relevant knowledge resides predominantly in the pretrained parametersθ0\(i\)\\theta\_\{0\}^\{\(i\)\}rather than in the task vectorΔt\(i\)\\Delta\_\{t\}^\{\(i\)\}\.
### 3\.2Identifying and Validating LBW Dimensions
#### 3\.2\.1LBW Dimensions in Scalar Weight
To validate our hypothesis, we first introduce two metrics for identifying potential LBW dimensions, which are used throughout the subsequent experiments:
Fθ=1N∑i=1N\(∂ℒi∂θ\)2,EB=dn∑τ∈\[ΔA^\]τ2∑τ∈\[ΔB\]τ2\.\\displaystyle F\_\{\\theta\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\(\\frac\{\\partial\{\\mathcal\{L\}\}i\}\{\\partial\\theta\}\\right\)^\{2\},\\qquad E\_\{B\}=\\frac\{d\}\{n\}\\frac\{\\sum\_\{\\tau\\in\[\\widehat\{\\Delta\_\{A\}\}\]\}\\tau^\{2\}\}\{\\sum\_\{\\tau\\in\[\\Delta\_\{B\}\]\}\\tau^\{2\}\}\.\(3\)
Here,FθF\_\{\\theta\}denotes the Fisher information\(Kirkpatrick et al\.,[2017](https://arxiv.org/html/2606.18627#bib.bib19)\), whereNNis the number of samples \(1000 in our experiments\) andℒi\\mathcal\{L\}\_\{i\}is the loss of theii\-th sample\. Fisher information quantifies the sensitivity of the loss to a parameter: a larger value indicates that small perturbations to the parameter induce larger changes in the loss\. The second metric,EBE\_\{B\}, is an enrichment score\(Subramanian et al\.,[2005](https://arxiv.org/html/2606.18627#bib.bib37)\), whereτ\\taudenotes an individual parameter,ΔA^\\widehat\{\\Delta\_\{A\}\}is a subset of task vectorΔA\\Delta\_\{A\},nnis the number of parameters in the subset, andddis the total number of parameters in the layer\. Intuitively,EBE\_\{B\}measures the extent to which taskBBconcentrates its updates within the dimensions selected from taskAA\. Values greater than 1 indicate that taskBBdisproportionately modifies these dimensions relative to the layer\-wide average\.
##### Filter Process\.
For each layer, we first select the30%30\\%of parameters inΔA\\Delta\_\{A\}with the smallest magnitudes\. Among them, we retain the30%30\\%with the largest Fisher values, yielding a subsetΔA^\\widehat\{\\Delta\_\{A\}\}that contains approximately9%9\\%of the layer’s parameters\. We then computeEBE\_\{B\}and retain only dimensions exhibiting statistically significant enrichment \(p<0\.05p<0\.05\)\. The resulting parameters constitute the*crucial mask*\. This procedure directly targets our hypothesis by identifying dimensions that are weakly represented in the task vector, highly important to task performance, and likely to be modified by another task\.
The crucial mask contains fewer than2%2\\%of the model parameters, and all subsequent experiments are conducted on this subset\. For comparison, we construct two control masks of identical size\. The*safe mask*is obtained by selecting the30%30\\%of parameters with the lowest Fisher values and applying the same enrichment filtering procedure\. The*random mask*is obtained by uniformly sampling parameters at random and applying the same filtering criterion\.
\(a\)
\(b\)
\(c\)
\(d\)
\(e\)
\(f\)
\(g\)
\(h\)
Figure 2:Accuracy \(%\) under varying attack strengths with respective to different masks\. Generally,M2M2is could be easily attacked whileM3M3andM4M4are rather stable, indicating the importance of parameters within crucial mask\. More results please refer to Appendix Sec\.[C\.1](https://arxiv.org/html/2606.18627#A3.SS1)\.
##### Do LBW Parameters Exist?
We first investigate whether LBW parameters exist\. Specifically, we select 16 task pairs\(A,B\)\(A,B\)from commonly employed benchmarks\(Wang et al\.,[2024](https://arxiv.org/html/2606.18627#bib.bib42)\)and compute the crucial mask for taskAA\. We then use taskBB’s task vector to perturb the corresponding dimensions\. LetM0M0denote the performance of taskAA’s fine\-tuned model\. We first restore the crucial\-mask parameters to their pre\-trained values, producingM1M1, to verify whether these dimensions indeed derive their functionality from the pre\-trained model\. We then perform three controlled perturbation experiments on the crucial, safe, and random masks, respectively:θ0\+ατB\\theta\_\{0\}\+\\alpha\\tau\_\{B\}, whereα∈\{1,5,10,20,30\}\\alpha\\in\\\{1,5,10,20,30\\\}controls the perturbation strength andτB\\tau\_\{B\}denotes parameters from the corresponding dimensions in task vectorΔB\\Delta\_\{B\}\. The resulting performances are denoted byM2M2\(crucial mask\),M3M3\(safe mask\), andM4M4\(random mask\)\. All experiments are conducted using ViT\-B/16\.
Representative results are shown in Figure[2](https://arxiv.org/html/2606.18627#S3.F2)and Table[2](https://arxiv.org/html/2606.18627#S3.T2), with additional results provided in the Appendix\. The near\-identical performance ofM0M0andM1M1indicates that the selected dimensions contribute little to the task vector itself, confirming that their functionality primarily originates from the pre\-trained model\. In contrast, perturbing the crucial mask \(M2M2\) causes a dramatic performance drop, whereas perturbations of identical magnitude applied to the safe mask \(M3M3\) and random mask \(M4M4\) lead to substantially smaller degradation\. As shown in Figure[2](https://arxiv.org/html/2606.18627#S3.F2), performance decreases monotonically with increasing perturbation strength for all three masks, with the most severe degradation consistently observed on the crucial mask\. These results provide direct evidence for the existence of LBW parameters\.
##### Are LBW Parameters Damaged During Merging?
Having established the existence of LBW parameters, we next examine whether these parameters are disrupted during model merging\. Specifically, we first construct the merged modelθ0\+ΔA\+ΔB\\theta\_\{0\}\+\\Delta\_\{A\}\+\\Delta\_\{B\}and evaluate its performance on taskAA\. We then perform a targeted restoration by resetting only the crucial\-mask parameters to their original pre\-trained values inθ0\\theta\_\{0\}, while keeping all other parameters exactly the same as in the merged model\. The restored model is then evaluated again on taskAA\. If this restoration leads to a clear performance recovery, it indicates that the crucial parameters of taskAAhave been overwritten during merging, and that such disruption directly contributes to the degradation of taskAA\.
Representative results are reported in Table[2](https://arxiv.org/html/2606.18627#S3.T2), with additional results provided in the Appendix Sec\.[C\.1](https://arxiv.org/html/2606.18627#A3.SS1)\. Restoring only the crucial\-mask parameters yields substantial performance improvements on taskAA\. Given that the crucial mask accounts for less than2%2\\%of the total parameters, the magnitude of the recovery is remarkably large\. These findings indicate that existing task\-vector\-based merging methods indeed modify LBW dimensions and thereby impair task\-specific knowledge embedded in the pre\-trained model\.
Table 1:Accuracy \(%\) under different parameters modifications\. The value ofα\\alphais set to 30\.0\.
Table 2:Δacc\\Delta\_\{acc\}represents the accuracy \(%\) change due to the crucial mask replacement\.
#### 3\.2\.2LBW Dimensions in Subspace
We have established the existence of LBW dimensions and their vulnerability during model merging\. However, whether a similar phenomenon exists in subspace\-based model merging methods\(Yang et al\.,[2026](https://arxiv.org/html/2606.18627#bib.bib51); Ruan et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib32)\)remains unclear\. While the Fisher information analysis reveals the vulnerability of specific parameters at a microscopic level, subspace\-based approaches operate on low\-rank geometric structures rather than individual weights\. Therefore, to understand how task interference arises in these methods, we extend our analysis from parameter space to subspace geometry\.
We perform SVD on both the pre\-trained weight matrixθ0∈ℝm×n\\theta\_\{0\}\\in\\mathbb\{R\}^\{m\\times n\}and the task vectorsΔ1,…,ΔT∈ℝm×n\\Delta\_\{1\},\\dots,\\Delta\_\{T\}\\in\\mathbb\{R\}^\{m\\times n\}:
θ0=U0Σ0V0⊤,andΔt=UtΣtVt⊤,∀t=1,…,T,\\displaystyle\\theta\_\{0\}=U\_\{0\}\\Sigma\_\{0\}V\_\{0\}^\{\\top\},\\quad\\text\{and\}\\quad\\Delta\_\{t\}=U\_\{t\}\\Sigma\_\{t\}V\_\{t\}^\{\\top\},\\quad\\forall t=1,\\dots,T,\(4\)whereΣ∗\\Sigma\_\{\*\}denotes the diagonal matrix of singular values, andUUandVVdenote the left and right singular vectors, respectively\. Physically, for any input featurex∈ℝnx\\in\\mathbb\{R\}^\{n\}, the forward pass transformation is governed byθ0x=U0Σ0V0⊤x\\theta\_\{0\}x=U\_\{0\}\\Sigma\_\{0\}V\_\{0\}^\{\\top\}x\. In this formulation, the right singular vectorsV0V\_\{0\}act as an orthonormal basis for the input space, where the projection coordinatesV0⊤xV\_\{0\}^\{\\top\}xare scaled by the singular valuesΣ0\\Sigma\_\{0\}\. Consequently, the top\-KKright singular vectorsV0,KV\_\{0,K\}represent the principal directions that preserve the maximum variance of the representations, thereby capturing the dominant structural foundation of the pre\-trained model\.
To identify LBW dimensions in the spectral domain, we extract the top\-KKsingular vectors of the pre\-trained weights to form the base matrixV0,K∈ℝn×KV\_\{0,K\}\\in\\mathbb\{R\}^\{n\\times K\}\. Similarly, we extract the top\-kksingular vectors ofΔt\\Delta\_\{t\}to form the active task matrixVt,k∈ℝn×kV\_\{t,k\}\\in\\mathbb\{R\}^\{n\\times k\}\. Unless otherwise specified, we set\(K,k\)=\(15,8\)\(K,k\)=\(15,8\)throughout all experiments\.
We define theLBW subspaceVt,LBWV\_\{t,LBW\}of taskttas the component of the pre\-trained core subspaceV0,KV\_\{0,K\}that lies outside the active update subspace of tasktt\. Formally, it is obtained by projectingV0,KV\_\{0,K\}onto the orthogonal complement ofVt,kV\_\{t,k\}followed by orthonormalization:
Vt,LBW=orth\(\(I−Vt,kVt,k⊤\)V0,K\)∈ℝn×dpact\.\\displaystyle V\_\{t,LBW\}=\\text\{orth\}\\left\(\(I\-V\_\{t,k\}V\_\{t,k\}^\{\\top\}\)V\_\{0,K\}\\right\)\\in\\mathbb\{R\}^\{n\\times d\_\{pact\}\}\.\(5\)Specifically, the projection matrixI−Vt,kVt,k⊤I\-V\_\{t,k\}V\_\{t,k\}^\{\\top\}filters out components of the pre\-trained coreV0,KV\_\{0,K\}that lie within the active task updatesVt,kV\_\{t,k\}\. This projection successfully isolates the pre\-trained directions that are critical for task performance yet receive negligible updates during fine\-tuning\. Theorthoperator then orthonormalizes these remaining components to construct a stable, non\-redundant basisVt,LBWV\_\{t,LBW\}with dimensiondpactd\_\{pact\}, which is at mostKK\. Conceptually,Vt,LBWV\_\{t,LBW\}serves as the spectral analogue of the LBW dimensions, capturing pre\-trained feature directions that remain untouched during task adaptation\.
##### Causal Validation via Subspace Ablation\.
To investigate whether model performance depends onVt,LBWV\_\{t,LBW\}, we conduct a global subspace ablation study\. Given a task\-specific modelθt=θ0\+Δt\\theta\_\{t\}=\\theta\_\{0\}\+\\Delta\_\{t\}, we remove the LBW subspace by constructingθabla=θt\(I−Vt,LBWVt,LBW⊤\)\.\\theta\_\{abla\}=\\theta\_\{t\}\(I\-V\_\{t,LBW\}V\_\{t,LBW\}^\{\\top\}\)\.We reasonably assumeΔt\\Delta\_\{t\}is approximately orthogonal toVt,LBWV\_\{t,LBW\}by construction \(i\.e\.,ΔtVt,LBW≈0\\Delta\_\{t\}V\_\{t,LBW\}\\approx 0\), this operation primarily removes the pre\-trained components contained inVt,LBWV\_\{t,LBW\}while preserving the task\-specific updates:
θabla≈θ0\(I−Vt,LBWVt,LBW⊤\)\+Δt\\displaystyle\\theta\_\{abla\}\\approx\\theta\_\{0\}\(I\-V\_\{t,LBW\}V\_\{t,LBW\}^\{\\top\}\)\+\\Delta\_\{t\}\(6\)We provide a formal derivation in Appendix Sec\.[A\.3](https://arxiv.org/html/2606.18627#A1.SS3)\. As a control group, we ablate a random orthogonal subspaceVrandV\_\{rand\}of identical dimensionality strictly within the non\-core pre\-trained space \(V0,K⟂V\_\{0,K\}^\{\\perp\}\):θrand=θt\(I−VrandVrand⊤\)\\theta\_\{rand\}=\\theta\_\{t\}\(I\-V\_\{rand\}V\_\{rand\}^\{\\top\}\)\. To ensure a fair comparison, the control subspaceVrandV\_\{rand\}is constructed under two constraints: it must share the identical dimensiondpactd\_\{pact\}and lie strictly within the non\-core pre\-trained space orthogonal toV0,KV\_\{0,K\}\. Formally, we project a random Gaussian matrixXrandX\_\{rand\}onto the orthogonal complement of the pre\-trained core to obtain\(I−V0,KV0,K⊤\)Xrand\(I\-V\_\{0,K\}V\_\{0,K\}^\{\\top\}\)X\_\{rand\}, and then apply QR decomposition to extract the orthonormal basis\. From this basis, we randomly select and re\-orthonormalizedpactd\_\{pact\}columns to yieldVrandV\_\{rand\}\. UnlikeVt,LBWV\_\{t,LBW\}which preserves the structured, task\-essential components derived from the pre\-trained coreV0,KV\_\{0,K\},VrandV\_\{rand\}represents purely unstructured directions within the non\-core spaceV0,K⟂V\_\{0,K\}^\{\\perp\}\.
Table 3:Δacc∗\\Delta\_\{acc\}^\{\*\}represents the accuracy \(%\) change due to the subspace filtering\. The random subspace filtering is averaged across three random seeds\. The numbers in parentheses indicate the decrease in accuracy compared to theθt\\theta\_\{t\}\.Table[3](https://arxiv.org/html/2606.18627#S3.T3)reports results on 19 tasks using ViT\-B/16\. AlthoughVt,LBWV\_\{t,LBW\}accounts for only1\.17%of the total dimensions \(780 out of 66,816\), its removal leads to severe performance degradation across tasks\. For example, EMNIST accuracy drops from99\.78%to17\.00%\(↓82\.78%\\downarrow 82\.78\\%\), while GTSRB accuracy decreases from99\.92%to1\.91%\(↓98\.01%\\downarrow 98\.01\\%\)\. In contrast, ablating a random control subspace has a negligible effect, with EMNIST and GTSRB retaining99\.75%\(↓0\.03%\\downarrow 0\.03\\%\) and99\.86%\(↓0\.06%\\downarrow 0\.06\\%\) accuracy, respectively\. Additional results are provided in Appendix Sec\.[C\.2](https://arxiv.org/html/2606.18627#A3.SS2)\.
On average, the performance degradation caused by removingVt,LBWV\_\{t,LBW\}is458\.05×\\timeslargerthan that observed for the random control subspace\. These results provide strong empirical evidence that the pre\-trained components contained inVt,LBWV\_\{t,LBW\}play a disproportionately important role in downstream task performance, supporting our interpretation ofVt,LBWV\_\{t,LBW\}as the subspace\-level manifestation of LBW dimensions\.
##### Interference of LBW Subspaces During Model Merging\.
Having establishedVt,LBWV\_\{t,LBW\}as the LBW subspace, we analyze how standard model merging disrupts these structures by introducing two geometric metrics for any task pair\(A,B\)\(A,B\)to quantify this interference\. First, we define the symmetricSacred Space Similarity\(SimSim\) to quantify the geometric overlap between the LBW subspaces of two distinct tasks, and the asymmetricHidden Interference Ratio\(InfInf\) to quantify the direct intrusion of one task’s updates into another task’s parameters:
Sim\(A,B\)=‖\(VA,LBW\)⊤VB,LBW‖F2min\(rA,rB\),Inf\(B→A\)=‖\(VB,k\)⊤VA,LBW‖F2k\\displaystyle Sim\(A,B\)=\\frac\{\\\|\(V\_\{A,LBW\}\)^\{\\top\}V\_\{B,LBW\}\\\|\_\{F\}^\{2\}\}\{\\min\(r\_\{A\},r\_\{B\}\)\},\\qquad Inf\(B\\to A\)=\\frac\{\\\|\(V\_\{B,k\}\)^\{\\top\}V\_\{A,LBW\}\\\|\_\{F\}^\{2\}\}\{k\}\(7\)whererAr\_\{A\}andrBr\_\{B\}denote the rank of their respective LBW subspace\.Mathematically,Sim\(A,B\)Sim\(A,B\)computes the sum of squared cosines of principal angles between two subspaces, a classical metric heavily utilized in Grassmanian manifold alignment\(Fernando et al\.,[2013](https://arxiv.org/html/2606.18627#bib.bib11)\)and task singular vector synchronization\(Marczak et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib25)\)\.InfInfcomputes the exact proportion of TaskBB’s explicit active task vector updates \(VB,kV\_\{B,k\}\) that invasively project onto the LBW subspace of TaskAA\. Unlike parameter\-wise ratios, theInfInfmetric is purely geometric and scale\-invariant, strictly evaluating the directional collisions between TaskBB’s core task vector updates and TaskAA’s implicit dependency on its LBW subspace\. This projection\-based interference measurement is closely related to the gradient projection metrics developed in continual learning for preventing catastrophic forgetting\(Farajtabar et al\.,[2020](https://arxiv.org/html/2606.18627#bib.bib10); Saha et al\.,[2021](https://arxiv.org/html/2606.18627#bib.bib33)\)\.
\(a\)
\(b\)
Figure 3:SimSimandInfInfof themlp\.c\_fclayer in Block 0\.The empirical severity of these subspace\-level collisions is fully exposed in the heatmaps for the foundational Block 0 \(Figure[3](https://arxiv.org/html/2606.18627#S3.F3)\)\. On one hand, theSimSimheatmap reveals thatthe similarity between the LBW subspaces of different tasks is relatively low to some extent, with values dropping as low as0\.590\.59\. This low similarity confirms that different tasks depend on highly heterogeneous subsets of pre\-trained parameters, proving that a universal, task\-agnostic “safe zone” does not exist; instead, each task carves out a unique, indispensable dependency on its own LBW subspace\. On the other hand, theInfInfheatmap uncovers a non\-negligible geometric collision, demonstrating thatthe active updates of task vectors blindly invade the LBW subspaces of other tasks\. While the diagonal remains strictly0\.000\.00—confirming self\-orthogonality—the off\-diagonal entries skyrocket, showing that the explicit task vector updates of one task can obliterate up to48%of the LBW subspace implicitly relied upon by another \(for instance, SUN397 severely intrudes upon the LBW subspace of EuroSAT\)\. This profound geometric collision highlights that naive model merging severely corrupts the foundational LBW subspace of downstream tasks \(refer to Appendix for more results\)\. This directly mandates a layer\-wise mechanism to explicitly shield these LBW subspace during merging, motivating our design\.
## 4Preserving Anchored Cores in Task\-vectors
While Fisher\-based methods can theoretically locate these vital parameters, their heavy computational cost and requirement for training data preclude the possibility of explicitly protecting them in practical, data\-free model merging scenarios\. We therefore need a merging framework that is task\-agnostic, data\-free, and capable of geometrically shielding these implicit dependencies layer by layer\. To this end, we proposePACT\. Instead of directly summing task matrices, PACT operationalizes the isolated “Sacred Spaces” \(Vt,LBWV\_\{t,LBW\}derived in Eqn\.[5](https://arxiv.org/html/2606.18627#S3.E5)\) into a universal geometric shield, enforcing strict orthogonal filtering to neutralize hidden interference before integration\.
Algorithm 1The Process ofPACT1:Input:Pre\-trained model
θ0\\theta\_\{0\}, task vectors
\{Δt\}t=1T\\\{\\Delta\_\{t\}\\\}\_\{t=1\}^\{T\}
2:Output:Filtered task vectors
\{Δ~t\}t=1T\\\{\\tilde\{\\Delta\}\_\{t\}\\\}\_\{t=1\}^\{T\}
3:Compute the principal subspace
V0,KV\_\{0,K\}of
θ0\\theta\_\{0\}\.
4:for
t=1,…,Tt=1,\\ldots,Tdo
5:Compute the task subspace
Vt,kV\_\{t,k\}\.
6:Compute the anchored core subspace
Vt,LBWV\_\{t,LBW\}using Eqn\.[5](https://arxiv.org/html/2606.18627#S3.E5)\.
7:endfor
8:for
t=1,…,Tt=1,\\ldots,Tdo
9:Construct the protected subspace
VprotecttV\_\{\\mathrm\{protect\}\}^\{t\}using Eqn\.[8](https://arxiv.org/html/2606.18627#S4.E8)\.
10:Filter the task vector using Eqn\.[9](https://arxiv.org/html/2606.18627#S4.E9)\.
11:endfor
As defined in Section[3](https://arxiv.org/html/2606.18627#S3),Vt,LBWV\_\{t,LBW\}precisely identifies the orthogonal anchor directions that a single taskttimplicitly relies upon\. However, mergingTTdistinct tasks requires a holistic defense mechanism: we must ensure that no task’s explicit update interferes with the load\-bearing walls of any other task\. Simply filtering against eachVt,LBWV\_\{t,LBW\}sequentially is mathematically flawed and computationally inefficient, as the sacred spaces across different tasks often share significant overlaps \(reflected by the high similarities in Figure[3](https://arxiv.org/html/2606.18627#S3.F3)\)\. To construct a unified, non\-redundant protection boundary for a given task t, we aggregate the sacred spaces of all other tasks\. We then extract a unified orthonormal column basis using QR decomposition, denoted by the operatororth\(⋅\)\{\\rm orth\}\(\\cdot\):
Vprotectt=orth\(\[V1,LBW\|…\|Vt−1,LBW\|Vt\+1,LBW\|…\|VT,LBW\]\)\\displaystyle V\_\{\\rm protect\}^\{t\}=\{\\rm orth\}\\Big\(\\big\[V\_\{1,LBW\}\\ \\big\|\\ \\dots\\ \\big\|\\ V\_\{t\-1,LBW\}\\ \\big\|\\ V\_\{t\+1,LBW\}\\ \\big\|\\ \\dots\\ \\big\|\\ V\_\{T,LBW\}\\big\]\\Big\)\(8\)
This resulting matrixVprotecttV\_\{\\mathrm\{protect\}\}^\{t\}serves as the*Global Orthogonal Shield*for tasktt\. It elegantly captures the union of all critical pre\-trained dependencies required by the rest of the tasks, while eliminating overlapping basis vectors to prevent over\-constraining the update space\. Equipped with the global shieldVprotecttV\_\{\\mathrm\{protect\}\}^\{t\}, we can now surgically sanitize the update matrix of tasktt\. We enforce strict non\-interference by subtracting the projection ofΔt\\Delta\_\{t\}onto the protected subspace:
Δ~t=Δt−ΔtVprotectt\(Vprotectt\)⊤\\displaystyle\\tilde\{\\Delta\}\_\{t\}=\\Delta\_\{t\}\-\\Delta\_\{t\}V\_\{\\mathrm\{protect\}\}^\{t\}\(V\_\{\\mathrm\{protect\}\}^\{t\}\)^\{\\top\}\(9\)whereVprotectt\(Vprotectt\)⊤V\_\{\\mathrm\{protect\}\}^\{t\}\(V\_\{\\mathrm\{protect\}\}^\{t\}\)^\{\\top\}is the orthogonal projection operator onto the protection space\. Applying it toΔj\\Delta\_\{j\}extracts the components ofΔt\\Delta\_\{t\}that would overwrite the anchor directions of other tasks\. Subtracting these components removes such detrimental parts, retaining only the portions ofΔt\\Delta\_\{t\}that are harmless to other tasks, thereby completing the filtering\. In this process, the filtering of each task is jointly determined by the anchor directions of all other tasks, embodying a mutual protection mechanism across tasks\.
The complete computational procedure is presented in Algorithm[1](https://arxiv.org/html/2606.18627#alg1)\. Considering that our algorithm requires performingT\+1T\+1SVD decompositions for each layer of the model, its time complexity reachesO\(mn⋅min\(m,n\)\)O\(mn\\cdot\\min\(m,n\)\)\. Moreover, it requires storing the fullU∈ℝm×mU\\in\\mathbb\{R\}^\{m\\times m\}andV∈ℝn×nV\\in\\mathbb\{R\}^\{n\\times n\}, which imposes substantial memory pressure for large\-scale matrices\. To address this, we propose an efficient alternative ofPACT\. Specifically, we can replace the full SVD decomposition in Eqn\.[4](https://arxiv.org/html/2606.18627#S3.E4)with randomized SVD\(Halko et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib14)\)\. Randomized SVD leverages random projections for low\-rank approximation, and its ability to efficiently and approximately obtain the top\-kksingular values and singular vectors is suited to the top\-kkextraction process ofPACT\. Meanwhile, the steep singular value spectrum ensures the efficiency of randomized SVD\. Through the adaptation of randomized SVD, we can reduce the time complexity toO\(mnk\)O\(mnk\), while only needing to store the narrow matrixQ∈ℝm×\(k\+p\)Q\\in\\mathbb\{R\}^\{m\\times\(k\+p\)\}and a few small matrices, thereby substantially alleviating memory pressure as well\. The computational complexity analysis is provided in the Appendix Sec\.[B](https://arxiv.org/html/2606.18627#A2)\.
##### Theoretical Analysis\.
To rigorously justify the empirical success ofPACT, we provide a theoretical analysis demonstrating its preservation of individual task performance\.
###### Theorem 1\(Projected First\-Order Performance Preservation\)\.
Letθj=θ0\+Δj\\theta\_\{j\}=\\theta\_\{0\}\+\\Delta\_\{j\}denote the task\-specific fine\-tuned parameters for taskjj, and letPj=VprotectjP\_\{j\}=V\_\{\\mathrm\{protect\}\}^\{j\}be the orthonormal basis of the protection space used byPACT\. The filtered update is defined asΔ~j=Δj\(I−PjPj⊤\)\.\\tilde\{\\Delta\}\_\{j\}=\\Delta\_\{j\}\\left\(I\-P\_\{j\}P\_\{j\}^\{\\top\}\\right\)\.Assume thatℒj\\mathcal\{L\}\_\{j\}is locallyβj\\beta\_\{j\}\-smooth aroundθj\\theta\_\{j\}\. Then the degradation caused by the projection filtering satisfies
ℒj\(θ0\+Δ~j\)−ℒj\(θ0\+Δj\)≤‖∇ℒj\(θj\)Pj‖F‖ΔjPj‖F\+βj2‖ΔjPj‖F2\.\\mathcal\{L\}\_\{j\}\(\\theta\_\{0\}\+\\tilde\{\\Delta\}\_\{j\}\)\-\\mathcal\{L\}\_\{j\}\(\\theta\_\{0\}\+\\Delta\_\{j\}\)\\leq\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)P\_\{j\}\\right\\\|\_\{F\}\\left\\\|\\Delta\_\{j\}P\_\{j\}\\right\\\|\_\{F\}\+\\frac\{\\beta\_\{j\}\}\{2\}\\left\\\|\\Delta\_\{j\}P\_\{j\}\\right\\\|\_\{F\}^\{2\}\.In particular, the first\-order degradation is bounded by
Δℒj\(1\)≤‖∇ℒj\(θj\)Pj‖F‖ΔjPj‖F\.\\Delta\\mathcal\{L\}\_\{j\}^\{\(1\)\}\\leq\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)P\_\{j\}\\right\\\|\_\{F\}\\left\\\|\\Delta\_\{j\}P\_\{j\}\\right\\\|\_\{F\}\.\(10\)
Theorem[1](https://arxiv.org/html/2606.18627#Thmtheorem1)shows that the self\-performance degradation is not determined by the overall norm of the update, but by the discarded update energy and the task sensitivity inside the protection space\. Equivalently, defining
ρg,j=‖∇ℒj\(θj\)Pj‖F‖∇ℒj\(θj\)‖F,ρΔ,j=‖ΔjPj‖F‖Δj‖F,\\rho\_\{g,j\}=\\frac\{\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)P\_\{j\}\\right\\\|\_\{F\}\}\{\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\right\\\|\_\{F\}\},\\qquad\\rho\_\{\\Delta,j\}=\\frac\{\\left\\\|\\Delta\_\{j\}P\_\{j\}\\right\\\|\_\{F\}\}\{\\left\\\|\\Delta\_\{j\}\\right\\\|\_\{F\}\},the first\-order term is controlled by
Δℒj\(1\)≤‖∇ℒj\(θj\)‖F‖Δj‖Fρg,jρΔ,j\.\\Delta\\mathcal\{L\}\_\{j\}^\{\(1\)\}\\leq\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\right\\\|\_\{F\}\\left\\\|\\Delta\_\{j\}\\right\\\|\_\{F\}\\rho\_\{g,j\}\\rho\_\{\\Delta,j\}\.\(11\)
###### Corollary 1\(Active\-Subspace Relaxation\)\.
LetVj,kV\_\{j,k\}be the orthonormal basis of the active subspace of taskjj, and definecos\(ϕj\)=‖Vj,k⊤Pj‖2\.\\cos\(\\phi\_\{j\}\)=\\left\\\|V\_\{j,k\}^\{\\top\}P\_\{j\}\\right\\\|\_\{2\}\.Suppose that both the gradient and the task update are approximately concentrated inVj,kV\_\{j,k\}, namely
‖∇ℒj\(θj\)\(I−Vj,kVj,k⊤\)‖F≤ϵg‖∇ℒj\(θj\)‖F,\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\left\(I\-V\_\{j,k\}V\_\{j,k\}^\{\\top\}\\right\)\\right\\\|\_\{F\}\\leq\\epsilon\_\{g\}\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\right\\\|\_\{F\},and
‖Δj\(I−Vj,kVj,k⊤\)‖F≤ϵΔ‖Δj‖F\.\\left\\\|\\Delta\_\{j\}\\left\(I\-V\_\{j,k\}V\_\{j,k\}^\{\\top\}\\right\)\\right\\\|\_\{F\}\\leq\\epsilon\_\{\\Delta\}\\left\\\|\\Delta\_\{j\}\\right\\\|\_\{F\}\.Then the degradation caused byPACT’s filtering satisfies
ℒj\(θ0\+Δ~j\)−ℒj\(θ0\+Δj\)≤\\displaystyle\\mathcal\{L\}\_\{j\}\(\\theta\_\{0\}\+\\tilde\{\\Delta\}\_\{j\}\)\-\\mathcal\{L\}\_\{j\}\(\\theta\_\{0\}\+\\Delta\_\{j\}\)\\leq‖∇ℒj\(θj\)‖F‖Δj‖F\(cos\(ϕj\)\+ϵg\)\(cos\(ϕj\)\+ϵΔ\)\\displaystyle\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\right\\\|\_\{F\}\\left\\\|\\Delta\_\{j\}\\right\\\|\_\{F\}\\left\(\\cos\(\\phi\_\{j\}\)\+\\epsilon\_\{g\}\\right\)\\left\(\\cos\(\\phi\_\{j\}\)\+\\epsilon\_\{\\Delta\}\\right\)\+βj2‖Δj‖F2\(cos\(ϕj\)\+ϵΔ\)2\.\\displaystyle\+\\frac\{\\beta\_\{j\}\}\{2\}\\left\\\|\\Delta\_\{j\}\\right\\\|\_\{F\}^\{2\}\\left\(\\cos\(\\phi\_\{j\}\)\+\\epsilon\_\{\\Delta\}\\right\)^\{2\}\.In the exact active\-subspace case whereϵg=ϵΔ=0\\epsilon\_\{g\}=\\epsilon\_\{\\Delta\}=0, the first\-order degradation reduces to
Δℒj\(1\)≤‖∇ℒj\(θj\)‖F‖Δj‖Fcos2\(ϕj\)\.\\Delta\\mathcal\{L\}\_\{j\}^\{\(1\)\}\\leq\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\right\\\|\_\{F\}\\left\\\|\\Delta\_\{j\}\\right\\\|\_\{F\}\\cos^\{2\}\(\\phi\_\{j\}\)\.\(12\)
## 5Experiments
In this section, we evaluate the model merging performance ofPACTunder various experimental settings\. We examine the effectiveness ofPACTwhen combined with representative model merging approaches, e\.g\., TA, TSV\-M and Iso‑C\. Experiments are first conducted on fully fine‑tuned vision models, followed by tests on LoRA fine‑tuned vision models\. We also investigate the effect of replacing full SVD with randomized SVD\. More details and experiments \(e\.g\., hyper\-parameter sensitivity analysis\) please refer to Appendix Sec\.[D](https://arxiv.org/html/2606.18627#A4)and Sec\.[E](https://arxiv.org/html/2606.18627#A5), respectively\.
### 5\.1Fully Fine\-tuned Vision Models
Table 4:PACT\-Iso\-Cachieves SOTA performance for all backbones on all evaluated scenarios\. We present average absolute accuracy and average normalized accuracy \(in bracket\) in%\\%\. The best method inboldand the second\-bestunderlined\.We evaluate our approaches over sets of 8, 14, and 20 datasets, followingMarczak et al\. \([2025](https://arxiv.org/html/2606.18627#bib.bib25)\)\. We provide the details of the datasets in Appendix\. We consider three variants of CLIP\(Radford et al\.,[2021](https://arxiv.org/html/2606.18627#bib.bib31)\)with ViT\-B/32,ViT\-B/16 and ViT\-L/14 as visual encoders\(Dosovitskiy et al\.,[2020](https://arxiv.org/html/2606.18627#bib.bib9)\)\. We use the checkpoints fine\-tuned on the tasks above, provided inWang et al\. \([2024](https://arxiv.org/html/2606.18627#bib.bib42)\)\. The hyperparameters \(K, k\) are set to be \(15,8\)\.
We compare our approaches with the following model merging methods: weight averaging\(Wortsman et al\.,[2022a](https://arxiv.org/html/2606.18627#bib.bib45)\), Task Arithmetic\(Ilharco et al\.,[2022a](https://arxiv.org/html/2606.18627#bib.bib17)\), TIES\-Merging\(Yadav et al\.,[2023](https://arxiv.org/html/2606.18627#bib.bib50)\), Consensus TA\(Wang et al\.,[2024](https://arxiv.org/html/2606.18627#bib.bib42)\), TSV\-M\(Gargiulo et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib12)\), Iso\-C and Iso\-CTS\(Marczak et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib25)\)\. We include the results of the zero\-shot model and fine\-tuned models serving as lower\-and upper\-bound, respectively\. We compare the results based on absolute and normalized accuracy following standard practice\(Wang et al\.,[2024](https://arxiv.org/html/2606.18627#bib.bib42); Gargiulo et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib12); Marczak et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib25)\)\.
Table[4](https://arxiv.org/html/2606.18627#S5.T4)presents the performance of multi\-task model merging\.PACTconsistently improves across all settings over all three base methods: TA, TSV\-M, and Iso\-C\. Notably,PACT\-IsoCachieves SOTA results in all 9 settings, and its gain over Iso\-C becomes even more pronounced as the number of tasks increases, surpassing the protection offered by Iso\-CTS in the common space\.PACT\-TSV\-Malso attains SOTA performance on ViT\-B/32\. Moreover,PACT\-TAdelivers highly competitive results, with the three experiments on ViT\-L/14 all approaching the SOTA level\. These substantial improvements further corroborate the existence of the LBW effect within pre\-trained parameters\.
### 5\.2LORA\-adaptive Vision Models
To evaluate our approaches in low\-rank adaptation scenario, we follow the evaluation protocol of KnOTS\(Stoica et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib36)\), a recent SOTA method for merging LoRA fine\-tuned models\. We use codebase and checkpoints provided by KnOTS: ViT\-B/32 and ViT\-L/14 fine\-tuned with rank 16 LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2606.18627#bib.bib16)\)on 8 vision tasks\. To adapt our methodology to the low\-rank regime, rather than operating on the reconstructed dense matrixΔWt=BtAt\\Delta W\_\{t\}=B\_\{t\}A\_\{t\}, we simply perform SVD on the small matrixAtA\_\{t\}\. This is mathematically justified as they share the exact same row space, providing a computationally free trick to extract the task subspace\. We compare our approaches with Iso\-C, Iso\-CTS, TIES and DARE\-TIES\(Yu et al\.,[2024](https://arxiv.org/html/2606.18627#bib.bib52)\)\- combined with KnOTS or not – and TA\. The hyperparameters \(K, k\) are set to be \(7, 4\)\.
As presented in Table[5](https://arxiv.org/html/2606.18627#S5.T5),PACT\-Iso\-Csurpasses all methods in both experiments, achieving SOTA performance, which demonstrates the versatility of our merging approach\. Meanwhile,PACT\-TAalso achieves performance close to KnOTS, a method specifically designed for LoRA fine\-tuning, indicating that the LBW effect also exists in LoRA fine\-tuning, even though the pre\-trained parameters remain untouched during the fine\-tuning process\.
Table 5:Normalized per\-task average accuracy\. We merge 8 models fine\-tuned with LoRA following\(Stoica et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib36)\)\.PACT\-Iso\-Cachieves SOTA performance on all evaluated scenarios\.
### 5\.3Efficient Substitute
We evaluate the effect of replacing the full SVD inPACTwith randomized SVD \(RSVD\)\(Halko et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib14)\), as reported in Table[6](https://arxiv.org/html/2606.18627#S5.T6)\. Under exactly the same hyperparameters \(K, k,α\\alpha\), RSVD achieves performance on par with full SVD, and in one\-third of the experiments it even surpasses the full SVD results, which is quite counterintuitive\. We attribute this to the implicit denoising effect of random projection\(Halko et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib14)\)\. Full SVD faithfully extracts all directions, including the potential overfitting noise hidden inΔt\\Delta\_\{t\}; in contrast, the random Gaussian projection used in RSVD inherently acts as a low\-pass filter\. It robustly captures the main load\-bearing wall components while naturally smoothing out high\-frequency noise\. Thus, RSVD provides a ”free lunch” forPACT: as a structural regularizer, it reduces the complexity by several orders of magnitude while simultaneously enhancing the generalization ability of the merged model\. We summarize and compare the computational complexity ofPACTand its variant against these baselines in Appendix Sec\.[B](https://arxiv.org/html/2606.18627#A2)\.
Table 6:Results of replacing SVD with RSVD\. We present average absolute accuracy and average normalized accuracy \(in bracket\) in%\\%\. Performance improvements are shown inbold\.
## 6Conclusion and Future Work
In this work, we have addressed a fundamentally overlooked assumption in existing task\-vector\-based model merging paradigms—namely, that pre\-trained base parameters remain inert and task performance depends solely on task\-vector modifications\. Through microscopic Fisher information analysis and macroscopic SVD subspace projections, we have identified and causally validated the existence of LBW dimensions\. These low\-dimensional, high\-energy pre\-trained structures remain untouched during downstream fine\-tuning, yet are critically essential for preserving task\-specific inference capabilities\. Our extensive causal ablation experiments demonstrate that a mere1\.17%1\.17\\%parametric disruption to these localized pre\-trained core structures leads to a catastrophic collapse of downstream performance, establishing their indispensable structural role\.
To mitigate the severe subspace collisions on these LBW during unconstrained merging, we proposedPACT\. By elegantly constructing a global orthogonal shield,PACTsurgically filters out interfering update components of other tasks from each expert’s load\-bearing subspace\. This data\-free geometric approach is highly modular and compatible with all existing task\-vector\-based merging methods\. Empirical evaluations across 20 diverse vision tasks, spanning different backbones \(ViT\-B and ViT\-L\) and adaptation paradigms \(full fine\-tuning and LoRA\), demonstrate that PACT\-enhanced merging algorithms consistently outperform baseline methods\. Furthermore, our efficient variant based on randomized SVD successfully decouples the cubic complexity barrier from the number of tasks, ensuring strong scalability for large\-scale merging scenarios\.
Our findings offer a new, geometer\-centric perspective on the mechanics of deep neural networks, suggesting that model merging is not merely about resolving conflicts among task updates, but actively respecting and shielding the critical pre\-trained foundations\. Moving forward, several promising research directions emerge\. First, extending thePACTframework to large language models \(LLMs\) and investigating the capacity limits of protected subspaces under hundreds of tasks remains a critical next step\. Second, exploring non\-linear projection operators and adaptive, weight\-importance\-aware dimensionalities forVt,LBWV\_\{t,LBW\}could further refine the surgical precision of the orthogonal shield\. We hope this work inspires deeper exploration into the geometric structures of pre\-trained models and paves the way for more robust multi\-task model synthesis\. We also summarize our limitations in Appendix Sec\.[F](https://arxiv.org/html/2606.18627#A6)\.
## References
- Ba et al\. \(2016\)Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton\.Layer normalization\.*arXiv preprint arXiv:1607\.06450*, 2016\.
- Bossard et al\. \(2014\)Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool\.Food\-101–mining discriminative components with random forests\.In*European conference on computer vision*, pp\. 446–461\. Springer, 2014\.
- Carion et al\. \(2020\)Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko\.End\-to\-end object detection with transformers\.In*European conference on computer vision*, pp\. 213–229\. Springer, 2020\.
- Caron et al\. \(2021\)Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin\.Emerging properties in self\-supervised vision transformers\.In*Proceedings of the IEEE/CVF international conference on computer vision*, pp\. 9650–9660, 2021\.
- Cheng et al\. \(2017\)Gong Cheng, Junwei Han, and Xiaoqiang Lu\.Remote sensing image scene classification: Benchmark and state of the art\.*Proceedings of the IEEE*, 105\(10\):1865–1883, 2017\.
- Cimpoi et al\. \(2014\)Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi\.Describing textures in the wild\.In*Proceedings of the IEEE conference on computer vision and pattern recognition*, pp\. 3606–3613, 2014\.
- Coates et al\. \(2011\)Adam Coates, Andrew Ng, and Honglak Lee\.An analysis of single\-layer networks in unsupervised feature learning\.In*Proceedings of the fourteenth international conference on artificial intelligence and statistics*, pp\. 215–223\. JMLR Workshop and Conference Proceedings, 2011\.
- Cohen et al\. \(2017\)Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik\.Emnist: Extending mnist to handwritten letters\.In*2017 international joint conference on neural networks \(IJCNN\)*, pp\. 2921–2926\. IEEE, 2017\.
- Dosovitskiy et al\. \(2020\)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al\.An image is worth 16x16 words: Transformers for image recognition at scale\.*arXiv preprint arXiv:2010\.11929*, 2020\.
- Farajtabar et al\. \(2020\)Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li\.Orthogonal gradient descent for continual learning\.In*International conference on artificial intelligence and statistics*, pp\. 3762–3773\. PMLR, 2020\.
- Fernando et al\. \(2013\)Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars\.Unsupervised visual domain adaptation using subspace alignment\.In*Proceedings of the IEEE international conference on computer vision*, pp\. 2960–2967, 2013\.
- Gargiulo et al\. \(2025\)Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola\.Task singular vectors: Reducing task interference in model merging\.In*Proceedings of the Computer Vision and Pattern Recognition Conference*, pp\. 18695–18705, 2025\.
- Goodfellow et al\. \(2013\)Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong\-Hyun Lee, et al\.Challenges in representation learning: A report on three machine learning contests\.In*International conference on neural information processing*, pp\. 117–124\. Springer, 2013\.
- Halko et al\. \(2011\)Nathan Halko, Per\-Gunnar Martinsson, and Joel A Tropp\.Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions\.*SIAM review*, 53\(2\):217–288, 2011\.
- Helber et al\. \(2019\)Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth\.Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification\.*IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing*, 12\(7\):2217–2226, 2019\.
- Hu et al\. \(2022\)Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al\.Lora: Low\-rank adaptation of large language models\.*Iclr*, 1\(2\):3, 2022\.
- Ilharco et al\. \(2022a\)Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi\.Editing models with task arithmetic\.*arXiv preprint arXiv:2212\.04089*, 2022a\.
- Ilharco et al\. \(2022b\)Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, and Ludwig Schmidt\.Patching open\-vocabulary models by interpolating weights\.*Advances in Neural Information Processing Systems*, 35:29262–29277, 2022b\.
- Kirkpatrick et al\. \(2017\)James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska\-Barwinska, et al\.Overcoming catastrophic forgetting in neural networks\.*Proceedings of the national academy of sciences*, 114\(13\):3521–3526, 2017\.
- Krause et al\. \(2013\)Jonathan Krause, Michael Stark, Jia Deng, and Li Fei\-Fei\.3d object representations for fine\-grained categorization\.In*Proceedings of the IEEE international conference on computer vision workshops*, pp\. 554–561, 2013\.
- Krizhevsky et al\. \(2009\)Alex Krizhevsky, Geoffrey Hinton, et al\.Learning multiple layers of features from tiny images\.2009\.
- LeCun et al\. \(1998\)Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner\.Gradient\-based learning applied to document recognition\.*Proceedings of the IEEE*, 86\(11\):2278–2324, 1998\.
- Li et al\. \(2025\)Lu Li, Tianyu Zhang, Zhiqi Bu, Suyuchen Wang, Huan He, Jie Fu, Yonghui Wu, Jiang Bian, Yong Chen, and Yoshua Bengio\.Map: Low\-compute model merging with amortized pareto fronts via quadratic approximation\.In*International Conference on Learning Representations*, volume 2025, pp\. 65032–65064, 2025\.
- Li et al\. \(2023\)Weishi Li, Yong Peng, Miao Zhang, Liang Ding, Han Hu, and Li Shen\.Deep model fusion: A survey\.*arXiv preprint arXiv:2309\.15698*, 2023\.
- Marczak et al\. \(2025\)Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Andrew D Bagdanov, and Joost Van De Weijer\.No task left behind: Isotropic model merging with common and task\-specific subspaces\.*arXiv preprint arXiv:2502\.04959*, 2025\.
- Netzer et al\. \(2011\)Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al\.Reading digits in natural images with unsupervised feature learning\.In*NIPS workshop on deep learning and unsupervised feature learning*, volume 2011, pp\. 4\. Granada, 2011\.
- Nguyen et al\. \(2025\)The\-Hai Nguyen, Dang Huu\-Tien, Takeshi Suzuki, and Le\-Minh Nguyen\.Regmean\+\+: Enhancing effectiveness and generalization of regression mean for model merging\.*arXiv preprint arXiv:2508\.03121*, 2025\.
- Nilsback & Zisserman \(2008\)Maria\-Elena Nilsback and Andrew Zisserman\.Automated flower classification over a large number of classes\.In*2008 Sixth Indian conference on computer vision, graphics & image processing*, pp\. 722–729\. IEEE, 2008\.
- Parkhi \(2012\)OM Parkhi\.A\. vedaldi, a\.In*Zisserman, c\. Jawahar,“cats and dogs,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition*, pp\. 3498–3505, 2012\.
- Radford et al\. \(2019\)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al\.Language models are unsupervised multitask learners\.*OpenAI blog*, 1\(8\):9, 2019\.
- Radford et al\. \(2021\)Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al\.Learning transferable visual models from natural language supervision\.In*International conference on machine learning*, pp\. 8748–8763\. PmLR, 2021\.
- Ruan et al\. \(2025\)Wei Ruan, Tianze Yang, Yifan Zhou, Tianming Liu, and Jin Lu\.From task\-specific models to unified systems: A review of model merging approaches\.*arXiv preprint arXiv:2503\.08998*, 2025\.
- Saha et al\. \(2021\)Gobinda Saha, Isha Garg, and Kaushik Roy\.Gradient projection memory for continual learning\.*arXiv preprint arXiv:2103\.09762*, 2021\.
- Socher et al\. \(2013\)Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts\.Recursive deep models for semantic compositionality over a sentiment treebank\.In*Proceedings of the 2013 conference on empirical methods in natural language processing*, pp\. 1631–1642, 2013\.
- Stallkamp et al\. \(2011\)Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel\.The german traffic sign recognition benchmark: a multi\-class classification competition\.In*The 2011 international joint conference on neural networks*, pp\. 1453–1460\. IEEE, 2011\.
- Stoica et al\. \(2025\)George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman\.Model merging with svd to tie the knots\.In*International Conference on Learning Representations*, volume 2025, pp\. 4501–4519, 2025\.
- Subramanian et al\. \(2005\)Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, et al\.Gene set enrichment analysis: a knowledge\-based approach for interpreting genome\-wide expression profiles\.*Proceedings of the national academy of sciences*, 102\(43\):15545–15550, 2005\.
- Sun et al\. \(2025\)Wenju Sun, Qingyong Li, Wen Wang, Yangliao Geng, and Boyang Li\.Task arithmetic in trust region: A training\-free model merging approach to navigate knowledge conflicts\.In*Proceedings of the 33rd ACM International Conference on Multimedia*, pp\. 5178–5187, 2025\.
- Tang et al\. \(2023\)Anke Tang, Li Shen, Yong Luo, Liang Ding, Han Hu, Bo Du, and Dacheng Tao\.Concrete subspace learning based interference elimination for multi\-task model fusion\.*arXiv preprint arXiv:2312\.06173*, 2023\.
- Vasudevan & Ramakrishna \(2017\)Vinita Vasudevan and M Ramakrishna\.A hierarchical singular value decomposition algorithm for low rank matrices\.*arXiv preprint arXiv:1710\.02812*, 2017\.
- Veeling et al\. \(2018\)Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling\.Rotation equivariant cnns for digital pathology\.In*International Conference on Medical image computing and computer\-assisted intervention*, pp\. 210–218\. Springer, 2018\.
- Wang et al\. \(2024\)Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz\-Jimenez, François Fleuret, and Pascal Frossard\.Localizing task information for improved model merging and compression\.*arXiv preprint arXiv:2405\.07813*, 2024\.
- Wei et al\. \(2024\)Yongxian Wei, Zixuan Hu, Li Shen, Zhenyi Wang, Yu Li, Chun Yuan, and Dacheng Tao\.Task groupings regularization: Data\-free meta\-learning with heterogeneous pre\-trained models\.*arXiv preprint arXiv:2405\.16560*, 2024\.
- Wei et al\. \(2025\)Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, and Xiaochun Cao\.Modeling multi\-task model merging as adaptive projective gradient descent\.*arXiv preprint arXiv:2501\.01230*, 2025\.
- Wortsman et al\. \(2022a\)Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo\-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al\.Model soups: averaging weights of multiple fine\-tuned models improves accuracy without increasing inference time\.In*International conference on machine learning*, pp\. 23965–23998\. PMLR, 2022a\.
- Wortsman et al\. \(2022b\)Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al\.Robust fine\-tuning of zero\-shot models\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pp\. 7959–7971, 2022b\.
- Xiao et al\. \(2017\)Han Xiao, Kashif Rasul, and Roland Vollgraf\.Fashion\-mnist: a novel image dataset for benchmarking machine learning algorithms\.*arXiv preprint arXiv:1708\.07747*, 2017\.
- Xiao et al\. \(2016\)Jianxiong Xiao, Krista A Ehinger, James Hays, Antonio Torralba, and Aude Oliva\.Sun database: Exploring a large collection of scene categories\.*International Journal of Computer Vision*, 119\(1\):3–22, 2016\.
- Xiong et al\. \(2024\)Feng Xiong, Runxi Cheng, Wang Chen, Zhanqiu Zhang, Yiwen Guo, Chun Yuan, and Ruifeng Xu\.Multi\-task model merging via adaptive weight disentanglement\.*arXiv preprint arXiv:2411\.18729*, 2024\.
- Yadav et al\. \(2023\)Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal\.Ties\-merging: Resolving interference when merging models\.*Advances in neural information processing systems*, 36:7093–7115, 2023\.
- Yang et al\. \(2026\)Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao\.Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities\.*ACM Computing Surveys*, 58\(8\):1–41, 2026\.
- Yu et al\. \(2024\)Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li\.Language models are super mario: Absorbing abilities from homologous models as a free lunch\.In*Forty\-first International Conference on Machine Learning*, 2024\.
- Zhou et al\. \(2024\)Yuyan Zhou, Liang Song, Bingning Wang, and Weipeng Chen\.Metagpt: Merging large language models using model exclusive task arithmetic\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pp\. 1711–1724, 2024\.
## Appendix AProofs of Theoretical Properties forPACT
### A\.1Proof of Theorem[1](https://arxiv.org/html/2606.18627#Thmtheorem1)
###### Proof\.
Let
Dj=Δj−Δ~j=ΔjPjPj⊤D\_\{j\}=\\Delta\_\{j\}\-\\tilde\{\\Delta\}\_\{j\}=\\Delta\_\{j\}P\_\{j\}P\_\{j\}^\{\\top\}\(13\)be the component discarded byPACT\. Then
θ0\+Δ~j=θj−Dj\.\\theta\_\{0\}\+\\tilde\{\\Delta\}\_\{j\}=\\theta\_\{j\}\-D\_\{j\}\.\(14\)Sinceℒj\\mathcal\{L\}\_\{j\}is locallyβj\\beta\_\{j\}\-smooth aroundθj\\theta\_\{j\}, we have
ℒj\(θj−Dj\)≤ℒj\(θj\)−⟨∇ℒj\(θj\),Dj⟩\+βj2‖Dj‖F2\.\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\-D\_\{j\}\)\\leq\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\-\\left\\langle\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\),D\_\{j\}\\right\\rangle\+\\frac\{\\beta\_\{j\}\}\{2\}\\\|D\_\{j\}\\\|\_\{F\}^\{2\}\.\(15\)SubstitutingDj=ΔjPjPj⊤D\_\{j\}=\\Delta\_\{j\}P\_\{j\}P\_\{j\}^\{\\top\}gives
−⟨∇ℒj\(θj\),Dj⟩\\displaystyle\-\\left\\langle\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\),D\_\{j\}\\right\\rangle≤\|⟨∇ℒj\(θj\),ΔjPjPj⊤⟩\|\\displaystyle\\leq\\left\|\\left\\langle\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\),\\Delta\_\{j\}P\_\{j\}P\_\{j\}^\{\\top\}\\right\\rangle\\right\|\(16\)=\|⟨∇ℒj\(θj\)Pj,ΔjPj⟩\|\\displaystyle=\\left\|\\left\\langle\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)P\_\{j\},\\Delta\_\{j\}P\_\{j\}\\right\\rangle\\right\|≤‖∇ℒj\(θj\)Pj‖F‖ΔjPj‖F,\\displaystyle\\leq\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)P\_\{j\}\\right\\\|\_\{F\}\\left\\\|\\Delta\_\{j\}P\_\{j\}\\right\\\|\_\{F\},where the last step follows from Cauchy–Schwarz\. Moreover, sincePjP\_\{j\}has orthonormal columns,
‖Dj‖F=‖ΔjPjPj⊤‖F=‖ΔjPj‖F\.\\\|D\_\{j\}\\\|\_\{F\}=\\\|\\Delta\_\{j\}P\_\{j\}P\_\{j\}^\{\\top\}\\\|\_\{F\}=\\\|\\Delta\_\{j\}P\_\{j\}\\\|\_\{F\}\.\(17\)Combining the above inequalities yields
ℒj\(θ0\+Δ~j\)−ℒj\(θ0\+Δj\)\\displaystyle\\mathcal\{L\}\_\{j\}\(\\theta\_\{0\}\+\\tilde\{\\Delta\}\_\{j\}\)\-\\mathcal\{L\}\_\{j\}\(\\theta\_\{0\}\+\\Delta\_\{j\}\)=ℒj\(θj−Dj\)−ℒj\(θj\)\\displaystyle=\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\-D\_\{j\}\)\-\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\(18\)≤‖∇ℒj\(θj\)Pj‖F‖ΔjPj‖F\+βj2‖ΔjPj‖F2\.\\displaystyle\\leq\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)P\_\{j\}\\right\\\|\_\{F\}\\left\\\|\\Delta\_\{j\}P\_\{j\}\\right\\\|\_\{F\}\+\\frac\{\\beta\_\{j\}\}\{2\}\\left\\\|\\Delta\_\{j\}P\_\{j\}\\right\\\|\_\{F\}^\{2\}\.∎
### A\.2Proof of Corollary[1](https://arxiv.org/html/2606.18627#Thmcorollary1)
###### Proof\.
LetPj=Vj,kVj,k⊤P\_\{j\}=V\_\{j,k\}V\_\{j,k\}^\{\\top\}\. By the triangle inequality,
‖∇ℒj\(θj\)Pj‖F\\displaystyle\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)P\_\{j\}\\right\\\|\_\{F\}≤‖∇ℒj\(θj\)PjPj‖F\+‖∇ℒj\(θj\)\(I−Pj\)Pj‖F\\displaystyle\\leq\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)P\_\{j\}P\_\{j\}\\right\\\|\_\{F\}\+\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\(I\-P\_\{j\}\)P\_\{j\}\\right\\\|\_\{F\}\(19\)≤‖∇ℒj\(θj\)‖F‖Vj,k⊤Pj‖2\+ϵg‖∇ℒj\(θj\)‖F\\displaystyle\\leq\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\right\\\|\_\{F\}\\left\\\|V\_\{j,k\}^\{\\top\}P\_\{j\}\\right\\\|\_\{2\}\+\\epsilon\_\{g\}\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\right\\\|\_\{F\}=‖∇ℒj\(θj\)‖F\(cos\(ϕj\)\+ϵg\)\.\\displaystyle=\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\right\\\|\_\{F\}\\left\(\\cos\(\\phi\_\{j\}\)\+\\epsilon\_\{g\}\\right\)\.Similarly,
‖ΔjPj‖F≤‖Δj‖F\(cos\(ϕj\)\+ϵΔ\)\.\\left\\\|\\Delta\_\{j\}P\_\{j\}\\right\\\|\_\{F\}\\leq\\left\\\|\\Delta\_\{j\}\\right\\\|\_\{F\}\\left\(\\cos\(\\phi\_\{j\}\)\+\\epsilon\_\{\\Delta\}\\right\)\.\(20\)Substituting these two inequalities into Theorem[1](https://arxiv.org/html/2606.18627#Thmtheorem1)gives Eqn\. equation[1](https://arxiv.org/html/2606.18627#S4.Ex5)\. Whenϵg=ϵΔ=0\\epsilon\_\{g\}=\\epsilon\_\{\\Delta\}=0, the first\-order term becomes
Δℒj\(1\)≤‖∇ℒj\(θj\)‖F‖Δj‖Fcos2\(ϕj\),\\Delta\\mathcal\{L\}\_\{j\}^\{\(1\)\}\\leq\\left\\\|\\nabla\\mathcal\{L\}\_\{j\}\(\\theta\_\{j\}\)\\right\\\|\_\{F\}\\left\\\|\\Delta\_\{j\}\\right\\\|\_\{F\}\\cos^\{2\}\(\\phi\_\{j\}\),\(21\)∎
### A\.3Derivation and Proof of the Ablation Approximation
To justify the ablation approximation presented in Eqn\.[6](https://arxiv.org/html/2606.18627#S3.E6), we present a unified mathematical derivation based on the construction of the LBW subspace\. Recall that the task\-specific model parameter is decomposed asθt=θ0\+Δt\\theta\_\{t\}=\\theta\_\{0\}\+\\Delta\_\{t\}, and the ablation operation is defined as:
θabla=θt\(I−Vt,LBWVt,LBW⊤\)\\theta\_\{abla\}=\\theta\_\{t\}\(I\-V\_\{t,LBW\}V\_\{t,LBW\}^\{\\top\}\)\(22\)where the LBW subspace basisVt,LBW∈ℝn×dLBWV\_\{t,LBW\}\\in\\mathbb\{R\}^\{n\\times d\_\{LBW\}\}is constructed via Eqn\.[5](https://arxiv.org/html/2606.18627#S3.E5)\.
By this construction,Vt,LBWV\_\{t,LBW\}lies within the column space of\(I−Vt,kVt,k⊤\)V0,K\(I\-V\_\{t,k\}V\_\{t,k\}^\{\\top\}\)V\_\{0,K\}\. Consequently, we can expressVt,LBW=\(I−Vt,kVt,k⊤\)V0,KR−1V\_\{t,LBW\}=\(I\-V\_\{t,k\}V\_\{t,k\}^\{\\top\}\)V\_\{0,K\}R^\{\-1\}for some invertible matrixRR\. Left\-multiplying this relation byVt,k⊤V\_\{t,k\}^\{\\top\}yields:
Vt,k⊤Vt,LBW\\displaystyle V\_\{t,k\}^\{\\top\}V\_\{t,LBW\}=Vt,k⊤\(I−Vt,kVt,k⊤\)V0,KR−1\\displaystyle=V\_\{t,k\}^\{\\top\}\(I\-V\_\{t,k\}V\_\{t,k\}^\{\\top\}\)V\_\{0,K\}R^\{\-1\}=\(Vt,k⊤−Vt,k⊤Vt,kVt,k⊤\)V0,KR−1\\displaystyle=\(V\_\{t,k\}^\{\\top\}\-V\_\{t,k\}^\{\\top\}V\_\{t,k\}V\_\{t,k\}^\{\\top\}\)V\_\{0,K\}R^\{\-1\}\(23\)Substituting the orthonormality conditionVt,k⊤Vt,k=IV\_\{t,k\}^\{\\top\}V\_\{t,k\}=I, the expression simplifies to:
Vt,k⊤Vt,LBW\\displaystyle V\_\{t,k\}^\{\\top\}V\_\{t,LBW\}=\(Vt,k⊤−I⋅Vt,k⊤\)V0,KR−1\\displaystyle=\(V\_\{t,k\}^\{\\top\}\-I\\cdot V\_\{t,k\}^\{\\top\}\)V\_\{0,K\}R^\{\-1\}=0⋅V0,KR−1=0\\displaystyle=0\\cdot V\_\{0,K\}R^\{\-1\}=0\(24\)This establishes that the constructed LBW subspaceVt,LBWV\_\{t,LBW\}is orthogonal to the principal task\-specific subspaceVt,kV\_\{t,k\}\.
Next, we assume that the task\-specific parameter updateΔt\\Delta\_\{t\}primarily aligns with the subspace spanned by its principal componentsVt,kV\_\{t,k\}\. This relationship can be formally expressed as the approximationΔt≈ΔtVt,kVt,k⊤\\Delta\_\{t\}\\approx\\Delta\_\{t\}V\_\{t,k\}V\_\{t,k\}^\{\\top\}\. Projecting the updateΔt\\Delta\_\{t\}onto the LBW subspace then yields:
ΔtVt,LBW\\displaystyle\\Delta\_\{t\}V\_\{t,LBW\}≈\(ΔtVt,kVt,k⊤\)Vt,LBW\\displaystyle\\approx\\left\(\\Delta\_\{t\}V\_\{t,k\}V\_\{t,k\}^\{\\top\}\\right\)V\_\{t,LBW\}=ΔtVt,k\(Vt,k⊤Vt,LBW\)\\displaystyle=\\Delta\_\{t\}V\_\{t,k\}\\left\(V\_\{t,k\}^\{\\top\}V\_\{t,LBW\}\\right\)\(25\)Applying the strict orthogonalityVt,k⊤Vt,LBW=0V\_\{t,k\}^\{\\top\}V\_\{t,LBW\}=0derived above, we immediately obtain:
ΔtVt,LBW≈0and thusΔtVt,LBWVt,LBW⊤≈0\\Delta\_\{t\}V\_\{t,LBW\}\\approx 0\\quad\\text\{and thus\}\\quad\\Delta\_\{t\}V\_\{t,LBW\}V\_\{t,LBW\}^\{\\top\}\\approx 0\(26\)
Finally, by substituting the decompositionθt=θ0\+Δt\\theta\_\{t\}=\\theta\_\{0\}\+\\Delta\_\{t\}into the ablation definition in Eqn\.[22](https://arxiv.org/html/2606.18627#A1.E22)and expanding the terms, we have:
θabla\\displaystyle\\theta\_\{abla\}=\(θ0\+Δt\)\(I−Vt,LBWVt,LBW⊤\)\\displaystyle=\(\\theta\_\{0\}\+\\Delta\_\{t\}\)\(I\-V\_\{t,LBW\}V\_\{t,LBW\}^\{\\top\}\)=θ0\(I−Vt,LBWVt,LBW⊤\)\+Δt−ΔtVt,LBWVt,LBW⊤\\displaystyle=\\theta\_\{0\}\(I\-V\_\{t,LBW\}V\_\{t,LBW\}^\{\\top\}\)\+\\Delta\_\{t\}\-\\Delta\_\{t\}V\_\{t,LBW\}V\_\{t,LBW\}^\{\\top\}\(27\)Using the approximationΔtVt,LBWVt,LBW⊤≈0\\Delta\_\{t\}V\_\{t,LBW\}V\_\{t,LBW\}^\{\\top\}\\approx 0from Eqn\.[26](https://arxiv.org/html/2606.18627#A1.E26), the expression simplifies to:
θabla≈θ0\(I−Vt,LBWVt,LBW⊤\)\+Δt\\theta\_\{abla\}\\approx\\theta\_\{0\}\(I\-V\_\{t,LBW\}V\_\{t,LBW\}^\{\\top\}\)\+\\Delta\_\{t\}\(28\)This completes the derivation, demonstrating that the ablation operation selectively filters out pre\-trained components within the LBW subspace while leaving the task\-specific updates largely intact\.
## Appendix BComputational complexity analysis
In this Section, we analyze the computational complexity of our proposedPACT\-Iso\-CandPACT\-TSV\-Mmerging algorithms, alongside their highly efficient RSVD variants\(Halko et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib14)\)\. We compare them with the baseline Iso\-C and the SOTA subspace merging methods, TSV\-M\(Gargiulo et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib12)\)and Iso\-CTS\(Marczak et al\.,[2025](https://arxiv.org/html/2606.18627#bib.bib25)\)\.
Following the conventions established in previous works, letΔt∈ℝn×n\\Delta\_\{t\}\\in\\mathbb\{R\}^\{n\\times n\}, and letTTandLLbe the number of tasks and network layers, respectively\. For simplicity, assume that each layer consists of a single squaredn×nn\\times nmatrix\.
In our analysis, we focus on the number of Singular Value Decompositions \(SVDs\) performed by each algorithm, as this is by far the most costly component of each pipeline\. The complexity of a single full SVD onΔt∈ℝn×n\\Delta\_\{t\}\\in\\mathbb\{R\}^\{n\\times n\}is𝒪\(n3\)\\mathcal\{O\}\(n^\{3\}\)\(Vasudevan & Ramakrishna,[2017](https://arxiv.org/html/2606.18627#bib.bib40)\)\. Below, we detail the total computational complexity for each merging method and summarize them in Table[7](https://arxiv.org/html/2606.18627#A2.T7):
- •Iso\-Cperforms a single SVD on the aggregatedΔTA\\Delta\_\{TA\}per layer, with total complexity: 𝒪\(Iso\-C\)=𝒪\(Ln3\)\\mathcal\{O\}\(\\text\{Iso\-C\}\)=\\mathcal\{O\}\(Ln^\{3\}\)\(29\)
- •TSV\-Mperforms: - –TTSVDs per layer on each task matrix, yielding𝒪\(TLn3\)\\mathcal\{O\}\(TLn^\{3\}\)\. - –Two additional SVDs per layer for subspace alignment, yielding𝒪\(2Ln3\)\\mathcal\{O\}\(2Ln^\{3\}\)\. Yielding the total complexity: 𝒪\(TSV\-M\)=𝒪\(TLn3\+2Ln3\)=𝒪\(\(T\+2\)Ln3\)=𝒪\(TLn3\)\\mathcal\{O\}\(\\text\{TSV\-M\}\)=\\mathcal\{O\}\(TLn^\{3\}\+2Ln^\{3\}\)=\\mathcal\{O\}\\big\(\(T\+2\)Ln^\{3\}\\big\)=\\mathcal\{O\}\(TLn^\{3\}\)\(30\)
- •Iso\-CTSperforms: - –One SVD onΔTA\\Delta\_\{TA\}per layer, with complexity𝒪\(Ln3\)\\mathcal\{O\}\(Ln^\{3\}\)\. - –One SVD on eachΔt\\Delta\_\{t\}, for allTTtasks per layer, with complexity𝒪\(TLn3\)\\mathcal\{O\}\(TLn^\{3\}\)\. - –Two SVDs on two constructed matricesU∗,V∗∈ℝn×nU^\{\*\},V^\{\*\}\\in\\mathbb\{R\}^\{n\\times n\}per layer, yielding𝒪\(2Ln3\)\\mathcal\{O\}\(2Ln^\{3\}\)\. Therefore, the total complexity equals: 𝒪\(Iso\-CTS\)=𝒪\(Ln3\+TLn3\+2Ln3\)=𝒪\(\(T\+3\)Ln3\)=𝒪\(TLn3\)\\mathcal\{O\}\(\\text\{Iso\-CTS\}\)=\\mathcal\{O\}\(Ln^\{3\}\+TLn^\{3\}\+2Ln^\{3\}\)=\\mathcal\{O\}\\big\(\(T\+3\)Ln^\{3\}\\big\)=\\mathcal\{O\}\(TLn^\{3\}\)\(31\)
- •PACT\-Iso\-C\(Ours\) integrates our orthogonal projection filtering with the Iso\-C framework\. It performs: - –\(T\+1\)\(T\+1\)SVDs for thePACTfiltering phase \(one onθ0\\theta\_\{0\}andTTon task matricesΔt\\Delta\_\{t\}\):𝒪\(\(T\+1\)Ln3\)\\mathcal\{O\}\\big\(\(T\+1\)Ln^\{3\}\\big\)\. - –One SVD on the aggregated filtered matrix∑Δ~j\\sum\\tilde\{\\Delta\}\_\{j\}for the final Iso\-C merging:𝒪\(Ln3\)\\mathcal\{O\}\(Ln^\{3\}\)\. Yielding the total complexity: 𝒪\(PACT\-Iso\-C\)=𝒪\(\(T\+1\)Ln3\+Ln3\)=𝒪\(\(T\+2\)Ln3\)=𝒪\(TLn3\)\\mathcal\{O\}\(\\text\{\{PACT\-Iso\-C\}\}\)=\\mathcal\{O\}\\big\(\(T\+1\)Ln^\{3\}\+Ln^\{3\}\\big\)=\\mathcal\{O\}\\big\(\(T\+2\)Ln^\{3\}\\big\)=\\mathcal\{O\}\(TLn^\{3\}\)\(32\)\*\(Note: The QR decompositions required to build the protection spaces have a complexity of𝒪\(n\(TK\)2\)\\mathcal\{O\}\(n\(TK\)^\{2\}\), which is strictly dominated by the𝒪\(n3\)\\mathcal\{O\}\(n^\{3\}\)SVD terms sinceK≪nK\\ll n\.\)\*
- •PACT\-TSV\-M\(Ours\) integratesPACTfiltering with TSV\-M merging\. It performs: - –\(T\+1\)\(T\+1\)SVDs for thePACTfiltering phase:𝒪\(\(T\+1\)Ln3\)\\mathcal\{O\}\\big\(\(T\+1\)Ln^\{3\}\\big\)\. - –\(T\+2\)\(T\+2\)SVDs applied to the filtered matricesΔ~t\\tilde\{\\Delta\}\_\{t\}for the TSV\-M merging phase:𝒪\(\(T\+2\)Ln3\)\\mathcal\{O\}\\big\(\(T\+2\)Ln^\{3\}\\big\)\. Yielding the total complexity: 𝒪\(PACT\-TSV\-M\)=𝒪\(\(T\+1\)Ln3\+\(T\+2\)Ln3\)=𝒪\(\(2T\+3\)Ln3\)=𝒪\(TLn3\)\\mathcal\{O\}\(\\text\{\{PACT\-TSV\-M\}\}\)=\\mathcal\{O\}\\big\(\(T\+1\)Ln^\{3\}\+\(T\+2\)Ln^\{3\}\\big\)=\\mathcal\{O\}\\big\(\(2T\+3\)Ln^\{3\}\\big\)=\\mathcal\{O\}\(TLn^\{3\}\)\(33\) Efficient Variants with Randomized SVD \(RSVD\): Since extracting task\-specific features only requires the top\-kksingular vectors, RSVD can approximate these components using random projections, strictly dropping the decomposition complexity from𝒪\(n3\)\\mathcal\{O\}\(n^\{3\}\)to𝒪\(n2kmax\)\\mathcal\{O\}\(n^\{2\}k\_\{max\}\), wherekmaxk\_\{max\}is the maximum rank retained\. However, SVDs utilized for final subspace alignment or full aggregation must remain exact\. - –Efficient TSV\-M \(RSVD\)replaces the firstTTtask\-specific SVDs, but the 2 final alignment SVDs cannot be substituted: 𝒪\(Efficient TSV\-M\)=𝒪\(TLn2kmax\+2Ln3\)\\mathcal\{O\}\(\\text\{Efficient TSV\-M\}\)=\\mathcal\{O\}\\big\(TLn^\{2\}k\_\{max\}\+2Ln^\{3\}\\big\)\(34\) - –EfficientPACT\-Iso\-C\(RSVD\)replaces all\(T\+1\)\(T\+1\)SVDs in thePACTphase, leaving only the final Iso\-C aggregation SVD exact: 𝒪\(EfficientPACT\-Iso\-C\)=𝒪\(\(T\+1\)Ln2kmax\+Ln3\)\\mathcal\{O\}\(\\text\{Efficient \{PACT\-Iso\-C\}\}\)=\\mathcal\{O\}\\big\(\(T\+1\)Ln^\{2\}k\_\{max\}\+Ln^\{3\}\\big\)\(35\) - –EfficientPACT\-TSV\-M\(RSVD\)replaces the\(T\+1\)\(T\+1\)SVDs in thePACTphase and theTTtask SVDs in the TSV\-M phase, while the 2 final TSV\-M alignment SVDs remain exact: 𝒪\(EfficientPACT\-TSV\-M\)=𝒪\(\(2T\+1\)Ln2kmax\+2Ln3\)\\mathcal\{O\}\(\\text\{Efficient \{PACT\-TSV\-M\}\}\)=\\mathcal\{O\}\\big\(\(2T\+1\)Ln^\{2\}k\_\{max\}\+2Ln^\{3\}\\big\)\(36\)
Comparison Summary\.As highlighted in the analysis, all exact subspace methods \(TSV\-M, Iso\-CTS,PACT\-Iso\-C, andPACT\-TSV\-M\) share the same asymptotic complexity scaling of𝒪\(TLn3\)\\mathcal\{O\}\(TLn^\{3\}\)\. Notably, our exactPACT\-Iso\-Cstrictly requires\(T\+2\)\(T\+2\)full SVDs, posing slightly less overhead than the\(T\+3\)\(T\+3\)required by Iso\-CTS\. ExactPACT\-TSV\-Munsurprisingly carries the highest constant factor with\(2T\+3\)\(2T\+3\)SVDs due to the sequential combination of two distinct algorithms\.
More importantly, ourEfficient RSVD Variantsfundamentally decouple the cubic complexity barrier from the number of tasksTT\. While a small, constant number of full SVDs \(e\.g\., 1 for Iso\-C, 2 for TSV\-M\) remains inescapable for the final aggregation steps, substituting theTT\-dependent operations with𝒪\(n2kmax\)\\mathcal\{O\}\(n^\{2\}k\_\{max\}\)practically eliminates the scaling bottleneck\. Because the number of active vectorskmaxk\_\{max\}is a tiny fraction of the matrix dimension \(kmax≪nk\_\{max\}\\ll n\), the computational cost scaling withTTis reduced by orders of magnitude\. This makes the heavily piped EfficientPACT\-TSV\-Mhighly scalable and drastically cheaper to compute than traditional exact merging pipelines on large\-scale models\.
Table 7:Comparison of computational complexity among different merging methods\.TTandLLdenote the number of tasks and layers, respectively\.nnis the dimension of the squared layer matrices, andkmaxk\_\{max\}\(kmax≪nk\_\{max\}\\ll n\) is the maximum number of vectors retained in RSVD approximations\.
## Appendix CAdditional Experiments
### C\.1Extended Analysis on Fisher\-level Parameter Collisions
Table 8:Accuracy \(%\) under different parameters modifications\. The value ofα\\alphais set to 30\.0\.Our parameter\-level perturbation experiments, as detailed in Section[3\.2\.1](https://arxiv.org/html/2606.18627#S3.SS2.SSS1), are extended here across 16 diverse task pairs\(A,B\)\(A,B\)to substantiate the universality of the LBW hypothesis\. Table[8](https://arxiv.org/html/2606.18627#A3.T8)presents the downstream performance of taskAAunder different modification schemes with an attack strength ofα=30\\alpha=30\. We observe a consistent trend across all evaluated pairs:
##### Severe Degradation under Crucial Attack \(M2M2\)
Modifying the parameters within the crucial mask \(M2M2\) leads to a drastic performance drop\. For instance, in the \(Cars, SUN397\) pair, the accuracy of taskAA\(Cars\) collapses from a baseline of 86\.49% to 5\.04%\. Similarly, for \(Cars, PACM\), the accuracy drops to 2\.70%\. This massive collapse indicates that these small\-magnitude, high\-Fisher parameters are highly sensitive and hold critical task\-specific information\.
##### Robustness under Safe and Random Attacks \(M3M3,M4M4\)
Conversely, modifying the parameters in the safe mask \(M3M3\) or the random mask \(M4M4\) with the same intensity yields considerably milder degradation\. For example, underM3M3andM4M4, the performance on Cars in the \(Cars, SUN397\) pair remains at 77\.93% and 70\.64%, respectively\. These results confirm that the ”load\-bearing” property is highly localized and successfully captured by our Fisher\-based identification method\.
To confirm whether model merging interferes with these LBW dimensions, we evaluate the effect of restoring the crucial mask parameters to their pre\-trained valuesθ0\\theta\_\{0\}after a naive merging of task vectorsθ0\+ΔA\+ΔB\\theta\_\{0\}\+\\Delta\_\{A\}\+\\Delta\_\{B\}\. As shown in Table[9](https://arxiv.org/html/2606.18627#A3.T9), across almost all task pairs, simply resetting the crucial mask parameters \(which constitute only0\.80%−2\.97%0\.80\\%\-2\.97\\%of the total parameters\) back toθ0\\theta\_\{0\}yields a measurable performance recoveryΔacc\\Delta\_\{acc\}\. Specifically, we observe an improvement of\+2\.33%\+2\.33\\%for \(Cars, PACM\) and\+2\.13%\+2\.13\\%for \(DTD, GTSRB\)\. This empirical recovery demonstrates that the task vector of the counterpart task \(ΔB\\Delta\_\{B\}\) indeed introduces destructive interference to the crucial pre\-trained weights of task A during naive addition\.
Table 9:Δacc\\Delta\_\{acc\}represents the accuracy \(%\) change due to the crucial mask replacement, with improvement in red and decrease in blue\.To thoroughly evaluate the sensitivity of the identified load\-bearing parameters, we visualize the downstream accuracy of task A under varying attack strengthsα∈\[1\.0,5\.0,10\.0,20\.0,30\.0\]\\alpha\\in\[1\.0,5\.0,10\.0,20\.0,30\.0\]across all 15 evaluated task pairs in Figure[4](https://arxiv.org/html/2606.18627#A3.F4)\. The curves show a systematic and consistent divergence\. Across all 16 pairs, theM2M2curve \(representing the targeted attack on the crucial mask\) drops precipitously as the attack strengthα\\alphaincreases\. In contrast, theM3M3\(safe mask\) andM4M4\(random mask\) curves exhibit a much gentler decline, remaining relatively stable even under high perturbation levels\. This uniform behavior across diverse task combinations suggests that the load\-bearing wall phenomenon is not an artifact of a specific dataset pair, but rather a structural characteristic inherent to pre\-trained vision models\.
### C\.2Extended Geometric Subspace Analysis and Layer\-wise Heterogeneity
This subsection provides the complete, expanded experimental results for the global subspace ablation analysis introduced in Section[3\.2\.2](https://arxiv.org/html/2606.18627#S3.SS2.SSS2)of the main text\. While Table[3](https://arxiv.org/html/2606.18627#S3.T3)in Section[3\.2\.2](https://arxiv.org/html/2606.18627#S3.SS2.SSS2)presents a concise subset of six representative tasks due to space constraints, we extend this evaluation here to the full set of 19 diverse downstream datasets on the ViT\-B\-16 model, as summarized in Table[10](https://arxiv.org/html/2606.18627#A3.T10)\. The results consistently indicate that surgically removing the low\-dimensionalVt,LBWV\_\{t,LBW\}subspace \(which represents on average only1\.17%1\.17\\%of the total parameter dimensions\) leads to a severe degradation of downstream performance across all 18 tasks\. For instance, CIFAR100 accuracy falls from91\.08%91\.08\\%to1\.88%1\.88\\%, and Flowers102 drops from97\.06%97\.06\\%to0\.98%0\.98\\%\. In contrast, ablating a random orthogonal control subspace of the identical dimensionality \(VrandV\_\{rand\}\) causes negligible performance changes, retaining88\.15%88\.15\\%and91\.18%91\.18\\%accuracy respectively\. The ratio of the resulting performance drops \(ΔaccLBW/Δaccrand\\Delta\_\{acc\}^\{LBW\}/\\Delta\_\{acc\}^\{rand\}\) ranges from13\.75×13\.75\\timesto over3104×3104\\times\. This notable asymmetry provides supportive empirical evidence indicating that these low\-rank pre\-trained subspaces are highly critical for sustaining downstream capabilities\.
\(a\)
\(b\)
\(c\)
\(d\)
\(e\)
\(f\)
\(g\)
\(h\)
\(i\)
\(j\)
\(k\)
\(l\)
\(m\)
\(n\)
\(o\)
\(p\)
Figure 4:Accuracy \(%\) under varying attack strengths with respective to different masks across 16 task pairs\. Generally,M2M2is could be easily attacked whileM3M3andM4M4are rather stable, indicating the importance of parameters within crucial mask\.To analyze the underlying geometric conflicts causing these performance drops, we supplement the main text experiments by evaluating the pairwise Sacred Space Similarity \(SimSim, Eqn\.[7](https://arxiv.org/html/2606.18627#S3.E7)\) and Hidden Interference Ratio \(InfInf, Eqn\.[7](https://arxiv.org/html/2606.18627#S3.E7)\)\. When these metrics are averaged globally across all network layers \(Figures[9](https://arxiv.org/html/2606.18627#A3.F9)and[9](https://arxiv.org/html/2606.18627#A3.F9)\), task similarities appear deceptively high \(\>0\.91\>0\.91\), and the pairwise interference appears mild \(4%−11%4\\%\-11\\%\)\.
However, we find that evaluating these metrics globally hides severe local conflicts due to a distinct, non\-uniform layer\-wise distribution\. When we rank the layers and isolate the top\-6 worst\-affected layers \(Figures[9](https://arxiv.org/html/2606.18627#A3.F9)and[9](https://arxiv.org/html/2606.18627#A3.F9)\), a starkly different geometric landscape emerges\. In these critical foundational layers, the similarity \(SimSim\) drops significantly to the0\.75−0\.870\.75\-0\.87range, and the Hidden Interference \(InfInf\) skyrockets, indicating that explicit updates from one task can overwrite up to25%25\\%of the implicit pre\-trained dependencies of another \(e\.g\., SUN397 and Cars invading SVHN’s space\)\.
Table 10:Δacc∗\\Delta\_\{acc\}^\{\*\}represents the accuracy \(%\) change due to the subspace filtering\. The random subspace filtering is averaged across three random seeds\. The numbers in parentheses indicate the decrease in accuracy compared to theθt\\theta\_\{t\}\.This sharp contrast reveals that the degree of pairwise task conflict is highly dependent on layer depth, with early foundational layers absorbing severe, high\-density interference\. This pronounced layer\-wise variation suggests that a uniform, global merging approach is insufficient, motivating a dedicated investigation into how these conflicts are distributed layer\-by\-layer across the pre\-trained architecture\.
Figure 5:Sim \(Global Avg\)
Figure 6:Interf \(Global Avg\)
Figure 7:Sim \(Top\-6 Layers\)
Figure 8:Interf \(Top\-6 Layers\)
Figure 9:Comparison of global averages versus the top\-6 worst\-affected layers for Sacred Space Similarity \(Sim\) and Hidden Interference Ratio \(Inf\)\.
### C\.3Layer\-wise and Component\-wise Analysis of Pre\-trained Subspace Intrusion
To systematically investigate how task updates interact with the pre\-trained architecture under a subspace geometric view, we conduct a layer\-wise analysis\. This investigation requires a metric that directly quantifies how task\-specific updates modify the shared, pre\-trained representation space, rather than measuring pairwise task\-to\-task relationships\. To this end, motivated by the projection strength metrics\(Saha et al\.,[2021](https://arxiv.org/html/2606.18627#bib.bib33)\)and subspace alignment formulations\(Fernando et al\.,[2013](https://arxiv.org/html/2606.18627#bib.bib11)\), we define theIntrusion Energy \(EinE\_\{in\}\)for a specific taskttat layerℓ\\ellas:
Ein\(t,ℓ\)=‖ΔtV0,K‖F2‖Δt‖F2\\displaystyle E\_\{in\}\(t,\\ell\)=\\frac\{\\\|\\Delta\_\{t\}V\_\{0,K\}\\\|\_\{F\}^\{2\}\}\{\\\|\\Delta\_\{t\}\\\|\_\{F\}^\{2\}\}\(37\)whereV0,K∈ℝn×KV\_\{0,K\}\\in\\mathbb\{R\}^\{n\\times K\}represents the top\-KKright singular vectors of the pre\-trained weightsθ0\\theta\_\{0\}, capturing the core coordinates of general pre\-trained knowledge\. The objective ofEinE\_\{in\}is to measure the exact proportion of a task’s explicit update energy \(Δt\\Delta\_\{t\}\) that projects directly onto this shared pre\-trained core\. A higherEinE\_\{in\}indicates that fine\-tuning has aggressively reshaped the foundational pre\-trained representation, creating a high\-energy ”intrusion” that is vulnerable to overwrite when merging\.
Figure 10:Layer\-wise Intrusion EnergyEinE\_\{in\}across the transformer blocks of ViT\-B\-16\.#### C\.3\.1Global Layer\-wise Intrusion Dynamics
Figure[10](https://arxiv.org/html/2606.18627#A3.F10)plots the averageEinE\_\{in\}across the transformer blocks of ViT\-B\-16, revealing a clear layer\-wise heterogeneity:
##### Foundational Bottlenecks in Early Blocks\.
In early blocks \(blocks 0 to 2\), which extract fundamental visual representations \(e\.g\., textures, edges\),EinE\_\{in\}is heavily elevated\. For instance, the intrusion energy in the input projection layers exceeds30%30\\%\. This indicates that downstream specialization forces task updates to aggressively modify these early representation bottlenecks, rendering them highly sensitive to cross\-task interference\.
##### Orthogonal Routing in Deeper Blocks:
Conversely, in deeper layers \(blocks 5 to 11\),EinE\_\{in\}drops significantly and plateaus below5%5\\%\. This confirms that deeper layers safely route high\-level semantic updates into task\-specific orthogonal subspaces with minimal perturbation of the core general\-purpose representations\.
This layer\-wise distribution highlights the risk of “the averaging illusion”: a48%48\\%structural corruption at foundational block 0 cannot be resolved by deeper blocks\. Treating all layers uniformly allows high\-energy foundational collisions to proceed unchecked, leading to downstream performance degradation\.
Figure 11:Task\-specific and Component\-wise Intrusion EnergyEinE\_\{in\}across different transformer components\.
#### C\.3\.2Component\-wise and Task\-wise Intrusion Dynamics
To evaluate these dynamics at a more granular scale, Figure[11](https://arxiv.org/html/2606.18627#A3.F11)decomposesEinE\_\{in\}across individual tasks and specific transformer components \(attn\.in\_proj,attn\.out\_proj,mlp\.c\_fc, andmlp\.c\_proj\)\.
We observe a strong component\-wise heterogeneity\. The input\-facing projection layers \(attn\.in\_projandmlp\.c\_fc\) absorb nearly40%40\\%of the update energy at block 0, acting as the primary bottleneck of task intrusion\. In contrast, the output projection layers \(attn\.out\_projandmlp\.c\_proj\) remain flat and low \(mostly below5%5\\%\), confirming they act primarily as feature routers that do not perturb the shared core\. Importantly, the tightly clustered curves across 8 diverse downstream datasets demonstrate that this early\-layer collision is a universal geometric characteristic of ViT fine\-tuning rather than an artifact of a specific task\.
Figure 12:Comprehensive heatmap of Intrusion Energy \(EinE\_\{in\}\) across all layers and tasks\.This universality is further illustrated in the exhaustive heatmap in Figure[12](https://arxiv.org/html/2606.18627#A3.F12)\. Across all evaluated tasks, the earliest blocks \(e\.g\.,b00\.attn\.in\_proj,b00\.mlp\.c\_fc\) manifest as prominent dark bands of intense, high\-energy intrusion into the pre\-trained space\. Beyond block 4, the heatmap transitions consistently to light hues, confirming that the pre\-trained core remains undisturbed in deeper blocks\.
Figure 13:Fractional Intrusion Energy distributed across individual right singular vectors \(v1v\_\{1\}tov15v\_\{15\}\) for select top\-affected layers\.
#### C\.3\.3Microscopic Intrusion Distribution Along Principal Axes
To understand the geometric mechanics of this intrusion at the finest scale, we analyze which specific singular vectors within the pre\-trained core subspace \(K=15K=15\) are affected\. Figure[13](https://arxiv.org/html/2606.18627#A3.F13)visualizes the fractional intrusion energy distributed across the individual right singular vectors \(v1v\_\{1\}tov15v\_\{15\}\) for the most heavily affected layers\.
The visualization reveals that task updates do not uniformly perturb the core space\. Instead, they target specific principal axes in a highly task\-dependent manner\. For example, in the foundational layerb00\.mlp\.c\_fc, SVHN heavily modifiesv4v\_\{4\}, whereas EuroSAT targetsv3v\_\{3\}\. Additionally, in thepositional\_embeddinglayer, the intrusion across all tasks is almost exclusively concentrated on the single most dominant vector \(v1v\_\{1\}\), indicating a universal shift in positional priors during downstream specialization\.
This vector\-wise heterogeneity highlights the necessity ofPACT’s design\. Because different tasks rely on and explicitly attack different combinations of principal axes, simply discarding a fixed, task\-agnostic set of singular vectors is insufficient\. The adaptive, QR\-decomposition\-based global shield implemented inPACTis essential to dynamically map and protect these highly heterogeneous, task\-specific attack vectors\.
## Appendix DDetailed Ablation Studies
To systematically evaluate the contribution of each design component inPACT, we conduct extensive ablation studies\. This section is organized into three parts: \(1\) an algorithmic component ablation comparingPACTagainst a static core\-filtering baseline; \(2\) a layer\-wise depth ablation exploring the performance of localized filtering; and \(3\) a sensitivity analysis of the hyper\-parameters\.
### D\.1Algorithmic Component Ablation: Dynamic Task Shields vs\. StaticPACT\(S\-PACT\)
To verify the necessity of dynamically extracting the active task subspaces \(the top\-kktask\-specific SVD\) and constructing the mutual protection shields, we comparePACTagainst a simplified, static baseline which we term S\-PACT\(StaticPACT\)\.
S\-PACTbypasses the task\-vector SVD and mutual projection phases entirely\. Instead, it only performs a single SVD on the pre\-trained weight matrixθ0\\theta\_\{0\}to extract the top\-KKcore directionsV0,KV\_\{0,K\}, and then directly projects each task matrix onto its orthogonal complement:
Δ~j=Δj\(I−V0,KV0,K⊤\)\\displaystyle\\tilde\{\\Delta\}\_\{j\}=\\Delta\_\{j\}\(I\-V\_\{0,K\}V\_\{0,K\}^\{\\top\}\)\(38\)This baseline effectively filters task updates against the pre\-trained foundation but ignores the distinct, task\-specific subspaces of other expert models\. S\-PACTis computationally cheaper as it requires only one SVD per layer and no subsequent QR decompositions\.
Table 11:Algorithmic ablation study comparingPACTagainst the core\-only S\-PACTbaseline across different model backbones\. We present average absolute accuracy and average normalized accuracy \(in bracket\) in%\\%\.Table[11](https://arxiv.org/html/2606.18627#A4.T11)summarizes the merging performance of S\-PACTagainst TA, Iso\-C, and fullPACT\. The empirical results reveal that merely performing static core\-filtering \(S\-PACT\) yields limited performance improvements\. For instance, on ViT\-B/16 \(14 tasks\), S\-PACT\-TA only marginally improves over TA \(71\.2%71\.2\\%vs\.70\.5%70\.5\\%\), while S\-PACT\-Iso\-Cslightly degrades the original Iso\-C baseline \(84\.2%84\.2\\%vs\.84\.4%84\.4\\%\)\.
In contrast,PACT\-TA andPACT\-Iso\-Cachieve substantial performance improvements, reaching76\.8%76\.8\\%and86\.1%86\.1\\%respectively on the same setting\. This difference indicates that simply protecting the static pre\-trained core is insufficient\. To prevent task updates from blindly overwriting each other’s distinct downstream manifolds, it is critical to dynamically extract the active task dimensions and construct mutually orthogonal protection shields\.
### D\.2Layer\-wise Depth Ablation: Localized Filtering for Efficiency
In Appendix[C\.3](https://arxiv.org/html/2606.18627#A3.SS3), we demonstrated that the energy intrusion effect is highly localized within the early foundational layers of the model, whereas deeper layers route semantics in mutually orthogonal directions\. Based on this observation, we perform an ablation on the depth of layers filtered byPACTto identify potential efficiency\-performance trade\-offs\. We compare applyingPACTglobally to100%100\\%of the layers versus applying it to various percentages of early foundational layers\.
Table 12:Ablation study on filtering depth comparing localized early\-layerPACTfiltering under various percentages versus100%100\\%globalPACTfiltering\. We present average absolute accuracy and average normalized accuracy \(in bracket\) in%\\%\. Best performances are shown inbold\.Table[12](https://arxiv.org/html/2606.18627#A4.T12)presents the comparison across three base merging algorithms \(TA, Iso\-C, TSV\-M\)\. Overall, integratingPACTsystematically improves the multi\-task accuracy of all three baseline methods across all evaluated backbones and task scales\. Specifically, while baseline methods suffer from performance degradation as the task load increases, thePACT\-enhanced variants successfully mitigate this drop, with the best\-performing configurations generally concentrated at75%75\\%or100%100\\%filtering depths\. To rigorously trace the depth\-wise trajectory of this base\-protection effect, we focus our analysis primarily onPACT\-IsoC, which provides a complete gradient of filtering depths \(25%25\\%,50%50\\%,75%75\\%, and100%100\\%\)\.
ForPACT\-IsoC, we observe a non\-monotonic performance curve as the filtering depth increases\. Moving from the baseline Iso\-C to25%25\\%and50%50\\%PACT\-IsoC, the accuracy steadily climbs across all configurations, confirming that shielding the load\-bearing walls in the early\-to\-mid layers is highly effective\. Interestingly, the performance peaks at different depths depending on model capacity and task scale\. For the majority of 8\-task and 14\-task scenarios on ViT\-B/16 and ViT\-L/14, the accuracy peaks at75%75\\%depth \(e\.g\., reaching92\.1%92\.1\\%and95\.2%95\.2\\%on 8 tasks\) and slightly plateaus or declines at100%100\\%\. This indicates that while protecting the first75%75\\%of the layers is essential due to high\-energy visual bottlenecks, the deepest25%25\\%of layers route high\-level semantic representations in naturally orthogonal directions, making strict orthogonal constraints in these final blocks redundant or over\-constraining\.
This peak behavior dynamically shifts when scaling the backbone capacity or task load\. For the highly parameter\-constrained ViT\-B/32 backbone, the optimal performance peaks earlier—at50%50\\%depth for 8 tasks \(87\.8%87\.8\\%\) and at75%75\\%depth for more tasks\. Conversely, under the heaviest task load of 20 tasks on larger backbones \(ViT\-B/16 and ViT\-L/14\), the peak shifts to100%100\\%global filtering \(achieving83\.4%83\.4\\%and90\.3%90\.3\\%\)\. This shift reveals that under extreme multi\-task interference, the necessity of safeguarding the pre\-trained core extends globally across all layers, and larger backbones possess sufficient dimensional capacity to accommodate the global orthogonal constraints without suffering from dimensionality lock\.
### D\.3Hyperparameter Sensitivity Analysis
Table 13:Performance ofPACT\-Iso\-Cunder different hyperparameters \(KK,kk\) and the corresponding scaling coefficient, where \(0,0\) represents the results of our reproduced Iso\-C\. We present average absolute accuracy and average normalized accuracy \(in brackets\) in%\\%\. The best performance is inbold\.To study the sensitivity ofPACTto its primary hyperparameters, we refer to the detailed evaluation presented in Table[13](https://arxiv.org/html/2606.18627#A4.T13)\. Table[13](https://arxiv.org/html/2606.18627#A4.T13)examines the effects of varying the pre\-trained core dimensionKKand the active task vector dimensionkk, alongside the corresponding base merging scaling coefficientsα\\alpha\.
The sensitivity analysis indicates thatPACTis relatively robust to different hyperparameter configurations\. Specifically, the parameter pair\(K,k\)=\(15,8\)\(K,k\)=\(15,8\)consistently yields favorable absolute and normalized accuracy across diverse dataset scales \(8, 14, and 20 tasks\) and various model backbones \(ViT\-B/32, ViT\-B/16, and ViT\-L/14\)\. This stable behavior suggests that a moderate, low\-rank projection configuration is sufficient to successfully segregate the shared general\-purpose representation from the task\-specific update manifolds, simplifying practical deployments\.
## Appendix EExperimental Details
In this Appendix, we provide the dataset and implementation details used to carry out the experiments presented in the paper\.
### E\.1Datasets
The 8\-dataset benchmark consists of: Cars\(Krause et al\.,[2013](https://arxiv.org/html/2606.18627#bib.bib20)\), DTD\(Cimpoi et al\.,[2014](https://arxiv.org/html/2606.18627#bib.bib6)\), EuroSAT\(Helber et al\.,[2019](https://arxiv.org/html/2606.18627#bib.bib15)\), GTSRB\(Stallkamp et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib35)\), MNIST\(LeCun et al\.,[1998](https://arxiv.org/html/2606.18627#bib.bib22)\), RESISC45\(Cheng et al\.,[2017](https://arxiv.org/html/2606.18627#bib.bib5)\), SUN397\(Xiao et al\.,[2016](https://arxiv.org/html/2606.18627#bib.bib48)\), and SVHN\(Netzer et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib26)\)\.
The 14\-dataset benchmark builds on the preceding one, incorporating six additional datasets: CIFAR100\(Krizhevsky et al\.,[2009](https://arxiv.org/html/2606.18627#bib.bib21)\), STL10\(Coates et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib7)\), Flowers102\(Nilsback & Zisserman,[2008](https://arxiv.org/html/2606.18627#bib.bib28)\), OxfordIIITPet\(Parkhi,[2012](https://arxiv.org/html/2606.18627#bib.bib29)\), PCAM\(Veeling et al\.,[2018](https://arxiv.org/html/2606.18627#bib.bib41)\), and FER2013\(Goodfellow et al\.,[2013](https://arxiv.org/html/2606.18627#bib.bib13)\)\.
Finally, the 20\-dataset benchmark includes the preceding 14 plus the following six: EMNIST\(Cohen et al\.,[2017](https://arxiv.org/html/2606.18627#bib.bib8)\), CIFAR10\(Krizhevsky et al\.,[2009](https://arxiv.org/html/2606.18627#bib.bib21)\), Food101\(Bossard et al\.,[2014](https://arxiv.org/html/2606.18627#bib.bib2)\), FashionMNIST\(Xiao et al\.,[2017](https://arxiv.org/html/2606.18627#bib.bib47)\), RenderedSST2\(Socher et al\.,[2013](https://arxiv.org/html/2606.18627#bib.bib34)\), and KMNIST\(Ba et al\.,[2016](https://arxiv.org/html/2606.18627#bib.bib1)\)\.
- •SUN397\(Xiao et al\.,[2016](https://arxiv.org/html/2606.18627#bib.bib48)\)contains more than100,000100,000images of 397 categories for benchmarking scene understanding\. The number of images varies across categories, but there are at least 100 images each\.
- •Stanford Cars \(Cars\)\(Krause et al\.,[2013](https://arxiv.org/html/2606.18627#bib.bib20)\)has16,18516,185images in total of 196 types of cars and is evenly split for training and testing sets\.
- •RESISC45\(Cheng et al\.,[2017](https://arxiv.org/html/2606.18627#bib.bib5)\)is developed for remote sensing image scene classification\. This dataset covers 45 scene classes with 700 images of size256×256256\\times 256for each\.
- •EuroSAT\(Helber et al\.,[2019](https://arxiv.org/html/2606.18627#bib.bib15)\)is used for land use and land cover classification using Sentinel\-2 satellite images of size64×6464\\times 64, consisting of27,00027,000images covering 10 classes\.
- •SVHN\(Netzer et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib26)\)is a street view house number classification benchmark, containing more than600,000600,000RGB images of 10 printed digits in size32×3232\\times 32cropped from house number plates\.
- •GTSRB\(Stallkamp et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib35)\)is a German traffic sign recognition benchmark consisting of over50,00050,000images of 43 classes of traffic signs in varying light and background conditions\.
- •MNIST\(LeCun et al\.,[1998](https://arxiv.org/html/2606.18627#bib.bib22)\)is a well\-known classical dataset for hand\-written digit classification with60,00060,000training and10,00010,000testing images of size28×2828\\times 28in 10 classes of numbers\.
- •DTD\(Cimpoi et al\.,[2014](https://arxiv.org/html/2606.18627#bib.bib6)\)is a collection of5,6405,640images across 47 categories of textures in the wild, annotated with human\-centric attributes\.
- •Flowers102\(Nilsback & Zisserman,[2008](https://arxiv.org/html/2606.18627#bib.bib28)\)contains 102 flower categories that are popular in the United Kingdom, with1,0201,020training and6,1496,149testing images\. The images have varying poses and light conditions\.
- •PCAM \(PatchCamelyon\)\(Veeling et al\.,[2018](https://arxiv.org/html/2606.18627#bib.bib41)\)consists of more than 300M color images in size of96×9696\\times 96pixels extracted from histopathologic scans of lymph node sections\. Each of them is annotated with a binary class indicating the presence of metastatic tissue\.
- •FER2013\(Goodfellow et al\.,[2013](https://arxiv.org/html/2606.18627#bib.bib13)\)is developed for facial expression recognition\. The images are grayscale and have a size of48×4848\\times 48pixels, describing seven different kinds of emotions\. The training and testing split consists of28,70928,709and7,1787,178samples, respectively\.
- •OxfordIIITPet\(Parkhi,[2012](https://arxiv.org/html/2606.18627#bib.bib29)\)is a 37\-category pet dataset with roughly 200 images for each category, and is equally divided for both training and testing splits\. The images vary in scale, pose, and lighting conditions\.
- •STL10\(Coates et al\.,[2011](https://arxiv.org/html/2606.18627#bib.bib7)\)is primarily built for unsupervised image recognition tasks covering 10 classes\. Hence, the number of labeled images is quite small: 500 training and 800 testing images for each class\. All of them are in96×9696\\times 96pixel resolution\.
- •CIFAR100\(Krizhevsky et al\.,[2009](https://arxiv.org/html/2606.18627#bib.bib21)\)consists of color images categorized in 100 general classes, each class contains 600 images, and each image is in size32×3232\\times 32\. There are50,00050,000training images and10,00010,000testing images\.
- •CIFAR10\(Krizhevsky et al\.,[2009](https://arxiv.org/html/2606.18627#bib.bib21)\)is similar to CIFAR100, except it has 10 classes\.
- •Food101\(Bossard et al\.,[2014](https://arxiv.org/html/2606.18627#bib.bib2)\)contains of 101 food categories, with101,000101,000images\. For each class, 750 images are for training and 250 are for testing\. Only the testing images are manually reviewed\. The training images contain noise mostly from intense colors, and sometimes are mislabelled\.
- •FashionMNIST\(Xiao et al\.,[2017](https://arxiv.org/html/2606.18627#bib.bib47)\)is designed as a drop\-in replacement benchmark for the original MNIST, thereby inheriting the same structure as MNIST\.
- •EMNIST\(Cohen et al\.,[2017](https://arxiv.org/html/2606.18627#bib.bib8)\)is an extended version of MNIST\. EMNIST contains images of both characters and digits\. We choose to use only the EMNIST Letters split, which contains around145,000145,000images evenly distributed in 26 classes of the alphabet letters\.
- •KMNIST\(Ba et al\.,[2016](https://arxiv.org/html/2606.18627#bib.bib1)\), yet another version of MNIST, represents 10 Japanese Hiragana characters\.
- •RenderedSST2\(Socher et al\.,[2013](https://arxiv.org/html/2606.18627#bib.bib34); Radford et al\.,[2019](https://arxiv.org/html/2606.18627#bib.bib30)\)is used for evaluating the models’ capability on optical character recognition\. The images are rendered from sentences in the Stanford Sentiment Treebank v2\(Socher et al\.,[2013](https://arxiv.org/html/2606.18627#bib.bib34)\), with black texts on a white background in448×448448\\times 448resolution\. Each image is labeled as positive or negative based on the mood expressed in the text, and the number of images for both classes is nearly balanced\. There are6,9206,920training and1,8211,821testing images\.
### E\.2Additional Notes
Our method relies on SVD and RSVD, which is defined for two\-dimensional matricesΔ∈ℝm×n\\Delta\\in\\mathbb\{R\}^\{m\\times n\}\. However, some weights of the neural networks are represented by vectorsδ∈ℝn\\delta\\in\\mathbb\{R\}^\{n\}, e\.g\. bias vectors and parameters of layer normalization\(Ba et al\.,[2016](https://arxiv.org/html/2606.18627#bib.bib1)\)\. Therefore, these parameters bypass the filtering process of thePACTalgorithm and directly enter the subsequent merging process\.
### E\.3Selection of the scaling factorα\\alpha
Figure 14:Theα\\alphais chosen based on the best average performance on the validation set across tasks\. Each point denotes the optimalα\\alphafor each method\. The model is ViT\-B/16\.In Figure[14](https://arxiv.org/html/2606.18627#A5.F14), we present the relationship between validation accuracy and the scaling factorα\\alphaon the 8\-task ViT\-B/16 model\. TA is sensitive to the choice ofα\\alpha, whereas Iso\-C and TSV\-M are relatively robust, which is consistent with previous findings\. Due to the cleaning and filtering effect ofPACT, the optimalα\\alphavalues increase for all three methods when combined withPACT, which also aligns with theoretical predictions\. Notably,PACT\-TAbecomes robust to the choice of the optimalα\\alpha, though the value itself becomes substantially large\. For reproducibility, in Table[14](https://arxiv.org/html/2606.18627#A5.T14)we provide the optimalα\\alphavalue chosen on the held\-out validation set for each model and number of tasks\. It should be noted that, in the optimalα\\alphasearch procedures, the early stopping step is set to be 3\.
Table 14:Optimalα\\alphavalue chosen on a held\-out validation set for different model types and numbers of tasks\.
## Appendix FLimitation
AlthoughPACTapproaches model merging from a new perspective, namely the identification and protection of LBW dimensions, and achieves substantial improvements over existing methods, several limitations remain\. First, despite the proposed efficient variant,PACTstill introduces additional computational overhead\. Second, there may exist more efficient and accurate strategies for identifying LBW dimensions\. We leave these directions for future work\.Similar Articles
Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
Anchor is a task-generation pipeline that addresses artifact drift in AI agent benchmarks by jointly producing instructions, environments, solutions, and verifiers from a single constraint optimization specification, yielding consistent and auditable evaluation tasks for enterprise workflows. The paper introduces ERP-Bench, a benchmark of 300 long-horizon tasks in a production ERP system, showing that frontier models satisfy explicit constraints in 26.1% of trials but reach optimal solutions in only 17.4%.
E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring
This paper introduces E-PMQ, an expert-guided post-merge quantization framework that addresses the combined deviations from merging and quantization, achieving significant accuracy improvements on multi-task merged models like CLIP-ViT and FLAN-T5.
Elastic Attention Cores for Scalable Vision Transformers [R]
This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.
CORE: Cyclic Orthotope Relation Embedding for Knowledge Graph Completion
This paper introduces CORE, a new knowledge graph completion model that uses cyclic orthotope relation embeddings on a torus manifold to address boundary constraints in region-based models. Experiments show competitive performance in link prediction tasks.
Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging
This paper introduces the concept of Access Sets to budget expert reads, enabling scalable weight-space model merging.