Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space

arXiv cs.LG Papers

Summary

This paper frames model merging as probabilistic inference under a product-of-experts scenario, showing that existing methods are special cases and proposing a heavy-tailed Cauchy expert design that better captures real residual behavior, achieving significant improvements over state-of-the-art baselines.

arXiv:2607.01689v1 Announce Type: new Abstract: Model merging aims to combine existing single-task solutions into a multi-task solution without additional data-driven fine-tuning.~Most existing approaches achieve this using geometric properties of local solution spaces. However, such geometric views provide limited guidance for scoring how statistically useful each task-specific update direction is across tasks during merging. To address this, we formulate model merging from a new perspective of probabilistic inference under a product-of-experts (PoE) scenario where each single-task solution defines an energy-based expert model (EBM) over the merged parameters. We show that several existing model merging methods arise as special cases of our framework under energy designs that impose implicit Gaussian assumptions on directional residuals between merged and task-specific models. Empirically, we find that these residuals are often heavy-tailed which exposes a mismatch with the imposed light-tailed Gaussian structures. We address this with a heavy-tailed PoE design based on Cauchy experts, which better captures the observed residual behavior while admitting a provably convergent inference procedure. Experiments across multiple tasks and architectures show significant improvements over state-of-the-arts baselines. Our code is available at https://github.com/MinhLong210/PoE-EBM-Merging.git.
Original Article
View Cached Full Text

Cached at: 07/03/26, 05:43 AM

# Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space
Source: [https://arxiv.org/html/2607.01689](https://arxiv.org/html/2607.01689)
Tuan Anh Le VanHanoi University of Science and TechnologyTung Phi DucHanoi University of Science and TechnologyPhi Le NguyenHanoi University of Science and TechnologyJana DoppaWashington State UniversityTrong Nghia HoangWashington State University

###### Abstract

Model merging aims to combine existing single\-task solutions into a multi\-task solution without additional data\-driven fine\-tuning\. Most existing approaches achieve this using geometric properties of local solution spaces\. However, such geometric views provide limited guidance for scoring how statistically useful each task\-specific update direction is across tasks during merging\. To address this, we formulate model merging from a new perspective of probabilistic inference under a product\-of\-experts \(PoE\) scenario where each single\-task solution defines an energy\-based expert model \(EBM\) over the merged parameters\. We show that several existing model merging methods arise as special cases of our framework under energy designs that impose implicit Gaussian assumptions on directional residuals between merged and task\-specific models\. Empirically, we find that these residuals are often heavy\-tailed which exposes a mismatch with the imposed light\-tailed Gaussian structures\. We address this with a heavy\-tailed PoE design based on Cauchy experts, which better captures the observed residual behavior while admitting a provably convergent inference procedure\. Experiments across multiple tasks and architectures show significant improvements over state\-of\-the\-arts baselines\. Our code is available at[https://github\.com/MinhLong210/PoE\-EBM\-Merging\.git](https://github.com/MinhLong210/PoE-EBM-Merging.git)\.

## 1Introduction

Large pre\-trained foundation models and their task\-specific fine\-tuned variants\[achiam2023gpt,touvron2023llama\]have become increasingly available for a wide range of downstream tasks\. The growing availability of such specialized models has motivated model merging, which seeks to combine multiple task\-specific models into a single multi\-task model without additional data\-driven fine\-tuning\[hoang2019collective,NghiaNeurIPS19,NghiaICML20,NghiaICML21,yang2024model,li2023deep,hoang2024fewshot\]\. For example, monolingual models can be combined to obtain a single multilingual model\[ahmadian2024mix\]\. Such model merging approaches are particularly valuable in many real\-world production systems\[Su2018ARS\]where both local datasets and training pipelines cannot be centralized and synchronized111Federated Learning\[McMahan2016CommunicationEfficientLO\]can help address data privacy but still requires synchronized local training processes\.\. Despite requiring no additional training data, model merging can often achieve performance competitive with full multi\-task fine\-tuning, which might be impractical in such settings\. Model merging is also attractive when storage resources are limited, such as on edge devices\[voghoei2018deep,narayanswamy2024scaling\], or access to privacy\-sensitive task\-specific data is restricted\[liang2025vision,zhang2024challenges,pan2024federated\]\.

Prior Work\.A common paradigm in model merging is to represent each task\-specific model with a fine\-tuning module, such as a low\-rank adaptation \(LoRA\) matrix\[hu2022lora\]or a task vector capturing the difference between fine\-tuned and pre\-trained parameters\[ilharco2022editing\]\. Model merging then reduces to aggregating these task\-specific updates into a single multi\-task model\. Most existing approaches perform this aggregation using geometric properties of local solution spaces\.

The simplest methods, including weight averaging and task arithmetic\[wortsman2022robust,ilharco2022editing\], assume that task\-specific updates lie in a common solution manifold and can be directly combined\. More sophisticated approaches, such as Fisher\-weighted averaging\[matena2022merging\]and Gram\-based weighting\[jin2022dataless\], incorporate curvature or data\-dependent geometric information to better account for differences among local solution spaces during aggregation\. Another line of work explicitly seeks to align local solution spaces before merging\. For example, DOGE\[wei2025modeling\]constructs a shared tangent space, while KnOTS\[stoica2024model\]derives a common low\-dimensional subspace using singular value decomposition\. Task\-specific models are then projected onto these aligned representations prior to aggregation\.

![Refer to caption](https://arxiv.org/html/2607.01689v1/x1.png)Figure 1:Empirical distributions of the directional residualr=\(𝜻−𝜽\)⊤​𝜽r=\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\)^\{\\top\}\\boldsymbol\{\\theta\}, which measures the drift of a merged update𝜻\\boldsymbol\{\\zeta\}from a task update𝜽\\boldsymbol\{\\theta\}along that task’s own update direction\. We compare residuals produced by different merging methods when merging 7 fine\-tuned ViT\-L/14 models\. Log\-scale density plots show that empirical tails decay substantially more slowly than the fitted Gaussian and align more closely with a fitted Cauchy distribution, indicating pronounced heavy\-tailed residual behavior in model merging, revealing a mismatch with the \(implicit\) Gaussian expert structure in prior work\.Limitation\.These approaches largely treat each task\-specific fine\-tuning module as a deterministic point estimate and aggregate them with some geometry\-guided operations\. However, this deterministic view, provides little information regarding how useful each update direction is across different tasks\. As a result, update directions that are effective only for individual tasks can be aggregated with directions that are consistently useful across tasks\. Such task\-specific directions can then cancel or dominate useful cross\-task directions, pulling the merged update away from shared directions that benefit multiple tasks\. This suggests that model merging should not only aggregate update directions, but also estimate how confidently each direction should influence the merged module\.

Motivation and Solution Vision\.Motivated by the above intuition, we investigate model merging within a broader probabilistic framework in which each task update induces a distribution over candidate shared update directions rather than a single point estimate\. The local probabilistic score assigned for each candidate direction thus reflects how it is supported by the corresponding task\. In this view, merging becomes evidence aggregation\. Directions supported by multiple tasks receive higher aggregate confidence while directions supported only by individual tasks receive low support from the remaining tasks and are down\-weighted\.

We instantiate this idea by casting model merging as MAP inference in the fine\-tuning parameter space under a product of task\-specific energy\-based experts\. This view recovers existing merging rules as special cases under particular energy designs and exposes the uncertainty assumptions they implicitly impose\. In particular, quadratic energies recover classical averaging\-style methods and correspond to Gaussian experts\. This approach is also closely related to a recent uncertainty\-aware gradient matching method ofdaheim2023modelwhich imposes an implicit Gaussian expert design via its Laplace approximation interpretation\. While this method provides important empirical evidence for the effectiveness of probabilistic model merging, our analysis reveals that its implicit Gaussian structure does not fit well with the distribution over directional residuals between individual and merged updates\. In particular, we show that these directional residuals exhibit substantial heavy\-tailed behavior while Gaussian models are inherently light\-tailed \(see Fig\.[1](https://arxiv.org/html/2607.01689#S1.F1)\)\. To address this limitation, we develop a product\-of\-experts \(PoE\) formulation with heavy\-tailed energy\-based expert models \(EBM\)\. Our solution approach is substantiated by the following technical contributions:

1\. A unified probabilistic model merging framework\.We formulate model merging as MAP inference in the fine\-tuning parameter space under a product of task\-specific energy\-based experts\. In this view, each fine\-tuning module induces an energy over the directional residual between a candidate merged update and the task\-specific update\. The merged module then corresponds to the MAP estimate under the resulting product of experts\. We show that existing merging rules, including uniform averaging and Fisher\-weighted averaging, arise as special cases under particular energy designs\. This provides a unified lens for these methods and exposes the implicit probabilistic assumptions imposed by their aggregation rules \(Section[3](https://arxiv.org/html/2607.01689#S3)and[4](https://arxiv.org/html/2607.01689#S4)\)\.

2\. Heavy\-tailed merging with convergence guarantees\.Under the above probabilistic framework, we show that existing merging methods correspond to light\-tailed distributions over the directional residualr=\(𝜻−𝜽\)⊤​𝜽r=\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\)^\{\\top\}\\boldsymbol\{\\theta\}, which measures how far a candidate merged update𝜻\\boldsymbol\{\\zeta\}drifts from task update𝜽\\boldsymbol\{\\theta\}along that task’s own update direction\. Our empirical analysis shows that these residuals exhibit heavy\-tailed behavior, revealing a mismatch with the Gaussian expert structure often implicitly assumed in prior work\. To address this, we develop a novel heavy\-tailed Cauchy\-based expert designs and an efficient fixed\-point MAP inference algorithm with convergence guarantees \(Section[5](https://arxiv.org/html/2607.01689#S5)\)\.

3\. Evaluation across vision and language models\.We evaluate the proposed framework across diverse vision and language benchmarks spanning multiple model families\. Our approach consistently improves merged model performance over state\-of\-the\-art \(SOTA\) baselines \(Section[6](https://arxiv.org/html/2607.01689#S6)\)\.

## 2Problem Setup and Background

Model merging aims to aggregateNNexisting models which are fine\-tuned from a large pre\-trained model𝑾0\\boldsymbol\{W\}\_\{0\}on different downstream datasets\. Each fine\-tuned task\-specific model has parameters𝑾i\\boldsymbol\{W\}\_\{i\}which are learned from the local datasetDiD\_\{i\}\. To extract task\-specific information for taski∈\[N\]i\\in\[N\],ilharco2022editingintroduced the task vector𝜽i=𝑾i−𝑾0\\boldsymbol\{\\theta\}\_\{i\}=\\boldsymbol\{W\}\_\{i\}\-\\boldsymbol\{W\}\_\{0\}\. The task vector encodes task\-specific information and allows the analysis of individual task’s characteristics\. The goal of model merging is to find merged parameters𝑾m\\boldsymbol\{W\}\_\{m\}that performs well on all tasks by designing an aggregation algorithmAAto combine the task vectors:

𝑾m\\displaystyle\\boldsymbol\{W\}\_\{m\}=\\displaystyle=A​\(𝑾0,𝜽1,…,𝜽N\)\.\\displaystyle A\\big\(\\boldsymbol\{W\}\_\{0\},\\boldsymbol\{\\theta\}\_\{1\},\\dots,\\boldsymbol\{\\theta\}\_\{N\}\\big\)\\ \.ilharco2022editingshow that we can obtain multi\-task models𝑾m\\boldsymbol\{W\}\_\{m\}by performing simple task arithmetic operations on the task vectorsθ1\\theta\_\{1\},θ2\\theta\_\{2\}, …,θN\\theta\_\{N\}:

𝑾m\\displaystyle\\boldsymbol\{W\}\_\{m\}=\\displaystyle=𝑾0\+λ​∑i=1N𝜽i,\\displaystyle\\boldsymbol\{W\}\_\{0\}\\ \+\\ \\lambda\\sum\_\{i=1\}^\{N\}\\boldsymbol\{\\theta\}\_\{i\}\\ ,\(1\)whereλ\\lambdais a scaling coefficient, usually tuned on a validation set\. Whenλ=1/N\\lambda=1/N, Eq\. \([1](https://arxiv.org/html/2607.01689#S2.E1)\) reduces to performing weight averaging over𝑾i\\boldsymbol\{W\}\_\{i\}\[wortsman2022robust\]\.matena2022mergingfurther generalize this idea by scaling local parameters with their corresponding Fisher Information Matrix𝑭i\\boldsymbol\{F\}\_\{i\}which results in the following Fisher Averaging:

𝑾m\\displaystyle\\boldsymbol\{W\}\_\{m\}=\\displaystyle=∑i=1Nαi​\(∑t=1Nαt​𝑭t\)−1​𝑭i​𝜽i,\\displaystyle\\sum\_\{i=1\}^\{N\}\\alpha\_\{i\}\\left\(\\sum\_\{t=1\}^\{N\}\\alpha\_\{t\}\\boldsymbol\{F\}\_\{t\}\\right\)^\{\-1\}\\boldsymbol\{F\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}\\ ,\(2\)where

𝐅i\\displaystyle\\mathbf\{F\}\_\{i\}=\\displaystyle=𝔼x∼Di​\[𝔼y∼p𝐖i​\(y\|x\)\]​𝐆i​𝐆i⊤\\displaystyle\\mathbb\{E\}\_\{x\\sim D\_\{i\}\}\\Big\[\\mathbb\{E\}\_\{y\\sim p\_\{\\mathbf\{W\}\_\{i\}\}\(y\|x\)\}\\Big\]\\mathbf\{G\}\_\{i\}\\mathbf\{G\}\_\{i\}^\{\\top\}\(3\)with𝐆i=∇𝐖ilog⁡p𝐖i​\(y\|x\)\\mathbf\{G\}\_\{i\}=\\nabla\_\{\\mathbf\{W\}\_\{i\}\}\\log p\_\{\\mathbf\{W\}\_\{i\}\}\(y\|x\)\.

Alternatively, RegMean\[jin2022dataless\]matches model behavior by aligning activations between the merged and task\-specific models at each linear layer\. This leads to a linear regression problem defined by the task\-specific data matrix𝑿i\\boldsymbol\{X\}\_\{i\}which introduces another merging rule:

𝑾m=\(∑i=1N1Ni​𝑿i⊤​𝑿i\)−1​\(∑i=1N1Ni​𝑿i⊤​𝑿i​𝜽i\)\.\\displaystyle\\hskip\-2\.84526pt\\boldsymbol\{W\}\_\{m\}=\\left\(\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{N\_\{i\}\}\\boldsymbol\{X\}\_\{i\}^\{\\top\}\\boldsymbol\{X\}\_\{i\}\\right\)^\{\-1\}\\left\(\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{N\_\{i\}\}\\boldsymbol\{X\}\_\{i\}^\{\\top\}\\boldsymbol\{X\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}\\right\)\.\(4\)
Overall, existing merging methods differ in how they align local solution spaces\. Yet, they essentially view task updates as deterministic quantities to be aggregated without weighing their statistical usefulness across different tasks \(Section[1](https://arxiv.org/html/2607.01689#S1)\)\. We therefore seek a probabilistic treatment of fine\-tuning modules that not only account for the local update geometries, but also estimate how confidently each direction should influence the merged module\.

## 3A Product\-of\-Experts \(PoE\) Perspective on Model Merging

In this section, we develop a new perspective for model merging in which task\-specific fine\-tuning modules can be interpreted as observations of a shared latent parameter that captures information common across tasks\. The merged model can be viewed as a sample drawn from a product\-of\-experts \(PoE\) model combining these local energy\-based densities \(Section[3\.1](https://arxiv.org/html/2607.01689#S3.SS1)\)\. In this view, model merging can then be formulated as MAP inference which also provides a unified probabilistic interpretation of existing merging rules\. In particular, we show that weight averaging, Fisher\-weighted averaging, and RegMean arise as special cases under specific choices of task\-specific energy functions, thereby making their implicit distributional assumptions explicit \(Section[3\.2](https://arxiv.org/html/2607.01689#S3.SS2)\)\. Section[4](https://arxiv.org/html/2607.01689#S4)then shows that these implicit assumptions are mismatched with the empirical deviations observed between candidate merged updates and task\-specific modules, and introduces a redesign of the local energy functions to mitigate this mismatch\.

### 3\.1POE formulation for model merging

Energy\-based models \(EBMs\)\[teh2003energy,lecun2006tutorial,song2021train\]define probability distributions through an unnormalized function called an*energy*E​\(𝜻\)E\(\\boldsymbol\{\\zeta\}\)assigning lower energy to more likely input𝜻\\boldsymbol\{\\zeta\}\. The resulting distribution is given by

p​\(𝜻\)\\displaystyle p\(\\boldsymbol\{\\zeta\}\)=\\displaystyle=Z−1⋅exp⁡\(−E​\(𝜻\)\),\\displaystyle Z^\{\-1\}\\cdot\\exp\{\\big\(\-E\(\\boldsymbol\{\\zeta\}\)\\big\)\}\\ ,\(5\)whereZ≜∫exp⁡\(−E​\(𝜻\)\)​𝑑𝜻Z\\triangleq\\int\\exp\{\(\-E\(\\boldsymbol\{\\zeta\}\)\)\}d\\boldsymbol\{\\zeta\}is the partition function, ensuring the distribution integrates to 1\. Under this view, inputs with lower energies correspond to higher probability\. This formulation only needs to specify compatibility via the energy without explicit normalization\.

Now, considering the problem of mergingNNtask\-specific models\. We assume that there exists an unknown latent parameter𝜻\{\\boldsymbol\{\\zeta\}\}that is shared across tasks\. Each task\-specific fine\-tuning module can then be viewed as providing noisy evidence about the latent shared parameter𝜻\\boldsymbol\{\\zeta\}\. Therefore, we associate each task with an expert distribution over𝜻\\boldsymbol\{\\zeta\}, where higher probability corresponds to greater compatibility between the latent parameter and the task\-specific update\. We model this compatibility through an energy function, resulting in the following energy\-based expert:

pi​\(𝜻\)\\displaystyle p\_\{i\}\(\\boldsymbol\{\\zeta\}\)∝\\displaystyle\\proptoexp⁡\(−Ei​\(𝜻\)\),\\displaystyle\\exp\{\\Big\(\-E\_\{i\}\(\\boldsymbol\{\\zeta\}\)\\Big\)\}\\ ,\(6\)whereEi​\(𝜻\)E\_\{i\}\(\\boldsymbol\{\\zeta\}\)is an energy function determined by the task\-specific parameter𝜽i\\boldsymbol\{\\theta\}\_\{i\}\. The energy function is general and can be designed with specific desiderata\. Local energy functions can also be combined naturally through multiplication as established in\[hinton2002training\]\. Essentially, each task defines its energyEi​\(𝜻\)E\_\{i\}\(\\boldsymbol\{\\zeta\}\)and under conditional independence assumption of tasks on𝜻\\boldsymbol\{\\zeta\}, the PoE distribution is given as

p​\(𝜻\)\\displaystyle p\(\\boldsymbol\{\\zeta\}\)∝\\displaystyle\\propto∏i=1Npi​\(𝜻\)∝exp⁡\(−∑i=1NEi​\(𝜻\)\)\.\\displaystyle\\prod\_\{i=1\}^\{N\}p\_\{i\}\(\\boldsymbol\{\\zeta\}\)\\ \\propto\\ \\exp\{\\left\(\-\\sum\_\{i=1\}^\{N\}E\_\{i\}\(\\boldsymbol\{\\zeta\}\)\\right\)\}\\ \.\(7\)This results in the global energyE​\(𝜻\)≜∑i=1NEi​\(𝜻\),E\(\\boldsymbol\{\\zeta\}\)\\triangleq\\sum\_\{i=1\}^\{N\}E\_\{i\}\(\\boldsymbol\{\\zeta\}\),such thatp​\(𝜻\)∝exp⁡\(−E​\(𝜻\)\)p\(\\boldsymbol\{\\zeta\}\)\\propto\\exp\{\(\-E\(\\boldsymbol\{\\zeta\}\)\)\}\. Combining task experts then corresponds to summing up their energies\. The resulting PoE distribution thus concentrates probability mass on configurations𝜻\\boldsymbol\{\\zeta\}that simultaneously achieve low energy for all experts\. Under this view, model merging reduces to inference with an energy\-based model over the shared parameter\.

A natural way to obtain a merged model is to compute the maximum\-a\-posteriori \(MAP\) estimate:

𝜻∗\\displaystyle\\boldsymbol\{\\zeta\}^\{\*\}=\\displaystyle=arg⁡max𝜻⁡p​\(𝜻\)=arg⁡min𝜻⁡E​\(𝜻\)\\displaystyle\\arg\\max\_\{\\boldsymbol\{\\zeta\}\}p\(\\boldsymbol\{\\zeta\}\)\\ =\\ \\arg\\min\_\{\\boldsymbol\{\\zeta\}\}E\(\\boldsymbol\{\\zeta\}\)\(8\)=\\displaystyle=arg⁡min𝜻​∑i=1NEi​\(𝜻\)\.\\displaystyle\\arg\\min\_\{\\boldsymbol\{\\zeta\}\}\\sum\_\{i=1\}^\{N\}E\_\{i\}\(\\boldsymbol\{\\zeta\}\)\\ \.Interestingly, this formulation unifies existing averaging heuristics as cases of quadratic energies as shown in Section[3\.2](https://arxiv.org/html/2607.01689#S3.SS2)\. At the same time, it also allows for new designs of more flexible and robust choices of energy function\. Merging operation can thus be tightly coupled with the design of local energy functions as shown in Eq\. \([8](https://arxiv.org/html/2607.01689#S3.E8)\)\. The MAP solution of the resulting PoE can then be found via minimizing Eq\. \([8](https://arxiv.org/html/2607.01689#S3.E8)\), resulting in a merged model corresponding to the most probable shared parameter\.

### 3\.2Gaussian Experts Recover Existing Merging Rules

The key modeling aspect in the PoE\-EBM framework is the design choice of the task\-specific functionEi​\(𝜻\)E\_\{i\}\(\\boldsymbol\{\\zeta\}\)\. Different choices encode different assumptions about how task\-specific parameters deviate from the shared latent structure\. These choices also lead to various merging designs\. Several of which were rediscovered below as special cases of PoE with Gaussian experts\.

A\. Gaussian Expert Formulation\.Assume each task\-specific parameter𝜽i\\boldsymbol\{\\theta\}\_\{i\}is generated from the shared latent parameter𝜻\\boldsymbol\{\\zeta\}under Gaussian noise:

𝜽i\\displaystyle\\boldsymbol\{\\theta\}\_\{i\}=\\displaystyle=𝜻\+ϵi,ϵi∼𝒩​\(𝟎,𝚺i\)\.\\displaystyle\\boldsymbol\{\\zeta\}\+\\boldsymbol\{\\epsilon\}\_\{i\},\\ \\ \\ \\ \\boldsymbol\{\\epsilon\}\_\{i\}\\sim\\mathcal\{N\}\(\\boldsymbol\{0\},\\boldsymbol\{\\Sigma\}\_\{i\}\)\\ \.\(9\)This induces the likelihood over the latent𝜻\\boldsymbol\{\\zeta\},

pi​\(𝜻\)\\displaystyle p\_\{i\}\(\\boldsymbol\{\\zeta\}\)∝\\displaystyle\\proptoexp⁡\(−12​\(𝜻−𝜽i\)⊤​𝚺i−1​\(𝜻−𝜽i\)\)\.\\displaystyle\\exp\{\\Big\(\-\\frac\{1\}\{2\}\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{\\top\}\\boldsymbol\{\\Sigma\}\_\{i\}^\{\-1\}\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)\}\\Big\)\\ \.\(10\)Thus, the corresponding energy function is quadratic

Ei​\(𝜻\)\\displaystyle E\_\{i\}\(\\boldsymbol\{\\zeta\}\)=\\displaystyle=12​\(𝜻−𝜽i\)⊤​𝚺i−1​\(𝜻−𝜽i\)\.\\displaystyle\\frac\{1\}\{2\}\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{\\top\}\\boldsymbol\{\\Sigma\}\_\{i\}^\{\-1\}\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)\\ \.\(11\)The MAP solution in Eq\. \([8](https://arxiv.org/html/2607.01689#S3.E8)\) thus exhibits a closed form:

𝜻∗\\displaystyle\\boldsymbol\{\\zeta\}^\{\*\}=\\displaystyle=\(∑i=1N𝚺i−1\)−1​\(∑i=1N𝚺i−1​𝜽i\)\.\\displaystyle\\Bigg\(\\sum\_\{i=1\}^\{N\}\\boldsymbol\{\\Sigma\}\_\{i\}^\{\-1\}\\Bigg\)^\{\-1\}\\Bigg\(\\sum\_\{i=1\}^\{N\}\\boldsymbol\{\\Sigma\}\_\{i\}^\{\-1\}\\boldsymbol\{\\theta\}\_\{i\}\\Bigg\)\\ \.\(12\)
B\. Recovering Existing Merging Methods\.We now show that various classical merging methods are special cases of the above PoE with Gaussian experts which exhibit quadratic local energy functions\. In particular, we will show that the above MAP estimator admits several well\-known model merging rules as special cases under different choices and structural assumption of the precision matrices𝚺i−1\\boldsymbol\{\\Sigma\}\_\{i\}^\{\-1\}\.

1\. Uniform Averaging\.Suppose all precision matrices are isotropic,𝚺i−1=𝑰\\boldsymbol\{\\Sigma\}\_\{i\}^\{\-1\}=\\boldsymbol\{I\}, Eq\. \([12](https://arxiv.org/html/2607.01689#S3.E12)\) simplifies to

𝜻avg\\displaystyle\\boldsymbol\{\\zeta\}^\{\\mathrm\{avg\}\}=\\displaystyle=1N​∑i=1N𝜽i,\\displaystyle\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\boldsymbol\{\\theta\}\_\{i\}\\ ,\(13\)which recovers uniform averaging\[wortsman2022robust\]\.

2\. Fisher\-Weighted Averaging\.Suppose the precision matrix of each task is chosen as its Fisher information matrix,𝚺i−1=𝑭i\\boldsymbol\{\\Sigma\}\_\{i\}^\{\-1\}=\\boldsymbol\{F\}\_\{i\}, Eq\. \([12](https://arxiv.org/html/2607.01689#S3.E12)\) becomes

𝜻FA\\displaystyle\\boldsymbol\{\\zeta\}^\{\\mathrm\{FA\}\}=\\displaystyle=\(∑i=1N𝑭i\)−1​\(∑i=1N𝑭i​𝜽i\),\\displaystyle\\left\(\\sum\_\{i=1\}^\{N\}\\boldsymbol\{F\}\_\{i\}\\right\)^\{\-1\}\\left\(\\sum\_\{i=1\}^\{N\}\\boldsymbol\{F\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}\\right\)\\ ,\(14\)which recovers Fisher averaging\[matena2022merging\]\.

3\. RegMean\.When the precision matrix is taken as the Gram matrix of empirical data,

𝚺i−1\\displaystyle\\boldsymbol\{\\Sigma\}\_\{i\}^\{\-1\}=\\displaystyle=1Ni​𝑿i⊤​𝑿i,\\displaystyle\\frac\{1\}\{N\_\{i\}\}\\boldsymbol\{X\}\_\{i\}^\{\\top\}\\boldsymbol\{X\}\_\{i\}\\ ,\(15\)the MAP estimator reduces to

𝜻RM=\(∑i=1N1Ni​𝑿i⊤​𝑿i\)−1​\(∑i=1N1Ni​𝑿i⊤​𝑿i​𝜽i\),\\displaystyle\\boldsymbol\{\\zeta\}^\{\\mathrm\{RM\}\}=\\left\(\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{N\_\{i\}\}\\boldsymbol\{X\}\_\{i\}^\{\\top\}\\boldsymbol\{X\}\_\{i\}\\right\)^\{\-1\}\\left\(\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{N\_\{i\}\}\\boldsymbol\{X\}\_\{i\}^\{\\top\}\\boldsymbol\{X\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}\\right\),\(16\)which recovers RegMean\[jin2022dataless\]\.

## 4From Gaussian to Heavy\-Tailed Expert Models

Section[3](https://arxiv.org/html/2607.01689#S3)shows that several existing merging methods can be recovered as PoE models with Gaussian experts\. We now show that these Gaussian formulations impose light\-tailed assumptions on directional residuals, which mismatch the heavy\-tailed behavior observed in practice \(Section[4\.1](https://arxiv.org/html/2607.01689#S4.SS1)\)\. To address this mismatch, we introduce a heavy\-tailed PoE formulation based on Cauchy experts \(Section[4\.2](https://arxiv.org/html/2607.01689#S4.SS2)\)\.

### 4\.1Limitations of Gaussian Experts

As shown above, the use of quadratic energy fields lead to EBMs with Gaussian shapes\. Such Gaussian experts are closely related to the framework of\[daheim2023model\]which assumes a Gaussian prior over task\-specific parameters\. This is followed by a Laplace approximation to obtain the Gaussian posterior for the merged parameters\. Under specific choices of the precision matrix, their formulation reduces to precision\-weighted averaging and recovers standard schemes such as weight averaging and Fisher\-weighted averaging\. However, such Gaussian structures often do not sufficiently capture the tail behaviors of the merged models due to their fast\-decaying tails\. This represents a structural mismatch according to our empirical findings in Fig\.[1](https://arxiv.org/html/2607.01689#S1.F1)which shows heavy tail behavior of the merged models\.

This can be seen by analyzing thedirectional residualof the merged model with respect to each task\-specific module\. Given a candidate merged parameter𝜻\\boldsymbol\{\\zeta\}and theii\-th fine\-tuning module𝜽i\\boldsymbol\{\\theta\}\_\{i\}, the directional residual is defined as

ri​\(𝜻\)\\displaystyle r\_\{i\}\(\\boldsymbol\{\\zeta\}\)≜\\displaystyle\\triangleq\(𝜻−𝜽i\)⊤​𝜽i\.\\displaystyle\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\\ \.\(17\)Intuitively, this computes the difference\(𝜻−𝜽i\)\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)between the merged and individual models which is then projected onto the direction of the task update𝜽i\\boldsymbol\{\\theta\}\_\{i\}\. The result measures how far the merged solution drifted from the task\-specific module along that task’s preferred direction\. Under a Gaussian model𝜻−𝜽i∼𝒩​\(𝟎,𝚺i\)\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\\sim\\mathcal\{N\}\(\\boldsymbol\{0\},\\boldsymbol\{\\Sigma\}\_\{i\}\), each directional residual thus follows a Gaussianri​\(𝜻\)∼𝒩​\(0,𝜽i⊤​𝚺i​𝜽i\)r\_\{i\}\(\\boldsymbol\{\\zeta\}\)\\sim\\mathcal\{N\}\(\{0\},\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\Sigma\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}\)\.

As a result, existing choices of Gaussian experts, and by extension the classical merging methods they recover, implicitly impose light\-tailed assumptions on the residuals\. However, a closer inspection of the directional residuals of different merging methods reveals an intrinsic heavy\-tailed behavior that stands in contrast to these Gaussian assumptions\. In Fig\.[1](https://arxiv.org/html/2607.01689#S1.F1), we plot the empirical density of the directional residuals across all tasks and layers when merging 7 \(fully\) fine\-tuned ViT\-L\-14 models and overlay Gaussian and heavy\-tailed Cauchy distributions for comparison\. It shows a heavy\-tailed behavior: the decay in the tail region indicates a heavy\-tailed distribution, deviating from Gaussian behavior and following more closely with the Cauchy distribution\. In particular, large residuals occur more frequently than would be predicted under a Gaussian assumption\.

### 4\.2Heavy\-Tailed Expert Models

To capture this inherent heavy\-tailed residual distribution, we adopt a robust energy function whose logarithmic growth reduces the influence of task\-specific updates with large directional residuals\. The negative log\-density of a Cauchy distribution\[liu2014robustness\]naturally exhibits this behavior, motivating the following Cauchy expert design:

EiC​a​u​c​h​y​\(𝜻\)\\displaystyle E^\{Cauchy\}\_\{i\}\(\\boldsymbol\{\\zeta\}\)≜\\displaystyle\\triangleqlog⁡\(1\+ri​\(𝜻\)2γ2\),\\displaystyle\\log\\left\(1\+\\frac\{r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\}\{\\gamma^\{2\}\}\\right\)\\ ,\(18\)whereri​\(𝜻\)r\_\{i\}\(\\boldsymbol\{\\zeta\}\)denotes the directional residual at task\-specific parameter𝜽i\\boldsymbol\{\\theta\}\_\{i\}andγ\>0\\gamma\>0is a user\-defined scale controlling the tail heaviness of the distribution\. This energy function induces the following \(unnormalized\) Cauchy density:

pi​\(𝜻\)\\displaystyle\\hskip\-22\.76219ptp\_\{i\}\(\\boldsymbol\{\\zeta\}\)∝\\displaystyle\\proptoexp⁡\(−EiC​a​u​c​h​y​\(𝜻\)\)=γ2γ2\+ri​\(𝜻\)2,\\displaystyle\\exp\{\\Big\(\-E\_\{i\}^\{Cauchy\}\(\\boldsymbol\{\\zeta\}\)\\Big\)\}\\ =\\frac\{\\gamma^\{2\}\}\{\\gamma^\{2\}\+r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\}\\ ,\(19\)which is heavy\-tailed\. Under conditional independence assumption of tasks, the PoE\-EBM posterior factorizes according to Eq\. \([7](https://arxiv.org/html/2607.01689#S3.E7)\), leading to the following global energy:

EC​a​u​c​h​y​\(𝜻\)\\displaystyle\\hskip\-42\.67912ptE^\{Cauchy\}\(\\boldsymbol\{\\zeta\}\)=\\displaystyle=∑ilog⁡\(γ2\+ri​\(𝜻\)2\)\+𝒞,\\displaystyle\\sum\_\{i\}\\log\\Big\(\\gamma^\{2\}\+r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\\Big\)\+\\mathcal\{C\}\\ ,\(20\)where𝒞\\mathcal\{C\}is a constant independent of𝜻\\boldsymbol\{\\zeta\}\. This defines the Cauchy score as negative of the global energy gradient:

SC​a​u​c​h​y​\(𝜻\)\\displaystyle\\hskip\-42\.67912ptS^\{Cauchy\}\(\\boldsymbol\{\\zeta\}\)≜\\displaystyle\\triangleq−∇𝜻EC​a​u​c​h​y\\displaystyle\-\\nabla\_\{\\boldsymbol\{\\zeta\}\}E^\{Cauchy\}\(21\)=\\displaystyle=−∑i=1N2​ri​\(𝜻\)γ2\+ri​\(𝜻\)2​∇𝜻ri​\(𝜻\),\\displaystyle\-\\sum\_\{i=1\}^\{N\}\\frac\{2r\_\{i\}\(\\boldsymbol\{\\zeta\}\)\}\{\\gamma^\{2\}\+r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\}\\nabla\_\{\\boldsymbol\{\\zeta\}\}r\_\{i\}\(\\boldsymbol\{\\zeta\}\)\\ ,where the residual gradient is∇𝜻ri​\(𝜻\)=𝜽i\\nabla\_\{\\boldsymbol\{\\zeta\}\}r\_\{i\}\(\\boldsymbol\{\\zeta\}\)=\\boldsymbol\{\\theta\}\_\{i\}\. We further provide insights into the Cauchy score and the connection between Cauchy and Gaussian experts in Appendix[D](https://arxiv.org/html/2607.01689#A4)\.

## 5MAP Inference Algorithm and Convergence Guarantee

We will now develop a practical MAP inference algorithm for the previously established heavy\-tailed PoE model\. We first derive a fixed\-point characterization of the optimal merged update and then use it to obtain an iterative procedure with a convergence guarantee\.

As shown in the general PoE\-EBM setting in Section[3\.1](https://arxiv.org/html/2607.01689#S3.SS1), merging corresponds to computing the MAP estimator in Eq\. \([8](https://arxiv.org/html/2607.01689#S3.E8)\)\. For Cauchy experts with global Cauchy energy in Eq\. \([20](https://arxiv.org/html/2607.01689#S4.E20)\), the inference task then becomes

𝜻∗\\displaystyle\\boldsymbol\{\\zeta\}^\{\*\}=\\displaystyle=arg⁡min​∑i=1Nlog⁡\(γ2\+ri​\(𝜻\)2\)\.\\displaystyle\\arg\\min\\sum\_\{i=1\}^\{N\}\\log\\Big\(\\gamma^\{2\}\+r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\\Big\)\\ \.\(22\)Unlike the quadratic case in Eq\. \([12](https://arxiv.org/html/2607.01689#S3.E12)\), this optimization loss is nonconvex and does not admit a closed\-form minimizer\. However, we can exploit its structure to derive an explicit optimality characterization\. In particular, the MAP estimator can be expressed as the closed\-form solution of a nonlinear equation, revealing that robust model merging amounts to a residual\-dependent weighted consensus among task\-specific models\. The following theorem formalizes this closed\-form characterization of the MAP solution\.

###### Theorem 5\.1\(Closed\-form characterization of MAP\)\.

Let\{𝛉i\}i=1N\\\{\\boldsymbol\{\\theta\}\_\{i\}\\\}\_\{i=1\}^\{N\}be a set of task\-specific fine\-tuning modules, and define the following auxiliary function:

ui​\(𝜻\)\\displaystyle\\hskip\-28\.45274ptu\_\{i\}\(\\boldsymbol\{\\zeta\}\)\\hskip\-5\.69054pt=\\displaystyle=\(\[\(𝜻−𝜽i\)⊤​𝜽i\]2\+γ2\)−1withγ\>0\.\\displaystyle\\hskip\-5\.69054pt\\left\(\\big\[\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\\big\]^\{2\}\+\\gamma^\{2\}\\right\)^\{\-1\}\\ \\ \\text\{with\}\\ \\ \\gamma\>0\\ \.\(23\)It then follows that any stationary point𝛇∗\\boldsymbol\{\\zeta\}^\{\*\}of the MAP loss in Eq\. \([22](https://arxiv.org/html/2607.01689#S5.E22)\) satisfies𝛇∗=F​\(𝛇∗\)\\boldsymbol\{\\zeta\}^\{\*\}=F\(\\boldsymbol\{\\zeta\}^\{\*\}\)with a closed\-form mappingFFdefined below;

F​\(⋅\)\\displaystyle\\hskip\-4\.2679ptF\(\\cdot\)\\hskip\-5\.69054pt≜\\displaystyle\\triangleq\(∑i=1Nui​\(⋅\)​𝜽i​𝜽i⊤\+η​𝑰\)−1​\(∑i=1Nui​\(⋅\)​𝜽i​𝜽i⊤​𝜽i\)\\displaystyle\\hskip\-5\.69054pt\\left\(\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\cdot\)\\,\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\+\\eta\\boldsymbol\{I\}\\right\)^\{\-1\}\\left\(\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\cdot\)\\,\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\\right\)\(24\)≜\\displaystyle\\triangleq𝑯​\(⋅\)−1​𝒃​\(⋅\),\\displaystyle\\boldsymbol\{H\}\(\\cdot\)^\{\-1\}\\boldsymbol\{b\}\(\\cdot\)\\ ,

where we define𝑯​\(⋅\)≜∑i=1Nui​\(⋅\)​𝜽i​𝜽i⊤\+η​𝑰\\boldsymbol\{H\}\(\\cdot\)\\triangleq\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\cdot\)\\,\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\+\\eta\\boldsymbol\{I\}and𝒃​\(⋅\)≜∑i=1Nui​\(⋅\)​𝜽i​𝜽i⊤​𝜽i\\boldsymbol\{b\}\(\\cdot\)\\triangleq\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\cdot\)\\,\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}for ease of notation\. Here,η\>0\\eta\>0is a conditioning hyper\-parameter to ensure the inversion operator in the definition ofFFis well\-defined\.

Under mild boundedness assumptions on the fine\-tuning modules \(see Assumption[5\.2](https://arxiv.org/html/2607.01689#S5.Thmtheorem2)\), Theorem[5\.4](https://arxiv.org/html/2607.01689#S5.Thmtheorem4)further shows thatFFis a contractive map with a unique fixed point solution\. Consequently, repeated application ofFFconverges to this fixed point, which corresponds to the MAP estimate \(i\.e\., the optimal merged solution\) ofPoE\-EBM\. This forms the basis of our merging algorithm \(see Algorithm[1](https://arxiv.org/html/2607.01689#alg1)\)\.

###### Assumption 5\.2\(Bounded fine\-tuning modules\)\.

There exists a constantM\>0M\>0such that‖𝜽i‖2≤M,∀i∈\[N\]\.\\\|\\boldsymbol\{\\theta\}\_\{i\}\\\|\_\{2\}\\leq M,\\ \\forall i\\in\[N\]\.

We note that in fine\-tuning regimes, task\-specific parameters are fine\-tuned from a common pre\-trained model with common practices such asℓ2\\ell\_\{2\}regularization with small learning rates or constraining the rank of the fine\-tuning modules\. Consequently, these modules often remain within a bounded region of the parameter space\.

To validate this assumption, we compute theℓ2\\ell\_\{2\}norm of full fine\-tuned task vectors across vision tasks and their ratios relative to the pretrained model\. As shown in Table[6](https://arxiv.org/html/2607.01689#A6.T6), the ratio is consistently in the range of0\.6%−0\.8%0\.6\\%\-0\.8\\%across all tasks and models\. Motivated by this empirical observation, we restrict our analysis to a local neighborhood around the fixed point𝜻∗\\boldsymbol\{\\zeta\}^\{\*\}and assume that all iterates remain within a unit ball centered at𝜻∗\\boldsymbol\{\\zeta\}^\{\*\}as formally stated below\.

###### Assumption 5\.3\(Local Contraction Neighborhood\)\.

Let𝜻∗\\boldsymbol\{\\zeta\}^\{\\ast\}be a fixed point ofFF\. The iterates\{𝜻\(k\)\}k\\\{\\boldsymbol\{\\zeta\}^\{\(k\)\}\\\}\_\{k\}generated byFFare assumed to remain in a closed unit ball around𝜻∗\\boldsymbol\{\\zeta\}^\{\\ast\}:

∀k:𝜻\(k\)∈B1​\(𝜻∗\)\\displaystyle\\hskip\-25\.60747pt\\forall k\\ :\\ \\boldsymbol\{\\zeta\}^\{\(k\)\}\\in B\_\{1\}\(\\boldsymbol\{\\zeta\}^\{\\ast\}\)\\hskip\-2\.84526pt≜\\displaystyle\\triangleq\{𝜻:∥𝜻−𝜻∗∥≤1\}\.\\displaystyle\\hskip\-2\.84526pt\\big\\\{\\boldsymbol\{\\zeta\}\\;:\\;\\\|\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\zeta\}^\{\\ast\}\\\|\\ \\ \\leq\\ \\ 1\\big\\\}\\ \.\(25\)

Given the above assumption, we can now show thatFFis contractive via Theorem[5\.4](https://arxiv.org/html/2607.01689#S5.Thmtheorem4)below\.

###### Theorem 5\.4\(Contraction ofFF\)\.

Under Assumptions[5\.2](https://arxiv.org/html/2607.01689#S5.Thmtheorem2)and[5\.3](https://arxiv.org/html/2607.01689#S5.Thmtheorem3), it follows thatFFis a contractive mapping where‖F​\(𝛇∗\)−𝛇\(k\+1\)‖=‖F​\(𝛇∗\)−F​\(𝛇\)\(k\)‖≤L​‖𝛇∗−𝛇\(k\)‖\\\|F\(\\boldsymbol\{\\zeta\}^\{\\ast\}\)\-\\boldsymbol\{\\zeta\}^\{\(k\+1\)\}\\\|=\\\|F\(\\boldsymbol\{\\zeta\}^\{\\ast\}\)\-F\(\\boldsymbol\{\\zeta\}\)^\{\(k\)\}\\\|\\leq L\\\|\\boldsymbol\{\\zeta\}^\{\\ast\}\-\\boldsymbol\{\\zeta\}^\{\(k\)\}\\\|withLLbeing a provably small Lipschitz constant,

L\\displaystyle\\hskip\-73\.97716ptL\\hskip\-2\.84526pt≜\\displaystyle\\triangleq2M∑i=1N∥𝑯−1𝑱i\(𝜻∗\)∥≤1,\\displaystyle\\hskip\-2\.84526pt2M\\sum\_\{i=1\}^\{N\}\\left\\\|\\boldsymbol\{H\}^\{\-1\}\\boldsymbol\{J\}\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\right\\\|\\ \\ \\leq\\ \\ 1\\ ,\(26\)where𝐇\\boldsymbol\{H\}is previously defined in Theorem[5\.1](https://arxiv.org/html/2607.01689#S5.Thmtheorem1)and𝐉i​\(𝛇∗\)≜ui​\(𝛇∗\)​\(𝛉i​𝛉i⊤​𝛉i−𝛉i​𝛉i⊤​F​\(𝛇∗\)\)\\boldsymbol\{J\}\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\triangleq u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\Big\(\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\-\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}F\(\\boldsymbol\{\\boldsymbol\{\\zeta\}^\{\*\}\}\)\\Big\)\.

Due to limited space, the proof of Theorem[5\.4](https://arxiv.org/html/2607.01689#S5.Thmtheorem4)is deferred to Appendix[B](https://arxiv.org/html/2607.01689#A2)\. We also show empirically in Fig\.[4](https://arxiv.org/html/2607.01689#A2.F4)thatL<1L<1in all experiments, asserting thatFFis indeed a contracting map in a neighborhood of the true MAP𝜻∗\\boldsymbol\{\\zeta\}^\{\*\}\. This leads to a practical merging procedure via iteratingFFin Algorithm[1](https://arxiv.org/html/2607.01689#alg1)\. Its complexity analysis is provided in Appendix[C](https://arxiv.org/html/2607.01689#A3)\.

Algorithm 1PoE\-EBMmerging1:INPUT:Backbone model

𝑾0\\boldsymbol\{W\}\_\{0\}, fine\-tuning modules

\{𝜽i\}i=1N\\\{\\boldsymbol\{\\theta\}\_\{i\}\\\}\_\{i=1\}^\{N\}, conditioning parameter

η\>0\\eta\>0, scaling coefficient

λ\\lambda, and number of iterations

TT\.

2:OUTPUT:Merged module

𝜻\{\\boldsymbol\{\\zeta\}\}, merged model

𝑾m\\boldsymbol\{W\}\_\{m\}\.

3:Initialize

𝜻\(0\)\{\\boldsymbol\{\\zeta\}\}^\{\(0\)\}
4:for

k=0k=0to

T−1T\-1do

5:for

i=1i=1to

NNdo

6:Compute residual

ri\(k\)←\(𝜻\(k\)−𝜽i\)⊤​𝜽ir\_\{i\}^\{\(k\)\}\\leftarrow\(\\boldsymbol\{\\zeta\}^\{\(k\)\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}
7:Compute

ui\(k\)←1/\(\(ri\(k\)\)2\+γ​‖𝜽i‖2\)u\_\{i\}^\{\(k\)\}\\leftarrow 1/\(\(r\_\{i\}^\{\(k\)\}\)^\{2\}\+\\gamma\\\|\\boldsymbol\{\\theta\}\_\{i\}\\\|^\{2\}\)
8:endfor

9:Compute

𝑯\(k\)=𝑯​\(𝜻\(k\)\),𝒃\(k\)=𝒃​\(𝜻\(k\)\)\\boldsymbol\{H\}^\{\(k\)\}=\\boldsymbol\{H\}\(\{\\boldsymbol\{\\zeta\}\}^\{\(k\)\}\),\\ \\boldsymbol\{b\}^\{\(k\)\}=\\boldsymbol\{b\}\(\{\\boldsymbol\{\\zeta\}\}^\{\(k\)\}\)via[24](https://arxiv.org/html/2607.01689#S5.E24)

10:Compute new iterate

𝜻\(k\+1\)=\(𝑯\(k\)\)−1​𝒃\(k\)\{\\boldsymbol\{\\zeta\}\}^\{\(k\+1\)\}=\\big\(\\boldsymbol\{H\}^\{\(k\)\}\\big\)^\{\-1\}\\boldsymbol\{b\}^\{\(k\)\}
11:endfor

12:return

𝜻≜\(𝜻\(T\)\)\{\\boldsymbol\{\\zeta\}\}\\triangleq\(\{\\boldsymbol\{\\zeta\}\}^\{\(T\)\}\),

𝑾m=𝑾0\+λ​𝜻\(T\)\\boldsymbol\{W\}\_\{m\}=\\boldsymbol\{W\}\_\{0\}\+\\lambda\\boldsymbol\{\\zeta\}^\{\(T\)\}

## 6Experiments and Results

In this section, we validate the effectiveness our framework on diverse empirical settings covering both vision and language tasks\. We describe the setup of our experiments in Section[6\.1](https://arxiv.org/html/2607.01689#S6.SS1)and provide the main results in Section[6\.2](https://arxiv.org/html/2607.01689#S6.SS2)\. We also provide additional ablation analysis in Section[6\.4](https://arxiv.org/html/2607.01689#S6.SS4)\.

For clarity, we use the following highlighting convention: \(1\) best accuracy isbolded; \(2\) second best accuracy isunderlined; and \(3\) task\-specific fine\-tuning accuracy \(as performance upper bound\) is coloredblue\.

Table 1:Multi\-task performance comparison when merging ViT\-B/32 \(fully finetuned\) across 7 vision benchmarks \(absolute accuracy\)\. Task\-specific finetuning accuracy isblue\- performance accuracy upper bound\.### 6\.1Experimental setup

Tasks\.We evaluate on both vision and language benchmarks\. For vision, we consider a 7\-task benchmark and an extended 13\-task benchmark\. For language task, we use 8 datasets from GLUE benchmark\[wang2018glue\]\.

Models\.For vision tasks we merge CLIP ViT\-B/32 and ViT\-B/14 models\[radford2021learning\]under both fully fine\-tuned and LoRA fine\-tuned settings\. For language task, we merge LoRA fine\-tuned Flan\-T5\-base and Flan\-T5\-large\.

Model Merging Methods\.We compare ourPoE\-EBMagainst common model merging baselines including Weight Averaging\[wortsman2022robust\], Task Arithmetic\[ilharco2022editing\], DARE\-TIES\[yadav2023ties,yu2024language\]\. We also compare with recent subspace\-based and optimization\-based merging methods such as Concrete\[tang2023concrete\], KnOTS\[stoica2024model\]\(for LoRA\-fine\-tuned vision models\) and DOGE\[wei2025modeling\]\.

Metrics\.We report absolute accuracy and normalized accuracy \(relative to each task’s fine\-tuned model\)\. Full experimental details are provided in Appendix[F](https://arxiv.org/html/2607.01689#A6)\.

![Refer to caption](https://arxiv.org/html/2607.01689v1/x2.png)Figure 2:Plots of average directional residual over tasks \(log\-scale\) at every layer weight in ViT\-B\-32 \(left column\) and ViT\-L\-14 \(right column\) incurred by different merging methods\. Top row \- merging fully fine\-tuned models\. Bottom row \- merging LoRA fine\-tuned models\. It can be observed thatPoE\-EBMconsistently achieve lower directional residual values at every layer than other merging methods\.
### 6\.2Model Merging for Vision Tasks

We present the results for merging fully\-finetuned ViT\-B\-32 and ViT\-L\-14 models on 7 vision benchmarks in Table[1](https://arxiv.org/html/2607.01689#S6.T1)and Table[8](https://arxiv.org/html/2607.01689#A8.T8), respectively\. We also present the results for LoRA\-finetuned ViT\-L\-14 models in Table[12](https://arxiv.org/html/2607.01689#A8.T12)\.

PoE\-EBMconsistently outperforms both simple weight averaging and task arithmetic baselines on every individual task\. More recent alignment\-based methods, such as Concrete and DOGE, yield further improvements over task arithmetic; however, their average accuracies remain lower to our approach\. In general,PoE\-EBMconsistently achieves best or second best accuracies on individual tasks and best accuracy on average\.

We note that while DOGE\-TA outperformsPoE\-EBMon a subset of tasks, these gains come at the expense of substantial degradation on others\. For instance, when merging ViT\-B/32 and ViT\-L/14 models, DOGE\-TA achieves absolute accuracies of64\.3164\.31and72\.2372\.23on DTD, respectively, which are markedly lower than those obtained byPoE\-EBM\(76\.0676\.06and83\.5683\.56\)\. This behavior highlights that our approach yields a more favorable trade\-off across tasks, maintaining strong general performance without disproportionately sacrificing any individual task\.PoE\-EBMalso demonstrates strong robustness as the number of tasks increases\. As shown in Table[2](https://arxiv.org/html/2607.01689#S6.T2), our framework achieves substantial performance gains over all baselines when merging a larger set of tasks, for both ViT\-B/32 and ViT\-L/14 models\.

Table[12](https://arxiv.org/html/2607.01689#A8.T12)shows that under LoRA\-fine\-tuned settings, our merging framework substantially outperforms task arithmetic and weight averaging variants and achieves better performance than LoRA\-specific merging methods such as KnOTS\-TIES and KnOTS\-DARE\-TIES\[stoica2024model\]\. Taken together, these results demonstrate thatPoE\-EBMconsistently generalizes across model scales, numbers of merged tasks, and fine\-tuning regimes, without requiring architecture\-specific modifications\.

Table 2:Average performance comparison when merging fully finetuned ViT\-L/14 models and ViT\-L/32 models across 13 vision benchmarks\. Task\-specific fine\-tuning accuracy isblue\- performance accuracy upper bound\.Table 3:Multi\-task performance comparison when merging Flan\-T5\-base models \(LoRA\-fine\-tuned\) across 8 GLUE benchmarks\. Task\-specific fine\-tuning accuracy is coloredbluewhich indicates the upper bound on performance accuracy for model merging\. It can be observed thatPoE\-EBMachieves the best rank among the considered methods across tasks\.
### 6\.3Model Merging for Language Tasks

We now present results on eight datasets from the GLUE benchmark using Flan\-T5\-base and Flan\-T5\-large models in Tables[3](https://arxiv.org/html/2607.01689#S6.T3)and[9](https://arxiv.org/html/2607.01689#A8.T9)\. As noted in\[tang2023concrete\], pretrained text\-generation models already exhibit strong inherent multitask capabilities, which can limit the extent of gains achievable through task\-specific fine\-tuning\. Despite this,PoE\-EBMachieves the best average generation performance across tasks on both model architectures, indicating that our framework remains effective even in regimes where performance improvements are intrinsically constrained\.

Scaling to larger LLMs\.We further evaluatePoE\-EBMby merging three 7B\-parameter models: Vicuna\-7B, Llama\-2\-Coder, and WizardMath, all fine\-tuned from the same Llama\-2\-7B backbone\. The merged model is evaluated on GSM8K math problems\. Despite operating on multi\-billion\-parameter models,PoE\-EBMrequires only122 seconds on 2×\\timesA100 GPUs, demonstrating its practical computational cost\. In addition,PoE\-EBMachieves the highest accuracy among all compared methods, outperforming both individual expert models and strong merging baselines\. As shown in Table[4](https://arxiv.org/html/2607.01689#S6.T4), the merged model surpasses the strongest expert \(WizardMath\) by more than4 percentage points, indicating thatPoE\-EBMcan effectively combine complementary capabilities from specialized models\.

Table 4:Accuracy \(%\) on GSM8K dataset after merging three 7B\-parameter models \(Vicuna\-7B, WizardMath, and Llama\-2\-Coder\) fine\-tuned from a shared Llama\-2\-7B backbone\.PoE\-EBMachieves the highest accuracy, outperforming both individual expert models and standard merging baselines \(Task Arithmetic, Weight Averaging\)\.
### 6\.4Analysis and Ablations

We provide ablation studies on the averaged directional residuals incurred byPoE\-EBMand other merging baselines, the empirical convergence of the fixed point map and runtime analysis ofPoE\-EBM\. Additional ablation on the performance sensitivity with respect to scaling parameterγ\\gammais provided in Appendix[G](https://arxiv.org/html/2607.01689#A7)\.

![Refer to caption](https://arxiv.org/html/2607.01689v1/x3.png)Figure 3:Convergence of the fixed point iteration method \(see Algorithm[1](https://arxiv.org/html/2607.01689#alg1)\) across layers of the merged ViT\-L\-14 model on 7 vision benchmarks using ourPoE\-EBMframework with heavy\-tailed experts\. We visualize‖𝜻\(k\+1\)−𝜻\(k\)‖2\|\|\\boldsymbol\{\\zeta\}^\{\(k\+1\)\}\-\\boldsymbol\{\\zeta\}^\{\(k\)\}\|\|\_\{2\}across 10 iterations\. Our results show rapid convergence to the MAP solution\.Residual values across layers\.To compare directional alignment of different merging methods, we plot the task\-average squared residual value

M​e​a​n​\(‖\(𝜻∗−𝜽i\)⊤​𝜽i‖2\),Mean\\Big\(\|\|\(\\boldsymbol\{\\zeta\}^\{\*\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\|\|^\{2\}\\Big\)\\ ,where𝜻∗\\boldsymbol\{\\zeta\}^\{\*\}is the merged model at each layer of both fully and LoRA fine\-tuned ViT\-B/32 and ViT\-L/14 models, andM​e​a​n​\(\)Mean\(\)represents the average taken across tasks\. The results are shown in Figure[2](https://arxiv.org/html/2607.01689#S6.F2), respectively\. Across all settings,PoE\-EBMconsistently achieves lower residual magnitude, indicating better directional alignment than other approaches\. Notably, other methods exhibit pronounced outlier residuals at certain layers\. For example, it can be observed that the residual magnitude is abnormally large at layer 5 of the LoRA\-fine\-tuned ViT\-L/14 in Figure[2](https://arxiv.org/html/2607.01689#S6.F2)\. In contrast,PoE\-EBMmaintains low residual values across all layers, reflecting more stable and robust alignment behavior\.

Empirical convergence of the fixed point map\.We empirically validate the convergence of the fixed point iteration algorithm \(see Algorithm[1](https://arxiv.org/html/2607.01689#alg1)\) to compute the MAP point of ourPoE\-EBM\. We report the merging results with ViT\-L\-14 models on 7 vision tasks\. As shown in Figure[3](https://arxiv.org/html/2607.01689#S6.F3)in Appendix[G](https://arxiv.org/html/2607.01689#A7), the sequence\{𝜻\(k\)\}\\\{\\boldsymbol\{\\zeta\}^\{\(k\)\}\\\}converges rapidly across all examined layers, with the update magnitude‖𝜻\(k\+1\)−𝜻\(k\)‖2\|\|\\boldsymbol\{\\zeta\}^\{\(k\+1\)\}\-\\boldsymbol\{\\zeta\}^\{\(k\)\}\|\|\_\{2\}decreasing sharply within the first few iterations and stabilizing thereafter\. This convergence pattern indicates that the contracting map admits a solution and the resulting estimate corresponds to the MAP solution under our probabilistic formulation\. Importantly, the convergence behavior is uniform across layers, implying that the optimization landscape induced by our model is well\-conditioned in practice and that the algorithm is reliable for large\-scale model merging\. The convergence behavior is observed across layers, implying that the optimization landscape induced byPoE\-EBMis well\-conditioned in practice\.

Runtime analysis\.We also evaluate the efficiency ofPoE\-EBMby measuring the wall\-clock runtime of the merging procedure across different tasks and model scales\. As reported in Table[7](https://arxiv.org/html/2607.01689#A7.T7), the proposed merging procedure completes in under one minute for all models considered\. For the largest model, Flan\-T5\-large with 0\.7B parameters, the full\-layer merging process takes only59\.19​s59\.19\\mathrm\{s\}\. The detailed complexity analysis is deferred to Appendix[C](https://arxiv.org/html/2607.01689#A3)\.

## 7Conclusion

We revisit model merging through the lens of MAP inference in the parameter space under a product\-of\-experts \(PoE\) energy\-based model\. We showed that many existing methods can be rediscovered from this new perspective\. It enriches the solution space for model merging and at the same time exposes critical limitations of existing work\. In this new view, existing merging methods can be interpreted as imposing a Gaussian structure with light tails on the directional residual between the merged and individual models\. To address this limitation, we introduce Cauchy\-based experts that better capture the observed heavy\-tailed behavior, resulting in improved performance\. We developed practical algorithms to perform MAP inference on the new PoE design and proved convergence guarantees\.

## Acknowledgement

This work utilized GPU compute resources at SDSC and ACES through allocation CIS230391 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services and Support \(ACCESS\) program\[ACCESS\-resource\], which is supported by U\.S\. National Science Foundation grants\#\\\#2138259,\#\\\#2138286,\#\\\#2138307,\#\\\#2137603, and\#\\\#2138296\. Trong Nghia Hoang is supported by the National Science Foundation CAREER Award IIS\-2544071\.

## References

## Appendix AProof for Theorem[5\.1](https://arxiv.org/html/2607.01689#S5.Thmtheorem1)

Consider the MAP loss under PoE\-EBM \([22](https://arxiv.org/html/2607.01689#S5.E22)\):

𝜻∗\\displaystyle\\boldsymbol\{\\zeta\}^\{\*\}=\\displaystyle=arg⁡min𝜻∑i=1Nlog⁡\(ri​\(𝜻\)2\+η\)=arg⁡min𝜻∑i=1Nlog⁡\(\[\(𝜻−𝜽i\)T​𝜽i\]2\+η\)\.\\displaystyle\\arg\\min\_\{\\boldsymbol\{\\zeta\}\}\\ \\ \\sum\_\{i=1\}^\{N\}\\log\\Big\(r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\+\\eta\\Big\)\\ =\\ \\arg\\min\_\{\\boldsymbol\{\\zeta\}\}\\ \\ \\sum\_\{i=1\}^\{N\}\\log\\Big\(\[\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{T\}\\ \\boldsymbol\{\\theta\}\_\{i\}\]^\{2\}\+\\eta\\Big\)\.\(27\)
We write the squared residual term as

ri​\(𝜻\)2\\displaystyle r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}=\\displaystyle=\[\(𝜻−𝜽i\)T​𝜽i\]2=\(𝜻−𝜽i\)T​𝜽i​𝜽i⊤​\(𝜻−𝜽i\)\\displaystyle\[\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{T\}\\ \\boldsymbol\{\\theta\}\_\{i\}\]^\{2\}\\ =\\ \(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{T\}\\ \\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)\(28\)A necessary condition for optimality is that the gradient of the global energy vanishes at𝜻∗\\boldsymbol\{\\zeta\}^\{\*\}\. This implies the Cauchy scoreSC​a​u​c​h​y​\(𝜻∗\)=0S^\{Cauchy\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)=0, resulting in the following stationary equation:

∑i=1N2​ri​\(𝜻∗\)γ2\+ri​\(𝜻∗\)2​𝜽i\\displaystyle\\sum\_\{i=1\}^\{N\}\\frac\{2r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\}\{\\gamma^\{2\}\+r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)^\{2\}\}\\boldsymbol\{\\theta\}\_\{i\}=\\displaystyle=0\.\\displaystyle 0\.\(29\)
SC​a​u​c​h​y​\(𝜻∗\)\\displaystyle S^\{Cauchy\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)=\\displaystyle=∑i=1N2​𝜽i​𝜽i⊤​\(𝜻∗−𝜽i\)\(𝜻∗−𝜽i\)⊤​𝜽i​𝜽i⊤​\(𝜻∗−𝜽i\)\+γ=2​∑i=1Nui​\(𝜻∗\)​𝜽i​𝜽i⊤​\(𝜻∗−𝜽i\)=0\.\\displaystyle\\sum\_\{i=1\}^\{N\}\\frac\{2\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\(\\boldsymbol\{\\zeta\}^\{\*\}\-\\boldsymbol\{\\theta\}\_\{i\}\)\}\{\(\\boldsymbol\{\\zeta\}^\{\*\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\(\\boldsymbol\{\\zeta\}^\{\*\}\-\\boldsymbol\{\\theta\}\_\{i\}\)\+\\gamma\}\\ =\\ 2\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\(\\boldsymbol\{\\zeta\}^\{\*\}\-\\boldsymbol\{\\theta\}\_\{i\}\)\\ =\\ 0\.\(30\)Solving for𝜻∗\\boldsymbol\{\\zeta\}^\{\*\}thus reveals the fixed\-point equation as desired,

𝜻∗\\displaystyle\\boldsymbol\{\\zeta\}^\{\*\}=\\displaystyle=\(∑i=1Nui​\(𝜻∗\)​𝜽i​𝜽i⊤\)†​\(∑i=1Nui​\(𝜻∗\)​𝜽i​𝜽i⊤​𝜽i\)\.\\displaystyle\\left\(\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\,\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\right\)^\{\\dagger\}\\left\(\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\,\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\\right\)\.\(31\)

## Appendix BProof for Theorem[5\.4](https://arxiv.org/html/2607.01689#S5.Thmtheorem4)

We prove that the fixed\-point mapping

F​\(𝜻∗\)\\displaystyle F\(\\boldsymbol\{\\zeta\}^\{\*\}\)=\\displaystyle=\(∑i=1Nui​\(𝜻∗\)​𝜽i​𝜽i⊤\+η​𝑰\)−1​\(∑i=1Nui​\(𝜻∗\)​𝜽i​𝜽i⊤​𝜽i\)where\\displaystyle\\left\(\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\,\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\+\\eta\\boldsymbol\{I\}\\right\)^\{\-1\}\\left\(\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\,\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\\right\)\\ \\ \\text\{where\}\\ \\\(32\)ui​\(𝜻∗\)=1\[\(𝜻∗−𝜽i\)⊤​𝜽i\]2\+γ≜1ri​\(𝜻∗\)2\+γ,η\>0,γ\>0,u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\ \\ =\\ \\ \\frac\{1\}\{\\big\[\(\\boldsymbol\{\\zeta\}^\{\*\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\\big\]^\{2\}\+\\gamma\}\\ \\ \\triangleq\\ \\ \\frac\{1\}\{r\_\{i\}\(\\boldsymbol\{\\zeta^\{\\ast\}\}\)^\{2\}\+\\gamma\},\\qquad\\eta\>0,\\;\\gamma\>0,is a contractive map under Assumptions[5\.2](https://arxiv.org/html/2607.01689#S5.Thmtheorem2)and[5\.3](https://arxiv.org/html/2607.01689#S5.Thmtheorem3), i\.e\.

‖F​\(𝜻∗\)−F​\(𝝂\)‖\\displaystyle\\\|F\(\\boldsymbol\{\\zeta\}^\{\*\}\)\-F\(\\boldsymbol\{\\nu\}\)\\\|≤\\displaystyle\\leqL∥𝜻∗−𝝂∥where0<L<1\.\\displaystyle L\\\|\\boldsymbol\{\\zeta\}^\{\*\}\-\\boldsymbol\{\\nu\}\\\|\\ \\ \\text\{where\}\\ \\ 0\\ \\ <\\ \\ L\\ \\ <\\ \\ 1\.\(33\)Choosing𝝂=𝜻\(k\)\\boldsymbol\{\\nu\}=\\boldsymbol\{\\zeta\}^\{\(k\)\}and factoring in that𝜻\(k\+1\)=F​\(𝜻\(k\)\)\\boldsymbol\{\\zeta\}^\{\(k\+1\)\}=F\(\\boldsymbol\{\\zeta\}^\{\(k\)\}\)then leads to the desired result\. For notational ease, we denote𝑯​\(𝜻∗\)=∑i=1Nui​\(𝜻∗\)​𝜽i​𝜽i⊤\+η​𝑰\\boldsymbol\{H\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)=\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\+\\eta\\boldsymbol\{I\}and𝒃​\(𝜻∗\)=∑i=1Nui​\(𝜻∗\)​𝜽i​𝜽i⊤​𝜽i\\boldsymbol\{b\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)=\\sum\_\{i=1\}^\{N\}u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\. This results in

F​\(𝜻∗\)\\displaystyle F\(\\boldsymbol\{\\zeta\}^\{\*\}\)=\\displaystyle=\(𝑯​\(𝜻∗\)\)−1​𝒃​\(𝜻∗\)\.\\displaystyle\\Big\(\\boldsymbol\{H\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\Big\)^\{\-1\}\\boldsymbol\{b\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\ \.\(34\)
We first need the following lemma for computing the directional gradient ofFF\.

###### Lemma B\.1\.

Given the mappingFFdefined in \([34](https://arxiv.org/html/2607.01689#A2.E34)\)\. Pick an arbitrary weight𝛇∗\\boldsymbol\{\\zeta\}^\{\*\}and direction𝐡\\boldsymbol\{h\}\. Then the directional gradient ofFFat𝛇∗\\boldsymbol\{\\zeta\}^\{\*\}in the direction𝐡\\boldsymbol\{h\}is given by

D​F​\(𝜻∗\)𝒉\\displaystyle DF\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}=\\displaystyle=−2​𝑯−1​∑i=1Nri​\(𝜻∗\)\(ri​\(𝜻∗\)2\+γ\)2​\(𝜽i⊤​𝒉\)​\(𝜽i​𝜽i⊤​𝜽i−𝜽i​𝜽i⊤​F​\(𝜻∗\)\)\.\\displaystyle\-2\\boldsymbol\{H\}^\{\-1\}\\sum\_\{i=1\}^\{N\}\\frac\{r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\}\{\(r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)^\{2\}\+\\gamma\)^\{2\}\}\(\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{h\}\)\\Big\(\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\-\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}F\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\Big\)\.\(35\)

###### Proof\.

Applying the definition of directional derivative:

D​F​\(𝜻∗\)𝒉\\displaystyle DF\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}=\\displaystyle=dd​t​F​\(𝜻∗\+t​𝒉\)\|t=0\.\\displaystyle\\frac\{d\}\{dt\}F\(\\boldsymbol\{\\zeta\}^\{\*\}\+t\\boldsymbol\{h\}\)\\Big\|\_\{t=0\}\.Denote𝑯t≜𝑯​\(𝜻∗\+t​𝒉\)\\boldsymbol\{H\}\_\{t\}\\triangleq\\boldsymbol\{H\}\(\\boldsymbol\{\\zeta\}^\{\*\}\+t\\boldsymbol\{h\}\)and𝒃t≜𝒃​\(𝜻∗\+t​𝒉\)\\boldsymbol\{b\}\_\{t\}\\triangleq\\boldsymbol\{b\}\(\\boldsymbol\{\\zeta\}^\{\*\}\+t\\boldsymbol\{h\}\)\. ThenFt≜F​\(𝜻∗\+t​𝒉\)=𝑯t−1​𝒃tF\_\{t\}\\triangleq F\(\\boldsymbol\{\\zeta\}^\{\*\}\+t\\boldsymbol\{h\}\)=\\boldsymbol\{H\}\_\{t\}^\{\-1\}\\boldsymbol\{b\}\_\{t\}\. Applying chain rule, we have

d​Ftd​t\\displaystyle\\frac\{dF\_\{t\}\}\{dt\}=\\displaystyle=dd​t​𝑯t−1​𝒃t\+𝑯t−1​d​𝒃td​t=−𝑯t−1​d​𝑯td​t​𝑯t−1​𝒃t⏟Ft\+𝑯t−1​d​𝒃td​t,\\displaystyle\\frac\{d\}\{dt\}\\boldsymbol\{H\}^\{\-1\}\_\{t\}\\boldsymbol\{b\}\_\{t\}\\ \+\\ \\boldsymbol\{H\}^\{\-1\}\_\{t\}\\frac\{d\\boldsymbol\{b\}\_\{t\}\}\{dt\}\\ =\\ \-\\boldsymbol\{H\}\_\{t\}^\{\-1\}\\frac\{d\\boldsymbol\{H\}\_\{t\}\}\{dt\}\\underbrace\{\\boldsymbol\{H\}\_\{t\}^\{\-1\}\\boldsymbol\{b\}\_\{t\}\}\_\{F\_\{t\}\}\+\\boldsymbol\{H\}\_\{t\}^\{\-1\}\\frac\{d\\boldsymbol\{b\}\_\{t\}\}\{dt\}\\ ,\(36\)where we apply the matrix inverse derivative identity\. Att=0t=0, \([36](https://arxiv.org/html/2607.01689#A2.E36)\) becomes

D​F​\(𝜻∗\)𝒉\\displaystyle DF\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}=\\displaystyle=−𝑯​\(𝜻∗\)−1​\[D​𝒃​\(𝜻∗\)𝒉−D​𝑯​\(𝜻∗\)𝒉​F​\(𝜻∗\)\],\\displaystyle\-\\boldsymbol\{H\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)^\{\-1\}\[D\\boldsymbol\{b\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}\-D\\boldsymbol\{H\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}F\(\\boldsymbol\{\\zeta\}^\{\*\}\)\],\(37\)where

D​𝒃​\(𝜻∗\)𝒉\\displaystyle D\\boldsymbol\{b\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}=\\displaystyle=∑i=1N\(∇ui​\(𝜻∗\)⊤​𝒉\)​𝜽i​𝜽i⊤​𝜽i=∑i=1N2​ri​\(𝜻∗\)\(ri​\(𝜻∗\)2\+γ\)2​\(𝜽i⊤​𝒉\)​𝜽i​𝜽i⊤​𝜽i,\\displaystyle\\sum\_\{i=1\}^\{N\}\(\\nabla u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)^\{\\top\}\\boldsymbol\{h\}\)\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\\ =\\ \\sum\_\{i=1\}^\{N\}\\frac\{2r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\}\{\(r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)^\{2\}\+\\gamma\)^\{2\}\}\(\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{h\}\)\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\},D​𝑯​\(𝜻∗\)𝒉\\displaystyle D\\boldsymbol\{H\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}=\\displaystyle=∑i=1N\(∇ui​\(𝜻∗\)⊤​𝒉\)​𝜽i​𝜽i⊤=∑i=1N2​ri​\(𝜻∗\)\(ri​\(𝜻∗\)2\+γ\)2​\(𝜽i⊤​𝒉\)​𝜽i​𝜽i⊤\.\\displaystyle\\sum\_\{i=1\}^\{N\}\(\\nabla u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)^\{\\top\}\\boldsymbol\{h\}\)\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\ =\\ \\sum\_\{i=1\}^\{N\}\\frac\{2r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\}\{\(r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)^\{2\}\+\\gamma\)^\{2\}\}\(\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{h\}\)\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\.\(38\)Substituting \([B](https://arxiv.org/html/2607.01689#A2.Ex8)\) to \([37](https://arxiv.org/html/2607.01689#A2.E37)\) thus completes the proof\. ∎

It is now sufficient to show that the directional gradient‖D​F​\(𝜻∗\)‖\\\|DF\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\\|ofFFin an arbitrary direction𝒉\\boldsymbol\{h\}in the neighborhood of𝜻∗\\boldsymbol\{\\zeta\}^\{\*\}is less than 1 to show local contraction ofFF\[kreyszig1991introductory\]\.

First, we compute the gradient ofui​\(𝜻∗\)u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)as:

∇ui​\(𝜻∗\)\\displaystyle\\nabla u\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)=\\displaystyle=2​ri​\(𝜻∗\)\(ri​\(𝜻∗\)2\+γ\)2⋅𝜽i\\displaystyle\\frac\{2r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\}\{\(r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)^\{2\}\+\\gamma\)^\{2\}\}\\cdot\\boldsymbol\{\\theta\}\_\{i\}\(39\)
Next, we define𝒉=𝝂−𝜻∗\\boldsymbol\{h\}=\\boldsymbol\{\\nu\}\-\\boldsymbol\{\\zeta\}^\{\*\}for arbitrary weights𝜻∗\\boldsymbol\{\\zeta\}^\{\*\}and𝝂\\boldsymbol\{\\nu\}such that‖𝒉‖=1\\\|\\boldsymbol\{h\}\\\|=1\(as per Assumption[5\.3](https://arxiv.org/html/2607.01689#S5.Thmtheorem3)\)\. We bound the norm of the directional gradientsD​F​\(𝜻∗\)𝒉DF\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}in direction𝐡\\mathbf\{h\}\. Using Lemma[B\.1](https://arxiv.org/html/2607.01689#A2.Thmtheorem1), we can compute∇F​\(𝜻∗\)𝒉\\nabla F\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}as

D​F​\(𝜻∗\)𝒉\\displaystyle DF\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}=\\displaystyle=−2​𝑯−1​∑i=1Nri​\(𝜻∗\)\(ri​\(𝜻∗\)2\+γ\)2​\(𝜽i​𝜽i⊤​𝜽i−𝜽i​𝜽i⊤​F​\(𝜻∗\)\)⏟≜𝑱i​\(𝜽i⊤​𝒉\)\.\\displaystyle\-2\\boldsymbol\{H\}^\{\-1\}\\sum\_\{i=1\}^\{N\}\\underbrace\{\\frac\{r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)\}\{\(r\_\{i\}\(\\boldsymbol\{\\zeta\}^\{\*\}\)^\{2\}\+\\gamma\)^\{2\}\}\\Big\(\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\-\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}F\(\\boldsymbol\{\\zeta\}^\{\*\}\)\\Big\)\}\_\{\\triangleq\\boldsymbol\{J\}\_\{i\}\}\(\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{h\}\)\.\(40\)Therefore,D​F​\(𝜻∗\)𝒉DF\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}is bounded by

‖D​F​\(𝜻∗\)𝒉‖\\displaystyle\\\|DF\(\\boldsymbol\{\\zeta\}^\{\*\}\)\_\{\\boldsymbol\{h\}\}\\\|=\\displaystyle=2∥∑i=1N𝑯−1𝑱i\(𝜽i⊤𝒉\)∥≤2∑i=1N∥𝑯−1𝑱i𝜽i⊤∥≤2M∑i=1N∥𝑯−1𝑱i∥,\\displaystyle 2\\Bigg\\\|\\sum\_\{i=1\}^\{N\}\\boldsymbol\{H\}^\{\-1\}\\boldsymbol\{J\}\_\{i\}\(\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\boldsymbol\{h\}\)\\Bigg\\\|\\ \\ \\leq\\ \\ 2\\sum\_\{i=1\}^\{N\}\\\|\\boldsymbol\{H\}^\{\-1\}\\boldsymbol\{J\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\\\|\\ \\ \\leq\\ \\ 2M\\sum\_\{i=1\}^\{N\}\\\|\\boldsymbol\{H\}^\{\-1\}\\boldsymbol\{J\}\_\{i\}\\\|\\ ,\(41\)where the first inequality follows from triangle inequality and‖𝒉‖=1\\\|\\boldsymbol\{h\}\\\|=1, the second inequality follows from Assumption[5\.2](https://arxiv.org/html/2607.01689#S5.Thmtheorem2)\. We empirically show in Figure[4](https://arxiv.org/html/2607.01689#A2.F4)that this upper bound is less than 1 in our experiment settings, makingFFa contracting map around the neighborhood of𝜻∗\\boldsymbol\{\\zeta\}^\{\*\}\.

![Refer to caption](https://arxiv.org/html/2607.01689v1/x4.png)Figure 4:Lipschitz constant values ofFFacross multiple layers of ourPoE\-EBMwhen merging 7 fullfinetuned ViT models\. Top: ViT\-L\-14\. Bottom: ViT\-B\-32\. The Lipschitz constant is consistently less than 1, indicating thatFFis a contracting map, ensuring convergence of our algorithm[1](https://arxiv.org/html/2607.01689#alg1)\(see Figure[3](https://arxiv.org/html/2607.01689#S6.F3)\)\.
## Appendix CComplexity Analysis of Algorithm[1](https://arxiv.org/html/2607.01689#alg1)\.

Let𝜽i∈ℝd×d\\boldsymbol\{\\theta\}\_\{i\}\\in\\mathbb\{R\}^\{d\\times d\}be a fine\-tuning module\. Outer loop over maximum iterations \(line 4\-12\) is executedTTtimes\. For each outer iteration, the inner loop \(line 5\-8\) over tasks is executedNNtimes\.

Inner loop complexity:

1. 1\.Residual computation \(line 6\):𝒪​\(d2\)\\mathcal\{O\}\(d^\{2\}\)\.
2. 2\.Task weight computation \(line 7\):𝒪​\(1\)\\mathcal\{O\}\(1\)\.

After the inner loop

1. 1\.Compute𝑯\\boldsymbol\{H\}and𝒃\\boldsymbol\{b\}\(line 9\):𝒪​\(N​d2\)\\mathcal\{O\}\(Nd^\{2\}\)
2. 2\.Update merged fine\-tuning module \(line 11\):𝒪​\(d3\)\\mathcal\{O\}\(d^\{3\}\)\.

Therefore, total complexity is𝒪​\(T​\(N​\(d2\+1\)\+N​d2\+d3\)\)=𝒪​\(T​N​d3\)\\mathcal\{O\}\\Bigg\(T\\Big\(N\(d^\{2\}\+1\)\+Nd^\{2\}\+d^\{3\}\\Big\)\\Bigg\)\\ \\ =\\ \\ \\mathcal\{O\}\(TNd^\{3\}\)

## Appendix DInsights into Cauchy score[21](https://arxiv.org/html/2607.01689#S4.E21)

### D\.1Gaussian energy and base score

For Gaussian experts \([11](https://arxiv.org/html/2607.01689#S3.E11)\) with the precision matrix chosen as𝚺i−1=1γ2​𝜽i​𝜽i⊤\\boldsymbol\{\\Sigma\}\_\{i\}^\{\-1\}=\\frac\{1\}\{\\gamma^\{2\}\}\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}, the corresponding energy function, which we refer to as thebase energy, for theii\-th task has the following form:

Eib​a​s​e​\(𝜻\)≜12​γ2​\(𝜻−𝜽i\)⊤​𝜽i​𝜽i⊤​\(𝜻−𝜽i\)\\displaystyle E\_\{i\}^\{base\}\(\\boldsymbol\{\\zeta\}\)\\ \\ \\triangleq\\ \\ \\frac\{1\}\{2\\gamma^\{2\}\}\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)^\{\\top\}\\boldsymbol\{\\theta\}\_\{i\}\\boldsymbol\{\\theta\}\_\{i\}^\{\\top\}\(\\boldsymbol\{\\zeta\}\-\\boldsymbol\{\\theta\}\_\{i\}\)=\\displaystyle=12​γ2​ri​\(𝜻\)2\.\\displaystyle\\frac\{1\}\{2\\gamma^\{2\}\}r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\\ \.\(42\)The global base energy is simply the sum over task energiesEb​a​s​e​\(𝜻\)=∑i=1N12​γ2​ri​\(𝜻\)2E^\{base\}\(\\boldsymbol\{\\zeta\}\)=\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{2\\gamma^\{2\}\}r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}, which admits the following:

Sb​a​s​e​\(𝜻\)\\displaystyle\\hskip\-48\.36967ptS^\{base\}\(\\boldsymbol\{\\zeta\}\)≜\\displaystyle\\triangleq−∇𝜻Eb​a​s​e=−∑i=1Nri​\(𝜻\)γ2​𝜽i\.\\displaystyle\-\\nabla\_\{\\boldsymbol\{\\zeta\}\}E^\{base\}\\ =\\ \-\\sum\_\{i=1\}^\{N\}\\frac\{r\_\{i\}\(\\boldsymbol\{\\zeta\}\)\}\{\\gamma^\{2\}\}\\boldsymbol\{\\theta\}\_\{i\}\\ \.\(43\)

### D\.2Cauchy experts as additive robust guidance

To understand of the effect of heavy\-tail experts, we compare the Cauchy score \([21](https://arxiv.org/html/2607.01689#S4.E21)\) with the base score \([43](https://arxiv.org/html/2607.01689#A4.E43)\)\. The Cauchy score \([21](https://arxiv.org/html/2607.01689#S4.E21)\) can be expressed as

SC​a​u​c​h​y\\displaystyle S^\{Cauchy\}=\\displaystyle=Sb​a​s​e\+G​\(𝜻\),\\displaystyle S^\{base\}\\ \+\\ G\(\\boldsymbol\{\\zeta\}\)\\ ,where we define the guidance term as:

G​\(𝜻\)\\displaystyle\\hskip\-48\.36967ptG\(\\boldsymbol\{\\zeta\}\)≜\\displaystyle\\triangleq∑i=1N\(ri​\(𝜻\)γ2−2​ri​\(𝜻\)γ2\+ri​\(𝜻\)2\)​𝜽i\.\\displaystyle\\sum\_\{i=1\}^\{N\}\\Bigg\(\\frac\{r\_\{i\}\(\\boldsymbol\{\\zeta\}\)\}\{\\gamma^\{2\}\}\\ \-\\ \\frac\{2r\_\{i\}\(\\boldsymbol\{\\zeta\}\)\}\{\\gamma^\{2\}\+r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\}\\Bigg\)\\boldsymbol\{\\theta\}\_\{i\}\\ \.Simplifying the expression yields:

G​\(𝜻\)\\displaystyle\\hskip\-76\.82243ptG\(\\boldsymbol\{\\zeta\}\)=\\displaystyle=∑i=1Nri​\(𝜻\)​\(ri​\(𝜻\)2−γ2\)γ2​\(γ2\+ri​\(𝜻\)2\)​𝜽i\.\\displaystyle\\sum\_\{i=1\}^\{N\}\\frac\{r\_\{i\}\(\\boldsymbol\{\\zeta\}\)\\Big\(r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\-\\gamma^\{2\}\\Big\)\}\{\\gamma^\{2\}\\Big\(\\gamma^\{2\}\+r\_\{i\}\(\\boldsymbol\{\\zeta\}\)^\{2\}\\Big\)\}\\boldsymbol\{\\theta\}\_\{i\}\\ \.\(44\)Intuitively,G​\(𝜻\)G\(\\boldsymbol\{\\zeta\}\)acts as a residual\-dependent adjustment that pushes𝜻\\boldsymbol\{\\zeta\}away from directions with large\|ri​\(𝜻\)\|\|r\_\{i\}\(\\boldsymbol\{\\zeta\}\)\|\. In contrast, the base Gaussian score \([43](https://arxiv.org/html/2607.01689#A4.E43)\) amplifies large residuals linearly, pulling the solution toward conflicts regime\. The Cauchy guidance thus provides automatic robustness where it tempers the influence of misaligned tasks while still encouraging alignment in well\-aligned directions\.

## Appendix ELimitations

PoE\-EBMcurrently assumes offline model merging with simultaneous access to all task models and a shared model architecture\. Moreover, our formulation focuses on Cauchy experts as the underlying heavy\-tailed distribution\. Extending the framework to continual merging, alternative heavy\-tailed experts, heterogeneous architectures, and multi\-modal settings remains potential scopes for future work\.

## Appendix FAdditional Experiment details

Table 5:Hyperparameter settings\(λ,γ,η\)\(\\lambda,\\gamma,\\eta\)forPoE\-EBMwhen used to merge fine\-tuned models in vision and NLP benchmarks\. The numerical algorithm \(Section[5](https://arxiv.org/html/2607.01689#S5)\) used to compute the MAP ofPoE\-EBMmodel is configured with convergence tolerance parameter10−510^\{\-5\}in all scenarios\.DomainTrainingModelScaleλ\\lambda𝜸\\gamma𝜼\\boldsymbol\{\\eta\}VisionFFTViT\-L/141\.00\.0110−310^\{\-3\}VisionFFTViT\-B/321\.00\.310−310^\{\-3\}VisionLoRAViT\-L/140\.250\.010\.0110−310^\{\-3\}NLPLoRAFlan\-T5\-base1\.010−310^\{\-3\}10−310^\{\-3\}NLPLoRAFlan\-T5\-large1\.010−210^\{\-2\}10−310^\{\-3\}Tasks\.We conduct our experiments on vision and natural language processing \(NLP\) tasks\. The downstream vision tasks contain the 7\-task benchmark: MNIST\[lecun1998mnist\], SVHN\[netzer2011reading\], Stanford Cars\[krause20133d\], DTD\[cimpoi2014describing\], GTRSB\[stallkamp2011german\], EuroSAT\[helber2019eurosat\], Resisc45\[cheng2017remote\]\. We also include an additional 6 datasets to create a more challenging 13\-task benchmark: CIFAR10, CIFAR100\[krizhevsky2009learning\], FashionMNIST\[xiao2017fashion\], Flowers102\[nilsback2008automated\], Food\[bossard14\]and Oxford\-IIIT Pet\. For NLP task, we use 8 datasets from the GLUE benchmark\[wang2018glue\], icluding CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST2 and STSB\.

Models\.For vision experiments, we leverage pretrained ViT\-B/32 and ViT\-B/14 models from CLIPradford2021learning\. We consider merging models in both fully\-finetuned and LoRA fine\-tuned\[hu2022lora\]settings\. We use the checkpoints provided byilharco2022editingfor fully\-finetuned models\. The LoRA\-finetuned version of these models are provided bystoica2024model\. For NLP tasks, we merge LoRA\-finetuned Flan\-T5\-base models and Flan\-T5\-large models whose checkpoints are provided bywei2025modeling\.

Metrics\.We report absolute accuracy of the merging methods as well as those of individual fine\-tuned models\. Following\[ilharco2022editing\], we also report the "normalized accuracy", i\.e\. the ratio between absolute accuracy of the merged model on taskii\-th and the finetuned model accuracy on the same task\. Normalized accuracy shows how close the merged model performs in relative to the finetuned model for each task\. Additional experiment details are provided in Appendix[F](https://arxiv.org/html/2607.01689#A6)\.

Hyperparameters\.All hyperparameters are selected via extensive grid search on validation performance for each architecture and training regime\. We use the same convergence criterion across all experiments and report the best\-performing configuration for each setting\. Convergence criteria: For all experiments withPoE\-EBM, fixed\-point iterations are terminated when the norm of the difference between two iterates is smaller or equals to10−510^\{\-5\}\. The conditioning parameterη\\etais fixed to10−310^\{\-3\}\. Other hyperparameter configurations for usingPoE\-EBMon vision and NLP tasks are reported in Table[5](https://arxiv.org/html/2607.01689#A6.T5)\.

Table 6:Averageℓ2\\ell\_\{2\}norms of full fine\-tuning task vectors𝜽i\\boldsymbol\{\\theta\}\_\{i\}across seven vision tasks, together with their magnitudes relative to the pretrained model𝑾0\\boldsymbol\{W\}\_\{0\}\. All values are averaged across layers\. The consistently small ratios \(<1%\) indicate that task\-specific updates remain localized perturbations of the pretrained model, supporting the fine\-tuning parameterization adopted throughout this work\.
## Appendix GAdditional results

We present additional experimental results and ablation studies in this section\. The normalized accuracy for fully fine\-tuned ViT\-B/32 and ViT\-B/14 models is reported in Tables[10](https://arxiv.org/html/2607.01689#A8.T10)and[11](https://arxiv.org/html/2607.01689#A8.T11), respectively\. Results for merging LoRA\-finetuned and fully fine\-tuned ViT\-L/14 models on the 7\-task vision benchmark are shown in Tables[12](https://arxiv.org/html/2607.01689#A8.T12)and[8](https://arxiv.org/html/2607.01689#A8.T8)\. We report the merging results on the 13 vision benchmarks in Table[2](https://arxiv.org/html/2607.01689#S6.T2)\. Table[9](https://arxiv.org/html/2607.01689#A8.T9)summarizes performance on 8 GLUE datasets when merging LoRA\-finetuned Flan\-T5\-large models\. Across all settings,PoE\-EBMconsistently outperforms the compared baselines, including KnOTS\[stoica2024model\], which is specifically designed for merging LoRA\-finetuned models\. We also provide additional empirical results demonstrating the residuals’ heavy\-tail behavior when merging Flan\-T5 models and ViTs which closely aligns with the Cauchy distribution in Figs\.[5](https://arxiv.org/html/2607.01689#A7.F5)and[6](https://arxiv.org/html/2607.01689#A7.F6)\.

![Refer to caption](https://arxiv.org/html/2607.01689v1/x5.png)Figure 5:Empirical distributions of directional residualsrr\(see Eq\.[17](https://arxiv.org/html/2607.01689#S4.E17)\) produced by different merging methods when merging seven fully fine\-tuned ViT models\. Across all methods, the observed tails are substantially heavier than those predicted by light\-tailed models and are most accurately captured by a Cauchy distribution among the distributions considered\.![Refer to caption](https://arxiv.org/html/2607.01689v1/x6.png)Figure 6:Empirical distributions of directional residualsrr\(see Eq\.[17](https://arxiv.org/html/2607.01689#S4.E17)\) produced by different merging methods when merging 8 LoRA fine\-tuned Flan\-T5 models\.Ablation on the covariance scale parameter\.We study the effect of the scale parameterγ\\gammain \([19](https://arxiv.org/html/2607.01689#S4.E19)\) on the averaged absolute accuracy of our merging algorithm on both ViT\-B\-32 and ViT\-L\-14\. We vary the value ofγ\\gamma, which controls the computation of the task\-wise weightsui​\(⋅\)u\_\{i\}\(\\cdot\), over the range\[10−3,3\]\[10^\{\-3\},3\]and plot the corresponding accuracies in Figure[7](https://arxiv.org/html/2607.01689#A7.F7)\. We observe that the performance is relatively stable across a wide range ofγ\\gammavalues for both ViT models\. For ViT\-B\-32, accuracy peaks atγ=0\.3\\gamma=0\.3and gradually decreases asγ\\gammaincreases to33\. Even atγ=2\\gamma=2,PoE\-EBMstill outperforms all the baselines compared with accuracy83\.36%83\.36\\%\. For ViT\-L\-14, the performance remains relatively stable as the accuracy gradually increases with smaller values ofγ\\gamma\. However, at extremely small values \(e\.g\.,γ=10−3\\gamma=10^\{\-3\}\), performance collapses, indicating numerical instability and over\-sensitivity in the weightuiu\_\{i\}’s computation\. These results suggest that moderateγ\\gammavalues result in stable performance of our merging algorithm, while overly small scales can be detrimental, particularly for larger models\.

![Refer to caption](https://arxiv.org/html/2607.01689v1/x7.png)Figure 7:Effect ofγ\\gammavalues on the merging accuracies ofPoE\-EBMon the performance accuracy when merging 7 full fine\-tuned ViT models\.Table 7:Reported processing time for our model merging algorithm \(see Algorithm[1](https://arxiv.org/html/2607.01689#alg1)\) across NLP and vision benchmarks\.
## Appendix HAdditional Related Works

To resolve task interference in task arithmetic\[ilharco2022editing\], TIES\[yadav2023ties\]improves upon task arithmetic by reducing interference between parameters using their signs and magnitudes before merging\. DARE\[yu2024language\]randomly removes fine\-tuned weights and rescales the existing ones to create sparse tasks vectors, improving generalization of task arithmetic\. Other line of work aims to resolve interference by merging within subspaces\. Alternatively,ortiz2023taskproposes finetuning models in the tangent space, disentangling finetuned models and thus improving their mergeability\. Task Singular Vectors\[gargiulo2025task\]combines task vectors using low\-rank approximation and reduces interference through means of whitening\. KnOTS\[stoica2024model\]leverages the Singular Value Decomposition \(SVD\) of the concatenated task vectors to extract the shared information across all tasks and merge only the task\-specific components using task arithmetic\. Concrete\[tang2023concrete\]uses meta learning to find a common low\-dimensional subspace and perform merging with reduced interference\.

More recent works explore model merging through the lens of optimization\. DOGE\[wei2025modeling\]view model merging as a single constrained optimization problem where the objective is aligning the test performance of the merged model with the task\-specific models on their respective tasks\. MAP\[li2024map\]model the merging problem as a multi\-objective optimization problem and aim to identify the Pareto front of the merging coefficients using proxy data\. Nevertheless, these works leverage the geometry of the parameter space and employ non\-probabilistic merging scheme and thus do not take into account the uncertainty\. On the other hand, ourPoE\-EBMframework explicitly views the merging problem as probabilistic inference in the parameter space and takes uncertainty into account\.

Table 8:Multi\-task performance comparison when merging ViT\-L/14 \(full fine\-tuned\) across 7 vision benchmarks \(absolute accuracy\)\. Task\-specific fine\-tuning accuracy is coloredblueto highlight performance upper\-bound\.Table 9:Multi\-task performance comparison when merging Flan\-T5\-large models \(LoRA\-fine\-tuned\) across 8 GLUE benchmarks\. Task\-specific fine\-tuning accuracy is coloredbluewhich highlights performance upper bound\. OurPoE\-EBMachieves the best rank across tasks\.Table 10:Multi\-task performance comparison when merging ViT\-B/32 \(full fine\-tuned\) across 7 vision benchmarks\. The performance is reported in terms of normalized accuracy\.Table 11:Multi\-task performance comparison when merging ViT\-L/14 \(full fine\-tuned\) across 7 vision benchmarks\. The performance is reported in terms of normalized accuracy\.Table 12:Multi\-task performance comparison when merging ViT\-L/14 \(LoRA\-fine\-tuned\) across 7 vision benchmarks\.

Similar Articles

Bayesian Model Merging

arXiv cs.LG

Introduces Bayesian Model Merging (BMM), a plug-and-play bi-level optimization framework for combining multiple task-specific experts into a single model, achieving state-of-the-art performance on vision and language benchmarks.

Model Merging by Output-Space Projection

arXiv cs.LG

This paper presents a new framework for model merging that casts the problem as a convex quadratic program over residual updates, minimizing a squared-output calibration objective. It subsumes existing heuristic methods and provides a closed-form diagnostic to predict merge quality, showing consistent gains on language and vision benchmarks.

Model Merging Scaling Laws in Large Language Models

Hugging Face Daily Papers

This paper establishes empirical scaling laws for language model merging, identifying power-law relationships between model size, expert count, and performance to enable predictive planning for optimal model composition.