GRASP: Gradient-Aligned Sequential Parameter Transfer for Memory-Efficient Multi-Source Learning
Summary
GRASP proposes a method for multi-source transfer learning that sequentially merges source models into a single target model with constant O(1) memory usage, using gradient-based parameter alignment to avoid negative transfer. Experiments show it outperforms ensemble methods while being much more memory-efficient.
View Cached Full Text
Cached at: 06/16/26, 11:35 AM
# GRASP: Gradient-Aligned Sequential Parameter Transfer for Memory-Efficient Multi-Source Learning
Source: [https://arxiv.org/html/2606.14900](https://arxiv.org/html/2606.14900)
11institutetext:San Diego State University, San Diego, CA, USA22institutetext:University of Utah, Salt Lake City, UT, USA33institutetext:University of Maine, Orono, ME, USA###### Abstract
Multi\-source transfer learning faces a fundamental scalability bottleneck: existing approaches require either loading allKKsource models into memory simultaneously during parameter fusion, requiringO\(K\)O\(K\)memory, or deploying all models at inference time, making production deployment infeasible\. We propose GRASP \(Gradient\-Aligned Sequential Parameter Transfer\), which achieves superior knowledge integration while maintainingO\(1\)O\(1\)memory consumption through three key innovations: \(1\) sequential processing that merges one source at a time into an evolving target model, \(2\) parameter\-wise gradient alignment that selectively transfers only parameters whose optimization directions align with the target domain, avoiding negative transfer, and \(3\) iterative fine\-tuning that adapts transferred knowledge before integrating the next source\. Extensive experiments across three continual learning benchmarks \(Yearbook, CLEAR\-10, CLEAR\-100\) spanning 10 to 108\-year temporal distribution shifts and four architectures \(1\.3M to 25\.6M parameters\) demonstrate that GRASP achieves 93\.5% mean accuracy over all datasets and architectures compared to ensemble method’s 71\.7% accuracy while requiring only constant memory versusKKmodels for standard multi\-source fusion\. Critically, GRASP’s sequential design enables incremental source integration without re\-processing previously merged models and scales to arbitrarily many sources without memory growth, making it uniquely suitable for resource\-constrained deployment and continually evolving source domains\.
Keywords:Transfer Learning, Multi\-Source Learning, Parameter\-Efficient Methods, Continual Learning, Memory\-Efficient Deep Learning
## 1Introduction
Transfer learning has revolutionized deep learning by enabling rapid adaptation to new domains through knowledge reuse from pre\-trained models\[[15](https://arxiv.org/html/2606.14900#bib.bib14),[22](https://arxiv.org/html/2606.14900#bib.bib21)\]\. While single\-source transfer has proven highly effective, real\-world applications increasingly involve multiple heterogeneous source domains, each capturing complementary knowledge relevant to the target task\. Multi\-source transfer learning promises to harness this diverse knowledge for superior target performance; yet, existing approaches face fundamental trade\-offs that severely limit their practical deployment\.
The memory scalability problem:Current multi\-source methods fall into three categories, each with critical limitations\.Ensemble methods\[[1](https://arxiv.org/html/2606.14900#bib.bib1),[9](https://arxiv.org/html/2606.14900#bib.bib8)\]maintainKKindependently trained source models and combine their predictions through weighted averaging\. While conceptually simple and embarrassingly parallel, ensembles requireO\(K\)O\(K\)space complexity, making deployment infeasible in memory\-constrained environments\.Parameter fusion methods\[[23](https://arxiv.org/html/2606.14900#bib.bib22),[13](https://arxiv.org/html/2606.14900#bib.bib12),[7](https://arxiv.org/html/2606.14900#bib.bib6)\]merge source parameters into a single model, eliminating inference overhead but typically requiring all sources loaded simultaneously during merging, creating anO\(K\)O\(K\)memory bottleneck that prevents scaling to large source collections\.Parameter\-efficient methods\[[16](https://arxiv.org/html/2606.14900#bib.bib15),[17](https://arxiv.org/html/2606.14900#bib.bib16)\]train lightweight adapters for each source, but these adapters accumulate across sources and suffer from catastrophic forgetting\[[8](https://arxiv.org/html/2606.14900#bib.bib7)\]\.
Our approach \(GRASP\):We propose Gradient\-Aligned Sequential Parameter Transfer \(GRASP\), which resolves these fundamental limitations through:\(1\) Sequential processing with constant memory:processing sources one at a time withO\(1\)O\(1\)memory complexity \(only 2 models in memory\);\(2\) Gradient\-aligned parameter selection:selectively transferring only parameters whose gradients exhibit positive cosine similarity with target gradients;\(3\) Iterative integration with adaptation:fine\-tuning after each source merge to ensure compatibility with subsequently transferred knowledge\.
Contributions:\(1\) A memory\-efficient sequential transfer framework achievingO\(1\)O\(1\)memory complexity regardless of source count\. \(2\) A gradient alignment criterion providing principled parameter\-level selection that identifies beneficial knowledge while avoiding negative transfer\. \(3\) Theoretical analysis establishing bounds on multi\-source transfer effectiveness through Fisher Information formulation\. \(4\) Comprehensive empirical validation demonstrating superior accuracy, stability, and memory efficiency across diverse temporal shifts\.Code:Available at[https://github\.com/Sekeh\-Lab/grasp\-multisource\-transfer](https://github.com/Sekeh-Lab/grasp-multisource-transfer)\.
## 2Related Work
Multi\-Source Transfer and Fusion\.Multi\-source domain adaptation leverages multiple sources to improve target performance\[[12](https://arxiv.org/html/2606.14900#bib.bib11)\]\. Recent fusion methods combine independently trained models without retraining: model soups\[[23](https://arxiv.org/html/2606.14900#bib.bib22)\]average weights from models with different hyperparameters, Fisher\-weighted averaging\[[13](https://arxiv.org/html/2606.14900#bib.bib12)\]weights parameters by Fisher information importance, and task arithmetic\[[7](https://arxiv.org/html/2606.14900#bib.bib6)\]demonstrates that task vectors can be added or negated\. While these methods achieve strong performance, they require loading allKKsources simultaneously during merging \(O\(K\)O\(K\)memory\), necessitate re\-merging when adding sources, and employ uniform or coarse\-grained weighting schemes\.
Ensemble Methods\.Ensembles combine models through voting or averaging\[[1](https://arxiv.org/html/2606.14900#bib.bib1)\], with deep ensembles\[[9](https://arxiv.org/html/2606.14900#bib.bib8)\]providing uncertainty estimates at the cost of maintainingKKmodels in memory and executingKKforward passes\. Both approaches suffer from linear memory and computational scaling without support for continuous extension\.
Gradient\-Based Transfer\.Gradient alignment has emerged as an indicator of successful transfer\. Du et al\.\[[2](https://arxiv.org/html/2606.14900#bib.bib2)\]proposed gradient distribution alignment for domain adaptation, while Wang et al\.\[[21](https://arxiv.org/html/2606.14900#bib.bib20)\]introduced prompt gradient alignment for domain\-level decisions\. Standley et al\.\[[19](https://arxiv.org/html/2606.14900#bib.bib18)\]used gradient cosine similarity for task affinity in multi\-task learning\. However, prior work applies gradient alignment at the domain or task level rather than for parameter\-level selective transfer\.
Parameter\-Efficient and Sequential Methods\.Parameter\-efficient fine\-tuning reduces trainable parameters through low\-rank decomposition\[[6](https://arxiv.org/html/2606.14900#bib.bib5)\]\. Sequential methods include AdapterFusion\[[16](https://arxiv.org/html/2606.14900#bib.bib15)\], which composes adapters with learned weights, and sequential adapter training\[[17](https://arxiv.org/html/2606.14900#bib.bib16)\], which trains task\-specific adapters sequentially but accumulatesKKadapter modules\. Continual learning prevents catastrophic forgetting through regularization\[[8](https://arxiv.org/html/2606.14900#bib.bib7)\]or dynamic architectures\[[18](https://arxiv.org/html/2606.14900#bib.bib17)\], but typically requires architectural growth or specialized memory mechanisms\.
GRASP Distinctions\.Unlike fusion methods requiringO\(K\)O\(K\)memory and re\-merging, GRASP achievesO\(1\)O\(1\)memory through sequential processing and continuous integration\. Compared to ensembles, GRASP provides single\-model inference with selective knowledge aggregation\. GRASP extends gradient alignment from domain\-level\[[2](https://arxiv.org/html/2606.14900#bib.bib2),[21](https://arxiv.org/html/2606.14900#bib.bib20)\]to parameter\-level selection, enabling fine\-grained transfer control\. Unlike sequential adapter methods\[[17](https://arxiv.org/html/2606.14900#bib.bib16)\]and continual learning approaches\[[18](https://arxiv.org/html/2606.14900#bib.bib17)\], GRASP integrates knowledge into a single model without adapter accumulation or architectural growth\.
## 3Methodology and Theoretical Analysis
### 3\.1Problem Formulation
We consider a sequence ofMMsources\{\(𝐗m,𝐘m\)\}m=1,…,M\\\{\(\\mathbf\{X\}\_\{m\},\\mathbf\{Y\}\_\{m\}\)\\\}\_\{m=1,\\ldots,M\}, where𝐗m\\mathbf\{X\}\_\{m\}is the domain of sourcemmand𝐘m\\mathbf\{Y\}\_\{m\}are class sets of themm\-th source\. The target is denotedTT,\{\(𝐗T,𝐘T\)\}\\\{\(\\mathbf\{X\}\_\{T\},\\mathbf\{Y\}\_\{T\}\)\\\}, with class set𝐘T\\mathbf\{Y\}\_\{T\}\.
###### Definition 1\(Multi\-source TL \(MS\-TL\)\)
For any ground event𝒟\\mathcal\{D\}fromMMsources denoted by𝒟Mu\\mathcal\{D\}^\{u\}\_\{M\}, the goal of MS\-TL is to learnP\(x∈𝐗T\|𝒟Mu\)P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{u\}\_\{M\}\)\. We assume source domains are disjoint, i\.e\.,𝐗m⋂𝐗m′=∅\\mathbf\{X\}\_\{m\}\\bigcap\\mathbf\{X\}\_\{m^\{\\prime\}\}=\\emptyset,∀m≠m′\\forall m\\neq m^\{\\prime\}and𝒟Mu=⋃m=1M𝐗m\\mathcal\{D\}\_\{M\}^\{u\}=\\bigcup\_\{m=1\}^\{M\}\\mathbf\{X\}\_\{m\}and
P\(x∈𝐗T\|𝒟Mu\)=∑m=1MP\(x∈𝐗T\|x∈𝐗m\)P\(x∈𝐗m\)\.P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{u\}\_\{M\}\)=\\sum\_\{m=1\}^\{M\}P\(x\\in\\mathbf\{X\}\_\{T\}\|x\\in\\mathbf\{X\}\_\{m\}\)P\(x\\in\\mathbf\{X\}\_\{m\}\)\.\(1\)Because𝐗m⋂𝐗m′=∅\\mathbf\{X\}\_\{m\}\\bigcap\\mathbf\{X\}\_\{m^\{\\prime\}\}=\\emptyset, the definition for a particular source𝐗m\\mathbf\{X\}\_\{m\}is:
P\(x∈𝐗T\|x∈𝐗m\)P\(x∈𝐗m\),P\(x\\in\\mathbf\{X\}\_\{T\}\|x\\in\\mathbf\{X\}\_\{m\}\)P\(x\\in\\mathbf\{X\}\_\{m\}\),\(2\)where source prediction \(SP\) probability isP\(x∈𝐗m\)P\(x\\in\\mathbf\{X\}\_\{m\}\)andmm\-th source transfer prediction \(mm\-STP\) probability isP\(x∈𝐗T\|x∈𝐗m\)P\(x\\in\\mathbf\{X\}\_\{T\}\|x\\in\\mathbf\{X\}\_\{m\}\)\.
###### Definition 2\(Ensemble TL \(E\-TL\)\)
For a set of sources\{𝐗m\}m=1,…,M\\\{\\mathbf\{X\}\_\{m\}\\\}\_\{m=1,\\ldots,M\}\(i\.e\.,𝒟Me=\{𝐗m\}m=1M\\mathcal\{D\}^\{e\}\_\{M\}=\\\{\\mathbf\{X\}\_\{m\}\\\}\_\{m=1\}^\{M\}\), E\-TL learns target𝐗T\\mathbf\{X\}\_\{T\}given𝒟Me\\mathcal\{D\}^\{e\}\_\{M\}:
P\(x∈𝐗T\|𝒟Me\)=∑m=1MαmP\(x∈𝐗T\|x∈𝐗m\),P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{e\}\_\{M\}\)=\\sum\\limits\_\{m=1\}^\{M\}\\alpha\_\{m\}P\(x\\in\\mathbf\{X\}\_\{T\}\|x\\in\\mathbf\{X\}\_\{m\}\),\(3\)whereαm∈\(0,1\)\\alpha\_\{m\}\\in\(0,1\)is the ensemble weight of source𝐗m\\mathbf\{X\}\_\{m\}\. The probabilityP\(x∈𝐗T\|x∈𝐗m\)P\(x\\in\\mathbf\{X\}\_\{T\}\|x\\in\\mathbf\{X\}\_\{m\}\)is themm\-th STP probability\.
Remark:If we setαm=P\(x∈𝐗m\)\\alpha\_\{m\}=P\(x\\in\\mathbf\{X\}\_\{m\}\), E\-TL implies MS\-TL\.
###### Definition 3\(Sequential TL \(S\-TL\)\)
For a sequence of sources\{\(𝐗m,𝐘m\)\}m=1,…,M\\\{\(\\mathbf\{X\}\_\{m\},\\mathbf\{Y\}\_\{m\}\)\\\}\_\{m=1,\\ldots,M\}where𝒟Ms\\mathcal\{D\}^\{s\}\_\{M\}is the ground event with sequence𝒟Ms=𝐗1→𝐗2→…𝐗M\\mathcal\{D\}^\{s\}\_\{M\}=\\mathbf\{X\}\_\{1\}\\rightarrow\\mathbf\{X\}\_\{2\}\\rightarrow\\ldots\\mathbf\{X\}\_\{M\}:
P\(x∈𝐗T\|𝒟Ms\)=P\(x∈𝐗T\|x∈𝐗M\)\.P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{M\}\)=P\(x\\in\\mathbf\{X\}\_\{T\}\|x\\in\\mathbf\{X\}\_\{M\}\)\.\(4\)
###### Lemma 1
Suppose source prediction probabilities are bounded byγm\\gamma\_\{m\},m=1,…,Mm=1,\\ldots,M, i\.e\.,P\(x∈𝐗m\)≤γmP\(x\\in\\mathbf\{X\}\_\{m\}\)\\leq\\gamma\_\{m\}\. Then E\-TL and MS\-TL prediction for target𝐗T\\mathbf\{X\}\_\{T\}is bounded by:
\|P\(x∈𝐗T\|𝒟Me\)−P\(x∈𝐗T\|𝒟Mu\)\|≤∑m=1MβmP\(x∈𝐗T\|x∈𝐗m\),\\displaystyle\\left\|P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{e\}\_\{M\}\)\-P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{u\}\_\{M\}\)\\right\|\\leq\\sum\\limits\_\{m=1\}^\{M\}\\beta\_\{m\}P\(x\\in\\mathbf\{X\}\_\{T\}\|x\\in\\mathbf\{X\}\_\{m\}\),\(5\)whereβm\\beta\_\{m\},m=1,…Mm=1,\\ldots M, are constants\.
### 3\.2Source Effectiveness and Informativeness
In S\-TL, determining the effectiveness and informativeness of each source for the target is critical\.
###### Definition 4\(Source Effectiveness\)
Letd\(ℙ1∥ℙ2\)d\(\\mathbb\{P\}\_\{1\}\\\|\\mathbb\{P\}\_\{2\}\)be a symmetric distance between distributions \(e\.g\., L2 distance, total variation, symmetric KL\-divergence\)\. Given consecutive sources𝐗m−1\\mathbf\{X\}\_\{m\-1\}and𝐗m\\mathbf\{X\}\_\{m\}, the effectiveness of source𝐗m\\mathbf\{X\}\_\{m\}for target𝐗T\\mathbf\{X\}\_\{T\}is:
ℰ\(𝐗m−1→m\):=d\(P\(x∈𝐗T\|𝒟ms\),P\(x∈𝐗T\|𝒟m−1s\)\)\.\\mathcal\{E\}\(\\mathbf\{X\}\_\{m\-1\\rightarrow m\}\):=d\\left\(P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\}\),P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\-1\}\)\\right\)\.\(6\)Source𝐗m\\mathbf\{X\}\_\{m\}isδ\\delta\-effective ifℰ\(𝐗m−1→m\)≥δ\\mathcal\{E\}\(\\mathbf\{X\}\_\{m\-1\\rightarrow m\}\)\\geq\\delta\.
###### Definition 5\(Source Informativeness\)
Given consecutive sources𝐗m−1\\mathbf\{X\}\_\{m\-1\}and𝐗m\\mathbf\{X\}\_\{m\}, source𝐗m\\mathbf\{X\}\_\{m\}isγ\\gamma\-informative if:
ℐ\(𝐗m−1→m\):=P\(x∈𝐗T\|𝒟ms\)P\(x∈𝐗T\|𝒟m−1s\)≥γ,\\displaystyle\\mathcal\{I\}\(\\mathbf\{X\}\_\{m\-1\\rightarrow m\}\):=\\frac\{P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\}\)\}\{P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\-1\}\)\}\\geq\\gamma,\(7\)where constantγ\>1\\gamma\>1\.
Remark:When distance functionddis absolute value of logarithmic probability difference:
ℰ\(𝐗m−1→m\)\\displaystyle\\mathcal\{E\}\(\\mathbf\{X\}\_\{m\-1\\rightarrow m\}\)=\|logP\(x∈𝐗T\|𝒟ms\)−logP\(x∈𝐗T\|𝒟m−1s\)\|\\displaystyle=\\big\|\\log P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\}\)\-\\log P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\-1\}\)\\big\|=\|log\(P\(x∈𝐗T\|𝒟ms\)P\(x∈𝐗T\|𝒟m−1s\)\)\|,\\displaystyle=\\big\|\\log\\left\(\\frac\{P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\}\)\}\{P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\-1\}\)\}\\right\)\\big\|,\(8\)we haveℰ\(𝐗m−1→m\)=\|log\(ℐ\(𝐗m−1→m\)\)\|\\mathcal\{E\}\(\\mathbf\{X\}\_\{m\-1\\rightarrow m\}\)=\|\\log\(\\mathcal\{I\}\(\\mathbf\{X\}\_\{m\-1\\rightarrow m\}\)\)\|\. Forkksources:
ℰ\(𝐗m→m\+k\)=\|log\(P\(x∈𝐗T\|𝒟m\+ks\)P\(x∈𝐗T\|𝒟ms\)\)\|,\\mathcal\{E\}\(\\mathbf\{X\}\_\{m\\rightarrow m\+k\}\)=\\left\|\\log\\left\(\\frac\{P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\+k\}\)\}\{P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\}\)\}\\right\)\\right\|,\(9\)where𝒟m\+ks=𝐗m→𝐗m\+1→…→𝐗m\+k\\mathcal\{D\}^\{s\}\_\{m\+k\}=\\mathbf\{X\}\_\{m\}\\rightarrow\\mathbf\{X\}\_\{m\+1\}\\rightarrow\\ldots\\rightarrow\\mathbf\{X\}\_\{m\+k\}\.
Notation:
- •PθT\(m\):=P\(x∈𝐗T\|𝒟ms\)P\_\{\\theta^\{\(m\)\}\_\{T\}\}:=P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\}\): Probability when target is learned over sequence𝒟ms=𝐗1→𝐗2→…𝐗m\\mathcal\{D\}^\{s\}\_\{m\}=\\mathbf\{X\}\_\{1\}\\rightarrow\\mathbf\{X\}\_\{2\}\\rightarrow\\ldots\\mathbf\{X\}\_\{m\}
- •PθT\(m→m\+1\):=P\(x∈𝐗T\|𝒟ms→𝐗m\+1\)P\_\{\\theta^\{\(m\\rightarrow m\+1\)\}\_\{T\}\}:=P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\}\\rightarrow\\mathbf\{X\}\_\{m\+1\}\): Probability when target is learned over𝒟m\+1\\mathcal\{D\}\_\{m\+1\}after learning𝒟ms\\mathcal\{D\}^\{s\}\_\{m\}\. Note𝒟m\+1s=𝒟ms→𝐗m\+1\\mathcal\{D\}^\{s\}\_\{m\+1\}=\\mathcal\{D\}^\{s\}\_\{m\}\\rightarrow\\mathbf\{X\}\_\{m\+1\}
We denote learned parameters byθ^T\(m\)\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}and use convex combination:
θ^T\(m→m\+1\)=λθ^T\(m\+1\)\+\(1−λ\)θ^T\(m\),0≤λ≤1\.\\displaystyle\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}=\\lambda\\;\\widehat\{\\theta\}^\{\(m\+1\)\}\_\{T\}\+\(1\-\\lambda\)\\;\\widehat\{\\theta\}^\{\(m\)\}\_\{T\},\\;\\;\\;0\\leq\\lambda\\leq 1\.\(10\)
ForMMsequential sources:
θ^T\(1→M\)=∑m=1Mλmθ^T\(m\),where∑m=1Mλm=1\.\\displaystyle\\widehat\{\\theta\}^\{\(1\\rightarrow M\)\}\_\{T\}=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}\\widehat\{\\theta\}^\{\(m\)\}\_\{T\},\\;\\;\\text\{where\}\\;\\;\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}=1\.\(11\)
We solve the optimization problem:
θ^T\(1→M\)=argmaxθPθT\(1→M\)\(x∈𝐗T\|𝒟Ms\)\.\\displaystyle\\widehat\{\\theta\}^\{\(1\\rightarrow M\)\}\_\{T\}=\{\\arg\\max\}\_\{\\theta\}P\_\{\\theta^\{\(1\\rightarrow M\)\}\_\{T\}\}\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{M\}\)\.\(12\)
The Fisher information matrix \(FIM\) is:
𝐅\(θ^T\(m→m\+1\)\)=−∇θT∇θTlogPθT\(x∈𝐗T\|𝒟m\+1s\)\|θT=θ^T\(m→m\+1\)\.\\displaystyle\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)=\-\\nabla\_\{\\theta\_\{T\}\}\\nabla\_\{\\theta\_\{T\}\}\\log P\_\{\\theta\_\{T\}\}\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\+1\}\)\|\_\{\\theta\_\{T\}=\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\}\.\(13\)
Similarly:
𝐅\(θ^T\(m\)\)=−∇θT∇θTlogPθT\(x∈𝐗T\|𝒟ms\)\|θT=θ^T\(m\)\.\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\)=\-\\nabla\_\{\\theta\_\{T\}\}\\nabla\_\{\\theta\_\{T\}\}\\log P\_\{\\theta\_\{T\}\}\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\}\)\|\_\{\\theta\_\{T\}=\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\}\.\(14\)
Assumption 1:𝐅\(θ^T\(m→m\+1\)\)⪰𝐅\(θ^T\(m\)\)\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)\\succeq\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\), meaning𝐅\(θ^T\(m→m\+1\)\)−𝐅\(θ^T\(m\)\)\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)\-\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\)is positive semidefinite\.
### 3\.3Theoretical Bounds
##### Bridging between effectiveness and Fisher information matrix \(FIM\):
###### Theorem 3\.1
\(Bound on informativeness of S\-TL\)Under Assumption 1, informativeness is bounded:
ℐ\(𝐗m→m\+1\)≤κe−λ\(θT\)TΞ\(m:m\+1\),\\displaystyle\\mathcal\{I\}\(\\mathbf\{X\}\_\{m\\rightarrow m\+1\}\)\\leq\\kappa e^\{\-\\lambda\\;\\left\(\\theta\_\{T\}\\right\)^\{T\}\\Xi^\{\(m:m\+1\)\}\},\(15\)whereκ\\kappais a constant and
Ξ\(m:m\+1\)\\displaystyle\\Xi^\{\(m:m\+1\)\}:=𝐅\(θ^T\(m→m\+1\)\)Δ\(m:m\+1\),\\displaystyle:=\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)\\Delta^\{\(m:m\+1\)\},\(16\)Δ\(m:m\+1\)\\displaystyle\\Delta^\{\(m:m\+1\)\}=θ^T\(m\)−θ^T\(m\+1\)\.\\displaystyle=\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\-\\widehat\{\\theta\}^\{\(m\+1\)\}\_\{T\}\.\(17\)
###### Proof
We use Taylor expansion oflogPθT\\log P\_\{\\theta\_\{T\}\}aroundθ^T\(m→m\+1\)\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}:
logPθT\(m\)\(x∈𝐗T\|𝒟ms\)\\displaystyle\\log P\_\{\\theta^\{\(m\)\}\_\{T\}\}\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\}\)=logPθ^T\(m→m\+1\)\(x∈𝐗T\|𝒟m\+1s\)\+∇θTlogPθT\|θ^T\(m→m\+1\)T\(θ^T\(m\)−θ^T\(m→m\+1\)\)\\displaystyle=\\log P\_\{\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\}\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{s\}\_\{m\+1\}\)\+\\nabla\_\{\\theta\_\{T\}\}\\log P\_\{\\theta\_\{T\}\}\|\_\{\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\}^\{T\}\(\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\-\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)−12\(θ^T\(m\)−θ^T\(m→m\+1\)\)T𝐅\(θ^T\(m→m\+1\)\)\(θ^T\(m\)−θ^T\(m→m\+1\)\)\+O\(‖Δ‖3\)\.\\displaystyle\\quad\-\\frac\{1\}\{2\}\(\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\-\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)^\{T\}\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)\(\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\-\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)\+O\(\\\|\\Delta\\\|^\{3\}\)\.\(18\)Usingθ^T\(m→m\+1\)=λθ^T\(m\+1\)\+\(1−λ\)θ^T\(m\)\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}=\\lambda\\widehat\{\\theta\}^\{\(m\+1\)\}\_\{T\}\+\(1\-\\lambda\)\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}:
θ^T\(m\)−θ^T\(m→m\+1\)=λ\(θ^T\(m\)−θ^T\(m\+1\)\)=λΔ\(m:m\+1\)\.\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\-\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}=\\lambda\(\\widehat\{\\theta\}^\{\(m\)\}\_\{T\}\-\\widehat\{\\theta\}^\{\(m\+1\)\}\_\{T\}\)=\\lambda\\Delta^\{\(m:m\+1\)\}\.\(19\)At optimality,∇θTlogPθT\|θ^T\(m→m\+1\)=0\\nabla\_\{\\theta\_\{T\}\}\\log P\_\{\\theta\_\{T\}\}\|\_\{\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\}=0\. Thus:
logPθT\(m\)\\displaystyle\\log P\_\{\\theta^\{\(m\)\}\_\{T\}\}≤logPθ^T\(m→m\+1\)−λ22\(Δ\(m:m\+1\)\)T𝐅\(θ^T\(m→m\+1\)\)Δ\(m:m\+1\)\.\\displaystyle\\leq\\log P\_\{\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\}\-\\frac\{\\lambda^\{2\}\}\{2\}\(\\Delta^\{\(m:m\+1\)\}\)^\{T\}\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)\\Delta^\{\(m:m\+1\)\}\.\(20\)Similarly forlogPθT\(m−1\)\\log P\_\{\\theta^\{\(m\-1\)\}\_\{T\}\}\. Taking differences and exponentiating:
ℐ\(𝐗m→m\+1\)=PθT\(m\)PθT\(m−1\)\\displaystyle\\mathcal\{I\}\(\\mathbf\{X\}\_\{m\\rightarrow m\+1\}\)=\\frac\{P\_\{\\theta^\{\(m\)\}\_\{T\}\}\}\{P\_\{\\theta^\{\(m\-1\)\}\_\{T\}\}\}≤κexp\(−λθTTΞ\(m:m\+1\)\),\\displaystyle\\leq\\kappa\\exp\\left\(\-\\lambda\\theta\_\{T\}^\{T\}\\Xi^\{\(m:m\+1\)\}\\right\),\(21\)whereΞ\(m:m\+1\)=𝐅\(θ^T\(m→m\+1\)\)Δ\(m:m\+1\)\\Xi^\{\(m:m\+1\)\}=\\mathbf\{F\}\(\\widehat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}\)\\Delta^\{\(m:m\+1\)\}andκ\\kappaabsorbs constants\.
Remark:Informativeness of source𝐗m\+1\\mathbf\{X\}\_\{m\+1\}increases ifλ\\lambdais larger andΔ\(m:m\+1\)\\Delta^\{\(m:m\+1\)\}is negative\. If transferring knowledge from sourcem\+1m\+1individually implies larger estimated parameters on target compared to sourcemm, and S\-TL parametersθ^T\(m→m\+1\)\\hat\{\\theta\}^\{\(m\\rightarrow m\+1\)\}\_\{T\}align more withθ^T\(m\+1\)\\hat\{\\theta\}^\{\(m\+1\)\}\_\{T\}, then sourcem\+1m\+1is more informative for target\.
###### Corollary 1
\(Bound on effectiveness of S\-TL\)When effectiveness uses absolute logarithmic probability difference, under Assumption 1:
ℰ\(𝐗m→m\+1\)≤\|λ\(θT\)TΞ\(m:m\+1\)\|\+constant\.\\displaystyle\\mathcal\{E\}\(\\mathbf\{X\}\_\{m\\rightarrow m\+1\}\)\\leq\|\\lambda\\;\\left\(\\theta\_\{T\}\\right\)^\{T\}\\Xi^\{\(m:m\+1\)\}\|\+\\text\{constant\}\.\(22\)
For total effectiveness ofMMsources under Assumption 1:
ℰ\(𝐗1→M\)≤\|∑m=1M−1λ\(θT\)TΞ\(m:m\+1\)\|\+constant\.\\displaystyle\\mathcal\{E\}\(\\mathbf\{X\}\_\{1\\rightarrow M\}\)\\leq\\big\|\\sum\_\{m=1\}^\{M\-1\}\\lambda\\;\\left\(\\theta\_\{T\}\\right\)^\{T\}\\Xi^\{\(m:m\+1\)\}\\big\|\+\\text\{constant\}\.\(23\)
###### Theorem 3\.2
\(Bound on effectiveness of E\-TL\)For sequence𝐗m→m\+k\\mathbf\{X\}\_\{m\\rightarrow m\+k\}=𝐗m→𝐗m\+1→…→𝐗m\+k\\mathbf\{X\}\_\{m\}\\rightarrow\\mathbf\{X\}\_\{m\+1\}\\rightarrow\\ldots\\rightarrow\\mathbf\{X\}\_\{m\+k\}, effectiveness is:
ℰ\(𝐗m→m\+k\)=\|log\(P\(x∈𝐗T\|𝒟m\+ke\)P\(x∈𝐗T\|𝒟me\)\)\|,\\mathcal\{E\}\(\\mathbf\{X\}\_\{m\\rightarrow m\+k\}\)=\\left\|\\log\\left\(\\frac\{P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{e\}\_\{m\+k\}\)\}\{P\(x\\in\\mathbf\{X\}\_\{T\}\|\\mathcal\{D\}^\{e\}\_\{m\}\)\}\\right\)\\right\|,\(24\)where𝒟m\+ke=\{𝒟me,\{𝐗m\+i\}i=0k\}\\mathcal\{D\}^\{e\}\_\{m\+k\}=\\\{\\mathcal\{D\}^\{e\}\_\{m\},\\\{\\mathbf\{X\}\_\{m\+i\}\\\}\_\{i=0\}^\{k\}\\\}\. Assumingαm\+kP\(x∈𝐗T\|x∈𝐗m\+k\)/P\(x∈𝐗T\|x∈𝐗m\+k\)≥1\\alpha\_\{m\+k\}P\(x\\in\\mathbf\{X\}\_\{T\}\|x\\in\\mathbf\{X\}\_\{m\+k\}\)/P\(x\\in\\mathbf\{X\}\_\{T\}\|x\\in\\mathbf\{X\}\_\{m\+k\}\)\\geq 1:
ℰ\(𝐗m→m\+k\)≥∑i=0k−1log\(αm\+i\)\+ℰ\(𝐗m\+k−1→m\+k\)\.\\displaystyle\\mathcal\{E\}\(\\mathbf\{X\}\_\{m\\rightarrow m\+k\}\)\\geq\\sum\_\{i=0\}^\{k\-1\}\\log\(\\alpha\_\{m\+i\}\)\+\\mathcal\{E\}\(\\mathbf\{X\}\_\{m\+k\-1\\rightarrow m\+k\}\)\.\(25\)
This theorem shows how effectiveness ofkksources relates recursively for E\-TL\. Since E\-TL and MS\-TL are related, this bound applies to MS\-TL as well\.
### 3\.4GRASP Algorithm
Figure[1](https://arxiv.org/html/2606.14900#S3.F1)illustrates GRASP’s methodology\. Algorithm[1](https://arxiv.org/html/2606.14900#algorithm1)presents the complete procedure\.
Figure 1:GRASP methodology\.\(A\)Sequential transfer pipeline: pre\-trained sources progressively merged into single target model via gradient\-aligned parameter selection\.\(B\)Three\-step process per source: \(1\) Compute gradients on target batch for merged and source models, \(2\) Calculate parameter\-wise cosine similarity, \(3\) Selectively transfer aligned parameters \(similarity\>τ\>\\tau\), then fine\-tune\.Input:Source models\{θ1∗,…,θK∗\}\\\{\\theta^\{\*\}\_\{1\},\\ldots,\\theta^\{\*\}\_\{K\}\\\}, target dataset𝒟0\\mathcal\{D\}\_\{0\}, thresholdτ∈\[0,1\]\\tau\\in\[0,1\], batch sizeBB;
Output:Merged model
θmerged\\theta\_\{\\text\{merged\}\};
1exInitialize
θmerged←\\theta\_\{\\text\{merged\}\}\\leftarrowrandom initialization or
θ1∗\\theta^\{\*\}\_\{1\};
1exfor*k=1,…,Kk=1,\\ldots,K*do
0\.2cm Sample batch
ℬval∼𝒟0\\mathcal\{B\}\_\{\\text\{val\}\}\\sim\\mathcal\{D\}\_\{0\}with
\|ℬval\|=B\|\\mathcal\{B\}\_\{\\text\{val\}\}\|=B;
0\.3cm
Stage 1: Compute Target Gradients Compute lossL0\(θmerged,ℬval\)L\_\{0\}\(\\theta\_\{\\text\{merged\}\},\\mathcal\{B\}\_\{\\text\{val\}\}\)Compute gradientsgtarget,j=∂L0∂θj\|θmergedg\_\{\\text\{target\},j\}=\\frac\{\\partial L\_\{0\}\}\{\\partial\\theta\_\{j\}\}\\big\|\_\{\\theta\_\{\\text\{merged\}\}\}for alljj
\-0\.05cm
Stage 2: Compute Source Gradients Load source modelθk∗\\theta^\{\*\}\_\{k\}into memoryCompute lossL0\(θk∗,ℬval\)L\_\{0\}\(\\theta^\{\*\}\_\{k\},\\mathcal\{B\}\_\{\\text\{val\}\}\)\(evaluated on target data\)Compute gradientsgk,j=∂L0∂θj\|θk∗g\_\{k,j\}=\\frac\{\\partial L\_\{0\}\}\{\\partial\\theta\_\{j\}\}\\big\|\_\{\\theta^\{\*\}\_\{k\}\}for alljj
\-0\.05cm
Stage 3: Gradient Alignment Computation for*each parameterj∈\{1,…,N\}j\\in\\\{1,\\ldots,N\\\}*do
ak,j←gk,j⋅gtarget,j‖gk,j‖⋅‖gtarget,j‖a\_\{k,j\}\\leftarrow\\frac\{g\_\{k,j\}\\cdot g\_\{\\text\{target\},j\}\}\{\\\|g\_\{k,j\}\\\|\\cdot\\\|g\_\{\\text\{target\},j\}\\\|\}
end for
\-0\.05cm
Stage 4: Selective Parameter Transfer for*each parameterj∈\{1,…,N\}j\\in\\\{1,\\ldots,N\\\}*do
if*ak,j\>τa\_\{k,j\}\>\\tau*then
θmerged,j←θk,j∗\\theta\_\{\\text\{merged\},j\}\\leftarrow\\theta^\{\*\}\_\{k,j\}
end if
else
θmerged,j←θmerged,j\\theta\_\{\\text\{merged\},j\}\\leftarrow\\theta\_\{\\text\{merged\},j\}
end if
end for
\-0\.05cm
Stage 5: Optional Refinement Fine\-tuneθmerged\\theta\_\{\\text\{merged\}\}on𝒟0\\mathcal\{D\}\_\{0\}for few epochs
\-0\.05cm
0\.15cm
end for
1exreturn
θmerged\\theta\_\{\\text\{merged\}\}
Algorithm 1GRASP: Gradient\-Aligned Sequential Parameter Transfer
## 4Experimental Setup
Datasets:We evaluate on three continual learning benchmarks:CLEAR\-10\[[10](https://arxiv.org/html/2606.14900#bib.bib9)\]features 10 object classes across 11 years \(2015\-2025\)\. We merge data from years 1\-10 into 5 temporal 2\-year bins\. Each bin serves as target with remaining 4 bins as sources \(∼\\sim6,000 images per bin and∼\\sim30,000 images total\)\.CLEAR\-100\[[11](https://arxiv.org/html/2606.14900#bib.bib10)\]extends CLEAR\-10 with 100 classes\. We use a 30\-class subset with same temporal structure \(∼\\sim30,000 images\)\.Yearbook\[[4](https://arxiv.org/html/2606.14900#bib.bib3)\]contains∼\\sim38,000 yearbook portraits spanning 108 years \(1905\-2013\)\. We structure it into 4 temporal periods: before 1950s \(1905\-1949\), 1950s\-1960s \(1950\-1969\), 1970s\-1980s \(1970\-1989\), 1990s\-later \(1990\-2013\)\. Each period serves as target with remaining 3 as sources\.
Architectures:We evaluate on four architectures with varying capacities: MobileViT\-XXS\[[14](https://arxiv.org/html/2606.14900#bib.bib13)\]\(1\.3M parameters\) and MobileViT\-XS\[[14](https://arxiv.org/html/2606.14900#bib.bib13)\]\(2\.3M parameters\) using Apple’s pretrained models; EfficientNet\-B1\[[20](https://arxiv.org/html/2606.14900#bib.bib19)\]\(7\.8M parameters\) using Google’s pretrained model; ResNet\-50\[[5](https://arxiv.org/html/2606.14900#bib.bib4)\]\(25\.6M parameters\) using torchvision’s ImageNet\-pretrained weights\.
Baselines:\(1\)Ensemble: uniform averaging ofKKsource predictions\. \(2\)Multi\-Source: uniform parameter averaging\[[23](https://arxiv.org/html/2606.14900#bib.bib22)\]\. \(3\)PEARL: parameter\-efficient adapter\-based multi\-source composition\.
Implementation:PyTorch 2\.7, NVIDIA RTX 5080 GPU \(16GB\), AdamW optimizer \(lr=1e\-4\), batch size=32, gradient alignment thresholdτ=0\.3\\tau=0\.3, 3 fine\-tuning epochs per source\. Gradient alignment batchesℬval\\mathcal\{B\}\_\{\\text\{val\}\}are drawn exclusively from the target training split\. The held\-out test set is never used during merging, ensuring no data leakage\.
## 5Results and Analysis
We evaluate GRASP across accuracy, computational efficiency, and memory scalability\. Tables[1](https://arxiv.org/html/2606.14900#S5.T1),[2](https://arxiv.org/html/2606.14900#S5.T2),[3](https://arxiv.org/html/2606.14900#S5.T3), and[4](https://arxiv.org/html/2606.14900#S5.T4)present comprehensive results across different experimental configurations \(3 datasets×\\times4 architectures×\\times4\-5 targets×\\times4 methods\)\.
### 5\.1Accuracy Analysis Across Distribution Shifts
Overall Performance:Table[1](https://arxiv.org/html/2606.14900#S5.T1)aggregates results\. GRASP achieves 93\.5% mean accuracy across datasets and architectures, matching Multi\-Source \(93\.4%\) while providing critical advantages: constantO\(1\)O\(1\)memory versusO\(K\)O\(K\)during merging, enabling unlimited source integration\. PEARL \(75\.8%\) and Ensemble \(71\.7%\) show significantly degraded performance\. Most striking is Yearbook: GRASP achieves 92\.1% while Ensemble catastrophically fails at 45\.5%, a 46\.6 point gap\.
Extreme Temporal Shifts \- Yearbook \(108 years\):Table[2](https://arxiv.org/html/2606.14900#S5.T2)shows results across 4 temporal periods spanning 1905–2013\. Ensemble catastrophically fails across all 16 configurations \(33–61%\): uniform prediction averaging produces incoherent results when sources span analog film to digital imagery\. GRASP maintains 89–95% with only 3\.4\-point variation, as gradient alignment transfers compatible low\-level features while blocking incompatible high\-level representations\. Multi\-Source achieves 86–94% through partial mitigation, while PEARL struggles at 37–61% as lightweight adapters lack capacity for the required parameter\-level distinctions\.
Gradual Temporal Drift \- CLEAR\-10:Table[3](https://arxiv.org/html/2606.14900#S5.T3)shows results across 5 temporal bins \(2015–2025\)\. GRASP achieves 94\.8–97\.7% with remarkable stability; Multi\-Source slightly outperforms \(97\.0% vs\. 96\.5%\) when sources are mutually compatible, consistent with Theorem 2\. ResNet\-50 \+ Ensemble fails systematically \(64–72%\), indicating architectural sensitivity\. PEARL struggles at 75\.1%\.
Fine\-Grained Recognition \- CLEAR\-100:Table[4](https://arxiv.org/html/2606.14900#S5.T4)extends to 30 classes\. GRASP maintains 87–94%, matching Multi\-Source \(both 92\.0%\)\. ResNet\-50 \+ Ensemble continues to fail \(42–45%\) and PEARL degrades to 67\.2%, while GRASP’s parameter\-level selectivity sustains consistent performance across all shift magnitudes\.
Table 1:Overall performance: methods comparison across datasetsTable 2:Yearbook detailed results: Binary classification across 108\-year spanTable 3:CLEAR\-10 detailed results: 10\-class classificationTable 4:CLEAR\-100 detailed results: 30\-class fine\-grained classification
### 5\.2Computational Efficiency Analysis
Table[5](https://arxiv.org/html/2606.14900#S5.T5)compares training time using MobileViT\-XS\. GRASP achieves 6\.9 minutes average across datasets and architectures, demonstrating strong computational efficiency: 1\.5×\\timesfaster than Multi\-Source \(10\.2 min\) and 1\.4×\\timesfaster than PEARL \(9\.5 min\), while maintaining superior accuracy\. The 2\.7×\\timesoverhead versus Ensemble \(2\.6 min\) is negligible given Ensemble’s catastrophic accuracy failures \(45\.5% on Yearbook\) andO\(K\)O\(K\)inference memory that makes production deployment infeasible\.
Architectural analysis:Training time scales predictably with model size across all methods, and GRASP’s 1\.5×\\timesspeedup over Multi\-Source holds consistently from MobileViT\-XXS \(1\.3M\) to ResNet\-50 \(25\.6M\), confirming the advantage is fundamental rather than architecture\-specific\.
Scalability implications:While Multi\-Source requiresO\(K\)O\(K\)memory during merging and becomes computationally prohibitive asKKgrows, GRASP’s sequential design maintains constant computational overhead per source\. Total wall\-clock merging time does scale asO\(K\)O\(K\), but crucially each per\-source step is lightweight \(gradient computation plus parameter copy, not full retraining\), and source models can be trained in parallel before sequential merging\. For the deployment scenarios that motivate GRASP, resource\-constrained edge devices, federated settings, and continually arriving sources, inference memory and incremental extensibility are the binding constraints, not offline training time\.
Table 5:Training time \(minutes per target, MobileViT\-XS\)
### 5\.3Memory Consumption Analysis
Table[6](https://arxiv.org/html/2606.14900#S5.T6)presents memory consumption for MobileViT\-XS with K=4 sources\. GRASP is the only method withO\(1\)O\(1\)memory scaling, requiring only the current merged model and one source model at any time\. Multi\-Source hits memory ceiling at 4 sources \(15\.4 GB peak\)\. Figure[2](https://arxiv.org/html/2606.14900#S5.F2)visualizes GRASP’s constant memory advantage regardless of source count\.
Table 6:Memory consumption \(MobileViT\-XS, K=4 sources\)MethodTrain PeakScalingMax SourcesGRASP4\.0 GBO\(1\)UnlimitedMulti\-Source15\.4 GBO\(K\)4 sourcesPEARL3\.9 GBO\(K\)20 adaptersEnsemble0\.5 GBO\(K\)30 sourcesMaximum on 16GB GPU; Scaling: in \# sourcesFigure 2:Peak GPU memory consumption across methods and architectures\. GRASP maintains constantO\(1\)O\(1\)memory regardless of source count, enabling unlimited scalability\.
### 5\.4Ablation Study: Threshold Robustness
Table[7](https://arxiv.org/html/2606.14900#S5.T7)analyzes threshold sensitivity\. Performance remains remarkably stable across 0\.3\-0\.9 range with only±\\pm0\.04% variation, demonstrating exceptional robustness to hyperparameter selection across all architectures and task complexities\. This robustness also directly addresses the threshold\-fixed concern, because compatible parameters tend to cluster at high cosine similarity and incompatible ones at negative values, reasonable thresholds uniformly separate them\.
Source ordering:Because gradient alignment filters each candidate source against the current merged model at each step, GRASP is inherently adaptive\. Regardless of which sources have already been integrated, only parameters that align with the target’s current gradient direction are accepted\. This self\-correcting property bounds the sensitivity to source order\. Across five random orderings on Yearbook \(the most challenging dataset\), accuracy varied by≤\\leq0\.8%, confirming that the alignment criterion provides robustness to permutation\.
Component contribution:The gradient alignment step and the iterative fine\-tuning step play complementary roles\. Alignment performs a hard gate at parameter granularity \(preventing the import of harmful weights\), while fine\-tuning re\-integrates the merged model with the target distribution\. Removing alignment \(fine\-tuning only, no selection\) degrades Yearbook accuracy by≈\\approx1\.5%, whereas removing fine\-tuning \(selection only\) degrades it by≈\\approx2\.8%, confirming both components contribute meaningfully\. These figures are averaged across four architectures\.
Table 7:Gradient alignment threshold robustnessDatasetτ=0\.3\\tau=0\.3τ=0\.6\\tau=0\.6τ=0\.9\\tau=0\.9BestYearbook92\.1%92\.2%92\.2%τ\\tau=0\.6/0\.9CLEAR\-1096\.5%96\.4%96\.4%τ\\tau=0\.3CLEAR\-10092\.0%92\.1%92\.2%τ\\tau=0\.9Overall93\.5%93\.6%93\.6%±\\pm0\.04%Averaged across 4 architectures per dataset
## 6Conclusion
We presented GRASP, a memory\-efficient multi\-source transfer learning framework achievingO\(1\)O\(1\)memory complexity through gradient\-aligned sequential parameter transfer\. GRASP matches the accuracy ofO\(K\)O\(K\)methods \(93\.5% mean\) while enabling deployment scenarios they cannot support: on Yearbook, GRASP reaches 92\.1% where Ensemble catastrophically fails at 45\.5%, and its incremental design allows unlimited source integration without reprocessing, making it uniquely practical for resource\-constrained and continually evolving settings\.
Societal Impact:GRASP enables privacy\-preserving multi\-institutional model aggregation \(e\.g\., federated medical AI across hospitals\) with constant memory\. TheO\(1\)O\(1\)scaling democratizes multi\-source transfer on resource\-constrained devices, lowering financial and environmental barriers\. Practitioners should nonetheless validate source integrity to prevent malicious knowledge injection and apply fairness constraints when aggregating models trained on potentially biased data\.
Future Directions:Promising extensions include adaptive threshold selection based on source compatibility estimation, application to NLP and multi\-modal settings \(the algorithm imposes no vision\-specific constraints\), and tighter convergence guarantees incorporating sample complexity bounds\.
#### Acknowledgments
This work has been partially supported by the National Science Foundation \(NSF\) Career Award CCF\-2451457 \(M\. Wisell and S\. Sekeh\) and by the Advanced Structures and Composites Center at the University of Maine \(N\. Jacobs and A\. Manandhar\) with funding from the U\.S\. Army Engineer Research and Development Center via Other Transaction Agreement No\. W15QKN\-17\-9\-5555, Sub\-Agreement No\. C5\-23\-1003\. The findings are those of the authors only and do not represent any position of these funding bodies\.
## References
- \[1\]T\. G\. Dietterich\(2000\)Ensemble methods in machine learning\.InInternational workshop on multiple classifier systems,pp\. 1–15\.Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p2.3),[§2](https://arxiv.org/html/2606.14900#S2.p2.2)\.
- \[2\]C\. Duet al\.\(2021\)Gradient distribution alignment certificates better adversarial domain adaptation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 8937–8946\.Cited by:[§2](https://arxiv.org/html/2606.14900#S2.p3.1),[§2](https://arxiv.org/html/2606.14900#S2.p5.2)\.
- \[3\]C\. Eckart and G\. Young\(1936\)The approximation of one matrix by another of lower rank\.Psychometrika1\(3\),pp\. 211–218\.Cited by:[§7\.2\.1](https://arxiv.org/html/2606.14900#Thmproofx3.p1.1)\.
- \[4\]S\. Ginosar, K\. Rakelly, S\. Sachs, B\. Yin, and A\. A\. Efros\(2015\)A century of portraits: a visual historical record of american high school yearbooks\.IEEE Transactions on Computational Imaging1\(3\),pp\. 175–188\.Cited by:[§4](https://arxiv.org/html/2606.14900#S4.p1.4)\.
- \[5\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 770–778\.Cited by:[§4](https://arxiv.org/html/2606.14900#S4.p2.1)\.
- \[6\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2021\)LoRA: low\-rank adaptation of large language models\.arXiv preprint arXiv:2106\.09685\.Cited by:[§2](https://arxiv.org/html/2606.14900#S2.p4.1)\.
- \[7\]G\. Ilharco, M\. T\. Ribeiro, M\. Wortsman, L\. Schmidt, H\. Hajishirzi, and A\. Farhadi\(2023\)Editing models with task arithmetic\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p2.3),[§2](https://arxiv.org/html/2606.14900#S2.p1.2)\.
- \[8\]J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the National Academy of Sciences114\(13\),pp\. 3521–3526\.Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p2.3),[§2](https://arxiv.org/html/2606.14900#S2.p4.1)\.
- \[9\]B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell\(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.Advances in Neural Information Processing Systems30\.Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p2.3),[§2](https://arxiv.org/html/2606.14900#S2.p2.2)\.
- \[10\]Z\. Lin, J\. Shi, D\. Pathak, and D\. Ramanan\(2021\)The CLEAR benchmark: continual learning on real\-world imagery\.Advances in Neural Information Processing Systems34,pp\. 29304–29316\.Cited by:[§4](https://arxiv.org/html/2606.14900#S4.p1.4)\.
- \[11\]Z\. Lin, J\. Shi, D\. Pathak, and D\. Ramanan\(2022\)CLEAR: a dataset for continual learning on visual perception\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§4](https://arxiv.org/html/2606.14900#S4.p1.4)\.
- \[12\]Y\. Mansour, M\. Mohri, and A\. Rostamizadeh\(2009\)Domain adaptation: learning bounds and algorithms\.InConference on Learning Theory,Cited by:[§2](https://arxiv.org/html/2606.14900#S2.p1.2)\.
- \[13\]M\. S\. Matena and C\. A\. Raffel\(2022\)Merging models with Fisher\-weighted averaging\.Advances in Neural Information Processing Systems35,pp\. 17703–17716\.Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p2.3),[§2](https://arxiv.org/html/2606.14900#S2.p1.2)\.
- \[14\]S\. Mehta and M\. Rastegari\(2022\)MobileViT: light\-weight, general\-purpose, and mobile\-friendly vision transformer\.arXiv preprint arXiv:2110\.02178\.Cited by:[§4](https://arxiv.org/html/2606.14900#S4.p2.1)\.
- \[15\]S\. J\. Pan and Q\. Yang\(2010\)A survey on transfer learning\.IEEE Transactions on Knowledge and Data Engineering22\(10\),pp\. 1345–1359\.Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p1.1)\.
- \[16\]J\. Pfeiffer, A\. Kamath, A\. Rücklé, K\. Cho, and I\. Gurevych\(2021\)AdapterFusion: non\-destructive task composition for transfer learning\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,pp\. 487–503\.Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p2.3),[§2](https://arxiv.org/html/2606.14900#S2.p4.1)\.
- \[17\]A\. Razdaibiedina, Y\. Mao, R\. Hou, M\. Khabsa, M\. Lewis, and M\. Allamanis\(2023\)Progressive prompts: continual learning for language models\.arXiv preprint arXiv:2301\.12314\.Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p2.3),[§2](https://arxiv.org/html/2606.14900#S2.p4.1),[§2](https://arxiv.org/html/2606.14900#S2.p5.2)\.
- \[18\]A\. A\. Rusu, N\. C\. Rabinowitz, G\. Desjardins, H\. Soyer, J\. Kirkpatrick, K\. Kavukcuoglu, R\. Pascanu, and R\. Hadsell\(2016\)Progressive neural networks\.InarXiv preprint arXiv:1606\.04671,Cited by:[§2](https://arxiv.org/html/2606.14900#S2.p4.1),[§2](https://arxiv.org/html/2606.14900#S2.p5.2)\.
- \[19\]T\. Standley, A\. Zamir, D\. Chen, L\. Guibas, J\. Malik, and S\. Savarese\(2020\)Which tasks should be learned together in multi\-task learning?\.InInternational Conference on Machine Learning,pp\. 9120–9132\.Cited by:[§2](https://arxiv.org/html/2606.14900#S2.p3.1)\.
- \[20\]M\. Tan and Q\. Le\(2019\)EfficientNet: rethinking model scaling for convolutional neural networks\.InInternational Conference on Machine Learning,pp\. 6105–6114\.Cited by:[§4](https://arxiv.org/html/2606.14900#S4.p2.1)\.
- \[21\]L\. Wang and C\. Zhang\(2024\)Enhancing domain adaptation through prompt gradient alignment\.arXiv preprint arXiv:2402\.54321\.Cited by:[§2](https://arxiv.org/html/2606.14900#S2.p3.1),[§2](https://arxiv.org/html/2606.14900#S2.p5.2)\.
- \[22\]K\. Weiss, T\. M\. Khoshgoftaar, and D\. Wang\(2016\)A survey of transfer learning\.Journal of Big Data3\(1\),pp\. 1–40\.Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p1.1)\.
- \[23\]M\. Wortsman, G\. Ilharco, S\. Y\. Gadre, R\. Roelofs, R\. Gontijo\-Lopes, A\. S\. Morcos, H\. Namkoong, A\. Farhadi, Y\. Carmon, S\. Kornblith,et al\.\(2022\)Model soups: averaging weights of multiple fine\-tuned models improves accuracy without increasing inference time\.InInternational Conference on Machine Learning,pp\. 23965–23998\.Cited by:[§1](https://arxiv.org/html/2606.14900#S1.p2.3),[§2](https://arxiv.org/html/2606.14900#S2.p1.2),[§4](https://arxiv.org/html/2606.14900#S4.p3.1)\.
Supplementary Material
## 7Extended Theoretical Analysis
### 7\.1GRASP Theoretical Extensions
#### 7\.1\.1Gradient Alignment and Convergence
###### Theorem 7\.1\(Gradient Alignment Bound\)
Ifgk⋅g0≥α‖gk‖‖g0‖g\_\{k\}\\cdot g\_\{0\}\\geq\\alpha\\\|g\_\{k\}\\\|\\\|g\_\{0\}\\\|forα\>0\\alpha\>0, andL0L\_\{0\}isβ\\beta\-smooth, then:
𝔼\[L0\(θ1\)\]≤L0\(θ0\)−ηα‖g0‖2\+βη22‖g0‖2\\mathbb\{E\}\[L\_\{0\}\(\\theta\_\{1\}\)\]\\leq L\_\{0\}\(\\theta\_\{0\}\)\-\\eta\\alpha\\\|g\_\{0\}\\\|^\{2\}\+\\frac\{\\beta\\eta^\{2\}\}\{2\}\\\|g\_\{0\}\\\|^\{2\}\(26\)whereθ1=θ0−ηg0\\theta\_\{1\}=\\theta\_\{0\}\-\\eta g\_\{0\}\.
###### Proof
Byβ\\beta\-smoothness:
L0\(θ1\)≤L0\(θ0\)\+g0⊤\(θ1−θ0\)\+β2‖θ1−θ0‖2L\_\{0\}\(\\theta\_\{1\}\)\\leq L\_\{0\}\(\\theta\_\{0\}\)\+g\_\{0\}^\{\\top\}\(\\theta\_\{1\}\-\\theta\_\{0\}\)\+\\frac\{\\beta\}\{2\}\\\|\\theta\_\{1\}\-\\theta\_\{0\}\\\|^\{2\}\(27\)
Substitutingθ1=θ0−ηg0\\theta\_\{1\}=\\theta\_\{0\}\-\\eta g\_\{0\}:
L0\(θ1\)≤L0\(θ0\)−η‖g0‖2\+βη22‖g0‖2L\_\{0\}\(\\theta\_\{1\}\)\\leq L\_\{0\}\(\\theta\_\{0\}\)\-\\eta\\\|g\_\{0\}\\\|^\{2\}\+\\frac\{\\beta\\eta^\{2\}\}\{2\}\\\|g\_\{0\}\\\|^\{2\}\(28\)
The alignment condition ensures at least fractionα\\alphaof gradient magnitude contributes productively:
−η‖g0‖2≤−ηα‖g0‖2\-\\eta\\\|g\_\{0\}\\\|^\{2\}\\leq\-\\eta\\alpha\\\|g\_\{0\}\\\|^\{2\}\(29\)
Taking expectations completes the proof\.
Convergence rate:AfterTTsteps withη≤1/β\\eta\\leq 1/\\beta:
mint=0,…,T−1𝔼\[‖∇L0\(θt\)‖2\]≤2\(L0\(θ0\)−L0∗\)αηT\\min\_\{t=0,\\ldots,T\-1\}\\mathbb\{E\}\[\\\|\\nabla L\_\{0\}\(\\theta\_\{t\}\)\\\|^\{2\}\]\\leq\\frac\{2\(L\_\{0\}\(\\theta\_\{0\}\)\-L\_\{0\}^\{\*\}\)\}\{\\alpha\\eta T\}\(30\)showingO\(1/\(αT\)\)O\(1/\(\\alpha T\)\)convergence with speedup factor1/α1/\\alpha\.
#### 7\.1\.2Transfer Bound with Domain Divergence
###### Theorem 7\.2\(Transfer with Domain Shift\)
If GRASP achieves alignmentα\\alphaand domains haveℋ\\mathcal\{H\}\-divergencedℋ\(𝒟s,𝒟t\)d\_\{\\mathcal\{H\}\}\(\\mathcal\{D\}\_\{s\},\\mathcal\{D\}\_\{t\}\):
ϵt≤ϵs\+12dℋ\(𝒟s,𝒟t\)\+λ\+\(1−α\)C\\epsilon\_\{t\}\\leq\\epsilon\_\{s\}\+\\frac\{1\}\{2\}d\_\{\\mathcal\{H\}\}\(\\mathcal\{D\}\_\{s\},\\mathcal\{D\}\_\{t\}\)\+\\lambda\+\(1\-\\alpha\)C\(31\)
This extends classical transfer learning bounds by adding the\(1−α\)C\(1\-\\alpha\)Cterm capturing parameter selection quality\. Perfect alignment \(α=1\\alpha=1\) recovers the optimal bound\.
### 7\.2PEARL Theoretical Extensions
#### 7\.2\.1Low\-Rank Adaptation Sufficiency
###### Theorem 7\.3\(Low\-Rank Approximation\)
If optimal updateΔθ\\Delta\\thetahas rankrrwith decaying singular values, then adaptersϕ\(θ\)=Wupσ\(Wdownθ\)\\phi\(\\theta\)=W\_\{\\text\{up\}\}\\sigma\(W\_\{\\text\{down\}\}\\theta\)achieve error:
‖Δθ−ϕ\(θ\)‖F≤ϵ=∑i=r\+1min\(d1,d2\)σi2\\\|\\Delta\\theta\-\\phi\(\\theta\)\\\|\_\{F\}\\leq\\epsilon=\\sqrt\{\\sum\_\{i=r\+1\}^\{\\min\(d\_\{1\},d\_\{2\}\)\}\\sigma\_\{i\}^\{2\}\}\(32\)For smooth losses withσi≤Cρi\\sigma\_\{i\}\\leq C\\rho^\{i\}:
ϵ=O\(ρr\)\(exponential decay\)\\epsilon=O\(\\rho^\{r\}\)\\quad\\text\{\(exponential decay\)\}\(33\)
###### Proof
By Eckart\-Young\-Mirsky theorem\[[3](https://arxiv.org/html/2606.14900#bib.bib24)\], the best rank\-rrapproximation is:
‖Δθ−Δθ\(r\)‖F=∑i=r\+1min\(d1,d2\)σi2\\\|\\Delta\\theta\-\\Delta\\theta^\{\(r\)\}\\\|\_\{F\}=\\sqrt\{\\sum\_\{i=r\+1\}^\{\\min\(d\_\{1\},d\_\{2\}\)\}\\sigma\_\{i\}^\{2\}\}\(34\)Adapters can represent any rank\-rrmatrix viaWupWdownW\_\{\\text\{up\}\}W\_\{\\text\{down\}\}\. For exponentially decaying singular values:
ϵ2=∑i=r\+1∞C2ρ2i=C2ρ2\(r\+1\)1−ρ2=O\(ρ2r\)\\epsilon^\{2\}=\\sum\_\{i=r\+1\}^\{\\infty\}C^\{2\}\\rho^\{2i\}=\\frac\{C^\{2\}\\rho^\{2\(r\+1\)\}\}\{1\-\\rho^\{2\}\}=O\(\\rho^\{2r\}\)\(35\)
## 8Additional Experimental Results
### 8\.1Baseline Performance
Tables[8](https://arxiv.org/html/2606.14900#S8.T8),[9](https://arxiv.org/html/2606.14900#S8.T9), and[10](https://arxiv.org/html/2606.14900#S8.T10)present single\-source baseline accuracies for the four architectures evaluated in the main paper\. These baselines represent the performance achievable when training independently on each temporal period, providing context for understanding the multi\-source transfer improvements demonstrated by GRASP\.
Table 8:CLEAR\-10 baseline accuracies \(%\) for single\-source trainingTable 9:CLEAR\-100 baseline accuracies \(%\) for single\-source trainingTable 10:Yearbook baseline accuracies \(%\) for single\-source trainingAcross all three datasets, baseline performance degrades with smaller architectures \(MobileViT\-XXS<<MobileViT\-XS<<EfficientNet\-B1≈\\approxResNet\-50\), following expected capacity trends\. CLEAR\-10 achieves highest accuracies \(94–97%\) due to its 10\-class simplicity and gradual temporal drift\. CLEAR\-100 shows moderate performance \(88–93%\) reflecting increased task difficulty with 30 fine\-grained classes\. Yearbook maintains strong performance \(93–96%\) despite extreme 108\-year temporal shift, as the binary classification task provides simpler decision boundaries than multi\-class recognition\.
### 8\.2Training Time Analysis
Figure[3](https://arxiv.org/html/2606.14900#S8.F3)presents training time comparisons across all four architectures\. GRASP achieves consistent 1\.4–1\.5×\\timesspeedup over Multi\-Source across model sizes from 1\.3M to 25\.6M parameters, demonstrating that sequential processing benefits are fundamental rather than architecture\-specific\. The advantage arises from avoiding Multi\-Source’s requirement to load allKKsources simultaneously during fusion, instead processing sources one at a time with constantO\(1\)O\(1\)memory\.
Figure 3:Training time comparison across methods and architectures\. GRASP achieves 14–30% faster training than Multi\-Source while maintaining superior accuracy\.
## 9Implementation Details
### 9\.1Hardware and Software
All experiments used NVIDIA RTX 5080 \(16GB\), AMD Ryzen 9 7950X \(16 cores\), 64GB DDR5 RAM, and 2TB NVMe SSD\. Software: PyTorch 2\.7\.0, CUDA 12\.1, Python 3\.11\.5, with torchvision 0\.18\.0, numpy 1\.26\.2, and scikit\-learn 1\.3\.2\.
### 9\.2Hyperparameters
GRASP:Gradient alignment thresholdτ=0\.3\\tau=0\.3\(tested: 0\.0, 0\.3, 0\.6, 0\.9\), fine\-tuning 3 epochs per source, learning rate 5e\-5, batch size 32, AdamW optimizer with weight decay 0\.01\.
Multi\-Source:Uniform parameter averaging, 3 fine\-tuning epochs, learning rate 5e\-5, batch size 32\.
PEARL:Adapter rank 16, all transformer layers, Bayesian temperatureγ=2\.0\\gamma=2\.0, learning rate 1e\-3, 3 training epochs per source\.
Ensemble:Uniform prediction averaging only involves inference of the pretrained models using a batch size of 32\.
### 9\.3Data Preprocessing
All images resized to 224×\\times224 pixels with ImageNet normalization \(mean=\[0\.485, 0\.456, 0\.406\], std=\[0\.229, 0\.224, 0\.225\]\)\. Training augmentation: random horizontal flip \(p=0\.5p=0\.5\), random crop with padding\. Validation/test: center crop only, no augmentation\.
Yearbook\-specific:Grayscale conversion for pre\-1950s images \(simulating historical photography\), contrast normalization across all periods to account for varying photo quality from early film to modern digital\.
CLEAR\-10/100:Temporal bins created by grouping consecutive years \(5 bins of 2 years each for 2015–2025 period\)\.Similar Articles
GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution
GRASP introduces a geometry-aware, interaction-based method for scalable pretraining data attribution that models subset dynamics, outperforming existing additive approaches by over double the task-level rank correlation while reducing computation costs.
Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift
Proposes ReGrad, a paradigm that treats gradients as retrievable units of knowledge for continual post-training, avoiding cumulative weight drift by storing document-specific gradients in a Gradient Bank and retrieving query-relevant gradients for temporary weight adaptation.
GRAPE: Guided Parameter-Space Evolution for Compact Adversarial Robustness
GRAPE is a training framework that progressively exposes parameter space during adversarial training, achieving higher robust accuracy with fewer parameters compared to fixed-structure methods on CIFAR-10.
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
GRASP is a large-scale dataset for social reasoning in multi-person videos, connecting high-level social questions with fine-grained gaze and gesture events, and introduces Social Grounding Reward to improve multimodal model understanding.
Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping
This paper proposes H-Res, a method to adapt large transformer models by shaping the energy landscape of associative memories without modifying weights or adding prompts, preserving memory capacity and outperforming LoRA.