Locality-aware Private Class Identification for Domain Adaptation with Extreme Label Shift

arXiv cs.AI Papers

Summary

This paper proposes a locality-aware private class identification approach and a reliable optimal transport-based method (ReOT) to address domain adaptation challenges under extreme label shift, particularly distinguishing shared from private classes.

arXiv:2605.05567v1 Announce Type: new Abstract: Domain adaptation aims to transfer knowledge from a labeled source domain to an unlabeled target domain with different distributions. In real-world scenarios, the label spaces of the two domains often have an inclusion relationship, where some classes exist only in one domain but not the other. These non-overlapping classes are referred to as private classes. Identifying private class samples and mitigating their adverse effects is critical in the literature. Existing methods rely on the assumption that shifts in private classes are large enough to be considered outliers. However, the variance within a single shared class can be significantly larger than the difference between a private class and another shared class, challenging this assumption. Consequently, private classes substantially increase the difficulty of cross-domain classification. To address these issues, based on local transportation and metric properties of optimal transport (OT), a locality-aware private class identification approach is proposed in the form of a score function on transport mass. The effectiveness of the proposed approach is theoretically proven, highlighting the score function's strong ability to distinguish between shared and private class samples. Building on this, we introduce a reliable OT-based method (ReOT) for domain adaptation under severe label shift. ReOT minimizes classification risk while learning the separated cluster structure between the identified shared classes and private classes, effectively avoiding mismatch between shared-private sample pairs, thus ensuring that important knowledge is reliably transported intra-class to mitigate class-conditional discrepancy. Furthermore, a generalization upper bound of the target risk is provided for extreme label shift scenarios, which can be minimized by ReOT. Extensive experiments on benchmarks validate the effectiveness of ReOT.
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:26 AM

# Locality-aware Private Class Identification for Domain Adaptation with Extreme Label Shift
Source: [https://arxiv.org/html/2605.05567](https://arxiv.org/html/2605.05567)
Chuan\-Xian Ren, Cheng\-Jun Guo, and Hong YanC\.X\. Ren and C\.J\. Guo are with the School of Mathematics, Sun Yat\-Sen University, Guangzhou 510275, China\. Hong Yan is with the Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China\. C\.X\. Ren is the corresponding author \(email: rchuanx@mail\.sysu\.edu\.cn\)\. This work is supported in part by National Key R&D Program of China \(2024YFA1011900\), in part by National Natural Science Foundation of China \(Grant No\. 62376291\), in part by Guangdong Basic and Applied Basic Research Foundation \(2023B1515020004\), in part by Science and Technology Program of Guangzhou \(2024A04J6413\), in part by the Fundamental Research Funds for the Central Universities, Sun Yat\-sen University \(24xkjc013\), and in part by the Hong Kong Innovation and Technology Commission \(ITC\) \(InnoHK Project CIMDA\) and the Institute of Digital Medicine of City University of Hong Kong \(Project 9229503\)\.

###### Abstract

Domain adaptation aims to transfer knowledge from a labeled source domain to an unlabeled target domain with different distributions\. In real\-world scenarios, the label spaces of the two domains often have an inclusion relationship, where some classes exist only in one domain but not the other\. These non\-overlapping classes are referred to as private classes\. Identifying private class samples and mitigating their adverse effects is critical in the literature\. Existing methods rely on the assumption that shifts in private classes are large enough to be considered outliers\. However, the variance within a single shared class can be significantly larger than the difference between a private class and another shared class, challenging this assumption\. Consequently, private classes substantially increase the difficulty of cross\-domain classification\. To address these issues, based on local transportation and metric properties of optimal transport \(OT\), a locality\-aware private class identification approach is proposed in the form of a score function on transport mass\. The effectiveness of the proposed approach is theoretically proven, highlighting the score function’s strong ability to distinguish between shared and private class samples\. Building on this, we introduce a reliable OT\-based method \(ReOT\) for domain adaptation under severe label shift\. ReOT minimizes classification risk while learning the separated cluster structure between the identified shared classes and private classes, effectively avoiding mismatch between shared\-private sample pairs, thus ensuring that important knowledge is reliably transported intra\-class to mitigate class\-conditional discrepancy\. Furthermore, a generalization upper bound of the target risk is provided for extreme label shift scenarios, which can be minimized by ReOT\. Extensive experiments on benchmarks validate the effectiveness of ReOT\.

###### Index Terms:

Open set domain adaptation, partial domain adaptation, optimal transport, generalization error analysis, private class identification\.

## 1Introduction

It is well\-known that a model with strong generalization performance requires large\-scale datasets with sufficient label annotations\. However, data collected from real world are often unlabeled, and labeling large\-scale datasets typically involves excessive costs\. In particular, the collected datasets may have different data distributions, resulting in a severe degradation of the model performance\. To solve these issues, unsupervised domain adaptation \(UDA\)\[[1](https://arxiv.org/html/2605.05567#bib.bib1),[2](https://arxiv.org/html/2605.05567#bib.bib2),[3](https://arxiv.org/html/2605.05567#bib.bib3)\]has recieved wide attention to transfer knowledge from a labeled domain \(i\.e\., source domain\) to an unlabeled domain with different data distribution \(i\.e\., target domain\), notably saving time and labor consuming to collect labels for target domain\. However, most existing UDA works\[[4](https://arxiv.org/html/2605.05567#bib.bib4),[5](https://arxiv.org/html/2605.05567#bib.bib5),[6](https://arxiv.org/html/2605.05567#bib.bib6)\]follow a closed set assumption, i\.e\., the two domains share identical label space\. This severely limits their applications to real\-world scenarios since the label spaces of source and target domains are usually heterogeneous or extremely shifted\. Specifically, the heterogeneous scenarios can be roughly divided into two categories: 1\) source label space is a subset of the target label space; 2\) source label space includes the target label space\. To address these two realistic challenges, open set domain adaptation \(OSDA\)\[[7](https://arxiv.org/html/2605.05567#bib.bib7),[8](https://arxiv.org/html/2605.05567#bib.bib8),[9](https://arxiv.org/html/2605.05567#bib.bib9)\]and partial domain adaptation \(PDA\)\[[10](https://arxiv.org/html/2605.05567#bib.bib10),[11](https://arxiv.org/html/2605.05567#bib.bib11),[12](https://arxiv.org/html/2605.05567#bib.bib12)\]are proposed, respectively\. These two scenarios can be viewed as cases of extreme label shift DA\.

Generally, OSDA assumes the private classes exist in the target domain, and it requires the model to recognize them\. PDA shows a similar scenario while the private classes exist in the source domain\. Therefore, both OSDA and PDA face two primary challenges: 1\) Domain gap on shared classes: The data distribution for shared classes differs between the source and target domains, reducing classification accuracy on these classes\. 2\) Category gap: The absence of private classes in one domain \(either source or target\) leads to a significant drop in model performance in the target domain, even if the model performs well in the source domain\. In these two challenges, the category gap is the core issue\. It is not only inherently challenging to be addressed but also complicates the resolution of domain discrepancies using well\-studied domain alignment techniques, as the absence of private classes can lead to misalignment between private and shared classes\. Therefore, it is crucial to mitigate the adverse effects of the category gap\.

Most current works\[[13](https://arxiv.org/html/2605.05567#bib.bib13),[14](https://arxiv.org/html/2605.05567#bib.bib14),[15](https://arxiv.org/html/2605.05567#bib.bib15)\]addressed these challenges by the following two steps: 1\) Identifying and then separating the private class samples to eliminate label shift; 2\) Implementing domain alignment approach to alleviate domain discrepancy\. Following this strategy, these works can recognize some target/source private samples and achieve a degree of domain alignment, thereby obtaining good model performance\. While achieving great success, it is worth noting that mainstream methods always focus on only one scenario, ignoring the connection between OSDA and PDA\. In fact, OSDA and PDA are symmetric in some sense, as the biggest difference is in which domain the private classes exist\. Although OSDA may be more challenging since the private class data are unlabeled while in PDA the private class data are labeled, we would emphasize that a proper method should be able to address both scenarios\.

Meanwhile, it is important to recognize that even when considering only one scenario, previous methods have significant limitations, as they often rely on unrealistic strong assumptions to identify private class samples\. Specifically, private class identification in previous works mainly includes importance weighting\-based approaches\[[10](https://arxiv.org/html/2605.05567#bib.bib10),[16](https://arxiv.org/html/2605.05567#bib.bib16)\]and threshold\-based approaches\[[8](https://arxiv.org/html/2605.05567#bib.bib8),[17](https://arxiv.org/html/2605.05567#bib.bib17),[15](https://arxiv.org/html/2605.05567#bib.bib15)\]\. Importance weighting\-based approaches are widely used in PDA\. However, they involve assigning very small weights for the private class samples, while learning high\-quality weights generally relies on the strong assumption that domain discrepancy is negligible\. Threshold\-based methods use a metric to determine the likelihood that a sample belongs to a private class\. If the metric value exceeds a certain threshold, the sample is classified as private\. To ensure effectiveness, these approaches often assume that the gap within a shared class is consistently smaller than the gap between private and shared classes\. While this assumption may be less strict than negligible domain discrepancy, it remains difficult to satisfy in practice\. As presented in Figure[1](https://arxiv.org/html/2605.05567#S1.F1), in real\-world scenarios, the gap within a single shared class can significantly exceed the gap between a private class and another shared class\. This contradicts the foundational assumptions of many existing methods\. Therefore, it is necessary to explore a reliable method, which ensures effectiveness under realistic conditions, for private class identification\.

![Refer to caption](https://arxiv.org/html/2605.05567v1/x1.png)\(a\)Open Set Domain Adaptation \(OSDA\)
![Refer to caption](https://arxiv.org/html/2605.05567v1/x2.png)\(b\)Partial Domain Adaptation \(PDA\)

Figure 1:Two typical extreme label shifts in unsupervised domain adaptation, i\.e\., \(a\) OSDA; \(b\) PDA\. To achieve successful private class identification, existing methods often assume that the semantic shift of the private class is consistently larger than the domain shift of shared classes\. However, the variance for a specific shared class can be larger than the gap between a private class and another shared class\. Better view in color\.\.

In the following, we explore the reliability of the private class identification approach for extreme label shift scenarios\. We observe that although the shift of shared classes is not always smaller than that of private classes across the entire data manifold, it generally holds true in local regions\. Specifically, in a local neighborhood of a shared class cluster, the semantic shift of private class samples is always larger\. This observation can be formulated as a local spatial structure\. Motivated by this insight, we develop theories to characterize local spatial structures using masked OT\[[18](https://arxiv.org/html/2605.05567#bib.bib18)\], which offers significant advantages in local correlation characterization and geometric interpretability\. Then, we propose a novel locality\-aware private class identification method, and theoretically ensure the effectiveness of this method in distinguishing between shared and private class samples\. Leveraging this approach, we derive a reliable OT\-based model \(ReOT\) for reliable transfer in practical applications\. Furthermore, we provide a generalization upper bound for the target risk in DA scenarios with heterogeneous label spaces, and show that ReOT significantly reduces this bound, indicating its efficiency in addressing extreme label shift DA\. Our contributions can be summarized as follows\.

- •A locality\-aware private class identification method is proposed using a transport mass\-based score function\. It effectively captures the complex manifold structure of data by MOT’s local transport property and distance\-related inequality\. MOT’s attributes enable our method to mitigate cross\-domain class\-conditional discrepancy, thus, it is employed for practical knowledge transfer\.
- •The effectiveness of the proposed identification method is theoretically proved, highlighting the score function’s strong ability to distinguish between shared and private class samples\. This further implies the reliability of the proposed private class identification approach, i\.e\., the ability to effectively identify private class samples under realistic conditions\.
- •A theory\-guided ReOT method is proposed utilizing locality\-aware identification\. ReOT minimizes classification risk while learning separated cluster structure with MOT, avoiding mismatch inter\-class, and ensuring reliable transfer intra\-class\. Further, we give a generalization upper bound for extreme label shift scenarios, showing that ReOT effectively minimizes this bound\.
- •Extensive experiments are conducted on several benchmark image datasets, in both OSDA and PDA settings, to validate the empirical effectiveness of ReOT\. The results show that ReOT generally outperforms the state\-of\-the\-art methods, demonstrating the stable and superior performance of the learned model\.

The rest of this paper is organized as follows\. In Section[2](https://arxiv.org/html/2605.05567#S2), we provide a brief review of open\-set domain adaptation, partial domain adaptation, and the OT\-based domain adaptation methods\. In Section[3](https://arxiv.org/html/2605.05567#S3), we present the locality\-aware private class identification approach and the ReOT method, following which a generalization error analysis is also provided to show the efficiency of ReOT in addressing extreme label shift DA\. Extensive experiments and analysis under standard OSDA/PDA settings are presented in Section[4](https://arxiv.org/html/2605.05567#S4)\. Finally, Section[5](https://arxiv.org/html/2605.05567#S5)concludes this paper\.

## 2Related Work

In this section, several related works on OSDA, PDA and OT\-based DA methods are briefly reviewed\.

### 2\.1Open Set Domain Adaptation

OSDA addresses a realistic challenge where the target domain may include private classes\. It is difficult for a source pre\-trained model to recognize target samples correctly due to the lack of private classes in the source domain\.

Saito et al\.\[[7](https://arxiv.org/html/2605.05567#bib.bib7)\]raise the OSDA setting for practical application, where private class samples appear only in the target domain\. Recent works generally follow the pipeline to identify private class samples and then align cross\-domain shared distributions\. Liu et al\.\[[13](https://arxiv.org/html/2605.05567#bib.bib13)\]propose separate to align \(STA\), which trains a binary classifier to separate shared and private samples, and implement domain alignment by adversarial learning\. Fang et al\.\[[19](https://arxiv.org/html/2605.05567#bib.bib19)\]provide a theoretical bound for open set domain adaptation, and propose a theory\-guided method called distribution alignment with open difference \(DAOD\)\. Bucci et al\.\[[8](https://arxiv.org/html/2605.05567#bib.bib8)\]focus on the effectiveness of image rotation for open set domain adaptation, and propose rotation\-based open set \(ROS\), which use the rotation\-invariance as a metric to identify the private class samples\. Jang et al\.\[[20](https://arxiv.org/html/2605.05567#bib.bib20)\]propose unknown\-aware domain adversarial learning \(UADAL\) to align the source and the target\-shared distribution while segregating the target\-private distribution\. Both progressive graph learning \(PGL\)\[[21](https://arxiv.org/html/2605.05567#bib.bib21)\]and manifold regularized joint transfer \(MRJT\)\[[17](https://arxiv.org/html/2605.05567#bib.bib17)\]essentially set empirical thresholds to assign pseudo\-labels to confidence samples from the target domain\. Huang et al\.\[[22](https://arxiv.org/html/2605.05567#bib.bib22)\]propose a correlation metric\-based graph framework which employs Hilbert\-Schmidt independence criterion to characterize the separation between unknown and known classes\. Adjustment and alignment \(ANNA\)\[[9](https://arxiv.org/html/2605.05567#bib.bib9)\]aims to improve the two\-step pipeline using causality\-based debiasing\. However, its applicability is limited to OSDA scenarios and does not extend to PDA\. By contrast, we view OSDA and PDA as cases of extreme label shift and address both scenarios\.

### 2\.2Partial Domain Adaptation

PDA assumes that the label space of the target domain is a subset of the source domain, thus, it needs to identify the private class in the source domain and then mitigate the effect of negative transfer\.

Pioneering PDA works are devoted to reweighing samples based on adversarial networks\. Cao et al\. propose partial adversarial domain adaptation \(PADA\)\[[10](https://arxiv.org/html/2605.05567#bib.bib10)\]which adds class\-level weights to the source domain classifier\. Technically, it weights both the classification loss and the adversarial loss on the source domain samples\. Zhang et al\.\[[23](https://arxiv.org/html/2605.05567#bib.bib23)\]propose the importance weighted adversarial nets \(IWAN\), which have separate feature extractors for each domain, and then view the domain discrimination as a bi\-class problem\. Subsequently, Cao et al\.\[[24](https://arxiv.org/html/2605.05567#bib.bib24)\]propose the example transfer network \(ETN\) which introduces an additional classifier and a progressive weighting scheme based on the transferability of source domain samples\. Chen et al\.\[[25](https://arxiv.org/html/2605.05567#bib.bib25)\]introduce domain adversarial learning to learn the shared feature subspace of selected source domain instances and target domain instances\. Sahoo et al\.\[[14](https://arxiv.org/html/2605.05567#bib.bib14)\]introduce a selector network and make instance\-level binary predictions to select the source shared class samples for subsequent adversarial domain alignment\.

Unlike methods based on adversarial learning to solve partial domain adaptation problems, Li et al\.\[[26](https://arxiv.org/html/2605.05567#bib.bib26)\]propose the deep residual correction network \(DRCN\) to address optimization difficulties and cross\-domain differences using MMD\. Thopalli et al\.\[[27](https://arxiv.org/html/2605.05567#bib.bib27)\]propose deep orthogonal Procrustes alignment \(DeepOPA\), which combines deep features with orthogonal Procrustes alignment to achieve effective subspace alignment for domain adaptation\. Note that the accuracy of target domain prediction plays an important role in estimating the class\-wise weights in the source domain\. Liang et al\.\[[28](https://arxiv.org/html/2605.05567#bib.bib28)\]use samples in the source domain to make data augment for the target domain\. Yang et al\.\[[29](https://arxiv.org/html/2605.05567#bib.bib29)\]propose a new reweighing scheme by incorporating the weights generated by the target domain information\. Tian et al\.\[[15](https://arxiv.org/html/2605.05567#bib.bib15)\]follow the alignment and separation assumptions to train an SVM for pseudo\-labeling the target samples for private class identification\. As mentioned above, these methods often rely on unrealistic assumptions for successful private class identification, limiting their applicability in real\-world scenarios\.

### 2\.3Optimal Transport for Domain Adaptation

Optimal transport \(OT\) has achieved great success in domain alignment to learn invariant representation\[[30](https://arxiv.org/html/2605.05567#bib.bib30)\]\. Courty et al\.\[[31](https://arxiv.org/html/2605.05567#bib.bib31)\]introduce OT for domain adaptation to align the source and target marginal distributions\. Li et al\.\[[32](https://arxiv.org/html/2605.05567#bib.bib32)\]further incorporate the label information into the cost matrix to gain more precise transport\. Since marginal alignment may fail to learn invariant representation space when domain discrepancy is large, joint distribution optimal transport \(JDOT\)\[[33](https://arxiv.org/html/2605.05567#bib.bib33)\]is proposed for joint distribution alignment\. To improve JDOT and estimate an unbiased transport plan when using mini\-batches, OT with relaxed marginal constraints\[[34](https://arxiv.org/html/2605.05567#bib.bib34),[35](https://arxiv.org/html/2605.05567#bib.bib35)\]are proposed\. Another efficient way to learn invariant representation is conditional distribution alignment\. To achieve conditional alignment, Ren et al\.\[[36](https://arxiv.org/html/2605.05567#bib.bib36)\]extend the optimal transport to the reproducing kernel Hilbert space\. Wang et al\.\[[37](https://arxiv.org/html/2605.05567#bib.bib37)\]propose a probability\-polarized optimal transport for conditional distribution alignment\.

Recently, OT with mask\[[38](https://arxiv.org/html/2605.05567#bib.bib38)\]is proposed for knowledge transfer to achieve intra\-class alignment, implicitly aligning the conditional distribution\. Gu et al\.\[[39](https://arxiv.org/html/2605.05567#bib.bib39)\]further incorporate the mask mechanism into Gromov\-Wasserstein\. Note that the strict marginal constraints severely limit their applications in more realistic scenarios such as PDA\. In light of this, masked optimal transport \(MOT\)\[[18](https://arxiv.org/html/2605.05567#bib.bib18)\]has been proposed by relaxing the strict marginal constraints\. However, in more challenging OSDA scenarios, the theoretical understanding of masked OT has not yet been developed\. Thus, it remains necessary to explore the theoretical foundation of masked OT for successful OSDA\.

## 3Methodology

Under the settings of extreme label shift in domain adaptation, there is a labeled source domain𝒟s=\{𝒙is,yis\}i=1ns\\mathcal\{D\}\_\{s\}=\\\{\\bm\{x\}\_\{i\}^\{s\},y\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}and an unlabeled target domain𝒟t=\{𝒙jt\}j=1nt\\mathcal\{D\}\_\{t\}=\\\{\\bm\{x\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}drawn from different distributionsP≠QP\\neq Q\. Notably, for any𝒙jt\\bm\{x\}\_\{j\}^\{t\}, its true labelyjty\_\{j\}^\{t\}is also deterministic though unavailable\. Moreover, let𝒴s\\mathcal\{Y\}\_\{s\}and𝒴t\\mathcal\{Y\}\_\{t\}represent their label spaces, respectively\. Then𝒴s\\mathcal\{Y\}\_\{s\}and𝒴t\\mathcal\{Y\}\_\{t\}are heterogeneous, e\.g\.,

- •In OSDA,𝒟t\\mathcal\{D\}\_\{t\}contains private class data, i\.e\.,𝒴s⊊𝒴t\\mathcal\{Y\}\_\{s\}\\subsetneq\\mathcal\{Y\}\_\{t\}\.
- •In PDA,𝒟s\\mathcal\{D\}\_\{s\}contains private class data, i\.e\.,𝒴s⊋𝒴t\\mathcal\{Y\}\_\{s\}\\supsetneq\\mathcal\{Y\}\_\{t\}\.

Our final goal is to learn a classification model that classifies target samples as correct as possible, where target private classes not in𝒴s∩𝒴t\\mathcal\{Y\}\_\{s\}\\cap\\mathcal\{Y\}\_\{t\}will be uniformly considered as onep​r​i​v​a​t​eprivateclass\. A modelh∘gh\\circ gconsists of a feature transformationggand a classifierhh\. The transformationggoutputs a feature vector𝒛=g​\(𝒙\)\\bm\{z\}=g\(\\bm\{x\}\)as the representation of𝒙\\bm\{x\}\. Classifierhhreceives the representation as input and outputs the predicted labely^=h∘g​\(𝒙\)\\hat\{y\}=h\\circ g\(\\bm\{x\}\)\.

### 3\.1Preliminary

Optimal transport focuses on finding an optimal solution for moving mass from one distribution to another under constraints\. We will first provide a brief introduction to OT\.

We denote by𝒮n\\mathcal\{S\}\_\{n\}the set of histograms withnnbins:𝒮n=\{S∈ℝ\+n,S⊤​𝟏n=1\}\\mathcal\{S\}\_\{n\}=\\left\\\{S\\in\\mathbb\{R\}^\{n\}\_\{\+\},S^\{\\top\}\\bm\{1\}\_\{n\}=1\\right\\\}, where𝟏n\\bm\{1\}\_\{n\}denotes the all\-ones vector inℝn\\mathbb\{R\}^\{n\}\. LetPX∈𝒮nP\_\{X\}\\in\\mathcal\{S\}\_\{n\}andPX′∈𝒮mP\_\{X^\{\\prime\}\}\\in\\mathcal\{S\}\_\{m\}be two empirical distributions over two sample sets\{𝒙i\}i=1n\\\{\\bm\{x\}\_\{i\}\\\}\_\{i=1\}^\{n\}and\{𝒙j′\}j=1m\\\{\\bm\{x\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\}, i\.e\.,PX=1n​∑iδ𝒙iP\_\{X\}=\\frac\{1\}\{n\}\\sum\_\{i\}\\delta\_\{\\bm\{x\}\_\{i\}\}andPX′=1m​∑jδ𝒙j′P\_\{X^\{\\prime\}\}=\\frac\{1\}\{m\}\\sum\_\{j\}\\delta\_\{\\bm\{x\}\_\{j\}^\{\\prime\}\}\. Then the set of all possible couplings betweenPXP\_\{X\}andPX′P\_\{X^\{\\prime\}\}is given by

Π​\(PX,PX′\)=\{𝚪∈ℝ\+n×m∣𝚪​𝟏m=PX,𝚪⊤​𝟏n=PX′\},\\displaystyle\\Pi\(P\_\{X\},P\_\{X^\{\\prime\}\}\)=\\left\\\{\\bm\{\\Gamma\}\\in\\mathbb\{R\}\_\{\+\}^\{n\\times m\}\\mid\\bm\{\\Gamma\}\\bm\{1\}\_\{m\}=P\_\{X\},\\bm\{\\Gamma\}^\{\\top\}\\bm\{1\}\_\{n\}=P\_\{X^\{\\prime\}\}\\right\\\},where𝚪\\bm\{\\Gamma\}is a transport plan with the entryΓi​j\\Gamma\_\{ij\}that describes the amount of mass transported from𝒙i\\bm\{x\}\_\{i\}to𝒙j′\\bm\{x\}\_\{j\}^\{\\prime\}\. Now given a cost matrix𝑪∈ℝn×m\\bm\{C\}\\in\\mathbb\{R\}^\{n\\times m\}that measures the distance between\{𝒙i\}i=1n\\\{\\bm\{x\}\_\{i\}\\\}\_\{i=1\}^\{n\}and\{𝒙j′\}j=1m\\\{\\bm\{x\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\}\. With the entropy regularizationH​\(𝚪\)=∑i,jΓi​j​ln⁡Γi​jH\(\\bm\{\\Gamma\}\)=\\sum\_\{i,j\}\\Gamma\_\{ij\}\\ln\\Gamma\_\{ij\}, the classical entropy regularized OT betweenPXP\_\{X\}andPX′P\_\{X^\{\\prime\}\}is written as\[[40](https://arxiv.org/html/2605.05567#bib.bib40),[31](https://arxiv.org/html/2605.05567#bib.bib31)\]

OTλ\(PX,PX′,𝑪\)=arg⁡min𝚪∈Π​\(PX,PX′\)⟨𝚪,𝑪⟩F\+λH\(𝚪\),\\displaystyle\\mathrm\{OT\}^\{\\lambda\}\(P\_\{X\},P\_\{X^\{\\prime\}\},\\bm\{C\}\)=\\operatorname\*\{\\arg\\\!\\min\}\\limits\_\{\\bm\{\\Gamma\}\\in\\Pi\(P\_\{X\},P\_\{X^\{\\prime\}\}\)\}\\langle\\bm\{\\Gamma\},\\bm\{C\}\\rangle\_\{F\}\+\\lambda H\(\\bm\{\\Gamma\}\),whereλ\>0\\lambda\>0is the parameter of sparsity penalty, and⟨𝚪,𝑪⟩F=∑i​jΓi​j​Ci​j\\langle\\bm\{\\Gamma\},\\bm\{C\}\\rangle\_\{F\}=\\sum\_\{ij\}\\Gamma\_\{ij\}C\_\{ij\}is the Frobenius inner product\.

Such an optimization objective seeks an sample\-level matching which usually ignores and even degrades the discriminability of the transport plan\. It is also limited in handling category shift challenges due to the hard constraints\. By replacing the strict constraints with relaxed penalty terms and introducing the mask mechanism, Luo et al\.\[[18](https://arxiv.org/html/2605.05567#bib.bib18)\]propose masked optimal transport to address the problems above\. Suppose labels\{yi\}i=1n\\\{\{y\}\_\{i\}\\\}\_\{i=1\}^\{n\}and\{yj′\}j=1m\\\{\{y\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\}are known111The labels may be not completely known in the unsupervised domain adaptation tasks, thus, they are estimated by pseudo labels\., the mask matrix𝑴∈ℝn×m\\bm\{M\}\\in\\mathbb\{R\}^\{n\\times m\}is defined as

Mi​j:=\{1,if​yi=yj′,∞,if​yi≠yj′\.M\_\{ij\}:=\\left\\\{\\begin\{aligned\} 1,\\,&&\{\\rm if\}\\,\\,\{y\}\_\{i\}=\{y\}\_\{j\}^\{\\prime\},\\\\ \\infty,&&\{\\rm if\}\\,\\,\{y\}\_\{i\}\\neq\{y\}\_\{j\}^\{\\prime\}\.\\\\ \\end\{aligned\}\\right\.Then the masked cost matrix is defined as𝑪¯=𝑪⊙𝑴\\bar\{\\bm\{C\}\}=\\bm\{C\}\\odot\\bm\{M\}, and the masked OT is formulated as

MOTβ1,β2λ\(P\\displaystyle\\mathrm\{MOT\}^\{\\lambda\}\_\{\\beta\_\{1\},\\beta\_\{2\}\}\(P,XPX′,𝑪¯\)=arg⁡min𝚪∈ℝ\+n×m⟨𝚪,𝑪¯⟩F\+λH\(𝚪\)\{\}\_\{X\},P\_\{X^\{\\prime\}\},\\bar\{\\bm\{C\}\}\)=\\operatorname\*\{\\arg\\\!\\min\}\\limits\_\{\\bm\{\\Gamma\}\\in\\mathbb\{R\}\_\{\+\}^\{n\\times m\}\}~\\langle\\bm\{\\Gamma\},\\bar\{\\bm\{C\}\}\\rangle\_\{F\}\+\\lambda H\(\\bm\{\\Gamma\}\)\+β1​DKL​\(𝚪​𝟏n∥PX\)\+β2​DKL​\(𝚪⊤​𝟏m∥PX′\),\\displaystyle\+\\beta\_\{1\}D\_\{\{\\rm KL\}\}\(\\bm\{\\Gamma\}\\bm\{1\}\_\{n\}\\\|P\_\{X\}\)\+\\beta\_\{2\}D\_\{\{\\rm KL\}\}\(\\bm\{\\Gamma\}^\{\\top\}\\bm\{1\}\_\{m\}\\\|P\_\{X^\{\\prime\}\}\),whereDKLD\_\{\{\\rm KL\}\}is the Kullback\-Leibler Divergence,β1\\beta\_\{1\}andβ2\\beta\_\{2\}are non\-negative penalty parameters\.

As aforementioned, on the whole data manifold, the variance within a single shared class can be significantly larger than the difference between a private class and another shared class\. However, in a local neighborhood of a shared class cluster, the distance between two samples with shared class is always smaller than that between shared and private class samples\. We will formulate this key observation as the local spatial structure of data\. Then MOT is used to leverage this local structure and establish a general method for identifying the private class\.

### 3\.2Locality\-aware Private Class identification

The private class identification problems in extreme label shift can be uniformly described as follows\. Given two sample sets𝑿1=\{𝒙i\}i=1n\\bm\{X\}\_\{1\}=\\\{\\bm\{x\}\_\{i\}\\\}\_\{i=1\}^\{n\}and𝑿2=\{𝒙j′\}j=1m\\bm\{X\}\_\{2\}=\\\{\\bm\{x\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\}, they shareKKclasses while𝑿2\\bm\{X\}\_\{2\}contains extra private class samples222For simplicity, we do not distinguish which is the source domain and which is the target domain here\.\. We aim to identify the private class samples in𝑿2\\bm\{X\}\_\{2\}\.

In our method, the magnitude relationship between shared and private shifts is formulated as the local spatial structure, which is further characterized by masked OT\. The identification is finally achieved through the magnitude relationship between transported mass\. As shown in Figure[2](https://arxiv.org/html/2605.05567#S3.F2), the method relaxes the strong assumptions on the whole data manifold structure, achieving more reliable identification than previous methods in practical applications\.

![Refer to caption](https://arxiv.org/html/2605.05567v1/x3.png)Figure 2:Illustration of most existing private class identification methods \(top\), and our OT\-based reliable private class identification method \(down\)\. Better view in color\.Specifically, we will formulate the local spatial structure as the magnitude relationship of the expected distances between instances with the same label\. Given labels\{y¯i\}i=1n\\\{\\bar\{y\}\_\{i\}\\\}\_\{i=1\}^\{n\}and\{y¯j′\}j=1m\\\{\\bar\{y\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\}corresponding to𝑿1\\bm\{X\}\_\{1\}and𝑿2\\bm\{X\}\_\{2\}, respectively\. We divide the representationsg​\(𝑿2\)=\{g​\(𝒙j′\)\}j=1mg\(\\bm\{X\}\_\{2\}\)=\\\{g\(\\bm\{x\}\_\{j\}^\{\\prime\}\)\\\}\_\{j=1\}^\{m\}into small subsetsℱk=\{𝒛j′∈g​\(𝑿2\)∣y¯j′=k\}\\mathcal\{F\}\_\{k\}=\\\{\\bm\{z\}\_\{j\}^\{\\prime\}\\in g\(\\bm\{X\}\_\{2\}\)\\mid\\bar\{y\}\_\{j\}^\{\\prime\}=k\\\}\. Within each subsetℱk\\mathcal\{F\}\_\{k\}, the representations corresponding to the private and shared classes are denoted byℱkp​r​v=\{𝒛j′∈ℱk∣yj′=K\+1\}\\mathcal\{F\}\_\{k\}^\{prv\}=\\\{\\bm\{z\}\_\{j\}^\{\\prime\}\\in\\mathcal\{F\}\_\{k\}\\mid y\_\{j\}^\{\\prime\}=K\+1\\\}andℱk\\ℱkp​r​v\\mathcal\{F\}\_\{k\}\\backslash\\mathcal\{F\}\_\{k\}^\{prv\}, respectively\. Hereyj′y\_\{j\}^\{\\prime\}is the true label of𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\}andK\+1K\+1represents thep​r​i​v​a​t​eprivate\. Essentially,ℱkp​r​v\\mathcal\{F\}\_\{k\}^\{prv\}denotes a set of private class representations with labely¯j′=k\\bar\{y\}\_\{j\}^\{\\prime\}=k, andℱk\\ℱkp​r​v\\mathcal\{F\}\_\{k\}\\backslash\\mathcal\{F\}\_\{k\}^\{prv\}represents a set of shared class representations with labely¯j′=k\\bar\{y\}\_\{j\}^\{\\prime\}=k\. In addition, letPZkp​r​vP\_\{Z^\{prv\}\_\{k\}\}andPZks​h​rP\_\{Z^\{shr\}\_\{k\}\}denote two empirical distributions overℱkp​r​v\\mathcal\{F\}^\{prv\}\_\{k\}andℱk\\ℱkp​r​v\\mathcal\{F\}\_\{k\}\\backslash\\mathcal\{F\}\_\{k\}^\{prv\}, respectively\. Then, given a non\-negative functionffas the distance metric, the local spatial structure is reformulated as the magnitude relationship between the distances, i\.e\.,

𝔼PZks​h​r​\[f​\(𝒛,𝒛′\)\]<𝔼PZkp​r​v​\[f​\(𝒛,𝒛′\)\],\\displaystyle\\mathbb\{E\}\_\{P\_\{Z^\{shr\}\_\{k\}\}\}\[f\(\\bm\{z\},\\bm\{z\}^\{\\prime\}\)\]<\\mathbb\{E\}\_\{P\_\{Z^\{prv\}\_\{k\}\}\}\[f\(\\bm\{z\},\\bm\{z\}^\{\\prime\}\)\],\(1\)wherek∈\[\[K\]\]k\\in\[\\\!\[K\]\\\!\], and𝒛\\bm\{z\}is any representation belongs to\{𝒛i∈g​\(𝑿1\)∣y¯i=k\}\\\{\\bm\{z\}\_\{i\}\\in g\(\\bm\{X\}\_\{1\}\)\\mid\\bar\{y\}\_\{i\}=k\\\}\. In other words, when considering representations that share the same label, the expected distance between two such shared class representations is always smaller than that between a shared class representation and one from a private class\.

#### 3\.2\.1OT\-based Local Spatial Structure Characterization

Masked OT will be built, and the formulated local structure in Eq\. \([1](https://arxiv.org/html/2605.05567#S3.E1)\) will be converted to the magnitude relationship of entries in the transport plan\.

Let𝑪Z∈ℝn×m\\bm\{C\}^\{Z\}\\in\\mathbb\{R\}^\{n\\times m\}denote the transport cost betweeng​\(𝑿1\)g\(\\bm\{X\}\_\{1\}\)andg​\(𝑿2\)g\(\\bm\{X\}\_\{2\}\), with entries given byCi​jZ=c​\(g​\(𝒙i\),g​\(𝒙j′\)\)C^\{Z\}\_\{ij\}=c\(g\(\\bm\{x\}\_\{i\}\),g\(\\bm\{x\}\_\{j\}^\{\\prime\}\)\)for a given cost functionc​\(⋅,⋅\)c\(\\cdot,\\cdot\)\(e\.g\., the squared Euclidean distance\)\. Let𝑴∈ℝn×m\\bm\{M\}\\in\\mathbb\{R\}^\{n\\times m\}be the mask matrix estimated with labels\{y¯i\}i=1n\\\{\\bar\{y\}\_\{i\}\\\}\_\{i=1\}^\{n\}and\{y¯j′\}j=1m\\\{\\bar\{y\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\},𝑪¯Z=𝑪Z⊙𝑴\\bar\{\\bm\{C\}\}^\{Z\}=\\bm\{C\}^\{Z\}\\odot\\bm\{M\}be the masked cost matrix, and\(PZ,PZ′\)∈Sn×Sm\(P\_\{Z\},P\_\{Z^\{\\prime\}\}\)\\in S\_\{n\}\\times S\_\{m\}represent two empirical distributions overg​\(𝑿1\)g\(\\bm\{X\}\_\{1\}\)andg​\(𝑿2\)g\(\\bm\{X\}\_\{2\}\), respectively\. The optimal transport plan𝚪∗∈ℝn×m\\bm\{\\Gamma\}^\{\*\}\\in\\mathbb\{R\}^\{n\\times m\}is

𝚪∗=MOT∞,β2λ\(PZ,\\displaystyle\\bm\{\\Gamma\}^\{\*\}=\\mathrm\{MOT\}^\{\\lambda\}\_\{\\infty,\\beta\_\{2\}\}\(P\_\{Z\},PZ′,𝑪¯Z\)\.\\displaystyle P\_\{Z^\{\\prime\}\},\\bar\{\\bm\{C\}\}^\{Z\}\)\.\(2\)Hereβ1\\beta\_\{1\}is set to∞\\infty, which means a unilateral strict marginal constraint on𝚪∗\\bm\{\\Gamma\}^\{\*\}, i\.e\.,𝚪∗​𝟏m=PZ=1n​∑iδ𝒛i\\bm\{\\Gamma\}^\{\*\}\\bm\{1\}\_\{m\}=P\_\{Z\}=\\frac\{1\}\{n\}\\sum\_\{i\}\\delta\_\{\\bm\{z\}\_\{i\}\}\. The constraint ensures that∑j=1mΓi​j∗=1n\\sum\_\{j=1\}^\{m\}\\Gamma^\{\*\}\_\{ij\}=\\frac\{1\}\{n\}, i\.e\., each representation ing​\(𝑿1\)g\(\\bm\{X\}\_\{1\}\)transports an equal amount of mass1n\\frac\{1\}\{n\}\. This implies each representation withing​\(𝑿1\)g\(\\bm\{X\}\_\{1\}\)is equally important in the transportation\. An intuitive explanation is that all representations ing​\(𝑿1\)g\(\\bm\{X\}\_\{1\}\)originate from the shared classes, and successful private class identification only relies on the information from the shared class instances\. Accordingly, every representation withing​\(𝑿1\)g\(\\bm\{X\}\_\{1\}\)wields considerable influence on the process of private class identification\. It is necessary to acknowledge that each representation plays an equally important role\.

It can be proved that the optimal transport plan𝚪∗\\bm\{\\Gamma\}^\{\*\}has the following properties\.

###### Proposition 1\.

\(distance\-related inequality\)\. Given𝒛i∈g​\(𝑿1\)\\bm\{z\}\_\{i\}\\in g\(\\bm\{X\}\_\{1\}\)withy¯i=k\\bar\{y\}\_\{i\}=k, and𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\},𝒛l′∈ℱk⊂g​\(𝑿2\)\\bm\{z\}\_\{l\}^\{\\prime\}\\in\\mathcal\{F\}\_\{k\}\\subset g\(\\bm\{X\}\_\{2\}\)\. If the cost functionc​\(⋅,⋅\)c\(\\cdot,\\cdot\)satisfiesc​\(𝒛i,𝒛j′\)<c​\(𝒛i,𝒛l′\)c\(\\bm\{z\}\_\{i\},\\bm\{z\}\_\{j\}^\{\\prime\}\)<c\(\\bm\{z\}\_\{i\},\\bm\{z\}\_\{l\}^\{\\prime\}\), then∃λ\>0\\exists\\,\\lambda\>0and sufficiently smallβ2\>0\\beta\_\{2\}\>0, s\.t\.,𝚪∗\\bm\{\\Gamma\}^\{\*\}satisfies

Γi​j∗\>Γi​l∗\.\\Gamma^\{\*\}\_\{ij\}\>\\Gamma^\{\*\}\_\{il\}\.

The proof is placed on the supplementary material\. Proposition[1](https://arxiv.org/html/2605.05567#Thmprop1)tells us that𝚪∗\\bm\{\\Gamma\}^\{\*\}tends to assign larger values to representation pairs with smaller distances\. Intuitively, if the local spatial structure holds, then for𝒛j′∈ℱk\\ℱkp​r​v\\bm\{z\}\_\{j\}^\{\\prime\}\\in\\mathcal\{F\}\_\{k\}\\backslash\\mathcal\{F\}\_\{k\}^\{prv\}and𝒛l′∈ℱkp​r​v\\bm\{z\}\_\{l\}^\{\\prime\}\\in\\mathcal\{F\}\_\{k\}^\{prv\}, the amount of mass transported from those𝒛i\\bm\{z\}\_\{i\}withy¯i=k\\bar\{y\}\_\{i\}=kto𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\}is expected to be larger than that to𝒛l′\\bm\{z\}\_\{l\}^\{\\prime\}, i\.e\.,

∑y¯i=kΓi​j∗\>∑y¯i=kΓi​l∗\.\\displaystyle\\sum\_\{\\bar\{y\}\_\{i\}=k\}\\Gamma^\{\*\}\_\{ij\}\>\\sum\_\{\\bar\{y\}\_\{i\}=k\}\\Gamma^\{\*\}\_\{il\}\.
According to the local transport property of masked OT\[[18](https://arxiv.org/html/2605.05567#bib.bib18)\],Γi​j∗=0\\Gamma^\{\*\}\_\{ij\}=0ify¯i≠y¯j\\bar\{y\}\_\{i\}\\neq\\bar\{y\}\_\{j\}, thus the above inequality is equivalent to

∑i=1nΓi​j∗\>∑i=1nΓi​l∗,\\displaystyle\\sum\_\{i=1\}^\{n\}\\Gamma^\{\*\}\_\{ij\}\>\\sum\_\{i=1\}^\{n\}\\Gamma^\{\*\}\_\{il\},\(3\)i\.e\., the local spatial structure can be described by the magnitude relationship between the total mass transported to shared and private class samples\.

#### 3\.2\.2Private Class Identification through Transport Mass

Inspired by the above discussion, a private class identification method can be naturally induced utilizing the transport mass\.

Equation \([3](https://arxiv.org/html/2605.05567#S3.E3)\) indicates that the mass transported to a private class representation tends to be small due to the local structure\. In other words, by reversing the sign of transport mass, an identification scores​\(𝒛j′\)s\(\\bm\{z\}\_\{j\}^\{\\prime\}\)for measuring the likelihood that𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\}belongs to private can be defined as

s​\(𝒛j′\):=1m−∑i=1nΓi​j∗\.\\displaystyle s\(\\bm\{z\}\_\{j\}^\{\\prime\}\):=\\frac\{1\}\{m\}\-\\sum\_\{i=1\}^\{n\}\\Gamma^\{\*\}\_\{ij\}\.\(4\)
![Refer to caption](https://arxiv.org/html/2605.05567v1/x4.png)Figure 3:Illustration of Theorem[1](https://arxiv.org/html/2605.05567#Thmtheorem1)on guiding the identification of private samples\. The local spatial structure is characterized by𝚪∗\\bm\{\\Gamma\}^\{\*\}, then identification score functionssacts on columns of𝚪∗\\bm\{\\Gamma\}^\{\*\}and we can identify shared and private samples based on the values ofss\.It is obvious that the mean of the scores overg​\(𝑿2\)g\(\\bm\{X\}\_\{2\}\)equals to zero\. So intuitively, if𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\}belongs to private class, thens​\(𝒛j′\)\>0s\(\\bm\{z\}\_\{j\}^\{\\prime\}\)\>0; otherwises​\(𝒛j′\)<0s\(\\bm\{z\}\_\{j\}^\{\\prime\}\)<0\. Letℱp​r​v=⋃k=1K\+1ℱkp​r​v⊂g​\(𝑿2\)\\mathcal\{F\}^\{prv\}=\\bigcup\\limits\_\{k=1\}^\{K\+1\}\\mathcal\{F\}\_\{k\}^\{prv\}\\subset g\(\\bm\{X\}\_\{2\}\)denote all private class representations withing​\(𝑿2\)g\(\\bm\{X\}\_\{2\}\), andℱs​h​r=⋃k=1K\+1\(ℱk\\ℱkp​r​v\)⊂g​\(𝑿2\)\\mathcal\{F\}^\{shr\}=\\bigcup\\limits\_\{k=1\}^\{K\+1\}\(\\mathcal\{F\}\_\{k\}\\backslash\\mathcal\{F\}\_\{k\}^\{prv\}\)\\subset g\(\\bm\{X\}\_\{2\}\)denote all shared class representations ing​\(𝑿2\)g\(\\bm\{X\}\_\{2\}\)\.PZp​r​vP\_\{Z^\{prv\}\}andPZs​h​rP\_\{Z^\{shr\}\}represent the empirical distributions overℱp​r​v\\mathcal\{F\}^\{prv\}andℱs​h​r\\mathcal\{F\}^\{shr\}, respectively\. Then we have the following theoretical result\.

###### Theorem 1\.

If the inequality \([1](https://arxiv.org/html/2605.05567#S3.E1)\) holds forf​\(𝒛,𝒛′\)=exp⁡\(c​\(𝒛,𝒛′\)\)f\(\\bm\{z\},\\bm\{z\}^\{\\prime\}\)=\\exp\(c\(\\bm\{z\},\\bm\{z\}^\{\\prime\}\)\)and∀k∈\[\[K\]\]\\forall\\,k\\in\[\\\!\[K\]\\\!\], then∃λ\>0\\exists\\,\\lambda\>0and sufficiently smallβ2\>0\\beta\_\{2\}\>0, s\.t\.,

𝔼PZs​h​r​\[s​\(𝒛′\)\]<0<𝔼PZp​r​v​\[s​\(𝒛′\)\]\.\\displaystyle\\mathbb\{E\}\_\{P\_\{Z^\{shr\}\}\}\[s\(\\bm\{z\}^\{\\prime\}\)\]<0<\\mathbb\{E\}\_\{P\_\{Z^\{prv\}\}\}\[s\(\\bm\{z\}^\{\\prime\}\)\]\.

The proof is placed on the supplementary material\. Theorem[1](https://arxiv.org/html/2605.05567#Thmtheorem1)highlights the score function’s strong ability to distinguish between shared and private class samples\. Specifically, it ensures that under the local spatial structure assumption, the expectation scores of shared and private class representations are strictly separated by 0\. This further implies that zero can serve as an effective threshold for private class identification, i\.e\., we classify representations withs\>0s\>0into private as presented in Figure[3](https://arxiv.org/html/2605.05567#S3.F3)\.

Although zero can serve as a threshold for identification, its application is contingent upon certain assumptions as described in Theorem[1](https://arxiv.org/html/2605.05567#Thmtheorem1)\. To broaden the applicability of our method in various practical scenarios, we would relax these assumptions by introducing a margin\. This margin would ensure that only samples with significant scores are considered private\. To induce an appropriate margin, we rethink𝚪∗\\bm\{\\Gamma\}^\{\*\}from the insight of joint distribution and consider the following probability matrix𝚪′∈ℝK×m\\bm\{\\Gamma\}^\{\\prime\}\\in\\mathbb\{R\}^\{K\\times m\}, i\.e\.,

𝚪′:=\[∑yi=1Γi​1∗⋯∑yi=1Γi​m∗⋮⋱⋮∑yi=KΓi​1∗⋯∑yi=KΓi​m∗\]\.\\displaystyle\\bm\{\\Gamma\}^\{\\prime\}:=\\begin\{bmatrix\}\\sum\_\{y\_\{i\}=1\}\\Gamma^\{\*\}\_\{i1\}&\\cdots&\\sum\_\{y\_\{i\}=1\}\\Gamma^\{\*\}\_\{im\}\\\\ \\vdots&\\ddots&\\vdots\\\\ \\sum\_\{y\_\{i\}=K\}\\Gamma^\{\*\}\_\{i1\}&\\cdots&\\sum\_\{y\_\{i\}=K\}\\Gamma^\{\*\}\_\{im\}\\end\{bmatrix\}\.
LetPYP\_\{Y\}be the empirical label distribution determined by𝑿1\\bm\{X\}\_\{1\}, i\.e\.,PY=1n​∑iδyiP\_\{Y\}=\\frac\{1\}\{n\}\\sum\_\{i\}\\delta\_\{y\_\{i\}\}, whereyi∈\[\[K\]\]y\_\{i\}\\in\[\\\!\[K\]\\\!\]is the true label of𝒙i\\bm\{x\}\_\{i\}\. Then𝚪′\\bm\{\\Gamma\}^\{\\prime\}satisfies𝚪′​𝟏m=PY\\bm\{\\Gamma\}^\{\\prime\}\\bm\{1\}\_\{m\}=P\_\{Y\}and𝚪′⁣⊤​𝟏K≈PZ′\\bm\{\\Gamma\}^\{\\prime\\top\}\\bm\{1\}\_\{K\}\\approx P\_\{Z^\{\\prime\}\}, so it can be viewed as an approximate probability coupling betweenPYP\_\{Y\}andPZ′P\_\{Z^\{\\prime\}\}\. AlthoughΓk​j′\\Gamma^\{\\prime\}\_\{kj\}is not an exact joint probability, it provides a meaningful characterization of the co\-occurrence at\(Y=k,Z′=𝒛j′\)\(Y=k,Z^\{\\prime\}=\\bm\{z\}\_\{j\}^\{\\prime\}\)in the sense of optimal transport\. Larger values indicate stronger correspondence between𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\}and classkk, Thus,Γk​j′\\Gamma^\{\\prime\}\_\{kj\}can be intuitively understood as an analog of the probability that𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\}corresponds to classkk\. Since\[\[K\]\]\[\\\!\[K\]\\\!\]consists of shared classes, we further interpret∑k=1KΓk​j′=∑i=1nΓi​j∗\\sum\_\{k=1\}^\{K\}\\Gamma^\{\\prime\}\_\{kj\}=\\sum\_\{i=1\}^\{n\}\\Gamma^\{\*\}\_\{ij\}as an analog of the probability that𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\}corresponds to shared classes\. Recall thats​\(𝒛j′\)s\(\\bm\{z\}\_\{j\}^\{\\prime\}\)measures the likelihood that𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\}belongs to the private class\. Intuitively, for a private class sample, its scoressshould exceed its probability of belonging to the shared classes\. This leads to the inequalitys​\(𝒛j′\)=1m−∑i=1nΓi​j∗\>∑k=1KΓk​j′=∑i=1nΓi​j∗s\(\\bm\{z\}\_\{j\}^\{\\prime\}\)=\\frac\{1\}\{m\}\-\\sum\_\{i=1\}^\{n\}\\Gamma^\{\*\}\_\{ij\}\>\\sum\_\{k=1\}^\{K\}\\Gamma^\{\\prime\}\_\{kj\}=\\sum\_\{i=1\}^\{n\}\\Gamma^\{\*\}\_\{ij\}by the definition ofssin Eq\. \([4](https://arxiv.org/html/2605.05567#S3.E4)\), which yieldss​\(𝒛j′\)\>12​ms\(\\bm\{z\}\_\{j\}^\{\\prime\}\)\>\\frac\{1\}\{2m\}for a private class sample𝒛j′\\bm\{z\}\_\{j\}^\{\\prime\}\.

Consequently, we can obtain the set of potential shared class samples𝒟s​h​r⊂𝑿2\\mathcal\{D\}^\{shr\}\\subset\\bm\{X\}\_\{2\}withs<0s<0, and the set of potential private class samples𝒟p​r​v⊂𝑿2\\mathcal\{D\}^\{prv\}\\subset\\bm\{X\}\_\{2\}withs\>12​ms\>\\frac\{1\}\{2m\}, i\.e\.,

𝒟s​h​r=\{𝒙j′∈𝑿2∣s​\(g​\(𝒙j′\)\)<0\},\\displaystyle\\mathcal\{D\}^\{shr\}=\\left\\\{\\bm\{x\}\_\{j\}^\{\\prime\}\\in\\bm\{X\}\_\{2\}\\mid s\(g\(\\bm\{x\}\_\{j\}^\{\\prime\}\)\)<0\\right\\\},\(5\)𝒟p​r​v=\{𝒙j′∈𝑿2∣s​\(g​\(𝒙j′\)\)\>1/2​m\}\.\\displaystyle\\mathcal\{D\}^\{prv\}=\\left\\\{\\bm\{x\}\_\{j\}^\{\\prime\}\\in\\bm\{X\}\_\{2\}\\mid s\(g\(\\bm\{x\}\_\{j\}^\{\\prime\}\)\)\>1/2m\\right\\\}\.The private class identification is finally achieved by obtaining these two sets\.

We summarize the entire process of private class identification in extreme label shift scenarios as follows\. For practical DA tasks with extreme label shift, it is essential to have prior knowledge to ascertain which set, either\{𝒙is\}i=1ns\\\{\\bm\{x\}\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}or\{𝒙jt\}j=1nt\\\{\\bm\{x\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}, contains the private class samples\. We will designate the set containing the private class samples as𝑿2\\bm\{X\}\_\{2\}and assign the remaining set to𝑿1\\bm\{X\}\_\{1\}\. The corresponding labels\{y¯i\}i=1n\\\{\\bar\{y\}\_\{i\}\\\}\_\{i=1\}^\{n\}and\{y¯j′\}j=1m\\\{\\bar\{y\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\}will be determined by true/predicted labels, depending on the availability of the true labels; if the true labels are not available, the predicted labels will be used\. Subsequently, we will built masked OT from𝑿1\\bm\{X\}\_\{1\}to𝑿2\\bm\{X\}\_\{2\}and calculate the optimal transport plan𝚪∗\\bm\{\\Gamma\}^\{\*\}by Eq\. \([2](https://arxiv.org/html/2605.05567#S3.E2)\)\. Leveraging𝚪∗\\bm\{\\Gamma\}^\{\*\}, we can then identify the shared class samples𝒟s​h​r\\mathcal\{D\}^\{shr\}and the private class samples𝒟p​r​v\\mathcal\{D\}^\{prv\}using Eq\. \([5](https://arxiv.org/html/2605.05567#S3.E5)\)\.

### 3\.3Algorithm and Applications for Extreme Label Shift

Based on the locality\-aware private class identification method, we now present the numerical algorithm for solving DA under two extreme label shift scenarios, i\.e\., open set domain adaptation and partial domain adaptation\.

The algorithm can be roughly divided into two steps\. First, the private and shared class samples will be identified\. Then, knowledge transfer will be implemented\.

For convenience, w\.l\.o\.g\., we consider the case𝒴s⊊𝒴t\\mathcal\{Y\}\_\{s\}\\subsetneq\\mathcal\{Y\}\_\{t\}, i\.e\., OSDA\. The results can be easily extended to𝒴s⊋𝒴t\\mathcal\{Y\}\_\{s\}\\supsetneq\\mathcal\{Y\}\_\{t\}, i\.e\., PDA\. The condition𝒴s⊊𝒴t\\mathcal\{Y\}\_\{s\}\\subsetneq\\mathcal\{Y\}\_\{t\}signifies that target domain\{𝒙jt\}j=1nt\\\{\\bm\{x\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}contains private class samples\. Therefore, as mentioned in Sec\.[3\.2](https://arxiv.org/html/2605.05567#S3.SS2), we substitute sets𝑿1\\bm\{X\}\_\{1\}and𝑿2\\bm\{X\}\_\{2\}with\{𝒙is\}i=1ns\\\{\\bm\{x\}\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}and\{𝒙jt\}j=1nt\\\{\\bm\{x\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}, respectively\. The corresponding labels\{y¯i\}i=1n\\\{\\bar\{y\}\_\{i\}\\\}\_\{i=1\}^\{n\}and\{y¯j′\}j=1m\\\{\\bar\{y\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\}are given by true labels\{yis\}i=1ns\\\{y\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}and predicted labels\{y^it\}j=1nt\\\{\\hat\{y\}\_\{i\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}, respectively\. The default cost functionccis given by the squared Euclidean distance\. The optimal transport plan𝚪∗\\bm\{\\Gamma\}^\{\*\}is then computed by Eq\. \([2](https://arxiv.org/html/2605.05567#S3.E2)\), and we obtain the identified target shared class samples𝒟s​h​r\\mathcal\{D\}^\{shr\}and the identified target private class samples𝒟p​r​v\\mathcal\{D\}^\{prv\}by Eq\. \([5](https://arxiv.org/html/2605.05567#S3.E5)\)\. These identified samples form the foundation for constructing a reliable transfer model in subsequent steps\.

#### 3\.3\.1Reliable Transfer Across Domains

To ensure correct classification of these samples, we need to consider aligning the shared class\-conditional distributions across domains through the cross\-domain intra\-class alignment on shared classes, which can be achieved by utilizing the local transport property of masked OT\.

Specifically, we approximate the labels of target samples by the predicted labels, and then build a masked OT from source domain to target domain by using samples in the shared classes with their labels\. The cost functionccis given by the squared Euclidean distance\. Notably, in the case where𝒴s⊊𝒴t\\mathcal\{Y\}\_\{s\}\\subsetneq\\mathcal\{Y\}\_\{t\}, the𝚪∗\\bm\{\\Gamma\}^\{\*\}computed during private class identification is served as a masked optimal transport plan from the source to the target domain\. By setting the mass transported between shared and private classes in𝚪∗\\bm\{\\Gamma\}^\{\*\}to zero, we obtain the desired transport planΓi​js​h​r\\Gamma^\{shr\}\_\{ij\}from the source shared to target shared classes, i\.e\.,

Γi​js​h​r:=\{0,both​𝒙is​and⁡𝒙jt∉𝒟s​h​r,Γi​j∗,others\.\\Gamma^\{shr\}\_\{ij\}:=\\left\\\{\\begin\{aligned\} 0,\\,\\,\\,\\quad&\{\\rm both\}\\,\\,\\bm\{x\}\_\{i\}^\{s\}\\,\\operatorname\{and\}\\,\\bm\{x\}\_\{j\}^\{t\}\\notin\\mathcal\{D\}^\{shr\},\\\\ \\Gamma^\{\*\}\_\{ij\},\\quad&\\operatorname\{others\}\.\\\\ \\end\{aligned\}\\right\.Then, the alignment term seeksggto minimize the cross\-domain intra\-class distances, formulated as

ming⟨𝚪s​h​r,𝑪Z⟩F,\\displaystyle\\min\_\{g\}~\\langle\\bm\{\\Gamma\}^\{shr\},\\bm\{C\}^\{Z\}\\rangle\_\{F\},where𝑪Z\\bm\{C\}^\{Z\}measure the square Euclidean distance between\{𝒛is\}i=1ns\\\{\\bm\{z\}\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}and\{𝒛jt\}j=1nt\\\{\\bm\{z\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}\.

In addition, to enhance the model discrimination ability and avoid mismatching between the shared and private class samples, we also consider enlarging the discrepancy between the shared and private classes in the representation space\. First, similar to𝚪s​h​r\\bm\{\\Gamma\}^\{shr\}, a transport plan𝚪p​r​v\\bm\{\\Gamma\}^\{prv\}between shared class samples and private class samples is defined by

Γi​jp​r​v:=\{0,both​𝒙is​and⁡𝒙jt∉𝒟p​r​v,Γi​j∗,others\.\\Gamma^\{prv\}\_\{ij\}:=\\left\\\{\\begin\{aligned\} 0,\\,\\,\\,\\quad&\{\\rm both\}\\,\\,\\bm\{x\}\_\{i\}^\{s\}\\,\\operatorname\{and\}\\,\\bm\{x\}\_\{j\}^\{t\}\\notin\\mathcal\{D\}^\{prv\},\\\\ \\Gamma^\{\*\}\_\{ij\},\\quad&\\operatorname\{others\}\.\\\\ \\end\{aligned\}\\right\.
Similarly, we use𝑪Z\\bm\{C\}^\{Z\}to denote the square Euclidean distance between\{𝒛is\}i=1ns\\\{\\bm\{z\}\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}and\{𝒛jt\}j=1nt\\\{\\bm\{z\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}, then the separation term is induced by maximizing the distances between shared and private class samples, which can be equivalently reformulated as the following objective, i\.e\.,

ming−⟨𝚪p​r​v,𝑪Z⟩F\.\\displaystyle\\min\_\{g\}~\-\\langle\\bm\{\\Gamma\}^\{prv\},\\bm\{C\}^\{Z\}\\rangle\_\{F\}\.
Finally, the reliable transfer lossℒr​t\\mathcal\{L\}\_\{rt\}is obtained by combining alignment term and separation term:

ming⁡ℒr​t=⟨𝚪s​h​r,𝑪Z⟩F−⟨𝚪p​r​v,𝑪Z⟩F\.\\displaystyle\\min\_\{g\}\\mathcal\{L\}\_\{rt\}=\\langle\\bm\{\\Gamma\}^\{shr\},\\bm\{C\}^\{Z\}\\rangle\_\{F\}\-\\langle\\bm\{\\Gamma\}^\{prv\},\\bm\{C\}^\{Z\}\\rangle\_\{F\}\.\(6\)

#### 3\.3\.2Risk Minimization

To ensure basic model performance, the empirical classification riskℒc​l​s\\mathcal\{L\}\_\{cls\}is minimized\. Specifically, it minimizes the risk not only for shared classes but also for the private\. In fact, if the model can precisely recognize the private samples, it can also determine which samples belong to the shared classes\. Therefore, the model performance on shared classes can benefit from the risk minimization on the private\. Letℓ​\(⋅,⋅\)\\ell\(\\cdot,\\cdot\)represent a given loss function,ℒc​l​s\\mathcal\{L\}\_\{cls\}is written as

ming,h⁡ℒc​l​s=1\|𝒟s∪𝒟p​r​v\|​∑𝒙∈𝒟s∪𝒟p​r​vℓ​\(h∘g​\(𝒙\),y\),\\displaystyle\\min\_\{g,h\}\\mathcal\{L\}\_\{cls\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{s\}\\cup\\mathcal\{D\}^\{prv\}\|\}\{\{\\sum\}\}\_\{\\bm\{x\}\\in\\mathcal\{D\}\_\{s\}\\cup\\mathcal\{D\}^\{prv\}\}\\ell\(h\\circ g\(\\bm\{x\}\),y\),\(7\)where pseudo labely=\|𝒴s∩𝒴t\|\+1y=\|\\mathcal\{Y\}\_\{s\}\\cap\\mathcal\{Y\}\_\{t\}\|\+1is used for all𝒙∈𝒟p​r​v\\bm\{x\}\\in\\mathcal\{D\}^\{prv\}, since𝒟p​r​v⊂𝒟t\\mathcal\{D\}^\{prv\}\\subset\\mathcal\{D\}\_\{t\}and the true labels are unavailable\.

Additionally, we propose to learn risk on the transported source domain, which is also called barycenter reconstruction loss\. Specifically, we map the target representations back to the source domain based on the transport plan𝚪∗\\bm\{\\Gamma\}^\{\*\}and then minimize the reconstruction loss\. The mapping is established via the barycenter map problem:

𝒛^is=arg⁡minz​∑j=1ntΓi​j∗​c​\(𝒛,𝒛jt\)\.\\displaystyle\\hat\{\\bm\{z\}\}\_\{i\}^\{s\}=\\operatorname\*\{\\arg\\\!\\min\}\_\{z\}\\sum\_\{j=1\}^\{n\_\{t\}\}\\Gamma^\{\*\}\_\{ij\}c\(\\bm\{z\},\\bm\{z\}^\{t\}\_\{j\}\)\.As demonstrated by Courty et al\.\[[31](https://arxiv.org/html/2605.05567#bib.bib31)\], an analytic solution for this problem can be written as

𝒛^is=ns​∑j=1ntΓi​j∗​𝒛jt\.\\displaystyle\\hat\{\\bm\{z\}\}\_\{i\}^\{s\}=n\_\{s\}\\sum\_\{j=1\}^\{n\_\{t\}\}\\Gamma^\{\*\}\_\{ij\}\\bm\{z\}^\{t\}\_\{j\}\.The barycenter reconstruction loss is then formulated as

ming,h⁡ℒb​r=1ns​∑i=1nsℓ​\(h​\(𝒛^is\),yis\)\.\\displaystyle\\min\_\{g,h\}\\mathcal\{L\}\_\{br\}=\\frac\{1\}\{n\_\{s\}\}\\sum\_\{i=1\}^\{n\_\{s\}\}\\ell\(h\(\\hat\{\\bm\{z\}\}\_\{i\}^\{s\}\),y\_\{i\}^\{s\}\)\.\(8\)
The barycenter mapping leverages target samples in the representation space to reconstruct the source sample𝒛s\\bm\{z\}^\{s\}with minimal cost\. By minimizingℒb​r\\mathcal\{L\}\_\{br\}, each target sample receives soft supervision proportional to its contribution to the reconstruction\. The target samples that belong to the same class as𝒛s\\bm\{z\}^\{s\}will contribute more significantly\. Consequently, this soft supervision will focus on pushing the target samples close to their corresponding class prototypes, implicitly lowering their classification risk\. Additionally, this process enhances the representation space and accentuates the local spatial structure, thereby improving the accuracy of private class identification\.

Letη1,η2\>0\\eta\_\{1\},\\eta\_\{2\}\>0be trade\-off parameters, our final training objective is written as

ming,h⁡ℒ=ℒc​l​s\+η1​ℒr​t\+η2​ℒb​r\.\\displaystyle\\min\_\{g,h\}\\mathcal\{L\}=\\mathcal\{L\}\_\{cls\}\+\\eta\_\{1\}\\mathcal\{L\}\_\{rt\}\+\\eta\_\{2\}\\mathcal\{L\}\_\{br\}\.\(9\)
Algorithm 1ReOT for extreme label shift DA0:source domain data

\{𝒙is,yis\}i=1ns\\\{\\bm\{x\}\_\{i\}^\{s\},y\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}, target domain data

\{𝒙jt\}j=1nt\\\{\\bm\{x\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}, parameters

λ\\lambda,

β2\\beta\_\{2\},

η1\\eta\_\{1\}and

η2\\eta\_\{2\}, number of iterations

TT;

0:classifier

h​\(⋅\)h\(\\cdot\)and feature transformation

g​\(⋅\)g\(\\cdot\);

1:Initialize

h​\(⋅\)h\(\\cdot\)and

g​\(⋅\)g\(\\cdot\);

2:for

i=1i=1to

TTdo

3:Forward propagate

\{𝒙jt\}j=1nt\\\{\\bm\{x\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}to obtain

\{y^jt\}j=1nt\\\{\\hat\{y\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\};

4:case

𝒴s⊊𝒴t\\mathcal\{Y\}\_\{s\}\\subsetneq\\mathcal\{Y\}\_\{t\}:

5:Substitution:

𝑿1←\{𝒙is\}i=1ns\\bm\{X\}\_\{1\}\\leftarrow\\\{\\bm\{x\}\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\},

𝑿2←\{𝒙jt\}j=1nt\\bm\{X\}\_\{2\}\\leftarrow\\\{\\bm\{x\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}; and

\{y¯i\}i=1n←\{yis\}i=1ns\\\{\\bar\{y\}\_\{i\}\\\}\_\{i=1\}^\{n\}\\leftarrow\\\{y\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\},

\{y¯j′\}j=1m←\{y^jt\}j=1nt\\\{\\bar\{y\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\}\\leftarrow\\\{\\hat\{y\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\};

6:Compute

𝚪∗\\bm\{\\Gamma\}^\{\*\}by Eq\. \([2](https://arxiv.org/html/2605.05567#S3.E2)\);

7:Implement private class identification by Eq\. \([5](https://arxiv.org/html/2605.05567#S3.E5)\);

8:case

𝒴s⊋𝒴t\\mathcal\{Y\}\_\{s\}\\supsetneq\\mathcal\{Y\}\_\{t\}:

9:Substitution:

𝑿1←\{𝒙jt\}j=1nt\\bm\{X\}\_\{1\}\\leftarrow\\\{\\bm\{x\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\},

𝑿2←\{𝒙is\}i=1ns\\bm\{X\}\_\{2\}\\leftarrow\\\{\\bm\{x\}\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}; and

\{y¯i\}i=1n←\{y^jt\}j=1nt\\\{\\bar\{y\}\_\{i\}\\\}\_\{i=1\}^\{n\}\\leftarrow\\\{\\hat\{y\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\},

\{y¯j′\}j=1m←\{yis\}i=1ns\\\{\\bar\{y\}\_\{j\}^\{\\prime\}\\\}\_\{j=1\}^\{m\}\\leftarrow\\\{y\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\};

10:Compute

𝚪∗\\bm\{\\Gamma\}^\{\*\}by Eq\. \([2](https://arxiv.org/html/2605.05567#S3.E2)\);

11:Implement private class identification by Eq\. \([5](https://arxiv.org/html/2605.05567#S3.E5)\);

12:

𝚪∗=\(𝚪∗\)T\\bm\{\\Gamma\}^\{\*\}=\(\\bm\{\\Gamma\}^\{\*\}\)^\{T\}; %

𝚪∗\\bm\{\\Gamma\}^\{\*\}is a transport plan from target to source domain

13:Compute

ℒr​t\\mathcal\{L\}\_\{rt\},

ℒc​l​s\\mathcal\{L\}\_\{cls\},

ℒb​r\\mathcal\{L\}\_\{br\}by Eqs\. \([6](https://arxiv.org/html/2605.05567#S3.E6)\), \([7](https://arxiv.org/html/2605.05567#S3.E7)\), \([8](https://arxiv.org/html/2605.05567#S3.E8)\), respectively;

14:Update

h​\(⋅\)h\(\\cdot\)and

g​\(⋅\)g\(\\cdot\)by Eq\. \([9](https://arxiv.org/html/2605.05567#S3.E9)\)\.

15:endfor

For OSDA, it satisfies𝒴s⊊𝒴t\\mathcal\{Y\}\_\{s\}\\subsetneq\\mathcal\{Y\}\_\{t\}, so the above discussion can be directly applied\. Besides, as neural network \(NN\) generally gains better learning capacity, we parameterize the classifierh​\(⋅\)h\(\\cdot\)and feature transformationg​\(⋅\)g\(\\cdot\)using NNs\. The NN\-based model can be updated with batch gradient descent\. The overall training algorithm for OSDA is summarized in Algorithm[1](https://arxiv.org/html/2605.05567#alg1), corresponding to case𝒴s⊊𝒴t\\mathcal\{Y\}\_\{s\}\\subsetneq\\mathcal\{Y\}\_\{t\}\.

For PDA, since𝒴s⊋𝒴t\\mathcal\{Y\}\_\{s\}\\supsetneq\\mathcal\{Y\}\_\{t\}and\{𝒙is\}i=1ns\\\{\\bm\{x\}\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}contains private class samples, we just need to substitute𝑿1\\bm\{X\}\_\{1\}and𝑿2\\bm\{X\}\_\{2\}with\{𝒙jt\}j=1nt\\\{\\bm\{x\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}and\{𝒙is\}i=1ns\\\{\\bm\{x\}\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}, while substituting\{y¯i\}i=1n\\\{\\bar\{y\}\_\{i\}\\\}\_\{i=1\}^\{n\}and\{y¯j\}j=1m\\\{\\bar\{y\}\_\{j\}\\\}\_\{j=1\}^\{m\}with\{y^jt\}j=1nt\\\{\\hat\{y\}\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\_\{t\}\}and\{yis\}i=1ns\\\{y\_\{i\}^\{s\}\\\}\_\{i=1\}^\{n\_\{s\}\}, respectively, for computing𝚪∗\\bm\{\\Gamma\}^\{\*\}in Eq\. \([2](https://arxiv.org/html/2605.05567#S3.E2)\)\. Then the private class identification is also implemented by Eq\. \([5](https://arxiv.org/html/2605.05567#S3.E5)\)\. Now,𝚪∗\\bm\{\\Gamma\}^\{\*\}transports probability mass from the target to the source domain, thus the subsequently reliable transfer cannot directly proceed\. Fortunately, since optimal transport is symmetric, we can directly transpose matrix𝚪∗\\bm\{\\Gamma\}^\{\*\}to obtain the transport plan from source to target\. Using the transposed𝚪∗\\bm\{\\Gamma\}^\{\*\},ℒr​t\\mathcal\{L\}\_\{rt\},ℒc​l​s\\mathcal\{L\}\_\{cls\}, andℒb​r\\mathcal\{L\}\_\{br\}can also be computed by Eqs\. \([6](https://arxiv.org/html/2605.05567#S3.E6)\), \([7](https://arxiv.org/html/2605.05567#S3.E7)\), and \([8](https://arxiv.org/html/2605.05567#S3.E8)\), respectively\. Notably,𝒴s⊋𝒴t\\mathcal\{Y\}\_\{s\}\\supsetneq\\mathcal\{Y\}\_\{t\}implies that private class samples belong to the labeled source domain, i\.e\.,𝒟p​r​v⊂𝒟s\\mathcal\{D\}^\{prv\}\\subset\\mathcal\{D\}\_\{s\}\. Consequently, there is no need to use the pseudo label forℒc​l​s\\mathcal\{L\}\_\{cls\}computation\. By instantiatingh​\(⋅\)h\(\\cdot\)andg​\(⋅\)g\(\\cdot\)as NNs, the overall training algorithm for PDA is also summarized in Algorithm[1](https://arxiv.org/html/2605.05567#alg1), corresponding to case𝒴s⊋𝒴t\\mathcal\{Y\}\_\{s\}\\supsetneq\\mathcal\{Y\}\_\{t\}\.

### 3\.4Generalization Error Analysis

In this section, we study the interpretability and transferability of the model learned by ReOT from theoretical perspective\. By decomposing the error gap between the source and target domains, theoretical bounds will be provided for the severe label shift scenarioes\.

The theoretical bounds for OSDA have been given by Fang et al\.\[[19](https://arxiv.org/html/2605.05567#bib.bib19),[41](https://arxiv.org/html/2605.05567#bib.bib41)\], but it is not applicable for addressing PDA tasks\. Zhao et al\.\[[42](https://arxiv.org/html/2605.05567#bib.bib42),[43](https://arxiv.org/html/2605.05567#bib.bib43)\]give sufficient upper bounds for domain adaptation, but handling the bounds in the extreme label shift scenario is challenging\. Combes et al\.\[[44](https://arxiv.org/html/2605.05567#bib.bib44)\]provide the theoretical bounds for generalized label shift domain adaptation\. However, the conclusions are formulated under the closed set scenario, i\.e\., the source domain and target domain have identical label space\. Therefore, it is necessary to extend the conclusions to the extreme label shift scenarioes where label spaces are heterogeneous, including both OSDA and PDA\.

Letp​\(Y\)p\(Y\)andq​\(Y\)q\(Y\)defined on𝒴:=𝒴s∪𝒴t\\mathcal\{Y\}:=\\mathcal\{Y\}\_\{s\}\\cup\\mathcal\{Y\}\_\{t\}represent the probability mass of the true label distribution ofPPandQQ, respectively\. For convenience, w\.l\.o\.g, the source/target private classes not in𝒴s∩𝒴t\\mathcal\{Y\}\_\{s\}\\cap\\mathcal\{Y\}\_\{t\}are uniformly denoted byKs/KtK\_\{s\}/K\_\{t\}, i\.e\.,𝒴=\(𝒴s∩𝒴t\)∪Ks∪Kt\\mathcal\{Y\}=\(\\mathcal\{Y\}\_\{s\}\\cap\\mathcal\{Y\}\_\{t\}\)\\cup K\_\{s\}\\cup K\_\{t\}andp​\(Y=Kt\)=q​\(Y=Ks\)=0p\(Y=K\_\{t\}\)=q\(Y=K\_\{s\}\)=0\. Now letP​\(Y^∣Y\)P\(\\hat\{Y\}\\mid Y\)represent the conditional predict distribution givenPPand a predictorY^:=h∘g​\(X\)\\hat\{Y\}:=h\\circ g\(X\)\. Then the error of modelh∘gh\\circ gon source domain can be defined asεs​\(h∘g\):=∑i≠j∈𝒴p​\(Y^=i,Y=j\)\\varepsilon\_\{s\}\(h\\circ g\):=\\sum\_\{i\\neq j\\in\\mathcal\{Y\}\}p\(\\hat\{Y\}=i,Y=j\)\. The error on target domainεt​\(h∘g\)\\varepsilon\_\{t\}\(h\\circ g\)is defined similarly\. Before presenting the main results, we first introduce two important concepts for theoretical analysis as Combes et al\.\[[44](https://arxiv.org/html/2605.05567#bib.bib44)\]\.

###### Definition 1\.

\(Maximal Prediction Error\)\. The maximal error of predictorY^\\hat\{Y\}on source distributionPPis

ΔMPE​\(P\)​\(Y^∥Y\):=maxj∈𝒴s∩𝒴t⁡p​\(Y^≠j∣Y=j\)\.\\displaystyle\\Delta\_\{\{\\rm MPE\}\}\(P\)\(\\hat\{Y\}\\\|Y\):=\\max\\limits\_\{j\\in\\mathcal\{Y\}\_\{s\}\\cap\\mathcal\{Y\}\_\{t\}\}p\(\\hat\{Y\}\\neq j\\mid Y=j\)\.

Intuitively, maximal prediction error implies the maximum gap between true labelYYand predicted labelY^\\hat\{Y\}\. Moreover, it can also be viewed as the degree of error for the prediction\. If the predictor can correctly classify shared class samples drawn from the distributionPP, then the maximal prediction errorΔMPE​\(P\)​\(Y^∥Y\)\\Delta\_\{\{\\rm MPE\}\}\(P\)\(\\hat\{Y\}\\\|Y\)will be zero\.

###### Definition 2\.

\(Shared Conditional Gap\)\. Given source distributionPPand target distributionQQ, the conditional gap of predictorY^\\hat\{Y\}on shared classes is

ΔSCG​\(Y^\):=∑j∈𝒴s∩𝒴tq​\(Y=j\)​𝔻​\(P​\(Y^\|Y=j\),Q​\(Y^\|Y=j\)\),\\displaystyle\\begin\{split\}\\Delta\_\{\{\\rm SCG\}\}\(\\hat\{Y\}\):=\\sum\_\{j\\in\\mathcal\{Y\}\_\{s\}\\cap\\mathcal\{Y\}\_\{t\}\}q\(Y=j\)\\mathbb\{D\}\(P\(\\hat\{Y\}\|Y=j\),Q\(\\hat\{Y\}\|Y=j\)\),\\end\{split\}where

𝔻​\(P​\(Y^\|Y=j\),Q​\(Y^\|Y=j\)\):=\\displaystyle\\mathbb\{D\}\(P\(\\hat\{Y\}\|Y=j\),Q\(\\hat\{Y\}\|Y=j\)\):=maxi≠j,i∈𝒴\|p\(Y^=i\|Y=j\)\\displaystyle\\max\_\{i\\neq j,i\\in\\mathcal\{Y\}\}\|p\(\\hat\{Y\}=i\|Y=j\)−q\(Y^=i\|Y=j\)\|\.\\displaystyle\-q\(\\hat\{Y\}=i\|Y=j\)\|\.

The shared conditional gap measures the cross\-domain conditional discrepancy on shared classes\. A smaller shared conditional gap implies a smaller conditional discrepancy across domains\. In fact, as demonstrated by Combes et al\.\[[44](https://arxiv.org/html/2605.05567#bib.bib44)\], if the following process is a Markov chain:

X⟶gZ⟶hY^,X\\stackrel\{\{\\scriptstyle g\}\}\{\{\\longrightarrow\}\}Z\\stackrel\{\{\\scriptstyle h\}\}\{\{\\longrightarrow\}\}\\hat\{Y\},whereZ=g​\(X\)Z=g\(X\)andY^=h∘g​\(X\)\\hat\{Y\}=h\\circ g\(X\), thenP​\(g​\(X\)\|Y=j\)=Q​\(g​\(X\)\|Y=j\)P\(g\(X\)\|Y=j\)=Q\(g\(X\)\|Y=j\)for anyj∈𝒴s∩𝒴tj\\in\\mathcal\{Y\}\_\{s\}\\cap\\mathcal\{Y\}\_\{t\}implies the shared conditional gapΔSCG​\(Y^\)\\Delta\_\{\{\\rm SCG\}\}\(\\hat\{Y\}\)is zero\. In other words, when the cross\-domain class\-conditional distributions on shared classes are aligned andY^\\hat\{Y\}depends only ong​\(X\)g\(X\), the shared conditional gapΔSCG​\(Y^\)\\Delta\_\{\{\\rm SCG\}\}\(\\hat\{Y\}\)will be zero\.

Based on these two concepts, we present our result on the generalization error in the following theorem\.

###### Theorem 2\.

Given source distributionPPand target distributionQQ, then for any predictorY^=h∘g​\(X\)\\hat\{Y\}=h\\circ g\(X\),

\|εt​\(h∘g\)−εs​\(h∘g\)\|≤\\displaystyle\|\\varepsilon\_\{t\}\(h\\circ g\)\-\\varepsilon\_\{s\}\(h\\circ g\)\|\\leq‖p​\(Y\)−q​\(Y\)‖1⋅ΔMPE​\(P\)​\(Y^∥Y\)\+\(\|𝒴\|−1\)​ΔSCG​\(Y^\)\\displaystyle\\\|p\(Y\)\-q\(Y\)\\\|\_\{1\}\\cdot\\Delta\_\{\{\\rm MPE\}\}\(P\)\(\\hat\{Y\}\\\|Y\)\+\(\|\\mathcal\{Y\}\|\-1\)\\Delta\_\{\{\\rm SCG\}\}\(\\hat\{Y\}\)\+p​\(Y^≠Y,Y=Ks\)\+q​\(Y^≠Y,Y=Kt\),\\displaystyle\+p\(\\hat\{Y\}\\neq Y,Y=K\_\{s\}\)\+q\(\\hat\{Y\}\\neq Y,Y=K\_\{t\}\),where

‖p​\(Y\)−q​\(Y\)‖1=∑i∈𝒴s∩𝒴t\|p​\(Y=i\)−q​\(Y=i\)\|\.\\displaystyle\\\|p\(Y\)\-q\(Y\)\\\|\_\{1\}=\\sum\_\{i\\in\\mathcal\{Y\}\_\{s\}\\cap\\mathcal\{Y\}\_\{t\}\}\\left\|p\(Y=i\)\-q\(Y=i\)\\right\|\.

The proof is placed on the supplementary material\. The upper bound in Theorem[2](https://arxiv.org/html/2605.05567#Thmtheorem2)provides a new way to decompose the error gap between source and target domains for the heterogeneous label space scenario, and it can also be directly used to obtain a generalization upper bound of the target riskεt\\varepsilon\_\{t\}, i\.e\.,

εt\\displaystyle\\varepsilon\_\{t\}≤εs\+p​\(Y^≠Y,Y=Ks\)\\displaystyle\\leq\\varepsilon\_\{s\}\+p\(\\hat\{Y\}\\neq Y,Y=K\_\{s\}\)\+‖p​\(Y\)−q​\(Y\)‖1⋅ΔMPE​\(P\)​\(Y^∥Y\)\\displaystyle\+\\\|p\(Y\)\-q\(Y\)\\\|\_\{1\}\\cdot\\Delta\_\{\{\\rm MPE\}\}\(P\)\(\\hat\{Y\}\\\|Y\)\+\(\|𝒴\|−1\)​ΔSCG​\(Y^\)\\displaystyle\+\(\|\\mathcal\{Y\}\|\-1\)\\Delta\_\{\{\\rm SCG\}\}\(\\hat\{Y\}\)\+q​\(Y^≠Y,Y=Kt\)\.\\displaystyle\+q\(\\hat\{Y\}\\neq Y,Y=K\_\{t\}\)\.
Note thatp​\(Y^≠Y,Y=Ks\)≤εsp\(\\hat\{Y\}\\neq Y,Y=K\_\{s\}\)\\leq\\varepsilon\_\{s\}, thus it can be simplified as

εt\\displaystyle\\varepsilon\_\{t\}≤2​εs\+‖p​\(Y\)−q​\(Y\)‖1⋅ΔMPE​\(P\)​\(Y^∥Y\)\\displaystyle\\leq 2\\varepsilon\_\{s\}\+\\\|p\(Y\)\-q\(Y\)\\\|\_\{1\}\\cdot\\Delta\_\{\{\\rm MPE\}\}\(P\)\(\\hat\{Y\}\\\|Y\)\(10\)\+\(\|𝒴\|−1\)​ΔSCG​\(Y^\)\+q​\(Y^≠Y,Y=Kt\)\.\\displaystyle\+\(\|\\mathcal\{Y\}\|\-1\)\\Delta\_\{\{\\rm SCG\}\}\(\\hat\{Y\}\)\+q\(\\hat\{Y\}\\neq Y,Y=K\_\{t\}\)\.
This generalization upper bound, shown by the right\-hand\-side of inequality \([10](https://arxiv.org/html/2605.05567#S3.E10)\), consists of four terms\. The first term measures the classification error of modelh∘gh\\circ gon the source domain and can be sufficiently minimized by ReOT, asℒc​l​s\\mathcal\{L\}\_\{cls\}reduces the risk on the source domain\.

The second term containsΔMPE​\(P\)​\(Y^∥Y\)\\Delta\_\{\{\\rm MPE\}\}\(P\)\(\\hat\{Y\}\\\|Y\)and‖p​\(Y\)−q​\(Y\)‖1\\\|p\(Y\)\-q\(Y\)\\\|\_\{1\}\. As mentioned,ΔMPE​\(P\)​\(Y^∥Y\)\\Delta\_\{\{\\rm MPE\}\}\(P\)\(\\hat\{Y\}\\\|Y\)will decrease to0as the model achieves correct classification on source shared classes\. Meanwhile,‖p​\(Y\)−q​\(Y\)‖1\\\|p\(Y\)\-q\(Y\)\\\|\_\{1\}measures the discrepancy between the true label distributions and is a constant\. Therefore, the second term can also be minimized by ReOT through risk minimization on the source domain\.

The third term involves constant\|𝒴\|−1\|\\mathcal\{Y\}\|\-1and the shared conditional gapΔSCG​\(Y^\)\\Delta\_\{\{\\rm SCG\}\}\(\\hat\{Y\}\)\. By optimizingℒr​t\\mathcal\{L\}\_\{rt\}, ReOT decreases the intra\-class distances on shared classes, and thus mitigates the cross\-domain class\-conditional discrepancy on shared classes\. Our experiments will show thatℒr​t\\mathcal\{L\}\_\{rt\}is indeed effective in alleviating class\-conditional discrepancy\. Thus, the third term can be also reduced by ReOT\.

The fourth term measures the classification error on target private class samples\. In PDA, it is naturally0since there is no target private class\. In OSDA, this term can be reduced by the risk minimization on target private class samples\. Although the target private class samples are unlabeled in OSDA, most of them can be recognized by the proposed private class identification method\. Consequently, ReOT can decrease this term by usingℒc​l​s\\mathcal\{L\}\_\{cls\}for risk minimization on the private class samples\.

In conclusion, our ReOT can sufficiently decrease the upper bound of the target risk, thus achieving a small generalization error on the target domain\. This ensures the effectiveness of ReOT from theoretical perspective\.

## 4Experiments

### 4\.1Datasets and Setup

Datasets\.We evaluate ReOT on four benchmark datasets, i\.e\., Image\-CLEF\[[45](https://arxiv.org/html/2605.05567#bib.bib45)\], Office\-31\[[46](https://arxiv.org/html/2605.05567#bib.bib46)\], Office\-Home\[[47](https://arxiv.org/html/2605.05567#bib.bib47)\], and VisDA\-2017\[[48](https://arxiv.org/html/2605.05567#bib.bib48)\]\. We follow the standard protocols\[[10](https://arxiv.org/html/2605.05567#bib.bib10),[16](https://arxiv.org/html/2605.05567#bib.bib16),[17](https://arxiv.org/html/2605.05567#bib.bib17),[9](https://arxiv.org/html/2605.05567#bib.bib9)\]to generate adaptation tasks\. Details for datasets and settings are summarized as follows\.

- •Image\-CLEFconsists of 12 categories from 4 domains, i\.e\., Caltech\(C\), ImageNet \(I\), Pascal \(P\), and Bing \(B\), that are collected from existing public datasets\. Each domain contains 600 images\. For the OSDA setting, the first 6 categories in alphabetical order are selected as shared classes and the remaining 6 are selected as target private classes\. For the PDA setting, the target domain consists of the first 6 categories \(in alphabetical order\)\.
- •Office\-31contains 4652 images of 31 categories\. The images are collected from 3 different domains, i\.e\., Amazon \(A\), DSLR \(D\), and Webcom \(W\)\. For the OSDA setting, the first 10 categories in alphabetical order are regarded as the shared class and the last 11 categories are considered target private classes\. Following the protocols in\[[10](https://arxiv.org/html/2605.05567#bib.bib10),[16](https://arxiv.org/html/2605.05567#bib.bib16)\], the target domain consists of 10 categories in the PDA setting\.
- •Office\-Homecontains 15500 images from 4 domains, including Art \(Ar\), Clipart \(Cl\), Product \(Pr\), and Real\-World \(Rw\), each domain contains 65 categories\. For the OSDA setting, the first 25 categories \(in alphabetical order\) are selected as shared classes, while the rest are selected as target private classes\. For the PDA setting, the first 25 classes \(in alphabetical order\) are selected as the target domains\.
- •VisDA\-2017contains about 200k images of 12 categories from 2 domains, i\.e\., Synthetic \(S\) and Real \(R\)\. The first 6 classes \(in alphabetical order\) are taken as the target domain in the PDA setting\. The PDA task is considered as the knowledge transfer from S to R\.

Evaluation Metrics\.In the OSDA setting, we use three evaluation metrics following the mainstream OSDA works\[[17](https://arxiv.org/html/2605.05567#bib.bib17),[9](https://arxiv.org/html/2605.05567#bib.bib9)\], i\.e\., OS\*, UNK, and H\. OS\* is the average classification accuracy over shared classes and UNK is the classification accuracy of the private class\.H=2×OS∗×UNKOS∗\+UNK\{\\rm H\}=2\\times\\frac\{\{\\rm OS^\{\*\}\}\\times\{\\rm UNK\}\}\{\{\\rm OS^\{\*\}\}\+\{\\rm UNK\}\}is the harmonic mean accuracy of OS\* and UNK, and it is the core metric since a high H appears only if the model performs well on both OS\* and UNK\. In the PDA setting, we use the standard classification accuracy as evaluation metrics following the mainstream PDA works\[[10](https://arxiv.org/html/2605.05567#bib.bib10),[16](https://arxiv.org/html/2605.05567#bib.bib16)\]\.

Implementation\.We employ ResNet\-50\[[49](https://arxiv.org/html/2605.05567#bib.bib49)\]pre\-trained on ImageNet\[[50](https://arxiv.org/html/2605.05567#bib.bib50)\]as the backbone, which has been widely used in OSDA and PDA, then we fix it to extract the 2048\-dimensional feature vectors as input for all datasets\. The feature transformationg​\(⋅\)g\(\\cdot\)consists of two fully\-connected\-layers and the setting for dimensions isℝ2048→ℝ1024→ℝ256\\mathbb\{R\}^\{2048\}\\to\\mathbb\{R\}^\{1024\}\\to\\mathbb\{R\}^\{256\}\. The classifierh​\(⋅\)h\(\\cdot\)consists of a single fully\-connected\-layer with\|𝒴s\|\+1\|\\mathcal\{Y\}\_\{s\}\|\+1\-dimensional output and SoftMax activation\. The loss functionℓ​\(⋅,⋅\)\\ell\(\\cdot,\\cdot\)is defined as the cross\-entropy between the classifier’s output and the one\-hot encoded form of the true label\. The final predicted label is determined by the index of the maximum component value\.g​\(⋅\)g\(\\cdot\)andh​\(⋅\)h\(\\cdot\)are initially fine\-tuned with the empirical risk on the source domain\. Then the whole modelh∘gh\\circ gis optimized by Algorithm[1](https://arxiv.org/html/2605.05567#alg1)\. More details about the training implementations are provided in the supplementary material\.

### 4\.2Comparative Experiments Under OSDA Setting

TABLE I:Classification accuracies \(%\) in the OSDA setting on Image\-CLEF dataset\. Bold fonts indicate the best results, and the rest are the same\.Image\-CLEFB→\\rightarrowCB→\\rightarrowIB→\\rightarrowPC→\\rightarrowBC→\\rightarrowIC→\\rightarrowPOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOSBP\[[7](https://arxiv.org/html/2605.05567#bib.bib7)\]87\.081\.083\.985\.365\.774\.366\.366\.766\.562\.058\.059\.989\.080\.084\.387\.753\.766\.7STA\[[13](https://arxiv.org/html/2605.05567#bib.bib13)\]93\.351\.766\.586\.060\.771\.277\.748\.759\.861\.369\.765\.291\.766\.777\.284\.054\.065\.7ROS\[[8](https://arxiv.org/html/2605.05567#bib.bib8)\]78\.390\.083\.873\.076\.374\.659\.067\.362\.959\.068\.363\.378\.383\.080\.668\.778\.773\.3DAOD\[[19](https://arxiv.org/html/2605.05567#bib.bib19)\]79\.482\.080\.778\.490\.984\.372\.180\.876\.351\.347\.149\.179\.088\.683\.674\.578\.976\.7MRJT\[[17](https://arxiv.org/html/2605.05567#bib.bib17)\]90\.799\.094\.774\.795\.383\.763\.392\.775\.258\.085\.369\.187\.093\.390\.169\.785\.776\.8ANNA\[[9](https://arxiv.org/html/2605.05567#bib.bib9)\]95\.398\.396\.881\.384\.783\.074\.075\.074\.558\.083\.068\.387\.093\.089\.978\.784\.081\.2ReOT93\.295\.894\.590\.093\.991\.977\.282\.179\.656\.690\.069\.488\.997\.092\.878\.187\.482\.5\-I→\\rightarrowBI→\\rightarrowCI→\\rightarrowPP→\\rightarrowBP→\\rightarrowCP→\\rightarrowIMeanOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOSBP55\.760\.758\.180\.792\.786\.366\.374\.370\.152\.361\.056\.394\.068\.078\.966\.080\.772\.674\.470\.271\.5STA62\.354\.057\.994\.053\.768\.480\.759\.068\.261\.343\.751\.093\.747\.763\.290\.051\.065\.181\.355\.165\.0ROS58\.059\.758\.888\.792\.790\.678\.076\.077\.047\.359\.352\.771\.390\.379\.779\.781\.380\.569\.976\.973\.1DAOD54\.556\.955\.780\.382\.081\.273\.380\.876\.951\.751\.051\.379\.082\.080\.579\.686\.683\.971\.175\.873\.3MRJT56\.783\.367\.597\.381\.388\.679\.088\.783\.552\.086\.364\.993\.780\.086\.387\.387\.787\.575\.788\.280\.7ANNA56\.078\.065\.294\.397\.796\.080\.782\.781\.754\.073\.762\.394\.093\.793\.885\.083\.384\.278\.285\.681\.4ReOT58\.487\.670\.193\.298\.495\.881\.786\.483\.955\.180\.665\.494\.895\.795\.289\.095\.492\.179\.790\.984\.4TABLE II:Ablation study results \(%\\%\) on different modules\.ModulesImage\-CLEFOffice\-31Office\-HomeMeanℒc​l​s\\mathcal\{L\}\_\{cls\}ℒr​t\\mathcal\{L\}\_\{rt\}ℒb​r\\mathcal\{L\}\_\{br\}OSDAPDAOSDAPDAOSDAPDAOSDAPDA✓✗✗80\.984\.188\.587\.163\.661\.477\.777\.5✓✓✗83\.292\.291\.195\.965\.869\.380\.085\.8✓✗✓82\.186\.990\.692\.366\.468\.879\.782\.7✓✓✓84\.493\.493\.498\.470\.378\.782\.790\.2TABLE III:Classification accuracies \(%\\%\) in the OSDA setting on Office\-31 dataset\.Office\-31A→\\rightarrowWA→\\rightarrowDD→\\rightarrowAD→\\rightarrowWW→\\rightarrowAW→\\rightarrowDMeanOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOSBP\[[7](https://arxiv.org/html/2605.05567#bib.bib7)\]86\.879\.282\.790\.575\.582\.476\.172\.375\.197\.796\.797\.273\.074\.473\.799\.184\.291\.187\.280\.483\.7STA\[[13](https://arxiv.org/html/2605.05567#bib.bib13)\]92\.158\.071\.095\.445\.561\.694\.155\.069\.497\.149\.765\.592\.146\.260\.996\.648\.564\.494\.650\.565\.5ROS\[[8](https://arxiv.org/html/2605.05567#bib.bib8)\]88\.476\.782\.187\.577\.882\.474\.881\.277\.999\.393\.096\.069\.786\.677\.210099\.499\.786\.685\.885\.9cUADAL\[[20](https://arxiv.org/html/2605.05567#bib.bib20)\]85\.595\.190\.185\.690\.487\.974\.287\.880\.598\.797\.798\.265\.687\.875\.199\.399\.499\.484\.893\.088\.5MRJT\[[17](https://arxiv.org/html/2605.05567#bib.bib17)\]90\.489\.690\.090\.289\.790\.092\.190\.091\.098\.310099\.191\.094\.792\.910093\.196\.493\.792\.993\.3ANNA\[[9](https://arxiv.org/html/2605.05567#bib.bib9)\]82\.888\.485\.593\.276\.183\.875\.491\.182\.599\.499\.699\.576\.087\.981\.610096\.898\.487\.890\.088\.6ReOT93\.498\.595\.994\.695\.995\.279\.793\.185\.999\.196\.497\.778\.795\.786\.398\.710099\.390\.796\.693\.4TABLE IV:Classification accuracies \(%\\%\) in the OSDA setting on Office\-Home dataset\.Office\-HomeAr→\\rightarrowClAr→\\rightarrowPrAr→\\rightarrowRwCl→\\rightarrowArCl→\\rightarrowPrCl→\\rightarrowRwOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOSBP\[[7](https://arxiv.org/html/2605.05567#bib.bib7)\]50\.261\.155\.171\.859\.865\.279\.367\.572\.959\.470\.364\.367\.062\.764\.772\.069\.270\.6STA\[[13](https://arxiv.org/html/2605.05567#bib.bib13)\]50\.863\.456\.368\.759\.763\.781\.150\.562\.153\.063\.957\.961\.463\.562\.569\.863\.266\.3ROS\[[8](https://arxiv.org/html/2605.05567#bib.bib8)\]50\.674\.160\.168\.470\.369\.375\.877\.276\.553\.665\.558\.959\.871\.665\.265\.372\.268\.6DAOD\[[19](https://arxiv.org/html/2605.05567#bib.bib19)\]72\.651\.860\.555\.357\.956\.678\.262\.669\.559\.161\.760\.470\.852\.660\.477\.857\.065\.8PGL\[[21](https://arxiv.org/html/2605.05567#bib.bib21)\]63\.319\.129\.378\.932\.145\.687\.740\.955\.885\.95\.310\.073\.924\.536\.870\.233\.845\.6cUADAL\[[20](https://arxiv.org/html/2605.05567#bib.bib20)\]55\.075\.663\.669\.473\.971\.682\.273\.377\.553\.882\.065\.061\.177\.468\.369\.376\.372\.6MRJT\[[17](https://arxiv.org/html/2605.05567#bib.bib17)\]49\.775\.359\.969\.377\.173\.078\.273\.476\.152\.382\.664\.159\.076\.166\.567\.479\.573\.0ANNA\[[9](https://arxiv.org/html/2605.05567#bib.bib9)\]61\.478\.769\.068\.379\.973\.774\.179\.776\.858\.073\.164\.764\.273\.668\.666\.980\.273\.0ReOT57\.178\.966\.372\.776\.074\.377\.481\.879\.559\.069\.663\.964\.175\.469\.370\.275\.272\.6\-Pr→\\rightarrowArPr→\\rightarrowClPr→\\rightarrowRwRw→\\rightarrowArRw→\\rightarrowClRw→\\rightarrowPrMeanOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOS\*UNKHOSBP59\.168\.163\.244\.566\.353\.276\.271\.773\.966\.167\.366\.748\.063\.054\.576\.368\.672\.364\.166\.364\.7STA55\.473\.763\.144\.771\.555\.078\.163\.369\.767\.962\.365\.051\.457\.954\.277\.958\.066\.463\.462\.661\.9ROS57\.364\.360\.646\.571\.256\.370\.878\.474\.467\.070\.868\.851\.573\.060\.472\.080\.075\.761\.672\.466\.2DAOD71\.350\.559\.158\.442\.849\.481\.850\.662\.566\.743\.352\.560\.036\.645\.584\.134\.749\.169\.650\.257\.6PGL73\.734\.747\.259\.238\.446\.684\.827\.641\.681\.56\.111\.468\.80\.00\.084\.838\.052\.576\.125\.035\.2cUADAL50\.982\.462\.941\.280\.754\.671\.283\.476\.866\.879\.672\.651\.871\.159\.977\.875\.676\.762\.577\.668\.5MRJT60\.373\.266\.151\.072\.259\.776\.469\.072\.566\.966\.166\.555\.171\.162\.177\.075\.676\.363\.574\.368\.0ANNA63\.070\.366\.554\.674\.863\.174\.378\.976\.666\.177\.371\.359\.773\.165\.776\.481\.078\.765\.676\.770\.7ReOT61\.874\.867\.753\.673\.562\.073\.977\.775\.760\.980\.769\.454\.877\.164\.175\.484\.779\.765\.177\.170\.4TABLE V:Classification accuracies \(%\\%\) in the PDA setting on Office\-Home and VisDA\-2017 datasets\.MethodsOffice\-HomeVisDA\-2017Ar→\\toClAr→\\toPrAr→\\toRwCl→\\toArCl→\\toPrCl→\\toRwPr→\\toArPr→\\toClPr→\\toRwRw→\\toArRw→\\toClRw→\\toPrMeanS→\\toRDANN\[[5](https://arxiv.org/html/2605.05567#bib.bib5)\]43\.867\.977\.563\.759\.067\.656\.837\.176\.469\.244\.377\.561\.751\.0PADA\[[10](https://arxiv.org/html/2605.05567#bib.bib10)\]52\.067\.078\.752\.253\.859\.052\.643\.278\.873\.756\.677\.162\.153\.5IWAN\[[23](https://arxiv.org/html/2605.05567#bib.bib23)\]53\.954\.578\.161\.348\.063\.354\.252\.081\.376\.556\.882\.963\.6\-ETN\[[24](https://arxiv.org/html/2605.05567#bib.bib24)\]59\.277\.079\.562\.965\.775\.068\.355\.484\.475\.757\.784\.570\.5\-HAFN\[[51](https://arxiv.org/html/2605.05567#bib.bib51)\]53\.472\.780\.864\.265\.371\.166\.151\.678\.372\.555\.379\.067\.565\.1SAFN\[[51](https://arxiv.org/html/2605.05567#bib.bib51)\]58\.976\.381\.470\.473\.077\.872\.455\.380\.475\.860\.479\.971\.867\.7BA3US\[[28](https://arxiv.org/html/2605.05567#bib.bib28)\]60\.683\.288\.471\.872\.883\.475\.561\.686\.579\.362\.886\.176\.0\-DRCN\[[26](https://arxiv.org/html/2605.05567#bib.bib26)\]54\.076\.483\.062\.164\.571\.070\.849\.880\.577\.559\.179\.969\.058\.2DMP\[[52](https://arxiv.org/html/2605.05567#bib.bib52)\]59\.081\.286\.368\.172\.878\.871\.257\.684\.977\.361\.582\.973\.572\.7AR\[[16](https://arxiv.org/html/2605.05567#bib.bib16)\]67\.485\.390\.077\.370\.685\.279\.064\.889\.580\.466\.286\.478\.388\.7m\-POT\[[35](https://arxiv.org/html/2605.05567#bib.bib35)\]64\.680\.687\.276\.477\.683\.677\.163\.787\.681\.468\.587\.478\.087\.0IDSP\[[53](https://arxiv.org/html/2605.05567#bib.bib53)\]60\.880\.887\.369\.376\.080\.274\.759\.285\.377\.861\.385\.774\.9\-SAN\+\+\[[54](https://arxiv.org/html/2605.05567#bib.bib54)\]61\.381\.688\.672\.876\.481\.974\.557\.787\.279\.763\.886\.176\.063\.1RAN\[[55](https://arxiv.org/html/2605.05567#bib.bib55)\]63\.383\.189\.075\.074\.582\.978\.061\.286\.779\.963\.585\.076\.875\.1MUL\[[11](https://arxiv.org/html/2605.05567#bib.bib11)\]57\.488\.790\.871\.080\.482\.177\.959\.891\.283\.558\.187\.777\.477\.5DeepOPA\[[27](https://arxiv.org/html/2605.05567#bib.bib27)\]51\.165\.976\.46565\.7672\.864\.249\.374\.670\.454\.977\.365\.6\-SLM\[[14](https://arxiv.org/html/2605.05567#bib.bib14)\]61\.184\.091\.476\.575\.081\.874\.655\.687\.882\.357\.883\.576\.091\.7PLSC\[[15](https://arxiv.org/html/2605.05567#bib.bib15)\]63\.285\.791\.672\.180\.282\.778\.756\.186\.076\.658\.987\.676\.674\.9ReOT63\.787\.492\.374\.281\.487\.179\.260\.888\.079\.262\.888\.578\.791\.8TABLE VI:Classification accuracies \(%\\%\) in the PDA setting on Image\-CLEF dataset\.Image\-CLEFI→\\toPP→\\toII→\\toCC→\\toIC→\\toPP→\\toCMeanDANN\[[5](https://arxiv.org/html/2605.05567#bib.bib5)\]78\.186\.391\.384\.072\.190\.383\.7PADA\[[10](https://arxiv.org/html/2605.05567#bib.bib10)\]81\.792\.194\.689\.877\.794\.188\.3HAFN\[[51](https://arxiv.org/html/2605.05567#bib.bib51)\]79\.187\.793\.790\.377\.894\.787\.2SAFN\[[51](https://arxiv.org/html/2605.05567#bib.bib51)\]79\.590\.793\.090\.377\.894\.087\.5DMP\[[52](https://arxiv.org/html/2605.05567#bib.bib52)\]82\.494\.596\.794\.378\.796\.490\.5MUL\[[11](https://arxiv.org/html/2605.05567#bib.bib11)\]87\.592\.298\.194\.687\.698\.593\.1PLSC\[[15](https://arxiv.org/html/2605.05567#bib.bib15)\]85\.097\.398\.396\.378\.099\.092\.3ReOT88\.194\.398\.193\.888\.097\.993\.4TABLE VII:Classification accuracies \(%\\%\) in the PDA setting on Office\-31 dataset\.Office\-31A→\\toWD→\\toWW→\\toDA→\\toDD→\\toAW→\\toAMeanDANN\[[5](https://arxiv.org/html/2605.05567#bib.bib5)\]73\.696\.398\.781\.582\.886\.186\.5PADA\[[10](https://arxiv.org/html/2605.05567#bib.bib10)\]86\.599\.310082\.292\.795\.492\.7IWAN\[[23](https://arxiv.org/html/2605.05567#bib.bib23)\]89\.299\.399\.490\.595\.694\.394\.7ETN\[[24](https://arxiv.org/html/2605.05567#bib.bib24)\]94\.510010095\.096\.294\.696\.7HAFN\[[51](https://arxiv.org/html/2605.05567#bib.bib51)\]87\.596\.799\.287\.389\.290\.791\.7SAFN\[[51](https://arxiv.org/html/2605.05567#bib.bib51)\]87\.596\.699\.489\.892\.692\.793\.1BA3US\[[28](https://arxiv.org/html/2605.05567#bib.bib28)\]99\.010098\.799\.494\.895\.097\.8DRCN\[[26](https://arxiv.org/html/2605.05567#bib.bib26)\]90\.810010094\.395\.294\.895\.9DMP\[[52](https://arxiv.org/html/2605.05567#bib.bib52)\]96\.610010096\.495\.195\.497\.2AR\[[16](https://arxiv.org/html/2605.05567#bib.bib16)\]93\.510099\.796\.895\.596\.096\.9IDSP\[[53](https://arxiv.org/html/2605.05567#bib.bib53)\]99\.799\.710099\.495\.195\.798\.3SAN\+\+\[[54](https://arxiv.org/html/2605.05567#bib.bib54)\]99\.710010098\.194\.195\.597\.9RAN\[[55](https://arxiv.org/html/2605.05567#bib.bib55)\]99\.010010097\.796\.396\.298\.2MUL\[[11](https://arxiv.org/html/2605.05567#bib.bib11)\]94\.210010098\.595\.696\.397\.5SLM\[[14](https://arxiv.org/html/2605.05567#bib.bib14)\]99\.810099\.898\.796\.195\.998\.4PLSC\[[15](https://arxiv.org/html/2605.05567#bib.bib15)\]99\.310010010096\.096\.698\.7ReOT99\.399\.910098\.596\.196\.598\.4We focus more on H in performance evaluation, since the harmonic mean H is a composite metric of OS\* and UNK\.

The results on Image\-CLEF are presented in Table[I](https://arxiv.org/html/2605.05567#S4.T1)\. From the view of OS\*, we see some methods perform very well\. For example, STA achieves the best OS\* on 5 out of all 12 tasks and outperforms the ReOT method by 1\.6%\\%on average OS\*\. However, STA relies heavily on the empirical threshold to achieve private class identification, which lacks reliability and precision\. As a result, it fails to identify the private samples effectively and gets significantly low UNK and H values\. In contrast, ReOT utilizes a reliable private class identification approach, which ensures the precision of the identification\. Consequently, the trained model can successfully recognize the private samples while not degenerating the performance over the shared classes\. We see that ReOT achieves the best UNK on 7 out of all 12 tasks and outperforms the best baseline by 2\.7%\\%on average UNK \(90\.9%\\%\), while maintaining a second\-highest average of 79\.7%\\%OS\*\. There are several methods that also perform well on both OS\* and UNK, e\.g\., MRJT and ANNA, which results in good performance in the view of H\. Compared with these methods, ReOT achieves the best H on 10 out of all 12 tasks\. Moreover, it gains the best average of 84\.4%\\%H over 12 OSDA tasks\. It outperforms the state\-of\-the\-art \(SOTA\) method ANNA by 1\.5%\\%, 5\.3%\\%and 3\.0%\\%on average OS\*, UNK and H, respectively\.

Table[III](https://arxiv.org/html/2605.05567#S4.T3)presents the results on Office\-31\. The performance of ReOT on the Office\-31 dataset also shows similar trends to the CLEF dataset\. Specifically, ReOT achieves the best UNK on 5 out of all 6 tasks on Office\-31 and outperforms the best baseline by 3\.6% on average UNK \(96\.6%\), while maintaining a high average of 90\.7% OS\* value\. Compared with other SOTA methods that cannot identify the private samples precisely, ReOT also gains the highest H value of 93\.4%\. These results significantly verify the effectiveness of the proposed private class identification approach in helping the model correctly recognize the target private samples and improving model performance\.

Table[IV](https://arxiv.org/html/2605.05567#S4.T4)presents the results on Office\-Home dataset\. ReOT achieves an average H\-value of 70\.3%, which is only lower than the causality\-driven method ANNA by 0\.4% and outperforms all the other methods\. Compared with related OSDA methods, such as MRJT which uses manifold regularization, ReOT improves the average OS\*, UNK, and H values by 1\.7%, 2\.5%, and 2\.3%, respectively\. All the above results demonstrate that our ReOT reaches a better balance in identifying the private samples and classifying the shared samples than SOTA methods\. It outperforms SOTA models by an average of 2\.0%∼\\sim2\.5% per dataset, significantly validating the stability and effectiveness in dealing with domain adaptation tasks with extreme label shift\.

### 4\.3Comparative Experiments Under PDA Setting

Table[VI](https://arxiv.org/html/2605.05567#S4.T6)presents the results on Image\-CLEF\. In the sense of mean accuracy, ReOT significantly outperforms the SOTA methods\. Specifically, by precise local structure characterization, ReOT improves the global manifold structure\-based method DMP\. It can be also observed that PLSC suffers a serious performance degradation on C→\\toP task, as it relies on the alignment and separation assumptions to train an SVM for private class identification\. With relaxed assumptions, ReOT generally achieves the level of PLSC on all adaptation tasks while improving PLSC by 10%\\%on the C→\\toP task\.

Table[VII](https://arxiv.org/html/2605.05567#S4.T7)presents the results on Office\-31 dataset\. In the PDA setting, tasks on Office\-31 are relatively simple, i\.e\., the domain gap on shared classes is relatively small\. Therefore, methods that rely on strong assumptions generally achieve high performance \(close to 100%\\%\)\. Compared with these methods, ReOT’s mean accuracy is only slightly lower \(0\.3%\\%\) than that of PLSC, yet it still ensures top performance among the other methods\. This validates again the stability of ReOT facing different environments\.

The results on Office\-Home dataset are presented in Table[V](https://arxiv.org/html/2605.05567#S4.T5)\. From the view of average accuracy, ReOT achieves the most advanced performance \(78\.7%\\%\)\. Specifically, our ReOT achieves the best accuracy on 6 out of all 12 tasks\. Compared with recent PDA methods that similarly implement private class identification \(e\.g\., SLM and PLSC\) under strong assumptions, ReOT improves them by an average of 2\.4%\\%, demonstrating the reliability of the proposed method\.

The results on VisDA\-2017 dataset are also presented in Table[V](https://arxiv.org/html/2605.05567#S4.T5)\. VisDA\-2017 is a more challenging dataset, containing larger\-scale data and a more complicated cluster structure\. It can be observed from the results that most existing methods have experienced serious performance degradation\. For example, PLSC delivers the best performance on Office\-31, but its results on VisDA\-2017 are poor; MUL gains the second\-best performance on Image\-CLEF and comparable results on Office\-Home, but it also suffers severe performance degradation on VisDA\-2017\. By contrast, ReOT still provides the most advanced performance on VisDA\-2017\. The overall results demonstrate that ReOT effectively maintains superior performance on challenging datasets, further validating the reliability and transferability of the model learned by ReOT\.

![Refer to caption](https://arxiv.org/html/2605.05567v1/x5.png)\(a\)W→\\toA \(OSDA\)
![Refer to caption](https://arxiv.org/html/2605.05567v1/x6.png)\(b\)W→\\toA \(PDA\)

Figure 4:Comparison of cross\-domain class\-conditional discrepancies including𝒜\\mathcal\{A\}\-distance and H/ACC\-value of different models\. \(a\) OSDA\. \(b\) PDA\.
### 4\.4Other Analysis

Ablation Study\.To evaluate the effectiveness of different modules, we conduct ablation experiments on three datasets\. We present the harmonic accuracy \(H\) for OSDA and the standard accuracy \(ACC\) for PDA in Table[II](https://arxiv.org/html/2605.05567#S4.T2)\. It can be observed that, compared with the baseline model \(Row 1\), introducing the transfer moduleℒr​t\\mathcal\{L\}\_\{rt\}\(Row 2\) and the reconstruction moduleℒb​r\\mathcal\{L\}\_\{br\}\(Row 3\) both improve the H/ACC across all three datasets\. Specifically,ℒr​t\\mathcal\{L\}\_\{rt\}improves the mean H\-value and ACC by 2\.3%\\%and 8\.3%\\%, respectively, whileℒb​r\\mathcal\{L\}\_\{br\}improves the mean H\-value and ACC by 2\.0%\\%and 5\.2%\\%, respectively\. This highlights the benefit of each module in OSDA and PDA scenarios\. Besides, introducing the transfer module shows a higher improvement on H\-value and ACC over these datasets, indicating that the invariant representation learning via class\-conditional alignment plays a crucial role in both OSDA and PDA\. Furthermore, ReOT \(Row 4\) consistently achieves the best performance on all datasets, which implies that the two modules can supplement each other and significantly benefit the full model\. In conclusion, the experimental results verify the general effectiveness ofℒr​t\\mathcal\{L\}\_\{rt\}andℒb​r\\mathcal\{L\}\_\{br\}on different datasets\.

Class\-conditional Discrepancy\.To validate the effectiveness of ReOT with the alignment termℒr​t\\mathcal\{L\}\_\{rt\}for cross\-domain class\-conditional alignment on shared classes, we conduct experiments to analyze the conditional discrepancies of different models, i\.e\., ReOT \(w/oℒr​t\\mathcal\{L\}\_\{rt\}\), ANNA, and ReOT\. Here, ReOT \(w/oℒr​t\\mathcal\{L\}\_\{rt\}\) represent the ReOT without the distribution alignment termℒr​t\\mathcal\{L\}\_\{rt\}\. For discrepancy measure, the𝒜\\mathcal\{A\}\-distance\[[56](https://arxiv.org/html/2605.05567#bib.bib56)\]is applied to the class\-wise data to estimate the conditional discrepancy on shared classes\. The experiment results on Office\-31 task W→\\toA are presented in Figure[4](https://arxiv.org/html/2605.05567#S4.F4), where the class\-conditional𝒜c\\mathcal\{A\}\_\{c\}\-distance is computed as the mean of𝒜\\mathcal\{A\}\-distances on all class\-wise data\. From the results, we can observe in both OSDA and PDA, the domain discrepancy is large without alignment, and ReOT significantly mitigates the discrepancy by minimizingℒr​t\\mathcal\{L\}\_\{rt\}\. Specifically, as shown in Figure[4a](https://arxiv.org/html/2605.05567#S4.F4.sf1), in the OSDA setting, ANNA shows better capacity than ReOT \(w/oℒr​t\\mathcal\{L\}\_\{rt\}\) in mitigating the domain discrepancy; however, by introducingℒr​t\\mathcal\{L\}\_\{rt\}, ReOT gains a smaller𝒜c\\mathcal\{A\}\_\{c\}\-distance with higher H\-value, which implies that ReOT withℒr​t\\mathcal\{L\}\_\{rt\}learns a better invariant representation without degenerating the discriminative ability\. A similar trend appears in the PDA setting as shown in Figure[4b](https://arxiv.org/html/2605.05567#S4.F4.sf2)\. Overall, the results above validate that the proposed ReOT effectively decreases conditional discrepancy by optimizingℒa​l​i​g​n\\mathcal\{L\}\_\{align\}\.

![Refer to caption](https://arxiv.org/html/2605.05567v1/x7.png)\(a\)W→\\toA \(OSDA\)
![Refer to caption](https://arxiv.org/html/2605.05567v1/x8.png)\(b\)Cl→\\toPr \(OSDA\)
![Refer to caption](https://arxiv.org/html/2605.05567v1/x9.png)\(c\)W→\\toA \(PDA\)
![Refer to caption](https://arxiv.org/html/2605.05567v1/x10.png)\(d\)Cl→\\toPr \(PDA\)

Figure 5:Sensitivity analysis on hyper\-parametersη1\\eta\_\{1\}andη2\\eta\_\{2\}\. \(a\)\-\(b\) OSDA\. \(c\)\-\(d\) PDA\.Effectiveness of the Private Class Identification\.To verify the high efficiency of the proposed private class identification method—namely, that a majority of private class samples can be successfully identified—we conduct experiments on three benchmark datasets\. To measure the efficiency, we calculate the average ratio of identified private class samples to all private class samples across all transfer tasks for each dataset\. The experiment results are presented in Table[VIII](https://arxiv.org/html/2605.05567#S4.T8)\. We observe that the proposed private class identification approach identifies nearly all private class samples on the Image\-CLEF dataset, achieving a 90\.7%\\%identification ratio in OSDA and 95\.1%\\%in PDA\. Similarly, on the Office\-31 dataset, it achieves a 94\.5%\\%identification ratio in OSDA and 99\.4%\\%in PDA\. On the Office\-Home dataset, although the identification performance is slightly reduced, the proposed approach still recognizes most private class samples, achieving a 79\.5%\\%identification ratio in OSDA and 92\.2%\\%in PDA\. Overall, our method successfully identifies the majority of private class samples across all three datasets, with an average identification ratio of 88\.2%\\%in OSDA and 95\.6%\\%in PDA, thereby validating its high efficiency\.

To further assess the reliability of the private class identification, we also investigate the behavior of ReOT when there are no private classes\. Specifically, on the Office\-Home dataset, we remove all private classes and train the model using only the remaining 25 shared classes\. The model is then evaluated on the target domain in terms of classification accuracy and false positive rate\. The latter is obtained with our private class identification method and measures the proportion of shared class samples mistakenly identified as private\. Fig\.[6](https://arxiv.org/html/2605.05567#S4.F6)shows the result on task Cl→\\toPr, ReOT achieves a high accuracy of about 82%\\%while keeping the false positive rate consistently low throughout training\. This behavior can be explained by the fact that, without private classes, shared samples naturally cluster around their class centers\. Consequently, the transport mass within local regions becomes approximately uniform, resulting in consistently low identification scores and thus preventing false identification as private\. These results indicate that even in the absence of private classes, ReOT rarely misidentifies shared class samples as private, validating the reliability of the proposed method\.

TABLE VIII:The ratio \(%\\%\) of identified private class samples to all private class samples using the proposed private class identification method\.SettingImage\-CLEFOffice\-31Office\-HomeMeanOSDA90\.794\.579\.588\.2PDA95\.199\.492\.295\.6![Refer to caption](https://arxiv.org/html/2605.05567v1/x11.png)Figure 6:Model performance evolution across epochs on the Office\-Home task Cl→\\toPr\. False positive rate denotes the proportion of shared\-class samples that are mistakenly identified as private class samples\.Hyper\-parameter Analysis\.To investigate the sensitivity of the model w\.r\.t\. hyper\-parametersη1\\eta\_\{1\}andη2\\eta\_\{2\}, we conduct experiments on Office\-31 and Office\-Home datasets\. For the OSDA setting, both the values ofη1\\eta\_\{1\}andη2\\eta\_\{2\}are selected from \{0\.5, 0\.75, 1, 1\.25, 1\.5\}\. For the PDA setting, the values ofη1\\eta\_\{1\}andη2\\eta\_\{2\}are searched from \{0\.1, 0\.2, 0\.3, 0\.4, 0\.5\} and \{3, 3\.25, 3\.5, 3\.75, 4\}, respectively\. The 3\-D grid visualizations for the results are presented in Figure[5](https://arxiv.org/html/2605.05567#S4.F5), where Office\-31 task W→\\toA and Office\-Home task Cl→\\toPr are tested\. It can be observed that ReOT is generally stable for different choices of parametersη1\\eta\_\{1\}andη2\\eta\_\{2\}\. Besides, the best performance is typically achieved with a larger value ofη1\\eta\_\{1\}, further indicating the crucial role of invariant representation learning in both OSDA and PDA\. In conclusion, the results above demonstrate that ReOT is generally robust to different settings of hyper\-parameters\.

![Refer to caption](https://arxiv.org/html/2605.05567v1/x12.png)\(a\)Source\-only
![Refer to caption](https://arxiv.org/html/2605.05567v1/x13.png)\(b\)STA
![Refer to caption](https://arxiv.org/html/2605.05567v1/x14.png)\(c\)ANNA
![Refer to caption](https://arxiv.org/html/2605.05567v1/x15.png)\(d\)ReOT

Figure 7:t\-SNE\[[57](https://arxiv.org/html/2605.05567#bib.bib57)\]visualizations of the representations learned by different OSDA methods on the Office\-31 task W→\\toA\. In the diagrams, “∘\\circ” means source domain, and “\+” means target domain; different colors represent different classes, and gray represents the private class\.![Refer to caption](https://arxiv.org/html/2605.05567v1/x16.png)\(a\)Source\-only
![Refer to caption](https://arxiv.org/html/2605.05567v1/x17.png)\(b\)AR
![Refer to caption](https://arxiv.org/html/2605.05567v1/x18.png)\(c\)SLM
![Refer to caption](https://arxiv.org/html/2605.05567v1/x19.png)\(d\)ReOT

Figure 8:t\-SNE\[[57](https://arxiv.org/html/2605.05567#bib.bib57)\]visualizations of the representations learned by different PDA methods on the Office\-31 task W→\\toA\. In the diagrams, “∘\\circ” means source domain, and “\+” means target domain; different colors represent different classes, and gray represents the private class\.![Refer to caption](https://arxiv.org/html/2605.05567v1/x20.png)\(a\)OSDA
![Refer to caption](https://arxiv.org/html/2605.05567v1/x21.png)\(b\)PDA

Figure 9:Impact of cost function selection in optimal transport on model performance\. “ ReOT \(w/cdisc\_\{\\mathrm\{dis\}\}\) ” indicates using a discriminator\-based cost function, while “ ReOT ” uses the default squared Euclidean cost\. \(a\) OSDA setting\. \(b\) PDA setting\.Feature Visualization\.As shown in Figures[7](https://arxiv.org/html/2605.05567#S4.F7)and[8](https://arxiv.org/html/2605.05567#S4.F8), we conduct t\-SNE feature visulization and comparison for both OSDA and PDA\. In the OSDA setting, as shown in Figure[7a](https://arxiv.org/html/2605.05567#S4.F7.sf1), the target private class in gray is mixed with the shared classes before adaptation\. Though STA and ANNA improve the discrimination of features, there is still a large overlap between private and shared classes, and the intra\-class distance is not sufficiently small as shown in Figures[7b](https://arxiv.org/html/2605.05567#S4.F7.sf2)and[7c](https://arxiv.org/html/2605.05567#S4.F7.sf3)\. Compared with these methods, ReOT separates the private and shared classes more completely and ensures well\-aligned cluster structures as shown in Figure[7d](https://arxiv.org/html/2605.05567#S4.F7.sf4)\. Similarly, in the PDA setting, although AR \(Figure[8b](https://arxiv.org/html/2605.05567#S4.F8.sf2)\) and SLM \(Figure[8c](https://arxiv.org/html/2605.05567#S4.F8.sf3)\) learn improved representation compared with Source\-only \(Figure[8a](https://arxiv.org/html/2605.05567#S4.F8.sf1)\), ReOT achieves better intra\-class compactness and inter\-class separability as shown in Figure[8d](https://arxiv.org/html/2605.05567#S4.F8.sf4)\. These results demonstrate that ReOT ensures a better representation space, and further indicate the effectiveness of ReOT in mitigating the cross\-domain class\-conditional distribution discrepancy\.

Impact of the Cost Function\.To examine the robustness of ReOT with respect to the choice of the cost function in OT, we conduct additional ablation studies on the Office\-31 and Office\-Home datasets\. Throughout our experiments, the default cost functionccis the squared Euclidean distance\. Here, we compare it with a discriminator\-based cost functioncd​i​sc\_\{dis\}\. The discriminatord​i​s​\(⋅\)dis\(\\cdot\)is a network that outputs a probability, indicating the likelihood that𝒛\\bm\{z\}belongs to the source domain\. It is pre\-trained using a binary cross\-entropy on both source and target domain samples, and its parameters are frozen during the subsequent ReOT training\. Intuitively, a source feature𝒛\\bm\{z\}and a target feature𝒛′\\bm\{z\}^\{\\prime\}that are easily confused by the discriminator should have a low transport cost\. Based on this, we definecd​i​sc\_\{dis\}as Based on this, we definecd​i​sc\_\{dis\}as:

cd​i​s​\(𝒛,𝒛′\)=\|d​i​s​\(𝒛\)−0\.5\|\+\|d​i​s​\(𝒛′\)−0\.5\|\.c\_\{dis\}\(\\bm\{z\},\\bm\{z\}^\{\\prime\}\)=\|dis\(\\bm\{z\}\)\-0\.5\|\+\|dis\(\\bm\{z\}^\{\\prime\}\)\-0\.5\|\.
Fig\.[9](https://arxiv.org/html/2605.05567#S4.F9)presents the results for the Office\-31 task W→\\toA and the Office\-Home task Cl→\\toPr under both OSDA and PDA settings\. The results show that ReOT with the default squared Euclidean cost consistently achieves superior performance across all settings\. For example, on the W→\\toA task under the OSDA setting, ReOT outperforms ReOT\(w/cd​i​sc\_\{dis\}\) by approximately 1\.7%\\%on H\. A similar trend is observed in the PDA setting, where ReOT also demonstrates better performance\. Despite these performance differences, ReOT\(w/cd​i​sc\_\{dis\}\) still delivers competitive results\. This indicates that the proposed method is robust and not sensitive to the selection of the cost function\.

## 5Conclusion

In this paper, we addressed the challenge of extreme label shift in domain adaptation by proposing a novel locality\-aware private class identification method, which is achieved by defining a score function on masked optimal transport mass\. This method relaxes the strong assumptions on which existing methods are based, and its effectiveness is demonstrated from theoretical perspective, highlighting its strong ability to distinguish between shared and private class samples\. Building upon this foundation, we introduce the reliable OT\-based \(ReOT\) method for practical applications\. ReOT integrates risk minimization and employs masked transport to learn invariant representation with separated cluster structures\. In addition, we provide a generalization bound of the target risk for the severe label shift scenario\. This upper bound is shown to be tight and can be minimized by ReOT\. Extensive experiments on benchmark datasets validated the effectiveness of ReOT, indicating its reliable and superior performance in extreme label shift scenarios\.

While ReOT demonstrates strong performance in both OSDA and PDA, the current method focuses on the single\-source single\-target scenario and assumes access to labeled source data\. A promising direction for future work is to extend ReOT to multi\-source\[[58](https://arxiv.org/html/2605.05567#bib.bib58)\]and multi\-target scenarios\[[59](https://arxiv.org/html/2605.05567#bib.bib59)\], as well as to the source\-free setting\[[60](https://arxiv.org/html/2605.05567#bib.bib60)\]\. Potential strategies include generalizing the transport plan to a multi\-marginal formulation\[[61](https://arxiv.org/html/2605.05567#bib.bib61)\]for multiple domains, or using source class prototypes distilled from a pre\-trained source model in source\-free scenario\[[60](https://arxiv.org/html/2605.05567#bib.bib60)\], while carefully addressing the associated computational and theoretical challenges\.

## References

- \[1\]Y\. Liu, Z\. Zhou, and B\. Sun, “COT: Unsupervised domain adaptation with clustering and optimal transport,” in*CVPR*, 2023, pp\. 19 998–20 007\.
- \[2\]Y\. E\. Kim, Y\. W\. Lee, and S\. W\. Lee, “LC\-MSM: Language\-conditioned masked segmentation model for unsupervised domain adaptation,”*Pattern Recognition*, vol\. 148, p\. 110201, 2024\.
- \[3\]H\. Wang, L\. Zheng, H\. Zhao*et al\.*, “Unsupervised domain adaptation with class\-aware memory alignment,”*IEEE TNNLS*, 2024, preprint\.
- \[4\]M\. Long, H\. Zhu, J\. Wang*et al\.*, “Unsupervised domain adaptation with residual transfer networks,” in*NeurIPS*, vol\. 29, 2016\.
- \[5\]Y\. Ganin, E\. Ustinova, H\. Ajakan*et al\.*, “Domain\-adversarial training of neural networks,”*JMLR*, vol\. 17, pp\. 2096–2030, 2016\.
- \[6\]M\. Long, Z\. Cao, J\. Wang*et al\.*, “Conditional adversarial domain adaptation,” in*NeurIPS*, vol\. 31, 2018\.
- \[7\]K\. Saito, S\. Yamamoto, Y\. Ushiku*et al\.*, “Open set domain adaptation by backpropagation,” in*ECCV*, 2018, pp\. 153–168\.
- \[8\]S\. Bucci, M\. R\. Loghmani, and T\. Tommasi, “On the effectiveness of image rotation for open set domain adaptation,” in*ECCV*, 2020, pp\. 422–438\.
- \[9\]W\. Li, J\. Liu, B\. Han*et al\.*, “Adjustment and alignment for unbiased open set domain adaptation,” in*CVPR*, 2023, pp\. 24 110–24 119\.
- \[10\]Z\. Cao, L\. Ma, M\. Long*et al\.*, “Partial adversarial domain adaptation,” in*ECCV*, 2018, pp\. 135–150\.
- \[11\]Y\. W\. Luo and C\. X\. Ren, “Generalized label shift correction via minimum uncertainty principle: Theory and algorithm,”*arXiv preprint arXiv:2202\.13043*, 2022\.
- \[12\]C\. J\. Guo, C\. X\. Ren, Y\. W\. Luo*et al\.*, “Partial domain adaptation via importance sampling\-based shift correction,”*IEEE TIP*, vol\. 34, pp\. 5009–5022, 2025\.
- \[13\]H\. Liu, Z\. Cao, M\. Long*et al\.*, “Separate to adapt: Open set domain adaptation via progressive separation,” in*CVPR*, 2019, pp\. 2927–2936\.
- \[14\]A\. Sahoo, R\. Panda, R\. Feris*et al\.*, “Select, label, and mix: Learning discriminative invariant feature representations for partial domain adaptation,” in*WACV*, 2023, pp\. 4210–4219\.
- \[15\]L\. Tian, Y\. Tang, and W\. Zhang, “Partial domain adaptation by progressive sample learning of shared classes,”*NPL*, vol\. 55, 2023\.
- \[16\]X\. Gu, X\. Yu, J\. Sun*et al\.*, “Adversarial reweighting for partial domain adaptation,” in*NeurIPS*, vol\. 34, 2021, pp\. 14 860–14 872\.
- \[17\]J\. Liu, H\. He, M\. Liu*et al\.*, “Manifold regularized joint transfer for open set domain adaptation,”*IEEE TMM*, 2023\.
- \[18\]Y\. W\. Luo and C\. X\. Ren, “MOT: Masked optimal transport for partial domain adaptation,” in*CVPR*, 2023, pp\. 3531–3540\.
- \[19\]Z\. Fang, J\. Lu, F\. Liu*et al\.*, “Open set domain adaptation: Theoretical bound and algorithm,”*IEEE TNNLS*, vol\. 32, no\. 10, pp\. 4309–4322, 2020\.
- \[20\]J\. Jang, B\. Na, D\. H\. Shin*et al\.*, “Unknown\-aware domain adversarial learning for open\-set domain adaptation,” in*NeurIPS*, vol\. 35, 2022, pp\. 16 755–16 767\.
- \[21\]Y\. Luo, Z\. Wang, Z\. Huang*et al\.*, “Progressive graph learning for open\-set domain adaptation,” in*ICML*, 2020, pp\. 6468–6478\.
- \[22\]Z\. X\. Huang and C\. X\. Ren, “Rethinking correlation learning via label prior for open set domain adaptation,” in*IJCAI*, 2024\.
- \[23\]J\. Zhang, Z\. Ding, W\. Li*et al\.*, “Importance weighted adversarial nets for partial domain adaptation,” in*CVPR*, 2018, pp\. 8156–8164\.
- \[24\]Z\. Cao, K\. You, M\. Long*et al\.*, “Learning to transfer examples for partial domain adaptation,” in*CVPR*, 2019, pp\. 2985–2994\.
- \[25\]J\. Chen, X\. Wu, L\. Duan*et al\.*, “Domain adversarial reinforcement learning for partial domain adaptation,”*IEEE TNNLS*, vol\. 33, no\. 2, pp\. 539–553, 2022\.
- \[26\]S\. Li, C\. H\. Liu, Q\. Lin*et al\.*, “Deep residual correction network for partial domain adaptation,”*IEEE TPAMI*, vol\. 43, no\. 7, pp\. 2329–2344, 2020\.
- \[27\]K\. Thopalli, R\. Anirudh, P\. Turaga*et al\.*, “The surprising effectiveness of deep orthogonal procrustes alignment in unsupervised domain adaptation,”*IEEE Access*, vol\. 11, pp\. 12 858–12 869, 2023\.
- \[28\]J\. Liang, Y\. Wang, D\. Hu*et al\.*, “A balanced and uncertainty\-aware approach for partial domain adaptation,” in*ECCV*, 2020, pp\. 123–140\.
- \[29\]C\. Yang, Y\. M\. Cheung, J\. Ding*et al\.*, “Contrastive learning assisted\-alignment for partial domain adaptation,”*IEEE TNNLS*, vol\. 34, no\. 10, pp\. 7621–7634, 2023\.
- \[30\]Y\. W\. Luo and C\. X\. Ren, “Conditional bures metric for domain adaptation,” in*CVPR*, 2021, pp\. 13 989–13 998\.
- \[31\]N\. Courty, R\. Flamary, D\. Tuia, and A\. Rakotomamonjy, “Optimal transport for domain adaptation,”*IEEE TPAMI*, vol\. 39, no\. 9, pp\. 1853–1865, September 2017\.
- \[32\]M\. Li, Y\. M\. Zhai, Y\. W\. Luo*et al\.*, “Enhanced transport distance for unsupervised domain adaptation,” in*CVPR*, 2020, pp\. 13 936–13 944\.
- \[33\]N\. Courty, R\. Flamary, A\. Habrard*et al\.*, “Joint distribution optimal transportation for domain adaptation,” in*NeurIPS*, vol\. 30, 2017\.
- \[34\]K\. Fatras, T\. Séjourné, R\. Flamary*et al\.*, “Unbalanced minibatch optimal transport; applications to domain adaptation,” in*ICML*, 2021, pp\. 3186–3197\.
- \[35\]K\. Nguyen, D\. Nguyen, T\. Pham*et al\.*, “Improving mini\-batch optimal transport via partial transportation,” in*ICML*, 2022, pp\. 16 656–16 690\.
- \[36\]C\. X\. Ren, Y\. W\. Luo, and D\. Q\. Dai, “Buresnet: Conditional bures metric for transferable representation learning,”*IEEE TPAMI*, vol\. 45, no\. 4, pp\. 4198–4213, 2023\.
- \[37\]Y\. Wang, C\. X\. Ren, Y\. M\. Zhai*et al\.*, “Probability\-polarized optimal transport for unsupervised domain adaptation,” in*AAAI*, vol\. 38, no\. 14, 2024, pp\. 15 653–15 661\.
- \[38\]J\. Zhang, X\. Xiao, L\.\-K\. Huang*et al\.*, “Fine\-tuning graph neural networks via graph topology induced optimal transport,” in*IJCAI*, 2022, pp\. 3730–3736\.
- \[39\]X\. Gu, Y\. Yang, W\. Zeng, J\. Sun, and Z\. Xu, “Keypoint\-guided optimal transport with applications in heterogeneous domain adaptation,” in*NeurIPS*, vol\. 35, 2022, pp\. 14 972–14 985\.
- \[40\]M\. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in*NeurIPS*, 2013, pp\. 2292–2300\.
- \[41\]L\. Zhong, Z\. Fang, F\. Liu*et al\.*, “Bridging the theoretical bound and deep algorithms for open set domain adaptation,”*IEEE TNNLS*, vol\. 34, no\. 8, pp\. 3859–3873, 2021\.
- \[42\]H\. Zhao, R\. T\. Des Combes, K\. Zhang*et al\.*, “On learning invariant representations for domain adaptation,” in*ICML*, 2019, pp\. 7523–7532\.
- \[43\]B\. Li, Y\. Wang, S\. Zhang*et al\.*, “Learning invariant representations and risks for semi\-supervised domain adaptation,” in*CVPR*, 2021, pp\. 1104–1113\.
- \[44\]R\. Tachet des Combes, H\. Zhao, Y\.\-X\. Wang*et al\.*, “Domain adaptation with conditional distribution matching and generalized label shift,” in*NeurIPS*, vol\. 33, 2020, pp\. 19 276–19 289\.
- \[45\]B\. Caputo, H\. Müller, J\. Martinez Gomez*et al\.*, “ImageCLEF 2014: Overview and analysis of the results,” in*International Conference of the Cross\-Language Evaluation Forum for European Languages*, 2014, pp\. 192–211\.
- \[46\]K\. Saenko, B\. Kulis, M\. Fritz*et al\.*, “Adapting visual category models to new domains,” in*ECCV*, 2010, pp\. 213–226\.
- \[47\]H\. Venkateswara, J\. Eusebio, S\. Chakraborty*et al\.*, “Deep hashing network for unsupervised domain adaptation,” in*CVPR*, 2017, pp\. 5018–5027\.
- \[48\]X\. Peng, B\. Usman, N\. Kaushik*et al\.*, “VisDA: The visual domain adaptation challenge,”*arXiv preprint arXiv:1710\.06924*, 2017\.
- \[49\]K\. He, X\. Zhang, S\. Ren*et al\.*, “Deep residual learning for image recognition,” in*CVPR*, 2016, pp\. 770–778\.
- \[50\]J\. Deng, W\. Dong, R\. Socher*et al\.*, “ImageNet: A large\-scale hierarchical image database,” in*CVPR*, 2009, pp\. 248–255\.
- \[51\]R\. Xu, G\. Li, J\. Yang*et al\.*, “Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation,” in*ICCV*, 2019, pp\. 1426–1435\.
- \[52\]Y\. W\. Luo, C\. X\. Ren, D\. Q\. Dai*et al\.*, “Unsupervised domain adaptation via discriminative manifold propagation,”*IEEE TPAMI*, vol\. 44, no\. 3, pp\. 1653–1669, 2020\.
- \[53\]W\. Li and S\. Chen, “Partial domain adaptation without domain alignment,”*IEEE TPAMI*, 2022\.
- \[54\]Z\. Cao, K\. You, Z\. Zhang*et al\.*, “From big to small: Adaptive learning to partial\-set domains,”*IEEE TPAMI*, vol\. 45, no\. 2, pp\. 1766–1780, 2022\.
- \[55\]K\. Wu, M\. Wu, Z\. Chen*et al\.*, “Reinforced adaptation network for partial domain adaptation,”*IEEE TCSVT*, 2022\.
- \[56\]S\. Ben\-David, J\. Blitzer, K\. Crammer*et al\.*, “A theory of learning from different domains,”*Machine Learning*, vol\. 79, pp\. 151–175, 2010\.
- \[57\]L\. Van der Maaten and G\. Hinton, “Visualizing data using t\-sne\.”*JMLR*, vol\. 9, no\. 11, 2008\.
- \[58\]S\. Zhao, B\. Li, P\. Xu*et al\.*, “Multi\-source domain adaptation in the deep learning era: A systematic survey,”*arXiv preprint arXiv:2002\.12169*, 2020\.
- \[59\]T\. Isobe, X\. Jia, S\. Chen*et al\.*, “Multi\-target domain adaptation with collaborative consistency learning,” in*CVPR*, 2021, pp\. 8187–8196\.
- \[60\]J\. Liang, D\. Hu, and J\. Feng, “Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,” in*ICLR*, 2020, pp\. 6028–6039\.
- \[61\]F\. Beier, J\. von Lindheim, S\. Neumayer*et al\.*, “Unbalanced multi\-marginal optimal transport,”*Journal of Mathematical Imaging and Vision*, vol\. 65, no\. 3, pp\. 394–413, 2023\.

Similar Articles

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

Hugging Face Daily Papers

This paper introduces MMOT, an online mixture model learning framework based on optimal transport theory that addresses incremental learning with distributional shifts through dynamic centroid updates and improved class similarity estimation. The approach includes a Dynamic Preservation strategy to mitigate catastrophic forgetting and maintain class separability in latent space.