Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios

arXiv cs.LG Papers

Summary

Proposes a novel meta-learning strategy called MEDIC for open set domain generalization, which uses implicit gradient matching across domain and class splits to achieve better boundaries. Experiments show state-of-the-art performance.

arXiv:2606.23758v1 Announce Type: new Abstract: Domain generalization learns from multiple source domains to generalize to unseen target domains. However, it often neglects the realistic case of label mismatch between source and target. Open set domain generalization is then proposed to recognize unseen classes in unseen domains. A simple approach trains one-vs-all classifiers to separate each class and detect outliers as unknown. Yet, the imbalance between few positive samples and many negative samples skews the decision boundary towards the positive ones, leading the model to over-reject out-of-distribution data, even from known classes in unseen domains. In this paper, we propose a novel meta-learning stategy called dualistic MEta-learning with joint DomaIn-Class matching (MEDIC), which considers implicit gradient matching towards inter-domain and inter-class task splits simultaneously to find optimal boundaries balanced for both domains and classes. Experimental results show that MEDIC not only outperforms prior methods in open set scenarios, but also maintains competitive close set generalization ability.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:48 AM

# Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios
Source: [https://arxiv.org/html/2606.23758](https://arxiv.org/html/2606.23758)
Xiran Wang, Jian Zhang, Lei Qi, Yang Gao, Yinghuan Shi The Corresponding author is Yinghuan Shi\. Xiran Wang, Jian Zhang, Gao Yang and Yinghuan Shi are with the State Key Laboratory for Novel Software Technology, Nanjing University, China\. Lei Qi is with the School of Computer Science and Engineering, Southeast University, China\. This work was supported by NSFC Project \(62536005, 62192783, 62506162\), Jiangsu Science and Technology Project \(BF2025061, BK20251241\), Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China \(JYB2025XDXM118\), “111 Center” \(B26023\) and Fundamental Research Funds for the Central Universities \(KG202508\)\.

###### Abstract

Domain generalization learns from multiple source domains to generalize to unseen target domains\. However, it often neglects the realistic case of label mismatch between source and target\. Open set domain generalization is then proposed to recognize unseen classes in unseen domains\. A simple approach trains one\-vs\-all classifiers to separate each class and detect outliers as unknown\. Yet, the imbalance between few positive samples and many negative samples skews the decision boundary towards the positive ones, leading the model to over\-reject out\-of\-distribution data, even from known classes in unseen domains\. In this paper, we propose a novel meta\-learning stategy called dualistic MEta\-learning with joint DomaIn\-Class matching \(MEDIC\), which considers implicit gradient matching towards inter\-domain and inter\-class task splits simultaneously to find optimal boundaries balanced for both domains and classes\. Experimental results show that MEDIC not only outperforms prior methods in open set scenarios, but also maintains competitive close set generalization ability\. Our code is available at[this link](https://github.com/zzwdx/MEDIC-plus)\.

## IIntroduction

Deep neural networks have achieved enormous success in a wide range of computer vision tasks, usually assuming that the training and test samples are drawn from the same data distribution and label space\. However, real\-world application scenarios often introduce unpredictability, potentially placing the model at risk of performance degradation when the above constraints are not satisfied\[[48](https://arxiv.org/html/2606.23758#bib.bib6)\]\. Domain generalization \(DG\)\[[87](https://arxiv.org/html/2606.23758#bib.bib12)\]is then motivated as a more realistic setting to deal with data distribution shift, which refers to using multiple source domains to obtain a model with the generalization ability that can be directly applied to arbitrary unseen target domains\.

Most current domain generalization researches\[[46](https://arxiv.org/html/2606.23758#bib.bib2),[47](https://arxiv.org/html/2606.23758#bib.bib17),[99](https://arxiv.org/html/2606.23758#bib.bib7),[32](https://arxiv.org/html/2606.23758#bib.bib103)\]are based on the assumption of close set recognition,*i\.e\.*, the classes of the source domains are consistent with those of the target domains\. However, in practice, the deployed model is often exposed to some new classes that have never been encountered during the training phase\[[74](https://arxiv.org/html/2606.23758#bib.bib18)\]\. For example, in medical imaging, some diseases are so rare\[[26](https://arxiv.org/html/2606.23758#bib.bib107)\]that obtaining their training samples is unrealistic\. In close set classification, objects are forced to be assigned into a known class, which introduces potential risks to the model’s robustness and security\. To mitigate this issue, it is essential to explore a more practical setting called open set domain generalization \(OSDG\), which aims to recognize unknown classes while maintaining original classification accuracy of known classes\.

![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/bias.jpg)Figure 1:An example for the variation of decision boundaries of a*one\-vs\-all*classifier in open set domain generalization\.In open set domain generalization\[[77](https://arxiv.org/html/2606.23758#bib.bib4),[38](https://arxiv.org/html/2606.23758#bib.bib23)\], the key is to address both domain shift and category shift simultaneously\. However, traditional open set recognition models are not easily applicable for domain generalization tasks because they tend to generate biased decision boundaries\.*i\.e\.*, only modeling the training data while neglecting the out\-of\-distribution samples\[[29](https://arxiv.org/html/2606.23758#bib.bib38),[82](https://arxiv.org/html/2606.23758#bib.bib36)\]\. For example, the multi\-binary classifier\[[73](https://arxiv.org/html/2606.23758#bib.bib1),[53](https://arxiv.org/html/2606.23758#bib.bib24)\]consists of multiple one\-vs\-all binary classifiers to define a decision boundary for each known class\. If a given sample is classified as negative by all sub\-classifiers, it is considered to have a high probability of belonging to unknown classes\. As shown in[Fig\.1](https://arxiv.org/html/2606.23758#S1.F1), the limited data distribution of the positive samples \(*i\.e\.*, from only one corresponding class\) and the more diverse distribution of the negative samples \(*i\.e\.*, from all other classes\) can increase the risk of predicting inputs as positive rather than negative\. This causes the decision boundary asymmetrically biased to the positive samples, potentially rejecting all out\-of\-distribution samples as unknown and misclassifying known classes in the unseen target domain\.

To establish a balanced decision boundary across domains and classes, our attention turns to meta\-learning\[[35](https://arxiv.org/html/2606.23758#bib.bib71)\], a simple yet effective approach for handling domain shift\. Prior work on meta\-learning\-based domain generalization\[[46](https://arxiv.org/html/2606.23758#bib.bib2),[76](https://arxiv.org/html/2606.23758#bib.bib51)\]seeks an optimal balance among domains by matching gradients across tasks sampled from them\. This domain\-wise meta\-learning can mitigate the risk of exhibiting excessive bias towards particular domains\. As shown in[Fig\.2](https://arxiv.org/html/2606.23758#S1.F2), the rationale is that if the angles between the gradients are small, which implies that optimizing one task does not interfere with other tasks, then it is possible to achieve a win\-win outcome by optimizing their combined gradient\. In contrast, large angles between gradients indicate conflicting objectives, where updating one task can adversely impact the optimization procedure of others\.

![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/gradient.jpg)Figure 2:Previous research\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]has demonstrated that the large angle between gradients of two tasks introduces contradictions in optimization\.We propose learning positive and negative samples in such a balanced way, to place the decision boundary at the middle zone of them, attaining a more rational separation between known and unknown classes in the target domain\. Concretely, we introduce a novel meta\-learning strategy called*dualistic MEta\-learning with joint DomaIn\-Class matching \(MEDIC\)*\. Instead of simply adding extra iterations for inter\-domain or inter\-class meta\-learning, we take a step further to achieve gradient matching between domains and classes simultaneously\. For tasks selected from different domains, we additionally split and recombine them at the category level to construct inter\-class pairs\. By matching the gradients of these recombined tasks, we expect the model to not only generalize well across domains, but also to grasp a more precise understanding of class\-wise relationships, which is beneficial for both close set generalization and open set recognition\. This article extends our original work\[[89](https://arxiv.org/html/2606.23758#bib.bib75)\]from an initial insight to a generalized framework with accompanying theory and experiments\.

- •We investigate inter\-class gradient matching for open set domain generalization\. The method is introduced from a special case \(*i\.e\.*, two steps per inner loop\) to a general one \(*i\.e\.*, multiple steps per inner loop\) with an integrated task scheduling strategy for hard class pairs\.
- •We provides a more precise theoretical analysis of step\-wise gradient matching compared to original proofs\[[61](https://arxiv.org/html/2606.23758#bib.bib76)\]\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\], as eliminating their reliance on mathematical expectations\. Our strategy can achieve task\-wise gradient matching close to its maximum value with fewer steps\.
- •Extensive experiments show that our method not only outperforms several state\-of\-the\-art methods in the open set scenario, but also maintains remarkable accuracy in the conventional domain generalization setting\.

## IIRelated Work

### II\-ADomain Generalization

Domain Generalization \(DG\) is intended to train a model on multiple source domains that can generalize to unseen target domains without extra retraining process\. Existing methods mainly focus on three directions: \(i\)Feature Representation, learning domain\-invariant features through techniques such as domain adversarial learning\[[49](https://arxiv.org/html/2606.23758#bib.bib15),[23](https://arxiv.org/html/2606.23758#bib.bib61),[78](https://arxiv.org/html/2606.23758#bib.bib79),[15](https://arxiv.org/html/2606.23758#bib.bib122)\], invariant risk minimization\[[4](https://arxiv.org/html/2606.23758#bib.bib56),[2](https://arxiv.org/html/2606.23758#bib.bib77)\], or causality\-based feature disentangling\[[12](https://arxiv.org/html/2606.23758#bib.bib40),[56](https://arxiv.org/html/2606.23758#bib.bib69)\]\. \(ii\)Data Augmentation, enhancing training diversity via domain transfer, mixing or Fourier transforms\[[99](https://arxiv.org/html/2606.23758#bib.bib7),[91](https://arxiv.org/html/2606.23758#bib.bib58),[92](https://arxiv.org/html/2606.23758#bib.bib78),[28](https://arxiv.org/html/2606.23758#bib.bib74)\], adversarial generation\[[50](https://arxiv.org/html/2606.23758#bib.bib8),[98](https://arxiv.org/html/2606.23758#bib.bib45)\], or stochastic noise injection\[[51](https://arxiv.org/html/2606.23758#bib.bib10),[90](https://arxiv.org/html/2606.23758#bib.bib80)\]\. \(iii\)Learning Strategy, applying meta\-learning\[[96](https://arxiv.org/html/2606.23758#bib.bib49),[18](https://arxiv.org/html/2606.23758#bib.bib14),[6](https://arxiv.org/html/2606.23758#bib.bib46)\], ensemble learning\[[100](https://arxiv.org/html/2606.23758#bib.bib48),[10](https://arxiv.org/html/2606.23758#bib.bib47),[5](https://arxiv.org/html/2606.23758#bib.bib81),[84](https://arxiv.org/html/2606.23758#bib.bib123)\], or regularization\[[36](https://arxiv.org/html/2606.23758#bib.bib50),[86](https://arxiv.org/html/2606.23758#bib.bib83),[76](https://arxiv.org/html/2606.23758#bib.bib51),[57](https://arxiv.org/html/2606.23758#bib.bib52)\]where some can be efficiently achieved via meta\-learning\.

TABLE I:Comparison of target domains under different settings\.Problem SettingDistributionof DataLabelSpaceParticipationin TrainingDomain Adaptation\[[88](https://arxiv.org/html/2606.23758#bib.bib27)\]𝒬\\mathcal\{Q\}𝒞\\mathcal\{C\}✓Domain Generalization\[[87](https://arxiv.org/html/2606.23758#bib.bib12)\]𝒬\\mathcal\{Q\}𝒞\\mathcal\{C\}×\\timesOpen Set Recognition\[[25](https://arxiv.org/html/2606.23758#bib.bib19)\]𝒫\\mathcal\{P\}𝒞∪𝒰\\mathcal\{C\\cup U\}×\\timesOpen Set Domain Generalization\[[77](https://arxiv.org/html/2606.23758#bib.bib4)\]𝒬\\mathcal\{Q\}𝒞∪𝒰\\mathcal\{C\\cup U\}×\\times

- 1𝒫\\mathcal\{P\}and𝒞\\mathcal\{C\}are data distribution and label space of source domains\.
- 2𝒬\\mathcal\{Q\}is the unseen data distribution and𝒞∩𝒰=∅\\mathcal\{C\}\\cap\\mathcal\{U\}=\\varnothing\.

### II\-BOpen Set Recognition

Open Set Recognition \(OSR\) focuses on detecting novel classes not included in the training set\. Based on the use of additional data, existing methods can be classified into two categories\. \(i\)Artificial Classes\.Some methods\[[17](https://arxiv.org/html/2606.23758#bib.bib26),[33](https://arxiv.org/html/2606.23758#bib.bib28)\]augment the training data with auxiliary classes to improve distinction among known classes, but their effectiveness depends heavily on the quality of these samples\. Others\[[24](https://arxiv.org/html/2606.23758#bib.bib30),[59](https://arxiv.org/html/2606.23758#bib.bib31)\]propose to use generative models to guess unknown class samples, yet the resulting images are often low in quality and far from realistic, making them less effective on complex datasets\[[40](https://arxiv.org/html/2606.23758#bib.bib67)\]\. \(ii\)Discriminative Models\.OpenMax\[[8](https://arxiv.org/html/2606.23758#bib.bib32)\]replaces the softmax layer and estimates unknown probabilities using EVT\[[80](https://arxiv.org/html/2606.23758#bib.bib33)\]\. Self\-supervised methods\[[62](https://arxiv.org/html/2606.23758#bib.bib34),[94](https://arxiv.org/html/2606.23758#bib.bib35),[95](https://arxiv.org/html/2606.23758#bib.bib84)\]leverage reconstruction errors, as they believe that known\-class samples are usually reconstructed more accurately than unknown ones\. Metric learning\[[14](https://arxiv.org/html/2606.23758#bib.bib37),[29](https://arxiv.org/html/2606.23758#bib.bib38),[54](https://arxiv.org/html/2606.23758#bib.bib85)\]is also widely used to enhance feature discrimination\. However, these approaches often misclassify all out\-of\-distribution samples as unknown, limiting their direct application to domain generalization\.

### II\-CMeta\-Learning

Meta\-learning, also known as learning to learn\[[83](https://arxiv.org/html/2606.23758#bib.bib87),[3](https://arxiv.org/html/2606.23758#bib.bib89)\], aims to equip models with the ability to generalize across tasks by finding an initialization that can be quickly adapted with minimal updates\. The model\-agnostic meta\-learning \(MAML\)\[[20](https://arxiv.org/html/2606.23758#bib.bib88)\]and first\-order meta\-learning \(Reptile\)\[[61](https://arxiv.org/html/2606.23758#bib.bib76)\]divide the model learning process into inner and outer loops\. The inner loop is for task\-specific adaptation, while the outer loop seeks a globally optimal initialization for the task in the inner loop\. In domain generalization, meta\-learning has been applied to balance optimization across diverse domains\[[46](https://arxiv.org/html/2606.23758#bib.bib2),[6](https://arxiv.org/html/2606.23758#bib.bib46),[76](https://arxiv.org/html/2606.23758#bib.bib51)\]\. MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]simulats domain shifts via meta\-train and meta\-test splits\. Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]introducs a first\-order strategy to reduce computational cost\. Unlike these domain\-level strategies, our approach further samples tasks at the class level to prevent biased decision boundaries and better distinguish known from unknown classes in the target domain\.

### II\-DOpen Set Domain Generalization

Open Set Domain Generalization \(OSDG\), summarized in[TableI](https://arxiv.org/html/2606.23758#S2.T1), aims to address both domain and class shifts\. Previous studies mainly focus on training highly discriminative models\[[77](https://arxiv.org/html/2606.23758#bib.bib4),[38](https://arxiv.org/html/2606.23758#bib.bib23),[93](https://arxiv.org/html/2606.23758#bib.bib54)\]or rejecting unknown classes at test time\[[13](https://arxiv.org/html/2606.23758#bib.bib86)\]\. A key limitation of prior methods is their separate treatment of the two shifts\. For example, DAML\[[77](https://arxiv.org/html/2606.23758#bib.bib4)\], based on domain augmentation and meta\-learning, primarily targets the data shift between source domains\. CrossMatch\[[101](https://arxiv.org/html/2606.23758#bib.bib53)\]adopts consistency regularization between the close set classifier and multi\-binary classifier without considering the domain shift\. And our object is to tackle both shifts within a unified framework\.

More recent methods\[[79](https://arxiv.org/html/2606.23758#bib.bib116),[9](https://arxiv.org/html/2606.23758#bib.bib117),[30](https://arxiv.org/html/2606.23758#bib.bib118)\]use stronger models to improve open set performance\. Some employ stable diffusion\[[70](https://arxiv.org/html/2606.23758#bib.bib120)\]to synthesize unknown classes, while some generate class descriptions using rules or GPT\-4o\[[1](https://arxiv.org/html/2606.23758#bib.bib121)\]and integrate them into CLIP\[[67](https://arxiv.org/html/2606.23758#bib.bib119)\]\. Concurrently, several meta\-learning methods have been developed based on MEDIC\[[89](https://arxiv.org/html/2606.23758#bib.bib75)\]\. L2OT\[[44](https://arxiv.org/html/2606.23758#bib.bib115)\]adds a regularization term to penalize similar distributions between different classes\. EBiL\-HaDS\[[65](https://arxiv.org/html/2606.23758#bib.bib108)\]incorporates noisy samples into training and proposes a scheduler to select challenging domain–class pairs to separate tasks\. HyProMeta\[[64](https://arxiv.org/html/2606.23758#bib.bib114)\]detects noisy samples using class prototypes and split meta\-train and test sets according to clean and noisy samples\. These studies show the adaptability of MEDIC, and the MEDIC\+\+ proposed in this paper is a more generalized baseline, of which MEDIC can be regarded as a special case\.

## IIIMethod

[SectionIII\-A](https://arxiv.org/html/2606.23758#S3.SS1)discuss some definitions in open set domain generalization and the training process of meta\-learning\.[SectionIII\-B](https://arxiv.org/html/2606.23758#S3.SS2)and[SectionIII\-C](https://arxiv.org/html/2606.23758#S3.SS3)present our MEDIC\+\+ framework\.[SectionIII\-D](https://arxiv.org/html/2606.23758#S3.SS4)establishes the adaptive task sampling strategy\. Finally,[SectionIII\-E](https://arxiv.org/html/2606.23758#S3.SS5)and[SectionIII\-F](https://arxiv.org/html/2606.23758#S3.SS6)introduce the multi\-binary classifier and its inference methodology\.

![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/model.jpg)Figure 3:Overview of MEDIC during one training iteration\.MMis the overall model, and the right figure represents its internal structure\. The numbers denote the sequence of data flow \(solid arrows\) and model updates \(dashed arrows\) respectively\.### III\-APreliminary

Problem definition\.For open set domain generalization, we are provided withSSsource domains𝒮=\{𝒟1,𝒟2,…,𝒟S\}\\mathcal\{S\}=\\\{\\mathcal\{D\}\_\{1\},\\mathcal\{D\}\_\{2\},\.\.\.,\\mathcal\{D\}\_\{S\}\\\}with a label space𝒞\\mathcal\{C\}andTTunseen target domains𝒯=\{𝒟S\+1,𝒟S\+2,…,𝒟S\+T\}\\mathcal\{T\}=\\\{\\mathcal\{D\}\_\{S\+1\},\\mathcal\{D\}\_\{S\+2\},\.\.\.,\\mathcal\{D\}\_\{S\+T\}\\\}with an extended label space𝒞∪𝒰\\mathcal\{C\}\\cup\\mathcal\{U\}which satisfies𝒞∩𝒰=∅\\mathcal\{C\}\\cap\\mathcal\{U\}=\\varnothing\. Thess\-th domain consisting ofNsN\_\{s\}samples is represented as𝒟s=\{\(xis,yis\)\}i=1Ns\\mathcal\{D\}\_\{s\}=\\\{\(x^\{s\}\_\{i\},y^\{s\}\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{s\}\}, wherexisx^\{s\}\_\{i\}denotes theii\-th sample andyisy^\{s\}\_\{i\}can take values from𝒞\\mathcal\{C\}for𝒮\\mathcal\{S\}and𝒞∪𝒰\\mathcal\{C\}\\cup\\mathcal\{U\}for𝒯\\mathcal\{T\}, which refers to the corresponding label in the source or target domains\. Our objective is to utilize these source domains𝒮\\mathcal\{S\}to develop a model that can generalize to any unseen domain𝒯\\mathcal\{T\}\. Below are several key terms\.

Task splits, also termed task partition of a dataset𝒮\\mathcal\{S\}is defined as a collection of non\-empty subsets\{𝒮1,𝒮2,…,𝒮t\}\\\{\\mathcal\{S\}\_\{1\},\\mathcal\{S\}\_\{2\},\\dots,\\mathcal\{S\}\_\{t\}\\\}such that𝒮i∩𝒮j=∅​\(i≠j\)\\mathcal\{S\}\_\{i\}\\cap\\mathcal\{S\}\_\{j\}=\\varnothing\\ \(i\\neq j\), and⋃i=1t𝒮i=𝒮\\bigcup\_\{i=1\}^\{t\}\\mathcal\{S\}\_\{i\}=\\mathcal\{S\}\. That is, the subsets are mutually disjoint and their union constitutes the entire dataset𝒮\\mathcal\{S\}\. For example, consider a dataset with domain labels\{a,b\}\\\{a,b\\\}and class labels\{1,2\}\\\{1,2\\\}:𝒮=\{\(a,1\),\(a,2\),\(b,1\),\(b,2\)\}\.\\mathcal\{S\}=\\\{\(a,1\),\(a,2\),\(b,1\),\(b,2\)\\\}\.Two valid partitions can be constructed in different ways\. Partitioning by domain yields\{\{\(a,1\),\(a,2\)\},\{\(b,1\),\(b,2\)\}\},\\big\\\{\\\{\(a,1\),\(a,2\)\\\},\\\{\(b,1\),\(b,2\)\\\}\\big\\\},while partitioning by class yields\{\{\(a,1\),\(b,1\)\},\{\(a,2\),\(b,2\)\}\}\.\\big\\\{\\\{\(a,1\),\(b,1\)\\\},\\\{\(a,2\),\(b,2\)\\\}\\big\\\}\.These represent two different criteria for dividing the same dataset\. In this paper, a task refers to a mini\-batch of samples drawn from a specific subset of such a partition\.

MLDG\-like meta\-learning\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]requires to split the source domains𝒮\\mathcal\{S\}into the meta\-train set𝒮ℱ\\mathcal\{S\}\_\{\\mathcal\{F\}\}and meta\-test set𝒮𝒢\\mathcal\{S\}\_\{\\mathcal\{G\}\}, ensuring that𝒮ℱ∪𝒮𝒢=𝒮\\mathcal\{S\}\_\{\\mathcal\{F\}\}\\cup\\mathcal\{S\}\_\{\\mathcal\{G\}\}=\\mathcal\{S\}and𝒮ℱ∩𝒮𝒢=∅\\mathcal\{S\}\_\{\\mathcal\{F\}\}\\cap\\mathcal\{S\}\_\{\\mathcal\{G\}\}=\\varnothing\. The losses on these two sub\-datasets𝒮ℱ\\mathcal\{S\}\_\{\\mathcal\{F\}\}and𝒮𝒢\\mathcal\{S\}\_\{\\mathcal\{G\}\}are defined asℱ​\(Θ\)\\mathcal\{F\}\(\\Theta\)and𝒢​\(Θ\)\\mathcal\{G\}\(\\Theta\)respectively, withΘ\\Thetarepresenting the parameters of the training model\. The optimization is conducted in a bi\-level manner\. First, the parametersΘ\\Thetaare updated toΘ′\\Theta^\{\\prime\}with the loss ofℱ​\(Θ\)\\mathcal\{F\}\(\\Theta\)\. Then the loss of𝒢​\(Θ′\)\\mathcal\{G\}\(\\Theta^\{\\prime\}\)is combined withℱ​\(Θ\)\\mathcal\{F\}\(\\Theta\)to update the model’s original parametersΘ\\Theta\.

Reptile\-like meta\-learning\[[61](https://arxiv.org/html/2606.23758#bib.bib76)\]segments the optimization process into an inner loop and an outer loop\. Each iteration can be summarized as follows: A task consists of a batch of data sampled from a particular data distribution, and a step aggregates several tasks to create a larger batch\. During the inner loop, the model is sequentially updated with steps to reach parametersΘ^\\hat\{\\Theta\}\. Then in the outer loop, the original parametersΘ\\Thetaare updated in the direction ofΘ^−Θ\\hat\{\\Theta\}\-\\Theta\. The optimization of MLDG\-like meta\-learning could also be approximated using the two\-step reptile scheme, with one step for meta\-train and the other for meta\-test\. In this case,Θ−Θ^=α​ℱ​\(Θ\)\+β​𝒢​\(Θ′\)\\Theta\-\\hat\{\\Theta\}=\\alpha\\mathcal\{F\}\(\\Theta\)\+\\beta\\mathcal\{G\}\(\\Theta^\{\\prime\}\)whereα\\alphaandβ\\betaare inner loop learning rates\.

Gradient matchingrefers to searching parameters where the gradient directions induced by different tasks are well aligned\. The goal is to ensure that optimizing one task does not cause significant interference with others, thereby promoting consistent improvement\. Directly enforcing gradient matching as an explicit regularization term would require computing second\-order derivatives, which introduces substantial computational overhead\. In practice, meta\-learning is commonly adopted to achieve implicit gradient matching\. A detailed discussion will be provided in the following sections\.

![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/compare.jpg)Figure 4:Comparison of different learning strategies\. \(a\) A single step\. \(b\) \(c\) Multiple tasks per step\. \(d\) Maximum steps\. A greater number of steps implies a larger number of task\-wise gradient\-matching\. Please turn to[Fig\.5](https://arxiv.org/html/2606.23758#S3.F5)\(d\) for more details\.
### III\-BMEDIC as a Special Case

To clarify the mechanism of gradient matching across multiple attributes \(i\.e\., domain and class\), we begin by introducing a special case called MEDIC, where both of the domains and classes are separated into two parts, yielding four tasks in total, and each step selects two of them\.

Given two sub\-datasets𝒮ℱ\\mathcal\{S\}\_\{\\mathcal\{F\}\}and𝒮𝒢\\mathcal\{S\}\_\{\\mathcal\{G\}\}, along with their corresponding loss functionsℱ​\(Θ\)\\mathcal\{F\}\(\\Theta\)and𝒢​\(Θ\)\\mathcal\{G\}\(\\Theta\), our objective is to reach a consensus on their gradientsℱ′​\(Θ\)\{\\mathcal\{F\}^\{\\prime\}\(\\Theta\)\}and𝒢′​\(Θ\)\{\\mathcal\{G\}^\{\\prime\}\(\\Theta\)\}to ensure an unbiased optimization direction for both of them\. The underlying principle is that if the angle between the directions ofℱ′​\(Θ\)\{\\mathcal\{F\}^\{\\prime\}\(\\Theta\)\}and𝒢′​\(Θ\)\{\\mathcal\{G\}^\{\\prime\}\(\\Theta\)\}is small which means optimizing one task does not adversely affect the other, then updating with their combined gradient \(*i\.e\.*, the sum of gradients in practice\) can yield enhanced performance\. Conversely, if the angle between their gradients is large which indicates a conflict between these two tasks, then updating one task would lead to an inferior optimization process for the other\. The core idea of gradient matching is to find an area in the parameter space where the angle between the gradients of𝒮ℱ\\mathcal\{S\}\_\{\\mathcal\{F\}\}and𝒮𝒢\\mathcal\{S\}\_\{\\mathcal\{G\}\}is minimized, which can be accomplished by maximizing the dot product ofℱ′​\(Θ\)\{\\mathcal\{F\}^\{\\prime\}\(\\Theta\)\}and𝒢′​\(Θ\)\{\\mathcal\{G\}^\{\\prime\}\(\\Theta\)\}\. Moving the model towards this region,𝒮ℱ\\mathcal\{S\}\_\{\\mathcal\{F\}\}and𝒮𝒢\\mathcal\{S\}\_\{\\mathcal\{G\}\}will converge on gradient direction, where both tasks benefit from shared updates rather than competing each other\.

Current gradient\-based domain generalization methods typically treat𝒮ℱ\\mathcal\{S\}\_\{\\mathcal\{F\}\}and𝒮𝒢\\mathcal\{S\}\_\{\\mathcal\{G\}\}as separate domains in order to find an optimization direction only among domains\[[46](https://arxiv.org/html/2606.23758#bib.bib2),[76](https://arxiv.org/html/2606.23758#bib.bib51)\]\. However, these methods often overlook the inter\-class relationships, which are important in open set scenarios\. Instead of simply adding extra iterations to mitigate biased prediction between classes, we propose a novel meta\-learning strategy that performs gradient matching across both inter\-domain and inter\-class splits simultaneously, which aims to learn generalizable decision boundaries that maintain balance across all tasks\.

As illustrated in[Fig\.3](https://arxiv.org/html/2606.23758#S3.F3), for tasks𝒮ℱ\\mathcal\{S\}\_\{\\mathcal\{F\}\}and𝒮𝒢\\mathcal\{S\}\_\{\\mathcal\{G\}\}sampled from different domains, we further divide them into𝒮ℱ1,𝒮ℱ2\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{1\}\},\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{2\}\}and𝒮𝒢1,𝒮𝒢2\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{1\}\},\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{2\}\}by class and define their loss functions asℱ1,ℱ2\\mathcal\{F\}\_\{1\},\\mathcal\{F\}\_\{2\}and𝒢1,𝒢2\\mathcal\{G\}\_\{1\},\\mathcal\{G\}\_\{2\}\. The label spaces of𝒮ℱ1\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{1\}\}and𝒮ℱ2\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{2\}\}, as well as𝒮𝒢1\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{1\}\}and𝒮𝒢2\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{2\}\}are both disjoint\. Besides, we require𝒮ℱ1\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{1\}\}and𝒮𝒢1\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{1\}\}to share the same label space, and likewise for𝒮ℱ2\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{2\}\}and𝒮𝒢2\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{2\}\}\. To simultaneously apply gradient matching between domains and classes, we utilize\(𝒮ℱ1,𝒮𝒢2\)\(\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{1\}\},\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{2\}\}\)as meta\-train set and\(𝒮ℱ2,𝒮𝒢1\)\(\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{2\}\},\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{1\}\}\)as meta\-test set\. The final meta\-objective function of MEDIC is defined as follows:

argminΘ\[ℱ1​\(Θ\)\+𝒢2​\(Θ\)\]\+β​\[ℱ2​\(Θ^\)\+𝒢1​\(Θ^\)\]\.\\mathop\{\\rm argmin\}\_\{\\Theta\}\\,\[\{\\mathcal\{F\}\_\{1\}\(\\Theta\)\}\+\{\\mathcal\{G\}\_\{2\}\(\\Theta\)\}\]\+\\beta\[\{\\mathcal\{F\}\_\{2\}\(\\hat\{\\Theta\}\)\}\+\{\\mathcal\{G\}\_\{1\}\(\\hat\{\\Theta\}\)\}\]\.\(1\)This objective approximates the two\-step reptile scheme, with one step for the meta\-train set\(𝒮ℱ1,𝒮𝒢2\)\(\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{1\}\},\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{2\}\}\)and the other for the meta\-test set\(𝒮ℱ2,𝒮𝒢1\)\(\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{2\}\},\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{1\}\}\), whereβ\\betacontrols the weight between the two meta sets andΘ^\\hat\{\\Theta\}is the optimized model parameters on the meta\-train set with learning rateα\\alpha:

Θ^=Θ−α​\(ℱ1′​\(Θ\)\+𝒢2′​\(Θ\)\)\.\{\\hat\{\\Theta\}\}=\\Theta\-\\alpha\(\{\\mathcal\{F\}\_\{1\}^\{\\prime\}\(\\Theta\)\}\+\{\\mathcal\{G\}\_\{2\}^\{\\prime\}\(\\Theta\)\}\)\.\(2\)To validate MEDIC’s capability to perform gradient matching between domains and classes at the same time, similar to the analysis in\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\], we conduct a first order Taylor expansion for the second term in[Eq\.1](https://arxiv.org/html/2606.23758#S3.E1):

ℱ2​\(Θ^\)\\displaystyle\{\\mathcal\{F\}\_\{2\}\(\\hat\{\\Theta\}\)\}=ℱ2​\(Θ\)−α​\(ℱ1′​\(Θ\)\+𝒢2′​\(Θ\)\)⋅ℱ2′​\(Θ\),\\displaystyle=\{\\mathcal\{F\}\_\{2\}\(\\Theta\)\}\-\\alpha\(\{\\mathcal\{F\}\_\{1\}^\{\\prime\}\(\\Theta\)\}\+\{\\mathcal\{G\}\_\{2\}^\{\\prime\}\(\\Theta\)\}\)\\cdot\{\\mathcal\{F\}\_\{2\}^\{\\prime\}\(\\Theta\)\},\(3\)𝒢1​\(Θ^\)\\displaystyle\{\\mathcal\{G\}\_\{1\}\(\\hat\{\\Theta\}\)\}=𝒢1​\(Θ\)−α​\(ℱ1′​\(Θ\)\+𝒢2′​\(Θ\)\)⋅𝒢1′​\(Θ\),\\displaystyle=\{\\mathcal\{G\}\_\{1\}\(\\Theta\)\}\-\\alpha\(\{\\mathcal\{F\}\_\{1\}^\{\\prime\}\(\\Theta\)\}\+\{\\mathcal\{G\}\_\{2\}^\{\\prime\}\(\\Theta\)\}\)\\cdot\{\\mathcal\{G\}\_\{1\}^\{\\prime\}\(\\Theta\)\},\(4\)and the objective function becomes:

argminΘ\[ℱ1​\(Θ\)\+𝒢2​\(Θ\)\+β​\(ℱ2​\(Θ\)\+𝒢1​\(Θ\)\)\]−β​α​\[\(ℱ1′​\(Θ\)\+𝒢2′​\(Θ\)\)⋅\(ℱ2′​\(Θ\)\+𝒢1′​\(Θ\)\)\]\.\\begin\{split\}&\\mathop\{\\rm argmin\}\_\{\\Theta\}\\,\[\{\\mathcal\{F\}\_\{1\}\(\\Theta\)\}\+\{\\mathcal\{G\}\_\{2\}\(\\Theta\)\}\+\\beta\(\{\\mathcal\{F\}\_\{2\}\(\\Theta\)\}\+\{\\mathcal\{G\}\_\{1\}\(\\Theta\)\}\)\]\\\\ &\-\\beta\\alpha\[\(\{\\mathcal\{F\}\_\{1\}^\{\\prime\}\(\\Theta\)\}\+\{\\mathcal\{G\}\_\{2\}^\{\\prime\}\(\\Theta\)\}\)\\cdot\(\{\\mathcal\{F\}\_\{2\}^\{\\prime\}\(\\Theta\)\}\+\{\\mathcal\{G\}\_\{1\}^\{\\prime\}\(\\Theta\)\}\)\]\.\\end\{split\}\(5\)The first term of[Eq\.5](https://arxiv.org/html/2606.23758#S3.E5)involves optimizing the model with the expected losses of each task, while the second term is the product of gradient sums\. By expanding this part we derive the following regularization term:

ℒreg=−\(ℱ1′⋅ℱ2′\+ℱ1′⋅𝒢1′\+𝒢2′⋅ℱ2′\+𝒢2′⋅𝒢1′\)\.\\mathcal\{L\}\_\{\\rm reg\}=\-\(\\mathcal\{F\}\_\{1\}^\{\\prime\}\\cdot\\mathcal\{F\}\_\{2\}^\{\\prime\}\+\\mathcal\{F\}\_\{1\}^\{\\prime\}\\cdot\\mathcal\{G\}\_\{1\}^\{\\prime\}\+\\mathcal\{G\}\_\{2\}^\{\\prime\}\\cdot\\mathcal\{F\}\_\{2\}^\{\\prime\}\+\\mathcal\{G\}\_\{2\}^\{\\prime\}\\cdot\\mathcal\{G\}\_\{1\}^\{\\prime\}\)\.\(6\)Note that we omit parametersΘ\\Thetafor the sake of simplicity\. As previously discussed, maximizing the dot product of gradients can regularize the optimization process to match the updating directions of different tasks\. Taking task𝒮ℱ1\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{1\}\}as an example, the multiplier𝒮𝒢1\\mathcal\{S\}\_\{\\mathcal\{G\}\_\{1\}\}contains the same classes but from different domains, whereas𝒮ℱ2\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{2\}\}includes different classes from the same domain, the two factors in any of the gradient products are either from different domains or different classes to enable domain\-wise and class\-wise matching simultaneously\. In contrast to conventional methods primarily concerned with inter\-domain relationships, the dot product between𝒮ℱ1\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{1\}\}and𝒮ℱ2\\mathcal\{S\}\_\{\\mathcal\{F\}\_\{2\}\}bridges the gap in class\-wise gradient matching inside each domain, which enables fine\-grained model optimization to learn more rational decision boundaries\.

![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/norm.jpg)Figure 5:\(a\) \(b\) \(c\) Standardized mean differences of class feature pairs\. Warmer colors indicate more distinct features\. It can be observed that the one task per step strategy leads to a convergence of their feature representations, consequently disrupting the classification performance of the model\. \(d\) As the number of steps increases, the rate of increase in task\-wise gradient matching gradually decreases, eventually converging to 0\.
### III\-CThe MEDIC\+\+ Framework

MEIDC\+\+ is a flexible meta\-learning strategy that involves several tasks at each step\. It can be extended into a general learning framework by creating additional tasks with finer division between domains and classes, while the number of tasks per step can also be customized as we need\.

Existing meta\-learning\-based DG methods usually overlook the distinction between tasks and steps, with a preference for sampling only one task at each step\. Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]theoretically demonstrates the effectiveness of this approach in pairing tasks with one another, allowing for maximum gradient matching across all tasks\. However, this strategy can become complicated when dealing with a large quantity of tasks, as the number of steps increases linearly with them\. Besides, the batch normalization\-based models over training only normalize the mean and variance of each single batch\. As shown in[Fig\.5](https://arxiv.org/html/2606.23758#S3.F5), updating with only one task per step leads to similar statistics across classes, making their features indistinguishable and may negatively affect classification performance\. Conversely, if we just take a single large step that encompasses all the tasks, the number of paired steps will be reduced to zero, leading to no gradient matching between any tasks\.

We argue that taking multiple steps is essential for task\-wise gradient matching, but there is no need for a large number of them\. To this end, we propose our MEDIC\+\+ as follows: During the inner loop, the dataset is partitioned intotttasks\{𝒮ℋ1,𝒮ℋ2,…,𝒮ℋt\}\\\{\\mathcal\{S\}\_\{\\mathcal\{H\}\_\{1\}\},\\mathcal\{S\}\_\{\\mathcal\{H\}\_\{2\}\},\.\.\.,\\mathcal\{S\}\_\{\\mathcal\{H\}\_\{t\}\}\\\}at both the domain and the class levels, with their loss functions represented as\{ℋ1,ℋ2,…,ℋt\}\\\{\\mathcal\{H\}\_\{1\},\\mathcal\{H\}\_\{2\},\.\.\.,\\mathcal\{H\}\_\{t\}\\\}\. We define the number of steps asnnand the corresponding loss functions as\{ℒ1,ℒ2,…,ℒn\}\\\{\\mathcal\{L\}\_\{1\},\\mathcal\{L\}\_\{2\},\.\.\.,\\mathcal\{L\}\_\{n\}\\\}\. The number of tasks included in theii\-th step is denoted asmim\_\{i\}that satisfies∑i=1nmi=t\\sum\_\{i=1\}^\{n\}m\_\{i\}=t\. In each step, tasks are randomly sampled without repetition for gradient update, with the condition thatnnis less thantt\. We assume a uniform distribution of tasks per step,*i\.e\.*, tasks are distributed as⌊tn⌋\\lfloor\\frac\{t\}\{n\}\\rfloorin some steps while⌈tn⌉\\lceil\\frac\{t\}\{n\}\\rceilin others\. The loss function of theii\-th step is expressed as:

ℒi=∑k=1miℋaik,\\mathcal\{L\}\_\{i\}=\\sum\_\{k=1\}^\{m\_\{i\}\}\\mathcal\{H\}\_\{a\_\{i\}^\{k\}\},\(7\)whereaika\_\{i\}^\{k\}is the index of thekk\-th task chosen in theii\-th step\. Following the inner loop, the model’s parameters are updated fromΘ\\ThetatoΘ^\\hat\{\\Theta\}, and we adopt the final parameters asΘ←Θ\+ϵ​\(Θ^−Θ\)\\Theta\\leftarrow\\Theta\+\\epsilon\(\\hat\{\\Theta\}\-\\Theta\), whereϵ\\epsiloncan be considered as the negative learning rate of the outer loop\. This reptile\-like meta\-learning\[[61](https://arxiv.org/html/2606.23758#bib.bib76),[76](https://arxiv.org/html/2606.23758#bib.bib51)\]conducts pairwise gradient matching between each step with scaling factorγ\\gamma:

argminΘ∑i=1nℒi​\(Θ\)−γ​∑i,j∈𝒩i≠jℒi′​\(Θ\)⋅ℒj′​\(Θ\),\\mathop\{\\rm argmin\}\_\{\\Theta\}\\,\\sum\_\{i=1\}^\{n\}\\mathcal\{L\}\_\{i\}\(\\Theta\)\-\\gamma\\sum\_\{i,j\\in\\mathcal\{N\}\}^\{i\\neq j\}\\mathcal\{L\}\_\{i\}^\{\\prime\}\(\\Theta\)\\cdot\\mathcal\{L\}\_\{j\}^\{\\prime\}\(\\Theta\),\(8\)where𝒩\\mathcal\{N\}is the indices of\{1,2,…,n\}\\\{1,2,\.\.\.,n\\\}, thus the number of step\-wise gradient matching can be calculated asn​\(n−1\)2\\frac\{n\(n\-1\)\}\{2\}\. We provide a more precise proof of[Eq\.8](https://arxiv.org/html/2606.23758#S3.E8)and please refer to[SectionIV\-A](https://arxiv.org/html/2606.23758#S4.SS1)for more details\. When substituting[Eq\.7](https://arxiv.org/html/2606.23758#S3.E7)and its derivative into[Eq\.8](https://arxiv.org/html/2606.23758#S3.E8), we can obtain the objective:

argminΘ∑i=1tℋi​\(Θ\)−γ​∑i,j∈𝒩i≠j\{∑k=1miℋaik′​\(Θ\)⋅∑k=1mjℋajk′​\(Θ\)\},\\mathop\{\\rm argmin\}\_\{\\Theta\}\\,\\sum\_\{i=1\}^\{t\}\\mathcal\{H\}\_\{i\}\(\\Theta\)\-\\gamma\\sum\_\{i,j\\in\\mathcal\{N\}\}^\{i\\neq j\}\\left\\\{\\sum\_\{k=1\}^\{m\_\{i\}\}\\mathcal\{H\}\_\{a\_\{i\}^\{k\}\}^\{\\prime\}\(\\Theta\)\\cdot\\sum\_\{k=1\}^\{m\_\{j\}\}\\mathcal\{H\}\_\{a\_\{j\}^\{k\}\}^\{\\prime\}\(\\Theta\)\\right\\\},\(9\)implying that matching gradients between steps is equivalent to matching gradients across all inter\-step tasks\.

Considering the number of task\-wise gradient matching astnt\_\{n\}, we show in[SectionIV\-C](https://arxiv.org/html/2606.23758#S4.SS3)thattnt\_\{n\}is positively correlated with the step countnn\. By usingtn\\frac\{t\}\{n\}to approximate the number of task\-wise gradient matching per step, we could derive an estimate fortnt\_\{n\}as:

f​\(n\)=n​\(n−1\)2⋅\(tn\)2=12​t2⋅\(1−1n\)\.f\(n\)=\\frac\{n\(n\-1\)\}\{2\}\\cdot\(\\frac\{t\}\{n\}\)^\{2\}=\\frac\{1\}\{2\}t^\{2\}\\cdot\(1\-\\frac\{1\}\{n\}\)\.\(10\)Note that ifttis divisible bynn, thentnt\_\{n\}andf​\(n\)f\(n\)coincide\. The derivative off​\(n\)f\(n\)is:

f′​\(n\)=t22​n2,f^\{\\prime\}\(n\)=\\frac\{t^\{2\}\}\{2n^\{2\}\},\(11\)which indicates that whennnis small, there is a rapid increase in task\-wise gradient matching\. However, asnnbecomes larger, the growth rate converges to zero\. As shown in[Fig\.5](https://arxiv.org/html/2606.23758#S3.F5)\(d\), if we consider a total of99tasks,tnt\_\{n\}reaches its maximum value of3636whennnequals99, while even withnnas small as33,tnt\_\{n\}still achieves a substantial value of2727, revealing that the inner loop doesn’t need to be excessively long\. In practice, a relatively small number of steps is sufficient to obtain gradient matchingtnt\_\{n\}close to its maximum value\. Furthermore, since we perform additional task sampling at the class level, updating gradients with multiple tasks allows unified batch normalization across diverse classes, which enables the model to capture specific spatial distribution of each class in the feature space\.

Remark 1\.We believe that the rationality behind MEDIC\+\+ can draw inspiration from the sampling strategies of modern gradient descent algorithms, as the sequence of steps mirrors such gradient update operation during each inner loop\. Please be aware that this is merely an informal analogy and should be taken much less seriously than the previous analysis\.

We begin by comparing the three task sampling strategies in[Fig\.4](https://arxiv.org/html/2606.23758#S3.F4)with gradient descent algorithms\[[71](https://arxiv.org/html/2606.23758#bib.bib90)\], assuming that the intra\-task samples can be interpreted as a collective data point due to their greater similarity in gradient patterns compared to the inter\-task samples\. In this case, the single\-step strategy is similar to batch gradient descent \(BGD\), which utilizes all data points for each gradient update\. The single\-task\-per\-step strategy resembles a narrower variant of stochastic gradient descent \(SGD\), involving a single data point per gradient step\. Our MEDIC\+\+, following a multiple\-task\-per\-step strategy, is analogous to mini\-batch gradient descent \(MBGD\) with each gradient computed using a small batch of randomly selected data points\. Modern gradient descent algorithms often prefer MBGD, which aims to seek a balance between exploration and stability\[[39](https://arxiv.org/html/2606.23758#bib.bib92),[34](https://arxiv.org/html/2606.23758#bib.bib91),[37](https://arxiv.org/html/2606.23758#bib.bib93)\], so opting for an multiple\-task\-per\-step approach seems to be a wise choice\.

Remark 2\.We recommend that one essential contribution of MEDIC\+\+ is to broaden the notion of a task\. In domain generalization, it is natural to treat each domain as a separate task, since source domains are predefined\. However, if we instead regard the entire training set as a single global distribution, it can be partitioned into tasks along multiple dimensions\. We explore class\-wise partitioning as an alternative and provide evidence of its effectiveness\. In this sense, task can be a design choice rather than an intrinsic property of the dataset, which needs not be limited to explicit domain or class splits\. More broadly, our work suggests that any dataset capable of being decomposed into multiple sub\-distributions may benefit from the learning paradigm of MEDIC\+\+\.

### III\-DAdaptive Task Sampling

We introduce an adaptive task sampling strategy to enhance the frequency of gradient matching between easily confusable classes\. Inspired by the method in\[[52](https://arxiv.org/html/2606.23758#bib.bib110)\], we first construct an asymmetric class transition probability matrix for each domain, wherepi​jp\_\{ij\}represents the average prediction probability of misclassifying classiias classjj:

pi​j=p​\(j\|c=i\)∑k≠ip​\(k\|c=i\)\.p\_\{ij\}=\\frac\{p\(j\|c=i\)\}\{\\sum\_\{k\\neq i\}p\(k\|c=i\)\}\.\(12\)As shown in[Fig\.6](https://arxiv.org/html/2606.23758#S3.F6), we start by randomly selecting a class\. Then, based on the transition probabilities from the selected class to others, we sample the next class without replacement\. This process is repeated, with the sampled classes cyclically assigned to different tasks\. The rationale behind is that a higher transition probability from classiito classjjsuggests that the decision boundary between these classes is more likely to be biased, thereby increasing the frequency of gradient matching between them\. Since gradient matching only occurs across different tasks, where the adjacent classes are always placed, it is reasonable to use the class transition probabilities to guide the sampling of the next class\.

### III\-EOpen Set Loss Function

For open set recognition, we adopt a multi\-binary classifier\[[73](https://arxiv.org/html/2606.23758#bib.bib1)\]to serve as a supplement to the close set classifier\. As illustrated in[Fig\.3](https://arxiv.org/html/2606.23758#S3.F3), the proposed classifier consists of\|𝒞\|\|\\mathcal\{C\}\|one\-vs\-all classifiers, with each classifier trained to detect whether a given sample belongs to its corresponding class\. Letp​\(y^k\|x\)p\(\\hat\{y\}^\{k\}\|x\)denote the output probability that an instancexxis an inlier of thekk\-th sub\-classifier\. For a given sample\(x,y\)\(x,y\), its loss on the multi\-binary classifier can be formulated as:

ℒova​\(x,y\)=−log​\(p​\(y^y\|x\)\)−minj≠ylog​\(1−p​\(y^j\|x\)\)\.\\mathcal\{L\}\_\{\\rm ova\}\(x,y\)=\-\{\\rm log\}\(p\(\\hat\{y\}^\{y\}\|x\)\)\-\\mathop\{\\rm min\}\_\{j\\neq y\}\\,\{\\rm log\}\(1\-p\(\\hat\{y\}^\{j\}\|x\)\)\.\(13\)The second term denotes that it updates only the most challenging binary classifier when used as a negative sample\. We simply adopt this loss and train a close set classifier using cross\-entropy loss denoted asℒce\\mathcal\{L\}\_\{\\rm ce\}\. Then the overall open set loss function can be defined as follows:

ℒall=ℒce\+ℒova\.\\mathcal\{L\}\_\{\\rm all\}=\\mathcal\{L\}\_\{\\rm ce\}\+\\mathcal\{L\}\_\{\\rm ova\}\.\(14\)[Eq\.14](https://arxiv.org/html/2606.23758#S3.E14)is employed as the objective for each task in our meta\-learning paradigm\. Implementing inter\-class gradient matching can stabilize the training process of both positive and negative samples, thus seeking a balance between close set generalization and open set recognition\.

Remark\.The close set classifier can also be considered as a general form of multi\-binary classifier, where each channel serves as a single\-output binary classifier\. The introduction of the two\-channel binary classifiers for open set classification is due to the fact that the cross entropy loss is not well\-suited for single\-output binary classifiers since it optimizes only the positive class but does not address the negative classes\.

Algorithm 1Training process of MEDIC\+\+0:Domains

𝒮\\mathcal\{S\}; classes

𝒞\\mathcal\{C\}; split counts

tdt\_\{d\}and

tct\_\{c\}; tasks per step

mm; model parametrized by

Θ\\Theta; loss function

ℒ\\mathcal\{L\}; learning rates

α\\alphaand

ϵ\\epsilon;

1:while

Θ\\Thetanot convergeddo

2:Init

Θ^←Θ\\hat\{\\Theta\}\\leftarrow\\Theta;

𝒜←∅\\mathcal\{A\}\\leftarrow\\varnothing;

3:Random split

𝒮1,𝒮2,…,𝒮td←𝒮\\mathcal\{S\}\_\{1\},\\mathcal\{S\}\_\{2\},\.\.\.,\\mathcal\{S\}\_\{t\_\{d\}\}\\leftarrow\\mathcal\{S\};

4:for

i=1,2,…,tdi=1,2,\.\.\.,t\_\{d\}do

5:Random or adaptively split

𝒞1,𝒞2,…,𝒞tc←𝒞\\mathcal\{C\}\_\{1\},\\mathcal\{C\}\_\{2\},\.\.\.,\\mathcal\{C\}\_\{t\_\{c\}\}\\leftarrow\\mathcal\{C\};

6:for

j=1,2,…,tcj=1,2,\.\.\.,t\_\{c\}do

7:Sample

ℬi​j\{\\mathcal\{B\}\}\_\{ij\}from

\(𝒮i,𝒞j\)\(\\mathcal\{S\}\_\{i\},\\mathcal\{C\}\_\{j\}\);

8:

𝒜←𝒜∪\{ℬi​j\}\\mathcal\{A\}\\leftarrow\\mathcal\{A\}\\cup\\\{\{\\mathcal\{B\}\}\_\{ij\}\\\};

9:endfor

10:endfor

11:while

𝒜≠∅\\mathcal\{A\}\\neq\\varnothingdo

12:

ℬ←\\mathcal\{B\}\\leftarrowrandom pop

mmtasks from

𝒜\\mathcal\{A\};

13:

Θ^←Θ^−α⋅∇Θ^ℒ​\(ℬ;Θ^\)\\hat\{\\Theta\}\\leftarrow\\hat\{\\Theta\}\-\\alpha\\cdot\\nabla\_\{\\hat\{\\Theta\}\}\\mathcal\{L\}\(\\mathcal\{B\};\\hat\{\\Theta\}\);

14:endwhile

15:

Θ←Θ\+ϵ​\(Θ^−Θ\)\\Theta\\leftarrow\\Theta\+\\epsilon\(\\hat\{\\Theta\}\-\\Theta\);

16:endwhile

### III\-FInference

In the test phase, each target sample is first predicted by the close set classifier to obtain a probability distributionp​\(y^\|x\)p\(\\hat\{y\}\|x\)over known classes\. The model either \(i\) chooses the value of its maximum likelihood:

confcls​\(x\)=maxi=1\|𝒞\|​\(p​\(y^\|x\)i\),\{\\rm conf\}\_\{\\rm cls\}\(x\)=\{\\rm max\}\_\{i=1\}^\{\|\\mathcal\{C\}\|\}\(p\(\\hat\{y\}\|x\)\_\{i\}\),\(15\)or \(ii\) then refers to the corresponding one\-vs\-all classifier and chooses the value on its positive output channel as the confidence score\[[73](https://arxiv.org/html/2606.23758#bib.bib1)\]:

confbcls​\(x\)=p​\(y^argmaxi=1\|𝒞\|​\(p​\(y^\|x\)i\)\|x\)\.\{\\rm conf\}\_\{\\rm bcls\}\(x\)=p\(\\hat\{y\}^\{\{\\rm argmax\}\_\{i=1\}^\{\|\\mathcal\{C\}\|\}\(p\(\\hat\{y\}\|x\)\_\{i\}\)\}\|x\)\.\(16\)If the score is greater than a preset thresholdμ\\mu, then classify the sample into known classes, otherwise judge it as unknown\. Experimental results for these two inference modes are both reported in[SectionV](https://arxiv.org/html/2606.23758#S5)\.

![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/transition.jpg)Figure 6:An example of adaptive task sampling\. Initially, class 2 is randomly selected and assigned to task 1\. Then the next class 3 is selected based on the transition probabilities without replacement and assigned to task 2\.

## IVTheoretical Analysis

[SectionIV\-A](https://arxiv.org/html/2606.23758#S4.SS1)proves that reptile\-like meta\-learning conducts pairwise gradient matching between each step\.[SectionIV\-B](https://arxiv.org/html/2606.23758#S4.SS2)explains how to scale the learning rate within the outer loop\.[SectionIV\-C](https://arxiv.org/html/2606.23758#S4.SS3)demonstrates that task\-wise gradient matching is positively correlated with the step count\.

### IV\-AStep\-wise Gradient Matching

We prove[Eq\.8](https://arxiv.org/html/2606.23758#S3.E8)through mathematical induction\. Let’s start by revisiting the definitions of thennsteps inner loop, during which the model’s parameters transition fromΘ\\ThetatoΘ^\\hat\{\\Theta\}\. We represent the loss at each step as\{ℒ1,ℒ2,…,ℒn\}\\\{\\mathcal\{L\}\_\{1\},\\mathcal\{L\}\_\{2\},\.\.\.,\\mathcal\{L\}\_\{n\}\\\}, and the parameter updating trajectory as\{θ1,θ2,…,θn\+1\}\\\{\\theta\_\{1\},\\theta\_\{2\},\.\.\.,\\theta\_\{n\+1\}\\\}, withθ1\\theta\_\{1\}andθn\+1\\theta\_\{n\+1\}corresponding toΘ\\ThetaandΘ^\\hat\{\\Theta\}respectively\. We useℒi​\(θj\)\\mathcal\{L\}\_\{i\}\(\\theta\_\{j\}\)to denote the loss of theii\-th step on parametersθj\\theta\_\{j\}\. During the inner loop, the update process is performed with a small learning rateα\\alpha:

θ2=θ1−α​ℒ1′​\(θ1\)θ3=θ2−α​ℒ2′​\(θ2\)⋮θn\+1=θn−α​ℒn′​\(θn\)\.\\begin\{split\}\\theta\_\{2\}&=\\theta\_\{1\}\-\\alpha\\mathcal\{L\}\_\{1\}^\{\\prime\}\(\\theta\_\{1\}\)\\\\ \\theta\_\{3\}&=\\theta\_\{2\}\-\\alpha\\mathcal\{L\}\_\{2\}^\{\\prime\}\(\\theta\_\{2\}\)\\\\ &\\vdots\\\\ \\theta\_\{n\+1\}&=\\theta\_\{n\}\-\\alpha\\mathcal\{L\}\_\{n\}^\{\\prime\}\(\\theta\_\{n\}\)\.\\end\{split\}\(17\)While in the outer loop, the model is updated fromθ1\\theta\_\{1\}to the final parameters:

Θ←θ1−ϵ​\(θ1−θn\+1\)\.\\Theta\\leftarrow\\theta\_\{1\}\-\\epsilon\(\\theta\_\{1\}\-\\theta\_\{n\+1\}\)\.\(18\)We then sum the formulas from[Eq\.17](https://arxiv.org/html/2606.23758#S4.E17)and obtain:

θ1−θn\+1=α​∑i=1nℒi′​\(θi\)\.\\theta\_\{1\}\-\\theta\_\{n\+1\}=\\alpha\\sum\_\{i=1\}^\{n\}\\mathcal\{L\}\_\{i\}^\{\\prime\}\(\\theta\_\{i\}\)\.\(19\)Plugging[Eq\.19](https://arxiv.org/html/2606.23758#S4.E19)into[Eq\.18](https://arxiv.org/html/2606.23758#S4.E18)yields the original representation for the reptile\-like meta\-learning objective:

argminθ1α​∑i=1nℒi​\(θi\)\.\\mathop\{\\rm argmin\}\_\{\\theta\_\{1\}\}\\,\\alpha\\sum\_\{i=1\}^\{n\}\\mathcal\{L\}\_\{i\}\(\\theta\_\{i\}\)\.\(20\)
Objective\.To prove that the gradient of any step is matched with those of the othernn\-1 steps, it is adequate to demonstrate that for any positive integeri=ki=k, stepkkis gradient\-matched with the previouskk\-1 steps as:

ℒk​\(θk\)=ℒk​\(θ1\)−α​∑i=1k−1ℒi′​\(θ1\)⋅ℒk′​\(θ1\)\+𝒪​\(α2\),\\mathcal\{L\}\_\{k\}\(\\theta\_\{k\}\)=\\mathcal\{L\}\_\{k\}\(\\theta\_\{1\}\)\-\\alpha\\sum\_\{i=1\}^\{k\-1\}\\mathcal\{L\}\_\{i\}^\{\\prime\}\(\\theta\_\{1\}\)\\cdot\\mathcal\{L\}\_\{k\}^\{\\prime\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha^\{2\}\),\(21\)where gradient matching between two steps can be expressed by their dot product atθ1\\theta\_\{1\}\. Proving[Eq\.21](https://arxiv.org/html/2606.23758#S4.E21)only requires that the following equation holds for any loss functionℒ\\mathcal\{L\}:

ℒ​\(θk\)=ℒ​\(θ1\)−α​∑i=1k−1ℒi′​\(θ1\)⋅ℒ′​\(θ1\)\+𝒪​\(α2\)\.\\mathcal\{L\}\(\\theta\_\{k\}\)=\\mathcal\{L\}\(\\theta\_\{1\}\)\-\\alpha\\sum\_\{i=1\}^\{k\-1\}\\mathcal\{L\}\_\{i\}^\{\\prime\}\(\\theta\_\{1\}\)\\cdot\\mathcal\{L\}^\{\\prime\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha^\{2\}\)\.\(22\)
Base Case\.Wheniiequals 1, it is evident thatℒ​\(θi\)=ℒ​\(θ1\)\\mathcal\{L\}\(\\theta\_\{i\}\)=\\mathcal\{L\}\(\\theta\_\{1\}\), so[Eq\.22](https://arxiv.org/html/2606.23758#S4.E22)holds\. Wheniiequals 2, we can substitute[Eq\.17](https://arxiv.org/html/2606.23758#S4.E17)intoℒ​\(θ2\)\\mathcal\{L\}\(\\theta\_\{2\}\)and conduct a first order Taylor expansion on it:

ℒ​\(θ2\)=ℒ​\(θ1\)−α​ℒ1′​\(θ1\)⋅ℒ′​\(θ1\)\+𝒪​\(α2\),\\mathcal\{L\}\(\\theta\_\{2\}\)=\\mathcal\{L\}\(\\theta\_\{1\}\)\-\\alpha\\mathcal\{L\}\_\{1\}^\{\\prime\}\(\\theta\_\{1\}\)\\cdot\\mathcal\{L\}^\{\\prime\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha^\{2\}\),\(23\)thus[Eq\.22](https://arxiv.org/html/2606.23758#S4.E22)also holds\.

Inductive Step\.Assuming that[Eq\.22](https://arxiv.org/html/2606.23758#S4.E22)is true for arbitraryi≤ki\\leq k, we prove its validity wheniiequalsk\+1k\+1\. Plugging[Eq\.17](https://arxiv.org/html/2606.23758#S4.E17)and[Eq\.22](https://arxiv.org/html/2606.23758#S4.E22)intoℒ​\(θk\+1\)\\mathcal\{L\}\(\\theta\_\{k\+1\}\)yields:

ℒ​\(θk\+1\)=ℒ​\(θk\)−α​ℒk′​\(θk\)⋅ℒ′​\(θk\)\+𝒪​\(α2\)=ℒ​\(θ1\)−α​∑i=1k−1ℒi′​\(θ1\)⋅ℒ′​\(θ1\)\+𝒪​\(α2\)−α​\(ℒk′​\(θ1\)\+𝒪​\(α\)\)​\(ℒ′​\(θ1\)\+𝒪​\(α\)\)\+𝒪​\(α2\)=ℒ​\(θ1\)−α​∑i=1kℒi′​\(θ1\)⋅ℒ′​\(θ1\)\+𝒪​\(α2\)\.\\begin\{split\}\\mathcal\{L\}\(\\theta\_\{k\+1\}\)=\\ &\\mathcal\{L\}\(\\theta\_\{k\}\)\-\\alpha\\mathcal\{L\}\_\{k\}^\{\\prime\}\(\\theta\_\{k\}\)\\cdot\\mathcal\{L\}^\{\\prime\}\(\\theta\_\{k\}\)\+\\mathcal\{O\}\(\\alpha^\{2\}\)\\\\ =\\ &\\mathcal\{L\}\(\\theta\_\{1\}\)\-\\alpha\\sum\_\{i=1\}^\{k\-1\}\\mathcal\{L\}\_\{i\}^\{\\prime\}\(\\theta\_\{1\}\)\\cdot\\mathcal\{L\}^\{\\prime\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha^\{2\}\)\\\\ &\-\\alpha\(\\mathcal\{L\}\_\{k\}^\{\\prime\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha\)\)\(\\mathcal\{L\}^\{\\prime\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha\)\)\+\\mathcal\{O\}\(\\alpha^\{2\}\)\\\\ =\\ &\\mathcal\{L\}\(\\theta\_\{1\}\)\-\\alpha\\sum\_\{i=1\}^\{k\}\\mathcal\{L\}\_\{i\}^\{\\prime\}\(\\theta\_\{1\}\)\\cdot\\mathcal\{L\}^\{\\prime\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha^\{2\}\)\.\\end\{split\}\(24\)Note that we substituteℒk′​\(θk\)\\mathcal\{L\}\_\{k\}^\{\\prime\}\(\\theta\_\{k\}\)into[Eq\.22](https://arxiv.org/html/2606.23758#S4.E22)to obtain:

ℒk′​\(θk\)=ℒk′​\(θ1\)−α​∑i=1k−1ℒi′​\(θ1\)⋅ℒk′′​\(θ1\)\+𝒪​\(α2\),\\mathcal\{L\}\_\{k\}^\{\\prime\}\(\\theta\_\{k\}\)=\\mathcal\{L\}\_\{k\}^\{\\prime\}\(\\theta\_\{1\}\)\-\\alpha\\sum\_\{i=1\}^\{k\-1\}\\mathcal\{L\}\_\{i\}^\{\\prime\}\(\\theta\_\{1\}\)\\cdot\\mathcal\{L\}\_\{k\}^\{\\prime\\prime\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha^\{2\}\),\(25\)which is simplified asℒk′​\(θ1\)\+𝒪​\(α\)\\mathcal\{L\}\_\{k\}^\{\\prime\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha\)within[Eq\.24](https://arxiv.org/html/2606.23758#S4.E24)\. The derivation ofℒ′​\(θk\)\\mathcal\{L\}^\{\\prime\}\(\\theta\_\{k\}\)follows the same process\.

Conclusion\.We prove that[Eq\.22](https://arxiv.org/html/2606.23758#S4.E22)holds for all positive integersi=ki=kand any loss functionℒ\\mathcal\{L\}\. Plugging[Eq\.22](https://arxiv.org/html/2606.23758#S4.E22)into[Eq\.20](https://arxiv.org/html/2606.23758#S4.E20)and the meta\-objective is transformed to:

argminθ1∑i=1nℒi​\(θ1\)−α​∑i,j∈𝒩i≠jℒi′​\(θ1\)⋅ℒj′​\(θ1\)\.\\mathop\{\\rm argmin\}\_\{\\theta\_\{1\}\}\\,\\sum\_\{i=1\}^\{n\}\\mathcal\{L\}\_\{i\}\(\\theta\_\{1\}\)\-\\alpha\\sum\_\{i,j\\in\\mathcal\{N\}\}^\{i\\neq j\}\\mathcal\{L\}\_\{i\}^\{\\prime\}\(\\theta\_\{1\}\)\\cdot\\mathcal\{L\}\_\{j\}^\{\\prime\}\(\\theta\_\{1\}\)\.\(26\)By replacing the learning rateα\\alphawithγ\\gammaand initial parametersθ1\\theta\_\{1\}withΘ\\Theta, we ultimately obtain[Eq\.8](https://arxiv.org/html/2606.23758#S3.E8)\.

Remark\.Our analysis is originally motivated by MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\], which adopts a two\-step procedure and derives gradient matching between two domain task splits\. This observation strongly inspire MEDIC\[[89](https://arxiv.org/html/2606.23758#bib.bib75)\]and lead us to investigate whether such behavior could be extended to multi\-step settings\. The background explains why our derivation differs substantially from previous Reptile\-based analyses\[[61](https://arxiv.org/html/2606.23758#bib.bib76),[76](https://arxiv.org/html/2606.23758#bib.bib51),[43](https://arxiv.org/html/2606.23758#bib.bib68)\]\. Prior works formulate the objective in terms of mathematical expectation, specifically through a quantity referred as AvgGradInner\[[61](https://arxiv.org/html/2606.23758#bib.bib76)\]\. They first derive the gradient at stepiias:

ℒi′​\(θi\)=ℒi′​\(θ1\)−α​ℒi′′​\(θ1\)​∑j=1i−1ℒj′​\(θ1\)\+𝒪​\(α2\),\\mathcal\{L\}^\{\\prime\}\_\{i\}\(\\theta\_\{i\}\)=\\mathcal\{L\}^\{\\prime\}\_\{i\}\(\\theta\_\{1\}\)\-\\alpha\\mathcal\{L\}^\{\\prime\\prime\}\_\{i\}\(\\theta\_\{1\}\)\\sum\_\{j=1\}^\{i\-1\}\\mathcal\{L\}^\{\\prime\}\_\{j\}\(\\theta\_\{1\}\)\+\\mathcal\{O\}\(\\alpha^\{2\}\),\(27\)and then isolate a single term from the second component and compute its expectation:

AvgGradInner\\displaystyle\\mathop\{\\rm AvgGradInner\}=𝔼i,j​\(ℒi′′​\(θ1\)​ℒj′​\(θ1\)\)\\displaystyle=\\mathbb\{E\}\_\{i,j\}\(\\mathcal\{L\}^\{\\prime\\prime\}\_\{i\}\(\\theta\_\{1\}\)\\mathcal\{L\}^\{\\prime\}\_\{j\}\(\\theta\_\{1\}\)\)\(28\)=𝔼i,j​\(ℒj′′​\(θ1\)​ℒi′​\(θ1\)\)\\displaystyle=\\mathbb\{E\}\_\{i,j\}\(\\mathcal\{L\}^\{\\prime\\prime\}\_\{j\}\(\\theta\_\{1\}\)\\mathcal\{L\}^\{\\prime\}\_\{i\}\(\\theta\_\{1\}\)\)=12​𝔼i,j​\(ℒi′′​\(θ1\)​ℒj′​\(θ1\)\+ℒj′′​\(θ1\)​ℒi′​\(θ1\)\)\\displaystyle=\\frac\{1\}\{2\}\\mathbb\{E\}\_\{i,j\}\(\\mathcal\{L\}^\{\\prime\\prime\}\_\{i\}\(\\theta\_\{1\}\)\\mathcal\{L\}^\{\\prime\}\_\{j\}\(\\theta\_\{1\}\)\+\\mathcal\{L\}^\{\\prime\\prime\}\_\{j\}\(\\theta\_\{1\}\)\\mathcal\{L\}^\{\\prime\}\_\{i\}\(\\theta\_\{1\}\)\)=12​𝔼i,j​\(\(ℒi′​\(θ1\)​ℒj′​\(θ1\)\)′\)\\displaystyle=\\frac\{1\}\{2\}\\mathbb\{E\}\_\{i,j\}\(\(\\mathcal\{L\}^\{\\prime\}\_\{i\}\(\\theta\_\{1\}\)\\mathcal\{L\}^\{\\prime\}\_\{j\}\(\\theta\_\{1\}\)\)^\{\\prime\}\)The result is therefore expressed in terms of an expectation𝔼\\mathbb\{E\}, which cannot be eliminated from their formulation\. In contrast, we establish[Eq\.21](https://arxiv.org/html/2606.23758#S4.E21)which directly yields:

ℒi′​\(θi\)=ℒi′​\(θ1\)−α​∑j=1i−1\(ℒj′​\(θ1\)⋅ℒi′​\(θ1\)\)′\+𝒪​\(α2\)\.\\mathcal\{L\}^\{\\prime\}\_\{i\}\(\\theta\_\{i\}\)=\\mathcal\{L\}^\{\\prime\}\_\{i\}\(\\theta\_\{1\}\)\-\\alpha\\sum\_\{j=1\}^\{i\-1\}\(\\mathcal\{L\}\_\{j\}^\{\\prime\}\(\\theta\_\{1\}\)\\cdot\\mathcal\{L\}\_\{i\}^\{\\prime\}\(\\theta\_\{1\}\)\)^\{\\prime\}\+\\mathcal\{O\}\(\\alpha^\{2\}\)\.\(29\)By removing the operator𝔼\\mathbb\{E\}, we characterize the exact pairwise gradient matching among tasks, rather than its expectation\. This distinction is central to our theoretical development, as it ensures that gradient matching holds on the current path instead of merely on average\.

### IV\-BScaling the Learning Rate

We discuss how to set learning rateϵ\\epsilonfor the outer loop\. Plugging[Eq\.7](https://arxiv.org/html/2606.23758#S3.E7)into[Eq\.20](https://arxiv.org/html/2606.23758#S4.E20)leads to the loss function as:

ℒouter=α​∑i=1n∑k=1miℋaik​\(θi\),\\mathcal\{L\}\_\{\\rm outer\}=\\alpha\\sum\_\{i=1\}^\{n\}\\sum\_\{k=1\}^\{m\_\{i\}\}\\mathcal\{H\}\_\{a\_\{i\}^\{k\}\}\(\\theta\_\{i\}\),\(30\)while the standard loss of empirical risk minimization \(ERM\)\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]without meta\-learning can be expressed as:

ℒerm=1t​∑i=1tℋi​\(θ1\)=1t​∑i=1n∑k=1miℋi​\(θ1\),\\begin\{split\}\\mathcal\{L\}\_\{\\rm erm\}&=\\frac\{1\}\{t\}\\sum\_\{i=1\}^\{t\}\\mathcal\{H\}\_\{i\}\(\\theta\_\{1\}\)\\\\ &=\\frac\{1\}\{t\}\\sum\_\{i=1\}^\{n\}\\sum\_\{k=1\}^\{m\_\{i\}\}\\mathcal\{H\}\_\{i\}\(\\theta\_\{1\}\),\\end\{split\}\(31\)which implies that the coefficient of the loss in the outer loop isα​t\\alpha ttimes that of ERM, thusϵ\\epsilonneeds to be scaled to1α​t\\frac\{1\}\{\\alpha t\}of the default learning rate\.

![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/graph.jpg)Figure 7:Supplementary visual aids for positive correlation between task\-wise gradient matching and step count of[SectionIV\-C](https://arxiv.org/html/2606.23758#S4.SS3)\.
### IV\-CRelationship Betweentnt\_\{n\}andnn

In[SectionIII\-C](https://arxiv.org/html/2606.23758#S3.SS3), the estimate of the gradient matching counttnt\_\{n\}increasing with the step countnnis presented in continuous form\. Here, we provide a discrete proof\. As shown in[Fig\.7](https://arxiv.org/html/2606.23758#S4.F7), we construct a graph model where tasks are represented as nodes, task\-wise gradient matching as edges, and steps as node partitions\{𝒫1,𝒫2,…,𝒫n\}\\\{\\mathcal\{P\}\_\{1\},\\mathcal\{P\}\_\{2\},\.\.\.,\\mathcal\{P\}\_\{n\}\\\}\. The number of tasks for theii\-th step can also be written as the number of nodes\|𝒫i\|\|\\mathcal\{P\}\_\{i\}\|\. Assuming a uniform task count distribution per step, it holds for anyiiandjjthat:

\|𝒫i\|≤\|𝒫j\|\+1\.\|\\mathcal\{P\}\_\{i\}\|\\leq\|\\mathcal\{P\}\_\{j\}\|\+1\.\(32\)Matching gradient between stepsiiandjjcan be regarded as generating a complete bipartite graphK​\(𝒫i,𝒫j\)K\(\\mathcal\{P\}\_\{i\},\\mathcal\{P\}\_\{j\}\)with a total of\|𝒫i\|⋅\|𝒫j\|\|\\mathcal\{P\}\_\{i\}\|\\cdot\|\\mathcal\{P\}\_\{j\}\|edges\. When the number of steps transforms fromn\+1n\+1tonn, it is equivalent to uniformly dividing𝒫n\+1\\mathcal\{P\}\_\{n\+1\}intonnsub\-partitions\{𝒫1′,𝒫2′,…,𝒫n′\}\\\{\\mathcal\{P\}\_\{1\}^\{\\prime\},\\mathcal\{P\}\_\{2\}^\{\\prime\},\.\.\.,\\mathcal\{P\}\_\{n\}^\{\\prime\}\\\}, and allocating them to the remaining partitions according to their respective indices\. The change in the number of edges is expressed as:

Δn\+1→n=∑i,j∈𝒩i≠j\|𝒫i′\|⋅\|𝒫j′\|−∑i=1n\|𝒫i′\|⋅\|𝒫i\|=12​∑i=1n\|𝒫i′\|​∑j∈𝒩j≠i\|𝒫j′\|−∑i=1n\|𝒫i′\|⋅\|𝒫i\|=12​∑i=1n\|𝒫i′\|​\(∑j∈𝒩j≠i\|𝒫j′\|−2​\|𝒫i\|\)\.\\begin\{split\}\\Delta\_\{n\+1\\rightarrow n\}&=\\sum\_\{i,j\\in\\mathcal\{N\}\}^\{i\\neq j\}\|\\mathcal\{P\}\_\{i\}^\{\\prime\}\|\\cdot\|\\mathcal\{P\}\_\{j\}^\{\\prime\}\|\-\\sum\_\{i=1\}^\{n\}\|\\mathcal\{P\}\_\{i\}^\{\\prime\}\|\\cdot\|\\mathcal\{P\}\_\{i\}\|\\\\ &=\\frac\{1\}\{2\}\\sum\_\{i=1\}^\{n\}\|\\mathcal\{P\}\_\{i\}^\{\\prime\}\|\\sum\_\{j\\in\\mathcal\{N\}\}^\{j\\neq i\}\|\\mathcal\{P\}\_\{j\}^\{\\prime\}\|\-\\sum\_\{i=1\}^\{n\}\|\\mathcal\{P\}\_\{i\}^\{\\prime\}\|\\cdot\|\\mathcal\{P\}\_\{i\}\|\\\\ &=\\frac\{1\}\{2\}\\sum\_\{i=1\}^\{n\}\|\\mathcal\{P\}\_\{i\}^\{\\prime\}\|\\left\(\\sum\_\{j\\in\\mathcal\{N\}\}^\{j\\neq i\}\|\\mathcal\{P\}\_\{j\}^\{\\prime\}\|\-2\|\\mathcal\{P\}\_\{i\}\|\\right\)\.\\end\{split\}\(33\)The first term is the edges unique to stepnn, while the second term is the edges unique to stepn\+1n\+1\. It is evident that each term of the summation in[Eq\.33](https://arxiv.org/html/2606.23758#S4.E33)equals0when\|𝒫i′\|=0\|\\mathcal\{P\}\_\{i\}^\{\\prime\}\|=0, For any\|𝒫i′\|≥1\|\\mathcal\{P\}\_\{i\}^\{\\prime\}\|\\geq 1, it follows from[Eq\.32](https://arxiv.org/html/2606.23758#S4.E32)that:

∑j∈𝒩j≠i\|𝒫j′\|=\|𝒫n\+1\|−\|𝒫i′\|≤\|𝒫n\+1\|−1≤\|𝒫i\|,\\sum\_\{j\\in\\mathcal\{N\}\}^\{j\\neq i\}\|\\mathcal\{P\}\_\{j\}^\{\\prime\}\|=\|\\mathcal\{P\}\_\{n\+1\}\|\-\|\\mathcal\{P\}\_\{i\}^\{\\prime\}\|\\leq\|\\mathcal\{P\}\_\{n\+1\}\|\-1\\leq\|\\mathcal\{P\}\_\{i\}\|,\(34\)thus the corresponding terms are less than0\. Because of this, the value of[Eq\.33](https://arxiv.org/html/2606.23758#S4.E33)is negative, so the number of edges from stepn\+1n\+1tonndecreases, which shows a positive correlation between gradient matching counttnt\_\{n\}and step countnn\.

## VExperiment

### V\-ADatasets

We evaluate on seven standard DG datasets whose details are described as follows: \(i\)PACS\[[45](https://arxiv.org/html/2606.23758#bib.bib3)\]contains 4 domains \(*photo*,*art\-painting*,*cartoon*,*sketch*\) with 7 classes and 9,991 images\. \(ii\)Office\-Home\[[85](https://arxiv.org/html/2606.23758#bib.bib63)\]comprises 4 domains \(*art*,*clipart*,*product*,*real\-world*\) with 65 classes and 15,588 images\. \(iii\)VLCS\[[19](https://arxiv.org/html/2606.23758#bib.bib94)\]consists of 4 domains \(*pascal*,*labelme*,*caltech*,*sun*\) with 5 classes and 10,729 images\. \(iv\)TerraIncognita\[[7](https://arxiv.org/html/2606.23758#bib.bib95)\]is composed of 4 domains \(*location38*,*location43*,*location46*,*location100*\) with 100 classes and 24,788 images\. \(v\)DomainNet\[[66](https://arxiv.org/html/2606.23758#bib.bib96)\]includes 6 domains \(*clipart*,*infograph*,*painting*,*quickdraw*,*real*,*sketch*\) with 345 classes and 586,575 images\. \(vi\)Digits\-DG\[[97](https://arxiv.org/html/2606.23758#bib.bib5)\]is an aggregation of 4 domains \(*mnist*\[[42](https://arxiv.org/html/2606.23758#bib.bib97)\],*mnist\-m*\[[22](https://arxiv.org/html/2606.23758#bib.bib98)\],*svhn*\[[60](https://arxiv.org/html/2606.23758#bib.bib99)\],*syn*\[[22](https://arxiv.org/html/2606.23758#bib.bib98)\]\) with 10 classes and 48,000 images\. \(vii\)CMNIST\[[4](https://arxiv.org/html/2606.23758#bib.bib56)\]consists of 3 domains, 2 classes, and 60,000 images\.

TABLE II:Close set accuracy \(%\) on DomainBed Benchmark\.MethodCMSTPACSVLCSOfficeTerraDomNetAvgERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]51\.585\.577\.566\.546\.140\.961\.3RSC\[[36](https://arxiv.org/html/2606.23758#bib.bib50)\]51\.785\.277\.165\.546\.638\.960\.8GroupDRO\[[72](https://arxiv.org/html/2606.23758#bib.bib104)\]52\.184\.476\.766\.043\.233\.359\.3MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]51\.584\.977\.266\.847\.841\.261\.6V\-REx\[[41](https://arxiv.org/html/2606.23758#bib.bib100)\]51\.884\.978\.366\.446\.433\.660\.2CORAL\[[81](https://arxiv.org/html/2606.23758#bib.bib59)\]51\.586\.278\.868\.747\.741\.562\.4AND\-mask\[[63](https://arxiv.org/html/2606.23758#bib.bib101)\]51\.384\.478\.165\.644\.637\.260\.2SAND\-mask\[[75](https://arxiv.org/html/2606.23758#bib.bib102)\]51\.884\.677\.465\.842\.932\.159\.1Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]51\.685\.577\.868\.645\.142\.761\.9Fishr\[[68](https://arxiv.org/html/2606.23758#bib.bib73)\]52\.085\.577\.867\.847\.441\.762\.0HGP\[[32](https://arxiv.org/html/2606.23758#bib.bib103)\]51\.884\.777\.668\.243\.641\.161\.2Hutchinson\[[32](https://arxiv.org/html/2606.23758#bib.bib103)\]52\.383\.976\.868\.246\.641\.661\.6MEDIC\+\+152\.287\.378\.569\.649\.340\.762\.9MEDIC\+\+252\.489\.479\.071\.650\.246\.764\.9

- 1Using default model structure and hyperparameters\.
- 2Using our own hyperparameters with multi\-binary classifier\.
- 3The best and second\-best results areboldedandunderlinedrespectively\.

TABLE III:Results \(%\) of PACS on ResNet50\. \(known: unknown = 6:1\)PhotoArtCartoonSketchAvgMethodAccH\-scoreOSCRAccH\-scoreOSCRAccH\-scoreOSCRAccH\-scoreOSCRAccH\-scoreOSCROpenMax\[[8](https://arxiv.org/html/2606.23758#bib.bib32)\]97\.5893\.09\-88\.3773\.91\-84\.3868\.23\-80\.0768\.06\-87\.6075\.82\-ARPL\[[14](https://arxiv.org/html/2606.23758#bib.bib37)\]97\.0996\.8196\.8688\.2477\.4880\.3282\.6867\.1968\.3178\.0870\.0469\.4786\.5277\.8878\.74MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]96\.7795\.8596\.3387\.9977\.1679\.9383\.4568\.7471\.3282\.2573\.1672\.2787\.6178\.7379\.96ERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]97\.0996\.5896\.6889\.9976\.0582\.4485\.1065\.7970\.5980\.3170\.2970\.1688\.1277\.1879\.97Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]97\.0195\.2796\.3788\.3176\.8579\.1984\.5964\.9072\.0183\.7672\.1073\.1088\.4277\.2880\.17CIRL\[[56](https://arxiv.org/html/2606.23758#bib.bib69)\]96\.5387\.7595\.4092\.0670\.7577\.4485\.7168\.8273\.7184\.3566\.7377\.2489\.6673\.5180\.95MixStyle\[[99](https://arxiv.org/html/2606.23758#bib.bib7)\]96\.5393\.5795\.3090\.8779\.1583\.2786\.8068\.0874\.6884\.8871\.5773\.4189\.7778\.0981\.66CrossMatch\[[101](https://arxiv.org/html/2606.23758#bib.bib53)\]96\.5396\.3496\.1291\.3775\.6782\.3283\.9267\.0274\.5581\.6172\.0373\.9988\.3777\.7681\.75SWAD\[[10](https://arxiv.org/html/2606.23758#bib.bib47)\]96\.3784\.5693\.2493\.7568\.4185\.0085\.5758\.5775\.9081\.9074\.6674\.6589\.4071\.5582\.20MVDG\[[96](https://arxiv.org/html/2606.23758#bib.bib49)\]97\.1795\.0296\.6392\.5079\.4785\.0286\.0271\.0576\.0383\.4475\.2475\.1889\.7880\.2083\.21MEDIC96\.3794\.7595\.7991\.6281\.6185\.8186\.6577\.3978\.3084\.6178\.3579\.5089\.8183\.0384\.85MEDIC\+\+97\.5896\.5696\.9993\.2582\.7085\.7587\.5876\.5778\.4385\.9878\.3679\.6391\.1083\.5585\.20

TABLE IV:Results \(%\) of Digits\-DG on ConvNet\. \(known: unknown = 6:4\)MNISTMNIST\-MSVHNSYNAvgMethodAccH\-scoreOSCRAccH\-scoreOSCRAccH\-scoreOSCRAccH\-scoreOSCRAccH\-scoreOSCROpenMax\[[8](https://arxiv.org/html/2606.23758#bib.bib32)\]97\.3352\.03\-71\.0357\.26\-72\.0049\.46\-84\.8354\.78\-81\.3053\.38\-MixStyle\[[99](https://arxiv.org/html/2606.23758#bib.bib7)\]97\.8673\.2589\.3674\.5059\.3056\.9569\.2853\.2448\.4385\.0660\.2265\.4481\.6861\.5065\.05ERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]97\.4780\.9092\.6071\.0353\.9254\.0471\.0854\.3749\.8685\.6751\.5767\.6381\.3160\.1966\.03ARPL\[[14](https://arxiv.org/html/2606.23758#bib.bib37)\]97\.7585\.7491\.8669\.7858\.0854\.2171\.7856\.9853\.6385\.3164\.0465\.8981\.1666\.2166\.40MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]97\.8380\.3694\.2871\.1146\.8455\.1773\.6453\.5453\.6486\.0863\.5670\.3482\.1661\.0868\.36SWAD\[[10](https://arxiv.org/html/2606.23758#bib.bib47)\]97\.7184\.4492\.6573\.0953\.3555\.9476\.0859\.1856\.2587\.9551\.2769\.0383\.7162\.0668\.47Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]97\.8374\.6995\.6171\.8147\.3156\.0574\.4249\.9452\.8986\.1466\.5773\.8882\.5559\.6369\.61CIRL\[[56](https://arxiv.org/html/2606.23758#bib.bib69)\]97\.9281\.1493\.5073\.7859\.8858\.2580\.0658\.7356\.8887\.8664\.9169\.9584\.9166\.1669\.64MEDIC97\.8983\.2095\.8171\.1460\.9858\.2876\.0058\.7757\.6088\.1162\.2472\.9183\.2866\.3071\.15MEDIC\+\+98\.4480\.5896\.8873\.1460\.5161\.6277\.3161\.5559\.9087\.4367\.6776\.6584\.0867\.5873\.76

TABLE V:Results \(%\) of PACS on GFNet\-H\-Ti\. \(known: unknown = 6:1\)PhotoArtCartoonSketchAvgMethodAccH\-scoreOSCRAccH\-scoreOSCRAccH\-scoreOSCRAccH\-scoreOSCRAccH\-scoreOSCROneRing\[[93](https://arxiv.org/html/2606.23758#bib.bib54)\]96\.2084\.95\-89\.2471\.54\-85\.3664\.53\-82\.2863\.97\-88\.2771\.25\-CrossMatch\[[101](https://arxiv.org/html/2606.23758#bib.bib53)\]96\.9381\.8388\.8691\.5674\.2580\.6985\.1569\.4870\.0082\.4968\.6672\.3989\.0373\.5677\.98ALOFT\[[28](https://arxiv.org/html/2606.23758#bib.bib74)\]97\.9086\.4589\.9893\.7574\.1279\.8985\.9868\.8071\.1583\.5867\.3072\.4190\.3074\.1778\.36Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]97\.6687\.1788\.9591\.2476\.7581\.2985\.6271\.5074\.8984\.4872\.4574\.1889\.7576\.9779\.83ERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]97\.5891\.2395\.8791\.0674\.6982\.6485\.0062\.7870\.0781\.8873\.2373\.0588\.8875\.4880\.41MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]96\.9384\.5893\.4091\.6878\.6781\.4585\.1071\.7674\.9382\.2276\.6273\.5488\.9877\.9180\.83ARPL\[[14](https://arxiv.org/html/2606.23758#bib.bib37)\]97\.9895\.3196\.4493\.3781\.0584\.8385\.5267\.6672\.6382\.5977\.1273\.1989\.8780\.2881\.77SWAD\[[10](https://arxiv.org/html/2606.23758#bib.bib47)\]97\.4287\.5394\.5892\.5678\.4385\.0287\.2269\.2974\.6883\.6374\.5775\.8590\.2177\.4582\.53MEDIC98\.0693\.8895\.7592\.8180\.5885\.2986\.0871\.1675\.2284\.7776\.3576\.4190\.4380\.4983\.17MEDIC\+\+98\.0692\.8795\.5993\.5082\.2786\.6987\.1172\.7374\.6586\.6077\.9778\.8191\.3281\.4683\.94

### V\-BImplementation Details

Basic details\.Each training set is randomly segmented into 3 parts by both domain and class to obtain a total of 9 tasks\. The inner loop comprises 3 steps, each of which contains 3 tasks with a fixed learning rate of0\.010\.01\. For datasets other than Digit\-DG, we employ ResNet50\[[31](https://arxiv.org/html/2606.23758#bib.bib64)\]and GFNet\[[69](https://arxiv.org/html/2606.23758#bib.bib109)\]pretrained on ImageNet\[[16](https://arxiv.org/html/2606.23758#bib.bib66)\]as our backbone networks\. For Digits\-DG, we adopt a lightweight convolutional architecture called ConvNet from\[[97](https://arxiv.org/html/2606.23758#bib.bib5)\]\. The leave\-one\-domain\-out evaluation protocol is applied to all benchmarks,*i\.e\.*, picking one target domain for testing and using the rest for training and validation\. We set aside 20% of the samples for validation from each source domain and choose the model that maximizes the accuracy on the overall validation set, which is same as the*training\-domain validation set*recommended in\[[27](https://arxiv.org/html/2606.23758#bib.bib72)\]\.

DomainBed benchmark\.We first follow the close set protocol proposed in\[[27](https://arxiv.org/html/2606.23758#bib.bib72)\], including the hyperparameter search space, model structure and Adam optimizer\. For datasets like PACS which contains 3 training domains with a default batch size of\(32,96\)\(32,96\), we configure our batch size to\(12,108\)\(12,108\)\. For DomainNet which includes 5 training domains with a default batch size of\(32,160\)\(32,160\), we select our batch size as\(18,162\)\(18,162\)\. For CMNIST with a default batch size of\(64,128\)\(64,128\), we set our batch size as\(32,128\)\(32,128\)\. The first value in the parentheses is the batch size per domain or per task, while the second value is their combined sum\. The default learning rates are5×10−55\\times 10^\{\-5\}for non\-digits and1×10−31\\times 10^\{\-3\}for digits, with our rates set to6×10−4×0\.096\\times 10^\{\-4\}\\times 0\.09and2×10−2×0\.042\\times 10^\{\-2\}\\times 0\.04\. As used in\[[10](https://arxiv.org/html/2606.23758#bib.bib47)\], we triple the number of iterations for DomainNet from 5,000 to 15,000 since 5,000 iterations is inadequate for convergence\. Notably, the multi\-binary classifier is excluded in this benchmark\.

Additional configurations\.To further improve the model’s performance, we fine\-tune hyperparameters and incorporate the multi\-binary classifier into training\. The number of iterations is doubled to 30,000 for DomainNet and 10,000 for TerraIncognita, while the batch size per task is uniformly set to 16 across all datasets\. Using a SGD optimizer, the initial learning rates of the outer loop are configured as follows:0\.020\.02for PACS and Office\-Home,0\.010\.01for VLCS, TerraIncognita and DomainNet,0\.10\.1for Digits\-DG and CMNIST\. Each learning rate is then reduced to 10% in the last 20% iterations\. For open set experiments, the classes are organized in alphabetical order, with the former part as known classes and the latter as unknown\. The split rates for known and unknown classes on each dataset are detailed in the corresponding table caption\. We use close set validation accuracy for model selection\.

For ablation studies, all methods use the same number of samples per iteration and scaled learning rate\. Except for the specific ablation involving multi\-binary classifiers, all methods are implemented with them to ensure a fair comparison\.

TABLE VI:Results \(%\) of DomainNet on ResNet50\. \(known: unknown = 100:245\)clipinfopaintquickrealsketchAvgMethodAccOSCRAccOSCRAccOSCRAccOSCRAccOSCRAccOSCRAccOSCRMIRO\[[11](https://arxiv.org/html/2606.23758#bib.bib70)\]71\.5561\.4531\.3825\.0359\.6650\.7819\.0613\.2775\.9566\.0765\.1455\.0753\.7945\.28MixStyle\[[99](https://arxiv.org/html/2606.23758#bib.bib7)\]71\.1061\.4230\.4625\.1460\.9651\.4721\.6815\.8573\.1362\.8167\.0458\.0454\.0645\.79ERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]71\.6261\.6731\.6226\.0661\.1751\.3221\.0615\.3175\.1365\.2167\.4057\.7254\.6746\.21ARPL\[[14](https://arxiv.org/html/2606.23758#bib.bib37)\]73\.2563\.5230\.0523\.9262\.6552\.1222\.1015\.9573\.7063\.3968\.2358\.9555\.0046\.31CrossMatch\[[101](https://arxiv.org/html/2606.23758#bib.bib53)\]72\.1562\.0932\.3826\.1961\.4051\.6121\.4715\.7574\.7365\.0667\.7958\.1954\.9946\.48MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]71\.6362\.0432\.5626\.8561\.5551\.2021\.7415\.8675\.3465\.1968\.1258\.6855\.1646\.64Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]73\.3963\.7131\.7926\.4262\.2952\.4121\.1115\.2174\.4964\.4068\.0158\.3355\.1846\.75SWAD\[[10](https://arxiv.org/html/2606.23758#bib.bib47)\]72\.8362\.9831\.8726\.6363\.4053\.8923\.4017\.3075\.1465\.0868\.3959\.2555\.8447\.52MEDIC73\.3164\.0531\.3226\.1363\.1553\.9622\.4516\.6475\.8765\.8968\.5059\.6455\.7747\.72MEDIC\+\+73\.3864\.0732\.9327\.3763\.0754\.1222\.4816\.6775\.6566\.1869\.2959\.9156\.1348\.05

### V\-CEvaluation Metrics

We choose three evaluation metrics that take both known and unknown class accuracy into account: \(i\)Accrepresents the typical close set accuracy\. \(ii\)H\-score\[[21](https://arxiv.org/html/2606.23758#bib.bib65)\]quantifies the harmonic mean of known class accuracyacck\{\\rm acc\}\_\{\\rm k\}and unknown class accuracyaccu\{\\rm acc\}\_\{\\rm u\}as follows:

H​\-​score=2⋅acck⋅accuacck\+accu\.\{\\rm H\\mbox\{\-\}score\}=2\\cdot\\frac\{\{\\rm acc\}\_\{\\rm k\}\\cdot\{\\rm acc\}\_\{\\rm u\}\}\{\{\\rm acc\}\_\{\\rm k\}\+\{\\rm acc\}\_\{\\rm u\}\}\.\(35\)Whenacck\+accu\{\\rm acc\}\_\{\\rm k\}\+\{\\rm acc\}\_\{\\rm u\}remains constant, the closeracck\{\\rm acc\}\_\{\\rm k\}andaccu\{\\rm acc\}\_\{\\rm u\}are, the larger H\-score will be\. Compared with the weighted average, H\-score puts more emphasis on the balance between close set classification and open set recognition\. Nevertheless, the manually designed threshold to reject unknown classes is not applicable for a random target domain\. We propose a threshold\-independent metric \(iii\) open set classification rate \(OSCR\)\[[17](https://arxiv.org/html/2606.23758#bib.bib26)\]which plots the true positive rate against the false positive rate using an ever\-moving threshold\. Different from area under the receiver operating characteristic \(AUROC\)\[[59](https://arxiv.org/html/2606.23758#bib.bib31)\]that neglects known class accuracy, OSCR considers only correctly labeled samples as true positive ones\.

### V\-DClose Set Results

We compare our strategy with closely related meta\-learning methods such as MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]and Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\], as well as gradient\-based methods like RSC\[[36](https://arxiv.org/html/2606.23758#bib.bib50)\], AND\-mask\[[63](https://arxiv.org/html/2606.23758#bib.bib101)\]and Fishr\[[68](https://arxiv.org/html/2606.23758#bib.bib73)\]\. Firstly, we use the same model architecture and default hyperparameters\. As[TableII](https://arxiv.org/html/2606.23758#S5.T2)illustrates, our method achieves the highest average performance, outperforming other methods on three datasets namely PACS, OfficeHome and TerraIncognita, surpassing the second\-best method by 1\.1%, 0\.9% and 1\.5% respectively\. Secondly, by involving the multi\-binary classifier, the model’s average performance is further boosted by 2\.0%, suggesting that the binary classifiers which separate inter\-class samples also play a role in improving close set accuracy\.

### V\-EOpen Set Results

We conduct open set experiments, and the results on PACS, Digits\-DG, and DomainNet are shown in[TableIII](https://arxiv.org/html/2606.23758#S5.T3),[TableIV](https://arxiv.org/html/2606.23758#S5.T4),[TableV](https://arxiv.org/html/2606.23758#S5.T5), and[TableVI](https://arxiv.org/html/2606.23758#S5.T6), respectively\. Our strategy outperforms other DG and OSR methods in both close set and open set scenarios\. MEDIC\+\+ shows superior performance compared to MEDIC, achieving a improvement in close set accuracy by 1\.29% in[TableIII](https://arxiv.org/html/2606.23758#S5.T3)\. These results indicate that our method is capable of producing generalizable and discriminative representations, benefiting both DG and OSR tasks\.

We also compare with several open set recognition methods such as OpenMax\[[8](https://arxiv.org/html/2606.23758#bib.bib32)\]and ARPL\[[14](https://arxiv.org/html/2606.23758#bib.bib37)\]\. Note that we exclude the calculation of OSCR for OpenMax\[[8](https://arxiv.org/html/2606.23758#bib.bib32)\]and OneRing\[[93](https://arxiv.org/html/2606.23758#bib.bib54)\]due to their threshold\-independent property,*i\.e\.*, the classifier is configured with\|𝒞\|\+1\|\\mathcal\{C\}\|\+1output channels, one of which is dedicated to the probability of unknown classes\. However, the H\-score of them is still below average, further highlighting that the hard inference derived from source domains is not suitable for the unseen target domains\. It can be observed that ARPL\[[14](https://arxiv.org/html/2606.23758#bib.bib37)\], which is one of the state\-of\-the\-art approaches for open set recognition, fails to perform well compared to the standard DG methods\. This may indicate that the deep learning models can exhibit a natural inclination to recognize unknown classes, so the close set classification under distribution shift remains crucial in open set domain generalization\.

TABLE VII:Ablation study \(%\) of tasks per step on PACS / ResNet50\.Tasks / StepPhotoArtCartoonSketchAvgAcc997\.4290\.5684\.5481\.8888\.605 497\.0992\.6886\.4485\.0190\.313 3 397\.5893\.2587\.5885\.9891\.103 2 2 297\.0193\.0686\.7084\.3790\.282 2 2 2 197\.2592\.4387\.1184\.1390\.232 2 1 1 1 1 195\.9691\.2486\.1982\.2888\.921 1 1 1 1 1 1 1 195\.8086\.3082\.4281\.7286\.56OSCR997\.1682\.9572\.3174\.2381\.665 496\.7385\.1376\.7779\.2084\.463 3 396\.9985\.7578\.4379\.6385\.203 2 2 296\.4485\.5077\.8478\.1884\.492 2 2 2 196\.7184\.6277\.7778\.7684\.462 2 1 1 1 1 195\.3279\.5675\.7475\.1781\.451 1 1 1 1 1 1 1 191\.4275\.8071\.4870\.3977\.27

TABLE VIII:Ablation studies \(%\) of parameter sharing on PACS / ResNet50\.Method*share*PhotoArtCartoonSketchAvgAccMEDIC\-96\.3791\.6286\.6584\.6189\.81✓97\.0192\.1886\.7085\.2790\.29MEDIC\+\+\-97\.5893\.2587\.5885\.9891\.10✓97\.1792\.4387\.2285\.9190\.68OSCRMEDIC\-95\.7985\.8178\.3079\.5084\.85✓96\.2485\.2177\.5779\.3584\.59MEDIC\+\+\-96\.9985\.7578\.4379\.6385\.20✓96\.7284\.9478\.5678\.8884\.78

TABLE IX:Ablation studies \(%\) of classifiers on PACS / ResNet50\.MethodTr\-b1Inf\-c2Inf\-b3PACSAvgOSCR\-✓\-96\.782\.470\.670\.280\.0ERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]✓✓\-97\.383\.970\.770\.980\.7✓\-✓97\.183\.871\.172\.081\.0\-✓\-96\.379\.971\.372\.380\.0MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]✓✓\-96\.783\.174\.673\.582\.0✓\-✓96\.883\.375\.474\.382\.5\-✓\-96\.479\.272\.073\.180\.2Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]✓✓\-96\.281\.175\.672\.081\.2✓\-✓96\.181\.276\.273\.181\.7\-✓\-95\.183\.773\.775\.582\.0MEDIC✓✓\-95\.484\.777\.576\.883\.6✓\-✓95\.885\.878\.379\.584\.9\-✓\-96\.784\.874\.273\.782\.4MEDIC\+\+✓✓\-96\.784\.977\.277\.084\.0✓\-✓97\.085\.878\.479\.685\.2

- 1Training with multi\-binary classifier\.
- 2Inference with close set classifier\.
- 3Inference with multi\-binary classifier\.

### V\-FAblation Study

Varying the number of steps\.As shown in[TableVII](https://arxiv.org/html/2606.23758#S5.T7), we investigate the influence of tasks per step on the performance of model\. We observe that both accuracy and OSCR initially increase, but then decline as the number of steps rises\. This indicates that when the step count is relatively smaller, the notable expansion of gradient\-matched tasks contributes to a rapid improvement\. However, in the later stages, the differences between classes across steps lead to high similarity in the normalized features of different classes, making it challenging for the model to distinguish between them\. Note that conventional meta\-learning\-based domain generalization methods do not experience this phenomenon because tasks from different domains are normalized separately, which actually aids in the extraction of domain\-invariant features\. From the second to the fourth rows are all variants of MEDIC\+\+, which is based on the core idea that the step matters, but not too many\. The similar results actually reflect the robustness of our method\. In contrast, existing methods typically fall into one of two extremes: either using a single step or restricting each step to a single task\. These approaches either completely lack gradient matching or are constrained by batch normalization and cost\. As illustrated in the first and last rows of the table, both of these strategies significantly underperform MEDIC\+\+\.

The effect of different domain generalization paradigms\.We compare our method with the baseline ERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\], as well as meta\-learning paradigms MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]and Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]\. Both MLDG and Fish share the same concept of simulating virtual target domains, but ignoring the relationship among classes\. When using the same loss function and model, the strategy becomes the only variable between our method and others\. As shown in[TableIX](https://arxiv.org/html/2606.23758#S5.T9), both MEDIC and MEDIC\+\+ outperform above methods no matter which option is uniformly appointed, demonstrating the critical role played by dualistic gradient matching in open set recognition\. Moreover, after transitioning from the cross\-entropy loss \(*i\.e\.*, training with close set classifier only\) to open set loss function \(*i\.e\.*, training with the two classifiers\), our method ushers in the largest performance gain on the average of OSCR by2\.9%2\.9\\%and2\.8%2\.8\\%for MEDIC and MEDIC\+\+ respectively, indicating that our strategy has better compatibility with the multi\-binary classifier to learn a more generalizable boundary for each known class\.

Varying the proportion of known to unknown\.We conduct experiments on the Office\-Home\[[85](https://arxiv.org/html/2606.23758#bib.bib63)\]dataset using the multi\-binary classifier across all strategies\. The OSCR results are visually presented in[Fig\.8](https://arxiv.org/html/2606.23758#S5.F8)\. It is evident that increasing the number of known classes introduces greater challenges to the classification task with lower accuracy\. Notably, MEDIC\+\+ achieves optimal performance across most split rates, which highlights the robustness of our method in diverse scenarios\. We further construct class\-wise meta\-learning variants for each method\. Interestingly, they consistently outperforms traditional domain\-wise meta\-learning strategies\. In some cases, such as when the number of known classes is 30, it even surpasses MEDIC\+\+\. This thereby emphasizes the importance of finding optimal balance between classes, making it a crucial consideration in the design of meta\-learning strategies\.

On the effect of different meta\-learning paradigms\.Ablation studies are also conducted to evaluate different optimization, task partitioning, and sampling techniques as detailed in[TableXII](https://arxiv.org/html/2606.23758#S5.T12)\. The baseline is to split source data without considering domains or classes, gradually extending to domain\-wise and class\-wise partitions with added sampling strategies\. We find that both task partition methods benefit most optimization strategies except for MAML\[[20](https://arxiv.org/html/2606.23758#bib.bib88)\]\. This may be due to its dependence on only the final gradient during the inner loop, which is more random and thus requires more careful optimization\. The similarity between\[[52](https://arxiv.org/html/2606.23758#bib.bib110)\]and MEDIC\+\+ is that both select confusable class pairs, while we further assign them to different steps rather than training together\. Compared to others, our strategy \(i\.e\., the last line\), which separates easily confusable class pairs across different steps, achieves the best open set performance\. This suggests that enhancing gradient matching between these pairs improves the generalizability of unbiased decision boundaries\.

TABLE X:Results \(%\) of partial classes on PACS / ResNet50\.MethodPhotoArtCartoonSketchAvgAccERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]93\.983\.667\.872\.579\.5MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]94\.689\.470\.978\.483\.3Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]96\.085\.871\.580\.383\.4MEDIC95\.789\.472\.081\.284\.6MEDIC\+\+96\.389\.873\.382\.485\.4OSCRERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]91\.373\.659\.656\.170\.1MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\]93\.378\.262\.871\.276\.4Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]95\.374\.563\.371\.676\.2MEDIC94\.779\.564\.571\.277\.5MEDIC\+\+95\.180\.766\.071\.878\.4TABLE XI:Results \(%\) of single domain on PACS / ResNet50\.MethodPhotoArtCartoonSketchAvgAccERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]45\.668\.573\.448\.759\.0CM\[[101](https://arxiv.org/html/2606.23758#bib.bib53)\]47\.067\.575\.158\.762\.1MEDIC54\.374\.781\.169\.669\.9MEDIC\+\+60\.675\.482\.070\.872\.2OSCRERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]37\.961\.265\.536\.150\.2CM\[[101](https://arxiv.org/html/2606.23758#bib.bib53)\]39\.361\.969\.045\.854\.0MEDIC45\.569\.576\.457\.262\.1MEDIC\+\+48\.471\.076\.760\.764\.2TABLE XII:Ablation studies \(%\) of learning paradigms and
task sampling strategies on PACS / ResNet50\.Baselined​wdw1c​wcw2o​p​topt3PACSAvgAccERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]\-\-\(i\)97\.587\.683\.980\.787\.4\-\-\(ii\)97\.491\.784\.084\.989\.5\-\-\(iii\)97\.691\.280\.978\.487\.0MAML\[[20](https://arxiv.org/html/2606.23758#bib.bib88)\]\-\-\(i\)95\.690\.187\.084\.589\.3✓\-\(i\)95\.786\.886\.084\.488\.2\-✓\(i\)93\.785\.682\.578\.285\.0✓✓\(i\)96\.492\.284\.185\.189\.5Reptile\[[61](https://arxiv.org/html/2606.23758#bib.bib76)\]\-\-\(i\)97\.291\.383\.781\.188\.3✓\-\(i\)96\.590\.686\.180\.788\.5\-✓\(i\)98\.189\.585\.882\.989\.1✓✓\(i\)97\.293\.287\.986\.291\.1✓✓\(iv\)96\.392\.085\.684\.889\.7✓✓\(v\)97\.393\.087\.185\.190\.6✓✓\(vi\)97\.693\.387\.686\.091\.1OSCRERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\]\-\-\(i\)97\.183\.871\.172\.081\.0\-\-\(ii\)96\.784\.372\.375\.382\.2\-\-\(iii\)97\.384\.271\.269\.680\.6MAML\[[20](https://arxiv.org/html/2606.23758#bib.bib88)\]\-\-\(i\)94\.981\.472\.777\.881\.7✓\-\(i\)94\.076\.975\.774\.780\.3\-✓\(i\)89\.375\.168\.369\.475\.5✓✓\(i\)93\.482\.976\.277\.182\.4Reptile\[[61](https://arxiv.org/html/2606.23758#bib.bib76)\]\-\-\(i\)96\.984\.272\.473\.781\.8✓\-\(i\)96\.181\.276\.273\.181\.7\-✓\(i\)94\.884\.175\.876\.582\.8✓✓\(i\)96\.985\.377\.279\.284\.7✓✓\(iv\)95\.885\.076\.177\.983\.7✓✓\(v\)97\.185\.476\.278\.984\.4✓✓\(vi\)97\.085\.878\.479\.685\.2

- 1Whether the tasks are divided based on different domains or not\.
- 2Whether the tasks are divided based on different classes or not\.
- 3\(i\) Random split\. \(ii\) Equal sampling quantity between domains and classes\. \(iii\) Equal output activation regulation\. \(iv\) Task sampling from\[[55](https://arxiv.org/html/2606.23758#bib.bib111)\]\. \(v\) Adaptive task sampling from\[[52](https://arxiv.org/html/2606.23758#bib.bib110)\], where confusable class pairs are selected but not assigned to different steps\. \(vi\) Our adaptive task sampling strategy, where confusable class pairs are selected and assigned to different steps\.

### V\-GAnalysis & Discussion

Time and Memory Costs\.The primary cost across different steps is a trade\-off between time and memory\. Model training mainly involves four operations:*forward*,*backward*,*step*, and*zero\_grad*\. Most computation occurs in*forward*and*backward*, whereas*step*and*zero\_grad*are lightweight\. In MEDIC\+\+, since the total data processed in the*forward*and*backward*passes is equivalent to that of ERM, the overall computational cost remains comparable\.[Fig\.9](https://arxiv.org/html/2606.23758#S5.F9)illustrates how the training time and memory utilization change with the number of steps, measured over 5000 iterations on a single Nvidia RTX 2080Ti GPU with a per\-task batch size of 8\. In practice, training time increases as the number of steps grows, because smaller batch size can reduce GPU utilization and incur higher kernel launch overhead\. However, since each batch’s computational graph is released after*backward*pass, peak memory usage is reduced, leading to improved memory efficiency\.

Verify unbiased decision boundaries\.We adopt confidence score to reflect the model’s decision tendency\. As illustrated in[TableXIII](https://arxiv.org/html/2606.23758#S5.T13), confpand confndenote the average activation of the multi\-binary classifier on the positive and negative output channels\. It can be observed that ERM\[[58](https://arxiv.org/html/2606.23758#bib.bib57)\], MLDG\[[46](https://arxiv.org/html/2606.23758#bib.bib2)\], and Fish\[[76](https://arxiv.org/html/2606.23758#bib.bib51)\]exhibit reduced confidence for positive samples than for negative samples, which explains the pattern in left panel of[Fig\.1](https://arxiv.org/html/2606.23758#S1.F1)\. In contrast, MEDIC and MEDIC\+\+ produce comparable activation magnitudes for both classes, corresponding to the right panel\. These results suggest that inter\-class gradient matching effectively promotes unbiased predictions\. We also provide t\-SNE feature results in[Fig\.10](https://arxiv.org/html/2606.23758#S5.F10)\. It can be seen that the unknown classes are generally clustered around the centralized region\. For both MEDIC and MEDIC\+\+, the overlap between known and unknown classes seems smaller, with more space allocated for potential unknown classes\. Compared to MEDIC, MEDIC\+\+ has clearer boundaries across known classes, which helps to explain its superior close set performance\.

![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/office_range.jpg)Figure 8:The values of OSCR \(%\) with the varying known\-unknown splits on Office\-Home using ResNet50, where the option*cw*denotes the class\-wise meta\-learning version of corresponding strategy\.Discuss unbiased decision boundaries\.We then discuss why a classifier that exhibits balanced output may improve open set behavior, as it induces maximal uncertainty in regions between known classes under mild linear assumptions\. To simplify the problem, consider binary classification with two classes𝒞0\\mathcal\{C\}\_\{0\}and𝒞1\\mathcal\{C\}\_\{1\}and modelf:ℝd→\[0,1\]f:\\mathbb\{R\}^\{d\}\\to\[0,1\], which assigns label 0 to𝒞0\\mathcal\{C\}\_\{0\}and label 1 to𝒞1\\mathcal\{C\}\_\{1\}\. Maximum predictive uncertainty occurs whenf​\(x\)=0\.5f\(x\)=0\.5\. We define balanced output as the sum of average predictions for both classes equals 1:

1\|𝒞0\|​∑xc0∈𝒞0f​\(xc0\)\+1\|𝒞1\|​∑xc1∈𝒞1f​\(xc1\)=1\.\\frac\{1\}\{\|\\mathcal\{C\}\_\{0\}\|\}\\sum\_\{x^\{c\_\{0\}\}\\in\\mathcal\{C\}\_\{0\}\}f\(x^\{c\_\{0\}\}\)\+\\frac\{1\}\{\|\\mathcal\{C\}\_\{1\}\|\}\\sum\_\{x^\{c\_\{1\}\}\\in\\mathcal\{C\}\_\{1\}\}f\(x^\{c\_\{1\}\}\)=1\.\(36\)Consider two correctly classified samplesxic0∈𝒞0x^\{c\_\{0\}\}\_\{i\}\\in\\mathcal\{C\}\_\{0\}andxjc1∈𝒞1x^\{c\_\{1\}\}\_\{j\}\\in\\mathcal\{C\}\_\{1\}\. We examine the midpoint as:

xi​j=xic0\+xjc12\.x\_\{ij\}=\\frac\{x^\{c\_\{0\}\}\_\{i\}\+x^\{c\_\{1\}\}\_\{j\}\}\{2\}\.\(37\)Such points lie between known classes and are thus plausible candidates for unknown class regions\. We define the average midpoint prediction as:

λ​\(𝒞0,𝒞1,f\)=1\|𝒞0\|​\|𝒞1\|​∑xc0∈𝒞0∑xc1∈𝒞1f​\(xc0\+xc12\)\.\\lambda\(\\mathcal\{C\}\_\{0\},\\mathcal\{C\}\_\{1\},f\)=\\frac\{1\}\{\|\\mathcal\{C\}\_\{0\}\|\|\\mathcal\{C\}\_\{1\}\|\}\\sum\_\{x^\{c\_\{0\}\}\\in\\mathcal\{C\}\_\{0\}\}\\sum\_\{x^\{c\_\{1\}\}\\in\\mathcal\{C\}\_\{1\}\}f\\\!\\left\(\\frac\{x^\{c\_\{0\}\}\+x^\{c\_\{1\}\}\}\{2\}\\right\)\.\(38\)Smaller\|λ−0\.5\|\|\\lambda\-0\.5\|indicates higher average uncertainty in inter\-class regions\. Assume thatffis locally linear such that, along the segment connecting two samples, it admits that:

f​\(xc0\+xc12\)≈12​\(f​\(xc0\)\+f​\(xc1\)\)\.f\\\!\\left\(\\frac\{x^\{c\_\{0\}\}\+x^\{c\_\{1\}\}\}\{2\}\\right\)\\approx\\frac\{1\}\{2\}\\left\(f\(x^\{c\_\{0\}\}\)\+f\(x^\{c\_\{1\}\}\)\\right\)\.\(39\)Under this approximation,

λ≈12​\|𝒞0\|​\|𝒞1\|​∑xc0∈𝒞0∑xc1∈𝒞1\(f​\(xc0\)\+f​\(xc1\)\)=12​\(1\|𝒞0\|​∑xc0∈𝒞0f​\(xc0\)\+1\|𝒞1\|​∑xc1∈𝒞1f​\(xc1\)\)\.\\begin\{split\}\\lambda&\\approx\\frac\{1\}\{2\|\\mathcal\{C\}\_\{0\}\|\|\\mathcal\{C\}\_\{1\}\|\}\\sum\_\{x^\{c\_\{0\}\}\\in\\mathcal\{C\}\_\{0\}\}\\sum\_\{x^\{c\_\{1\}\}\\in\\mathcal\{C\}\_\{1\}\}\\left\(f\(x^\{c\_\{0\}\}\)\+f\(x^\{c\_\{1\}\}\)\\right\)\\\\ &=\\frac\{1\}\{2\}\\left\(\\frac\{1\}\{\|\\mathcal\{C\}\_\{0\}\|\}\\sum\_\{x^\{c\_\{0\}\}\\in\\mathcal\{C\}\_\{0\}\}f\(x^\{c\_\{0\}\}\)\+\\frac\{1\}\{\|\\mathcal\{C\}\_\{1\}\|\}\\sum\_\{x^\{c\_\{1\}\}\\in\\mathcal\{C\}\_\{1\}\}f\(x^\{c\_\{1\}\}\)\\right\)\.\\end\{split\}\(40\)Therefore,\|λ−0\.5\|\|\\lambda\-0\.5\|is minimized when[Eq\.36](https://arxiv.org/html/2606.23758#S5.E36)holds\. This suggests that balanced output across known classes serves as a sufficient condition for inducing maximal uncertainty in inter\-class regions under local linear assumption\. We emphasize that this argument provides intuition rather than a strict guarantee, as the output of deep networks is highly likely nonlinear\.

Sharing parameters between classifiers\.Each known class is linked to a unique one\-vs\-all classifier and a single output channel in the close set classifier\. The function of this channel is similar to that of the positive output channel in the corresponding binary classifier, which is activated by samples from the same class\. This raises the question of whether parameter sharing between these channels is feasible\. As demonstrated in[TableVIII](https://arxiv.org/html/2606.23758#S5.T8), sharing parameters yields performance comparable to the original architecture, while reducing the total number of output channels from3​\|𝒞\|3\|\\mathcal\{C\}\|to2​\|𝒞\|2\|\\mathcal\{C\}\|\.

Partial domain generalization\.We consider the presence of domain\-specific classes, where splitting domain and class over the entire set is not recommended, as this may create invalid pairs\. In[Algorithm1](https://arxiv.org/html/2606.23758#alg1), we thus first split by domain in the third line, and then partition within each domain\-specific class set in the fifth line\. As illustrated in[TableX](https://arxiv.org/html/2606.23758#S5.T10), each source domain consists of four known classes, with any two domains sharing two classes and differing in the other two \(i\.e\.,\{1,2,3,4\}\\\{1,2,3,4\\\}for domain11,\{1,2,5,6\}\\\{1,2,5,6\\\}for domain22,\{3,4,5,6\}\\\{3,4,5,6\\\}for domain33\)\. MEDIC\+\+ continues to achieve strong performance, indicating the potential of this finer\-grained task\-level balance approach for other problem settings\.

Single domain generalization\.In this scenario, the model is trained on a single domain and evaluated on all other domains\. Our method then reduces to pure class\-wise gradient matching, while MLDG and Fish degenerate to standard ERM\. We thus compare with CrossMatch\[[101](https://arxiv.org/html/2606.23758#bib.bib53)\], which also uses multi\-binary classifiers for open set single domain generalization\. As shown in[TableXI](https://arxiv.org/html/2606.23758#S5.T11), MEDIC\+\+ still performs well in this setting\. Since no inter\-domain tasks is involved, the gain comes directly from class\-wise gradient matching\.

![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/time_memory.jpg)Figure 9:Training time \(*sec*\) and memory cost \(*mb*\) with respect to the number of steps during 5000 iterations\.TABLE XIII:Confidence scores \(%\) on PACS / sketch\.MethodERMMLDGFishMEDICMEDIC\+\+confp75\.5977\.3075\.7682\.2385\.08confn84\.2984\.0485\.1282\.1285\.15![Refer to caption](https://arxiv.org/html/2606.23758v1/pictures/tsne.jpg)Figure 10:T\-SNE results of feature representations in the target domain, where pink and green corresponds to known and unknown classes respectively\.

## VIConclusion

In this paper, we introduce the problem setting of open set domain generalization, which aims to tackle both challenges of domain shift and category shift in the unseen target domain\. We propose a simple yet powerful meta\-learning\-based framework, which incorporates domain\-wise and class\-wise gradient matching simultaneously, accompanied by a multi\-binary classifier to learn a balanced decision boundary for each known class\. We conduct experiments on multiple benchmarks to demonstrate the superior performance of our approach in both close set and open set scenarios\.

## References

- \[1\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1)\.
- \[2\]\(2021\)Invariance principle meets information bottleneck for out\-of\-distribution generalization\.Advances in Neural Information Processing Systems34,pp\. 3438–3450\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[3\]M\. Al\-Shedivat, L\. Li, E\. Xing, and A\. Talwalkar\(2021\)On data efficiency of meta\-learning\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1369–1377\.Cited by:[§II\-C](https://arxiv.org/html/2606.23758#S2.SS3.p1.1)\.
- \[4\]M\. Arjovsky, L\. Bottou, I\. Gulrajani, and D\. Lopez\-Paz\(2020\)Invariant risk minimization\.Stat1050,pp\. 27\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1),[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1)\.
- \[5\]D\. Arpit, H\. Wang, Y\. Zhou, and C\. Xiong\(2022\)Ensemble of averages: improving model selection and boosting performance in domain generalization\.Advances in Neural Information Processing Systems35,pp\. 8265–8277\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[6\]Y\. Balaji, S\. Sankaranarayanan, and R\. Chellappa\(2018\)Metareg: towards domain generalization using meta\-regularization\.Advances in Neural Information Processing Systems31\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1),[§II\-C](https://arxiv.org/html/2606.23758#S2.SS3.p1.1)\.
- \[7\]S\. Beery, G\. Van Horn, and P\. Perona\(2018\)Recognition in terra incognita\.InProceedings of the European Conference on Computer Vision \(ECCV\),pp\. 456–473\.Cited by:[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1)\.
- \[8\]A\. Bendale and T\. E\. Boult\(2016\)Towards open set deep networks\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 1563–1572\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1),[§V\-E](https://arxiv.org/html/2606.23758#S5.SS5.p2.1),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.3.1),[TABLE IV](https://arxiv.org/html/2606.23758#S5.T4.1.1.3.1)\.
- \[9\]S\. Bose, M\. Singha, A\. Jha, S\. Mukhopadhyay, and B\. Banerjee\(2025\)Meta\-learning to teach semantic prompts for open domain generalization in vision\-language models\.Transactions on Machine Learning Research\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1)\.
- \[10\]J\. Cha, S\. Chun, K\. Lee, H\. Cho, S\. Park, Y\. Lee, and S\. Park\(2021\)Swad: domain generalization by seeking flat minima\.Advances in Neural Information Processing Systems34,pp\. 22405–22418\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1),[§V\-B](https://arxiv.org/html/2606.23758#S5.SS2.p2.10),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.11.1),[TABLE IV](https://arxiv.org/html/2606.23758#S5.T4.1.1.8.1),[TABLE V](https://arxiv.org/html/2606.23758#S5.T5.1.1.10.1),[TABLE VI](https://arxiv.org/html/2606.23758#S5.T6.1.1.10.1)\.
- \[11\]J\. Cha, K\. Lee, S\. Park, and S\. Chun\(2022\)Domain generalization by mutual\-information regularization with pre\-trained models\.InEuropean Conference on Computer Vision,pp\. 440–457\.Cited by:[TABLE VI](https://arxiv.org/html/2606.23758#S5.T6.1.1.3.1)\.
- \[12\]P\. Chattopadhyay, Y\. Balaji, and J\. Hoffman\(2020\)Learning to balance specificity and invariance for in and out of domain generalization\.InEuropean Conference on Computer Vision,pp\. 301–318\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[13\]C\. Chen, L\. Tang, L\. Tao, H\. Zhou, Y\. Huang, X\. Han, and Y\. Yu\(2023\)Activate and reject: towards safe domain generalization under category shift\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 11552–11563\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p1.1)\.
- \[14\]G\. Chen, P\. Peng, X\. Wang, and Y\. Tian\(2021\)Adversarial reciprocal points learning for open set recognition\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1),[§V\-E](https://arxiv.org/html/2606.23758#S5.SS5.p2.1),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.4.1),[TABLE IV](https://arxiv.org/html/2606.23758#S5.T4.1.1.6.1),[TABLE V](https://arxiv.org/html/2606.23758#S5.T5.1.1.9.1),[TABLE VI](https://arxiv.org/html/2606.23758#S5.T6.1.1.6.1)\.
- \[15\]J\. Chen, L\. Ding, Y\. Yang, Z\. Di, and Y\. Xiang\(2024\)Domain adversarial active learning for domain generalization classification\.IEEE Transactions on Knowledge and Data Engineering37\(1\),pp\. 226–238\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[16\]J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei\(2009\)Imagenet: a large\-scale hierarchical image database\.In2009 IEEE Conference on Computer Vision and Pattern Recognition,pp\. 248–255\.Cited by:[§V\-B](https://arxiv.org/html/2606.23758#S5.SS2.p1.1)\.
- \[17\]A\. R\. Dhamija, M\. Günther, and T\. Boult\(2018\)Reducing network agnostophobia\.Advances in Neural Information Processing Systems31\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1),[§V\-C](https://arxiv.org/html/2606.23758#S5.SS3.p1.5)\.
- \[18\]Q\. Dou, D\. Coelho de Castro, K\. Kamnitsas, and B\. Glocker\(2019\)Domain generalization via model\-agnostic learning of semantic features\.Advances in Neural Information Processing Systems32\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[19\]C\. Fang, Y\. Xu, and D\. N\. Rockmore\(2013\)Unbiased metric learning: on the utilization of multiple datasets and web images for softening bias\.InProceedings of the IEEE International Conference on Computer Vision,pp\. 1657–1664\.Cited by:[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1)\.
- \[20\]C\. Finn, P\. Abbeel, and S\. Levine\(2017\)Model\-agnostic meta\-learning for fast adaptation of deep networks\.InInternational Conference on Machine Learning,pp\. 1126–1135\.Cited by:[§II\-C](https://arxiv.org/html/2606.23758#S2.SS3.p1.1),[§V\-F](https://arxiv.org/html/2606.23758#S5.SS6.p4.1),[TABLE XII](https://arxiv.org/html/2606.23758#S5.T12.3.3.23.1.1),[TABLE XII](https://arxiv.org/html/2606.23758#S5.T12.3.3.8.1.1)\.
- \[21\]B\. Fu, Z\. Cao, M\. Long, and J\. Wang\(2020\)Learning to detect open classes for universal domain adaptation\.InEuropean Conference on Computer Vision,pp\. 567–583\.Cited by:[§V\-C](https://arxiv.org/html/2606.23758#S5.SS3.p1.2)\.
- \[22\]Y\. Ganin and V\. Lempitsky\(2015\)Unsupervised domain adaptation by backpropagation\.InInternational Conference on Machine Learning,pp\. 1180–1189\.Cited by:[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1)\.
- \[23\]Y\. Ganin, E\. Ustinova, H\. Ajakan, P\. Germain, H\. Larochelle, F\. Laviolette, M\. Marchand, and V\. Lempitsky\(2016\)Domain\-adversarial training of neural networks\.The Journal of Machine Learning Research17\(1\),pp\. 2096–2030\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[24\]Z\. Ge, S\. Demyanov, Z\. Chen, and R\. Garnavi\(2017\)Generative openmax for multi\-class open set classification\.InBritish Machine Vision Conference 2017,Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1)\.
- \[25\]C\. Geng, S\. Huang, and S\. Chen\(2020\)Recent advances in open set recognition: a survey\.IEEE Transactions on Pattern Analysis and Machine Intelligence43\(10\),pp\. 3614–3631\.Cited by:[TABLE I](https://arxiv.org/html/2606.23758#S2.T1.8.8.8.4)\.
- \[26\]R\. C\. Griggs, M\. Batshaw, M\. Dunkle, R\. Gopal\-Srivastava, E\. Kaye, J\. Krischer, T\. Nguyen, K\. Paulus, P\. A\. Merkel,et al\.\(2009\)Clinical research for rare disease: opportunities, challenges, and solutions\.Molecular genetics and metabolism96\(1\),pp\. 20–26\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p2.1)\.
- \[27\]I\. Gulrajani and D\. Lopez\-Paz\(2020\)In search of lost domain generalization\.arXiv preprint arXiv:2007\.01434\.Cited by:[§V\-B](https://arxiv.org/html/2606.23758#S5.SS2.p1.1),[§V\-B](https://arxiv.org/html/2606.23758#S5.SS2.p2.10)\.
- \[28\]J\. Guo, N\. Wang, L\. Qi, and Y\. Shi\(2023\)ALOFT: a lightweight mlp\-like architecture with dynamic low\-frequency transform for domain generalization\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 24132–24141\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1),[TABLE V](https://arxiv.org/html/2606.23758#S5.T5.1.1.5.1)\.
- \[29\]Y\. Guo, G\. Camporese, W\. Yang, A\. Sperduti, and L\. Ballan\(2021\)Conditional variational capsule network for open set recognition\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 103–111\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1)\.
- \[30\]D\. Gupta, M\. Singha, S\. B\. Rongali, A\. Jha, M\. H\. Khan, B\. Banerjee,et al\.\(2025\)Osloprompt: bridging low\-supervision challenges and open\-set domain generalization in clip\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 10110–10120\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1)\.
- \[31\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 770–778\.Cited by:[§V\-B](https://arxiv.org/html/2606.23758#S5.SS2.p1.1)\.
- \[32\]S\. Hemati, G\. Zhang, A\. Estiri, and X\. Chen\(2023\)Understanding hessian alignment for domain generalization\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 19004–19014\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p2.1),[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.12.1),[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.13.1)\.
- \[33\]D\. Hendrycks, M\. Mazeika, and T\. Dietterich\(2018\)Deep anomaly detection with outlier exposure\.InInternational Conference on Learning Representations,Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1)\.
- \[34\]G\. Hinton, N\. Srivastava, and K\. Swersky\(2012\)Neural networks for machine learning lecture 6a overview of mini\-batch gradient descent\.Cited on14\(8\),pp\. 2\.Cited by:[§III\-C](https://arxiv.org/html/2606.23758#S3.SS3.p6.1)\.
- \[35\]T\. Hospedales, A\. Antoniou, P\. Micaelli, and A\. Storkey\(2021\)Meta\-learning in neural networks: a survey\.IEEE Transactions on Pattern Analysis and Machine Intelligence44\(9\),pp\. 5149–5169\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p4.1)\.
- \[36\]Z\. Huang, H\. Wang, E\. P\. Xing, and D\. Huang\(2020\)Self\-challenging improves cross\-domain generalization\.InEuropean Conference on Computer Vision,pp\. 124–140\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1),[§V\-D](https://arxiv.org/html/2606.23758#S5.SS4.p1.1),[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.3.1)\.
- \[37\]S\. Jastrzkebski, Z\. Kenton, D\. Arpit, N\. Ballas, A\. Fischer, Y\. Bengio, and A\. Storkey\(2017\)Three factors influencing minima in sgd\.arXiv preprint arXiv:1711\.04623\.Cited by:[§III\-C](https://arxiv.org/html/2606.23758#S3.SS3.p6.1)\.
- \[38\]K\. Katsumata, I\. Kishida, A\. Amma, and H\. Nakayama\(2021\)Open\-set domain generalization via metric learning\.In2021 IEEE International Conference on Image Processing,pp\. 459–463\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p3.1),[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p1.1)\.
- \[39\]N\. S\. Keskar, D\. Mudigere, J\. Nocedal, M\. Smelyanskiy, and P\. T\. P\. Tang\(2016\)On large\-batch training for deep learning: generalization gap and sharp minima\.arXiv preprint arXiv:1609\.04836\.Cited by:[§III\-C](https://arxiv.org/html/2606.23758#S3.SS3.p6.1)\.
- \[40\]S\. Kong and D\. Ramanan\(2021\)Opengan: open\-set recognition via open data generation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 813–822\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1)\.
- \[41\]D\. Krueger, E\. Caballero, J\. Jacobsen, A\. Zhang, J\. Binas, D\. Zhang, R\. Le Priol, and A\. Courville\(2021\)Out\-of\-distribution generalization via risk extrapolation \(rex\)\.InInternational Conference on Machine Learning,pp\. 5815–5826\.Cited by:[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.6.1)\.
- \[42\]Y\. LeCun, L\. Bottou, Y\. Bengio, and P\. Haffner\(1998\)Gradient\-based learning applied to document recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.Cited by:[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1)\.
- \[43\]S\. Lee, H\. B\. Lee, J\. Lee, and S\. J\. Hwang\(2022\)Sequential reptile: inter\-task gradient alignment for multilingual learning\.InTenth International Conference on Learning Representations,Cited by:[§IV\-A](https://arxiv.org/html/2606.23758#S4.SS1.p6.1)\.
- \[44\]C\. Li, S\. Wang, Y\. Long, and H\. Zhang\(2025\)Learning to transport for open set domain generalization\.Pattern Recognition,pp\. 112988\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1)\.
- \[45\]D\. Li, Y\. Yang, Y\. Song, and T\. M\. Hospedales\(2017\)Deeper, broader and artier domain generalization\.InProceedings of the IEEE International Conference on Computer Vision,pp\. 5542–5550\.Cited by:[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1)\.
- \[46\]D\. Li, Y\. Yang, Y\. Song, and T\. Hospedales\(2018\)Learning to generalize: meta\-learning for domain generalization\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p2.1),[§I](https://arxiv.org/html/2606.23758#S1.p4.1),[§II\-C](https://arxiv.org/html/2606.23758#S2.SS3.p1.1),[§III\-A](https://arxiv.org/html/2606.23758#S3.SS1.p3.16),[§III\-B](https://arxiv.org/html/2606.23758#S3.SS2.p3.2),[§III\-B](https://arxiv.org/html/2606.23758#S3.SS2.p4.28),[§IV\-A](https://arxiv.org/html/2606.23758#S4.SS1.p6.1),[§V\-D](https://arxiv.org/html/2606.23758#S5.SS4.p1.1),[§V\-F](https://arxiv.org/html/2606.23758#S5.SS6.p2.2),[§V\-G](https://arxiv.org/html/2606.23758#S5.SS7.p2.2),[TABLE X](https://arxiv.org/html/2606.23758#S5.T10.1.10.1),[TABLE X](https://arxiv.org/html/2606.23758#S5.T10.1.4.1),[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.5.1),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.5.1),[TABLE IV](https://arxiv.org/html/2606.23758#S5.T4.1.1.7.1),[TABLE V](https://arxiv.org/html/2606.23758#S5.T5.1.1.8.1),[TABLE VI](https://arxiv.org/html/2606.23758#S5.T6.1.1.8.1),[TABLE IX](https://arxiv.org/html/2606.23758#S5.T9.1.1.7.1)\.
- \[47\]D\. Li, J\. Zhang, Y\. Yang, C\. Liu, Y\. Song, and T\. M\. Hospedales\(2019\)Episodic training for domain generalization\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 1446–1455\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p2.1)\.
- \[48\]H\. Li, J\. Li, X\. Guan, B\. Liang, Y\. Lai, and X\. Luo\(2019\)Research on overfitting of deep learning\.In2019 15th International Conference on Computational Intelligence and Security,pp\. 78–81\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p1.1)\.
- \[49\]H\. Li, S\. J\. Pan, S\. Wang, and A\. C\. Kot\(2018\)Domain generalization with adversarial feature learning\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 5400–5409\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[50\]L\. Li, K\. Gao, J\. Cao, Z\. Huang, Y\. Weng, X\. Mi, Z\. Yu, X\. Li, and B\. Xia\(2021\)Progressive domain expansion network for single domain generalization\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 224–233\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[51\]P\. Li, D\. Li, W\. Li, S\. Gong, Y\. Fu, and T\. M\. Hospedales\(2021\)A simple feature augmentation for domain generalization\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 8886–8895\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[52\]C\. Liu, Z\. Wang, D\. Sahoo, Y\. Fang, K\. Zhang, and S\. C\. Hoi\(2020\)Adaptive task sampling for meta\-learning\.InProceedings of the European Conference on Computer Vision \(ECCV\),pp\. 752–769\.Cited by:[§III\-D](https://arxiv.org/html/2606.23758#S3.SS4.p1.3),[item 3](https://arxiv.org/html/2606.23758#S5.I3.ix3.p1.1),[§V\-F](https://arxiv.org/html/2606.23758#S5.SS6.p4.1)\.
- \[53\]H\. Liu, Z\. Cao, M\. Long, J\. Wang, and Q\. Yang\(2019\)Separate to adapt: open set domain adaptation via progressive separation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 2927–2936\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p3.1)\.
- \[54\]J\. Lu, Y\. Xu, H\. Li, Z\. Cheng, and Y\. Niu\(2022\)Pmal: open set recognition via robust prototype mining\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 1872–1880\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1)\.
- \[55\]R\. Luna Gutierrez and M\. Leonetti\(2020\)Information\-theoretic task selection for meta\-reinforcement learning\.Advances in Neural Information Processing Systems33,pp\. 20532–20542\.Cited by:[item 3](https://arxiv.org/html/2606.23758#S5.I3.ix3.p1.1)\.
- \[56\]F\. Lv, J\. Liang, S\. Li, B\. Zang, C\. H\. Liu, Z\. Wang, and D\. Liu\(2022\)Causality inspired representation learning for domain generalization\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 8046–8056\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.8.1),[TABLE IV](https://arxiv.org/html/2606.23758#S5.T4.1.1.10.1)\.
- \[57\]L\. Mansilla, R\. Echeveste, D\. H\. Milone, and E\. Ferrante\(2021\)Domain generalization via gradient surgery\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 6630–6638\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[58\]V\. V\. Naumovich and V\. Vlamimir\(1998\)Statistical learning theory\.Wiley New York\.Cited by:[§IV\-B](https://arxiv.org/html/2606.23758#S4.SS2.p1.5),[§V\-F](https://arxiv.org/html/2606.23758#S5.SS6.p2.2),[§V\-G](https://arxiv.org/html/2606.23758#S5.SS7.p2.2),[TABLE X](https://arxiv.org/html/2606.23758#S5.T10.1.3.1),[TABLE X](https://arxiv.org/html/2606.23758#S5.T10.1.9.1),[TABLE XI](https://arxiv.org/html/2606.23758#S5.T11.1.3.1),[TABLE XI](https://arxiv.org/html/2606.23758#S5.T11.1.8.1),[TABLE XII](https://arxiv.org/html/2606.23758#S5.T12.3.3.20.1.1),[TABLE XII](https://arxiv.org/html/2606.23758#S5.T12.3.3.5.1.1),[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.2.1),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.6.1),[TABLE IV](https://arxiv.org/html/2606.23758#S5.T4.1.1.5.1),[TABLE V](https://arxiv.org/html/2606.23758#S5.T5.1.1.7.1),[TABLE VI](https://arxiv.org/html/2606.23758#S5.T6.1.1.5.1),[TABLE IX](https://arxiv.org/html/2606.23758#S5.T9.1.1.4.1)\.
- \[59\]L\. Neal, M\. Olson, X\. Fern, W\. Wong, and F\. Li\(2018\)Open set learning with counterfactual images\.InProceedings of the European Conference on Computer Vision,pp\. 613–628\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1),[§V\-C](https://arxiv.org/html/2606.23758#S5.SS3.p1.5)\.
- \[60\]Y\. Netzer, T\. Wang, A\. Coates, A\. Bissacco, B\. Wu, and A\. Y\. Ng\(2011\)Reading digits in natural images with unsupervised feature learning\.Cited by:[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1)\.
- \[61\]A\. Nichol, J\. Achiam, and J\. Schulman\(2018\)On first\-order meta\-learning algorithms\.arXiv preprint arXiv:1803\.02999\.Cited by:[2nd item](https://arxiv.org/html/2606.23758#S1.I1.i2.p1.1),[§II\-C](https://arxiv.org/html/2606.23758#S2.SS3.p1.1),[§III\-A](https://arxiv.org/html/2606.23758#S3.SS1.p4.6),[§III\-C](https://arxiv.org/html/2606.23758#S3.SS3.p3.21),[§IV\-A](https://arxiv.org/html/2606.23758#S4.SS1.p6.1),[TABLE XII](https://arxiv.org/html/2606.23758#S5.T12.3.3.12.1.1),[TABLE XII](https://arxiv.org/html/2606.23758#S5.T12.3.3.27.1.1)\.
- \[62\]P\. Oza and V\. M\. Patel\(2019\)C2ae: class conditioned auto\-encoder for open\-set recognition\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 2307–2316\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1)\.
- \[63\]G\. Parascandolo, A\. Neitz, A\. Orvieto, L\. Gresele, and B\. Schölkopf\(2020\)Learning explanations that are hard to vary\.InInternational Conference on Learning Representations,Cited by:[§V\-D](https://arxiv.org/html/2606.23758#S5.SS4.p1.1),[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.8.1)\.
- \[64\]K\. Peng, D\. Wen, M\. S\. Sarfraz, Y\. Chen, J\. Zheng, D\. Schneider, K\. Yang, J\. Wu, A\. Roitberg, and R\. Stiefelhagen\(2026\)Mitigating label noise using prompt\-based hyperbolic meta\-learning in open\-set domain generalization\.International Journal of Computer Vision134\(3\),pp\. 99\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1)\.
- \[65\]K\. Peng, D\. Wen, K\. Yang, A\. Luo, Y\. Chen, J\. Fu, M\. S\. Sarfraz, A\. Roitberg, and R\. Stiefelhagen\(2024\)Advancing open\-set domain generalization using evidential bi\-level hardest domain scheduler\.Advances in Neural Information Processing Systems37,pp\. 85412–85440\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1)\.
- \[66\]X\. Peng, Q\. Bai, X\. Xia, Z\. Huang, K\. Saenko, and B\. Wang\(2019\)Moment matching for multi\-source domain adaptation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 1406–1415\.Cited by:[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1)\.
- \[67\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1)\.
- \[68\]A\. Rame, C\. Dancette, and M\. Cord\(2022\)Fishr: invariant gradient variances for out\-of\-distribution generalization\.InInternational Conference on Machine Learning,pp\. 18347–18377\.Cited by:[§V\-D](https://arxiv.org/html/2606.23758#S5.SS4.p1.1),[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.11.1)\.
- \[69\]Y\. Rao, W\. Zhao, Z\. Zhu, J\. Lu, and J\. Zhou\(2021\)Global filter networks for image classification\.Advances in neural information processing systems34,pp\. 980–993\.Cited by:[§V\-B](https://arxiv.org/html/2606.23758#S5.SS2.p1.1)\.
- \[70\]R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer\(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10684–10695\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1)\.
- \[71\]S\. Ruder\(2016\)An overview of gradient descent optimization algorithms\.arXiv preprint arXiv:1609\.04747\.Cited by:[§III\-C](https://arxiv.org/html/2606.23758#S3.SS3.p6.1)\.
- \[72\]S\. Sagawa, P\. W\. Koh, T\. B\. Hashimoto, and P\. Liang\(2019\)Distributionally robust neural networks\.InInternational Conference on Learning Representations,Cited by:[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.4.1)\.
- \[73\]K\. Saito and K\. Saenko\(2021\)Ovanet: one\-vs\-all network for universal domain adaptation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 9000–9009\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p3.1),[§III\-E](https://arxiv.org/html/2606.23758#S3.SS5.p1.5),[§III\-F](https://arxiv.org/html/2606.23758#S3.SS6.p1.3)\.
- \[74\]W\. J\. Scheirer, A\. de Rezende Rocha, A\. Sapkota, and T\. E\. Boult\(2012\)Toward open set recognition\.IEEE Transactions on Pattern Analysis and Machine Intelligence35\(7\),pp\. 1757–1772\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p2.1)\.
- \[75\]S\. Shahtalebi, J\. Gagnon\-Audet, T\. Laleh, M\. Faramarzi, K\. Ahuja, and I\. Rish\(2021\)Sand\-mask: an enhanced gradient masking strategy for the discovery of invariances in domain generalization\.arXiv preprint arXiv:2106\.02266\.Cited by:[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.9.1)\.
- \[76\]Y\. Shi, J\. Seely, P\. Torr, N\. Siddharth, A\. Hannun, N\. Usunier, and G\. Synnaeve\(2021\)Gradient matching for domain generalization\.InInternational Conference on Learning Representations,Cited by:[Figure 2](https://arxiv.org/html/2606.23758#S1.F2),[2nd item](https://arxiv.org/html/2606.23758#S1.I1.i2.p1.1),[§I](https://arxiv.org/html/2606.23758#S1.p4.1),[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1),[§II\-C](https://arxiv.org/html/2606.23758#S2.SS3.p1.1),[§III\-B](https://arxiv.org/html/2606.23758#S3.SS2.p3.2),[§III\-C](https://arxiv.org/html/2606.23758#S3.SS3.p2.1),[§III\-C](https://arxiv.org/html/2606.23758#S3.SS3.p3.21),[§IV\-A](https://arxiv.org/html/2606.23758#S4.SS1.p6.1),[§V\-D](https://arxiv.org/html/2606.23758#S5.SS4.p1.1),[§V\-F](https://arxiv.org/html/2606.23758#S5.SS6.p2.2),[§V\-G](https://arxiv.org/html/2606.23758#S5.SS7.p2.2),[TABLE X](https://arxiv.org/html/2606.23758#S5.T10.1.11.1),[TABLE X](https://arxiv.org/html/2606.23758#S5.T10.1.5.1),[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.10.1),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.7.1),[TABLE IV](https://arxiv.org/html/2606.23758#S5.T4.1.1.9.1),[TABLE V](https://arxiv.org/html/2606.23758#S5.T5.1.1.6.1),[TABLE VI](https://arxiv.org/html/2606.23758#S5.T6.1.1.9.1),[TABLE IX](https://arxiv.org/html/2606.23758#S5.T9.1.1.10.1)\.
- \[77\]Y\. Shu, Z\. Cao, C\. Wang, J\. Wang, and M\. Long\(2021\)Open domain generalization with domain\-augmented meta\-learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9624–9633\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p3.1),[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p1.1),[TABLE I](https://arxiv.org/html/2606.23758#S2.T1.11.11.11.4)\.
- \[78\]A\. Sicilia, X\. Zhao, and S\. J\. Hwang\(2023\)Domain adversarial neural networks for domain generalization: when it works and how to improve\.Machine Learning,pp\. 1–37\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[79\]M\. Singha, A\. Jha, S\. Bose, A\. Nair, M\. Abdar, and B\. Banerjee\(2024\)Unknown prompt the only lacuna: unveiling clip’s potential for open domain generalization\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13309–13319\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1)\.
- \[80\]R\. L\. Smith\(1990\)Extreme value theory\.Handbook of Applicable Mathematics7,pp\. 437–471\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1)\.
- \[81\]B\. Sun and K\. Saenko\(2016\)Deep coral: correlation alignment for deep domain adaptation\.InEuropean Conference on Computer Vision,pp\. 443–450\.Cited by:[TABLE II](https://arxiv.org/html/2606.23758#S5.T2.1.1.7.1)\.
- \[82\]X\. Sun, Z\. Yang, C\. Zhang, K\. Ling, and G\. Peng\(2020\)Conditional gaussian distribution learning for open set recognition\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13480–13489\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p3.1)\.
- \[83\]S\. Thrun and L\. Pratt\(2012\)Learning to learn\.Springer Science & Business Media\.Cited by:[§II\-C](https://arxiv.org/html/2606.23758#S2.SS3.p1.1)\.
- \[84\]C\. X\. Tian, H\. Li, Y\. Wang, and S\. Wang\(2023\)Privacy\-preserving constrained domain generalization via gradient alignment\.IEEE Transactions on Knowledge and Data Engineering36\(5\),pp\. 2142–2150\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[85\]H\. Venkateswara, J\. Eusebio, S\. Chakraborty, and S\. Panchanathan\(2017\)Deep hashing network for unsupervised domain adaptation\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 5018–5027\.Cited by:[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1),[§V\-F](https://arxiv.org/html/2606.23758#S5.SS6.p3.1)\.
- \[86\]J\. Wang, L\. Chen, and R\. Wang\(2022\)Domain generalization model of deep convolutional networks based on sand\-mask\.Algorithms15\(6\),pp\. 215\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[87\]J\. Wang, C\. Lan, C\. Liu, Y\. Ouyang, T\. Qin, W\. Lu, Y\. Chen, W\. Zeng, and P\. Yu\(2022\)Generalizing to unseen domains: a survey on domain generalization\.IEEE Transactions on Knowledge and Data Engineering\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p1.1),[TABLE I](https://arxiv.org/html/2606.23758#S2.T1.5.5.5.4)\.
- \[88\]M\. Wang and W\. Deng\(2018\)Deep visual domain adaptation: a survey\.Neurocomputing312,pp\. 135–153\.Cited by:[TABLE I](https://arxiv.org/html/2606.23758#S2.T1.2.2.2.3)\.
- \[89\]X\. Wang, J\. Zhang, L\. Qi, and Y\. Shi\(2023\)Generalizable decision boundaries: dualistic meta\-learning for open set domain generalization\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 11564–11573\.Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p5.1),[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p2.1),[§IV\-A](https://arxiv.org/html/2606.23758#S4.SS1.p6.1)\.
- \[90\]Y\. Wang, L\. Qi, Y\. Shi, and Y\. Gao\(2022\)Feature\-based style randomization for domain generalization\.IEEE Transactions on Circuits and Systems for Video Technology32\(8\),pp\. 5495–5509\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[91\]Y\. Wang, H\. Li, and A\. C\. Kot\(2020\)Heterogeneous domain generalization via domain mixup\.InIEEE International Conference on Acoustics, Speech and Signal Processing,pp\. 3622–3626\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[92\]Q\. Xu, R\. Zhang, Y\. Zhang, Y\. Wang, and Q\. Tian\(2021\)A fourier\-based framework for domain generalization\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14383–14392\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[93\]S\. Yang, Y\. Wang, K\. Wang, S\. Jui, and J\. van de Weijer\(2022\)One ring to bring them all: towards open\-set recognition under domain shift\.ArXiv Preprint ArXiv:2206\.03600\.Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p1.1),[§V\-E](https://arxiv.org/html/2606.23758#S5.SS5.p2.1),[TABLE V](https://arxiv.org/html/2606.23758#S5.T5.1.1.3.1)\.
- \[94\]R\. Yoshihashi, W\. Shao, R\. Kawakami, S\. You, M\. Iida, and T\. Naemura\(2019\)Classification\-reconstruction learning for open\-set recognition\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 4016–4025\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1)\.
- \[95\]J\. Yue, L\. Fang, and M\. He\(2022\)Spectral\-spatial latent reconstruction for open\-set hyperspectral image classification\.IEEE Transactions on Image Processing31,pp\. 5227–5241\.Cited by:[§II\-B](https://arxiv.org/html/2606.23758#S2.SS2.p1.1)\.
- \[96\]J\. Zhang, L\. Qi, Y\. Shi, and Y\. Gao\(2022\)MVDG: a unified multi\-view framework for domain generalization\.InEuropean Conference on Computer Vision,pp\. 161–177\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.12.1)\.
- \[97\]K\. Zhou, Y\. Yang, T\. Hospedales, and T\. Xiang\(2020\)Deep domain\-adversarial image generation for domain generalisation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 13025–13032\.Cited by:[§V\-A](https://arxiv.org/html/2606.23758#S5.SS1.p1.1),[§V\-B](https://arxiv.org/html/2606.23758#S5.SS2.p1.1)\.
- \[98\]K\. Zhou, Y\. Yang, T\. Hospedales, and T\. Xiang\(2020\)Learning to generate novel domains for domain generalization\.InEuropean Conference on Computer Vision,pp\. 561–578\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[99\]K\. Zhou, Y\. Yang, Y\. Qiao, and T\. Xiang\(2020\)Domain generalization with mixstyle\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.23758#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.9.1),[TABLE IV](https://arxiv.org/html/2606.23758#S5.T4.1.1.4.1),[TABLE VI](https://arxiv.org/html/2606.23758#S5.T6.1.1.4.1)\.
- \[100\]K\. Zhou, Y\. Yang, Y\. Qiao, and T\. Xiang\(2021\)Domain adaptive ensemble learning\.IEEE Transactions on Image Processing30,pp\. 8008–8018\.Cited by:[§II\-A](https://arxiv.org/html/2606.23758#S2.SS1.p1.1)\.
- \[101\]R\. Zhu and S\. Li\(2021\)CrossMatch: cross\-classifier consistency regularization for open\-set single domain generalization\.InInternational Conference on Learning Representations,Cited by:[§II\-D](https://arxiv.org/html/2606.23758#S2.SS4.p1.1),[§V\-G](https://arxiv.org/html/2606.23758#S5.SS7.p6.1),[TABLE XI](https://arxiv.org/html/2606.23758#S5.T11.1.4.1),[TABLE XI](https://arxiv.org/html/2606.23758#S5.T11.1.9.1),[TABLE III](https://arxiv.org/html/2606.23758#S5.T3.1.1.10.1),[TABLE V](https://arxiv.org/html/2606.23758#S5.T5.1.1.4.1),[TABLE VI](https://arxiv.org/html/2606.23758#S5.T6.1.1.7.1)\.

Similar Articles

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

arXiv cs.CL

This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.