Modularity-Free Conflict-Averse Training for Generalized PINNs
Summary
This paper identifies a capacity-induced failure mode in physics-informed neural networks (PINNs) where overparameterized networks develop functional modularity that hinders convergence, and proposes Modular-Sparsity Synchronization (ModSync), a framework that penalizes task-exclusive connections to maintain cross-objective interaction and achieve state-of-the-art accuracy.
View Cached Full Text
Cached at: 06/20/26, 02:35 PM
# Modularity-Free Conflict-Averse Training for Generalized PINNs
Source: [https://arxiv.org/html/2606.20156](https://arxiv.org/html/2606.20156)
###### Abstract
Physics\-informed neural networks \(PINNs\) have become a powerful framework for solving PDEs by embedding physical laws into differentiable objectives\. Despite their advances, training PINNs remains fragile: recent conflict\-averse optimization schemes alleviate gradient interference between residual and boundary losses, but we show that their effectiveness deteriorates as model capacity increases\. In this paper, we identify a capacity\-induced failure mode, where overparameterized networks undergo functional modularity, self\-partitioning into task\-exclusive modules that suppress cross\-objective interaction and hinder convergence toward Pareto\-stationary points\. To address this issue, we propose a novel framework, Modular\-Sparsity Synchronization \(ModSync\), which integrates structural optimization into conflict\-averse training by penalizing task\-exclusive connections while preserving interaction\-promoting pathways\. Extensive experiments across diverse PDE benchmarks demonstrate that ModSync consistently prevents capacity\-driven failures, sustains robust cross\-objective coupling, and achieves state\-of\-the\-art accuracy\. Codes are available at[https://github\.com/heejokong/ModSync](https://github.com/heejokong/ModSync)\.
Index Terms—Physics\-informed neural networks, multi\-task learning, conflict\-averse optimization, dynamic sparse training
## 1Introduction
Physics\-informed neural networks \(PINNs\)\[[19](https://arxiv.org/html/2606.20156#bib.bib1),[10](https://arxiv.org/html/2606.20156#bib.bib2)\]have emerged as a powerful approach for solving partial differential equations \(PDEs\) that underlie complex phenomena across science and engineering\[[3](https://arxiv.org/html/2606.20156#bib.bib3),[17](https://arxiv.org/html/2606.20156#bib.bib4),[12](https://arxiv.org/html/2606.20156#bib.bib30)\]\. Compared with traditional numerical solvers\[[1](https://arxiv.org/html/2606.20156#bib.bib5),[6](https://arxiv.org/html/2606.20156#bib.bib6)\], PINNs embed physical laws directly into the training objective, using neural networks as differentiable ansatzes to approximate the underlying physics\. This paradigm enables mesh\-free training and scalable approximation, making PINNs particularly effective for high\-dimensional systems where traditional methods struggle\.
Despite their promise, training PINNs remains challenging due to the unclear nature of training pathologies\[[13](https://arxiv.org/html/2606.20156#bib.bib7),[20](https://arxiv.org/html/2606.20156#bib.bib8),[23](https://arxiv.org/html/2606.20156#bib.bib9),[25](https://arxiv.org/html/2606.20156#bib.bib10),[26](https://arxiv.org/html/2606.20156#bib.bib11)\]\. These challenges often lead to failures to learn correct solutions, even in relatively simple settings, thereby hindering their broader applicability\. Recent works\[[8](https://arxiv.org/html/2606.20156#bib.bib12),[21](https://arxiv.org/html/2606.20156#bib.bib13)\], inspired by multi\-task learning \(MTL\)\[[4](https://arxiv.org/html/2606.20156#bib.bib15),[28](https://arxiv.org/html/2606.20156#bib.bib16)\], have proposed conflict\-averse training to mitigate negative interference between the PDE residual and data loss via gradient manipulation, yielding notable gains across diverse physics scenarios\. However, these studies largely focus on gradient refinement while leaving other crucial aspects of PINN optimization underexplored\.
Building on the above limitations of gradient\-centric remedies, we observe that as model capacity increases, existing conflict\-averse methods can enter systematic failure modes\. As shown in Fig\. 1, beyond a capacity threshold, some even underperform vanilla PINNs trained with Adam\[[11](https://arxiv.org/html/2606.20156#bib.bib18)\],i\.e\., training that does not explicitly account for conflict between the objectives\. Through the lens of functional modularity\[[2](https://arxiv.org/html/2606.20156#bib.bib19),[18](https://arxiv.org/html/2606.20156#bib.bib20),[29](https://arxiv.org/html/2606.20156#bib.bib21)\], we interpret this phenomenon as a capacity\-induced shortcut whereby a single overparameterized network self\-partitions into task\-exclusive modules, allowing the PDE residual and data objectives to be optimized in isolation\. Such modular segregation suppresses cross\-objective interaction and impedes convergence to Pareto\-stationary points\[[4](https://arxiv.org/html/2606.20156#bib.bib15),[7](https://arxiv.org/html/2606.20156#bib.bib22)\], thereby undermining the intended benefits of conflict\-averse training\.
Fig\. 1:Effect of model capacity on PINN performance for the Burgers’ equation\. With increasing width, existing conflict\-averse methods \(DCGD and ConFIG\) suffer from systematic failures and can underperform vanilla PINNs \(ADAM\), while the proposed method alleviates these degradations\.To address this drawback, we propose a novel framework, Modular\-Sparsity Synchronization \(ModSync\), that incorporates structural optimization into conflict\-averse training\. Our key idea is to identify network connections that become task\-exclusive with respect to PINN objectives and penalize them, while preserving interaction\-promoting pathways\. Specifically, our method dynamically optimizes both network parameters and structure by pruning regions where the two objectives evolve independently while preserving connections that foster interaction\. This approach mitigates unintended effects of functional modularity, sustains robust cross\-objective coupling, and promotes convergence toward Pareto\-stationary points\. Across diverse PDE benchmarks, we demonstrate that the proposed ModSync prevents—or substantially reduces—the failure modes exhibited by existing conflict\-averse methods\.
## 2Background
### 2\.1Preliminaries
Physics\-informed neural networks\[[19](https://arxiv.org/html/2606.20156#bib.bib1)\]are designed to solve partial differential equations \(PDEs\) by incorporating physical constraints directly into the training process\. LetΩ∈ℝd\\Omega\\in\\mathbb\{R\}^\{d\}represent the domain, and∂Ω\\partial\\Omegais its boundary\. We consider nonlinear PDEs of the form:
𝒩\[u\(x\),x\]=f\(x\),x∈Ω,\\displaystyle\\mathcal\{N\}\[u\(x\),x\]=f\(x\),\\qquad x\\in\\Omega,\(1\)ℬ\[u\(x\),x\]=g\(x\),x∈∂Ω,\\displaystyle\\mathcal\{B\}\[u\(x\),x\]=g\(x\),\\qquad x\\in\\partial\\Omega,where𝒩\\mathcal\{N\}andℬ\\mathcal\{B\}denote a nonlinear differential operator and a boundary condition operator, respectively\. PINNs approximate the solutionu\(x\)u\(x\)using a neural networku\(x;θ\)u\(x;\\theta\)parameterized byθ\\theta, trained to minimize two losses, the residual lossℒr\(θ\)\\mathcal\{L\}\_\{r\}\(\\theta\)and boundary lossℒb\(θ\)\\mathcal\{L\}\_\{b\}\(\\theta\), defined as:
ℒr\(θ\)=1\|Nr\|∑xri∈Nr\|𝒩\[u\(xri;θ\),xri\]−f\(xri\)\|2,\\displaystyle\\mathcal\{L\}\_\{r\}\(\\theta\)=\\frac\{1\}\{\|N\_\{r\}\|\}\\sum\_\{x\_\{r\}^\{i\}\\in N\_\{r\}\}\\left\|\\mathcal\{N\}\[u\(x\_\{r\}^\{i\};\\theta\),x\_\{r\}^\{i\}\]\-f\(x\_\{r\}^\{i\}\)\\right\|^\{2\},\(2\)ℒb\(θ\)=1\|Nb\|∑xbi∈Nb\|ℬ\[u\(xbi;θ\),xbi\]−g\(xbi\)\|2,\\displaystyle\\mathcal\{L\}\_\{b\}\(\\theta\)=\\frac\{1\}\{\|N\_\{b\}\|\}\\sum\_\{x\_\{b\}^\{i\}\\in N\_\{b\}\}\\left\|\\mathcal\{B\}\[u\(x\_\{b\}^\{i\};\\theta\),x\_\{b\}^\{i\}\]\-g\(x\_\{b\}^\{i\}\)\\right\|^\{2\},whereNrN\_\{r\}andNbN\_\{b\}denote the sets of sampling points withinΩ\\Omegaand on∂Ω\\partial\\Omega, respectively\.ℒr\(θ\)\\mathcal\{L\}\_\{r\}\(\\theta\)quantifies the extent to which the predicted solution violates the governing PDE, whileℒb\(θ\)\\mathcal\{L\}\_\{b\}\(\\theta\)measures the degree of violation of the boundary conditions\. The overall objective is a weighted sum of these losses:
ℒ\(θ\)=λrℒr\(θ\)\+λbℒb\(θ\),\\mathcal\{L\}\(\\theta\)=\{\\lambda\}\_\{r\}\\mathcal\{L\}\_\{r\}\(\\theta\)\+\{\\lambda\}\_\{b\}\\mathcal\{L\}\_\{b\}\(\\theta\),\(3\)whereλr\{\\lambda\}\_\{r\}andλb\{\\lambda\}\_\{b\}are the non\-negative weights of each loss term\. This training process naturally aligns with multi\-task learning \(MTL\), as it seeks to minimize both losses simultaneously\.
Fig\. 2:Cosine similarity of gradients versus training step for different network widths\. Wide models converge near 0, indicating orthogonality of gradients and reduced cross\-objective interaction\.Conflict\-averse training\[[28](https://arxiv.org/html/2606.20156#bib.bib16),[15](https://arxiv.org/html/2606.20156#bib.bib17)\], a key technique in MTL, addresses gradient conflicts between tasks\. These conflicts arise when the inner product of gradients is negative,i\.e\.,giTgj<0,∀i≠jg\_\{i\}^\{T\}g\_\{j\}<0,\\forall i\\neq j, wheregi=∇θℒi\(θ\)g\_\{i\}=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{i\}\(\\theta\)represents the gradients of theii\-th objective\. Such conflicts disrupt optimization by inducing destructive interference, ultimately degrading the performance of trained model\. The conflict\-averse training mitigates this issue by adjusting gradients to align in cooperative directions, ensuring that tasks contribute constructively to the optimization process\.
In the context of PINNs, gradient conflicts frequently occur between the residual lossℒr\(θ\)\\mathcal\{L\}\_\{r\}\(\\theta\)and boundary lossℒb\(θ\)\\mathcal\{L\}\_\{b\}\(\\theta\)\. The most relevant works to ours are\[[8](https://arxiv.org/html/2606.20156#bib.bib12),[21](https://arxiv.org/html/2606.20156#bib.bib13)\], which adopt gradient manipulation strategies to alleviate these conflicts in PINNs\. Following them, we define the generalized gradient manipulation strategy as:
g^=𝒜\(∇θℒr\(θ\),∇θℒb\(θ\)\),\\hat\{g\}=\\mathcal\{A\}\\left\(\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{r\}\\left\(\\theta\\right\),\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{b\}\\left\(\\theta\\right\)\\right\),\(4\)where𝒜\\mathcal\{A\}represents an adjustment function tailored to reduce conflicts\. For example, in\[[8](https://arxiv.org/html/2606.20156#bib.bib12)\], the adjustment function𝒜\\mathcal\{A\}can be defined as:
g^=𝒜\(gr,gb\)=⟨gc,∇θℒ\(θ\)⟩‖gc‖2gc,\\displaystyle\\hat\{g\}=\\mathcal\{A\}\(g\_\{r\},g\_\{b\}\)=\\frac\{\\left\\langle g^\{c\},\\nabla\_\{\\theta\}\\mathcal\{L\}\\left\(\\theta\\right\)\\right\\rangle\}\{\\left\\\|g^\{c\}\\right\\\|^\{2\}\}g^\{c\},\\qquad\(5\)wheregc=∇θℒr\(θ\)‖∇θℒr\(θ\)‖\+∇θℒb\(θ\)‖∇θℒb\(θ\)‖\.\\displaystyle\\textit\{ where \}g^\{c\}=\\frac\{\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{r\}\\left\(\\theta\\right\)\}\{\\left\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{r\}\\left\(\\theta\\right\)\\right\\\|\}\+\\frac\{\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{b\}\\left\(\\theta\\right\)\}\{\\left\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{b\}\\left\(\\theta\\right\)\\right\\\|\}\.This adjustment function operates by projecting the overall gradient∇θℒ\(θ\)\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\)onto a directiongcg^\{c\}, which combines the normalized gradients ofℒr\\mathcal\{L\}\_\{r\}andℒb\\mathcal\{L\}\_\{b\}\. By aligning gradients, this manipulation effectively reduces conflicts, enabling more stable optimization convergence\. The adjusted gradientg^\\hat\{g\}is then used for parameter updates in gradient descent asθt\+1=θt−ηg^\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\\hat\{g\}, whereη\\etais the learning rate\.
\(a\) Vanilla PINNs\(b\) Conflict\-averse PINNsFig\. 3:Sample fraction with absolute dot product exceedingε\\varepsilon\(ε=10−12\\varepsilon=10^\{\-12\}\)\. The ratio decreases with capacity and is further reduced under conflict\-averse training\. Color encodes relative L2 error\.
### 2\.2Empirical Observations
To investigate the side\-effect of conflict\-averse training, we conduct experiments on the viscous Burgers’ equation, a widely used benchmark for evaluating PINN performance\. Using a plain multi\-layer perceptron \(MLP\) trained with the Adam optimizer as a baseline, we compare the performance of conflict\-averse method, adopting DCGD\[[8](https://arxiv.org/html/2606.20156#bib.bib12)\]as a representative benchmark\.
Figure 2 reports the cosine similarity between the gradients of the residual and boundary losses,∇θℒr\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{r\}and∇θℒb\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{b\}, as the network width increases while fixing depth to five layers\. As training progresses, the magnitude of negative alignment \(conflict\) generally diminishes\. Notably, for large width \(512\), the cosine similarity converges near zero, indicating an emergent orthogonality between the two gradients\. To quantify this effect, Figure 3 reports, over the training set, the fraction of samples for which\|∇θℒr⋅∇θℒb\|\>ε\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{r\}\\cdot\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{b\}\|\>\\varepsilonwithε=10−12\\varepsilon=10^\{\-12\}\. This ratio decreases as model capacity grows, and the decrease is amplified under conflict\-averse training, implying that sample\-wise gradients increasingly become orthogonal\.
These observations suggest that larger models adopt a capacity\-enabled shortcut: gradients decouple toward orthogonality, consistent with the emergence of objective\-exclusive functional modularity\. Moreover, conflict\-averse training accelerates this decoupling, reinforcing modular segregation rather than fostering cross\-objective interaction\.
## 3Our Approach
As noted earlier, conflict\-averse training in high\-capacity PINNs can induce a capacity\-enabled shortcut—objective\-exclusive functional modularity—where the residual and boundary objectives decouple and gradients drift toward orthogonality\. To counter this effect, we propose Modular\-Sparsity Synchronization \(ModSync\), a model\-agnostic framework that augments conflict\-averse training with structural optimization\. ModSync has two key components: \(i\) identification of objective\-specific modules, and \(ii\) regularization of objective\-exclusive connections\. In the following sections, we describe the specific mechanisms of the two aspects, and then present the final optimization procedure that combines these strategies\.
### 3\.1Identifying Objective\-Specific Modules
We treat the residual and boundary objectives as distinct tasks and aim to identify subnetworks that each objective predominantly develops on its own\. To this end, we adopt ideas from dynamic sparse training \(DST\)\[[16](https://arxiv.org/html/2606.20156#bib.bib23),[9](https://arxiv.org/html/2606.20156#bib.bib24),[5](https://arxiv.org/html/2606.20156#bib.bib25),[14](https://arxiv.org/html/2606.20156#bib.bib26)\]to learn binary connectivity masks jointly with the network weights\.
Let the parameters of the network consist of learnable weightsθ\\thetaand a binary maskMMthat selects active connections\. For each layeriiwith the weight matrixθi∈ℝdo×di\\theta^\{i\}\\in\\mathbb\{R\}^\{d\_\{o\}\\times d\_\{i\}\}, we associate a trainable threshold vectorti∈ℝdot^\{i\}\\in\\mathbb\{R\}^\{d\_\{o\}\}\(one threshold per output unit\)\. Using a step functionS\(⋅\)S\(\\cdot\), the binary mask is constructed as:
qij=\|θij\|−ti,\\displaystyle q^\{ij\}=\|\\theta^\{ij\}\|\-t^\{i\},\(6\)Mij=S\(qij\),where1≤i≤do,1≤j≤di\.\\displaystyle M^\{ij\}=S\(q^\{ij\}\),\\textit\{ where \}1\\leq i\\leq d\_\{o\},1\\leq j\\leq d\_\{i\}\.With the dynamic maskMM, an elementMi,jM^\{i,j\}is set to zero ifθi,j\\theta^\{i,j\}is pruned, resulting in the sparse parametersM⊙θM\\odot\\theta, where⊙\\odotis the Hadamard product\. The sparse parameters are updated through gradient\-based optimization\. SinceS\(⋅\)S\(\\cdot\)is non\-differentiable, we approximate it with a smooth gradient estimator, leveraging the long\-tailed estimator\[[16](https://arxiv.org/html/2606.20156#bib.bib23)\]\.
To reveal objective\-specific structure while keeping a single forward connectivity, we decompose the thresholds ast=\(tr\+tb\)/2t=\(t\_\{r\}\+t\_\{b\}\)/2, wheretrt\_\{r\}andtbt\_\{b\}are updated using gradients from the residual and boundary objectives, respectively\. The forward propagation uses the shared mask built from the composite vectortt, ensuring a common inference pathway, whereas the backward propagation applies separate updates totrt\_\{r\}andtbt\_\{b\}with their own objective gradients \(the weights continue to follow the conflict\-averse update\)\.
Combined with the sparsity regularizer introduced next, the threshold vectorstrt\_\{r\}andtbt\_\{b\}are optimized to suppress connections that are redundant for their respective objective\. We therefore regard each threshold vector as an implicit objective\-specific sparse subnetwork\. Based on these, we quantify objective exclusivity by the dissimilarity between their supports, which will serve as the target signal for the sparsity\-synchronization regularizer in the next subsection\.
### 3\.2Regularizing Modular\-Sparsity
Building on the objective\-specific modules, we introduce a sparsity regularizer to suppress unexpected objective exclusivity while retaining interaction\-promoting connections\. The regularizer is designed to satisfy two requirements:i\)encourage each objective to prune redundant connections \(promote sparsity\), andii\)de\-emphasize the sparsity penalty on connections that are likely shared across objectives so as to preserve interaction\. Specifically, the proposed sparsity term is defined as:
ℒs\(tr,tb\)=∑i=1Cℛw\(tri,tbi\)⋅\(ℛs\(tri\)\+ℛs\(tbi\)\),\\mathcal\{L\}\_\{s\}\(t\_\{r\},t\_\{b\}\)=\\sum\_\{i=1\}^\{C\}\\mathcal\{R\}\_\{w\}\(t\_\{r\}^\{i\},t\_\{b\}^\{i\}\)\\cdot\(\\mathcal\{R\}\_\{s\}\(t\_\{r\}^\{i\}\)\+\\mathcal\{R\}\_\{s\}\(t\_\{b\}^\{i\}\)\),\(7\)whereCCrepresents the total number of layers in the training model\.
To fulfill the first requirement, we adopt a monotonically decreasing sparsity\-inducing functionℛs\(t\)=exp\(−t\)\\mathcal\{R\}\_\{s\}\(t\)=\\exp\(\-t\), inspired by original DST\[[16](https://arxiv.org/html/2606.20156#bib.bib23)\]\. It penalizes smallttto prune redundant connections while avoiding over\-sparsification pressure at largett, yielding binary masks that focus capacity on objective\-relevant structure\. Formally, the sparsity\-inducing function is given as:
ℛs\(tri\)=exp\(−tri\),ℛs\(tbi\)=exp\(−tbi\)\.\\mathcal\{R\}\_\{s\}\(t\_\{r\}^\{i\}\)=\\exp\(\-t\_\{r\}^\{i\}\),\\quad\\mathcal\{R\}\_\{s\}\(t\_\{b\}^\{i\}\)=\\exp\(\-t\_\{b\}^\{i\}\)\.\(8\)
To satisfy the second requirement, we introduce a reweighting functionℛw\\mathcal\{R\}\_\{w\}, which dynamically adjusts the sparsity penalty based on the L2 distance betweentrit\_\{r\}^\{i\}andtbit\_\{b\}^\{i\}\. This function assigns higher weights to elements with significant differences, emphasizing task\-specific connections, while suppressing sparsity penalties for shared connections:
ℛw\(tri,tbi\)=exp\(\|tri−tbi\|2/β\),\\mathcal\{R\}\_\{w\}\(t\_\{r\}^\{i\},t\_\{b\}^\{i\}\)=\\exp\(\|t\_\{r\}^\{i\}\-t\_\{b\}^\{i\}\|^\{2\}/~\\beta\),\(9\)where the sharpness factorβ∈\[0,1\]\\beta\\in\[0,1\]controls the sensitivity of the weighting function\. To ensure training stability, we define this function in exponential form, while setting the scaling factorλs\\lambda\_\{s\}in Eq\. 10 to a relatively small value\.
### 3\.3Proposed Overview
Consequently, the overall objective function of the proposed ModSync can be defined as:
ℒ\(θ,tr,tb\)=λrℒr\(θ~\)\+λbℒb\(θ~\)\+λsℒs\(tr,tb\),\\mathcal\{L\}\(\\theta,t\_\{r\},t\_\{b\}\)=\{\\lambda\}\_\{r\}\\mathcal\{L\}\_\{r\}\(\\tilde\{\\theta\}\)\+\{\\lambda\}\_\{b\}\\mathcal\{L\}\_\{b\}\(\\tilde\{\\theta\}\)\+\{\\lambda\}\_\{s\}\\mathcal\{L\}\_\{s\}\(t\_\{r\},t\_\{b\}\),\(10\)whereθ~=M⊙θ\\tilde\{\\theta\}=M\\odot\\thetarepresents a sparse parameters with a binary mask applied\. For optimization with respect toθ\\theta, we use the refined gradient obtained by applying the adjustment function defined in Eq\. 4 to the gradients derived fromℒr\\mathcal\{L\}\_\{r\}andℒb\\mathcal\{L\}\_\{b\}\. In contrast,trt\_\{r\}andtbt\_\{b\}re optimized by applying independent gradients for the two objectives\. The overall training procedure of the proposed method can be found in Algorithm 1\.
Table 1:Comparison of PINN performance across multiple PDE benchmarks\.\(a\) Helmholtz \(2D\)\(b\) Klein\-Gordon \(2D\)Fig\. 4:Effect of model capacity on PINN performance across diverse PDE benchmarks\. The proposed ModSync consistently prevents capacity\-induced failure modes and achieves lower relative errors than existing baselines\.
## 4Experiments
We evaluate the proposed method on several challenging PDEs to assess its training stability and accuracy\. Unless otherwise noted, models train for 50,000 iterations; results are the average of three different seeds, selecting the best validation checkpoint\. We use an identical set of hyperparameters for all ModSync runs:\{λr=1\.0,λb=1\.0,λs=0\.0001,ηθ=0\.001,ηt=0\.0001,β=0\.3\}\\\{\\lambda\_\{r\}=1\.0,\\lambda\_\{b\}=1\.0,\\lambda\_\{s\}=0\.0001,\\eta\_\{\\theta\}=0\.001,\\eta\_\{t\}=0\.0001,\\beta=0\.3\\\}\. Detailed PDE formulations and sampling strategies follow\[[8](https://arxiv.org/html/2606.20156#bib.bib12),[21](https://arxiv.org/html/2606.20156#bib.bib13),[27](https://arxiv.org/html/2606.20156#bib.bib27)\]\.
### 4\.1Comparison on Benchmark Equations
We consider three popular 2D benchmarks—Helmholtz, Klein–Gordon, and Burgers’—and additionally evaluate Helmholtz and Klein–Gordon in 3D\. Baselines include Vanilla PINNs \(Adam\) and PINN\-specific optimization methods \(LRA\[[22](https://arxiv.org/html/2606.20156#bib.bib28)\], NTK\[[24](https://arxiv.org/html/2606.20156#bib.bib29)\], MultiAdam\[[27](https://arxiv.org/html/2606.20156#bib.bib27)\]\), as well as MTL gradient\-manipulation methods \(PCGrad\[[28](https://arxiv.org/html/2606.20156#bib.bib16)\], CAGrad\[[15](https://arxiv.org/html/2606.20156#bib.bib17)\]\)\. To test compatibility, ModSync is integrated into two conflict\-averse PINN approaches, DCGD\[[8](https://arxiv.org/html/2606.20156#bib.bib12)\]and ConFIG\[[21](https://arxiv.org/html/2606.20156#bib.bib13)\]\. Table 1 reports absolute and relative errors for a 5\-layer MLP with width 512\. We observe that several conflict\-averse methods \(e\.g\., PCGrad, DCGD, ConFIG\) can enter capacity\-induced failure modes\. In contrast, integrating ModSync consistently mitigates these failures and achieves state\-of\-the\-art accuracy on most benchmarks\.
### 4\.2Verification of Scalability Effects
To directly probe ModSync’s suppression of functional modularity, we study model width scaling\. Guided by Table 1, we adopt strong baselines—Adam, LRA, NTK, DCGD, and ConFIG—and evaluate ModSync when combined with DCGD and ConFIG\. All models use a 5\-layer MLP; only width varies\. Figures 1 and 4 show that DCGD and ConFIG perform well at small widths but degrade as capacity increases, exhibiting the predicted failure modes\. When augmented with ModSync, both methods retain stable convergence and competitive accuracy even at larger widths\. These results substantiate that ModSync effectively counters the structural shortcut \(functional modularity\) responsible for capacity\-driven failures\.
### 4\.3Ablation Study
To assess the contribution of each component in ModSync, we conduct ablations on the Burgers’ equation using a 5\-layer PINN \(width 512\)\. Table 2 summarizes the results\. Compared to standard PINNs \(Adam\), the DCGD baseline fails to converge, yielding a large relative error\. Introducing a plain DST variant with only the sparsity regularizerℛs\\mathcal\{R\}\_\{s\}\(third row\) restores convergence, and adopting objective\-centric DST\[[16](https://arxiv.org/html/2606.20156#bib.bib23)\]with dual thresholds \(fourth row\) further improves accuracy\. We attribute these gains to early\-stage correction of redundant connections that otherwise foster functional modularity\. Finally, the full ModSync—combining dual thresholds,ℛs\\mathcal\{R\}\_\{s\}, and the reweightingℛw\\mathcal\{R\}\_\{w\}—achieves the best performance, delivering meaningful improvements over both Adam\-optimized PINNs and DCGD\-optimized models\.
Table 2:Ablation studies of the individual components\.
## 5Conclusion
We investigated the fragility of conflict\-averse training in PINNs and showed that growing model capacity induces functional modularity, where networks partition into task\-exclusive modules and suppress cross\-objective interaction\. To address this, we proposed ModSync, a structural optimization that integrates with conflict\-averse schemes by penalizing exclusive connections while preserving interaction pathways\. Experiments across diverse PDEs demonstrated that ModSync mitigates capacity\-induced failures, sustains stable convergence, and achieves state\-of\-the\-art accuracy\. These results underscore the role of structural regularization in enabling scalable, reliable PINN training\.
## References
- \[1\]\(2006\)Finite element procedures\.Klaus\-Jurgen Bathe\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p1.1)\.
- \[2\]R\. Csordás, S\. van Steenkiste, and J\. Schmidhuber\(2021\)Are neural nets modular? inspecting functional modularity through differentiable weight masks\.InProc\. International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p3.1)\.
- \[3\]S\. Cuomo, V\. S\. Di Cola, F\. Giampaolo, G\. Rozza, M\. Raissi, and F\. Piccialli\(2022\)Scientific machine learning through physics–informed neural networks: where we are and what’s next\.Journal of Scientific Computing,pp\. 88\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p1.1)\.
- \[4\]J\. Désidéri\(2012\)Multiple\-gradient descent algorithm \(mgda\) for multiobjective optimization\.Comptes Rendus Mathematique,pp\. 313–318\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p2.1),[§1](https://arxiv.org/html/2606.20156#S1.p3.1)\.
- \[5\]U\. Evci, T\. Gale, J\. Menick, P\. S\. Castro, and E\. Elsen\(2020\)Rigging the lottery: making all tickets winners\.InProc\. International Conference on Machine Learning \(ICML\),pp\. 2943–2952\.Cited by:[§3\.1](https://arxiv.org/html/2606.20156#S3.SS1.p1.1)\.
- \[6\]C\. Grossmann, H\. Roos, and M\. Stynes\(2007\)Numerical treatment of partial differential equations: translated and revised by martin stynes\.Springer\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p1.1)\.
- \[7\]H\. M\. Hochman and J\. D\. Rodgers\(1969\)Pareto optimal redistribution\.The American economic review,pp\. 542–557\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p3.1)\.
- \[8\]Y\. Hwang and D\. Lim\(2024\)Dual cone gradient descent for training physics\-informed neural networks\.Advances in Neural Information Processing Systems \(NeurIPS\),pp\. 98563–98595\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.20156#S2.SS1.p3.2),[§2\.1](https://arxiv.org/html/2606.20156#S2.SS1.p3.4),[§2\.2](https://arxiv.org/html/2606.20156#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.20156#S4.SS1.p1.1),[§4](https://arxiv.org/html/2606.20156#S4.p1.1)\.
- \[9\]J\. Ji, G\. Li, L\. Yin, M\. Qin, G\. Yuan, L\. Guo, S\. Liu, and X\. Ma\(2024\)Advancing dynamic sparse training by exploring optimization opportunities\.InProc\. International Conference on Machine Learning \(ICML\),pp\. 21606–21619\.Cited by:[§3\.1](https://arxiv.org/html/2606.20156#S3.SS1.p1.1)\.
- \[10\]G\. E\. Karniadakis, I\. G\. Kevrekidis, L\. Lu, P\. Perdikaris, S\. Wang, and L\. Yang\(2021\)Physics\-informed machine learning\.Nature Reviews Physics,pp\. 422–440\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p1.1)\.
- \[11\]D\. P\. Kingma and J\. Ba\(2015\)Adam: A method for stochastic optimization\.InProc\. International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p3.1)\.
- \[12\]M\. Koh, B\. Park, H\. Kong, and S\. Lee\(2025\)Integrating locality\-aware attention with transformers for general geometry pdes\.InProc\. International Joint Conference on Neural Networks \(IJCNN\),Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p1.1)\.
- \[13\]A\. Krishnapriyan, A\. Gholami, S\. Zhe, R\. Kirby, and M\. W\. Mahoney\(2021\)Characterizing possible failure modes in physics\-informed neural networks\.Advances in Neural Information Processing Systems \(NeurIPS\),pp\. 26548–26560\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p2.1)\.
- \[14\]M\. Lasby, A\. Golubeva, U\. Evci, M\. Nica, and Y\. Ioannou\(2024\)Dynamic sparse training with structured sparsity\.InProc\. International Conference on Learning Representations \(ICLR\),Cited by:[§3\.1](https://arxiv.org/html/2606.20156#S3.SS1.p1.1)\.
- \[15\]B\. Liu, X\. Liu, X\. Jin, P\. Stone, and Q\. Liu\(2021\)Conflict\-averse gradient descent for multi\-task learning\.Advances in Neural Information Processing Systems \(NeurIPS\),pp\. 18878–18890\.Cited by:[§2\.1](https://arxiv.org/html/2606.20156#S2.SS1.p2.3),[§4\.1](https://arxiv.org/html/2606.20156#S4.SS1.p1.1)\.
- \[16\]J\. LIU, Z\. XU, R\. SHI, R\. C\. C\. Cheung, and H\. K\.H\. So\(2020\)Dynamic sparse training: find efficient sparse network from scratch with trainable masked layers\.InProc\. International Conference on Learning Representations \(ICLR\),Cited by:[§3\.1](https://arxiv.org/html/2606.20156#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.20156#S3.SS1.p2.12),[§3\.2](https://arxiv.org/html/2606.20156#S3.SS2.p2.3),[§4\.3](https://arxiv.org/html/2606.20156#S4.SS3.p1.3)\.
- \[17\]Y\. Park, P\. Gerstoft, S\. Yoon, and W\. Seong\(2025\)Physics\-informed neural networks for ocean acoustic field prediction with envelope smoothing\.InProc\. IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p1.1)\.
- \[18\]J\. Pfeiffer, S\. Ruder, I\. Vulić, and E\. Ponti\(2023\)Modular deep learning\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p3.1)\.
- \[19\]M\. Raissi, P\. Perdikaris, and G\. E\. Karniadakis\(2019\)Physics\-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.Journal of Computational physics,pp\. 686–707\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.20156#S2.SS1.p1.2)\.
- \[20\]P\. Rathore, W\. Lei, Z\. Frangella, L\. Lu, and M\. Udell\(2024\)Challenges in training PINNs: a loss landscape perspective\.InProc\. International Conference on Machine Learning \(ICML\),pp\. 42159–42191\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p2.1)\.
- \[21\]P\. Rathore, W\. Lei, Z\. Frangella, L\. Lu, and M\. Udell\(2025\)Config: towards conflict\-free training of physics informed neural networks\.InProc\. International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.20156#S2.SS1.p3.2),[§4\.1](https://arxiv.org/html/2606.20156#S4.SS1.p1.1),[§4](https://arxiv.org/html/2606.20156#S4.p1.1)\.
- \[22\]S\. Wang, Y\. Teng, and P\. Perdikaris\(2021\)Understanding and mitigating gradient flow pathologies in physics\-informed neural networks\.SIAM Journal on Scientific Computing,pp\. A3055–A3081\.Cited by:[§4\.1](https://arxiv.org/html/2606.20156#S4.SS1.p1.1)\.
- \[23\]S\. Wang, H\. Wang, and P\. Perdikaris\(2021\)On the eigenvector bias of fourier feature networks: from regression to solving multi\-scale pdes with physics\-informed neural networks\.Computer Methods in Applied Mechanics and Engineering,pp\. 113938\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p2.1)\.
- \[24\]S\. Wang, X\. Yu, and P\. Perdikaris\(2022\)When and why pinns fail to train: a neural tangent kernel perspective\.Journal of Computational Physics,pp\. 110768\.Cited by:[§4\.1](https://arxiv.org/html/2606.20156#S4.SS1.p1.1)\.
- \[25\]C\. Wu, M\. Zhu, Q\. Tan, Y\. Kartha, and L\. Lu\(2023\)A comprehensive study of non\-adaptive and residual\-based adaptive sampling for physics\-informed neural networks\.Computer Methods in Applied Mechanics and Engineering,pp\. 115671\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p2.1)\.
- \[26\]Z\. Xiang, W\. Peng, X\. Liu, and W\. Yao\(2022\)Self\-adaptive loss balanced physics\-informed neural networks\.Neurocomputing496,pp\. 11–34\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p2.1)\.
- \[27\]J\. Yao, C\. Su, Z\. Hao, S\. Liu, H\. Su, and J\. Zhu\(2023\)Multiadam: parameter\-wise scale\-invariant optimizer for multiscale training of physics\-informed neural networks\.InProc\. International Conference on Machine Learning \(ICML\),pp\. 39702–39721\.Cited by:[§4\.1](https://arxiv.org/html/2606.20156#S4.SS1.p1.1),[§4](https://arxiv.org/html/2606.20156#S4.p1.1)\.
- \[28\]T\. Yu, S\. Kumar, A\. Gupta, S\. Levine, K\. Hausman, and C\. Finn\(2020\)Gradient surgery for multi\-task learning\.Advances in Neural Information Processing Systems \(NeurIPS\),pp\. 5824–5836\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.20156#S2.SS1.p2.3),[§4\.1](https://arxiv.org/html/2606.20156#S4.SS1.p1.1)\.
- \[29\]Z\. Zhang, Z\. Zeng, Y\. Lin, C\. Xiao, X\. Wang, X\. Han, Z\. Liu, R\. Xie, M\. Sun, and J\. Zhou\(2023\)Emergent modularity in pre\-trained transformers\.InProc\. Findings of the Association for Computational Linguistics \(ACL\),pp\. 4066–4083\.Cited by:[§1](https://arxiv.org/html/2606.20156#S1.p3.1)\.Similar Articles
Curriculum Learning of Physics-Informed Neural Networks based on Spatial Correlation
This paper proposes a spatially correlated curriculum learning framework for Physics-Informed Neural Networks (PINNs) that improves training stability and solution accuracy by leveraging spatial correlations among subregions, addressing issues like high-dimensional non-convex loss landscapes and imbalanced multi-objective constraints.
Physics-Informed Neural Networks with Learnable Loss Balancing and Transfer Learning
This paper proposes a self-supervised physics-informed neural network (PINN) framework with a learnable blending neuron to adaptively balance physics-based and data-driven losses, and integrates transfer learning to improve efficiency under data scarcity. It is validated on liquid-metal miniature heat sink CFD data with only 87 datapoints, achieving under 8% error.
Energy-Conserved Neural Pipelines: Attenuating Error Propagation in Modular Neural Networks via Physical Conservation Constraints
Introduces energy conservation as a hard physical constraint on inter-module information flow in modular neural networks, enforcing exact preservation of activation energy at module boundaries to attenuate error propagation. Experiments on CIFAR-10 and a robotic pipeline show significant improvements in noise robustness.
A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning
This paper develops a PAC-Bayesian framework for physics-informed machine learning, providing high-probability generalization guarantees for unbounded losses. It proposes a multi-task perspective that jointly handles data fidelity, PDE residuals, and boundary conditions, and introduces a self-bounding learning algorithm.
ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors
Proposes a finetuning-based hardware-aware training algorithm to mitigate I-V non-linearity and retention errors in ReRAM crossbar arrays, enabling robust DNN deployment with minimal overhead. Evaluated on image classification and QA tasks, achieving near-baseline accuracy.