Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization
Summary
This paper identifies a consistent three-regime structure in scientific machine learning models, showing that optimization effectiveness is regime-specific and can challenge conventional loss-landscape interpretations. It proposes a regime-aware diagnostic framework validated across PINNs, neural operators, and neural ODEs.
View Cached Full Text
Cached at: 05/29/26, 09:17 AM
# Distinct Failure Modes and Regime-specific Optimization
Source: [https://arxiv.org/html/2605.29153](https://arxiv.org/html/2605.29153)
## Unveiling Multi\-regime Patterns in SciML: Distinct Failure Modes and Regime\-specific Optimization
Yuanzhe HuXiaokun ZhongXiaopeng WangHaiquan LuTianyu PangMichael W\. MahoneyYujun YanPu RenYaoqing Yang
###### Abstract
Neural networks trained under different hyperparameter settings can fall into distinct training “regimes,” with consistent behavior within regimes and qualitative differences across regimes\. In this paper, we study such multi\-regime behavior in scientific machine learning \(SciML\) models through a*regime\-aware diagnostic framework*that jointly analyzes performance, training dynamics, and loss\-landscape geometry\. We identify three key findings: \(i\) a consistent three\-regime structure emerges across many standard SciML models, different constraint enforcements, and various optimizer designs; \(ii\) optimization effectiveness is regime\-specific, with no single method performing well across all regimes; and \(iii\) SciML models can exhibit fine\-grained failure modes that can challenge conventional interpretations of standard loss\-landscape metrics\. Our results provide an approach to establish a unified, task\-oblivious perspective on failure modes in SciML and to inform regime\-aware guidance for improving robustness\. We validate these findings across widely\-used SciML models, including physics\-informed neural networks, neural operators, and neural ordinary differential equations, on benchmarks spanning representative ordinary and partial differential equations\.
Scientific Machine Learning, Loss Landscape, Multi\-regime Analysis
## 1Introduction
Scientific machine learning \(SciML\) offers the opportunity to combine data\-driven ML models with domain\-driven physical models\. Popular SciML model families include: \(i\) physics\-informed neural networks \(PINNs\), which try to enforce physical laws through optimization penalties\(Raissiet al\.,[2019](https://arxiv.org/html/2605.29153#bib.bib19); Karniadakiset al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib82)\); \(ii\) neural operators \(NOs\), which try to learn mappings between infinite\-dimensional function spaces\(Liet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib23); Luet al\.,[2021a](https://arxiv.org/html/2605.29153#bib.bib124)\); and \(iii\) neural ordinary differential equations \(NODEs\), which try to parameterize continuous\-time dynamics and train via differentiable ODE solvers\(Chenet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib121)\)\. These approaches have been applied in scientific discovery and engineering design in an effort to accelerate research and improve computational workflows\(Karniadakiset al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib82); Wanget al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib134); Azizzadenesheliet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib76)\)\.
Despite the promise of these methods, using them in realistic SciML workflows can be challenging, and recent studies have identified fundamental reasons for these challenges\. This includes optimization and training difficulties in PINNs\(Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20)\), resolution and aliasing issues in NOs\(Sakarvadiaet al\.,[2025](https://arxiv.org/html/2605.29153#bib.bib118)\), and discretization and continuity failures in NODEs\(Krishnapriyanet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib125)\)\. These issues point to a broader ill\-posedness: many SciML problems exhibit strong interactions between physical regimes \(e\.g\., stiffness and high\-frequency features\) and training regimes \(e\.g\., limited supervision and constraint enforcement\), leading to behavior that is highly sensitive and often poorly conditioned, and thus difficult to understand from task\-level performance alone\.
This motivates a diagnostic perspective\. Following prior work on taxonomizing local versus global structure in neural network loss landscapes\(Yanget al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib9); Zhouet al\.,[2023b](https://arxiv.org/html/2605.29153#bib.bib10)\), we adopt a diagnostic approach to analyze and address these challenges\. We provide empirical evidence that, in many cases, the failure modes of SciML models exhibit structured and consistent patterns across commonly\-used SciML model families, alongside distinct model\-specific behaviors\. To support this analysis, we develop a*regime\-aware diagnostic framework*that systematically maps*when*and*why*these model families exhibit various properties across physical and training regimes\.
FollowingYanget al\.\([2021](https://arxiv.org/html/2605.29153#bib.bib9)\); Zhouet al\.\([2023b](https://arxiv.org/html/2605.29153#bib.bib10)\), a*regime*here refers to a region in the configuration space where loss\-landscape properties, such as sharpness\(Foretet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib45); Keskaret al\.,[2016](https://arxiv.org/html/2605.29153#bib.bib52)\), connectivity\(Garipovet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib54); Draxleret al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib55)\), and representation similarity\(Kornblithet al\.,[2019](https://arxiv.org/html/2605.29153#bib.bib56)\), remain qualitatively homogeneous and thus correspond to similar model quality\. Building on this notion, we aim to translate these findings into actionable guidance, including revealing when better\-conditioned model designs and optimization methods reduce sensitivity and deliver more stable implicit biases\. Our approach is motivated in particular by the observation that many SciML exhibits richer and more complex loss landscape structures than typical computer vision \(CV\) or natural language processing \(NLP\) models\(Liet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib39); Kwonet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib11); Bahriet al\.,[2022](https://arxiv.org/html/2605.29153#bib.bib63); Hooglandet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib46); Chenet al\.,[2025](https://arxiv.org/html/2605.29153#bib.bib62)\), and thus more refined diagnostic tools are needed for diagnosing their properties and understanding them\. As an example, we show empirically, in Section[2\.5](https://arxiv.org/html/2605.29153#S2.SS5), that SciML loss landscapes can lack the well\-conditioned basins commonly observed in standard CV models, and instead display strong nonconvexity and sensitivity to problem setup\. This qualitatively\-new source of regime structure arises from the interaction among data representation, physical constraints, and optimizations, all of which call for novel diagnostic tools for SciML\.
To move beyond ad\-hoc hyperparameter tuning and to start to make the above observations operational, we adopt a regime\-aware empirical evaluation that systematically varies both*physical regimes*\(e\.g\., parameters and types of partial differential equations \(PDEs\)\) and*training regimes*\(e\.g\., optimizer configurations, constraint handling, and collocation design\)\. By jointly analyzing model performance, training dynamics, and loss\-landscape structure across these axes, we construct empirical regime maps of model behavior that reveal consistent patterns across SciML models\. By incorporating problem\-dependent structure from the underlying physics and discretization, our framework extends prior work on systematic diagnostic analysis in CV/NLP\(Yanget al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib9)\)to the SciML setting\. While existing SciML studies have explored related failure phenomena\(Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20); Rathoreet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib58); Krishnapriyanet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib125); Sakarvadiaet al\.,[2025](https://arxiv.org/html/2605.29153#bib.bib118)\), they are typically limited to a small number of settings and do not provide an operational view of*when*a method is reliable,*why*it fails, and*which interventions*\(optimization/training tricks vs\. better\-conditioned model designs\) most effectively improve robustness\. To summarize, our main contributions are as follows:
- •We introduce*a regime\-based evaluation taxonomy*that categorizes trained SciML models into three regimes, which are separated by clear boundaries on the training\- and test\-error plot: Well\-Trained, Under\-Trained, and Over\-Trained\. This three\-regime pattern consistently appears across various model classes, physical systems, and optimization/training methods\.
- •We develop*a unified evaluation framework*that connects optimizer behavior to loss landscape features and downstream performance\. Using this framework, we characterize*boundaries of effectiveness*for optimization methods for SciML models, and we show that different approaches are effective in different regimes\.
- •We identify and analyze previously\-uncharacterized and counterintuitive pathological phenomena in SciML models that can cause standard \(CV/NLP motivated\) performance metrics to misrepresent the underlying loss behavior\. For example, we observe*deceptive sharpness*, where the Hessian is ill\-conditioned such that its trace and leading eigenvalues continue to grow, even as the model becomes increasingly optimized, as indicated by a reduced loss close to zero \(Figures[7](https://arxiv.org/html/2605.29153#S2.F7)and[11](https://arxiv.org/html/2605.29153#A6.F11)\(a\)\)\. We also observe*deceptive flatness*, where the Hessian appears well\-behaved while the training loss remains elevated—a clear sign of undertraining \(Figures[11](https://arxiv.org/html/2605.29153#A6.F11)\(b\) and[13](https://arxiv.org/html/2605.29153#A6.F13)\)\. We attribute the breakdown of the positive Hessian\-loss correlation in SciML to the non\-trivial nature of its loss landscape, a fundamental departure from \(relatively\) well\-behaved CV landscapes\(Yanget al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib9)\)\. We discuss five loss landscape peculiarities that have been associated with training difficulties or failures in CV and NLP models; and we examine which of them do, or do not, explain the specific failures observed in SciML models\.
Overall, our work provides an empirical regime\-based perspective on SciML models that complements existing methodological advances; and, by making failure modes measurable, our framework offers practical guidance for improving robustness and stability in SciML applications\. For more related work, see Appendix[A](https://arxiv.org/html/2605.29153#A1); and subsequent appendices contain additional information\. Our code is available at[https://github\.com/leastima/sciml\_multi\_regime](https://github.com/leastima/sciml_multi_regime)\.
## 2Multi\-Regime Analysis Results
In this section, we present our multi\-regime empirical analysis along three complementary axes: model family; physical system; and optimization or training strategy\. We examine how these factors shape regime geometry, trainability, and generalization\. We then provide several case studies of loss landscapes observed in SciML tasks\.
### 2\.1Experimental Setup
##### Datasets and Models\.
We study representative regime structures across five major SciML models, including PINNs, Fourier NOs \(FNOs\), Physics\-informed NOs \(PINOs\), NODEs, and Physics\-informed NODEs \(PINODEs\)\. These models are widely\-used and conceptually distinct, covering physics\-constrained ML methods, operator learning, and temporal dynamics modeling\. For PINNs, we train on 1D convection, reaction, wave, and reaction\-diffusion equations, following the settings in previous work\(Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20); Rathoreet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib58)\)\. We use 2D Poisson and advection\-diffusion systems for FNOs\(Subramanianet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib26)\), and 2D Darcy flow for PINOs\(Liet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib93)\); and we use the nonlinear pendulum benchmark in prior work\(Krishnapriyanet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib125)\)for NODEs and PINODEs\.
##### Training and Optimization\.
We consider a range of optimization and training strategies, including standard first\- and second\-order methods \(Adam\(Kingma and Ba,[2015](https://arxiv.org/html/2605.29153#bib.bib136)\)and L\-BFGS\(Zhuet al\.,[1997](https://arxiv.org/html/2605.29153#bib.bib141)\)\), the advanced second\-order NysNewton\-CG \(NNCG\) method\(Rathoreet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib58)\), hard\-constrained optimization via the augmented Lagrangian method \(ALM\)\(Luet al\.,[2021b](https://arxiv.org/html/2605.29153#bib.bib41)\), curriculum learning \(CL\)\(Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20)\), and the RoPINN stabilization framework\(Wuet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib69)\)\. These methods play different roles in different algorithms: Adam, L\-BFGS, and NNCG modify the parameter\-update dynamics; ALM changes the constraint\-enforcement formulation; CL alters the training trajectory; and RoPINN aims to improve collocation conditioning through PINN\-specific stabilization\. Thus, we study optimization and training interventions broadly, rather than optimizers in the narrow sense\. The detailed experimental configurations for each SciML model and benchmark system are summarized in Table[1](https://arxiv.org/html/2605.29153#A3.T1)\. Unless otherwise stated, five random seeds are used for PINN and FNO experiments, and three random seeds are used for the remaining experiments\. The ResNet\-18 model\(Heet al\.,[2016](https://arxiv.org/html/2605.29153#bib.bib138)\)discussed in Section[2\.5](https://arxiv.org/html/2605.29153#S2.SS5), which we include for comparison, is trained using stochastic gradient descent \(SGD\)\.
##### Regime Plots and Boundaries\.
The regime plots we will present \(see Figure[1](https://arxiv.org/html/2605.29153#S2.F1)for an example\) summarize training and test performance across representative SciML models\. In each plot, thexx\-axis corresponds to a problem\-specific physical parameter, such as the PDE coefficientβ\\betain the 1D convection equation\. Varying this parameter changes the solution structure and target complexity induced by the governing equation, thereby altering training difficulty in a problem\-dependent manner\. Theyy\-axis denotes the amount of supervision, measured by the number of collocation points for PINNs and by the number of training samples for other models\. The implementation details of computing Hessian\-related quantities, defining regime boundaries, and generating regime plots are described in Appendix[D\.2](https://arxiv.org/html/2605.29153#A4.SS2),[D\.3](https://arxiv.org/html/2605.29153#A4.SS3), and[D\.4](https://arxiv.org/html/2605.29153#A4.SS4), respectively\. Specifically, regime boundaries are extracted automatically from the regime diagrams using training\- and test\-error thresholds\. Thegeneralization boundary\(dashed line\) is determined by the test\-error thresholdTtestT\_\{\\text\{test\}\}and separates regions with successful generalization from those where the model trains but fails to generalize well\. Thetraining boundary\(solid line\) is determined by the training\-loss thresholdTtrainT\_\{\\text\{train\}\}and separates trainable regions from those where optimization remains insufficient\. To assess robustness, we perturb the percentile thresholds by±20%\\pm 20\\%and recompute the regime boundaries\.
### 2\.2Regimes Across SciML Models
\(*a*\) PINN\(*b*\) FNO\(*c*\) PINO\(*d*\) NODE\(*e*\) PINODEFigure 1:Representative regime plots acrossvarying model families\. Lighter \(yellow\) and darker \(green\) colors denote lower and higher training loss/test error, respectively\. Across all models, the train\-test regime plots consistently separate into three regimes: Regime I \(Well\-Trained regime with low training and test error\); Regime II \(Under\-Trained regime with high training and test errors\); and Regime III \(Over\-Trained regime with low training loss and high test errors\)\. Colored curves indicate empirically identified regime boundaries under threshold perturbations, showing the robustness and sensitivity of the extracted regime structure\.Here, we analyze empirical regimes exhibited by SciML models across different physical and training configurations\. For each model, we use a benchmark system and its standard training configuration commonly adopted in prior work\. Specifically, we evaluate PINNs on the 1D convection equation, PINOs on the Darcy flow problems, and PINODEs on the nonlinear pendulum system, using the standard L\-BFGS optimizer; and we train FNOs on the 2D advection\-diffusion system and NODEs on the pendulum system using Adam\. We note that these evaluations are not intended as strictly controlled architecture ablations, but instead we use them to provide representative examples for analyzing how different SciML models exhibit distinct regime geometry and transition behavior under realistic training settings\.
Rather than assuming a predefined set of regimes, we identify empirical regimes from the joint behavior of training and test errors over this 2D configuration space\. Figure[1](https://arxiv.org/html/2605.29153#S2.F1)shows that a coarse three\-regime structure emerges broadly across diverse SciML models, despite substantial differences in model architectures, training supervisions, and physical systems\. These regime labels aredescriptive: they summarize the observed relationship between training performance, test errors, physical difficulty, and data availability\. The three empirical regimes are described as follows:
- •Regime I \(Well\-Trained\)\.This regime corresponds to configurations where the model achieves low training loss and low test error\. For the popular SciML models we consider, this regime typically occurs when the physical regime is relatively easy to learn and sufficient training information is available, e\.g\., with dense collocation points and/or “nice” physical parameter settings for PINNs, or with a large number of training samples for FNOs and NODEs\. In this region, the model can both optimize the training objective and also generalize well to unseen samples\. This regime corresponds to the yellow\-colored region marked with “I”\.
- •Regime II \(Under\-Trained\)\.This regime is characterized by simultaneously high training and test errors\. This regime often arises in physically more challenging configurations where the model struggles to optimize the objective, despite having relatively abundant training information\. In some sense, the dominant failure mode isoptimization difficultyrather than data scarcity: the model fails to adequately minimize the training objective, and consequently it generalizes poorly\. This region is labeled “II” in the regime figure\.
- •Regime III \(Over\-Trained\)\.This regime is marked by low training error but high test error\. This regime appears when training information is limited, such as when PINNs use sparse collocation points or have “hard” physical parameters, or when FNOs and NODEs are trained with only a small number of samples\. This behavior indicates a data\-limited failure mode \(degraded generalization\): the model fits the available supervision, but it does not learn a solution that generalizes across the full physical domain or distribution\. This region is labeled “III” in the regime figure\.
Although all classes of SciML models exhibit similar regimes and failure modes, the detailed geometry of these regions differs across models\. Physics\-constrained models, particularly PINNs, exhibit sharper and more localized transitions between regimes, indicating stronger sensitivity to physical difficulty and/or training conditions\. In contrast, NOs and NODEs display smoother and more diffuse transition structure, with less clearly\-separated trainability and generalization regions\. Interestingly, PINOs exhibit intermediate behavior between these extremes, suggesting that incorporating physical constraints into operator\-learning architectures partially reintroduces the sharper regime structures observed in PINNs\. These observations indicate that while coarse failure modes such as trainability and generalization structure appear to be a recurring phenomenon across SciML, the fine\-scale geometry of the regimes depends on the balance between data\-driven learning, physical constraints, and optimization and training methods\.
### 2\.3Regimes Under Varying Physics
Here, we consider different physical systems, and we observe that different systems exhibit a similar coarse three\-regime structure under a fixed optimization and model setup\. To illustrate this, we use PINNs as a controlled representative setting \(because their collocation\-based residual optimization produces sharper regime transitions than other SciML models\)\. We use the standard and stable L\-BFGS optimizer for training PINNs\. The selected PDEs span four representative PDE systems frequently studied in SciML\(Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20)\), including transport\-dominated dynamics \(convection\), reaction\-driven dynamics \(reaction\), oscillatory behavior \(wave\), and coupled transport\-reaction effects \(reaction\-diffusion\)\. Additional regime plots for other SciML models, training strategies, and physical settings are provided in Figures[8](https://arxiv.org/html/2605.29153#A5.F8),[9](https://arxiv.org/html/2605.29153#A5.F9), and[10](https://arxiv.org/html/2605.29153#A5.F10)of Appendix[E](https://arxiv.org/html/2605.29153#A5)\.
Figure[2](https://arxiv.org/html/2605.29153#S2.F2)shows that a consistent coarse three\-regime structure exists across four PDE systems under a fixed PINN and L\-BFGS training setup\. Despite differences in the underlying physics, all four PDEs exhibit recurring regions associated with stable training and generalization \(Regime I\), optimization difficulties \(Regime II\), and degraded generalization \(Regime III\)\. This consistency suggests that similar trainability and generalization regimes arise across PDE systems with different physical mechanisms, while PDE\-specific structure and parameter\-induced difficulty shape the geometry and location of the regime boundaries\. These results demonstrate that while coarse regime organization persists across PDE systems, the fine\-scale geometry of the regimes is shaped by the specific physical mechanisms and parameter\-induced properties of the underlying equations\.
\(*a*\) 1D Convection\(*b*\) 1D Reaction\(*c*\) 1D Wave\(*d*\) 1D Reaction\-DiffusionFigure 2:Regime plots acrossvarying physical systemsusing a fixed PINN architecture trained with the L\-BFGS optimizer\. The evaluated PDE systems include \(*a*\) the 1D convection equation with convection coefficientβ\\beta, \(*b*\) the 1D reaction equation with reaction coefficientρ\\rho, \(*c*\) the 1D wave equation with wave speedcc, and \(*d*\) the 1D reaction\-diffusion equation with reaction coefficientρ\\rho\.
### 2\.4Effect of Optimization and Training Strategies
Here, to analyze how optimization and training shape regime geometry, we evaluate three physics\-constrained SciML models under a variety of representative optimization and training strategies\. We consider PINNs on the 1D convection equation, PINOs on the Darcy flow problem, and PINODEs on the nonlinear pendulum system\. Figure[3](https://arxiv.org/html/2605.29153#S2.F3)shows that these interventions primarily reshape the geometry and location of the regime boundaries, rather than eliminating the underlying coarse regime structure\. Across all methods, the same coarse regions remain visible, but the detailed shape and extent of these regions vary across interventions\.
\(*a*\) PINN, RoPINN\(*b*\) PINN, L\-BFGS\(*c*\) PINN, ALM\(*d*\) PINN, NNCG\(*e*\) PINN, CL\(*f*\) PINO, Adam\(*g*\) PINO, L\-BFGS\(*h*\) PINO, ALM\(*i*\) PINO, NNCG\(*j*\) PINO, CL\(*k*\) PINODE, Adam\(*l*\) PINODE, L\-BFGS\(*m*\) PINODE, ALM\(*n*\) PINODE, NNCG\(*o*\) PINODE, CLFigure 3:Regime maps for physics\-constrained SciML models undervarying optimization and training strategies\. The first pair of rows presents PINNs on the 1D convection equation using \(*a*\) RoPINN, \(*b*\) L\-BFGS, \(*c*\) ALM, \(*d*\) NNCG, and \(*e*\) CL\. The second pair of rows presents PINOs on the Darcy flow problem using \(*f*\) Adam, \(*g*\) L\-BFGS, \(*h*\) ALM, \(*i*\) NNCG, and \(*j*\) CL\. The third pair of rows presents PINODEs on the nonlinear pendulum system using \(*k*\) Adam, \(*l*\) L\-BFGS, \(*m*\) ALM, \(*n*\) NNCG, and \(*o*\) CL\.##### Effect of Soft and Hard Constraints\.
As shown in Figure[3](https://arxiv.org/html/2605.29153#S2.F3)\(the second and third columns\), the comparison between the standard soft\-constrained PINN training and the hard\-constrained ALM training indicates that the optimization details of constraint handling can shift the trainability boundary in challenging physical regimes\. For example, the vertical regime transition line in Figure[3](https://arxiv.org/html/2605.29153#S2.F3)\(b\) on the training loss plot is atβ≈25\\beta\\approx 25, while the same transition line shifts toβ≈50\\beta\\approx 50in Figure[3](https://arxiv.org/html/2605.29153#S2.F3)\(c\)\. Similar behavior is also observed for PINOs and PINODEs\. That is, ALM expands portions of the trainable region \(Regime I\) by enforcing the PDE constraint through a hard\-constrained augmented Lagrangian formulation, rather than relying only on a soft\-constrained penalty method formulation\. This suggests that improved constraint handling can reduce sensitivity to loss\-balancing effects and partially mitigate optimization failure \(Regime II\)\. However, ALM does not remove Regime III; instead, it changes its quantitative boundaries\.
##### Effect of CL\.
As shown in Figure[3](https://arxiv.org/html/2605.29153#S2.F3)\(e,j,o\), CL improves physics\-constrained SciML training in both Regime II \(Under\-Trained\) and Regime III \(Over\-Trained\)\. In Regime II, CL shifts the Regime I/II boundary toward more challenging physical parameters and lower\-data settings, converting some previously under\-trained configurations into trainable ones\. In Regime III, CL reduces test error in parts of the low\-data region, but it does not fully remove generalization\-limited behavior\. For instance, compared with ALM \(Figure[3](https://arxiv.org/html/2605.29153#S2.F3)\(c\)\), CL maintains a comparable training loss boundary to theβ=50\\beta=50; while in Regime III, CL significantly reduces the likelihood of a high generalization gap, outperforming ALM\. While ALM requires a fivefold increase in collocation points \(Nf≥5000N\_\{f\}\\geq 5000\) to maintain fidelity asβ\\betarises to 30, CL consistently achieves superior accuracy with as few asNf=1000N\_\{f\}=1000\. Moreover, CL produces visibly smoother and less sharply separated transitions than the other strategies\. This behavior is consistent with the role of curricula in gradually exposing the model to harder configurations, which changes the optimization trajectory and reduces abrupt changes in trainability\. The resulting regime map shows that CL can reshape difficult\-regime behavior and modify portions of the boundary\.
##### Effect of RoPINN\.
RoPINN exhibits a distinct regime\-shaping effect as a stabilization method rather than a generic optimizer\. In Figure[3](https://arxiv.org/html/2605.29153#S2.F3)\(a\), it preserves the coarse three\-regime structure, but it moves the trainability boundary toward larger convection coefficientsβ\\beta, indicating improved robustness in more difficult physical regimes with limited collocation points\. Compared with CL, which smooths regime transitions by modifying the training trajectory, RoPINN produces a sharper boundary shift by improving PINN\-specific training stability\. However, the generalization\-limited regime \(Regime III\) remains visible, especially in low\-collocation or high\-β\\betasettings, showing that RoPINN improves trainability but does not fully eliminate generalization failure\.
##### Effect of NNCG\.
The NNCG optimizer, applied here as a post\-training fine\-tuning stage initialized with the converged L\-BFGS solution, demonstrates a distinct performance behavior\. While NNCG is designed to use curvature information via a Nyström approximation of the Hessian, our analysis reveals that its stability is sensitive to the physical regime\. Specifically, NNCG presents significant gains in Regime I, corresponding to the low\-to\-medium coefficient \(e\.g\.,β\\beta\) regions \(the left two\-thirds of the phase space\)\. However, in the high\-difficulty Regime II and Regime III, whereβ\\betais large, the performance of the optimizer is not stable\. While averaging over multiple random seeds yields a smoother 2D regime plot, individual runs frequently encounter numerical instability or divergence\.
\(*a*\) 1D Convection, Training Loss\(*b*\) 1D Convection, Test Error\(*c*\) 1D Reaction, Training Loss\(*d*\) 1D Reaction, Test ErrorFigure 4:Relative performance improvement of NNCG over L\-BFGS across different physical and data regimes\. The heatmaps show the percentile\-wise relative improvement of NNCG compared with L\-BFGS for training loss and test error, where positive values indicate better performance under NNCG\. Panels \(*a*,*b*\) correspond to PINNs trained on the 1D convection equation, while Panels \(*c*,*d*\) present the corresponding results for the 1D reaction equation\. In both cases, NNCG presents good fine\-tuning gains in Regime I, indicating improved optimization within already trainable regions\.This behavior is visualized in Figure[4](https://arxiv.org/html/2605.29153#S2.F4), which reports the average improvement over three random seeds\. In Regime I, NNCG reduces the test error by approximately 50% in data\-limited scenarios and improves training loss by roughly 60% in well\-trained scenarios\. Conversely, in the high\-β\\betaparameter space \(Regime II\), the averaged performance shows minimal improvement, reflecting the difficulty of maintaining a stable Nyström approximation in extremely ill\-conditioned landscapes\. The success in lower\-β\\betaregimes validates the hypothesis inRathoreet al\.\([2024](https://arxiv.org/html/2605.29153#bib.bib58)\)regarding “under\-optimization\.” They argue that L\-BFGS often terminates prematurely because it cannot find a step size satisfying the strong Wolfe conditions, due to the “fast spectral decay” of the Hessian, a claim which our diagnostic analysis validates\. NNCG circumvents this by using a Nyström preconditioner and a simpler Armijo line search, allowing it to extract further performance in amenable landscapes\.
PINN\(*a*\)Nf=100,β=5N\_\{f\}=100,\\ \\beta=5PINN\(*b*\)Nf=500,β=50N\_\{f\}=500,\\ \\beta=50FNO\(*c*\)Nf=256,K∈\[2\.5,5\]N\_\{f\}=256,\\ K\\in\[2\.5,5\]FNO\(*d*\)Nf=16384,K∈\[100,200\]N\_\{f\}=16384,\\ K\\in\[100,200\]ResNet\-18\(*e*\) CIFAR\-10Figure 5:Comparison of 3D loss landscapes across PINN, FNO, and ResNet models\. Panels \(*a*,*b*\) and \(*c*,*d*\) visualize the non\-convex landscapes of a PINN \(trained on 1D Convection with different numbers of collocation pointsNfN\_\{f\}and PDE coefficientsβ\\beta\) and an FNO \(trained on 2D Poisson with varying numbers of training samplesNfN\_\{f\}and frequency modesKK\), respectively\. They are characterized by sharp local minima and chaotic ridges\. In contrast, Panel \(*e*\) displays the smooth, convex basin of a standard ResNet\-18\.
### 2\.5Case Studies of SciML Landscapes
Here, we present case studies of loss landscapes in SciML\. We first examine their*topological characteristics*, showing that SciML landscapes often contain sharper, more disconnected, and more anisotropic structures than those observed in standard CV settings\. We then analyze*nonstandard landscape phenomena*, where popular and intuitive landscape metrics can give misleading signals about trainability and generalization\. Together, these case studies provide a finer\-scale characterization beyond Hessian\-based analysis\(Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20)\), and they reveal distinct optimization pathologies in commonly\-used SciML models\.
#### 2\.5\.1SciML Topological Characteristics
##### More Rugged Landscapes in SciML\.
Figure[5](https://arxiv.org/html/2605.29153#S2.F5)compares the 3D loss landscapes of SciML models \(PINNs and FNO\) with that of a well\-trained CV model \(ResNet\-18\)\. Consistent with prior analyses\(Liet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib39)\), the CV loss surface forms a wide, gently curved basin with relatively smooth curvature\. In contrast, the PINN landscapes exhibit distinct, regime\-dependent patterns\. When the PDE coefficient is small, indicating an easier\-to\-solve problem, the landscape features a large flat region of high loss surrounding a deep, narrow valley \(Figure[5](https://arxiv.org/html/2605.29153#S2.F5)\(a\)\)\. However, as the parameter increases, corresponding to a progressively more challenging regime, the landscape transitions into a highly irregular and much more rugged state\. While increasing the number of collocation points offers slight mitigation \(Figure[5](https://arxiv.org/html/2605.29153#S2.F5)\(b\)\), the topology remains more rugged than the well\-behaved CV counterparts\. This is consistent with observations in prior work\(Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20)\)\. Furthermore, this phenomenon extends to operator learning models; both PINN and FNO exhibit sharp minima, whereas ResNet converges to a smoother minimum, underscoring the intrinsic optimization challenges of these widely used SciML models\.
##### Lack of Connectivity and Existence of Large Hessian Eigenvalues\.
In Figure[6](https://arxiv.org/html/2605.29153#S2.F6)\(e\), we present the estimated Hessian spectral density of ResNet\-18 model\. Two characteristics are worth noting: the largest outlier eigenvalue is relatively small \(around 10\); and there is a density peak near zero\. This abundance of near\-zero flat directions has been observed in prior studies on the Hessian spectrum\(Sagunet al\.,[2017](https://arxiv.org/html/2605.29153#bib.bib111)\)\. Consequently, the zero\-loss manifold of the loss landscape in CV classification models is often conceptualized as being highly connected and is supported by*mode connectivity*\(Draxleret al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib55); Garipovet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib54)\)\. However, for the popular SciML models analyzed in this paper, the eigenvalues exhibit a markedly different distribution\. As shown in Figures[6](https://arxiv.org/html/2605.29153#S2.F6)\(a,b\), which present the estimated spectral densities of PINNs, we observe large outlier eigenvalues \(exceeding10310^\{3\}or even10410^\{4\}\) and the absence of a zero\-eigenvalue peak, a phenomenon that becomes even more significant after training\. This spectral density suggests that the optimization trajectories are often trapped in spurious minima that hinder further descent, preventing the model from recovering the correct physical behavior\. In such cases, the large outlier eigenvalues indicate that the neighborhood around the loss minimum is extremely sharp, implying that more effective optimization methods are required for SciML models to escape local minima and converge toward regions near the global minimum\.
##### Landscapes Decouple Hessian–loss Correlation\.
The distinct Hessian spectral densities observed in these models suggest that the conventional Hessian\-loss relationship, as popularly conceptualized within ML, does not necessarily port to SciML models\. In standard CV tasks, minimizing the loss \(a global property\) typically correlates with a reduction in curvature \(a local property\); as the model approaches the solution, the Hessian eigenvalues tend to diminish, and the distribution shifts towards zero \(Figure[6](https://arxiv.org/html/2605.29153#S2.F6)\(e\)\)\. However, for many SciML models, we observe a decoupling of this correlation\. Specifically, while the training loss decreases, the leading Hessian eigenvalues do not vanish, but instead they remain large or can even increase\. We hypothesize that this phenomenon is related to*increasing sharpening*\(Damianet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib81)\), where sharpness continues to grow as models approach the bottom of the basin\. This results in a loss landscape characterized by a thin, funnel\-shaped structure near the minimum, as opposed to the wide, flat basins of CV models\. Consequently, a low training loss in PINNs does not guarantee a robust solution, as the increasing PDE coefficients can significantly sharpen the landscape, causing Hessian\-based metrics to diverge from the loss trajectory\.
PINN\(*a*\)Nf=100,β=5N\_\{f\}=100,\\ \\beta=5PINN\(*b*\)Nf=500,β=50N\_\{f\}=500,\\ \\beta=50FNO\(*c*\)Nf=256,K∈\[2\.5,5\]N\_\{f\}=256,\\ K\\in\[2\.5,5\]FNO\(*d*\)Nf=16384,K∈\[100,200\]N\_\{f\}=16384,\\ K\\in\[100,200\]ResNet\-18\(*e*\) CIFAR\-10Figure 6:Comparison of Hessian eigenspectra across PINN, FNO, and ResNet models\. Each panel shows the empirical spectral density of the Hessian before and after training\. Panels \(*a*,*b*\) show PINNs trained on the 1D Convection under different physical and data regimes, varying the number of collocation pointsNfN\_\{f\}and the convection coefficientβ\\beta\. Panels \(*c*,*d*\) show FNOs trained on the 2D Poisson under different training\-sample sizesNfN\_\{f\}and coefficient rangesKK\. Panel \(*e*\) shows ResNet\-18 trained on CIFAR\-10 as a CV reference\.
#### 2\.5\.2Beyond Standard Landscape Intuitions
The Hessian\-loss relationship discussed in Section[2\.5\.1](https://arxiv.org/html/2605.29153#S2.SS5.SSS1)shows that standard loss landscape intuitions from conventional deep learning can fail to characterize SciML model behavior\. In particular, a low training loss does not necessarily correspond to a low curvature, nor does a high loss imply high curvature\. We discuss two possible types of “pathological” loss landscapes arising from this: \(i\) the solution is hidden within sharp and narrow basins \(deceptive sharpness\); and \(ii\) flatness masks a lack of learning signal \(deceptive flatness\)\. We also examine three alternative explanations commonly studied in deep learning that could plausibly account for SciML optimization difficulty: landscape jumps between basins; landscape barriers; and landscape aging\. Our results indicate that these mechanisms do not fully explain the observed SciML behavior\.
##### Deceptive Sharpness\.
High Hessian eigenvalues do not necessarily indicate poor training\. Rather, they can indicate the entrance to more optimal and well\-trained regions, which are often hypothesized to be reached through narrow and sharp bottleneck paths\(Golatkaret al\.,[2019](https://arxiv.org/html/2605.29153#bib.bib112)\)\. This geometry not only poses a significant challenge for optimizers, but it also explains the sudden loss drop observed once these bottlenecks are successfully traversed\(Cohenet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib108); Damianet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib81)\)\. We visualize this phenomenon in Figure[7](https://arxiv.org/html/2605.29153#S2.F7)\. The training dynamics exhibit a phase ofIncreasing Sharpening\(indicated by the green background\), while the maximum Hessian eigenvalueλmax\\lambda\_\{\\max\}undergoes a dramatic ascent\. This sharp increase in curvature does not hinder optimization\. On the contrary, it perfectly coincides with the most rapid decrease in training loss\. This suggests that the optimizer is successfully navigating into a sharp but highly accurate valley\. Following this, the model enters aStable Regime\(pink background\), where both the loss and the sharpness stabilize\. More examples can be found in Figure[12](https://arxiv.org/html/2605.29153#A6.F12)of Appendix[F\.1](https://arxiv.org/html/2605.29153#A6.SS1)\.
This counterintuitive observation that sharper minima can yield lower training errors is not limited to the temporal dynamics, but it also manifests across different problem settings\. As corroborated by our additional analysis on the 2D Poisson problem \(see Figure[11](https://arxiv.org/html/2605.29153#A6.F11)\(a\) of Appendix[F\.1](https://arxiv.org/html/2605.29153#A6.SS1)\), we observe an anomalous distribution where well\-trained regions \(Regime I\) are characterized by high Hessian eigenvalues\. In contrast, regions with lower Hessian values \(Regime II\) correspond to significantly higher training errors, indicating that those “flatter” areas are likely distinct failure modes or deceptive plateaus rather than generalizable minima\.
Figure 7:Increasing sharpening dynamics on the 1D convection problem \(Nf=15000,β=5\.0N\_\{f\}=15000,\\beta=5\.0\) trained with the L\-BFGS optimizer with five random seeds\. The plots illustrate the correlation between curvature and loss:λmax\\lambda\_\{\\max\}rises sharply during the initial phase, coinciding with the rapid decay in training loss\.
##### Deceptive Flatness and Vanishing Descent Directions\.
While flat minima are typically desired for their generalization properties, our analysis reveals a counter\-phenomenon:deceptive flatness\. Detailed analysis of the Hessian spectrum and phase plots is provided in Appendix[F\.2](https://arxiv.org/html/2605.29153#A6.SS2), demonstrating that flat regions in SciML often occur at high training\-loss values\. Unlike the wide, low\-loss basins associated with robust solutions, these high\-loss plateaus correspond to a lack of informative gradients\. Therefore, the optimizer receives negligible directional signals, causing it to stall\. This deceptive structure is particularly pronounced in regimes with larger, more difficult PDE coefficients \(e\.g\.,ρ≥15\\rho\\geq 15\)\. As detailed in Figure[11](https://arxiv.org/html/2605.29153#A6.F11)\(b\) of Appendix[F\.2](https://arxiv.org/html/2605.29153#A6.SS2), models in these settings frequently become trapped in regions characterized by small Hessian eigenvalues yet significantly elevated training losses \(Regime II/III in the phase plots\)\. Theoretically, this observation resonates with the sigmoidal learning dynamics analyzed bySaxeet al\.\([2014](https://arxiv.org/html/2605.29153#bib.bib80)\), where models must navigate long plateaus\. However, the persistence of the plateau in our experiments suggests a more fundamental issue rooted in the sharpness\-diversity tradeoff\(Luet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib100); Gonget al\.,[2025](https://arxiv.org/html/2605.29153#bib.bib99)\)\. Specifically, the observed flatness does not stem from a wide basin of attraction, but rather fromreduced information contentin the gradients\. This lack of driving force creates a “blind” landscape where the optimizer, deprived of directional guidance, fails to trigger the effective descent phase\.
##### Loss Landscape Jumps Between Basins\.
We first test whether SciML optimization difficulty can be explained by repeated jumps between disconnected basins\. However, our observation suggests the opposite: PINNs, especially in harder physical regimes, exhibit fewer basin transitions than ResNet\-18\. Thus, SciML failures are less likely driven by excessive basin jumping and more likely caused by degraded, poorly conditioned landscape geometry\. See Appendix[F\.3](https://arxiv.org/html/2605.29153#A6.SS3)for details\.
##### Loss Landscape Barriers\.
We next test whether SciML optimization difficulty can be explained by the existence of high\-loss barriers separating coarse and fine solutions\. However, our analysis shows that the primary difficulty in PINNs is not barrier crossing, but the mismatch between low training loss and physically correct solutions \(Appendix[F\.4](https://arxiv.org/html/2605.29153#A6.SS4)\)\.
##### Loss Landscape Aging\.
Finally, we test whether SciML optimization difficulty can be explained by loss landscape aging\(Baity\-Jesiet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib87)\), where optimization slows after entering a wide and flat minimum\. Our results demonstrate that this mechanism does not match the observed SciML behavior: PINN landscapes retain many nonzero Hessian directions and remain relatively sharp even near convergence\. Thus, the difficulty is unlikely to be primarily caused by excessive flatness or diffusive late\-stage dynamics\. See Appendix[F\.5](https://arxiv.org/html/2605.29153#A6.SS5)for details\.
## 3Conclusion
We present a regime\-aware empirical and diagnostic analysis of SciML models across ML model families, PDE systems, and optimization and training strategies\. A recurring coarse three\-regime structure is observed associated withstable training\(Regime I\),optimization difficulty\(Regime II\), anddegraded generalization\(Regime III\)\. Our results show that training and optimization strategies mainly reshape regime boundaries rather than eliminating the underlying structure\. Loss landscape case studies further reveal SciML\-specific geometries that challenge standard interpretations of flatness, sharpness, and optimization stability\. More generally, these findings provide a foundation for regime\-aware diagnostics and more robust SciML training methods\.
## Acknowledgements
YY would like to acknowledge the support by the U\.S\. Department of Energy \(Grant number: DE\-SC0025584\) and the Defense Advanced Research Projects Agency \(Grant number: HR00112520011\)\. MWM would like to acknowledge DARPA, DOE, NSF, ONR, and the DOE SciGPT grant\. We would also like to acknowledge NERSC DOE Mission Science Allocation for ERCAP request ERCAP0035003\.
## Impact Statement
This paper presents research aimed at advancing the fields of Machine Learning and Scientific Computing, particularly in diagnosing and optimizing Scientific Machine Learning models through loss landscape and regime analysis\. While our work has various potential societal implications, we do not find it necessary to highlight any specific ones here\.
## References
- M\. Andriushchenko, F\. Croce, M\. Müller, M\. Hein, and N\. Flammarion \(2023\)A modern look at the relationship between sharpness and generalization\.InICML,Proceedings of Machine Learning Research,pp\. 840–902\.Cited by:[§D\.1](https://arxiv.org/html/2605.29153#A4.SS1.p1.1)\.
- K\. Azizzadenesheli, N\. Kovachki, Z\. Li, M\. Liu\-Schiaffini, J\. Kossaifi, and A\. Anandkumar \(2024\)Neural operators for accelerating scientific simulations and design\.Nature Reviews Physics6\(5\),pp\. 320–328\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p1.1)\.
- D\. Bahri, H\. Mobahi, and Y\. Tay \(2022\)Sharpness\-aware minimization improves language model generalization\.InACL \(1\),pp\. 7360–7371\.Cited by:[§1](https://arxiv.org/html/2605.29153#S1.p4.1)\.
- M\. Baity\-Jesi, L\. Sagun, M\. Geiger, S\. Spigler, G\. B\. Arous, C\. Cammarota, Y\. LeCun, M\. Wyart, and G\. Biroli \(2018\)Comparing dynamics: deep neural networks versus glassy systems\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 314–323\.Cited by:[§F\.5](https://arxiv.org/html/2605.29153#A6.SS5.p1.1),[§2\.5\.2](https://arxiv.org/html/2605.29153#S2.SS5.SSS2.Px5.p1.1)\.
- A\. J\. Ballard, R\. Das, S\. Martiniani, D\. Mehta, L\. Sagun, J\. D\. Stevenson, and D\. J\. Wales \(2017\)Energy landscapes for machine learning\.Physical Chemistry Chemical Physics19,pp\. 12585–12603\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1)\.
- P\. L\. Bartlett, P\. M\. Long, G\. Lugosi, and A\. Tsigler \(2020\)Benign overfitting in linear regression\.Proceedings of the National Academy of Sciences117\(48\),pp\. 30063–30070\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1)\.
- S\. Batzner, A\. Musaelian, L\. Sun, M\. Geiger, J\. P\. Mailoa, M\. Kornbluth, N\. Molinari, T\. E\. Smidt, and B\. Kozinsky \(2022\)E \(3\)\-equivariant graph neural networks for data\-efficient and accurate interatomic potentials\.Nature communications13\(1\),pp\. 2453\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- M\. Belkin, D\. Hsu, S\. Ma, and S\. Mandal \(2019\)Reconciling modern machine\-learning practice and the classical bias–variance trade\-off\.Proceedings of the National Academy of Sciences116\(32\),pp\. 15849–15854\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1)\.
- J\. Benitez, K\. Hegazy, J\. Guo, I\. Dokmanić, M\. W\. Mahoney, and M\. V\. de Hoop \(2025\)Neural equilibria for long\-term prediction of nonlinear conservation laws\.arXiv preprint arXiv:2501\.06933\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- H\. Chen, Y\. Dong, Z\. Wei, Y\. Huang, Y\. Zhang, H\. Su, and J\. Zhu \(2025\)Unveiling the basin\-like loss landscape in large language models\.arXiv preprint arXiv:2505\.17646\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p4.1)\.
- R\. T\. Chen, Y\. Rubanova, J\. Bettencourt, and D\. K\. Duvenaud \(2018\)Neural ordinary differential equations\.Advances in neural information processing systems31\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[§B\.1](https://arxiv.org/html/2605.29153#A2.SS1.SSS0.Px3.p1.11),[§1](https://arxiv.org/html/2605.29153#S1.p1.1)\.
- X\. Cheng and S\. Na \(2024\)Physics\-informed neural networks with trust\-region sequential quadratic programming\.arXiv preprint arXiv:2409\.10777\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- J\. Cohen, S\. Kaur, Y\. Li, J\. Z\. Kolter, and A\. Talwalkar \(2021\)Gradient descent on neural networks typically occurs at the edge of stability\.InICLR,Cited by:[§2\.5\.2](https://arxiv.org/html/2605.29153#S2.SS5.SSS2.Px1.p1.1)\.
- A\. Damian, E\. Nichani, and J\. D\. Lee \(2023\)Self\-stabilization: the implicit bias of gradient descent at the edge of stability\.InICLR,Cited by:[§2\.5\.1](https://arxiv.org/html/2605.29153#S2.SS5.SSS1.Px3.p1.1),[§2\.5\.2](https://arxiv.org/html/2605.29153#S2.SS5.SSS2.Px1.p1.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InICLR,Cited by:[§D\.2](https://arxiv.org/html/2605.29153#A4.SS2.p3.1)\.
- F\. Draxler, K\. Veschgini, M\. Salmhofer, and F\. Hamprecht \(2018\)Essentially no barriers in neural network energy landscape\.InInternational conference on machine learning,pp\. 1309–1318\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[§F\.4](https://arxiv.org/html/2605.29153#A6.SS4.p1.4),[§1](https://arxiv.org/html/2605.29153#S1.p4.1),[§2\.5\.1](https://arxiv.org/html/2605.29153#S2.SS5.SSS1.Px2.p1.2)\.
- A\. Engel \(2001\)Statistical mechanics of learning\.Cambridge University Press\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1)\.
- P\. Foret, A\. Kleiner, H\. Mobahi, and B\. Neyshabur \(2021\)Sharpness\-aware minimization for efficiently improving generalization\.InICLR,Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p4.1)\.
- J\. Galvis, E\. T\. Chung, Y\. Efendiev, and W\. T\. Leung \(2017\)On overlapping domain decomposition methods for high\-contrast multiscale problems\.InInternational Conference on Domain Decomposition Methods,pp\. 45–57\.Cited by:[§B\.2](https://arxiv.org/html/2605.29153#A2.SS2.SSS0.Px7.p2.1)\.
- H\. Gao, L\. Sun, and J\. Wang \(2021\)PhyGeoNet: physics\-informed geometry\-adaptive convolutional neural networks for solving parameterized steady\-state pdes on irregular domain\.Journal of Computational Physics428,pp\. 110079\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- T\. Garipov, P\. Izmailov, D\. Podoprikhin, D\. P\. Vetrov, and A\. G\. Wilson \(2018\)Loss surfaces, mode connectivity, and fast ensembling of dnns\.Advances in neural information processing systems31\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[§F\.4](https://arxiv.org/html/2605.29153#A6.SS4.p1.4),[§1](https://arxiv.org/html/2605.29153#S1.p4.1),[§2\.5\.1](https://arxiv.org/html/2605.29153#S2.SS5.SSS1.Px2.p1.2)\.
- A\. S\. Golatkar, A\. Achille, and S\. Soatto \(2019\)Time matters in regularizing deep networks: weight decay and data augmentation affect early learning dynamics, matter little near convergence\.Advances in Neural Information Processing Systems32\.Cited by:[§2\.5\.2](https://arxiv.org/html/2605.29153#S2.SS5.SSS2.Px1.p1.1)\.
- Z\. Gong, J\. Teng, and Y\. Liu \(2025\)What makes looped transformers perform better than non\-recursive ones \(provably\)\.arXiv preprint arXiv:2510\.10089\.Cited by:[§2\.5\.2](https://arxiv.org/html/2605.29153#S2.SS5.SSS2.Px2.p1.1)\.
- D\. Hansen, D\. C\. Maddix, S\. Alizadeh, G\. Gupta, and M\. W\. Mahoney \(2023\)Learning physical models that can respect conservation laws\.InInternational Conference on Machine Learning,pp\. 12469–12510\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- Z\. Hao, J\. Yao, C\. Su, H\. Su, Z\. Wang, F\. Lu, Z\. Xia, Y\. Zhang, S\. Liu, L\. Lu, and J\. Zhu \(2024\)PINNacle: A comprehensive benchmark of physics\-informed neural networks for solving pdes\.InNeurIPS,Cited by:[§B\.1](https://arxiv.org/html/2605.29153#A2.SS1.SSS0.Px1.p1.11)\.
- T\. Hastie, A\. Montanari, S\. Rosset, and R\. J\. Tibshirani \(2022\)Surprises in high\-dimensional ridgeless least squares interpolation\.Annals of statistics50\(2\),pp\. 949\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InCVPR,pp\. 770–778\.Cited by:[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px2.p1.1)\.
- L\. Hodgkinson, U\. Simsekli, R\. Khanna, and M\. Mahoney \(2022\)Generalization bounds using lower tail exponents in stochastic optimizers\.InInternational Conference on Machine Learning,pp\. 8774–8795\.Cited by:[§F\.3](https://arxiv.org/html/2605.29153#A6.SS3.p1.5)\.
- J\. Hoogland, G\. Wang, M\. Farrugia\-Roberts, L\. Carroll, S\. Wei, and D\. Murfet \(2024\)Loss landscape degeneracy and stagewise development in transformers\.Transactions on Machine Learning Research\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p4.1)\.
- Y\. Hu, K\. Goel, V\. Killiakov, and Y\. Yang \(2025\)Eigenspectrum analysis of neural networks without aspect ratio bias\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 24290–24313\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1)\.
- G\. E\. Karniadakis, I\. G\. Kevrekidis, L\. Lu, P\. Perdikaris, S\. Wang, and L\. Yang \(2021\)Physics\-informed machine learning\.Nature Reviews Physics3\(6\),pp\. 422–440\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p1.1)\.
- N\. S\. Keskar, D\. Mudigere, J\. Nocedal, M\. Smelyanskiy, and P\. T\. P\. Tang \(2016\)On large\-batch training for deep learning: generalization gap and sharp minima\.arXiv preprint arXiv:1609\.04836\.Cited by:[§1](https://arxiv.org/html/2605.29153#S1.p4.1)\.
- P\. Kidger, J\. Morrill, J\. Foster, and T\. Lyons \(2020\)Neural controlled differential equations for irregular time series\.Advances in neural information processing systems33,pp\. 6696–6707\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: A method for stochastic optimization\.InICLR,Cited by:[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px2.p1.1)\.
- S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton \(2019\)Similarity of neural network representations revisited\.InInternational conference on machine learning,pp\. 3519–3529\.Cited by:[§1](https://arxiv.org/html/2605.29153#S1.p4.1)\.
- A\. Krishnapriyan, A\. Gholami, S\. Zhe, R\. Kirby, and M\. W\. Mahoney \(2021\)Characterizing possible failure modes in physics\-informed neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[§B\.2](https://arxiv.org/html/2605.29153#A2.SS2.SSS0.Px1.p1.4),[§B\.2](https://arxiv.org/html/2605.29153#A2.SS2.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p2.1),[§1](https://arxiv.org/html/2605.29153#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px1.p1.1),[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px2.p1.1),[§2\.3](https://arxiv.org/html/2605.29153#S2.SS3.p1.1),[§2\.5\.1](https://arxiv.org/html/2605.29153#S2.SS5.SSS1.Px1.p1.1),[§2\.5](https://arxiv.org/html/2605.29153#S2.SS5.p1.1)\.
- A\. S\. Krishnapriyan, A\. F\. Queiruga, N\. B\. Erichson, and M\. W\. Mahoney \(2023\)Learning continuous models for continuous physics\.Communications Physics6\(1\),pp\. 319\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p2.1),[§1](https://arxiv.org/html/2605.29153#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px1.p1.1)\.
- J\. Kwon, J\. Kim, H\. Park, and I\. K\. Choi \(2021\)Asam: adaptive sharpness\-aware minimization for scale\-invariant learning of deep neural networks\.InInternational Conference on Machine Learning,pp\. 5905–5914\.Cited by:[§D\.1](https://arxiv.org/html/2605.29153#A4.SS1.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p4.1)\.
- H\. Li, Z\. Xu, G\. Taylor, C\. Studer, and T\. Goldstein \(2018\)Visualizing the loss landscape of neural nets\.Advances in neural information processing systems31\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[§D\.4](https://arxiv.org/html/2605.29153#A4.SS4.SSS0.Px3.p1.11),[§1](https://arxiv.org/html/2605.29153#S1.p4.1),[§2\.5\.1](https://arxiv.org/html/2605.29153#S2.SS5.SSS1.Px1.p1.1)\.
- Z\. Li, N\. Kovachki, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. Stuart, and A\. Anandkumar \(2020\)Neural operator: graph kernel network for partial differential equations\.arXiv preprint arXiv:2003\.03485\.Cited by:[§B\.1](https://arxiv.org/html/2605.29153#A2.SS1.SSS0.Px2.p1.4)\.
- Z\. Li, N\. B\. Kovachki, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. M\. Stuart, and A\. Anandkumar \(2021\)Fourier neural operator for parametric partial differential equations\.InICLR,Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p1.1)\.
- Z\. Li, H\. Zheng, N\. Kovachki, D\. Jin, H\. Chen, B\. Liu, K\. Azizzadenesheli, and A\. Anandkumar \(2024\)Physics\-informed neural operator for learning partial differential equations\.ACM/IMS Journal of Data Science1\(3\),pp\. 1–27\.Cited by:[§B\.2](https://arxiv.org/html/2605.29153#A2.SS2.SSS0.Px7.p1.1),[§C\.2](https://arxiv.org/html/2605.29153#A3.SS2.SSS0.Px3.p1.4),[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px1.p1.1)\.
- Z\. Liu, Y\. Hu, T\. Pang, Y\. Zhou, P\. Ren, and Y\. Yang \(2024\)Model balancing helps low\-data training and fine\-tuning\.InEMNLP,pp\. 1311–1331\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1)\.
- H\. Lu, X\. Liu, Y\. Zhou, Q\. Li, K\. Keutzer, M\. W\. Mahoney, Y\. Yan, H\. Yang, and Y\. Yang \(2024\)Sharpness\-diversity tradeoff: improving flat ensembles with sharpbalance\.Advances in Neural Information Processing Systems37,pp\. 136727–136762\.Cited by:[§2\.5\.2](https://arxiv.org/html/2605.29153#S2.SS5.SSS2.Px2.p1.1)\.
- L\. Lu, P\. Jin, G\. Pang, Z\. Zhang, and G\. E\. Karniadakis \(2021a\)Learning nonlinear operators via deeponet based on the universal approximation theorem of operators\.Nature machine intelligence3\(3\),pp\. 218–229\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[§B\.1](https://arxiv.org/html/2605.29153#A2.SS1.SSS0.Px2.p1.4),[§1](https://arxiv.org/html/2605.29153#S1.p1.1)\.
- L\. Lu, R\. Pestourie, W\. Yao, Z\. Wang, F\. Verdugo, and S\. G\. Johnson \(2021b\)Physics\-informed neural networks with hard constraints for inverse design\.SIAM Journal on Scientific Computing43\(6\),pp\. B1105–B1132\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px2.p1.1)\.
- C\. H\. Martin and M\. W\. Mahoney \(2017\)Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior\.Technical reportTechnical ReportPreprint: arXiv:1710\.09553\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1)\.
- P\. Nakkiran, G\. Kaplun, Y\. Bansal, T\. Yang, B\. Barak, and I\. Sutskever \(2020\)Deep double descent: where bigger models and more data hurt\.InICLR,Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1)\.
- G\. Négiar, M\. W\. Mahoney, and A\. S\. Krishnapriyan \(2023\)Learning differentiable solvers for systems with hard constraints\.InICLR,Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- M\. Raissi, P\. Perdikaris, and G\.E\. Karniadakis \(2019\)Physics\-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.Journal of Computational Physics378,pp\. 686–707\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p1.1)\.
- C\. Rao, P\. Ren, Q\. Wang, O\. Buyukozturk, H\. Sun, and Y\. Liu \(2023\)Encoding physics to learn reaction–diffusion processes\.Nature Machine Intelligence5\(7\),pp\. 765–779\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- B\. Raonic, R\. Molinaro, T\. De Ryck, T\. Rohner, F\. Bartolucci, R\. Alaifari, S\. Mishra, and E\. de Bézenac \(2023\)Convolutional neural operators for robust and accurate learning of pdes\.Advances in Neural Information Processing Systems36,pp\. 77187–77200\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- P\. Rathore, W\. Lei, Z\. Frangella, L\. Lu, and M\. Udell \(2024\)Challenges in training pinns: A loss landscape perspective\.InICML,Proceedings of Machine Learning Research,pp\. 42159–42191\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[3rd item](https://arxiv.org/html/2605.29153#A4.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px1.p1.1),[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px2.p1.1),[§2\.4](https://arxiv.org/html/2605.29153#S2.SS4.SSS0.Px4.p2.2)\.
- P\. Ren, C\. Rao, Y\. Liu, J\. Wang, and H\. Sun \(2022\)PhyCRNet: physics\-informed convolutional\-recurrent network for solving spatiotemporal pdes\.Computer Methods in Applied Mechanics and Engineering389,pp\. 114399\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- L\. Sagun, U\. Evci, V\. U\. Guney, Y\. Dauphin, and L\. Bottou \(2017\)Empirical analysis of the hessian of over\-parametrized neural networks\.arXiv preprint arXiv:1706\.04454\.Cited by:[§2\.5\.1](https://arxiv.org/html/2605.29153#S2.SS5.SSS1.Px2.p1.2)\.
- M\. Sakarvadia, K\. Hegazy, A\. Totounferoush, K\. Chard, Y\. Yang, I\. Foster, and M\. W\. Mahoney \(2025\)The false promise of zero\-shot super\-resolution in machine\-learned operators\.arXiv preprint arXiv:2510\.06646\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p2.1),[§1](https://arxiv.org/html/2605.29153#S1.p5.1)\.
- V\. G\. Satorras, E\. Hoogeboom, and M\. Welling \(2021\)E \(n\) equivariant graph neural networks\.InInternational conference on machine learning,pp\. 9323–9332\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- A\. Saxe, J\. McClelland, and S\. Ganguli \(2014\)Exact solutions to the nonlinear dynamics of learning in deep linear neural networks\.InProceedings of the International Conference on Learning Represenatations 2014,Cited by:[§2\.5\.2](https://arxiv.org/html/2605.29153#S2.SS5.SSS2.Px2.p1.1)\.
- R\. Schaeffer, B\. Miranda, and S\. Koyejo \(2024\)Are emergent abilities of large language models a mirage?\.Advances in Neural Information Processing Systems36\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1)\.
- S\. Subramanian, P\. Harrington, K\. Keutzer, W\. Bhimji, D\. Morozov, M\. W\. Mahoney, and A\. Gholami \(2023\)Towards foundation models for scientific machine learning: characterizing scaling and transfer behavior\.Advances in Neural Information Processing Systems36,pp\. 71242–71262\.Cited by:[§B\.2](https://arxiv.org/html/2605.29153#A2.SS2.SSS0.Px5.p1.1),[§B\.2](https://arxiv.org/html/2605.29153#A2.SS2.SSS0.Px6.p1.3),[§C\.2](https://arxiv.org/html/2605.29153#A3.SS2.SSS0.Px2.p1.3),[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px1.p1.1)\.
- R\. Theisen, H\. Kim, Y\. Yang, L\. Hodgkinson, and M\. W\. Mahoney \(2023\)When are ensembles really effective?\.InNeurIPS,Cited by:[§D\.1](https://arxiv.org/html/2605.29153#A4.SS1.p1.1)\.
- H\. Wang, T\. Fu, Y\. Du, W\. Gao, K\. Huang, Z\. Liu, P\. Chandak, S\. Liu, P\. Van Katwyk, A\. Deac,et al\.\(2023\)Scientific discovery in the age of artificial intelligence\.Nature620\(7972\),pp\. 47–60\.Cited by:[§1](https://arxiv.org/html/2605.29153#S1.p1.1)\.
- S\. Wang, X\. Yu, and P\. Perdikaris \(2022\)When and why pinns fail to train: a neural tangent kernel perspective\.Journal of Computational Physics449,pp\. 110768\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
- J\. Wei, Y\. Tay, R\. Bommasani, C\. Raffel, B\. Zoph, S\. Borgeaud, D\. Yogatama, M\. Bosma, D\. Zhou, D\. Metzler, E\. H\. Chi, T\. Hashimoto, O\. Vinyals, P\. Liang, J\. Dean, and W\. Fedus \(2022\)Emergent abilities of large language models\.Trans\. Mach\. Learn\. Res\.2022\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1)\.
- S\. Woo, S\. Debnath, R\. Hu, X\. Chen, Z\. Liu, I\. S\. Kweon, and S\. Xie \(2023\)ConvNeXt V2: co\-designing and scaling convnets with masked autoencoders\.InCVPR,pp\. 16133–16142\.Cited by:[§D\.2](https://arxiv.org/html/2605.29153#A4.SS2.p3.1)\.
- H\. Wu, H\. Luo, Y\. Ma, J\. Wang, and M\. Long \(2024\)Ropinn: region optimized physics\-informed neural networks\.Advances in Neural Information Processing Systems37,pp\. 110494–110532\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[§B\.2](https://arxiv.org/html/2605.29153#A2.SS2.SSS0.Px2.p1.3),[§B\.2](https://arxiv.org/html/2605.29153#A2.SS2.SSS0.Px3.p1.1),[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px2.p1.1)\.
- T\. Xie, J\. Chen, Y\. Yang, C\. Geniesse, G\. Shi, A\. Chaudhari, J\. K\. Cava, M\. W\. Mahoney, T\. Perciano, G\. H\. Weber,et al\.\(2024\)LossLens: diagnostics for machine learning through loss landscape visual analytics\.IEEE Computer Graphics and Applications\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1)\.
- Y\. Yang, L\. Hodgkinson, R\. Theisen, J\. Zou, J\. E\. Gonzalez, K\. Ramchandran, and M\. W\. Mahoney \(2021\)Taxonomizing local versus global structure in neural network loss landscapes\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 18722–18733\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1),[§D\.1](https://arxiv.org/html/2605.29153#A4.SS1.p1.1),[§D\.1](https://arxiv.org/html/2605.29153#A4.SS1.p2.2),[§D\.2](https://arxiv.org/html/2605.29153#A4.SS2.p1.1),[§D\.3](https://arxiv.org/html/2605.29153#A4.SS3.p1.1),[3rd item](https://arxiv.org/html/2605.29153#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.29153#S1.p3.1),[§1](https://arxiv.org/html/2605.29153#S1.p4.1),[§1](https://arxiv.org/html/2605.29153#S1.p5.1)\.
- Z\. Yao, A\. Gholami, D\. Arfeen, R\. Liaw, J\. Gonzalez, K\. Keutzer, and M\. Mahoney \(2018\)Large batch size training of neural networks with adversarial training and second\-order information\.arXiv preprint arXiv:1810\.01021\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1)\.
- Z\. Yao, A\. Gholami, K\. Keutzer, and M\. W\. Mahoney \(2020\)Pyhessian: neural networks through the lens of the hessian\.In2020 IEEE international conference on big data \(Big data\),pp\. 581–590\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1),[§D\.2](https://arxiv.org/html/2605.29153#A4.SS2.p1.1)\.
- L\. Zdeborová and F\. Krzakala \(2016\)Statistical physics of inference: thresholds and algorithms\.Advances in Physics65\(5\),pp\. 453–552\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px3.p1.1)\.
- Y\. Zhou, T\. Pang, K\. Liu, M\. W\. Mahoney, Y\. Yang,et al\.\(2023a\)Temperature balancing, layer\-wise weight analysis, and neural network training\.Advances in Neural Information Processing Systems36,pp\. 63542–63572\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhou, Y\. Yang, A\. Chang, and M\. W\. Mahoney \(2023b\)A three\-regime model of network pruning\.InInternational Conference on Machine Learning,pp\. 42790–42809\.Cited by:[§1](https://arxiv.org/html/2605.29153#S1.p3.1),[§1](https://arxiv.org/html/2605.29153#S1.p4.1)\.
- C\. Zhu, R\. H\. Byrd, P\. Lu, and J\. Nocedal \(1997\)Algorithm 778: l\-bfgs\-b: fortran subroutines for large\-scale bound\-constrained optimization\.ACM Transactions on mathematical software \(TOMS\)23\(4\),pp\. 550–560\.Cited by:[§2\.1](https://arxiv.org/html/2605.29153#S2.SS1.SSS0.Px2.p1.1)\.
- Y\. Zhu, N\. Zabaras, P\. Koutsourelakis, and P\. Perdikaris \(2019\)Physics\-constrained deep learning for high\-dimensional surrogate modeling and uncertainty quantification without labeled data\.Journal of Computational Physics394,pp\. 56–81\.Cited by:[Appendix A](https://arxiv.org/html/2605.29153#A1.SS0.SSS0.Px1.p1.1)\.
## Appendix ARelated Work
##### Machine Learning Differential Equations\.
ML\-based approaches for “solving” differential equations have received attention in recent years\(Karniadakiset al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib82); Azizzadenesheliet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib76)\)\. One popular approach is based on PINNs\(Raissiet al\.,[2019](https://arxiv.org/html/2605.29153#bib.bib19); Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20)\)and its variants\(Zhuet al\.,[2019](https://arxiv.org/html/2605.29153#bib.bib88); Gaoet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib90); Renet al\.,[2022](https://arxiv.org/html/2605.29153#bib.bib89)\), which exploit soft Lagrangian relaxation as a penalty method to incorporate PDE residuals as soft constraints\. Since such formulations can be ill\-conditioned and very sensitive to optimization hyperparameters, a range of optimization\(Wanget al\.,[2022](https://arxiv.org/html/2605.29153#bib.bib68); Wuet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib69); Rathoreet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib58)\)and hard\-constraint methods\(Luet al\.,[2021b](https://arxiv.org/html/2605.29153#bib.bib41); Cheng and Na,[2024](https://arxiv.org/html/2605.29153#bib.bib86)\)have been proposed to help improve training stability\. Relatedly, to enable instance\-level generalization, NOs\(Liet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib23); Luet al\.,[2021a](https://arxiv.org/html/2605.29153#bib.bib124); Raonicet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib92); Sakarvadiaet al\.,[2025](https://arxiv.org/html/2605.29153#bib.bib118)\)aim to learn mappings between functional spaces for instance\-level generalization\. In a complementary direction, NODEs\(Chenet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib121); Kidgeret al\.,[2020](https://arxiv.org/html/2605.29153#bib.bib128); Krishnapriyanet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib125)\)are motivated by continuous\-time modeling, by learning the time derivative with a neural network \(NN\) and integrating it with a differentiable ODE solver\. Despite their popularity, the training and generalization properties of these models can be extremely sensitive to design and optimization choices\(Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20),[2023](https://arxiv.org/html/2605.29153#bib.bib125); Sakarvadiaet al\.,[2025](https://arxiv.org/html/2605.29153#bib.bib118)\)\. This has motivated structure\-preserving approaches that encode physical structure as stronger inductive biases, often by incorporating numerical principles at macro\-scale\(Négiaret al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib114); Hansenet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib115); Raoet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib131)\), meso\-scale\(Benitezet al\.,[2025](https://arxiv.org/html/2605.29153#bib.bib119)\), or micro\-scale\(Satorraset al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib130); Batzneret al\.,[2022](https://arxiv.org/html/2605.29153#bib.bib129)\)levels\.
##### Loss Landscape Analysis\.
Loss landscape analysis comes from chemical physics\(Ballardet al\.,[2017](https://arxiv.org/html/2605.29153#bib.bib143)\), and it has been widely used to understand various phenomena in NN models\. Within CV, it has been instrumental in “explaining” normalization and residual connections\(Liet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib39); Yaoet al\.,[2020](https://arxiv.org/html/2605.29153#bib.bib44)\), large\-batch training\(Yaoet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib43)\), sharpness\-aware optimization\(Foretet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib45)\), and large\-scale characterization of NN behavior\(Yanget al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib9)\)\. Similarly, within NLP, it has been used to study the stage\-wise spectral degeneration of transformers\(Hooglandet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib46)\)as well as pre\-training and fine\-tuning\(Chenet al\.,[2025](https://arxiv.org/html/2605.29153#bib.bib62)\)\. Loss landscape analysis has also been applied in SciML to characterize ill\-conditioning properties of PINNs\(Krishnapriyanet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib20); Xieet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib51); Rathoreet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib58); Wuet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib69)\)\. Methods for loss landscape analysis have advanced significantly in recent years\. Early work mainly focused on visualization\(Liet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib39)\), but subsequent studies introduced quantitative local measures by sharpness and Hessian\-based metrics\(Yaoet al\.,[2020](https://arxiv.org/html/2605.29153#bib.bib44)\)\. Motivated by going beyond purely local geometry, subsequent work also developed more global characterizations of the landscape, such as mode connectivity\(Garipovet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib54); Draxleret al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib55)\)\. These tools have been used to classify phase\-transitions\(Yanget al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib9)\)and to inform hyperparameter selection in training models\(Zhouet al\.,[2023a](https://arxiv.org/html/2605.29153#bib.bib113); Liuet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib127); Huet al\.,[2025](https://arxiv.org/html/2605.29153#bib.bib137)\)\. More recently, topological data analysis has been explored as an additional lens for probing loss\-landscape structure\(Xieet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib51)\)\.
##### Training Regimes in Optimization\.
NN training is known to exhibit distinct*phases*\. The characterization of these phases is rooted in statistical physics\(Engel,[2001](https://arxiv.org/html/2605.29153#bib.bib94); Zdeborová and Krzakala,[2016](https://arxiv.org/html/2605.29153#bib.bib96); Martin and Mahoney,[2017](https://arxiv.org/html/2605.29153#bib.bib144)\), where phase transitions often manifest as discontinuous changes in system properties, governed by control parameters such as optimization noise and the data\-model ratio\. In ML, such transitions are often associated with generalization, e\.g\., “double descent” behavior, observed as model size or epoch count increases\(Bartlettet al\.,[2020](https://arxiv.org/html/2605.29153#bib.bib145); Nakkiranet al\.,[2020](https://arxiv.org/html/2605.29153#bib.bib5); Hastieet al\.,[2022](https://arxiv.org/html/2605.29153#bib.bib146); Belkinet al\.,[2019](https://arxiv.org/html/2605.29153#bib.bib147)\)\. These macroscopic regimes are intrinsically tied to the geometry of the loss landscape, where the optimizer’s trajectory is dictated by the connectivity of local basins\(Yanget al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib9)\)\. While recent studies debate whether certain transitions in large language models are indeed emergent or induced by specific metrics\(Weiet al\.,[2022](https://arxiv.org/html/2605.29153#bib.bib7); Schaefferet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib8)\), the phase\-based studies still provide a crucial practical way of diagnosing optimization failures\.
## Appendix BProblem Setup
In this paper, the focus is on SciML methods for learning differential equation systems\. This section outlines the basic methodology of standard SciML models and their optimization problems, including both data\-driven approaches \(e\.g\., NODEs and FNOs\) and physics\-constrained methods \(e\.g\., PINNs, PINOs, and PINODEs\)\.
### B\.1SciML Models
Here, we describe the basic SciML models we used in our analysis\.
##### PINNs\.
The main goal of PINNs is to solve PDEs\. Similar to prior work\(Haoet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib15)\), we consider the general setup of PDEs:
𝒟\[u\(x\),x\]=0,x∈Ω,\\displaystyle\\mathcal\{D\}\[u\(x\),x\]=0,\\quad x\\in\\Omega,\(1a\)ℬ\[u\(x\),x\]=0,x∈∂Ω,\\displaystyle\\mathcal\{B\}\[u\(x\),x\]=0,\\quad x\\in\\partial\\Omega,\(1b\)where𝒟\\mathcal\{D\}is a differential operator defining the PDE,ℬ\\mathcal\{B\}is an operator associated with the boundary and/or initial conditions, andΩ⊆ℝd\\Omega\\subseteq\\mathbb\{R\}^\{d\}\. PINNs modeluuas an NN and approximate the true solution by the network whose weights solve the following non\-linear least\-squares problem:
minimizew∈ℝpL\(w\)≔\\displaystyle\\underset\{w\\in\\mathbb\{R\}^\{p\}\}\{\\mbox\{minimize\}\}~L\(w\)\\coloneqq12nres∑i=1nres\(𝒟\[u\(xri;w\),xri\]\)2\+12nbc∑j=1nbc\(ℬ\[u\(xbj;w\),xbj\]\)2\.\\displaystyle\\frac\{1\}\{2n\_\{\\textup\{res\}\}\}\\sum\_\{i=1\}^\{n\_\{\\textup\{res\}\}\}\\left\(\\mathcal\{D\}\[u\(x\_\{r\}^\{i\};w\),x\_\{r\}^\{i\}\]\\right\)^\{2\}\+\\frac\{1\}\{2n\_\{\\textup\{bc\}\}\}\\sum^\{n\_\{\\textup\{bc\}\}\}\_\{j=1\}\\left\(\\mathcal\{B\}\[u\(x\_\{b\}^\{j\};w\),x\_\{b\}^\{j\}\]\\right\)^\{2\}\.\(2\)Here,nresn\_\{\\textup\{res\}\}andnbcn\_\{\\textup\{bc\}\}denote the numbers of residual and boundary/initial points,\{xri\}i=1nres\\\{x\_\{r\}^\{i\}\\\}^\{n\_\{\\textup\{res\}\}\}\_\{i=1\}are the residual points, and\{xbj\}j=1nbc\\\{x^\{j\}\_\{b\}\\\}^\{n\_\{\\textup\{bc\}\}\}\_\{j=1\}are the boundary/initial points\. The first loss term measures how muchu\(x;w\)u\(x;w\)fails to satisfy the PDE, while the second term measures how muchu\(x;w\)u\(x;w\)fails to satisfy the boundary/initial conditions\.
##### NOs and PINOs\.
NOs\(Liet al\.,[2020](https://arxiv.org/html/2605.29153#bib.bib21); Luet al\.,[2021a](https://arxiv.org/html/2605.29153#bib.bib124)\)extend standard NNs by learning mappings between function spaces\. Specifically, they approximate an operatorG:𝒜→𝒰G:\\mathcal\{A\}\\rightarrow\\mathcal\{U\}, where𝒜\\mathcal\{A\}denotes an input function space \(e\.g\., initial conditions, boundary conditions, or PDE coefficients\) and𝒰\\mathcal\{U\}denotes the corresponding solution space\. The model parameterizes a mappingGθG\_\{\\theta\}such that
u=Gθ\(a\),a∈𝒜,u=G\_\{\\theta\}\(a\),\\quad a\\in\\mathcal\{A\},\(3\)whereθ\\thetadenotes the trainable parameters\. In practice, functions are discretized on a grid, andai∈ℝdaa\_\{i\}\\in\\mathbb\{R\}^\{d\_\{a\}\}andui∈ℝduu\_\{i\}\\in\\mathbb\{R\}^\{d\_\{u\}\}denote the discretized input and solution fields for theii\-th sample\. The operator is trained by minimizing a data\-driven loss,
ℒdata\(θ\)=1N∑i=1N‖Gθ\(ai\)−ui‖2,\\mathcal\{L\}\_\{\\text\{data\}\}\(\\theta\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|G\_\{\\theta\}\(a\_\{i\}\)\-u\_\{i\}\\\|^\{2\},\(4\)where\{\(ai,ui\)\}i=1N\\\{\(a\_\{i\},u\_\{i\}\)\\\}\_\{i=1\}^\{N\}are input–output pairs, and∥⋅∥\\\|\\cdot\\\|denotes the Euclidean norm over the discretized field\.
For physics\-informed variants \(PINOs\), additional constraints are incorporated to enforce consistency with governing equations\. This is achieved by augmenting the objective with a residual loss defined on collocation points,
ℒphys\(θ\)=1M∑j=1M‖ℱ\(Gθ\(a\)\(xj\)\)‖2,\\mathcal\{L\}\_\{\\text\{phys\}\}\(\\theta\)=\\frac\{1\}\{M\}\\sum\_\{j=1\}^\{M\}\\\|\\mathcal\{F\}\(G\_\{\\theta\}\(a\)\(x\_\{j\}\)\)\\\|^\{2\},\(5\)whereℱ\\mathcal\{F\}denotes the differential operator associated with the governing PDE and\{xj\}j=1M\\\{x\_\{j\}\\\}\_\{j=1\}^\{M\}are collocation points in the spatial \(and possibly temporal\) domain\. The final loss function is given by,
ℒ\(θ\)=ℒdata\(θ\)\+λℒphys\(θ\),\\mathcal\{L\}\(\\theta\)=\\mathcal\{L\}\_\{\\text\{data\}\}\(\\theta\)\+\\lambda\\mathcal\{L\}\_\{\\text\{phys\}\}\(\\theta\),\(6\)withλ\\lambdais the weighting coefficient\.
##### NODEs and PINODEs\.
We consider NODEs\(Chenet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib121)\)for modeling temporal dynamics of the form
dxdt=fθ\(x,t\),\\frac\{dx\}\{dt\}=f\_\{\\theta\}\(x,t\),\(7\)wherefθf\_\{\\theta\}is parameterized by NNs\. Given an initial conditionx\(0\)x\(0\), predictions at timetit\_\{i\}are obtained via numerical integration,
x^\(ti\)=x\(0\)\+∫0tifθ\(x\(τ\),τ\)𝑑τ,\\hat\{x\}\(t\_\{i\}\)=x\(0\)\+\\int\_\{0\}^\{t\_\{i\}\}f\_\{\\theta\}\(x\(\\tau\),\\tau\)\\,d\\tau,\(8\)which is approximated using a fixed\-step solver \(e\.g\., forward Euler\) aligned with the temporal resolution of the training data\. The standard NODE is trained by minimizing the mean squared error \(MSE\) between predicted and observed trajectories,
ℒdata=1N∑i=1N‖x^\(ti;θ\)−x\(ti\)‖2,\\mathcal\{L\}\_\{\\text\{data\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|\\hat\{x\}\(t\_\{i\};\\theta\)\-x\(t\_\{i\}\)\\\|^\{2\},\(9\)where\{ti\}i=1N\\\{t\_\{i\}\\\}\_\{i=1\}^\{N\}denotes the set of observation time points,x\(ti\)∈ℝdx\(t\_\{i\}\)\\in\\mathbb\{R\}^\{d\}is the ground\-truth state at timetit\_\{i\}, andx^\(ti;θ\)\\hat\{x\}\(t\_\{i\};\\theta\)is the model prediction obtained by integrating the learned dynamicsfθf\_\{\\theta\}\. The norm∥⋅∥\\\|\\cdot\\\|denotes the Euclidean norm inℝd\\mathbb\{R\}^\{d\}\.
For the physics\-informed variant, additional loss terms are incorporated by enforcing consistency with known governing equations or physical constraints\. This is achieved by augmenting the objective with a residual loss,
ℒphys=1M∑j=1M‖∂tx^\(tj\)−ℱ\(x^\(tj\)\)‖2,\\mathcal\{L\}\_\{\\text\{phys\}\}=\\frac\{1\}\{M\}\\sum\_\{j=1\}^\{M\}\\\|\\partial\_\{t\}\\hat\{x\}\(t\_\{j\}\)\-\\mathcal\{F\}\(\\hat\{x\}\(t\_\{j\}\)\)\\\|^\{2\},\(10\)whereℱ\\mathcal\{F\}represents the known dynamics \(when available\)\. The overall objective combines data fidelity and physical consistency,
ℒ=ℒdata\+λℒphys,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{data\}\}\+\\lambda\\mathcal\{L\}\_\{\\text\{phys\}\},\(11\)withλ\\lambdacontrolling the trade\-off between the two terms\.
### B\.2Differential Equations
Here, we describe the differential equations considered in the experimental studies\. We use: \(1\) 1D convection, reaction, wave, and reaction\-diffusion equations for PINNs; \(2\) 2D Poisson and advection\-diffusion equations for NOs; \(3\) 2D Darcy flow for PINOs; and \(4\) the nonlinear Pendulum for NODEs and PINODEs\.
##### 1D Convection\.
The 1D convection problem is a hyperbolic PDE that can be used to model fluid flow, heat transfer, and biological processes\. We consider the setup of the convection equation inKrishnapriyanet al\.\([2021](https://arxiv.org/html/2605.29153#bib.bib20)\), given by
∂u∂t\+β∂u∂x=0,\\displaystyle\\frac\{\\partial u\}\{\\partial t\}\+\\beta\\frac\{\\partial u\}\{\\partial x\}=0,x∈\[0,2π\),t∈\(0,1\),\\displaystyle\\quad x\\in\[0,2\\pi\),t\\in\(0,1\),u\(x,0\)=sin\(x\),\\displaystyle u\(x,0\)=\\sin\(x\),u\(0,t\)=u\(2π,t\)\.\\displaystyle\\quad u\(0,t\)=u\(2\\pi,t\)\.The analytical solution to this PDE isu\(x,t\)=sin\(x−βt\)u\(x,t\)=\\sin\(x\-\\beta t\)\. In this paper, the parameterβ\\betais varied over the range\[5,70\]\[5,70\]\.
##### 1D Reaction\.
The 1D reaction problem is governed by a nonlinear PDE commonly used to model chemical reaction dynamics\. The specific reaction PDE considered in this study is given by\(Wuet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib69)\):
∂u∂t−ρu\(1−u\)=0,\\displaystyle\\frac\{\\partial u\}\{\\partial t\}\-\\rho u\(1\-u\)=0,x∈\(0,2π\),t∈\(0,1\)\\displaystyle\\quad x\\in\(0,2\\pi\),t\\in\(0,1\)u\(x,0\)=exp\(−\(x−π\)22\(π/4\)2\),\\displaystyle u\(x,0\)=\\exp\\left\(\-\\frac\{\(x\-\\pi\)^\{2\}\}\{2\(\\pi/4\)^\{2\}\}\\right\),x∈\[0,2π\],\\displaystyle\\quad x\\in\[0,2\\pi\],u\(0,t\)=u\(2π,t\),\\displaystyle u\(0,t\)=u\(2\\pi,t\),t∈\[0,1\]\.\\displaystyle\\quad t\\in\[0,1\]\.The analytical solution to this ODE isu\(x,t\)=h\(x\)eρth\(x\)eρt\+1−h\(x\)u\(x,t\)=\\frac\{h\(x\)e^\{\\rho t\}\}\{h\(x\)e^\{\\rho t\}\+1\-h\(x\)\}, whereh\(x\)=exp\(−\(x−π\)22\(π/4\)2\)h\(x\)=\\exp\\left\(\-\\frac\{\(x\-\\pi\)^\{2\}\}\{2\(\\pi/4\)^\{2\}\}\\right\)\.
##### 1D Wave\.
The 1D wave equation is widely used to model wave propagation phenomena in acoustics, elasticity, and electromagnetics\. We follow the setup inWuet al\.\([2024](https://arxiv.org/html/2605.29153#bib.bib69)\):
∂2u∂t2−c2⋅∂2u∂x2=0,\\displaystyle\\frac\{\\partial^\{2\}u\}\{\\partial t^\{2\}\}\-c^\{2\}\\cdot\\frac\{\\partial^\{2\}u\}\{\\partial x^\{2\}\}=0,x∈\(0,1\),t∈\(0,1\),\\displaystyle\\quad x\\in\(0,1\),t\\in\(0,1\),u\(x,0\)=sin\(πx\)\+0\.5sin\(βπx\),\\displaystyle u\(x,0\)=\\text\{sin\}\(\\pi x\)\+0\.5\\text\{sin\}\(\\beta\\pi x\),x∈\[0,1\],\\displaystyle\\quad x\\in\[0,1\],∂u\(x,0\)∂t=0,\\displaystyle\\frac\{\\partial u\(x,0\)\}\{\\partial t\}=0,x∈\[0,1\],\\displaystyle\\quad x\\in\[0,1\],u\(0,t\)=u\(1,t\)=0,\\displaystyle u\(0,t\)=u\(1,t\)=0,t∈\[0,1\]\.\\displaystyle\\quad t\\in\[0,1\]\.
##### 1D Reaction\-Diffusion\.
The 1D reaction\-diffusion equation is a prototypical nonlinear PDE that models the interplay between diffusion\-driven transport and local reaction kinetics\. Following the setup inKrishnapriyanet al\.\([2021](https://arxiv.org/html/2605.29153#bib.bib20)\), we formulate this equation as:
∂u∂t=ν∂2u∂x2−ρu\(1−u\),x∈\[0,2π\),t∈\(0,1\)\.\\frac\{\\partial u\}\{\\partial t\}=\\nu\\frac\{\\partial^\{2\}u\}\{\\partial x^\{2\}\}\-\\rho u\(1\-u\),\\quad x\\in\[0,2\\pi\),t\\in\(0,1\)\.
##### 2D Poisson\.
Here, we adopt the same PDE as SYS\-1 inSubramanianet al\.\([2023](https://arxiv.org/html/2605.29153#bib.bib26)\)\. Specifically, we consider an elliptic system with periodic boundary conditions on the domainΩ=\[0,1\]2\\Omega=\[0,1\]^\{2\}:
−div𝐊∇u=f,inΩ,\-\\operatorname\{div\}\\mathbf\{K\}\\nabla u=f,\\quad\\text\{in\}~\\Omega,\(12\)whereuudenotes the state variable,f\(𝐱\)f\(\\mathbf\{x\}\)is a source \(forcing\) term, and𝐊\\mathbf\{K\}is the diffusion coefficient tensor\. The tensor𝐊\\mathbf\{K\}encodes the underlying material or transport physics of the system\. The parameter𝐊\\mathbf\{K\}is selected within\[1,200\]\[1,200\]throughout this study\.
##### 2D Advection\-Diffusion\.
Following the SYS\-2 setup inSubramanianet al\.\([2023](https://arxiv.org/html/2605.29153#bib.bib26)\), we further evaluate a more complex 2D advection\-diffusion PDE, given by
−div𝐊∇𝐮\+𝝂⋅∇𝐮=f,inΩ,\-\\operatorname\{div\}\\mathbf\{K\}\\nabla\\mathbf\{u\}\+\\bm\{\\nu\}\\cdot\\nabla\\mathbf\{u\}=f,\\quad\\text\{in\}~\\Omega,\(13\)where𝝂\\bm\{\\nu\}denotes the velocity vector\. We use the ratio of advection to diffusion as a varying coefficient:𝚿=‖𝝂⋅∇𝐮‖/‖div𝐊∇𝐮‖\\bm\{\\Psi\}=\\\|\\bm\{\\nu\}\\cdot\\nabla\\mathbf\{u\}\\\|/\\\|\\operatorname\{div\}\\mathbf\{K\}\\nabla\\mathbf\{u\}\\\|\.
##### 2D Darcy Flow\.
We consider the steady\-state Darcy flow equation\(Liet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib93)\)on the unit square domainΩ=\(0,1\)2\\Omega=\(0,1\)^\{2\}with the homogeneous Dirichlet boundary condition\. The governing PDE is
−∇⋅\(a\(x\)∇u\(x\)\)\\displaystyle\-\\nabla\\cdot\\big\(a\(x\)\\nabla u\(x\)\\big\)=f\(x\),x∈Ω,\\displaystyle=f\(x\),\\quad x\\in\\Omega,\(14\)u\(x\)\\displaystyle u\(x\)=0,x∈∂Ω,\\displaystyle=0,\\quad x\\in\\partial\\Omega,\(15\)wherea\(x\)∈L∞\(Ω;ℝ\+\)a\(x\)\\in L^\{\\infty\}\(\\Omega;\\mathbb\{R\}\_\{\+\}\)denotes a piecewise\-constant diffusion coefficient andf\(x\)=1f\(x\)=1is a fixed forcing term\. The task is to learn the nonlinear operator that maps the diffusion fieldaato the corresponding solution fielduu\. For the physics\-informed training objective, we use the PDE residual
ℒpde\(u\)=∇⋅\(a∇u\)−f\.\\mathcal\{L\}\_\{\\text\{pde\}\}\(u\)=\\nabla\\cdot\\big\(a\\nabla u\\big\)\-f\.\(16\)
To control the physical difficulty of the problem, we vary the contrast ratio\(Galviset al\.,[2017](https://arxiv.org/html/2605.29153#bib.bib142)\),
r=amaxamin,r=\\frac\{a\_\{\\max\}\}\{a\_\{\\min\}\},\(17\)which measures the scale separation between the largest and smallest diffusion coefficients\. Larger contrast ratios correspond to stronger multiscale heterogeneity and higher physical stiffness, making the operator\-learning problem more challenging\.
##### Nonlinear Damped Pendulum\.
We study a nonlinear damped pendulum to probe training dynamics of time\-continuous models under varying physical complexity and data availability\. The system is governed by
d2θdt2\+bdθdt\+sin\(θ\)=0,\\frac\{d^\{2\}\\theta\}\{dt^\{2\}\}\+b\\frac\{d\\theta\}\{dt\}\+\\sin\(\\theta\)=0,\(18\)or equivalently, the first\-order form
θ˙=ω,ω˙=−bω−sin\(θ\),\\dot\{\\theta\}=\\omega,\\qquad\\dot\{\\omega\}=\-b\\,\\omega\-\\sin\(\\theta\),\(19\)whereθ\(t\)\\theta\(t\)is the angle andb∈\[0\.1,1\.0\]b\\in\[0\.1,1\.0\]is the damping coefficient\. Smallerbbyields longer\-lived, more oscillatory nonlinear dynamics, while largerbbproduces rapidly decaying trajectories\.
## Appendix CExtended Experiment Setups
### C\.1Summary of Experimental Configurations
Table[1](https://arxiv.org/html/2605.29153#A3.T1)summarizes the representative SciML models, benchmark physical systems, and optimization or training strategies evaluated in this work\. The selected models span several major SciML paradigms, including collocation\-based physics\-constrained learning \(PINNs\), operator learning \(FNOs\), physics\-informed operator learning \(PINOs\), and temporal dynamics models \(NODEs and PINODEs\)\. The benchmark systems cover representative PDE and ODE settings with varying physical difficulty, including transport\-dominated dynamics, elliptic systems, multiscale porous\-media flow, and nonlinear dynamical systems\. We evaluate a broad set of optimization and training interventions, including standard first\- and second\-order optimization methods, hard\-constrained formulations, curriculum learning strategies, and PINN\-specific stabilization methods\. These experiments form the basis of our regime\-aware empirical analysis\.
Table 1:Summary of SciML models, benchmark systems, and optimization/training strategies evaluated in this work\.SciML ModelRepresentative Physical SystemOptimization / Training StrategiesPINN1D convection, reaction, wave, reaction\-diffusion equationsL\-BFGS, ALM, NNCG, RoPINN, CLFNO2D Poisson, advection\-diffusion equationsAdamPINO2D Darcy flowAdam, L\-BFGS, ALM, NNCG, CLNODENonlinear pendulum systemAdamPINODENonlinear pendulum systemAdam, L\-BFGS, ALM, NNCG, CL
### C\.2Training Protocols
##### PINNs\.
For the PINN experiments, we sweep the governing physical parameter of each PDE system to control the underlying physical difficulty and construct regime plots\. For the 1D convection equation, the convection coefficient is varied overβ∈\[5,70\]\\beta\\in\[5,70\]\. For the reaction equation, we vary the reaction coefficient overρ∈\[1,12\]\\rho\\in\[1,12\]\. For the wave equation, the wave speed is swept overc∈\[0\.1,6\]c\\in\[0\.1,6\]\. Finally, for the reaction\-diffusion equation, we evaluate regimes acrossρ∈\[1,30\]\\rho\\in\[1,30\]\. In all cases, increasing the physical parameter generally produces sharper solution structures, stronger stiffness, or more challenging multiscale behavior, thereby increasing the optimization and generalization difficulty of the corresponding SciML problem\. The PINN architecture uses a Multi\-Layer Perceptron \(MLP\) with four hidden layers for all related PDE scenarios\. The hidden dimensions are\[50,50,50,50\]\[50,50,50,50\]\. We run PINN experiments using five random seeds\. The training configurations are provided in Table[2](https://arxiv.org/html/2605.29153#A3.T2)\.
Table 2:Training protocols and optimization configurations for PINN experiments\.StrategyLearning RateEpochsBatch SizeAdditional SettingsL\-BFGS1\.02×1032\\times 10^\{3\}Full batchhistory=100\\text\{history\}=100;tolerance\_grad=10−7\\text\{tolerance\\\_grad\}=10^\{\-7\},tolerance\_change=10−9\\text\{tolerance\\\_change\}=10^\{\-9\}NNCG0\.001 / 1\.05×1035\\times 10^\{3\}Full batchAdam:2×1032\\times 10^\{3\}epochs, L\-BFGS:1×1031\\times 10^\{3\}epochs, NNCG:2×1032\\times 10^\{3\}epochs; rank = 50ALM1\.050×2×10350\\times 2\\times 10^\{3\}Full batchpenalty coefficient increases from 2 toμ\\mu=1\.2501\.2^\{50\}; multiplier update every2×1032\\times 10^\{3\}epochsRoPINN1\.01×1031\\times 10^\{3\}Full batchinit\. region =10−510^\{\-5\}; samples = 1; window = 5CL1\.0140×2×103140\\times 2\\times 10^\{3\}Full batchprogressively increases PDE coefficient at speed of 0\.05 every2×1032\\times 10^\{3\}iterations
##### FNOs\.
For the experimental setup of FNOs, we follow the methodology described inSubramanianet al\.\([2023](https://arxiv.org/html/2605.29153#bib.bib26)\)\. We focus on solving the 2D Poisson \(SYS\-1\) and advection–diffusion \(SYS\-2\) systems under varying parameter configurations\. Specifically, we generate multiple sets of SYS\-1 by varying the governing parameters across the range:\(1\.0,2\.5\),\(2\.5,5\.0\),\(5\.0,10\),\(10,20\),\(20,30\),\(30,50\),\(50,100\),\(100,200\)\{\(1\.0,2\.5\),\(2\.5,5\.0\),\(5\.0,10\),\(10,20\),\(20,30\),\(30,50\),\(50,100\),\(100,200\)\}and SYS\-2 with the settings across the range:\(0\.2,0\.4\),\(0\.4,0\.6\),\(0\.6,0\.8\),\(0\.8,10\),\(1\.0,1\.3\),\(1\.3,1\.7\),\(1\.7,2\.0\),\(2\.0,2\.5\)\{\(0\.2,0\.4\),\(0\.4,0\.6\),\(0\.6,0\.8\),\(0\.8,10\),\(1\.0,1\.3\),\(1\.3,1\.7\),\(1\.7,2\.0\),\(2\.0,2\.5\)\}\. The FNO model used in our experiments contains approximately one million parameters\. Other hyperparameters, including batch size and learning rate, are aligned with those reported inSubramanianet al\.\([2023](https://arxiv.org/html/2605.29153#bib.bib26)\)to ensure consistency and comparability\. To balance convergence and overfitting, we adjust the number of training epochs based on the dataset size\. Our settings are as follows, where the first value denotes dataset size \(number of PDE samples\) and the second value denotes the corresponding number of training epochs:\{16k:75,8k:100,4k:150,2k:200,1k:300,512:500,256:750,128:1000\}\\\{16\\text\{k\}:75,8\\text\{k\}:100,4\\text\{k\}:150,2\\text\{k\}:200,1\\text\{k\}:300,512:500,256:750,128:1000\\\}\. We use the Adam optimizer for training FNOs across five random seeds\.
##### PINOs\.
We follow the procedure and the experimental setup of 2D Darcy Flow inLiet al\.\([2024](https://arxiv.org/html/2605.29153#bib.bib93)\)\. The PINO backbone is an FNO model with spectral modes 20 per spatial dimension, channel width 64, Fourier block layout\[64,64,64,64,64\]\[64,64,64,64,64\]\. The coefficient fielda\(x\)a\(x\)is generated from a Gaussian Random Field\-based pipeline, and we sweep the contrast ratio \(physical parameter\) in\{1,2,3,4,5,6,8,10\}\\\{1,2,3,4,5,6,8,10\\\}, training set size in\{5,10,20,50,100,200,500,1000\}\\\{5,10,20,50,100,200,500,1000\\\}, and use three random seeds\. The training details are provided in Table[3](https://arxiv.org/html/2605.29153#A3.T3)\.
Table 3:Training protocols and optimization configurations for PINO experiments\.StrategyLearning RateEpochsBatch SizeAdditional SettingsAdam0\.0011×1041\\times 10^\{4\}32\(β1,β2\)=\(0\.9,0\.999\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.999\)L\-BFGS1\.01×1041\\times 10^\{4\}Full batchhistory=100\\text\{history\}=100;tolerance\_grad=10−7\\text\{tolerance\\\_grad\}=10^\{\-7\},tolerance\_change=10−9\\text\{tolerance\\\_change\}=10^\{\-9\}NNCG0\.001 / 1\.01\.11×1041\.11\\times 10^\{4\}32 / Full batchAdam:1×1041\\times 10^\{4\}epochs, L\-BFGS:1×1031\\times 10^\{3\}epochs, NNCG: 100 epochs; rank = 4ALM0\.00013\.5×1043\.5\\times 10^\{4\}32penalty coefficient increase from 2 toμ\\mu=1\.05401\.05^\{40\}; multiplier update every5×1025\\times 10^\{2\}iterationsCL0\.001100×104100\\times 10^\{4\}32progressively increases PDE coefficient at speed of 0\.1 every1×1041\\times 10^\{4\}iterations
##### NODEs and PINODEs\.
The network and data setup are the same for both NODEs and PINODEs\. We use a shallow MLP with three hidden layers \(each with a width of 16\)\. Predictions are produced by numerical integration:
x^\(ti\)=x\(0\)\+∫0tifθ\(x\(τ\)\)𝑑τ\.\\hat\{x\}\(t\_\{i\}\)=x\(0\)\+\\int\_\{0\}^\{t\_\{i\}\}f\_\{\\theta\}\(x\(\\tau\)\)\\,d\\tau\.\(20\)During training, we integrate with the forward Euler method using the same step size as the training data \(Δt=0\.05\\Delta t=0\.05\)\. The model minimizes mean\-squared error \(MSE\) between predicted and ground\-truth trajectories over the training horizon\. Ground\-truth trajectories are generated withsolve\_ivpusing default tolerances\. We embed the phase\-space state\[θ,ω\]\[\\theta,\\omega\]intoℝ3\\mathbb\{R\}^\{3\}via a spherical map:
x\(t\)=\[sin\(θ\)cos\(ω\)sin\(θ\)sin\(ω\)−cos\(θ\)\],x\(t\)=\\begin\{bmatrix\}\\sin\(\\theta\)\\cos\(\\omega\)\\\\ \\sin\(\\theta\)\\sin\(\\omega\)\\\\ \-\\cos\(\\theta\)\\end\{bmatrix\},\(21\)which constrains dynamics to the unit sphere and introduces geometric complexity\. Training observations are sampled at a fixed temporal resolutionΔt=0\.05\\Delta t=0\.05\. We sweep two axes: \(1\) System Nonlinearity \(xx\-axis\): We vary dampingb∈\[0\.1,1\.0\]b\\in\[0\.1,1\.0\]and visualize thexx\-axis as1/b1/b\(rightward indicates lower damping and more oscillatory dynamics\); \(2\) Data Availability \(yy\-axis\): We vary training horizonTtrain∈\[1\.0,20\.0\]T\_\{\\text\{train\}\}\\in\[1\.0,20\.0\], with number of samplesNt=Ttrain/ΔtN\_\{t\}=T\_\{\\text\{train\}\}/\\Delta t\. WhileTtrainT\_\{\\text\{train\}\}varies, evaluation is done on a fixed long horizonTtest=20\.0T\_\{\\text\{test\}\}=20\.0s from a distinct initial condition \(θ0test=2\.8\\theta\_\{0\}^\{\\text\{test\}\}=2\.8v\.s\.θ0train=1\.7\\theta\_\{0\}^\{\\text\{train\}\}=1\.7\)\. The reported test error is the one\-step prediction error along this full test trajectory, probing generalization across initial conditions\. We also use Adam for training NODEs\. Table[4](https://arxiv.org/html/2605.29153#A3.T4)also presents the training details of PINODEs for different training and optimization methods\.
Table 4:Training protocols and optimization configurations for PINODE experiments\.StrategyLearning RateEpochsBatch SizeAdditional SettingsAdam10−310^\{\-3\}6×1026\\times 10^\{2\}32\(β1,β2\)=\(0\.9,0\.999\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.999\)L\-BFGS1\.01×1031\\times 10^\{3\}Full batchhistory=50\\text\{history\}=50;tolerance\_grad=10−7\\text\{tolerance\\\_grad\}=10^\{\-7\},tolerance\_change=10−9\\text\{tolerance\\\_change\}=10^\{\-9\}NNCG10−310^\{\-3\}1\.5×1031\.5\\times 10^\{3\}Full batchLBFGS warmup for1×1031\\times 10^\{3\}epochs; Nyström rank = 10ALM1\.050×50050\\times 500Full batchpenalty coefficientμ\\mu=1\.05501\.05^\{50\}; multiplier update every5×1025\\times 10^\{2\}epochsCL1\.060×50060\\times 500Full batchprogressively increases PDE coefficient at speed of 0\.25 every5×1025\\times 10^\{2\}epochs
## Appendix DImplementations of Visualization and Multi\-Regime Analysis
### D\.1Loss Landscape Metrics
Prior work\(Yanget al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib9)\)categorizes phases in load\-temperature variations using four metrics\. The first metric is the training loss, which evaluates whether the training data is interpolated\. The other metrics describe the sharpness of the local minima, the similarity between models trained using different random seeds, and the connectivity between different local minima of the loss landscape\. It should be noted thatYanget al\.\([2021](https://arxiv.org/html/2605.29153#bib.bib9)\)used a certain set of metrics to measure these loss landscape properties, but there are alternative metrics available\. For example, the sharpness of local minima can be measured using*adaptive sharpness metrics*\(Andriushchenkoet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib12); Kwonet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib11)\), while similarity can be measured using*disagreement*\(Theisenet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib13)\)\.
We define the loss landscape metrics followingYanget al\.\([2021](https://arxiv.org/html/2605.29153#bib.bib9)\)\. Let𝜽∈ℝm\{\\bm\{\\theta\}\}\\in\\mathbb\{R\}^\{m\}denote the learnable weight parameter, and letℒ\\mathcal\{L\}be the loss function\. We compute metrics using the train set unless stated otherwise\.
Hessian\-based Metrics\.The Hessian matrix𝐇\\mathbf\{H\}at a given point𝜽\{\\bm\{\\theta\}\}can be represented as∇𝜽2ℒ\(𝜽\)∈ℝm×m\\nabla^\{2\}\_\{\\mathbf\{\{\\bm\{\\theta\}\}\}\}\\mathcal\{L\}\(\{\\bm\{\\theta\}\}\)\\in\\mathbb\{R\}^\{m\\times m\}\. The largest eigenvalueλmax\(𝐇\)\\lambda\_\{\\max\}\(\{\\mathbf\{H\}\}\)and traceTr\(𝐇\)\\text\{Tr\}\(\\mathbf\{H\}\)are used to summarize the local curvature properties in a single value\. Specifically, a larger value of the top eigenvalue or trace indicates greater sharpness\.
Mode Connectivity\.The mode connectivity assesses the presence of low\-loss paths between different local minima and reflects how well different solutions are connected in the parameter space, indicating smoother and more generalizable loss landscapes\. It is common to fit Bézier curves,\(γϕ\(t\)\(\\gamma\_\{\\phi\}\(t\), between two models𝜽\{\\bm\{\\theta\}\}and𝜽′\{\\bm\{\\theta\}\}^\{\\prime\}, and subsequently compute mode connectivitymcas
mc\(𝜽,𝜽′\)=12\(ℒ\(𝜽\)\+ℒ\(𝜽′\)\)−ℒ\(γϕ\(t∗\)\),\\texttt\{mc\}\(\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}^\{\\prime\}\)=\\frac\{1\}\{2\}\(\\mathcal\{L\}\(\{\\bm\{\\theta\}\}\)\+\\mathcal\{L\}\(\{\\bm\{\\theta\}\}^\{\\prime\}\)\)\-\\mathcal\{L\}\(\\gamma\_\{\\phi\}\(t^\{\*\}\)\),wheret∗=argmin𝑡\|12\(ℒ\(𝜽\)\+ℒ\(𝜽′\)\)−ℒ\(γϕ\(t\)\)\|t^\{\*\}=\\underset\{t\}\{\\mathrm\{argmin\}\}\\left\|\\frac\{1\}\{2\}\(\\mathcal\{L\}\(\{\\bm\{\\theta\}\}\)\+\\mathcal\{L\}\(\{\\bm\{\\theta\}\}^\{\\prime\}\)\)\-\\mathcal\{L\}\(\\gamma\_\{\\phi\}\(t\)\)\\right\|\. Here,mc<0\\texttt\{mc\}<0indicates a loss barrier between the two models and hence poor connectivity;mc\>0\\texttt\{mc\}\>0reveals lower loss regions between the models, which indicates poor training; andmc≈0\\texttt\{mc\}\\approx 0indicates well\-connected models\.
### D\.2Hessian Metrics Computation
To analyze the second\-order geometry of the loss landscape, we use Hessian\-based metrics including top eigenvalues, trace, and spectrum\. Given the high dimensionality of the parameter space, explicit computation of the Hessian matrixHHis computationally prohibitive\. Therefore, we implement matrix\-free methods based on Hessian\-vector products \(HVP\), following the PyHessian framework\(Yanget al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib9); Yaoet al\.,[2020](https://arxiv.org/html/2605.29153#bib.bib44)\)\.
- •Top Eigenvalues:The dominant eigenvalues \(λmax\\lambda\_\{max\}\) are computed using the power iteration method\. This iterative process identifies the directions of maximum curvature in the loss landscape, which are often associated with model sensitivity and sharp minima\.
- •Hessian Trace:The trace of the Hessian is estimated using Hutchinson’s method\. By computing the expectation of quadratic formsvTHvv^\{T\}Hvwith Rademacher random vectorsv∈\{−1,1\}nv\\in\\\{\-1,1\\\}^\{n\}, we obtain an unbiased estimator of the average curvature\.
- •Hessian Spectrum:The Hessian Spectrum is estimated via the Stochastic Lanczos Quadrature to provide a global view of the curvature distribution\. We follow the methodology in the NNCG\(Rathoreet al\.,[2024](https://arxiv.org/html/2605.29153#bib.bib58)\)\. To capture the heavy\-tailed nature of the spectrum, the density is visualized using a symmetrical log\-scale, allowing for the clear observation of eigenvalues spanning multiple orders of magnitude\.
While standard Hessian analysis for architectures like ResNet, ViT\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2605.29153#bib.bib139)\), or ConvNeXt\(Wooet al\.,[2023](https://arxiv.org/html/2605.29153#bib.bib140)\)typically relies on a single mini\-batch for computational efficiency, our implementation for PINNs uses the entire set of residual collocation points\. This ensures that the second\-order information faithfully represents the full physics\-informed loss surface rather than a stochastic approximation\.
### D\.3Implementation Details of Boundaries
The specific methodology for regime classification that we use is adapted from the taxonomic framework proposed inYanget al\.\([2021](https://arxiv.org/html/2605.29153#bib.bib9)\)\. We report boundary segmentation thresholds for different PDEs in Table[5](https://arxiv.org/html/2605.29153#A4.T5)\. No smoothing or post\-processing is applied to the raw values shown in the heatmaps, ensuring that the regime plots faithfully reflect the original experimental measurements\.
Table 5:Detailed training and test\-error thresholds used for regime\-boundary extraction, grouped by model, PDE system, and optimization method\.ModelPhysical SystemTraining StrategyTraining Threshold\(𝑻train\\bm\{T\_\{\\text\{train\}\}\}\)Test Threshold\(𝑻test\\bm\{T\_\{\\text\{test\}\}\}\)PINN1D ConvectionL\-BFGS3\.80×10−33\.80\\times 10^\{\-3\}0\.4340\.434ALM3\.34×10−23\.34\\times 10^\{\-2\}0\.2780\.278RoPINN2\.97×10−32\.97\\times 10^\{\-3\}0\.3620\.362NNCG3\.55×10−33\.55\\times 10^\{\-3\}0\.4330\.433CL2\.29×10−32\.29\\times 10^\{\-3\}0\.2390\.2391D ReactionL\-BFGS2\.73×10−22\.73\\times 10^\{\-2\}0\.4690\.469ALM7\.30×10−57\.30\\times 10^\{\-5\}0\.4940\.494RoPINN1\.84×10−21\.84\\times 10^\{\-2\}0\.3610\.361NNCG2\.46×10−22\.46\\times 10^\{\-2\}0\.4320\.4321D WaveL\-BFGS9\.91×10−39\.91\\times 10^\{\-3\}0\.2970\.297ALM5\.77×10−35\.77\\times 10^\{\-3\}0\.3000\.300RoPINN7\.80×10−37\.80\\times 10^\{\-3\}0\.3060\.3061D Reaction–DiffusionL\-BFGS4\.46×10−24\.46\\times 10^\{\-2\}0\.4350\.435ALM2\.44×10−42\.44\\times 10^\{\-4\}0\.3840\.384RoPINN4\.26×10−24\.26\\times 10^\{\-2\}0\.4110\.411FNO2D PoissonAdam5\.96×10−25\.96\\times 10^\{\-2\}0\.1920\.1922D Advection\-DiffusionAdam6\.67×10−26\.67\\times 10^\{\-2\}0\.2580\.258PINO2D Darcy FlowAdam0\.2110\.2110\.1610\.161L\-BFGS0\.3230\.3230\.1630\.163ALM0\.1250\.1250\.1580\.158NNCG0\.1250\.1250\.1670\.167CL0\.1070\.1070\.09730\.0973NODENonlinear Damped PendulumAdam5\.21×10−55\.21\\times 10^\{\-5\}1\.981\.98PINODENonlinear Damped PendulumAdam1\.12×10−61\.12\\times 10^\{\-6\}1\.741\.74L\-BFGS3\.20×10−73\.20\\times 10^\{\-7\}3\.183\.18ALM6\.88×10−96\.88\\times 10^\{\-9\}1\.311\.31NNCG1\.37×10−71\.37\\times 10^\{\-7\}2\.192\.19CL2\.43×10−72\.43\\times 10^\{\-7\}3\.473\.47
### D\.4Color Coding Protocol
##### 2D Regime Plots\.
Each cell in the 2D grid corresponds to a unique \(physical parameter, data quantity\) configuration\. For each configuration, we train the model with multiple random seeds and report the seed\-averaged metric value\. The color of each cell is determined by mapping this averaged value to a truncated colormap \(using the\[0,0\.8\]\[0,0\.8\]portion of the full color range\) through a robust percentile normalization\. Specifically, for a metric valuevvat grid point\(i,j\)\(i,j\), the color is assigned as
ci,j=colormap\(vi,j−vminvmax−vmin\),c\_\{i,j\}\\;=\\;\\mathrm\{colormap\}\\\!\\left\(\\frac\{v\_\{i,j\}\-v\_\{\\min\}\}\{v\_\{\\max\}\-v\_\{\\min\}\}\\right\),\(22\)wherevminv\_\{\\min\}andvmaxv\_\{\\max\}denote the 5th and 95th percentiles of the metric values across all grid points within that subplot, respectively\. This percentile\-based normalization ensures that extreme outliers do not dominate the color scale\. Each subplot is independently normalized; different subplots may therefore have different color scales\. Training loss and test error are visualized in separate panels, each with its own independent color scale\. Values falling outside the\[vmin,vmax\]\[v\_\{\\min\},v\_\{\\max\}\]range are clamped to the corresponding endpoint colors of the colormap\.
##### Relative Improvement Heatmaps\.
Each cell shows the percentage change of the evaluated method relative to the baseline:
Δi,j=vi,jbase−vi,jmethodvi,jbase×100%\.\\Delta\_\{i,j\}\\;=\\;\\frac\{v^\{\\mathrm\{base\}\}\_\{i,j\}\-v^\{\\mathrm\{method\}\}\_\{i,j\}\}\{v^\{\\mathrm\{base\}\}\_\{i,j\}\}\\times 100\\%\.\(23\)A diverging colormap is used, centered at zero, so that positive values \(improvement over the baseline\) and negative values \(degradation\) are visually distinguishable\. All runs, including those exhibiting numerical instability or divergence, are included in the seed average without exclusion; hence, the displayed color at each cell faithfully reflects the raw averaged percentage change\.
##### 3D Loss Landscape Surfaces\.
To generate our plots, we use the filter\-wise normalization presented inLiet al\.\([2018](https://arxiv.org/html/2605.29153#bib.bib39)\)\. Given the parameterθ\\thetaof a trained model, the resulting plots are computed by the following function:f\(α,β\)=ℒ\(θ\+αd1\+βd2\)f\(\\alpha,\\beta\)=\\mathcal\{L\}\(\\theta\+\\alpha d\_\{1\}\+\\beta d\_\{2\}\), whered1d\_\{1\}andd2d\_\{2\}are the two directions with maximum eigenvalues ofHℒH\_\{\\mathcal\{L\}\}\.α\\alphaandβ\\betaare uniformly sampled from\[−r,r\]\[\-r,r\], whererrcontrols the perturbation radius in the parameter space\. Given that the value ofℒ\\mathcal\{L\}near the converged model is small, we use the log scale to make the trend more pronounced\. The surface color encodes the log\-scaled loss valuelogℒ\(θ\+αd1\+βd2\)\\log\\mathcal\{L\}\(\\theta\+\\alpha d\_\{1\}\+\\beta d\_\{2\}\)at each point in the two\-dimensional parameter slice, following the filter\-wise normalization\. A sequential colormap is used, with cooler colors corresponding to lower loss values and warmer colors to higher loss values\.
##### Hessian\-based Visualizations\.
Two types of Hessian visualizations are used in this work\.*Hessian phase plots*display the logarithm of the maximum Hessian eigenvalue,logλmax\(𝐇\)\\log\\lambda\_\{\\max\}\(\\mathbf\{H\}\), computed at the converged model for each configuration and averaged over random seeds\. The color coding follows the same per\-subplot robust percentile normalization as in Equation[22](https://arxiv.org/html/2605.29153#A4.E22)\.*Hessian spectral density plots*do not use a heatmap color coding; instead, thexx\-axis represents eigenvalue magnitude on a symmetric log scale, and theyy\-axis represents the estimated spectral density\. Blue and orange curves distinguish the spectra before and after training, respectively\. The spectral density is estimated via the Stochastic Lanczos Quadrature method, following the procedure described in Appendix[D\.2](https://arxiv.org/html/2605.29153#A4.SS2)\.
## Appendix EExtended Analysis of Regimes Across Varying Physics
In this section, we provide additional regime\-analysis results across varying physical systems\. In addition to the main\-text experiments, we evaluate PINNs with ALM and RoPINN on four representative PDE systems: the 1D convection, reaction, wave, and reaction\-diffusion equations\. Figures[8](https://arxiv.org/html/2605.29153#A5.F8)and[9](https://arxiv.org/html/2605.29153#A5.F9)further demonstrate that the coarse three\-regime structure persists across different physical mechanisms, while the detailed geometry and location of the regime boundaries vary with the underlying PDE\. As shown in Figure[10](https://arxiv.org/html/2605.29153#A5.F10), we also provide additional FNO experiments trained with Adam on the 2D Poisson and advection\-diffusion equations\. Compared with PINN\-based models, the resulting regime maps exhibit smoother and less sharply separated transitions, but they still display consistent trainability and generalization patterns across varying physical difficulty and training\-data regimes\. Together, these extended results support the robustness of the regime\-aware perspective across both physics\-constrained and data\-driven SciML paradigms\.
\(*a*\) 1D Convection\(*b*\) 1D Reaction\(*c*\) 1D Wave\(*d*\) 1D Reaction\-DiffusionFigure 8:Regime plots across varying physical systems usinga fixed PINN model trained with the ALM optimizer\. The evaluated PDE systems include \(*a*\) the 1D convection equation with convection coefficientβ\\beta, \(*b*\) the 1D reaction equation with reaction coefficientρ\\rho, \(*c*\) the 1D wave equation with wave speedcc, and \(*d*\) the 1D reaction\-diffusion equation with reaction coefficientρ\\rho\. Lighter \(yellow\) and darker \(green\) colors denote lower and higher training loss/test error, respectively\.\(*a*\) 1D Convection\(*b*\) 1D Reaction\(*c*\) 1D Wave\(*d*\) 1D Reaction\-DiffusionFigure 9:Regime plots across varying physical systems usinga fixed PINN model trained with the RoPINN method\. The evaluated PDE systems include \(*a*\) the 1D convection equation with convection coefficientβ\\beta, \(*b*\) the 1D reaction equation with reaction coefficientρ\\rho, \(*c*\) the 1D wave equation with wave speedcc, and \(*d*\) the 1D reaction\-diffusion equation with reaction coefficientρ\\rho\.\(*a*\) 2D Poisson\(*b*\) 2D Advection\-DiffusionFigure 10:Regime plots across varying physical systems usinga fixed FNO model with the Adam optimizer\. The evaluated PDE systems include \(*a*\) the 2D Poisson equation and \(*b*\) the 2D advection\-diffusion equation\.
## Appendix FExtended Analysis of Possible Failure Mechanisms
In this section, we provide additional evidence to support our analysis of possible failure mechanisms in SciML optimization\. Building on the loss\-landscape and Hessian eigenspectra observations in the main text, we further investigate two possible failure mechanisms and three additional hypotheses commonly associated with optimization instability \(basin jumping, loss barriers, and landscape aging\)\. The following empirical results suggest that many classical explanations from conventional deep learning do*not*fully account for the optimization difficulties observed in SciML, highlighting the distinctive geometric and physical characteristics of these models, as well as the need for improved model diagnostics\.
### F\.1Deceptive Sharpness
Here, we provide empirical evidence for deceptive sharpness\. As shown in Figure[11](https://arxiv.org/html/2605.29153#A6.F11)\(a\), we observe an anomalous distribution of Hessian metrics when training FNO models on 2D Poisson datasets: some higher Hessian metric values appear in well\-trained regions \(Regime I\), while lower values characterize poorly trained regions \(Regime II and Regime III\)\. Additional results on increasing sharpening are shown in Figure[12](https://arxiv.org/html/2605.29153#A6.F12)for PINNs trained on the 1D convection equation under varying collocation sizes and physical difficulty\. Across all settings, PINN training exhibits progressive sharpening, where Hessian curvature increases substantially even as the training loss decreases\. The sharpening effect becomes more pronounced in harder physical regimes with largerβ\\beta, indicating increasingly ill\-conditioned local geometry during optimization\.
\(*a*\) Deceptive Sharpness \(FNO, 2D Poisson\)\(*b*\) Deceptive Flatness \(PINN, 1D Reaction\-Diffusion\)Figure 11:Examples of deceptive sharpness and deceptive flatness in SciML regime plots, illustrated through comparisons between log Hessian eigenvalues and training loss\. \(*a*\) FNO trained on the 2D Poisson equation, showingdeceptive sharpness, where regions with relatively low training loss still exhibit consistently large Hessian eigenvalues\. \(*b*\) PINN trained on the 1D reaction\-diffusion equation, showingdeceptive flatness, where regions with smaller Hessian eigenvalues nevertheless correspond to poor optimization and large training loss\. These results illustrate that Hessian sharpness alone does not reliably characterize trainability quality in SciML\.\(a\)Nf=100,β=5N\_\{f\}=100,\\ \\beta=5
\(b\)Nf=100,β=50N\_\{f\}=100,\\ \\beta=50
\(c\)Nf=1000,β=5N\_\{f\}=1000,\\ \\beta=5
\(d\)Nf=1000,β=50N\_\{f\}=1000,\\ \\beta=50
Figure 12:Evolution of training loss and maximum Hessian eigenvalueλmax\\lambda\_\{\\max\}during PINN training on the 1D convection equation under different physical and data regimes\. The blue curve shows the training loss, while the red dashed curve shows the largest Hessian eigenvalue over training iterations\. Panels vary the number of collocation pointsNfN\_\{f\}and the convection coefficientβ\\beta\. The shaded regions separate the early optimization phase and the later convergence phase, divided by the dashed vertical line\.
### F\.2Deceptive Flatness
\(*a*\)ρ=5\\rho=5\(*b*\)ρ=15\\rho=15Figure 13:3D loss landscapes and Hessian eigenspectra of PINNs trained on the 1D reaction\-diffusion equation under different physical parameters\. We use 1000 collocation points for training both models\. Panel \(*a*\) corresponds toρ=5\\rho=5, while panel \(*b*\) showsρ=15\\rho=15\. Although theρ=5\\rho=5setting exhibits a sharper minimum and larger Hessian eigenvalues after training, it achieves better optimization performance than theρ=15\\rho=15case\. In contrast, the flatter landscape atρ=15\\rho=15is associated with broad high\-loss plateaus and weaker descent structure, indicating that low curvature alone does not necessarily imply better trainability in SciML\.Here, we provide empirical evidence for deceptive flatness\. We visualize the Hessian spectrum and training dynamics to characterize the pathological regions where optimization stalls\.
##### Hessian Spectrum Analysis\.
Figure[13](https://arxiv.org/html/2605.29153#A6.F13)illustrates the evolution of the Hessian spectrum before and after training\. At initialization \(blue curves\), the spectrum exhibits a relatively symmetric structure centered around zero, indicating a diverse set of descent directions \(negative eigenvalues\) and ascent directions \(positive eigenvalues\)\. However, as the optimizer interacts with the high\-coefficient PDE \(ρ=20\\rho=20\), the spectrum undergoes a pronounced shift\. The distribution becomes heavily biased toward positive values, while the magnitude of these eigenvalues remains small\.
Quantitatively, we observe a significant reduction in the availability of descent directions\. For theρ=20\\rho=20case, the proportion of negative eigenvalues drops to approximately 37%\. For context, this is substantially lower than standard deep learning benchmarks; for instance, a ResNet\-18 model typically retains a negative eigenvalue proportion close to 50%\. This spectral degeneracy suggests that the optimizer is not sitting in a saddle point with clear escape routes, but rather on a relatively flat, featureless plateau where gradient information is effectively vanished\.
##### Regime Plots\.
To further investigate the relationship between flatness and solution quality, we analyze the training dynamics using 2D phase plots, as shown in Figure[11](https://arxiv.org/html/2605.29153#A6.F11)\(b\)\. These plots summarize the maximum Log Hessian Eigenvalue across different collocation point settings and PDE parameters \(ρ\\rho\)\.
The plots reveal distinct training regimes\. Notably, the region corresponding to large PDE coefficients \(e\.g\.,ρ=15,20\\rho=15,20\) exhibits an anomalous behavior \(highlighted in yellow/lighter colors in the heatmap\)\. In this regime, the model exhibitslowHessian eigenvalues—normally a signature of convergence to a flat minimum—yet sustainshightraining loss\. This confirms our hypothesis of deceptive flatness: the L\-BFGS optimizer is trapped in a region of vanishing curvature\. Unlike the sharp minima found in successful runs \(Region I\) or the chaotic high\-curvature regions, this “silent” failure mode provides no directional feedback, causing the optimizer to terminate prematurely in a non\-optimal state\.
### F\.3Loss Landscape Jumps between Basins
We consider a potential hypothesis for the observed optimization difficulties: training fails because the optimizer unstably “jumps” between different basins of attraction, preventing convergence to a consistent minimum\. This could be a plausible hypothesis, given the lack of basin connectivity observed in previous sections\. The results are shown in Figure[14](https://arxiv.org/html/2605.29153#A6.F14)\. We follow the methodology ofHodgkinsonet al\.\([2022](https://arxiv.org/html/2605.29153#bib.bib101)\)to quantify basin jumping and apply it to three settings: ResNet\-18 \(purple stars\); a PINN for 1D convection withβ=5\\beta=5\(orange squares\); and a PINN for 1D convection withβ=20\\beta=20\(green dots\)\. The PINNs are optimized using L\-BFGS, while ResNet\-18 is optimized using SGD\. Once the loss falls below a chosen threshold, we continue optimization for an additional number of iterations and record the step magnitudes\|Wk\+1−Wk\|\|W\_\{k\+1\}\-W\_\{k\}\|, followingHodgkinsonet al\.\([2022](https://arxiv.org/html/2605.29153#bib.bib101)\)\. We then compute the sequence1/\|Wk\+1−Wk\|1/\|W\_\{k\+1\}\-W\_\{k\}\|for increasingkkand fit a power\-law \(PL\) distribution using thepowerlawtoolbox\.
Figure[14](https://arxiv.org/html/2605.29153#A6.F14)shows the fitted PL exponentsα\\alpha, where smaller values ofα\\alphaindicate a higher tendency for basin jumping\. We observe that both PINN settings generally exhibit largerα\\alphavalues than ResNet\-18, suggesting fewer jumps\. This is expected, as the better connectivity of the ResNet loss landscape may reduce the need for frequent basin transitions\. However, it is noteworthy that theβ=20\\beta=20case \(green dots\) exhibits fewer jumps than theβ=5\\beta=5case \(orange squares\)\. This is somewhat surprising and suggests that basin jumping may be effectively suppressed by the severely degraded loss landscape at largerβ\\beta\.
Moreover, we observe an abrupt drop in the PL exponentα\\alphafor ResNet\-18 as the learning rate increases to larger values, indicating more frequent transitions between basins\. In contrast, theα\\alphavalues for the PINNs remain relatively stable across settings, further suggesting that PINN optimization exhibits fewer basin jumps than one might initially expect\.
Figure 14:Comparison of lower\-tail exponentsα\\alphaacross learning ratesη\\etafor ResNet\-18 and PINNs\. Purple stars denote ResNet\-18 trained with SGD, while orange squares and green circles denote PINNs for the 1D convection equation withβ=5\\beta=5andβ=20\\beta=20, respectively, optimized using L\-BFGS\.
### F\.4Loss Landscape Barrier
Another possible loss landscape characteristic often investigated in CV is the loss landscape barriers; that is, the optimal solution and the current solution are separated by a high\-loss barrier\(Draxleret al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib55); Garipovet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib54)\)\. To further test whether the training difficulty of PINNs can be attributed to such a barrier within an otherwise correct loss basin, we conduct an additional evaluation that focuses on connecting a solution with coarse convergence to one with fine convergence\. Specifically, we select the convection setting and archive model checkpoints at two distinct loss levels: a coarse\-convergence level \(L≈10−2L\\approx 10^\{\-2\}\) in PINN with L\-BFGS; and a fine convergence level \(L≈10−4L\\approx 10^\{\-4\}\) in PINN with ALM\. The two checkpoints are att=0t=0andt=1t=1in Figure[15](https://arxiv.org/html/2605.29153#A6.F15)\. Although the coarse checkpoint already achieves a low training loss, it produces physically invalid solutions with large test error, failing to capture critical fine\-scale structures\. In contrast, further optimization to a much smaller loss is required to recover the correct physical behavior\. This result indicates that the optimization difficulty does not arise from the need to climb a loss barrier\. Instead, it reflects a more fundamental mismatch between loss minimization and solution fidelity in SciML, where low loss does not necessarily correspond to physically accurate solutions\. In other words, optimization failures are not only due to optimizer inefficiency, but also arise from misleading loss landscapes that decouple training loss from solution accuracy\.
\(a\)Training Loss
\(b\)Test Error
Figure 15:Training loss and test error of PINN models on Bezier curve between L\-BFGS \(t=0t=0\) and ALM \(t=1t=1\)\. We evaluate on 1D convection equation usingNf=10000N\_\{f\}=10000andβ=40\\beta=40\. As shown in \(*a*\), the training loss landscape suggests that the L\-BFGS and ALM solutions lie within the same basin\. However, the corresponding test errors in \(*b*\) differ significantly\. This observation indicates that sharing a similar low\-loss region does not necessarily imply comparable generalization performance; only sufficiently low training loss leads to physically accurate solutions\.
### F\.5Loss Landscape Aging
A third direction is “loss landscape aging,” a phenomenon where optimization dynamics slow down drastically during the final convergence phase\(Baity\-Jesiet al\.,[2018](https://arxiv.org/html/2605.29153#bib.bib87)\)\. This aging is characterized by the emergence of an increasing number of flat directions \(i\.e\., Hessian eigenvalues approaching zero\) after the optimizer has reached a wide, flat minimum\. The optimizers’ subsequent motion becomes diffusive and slow, hindering its ability to settle into the precise “bottom” of the basin\. We contend, however, that this late\-stage phenomenon does not explain the main optimization barrier observed in our SciML settings\. Figures[16](https://arxiv.org/html/2605.29153#A6.F16)\(a\-d\) show that the PINN loss landscapes have a large number of non\-zero eigenvalues and do not have a significantly higher density at zero\. However, as shown in Figure[16](https://arxiv.org/html/2605.29153#A6.F16)\(e\), the density of eigenvalues of the Hessian matrix of CNN will decrease sharply when it extends from 0 outward in both directions\. Since the “aging” hypothesis fundamentally presumes that the optimizer has already successfully navigated to a relatively low\-loss, near\-optimal region, we conclude that this hypothesis likely does not explain the hard training of the SciML settings we have considered\. We note, of course, that this phenomenon also relates to increasing sharpening, as the bottom of the loss minimum remains sharper rather than flatter\.
\(*a*\) PINN \(Well\-Trained\)Nf=100,β=5N\_\{f\}=100,\\ \\beta=5\(*b*\) PINN \(Over\-Trained\)Nf=100,β=50N\_\{f\}=100,\\ \\beta=50\(*c*\) PINN \(Well\-Trained\)Nf=20k,β=5N\_\{f\}=20\\text\{k\},\\ \\beta=5\(*d*\) PINN \(Under\-Trained\)Nf=20k,β=50N\_\{f\}=20\\text\{k\},\\ \\beta=50\(*e*\) ResNet\-18Figure 16:Hessian eigenspectrum density plots before and after training for PINNs under different training regimes and for ResNet\-18\. The PINN examples compare well\-trained, over\-trained, and under\-trained regimes across different collocation budgetsNfN\_\{f\}and PDE coefficient parametersβ\\beta, while ResNet\-18 serves as a representative CV model\.Similar Articles
Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
This paper proposes a three-regime framework to resolve empirical contradictions in how LLMs handle conflict between training knowledge and new documents, validated across five major models. It distinguishes between parametric strength and uniqueness and demonstrates how task framing and evidence coherence significantly impact model behavior.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
This paper identifies mode collapse in on-policy RL methods like GRPO and proposes DMPO, which approximates forward KL minimization to maintain solution diversity. It achieves significant improvements on NP-hard combinatorial optimization and mathematical reasoning tasks.
@MaximeRivest: At first glance: > Structural Equation Modeling (SEM/Path Analysis) > Neural Ordinary Differential Equations (Neural OD…
The author compares Structural Equation Modeling, Neural ODEs, and AI Programs like DSPy as declarative frameworks for defining and optimizing computational graphs, arguing that structured flows are essential for trustworthy AI agents.
Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
This paper analyzes inference-time optimization techniques for AIMO 3, finding that model capability dominates over prompt engineering and diverse sampling strategies. The study reveals that high-temperature sampling already decorrelates errors maximally, leaving no room for prompt-based improvements, and identifies a 6-point selection loss gap between individual model pass@20 and majority voting consensus.
Open Multimodal Datasets and Open-Source Software for Data-Driven Modeling of Multiphase Transport and Thermal Systems
This paper presents open multimodal datasets and open-source software packages for reproducible AI-enabled thermal-fluid research, introducing a spatial-temporal dimensionality framework and tools like SeqReg for sequence regression.