Closed-Form Steepest Descent Direction toward Flat Minima: Reducing Upper Bounds on the Loss Hessian Eigenspectrum in Neural Networks
Summary
Derives the closed-form gradient of the Wolkowicz-Styan upper bound on the loss Hessian eigenspectrum to guide neural network training toward flat minima, and introduces Hessian Spectral Range (HSR) Regularization. Numerical experiments show that HSR narrows the Hessian eigenvalue range, avoids sharp minima and saddle points, and achieves flat solutions comparable to Sharpness-Aware Minimization (SAM).
View Cached Full Text
Cached at: 06/30/26, 05:28 AM
# Closed-Form Steepest Descent Direction toward Flat Minima: Reducing Upper Bounds on the Loss Hessian Eigenspectrum in Neural Networks
Source: [https://arxiv.org/html/2606.28662](https://arxiv.org/html/2606.28662)
Kazuki Sakai2Yohei Kakimoto1Makoto Sasaki1Yusuke Sakai3Hirotaka Takahashi31: Nihon University \(Japan\)2: National Institute of Technology, Nagaoka College \(Japan\)3: Tokyo City University \(Japan\)
###### Abstract
One influential theory for explaining the generalization ability of neural networks \(NNs\) is the flatness hypothesis, which suggests that the flatness of the loss landscape is related to generalization performance\. In general, the flatness of the loss landscape is quantified by the eigenvalues of the Hessian matrix of the Taylor\-expanded loss function\. Several training algorithms have been proposed to reduce the eigenvalues of the loss Hessian\. However, most existing studies focus on the design of training algorithms and do not clarify how the training data distribution and the internal parameters of NNs contribute to directions leading to flatter minima\. One direct way to achieve this is to analytically characterize the directions in which the eigenvalues decrease; however, deriving such directions is generally difficult\. On the other hand, recent studies have reported the Wolkowicz\-Styan \(WS\) upper bound, a theorem that analytically describes an upper bound on the largest eigenvalue of the cross\-entropy \(CE\) loss Hessian in a three\-layer hierarchical NN\. However, that study was limited to deriving the upper bound and did not derive its gradient\. Therefore, in this study, we analytically derive the gradient of the WS upper bound and use its closed\-form expression to characterize directions leading to flatter minima\. To examine whether this direction facilitates convergence to flatter minima, we propose a regularization method that updates the network parameters along the steepest descent direction of the WS upper bound\. Results from several numerical experiments show that this regularization narrows the range of the Hessian eigenvalue spectrum, avoids both sharp minima and saddle points, and promotes convergence to flatter minima\. Therefore, we name this method Hessian Spectral Range \(HSR\) Regularization\. Comparisons with existing regularization methods show that HSR Regularization outperforms Hessian Regularization and achieves solutions that are as flat as those obtained by Sharpness\-Aware Minimization \(SAM\)\. The applicability of the proposed method is limited because it only works for the combination of the CE loss and a three\-layer hierarchical NN\. However, to the best of the authors’ knowledge, no previous study has reported a closed\-form gradient that promotes convergence to flatter minima without relying on numerical approximations\. Therefore, this study contributes to the theoretical development of NNs\.
## IIntroduction
Neural networks \(NNs\) are widely utilized in a broad range of tasks and have achieved state\-of\-the\-art performance in numerous domains\[[5](https://arxiv.org/html/2606.28662#bib.bib18)\]\[[23](https://arxiv.org/html/2606.28662#bib.bib16)\]\[[2](https://arxiv.org/html/2606.28662#bib.bib17)\]\. On the other hand, the theoretical understanding of their generalization capabilities is still under development\. As a prominent theory concerning the generalization of NNs, the flatness hypothesis is widely recognized\[[11](https://arxiv.org/html/2606.28662#bib.bib14)\]\. According to the flatness hypothesis, if the loss function exhibits a sharp landscape in the vicinity of a solution obtained through training, the generalization error tends to be large; conversely, if the loss function has a flat shape, the generalization error is expected to be small\[[20](https://arxiv.org/html/2606.28662#bib.bib40)\]\[[3](https://arxiv.org/html/2606.28662#bib.bib39)\]\. To quantify the sharpness of the loss, the eigenspectrum of the Hessian matrix has been widely employed as a representative metric\. This is because when the loss function is Taylor\-expanded around a critical point, its local curvature is characterized by the Hessian matrix appearing in the quadratic term\[[20](https://arxiv.org/html/2606.28662#bib.bib40)\]\[[22](https://arxiv.org/html/2606.28662#bib.bib36)\]\. Previous studies have proposed several optimization algorithms aimed at reaching flat minima by mitigating sharpness\. Representative examples of such approaches include Hessian regularization\[[20](https://arxiv.org/html/2606.28662#bib.bib40)\]and sharpness\-aware minimization \(SAM\)\[[9](https://arxiv.org/html/2606.28662#bib.bib49)\], both of which have demonstrated improvements in test performance across various tasks\. However, these methods focus primarily on algorithmic design, and the factors that dictate the direction toward flat minima are not yet fully understood\. To theoretically comprehend the structural mechanisms that form flat minima, it is essential to analytically describe the direction leading toward them\.
Therefore, the objective of this study is to derive a closed\-form solution for the direction toward flat minima\. In this paper, we define this direction as the steepest descent direction of sharpness\. In other words, the research objective is to derive the parameter gradient of the maximum eigenvalue\. However, because it is generally difficult to analytically express the maximum eigenvalue, obtaining its gradient analytically is inherently challenging\. Nevertheless, a recent study\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\]derived a closed\-form solution for the upper bound of the maximum eigenvalue under the condition of the cross\-entropy \(CE\) loss in three\-layer hierarchical NNs\. This theorem describes the upper bound of the maximum eigenvalue based on the traces of the Hessian and the squared Hessian, which is referred to as the “Wolkowicz\-Styan \(WS\) upper bound\.” Although this function is expected to possess an analytical derivative, the aforementioned study did not extend to its derivation\. Therefore, this study attempts to describe the steepest descent direction of sharpness as a closed\-form function by deriving the parameter gradient of the WS upper bound\. This enables us to investigate how the training data distribution and the internal network parameters influence the direction toward flat minima\. To the best of the authors’ knowledge, no prior study has reported the derivation of the direction toward flat minima in a closed form\. This work provides a new foundation for the analytical understanding of flat minima\.
In this paper, we name the optimization approach that moves in the steepest descent direction of the WS upper bound “Hessian Spectral Range \(HSR\) Regularization\.” HSR regularization has the effects of decreasing the maximum eigenvalue and increasing the minimum eigenvalue of the Hessian matrix\. That is, this method has the effect of narrowing the range of the eigenspectrum, which is expected to prevent the model from falling into both sharp minima and saddle points\. In this work, we verify whether HSR regularization can achieve flatness comparable to existing methods, such as Hessian regularization and SAM\. At present, HSR regularization can only be applied to three\-layer hierarchical NNs, which poses a significant limitation in practical applications\. However, because it can achieve flatness comparable to existing methods without relying on numerical approximations, the direction toward flat minima derived in this study can be regarded as a closed\-form function of considerable value\. The above constitutes the academic contribution of this study to the NN domain\.
## IIRelated works
### II\-AFlat minima
In 1997, Hochreiter et al\.\[[11](https://arxiv.org/html/2606.28662#bib.bib14)\]argued that as a requirement for NNs with high generalization performance, it is important not only that the error is low but also that the errors in its vicinity are low, meaning that the loss landscape is flat\. Regarding this hypothesis, some studies have questioned its validity due to issues such as invariance under reparameterization\[[7](https://arxiv.org/html/2606.28662#bib.bib13)\]\. On the other hand, numerous practical applications have reported that reaching flat minima improves generalization performance\[[6](https://arxiv.org/html/2606.28662#bib.bib45)\]\[[12](https://arxiv.org/html/2606.28662#bib.bib44)\]\[[19](https://arxiv.org/html/2606.28662#bib.bib24)\]\. Furthermore, it has been reported that several empirical techniques considered effective for improving the generalization performance of NNs may be related to the reduction of sharpness\. For instance, the verification of sharpness reduction effects achieved by batch normalization\[[15](https://arxiv.org/html/2606.28662#bib.bib1)\]\[[10](https://arxiv.org/html/2606.28662#bib.bib34)\]\[[22](https://arxiv.org/html/2606.28662#bib.bib36)\], stochastic gradient descent\[[34](https://arxiv.org/html/2606.28662#bib.bib31)\]\[[32](https://arxiv.org/html/2606.28662#bib.bib35)\], and skip connections\[[19](https://arxiv.org/html/2606.28662#bib.bib24)\]constitutes an intriguing area of research\. As can be seen from these examples, the pursuit of flat minima is considered an important perspective for constructing deep learning models with high generalization capabilities\.
### II\-BNumerical Approaches to Eigenspectrum Analysis
This sharpness is evaluated by the quadratic term when the loss function is Taylor\-expanded around a critical point\[[20](https://arxiv.org/html/2606.28662#bib.bib40)\]\. The reason for this is that the eigenvalues of the Hessian matrix represent the curvature\[[3](https://arxiv.org/html/2606.28662#bib.bib39)\]\. While the eigenspectrum consists of multiple eigenvalues, the maximum eigenvalue in particular is utilized as a crucial indicator representing the curvature of the loss landscape\[[22](https://arxiv.org/html/2606.28662#bib.bib36)\]\. The eigenspectrum can be obtained by solving the characteristic equation of the Hessian\. LettingDDdenote the parameter size of the NN, the Hessian becomes a matrix of sizeD×DD\\times D\. WhenD≥5D\\geq 5, the eigenvalues of the Hessian cannot be obtained analytically because a characteristic equation of degree five or higher does not possess a closed\-form solution\. However, in modern deep learning, which is currently the mainstream, the parameter sizeDDof networks is exceedingly large\. For instance, in the implementation using PyTorch\[[31](https://arxiv.org/html/2606.28662#bib.bib4)\],D∼1\.38×108D\\sim 1\.38\\times 10^\{8\}for VGG16\[[28](https://arxiv.org/html/2606.28662#bib.bib3)\]andD∼1\.17×107D\\sim 1\.17\\times 10^\{7\}for ResNet18\[[14](https://arxiv.org/html/2606.28662#bib.bib2)\]\. To determine the eigenspectrum of such a massive Hessian matrix, numerical approximations are employed\. As prominent approaches for this purpose, Hutchinson’s method\[[13](https://arxiv.org/html/2606.28662#bib.bib15)\]and the Lanczos method\[[18](https://arxiv.org/html/2606.28662#bib.bib11)\]are well known\. Hutchinson’s method is a technique for estimating the Hessian trace, whereas the Lanczos method is used to estimate the eigenspectrum; by leveraging these methods, it is possible to numerically evaluate the sharpness\. In fact, several studies have proposed methods to compute the eigenspectrum of deep learning models using the Lanczos method\[[10](https://arxiv.org/html/2606.28662#bib.bib34)\]\[[37](https://arxiv.org/html/2606.28662#bib.bib26)\], as well as techniques to calculate the Hessian trace via Hutchinson’s method\[[20](https://arxiv.org/html/2606.28662#bib.bib40)\]\[[8](https://arxiv.org/html/2606.28662#bib.bib38)\]\.
### II\-CAnalytical Approaches to Eigenspectrum Analysis
On the other hand, the numerical approximation approach suffers from an inherent limitation in that it cannot clarify what causes the loss landscape to become sharp\. To achieve this, it is necessary to express the eigenvalues analytically\. As a pioneering study on the analytical computation of the Hessian, Bishop\[[4](https://arxiv.org/html/2606.28662#bib.bib19)\]proposed an extended backpropagation algorithm that precisely calculates all components of the Hessian matrix for feedforward networks with arbitrary topologies\. However, this study provides a foundational method for computing the Hessian and does not delve into the analytical representation of the eigenvalues themselves\. To express eigenvalues analytically, it is necessary to impose certain structural constraints on the network, making it difficult to achieve this for general networks with arbitrary layer structures\. For this reason, analytical studies have been conducted using simplified network architectures\. For instance, Singh et al\.\[[29](https://arxiv.org/html/2606.28662#bib.bib30)\]derived a closed\-form expression for the rank of the Hessian in networks utilizing linear activations\. Additionally, Wu et al\.\[[35](https://arxiv.org/html/2606.28662#bib.bib29)\]proposed a separation conjecture that approximates the layer\-wise Hessian using Kronecker products, analytically explaining common structures such as the low\-rank properties of the Hessian and the overlapping of eigenspaces among different models\. Furthermore, Singh et al\.\[[30](https://arxiv.org/html/2606.28662#bib.bib23)\]obtained closed\-form representations of the eigenvalues of the loss Hessian in linear networks with identity or ReLU activations\. While obtaining closed\-form representations of eigenvalues is inherently difficult for non\-linear activations, it is possible to derive upper bounds for them\. For example, Omae et al\.\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\]derived an upper bound for the maximum eigenvalue under the CE loss in a three\-layer hierarchical NN\. A key advantage of this approach is that it allows the activation function of the hidden layer to be chosen arbitrarily\.
### II\-DOptimization Algorithms for Sharpness Reduction
Based on various reports indicating that a sharper critical point of the loss function leads to a larger generalization error, several methods aimed at mitigating sharpness have been devised\. For instance, Yue et al\.\[[38](https://arxiv.org/html/2606.28662#bib.bib12)\]proposed the Sharpness\-Aware Learning Rate Scheduler, which dynamically adjusts the learning rate according to the sharpness of the loss landscape to promote convergence to flat minima\. Liu et al\.\[[20](https://arxiv.org/html/2606.28662#bib.bib40)\]proposed Hessian regularization, which suppresses curvature by regularizing the trace of the Hessian matrix\. Sankar et al\.\[[27](https://arxiv.org/html/2606.28662#bib.bib42)\]proposed Layerwise Hessian Trace Regularization to reduce the Hessian trace of each layer\. Luo et al\.\[[21](https://arxiv.org/html/2606.28662#bib.bib21)\]proposed an eigenvalue regularization method that explicitly suppresses the maximum eigenvalue of the Hessian\. In addition, Sharpness\-Aware Minimization \(SAM\) is widely recognized as a representative approach for suppressing large eigenvalues of the Hessian\[[9](https://arxiv.org/html/2606.28662#bib.bib49)\]\. SAM is an optimization method that avoids sharp local minima by finding a perturbation that maximizes the loss in the vicinity of the parameters and subsequently minimizing this worst\-case loss\. As theoretical advancements of SAM, modified versions have been proposed, such as a variant that overcomes the vulnerability to weight parameter scaling\[[17](https://arxiv.org/html/2606.28662#bib.bib46)\]and another that functions effectively even with imbalanced data\[[40](https://arxiv.org/html/2606.28662#bib.bib48)\]\. Research dedicated to theoretically elucidating the effectiveness of SAM is also progressing\[[1](https://arxiv.org/html/2606.28662#bib.bib43)\]\. Beyond theoretical studies, it has been reported that both SAM and Hessian regularization have the effect of enhancing test performance in practical application tasks\[[6](https://arxiv.org/html/2606.28662#bib.bib45)\]\[[12](https://arxiv.org/html/2606.28662#bib.bib44)\]\[[39](https://arxiv.org/html/2606.28662#bib.bib33)\]\.
### II\-EOriginality of This Study
Most of the aforementioned methods adopt an approach that numerically searches for the direction that reduces the sharpness of the loss function and subsequently updates the parameters in that direction\. In other words, they can be regarded as techniques for estimating the direction toward flat minima through numerical computation\. These methods are applicable to large\-scale neural networks and are highly effective for the practical purpose of improving generalization performance\. On the other hand, it is not straightforward to understand how the numerically obtained direction relates to the distribution of training data or the internal parameter structure of the network\. Therefore, an inherent limitation exists in terms of clarifying the underlying mechanisms behind the direction toward flat minima\.
In this study, we adopt an analytical approach to address this issue\. Specifically, we analytically derive the direction toward flat minima based on the eigenvalue statistics of the Hessian matrix\. While analytical methods are generally applicable only to small\-scale models or under specific assumptions, and thus may be inferior to numerical methods in terms of applicability, they offer a distinct advantage\. Specifically, they allow the relationship between sharpness reduction and network structure to be described mathematically, which is expected to facilitate a theoretical understanding of neural networks\. Therefore, the numerical and analytical approaches are not in competition with each other, but rather serve different purposes\. The former is useful as a practical optimization method for large\-scale models, whereas the latter is valuable as a theoretical framework for understanding the behavior of neural networks\. The originality of this study lies in analytically expressing the direction toward flat minima, thereby enabling its theoretical interpretation\.
Figure 1:Three\-layer hierarchical NN analyzed in this study\.
## IIIOverview of the Previous Model
In this study, we adopt the denominator layout for the arrangement of derivatives within matrices and vectors\. For further details, please refer to Appendix[A\-A](https://arxiv.org/html/2606.28662#A1.SS1)\. Additionally, this work is a continuation of Omae et al\.\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\], and the definitions of all variables and functions are identical to those in the previous study\. Therefore, we provide a brief overview here\.
### III\-AModel Assumptions and Loss Function
The target model considered in this study is a three\-layer hierarchical NN for binary classification\. Given an input𝒙\\bm\{x\}, the estimated probabilityppis computed as
𝒚=𝑾𝒉\(𝒙\),𝒓=𝒇\(𝒚\),z=𝑽𝒉\(𝒓\),p\\displaystyle\\bm\{y\}=\\bm\{W\}\\bm\{h\}\(\\bm\{x\}\),\\ \\bm\{r\}=\\bm\{f\}\(\\bm\{y\}\),\\ z=\\bm\{V\}\\bm\{h\}\(\\bm\{r\}\),\\ p=s\(z\),\\displaystyle=s\(z\),wheressdenotes the sigmoid function, which outputs the estimated probability\. Specifically, an input is classified into class 1 ifp≥0\.5p\\geq 0\.5, and into class 0 ifp<0\.5p<0\.5\. Here,𝒙∈ℝM\\bm\{x\}\\in\\mathbb\{R\}^\{M\},𝒚∈ℝN\\bm\{y\}\\in\\mathbb\{R\}^\{N\}, and𝒓∈ℝN\\bm\{r\}\\in\\mathbb\{R\}^\{N\}, whereMMandNNdenote the dimensionalities of the input and the hidden layer, respectively\. Note that𝒇\(𝒚\)\\bm\{f\}\(\\bm\{y\}\)represents the activation function of the hidden layer; for further details, refer to Eq\. \([64](https://arxiv.org/html/2606.28662#A1.E64)\)\. We also note that various expressions for the activation functions used in this paper are summarized in Appendix[A\-B](https://arxiv.org/html/2606.28662#A1.SS2)\.
The function𝒉\\bm\{h\}prepends a 1 to the 0\-th dimension of the input vector\. Specifically,𝒉\(𝒙\)=\[1𝒙⊤\]⊤∈ℝM\+1\\bm\{h\}\(\\bm\{x\}\)=\[1\\quad\\bm\{x\}^\{\\top\}\]^\{\\top\}\\in\\mathbb\{R\}^\{M\+1\}and𝒉\(𝒓\)=\[1𝒓⊤\]⊤∈ℝN\+1\\bm\{h\}\(\\bm\{r\}\)=\[1\\quad\\bm\{r\}^\{\\top\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\+1\}\. The matrices𝑾\\bm\{W\}and𝑽\\bm\{V\}represent the affine mapping parameters, defined as𝑾=\[𝒃𝑾~\]∈ℝN×\(M\+1\)\\bm\{W\}=\[\\bm\{b\}\\quad\\widetilde\{\\bm\{W\}\}\]\\in\\mathbb\{R\}^\{N\\times\(M\+1\)\}and𝑽=\[c𝑽~\]∈ℝ1×\(N\+1\)\\bm\{V\}=\[c\\quad\\widetilde\{\\bm\{V\}\}\]\\in\\mathbb\{R\}^\{1\\times\(N\+1\)\}\. Here,𝑾~∈ℝN×M\\widetilde\{\\bm\{W\}\}\\in\\mathbb\{R\}^\{N\\times M\}and𝑽~∈ℝ1×N\\widetilde\{\\bm\{V\}\}\\in\\mathbb\{R\}^\{1\\times N\}are the weight parameters, whereas𝒃∈ℝN\\bm\{b\}\\in\\mathbb\{R\}^\{N\}andc∈ℝc\\in\\mathbb\{R\}correspond to the bias parameters\. The network architecture using this notation is illustrated in Fig\.[1](https://arxiv.org/html/2606.28662#S2.F1)\. As can be seen, the object of analysis in this study is a fundamental NN with a single hidden layer\.
Letting𝒘\\bm\{w\}and𝒗\\bm\{v\}denote the vectors obtained by vertically concatenating the column vectors of𝑾\\bm\{W\}and𝑽\\bm\{V\}, respectively, the vector𝜽\\bm\{\\theta\}formed by vertically stacking these vectors constitutes the full parameter set of the NN\. That is,
𝜽=\[𝒘𝒗\]∈ℝD,𝒘∈ℝ\(M\+1\)N,𝒗∈ℝN\+1\.\\displaystyle\\bm\{\\theta\}=\\begin\{bmatrix\}\\bm\{w\}\\\\ \\bm\{v\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{D\},\\ \\bm\{w\}\\in\\mathbb\{R\}^\{\(M\+1\)N\},\\ \\bm\{v\}\\in\\mathbb\{R\}^\{N\+1\}\.Note that in this paper, the set of real matrices withDDrows and 1 column, denoted asℝD×1\\mathbb\{R\}^\{D\\times 1\}, is abbreviated asℝD\\mathbb\{R\}^\{D\}\. In addition, the total number of dimensions is given byD=MN\+2N\+1D=MN\+2N\+1\. For the specific arrangement of the elements in𝒘\\bm\{w\}and𝒗\\bm\{v\}, please refer to Eq\. \(3\) in\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\]\.
Since this model is an NN designed for binary classification, the loss function is given by the binary cross\-entropy\. Specifically, lettingpip\_\{i\}denote the estimated probability that theii\-th data sample belongs to class 1, andqi∈\{0,1\}q\_\{i\}\\in\\\{0,1\\\}denote its ground\-truth label, the loss function is expressed as
L\(𝜽\)=−∑i=1I\(qilogpi\(𝜽\)\+\(1−qi\)log\(1−pi\(𝜽\)\)\),\\displaystyle L\(\\bm\{\\theta\}\)=\-\\sum\_\{i=1\}^\{I\}\\big\(q\_\{i\}\\log p\_\{i\}\(\\bm\{\\theta\}\)\+\(1\-q\_\{i\}\)\\log\(1\-p\_\{i\}\(\\bm\{\\theta\}\)\)\\big\),whereIIrepresents the total number of training data samples\.
### III\-BWolkowicz\-Styan Upper Bound
While the previous study\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\]denoted the Hessian of the lossLLwith respect to𝜽\\bm\{\\theta\}as𝑯L\(𝜽,𝜽\)\\bm\{H\}\_\{L\}\(\\bm\{\\theta\},\\bm\{\\theta\}\), this paper denotes it as𝑯L\(𝜽\)\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)to save space\. Furthermore, letλ1≥λ2≥⋯≥λD\\lambda\_\{1\}\\geq\\lambda\_\{2\}\\geq\\cdots\\geq\\lambda\_\{D\}represent the eigenvalues of𝑯L\(𝜽\)\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\. That is,λ1\\lambda\_\{1\}andλD\\lambda\_\{D\}correspond to the maximum and minimum eigenvalues, respectively\. By applying Eq\. \(2\.3\) of Wolkowicz and Styan in\[[33](https://arxiv.org/html/2606.28662#bib.bib51)\], the upper bound of the maximum eigenvalueλ1\\lambda\_\{1\}of𝑯L\(𝜽\)\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)can be expressed as
λ1≤λsup\(𝜽\)=μ\(𝜽\)\+D−1σ\(𝜽\),\\displaystyle\\lambda\_\{1\}\\leq\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)=\\mu\(\\bm\{\\theta\}\)\+\\sqrt\{D\-1\}\\sigma\(\\bm\{\\theta\}\),\(1\)μ\(𝜽\)=1Dtr\(𝑯L\(𝜽\)\),σ\(𝜽\)2=1Dtr\(𝑯L\(𝜽\)2\)−μ\(𝜽\)2\.\\displaystyle\\mu\(\\bm\{\\theta\}\)=\\frac\{1\}\{D\}\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\),\\sigma\(\\bm\{\\theta\}\)^\{2\}=\\frac\{1\}\{D\}\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\-\\mu\(\\bm\{\\theta\}\)^\{2\}\.\(2\)Here,μ\(𝜽\)\\mu\(\\bm\{\\theta\}\)andσ\(𝜽\)2\\sigma\(\\bm\{\\theta\}\)^\{2\}represent the mean and variance of the eigenspectrum, respectively\. The termλsup\(𝜽\)\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)denotes the WS upper bound, which is a function comprising three arguments:DD,tr\(𝑯L\(𝜽\)\)\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\), andtr\(𝑯L\(𝜽\)2\)\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\. Omae et al\.\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\]obtained the WS upper bound of the NN as a closed\-form function by derivingtr\(𝑯L\(𝜽\)\)\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)andtr\(𝑯L\(𝜽\)2\)\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)in closed form under the CE loss in a three\-layer hierarchical NN\. According to their work\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\],tr\(𝑯L\(𝜽\)\)\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)is given by
tr\(𝑯L\(𝜽\)\)=∑i=1Is′\(zi\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚i\)\)\\displaystyle\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)=\\sum\_\{i=1\}^\{I\}s^\{\\prime\}\(z\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\+∑i=1I\(1\+𝒙i⊤𝒙i\)\(s′\(zi\)‖𝑭′\(𝒚i\)𝑽~⊤‖2\+δi𝑽~𝒇′′\(𝒚i\)\)\.\\displaystyle\+\\sum\_\{i=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)\\Big\(s^\{\\prime\}\(z\_\{i\}\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\+\\delta\_\{i\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\Big\)\.\(3\)The termss′\(zi\)s^\{\\prime\}\(z\_\{i\}\)andδi\\delta\_\{i\}correspond tos′\(z\)s^\{\\prime\}\(z\)andδ\\deltafor theii\-th data sample, respectively, which are given by Eqs\. \(52\) and \(53\) in\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\]as
s′\(z\)=p\(1−p\),δ=p−q\.\\displaystyle s^\{\\prime\}\(z\)=p\(1\-p\),\\ \\delta=p\-q\.Similarly,tr\(𝑯L\(𝜽\)2\)\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)is given by
tr\(𝑯L\(𝜽\)2\)\\displaystyle\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)=∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)2ϕij\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)^\{2\}\\phi\_\{ij\}\+2∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)ψij\\displaystyle\+2\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)\\psi\_\{ij\}\+∑i=1I∑j=1I\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2ωij\.\\displaystyle\+\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\\omega\_\{ij\}\.\(4\)This expression was derived by Omae et al\.\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\]\. Here, the componentϕij\\phi\_\{ij\}is defined as
ϕij\\displaystyle\\phi\_\{ij\}=𝒐i⊤𝚽ij𝒐j∈ℝ,𝚽ij∈ℝ2×2,\\displaystyle=\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\\in\\mathbb\{R\},\\ \\bm\{\\Phi\}\_\{ij\}\\in\\mathbb\{R\}^\{2\\times 2\},\(5\)\(𝚽ij\)11\\displaystyle\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}=\(𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤\)2,\\displaystyle=\(\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)^\{2\},\(6\)\(𝚽ij\)12\\displaystyle\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}=𝑽~𝑭′\(𝒚i\)diag\(𝑽~⊤\)𝑭′′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤,\\displaystyle=\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(7\)\(𝚽ij\)21\\displaystyle\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}=𝑽~𝑭′\(𝒚j\)diag\(𝑽~⊤\)𝑭′′\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊤,\\displaystyle=\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(8\)\(𝚽ij\)22\\displaystyle\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}=𝑽~𝑭′′\(𝒚i\)𝑭′′\(𝒚j\)𝑽~⊤\.\\displaystyle=\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\.\(9\)For the detailed definitions of𝑭′\(𝒚\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)and𝑭′′\(𝒚\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\), refer to Eq\. \([66](https://arxiv.org/html/2606.28662#A1.E66)\)\. Furthermore,𝒐\\bm\{o\}is defined as
𝒐=\[s′\(z\)δ\]⊤,\\displaystyle\\bm\{o\}=\\begin\{bmatrix\}s^\{\\prime\}\(z\)&\\delta\\end\{bmatrix\}^\{\\top\},\(10\)where𝒐i\\bm\{o\}\_\{i\}and𝒐j\\bm\{o\}\_\{j\}correspond to𝒐\\bm\{o\}for theii\-th andjj\-th data samples, respectively\.ψij\\psi\_\{ij\}is defined as
ψij\\displaystyle\\psi\_\{ij\}=𝒐i⊤𝚿ij𝒐j∈ℝ,𝚿ij∈ℝ2×2,\\displaystyle=\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Psi\}\_\{ij\}\\bm\{o\}\_\{j\}\\in\\mathbb\{R\},\\ \\bm\{\\Psi\}\_\{ij\}\\in\\mathbb\{R\}^\{2\\times 2\},\(11\)\(𝚿ij\)11\\displaystyle\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}=\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤,\\displaystyle=\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(12\)\(𝚿ij\)12\\displaystyle\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}=𝒇\(𝒚i\)⊤𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤,\\displaystyle=\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(13\)\(𝚿ij\)21\\displaystyle\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}=𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝒇\(𝒚j\),\\displaystyle=\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}\(\\bm\{y\}\_\{j\}\),\(14\)\(𝚿ij\)22\\displaystyle\(\\bm\{\\Psi\}\_\{ij\}\)\_\{22\}=𝒇′\(𝒚i\)⊤𝒇′\(𝒚j\)\.\\displaystyle=\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\.\(15\)For𝒇′\(𝒚\)\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\), refer to Eq\. \([65](https://arxiv.org/html/2606.28662#A1.E65)\)\.ωij\\omega\_\{ij\}is defined as
ωij\\displaystyle\\omega\_\{ij\}=𝒐i⊤𝛀ij𝒐j∈\(0,1/16\],𝛀ij=\[1000\]\.\\displaystyle=\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Omega\}\_\{ij\}\\bm\{o\}\_\{j\}\\in\(0,1/16\],\\ \\bm\{\\Omega\}\_\{ij\}=\\begin\{bmatrix\}1&0\\\\ 0&0\\end\{bmatrix\}\.\(16\)
## IVSteepest Descent Direction of the WS Upper Bound
### IV\-AMain Theorem
In our previous study\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\], we derived the upper bound of the maximum eigenvalue,λsup\(𝜽\)\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\), as a closed\-form function, as shown in Eq\. \([1](https://arxiv.org/html/2606.28662#S3.E1)\)\. As a continuation of that work, this study derives the steepest descent direction of the WS upper bound as a closed\-form function\. Since this direction corresponds to the negative gradient of the WS upper bound, it is given as follows\.
###### Theorem 1\.
−∂λsup\(𝜽\)∂𝜽=−∂μ\(𝜽\)∂𝜽−D−1∂σ\(𝜽\)∂𝜽,\\displaystyle\-\\frac\{\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\-\\frac\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\-\\sqrt\{D\-1\}\\frac\{\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\},\(17\)∂μ\(𝜽\)∂𝜽=1D∂tr\(𝑯L\(𝜽\)\)∂𝜽,\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{1\}\{D\}\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\},\(18\)∂σ\(𝜽\)∂𝜽=12σ\(𝜽\)D∂tr\(𝑯L\(𝜽\)2\)∂𝜽−μ\(𝜽\)σ\(𝜽\)∂μ\(𝜽\)∂𝜽\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{1\}\{2\\sigma\(\\bm\{\\theta\}\)D\}\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\-\\frac\{\\mu\(\\bm\{\\theta\}\)\}\{\\sigma\(\\bm\{\\theta\}\)\}\\frac\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\.\(19\)
###### Proof\.
See Appendix[B\-A](https://arxiv.org/html/2606.28662#A2.SS1)\. ∎
By clarifying the gradients of the Hessian trace and the squared Hessian trace, the closed\-form expression for the steepest descent direction of the WS upper bound can be obtained\. Furthermore, since∂/∂𝜽\\mathop\{\}\\\!\\partial/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}comprises∂/∂𝒘\\mathop\{\}\\\!\\partial/\\mathop\{\}\\\!\\partial\\bm\{w\}and∂/∂𝒗\\mathop\{\}\\\!\\partial/\\mathop\{\}\\\!\\partial\\bm\{v\}as its components, it is sufficient to clarify∂tr\(𝑯L\(𝜽\)\)/∂𝒘\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)/\\mathop\{\}\\\!\\partial\\bm\{w\},∂tr\(𝑯L\(𝜽\)\)/∂𝒗\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)/\\mathop\{\}\\\!\\partial\\bm\{v\},∂tr\(𝑯L\(𝜽\)2\)/∂𝒘\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{w\}, and∂tr\(𝑯L\(𝜽\)2\)/∂𝒗\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{v\}\.
Observing the main theorem reveals that the steepest descent direction ofλsup\(𝜽\)\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)is composed of the steepest descent directions of the meanμ\(𝜽\)\\mu\(\\bm\{\\theta\}\)and the standard deviationσ\(𝜽\)\\sigma\(\\bm\{\\theta\}\)of the eigenspectrum\. That is, moving in the direction of−∂λsup\(𝜽\)/∂𝜽\-\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}represents an operation that reduces the WS upper boundλsup\(𝜽\)\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)by shifting down the entire eigenspectrum and suppressing its variance\. The operation that contributes most to reducing both the mean and the standard deviation is bringing the maximum eigenvalueλ1\\lambda\_\{1\}closer to the mean\. Therefore, moving in the steepest descent direction of the WS upper bound can be expected to have the effect of decreasing the maximum eigenvalueλ1\\lambda\_\{1\}\. Consequently,−∂λsup\(𝜽\)/∂𝜽\-\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}can be regarded as a direction toward flat minima\.
Figure 2:Computation time of∂λsup\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\. “Num\.” denotes the numerical solution, and “Ana\.” denotes the analytical solution\. Left: Variation with respect to the dimensionalityDD, with the training data size fixed atI=200I=200\. Right: Variation with respect to the data sizeII, with the dimensionality fixed atD=21D=21\. All computations were executed via serial processing on an Apple M2 CPU \(clock frequency: 3\.49 GHz\)\. The three\-point finite difference method was used for numerical differentiation\. The activation function of the hidden layer is sigmoid\.Although it is possible to solve∂λsup\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}using numerical differentiation, this approach does not allow us to analyze how the direction toward flat minima is characterized\. Therefore, this study aims to clarify the analytical solution, which is established in Theorems[2](https://arxiv.org/html/2606.28662#Thmtheorem2),[3](https://arxiv.org/html/2606.28662#Thmtheorem3),[4](https://arxiv.org/html/2606.28662#Thmtheorem4), and[5](https://arxiv.org/html/2606.28662#Thmtheorem5)\. From another perspective, the analytical solution offers a distinct advantage in terms of computation time\. Fig\.[2](https://arxiv.org/html/2606.28662#S4.F2)shows the computation times of the numerical and analytical solutions required to derive∂λsup\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\. Fig\.[2](https://arxiv.org/html/2606.28662#S4.F2)\(left\) represents the case where the parameter sizeDDof the NN is increased by increasing the dimensionality of the hidden layerNNwhile fixing the input dimensionality atM=2M=2\. It can be observed that the computation time increases linearly with respect toDD\. The WS upper bound is a function based on the traces of the Hessian and the squared Hessian, and the computation time of the trace is affected by the number of diagonal componentsDD\. Therefore, it is considered that the computation time increases linearly withDD\. Fig\.[2](https://arxiv.org/html/2606.28662#S4.F2)\(right\) illustrates the case where the training data sizeIIis increased\. Although the details are described later,∂λsup\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}includes a double sum with respect toII\. Thus, the computation time is expected to be proportional toI2I^\{2\}\. Although an increasing trend in computation time is observed with respect to bothDDandII, the analytical solution completes the computation in less time than the numerical solution in either case\.
Figure 3:Comparison between numerical and analytical solutions\. “Num\.” denotes the numerical solution, and “Ana\.” denotes the analytical solution\. The three\-point finite difference method was used for numerical differentiation\. The activation function of the hidden layer is sigmoid\.
### IV\-BGradients of the Hessian Trace
Here, we address the derivation of∂tr\(𝑯L\(𝜽\)\)/∂𝒘\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)/\\mathop\{\}\\\!\\partial\\bm\{w\}and∂tr\(𝑯L\(𝜽\)\)/∂𝒗\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)/\\mathop\{\}\\\!\\partial\\bm\{v\}, which are the components of the steepest descent direction of the WS upper bound\. The gradient oftr\(𝑯L\(𝜽\)\)\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)with respect to the affine parameters𝒘\\bm\{w\}from the input layer to the hidden layer is given as follows\.
###### Theorem 2\.
∂tr\(𝑯L\(𝜽\)\)∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∑i=1I\(𝒉\(𝒙i\)⊗∑a∈𝔄𝓦ia\)\.\\displaystyle=\\sum\_\{i=1\}^\{I\}\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\sum\_\{a\\in\\mathfrak\{A\}\}\\bm\{\\mathcal\{W\}\}\_\{i\}^\{a\}\\bigg\)\.\(20\)
###### Proof\.
See Appendix[B\-B](https://arxiv.org/html/2606.28662#A2.SS2)\. ∎
Where
a∈𝔄:=\{I,II,III\}\\displaystyle a\\in\\mathfrak\{A\}:=\\\{\\mathrm\{I\},\\mathrm\{II\},\\mathrm\{III\}\\\}and
𝓦iI\\displaystyle\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{I\}\}=\(1\+𝒙i⊤𝒙i\)\(s′′\(zi\)𝑭′\(𝒚i\)𝑽~⊤𝑽~𝑭′\(𝒚i\)\\displaystyle=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)\\Big\(s^\{\\prime\\prime\}\(z\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\+2s′\(zi\)diag\(𝑽~⊤\)𝑭′′\(𝒚i\)\)𝑭′\(𝒚i\)𝑽~⊤,\\displaystyle\+2s^\{\\prime\}\(z\_\{i\}\)\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\Big\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(21\)𝓦iII\\displaystyle\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{II\}\}=\(1\+𝒙i⊤𝒙i\)\\displaystyle=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)×\(s′\(zi\)𝑭′\(𝒚i\)𝑽~⊤𝑽~𝒇′′\(𝒚i\)\+δi𝑭′′′\(𝒚i\)𝑽~⊤\),\\displaystyle\\times\\Big\(s^\{\\prime\}\(z\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\+\\delta\_\{i\}\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\Big\),\(22\)𝓦iIII\\displaystyle\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{III\}\}=s′′\(zi\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚i\)\)𝑭′\(𝒚i\)𝑽~⊤\\displaystyle=s^\{\\prime\\prime\}\(z\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+2s′\(zi\)𝑭′\(𝒚i\)𝒇\(𝒚i\)\.\\displaystyle\+2s^\{\\prime\}\(z\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\.\(23\)For𝒇′′\(𝒚\)\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)and𝑭′′′\(𝒚\)\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\), refer to Eqs\. \([65](https://arxiv.org/html/2606.28662#A1.E65)\) and \([66](https://arxiv.org/html/2606.28662#A1.E66)\)\. The terms′′\(z\)s^\{\\prime\\prime\}\(z\)represents the second derivative of the sigmoid function, which is given by
s′′\(z\)=\(1−2p\)\(1−p\)p\.\\displaystyle s^\{\\prime\\prime\}\(z\)=\(1\-2p\)\(1\-p\)p\.The dimensions of these variables are given by
∂tr\(𝑯L\(𝜽\)\)∂𝒘∈ℝ\(M\+1\)N,𝓦iI,𝓦iII,𝓦iIII∈ℝN,∀i∈ℕ≤I\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\in\\mathbb\{R\}^\{\(M\+1\)N\},\\ \\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{I\}\},\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{II\}\},\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{III\}\}\\in\\mathbb\{R\}^\{N\},\\forall i\\in\\mathbb\{N\}\_\{\\leq I\}\.
The gradient of the Hessian trace with respect to the affine parameters𝒗\\bm\{v\}from the hidden layer to the output layer is given as follows\.
###### Theorem 3\.
∂tr\(𝑯L\(𝜽\)\)∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=∑i=1I∑a∈𝔄𝓥ia\.\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{a\\in\\mathfrak\{A\}\}\\bm\{\\mathcal\{V\}\}\_\{i\}^\{a\}\.\(24\)
###### Proof\.
See Appendix[B\-C](https://arxiv.org/html/2606.28662#A2.SS3)\. ∎
Where
𝓥iI\\displaystyle\\bm\{\\mathcal\{V\}\}\_\{i\}^\{\\mathrm\{I\}\}=\(1\+𝒙i⊤𝒙i\)\(s′′\(zi\)𝒉\(𝒇\(𝒚i\)\)𝑽~𝑭′\(𝒚i\)\\displaystyle=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)\\Big\(s^\{\\prime\\prime\}\(z\_\{i\}\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\+2s′\(zi\)𝑭0′\(𝒚i\)\)𝑭′\(𝒚i\)𝑽~⊤,\\displaystyle\+2s^\{\\prime\}\(z\_\{i\}\)\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\Big\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(25\)𝓥iII\\displaystyle\\bm\{\\mathcal\{V\}\}\_\{i\}^\{\\mathrm\{II\}\}=\(1\+𝒙i⊤𝒙i\)\\displaystyle=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)×\(s′\(zi\)𝒉\(𝒇\(𝒚i\)\)𝑽~𝒇′′\(𝒚i\)\+δi𝒇0′′\(𝒚i\)\),\\displaystyle\\times\\big\(s^\{\\prime\}\(z\_\{i\}\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\+\\delta\_\{i\}\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\big\),\(26\)𝓥iIII\\displaystyle\\bm\{\\mathcal\{V\}\}\_\{i\}^\{\\mathrm\{III\}\}=s′′\(zi\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚i\)\)𝒉\(𝒇\(𝒚i\)\)\.\\displaystyle=s^\{\\prime\\prime\}\(z\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\.\(27\)For𝑭0′\(𝒚\)\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\)and𝒇0′′\(𝒚\)\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\), refer to Eq\. \([67](https://arxiv.org/html/2606.28662#A1.E67)\)\. The dimensions of these variables are given by
∂tr\(𝑯L\(𝜽\)\)∂𝒗,𝓥iI,𝓥iII,𝓥iIII∈ℝN\+1,∀i∈ℕ≤I\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\},\\bm\{\\mathcal\{V\}\}\_\{i\}^\{\\mathrm\{I\}\},\\bm\{\\mathcal\{V\}\}\_\{i\}^\{\\mathrm\{II\}\},\\bm\{\\mathcal\{V\}\}\_\{i\}^\{\\mathrm\{III\}\}\\in\\mathbb\{R\}^\{N\+1\},\\forall i\\in\\mathbb\{N\}\_\{\\leq I\}\.The top two panels of Fig\.[3](https://arxiv.org/html/2606.28662#S4.F3)compare the numerical and analytical solutions of∂tr\(𝑯L\(𝜽\)\)/∂𝒘\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)/\\mathop\{\}\\\!\\partial\\bm\{w\}and∂tr\(𝑯L\(𝜽\)\)/∂𝒗\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)/\\mathop\{\}\\\!\\partial\\bm\{v\}\. The difference norms between the numerical and analytical solutions are both on the order of∼10−9\\sim 10^\{\-9\}, confirming that they are strongly consistent\. This trend remains identical for parameters generated with different random seeds\. These results demonstrate that the analytical solutions derived in this study are correct\.
### IV\-CGradients of the Squared Hessian Trace
Here, we address∂tr\(𝑯L\(𝜽\)2\)/∂𝒘\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{w\}and∂tr\(𝑯L\(𝜽\)2\)/∂𝒗\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{v\}, which are the components of the steepest descent direction of the WS upper bound\. The gradient of the squared Hessian trace with respect to the affine parameters𝒘\\bm\{w\}from the input layer to the hidden layer is given as follows\.
###### Theorem 4\.
∂tr\(𝑯L\(𝜽\)2\)∂𝒘=2∑i=1I∑j=1I\(𝒉\(𝒙i\)⊗∑a∈𝔅𝓦ija\)\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=2\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\sum\_\{a\\in\\mathfrak\{B\}\}\\bm\{\\mathcal\{W\}\}\_\{ij\}^\{a\}\\bigg\)\.\(28\)
###### Proof\.
See Appendix[B\-D](https://arxiv.org/html/2606.28662#A2.SS4)\. ∎
Where
𝓦ija=𝓐ija\+𝓑ija,a∈𝔅:=\{Φ,Ψ,Ω\},\\displaystyle\\bm\{\\mathcal\{W\}\}\_\{ij\}^\{a\}=\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{a\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{a\},\\ a\\in\\mathfrak\{B\}:=\\\{\\Phi,\\Psi,\\Omega\\\},𝓐ijΦ=\(1\+𝒙i⊤𝒙j\)2𝑮ijΦ\(𝒐i⊗𝒐j\),\\displaystyle\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Phi\}=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)^\{2\}\\bm\{G\}^\{\\Phi\}\_\{ij\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\),\(29\)𝓐ijΨ=2\(1\+𝒙i⊤𝒙j\)𝑮ijΨ\(𝒐i⊗𝒐j\),\\displaystyle\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Psi\}=2\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)\\bm\{G\}^\{\\Psi\}\_\{ij\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\),\(30\)𝓐ijΩ=\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2s′′\(zi\)s′\(zj\)𝑭′\(𝒚i\)𝑽~⊤,\\displaystyle\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Omega\}=\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}s^\{\\prime\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(31\)𝓑ijΦ=\(1\+𝒙i⊤𝒙j\)2𝑭′\(𝒚i\)𝑽~⊤𝒔′′/′\(zi\)⊤𝚽ij𝒐j,\\displaystyle\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Phi\}=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)^\{2\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\_\{i\}\)^\{\\top\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\},\(32\)𝓑ijΨ=2\(1\+𝒙i⊤𝒙j\)𝑭′\(𝒚i\)𝑽~⊤𝒔′′/′\(zi\)⊤𝚿ij𝒐j,\\displaystyle\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Psi\}=2\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\_\{i\}\)^\{\\top\}\\bm\{\\Psi\}\_\{ij\}\\bm\{o\}\_\{j\},\(33\)𝓑ijΩ=2s′\(zi\)s′\(zj\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)𝑭′\(𝒚i\)𝒇\(𝒚j\),\\displaystyle\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Omega\}=2s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}\(\\bm\{y\}\_\{j\}\),\(34\)𝒔′′/′\(z\)=\[s′′\(z\)s′\(z\)\]⊤\.\\displaystyle\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\)=\\begin\{bmatrix\}s^\{\\prime\\prime\}\(z\)&s^\{\\prime\}\(z\)\\end\{bmatrix\}^\{\\top\}\.\(35\)Here,𝑮ijΦ\\bm\{G\}^\{\\Phi\}\_\{ij\}is given by
𝑮ijΦ\\displaystyle\\bm\{G\}^\{\\Phi\}\_\{ij\}=\[𝖆ijΦ𝖇ijΦ𝖈ijΦ𝖉ijΦ\]∈ℝN×4,\\displaystyle=\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ij\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{N\\times 4\},\(36\)𝖆ijΦ\\displaystyle\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ij\}=2\(𝚽ij\)111/2𝑭′′\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊙2⊤,\\displaystyle=2\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\},\(37\)𝖇ijΦ\\displaystyle\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ij\}=2𝑭′\(𝒚i\)𝑭′′\(𝒚i\)𝑭′′\(𝒚j\)𝑽~⊙3⊤,\\displaystyle=2\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 3\\top\},\(38\)𝖈ijΦ\\displaystyle\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ij\}=𝑭′′′\(𝒚i\)𝑭′\(𝒚j\)2𝑽~⊙3⊤,\\displaystyle=\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)^\{2\}\\widetilde\{\\bm\{V\}\}^\{\\odot 3\\top\},\(39\)𝖉ijΦ\\displaystyle\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ij\}=𝑭′′′\(𝒚i\)𝑭′′\(𝒚j\)𝑽~⊙2⊤,\\displaystyle=\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\},\(40\)and𝑮ijΨ\\bm\{G\}^\{\\Psi\}\_\{ij\}is given by
𝑮ijΨ=\[𝖆ijΨ𝖇ijΨ𝖈ijΨ𝖉ijΨ\]∈ℝN×4,\\displaystyle\\bm\{G\}^\{\\Psi\}\_\{ij\}=\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ij\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{N\\times 4\},\(41\)𝖆ijΨ=\(𝑭′\(𝒚i\)𝒇\(𝒚j\)𝑽~𝑭′\(𝒚i\)\\displaystyle\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ij\}=\\Big\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\+diag\(𝑽~⊤\)𝑭′′\(𝒚i\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)\)𝑭′\(𝒚j\)𝑽~⊤,\\displaystyle\+\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\Big\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(42\)𝖇ijΨ=\(𝑭′\(𝒚i\)2\+𝑭′′\(𝒚i\)𝑭\(𝒚i\)\)𝑭′\(𝒚j\)𝑽~⊤,\\displaystyle\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ij\}=\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)^\{2\}\+\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}\(\\bm\{y\}\_\{i\}\)\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(43\)𝖈ijΨ=𝑭′′\(𝒚i\)𝑭\(𝒚j\)𝑭′\(𝒚j\)𝑽~⊤,\\displaystyle\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ij\}=\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(44\)𝖉ijΨ=𝑭′′\(𝒚i\)𝒇′\(𝒚j\)\.\\displaystyle\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ij\}=\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\.\(45\)The dimensions of these variables are given by
∂tr\(𝑯L\(𝜽\)2\)∂𝒘∈ℝ\(M\+1\)N,𝓐ija,𝓑ija,∈ℝN,∀a∈𝔅,\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\in\\mathbb\{R\}^\{\(M\+1\)N\},\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{a\},\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{a\},\\in\\mathbb\{R\}^\{N\},\\forall a\\in\\mathfrak\{B\},𝖆ijb,𝖇ijb,𝖈ijb,𝖉ijb∈ℝN,∀b∈𝔅∖\{Ω\},∀i,j∈ℕ≤I\.\\displaystyle\\bm\{\\mathfrak\{a\}\}^\{b\}\_\{ij\},\\bm\{\\mathfrak\{b\}\}^\{b\}\_\{ij\},\\bm\{\\mathfrak\{c\}\}^\{b\}\_\{ij\},\\bm\{\\mathfrak\{d\}\}^\{b\}\_\{ij\}\\in\\mathbb\{R\}^\{N\},\\forall b\\in\\mathfrak\{B\}\\setminus\\\{\\Omega\\\},\\forall i,j\\in\\mathbb\{N\}\_\{\\leq I\}\.
The gradient of the squared Hessian trace with respect to the affine parameters𝒗\\bm\{v\}from the hidden layer to the output layer is given as follows\.
###### Theorem 5\.
∂tr\(𝑯L\(𝜽\)2\)∂𝒗=2∑i=1I∑j=1I∑a∈𝔅𝓥ija\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=2\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\sum\_\{a\\in\\mathfrak\{B\}\}\\bm\{\\mathcal\{V\}\}\_\{ij\}^\{a\}\.\(46\)
###### Proof\.
See Appendix[B\-E](https://arxiv.org/html/2606.28662#A2.SS5)\. ∎
Where
𝓥ija\\displaystyle\\bm\{\\mathcal\{V\}\}\_\{ij\}^\{a\}=𝓒ija\+𝓓ija,a∈𝔅,\\displaystyle=\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{a\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{a\},\\ a\\in\\mathfrak\{B\},𝓒ijΦ\\displaystyle\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Phi\}=\(1\+𝒙i⊤𝒙j\)2𝑲ijΦ\(𝒐i⊗𝒐j\),\\displaystyle=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)^\{2\}\\bm\{K\}\_\{ij\}^\{\\Phi\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\),\(47\)𝓒ijΨ\\displaystyle\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Psi\}=2\(1\+𝒙i⊤𝒙j\)𝑲ijΨ\(𝒐i⊗𝒐j\),\\displaystyle=2\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)\\bm\{K\}\_\{ij\}^\{\\Psi\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\),\(48\)𝓒ijΩ\\displaystyle\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Omega\}=s′′\(zi\)s′\(zj\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2𝒉\(𝒇\(𝒚i\)\),\\displaystyle=s^\{\\prime\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\),\(49\)𝓓ijΦ\\displaystyle\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Phi\}=\(1\+𝒙i⊤𝒙j\)2\(𝒉\(𝒇\(𝒚i\)\)𝒔′′/′\(zi\)⊤𝚽ij𝒐j\),\\displaystyle=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)^\{2\}\\big\(\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\_\{i\}\)^\{\\top\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\\big\),\(50\)𝓓ijΨ\\displaystyle\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Psi\}=2\(1\+𝒙i⊤𝒙j\)\(𝒉\(𝒇\(𝒚i\)\)𝒔′′/′\(zi\)⊤𝚿ij𝒐j\),\\displaystyle=2\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)\\big\(\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\_\{i\}\)^\{\\top\}\\bm\{\\Psi\}\_\{ij\}\\bm\{o\}\_\{j\}\\big\),\(51\)𝓓ijΩ\\displaystyle\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Omega\}=𝟎N\+1\.\\displaystyle=\\bm\{0\}\_\{N\+1\}\.\(52\)Here,𝟎N\+1\\bm\{0\}\_\{N\+1\}denotes an\(N\+1\)\(N\+1\)\-dimensional zero vector\.𝑲ijΦ\\bm\{K\}\_\{ij\}^\{\\Phi\}is given by
𝑲ijΦ\\displaystyle\\bm\{K\}\_\{ij\}^\{\\Phi\}=\[𝖊ijΦ𝖋ijΦ𝟎N\+1𝖌ijΦ\]∈ℝ\(N\+1\)×4,\\displaystyle=\\begin\{bmatrix\}\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ij\}&\\bm\{0\}\_\{N\+1\}&\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ij\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{\(N\+1\)\\times 4\},\(53\)𝖊ijΦ\\displaystyle\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ij\}=2\(𝚽ij\)111/2𝑽⊤⊙𝒇0′\(𝒚j\)⊙𝒇0′\(𝒚i\),\\displaystyle=2\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\\bm\{V\}^\{\\top\}\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\),\(54\)𝖋ijΦ\\displaystyle\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ij\}=3𝑽⊙2⊤⊙𝒇0′\(𝒚i\)⊙2⊙𝒇0′′\(𝒚j\),\\displaystyle=3\\bm\{V\}^\{\\odot 2\\top\}\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)^\{\\odot 2\}\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\),\(55\)𝖌ijΦ\\displaystyle\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ij\}=𝑽⊤⊙𝒇0′′\(𝒚i\)⊙𝒇0′′\(𝒚j\),\\displaystyle=\\bm\{V\}^\{\\top\}\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\),\(56\)and𝑲ijΨ\\bm\{K\}\_\{ij\}^\{\\Psi\}is given by
𝑲ijΨ\\displaystyle\\bm\{K\}\_\{ij\}^\{\\Psi\}=\[𝖊ijΨ𝖋ijΨ𝟎N\+1𝟎N\+1\]∈ℝ\(N\+1\)×4,\\displaystyle=\\begin\{bmatrix\}\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ij\}&\\bm\{0\}\_\{N\+1\}&\\bm\{0\}\_\{N\+1\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{\(N\+1\)\\times 4\},\(57\)𝖊ijΨ\\displaystyle\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ij\}=\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)𝑭0′\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊤,\\displaystyle=\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(58\)𝖋ijΨ\\displaystyle\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ij\}=𝑭0′\(𝒚i\)𝑭′\(𝒚j\)𝒇\(𝒚i\)\.\\displaystyle=\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\.\(59\)The dimensions of these variables are given by
∂tr\(𝑯L\(𝜽\)2\)∂𝒗,𝓒ija,𝓓ija,𝖊ijb,𝖋ijb,𝖌ijΦ∈ℝ\(N\+1\)×1,\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\},\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{a\},\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{a\},\\bm\{\\mathfrak\{e\}\}^\{b\}\_\{ij\},\\bm\{\\mathfrak\{f\}\}^\{b\}\_\{ij\},\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ij\}\\in\\mathbb\{R\}^\{\(N\+1\)\\times 1\},∀a∈𝔅,∀b∈𝔅∖\{Ω\},∀i,j∈ℕ≤I\.\\displaystyle\\forall a\\in\\mathfrak\{B\},\\forall b\\in\\mathfrak\{B\}\\setminus\\\{\\Omega\\\},\\forall i,j\\in\\mathbb\{N\}\_\{\\leq I\}\.The bottom two panels of Fig\.[3](https://arxiv.org/html/2606.28662#S4.F3)compare the numerical and analytical solutions of∂tr\(𝑯L\(𝜽\)2\)/∂𝒘\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{w\}and∂tr\(𝑯L\(𝜽\)2\)/∂𝒗\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{v\}\. The difference norms between the numerical and analytical solutions are both on the order of∼10−9\\sim 10^\{\-9\}, confirming that they are strongly consistent\. This trend remains identical for parameters generated with different random seeds\. These results demonstrate that the analytical solutions derived in this study are correct\.
Figure 4:Relationship between\|δi\|\|\\delta\_\{i\}\|and gradient norms\. A single NN trained in the experiment in Section[VI](https://arxiv.org/html/2606.28662#S6)\(described later\) was used\. SinceI=50I=50, there are 50 data points\. The bottom two panels show the results withjjfixed\. The activation function of the hidden layer is sigmoid\.
## VDiscussion
### V\-AInfluence of Individual Data Samples on the Gradient Norm
All four gradients derived in this study possess the following property\.
###### Proposition 1\.
pi→qi\\displaystyle p\_\{i\}\\rightarrow q\_\{i\}⇒\(𝒉\(𝒙i\)⊗∑a∈𝔄𝓦ia→𝟎\(M\+1\)N\)∧\(∑a∈𝔄𝓥ia→𝟎N\+1\)\\displaystyle\\Rightarrow\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\sum\_\{a\\in\\mathfrak\{A\}\}\\bm\{\\mathcal\{W\}\}\_\{i\}^\{a\}\\rightarrow\\bm\{0\}\_\{\(M\+1\)N\}\\bigg\)\\land\\bigg\(\\sum\_\{a\\in\\mathfrak\{A\}\}\\bm\{\\mathcal\{V\}\}\_\{i\}^\{a\}\\rightarrow\\bm\{0\}\_\{N\+1\}\\bigg\)∧\(𝒉\(𝒙i\)⊗∑a∈𝔅𝓦ija→𝟎\(M\+1\)N\)∧\(∑a∈𝔅𝓥ija→𝟎N\+1\)\.\\displaystyle\\land\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\sum\_\{a\\in\\mathfrak\{B\}\}\\bm\{\\mathcal\{W\}\}\_\{ij\}^\{a\}\\rightarrow\\bm\{0\}\_\{\(M\+1\)N\}\\bigg\)\\land\\bigg\(\\sum\_\{a\\in\\mathfrak\{B\}\}\\bm\{\\mathcal\{V\}\}\_\{ij\}^\{a\}\\rightarrow\\bm\{0\}\_\{N\+1\}\\bigg\)\.\(60\)
###### Proof\.
See Appendix[B\-F](https://arxiv.org/html/2606.28662#A2.SS6)\. ∎
From Eq\. \(53\) of the predecessor study\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\],\|δ\|\|\\delta\|represents the degree of proximity betweenppandqq\. Therefore, this proposition asserts that when the NN’s estimateppfor a given training data sample is sufficiently close to the ground\-truth labelqq, resulting in a sufficiently small\|δ\|\|\\delta\|, that specific data sample does not affect the gradient of the WS upper bound\. To verify whether this phenomenon actually occurs, we extracted a single NN trained in the experiments described later in Section[VI](https://arxiv.org/html/2606.28662#S6)and plotted a scatter plot illustrating the relationship between\|δ\|\|\\delta\|and the gradient norm of the WS upper bound\. This result is shown in Fig\.[4](https://arxiv.org/html/2606.28662#S4.F4)\. As can be seen from the figure, the closer\|δ\|\|\\delta\|is to 0, that is, the closerppis toqq, the smaller the gradient norm becomes\. From this observation, it can be concluded that data samples that can be correctly estimated with high confidence do not influence the direction toward a flat minimum\. As a practical advantage, it is conceivable that training data samples whose\|δ\|\|\\delta\|is close to 0 can be removed when computing the direction to achieve a flat minimum\.
Figure 5:Relationship between the covariance of the bivariate Gaussian distribution generating the input data𝒙\\bm\{x\}and the gradient of the squared Hessian trace\. The left panel represents the case of linear activation, and the right panel represents the case of sigmoid activation\. The values are normalized to a range of 0 to 1 to clearly illustrate the scale of variation\. Solid lines represent the median, and shaded areas represent the IQR\.
### V\-BInner Product of Input Data and the Norm of the Steepest Descent Direction
By observing∂tr\(𝑯L\(𝜽\)2\)/∂𝒘\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{w\}and∂tr\(𝑯L\(𝜽\)2\)/∂𝒗\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{v\}, one can confirm the presence of the inner product between input data samples,𝒙i⊤𝒙j\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}, within terms such as𝓐ijΦ\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Phi\}and𝓑ijΦ\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Phi\}in Eqs\. \([29](https://arxiv.org/html/2606.28662#S4.E29)\) and \([32](https://arxiv.org/html/2606.28662#S4.E32)\), as well as𝓒ijΦ\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Phi\}and𝓓ijΦ\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Phi\}in Eqs\. \([47](https://arxiv.org/html/2606.28662#S4.E47)\) and \([50](https://arxiv.org/html/2606.28662#S4.E50)\)\. Therefore, it is considered that the directional similarity among training data samples affects the scaling factor of the gradient of the squared Hessian trace\. To verify this, we observed the variation in the gradient norm of the squared Hessian trace with respect to changes in the covariance of the Gaussian distribution used to generate the training data𝒙\\bm\{x\}\. The training data samples were centered at the origin of a two\-dimensional space, where a sample was assigned to class 0 ifx1<0x\_\{1\}<0and to class 1 ifx1≥0x\_\{1\}\\geq 0\. The structure of the NN and the size of the training dataset were configured as\(M,N,I\)=\(2,3,100\)\(M,N,I\)=\(2,3,100\)\. The obtained results are shown in Fig\.[5](https://arxiv.org/html/2606.28662#S5.F5)\. The left panel represents the case with linear activation, while the right panel represents the case with sigmoid activation\. As can be seen from the figure, training data with a higher inner product leads to a larger gradient norm of the squared Hessian trace\. In light of Eq\. \([2](https://arxiv.org/html/2606.28662#S3.E2)\), this implies that pairs of similar training data samples exert a strong influence in the direction that reduces the standard deviation of the eigenspectrum\. However, the scaling effect due to the inner product is large for linear activation and small for sigmoid activation\. In the case of saturating activation functions, the influence exerted by the inner product of the training data appears to diminish\.
Figure 6:Comparison of gradient norms between the mean and standard deviation terms of∂λsup\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\. Calculated using 100 randomly generated NN patterns under the condition of\(I,M,N\)=\(50,2,3\)\(I,M,N\)=\(50,2,3\)\. Asterisks indicate p\-values from the two\-sided Wilcoxon signed\-rank test \(\*\*: 0\.1% level, \*: 1% level\)\. Boxes represent the IQR, and whiskers indicate the±1\.5×IQR\\pm 1\.5\\times\\mathrm\{IQR\}range\. The activation function of the hidden layer is sigmoid\.
### V\-CDominant Terms for the Direction Toward Flat Minima
From Eq\. \([17](https://arxiv.org/html/2606.28662#S4.E17)\), the direction toward flat minima,−∂λsup\(𝜽\)/∂𝜽\-\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}, can be decomposed into terms that reduce the mean and standard deviation of the eigenspectrum, which are given by
−∂μ\(𝜽\)∂𝜽,−D−1∂σ\(𝜽\)∂𝜽\.\\displaystyle\-\\frac\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\},\\ \-\\sqrt\{D\-1\}\\frac\{\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\.From Eqs\. \([20](https://arxiv.org/html/2606.28662#S4.E20)\) and \([24](https://arxiv.org/html/2606.28662#S4.E24)\),∂μ\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}is expressed by a single summation over the training dataset sizeII, whereas from Eqs\. \([28](https://arxiv.org/html/2606.28662#S4.E28)\) and \([46](https://arxiv.org/html/2606.28662#S4.E46)\),∂σ\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}is expressed by a double summation over the training dataset sizeII\. Furthermore, the architecture is structured such that∂σ\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}is scaled by the NN parameter sizeDD\. Therefore, it is expected that the direction toward flat minima,−∂λsup\(𝜽\)/∂𝜽\-\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}, is strongly influenced by the direction that reduces the standard deviation of the eigenspectrum rather than its mean\. To verify whether this hypothesis is correct, we calculated the norms of∂μ\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}andD−1∂σ\(𝜽\)/∂𝜽\\sqrt\{D\-1\}\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}using multiple randomly generated initial parameters\. The results are shown in Fig\.[6](https://arxiv.org/html/2606.28662#S5.F6)\. As can be seen from the figure, as expected, it can be determined that the norm ofD−1∂σ\(𝜽\)/∂𝜽\\sqrt\{D\-1\}\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}is larger than that of∂μ\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\.
Figure 7:Top: Training dynamics of the WS upper bound and top eigenvalues\. Middle: Training dynamics of the WS upper and lower bounds, maximum eigenvalue, and minimum eigenvalue\. Bottom: Details of the eigenspectrum during training\. Here,\(γ1,γ2\)=\(0,0\.01\)\(\\gamma\_\{1\},\\gamma\_\{2\}\)=\(0,0\.01\), which represents the case where only the WS upper bound is reduced without decreasing the loss\. The activation function of the hidden layer is sigmoid\.
### V\-DHSR Regularization
Although moving the NN parameters in the direction of−∂λsup\(𝜽\)/∂𝜽\-\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}is expected to decrease the WS upper bound and consequently reduce the eigenvalues, it is practically necessary to simultaneously reduce the lossL\(𝜽\)L\(\\bm\{\\theta\}\)\. Therefore, in this study, we refer toλsup\(𝜽\)\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)as the HSR regularization term, and we propose a method termed “HSR Regularization” that minimizes the weighted sum ofL\(𝜽\)L\(\\bm\{\\theta\}\)andλsup\(𝜽\)\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)\. That is, the update equation is given by
𝜽\(t\+1\)=𝜽\(t\)−\(γ1∂L\(𝜽\)∂𝜽\|𝜽=𝜽\(t\)\+γ2∂λsup\(𝜽\)∂𝜽\|𝜽=𝜽\(t\)\),\\displaystyle\\bm\{\\theta\}^\{\(t\+1\)\}=\\bm\{\\theta\}^\{\(t\)\}\-\\bigg\(\\gamma\_\{1\}\\left\.\\frac\{\\mathop\{\}\\\!\\partial L\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\right\|\_\{\\bm\{\\theta\}=\\bm\{\\theta\}^\{\(t\)\}\}\+\\gamma\_\{2\}\\left\.\\frac\{\\mathop\{\}\\\!\\partial\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\right\|\_\{\\bm\{\\theta\}=\\bm\{\\theta\}^\{\(t\)\}\}\\bigg\),whereγ1\\gamma\_\{1\}is the weight for the loss reduction andγ2\\gamma\_\{2\}is the weight for the reduction of the WS upper bound\.
To verify whether moving in the steepest descent direction of the WS upper bound decreases both the WS upper bound and the eigenvalues, we executed the gradient descent method under the conditions ofγ1=0\\gamma\_\{1\}=0andγ2=0\.01\\gamma\_\{2\}=0\.01\. The results are shown in the upper panel of Fig\.[7](https://arxiv.org/html/2606.28662#S5.F7)\. As can be seen from the figure, both the WS upper bound and the eigenvalues decrease as the training progresses\. Therefore, it can be said that HSR regularization has the effect of lowering the second\-order derivatives of the loss function, and it is considered to have the effect of facilitating the attainment of flat minima\.
### V\-EEffect of Elevating the Wolkowicz\-Styan Lower Bound
Wolkowicz et al\.\[[33](https://arxiv.org/html/2606.28662#bib.bib51)\]derived not only the upper bound for the maximum eigenvalue but also a closed\-form expression for the lower bound of the minimum eigenvalue\. By applying Eq\. \(2\.2\) in\[[33](https://arxiv.org/html/2606.28662#bib.bib51)\], the lower boundλinf\(𝜽\)\\lambda\_\{\\mathrm\{inf\}\}\(\\bm\{\\theta\}\)for the minimum eigenvalueλD\\lambda\_\{D\}is given by
λD≥λinf\(𝜽\)\\displaystyle\\lambda\_\{D\}\\geq\\lambda\_\{\\mathrm\{inf\}\}\(\\bm\{\\theta\}\)=μ\(𝜽\)−D−1σ\(𝜽\)\.\\displaystyle=\\mu\(\\bm\{\\theta\}\)\-\\sqrt\{D\-1\}\\sigma\(\\bm\{\\theta\}\)\.\(61\)Moreover, Fig\.[6](https://arxiv.org/html/2606.28662#S5.F6)indicates that the direction toward flat minima is more easily determined by reducingσ\(𝜽\)\\sigma\(\\bm\{\\theta\}\)rather thanμ\(𝜽\)\\mu\(\\bm\{\\theta\}\)\. As an extreme example of this, let us consider a case whereμ\(𝜽\)\\mu\(\\bm\{\\theta\}\)remains unchanged while onlyσ\(𝜽\)\\sigma\(\\bm\{\\theta\}\)decreases\. Under this condition, the following holds true\.
###### Proposition 2\.
\(σ\(𝜽\(t\+1\)\)<σ\(𝜽\(t\)\)\)∧\(μ\(𝜽\(t\+1\)\)=μ\(𝜽\(t\)\)\)⇒\\displaystyle\\Big\(\\sigma\(\\bm\{\\theta\}^\{\(t\+1\)\}\)<\\sigma\(\\bm\{\\theta\}^\{\(t\)\}\)\\Big\)\\land\\Big\(\\mu\(\\bm\{\\theta\}^\{\(t\+1\)\}\)=\\mu\(\\bm\{\\theta\}^\{\(t\)\}\)\\Big\)\\Rightarrow\(λsup\(𝜽\(t\+1\)\)<λsup\(𝜽\(t\)\)\)∧\(λinf\(𝜽\(t\+1\)\)\>λinf\(𝜽\(t\)\)\)\.\\displaystyle\\Big\(\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}^\{\(t\+1\)\}\)<\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}^\{\(t\)\}\)\\Big\)\\land\\Big\(\\lambda\_\{\\mathrm\{inf\}\}\(\\bm\{\\theta\}^\{\(t\+1\)\}\)\>\\lambda\_\{\\mathrm\{inf\}\}\(\\bm\{\\theta\}^\{\(t\)\}\)\\Big\)\.\(62\)
###### Proof\.
See Appendix[B\-G](https://arxiv.org/html/2606.28662#A2.SS7)\. ∎
This proposition asserts that training to decrease the WS upper bound to reduce the maximum eigenvalue inherently increases the WS lower bound\. To verify whether this phenomenon actually occurs, we performed the gradient descent method withγ1=0\\gamma\_\{1\}=0andγ2=0\.01\\gamma\_\{2\}=0\.01to minimize only the WS upper bound\. The results are shown in the middle panel of Fig\.[7](https://arxiv.org/html/2606.28662#S5.F7)\. As can be seen from the figure, as the training progresses, the WS upper bound decreases while the WS lower bound increases\. Concurrently, it can be observed that the maximum eigenvalue approaches 0 from the positive direction, and the minimum eigenvalue approaches 0 from the negative direction\. The lower panel of Fig\.[7](https://arxiv.org/html/2606.28662#S5.F7)illustrates the details of the eigenspectrum, confirming that the range of the eigenvalues narrows as the training progresses\. Generally, when executing an operation to reduce the maximum eigenvalue, there is a potential risk of the eigenvalues becoming negative\. However, because HSR regularization also has the effect of increasing the WS lower bound, this risk is mitigated\. Since HSR regularization can be interpreted as having the effect of narrowing the range of the eigenspectrum, it is considered to facilitate the attainment of flat minima\. From the perspective of the flatness hypothesis, this is a highly desirable property\.
Figure 8:Left: Training data \(50 samples\), Right: Test data \(1,000 samples\)\. Both are balanced datasets\.
## VIExperiments
### VI\-ADetermination of Critical Points for Analysis
Experiments were conducted with the objective of verifying whether the steepest descent direction of the WS upper bound derived in this study has the effect of improving generalization\. In this verification, a key factor is whether the attainment of sharp minima can be avoided\. Therefore, we searched for initial parameters that lead to sharp minima using a plain gradient descent method without any modifications\. The specific procedure for this process is described below\.
First, we constructed a three\-layer feedforward NN with input layer dimensionM=2M=2and hidden layer dimensionN=3N=3\. The activation function of the hidden layer was set as the sigmoid function\. Specifically, by using
f\(y\)\\displaystyle f\(y\)=1/\(1\+exp\(−y\)\),\\displaystyle=1/\(1\+\\exp\(\-y\)\),f′\(y\)\\displaystyle f^\{\\prime\}\(y\)=\(1−f\(y\)\)f\(y\),\\displaystyle=\(1\-f\(y\)\)f\(y\),f′′\(y\)\\displaystyle f^\{\\prime\\prime\}\(y\)=\(1−2f\(y\)\)\(1−f\(y\)\)f\(y\),\\displaystyle=\(1\-2f\(y\)\)\(1\-f\(y\)\)f\(y\),f′′′\(y\)\\displaystyle f^\{\\prime\\prime\\prime\}\(y\)=6\(f\(y\)−a\+\)\(f\(y\)−a−\)\(1−f\(y\)\)f\(y\),\\displaystyle=6\(f\(y\)\-a^\{\+\}\)\(f\(y\)\-a^\{\-\}\)\(1\-f\(y\)\)f\(y\),a\+\\displaystyle a^\{\+\}=\(3\+3\)/6,a−=\(3−3\)/6\\displaystyle=\(3\+\\sqrt\{3\}\)/6,\\ a^\{\-\}=\(3\-\\sqrt\{3\}\)/6to construct𝒇\(𝒚\)\\bm\{f\}\(\\bm\{y\}\),𝒇\(k\)\(𝒚\)\\bm\{f\}^\{\(k\)\}\(\\bm\{y\}\),𝑭\(k\)\(𝒚\)\\bm\{F\}^\{\(k\)\}\(\\bm\{y\}\),𝒇0\(k\)\(𝒚\)\\bm\{f\}^\{\(k\)\}\_\{0\}\(\\bm\{y\}\), and𝑭0\(k\)\(𝒚\)\\bm\{F\}^\{\(k\)\}\_\{0\}\(\\bm\{y\}\)in accordance with Appendix[A\-B](https://arxiv.org/html/2606.28662#A1.SS2), both the WS upper bound and its steepest descent direction can be obtained as closed\-form functions\.
The task assigned to the NN is a two\-class classification problem aimed at estimating which of two Gaussian distributions a given data sample𝒙\\bm\{x\}in a two\-dimensional space was generated from\. Regarding the distribution parameters, Class 0 has a mean of\[11\]⊤\\begin\{bmatrix\}1&1\\end\{bmatrix\}^\{\\top\}, and Class 1 has a mean of\[−1−1\]⊤\\begin\{bmatrix\}\-1&\-1\\end\{bmatrix\}^\{\\top\}\. This specific classification problem has also been utilized in prior research investigating the eigenspectrum of the Hessian\[[26](https://arxiv.org/html/2606.28662#bib.bib28)\]\. The variances along both thex1x\_\{1\}andx2x\_\{2\}axes were set to 2, and the covariance was set to 0\. The training dataset size was configured asI=50I=50, and the test dataset size was set to10310^\{3\}, with an equal split between Class 0 and Class 1\. For reference, a scatter plot generated under these conditions is shown in Fig\.[8](https://arxiv.org/html/2606.28662#S5.F8)\.
Figure 9:Left: Comparison betweenλ1\\lambda\_\{1\}andλsup\(𝜽♯\)\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}^\{\\sharp\}\)\. Right: Histogram ofλsup\(𝜽♯\)\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}^\{\\sharp\}\)\. Results are shown for 1,124 unique critical points that satisfy the convergence criteria\.To construct the NNs, the initial parameters𝜽=𝜽\(0\)\\bm\{\\theta\}=\\bm\{\\theta\}^\{\(0\)\}were generated using 3,000 different random seeds\. For the random number generation, a uniform distribution over the open interval\(−10,10\)D\(\-10,10\)^\{D\}was adopted\. The reason for employing a wide range for random number generation was to observe a diverse set of critical points\. For the parameter update, a plain gradient descent method given by
𝜽\(t\+1\)=𝜽\(t\)−γ∂L\(𝜽\)∂𝜽\|𝜽=𝜽\(t\),t∈\{0,⋯,T−1\},\\displaystyle\\bm\{\\theta\}^\{\(t\+1\)\}=\\bm\{\\theta\}^\{\(t\)\}\-\\gamma\\frac\{\\mathop\{\}\\\!\\partial L\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\|\_\{\\bm\{\\theta\}=\\bm\{\\theta\}^\{\(t\)\}\},\\ t\\in\\\{0,\\cdots,T\-1\\\},whereTTdenotes the maximum number of training iterations, andT=104T=10^\{4\}was utilized\. When the gradient norm of the lossLLfell below 0\.1% of its initial gradient norm, it was determined that convergence to a critical point had been achieved, and the training was terminated\. Furthermore, since it is inappropriate to include duplicate critical points in the analysis, only one of the closely located critical points was retained\. Specifically, duplicate points were defined as those where the distance between critical points was less thanAve\.−3×Std\.\\mathrm\{Ave\.\}\-3\\times\\mathrm\{Std\.\}, where “Ave\.\\mathrm\{Ave\.\}” and “Std\.\\mathrm\{Std\.\}” denote the mean and standard deviation of the Euclidean distances among all critical points\. Critical points are represented as𝜽♯\\bm\{\\theta\}^\{\\sharp\}\.
As a result of executing the aforementioned procedure, 2,799 out of the 3,000 parameters satisfied the convergence condition to a critical point\. Furthermore, the number of non\-duplicate critical points was found to be 1,124\. A scatter plot of the maximum eigenvalues and their upper bounds at these critical points is shown in the left panel of Fig\.[9](https://arxiv.org/html/2606.28662#S6.F9)\. As can be seen from the figure, the WS upper bound is tight with respect to the maximum eigenvalue\. Therefore, the WS upper bound can be regarded as a function that evaluates the sharpness of the loss function\. Additionally, a histogram regarding the WS upper bound of the 1,124 critical points is shown in the right panel of Fig\.[9](https://arxiv.org/html/2606.28662#S6.F9)\. From this plot, it can be confirmed that the distribution of the WS upper bound is right\-skewed with a long tail\. Therefore, the 113 critical points whose WS upper bounds lie in the top 90% or higher are termed “Sharp Minima” and serve as the primary subject of analysis in this study\. To perform a comparative analysis, the 113 critical points whose WS upper bounds lie in the bottom 10% or lower are termed “Flat Minima” and are also included in the analysis of this study\. The red and blue dashed lines in the right panel of Fig\.[9](https://arxiv.org/html/2606.28662#S6.F9)represent these respective thresholds\. That is, the analysis in this study targets a total of 226 independent critical points\.
AL\(𝜽\)L\(\\bm\{\\theta\}\)Dynamics
Bλ1\\lambda\_\{1\}Dynamics
Figure 10:Comparison of training dynamics for each method\. Results for all seeds belonging to Sharp Minima Seeds and Flat Minima Seeds are shown\.
### VI\-BControllability of the Eigenspectrum
Here, we discuss the capability of HSR regularization to control the eigenspectrum\. For convenience, the 113 random seeds that resulted in sharp minima via the plain gradient descent method are referred to as “Sharp Minima Seeds,” while the 113 random seeds that resulted in flat minima are referred to as “Flat Minima Seeds\.” If initializing training with parameters𝜽\(0\)\\bm\{\\theta\}^\{\(0\)\}from the Sharp Minima Seeds can successfully guide the model toward flat minima, it can be interpreted that HSR regularization is effective\. However, it is also necessary to verify whether executing training with initial parameters𝜽\(0\)\\bm\{\\theta\}^\{\(0\)\}from the Flat Minima Seeds induces any adverse effects\. Therefore, we applied HSR regularization to both the Sharp Minima Seeds and the Flat Minima Seeds\. Furthermore, to evaluate the relative effectiveness of HSR regularization, we also implemented existing methods designed to reduce eigenvalues, namely Hessian Regularization\[[20](https://arxiv.org/html/2606.28662#bib.bib40)\]and Adaptive Sharpness\-Aware Minimization\[[17](https://arxiv.org/html/2606.28662#bib.bib46)\]\(A\-SAM\)\. Although SAM\[[9](https://arxiv.org/html/2606.28662#bib.bib49)\]is available as a more basic approach, it should be noted that we employed its advanced variant, A\-SAM\[[17](https://arxiv.org/html/2606.28662#bib.bib46)\], due to previously documented issues regarding parameter scaling in SAM\. Hessian Regularization is a method that minimizes the Hessian trace, whereas A\-SAM identifies and moves toward a flatter region by introducing perturbations to the parameters\.
For all methods, the maximum number of training iterations was set toT=104T=10^\{4\}\. However, the operation to reduce the eigenvalues was executed only during the first 5,000 iterations, whereas the operation to minimize only the loss functionL\(𝜽\)L\(\\bm\{\\theta\}\)was performed during the remaining 5,000 iterations\. The reason for this strategy is that we prioritized reducing the eigenvalues in the first half of training and focused on driving the loss function to reach a critical point in the second half\. Specifically, during the first 5,000 training iterations, the weight for the loss gradient term wasγ1=0\.01\\gamma\_\{1\}=0\.01, and the weight for HSR regularization wasγ2=0\.01\\gamma\_\{2\}=0\.01\. Similarly, the weight for the Hessian trace term in Hessian Regularization was set to 0\.01\. Additionally, the perturbation radius for A\-SAM was configured to 0\.1, as a larger perturbation radius failed to yield a noticeable reduction in the maximum eigenvalue\. In the subsequent 5,000 training iterations, a plain gradient descent method with a loss gradient term weight ofγ1=0\.01\\gamma\_\{1\}=0\.01was implemented for all methods\. As before, when the gradient norm of the lossLLfell below 0\.1% of its initial gradient norm, it was determined that convergence to a critical point had been achieved, and the training was terminated\.
The training dynamics obtained through these procedures are shown in Fig\.[10](https://arxiv.org/html/2606.28662#S6.F10)\. Fig\.[10A](https://arxiv.org/html/2606.28662#S6.F10.sf1)represents the evolution of the loss, and Fig\.[10B](https://arxiv.org/html/2606.28662#S6.F10.sf2)illustrates the evolution of the maximum eigenvalue\. Here, “Baseline” refers to the plain gradient descent method that minimizes only the loss, “Hessian Reg\.” denotes Hessian Regularization, “A\-SAM” represents Adaptive SAM, and “HSR Reg\.” signifies HSR regularization\. Furthermore, the upper panel of each figure displays the results for the Flat Minima Seeds, which are the initial parameters where even the plain gradient descent method reaches flat minima\. Looking at the results for the Flat Minima Seeds, it can be confirmed that both the loss and the eigenvalues exhibit almost identical behavior across all methods\. The lower panel of each figure shows the results for the Sharp Minima Seeds, indicating that the loss decreases across all methods\. On the other hand, it can be observed that the maximum eigenvalue reaches a large value at the critical point under the plain gradient descent method\. In contrast, the three methods that control the eigenvalues finish training while maintaining the maximum eigenvalue at a low value\.
Figure 11:Loss landscapes at critical points for Sharp Minima Seeds \(113 critical points\)\. One\-dimensional visualization along the direction of the eigenvector𝒖1\\bm\{u\}\_\{1\}corresponding to the maximum eigenvalue\. For visual clarity, the curves are vertically shifted so that the loss at each critical point is aligned at 0\.Fig\.[11](https://arxiv.org/html/2606.28662#S6.F11)visualizes the loss function at the converged critical points at the end of training for the Sharp Minima Seeds\. Since the loss function is high\-dimensional, it cannot be visualized directly\. Therefore, we performed the visualization along the direction of the eigenvector𝒖1\\bm\{u\}\_\{1\}corresponding to the maximum eigenvalueλ1\\lambda\_\{1\}\. From this plot, it can be seen that each method controlling the eigenvalues prefers flatter minima compared to the plain gradient descent method\. Among them, it can be confirmed that HSR regularization consistently converges to flat minima\.
AMaximum eigenvalue
BAverage of eigenvalues
CStandard deviation of eigenvalues
DMinimum eigenvalue
Figure 12:Statistics of the Hessian eigenspectrum at critical points\. Asterisks indicate p\-values from the two\-sided Wilcoxon signed\-rank test \(\*\*: 0\.1% level, \*: 1% level\)\. Boxes represent the IQR, and whiskers indicate the±1\.5×IQR\\pm 1\.5\\times\\mathrm\{IQR\}range\.To quantitatively compare the sharpness of the critical points, we examined several statistics regarding the eigenspectrum at the converged critical points\. The results are shown in Fig\.[12](https://arxiv.org/html/2606.28662#S6.F12)\. Fig\.[12A](https://arxiv.org/html/2606.28662#S6.F12.sf1)presents a comparison of the maximum eigenvalues\. In the case of the Sharp Minima Seeds, it can be seen that all methods successfully reduced the maximum eigenvalue compared to the plain gradient descent method without regularization\. Among these, A\-SAM and HSR regularization were particularly effective\. The reason HSR regularization outperformed Hessian regularization is considered to be that while Hessian regularization relies solely on the Hessian trace, HSR regularization additionally incorporates the squared Hessian trace\. In the case of the Flat Minima Seeds, all methods achieved low values\. Therefore, no adverse effects of eigenvalue control on the maximum eigenvalue were observed\.
Fig\.[12B](https://arxiv.org/html/2606.28662#S6.F12.sf2)and Fig\.[12C](https://arxiv.org/html/2606.28662#S6.F12.sf3)present the mean and standard deviation of the eigenspectrum, respectively\. For these statistics, almost the same trends as the maximum eigenvalue were observed\. In deep learning, it is widely known that the resulting eigenspectrum tends to have only a few top eigenvalues with non\-zero values, while the remaining eigenvalues are close to zero\[[35](https://arxiv.org/html/2606.28662#bib.bib29)\]\[[36](https://arxiv.org/html/2606.28662#bib.bib27)\]\[[25](https://arxiv.org/html/2606.28662#bib.bib25)\]\. In such a situation, the operation to decrease the maximum eigenvalue is equivalent to the operation to lower the mean of the eigenvalues\. Furthermore, lowering the top eigenvalues implies that they approach the mean, which leads to a reduction in the standard deviation of the eigenspectrum\. That is, a low maximum eigenvalue and low values for both the mean and standard deviation of the eigenspectrum are considered to be strongly related properties\.
Figure 13:Loss landscape along the eigenvector directions at the critical point\. Here,𝒖1\\bm\{u\}\_\{1\}and𝒖D\\bm\{u\}\_\{D\}represent the eigenvectors corresponding toλ1\\lambda\_\{1\}andλD\\lambda\_\{D\}, respectively\. The result is obtained using a random seed where the minimum eigenvalue is located around−1\.5×IQR\-1\.5\\times\\mathrm\{IQR\}in the distribution shown in Fig\.[12D](https://arxiv.org/html/2606.28662#S6.F12.sf4)\. For better visual clarity, the loss function is vertically shifted such that its minimum value is 0 along the𝒖1\\bm\{u\}\_\{1\}direction, and its maximum value is 0 along the𝒖D\\bm\{u\}\_\{D\}direction\.Fig\.[12D](https://arxiv.org/html/2606.28662#S6.F12.sf4)summarizes the minimum eigenvalues of the critical points reached by each method\. From this plot, it can be seen that the minimum eigenvalues are close to 0 in most cases\. On the other hand, for the case of A\-SAM with the Flat Minima Seeds, the distribution of the minimum eigenvalues tended to be left\-skewed with a long tail in the negative direction\. Since this suggests the presence of saddle points, we performed a one\-dimensional visualization of the loss at the critical point located at the tip of the whisker for A\-SAM\. For comparison, we also examined the loss shape at the critical point located at the tip of the whisker of the minimum eigenvalue for HSR regularization\. The resulting plots are shown in Fig\.[13](https://arxiv.org/html/2606.28662#S6.F13)\. The blue and red lines represent the loss shapes along the directions of the eigenvectors corresponding to the maximum and minimum eigenvalues, respectively\. From these results, a clear saddle point was observed for A\-SAM\. In recent deep learning literature, this may not pose a significant issue because several methods exist to escape from saddle points\. However, since the termination of training at a saddle point carries a potential risk of training failure, it cannot be considered a desirable state\. In contrast, no clear saddle point was observed for HSR regularization\. The reason for this is considered to be that HSR regularization controls not only the maximum eigenvalue but also the lower bound of the minimum eigenvalue\.
Figure 14:Macro F1\-scores at the critical point\. Asterisks indicate statistical significance based on the two\-sided Wilcoxon signed\-rank test \(∗∗\{\}^\{\*\*\}\\\!: 0\.1% level,∗\{\}^\{\*\}\\\!: 1% level\)\. Boxes represent the IQR, and whiskers extend to±1\.5×IQR\\pm 1\.5\\times\\text\{IQR\}\.
### VI\-CImpact on Generalization Performance
In the previous section, it was confirmed that HSR regularization has the effect of reducing the maximum eigenvalue while avoiding convergence to saddle points\. Although this is a desirable property from the perspective of pursuing flat minima, it would be counterproductive if it did not lead to an improvement in generalization performance\. Therefore, in this section, we verify whether HSR regularization has the effect of improving generalization performance\.
Fig\.[14](https://arxiv.org/html/2606.28662#S6.F14)summarizes the macro F1 scores at the critical points reached by each method, evaluated using the test dataset\. Looking at the results for the Sharp Minima Seeds, it can be confirmed that the three methods controlling the eigenvalues successfully improved the test performance compared to the plain gradient descent method without regularization\. Among these, A\-SAM and HSR regularization exhibited a higher improvement than Hessian regularization\. Therefore, it can be said that HSR regularization is expected to achieve a performance improvement comparable to that of A\-SAM\. Furthermore, the results for the Flat Minima Seeds suggest that HSR regularization does not induce any adverse effects on initial parameters that inherently lead to flat minima without any specific modifications\.
To investigate the reasons behind the performance improvement of HSR regularization, we visualized the decision boundaries for both the unregularized method and HSR regularization in the case of the Sharp Minima Seeds\. The results are shown in Fig\.[15](https://arxiv.org/html/2606.28662#S6.F15)\. The upper panel represents the decision boundaries without regularization, while the lower panel shows those obtained with HSR regularization\. Since the results in the same column are derived from the identical initial parameters𝜽\(0\)\\bm\{\\theta\}^\{\(0\)\}, they have a one\-to\-one correspondence\. From these plots, it can be seen that the application of HSR regularization has the effect of simplifying the decision boundaries\. This implies the suppression of overfitting, which is a desirable property from the perspective of generalization\.
Figure 15:Relationship between decision boundaries and the presence of HSR regularization for Sharp Minima Seeds\. Top: Without regularization \(baseline\)\. Bottom: With HSR regularization\.
## VIILimitations and Future Work
In prior studies, a closed\-form expression for the direction toward flat minima under the CE loss in feedforward NNs had not been discovered\. Therefore, in this study, we derived the steepest descent direction of the WS upper bound, which represents one such expression, in a closed form\. Through this derivation, it was confirmed that the eigenvalues can be controlled, and the proposed method demonstrated superior performance compared to the existing Hessian regularization\. Furthermore, it was confirmed to yield a performance improvement comparable to that of A\-SAM\. These findings suggest that the closed\-form expression of the steepest descent direction of the WS upper bound is a valuable function from the perspective of clarifying the relationship between NNs and generalization\. However, because the resulting function is extensively large, the theoretical investigation remains insufficient at this stage\. Accordingly, as one of our future tasks, we intend to conduct a detailed analysis of this closed\-form function\.
Furthermore, extending this approach toward practical applications is also crucial\. At present, the value of the WS upper bound demonstrated in this study remains limited to the theoretical aspect\. This limitation arises because the steepest descent direction of the WS upper bound can only be applied to a three\-layer feedforward NN designed for solving two\-class classification problems\. As a method intended for NNs capable of addressing a wide variety of tasks, such as multi\-class classification, regression, and segmentation, this current formulation is overly restrictive\. In addition, HSR regularization cannot be applied to NNs with four or more layers\. Therefore, for future work, we plan to devise a methodology that enables the implementation of HSR regularization for arbitrary layer architectures and diverse tasks\.
## Appendix APreliminaries
### A\-ADerivative Layout
In this study, we adopt the denominator layout for arranging derivatives in gradients and Jacobians\. That is,
ϰ\(θ\)∈ℝD1∧θ∈ℝ⇒∂ϰ\(θ\)∂θ=\[∂ϰ1∂θ⋯∂ϰD1∂θ\],\\displaystyle\\bm\{\\varkappa\}\(\\theta\)\\in\\mathbb\{R\}^\{D\_\{1\}\}\\land\\theta\\in\\mathbb\{R\}\\Rightarrow\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\varkappa\}\(\\theta\)\}\{\\mathop\{\}\\\!\\partial\\theta\}=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\\varkappa\_\{1\}\}\{\\mathop\{\}\\\!\\partial\\theta\}\\cdots\\frac\{\\mathop\{\}\\\!\\partial\\varkappa\_\{D\_\{1\}\}\}\{\\mathop\{\}\\\!\\partial\\theta\}\\end\{bmatrix\},ϰ\(𝜽\)∈ℝ∧𝜽∈ℝD2⇒∂ϰ\(𝜽\)∂𝜽=\[∂ϰ\(𝜽\)∂θ1⋯∂ϰ\(𝜽\)∂θD2\]⊤\\displaystyle\\varkappa\(\\bm\{\\theta\}\)\\in\\mathbb\{R\}\\land\\bm\{\\theta\}\\in\\mathbb\{R\}^\{D\_\{2\}\}\\Rightarrow\\frac\{\\mathop\{\}\\\!\\partial\\varkappa\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\\varkappa\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\theta\_\{1\}\}\\cdots\\frac\{\\mathop\{\}\\\!\\partial\\varkappa\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\theta\_\{D\_\{2\}\}\}\\end\{bmatrix\}^\{\\top\}holds\. The Jacobian is defined as
ϰ\(𝜽\)∈ℝ𝒟1∧𝜽∈ℝ𝒟2⇒\\displaystyle\\bm\{\\varkappa\}\(\\bm\{\\theta\}\)\\in\\mathbb\{R\}^\{\\mathcal\{D\}\_\{1\}\}\\land\\bm\{\\theta\}\\in\\mathbb\{R\}^\{\\mathcal\{D\}\_\{2\}\}\\Rightarrow∂ϰ\(𝜽\)∂𝜽=∂∂𝜽ϰ\(𝜽\)⊤=\[∂ϰ1∂θ1⋯∂ϰD1∂θ1⋮⋱⋮∂ϰ1∂θD2⋯∂ϰD1∂θD2\]\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\varkappa\}\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{\\mathop\{\}\\\!\\partial\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\varkappa\}\(\\bm\{\\theta\}\)^\{\\top\}=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\\varkappa\_\{1\}\}\{\\mathop\{\}\\\!\\partial\\theta\_\{1\}\}&\\cdots&\\frac\{\\mathop\{\}\\\!\\partial\\varkappa\_\{D\_\{1\}\}\}\{\\mathop\{\}\\\!\\partial\\theta\_\{1\}\}\\\\ \\vdots&\\ddots&\\vdots\\\\ \\frac\{\\mathop\{\}\\\!\\partial\\varkappa\_\{1\}\}\{\\mathop\{\}\\\!\\partial\\theta\_\{D\_\{2\}\}\}&\\cdots&\\frac\{\\mathop\{\}\\\!\\partial\\varkappa\_\{D\_\{1\}\}\}\{\\mathop\{\}\\\!\\partial\\theta\_\{D\_\{2\}\}\}\\end\{bmatrix\}\.In this case, according to Eq\. \(4\) of\[[16](https://arxiv.org/html/2606.28662#bib.bib5)\], the Leibniz rule for the gradient of an inner product is given by
ϰ\(𝜽\),𝝃\(𝜽\)∈ℝ𝒟1∧𝜽∈ℝ𝒟2⇒\\displaystyle\\bm\{\\varkappa\}\(\\bm\{\\theta\}\),\\bm\{\\xi\}\(\\bm\{\\theta\}\)\\in\\mathbb\{R\}^\{\\mathcal\{D\}\_\{1\}\}\\land\\bm\{\\theta\}\\in\\mathbb\{R\}^\{\\mathcal\{D\}\_\{2\}\}\\Rightarrow∂ϰ\(𝜽\)⊤𝝃\(𝜽\)∂𝜽=∂ϰ\(𝜽\)∂𝜽𝝃\(𝜽\)\+∂𝝃\(𝜽\)∂𝜽ϰ\(𝜽\)∈ℝ𝒟2\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\varkappa\}\(\\bm\{\\theta\}\)^\{\\top\}\\bm\{\\xi\}\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\varkappa\}\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\xi\}\(\\bm\{\\theta\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\xi\}\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\varkappa\}\(\\bm\{\\theta\}\)\\in\\mathbb\{R\}^\{\\mathcal\{D\}\_\{2\}\}\.\(63\)
### A\-BActivation Functions
Letf\(y\)∈ℝf\(y\)\\in\\mathbb\{R\}be an activation function for a scalar valuey∈ℝy\\in\\mathbb\{R\}\. For an intermediate\-layer data vector𝒚∈ℝN\\bm\{y\}\\in\\mathbb\{R\}^\{N\}, we define the vector obtained by applying the activation function element\-wise as
𝒇\(𝒚\)=\[f\(y1\)⋯f\(yN\)\]⊤∈ℝN\.\\displaystyle\\bm\{f\}\(\\bm\{y\}\)=\\begin\{bmatrix\}f\(y\_\{1\}\)\\ \\cdots\\ f\(y\_\{N\}\)\\end\{bmatrix\}^\{\\top\}\\in\\mathbb\{R\}^\{N\}\.\(64\)Letf\(k\)\(y\)f^\{\(k\)\}\(y\)denote thekk\-th derivative off\(y\)f\(y\)\. We define the vector composed of these derivatives as
𝒇\(k\)\(𝒚\)=\[f\(k\)\(y1\)⋯f\(k\)\(yN\)\]⊤∈ℝN,\\displaystyle\\bm\{f\}^\{\(k\)\}\(\\bm\{y\}\)=\\begin\{bmatrix\}f^\{\(k\)\}\(y\_\{1\}\)\\ \\cdots\\ f^\{\(k\)\}\(y\_\{N\}\)\\end\{bmatrix\}^\{\\top\}\\in\\mathbb\{R\}^\{N\},\(65\)wherek∈ℕk\\in\\mathbb\{N\}\. We also define the Jacobian matrix with respect to the input𝒚\\bm\{y\}as
𝑭\(k\)\(𝒚\)=∂𝒇\(k−1\)\(𝒚\)∂𝒚=diag\(𝒇\(k\)\(𝒚\)\),\\displaystyle\\bm\{F\}^\{\(k\)\}\(\\bm\{y\}\)=\\frac\{\\partial\\bm\{f\}^\{\(k\-1\)\}\(\\bm\{y\}\)\}\{\\partial\\bm\{y\}\}=\\mathrm\{diag\}\(\\bm\{f\}^\{\(k\)\}\(\\bm\{y\}\)\),\(66\)where𝒇\(0\)=𝒇\\bm\{f\}^\{\(0\)\}=\\bm\{f\}\. We further define the vector and matrix obtained by prepending a zero element or a zero row to the gradient and Jacobian, respectively, as
𝒇0\(k\)\(𝒚\)=\[0𝒇\(k\)\(𝒚\)\],𝑭0\(k\)\(𝒚\)=\[𝟎N⊤𝑭\(k\)\(𝒚\)\],\\displaystyle\\bm\{f\}^\{\(k\)\}\_\{0\}\(\\bm\{y\}\)=\\begin\{bmatrix\}0\\\\ \\bm\{f\}^\{\(k\)\}\(\\bm\{y\}\)\\end\{bmatrix\},\\ \\bm\{F\}^\{\(k\)\}\_\{0\}\(\\bm\{y\}\)=\\begin\{bmatrix\}\\bm\{0\}\_\{N\}^\{\\top\}\\\\ \\bm\{F\}^\{\(k\)\}\(\\bm\{y\}\)\\end\{bmatrix\},\(67\)where𝟎N\\bm\{0\}\_\{N\}denotes theNN\-dimensional zero vector\. In this paper, however, only the casesk∈\{1,2,3\}k\\in\\\{1,2,3\\\}are considered\. Therefore, the value ofkkis indicated using prime notation\. Specifically,f\(1\)=f′f^\{\(1\)\}=f^\{\\prime\},f\(2\)=f′′f^\{\(2\)\}=f^\{\\prime\\prime\}, andf\(3\)=f′′′f^\{\(3\)\}=f^\{\\prime\\prime\\prime\}\. Omae et al\.\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\]summarize the first\-, second\-, and third\-order derivatives of several activation functions, including the linear, sigmoid, tanh, SmoothReLU, and GELU activations\. Readers may refer to that work for details when needed\.
### A\-CKronecker Product
The Kronecker product appears frequently throughout this paper\. In this subsection, we present several identities that will be used in subsequent derivations\.
###### Lemma 1\.
𝒙⊗\(𝑨𝑩\)\\displaystyle\\bm\{x\}\\otimes\(\\bm\{A\}\\bm\{B\}\)=\(𝒙⊗𝑨\)𝑩,\\displaystyle=\(\\bm\{x\}\\otimes\\bm\{A\}\)\\bm\{B\},\(68\)𝒙⊗\(𝑨\+𝑩\)\\displaystyle\\bm\{x\}\\otimes\(\\bm\{A\}\+\\bm\{B\}\)=𝒙⊗𝑨\+𝒙⊗𝑩,\\displaystyle=\\bm\{x\}\\otimes\\bm\{A\}\+\\bm\{x\}\\otimes\\bm\{B\},\(69\)𝒙⊗\[𝑨𝑩\]\\displaystyle\\bm\{x\}\\otimes\\begin\{bmatrix\}\\bm\{A\}&\\bm\{B\}\\end\{bmatrix\}=\[𝒙⊗𝑨𝒙⊗𝑩\]\.\\displaystyle=\\begin\{bmatrix\}\\bm\{x\}\\otimes\\bm\{A\}&\\bm\{x\}\\otimes\\bm\{B\}\\end\{bmatrix\}\.\(70\)
###### Proof\.
If𝑨𝑩\\bm\{A\}\\bm\{B\}is well\-defined,
\(𝒙⊗𝑨\)𝑩=\[x1𝑨⋮xM𝑨\]𝑩=\[x1𝑨𝑩⋮xM𝑨𝑩\]=𝒙⊗\(𝑨𝑩\)\.\\displaystyle\(\\bm\{x\}\\otimes\\bm\{A\}\)\\bm\{B\}=\\begin\{bmatrix\}x\_\{1\}\\bm\{A\}\\\\ \\vdots\\\\ x\_\{M\}\\bm\{A\}\\end\{bmatrix\}\\bm\{B\}=\\begin\{bmatrix\}x\_\{1\}\\bm\{A\}\\bm\{B\}\\\\ \\vdots\\\\ x\_\{M\}\\bm\{A\}\\bm\{B\}\\end\{bmatrix\}=\\bm\{x\}\\otimes\(\\bm\{A\}\\bm\{B\}\)\.If𝑨\+𝑩\\bm\{A\}\+\\bm\{B\}is well\-defined,
𝒙⊗𝑨\+𝒙⊗𝑩\\displaystyle\\bm\{x\}\\otimes\\bm\{A\}\+\\bm\{x\}\\otimes\\bm\{B\}=\[x1𝑨⋮xM𝑨\]\+\[x1𝑩⋮xM𝑩\]=\[x1\(𝑨\+𝑩\)⋮xM\(𝑨\+𝑩\)\]\\displaystyle=\\begin\{bmatrix\}x\_\{1\}\\bm\{A\}\\\\ \\vdots\\\\ x\_\{M\}\\bm\{A\}\\end\{bmatrix\}\+\\begin\{bmatrix\}x\_\{1\}\\bm\{B\}\\\\ \\vdots\\\\ x\_\{M\}\\bm\{B\}\\end\{bmatrix\}=\\begin\{bmatrix\}x\_\{1\}\(\\bm\{A\}\+\\bm\{B\}\)\\\\ \\vdots\\\\ x\_\{M\}\(\\bm\{A\}\+\\bm\{B\}\)\\end\{bmatrix\}=𝒙⊗\(𝑨\+𝑩\)\.\\displaystyle=\\bm\{x\}\\otimes\(\\bm\{A\}\+\\bm\{B\}\)\.If\[𝑨𝑩\]\\begin\{bmatrix\}\\bm\{A\}&\\bm\{B\}\\end\{bmatrix\}is well\-defined,
\[𝒙⊗𝑨𝒙⊗𝑩\]\\displaystyle\\begin\{bmatrix\}\\bm\{x\}\\otimes\\bm\{A\}&\\bm\{x\}\\otimes\\bm\{B\}\\end\{bmatrix\}=\[x1𝑨x1𝑩⋮⋮xM𝑨xM𝑩\]=\[x1\[𝑨𝑩\]⋮xM\[𝑨𝑩\]\]\\displaystyle=\\begin\{bmatrix\}x\_\{1\}\\bm\{A\}&x\_\{1\}\\bm\{B\}\\\\ \\vdots&\\vdots\\\\ x\_\{M\}\\bm\{A\}&x\_\{M\}\\bm\{B\}\\end\{bmatrix\}=\\begin\{bmatrix\}x\_\{1\}\\begin\{bmatrix\}\\bm\{A\}&\\bm\{B\}\\end\{bmatrix\}\\\\ \\vdots\\\\ x\_\{M\}\\begin\{bmatrix\}\\bm\{A\}&\\bm\{B\}\\end\{bmatrix\}\\end\{bmatrix\}=𝒙⊗\[𝑨𝑩\]\\displaystyle=\\bm\{x\}\\otimes\\begin\{bmatrix\}\\bm\{A\}&\\bm\{B\}\\end\{bmatrix\}is obtained\. ∎
## Appendix BProofs
### B\-AProof for Eqs\. \([17](https://arxiv.org/html/2606.28662#S4.E17)\), \([18](https://arxiv.org/html/2606.28662#S4.E18)\), and \([19](https://arxiv.org/html/2606.28662#S4.E19)\)
From Eq\. \([1](https://arxiv.org/html/2606.28662#S3.E1)\), Eq\. \([17](https://arxiv.org/html/2606.28662#S4.E17)\) follows directly\. Likewise, Eq\. \([18](https://arxiv.org/html/2606.28662#S4.E18)\) follows from Eq\. \([2](https://arxiv.org/html/2606.28662#S3.E2)\)\. From Eq\. \([2](https://arxiv.org/html/2606.28662#S3.E2)\), the gradient∂σ\(𝜽\)/∂𝜽\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}is given by
∂σ\(𝜽\)∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂\(σ\(𝜽\)2\)1/2∂σ\(𝜽\)2∂σ\(𝜽\)2∂𝜽\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\(\\sigma\(\\bm\{\\theta\}\)^\{2\}\)^\{1/2\}\}\{\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)^\{2\}\}\\frac\{\\mathop\{\}\\\!\\partial\\sigma\(\\bm\{\\theta\}\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=12σ\(𝜽\)\(1D∂tr\(𝑯L\(𝜽\)2\)∂𝜽−∂μ\(𝜽\)2∂𝜽\)\\displaystyle=\\frac\{1\}\{2\\sigma\(\\bm\{\\theta\}\)\}\\bigg\(\\frac\{1\}\{D\}\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\-\\frac\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)=12σ\(𝜽\)\(1D∂tr\(𝑯L\(𝜽\)2\)∂𝜽−∂μ\(𝜽\)2∂μ\(𝜽\)∂μ\(𝜽\)∂𝜽\)\\displaystyle=\\frac\{1\}\{2\\sigma\(\\bm\{\\theta\}\)\}\\bigg\(\\frac\{1\}\{D\}\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\-\\frac\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)\}\\frac\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)=12σ\(𝜽\)\(1D∂tr\(𝑯L\(𝜽\)2\)∂𝜽−2μ\(𝜽\)∂μ\(𝜽\)∂𝜽\)\\displaystyle=\\frac\{1\}\{2\\sigma\(\\bm\{\\theta\}\)\}\\bigg\(\\frac\{1\}\{D\}\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\-2\\mu\(\\bm\{\\theta\}\)\\frac\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)=12σ\(𝜽\)D∂tr\(𝑯L\(𝜽\)2\)∂𝜽−μ\(𝜽\)σ\(𝜽\)∂μ\(𝜽\)∂𝜽\.\\displaystyle=\\frac\{1\}\{2\\sigma\(\\bm\{\\theta\}\)D\}\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\-\\frac\{\\mu\(\\bm\{\\theta\}\)\}\{\\sigma\(\\bm\{\\theta\}\)\}\\frac\{\\mathop\{\}\\\!\\partial\\mu\(\\bm\{\\theta\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\.
### B\-BProof for Eq\. \([20](https://arxiv.org/html/2606.28662#S4.E20)\)
#### B\-B1Gradient Decomposition
From Eq\. \([3](https://arxiv.org/html/2606.28662#S3.E3)\), the gradient of the Hessian trace can be decomposed as
∂tr\(𝑯L\(𝜽\)\)∂𝜽=∑i=1I∂s′\(zi\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚i\)\)∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\sum\_\{i=1\}^\{I\}\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+∑i=1I\(1\+𝒙i⊤𝒙i\)\(∂s′\(zi\)‖𝑭′\(𝒚i\)𝑽~⊤‖2∂𝜽\+∂δi𝑽~𝒇′′\(𝒚i\)∂𝜽\)\.\\displaystyle\+\\sum\_\{i=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+\\frac\{\\mathop\{\}\\\!\\partial\\delta\_\{i\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)\.\(71\)Applying Eq\. \([63](https://arxiv.org/html/2606.28662#A1.E63)\) to the first gradient term, we obtain
∂s′\(z\)\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)∂𝜽\\displaystyle\\frac\{\\partial s^\{\\prime\}\(z\)\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\}\{\\partial\\bm\{\\theta\}\}=∂s′\(z\)∂𝜽\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)\+s′\(z\)∂𝒇\(𝒚\)⊤𝒇\(𝒚\)∂𝜽\\displaystyle=\\frac\{\\partial s^\{\\prime\}\(z\)\}\{\\partial\\bm\{\\theta\}\}\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\+s^\{\\prime\}\(z\)\\frac\{\\partial\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\}\{\\partial\\bm\{\\theta\}\}=∂s′\(z\)∂𝜽\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)\+2s′\(z\)∂𝒇\(𝒚\)∂𝜽𝒇\(𝒚\)\.\\displaystyle=\\frac\{\\partial s^\{\\prime\}\(z\)\}\{\\partial\\bm\{\\theta\}\}\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\+2s^\{\\prime\}\(z\)\\frac\{\\partial\\bm\{f\}\(\\bm\{y\}\)\}\{\\partial\\bm\{\\theta\}\}\\bm\{f\}\(\\bm\{y\}\)\.\(72\)Applying Eq\. \([63](https://arxiv.org/html/2606.28662#A1.E63)\) to the second gradient term, we obtain
∂s′\(z\)‖𝑭′\(𝒚\)𝑽~⊤‖2∂𝜽=∂s′\(z\)\(𝑭′\(𝒚\)𝑽~⊤\)⊤𝑭′\(𝒚\)𝑽~⊤∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)^\{\\top\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂s′\(z\)∂𝜽𝑽~𝑭′\(𝒚\)2𝑽~⊤\+s′\(z\)∂\(𝑭′\(𝒚\)𝑽~⊤\)⊤𝑭′\(𝒚\)𝑽~⊤∂𝜽\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)^\{2\}\\widetilde\{\\bm\{V\}\}^\{\\top\}\+s^\{\\prime\}\(z\)\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)^\{\\top\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\(∂s′\(z\)∂𝜽𝑽~𝑭′\(𝒚\)\+2s′\(z\)∂𝑭′\(𝒚\)𝑽~⊤∂𝜽\)𝑭′\(𝒚\)𝑽~⊤\.\\displaystyle=\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\+2s^\{\\prime\}\(z\)\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\.\(73\)Since𝑽~𝒇′′\(𝒚i\)\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)is the inner product of𝑽~⊤\\widetilde\{\\bm\{V\}\}^\{\\top\}and𝒇′′\(𝒚i\)\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\), Eq\. \([63](https://arxiv.org/html/2606.28662#A1.E63)\) gives
∂δ𝑽~𝒇′′\(𝒚\)∂𝜽=∂δ∂𝜽𝑽~𝒇′′\(𝒚\)\+δ∂𝑽~𝒇′′\(𝒚\)∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\delta\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{\\mathop\{\}\\\!\\partial\\delta\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\frac\{\\mathop\{\}\\\!\\partial\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂δ∂𝜽𝑽~𝒇′′\(𝒚\)\+δ∂𝑽~⊤∂𝜽𝒇′′\(𝒚\)\+δ∂𝒇′′\(𝒚\)∂𝜽𝑽~⊤\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\delta\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\frac\{\\mathop\{\}\\\!\\partial\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\widetilde\{\\bm\{V\}\}^\{\\top\}\(74\)for the third gradient term\.
#### B\-B2Individual Gradients
###### Lemma 2\.
∂s′\(z\)‖𝑭′\(𝒚\)𝑽~⊤‖2∂𝒘=𝒉\(𝒙\)⊗\(s′′\(z\)𝑭′\(𝒚\)𝑽~⊤𝑽~𝑭′\(𝒚\)\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\Big\(s^\{\\prime\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\+2s′\(z\)diag\(𝑽~⊤\)𝑭′′\(𝒚\)\)𝑭′\(𝒚\)𝑽~⊤∈ℝ\(M\+1\)N\.\\displaystyle\+2s^\{\\prime\}\(z\)\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\)\\Big\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\in\\mathbb\{R\}^\{\(M\+1\)N\}\.\(75\)
###### Proof\.
Using Eqs\. \(51\) and \(54\) in\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\], we obtain
∂z∂𝑾:m=∂𝒚∂𝑾:m∂𝒓∂𝒚∂𝒉\(𝒓\)∂𝒓∂z∂𝒉\(𝒓\)=xm𝑭′\(𝒚\)𝑽~⊤\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial z\}\{\\mathop\{\}\\\!\\partial\\bm\{W\}\_\{:m\}\}=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}\{\\mathop\{\}\\\!\\partial\\bm\{W\}\_\{:m\}\}\\frac\{\\mathop\{\}\\\!\\partial\\bm\{r\}\}\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}\\frac\{\\mathop\{\}\\\!\\partial\\bm\{h\}\(\\bm\{r\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{r\}\}\\frac\{\\mathop\{\}\\\!\\partial z\}\{\\mathop\{\}\\\!\\partial\\bm\{h\}\(\\bm\{r\}\)\}=x\_\{m\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\.\(76\)Therefore,
∂s′\(z\)∂𝑾:m\\displaystyle\\frac\{\\partial s^\{\\prime\}\(z\)\}\{\\partial\\bm\{W\}\_\{:m\}\}=∂z∂𝑾:m∂s′\(z\)∂z=xms′′\(z\)𝑭′\(𝒚\)𝑽~⊤,\\displaystyle=\\frac\{\\partial z\}\{\\partial\\bm\{W\}\_\{:m\}\}\\frac\{\\partial s^\{\\prime\}\(z\)\}\{\\partial z\}=x\_\{m\}s^\{\\prime\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},∂s′\(z\)∂𝒘\\displaystyle\\frac\{\\partial s^\{\\prime\}\(z\)\}\{\\partial\\bm\{w\}\}=𝒉\(𝒙\)⊗\(s′′\(z\)𝑭′\(𝒚\)𝑽~⊤\)\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\)\\otimes\(s^\{\\prime\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\.\(77\)From Eq\. \(51\) in\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\], we have
∂𝒚∂𝒘=𝒉\(𝒙\)⊗𝑬N\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\bm\{E\}\_\{N\}\.\(78\)Therefore, by Eq\. \([68](https://arxiv.org/html/2606.28662#A1.E68)\),
∂𝑭′\(𝒚\)𝑽~⊤∂𝒘=∂𝒇′\(𝒚\)⊙𝑽~⊤∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\)\\odot\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\[v1∂f′\(y1\)∂𝒘⋯vN∂f′\(yN\)∂𝒘\]\\displaystyle=\\begin\{bmatrix\}v\_\{1\}\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{1\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\cdots&v\_\{N\}\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{N\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\end\{bmatrix\}=\[v1∂f′\(y1\)∂y1∂y1∂𝒘⋯vN∂f′\(yN\)∂yN∂yN∂𝒘\]\\displaystyle=\\begin\{bmatrix\}v\_\{1\}\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{1\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{1\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{1\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\cdots&v\_\{N\}\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{N\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{N\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{N\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\end\{bmatrix\}=\[v1f′′\(y1\)∂y1∂𝒘⋯vNf′′\(yN\)∂yN∂𝒘\]\\displaystyle=\\begin\{bmatrix\}v\_\{1\}f^\{\\prime\\prime\}\(y\_\{1\}\)\\frac\{\\mathop\{\}\\\!\\partial y\_\{1\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\cdots&v\_\{N\}f^\{\\prime\\prime\}\(y\_\{N\}\)\\frac\{\\mathop\{\}\\\!\\partial y\_\{N\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\end\{bmatrix\}=\[∂y1∂𝒘⋯∂yN∂𝒘\]\[v1f′′\(y1\)⋯0⋮⋱⋮0⋯vNf′′\(yN\)\]\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{1\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\cdots&\\frac\{\\mathop\{\}\\\!\\partial y\_\{N\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\end\{bmatrix\}\\begin\{bmatrix\}v\_\{1\}f^\{\\prime\\prime\}\(y\_\{1\}\)&\\cdots&0\\\\ \\vdots&\\ddots&\\vdots\\\\ 0&\\cdots&v\_\{N\}f^\{\\prime\\prime\}\(y\_\{N\}\)\\\\ \\end\{bmatrix\}=∂𝒚∂𝒘diag\(𝑽~⊤\)𝑭′′\(𝒚\)=𝒉\(𝒙\)⊗diag\(𝑽~⊤\)𝑭′′\(𝒚\)\.\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\)=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\)\.\(79\)Substituting Eqs\. \([77](https://arxiv.org/html/2606.28662#A2.E77)\) and \([79](https://arxiv.org/html/2606.28662#A2.E79)\) into Eq\. \([73](https://arxiv.org/html/2606.28662#A2.E73)\), we obtain
∂s′\(z\)‖𝑭′\(𝒚\)𝑽~⊤‖2∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\(∂s′\(z\)∂𝒘𝑽~𝑭′\(𝒚\)\+2s′\(z\)∂𝑭′\(𝒚\)𝑽~⊤∂𝒘\)𝑭′\(𝒚\)𝑽~⊤\\displaystyle=\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\+2s^\{\\prime\}\(z\)\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bigg\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}=𝒉\(𝒙\)⊗\(s′′\(z\)𝑭′\(𝒚\)𝑽~⊤𝑽~𝑭′\(𝒚\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\Big\(s^\{\\prime\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\+2s′\(z\)diag\(𝑽~⊤\)𝑭′′\(𝒚\)\)𝑭′\(𝒚\)𝑽~⊤,\\displaystyle\+2s^\{\\prime\}\(z\)\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\)\\Big\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},\(80\)where Eq\. \([69](https://arxiv.org/html/2606.28662#A1.E69)\) was used\. ∎
###### Lemma 3\.
∂δ𝑽~𝒇′′\(𝒚\)∂𝒘=𝒉\(𝒙\)⊗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\delta\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\)\\otimes\(s′\(z\)𝑭′\(𝒚\)𝑽~⊤𝑽~𝒇′′\(𝒚\)\+δ𝑭′′′\(𝒚\)𝑽~⊤\)∈ℝ\(M\+1\)N\.\\displaystyle\(s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\in\\mathbb\{R\}^\{\(M\+1\)N\}\.\(81\)
###### Proof\.
Combining Eq\. \([76](https://arxiv.org/html/2606.28662#A2.E76)\) with Eq\. \(52\) in\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\]yields
∂δ∂𝑾:m\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\delta\}\{\\mathop\{\}\\\!\\partial\\bm\{W\}\_\{:m\}\}=∂p∂𝑾:m=∂z∂𝑾:m∂p∂z=xms′\(z\)𝑭′\(𝒚\)𝑽~⊤,\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial p\}\{\\mathop\{\}\\\!\\partial\\bm\{W\}\_\{:m\}\}=\\frac\{\\mathop\{\}\\\!\\partial z\}\{\\mathop\{\}\\\!\\partial\\bm\{W\}\_\{:m\}\}\\frac\{\\mathop\{\}\\\!\\partial p\}\{\\mathop\{\}\\\!\\partial z\}=x\_\{m\}s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\},∂δ∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\delta\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙\)⊗\(s′\(z\)𝑭′\(𝒚\)𝑽~⊤\)\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\)\\otimes\(s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\.\(82\)Since𝑽~\\widetilde\{\\bm\{V\}\}does not depend on𝒘\\bm\{w\},
∂𝑽~⊤∂𝒘=𝟎\(M\+1\)N×N\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{0\}\_\{\(M\+1\)N\\times N\}\.\(83\)From Eq\. \([78](https://arxiv.org/html/2606.28662#A2.E78)\) and Eq\. \(35\) in\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\], we obtain
∂𝒇\(𝒚\)∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∂𝒚∂𝒘∂𝒇\(𝒚\)∂𝒚=𝒉\(𝒙\)⊗𝑭′\(𝒚\),\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\),\(84\)∂𝒇′\(𝒚\)∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∂𝒚∂𝒘∂𝒇′\(𝒚\)∂𝒚=𝒉\(𝒙\)⊗𝑭′′\(𝒚\),\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\),\(85\)∂𝒇′′\(𝒚\)∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∂𝒚∂𝒘∂𝒇′′\(𝒚\)∂𝒚=𝒉\(𝒙\)⊗𝑭′′′\(𝒚\)\.\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{y\}\}=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\)\.\(86\)By substituting Eqs\. \([82](https://arxiv.org/html/2606.28662#A2.E82)\), \([83](https://arxiv.org/html/2606.28662#A2.E83)\), and \([86](https://arxiv.org/html/2606.28662#A2.E86)\) into Eq\. \([74](https://arxiv.org/html/2606.28662#A2.E74)\), we obtain
∂δ𝑽~𝒇′′\(𝒚\)∂𝒘=∂δ∂𝒘𝑽~𝒇′′\(𝒚\)\+δ∂𝑽~⊤∂𝒘𝒇′′\(𝒚\)\+δ∂𝒇′′\(𝒚\)∂𝒘𝑽~⊤\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\delta\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\frac\{\\mathop\{\}\\\!\\partial\\delta\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\frac\{\\mathop\{\}\\\!\\partial\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\widetilde\{\\bm\{V\}\}^\{\\top\}=𝒉\(𝒙\)⊗\(s′\(z\)𝑭′\(𝒚\)𝑽~⊤𝑽~𝒇′′\(𝒚\)\+δ𝑭′′′\(𝒚\)𝑽~⊤\),\\displaystyle=\\bm\{h\}\(\\bm\{x\}\)\\otimes\(s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\),where Eq\. \([69](https://arxiv.org/html/2606.28662#A1.E69)\) was used\. ∎
###### Lemma 4\.
∂s′\(z\)\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)∂𝒘=𝒉\(𝒙\)⊗\\displaystyle\\frac\{\\partial s^\{\\prime\}\(z\)\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\}\{\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\)\\otimes\(s′′\(z\)\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)𝑭′\(𝒚\)𝑽~⊤\+2s′\(z\)𝑭′\(𝒚\)𝒇\(𝒚\)\)\\displaystyle\\Big\(s^\{\\prime\\prime\}\(z\)\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+2s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\bm\{f\}\(\\bm\{y\}\)\\Big\)∈ℝ\(M\+1\)N\.\\displaystyle\\in\\mathbb\{R\}^\{\(M\+1\)N\}\.\(87\)
###### Proof\.
Substituting Eqs\. \([77](https://arxiv.org/html/2606.28662#A2.E77)\) and \([84](https://arxiv.org/html/2606.28662#A2.E84)\) into Eq\. \([72](https://arxiv.org/html/2606.28662#A2.E72)\) yields
∂s′\(z\)\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)∂𝒘\\displaystyle\\frac\{\\partial s^\{\\prime\}\(z\)\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\}\{\\partial\\bm\{w\}\}=∂s′\(z\)∂𝒘\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)\+2s′\(z\)∂𝒇\(𝒚\)∂𝒘𝒇\(𝒚\)\\displaystyle=\\frac\{\\partial s^\{\\prime\}\(z\)\}\{\\partial\\bm\{w\}\}\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\+2s^\{\\prime\}\(z\)\\frac\{\\partial\\bm\{f\}\(\\bm\{y\}\)\}\{\\partial\\bm\{w\}\}\\bm\{f\}\(\\bm\{y\}\)=𝒉\(𝒙\)⊗\(s′′\(z\)\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)𝑭′\(𝒚\)𝑽~⊤\\displaystyle=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\Big\(s^\{\\prime\\prime\}\(z\)\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+2s′\(z\)𝑭′\(𝒚\)𝒇\(𝒚\)\),\\displaystyle\+2s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\bm\{f\}\(\\bm\{y\}\)\\Big\),where Eq\. \([69](https://arxiv.org/html/2606.28662#A1.E69)\) was used\. ∎
#### B\-B3Completion of the Proof
From Eqs\. \([21](https://arxiv.org/html/2606.28662#S4.E21)\), \([22](https://arxiv.org/html/2606.28662#S4.E22)\), \([23](https://arxiv.org/html/2606.28662#S4.E23)\), \([75](https://arxiv.org/html/2606.28662#A2.E75)\), \([81](https://arxiv.org/html/2606.28662#A2.E81)\), and \([87](https://arxiv.org/html/2606.28662#A2.E87)\),
𝒉\(𝒙i\)⊗𝓦iI\\displaystyle\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{I\}\}=\(1\+𝒙i⊤𝒙i\)∂s′\(zi\)‖𝑭′\(𝒚i\)𝑽~⊤‖2∂𝒘,\\displaystyle=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\},𝒉\(𝒙i\)⊗𝓦iII\\displaystyle\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{II\}\}=\(1\+𝒙i⊤𝒙i\)∂δi𝑽~𝒇′′\(𝒚i\)∂𝒘,\\displaystyle=\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)\\frac\{\\mathop\{\}\\\!\\partial\\delta\_\{i\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\},𝒉\(𝒙i\)⊗𝓦iIII\\displaystyle\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{III\}\}=∂s′\(zi\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚i\)\)∂𝒘\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}hold\. Therefore, from Eq\. \([71](https://arxiv.org/html/2606.28662#A2.E71)\), we obtain
∂tr\(𝑯L\(𝜽\)\)∂𝒘=∑i=1I∂s′\(zi\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚i\)\)∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\sum\_\{i=1\}^\{I\}\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\+∑i=1I\(1\+𝒙i⊤𝒙i\)\(∂s′\(zi\)‖𝑭′\(𝒚i\)𝑽~⊤‖2∂𝒘\+∂δi𝑽~𝒇′′\(𝒚i\)∂𝒘\)\\displaystyle\+\\sum\_\{i=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\+\\frac\{\\mathop\{\}\\\!\\partial\\delta\_\{i\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bigg\)=∑i=1I𝒉\(𝒙i\)⊗\(𝓦iI\+𝓦iII\+𝓦iIII\)\.\\displaystyle=\\sum\_\{i=1\}^\{I\}\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{I\}\}\+\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{II\}\}\+\\bm\{\\mathcal\{W\}\}\_\{i\}^\{\\mathrm\{III\}\}\)\.
### B\-CProof for Eq\. \([24](https://arxiv.org/html/2606.28662#S4.E24)\)
#### B\-C1Individual Gradients
###### Lemma 5\.
∂s′\(z\)‖𝑭′\(𝒚\)𝑽~⊤‖2∂𝒗=\(s′′\(z\)𝒉\(𝒇\(𝒚\)\)𝑽~𝑭′\(𝒚\)\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\Big\(s^\{\\prime\\prime\}\(z\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\+2s′\(z\)𝑭0′\(𝒚\)\)𝑭′\(𝒚\)𝑽~⊤∈ℝN\+1\.\\displaystyle\+2s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\)\\Big\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\in\\mathbb\{R\}^\{N\+1\}\.\(88\)
###### Proof\.
Using Eq\. \(55\) in\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\], we obtain
∂s′\(z\)∂𝒗=∂s′\(z\)∂z∂z∂𝒗=s′′\(z\)𝒉\(𝒇\(𝒚\)\)\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\}\{\\mathop\{\}\\\!\\partial z\}\\frac\{\\mathop\{\}\\\!\\partial z\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=s^\{\\prime\\prime\}\(z\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\.\(89\)We also have
∂𝑭′\(𝒚\)𝑽~⊤∂𝒗=∂𝒇′\(𝒚\)⊙𝑽~⊤∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\)\\odot\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\[f′\(y1\)∂v1∂𝒗⋯f′\(yN\)∂vN∂𝒗\]\\displaystyle=\\begin\{bmatrix\}f^\{\\prime\}\(y\_\{1\}\)\\frac\{\\mathop\{\}\\\!\\partial v\_\{1\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}&\\cdots&f^\{\\prime\}\(y\_\{N\}\)\\frac\{\\mathop\{\}\\\!\\partial v\_\{N\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\end\{bmatrix\}=\[∂v1∂𝒗⋯∂vN∂𝒗\]𝑭′\(𝒚\)=\[𝟎N⊤𝑬N\]𝑭′\(𝒚\)=𝑭0′\(𝒚\)\.\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial v\_\{1\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}&\\cdots&\\frac\{\\mathop\{\}\\\!\\partial v\_\{N\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\end\{bmatrix\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)=\\begin\{bmatrix\}\\bm\{0\}\_\{N\}^\{\\top\}\\\\ \\bm\{E\}\_\{N\}\\end\{bmatrix\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)=\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\)\.\(90\)Therefore, from Eq\. \([73](https://arxiv.org/html/2606.28662#A2.E73)\),
∂s′\(z\)‖𝑭′\(𝒚\)𝑽~⊤‖2∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\(∂s′\(z\)∂𝒗𝑽~𝑭′\(𝒚\)\+2s′\(z\)∂𝑭′\(𝒚\)𝑽~⊤∂𝒗\)𝑭′\(𝒚\)𝑽~⊤\\displaystyle=\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\+2s^\{\\prime\}\(z\)\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\bigg\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}=\(s′′\(z\)𝒉\(𝒇\(𝒚\)\)𝑽~𝑭′\(𝒚\)\+2s′\(z\)𝑭0′\(𝒚\)\)𝑭′\(𝒚\)𝑽~⊤\\displaystyle=\\Big\(s^\{\\prime\\prime\}\(z\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\+2s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\)\\Big\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}is obtained\. ∎
###### Lemma 6\.
∂δ𝑽~𝒇′′\(𝒚\)∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\delta\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=s′\(z\)𝒉\(𝒇\(𝒚\)\)𝑽~𝒇′′\(𝒚\)\+δ𝒇0′′\(𝒚\)∈ℝN\+1\.\\displaystyle=s^\{\\prime\}\(z\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\)\\in\\mathbb\{R\}^\{N\+1\}\.\(91\)
###### Proof\.
Since𝒇′′\(𝒚\)\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)does not depend on𝒗\\bm\{v\}, and𝑽~\\widetilde\{\\bm\{V\}\}contains𝒗\\bm\{v\},
∂𝒇′′\(𝒚\)∂𝒗=𝟎\(N\+1\)×N,∂𝑽~⊤∂𝒗=\[𝟎N⊤𝑬N\]\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\bm\{0\}\_\{\(N\+1\)\\times N\},\\ \\frac\{\\mathop\{\}\\\!\\partial\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\begin\{bmatrix\}\\bm\{0\}\_\{N\}^\{\\top\}\\\\ \\bm\{E\}\_\{N\}\\end\{bmatrix\}\.Furthermore, from Eq\. \(62\) in\[[24](https://arxiv.org/html/2606.28662#bib.bib6)\], we obtain
∂δ∂𝒗=∂p∂𝒗=s′\(z\)𝒉\(𝒇\(𝒚\)\)\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\delta\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\frac\{\\mathop\{\}\\\!\\partial p\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=s^\{\\prime\}\(z\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\.\(92\)Substituting these results into Eq\. \([74](https://arxiv.org/html/2606.28662#A2.E74)\),
∂δ𝑽~𝒇′′\(𝒚\)∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\delta\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=∂δ∂𝒗𝑽~𝒇′′\(𝒚\)\+δ∂𝑽~⊤∂𝒗𝒇′′\(𝒚\)\+δ∂𝒇′′\(𝒚\)∂𝒗𝑽~⊤\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\delta\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\frac\{\\mathop\{\}\\\!\\partial\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\widetilde\{\\bm\{V\}\}^\{\\top\}=s′\(z\)𝒉\(𝒇\(𝒚\)\)𝑽~𝒇′′\(𝒚\)\+δ\[𝟎N⊤𝑬N\]𝒇′′\(𝒚\)\\displaystyle=s^\{\\prime\}\(z\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\begin\{bmatrix\}\\bm\{0\}\_\{N\}^\{\\top\}\\\\ \\bm\{E\}\_\{N\}\\end\{bmatrix\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)=s′\(z\)𝒉\(𝒇\(𝒚\)\)𝑽~𝒇′′\(𝒚\)\+δ𝒇0′′\(𝒚\)\\displaystyle=s^\{\\prime\}\(z\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\)\+\\delta\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\)holds\. ∎
###### Lemma 7\.
∂s′\(z\)\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)∂𝒗\\displaystyle\\frac\{\\partial s^\{\\prime\}\(z\)\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\}\{\\partial\\bm\{v\}\}=s′′\(z\)\(1\+𝒇\(𝒚\)⊤𝒇\(𝒚\)\)𝒉\(𝒇\(𝒚\)\)∈ℝN\+1\.\\displaystyle=s^\{\\prime\\prime\}\(z\)\(1\+\\bm\{f\}\(\\bm\{y\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\)\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\\in\\mathbb\{R\}^\{N\+1\}\.\(93\)
###### Proof\.
Since𝒇\(𝒚\)\\bm\{f\}\(\\bm\{y\}\)does not depend on𝒗\\bm\{v\},
∂𝒇\(𝒚\)∂𝒗=𝟎\(N\+1\)×N\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\bm\{0\}\_\{\(N\+1\)\\times N\}\.\(94\)Substituting Eqs\. \([94](https://arxiv.org/html/2606.28662#A2.E94)\) and \([89](https://arxiv.org/html/2606.28662#A2.E89)\) into Eq\. \([72](https://arxiv.org/html/2606.28662#A2.E72)\), we obtain Eq\. \([93](https://arxiv.org/html/2606.28662#A2.E93)\)\. ∎
#### B\-C2Completion of the Proof
Substituting Eqs\. \([88](https://arxiv.org/html/2606.28662#A2.E88)\), \([91](https://arxiv.org/html/2606.28662#A2.E91)\), and \([93](https://arxiv.org/html/2606.28662#A2.E93)\) into Eq\. \([71](https://arxiv.org/html/2606.28662#A2.E71)\), we obtain
∂tr\(𝑯L\(𝜽\)\)∂𝒗=∑i=1I∂s′\(zi\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚i\)\)∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\sum\_\{i=1\}^\{I\}\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\+∑i=1I\(1\+𝒙i⊤𝒙i\)\(∂s′\(zi\)‖𝑭′\(𝒚i\)𝑽~⊤‖2∂𝒗\+∂δ𝑽~𝒇′′\(𝒚i\)∂𝒗\)\\displaystyle\+\\sum\_\{i=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{i\}\)\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\\\|\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\\|^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\+\\frac\{\\mathop\{\}\\\!\\partial\\delta\\widetilde\{\\bm\{V\}\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\bigg\)=∑i=1I\(𝓥iI\+𝓥iII\+𝓥iIII\),\\displaystyle=\\sum\_\{i=1\}^\{I\}\(\\bm\{\\mathcal\{V\}\}\_\{i\}^\{\\mathrm\{I\}\}\+\\bm\{\\mathcal\{V\}\}\_\{i\}^\{\\mathrm\{II\}\}\+\\bm\{\\mathcal\{V\}\}\_\{i\}^\{\\mathrm\{III\}\}\),where Eqs\. \([25](https://arxiv.org/html/2606.28662#S4.E25)\), \([26](https://arxiv.org/html/2606.28662#S4.E26)\), and \([27](https://arxiv.org/html/2606.28662#S4.E27)\) were used\.
### B\-DProof for Eq\. \([28](https://arxiv.org/html/2606.28662#S4.E28)\)
#### B\-D1Gradient Decomposition
From Eq\. \([4](https://arxiv.org/html/2606.28662#S3.E4)\),
∂tr\(𝑯L\(𝜽\)2\)∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)2∂ϕij∂𝜽\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\phi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+2∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)∂ψij∂𝜽\\displaystyle\+2\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)\\frac\{\\mathop\{\}\\\!\\partial\\psi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+∑i=1I∑j=1I\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂ωij∂𝜽\\displaystyle\+\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+∑i=1I∑j=1I∂\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂𝜽ωij\\displaystyle\+\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\frac\{\\mathop\{\}\\\!\\partial\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\omega\_\{ij\}\(95\)holds\. We decompose it into three parts
\[∂tr\(𝑯L\(𝜽\)2\)∂𝜽†\]Φ=∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)2∂ϕij∂𝜽†,\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}^\{\\dagger\}\}\\Bigg\]\_\{\\Phi\}=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\phi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}^\{\\dagger\}\},\(96\)\[∂tr\(𝑯L\(𝜽\)2\)∂𝜽†\]Ψ=2∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)∂ψij∂𝜽†,\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}^\{\\dagger\}\}\\Bigg\]\_\{\\Psi\}=2\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)\\frac\{\\mathop\{\}\\\!\\partial\\psi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}^\{\\dagger\}\},\(97\)\[∂tr\(𝑯L\(𝜽\)2\)∂𝜽†\]Ω=∑i=1I∑j=1I\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂ωij∂𝜽†\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}^\{\\dagger\}\}\\Bigg\]\_\{\\Omega\}=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}^\{\\dagger\}\}\+∑i=1I∑j=1I∂\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂𝜽†ωij,𝜽†∈\{𝒘,𝒗\}\.\\displaystyle\+\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\frac\{\\mathop\{\}\\\!\\partial\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}^\{\\dagger\}\}\\omega\_\{ij\},\\ \\bm\{\\theta\}^\{\\dagger\}\\in\\\{\\bm\{w\},\\bm\{v\}\\\}\.\(98\)
###### Lemma 8\.
∂ϕij∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\phi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂ϕij∂𝜽\(𝒐i⊗𝒐j\)\+∂𝒐i∂𝜽𝚽ij𝒐j\+∂𝒐j∂𝜽𝚽ji𝒐i,\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\phi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\Phi\}\_\{ji\}\\bm\{o\}\_\{i\},\(99\)∂ψij∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\psi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂𝝍ij∂𝜽\(𝒐i⊗𝒐j\)\+∂𝒐i∂𝜽𝚿ij𝒐j\+∂𝒐j∂𝜽𝚿ji𝒐i,\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\psi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\Psi\}\_\{ij\}\\bm\{o\}\_\{j\}\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\Psi\}\_\{ji\}\\bm\{o\}\_\{i\},\(100\)∂ϕij∂𝜽,∂ψij∂𝜽∈ℝD\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\phi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\},\\frac\{\\mathop\{\}\\\!\\partial\\psi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\in\\mathbb\{R\}^\{D\}\.
###### Proof\.
Since quadratic forms are equivalent to inner products, from Eq\. \([63](https://arxiv.org/html/2606.28662#A1.E63)\) we have
∂ϕij∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\phi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂𝒐i⊤𝚽ij𝒐j∂𝜽=\(∂\(𝒐i⊤𝚽ij\)⊤∂𝜽\)𝒐j\+\(∂𝒐j∂𝜽\)\(𝒐i⊤𝚽ij\)⊤\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Phi\}\_\{ij\}\)^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)\\bm\{o\}\_\{j\}\+\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)\(\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Phi\}\_\{ij\}\)^\{\\top\}=\(∂𝚽ij⊤𝒐i∂𝜽\)𝒐j\+\(∂𝒐j∂𝜽\)𝚽ij⊤𝒐i\\displaystyle=\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)\\bm\{o\}\_\{j\}\+\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\\bm\{o\}\_\{i\}=∂ϕij∂𝜽\(𝒐i⊗𝒐j\)\+∂𝒐i∂𝜽𝚽ij𝒐j\+∂𝒐j∂𝜽𝚽ij⊤𝒐i\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\phi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\\bm\{o\}\_\{i\}=∂ϕij∂𝜽\(𝒐i⊗𝒐j\)\+∂𝒐i∂𝜽𝚽ij𝒐j\+∂𝒐j∂𝜽𝚽ji𝒐i\.\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\phi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\Phi\}\_\{ji\}\\bm\{o\}\_\{i\}\.Using Eqs\. \([6](https://arxiv.org/html/2606.28662#S3.E6)\)–\([9](https://arxiv.org/html/2606.28662#S3.E9)\), we obtain
𝚽ij⊤=𝚽ji\\displaystyle\\bm\{\\Phi\}\_\{ij\}^\{\\top\}=\\bm\{\\Phi\}\_\{ji\}and we also use
∂𝚽ij⊤𝒐i∂𝜽𝒐j\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{o\}\_\{j\}=∂\[\(𝚽ij⊤\)1:𝒐i\(𝚽ij⊤\)2:𝒐i\]∂𝜽𝒐j\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\begin\{bmatrix\}\(\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\)\_\{1:\}\\bm\{o\}\_\{i\}\\\\ \(\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\)\_\{2:\}\\bm\{o\}\_\{i\}\\end\{bmatrix\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{o\}\_\{j\}=\[∂\(𝚽ij⊤\)1:𝒐i∂𝜽∂\(𝚽ij⊤\)2:𝒐i∂𝜽\]\[s′\(zj\)δj\]\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\)\_\{1:\}\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\)\_\{2:\}\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\end\{bmatrix\}\\begin\{bmatrix\}s^\{\\prime\}\(z\_\{j\}\)\\\\ \\delta\_\{j\}\\end\{bmatrix\}=s′\(zj\)∂\(𝚽ij⊤\)1:𝒐i∂𝜽\+δj∂\(𝚽ij⊤\)2:𝒐i∂𝜽\\displaystyle=s^\{\\prime\}\(z\_\{j\}\)\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\)\_\{1:\}\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+\\delta\_\{j\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}^\{\\top\}\)\_\{2:\}\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=s′\(zj\)\(∂\(𝚽ij\)11s′\(zi\)∂𝜽\+∂\(𝚽ij\)21δi∂𝜽\)\\displaystyle=s^\{\\prime\}\(z\_\{j\}\)\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}s^\{\\prime\}\(z\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\\delta\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)\+δj\(∂\(𝚽ij\)12s′\(zi\)∂𝜽\+∂\(𝚽ij\)22δi∂𝜽\)\\displaystyle\+\\delta\_\{j\}\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}s^\{\\prime\}\(z\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\\delta\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bigg\)=s′\(zj\)s′\(zi\)∂\(𝚽ij\)11∂𝜽\+δjs′\(zi\)∂\(𝚽ij\)12∂𝜽\\displaystyle=s^\{\\prime\}\(z\_\{j\}\)s^\{\\prime\}\(z\_\{i\}\)\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+\\delta\_\{j\}s^\{\\prime\}\(z\_\{i\}\)\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+s′\(zj\)δi∂\(𝚽ij\)21∂𝜽\+δjδi∂\(𝚽ij\)22∂𝜽\\displaystyle\+s^\{\\prime\}\(z\_\{j\}\)\\delta\_\{i\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+\\delta\_\{j\}\\delta\_\{i\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+s′\(zj\)\(𝚽ij\)11∂s′\(zi\)∂𝜽\+s′\(zj\)\(𝚽ij\)21∂δi∂𝜽\\displaystyle\+s^\{\\prime\}\(z\_\{j\}\)\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+s^\{\\prime\}\(z\_\{j\}\)\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\\frac\{\\mathop\{\}\\\!\\partial\\delta\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+δj\(𝚽ij\)12∂s′\(zi\)∂𝜽\+δj\(𝚽ij\)22∂δi∂𝜽\\displaystyle\+\\delta\_\{j\}\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\+\\delta\_\{j\}\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\\frac\{\\mathop\{\}\\\!\\partial\\delta\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂ϕij∂𝜽\(𝒐i⊗𝒐j\)\+∂𝒐i∂𝜽𝚽ij𝒐j\.\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\phi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\.From these results,∂ψij/∂𝜽\\mathop\{\}\\\!\\partial\\psi\_\{ij\}/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}can be proved in the same manner\. Note thatϕij\\bm\{\\phi\}\_\{ij\}is the vector obtained by flattening𝚽ij\\bm\{\\Phi\}\_\{ij\}, and𝝍ij\\bm\{\\psi\}\_\{ij\}is the vector obtained by flattening𝚿ij\\bm\{\\Psi\}\_\{ij\}, respectively, i\.e\.,
ϕij\\displaystyle\\bm\{\\phi\}\_\{ij\}=\[\(𝚽ij\)11\(𝚽ij\)12\(𝚽ij\)21\(𝚽ij\)22\]⊤,\\displaystyle=\\begin\{bmatrix\}\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}&\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}&\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}&\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\\end\{bmatrix\}^\{\\top\},𝝍ij\\displaystyle\\bm\{\\psi\}\_\{ij\}=\[\(𝚿ij\)11\(𝚿ij\)12\(𝚿ij\)21\(𝚿ij\)22\]⊤\.\\displaystyle=\\begin\{bmatrix\}\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}&\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}&\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}&\(\\bm\{\\Psi\}\_\{ij\}\)\_\{22\}\\end\{bmatrix\}^\{\\top\}\.Also, the Jacobians are given by
∂ϕij∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\phi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\[∂\(𝚽ij\)11∂𝜽∂\(𝚽ij\)12∂𝜽∂\(𝚽ij\)21∂𝜽∂\(𝚽ij\)22∂𝜽\],\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\end\{bmatrix\},\(101\)∂𝝍ij∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\psi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\[∂\(𝚿ij\)11∂𝜽∂\(𝚿ij\)12∂𝜽∂\(𝚿ij\)21∂𝜽∂\(𝚿ij\)22∂𝜽\]\.\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\end\{bmatrix\}\.\(102\)∎
###### Lemma 9\.
∂ωij∂𝜽=∂s′\(zi\)∂𝜽s′\(zj\)\+∂s′\(zj\)∂𝜽s′\(zi\)∈ℝD×1\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}s^\{\\prime\}\(z\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}s^\{\\prime\}\(z\_\{i\}\)\\in\\mathbb\{R\}^\{D\\times 1\}\.\(103\)
###### Proof\.
From Eq\. \([16](https://arxiv.org/html/2606.28662#S3.E16)\),
∂ωij∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂𝒐i⊤𝛀ij𝒐j∂𝜽=∂s′\(zi\)s′\(zj\)∂𝜽\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Omega\}\_\{ij\}\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂s′\(zi\)∂𝜽s′\(zj\)\+∂s′\(zj\)∂𝜽s′\(zi\)\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}s^\{\\prime\}\(z\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}s^\{\\prime\}\(z\_\{i\}\)is obtained\. ∎
#### B\-D2Gradient with Respect to Affine Parameters from Input to Hidden Layer
∂ϕij/∂𝜽\\mathop\{\}\\\!\\partial\\phi\_\{ij\}/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}contains∂𝒐/∂𝜽\\mathop\{\}\\\!\\partial\\bm\{o\}/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\. Therefore, we show the Jacobian of𝒐\\bm\{o\}with respect to𝒘\\bm\{w\}here\.
###### Lemma 10\.
∂𝒐∂𝒘=𝒉\(𝒙\)⊗\(𝑭′\(𝒚\)𝑽~⊤𝒔′′/′\(z\)⊤\)∈ℝ\(M\+1\)N×2\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\)^\{\\top\}\)\\in\\mathbb\{R\}^\{\(M\+1\)N\\times 2\}\.\(104\)
###### Proof\.
From Eqs\. \([35](https://arxiv.org/html/2606.28662#S4.E35)\), \([68](https://arxiv.org/html/2606.28662#A1.E68)\), \([70](https://arxiv.org/html/2606.28662#A1.E70)\), \([77](https://arxiv.org/html/2606.28662#A2.E77)\), and \([82](https://arxiv.org/html/2606.28662#A2.E82)\),
∂𝒐∂𝒘=\[∂s′\(z\)∂𝒘∂δ∂𝒘\]\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\frac\{\\mathop\{\}\\\!\\partial\\delta\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\end\{bmatrix\}=\[𝒉\(𝒙\)⊗\(s′′\(z\)𝑭′\(𝒚\)𝑽~⊤\)𝒉\(𝒙\)⊗\(s′\(z\)𝑭′\(𝒚\)𝑽~⊤\)\]\\displaystyle=\\begin\{bmatrix\}\\bm\{h\}\(\\bm\{x\}\)\\otimes\(s^\{\\prime\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)&\\bm\{h\}\(\\bm\{x\}\)\\otimes\(s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\end\{bmatrix\}=𝒉\(𝒙\)⊗\[s′′\(z\)𝑭′\(𝒚\)𝑽~⊤s′\(z\)𝑭′\(𝒚\)𝑽~⊤\]\\displaystyle=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\begin\{bmatrix\}s^\{\\prime\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}&s^\{\\prime\}\(z\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\end\{bmatrix\}=𝒉\(𝒙\)⊗𝑭′\(𝒚\)𝑽~⊤\[s′′\(z\)s′\(z\)\]\\displaystyle=\\bm\{h\}\(\\bm\{x\}\)\\otimes\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\begin\{bmatrix\}s^\{\\prime\\prime\}\(z\)&s^\{\\prime\}\(z\)\\end\{bmatrix\}=𝒉\(𝒙\)⊗\(𝑭′\(𝒚\)𝑽~⊤𝒔′′/′\(z\)⊤\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\)^\{\\top\}\)is obtained\. ∎
#### B\-D3Gradients of the Phi Term
Here, we clarify\[∂tr\(𝑯L\(𝜽\)2\)/∂𝒘\]Φ\[\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{w\}\]\_\{\\Phi\}shown in Eq\. \([96](https://arxiv.org/html/2606.28662#A2.E96)\)\.
###### Lemma 11\.
∂\(𝚽ij\)11∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙i\)⊗𝖆ijΦ\+𝒉\(𝒙j\)⊗𝖆jiΦ∈ℝ\(M\+1\)N\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ji\}\\in\\mathbb\{R\}^\{\(M\+1\)N\}\.\(105\)
###### Proof\.
Fromyn=∑m=1Mwnmxm\+bny\_\{n\}=\\sum\_\{m=1\}^\{M\}w\_\{nm\}x\_\{m\}\+b\_\{n\},wnmw\_\{nm\}appears only inyny\_\{n\}\. Also, from Eq\. \([6](https://arxiv.org/html/2606.28662#S3.E6)\),
∂\(𝚽ij\)111/2∂wnm=vn2∂f′\(yjn\)f′\(yin\)∂wnm\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=v\_\{n\}^\{2\}\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{jn\}\)f^\{\\prime\}\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=vn2\(∂f′\(yjn\)∂wnmf′\(yin\)\+∂f′\(yin\)∂wnmf′\(yjn\)\)\\displaystyle=v\_\{n\}^\{2\}\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\}\(y\_\{in\}\)\+\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\}\(y\_\{jn\}\)\\bigg\)=vn2\(∂f′\(yjn\)∂yjn∂yjn∂wnmf′\(yin\)\+∂f′\(yin\)∂yin∂yin∂wnmf′\(yjn\)\)\\displaystyle=v\_\{n\}^\{2\}\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{jn\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{jn\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\}\(y\_\{in\}\)\+\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{in\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{in\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\}\(y\_\{jn\}\)\\bigg\)=vn2\(f′′\(yjn\)xjmf′\(yin\)\+f′′\(yin\)ximf′\(yjn\)\)\\displaystyle=v\_\{n\}^\{2\}\\bigg\(f^\{\\prime\\prime\}\(y\_\{jn\}\)x\_\{jm\}f^\{\\prime\}\(y\_\{in\}\)\+f^\{\\prime\\prime\}\(y\_\{in\}\)x\_\{im\}f^\{\\prime\}\(y\_\{jn\}\)\\bigg\)holds\. Therefore,
∂\(𝚽ij\)111/2∂𝑾:m\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{W\}\_\{:m\}\}=𝑽~⊙2⊤⊙\(xim𝒇′′\(𝒚i\)⊙𝒇′\(𝒚j\)\+xjm𝒇′′\(𝒚j\)⊙𝒇′\(𝒚i\)\)\\displaystyle=\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\\odot\\bigg\(x\_\{im\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\odot\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\+x\_\{jm\}\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\odot\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bigg\)=xim𝑭′′\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊙2⊤\+xjm𝑭′′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊙2⊤\\displaystyle=x\_\{im\}\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\+x\_\{jm\}\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}is obtained, where𝑽~⊙2⊤=𝑽~⊤⊙𝑽~⊤\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}=\\widetilde\{\\bm\{V\}\}^\{\\top\}\\odot\\widetilde\{\\bm\{V\}\}^\{\\top\}\. Using this, we obtain
∂\(𝚽ij\)111/2∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙i\)⊗\(𝑭′′\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊙2⊤\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\)\+𝒉\(𝒙j\)⊗\(𝑭′′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊙2⊤\)\.\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\)\.Therefore, using Eq\. \([37](https://arxiv.org/html/2606.28662#S4.E37)\), we obtain
∂\(𝚽ij\)11∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∂\(𝚽ij\)11∂\(𝚽ij\)111/2∂\(𝚽ij\)111/2∂𝒘\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=2\(𝚽ij\)111/2\(𝒉\(𝒙i\)⊗\(𝑭′′\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊙2⊤\)\\displaystyle=2\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\\Big\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\)\+𝒉\(𝒙j\)⊗\(𝑭′′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊙2⊤\)\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\)\\Big\)=𝒉\(𝒙i\)⊗\(2\(𝚽ij\)111/2𝑭′′\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊙2⊤\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(2\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\)\+𝒉\(𝒙j\)⊗\(2\(𝚽ji\)111/2𝑭′′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊙2⊤\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(2\(\\bm\{\\Phi\}\_\{ji\}\)\_\{11\}^\{1/2\}\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\)=𝒉\(𝒙i\)⊗𝖆ijΦ\+𝒉\(𝒙j\)⊗𝖆jiΦ\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ji\}\.Note that we utilized\(𝚽ij\)11=\(𝚽ji\)11\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}=\(\\bm\{\\Phi\}\_\{ji\}\)\_\{11\}from Eq\. \([6](https://arxiv.org/html/2606.28662#S3.E6)\)\. ∎
###### Lemma 12\.
∂\(𝚽ij\)12∂𝒘=𝒉\(𝒙i\)⊗𝖇ijΦ\+𝒉\(𝒙j\)⊗𝖈jiΦ∈ℝ\(M\+1\)N,\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ji\}\\in\\mathbb\{R\}^\{\(M\+1\)N\},\(106\)∂\(𝚽ij\)21∂𝒘=𝒉\(𝒙i\)⊗𝖈ijΦ\+𝒉\(𝒙j\)⊗𝖇jiΦ∈ℝ\(M\+1\)N\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ji\}\\in\\mathbb\{R\}^\{\(M\+1\)N\}\.\(107\)
###### Proof\.
Fromyn=∑m=1Mwnmxm\+bny\_\{n\}=\\sum\_\{m=1\}^\{M\}w\_\{nm\}x\_\{m\}\+b\_\{n\},wnmw\_\{nm\}appears only inyny\_\{n\}\. Also, from Eq\. \([7](https://arxiv.org/html/2606.28662#S3.E7)\),
∂\(𝚽ij\)12∂wnm=vn3∂f′\(yin\)2f′′\(yjn\)∂wnm\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=v\_\{n\}^\{3\}\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{in\}\)^\{2\}f^\{\\prime\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=vn3\(∂f′\(yin\)2∂wnmf′′\(yjn\)\+∂f′′\(yjn\)∂wnmf′\(yin\)2\)\\displaystyle=v\_\{n\}^\{3\}\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{in\}\)^\{2\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\\prime\}\(y\_\{jn\}\)\+\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\}\(y\_\{in\}\)^\{2\}\\bigg\)=2ximvn3f′\(yin\)f′′\(yin\)f′′\(yjn\)\+xjmvn3f′′′\(yjn\)f′\(yin\)2,\\displaystyle=2x\_\{im\}v\_\{n\}^\{3\}f^\{\\prime\}\(y\_\{in\}\)f^\{\\prime\\prime\}\(y\_\{in\}\)f^\{\\prime\\prime\}\(y\_\{jn\}\)\+x\_\{jm\}v\_\{n\}^\{3\}f^\{\\prime\\prime\\prime\}\(y\_\{jn\}\)f^\{\\prime\}\(y\_\{in\}\)^\{2\},where
∂f′\(yin\)2∂wnm\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{in\}\)^\{2\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=∂f′\(yin\)2∂f′\(yin\)∂f′\(yin\)∂yin∂yin∂wnm,\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{in\}\)^\{2\}\}\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{in\}\)\}\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{in\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{in\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\},∂f′′\(yjn\)∂wnm\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=∂f′′\(yjn\)∂yjn∂yjn∂wnm\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{jn\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{jn\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}are utilized\. Thus, we have
∂\(𝚽ij\)12∂𝑾:m\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{W\}\_\{:m\}\}=2xim𝑽~⊙3⊤⊙𝒇′\(𝒚i\)⊙𝒇′′\(𝒚i\)⊙𝒇′′\(𝒚j\)\\displaystyle=2x\_\{im\}\\widetilde\{\\bm\{V\}\}^\{\\odot 3\\top\}\\odot\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\odot\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\odot\\bm\{f\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\+xjm𝑽~⊙3⊤⊙𝒇′′′\(𝒚j\)⊙𝒇′\(𝒚i\)⊙2\\displaystyle\+x\_\{jm\}\\widetilde\{\\bm\{V\}\}^\{\\odot 3\\top\}\\odot\\bm\{f\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\odot\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)^\{\\odot 2\}=2xim𝑭′\(𝒚i\)𝑭′′\(𝒚i\)𝑭′′\(𝒚j\)𝑽~⊙3⊤\\displaystyle=2x\_\{im\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 3\\top\}\+xjm𝑭′′′\(𝒚j\)𝑭′\(𝒚i\)2𝑽~⊙3⊤\\displaystyle\+x\_\{jm\}\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)^\{2\}\\widetilde\{\\bm\{V\}\}^\{\\odot 3\\top\}and using Eqs\. \([38](https://arxiv.org/html/2606.28662#S4.E38)\) and \([39](https://arxiv.org/html/2606.28662#S4.E39)\) yields
∂\(𝚽ij\)12∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙i\)⊗\(2𝑭′\(𝒚i\)𝑭′′\(𝒚i\)𝑭′′\(𝒚j\)𝑽~⊙3⊤\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\big\(2\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 3\\top\}\\big\)\+𝒉\(𝒙j\)⊗\(𝑭′′′\(𝒚j\)𝑭′\(𝒚i\)2𝑽~⊙3⊤\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\big\(\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)^\{2\}\\widetilde\{\\bm\{V\}\}^\{\\odot 3\\top\}\\big\)=𝒉\(𝒙i\)⊗𝖇ijΦ\+𝒉\(𝒙j\)⊗𝖈jiΦ\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ji\}\.Also, since\(𝚽ij\)21=\(𝚽ji\)12\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}=\(\\bm\{\\Phi\}\_\{ji\}\)\_\{12\}from Eqs\. \([7](https://arxiv.org/html/2606.28662#S3.E7)\) and \([8](https://arxiv.org/html/2606.28662#S3.E8)\),
∂\(𝚽ij\)21∂𝒘=∂\(𝚽ji\)12∂𝒘=𝒉\(𝒙i\)⊗𝖈ijΦ\+𝒉\(𝒙j\)⊗𝖇jiΦ\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ji\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ji\}is obtained\. ∎
###### Lemma 13\.
∂\(𝚽ij\)22∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙i\)⊗𝖉ijΦ\+𝒉\(𝒙j\)⊗𝖉jiΦ∈ℝ\(M\+1\)N\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ji\}\\in\\mathbb\{R\}^\{\(M\+1\)N\}\.\(108\)
###### Proof\.
Fromyn=∑m=1Mwnmxm\+bny\_\{n\}=\\sum\_\{m=1\}^\{M\}w\_\{nm\}x\_\{m\}\+b\_\{n\},wnmw\_\{nm\}appears only inyny\_\{n\}\. Also, from Eq\. \([9](https://arxiv.org/html/2606.28662#S3.E9)\), since
∂\(𝚽ij\)22∂wnm=vn2∂f′′\(yin\)f′′\(yjn\)∂wnm\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=v\_\{n\}^\{2\}\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\\prime\}\(y\_\{in\}\)f^\{\\prime\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=vn2\(∂f′′\(yin\)∂wnmf′′\(yjn\)\+∂f′′\(yjn\)∂wnmf′′\(yin\)\)\\displaystyle=v\_\{n\}^\{2\}\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\\prime\}\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\\prime\}\(y\_\{jn\}\)\+\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\\prime\}\(y\_\{in\}\)\\bigg\)=vn2\(∂f′′\(yin\)∂yin∂yin∂wnmf′′\(yjn\)\+∂f′′\(yjn\)∂yjn∂yjn∂wnmf′′\(yin\)\)\\displaystyle=v\_\{n\}^\{2\}\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\\prime\}\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{in\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{in\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\\prime\}\(y\_\{jn\}\)\+\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{jn\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{jn\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\\prime\}\(y\_\{in\}\)\\bigg\)=vn2\(ximf′′′\(yin\)f′′\(yjn\)\+xjmf′′′\(yjn\)f′′\(yin\)\),\\displaystyle=v\_\{n\}^\{2\}\\big\(x\_\{im\}f^\{\\prime\\prime\\prime\}\(y\_\{in\}\)f^\{\\prime\\prime\}\(y\_\{jn\}\)\+x\_\{jm\}f^\{\\prime\\prime\\prime\}\(y\_\{jn\}\)f^\{\\prime\\prime\}\(y\_\{in\}\)\\big\),we have
∂\(𝚽ij\)22∂𝑾:m\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{W\}\_\{:m\}\}=xim𝑭′′′\(𝒚i\)𝑭′′\(𝒚j\)𝑽~⊙2⊤\\displaystyle=x\_\{im\}\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\+xjm𝑭′′′\(𝒚j\)𝑭′′\(𝒚i\)𝑽~⊙2⊤\\displaystyle\+x\_\{jm\}\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}and using Eq\. \([40](https://arxiv.org/html/2606.28662#S4.E40)\),
∂\(𝚽ij\)22∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙i\)⊗\(𝑭′′′\(𝒚i\)𝑭′′\(𝒚j\)𝑽~⊙2⊤\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\big\(\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\\big\)\+𝒉\(𝒙j\)⊗\(𝑭′′′\(𝒚j\)𝑭′′\(𝒚i\)𝑽~⊙2⊤\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\big\(\\bm\{F\}^\{\\prime\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\odot 2\\top\}\\big\)=𝒉\(𝒙i\)⊗𝖉ijΦ\+𝒉\(𝒙j\)⊗𝖉jiΦ\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ji\}is obtained\. ∎
###### Lemma 14\.
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒘\]Φ=∑i=1I∑j=1I\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\Bigg\]\_\{\\Phi\}=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(𝒉\(𝒙i\)⊗\(𝓐ijΦ\+𝓑ijΦ\)\+𝒉\(𝒙j\)⊗\(𝓐jiΦ\+𝓑jiΦ\)\)\.\\displaystyle\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Phi\}\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Phi\}\+\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Phi\}\)\\bigg\)\.\(109\)
###### Proof\.
Since the Kronecker product of𝒐i\\bm\{o\}\_\{i\}and𝒐j\\bm\{o\}\_\{j\}is given by
𝒐i⊗𝒐j=\[s′\(zi\)s′\(zj\)s′\(zi\)δjδis′\(zj\)δiδj\]⊤,\\displaystyle\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}=\\begin\{bmatrix\}s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)&s^\{\\prime\}\(z\_\{i\}\)\\delta\_\{j\}&\\delta\_\{i\}s^\{\\prime\}\(z\_\{j\}\)&\\delta\_\{i\}\\delta\_\{j\}\\end\{bmatrix\}^\{\\top\},\(110\)we have
\[𝖆jiΦ𝖈jiΦ𝖇jiΦ𝖉jiΦ\]\(𝒐i⊗𝒐j\)\\displaystyle\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ji\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=\[𝖆jiΦ𝖇jiΦ𝖈jiΦ𝖉jiΦ\]\(𝒐j⊗𝒐i\)∈ℝN\.\\displaystyle=\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ji\}\\end\{bmatrix\}\(\\bm\{o\}\_\{j\}\\otimes\\bm\{o\}\_\{i\}\)\\in\\mathbb\{R\}^\{N\}\.\(111\)
From Eqs\. \([36](https://arxiv.org/html/2606.28662#S4.E36)\), \([101](https://arxiv.org/html/2606.28662#A2.E101)\), \([105](https://arxiv.org/html/2606.28662#A2.E105)\), \([106](https://arxiv.org/html/2606.28662#A2.E106)\), \([107](https://arxiv.org/html/2606.28662#A2.E107)\), \([108](https://arxiv.org/html/2606.28662#A2.E108)\), \([110](https://arxiv.org/html/2606.28662#A2.E110)\), and \([111](https://arxiv.org/html/2606.28662#A2.E111)\), the first term of Eq\. \([99](https://arxiv.org/html/2606.28662#A2.E99)\) becomes
∂ϕij∂𝒘\(𝒐i⊗𝒐j\)\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\phi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=\[∂\(𝚽ij\)11∂𝒘∂\(𝚽ij\)12∂𝒘∂\(𝚽ij\)21∂𝒘∂\(𝚽ij\)22∂𝒘\]\(𝒐i⊗𝒐j\)\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=𝒉\(𝒙i\)⊗\[𝖆ijΦ𝖇ijΦ𝖈ijΦ𝖉ijΦ\]\(𝒐i⊗𝒐j\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ij\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+𝒉\(𝒙j\)⊗\[𝖆jiΦ𝖈jiΦ𝖇jiΦ𝖉jiΦ\]\(𝒐i⊗𝒐j\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ji\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=𝒉\(𝒙i\)⊗\[𝖆ijΦ𝖇ijΦ𝖈ijΦ𝖉ijΦ\]\(𝒐i⊗𝒐j\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ij\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+𝒉\(𝒙j\)⊗\[𝖆jiΦ𝖇jiΦ𝖈jiΦ𝖉jiΦ\]\(𝒐j⊗𝒐i\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{b\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{c\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{d\}\}^\{\\Phi\}\_\{ji\}\\end\{bmatrix\}\(\\bm\{o\}\_\{j\}\\otimes\\bm\{o\}\_\{i\}\)=𝒉\(𝒙i\)⊗𝑮ijΦ\(𝒐i⊗𝒐j\)\+𝒉\(𝒙j\)⊗𝑮jiΦ\(𝒐j⊗𝒐i\)\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{G\}^\{\\Phi\}\_\{ij\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{G\}^\{\\Phi\}\_\{ji\}\(\\bm\{o\}\_\{j\}\\otimes\\bm\{o\}\_\{i\}\)\.\(112\)From Eqs\. \([68](https://arxiv.org/html/2606.28662#A1.E68)\) and \([104](https://arxiv.org/html/2606.28662#A2.E104)\), the second and third terms of Eq\. \([99](https://arxiv.org/html/2606.28662#A2.E99)\) become
∂𝒐i∂𝒘𝚽ij𝒐j\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}=𝒉\(𝒙i\)⊗\(𝑭′\(𝒚i\)𝑽~⊤𝒔′′/′\(zi\)⊤𝚽ij𝒐j\),\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\_\{i\}\)^\{\\top\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\),\(113\)∂𝒐j∂𝒘𝚽ji𝒐i\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bm\{\\Phi\}\_\{ji\}\\bm\{o\}\_\{i\}=𝒉\(𝒙j\)⊗\(𝑭′\(𝒚j\)𝑽~⊤𝒔′′/′\(zj\)⊤𝚽ji𝒐i\)\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\_\{j\}\)^\{\\top\}\\bm\{\\Phi\}\_\{ji\}\\bm\{o\}\_\{i\}\)\.\(114\)From Eqs\. \([29](https://arxiv.org/html/2606.28662#S4.E29)\), \([32](https://arxiv.org/html/2606.28662#S4.E32)\), \([69](https://arxiv.org/html/2606.28662#A1.E69)\), \([96](https://arxiv.org/html/2606.28662#A2.E96)\), \([99](https://arxiv.org/html/2606.28662#A2.E99)\), \([112](https://arxiv.org/html/2606.28662#A2.E112)\), \([113](https://arxiv.org/html/2606.28662#A2.E113)\), and \([114](https://arxiv.org/html/2606.28662#A2.E114)\),
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒘\]Φ=∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)2∂ϕij∂𝒘\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\Bigg\]\_\{\\Phi\}=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\phi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∑i=1I∑j=1I\(𝒉\(𝒙i\)⊗\(𝓐ijΦ\+𝓑ijΦ\)\+𝒉\(𝒙j\)⊗\(𝓐jiΦ\+𝓑jiΦ\)\)\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Phi\}\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Phi\}\+\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Phi\}\)\\bigg\)holds\. ∎
#### B\-D4Gradients of the Psi Term
Here, we clarify\[∂tr\(𝑯L\(𝜽\)2\)/∂𝒘\]Ψ\[\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)/\\mathop\{\}\\\!\\partial\\bm\{w\}\]\_\{\\Psi\}shown in Eq\. \([97](https://arxiv.org/html/2606.28662#A2.E97)\)\.
###### Lemma 15\.
∂\(𝚿ij\)11∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙i\)⊗𝖆ijΨ\+𝒉\(𝒙j\)⊗𝖆jiΨ∈ℝ\(M\+1\)N\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ji\}\\in\\mathbb\{R\}^\{\(M\+1\)N\}\.\(115\)
###### Proof\.
From Eq\. \([12](https://arxiv.org/html/2606.28662#S3.E12)\),
∂\(𝚿ij\)11∂𝜽=∂\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{\\mathop\{\}\\\!\\partial\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂𝒇\(𝒚i\)⊤𝒇\(𝒚j\)∂𝜽𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)∂𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤∂𝜽\\displaystyle\+\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\frac\{\\mathop\{\}\\\!\\partial\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\(∂𝒇\(𝒚i\)∂𝜽𝒇\(𝒚j\)\+∂𝒇\(𝒚j\)∂𝜽𝒇\(𝒚i\)\)𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤\\displaystyle=\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\\bigg\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)\\displaystyle\+\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)×\(∂𝑭′\(𝒚j\)𝑽~⊤∂𝜽𝑭′\(𝒚i\)\+∂𝑭′\(𝒚i\)𝑽~⊤∂𝜽𝑭′\(𝒚j\)\)𝑽~⊤\\displaystyle\\times\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bigg\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\(116\)holds\. Note that𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}is an inner product because𝑽~𝑭′\(𝒚j\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)is a row vector and𝑭′\(𝒚i\)𝑽~⊤\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}is a column vector\. Therefore, Eq\. \([63](https://arxiv.org/html/2606.28662#A1.E63)\) is applied\. From Eqs\. \([42](https://arxiv.org/html/2606.28662#S4.E42)\), \([68](https://arxiv.org/html/2606.28662#A1.E68)\), \([79](https://arxiv.org/html/2606.28662#A2.E79)\), \([84](https://arxiv.org/html/2606.28662#A2.E84)\), and \([116](https://arxiv.org/html/2606.28662#A2.E116)\),
∂\(𝚿ij\)11∂𝒘=\(𝒉\(𝒙i\)⊗\(𝑭′\(𝒚i\)𝒇\(𝒚j\)\)\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\Big\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\+𝒉\(𝒙j\)⊗\(𝑭′\(𝒚j\)𝒇\(𝒚i\)\)\)𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\Big\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)\(𝒉\(𝒙j\)⊗\(diag\(𝑽~⊤\)𝑭′′\(𝒚j\)𝑭′\(𝒚i\)\)\\displaystyle\+\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\Big\(\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\)\+𝒉\(𝒙i\)⊗\(diag\(𝑽~⊤\)𝑭′′\(𝒚i\)𝑭′\(𝒚j\)\)\)𝑽~⊤\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\)\\Big\)\\widetilde\{\\bm\{V\}\}^\{\\top\}=𝒉\(𝒙i\)⊗\(\(𝑭′\(𝒚i\)𝒇\(𝒚j\)𝑽~𝑭′\(𝒚i\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\Big\(\\big\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\+diag\(𝑽~⊤\)𝑭′′\(𝒚i\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)\)𝑭′\(𝒚j\)𝑽~⊤\)\\displaystyle\+\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\big\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\Big\)\+𝒉\(𝒙j\)⊗\(\(𝑭′\(𝒚j\)𝒇\(𝒚i\)𝑽~𝑭′\(𝒚j\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\Big\(\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\+diag\(𝑽~⊤\)𝑭′′\(𝒚j\)\(1\+𝒇\(𝒚j\)⊤𝒇\(𝒚i\)\)\)𝑭′\(𝒚i\)𝑽~⊤\)\\displaystyle\+\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{j\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\Big\)=𝒉\(𝒙i\)⊗𝖆ijΨ\+𝒉\(𝒙j\)⊗𝖆jiΨ\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ji\}is obtained\. ∎
###### Lemma 16\.
∂\(𝚿ij\)12∂𝒘=𝒉\(𝒙i\)⊗𝖇ijΨ\+𝒉\(𝒙j\)⊗𝖈jiΨ∈ℝ\(M\+1\)N,\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ji\}\\in\\mathbb\{R\}^\{\(M\+1\)N\},\(117\)∂\(𝚿ij\)21∂𝒘=𝒉\(𝒙i\)⊗𝖈ijΨ\+𝒉\(𝒙j\)⊗𝖇jiΨ∈ℝ\(M\+1\)N\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ji\}\\in\\mathbb\{R\}^\{\(M\+1\)N\}\.\(118\)
###### Proof\.
From Eq\. \([13](https://arxiv.org/html/2606.28662#S3.E13)\),
∂\(𝚿ij\)12∂𝜽=∂𝒇\(𝒚i\)⊤𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤∂𝜽\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}=∂𝑭′\(𝒚j\)𝒇\(𝒚i\)∂𝜽𝑭′\(𝒚i\)𝑽~⊤\+∂𝑭′\(𝒚i\)𝑽~⊤∂𝜽𝑭′\(𝒚j\)𝒇\(𝒚i\)\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\}\{\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\(119\)holds\. Note that𝒇\(𝒚i\)⊤𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}is an inner product because𝒇\(𝒚i\)⊤𝑭′\(𝒚j\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)is a row vector and𝑭′\(𝒚i\)𝑽~⊤\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}is a column vector\. Therefore, Eq\. \([63](https://arxiv.org/html/2606.28662#A1.E63)\) is applied\. Since𝑭′\(𝒚j\)𝒇\(𝒚i\)=\[f′\(yj1\)f\(yi1\)⋯f′\(yjN\)f\(yiN\)\]⊤\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)=\\begin\{bmatrix\}f^\{\\prime\}\(y\_\{j1\}\)f\(y\_\{i1\}\)\\cdots f^\{\\prime\}\(y\_\{jN\}\)f\(y\_\{iN\}\)\\end\{bmatrix\}^\{\\top\}, deriving the partial derivative off′\(yjn\)f\(yin\)f^\{\\prime\}\(y\_\{jn\}\)f\(y\_\{in\}\)with respect townmw\_\{nm\}yields
∂f′\(yjn\)f\(yin\)∂wnm=∂f′\(yjn\)∂wnmf\(yin\)\+∂f\(yin\)∂wnmf′\(yjn\)\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{jn\}\)f\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f\(y\_\{in\}\)\+\\frac\{\\mathop\{\}\\\!\\partial f\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\}\(y\_\{jn\}\)=∂f′\(yjn\)∂yjn∂yjn∂wnmf\(yin\)\+∂f\(yin\)∂yin∂yin∂wnmf′\(yjn\)\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{jn\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{jn\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f\(y\_\{in\}\)\+\\frac\{\\mathop\{\}\\\!\\partial f\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial y\_\{in\}\}\\frac\{\\mathop\{\}\\\!\\partial y\_\{in\}\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}f^\{\\prime\}\(y\_\{jn\}\)=xjmf′′\(yjn\)f\(yin\)\+ximf′\(yin\)f′\(yjn\)\.\\displaystyle=x\_\{jm\}f^\{\\prime\\prime\}\(y\_\{jn\}\)f\(y\_\{in\}\)\+x\_\{im\}f^\{\\prime\}\(y\_\{in\}\)f^\{\\prime\}\(y\_\{jn\}\)\.Therefore,
∂𝑭′\(𝒚j\)𝒇\(𝒚i\)∂wnm\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}=\[∂f′\(yj1\)f\(yi1\)∂wnm⋯∂f′\(yjn\)f\(yin\)∂wnm⋯∂f′\(yjN\)f\(yiN\)∂wnm\]\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{j1\}\)f\(y\_\{i1\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}\\cdots\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{jn\}\)f\(y\_\{in\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}\\cdots\\frac\{\\mathop\{\}\\\!\\partial f^\{\\prime\}\(y\_\{jN\}\)f\(y\_\{iN\}\)\}\{\\mathop\{\}\\\!\\partial w\_\{nm\}\}\\end\{bmatrix\}=\[0⋯xjmf′′\(yjn\)f\(yin\)\+ximf′\(yin\)f′\(yjn\)⋯0\],\\displaystyle=\\begin\{bmatrix\}0\\cdots x\_\{jm\}f^\{\\prime\\prime\}\(y\_\{jn\}\)f\(y\_\{in\}\)\+x\_\{im\}f^\{\\prime\}\(y\_\{in\}\)f^\{\\prime\}\(y\_\{jn\}\)\\cdots 0\\end\{bmatrix\},and
∂𝑭′\(𝒚j\)𝒇\(𝒚i\)∂𝑾:m\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{W\}\_\{:m\}\}=xim𝑭′\(𝒚j\)𝑭′\(𝒚i\)\+xjm𝑭′′\(𝒚j\)𝑭\(𝒚i\),\\displaystyle=x\_\{im\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\+x\_\{jm\}\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}\(\\bm\{y\}\_\{i\}\),∂𝑭′\(𝒚j\)𝒇\(𝒚i\)∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙i\)⊗\(𝑭′\(𝒚j\)𝑭′\(𝒚i\)\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\)\+𝒉\(𝒙j\)⊗\(𝑭′′\(𝒚j\)𝑭\(𝒚i\)\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}\(\\bm\{y\}\_\{i\}\)\)\(120\)are obtained\. Finally, from Eqs\. \([43](https://arxiv.org/html/2606.28662#S4.E43)\), \([44](https://arxiv.org/html/2606.28662#S4.E44)\), \([68](https://arxiv.org/html/2606.28662#A1.E68)\), \([69](https://arxiv.org/html/2606.28662#A1.E69)\), \([79](https://arxiv.org/html/2606.28662#A2.E79)\), \([119](https://arxiv.org/html/2606.28662#A2.E119)\), and \([120](https://arxiv.org/html/2606.28662#A2.E120)\), we obtain
∂\(𝚿ij\)12∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\(𝒉\(𝒙i\)⊗\(𝑭′\(𝒚j\)𝑭′\(𝒚i\)\)\\displaystyle=\\Big\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\)\+𝒉\(𝒙j\)⊗\(𝑭′′\(𝒚j\)𝑭\(𝒚i\)\)\)𝑭′\(𝒚i\)𝑽~⊤\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}\(\\bm\{y\}\_\{i\}\)\)\\Big\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+𝒉\(𝒙i\)⊗\(diag\(𝑽~⊤\)𝑭′′\(𝒚i\)𝑭′\(𝒚j\)𝒇\(𝒚i\)\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)=𝒉\(𝒙i\)⊗\(\(𝑭′\(𝒚i\)2\+𝑭′′\(𝒚i\)𝑭\(𝒚i\)\)𝑭′\(𝒚j\)𝑽~⊤\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)^\{2\}\+\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}\(\\bm\{y\}\_\{i\}\)\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\+𝒉\(𝒙j\)⊗\(𝑭′′\(𝒚j\)𝑭\(𝒚i\)𝑭′\(𝒚i\)𝑽~⊤\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)=𝒉\(𝒙i\)⊗𝖇ijΨ\+𝒉\(𝒙j\)⊗𝖈jiΨ,\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ji\},\(121\)where
diag\(𝑽~⊤\)𝑭′′\(𝒚i\)𝑭′\(𝒚j\)𝒇\(𝒚i\)=𝑭′′\(𝒚i\)𝑭\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊤\\displaystyle\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)=\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}is used\. Also, from Eqs\. \([13](https://arxiv.org/html/2606.28662#S3.E13)\) and \([14](https://arxiv.org/html/2606.28662#S3.E14)\),
\(𝚿ij\)21\\displaystyle\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}=𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝒇\(𝒚j\)\\displaystyle=\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}\(\\bm\{y\}\_\{j\}\)=𝒇\(𝒚j\)⊤𝑭′\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊤=\(𝚿ji\)12\\displaystyle=\\bm\{f\}\(\\bm\{y\}\_\{j\}\)^\{\\top\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}=\(\\bm\{\\Psi\}\_\{ji\}\)\_\{12\}\(122\)holds\. Therefore, from Eq\. \([117](https://arxiv.org/html/2606.28662#A2.E117)\),
∂\(𝚿ij\)21∂𝒘=∂\(𝚿ji\)12∂𝒘=𝒉\(𝒙i\)⊗𝖈ijΨ\+𝒉\(𝒙j\)⊗𝖇jiΨ\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ji\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ji\}is obtained\. ∎
###### Lemma 17\.
∂\(𝚿ij\)22∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙i\)⊗𝖉ijΨ\+𝒉\(𝒙j\)⊗𝖉jiΨ∈ℝ\(M\+1\)N\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ji\}\\in\\mathbb\{R\}^\{\(M\+1\)N\}\.\(123\)
###### Proof\.
From Eqs\. \([45](https://arxiv.org/html/2606.28662#S4.E45)\), \([63](https://arxiv.org/html/2606.28662#A1.E63)\), and \([85](https://arxiv.org/html/2606.28662#A2.E85)\),
∂\(𝚿ij\)22∂𝒘=∂𝒇′\(𝒚i\)⊤𝒇′\(𝒚j\)∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∂𝒇′\(𝒚i\)∂𝒘𝒇′\(𝒚j\)\+∂𝒇′\(𝒚j\)∂𝒘𝒇′\(𝒚i\)\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)=𝒉\(𝒙i\)⊗\(𝑭′′\(𝒚i\)𝒇′\(𝒚j\)\)\+𝒉\(𝒙j\)⊗\(𝑭′′\(𝒚j\)𝒇′\(𝒚i\)\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\)=𝒉\(𝒙i\)⊗𝖉ijΨ\+𝒉\(𝒙j\)⊗𝖉jiΨ\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ji\}is obtained\. ∎
###### Lemma 18\.
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒘\]Ψ=∑i=1I∑j=1I\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\Bigg\]\_\{\\Psi\}=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(𝒉\(𝒙i\)⊗\(𝓐ijΨ\+𝓑ijΨ\)\+𝒉\(𝒙j\)⊗\(𝓐jiΨ\+𝓑jiΨ\)\)\.\\displaystyle\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Psi\}\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Psi\}\+\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Psi\}\)\\bigg\)\.\(124\)
###### Proof\.
From Eqs\. \([41](https://arxiv.org/html/2606.28662#S4.E41)\), \([102](https://arxiv.org/html/2606.28662#A2.E102)\), \([111](https://arxiv.org/html/2606.28662#A2.E111)\), \([115](https://arxiv.org/html/2606.28662#A2.E115)\), \([117](https://arxiv.org/html/2606.28662#A2.E117)\), \([118](https://arxiv.org/html/2606.28662#A2.E118)\), and \([123](https://arxiv.org/html/2606.28662#A2.E123)\), the first term of Eq\. \([100](https://arxiv.org/html/2606.28662#A2.E100)\) becomes
∂𝝍ij∂𝒘\(𝒐i⊗𝒐j\)\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\psi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=\[∂\(𝚿ij\)11∂𝒘∂\(𝚿ij\)12∂𝒘∂\(𝚿ij\)21∂𝒘∂\(𝚿ij\)22∂𝒘\]\(𝒐i⊗𝒐j\)\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=𝒉\(𝒙i\)⊗\[𝖆ijΨ𝖇ijΨ𝖈ijΨ𝖉ijΨ\]\(𝒐i⊗𝒐j\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ij\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+𝒉\(𝒙j\)⊗\[𝖆jiΨ𝖈jiΨ𝖇jiΨ𝖉jiΨ\]\(𝒐i⊗𝒐j\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ji\}&\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ji\}&\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ji\}&\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ji\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=𝒉\(𝒙i\)⊗\[𝖆ijΨ𝖇ijΨ𝖈ijΨ𝖉ijΨ\]\(𝒐i⊗𝒐j\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ij\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+𝒉\(𝒙j\)⊗\[𝖆jiΨ𝖇jiΨ𝖈jiΨ𝖉jiΨ\]\(𝒐j⊗𝒐i\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\begin\{bmatrix\}\\bm\{\\mathfrak\{a\}\}^\{\\Psi\}\_\{ji\}&\\bm\{\\mathfrak\{b\}\}^\{\\Psi\}\_\{ji\}&\\bm\{\\mathfrak\{c\}\}^\{\\Psi\}\_\{ji\}&\\bm\{\\mathfrak\{d\}\}^\{\\Psi\}\_\{ji\}\\end\{bmatrix\}\(\\bm\{o\}\_\{j\}\\otimes\\bm\{o\}\_\{i\}\)=𝒉\(𝒙i\)⊗𝑮ijΨ\(𝒐i⊗𝒐j\)\+𝒉\(𝒙j\)⊗𝑮jiΨ\(𝒐j⊗𝒐i\)\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{G\}^\{\\Psi\}\_\{ij\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{G\}^\{\\Psi\}\_\{ji\}\(\\bm\{o\}\_\{j\}\\otimes\\bm\{o\}\_\{i\}\)\.\(125\)From Eqs\. \([68](https://arxiv.org/html/2606.28662#A1.E68)\) and \([104](https://arxiv.org/html/2606.28662#A2.E104)\), the second and third terms of Eq\. \([100](https://arxiv.org/html/2606.28662#A2.E100)\) become
∂𝒐i∂𝒘𝚿ij𝒐j\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bm\{\\Psi\}\_\{ij\}\\bm\{o\}\_\{j\}=𝒉\(𝒙i\)⊗\(𝑭′\(𝒚i\)𝑽~⊤𝒔′′/′\(zi\)⊤𝚿ij𝒐j\),\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\_\{i\}\)^\{\\top\}\\bm\{\\Psi\}\_\{ij\}\\bm\{o\}\_\{j\}\),\(126\)∂𝒐j∂𝒘𝚿ji𝒐i\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bm\{\\Psi\}\_\{ji\}\\bm\{o\}\_\{i\}=𝒉\(𝒙j\)⊗\(𝑭′\(𝒚j\)𝑽~⊤𝒔′′/′\(zj\)⊤𝚿ji𝒐i\)\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\_\{j\}\)^\{\\top\}\\bm\{\\Psi\}\_\{ji\}\\bm\{o\}\_\{i\}\)\.\(127\)From Eqs\. \([30](https://arxiv.org/html/2606.28662#S4.E30)\), \([33](https://arxiv.org/html/2606.28662#S4.E33)\), \([69](https://arxiv.org/html/2606.28662#A1.E69)\), \([97](https://arxiv.org/html/2606.28662#A2.E97)\), \([100](https://arxiv.org/html/2606.28662#A2.E100)\), \([125](https://arxiv.org/html/2606.28662#A2.E125)\), \([126](https://arxiv.org/html/2606.28662#A2.E126)\), and \([127](https://arxiv.org/html/2606.28662#A2.E127)\),
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒘\]Ψ=2∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)∂ψij∂𝒘\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\Bigg\]\_\{\\Psi\}=2\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)\\frac\{\\mathop\{\}\\\!\\partial\\psi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∑i=1I∑j=1I\(𝒉\(𝒙i\)⊗\(𝓐ijΨ\+𝓑ijΨ\)\+𝒉\(𝒙j\)⊗\(𝓐jiΨ\+𝓑jiΨ\)\)\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Psi\}\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Psi\}\+\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Psi\}\)\\bigg\)holds\. ∎
#### B\-D5Gradient of the Omega Term
With respect toωij\\omega\_\{ij\}, the following holds\.
###### Lemma 19\.
ωij=ωji\.\\displaystyle\\omega\_\{ij\}=\\omega\_\{ji\}\.\(128\)
###### Proof\.
From Eq\. \([16](https://arxiv.org/html/2606.28662#S3.E16)\),𝛀ij=𝛀ji\\bm\{\\Omega\}\_\{ij\}=\\bm\{\\Omega\}\_\{ji\}holds\. Therefore, we obtain
ωij=𝒐i⊤𝛀ij𝒐j=𝒐j⊤𝛀ji𝒐i=ωji\.\\displaystyle\\omega\_\{ij\}=\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Omega\}\_\{ij\}\\bm\{o\}\_\{j\}=\\bm\{o\}\_\{j\}^\{\\top\}\\bm\{\\Omega\}\_\{ji\}\\bm\{o\}\_\{i\}=\\omega\_\{ji\}\.Note thatωij=s′\(zi\)s′\(zj\)\\omega\_\{ij\}=s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)from Eq\. \([10](https://arxiv.org/html/2606.28662#S3.E10)\)\. ∎
###### Lemma 20\.
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒘\]Ω=∑i=1I∑j=1I\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\Bigg\]\_\{\\Omega\}=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(𝒉\(𝒙i\)⊗\(𝓐ijΩ\+𝓑ijΩ\)\+𝒉\(𝒙j\)⊗\(𝓐jiΩ\+𝓑jiΩ\)\)\.\\displaystyle\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Omega\}\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Omega\}\+\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Omega\}\)\\bigg\)\.\(129\)
###### Proof\.
From Eqs\. \([16](https://arxiv.org/html/2606.28662#S3.E16)\) and \([77](https://arxiv.org/html/2606.28662#A2.E77)\),
∂ωij∂𝒘\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∂𝒐i⊤𝛀ij𝒐j∂𝒘=∂s′\(zi\)s′\(zj\)∂𝒘\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Omega\}\_\{ij\}\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∂s′\(zi\)∂𝒘s′\(zj\)\+∂s′\(zj\)∂𝒘s′\(zi\)\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}s^\{\\prime\}\(z\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}s^\{\\prime\}\(z\_\{i\}\)=s′′\(zi\)s′\(zj\)𝒉\(𝒙i\)⊗\(𝑭′\(𝒚i\)𝑽~⊤\)\\displaystyle=s^\{\\prime\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\+s′′\(zj\)s′\(zi\)𝒉\(𝒙j\)⊗\(𝑭′\(𝒚j\)𝑽~⊤\)\\displaystyle\+s^\{\\prime\\prime\}\(z\_\{j\}\)s^\{\\prime\}\(z\_\{i\}\)\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)holds\. From Eq\. \([31](https://arxiv.org/html/2606.28662#S4.E31)\) we obtain
\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂ωij∂𝒘\\displaystyle\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=𝒉\(𝒙i\)⊗\(\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2s′′\(zi\)s′\(zj\)𝑭′\(𝒚i\)𝑽~⊤\)\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}s^\{\\prime\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\+𝒉\(𝒙j\)⊗\(\(1\+𝒇\(𝒚j\)⊤𝒇\(𝒚i\)\)2s′′\(zj\)s′\(zi\)𝑭′\(𝒚j\)𝑽~⊤\)\\displaystyle\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\(1\+\\bm\{f\}\(\\bm\{y\}\_\{j\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)^\{2\}s^\{\\prime\\prime\}\(z\_\{j\}\)s^\{\\prime\}\(z\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\)=𝒉\(𝒙i\)⊗𝓐ijΩ\+𝒉\(𝒙j\)⊗𝓐jiΩ\.\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Omega\}\.From Eqs\. \([34](https://arxiv.org/html/2606.28662#S4.E34)\), \([63](https://arxiv.org/html/2606.28662#A1.E63)\), and \([84](https://arxiv.org/html/2606.28662#A2.E84)\),
ωij∂\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂𝒘\\displaystyle\\omega\_\{ij\}\\frac\{\\mathop\{\}\\\!\\partial\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=ωij∂\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)∂𝒇\(𝒚i\)⊤𝒇\(𝒚j\)∂𝒘\\displaystyle=\\omega\_\{ij\}\\frac\{\\mathop\{\}\\\!\\partial\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\}\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=2ωij\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)\(∂𝒇\(𝒚i\)∂𝒘𝒇\(𝒚j\)\+∂𝒇\(𝒚j\)∂𝒘𝒇\(𝒚i\)\)\\displaystyle=2\\omega\_\{ij\}\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\bigg\(\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\\bigg\)=2ωij\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)\\displaystyle=2\\omega\_\{ij\}\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)×\(𝒉\(𝒙i\)⊗𝑭′\(𝒚i\)𝒇\(𝒚j\)\+𝒉\(𝒙j\)⊗𝑭′\(𝒚j\)𝒇\(𝒚i\)\)\\displaystyle\\times\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)=𝒉\(𝒙i\)⊗𝓑ijΩ\+𝒉\(𝒙j\)⊗𝓑jiΩ\\displaystyle=\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Omega\}holds\. Therefore, from Eqs\. \([69](https://arxiv.org/html/2606.28662#A1.E69)\) and \([98](https://arxiv.org/html/2606.28662#A2.E98)\),
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒘\]Ω=∑i=1I∑j=1I\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂ωij∂𝒘\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\Bigg\]\_\{\\Omega\}=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\+∑i=1I∑j=1Iωij∂\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂𝒘\\displaystyle\+\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\omega\_\{ij\}\\frac\{\\mathop\{\}\\\!\\partial\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=∑i=1I∑j=1I\(𝒉\(𝒙i\)⊗\(𝓐ijΩ\+𝓑ijΩ\)\+𝒉\(𝒙j\)⊗\(𝓐jiΩ\+𝓑jiΩ\)\)\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Omega\}\)\+\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Omega\}\+\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Omega\}\)\\bigg\)is obtained\. ∎
#### B\-D6Completion of the Proof
From Eqs\. \([95](https://arxiv.org/html/2606.28662#A2.E95)\), \([96](https://arxiv.org/html/2606.28662#A2.E96)\), \([97](https://arxiv.org/html/2606.28662#A2.E97)\), \([98](https://arxiv.org/html/2606.28662#A2.E98)\), \([109](https://arxiv.org/html/2606.28662#A2.E109)\), \([124](https://arxiv.org/html/2606.28662#A2.E124)\), and \([129](https://arxiv.org/html/2606.28662#A2.E129)\), we have
∂tr\(𝑯L\(𝜽\)2\)∂𝒘=∑a∈𝔅\[∂tr\(𝑯L\(𝜽\)2\)∂𝒘\]a\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}=\\sum\_\{a\\in\\mathfrak\{B\}\}\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{w\}\}\\Bigg\]\_\{a\}=∑i=1I∑j=1I𝒉\(𝒙i\)⊗\(𝓐ijΦ\+𝓐ijΨ\+𝓐ijΩ\+𝓑ijΦ\+𝓑ijΨ\+𝓑ijΩ\)\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{\\Omega\}\)\+∑i=1I∑j=1I𝒉\(𝒙j\)⊗\(𝓐jiΦ\+𝓐jiΨ\+𝓐jiΩ\+𝓑jiΦ\+𝓑jiΨ\+𝓑jiΩ\)\\displaystyle\+\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\bm\{h\}\(\\bm\{x\}\_\{j\}\)\\otimes\(\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Phi\}\+\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Psi\}\+\\bm\{\\mathcal\{A\}\}\_\{ji\}^\{\\Omega\}\+\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Phi\}\+\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Psi\}\+\\bm\{\\mathcal\{B\}\}\_\{ji\}^\{\\Omega\}\)=2∑i=1I∑j=1I𝒉\(𝒙i\)⊗\(𝓦ijΦ\+𝓦ijΨ\+𝓦ijΩ\)\.\\displaystyle=2\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\(\\bm\{\\mathcal\{W\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{W\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{W\}\}\_\{ij\}^\{\\Omega\}\)\.Therefore, Eq\. \([28](https://arxiv.org/html/2606.28662#S4.E28)\) holds\.
### B\-EProof for Eq\. \([46](https://arxiv.org/html/2606.28662#S4.E46)\)
#### B\-E1Gradient with Respect to Affine Parameters from Hidden to Output Layer
∂ϕij/∂𝜽\\mathop\{\}\\\!\\partial\\phi\_\{ij\}/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}contains∂𝒐/∂𝜽\\mathop\{\}\\\!\\partial\\bm\{o\}/\\mathop\{\}\\\!\\partial\\bm\{\\theta\}\. Therefore, we present the Jacobian of𝒐\\bm\{o\}with respect to𝒗\\bm\{v\}here\.
###### Lemma 21\.
∂𝒐∂𝒗=𝒉\(𝒇\(𝒚\)\)𝒔′′/′\(z\)⊤∈ℝ\(N\+1\)×2\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\)^\{\\top\}\\in\\mathbb\{R\}^\{\(N\+1\)\\times 2\}\.\(130\)
###### Proof\.
From Eqs\. \([35](https://arxiv.org/html/2606.28662#S4.E35)\), \([89](https://arxiv.org/html/2606.28662#A2.E89)\), and \([92](https://arxiv.org/html/2606.28662#A2.E92)\),
∂𝒐∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\[∂s′\(z\)∂𝒗∂δ∂𝒗\]=𝒉\(𝒇\(𝒚\)\)\[s′′\(z\)s′\(z\)\]\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}&\\frac\{\\mathop\{\}\\\!\\partial\\delta\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\end\{bmatrix\}=\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\\begin\{bmatrix\}s^\{\\prime\\prime\}\(z\)&s^\{\\prime\}\(z\)\\end\{bmatrix\}=𝒉\(𝒇\(𝒚\)\)𝒔′′/′\(z\)⊤\\displaystyle=\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\)\)\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\)^\{\\top\}holds, which is the Jacobian of𝒐\\bm\{o\}with respect to𝒗\\bm\{v\}\. ∎
#### B\-E2Gradient of the Phi Term
###### Lemma 22\.
∂\(𝚽ij\)11∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=𝖊ijΦ\+𝖊jiΦ∈ℝN\+1\.\\displaystyle=\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ji\}\\in\\mathbb\{R\}^\{N\+1\}\.\(131\)
###### Proof\.
From Eq\. \([6](https://arxiv.org/html/2606.28662#S3.E6)\), we have
\(𝚽ij\)111/2=𝑽~𝑭′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤=∑n=1Nvn2f′\(yjn\)f′\(yin\)\.\\displaystyle\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}=\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}=\\sum\_\{n=1\}^\{N\}v\_\{n\}^\{2\}f^\{\\prime\}\(y\_\{jn\}\)f^\{\\prime\}\(y\_\{in\}\)\.Therefore,
∂\(𝚽ij\)111/2∂vn=\{0,n=02vnf′\(yjn\)f′\(yin\),n∈ℕ≤N\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\}\{\\mathop\{\}\\\!\\partial v\_\{n\}\}=\\begin\{cases\}0,&n=0\\\\ 2v\_\{n\}f^\{\\prime\}\(y\_\{jn\}\)f^\{\\prime\}\(y\_\{in\}\),&n\\in\\mathbb\{N\}\_\{\\leq N\}\\end\{cases\}holds\. Using Eq\. \([54](https://arxiv.org/html/2606.28662#S4.E54)\), we obtain
∂\(𝚽ij\)11∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=∂\(𝚽ij\)11∂\(𝚽ij\)111/2∂\(𝚽ij\)111/2∂𝒗\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=4\(𝚽ij\)111/2𝑽⊤⊙𝒇0′\(𝒚j\)⊙𝒇0′\(𝒚i\)\\displaystyle=4\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\\bm\{V\}^\{\\top\}\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)=2\(𝚽ij\)111/2𝑽⊤⊙𝒇0′\(𝒚j\)⊙𝒇0′\(𝒚i\)\\displaystyle=2\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}^\{1/2\}\\bm\{V\}^\{\\top\}\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\+2\(𝚽ji\)111/2𝑽⊤⊙𝒇0′\(𝒚i\)⊙𝒇0′\(𝒚j\)\\displaystyle\+2\(\\bm\{\\Phi\}\_\{ji\}\)\_\{11\}^\{1/2\}\\bm\{V\}^\{\\top\}\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)=𝖊ijΦ\+𝖊jiΦ,\\displaystyle=\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ji\},where\(𝚽ij\)11=\(𝚽ji\)11\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}=\(\\bm\{\\Phi\}\_\{ji\}\)\_\{11\}from Eq\. \([6](https://arxiv.org/html/2606.28662#S3.E6)\) is utilized\. ∎
###### Lemma 23\.
∂\(𝚽ij\)12∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=𝖋ijΦ∈ℝN\+1,\\displaystyle=\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ij\}\\in\\mathbb\{R\}^\{N\+1\},\(132\)∂\(𝚽ij\)21∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=𝖋jiΦ∈ℝN\+1\.\\displaystyle=\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ji\}\\in\\mathbb\{R\}^\{N\+1\}\.\(133\)
###### Proof\.
From Eq\. \([7](https://arxiv.org/html/2606.28662#S3.E7)\), since
\(𝚽ij\)12\\displaystyle\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}=𝑽~𝑭′\(𝒚i\)diag\(𝑽~⊤\)𝑭′′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤\\displaystyle=\\widetilde\{\\bm\{V\}\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\mathrm\{diag\}\(\\widetilde\{\\bm\{V\}\}^\{\\top\}\)\\bm\{F\}^\{\\prime\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}=∑n=1Nvn3f′\(yin\)2f′′\(yjn\),\\displaystyle=\\sum\_\{n=1\}^\{N\}v\_\{n\}^\{3\}f^\{\\prime\}\(y\_\{in\}\)^\{2\}f^\{\\prime\\prime\}\(y\_\{jn\}\),we obtain
∂\(𝚽ij\)12∂vn=\{0,n=03vn2f′\(yin\)2f′′\(yjn\),n∈ℕ≤N\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial v\_\{n\}\}=\\begin\{cases\}0,&n=0\\\\ 3v\_\{n\}^\{2\}f^\{\\prime\}\(y\_\{in\}\)^\{2\}f^\{\\prime\\prime\}\(y\_\{jn\}\),&n\\in\\mathbb\{N\}\_\{\\leq N\}\\end\{cases\}\.Therefore, from Eq\. \([55](https://arxiv.org/html/2606.28662#S4.E55)\) we have
∂\(𝚽ij\)12∂𝒗=3𝑽⊙2⊤⊙𝒇0′\(𝒚i\)⊙2⊙𝒇0′′\(𝒚j\)=𝖋ijΦ\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=3\\bm\{V\}^\{\\odot 2\\top\}\\odot\\bm\{f\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)^\{\\odot 2\}\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)=\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ij\}\.Also, from Eqs\. \([7](https://arxiv.org/html/2606.28662#S3.E7)\) and \([8](https://arxiv.org/html/2606.28662#S3.E8)\),
∂\(𝚽ij\)21∂𝒗=∂\(𝚽ji\)12∂𝒗=𝖋jiΦ\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ji\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ji\}holds\. ∎
###### Lemma 24\.
∂\(𝚽ij\)22∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=𝖌ijΦ\+𝖌jiΦ∈ℝN\+1\.\\displaystyle=\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ji\}\\in\\mathbb\{R\}^\{N\+1\}\.\(134\)
###### Proof\.
From Eq\. \([9](https://arxiv.org/html/2606.28662#S3.E9)\) we have
∂\(𝚽ij\)22∂vn\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial v\_\{n\}\}=∂∑n=1Nvn2f′′\(yin\)f′′\(yjn\)∂vn\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\sum\_\{n=1\}^\{N\}v\_\{n\}^\{2\}f^\{\\prime\\prime\}\(y\_\{in\}\)f^\{\\prime\\prime\}\(y\_\{jn\}\)\}\{\\mathop\{\}\\\!\\partial v\_\{n\}\}=\{0,n=02vnf′′\(yin\)f′′\(yjn\),n∈ℕ≤N\.\\displaystyle=\\begin\{cases\}0,&n=0\\\\ 2v\_\{n\}f^\{\\prime\\prime\}\(y\_\{in\}\)f^\{\\prime\\prime\}\(y\_\{jn\}\),&n\\in\\mathbb\{N\}\_\{\\leq N\}\\end\{cases\}\.Then, from this and Eq\. \([56](https://arxiv.org/html/2606.28662#S4.E56)\),
∂\(𝚽ij\)22∂𝒗=2𝑽⊤⊙𝒇0′′\(𝒚i\)⊙𝒇0′′\(𝒚j\)\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=2\\bm\{V\}^\{\\top\}\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)=𝑽⊤⊙𝒇0′′\(𝒚i\)⊙𝒇0′′\(𝒚j\)\+𝑽⊤⊙𝒇0′′\(𝒚j\)⊙𝒇0′′\(𝒚i\)\\displaystyle=\\bm\{V\}^\{\\top\}\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)\+\\bm\{V\}^\{\\top\}\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)\\odot\\bm\{f\}^\{\\prime\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)=𝖌ijΦ\+𝖌jiΦ\\displaystyle=\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ji\}holds\. ∎
###### Lemma 25\.
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒗\]Φ\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\Bigg\]\_\{\\Phi\}=∑i=1I∑j=1I\(𝓒ijΦ\+𝓓ijΦ\+𝓒jiΦ\+𝓓jiΦ\)\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\Big\(\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Phi\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Phi\}\\Big\)\(135\)
###### Proof\.
From Eqs\. \([53](https://arxiv.org/html/2606.28662#S4.E53)\), \([101](https://arxiv.org/html/2606.28662#A2.E101)\), \([110](https://arxiv.org/html/2606.28662#A2.E110)\), \([131](https://arxiv.org/html/2606.28662#A2.E131)\), \([132](https://arxiv.org/html/2606.28662#A2.E132)\), \([133](https://arxiv.org/html/2606.28662#A2.E133)\), and \([134](https://arxiv.org/html/2606.28662#A2.E134)\), the first term of Eq\. \([99](https://arxiv.org/html/2606.28662#A2.E99)\) becomes
∂ϕij∂𝒗\(𝒐i⊗𝒐j\)\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\phi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=\[∂\(𝚽ij\)11∂𝒗∂\(𝚽ij\)12∂𝒗∂\(𝚽ij\)21∂𝒗∂\(𝚽ij\)22∂𝒗\]\(𝒐i⊗𝒐j\)\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Phi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=s′\(zi\)s′\(zj\)\(𝖊ijΦ\+𝖊jiΦ\)\+s′\(zi\)δj𝖋ijΦ\\displaystyle=s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\(\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ji\}\)\+s^\{\\prime\}\(z\_\{i\}\)\\delta\_\{j\}\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ij\}\+δis′\(zj\)𝖋jiΦ\+δiδj\(𝖌ijΦ\+𝖌jiΦ\)\\displaystyle\+\\delta\_\{i\}s^\{\\prime\}\(z\_\{j\}\)\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ji\}\+\\delta\_\{i\}\\delta\_\{j\}\(\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ij\}\+\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ji\}\)=\(s′\(zi\)s′\(zj\)𝖊ijΦ\+s′\(zi\)δj𝖋ijΦ\+δiδj𝖌ijΦ\)\\displaystyle=\\big\(s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ij\}\+s^\{\\prime\}\(z\_\{i\}\)\\delta\_\{j\}\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ij\}\+\\delta\_\{i\}\\delta\_\{j\}\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ij\}\\big\)\+\(s′\(zj\)s′\(zi\)𝖊jiΦ\+s′\(zj\)δi𝖋jiΦ\+δjδi𝖌jiΦ\)\\displaystyle\+\\big\(s^\{\\prime\}\(z\_\{j\}\)s^\{\\prime\}\(z\_\{i\}\)\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ji\}\+s^\{\\prime\}\(z\_\{j\}\)\\delta\_\{i\}\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ji\}\+\\delta\_\{j\}\\delta\_\{i\}\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ji\}\\big\)=\[𝖊ijΦ𝖋ijΦ𝟎N\+1𝖌ijΦ\]\(𝒐i⊗𝒐j\)\\displaystyle=\\begin\{bmatrix\}\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ij\}&\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ij\}&\\bm\{0\}\_\{N\+1\}&\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ij\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\[𝖊jiΦ𝖋jiΦ𝟎N\+1𝖌jiΦ\]\(𝒐j⊗𝒐i\)\\displaystyle\+\\begin\{bmatrix\}\\bm\{\\mathfrak\{e\}\}^\{\\Phi\}\_\{ji\}&\\bm\{\\mathfrak\{f\}\}^\{\\Phi\}\_\{ji\}&\\bm\{0\}\_\{N\+1\}&\\bm\{\\mathfrak\{g\}\}^\{\\Phi\}\_\{ji\}\\end\{bmatrix\}\(\\bm\{o\}\_\{j\}\\otimes\\bm\{o\}\_\{i\}\)=𝑲ijΦ\(𝒐i⊗𝒐j\)\+𝑲jiΦ\(𝒐j⊗𝒐i\)\.\\displaystyle=\\bm\{K\}\_\{ij\}^\{\\Phi\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\\bm\{K\}\_\{ji\}^\{\\Phi\}\(\\bm\{o\}\_\{j\}\\otimes\\bm\{o\}\_\{i\}\)\.Also, from Eq\. \([130](https://arxiv.org/html/2606.28662#A2.E130)\), the second and third terms of Eq\. \([99](https://arxiv.org/html/2606.28662#A2.E99)\) become
∂𝒐i∂𝒗𝚽ij𝒐j\+∂𝒐j∂𝒗𝚽ji𝒐i\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\bm\{\\Phi\}\_\{ji\}\\bm\{o\}\_\{i\}=𝒉\(𝒇\(𝒚i\)\)𝒔′′,′\(zi\)⊤𝚽ij𝒐j\+𝒉\(𝒇\(𝒚j\)\)𝒔′′,′\(zj\)⊤𝚽ji𝒐i\.\\displaystyle=\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\bm\{s\}^\{\\prime\\prime,\\prime\}\(z\_\{i\}\)^\{\\top\}\\bm\{\\Phi\}\_\{ij\}\\bm\{o\}\_\{j\}\+\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\bm\{s\}^\{\\prime\\prime,\\prime\}\(z\_\{j\}\)^\{\\top\}\\bm\{\\Phi\}\_\{ji\}\\bm\{o\}\_\{i\}\.Finally, from Eqs\. \([47](https://arxiv.org/html/2606.28662#S4.E47)\), \([50](https://arxiv.org/html/2606.28662#S4.E50)\), \([96](https://arxiv.org/html/2606.28662#A2.E96)\), and \([99](https://arxiv.org/html/2606.28662#A2.E99)\),
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒗\]Φ\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\Bigg\]\_\{\\Phi\}=∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)2∂ϕij∂𝒗\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\phi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=∑i=1I∑j=1I\(𝓒ijΦ\+𝓓ijΦ\+𝓒jiΦ\+𝓓jiΦ\)\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\Big\(\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Phi\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Phi\}\\Big\)is obtained\. ∎
#### B\-E3Gradient of the Psi Term
###### Lemma 26\.
∂\(𝚿ij\)11∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=𝖊ijΨ\+𝖊jiΨ∈ℝN\+1\.\\displaystyle=\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ji\}\\in\\mathbb\{R\}^\{N\+1\}\.\(136\)
###### Proof\.
Substituting Eqs\. \([90](https://arxiv.org/html/2606.28662#A2.E90)\) and \([94](https://arxiv.org/html/2606.28662#A2.E94)\) into Eq\. \([116](https://arxiv.org/html/2606.28662#A2.E116)\) yields
∂\(𝚿ij\)11∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)\\displaystyle=\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)×\(𝑭0′\(𝒚j\)𝑭′\(𝒚i\)\+𝑭0′\(𝒚i\)𝑭′\(𝒚j\)\)𝑽~⊤\\displaystyle\\times\\big\(\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\+\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\big\)\\widetilde\{\\bm\{V\}\}^\{\\top\}=\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)𝑭0′\(𝒚i\)𝑭′\(𝒚j\)𝑽~⊤\\displaystyle=\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+\(1\+𝒇\(𝒚j\)⊤𝒇\(𝒚i\)\)𝑭0′\(𝒚j\)𝑭′\(𝒚i\)𝑽~⊤\\displaystyle\+\(1\+\\bm\{f\}\(\\bm\{y\}\_\{j\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{j\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}=𝖊ijΨ\+𝖊jiΨ,\\displaystyle=\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ji\},where Eqs\. \([58](https://arxiv.org/html/2606.28662#S4.E58)\) are utilized\. ∎
###### Lemma 27\.
∂\(𝚿ij\)12∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=𝖋ijΨ∈ℝN\+1,\\displaystyle=\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ij\}\\in\\mathbb\{R\}^\{N\+1\},\(137\)∂\(𝚿ij\)21∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=𝖋jiΨ∈ℝN\+1\.\\displaystyle=\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ji\}\\in\\mathbb\{R\}^\{N\+1\}\.\(138\)
###### Proof\.
Since𝑭′\(𝒚j\)𝒇\(𝒚i\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)does not depend on𝒗\\bm\{v\}, its Jacobian becomes
∂𝑭′\(𝒚j\)𝒇\(𝒚i\)∂𝒗=𝟎\(N\+1\)×N\.\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\bm\{0\}\_\{\(N\+1\)\\times N\}\.From Eqs\. \([59](https://arxiv.org/html/2606.28662#S4.E59)\), \([90](https://arxiv.org/html/2606.28662#A2.E90)\), and \([119](https://arxiv.org/html/2606.28662#A2.E119)\), we have
∂\(𝚿ij\)12∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=𝟎\(N\+1\)×N𝑭′\(𝒚i\)𝑽~⊤\+𝑭0′\(𝒚i\)𝑭′\(𝒚j\)𝒇\(𝒚i\)\\displaystyle=\\bm\{0\}\_\{\(N\+1\)\\times N\}\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{i\}\)\\widetilde\{\\bm\{V\}\}^\{\\top\}\+\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)=𝑭0′\(𝒚i\)𝑭′\(𝒚j\)𝒇\(𝒚i\)=𝖋ijΨ\.\\displaystyle=\\bm\{F\}^\{\\prime\}\_\{0\}\(\\bm\{y\}\_\{i\}\)\\bm\{F\}^\{\\prime\}\(\\bm\{y\}\_\{j\}\)\\bm\{f\}\(\\bm\{y\}\_\{i\}\)=\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ij\}\.Using Eq\. \([122](https://arxiv.org/html/2606.28662#A2.E122)\),
∂\(𝚿ij\)21∂𝒗=∂\(𝚿ji\)12∂𝒗=𝖋jiΨ\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ji\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ji\}is obtained\. ∎
###### Lemma 28\.
∂\(𝚿ij\)22∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=𝟎N\+1\.\\displaystyle=\\bm\{0\}\_\{N\+1\}\.\(139\)
###### Proof\.
As it can be seen from Eq\. \([15](https://arxiv.org/html/2606.28662#S3.E15)\),\(𝚿ij\)22\(\\bm\{\\Psi\}\_\{ij\}\)\_\{22\}does not depend on𝒗\\bm\{v\}\. ∎
###### Lemma 29\.
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒗\]Ψ\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\Bigg\]\_\{\\Psi\}=∑i=1I∑j=1I\(𝓒ijΨ\+𝓓ijΨ\+𝓒jiΨ\+𝓓jiΨ\)\.\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\Big\(\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Psi\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Psi\}\\Big\)\.\(140\)
###### Proof\.
From Eqs\. \([57](https://arxiv.org/html/2606.28662#S4.E57)\), \([102](https://arxiv.org/html/2606.28662#A2.E102)\), \([110](https://arxiv.org/html/2606.28662#A2.E110)\), \([136](https://arxiv.org/html/2606.28662#A2.E136)\), \([137](https://arxiv.org/html/2606.28662#A2.E137)\), \([138](https://arxiv.org/html/2606.28662#A2.E138)\), and \([139](https://arxiv.org/html/2606.28662#A2.E139)\), the first term of Eq\. \([100](https://arxiv.org/html/2606.28662#A2.E100)\) becomes
∂𝝍ij∂𝒗\(𝒐i⊗𝒐j\)\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{\\psi\}\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=\[∂\(𝚿ij\)11∂𝒗∂\(𝚿ij\)12∂𝒗∂\(𝚿ij\)21∂𝒗∂\(𝚿ij\)22∂𝒗\]\(𝒐i⊗𝒐j\)\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{11\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{12\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{21\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}&\\frac\{\\mathop\{\}\\\!\\partial\(\\bm\{\\Psi\}\_\{ij\}\)\_\{22\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)=s′\(zi\)s′\(zj\)\(𝖊ijΨ\+𝖊jiΨ\)\+s′\(zi\)δj𝖋ijΨ\+δis′\(zj\)𝖋jiΨ\)\\displaystyle=s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\(\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ij\}\+\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ji\}\)\+s^\{\\prime\}\(z\_\{i\}\)\\delta\_\{j\}\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ij\}\+\\delta\_\{i\}s^\{\\prime\}\(z\_\{j\}\)\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ji\}\)=\(s′\(zi\)s′\(zj\)𝖊ijΨ\+s′\(zi\)δj𝖋ijΨ\)\\displaystyle=\\big\(s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ij\}\+s^\{\\prime\}\(z\_\{i\}\)\\delta\_\{j\}\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ij\}\\big\)\+\(s′\(zj\)s′\(zi\)𝖊jiΨ\+s′\(zj\)δi𝖋jiΨ\)\\displaystyle\+\\big\(s^\{\\prime\}\(z\_\{j\}\)s^\{\\prime\}\(z\_\{i\}\)\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ji\}\+s^\{\\prime\}\(z\_\{j\}\)\\delta\_\{i\}\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ji\}\\big\)=\[𝖊ijΨ𝖋ijΨ𝟎N\+1𝟎N\+1\]\(𝒐i⊗𝒐j\)\\displaystyle=\\begin\{bmatrix\}\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ij\}&\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ij\}&\\bm\{0\}\_\{N\+1\}&\\bm\{0\}\_\{N\+1\}\\end\{bmatrix\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\[𝖊jiΨ𝖋jiΨ𝟎N\+1𝟎N\+1\]\(𝒐j⊗𝒐i\)\\displaystyle\+\\begin\{bmatrix\}\\bm\{\\mathfrak\{e\}\}^\{\\Psi\}\_\{ji\}&\\bm\{\\mathfrak\{f\}\}^\{\\Psi\}\_\{ji\}&\\bm\{0\}\_\{N\+1\}&\\bm\{0\}\_\{N\+1\}\\end\{bmatrix\}\(\\bm\{o\}\_\{j\}\\otimes\\bm\{o\}\_\{i\}\)=𝑲ijΨ\(𝒐i⊗𝒐j\)\+𝑲jiΨ\(𝒐j⊗𝒐i\)\.\\displaystyle=\\bm\{K\}\_\{ij\}^\{\\Psi\}\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\)\+\\bm\{K\}\_\{ji\}^\{\\Psi\}\(\\bm\{o\}\_\{j\}\\otimes\\bm\{o\}\_\{i\}\)\.Also, from Eq\. \([130](https://arxiv.org/html/2606.28662#A2.E130)\), the second and third terms of Eq\. \([100](https://arxiv.org/html/2606.28662#A2.E100)\) become
∂𝒐i∂𝒗𝚿ij𝒐j\+∂𝒐j∂𝒗𝚿ji𝒐i\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\bm\{\\Psi\}\_\{ij\}\\bm\{o\}\_\{j\}\+\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\bm\{\\Psi\}\_\{ji\}\\bm\{o\}\_\{i\}=𝒉\(𝒇\(𝒚i\)\)𝒔′′,′\(zi\)⊤𝚿ij𝒐j\+𝒉\(𝒇\(𝒚j\)\)𝒔′′,′\(zj\)⊤𝚿ji𝒐i\.\\displaystyle=\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\\bm\{s\}^\{\\prime\\prime,\\prime\}\(z\_\{i\}\)^\{\\top\}\\bm\{\\Psi\}\_\{ij\}\\bm\{o\}\_\{j\}\+\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\\bm\{s\}^\{\\prime\\prime,\\prime\}\(z\_\{j\}\)^\{\\top\}\\bm\{\\Psi\}\_\{ji\}\\bm\{o\}\_\{i\}\.From Eqs\. \([48](https://arxiv.org/html/2606.28662#S4.E48)\), \([51](https://arxiv.org/html/2606.28662#S4.E51)\), \([97](https://arxiv.org/html/2606.28662#A2.E97)\), and \([100](https://arxiv.org/html/2606.28662#A2.E100)\),
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒗\]Ψ\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\Bigg\]\_\{\\Psi\}=2∑i=1I∑j=1I\(1\+𝒙i⊤𝒙j\)∂ψij∂𝒗\\displaystyle=2\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(1\+\\bm\{x\}\_\{i\}^\{\\top\}\\bm\{x\}\_\{j\}\)\\frac\{\\mathop\{\}\\\!\\partial\\psi\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=∑i=1I∑j=1I\(𝓒ijΨ\+𝓓ijΨ\+𝓒jiΨ\+𝓓jiΨ\)\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\Big\(\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Psi\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Psi\}\\Big\)is obtained\. ∎
#### B\-E4Gradient of the Omega Term
###### Lemma 30\.
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒗\]Ω\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\Bigg\]\_\{\\Omega\}=∑i=1I∑j=1I\(𝓒ijΩ\+𝓓ijΩ\+𝓒jiΩ\+𝓓jiΩ\)\.\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\Big\(\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Omega\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Omega\}\\Big\)\.\(141\)
###### Proof\.
From Eqs\. \([16](https://arxiv.org/html/2606.28662#S3.E16)\) and \([89](https://arxiv.org/html/2606.28662#A2.E89)\), we have
∂ωij∂𝒗\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=∂𝒐i⊤𝛀ij𝒐j∂𝒗=∂s′\(zi\)s′\(zj\)∂𝒗\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial\\bm\{o\}\_\{i\}^\{\\top\}\\bm\{\\Omega\}\_\{ij\}\\bm\{o\}\_\{j\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=∂s′\(zi\)∂𝒗s′\(zj\)\+∂s′\(zj\)∂𝒗s′\(zi\)\\displaystyle=\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{i\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}s^\{\\prime\}\(z\_\{j\}\)\+\\frac\{\\mathop\{\}\\\!\\partial s^\{\\prime\}\(z\_\{j\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}s^\{\\prime\}\(z\_\{i\}\)=s′′\(zi\)s′\(zj\)𝒉\(𝒇\(𝒚i\)\)\+s′′\(zj\)s′\(zi\)𝒉\(𝒇\(𝒚j\)\)\.\\displaystyle=s^\{\\prime\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\+s^\{\\prime\\prime\}\(z\_\{j\}\)s^\{\\prime\}\(z\_\{i\}\)\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)\.From this and Eq\. \([49](https://arxiv.org/html/2606.28662#S4.E49)\), we obtain
\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂ωij∂𝒗\\displaystyle\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=s′′\(zi\)s′\(zj\)\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2𝒉\(𝒇\(𝒚i\)\)\\displaystyle=s^\{\\prime\\prime\}\(z\_\{i\}\)s^\{\\prime\}\(z\_\{j\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)\+s′′\(zj\)s′\(zi\)\(1\+𝒇\(𝒚j\)⊤𝒇\(𝒚i\)\)2𝒉\(𝒇\(𝒚j\)\)\\displaystyle\+s^\{\\prime\\prime\}\(z\_\{j\}\)s^\{\\prime\}\(z\_\{i\}\)\(1\+\\bm\{f\}\(\\bm\{y\}\_\{j\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{i\}\)\)^\{2\}\\bm\{h\}\(\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)=𝓒ijΩ\+𝓒jiΩ\.\\displaystyle=\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Omega\}\.Using the fact that𝒇\(𝒚\)\\bm\{f\}\(\\bm\{y\}\)does not depend on𝒗\\bm\{v\}and Eq\. \([52](https://arxiv.org/html/2606.28662#S4.E52)\),
ωij∂\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂𝒗\\displaystyle\\omega\_\{ij\}\\frac\{\\mathop\{\}\\\!\\partial\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=ωij𝟎N\+1\+ωji𝟎N\+1\\displaystyle=\\omega\_\{ij\}\\bm\{0\}\_\{N\+1\}\+\\omega\_\{ji\}\\bm\{0\}\_\{N\+1\}=𝓓ijΩ\+𝓓jiΩ\\displaystyle=\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Omega\}holds, where Eq\. \([128](https://arxiv.org/html/2606.28662#A2.E128)\) is utilized\. From these and Eq\. \([98](https://arxiv.org/html/2606.28662#A2.E98)\),
\[∂tr\(𝑯L\(𝜽\)2\)∂𝒗\]Ω=∑i=1I∑j=1I\\displaystyle\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\Bigg\]\_\{\\Omega\}=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂ωij∂𝒗\+ωij∂\(1\+𝒇\(𝒚i\)⊤𝒇\(𝒚j\)\)2∂𝒗\)\\displaystyle\\bigg\(\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\\frac\{\\mathop\{\}\\\!\\partial\\omega\_\{ij\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\+\\omega\_\{ij\}\\frac\{\\mathop\{\}\\\!\\partial\(1\+\\bm\{f\}\(\\bm\{y\}\_\{i\}\)^\{\\top\}\\bm\{f\}\(\\bm\{y\}\_\{j\}\)\)^\{2\}\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\bigg\)=∑i=1I∑j=1I\(𝓒ijΩ\+𝓓ijΩ\+𝓒jiΩ\+𝓓jiΩ\)\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\\Big\(\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Omega\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Omega\}\\Big\)is obtained\. ∎
#### B\-E5Completion of the Proof
From Eqs\. \([95](https://arxiv.org/html/2606.28662#A2.E95)\), \([96](https://arxiv.org/html/2606.28662#A2.E96)\), \([97](https://arxiv.org/html/2606.28662#A2.E97)\), \([98](https://arxiv.org/html/2606.28662#A2.E98)\), \([135](https://arxiv.org/html/2606.28662#A2.E135)\), \([140](https://arxiv.org/html/2606.28662#A2.E140)\), \([141](https://arxiv.org/html/2606.28662#A2.E141)\), we have
∂tr\(𝑯L\(𝜽\)2\)∂𝒗=∑a∈𝔅\[∂tr\(𝑯L\(𝜽\)2\)∂𝒗\]a\\displaystyle\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}=\\sum\_\{a\\in\\mathfrak\{B\}\}\\Bigg\[\\frac\{\\mathop\{\}\\\!\\partial\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)\}\{\\mathop\{\}\\\!\\partial\\bm\{v\}\}\\Bigg\]\_\{a\}=∑i=1I∑j=1I\(𝓒ijΦ\+𝓒ijΨ\+𝓒ijΩ\+𝓓ijΦ\+𝓓ijΨ\+𝓓ijΩ\)\\displaystyle=\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{\\Omega\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{\\Omega\}\)\+∑i=1I∑j=1I\(𝓒jiΦ\+𝓒jiΨ\+𝓒jiΩ\+𝓓jiΦ\+𝓓jiΨ\+𝓓jiΩ\)\\displaystyle\+\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Phi\}\+\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Psi\}\+\\bm\{\\mathcal\{C\}\}\_\{ji\}^\{\\Omega\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Phi\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Psi\}\+\\bm\{\\mathcal\{D\}\}\_\{ji\}^\{\\Omega\}\)=2∑i=1I∑j=1I\(𝓥ijΦ\+𝓥ijΨ\+𝓥ijΩ\)\.\\displaystyle=2\\sum\_\{i=1\}^\{I\}\\sum\_\{j=1\}^\{I\}\(\\bm\{\\mathcal\{V\}\}\_\{ij\}^\{\\Phi\}\+\\bm\{\\mathcal\{V\}\}\_\{ij\}^\{\\Psi\}\+\\bm\{\\mathcal\{V\}\}\_\{ij\}^\{\\Omega\}\)\.Therefore, Eq\. \([46](https://arxiv.org/html/2606.28662#S4.E46)\) holds\.
### B\-FProof for Eq\. \([60](https://arxiv.org/html/2606.28662#S5.E60)\)
Each term in the gradient oftr\(𝑯L\(𝜽\)\)\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)\)becomes
pi→qi\\displaystyle p\_\{i\}\\rightarrow q\_\{i\}⇒\(s\(zi\)→0∨1\)∧\(δi→0\)⇒s′\(zi\)→0∧s′′\(zi\)→0\\displaystyle\\Rightarrow\(s\(z\_\{i\}\)\\rightarrow 0\\lor 1\)\\land\(\\delta\_\{i\}\\rightarrow 0\)\\Rightarrow s^\{\\prime\}\(z\_\{i\}\)\\rightarrow 0\\land s^\{\\prime\\prime\}\(z\_\{i\}\)\\rightarrow 0⇒\(𝒉\(𝒙i\)⊗∑a∈𝔄𝓦ia→𝟎\(M\+1\)N\)∧\(∑a∈𝔄𝓥ia→𝟎N\+1\)\.\\displaystyle\\Rightarrow\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\sum\_\{a\\in\\mathfrak\{A\}\}\\bm\{\\mathcal\{W\}\}\_\{i\}^\{a\}\\rightarrow\\bm\{0\}\_\{\(M\+1\)N\}\\bigg\)\\land\\bigg\(\\sum\_\{a\\in\\mathfrak\{A\}\}\\bm\{\\mathcal\{V\}\}\_\{i\}^\{a\}\\rightarrow\\bm\{0\}\_\{N\+1\}\\bigg\)\.Similarly, each term in the gradient oftr\(𝑯L\(𝜽\)2\)\\mathrm\{tr\}\(\\bm\{H\}\_\{L\}\(\\bm\{\\theta\}\)^\{2\}\)becomes
pi→qi\\displaystyle p\_\{i\}\\rightarrow q\_\{i\}⇒\(s\(zi\)→0∨1\)∧\(δi→0\)\\displaystyle\\Rightarrow\(s\(z\_\{i\}\)\\rightarrow 0\\lor 1\)\\land\(\\delta\_\{i\}\\rightarrow 0\)⇒\(s′\(zi\)→0\)∧\(s′′\(zi\)→0\)\\displaystyle\\Rightarrow\(s^\{\\prime\}\(z\_\{i\}\)\\rightarrow 0\)\\land\(s^\{\\prime\\prime\}\(z\_\{i\}\)\\rightarrow 0\)⇒\(𝒐i⊗𝒐j→𝟎4\)∧\(𝒔′′/′\(z\)→𝟎2\)\\displaystyle\\Rightarrow\(\\bm\{o\}\_\{i\}\\otimes\\bm\{o\}\_\{j\}\\rightarrow\\bm\{0\}\_\{4\}\)\\land\(\\bm\{s\}^\{\\prime\\prime/\\prime\}\(z\)\\rightarrow\\bm\{0\}\_\{2\}\)⇒\(𝓐ija→𝟎N\)∧\(𝓑ija→𝟎N\)∧\(𝓒ija→𝟎N\+1\)\\displaystyle\\Rightarrow\(\\bm\{\\mathcal\{A\}\}\_\{ij\}^\{a\}\\rightarrow\\bm\{0\}\_\{N\}\)\\land\(\\bm\{\\mathcal\{B\}\}\_\{ij\}^\{a\}\\rightarrow\\bm\{0\}\_\{N\}\)\\land\(\\bm\{\\mathcal\{C\}\}\_\{ij\}^\{a\}\\rightarrow\\bm\{0\}\_\{N\+1\}\)∧\(𝓓ijb→𝟎N\+1\),∀a∈𝔅,∀b∈𝔅∖\{Ω\}\\displaystyle\\land\(\\bm\{\\mathcal\{D\}\}\_\{ij\}^\{b\}\\rightarrow\\bm\{0\}\_\{N\+1\}\),\\forall a\\in\\mathfrak\{B\},\\forall b\\in\\mathfrak\{B\}\\setminus\\\{\\Omega\\\}⇒\(𝒉\(𝒙i\)⊗∑a∈𝔅𝓦ija→𝟎\(M\+1\)N\)\\displaystyle\\Rightarrow\\bigg\(\\bm\{h\}\(\\bm\{x\}\_\{i\}\)\\otimes\\sum\_\{a\\in\\mathfrak\{B\}\}\\bm\{\\mathcal\{W\}\}\_\{ij\}^\{a\}\\rightarrow\\bm\{0\}\_\{\(M\+1\)N\}\\bigg\)∧\(∑a∈𝔅𝓥ija→𝟎N\+1\)\.\\displaystyle\\land\\bigg\(\\sum\_\{a\\in\\mathfrak\{B\}\}\\bm\{\\mathcal\{V\}\}\_\{ij\}^\{a\}\\rightarrow\\bm\{0\}\_\{N\+1\}\\bigg\)\.Therefore, Eq\. \([60](https://arxiv.org/html/2606.28662#S5.E60)\) holds\.
### B\-GProof for Eq\. \([62](https://arxiv.org/html/2606.28662#S5.E62)\)
First, we focus on the upper bound\. From Eq\. \([1](https://arxiv.org/html/2606.28662#S3.E1)\), since
\(σ\(𝜽\(t\+1\)\)<σ\(𝜽\(t\)\)\)∧\(μ\(𝜽\(t\+1\)\)=μ\(𝜽\(t\)\)\)⇒\\displaystyle\\Big\(\\sigma\(\\bm\{\\theta\}^\{\(t\+1\)\}\)<\\sigma\(\\bm\{\\theta\}^\{\(t\)\}\)\\Big\)\\land\\Big\(\\mu\(\\bm\{\\theta\}^\{\(t\+1\)\}\)=\\mu\(\\bm\{\\theta\}^\{\(t\)\}\)\\Big\)\\Rightarrowλsup\(𝜽\(t\+1\)\)=μ\(𝜽\(t\+1\)\)\+D−1σ\(𝜽\(t\+1\)\)\\displaystyle\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}^\{\(t\+1\)\}\)=\\mu\(\\bm\{\\theta\}^\{\(t\+1\)\}\)\+\\sqrt\{D\-1\}\\sigma\(\\bm\{\\theta\}^\{\(t\+1\)\}\)<μ\(𝜽\(t\+1\)\)\+D−1σ\(𝜽\(t\)\)\\displaystyle<\\mu\(\\bm\{\\theta\}^\{\(t\+1\)\}\)\+\\sqrt\{D\-1\}\\sigma\(\\bm\{\\theta\}^\{\(t\)\}\)=μ\(𝜽\(t\)\)\+D−1σ\(𝜽\(t\)\)=λsup\(𝜽\(t\)\)\\displaystyle=\\mu\(\\bm\{\\theta\}^\{\(t\)\}\)\+\\sqrt\{D\-1\}\\sigma\(\\bm\{\\theta\}^\{\(t\)\}\)=\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}^\{\(t\)\}\)holds, we obtainλsup\(𝜽\(t\+1\)\)<λsup\(𝜽\(t\)\)\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}^\{\(t\+1\)\}\)<\\lambda\_\{\\mathrm\{sup\}\}\(\\bm\{\\theta\}^\{\(t\)\}\)\. Next, we focus on the lower bound\. From Eq\. \([61](https://arxiv.org/html/2606.28662#S5.E61)\), since
\(σ\(𝜽\(t\+1\)\)<σ\(𝜽\(t\)\)\)∧\(μ\(𝜽\(t\+1\)\)=μ\(𝜽\(t\)\)\)⇒\\displaystyle\\Big\(\\sigma\(\\bm\{\\theta\}^\{\(t\+1\)\}\)<\\sigma\(\\bm\{\\theta\}^\{\(t\)\}\)\\Big\)\\land\\Big\(\\mu\(\\bm\{\\theta\}^\{\(t\+1\)\}\)=\\mu\(\\bm\{\\theta\}^\{\(t\)\}\)\\Big\)\\Rightarrowλinf\(𝜽\(t\+1\)\)=μ\(𝜽\(t\+1\)\)−D−1σ\(𝜽\(t\+1\)\)\\displaystyle\\lambda\_\{\\mathrm\{inf\}\}\(\\bm\{\\theta\}^\{\(t\+1\)\}\)=\\mu\(\\bm\{\\theta\}^\{\(t\+1\)\}\)\-\\sqrt\{D\-1\}\\sigma\(\\bm\{\\theta\}^\{\(t\+1\)\}\)\>μ\(𝜽\(t\+1\)\)−D−1σ\(𝜽\(t\)\)\\displaystyle\>\\mu\(\\bm\{\\theta\}^\{\(t\+1\)\}\)\-\\sqrt\{D\-1\}\\sigma\(\\bm\{\\theta\}^\{\(t\)\}\)=μ\(𝜽\(t\)\)−D−1σ\(𝜽\(t\)\)=λinf\(𝜽\(t\)\)\\displaystyle=\\mu\(\\bm\{\\theta\}^\{\(t\)\}\)\-\\sqrt\{D\-1\}\\sigma\(\\bm\{\\theta\}^\{\(t\)\}\)=\\lambda\_\{\\mathrm\{inf\}\}\(\\bm\{\\theta\}^\{\(t\)\}\)holds, we obtainλinf\(𝜽\(t\+1\)\)\>λinf\(𝜽\(t\)\)\\lambda\_\{\\mathrm\{inf\}\}\(\\bm\{\\theta\}^\{\(t\+1\)\}\)\>\\lambda\_\{\\mathrm\{inf\}\}\(\\bm\{\\theta\}^\{\(t\)\}\)\.
## References
- \[1\]M\. Andriushchenko and N\. Flammarion\(2022\)Towards Understanding Sharpness\-Aware Minimization\.Proceedings of the 39th International Conference on Machine Learning,pp\. 639–668\.Cited by:[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1)\.
- \[2\]E\. O\. Arkhangelskaya and S\. I\. Nikolenko\(2023\)Deep Learning for Natural Language Processing: A Survey\.Journal of Mathematical Sciences273\(4\),pp\. 533–582\.External Links:[Document](https://dx.doi.org/10.1007/s10958-023-06519-6)Cited by:[§I](https://arxiv.org/html/2606.28662#S1.p1.1)\.
- \[3\]S\. Arora, Z\. Li, and A\. Panigrahi\(2022\)Understanding Gradient Descent on Edge of Stability in Deep Learning\.External Links:2205\.09745,[Document](https://dx.doi.org/10.48550/arXiv.2205.09745)Cited by:[§I](https://arxiv.org/html/2606.28662#S1.p1.1),[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[4\]C\. Bishop\(1992\)Exact Calculation of the Hessian Matrix for the Multilayer Perceptron\.Neural Computation4\(4\),pp\. 494–501\.External Links:[Document](https://dx.doi.org/10.1162/neco.1992.4.4.494)Cited by:[§II\-C](https://arxiv.org/html/2606.28662#S2.SS3.p1.1)\.
- \[5\]J\. Chai, H\. Zeng, A\. Li, and E\. W\. T\. Ngai\(2021\)Deep learning in computer vision: A critical review of emerging techniques and application scenarios\.Machine Learning with Applications6,pp\. 100134\.External Links:[Document](https://dx.doi.org/10.1016/j.mlwa.2021.100134)Cited by:[§I](https://arxiv.org/html/2606.28662#S1.p1.1)\.
- \[6\]X\. Chen, C\. Hsieh, and B\. Gong\(2022\)When Vision Transformers Outperform ResNets without Pre\-training or Strong Data Augmentations\.External Links:2106\.01548,[Document](https://dx.doi.org/10.48550/arXiv.2106.01548)Cited by:[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1),[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1)\.
- \[7\]L\. Dinh, R\. Pascanu, S\. Bengio, and Y\. Bengio\(2017\)Sharp Minima Can Generalize For Deep Nets\.Proceedings of the 34th International Conference on Machine Learning\.External Links:[Document](https://dx.doi.org/arXiv%3A1703.04933)Cited by:[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1)\.
- \[8\]Z\. Dong, Z\. Yao, D\. Arfeen, A\. Gholami, M\. W\. Mahoney, and K\. Keutzer\(2020\)HAWQ\-V2: Hessian Aware trace\-Weighted Quantization of Neural Networks\.Advances in Neural Information Processing Systems33,pp\. 18518–18529\.Cited by:[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[9\]P\. Foret, A\. Kleiner, H\. Mobahi, and B\. Neyshabur\(2021\)Sharpness\-Aware Minimization for Efficiently Improving Generalization\.External Links:2010\.01412,[Document](https://dx.doi.org/10.48550/arXiv.2010.01412)Cited by:[§I](https://arxiv.org/html/2606.28662#S1.p1.1),[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1),[§VI\-B](https://arxiv.org/html/2606.28662#S6.SS2.p1.2)\.
- \[10\]B\. Ghorbani, S\. Krishnan, and Y\. Xiao\(2019\)An Investigation into Neural Net Optimization via Hessian Eigenvalue Density\.Proceedings of the 36th International Conference on Machine Learning,pp\. 2232–2241\.Cited by:[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[11\]S\. Hochreiter and J\. Schmidhuber\(1997\)Flat minima\.Neural Computation9\(1\),pp\. 1–42\.External Links:[Document](https://dx.doi.org/10.1162/neco.1997.9.1.1)Cited by:[§I](https://arxiv.org/html/2606.28662#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1)\.
- \[12\]W\. Huang, X\. Liu, X\. Wang, J\. Yamagishi, and Y\. Qian\(2025\)From Sharpness to Better Generalization for Speech Deepfake Detection\.External Links:2506\.11532,[Document](https://dx.doi.org/10.48550/arXiv.2506.11532)Cited by:[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1),[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1)\.
- \[13\]M\.F\. Hutchinson\(1989\)A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines\.Communications in Statistics \- Simulation and Computation18\(3\),pp\. 1059–1076\.External Links:[Document](https://dx.doi.org/10.1080/03610918908812806)Cited by:[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[14\]H\. Kaiming, Z\. Xiangyu, R\. Shaoqing, and S\. Jian\(2016\)Deep Residual Learning for Image Recognition\.2016 IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\)1,pp\. 770–778\.External Links:[Document](https://dx.doi.org/10.1109/cvpr.2016.90)Cited by:[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[15\]N\. S\. Keskar, D\. Mudigere, J\. Nocedal, M\. Smelyanskiy, and P\. T\. P\. Tang\(2017\)On Large\-Batch Training for Deep Learning: Generalization Gap and Sharp Minima\.External Links:1609\.04836,[Document](https://dx.doi.org/10.48550/arXiv.1609.04836)Cited by:[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1)\.
- \[16\]M\. A\. Khamis, H\. Q\. Ngo, X\. Nguyen, D\. Olteanu, and M\. Schleich\(2020\)Learning Models over Relational Data using Sparse Tensors and Functional Dependencies\.External Links:1703\.04780,[Document](https://dx.doi.org/10.48550/arXiv.1703.04780)Cited by:[§A\-A](https://arxiv.org/html/2606.28662#A1.SS1.p1.3)\.
- \[17\]J\. Kwon, J\. Kim, H\. Park, and I\. K\. Choi\(2021\)ASAM: Adaptive Sharpness\-Aware Minimization for Scale\-Invariant Learning of Deep Neural Networks\.Proceedings of the 38th International Conference on Machine Learning,pp\. 5905–5914\.Cited by:[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1),[§VI\-B](https://arxiv.org/html/2606.28662#S6.SS2.p1.2)\.
- \[18\]C\. Lanczos\(1950\)An iteration method for the solution of the eigenvalue problem of linear differential and integral operators\.Journal of Research of the National Bureau of Standards45\(4\),pp\. 255–282\.Cited by:[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[19\]H\. Li, Z\. Xu, G\. Taylor, C\. Studer, and T\. Goldstein\(2018\)Visualizing the Loss Landscape of Neural Nets\.Advances in Neural Information Processing Systems31\.Cited by:[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1)\.
- \[20\]Y\. Liu, S\. Yu, and T\. Lin\(2023\)Hessian regularization of deep neural networks: A novel approach based on stochastic estimators of Hessian trace\.Neurocomputing536,pp\. 13–20\.External Links:[Document](https://dx.doi.org/10.1016/j.neucom.2023.03.017)Cited by:[§I](https://arxiv.org/html/2606.28662#S1.p1.1),[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6),[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1),[§VI\-B](https://arxiv.org/html/2606.28662#S6.SS2.p1.2)\.
- \[21\]H\. Luo, T\. Truong, T\. Pham, M\. Harandi, D\. Phung, and T\. Le\(2025\)Explicit Eigenvalue Regularization Improves Sharpness\-Aware Minimization\.External Links:2501\.12666,[Document](https://dx.doi.org/10.48550/arXiv.2501.12666)Cited by:[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1)\.
- \[22\]K\. Lyu, Z\. Li, and S\. Arora\(2023\)Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction\.External Links:2206\.07085,[Document](https://dx.doi.org/10.48550/arXiv.2206.07085)Cited by:[§I](https://arxiv.org/html/2606.28662#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[23\]A\. Mehrish, N\. Majumder, R\. Bharadwaj, R\. Mihalcea, and S\. Poria\(2023\)A review of deep learning techniques for speech processing\.Information Fusion99,pp\. 101869\.External Links:[Document](https://dx.doi.org/10.1016/j.inffus.2023.101869)Cited by:[§I](https://arxiv.org/html/2606.28662#S1.p1.1)\.
- \[24\]Y\. Omae, K\. Sakai, Y\. Kakimoto, M\. Sasaki, Y\. Sakai, and H\. Takahashi\(2026\)Wolkowicz\-Styan Upper Bound on the Hessian Eigenspectrum for Cross\-Entropy Loss in Nonlinear Smooth Neural Networks\.arXiv\.org\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2604.10202)Cited by:[§A\-B](https://arxiv.org/html/2606.28662#A1.SS2.p1.16),[§B\-B2](https://arxiv.org/html/2606.28662#A2.SS2.SSS2.1.p1.1),[§B\-B2](https://arxiv.org/html/2606.28662#A2.SS2.SSS2.1.p1.3),[§B\-B2](https://arxiv.org/html/2606.28662#A2.SS2.SSS2.2.p1.3),[§B\-B2](https://arxiv.org/html/2606.28662#A2.SS2.SSS2.2.p1.4),[§B\-C1](https://arxiv.org/html/2606.28662#A2.SS3.SSS1.1.p1.1),[§B\-C1](https://arxiv.org/html/2606.28662#A2.SS3.SSS1.2.p1.5),[§I](https://arxiv.org/html/2606.28662#S1.p2.1),[§II\-C](https://arxiv.org/html/2606.28662#S2.SS3.p1.1),[§III\-A](https://arxiv.org/html/2606.28662#S3.SS1.p3.11),[§III\-B](https://arxiv.org/html/2606.28662#S3.SS2.p1.10),[§III\-B](https://arxiv.org/html/2606.28662#S3.SS2.p1.19),[§III\-B](https://arxiv.org/html/2606.28662#S3.SS2.p1.24),[§III\-B](https://arxiv.org/html/2606.28662#S3.SS2.p1.26),[§III](https://arxiv.org/html/2606.28662#S3.p1.1),[§IV\-A](https://arxiv.org/html/2606.28662#S4.SS1.p1.1),[§V\-A](https://arxiv.org/html/2606.28662#S5.SS1.p2.11)\.
- \[25\]V\. Papyan\(2020\)Traces of Class/Cross\-Class Structure Pervade Deep Learning Spectra\.Journal of Machine Learning Research21\.Cited by:[§VI\-B](https://arxiv.org/html/2606.28662#S6.SS2.p6.1)\.
- \[26\]L\. Sagun, L\. Bottou, and Y\. LeCun\(2017\)Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond\.External Links:1611\.07476,[Document](https://dx.doi.org/10.48550/arXiv.1611.07476)Cited by:[§VI\-A](https://arxiv.org/html/2606.28662#S6.SS1.p3.7)\.
- \[27\]A\. R\. Sankar, Y\. Khasbage, R\. Vigneswaran, and V\. N Balasubramanian\(2021\)A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization\.Proceedings of the AAAI Conference on Artificial Intelligence35\(11\),pp\. 9481–9488\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v35i11.17142)Cited by:[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1)\.
- \[28\]K\. Simonyan and A\. Zisserman\(2014\)Very Deep Convolutional Networks for Large\-Scale Image Recognition\.CoRR\.Cited by:[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[29\]S\. P\. Singh, G\. Bachmann, and T\. Hofmann\(2021\)Analytic Insights into Structure and Rank of Neural Network Hessian Maps\.External Links:2106\.16225,[Document](https://dx.doi.org/10.48550/arXiv.2106.16225)Cited by:[§II\-C](https://arxiv.org/html/2606.28662#S2.SS3.p1.1)\.
- \[30\]S\. P\. Singh, W\. Ormaniec, and T\. Hofmann\(2026\)Cracking the Hessian: Closed\-Form Hessian Spectra for Fundamental Neural Networks\.OpenReview in ICLR2026\.Cited by:[§II\-C](https://arxiv.org/html/2606.28662#S2.SS3.p1.1)\.
- \[31\]Torchvision — Torchvision 0\.27 documentation\.Note:https://docs\.pytorch\.org/vision/stable/index\.htmlCited by:[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[32\]M\. Wei and D\. J\. Schwab\(2019\)How noise affects the Hessian spectrum in overparameterized neural networks\.External Links:1910\.00195,[Document](https://dx.doi.org/10.48550/arXiv.1910.00195)Cited by:[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1)\.
- \[33\]H\. Wolkowicz and G\. P\. H\. Styan\(1980\)Bounds for eigenvalues using traces\.Linear Algebra and its Applications29,pp\. 471–506\.External Links:[Document](https://dx.doi.org/10.1016/0024-3795%2880%2990258-X)Cited by:[§III\-B](https://arxiv.org/html/2606.28662#S3.SS2.p1.10),[§V\-E](https://arxiv.org/html/2606.28662#S5.SS5.p1.2)\.
- \[34\]L\. Wu and W\. J\. Su\(2023\)The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent\.External Links:2305\.17490,[Document](https://dx.doi.org/10.48550/arXiv.2305.17490)Cited by:[§II\-A](https://arxiv.org/html/2606.28662#S2.SS1.p1.1)\.
- \[35\]Y\. Wu, X\. Zhu, C\. Wu, A\. Wang, and R\. Ge\(2022\)Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks\.External Links:2010\.04261,[Document](https://dx.doi.org/10.48550/arXiv.2010.04261)Cited by:[§II\-C](https://arxiv.org/html/2606.28662#S2.SS3.p1.1),[§VI\-B](https://arxiv.org/html/2606.28662#S6.SS2.p6.1)\.
- \[36\]Z\. Xie, Q\. Tang, Y\. Cai, M\. Sun, and P\. Li\(2022\)On the Power\-Law Hessian Spectrums in Deep Learning\.External Links:2201\.13011,[Document](https://dx.doi.org/10.48550/arXiv.2201.13011)Cited by:[§VI\-B](https://arxiv.org/html/2606.28662#S6.SS2.p6.1)\.
- \[37\]Z\. Yao, A\. Gholami, K\. Keutzer, and M\. W\. Mahoney\(2020\)PyHessian: Neural Networks Through the Lens of the Hessian\.2020 IEEE International Conference on Big Data \(Big Data\),pp\. 581–590\.External Links:[Document](https://dx.doi.org/10.1109/BigData50022.2020.9378171)Cited by:[§II\-B](https://arxiv.org/html/2606.28662#S2.SS2.p1.6)\.
- \[38\]X\. Yue, M\. Nouiehed, and R\. A\. Kontar\(2024\)SALR: Sharpness\-aware Learning Rate Scheduler for Improved Generalization\.IEEE Transactions on Neural Networks and Learning Systems35\(9\),pp\. 12518–12527\.External Links:2011\.05348,[Document](https://dx.doi.org/10.1109/TNNLS.2023.3263393)Cited by:[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1)\.
- \[39\]H\. R\. Zhang, D\. Li, and H\. Ju\(2024\)Noise Stability Optimization for Finding Flat Minima: A Hessian\-based Regularization Approach\.External Links:2306\.08553,[Document](https://dx.doi.org/10.48550/arXiv.2306.08553)Cited by:[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1)\.
- \[40\]Y\. Zhou, Y\. Qu, X\. Xu, and H\. Shen\(2023\)ImbSAM: A Closer Look at Sharpness\-Aware Minimization in Class\-Imbalanced Recognition\.2023 IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 11311–11321\.External Links:[Document](https://dx.doi.org/10.1109/ICCV51070.2023.01042)Cited by:[§II\-D](https://arxiv.org/html/2606.28662#S2.SS4.p1.1)\.Similar Articles
Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent
This paper presents an exact decomposition of the curvature exponent α in neural network loss landscapes, explaining why it varies across layer types. It introduces the spectral alignment decomposition and derives a spectral transfer identity linking curvature, gradient rank decay, and Hessian exponents, validated across architectures and datasets.
Edge of Stability Selectively Shapes Learning Across the Data Distribution
MIT researchers show that the edge of stability (EoS) in neural network training is not merely a global optimization phenomenon but selectively redistributes learning across subsets of the training distribution, amplifying progress on some data groups while suppressing others. They identify two key conditions governing this allocation: gradient alignment with the top Hessian eigenvector and sustained non-vanishing gradient magnitude.
Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
This paper generalizes non-uniform smoothness assumptions to objectives whose curvature is affine in the objective value, proving convergence rates for steepest descent and diagonal variants of RMSProp and Adam, with applications to logistic regression and neural networks.
Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent
This paper develops a sharp pseudospectral theory for block-triangular Jacobians in coupled gradient descent, proving Kreiss-constant bounds and establishing iteration complexity results. The work exposes non-asymptotic, instance-dependent transient amplification phenomena relevant to bilevel optimization, two-time-scale stochastic approximation, and GAN training.
A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks
This paper establishes a mathematically rigorous connection between shock-wave theory and symmetry-quotiented learning dynamics of stochastic gradient descent, showing that after symmetry reduction and coarse-graining, the dynamics satisfy viscous Hamilton-Jacobi and Burgers-type equations with shock formation times controlled by loss curvature.