CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks

arXiv cs.LG 05/14/26, 04:00 AM Papers
Summary
Introduces CAWI, a copula-based weight initialization method for randomized neural networks that models inter-feature dependence, improving predictive performance across 83 classification benchmarks.
arXiv:2605.12580v1 Announce Type: new Abstract: Randomized neural networks (RdNNs) enable efficient, backpropagation-free training by freezing randomly initialized input-to-hidden weights, which permits a closed-form solution for the output layer. However, conventional random initialization is blind to inter-feature dependence, ignoring correlations, asymmetries, and tail dependence in the data, which degrades conditioning and predictive performance. To the best of our knowledge, this limitation remains unaddressed in the RdNN literature. To close this gap, we propose CAWI (Copula-Aligned Weight Initialization), a framework that draws input-to-hidden weights from a data-fitted copula that matches empirical dependence, ensuring the frozen projections respect inter-feature dependence without sacrificing the closed-form solution. CAWI (i) maps each feature to the unit interval using empirical CDFs, (ii) fits a multivariate copula that captures rank-based dependence among features, and (iii) samples each weight column w_j from the fitted copula and applies a fixed inverse marginal transform to set scale. The objective, solver, and "freeze-once" paradigm remain unchanged; only the sampling law for W becomes dependence-aware. For dependence modeling, we consider two copula families: elliptical (Gaussian, t) and Archimedean (Clayton, Frank, Gumbel). This enables CAWI to handle diverse dependence, including tail dependence. We evaluate CAWI across 83 diverse classification benchmarks (binary and multiclass) and two biomedical datasets, BreaKHis and the Schizophrenia dataset, using standard shallow and deep RdNN architectures. CAWI consistently delivers significant improvements in predictive performance over conventional random initialization. Code is available at: https://github.com/mtanveer1/CAWI
Original Article
View Cached Full Text
Cached at: 05/14/26, 06:16 AM
# 1 Introduction
Source: [https://arxiv.org/html/2605.12580](https://arxiv.org/html/2605.12580)
CAWI: Copula\-Aligned Weight Initialization for Randomized Neural Networks

Mushir AkhtarM\. TanveerMohd\. Arshad

Indian Institute of Technology Indore, Simrol, Indore, 453552, India

###### Abstract

Randomized neural networks \(RdNNs\) enable efficient, backpropagation\-free training by freezing randomly initialized input\-to\-hidden weights, which permits a closed\-form solution for the output layer\. However, conventional random initialization is blind to inter\-feature dependence—ignoring correlations, asymmetries, and tail dependence in the data—which degrades conditioning and predictive performance\. To the best of our knowledge, this limitation remains unaddressed in the RdNN literature\. To close this gap, we propose CAWI \(Copula\-Aligned Weight Initialization\), a framework that draws input\-to\-hidden weights from a data\-fitted copula that matches empirical dependence, ensuring the frozen projections respect inter\-feature dependence without sacrificing closed\-form solution\. CAWI \(i\) maps each feature to the unit interval using empirical CDFs, \(ii\) fits a multivariate copula that captures rank\-based dependence among features, and \(iii\) samples each weight columnwjw\_\{j\}from the fitted copula and applies a fixed inverse marginal transform to set scale\. The objective, solver, and “freeze\-once” paradigm remain unchanged; only the sampling law forWWbecomes dependence\-aware\. For dependence modeling, we consider two copula families: elliptical \(Gaussian, t\) and Archimedean \(Clayton, Frank, Gumbel\)\. This enables CAWI to handle diverse dependence, including tail dependence\. We evaluate CAWI across 83 diverse classification benchmarks \(binary and multiclass\) and two biomedical datasets, BreaKHis and the Schizophrenia dataset, using standard shallow and deep RdNN architectures\. CAWI consistently delivers significant improvements in predictive performance over conventional random initialization\. Codes are provided at[https://github\.com/mtanveer1/CAWI](https://github.com/mtanveer1/CAWI)\.

Deep learning \(DL\) models have achieved remarkable success across many domains, from image recognitionVoulodimoset al\.\([2018](https://arxiv.org/html/2605.12580#bib.bib14)\); Chenet al\.\([2019](https://arxiv.org/html/2605.12580#bib.bib37)\); Liuet al\.\([2024](https://arxiv.org/html/2605.12580#bib.bib55)\)to natural language processingOtteret al\.\([2021](https://arxiv.org/html/2605.12580#bib.bib15)\); Lauriolaet al\.\([2022](https://arxiv.org/html/2605.12580#bib.bib38)\); Yinet al\.\([2024](https://arxiv.org/html/2605.12580#bib.bib56)\), due to their ability to learn rich hierarchical representations of dataLeCunet al\.\([2015](https://arxiv.org/html/2605.12580#bib.bib17)\); Luoet al\.\([2023](https://arxiv.org/html/2605.12580#bib.bib74)\)\. End\-to\-end training with backpropagation refines millions of parameters to minimize task\-specific losses, yielding expressive features and strong generalization\. However, state\-of\-the\-art performance typically demands substantial computation, broad hyperparameter search, and long training cyclesGoodfellow \([2016](https://arxiv.org/html/2605.12580#bib.bib39)\); Tiwariet al\.\([2023](https://arxiv.org/html/2605.12580#bib.bib75)\)\. Furthermore, training deep architectures can be hindered by vanishing or exploding gradientPascanuet al\.\([2013](https://arxiv.org/html/2605.12580#bib.bib40)\); Jaiswalet al\.\([2022](https://arxiv.org/html/2605.12580#bib.bib57)\); Ceni \([2025](https://arxiv.org/html/2605.12580#bib.bib76)\)\.

The aforementioned limitations of DL models have prompted researchers to seek alternative neural architectures that couple competitive predictive performance with markedly lower training cost\. A prominent family is randomized neural networks \(RdNNs\)Paoet al\.\([1994](https://arxiv.org/html/2605.12580#bib.bib22)\); Caoet al\.\([2018](https://arxiv.org/html/2605.12580#bib.bib19)\); Suganthan and Katuwal \([2021](https://arxiv.org/html/2605.12580#bib.bib58)\); Huet al\.\([2024](https://arxiv.org/html/2605.12580#bib.bib59)\)\. In RdNNs, a substantial subset of parameters—typically input\-to\-hidden weights—are sampled once from simple distributions and then held fixed, while the remaining \(often output\) weights are obtained by solving a single linear system, commonly with Tikhonov regularization, instead of iterative backpropagationZhang and Suganthan \([2016](https://arxiv.org/html/2605.12580#bib.bib18)\)\. This paradigm eliminates the need for iterative weight updates, substantially reducing training time and computational overhead, yet still allows RdNNs to learn complex input\-output mappingsZhang and Suganthan \([2016](https://arxiv.org/html/2605.12580#bib.bib18)\)\. Theoretically, RdNNs satisfy universal approximation on compact domains, approximating any continuous function arbitrarily well given sufficient width and appropriate activationsIgelnik and Pao \([1995](https://arxiv.org/html/2605.12580#bib.bib25)\); Needellet al\.\([2024](https://arxiv.org/html/2605.12580#bib.bib77)\)\.

In recent years, RdNNs have witnessed significant advancements, particularly in methods aimed at improving their robustness and scalability\. Ensemble\-based approaches have been introduced to enhance generalization and mitigate variabilityShiet al\.\([2021](https://arxiv.org/html/2605.12580#bib.bib60)\); Guoet al\.\([2021](https://arxiv.org/html/2605.12580#bib.bib61)\)\. Granular ball\-based scalable methods have further extended the applicability of RdNNs to large\-scale and high\-dimensional datasetsSajidet al\.\([2025a](https://arxiv.org/html/2605.12580#bib.bib62)\)\. Additionally, the integration of fuzzy inference systems into RdNN frameworks has provided a way to incorporate uncertainty handling and interpretability, making them more effective for complex and imprecise decision\-making tasksSajidet al\.\([2024a](https://arxiv.org/html/2605.12580#bib.bib63),[2025b](https://arxiv.org/html/2605.12580#bib.bib64)\)\. Beyond these trends, a number of technical advances have broadened the RdNN toolkitWonget al\.\([2022](https://arxiv.org/html/2605.12580#bib.bib83)\); Chenet al\.\([2025](https://arxiv.org/html/2605.12580#bib.bib84)\); Akhtaret al\.\([2025](https://arxiv.org/html/2605.12580#bib.bib105)\); Renet al\.\([2025](https://arxiv.org/html/2605.12580#bib.bib85)\)\.

Despite several advancements, RdNNs face a fundamental limitation rooted in the very mechanism that makes them efficient: the input\-to\-hidden weights are randomly initialized once and then held fixed throughout training\. With the matrixW∈ℝd×hW\\in\\mathbb\{R\}^\{d\\times h\}frozen, the induced feature mapx↦ϕ\(xW\+b\)x\\mapsto\\phi\(xW\+b\)is non\-adaptive; the hidden representationH=ϕ\(XW\+𝟏mb⊤\)∈ℝm×hH=\\phi\(XW\+\\mathbf\{1\}\_\{m\}b^\{\\top\}\)\\in\\mathbb\{R\}^\{m\\times h\}is therefore determined by this random draw rather than learned from data\. The fixed representationHHcannot adjust to dependencies among inputs \(e\.g\., correlations, relative scales, higher\-order interactions\), so at practical hidden dimensionhhit may underrepresent the joint patterns that drive prediction\. The readout then computesY^=HΘ\\hat\{Y\}=H\\ThetawithΘ∈ℝh×n\\Theta\\in\\mathbb\{R\}^\{h\\times n\}; if relevant dependencies are absent fromHH, they cannot be recovered, leading to persistent approximation bias and lower performance at a given model size\.

The preceding limitation naturally raises a question: can we make the frozen projections dependence\-aware without abandoning closed\-form training? In this work, we answer this question by introducing CAWI \(Copula\-Aligned Weight Initialization\), a method that uses copula theory to generate*dependence\-aware*input\-to\-hidden weights\. Copulas model multivariate data by separating the univariate marginals from the joint dependence structure, allowing us to learn dependence without committing to specific marginal formsNelsen \([2006](https://arxiv.org/html/2605.12580#bib.bib86)\); Joe \([2014](https://arxiv.org/html/2605.12580#bib.bib87)\)\. This separation is well suited to tabular learning, where feature scales and marginals vary widely but predictive signal often resides in cross\-feature structure\. Recent literature demonstrates broad utility of copulas in AI: copulas for time\-series forecastingSun and Yu \([2022](https://arxiv.org/html/2605.12580#bib.bib94)\); Ashoket al\.\([2024](https://arxiv.org/html/2605.12580#bib.bib95)\), interpretable estimation of CNN deep\-feature densities using copulas and the generalized characteristic functionChapman and Farvardin \([2024](https://arxiv.org/html/2605.12580#bib.bib88)\), and survival analysis under dependent censoring with identifiability guaranteesZhanget al\.\([2024](https://arxiv.org/html/2605.12580#bib.bib90)\)\. Further developments include copula\-nested spectral kernel networksTianet al\.\([2024](https://arxiv.org/html/2605.12580#bib.bib92)\), scalable mixed models with arbitrary marginalsSimchoni and Rosset \([2025](https://arxiv.org/html/2605.12580#bib.bib89)\), and neural copula density estimationLetiziaet al\.\([2025](https://arxiv.org/html/2605.12580#bib.bib93)\), among others\.

Several works in deep neural networks have explored structured or data\-aware initialization strategies, such as orthogonal weight initializationHuet al\.\([2020](https://arxiv.org/html/2605.12580#bib.bib102)\); Narkhedeet al\.\([2022](https://arxiv.org/html/2605.12580#bib.bib101)\)and whitening\-based transformationsLeCunet al\.\([2002](https://arxiv.org/html/2605.12580#bib.bib104)\); Kessyet al\.\([2018](https://arxiv.org/html/2605.12580#bib.bib103)\)\. Orthogonal schemes stabilize signal propagation via algebraic constraints but remain distribution\-agnostic, while whitening decorrelates features through covariance normalization, primarily addressing linear dependencies\. More generally, data\-dependent projection methods modify the feature space \(e\.g\., via PCA or kernel mappings\), often at additional computational cost\.

In contrast, CAWI is designed for RdNNs, where hidden weights are sampled once and kept fixed\. Rather than altering the feature representation or imposing structural constraints, CAWI modifies only the sampling distribution of the weight matrix using an empirical copula fitted to the training features\. Specifically, we construct rank\-based pseudo\-observations, estimate elliptical \(Gaussian,tt\) or Archimedean \(Clayton, Frank, Gumbel\) copulas, and draw each column ofWWfrom the fitted dependence structure with fixed marginal scaling\. The resulting frozen projections inherit empirical inter\-feature dependence, while the RdNN architecture, activation functions, objective, and closed\-form solver remain unchanged, incurring essentially no additional training cost\.

Our Contributions

- •We introduce CAWI \(Copula\-Aligned Weight Initialization\), a plug\-in scheme that makes randomized neural networks dependence\-aware by sampling the columns ofWWfrom a copula fitted to the training features, while preserving the closed\-form solver and the freeze\-once paradigm\.
- •We provide numerically stable constructions for elliptical \(Gaussian,tt\) and Archimedean \(Clayton, Frank, Gumbel\) copulas via rank\-based estimation and nearest\-SPD projection\.
- •We evaluate CAWI on 83 diverse classification benchmarks \(binary and multiclass\) and two biomedical datasets \(BreaKHis and the Schizophrenia dataset\) using both shallow and deep RdNN architectures, and observe consistent improvements, with 4–5% relative gains on some datasets\.

## 2Preliminaries

### 2\.1Randomized neural network \(RdNN\) setting

LetX∈ℝm×dX\\in\\mathbb\{R\}^\{m\\times d\}be the input matrix \(mmsamples,ddfeatures\) andY∈ℝm×nY\\in\\mathbb\{R\}^\{m\\times n\}the target matrix \(e\.g\., one–hot labels\)\. An RdNN draws a single hidden layer ofhhunits by sampling a*frozen*weight matrixW∼𝒟W⊂ℝd×h,b∈ℝhW\\sim\\mathcal\{D\}\_\{W\}\\;\\subset\\;\\mathbb\{R\}^\{d\\times h\},b\\in\\mathbb\{R\}^\{h\}, from a prescribed base distribution \(classically i\.i\.d\. uniform or Gaussian, independent across coordinates\), and then*never updates*these parameters\. With an elementwise nonlinearityϕ:ℝ→ℝ\\phi:\\mathbb\{R\}\\\!\\to\\\!\\mathbb\{R\}, the hidden representation is

H=ϕ\(XW\+𝟏mb⊤\)∈ℝm×h,H\\;=\\;\\phi\\\!\\big\(XW\+\\mathbf\{1\}\_\{m\}b^\{\\top\}\\big\)\\;\\in\\;\\mathbb\{R\}^\{m\\times h\},\(1\)where𝟏m\\mathbf\{1\}\_\{m\}denotes anmm–vector of ones broadcasting the bias\. Training reduces to fitting only the output weightsΘ∈ℝh×n\\Theta\\in\\mathbb\{R\}^\{h\\times n\}by ridge regression:

Θ⋆∈arg⁡minΘ∈ℝh×n⁡‖HΘ−Y‖F2\+λ2‖Θ‖F2,λ≥0,\\Theta^\{\\star\}\\;\\in\\;\\arg\\min\_\{\\Theta\\in\\mathbb\{R\}^\{h\\times n\}\}\\,\\\|H\\Theta\-Y\\\|\_\{F\}^\{2\}\+\\frac\{\\lambda\}\{2\}\\\|\\Theta\\\|\_\{F\}^\{2\},\\qquad\\lambda\\geq 0,\(2\)which admits the standard closed forms

\(primal\)Θ⋆\\displaystyle\\text\{\(primal\)\}\\quad\\Theta^\{\\star\}=\(H⊤H\+λIh\)−1H⊤Y,\\displaystyle=\(H^\{\\top\}H\+\\lambda I\_\{h\}\)^\{\-1\}H^\{\\top\}Y,\(3\)\(dual\)Θ⋆\\displaystyle\\text\{\(dual\)\}\\quad\\Theta^\{\\star\}=H⊤\(HH⊤\+λIm\)−1Y\.\\displaystyle=H^\{\\top\}\(HH^\{\\top\}\+\\lambda I\_\{m\}\)^\{\-1\}Y\.\(4\)Thus*no backpropagation is required*: onceWWandbbare sampled, training is a single linear solve\. Detailed formulation and architecture of standard RdNN models are discussed in Section[A](https://arxiv.org/html/2605.12580#A1)of the Appendix\.

### 2\.2Copulas

Copulas are a classical device for isolating and modeling*dependence*separately from*marginals*Nelsen \([2006](https://arxiv.org/html/2605.12580#bib.bib86)\); Joe \([2014](https://arxiv.org/html/2605.12580#bib.bib87)\)\. Sklar’s seminal result shows that any multivariate distribution can be written as a coupling of its univariate marginals through a single function on the unit hypercube, thereby reducing multivariate modeling to estimating marginals and a dependence map on\[0,1\]d\[0,1\]^\{d\}Sklar \([1959](https://arxiv.org/html/2605.12580#bib.bib97)\)\.

##### Copula:

Let𝐗=\(X1,…,Xd\)\\mathbf\{X\}=\(X\_\{1\},\\ldots,X\_\{d\}\)be a random vector with marginal cumulative distribution functions \(CDFs\)Fj\(t\)=Pr⁡\[Xj≤t\]F\_\{j\}\(t\)=\\Pr\[X\_\{j\}\\leq t\]forj=1,…,dj=1,\\ldots,d\. The*copula associated with𝐗\\mathbf\{X\}*is the joint CDF of the transformed vector\(F1\(X1\),…,Fd\(Xd\)\)∈\[0,1\]d\(F\_\{1\}\(X\_\{1\}\),\\ldots,F\_\{d\}\(X\_\{d\}\)\)\\in\[0,1\]^\{d\}, namely

C\(u1,…,ud\)\\displaystyle C\(u\_\{1\},\\ldots,u\_\{d\}\)=Pr⁡\{F1\(X1\)≤u1,…,Fd\(Xd\)≤ud\},\\displaystyle=\\Pr\\\!\\bigl\\\{F\_\{1\}\(X\_\{1\}\)\\leq u\_\{1\},\\ldots,F\_\{d\}\(X\_\{d\}\)\\leq u\_\{d\}\\bigr\\\},\(5\)\(u1,…,ud\)∈\[0,1\]d\.\\displaystyle\\quad\(u\_\{1\},\\ldots,u\_\{d\}\)\\in\[0,1\]^\{d\}\.Equivalently, add–variate copula is a distribution function on\[0,1\]d\[0,1\]^\{d\}whose univariate marginals are uniform on\(0,1\)\(0,1\)\. Figure[1](https://arxiv.org/html/2605.12580#S2.F1)provides an intuitive picture of the concept\.

##### Invariance:

Ifψj\\psi\_\{j\}are strictly increasing for eachjj, then\(ψ1\(X1\),…,ψd\(Xd\)\)\(\\psi\_\{1\}\(X\_\{1\}\),\\ldots,\\psi\_\{d\}\(X\_\{d\}\)\)has the same copula as\(X1,…,Xd\)\(X\_\{1\},\\ldots,X\_\{d\}\)\. Thus a copula depends only on cross\-coordinate order, not on units or scales\.

![Refer to caption](https://arxiv.org/html/2605.12580v1/x1.png)Figure 1:Schematic of a copula\.*Left:*empirical sample \(with its joint CDF inset\)\.*Middle:*univariate CDFsF1F\_\{1\}andF2F\_\{2\}\(mappingxj↦uj=Fj\(xj\)x\_\{j\}\\mapsto u\_\{j\}=F\_\{j\}\(x\_\{j\}\)\)\.*Right:*the copula surfaceC\(u1,u2\)C\(u\_\{1\},u\_\{2\}\), i\.e\., the joint CDF on\[0,1\]2\[0,1\]^\{2\}with uniform marginals\. The equality sign indicates Sklar’s factorizationF\(x1,x2\)=C\(F1\(x1\),F2\(x2\)\)F\(x\_\{1\},x\_\{2\}\)=C\(F\_\{1\}\(x\_\{1\}\),F\_\{2\}\(x\_\{2\}\)\); the⊕\\oplushighlights thatCCcombines the marginal information into a dependence structure\.
##### Sklar’s theorem:

LetF:ℝd→\[0,1\]F:\\mathbb\{R\}^\{d\}\\\!\\to\[0,1\]be the joint CDF of𝐗=\(X1,…,Xd\)\\mathbf\{X\}=\(X\_\{1\},\\ldots,X\_\{d\}\)with marginalsF1,…,FdF\_\{1\},\\ldots,F\_\{d\}\. There exists a copulaCCsuch that, for all𝐱=\(x1,…,xd\)∈ℝd\\mathbf\{x\}=\(x\_\{1\},\\ldots,x\_\{d\}\)\\in\\mathbb\{R\}^\{d\},

F\(x1,…,xd\)=C\(F1\(x1\),…,Fd\(xd\)\)\.F\(x\_\{1\},\\ldots,x\_\{d\}\)\\;=\\;C\\\!\\big\(F\_\{1\}\(x\_\{1\}\),\\ldots,F\_\{d\}\(x\_\{d\}\)\\big\)\.If eachFjF\_\{j\}is continuous, the copulaCCis unique\. Conversely, for any copulaCCand any marginalsF1,…,FdF\_\{1\},\\ldots,F\_\{d\}, the mapping above defines a validdd\-variate distribution\. IfX1,…,XdX\_\{1\},\\ldots,X\_\{d\}are mutually independent, the copula is the*product copula*C\(u1,…,ud\)=∏j=1dujC\(u\_\{1\},\\ldots,u\_\{d\}\)=\\prod\_\{j=1\}^\{d\}u\_\{j\}, representing absence of dependence among coordinates\.

## 3Problem Statement

Randomized neural networks \(RdNNs\)*sample once and freeze*the input\-to\-hidden weight matrixW∈ℝd×hW\\in\\mathbb\{R\}^\{d\\times h\}and biasb∈ℝhb\\in\\mathbb\{R\}^\{h\}\. WithH=ϕ\(XW\+𝟏mb⊤\)H=\\phi\(XW\+\\mathbf\{1\}\_\{m\}b^\{\\top\}\)andΘ⋆\(W\)=arg⁡minΘ⁡\{‖HΘ−Y‖F2\+λ‖Θ‖F2\}\\Theta^\{\\star\}\(W\)=\\arg\\min\_\{\\Theta\}\\\{\\\|H\\Theta\-Y\\\|\_\{F\}^\{2\}\+\\lambda\\\|\\Theta\\\|\_\{F\}^\{2\}\\\}, training reduces to a single ridge\-regularized linear solve—no backpropagation is required\.

##### Limitation\.

The longstanding bottleneck is howWWis drawn: almost universally from an*i\.i\.d\.*, factorized base law \(uniform or Gaussian\)\. Equivalently, this imposes the*product copula*C⟂\(u\)=∏j=1dujC\_\{\\perp\}\(u\)=\\prod\_\{j=1\}^\{d\}u\_\{j\}across weight coordinates, irrespective of the empirical dependence in the inputsXX\(correlations, asymmetries, tail dependence\)\. The resulting mismatch can \(i\) reduce the diversity/effective rank ofHH, \(ii\) fail to excite discriminative co\-movements, and \(iii\) worsen the conditioning ofH⊤H\+λIH^\{\\top\}H\+\\lambda I\.

Goal\.Replace the independence assumption on frozen weights*without*altering the RdNN pipeline\. Given dataX∈ℝm×dX\\in\\mathbb\{R\}^\{m\\times d\}, widthhh, activationϕ\\phi, and ridgeλ\\lambda, training remains the same closed form; only the*sampling law*forWWchanges\. We seek a*one\-shot, data\-aware*sampler𝒮\\mathcal\{S\}that mapsXXto a distribution overWWsuch that each columnw∈ℝdw\\in\\mathbb\{R\}^\{d\}has a joint law whose*copula*matches an estimator of input dependence, while its marginals are simple and well\-behaved:

Copula⁡\(w\)\\displaystyle\\operatorname\{Copula\}\(w\)≈Copula^\(X\),wj∼Gfor a fixed marginalG\\displaystyle\\approx\\widehat\{\\operatorname\{Copula\}\}\(X\),w\_\{j\}\\sim G~\\text\{for a fixed marginal\}~G\(6\)\(e\.g\.,𝒩\(0,1\)or𝒰\[−1,1\]\),j=1,…,d\.\\displaystyle\\ \(\\text\{e\.g\., \}\\mathcal\{N\}\(0,1\)\\ \\text\{or\}\\ \\mathcal\{U\}\[\-1,1\]\),\\ j=1,\\ldots,d\.
Biasesbbmay be drawn independently from a simple one\-dimensional law\.

Design objective\.Among samplers satisfying the dependence\-matching and marginal constraints, choose𝒮\\mathcal\{S\}to improve downstream risk*without*any iterative tuning ofWW:

min𝒮\\displaystyle\\min\_\{\\mathcal\{S\}\}𝔼W∼𝒮\(X\)𝔼\(x,y\)∼𝒟\[ℓ\(h\(x;W,b\)Θ⋆\(W\),y\)\]\\displaystyle\\mathbb\{E\}\_\{W\\sim\\mathcal\{S\}\(X\)\}\\,\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\\big\[\\,\\ell\\big\(h\(x;W,b\)\\,\\Theta^\{\\star\}\(W\),\\,y\\big\)\\big\]\(7\)s\.t\.Copula⁡\(w\)≈Copula^\(X\),wj∼G,j=1,…,d,\\displaystyle\\operatorname\{Copula\}\(w\)\\approx\\widehat\{\\operatorname\{Copula\}\}\(X\),\\quad w\_\{j\}\\sim G,\\ \\ j=1,\\ldots,d,\\whereh\(x;W,b\)=ϕ\(xW\+b\)∈ℝhh\(x;W,b\)=\\phi\(xW\+b\)\\in\\mathbb\{R\}^\{h\}and𝒟\\mathcal\{D\}is the data distribution\. In short, the problem is to replace i\.i\.d\. frozen weights with*dependence\-aware*frozen weights that align with the empirical copula ofXX, thereby addressing the core weakness of random initialization while preserving the hallmark efficiency of RdNNs\.

## 4Method

We introduce*Copula\-Aligned Weight Initialization*\(CAWI\), a data\-aware procedure that samples the frozen hidden weights in randomized neural networks using only the inputsXX\. CAWI preserves the classical RdNN pipeline \(no backpropagation; closed\-form output layer\) while aligning the*dependence*among weight coordinates with that of the inputs\. Concretely, CAWI \(i\) maps each feature to\[0,1\]\[0,1\]via its empirical cumulative distribution function to remove marginal scales, \(ii\) fits a multivariate copula on\[0,1\]d\[0,1\]^\{d\}to summarize input dependence, and \(iii\) samples each weight column by drawinguufrom the fitted copula and transforming coordinates through a fixed marginal lawGG\(e\.g\.,𝒩\(0,1\)\\mathcal\{N\}\(0,1\)or𝒰\[−1,1\]\\mathcal\{U\}\[\-1,1\]\) via inverse CDFs\. The training objective and closed\-form solver remain unchanged; only the law used to sample \(and then freeze\)WWis replaced\. The detailed CAWI procedure and its steps are presented next\. Algorithm[1](https://arxiv.org/html/2605.12580#algorithm1)summarizes the procedure\.

##### Step 1: Probability\-Integral Transform of Features

GivenX∈ℝm×dX\\in\\mathbb\{R\}^\{m\\times d\}with columnsX:jX\_\{:j\}, we construct pseudo\-observations

Uij=F^j\(Xij\)∈\(0,1\),U∈\(0,1\)m×d,U\_\{ij\}\\;=\\;\\widehat\{F\}\_\{j\}\(X\_\{ij\}\)\\in\(0,1\),\\qquad U\\in\(0,1\)^\{m\\times d\},whereF^j\\widehat\{F\}\_\{j\}is the empirical cumulative distribution function \(ECDF\) of featurejj, computed across samples \(i\.e\., withjjfixed andi=1,…,mi=1,\\dots,m\)\. A common rank\-based implementation setsUij=rank\(Xij\)m\+1,U\_\{ij\}\\;=\\;\\frac\{\\mathrm\{rank\}\(X\_\{ij\}\)\}\{m\+1\},where ranks are assigned feature\-wise across samples\. This guaranteesUij∈\(0,1\)U\_\{ij\}\\in\(0,1\)\(never exactly0or11\) and thus avoids infinities when applying inverse CDFs in Step 3\.

This transformation removes units and strictly monotone re\-scalings, isolating cross\-feature*dependence*in the joint distribution ofUU, consistent with standard empirical copula construction\.

##### Step 2: Fit a Multivariate Copula onUU

Let𝒞=\{C\(⋅;η\):η∈Ξ\}\\mathcal\{C\}=\\\{\\,C\(\\cdot;\\eta\):\\eta\\in\\Xi\\,\\\}be a parametric family ofdd\-variate copulas on\[0,1\]d\[0,1\]^\{d\}, with parameterη\\etain parameter spaceΞ\\Xi\. Using the pseudo\-observationsUU, we estimateη^\\widehat\{\\eta\}to obtain a fitted copulaC^\(⋅\)=C\(⋅;η^\),\\widehat\{C\}\(\\cdot\)\\;=\\;C\(\\cdot;\\widehat\{\\eta\}\),which summarizes the empirical dependence of the inputs and has uniform marginals by construction\. Estimation is rank\-based: one may maximize the copula pseudo\-likelihood, solve method\-of\-moments equations \(e\.g\., matching Kendall’sτ\\tau\), or employ composite likelihoods over pairs whenddis large\. This fit is performed*once*per training split and is agnostic to the downstream task\. The procedure is copula\-agnostic: anydd\-variate copula family for which an estimator and a sampler are available \(elliptical or Archimedean\) can be plugged into𝒞\\mathcal\{C\}\. Concrete instantiations used in our work are detailed in[4\.1](https://arxiv.org/html/2605.12580#S4.SS1)\.

##### Step 3: Sample Dependence\-Aligned Weight Columns

Choose a continuous one\-dimensional marginal lawGGwith a strictly increasing quantileG−1G^\{\-1\}for individual weight entries \(e\.g\.,𝒩\(0,1\)\\mathcal\{N\}\(0,1\)or𝒰\[−1,1\]\\mathcal\{U\}\[\-1,1\]\)\. For each hidden unitt=1,…,ht=1,\\dots,h:

1. 1\.drawu\(t\)∼C^u^\{\(t\)\}\\sim\\widehat\{C\}in\[0,1\]d\[0,1\]^\{d\};
2. 2\.setw\(t\)←G−1\(u\(t\)\)w^\{\(t\)\}\\leftarrow G^\{\-1\}\\\!\\big\(u^\{\(t\)\}\\big\)coordinatewise, i\.e\.,wj\(t\)←G−1\(uj\(t\)\)w^\{\(t\)\}\_\{j\}\\leftarrow G^\{\-1\}\(u^\{\(t\)\}\_\{j\}\)forj=1,…,dj=1,\\ldots,d;
3. 3\.assign thettht^\{th\}column ofWWasW:t←w\(t\)W\_\{:t\}\\leftarrow w^\{\(t\)\}and draw a scalar biasbtb\_\{t\}independently from a simple one\-dimensional law\.

SinceG−1G^\{\-1\}is strictly increasing for continuousGG, the*joint law of the coordinates of*w\(t\)w^\{\(t\)\}has copulaC^\\widehat\{C\}while its univariate marginals areGG, thereby matching input dependence while retaining convenient weight marginals\.

##### Step 4: Closed\-Form Training \(Unchanged\)

With the dependence\-aligned weightsWW, we form the hidden matrixH=ϕ\(XW\+𝟏mb⊤\)H\\;=\\;\\phi\\\!\\big\(XW\+\\mathbf\{1\}\_\{m\}b^\{\\top\}\\big\), and obtain the output layer by the standard ridge\-regularized least squares

Θ⋆=arg⁡minΘ⁡‖HΘ−Y‖F2\+λ2‖Θ‖F2\.\\Theta^\{\\star\}\\;=\\;\\arg\\min\_\{\\Theta\}\\ \\\|H\\Theta\-Y\\\|\_\{F\}^\{2\}\+\\frac\{\\lambda\}\{2\}\\\|\\Theta\\\|\_\{F\}^\{2\}\.No iterative updates ofWWare introduced: CAWI modifies only the sampling law used to freeze the hidden weights; the training and inference pipeline remains identical to a classical RdNN\.

Input :Data

X∈ℝm×dX\\in\\mathbb\{R\}^\{m\\times d\}, width

hh, activation

ϕ\\phi, ridge

λ\\lambda, weight marginal

GG, copula family

𝒞=\{C\(⋅;η\):η∈Ξ\}\\mathcal\{C\}=\\\{C\(\\cdot;\\eta\):\\eta\\in\\Xi\\\}\.

Output :Frozen weights

W∈ℝd×hW\\in\\mathbb\{R\}^\{d\\times h\}, bias vector

b∈ℝhb\\in\\mathbb\{R\}^\{h\}, output weights

Θ⋆\\Theta^\{\\star\}\.

1

//Step 1: Probability−\-integral transform \(pseudo−\-observations\)

2for*j←1j\\leftarrow 1todd*do

3compute ECDF

F^j\\widehat\{F\}\_\{j\}of column

X:jX\_\{:j\}
4for*i←1i\\leftarrow 1tomm*do

5

Uij←F^j\(Xij\)∈\(0,1\)U\_\{ij\}\\leftarrow\\widehat\{F\}\_\{j\}\(X\_\{ij\}\)\\in\(0,1\)
6

7

81ex

//Step 2: Fit ad−d\-variate copula onUU\(details in[4\.1](https://arxiv.org/html/2605.12580#S4.SS1)\)

9Estimate

η^\\widehat\{\\eta\}from

UU\(rank–based\); set

C^\(⋅\)←C\(⋅;η^\)\\widehat\{C\}\(\\cdot\)\\leftarrow C\(\\cdot;\\widehat\{\\eta\}\)
10

111ex

//Step 3: Sample dependence−\-aligned weight columns

12for*t←1t\\leftarrow 1tohh*do

13draw

u\(t\)∼C^u^\{\(t\)\}\\sim\\widehat\{C\}in

\[0,1\]d\[0,1\]^\{d\}
14for*j←1j\\leftarrow 1todd*do

15

wj\(t\)←G−1\(uj\(t\)\)w^\{\(t\)\}\_\{j\}\\leftarrow G^\{\-1\}\\\!\\big\(u^\{\(t\)\}\_\{j\}\\big\)
16set

W:t←w\(t\)W\_\{:t\}\\leftarrow w^\{\(t\)\}
17

18Sample each bias

btb\_\{t\}independently from a simple one–dimensional law \(e\.g\., uniform\), for

t=1,…,ht=1,\\dots,h
19

201ex

//Step 4: Closed−\-form output layer \(unchanged\)

21Form hidden features

H←ϕ\(XW\+𝟏mb⊤\)H\\leftarrow\\phi\\\!\\big\(XW\+\\mathbf\{1\}\_\{m\}b^\{\\top\}\\big\)
22Compute

Θ⋆\\Theta^\{\\star\}
23

Algorithm 1CAWI: Copula–Aligned Weight Initialization for Randomized Neural Networks
### 4\.1Copula Families: Parameterization and Estimation

We instantiate Step 2 with five widely useddd\-variate copula families that together capture symmetric vs\. asymmetric dependence and both light\- and heavy\-tailed co\-movement:*Gaussian*\(elliptical, tail\-neutral\),*tt*\(elliptical, heavy\-tailed\),*Clayton*\(Archimedean, lower\-tail dependent\),*Gumbel*\(Archimedean, upper\-tail dependent\), and*Frank*\(Archimedean, tail\-neutral, non\-elliptical\)\. This selection balances expressive coverage with tractable rank\-based estimation and efficient sampling, enabling CAWI to adapt to diverse empirical dependence while keeping the training pipeline simple and stable\.

##### Elliptical families \(Gaussian,tt\)

Both families are parameterized by a correlation matrixR∈ℝd×dR\\in\\mathbb\{R\}^\{d\\times d\}; thettcopula additionally has degrees of freedomν\>2\\nu\>2\. We estimateRRfrom pairwise Kendall’sτ\\taucomputed onUUvia the elliptical identity

τij=2πarcsin⁡\(Rij\)⇒R^ij=sin⁡\(π2τ^ij\)\(elementwise\),\\tau\_\{ij\}\\;=\\;\\tfrac\{2\}\{\\pi\}\\arcsin\(R\_\{ij\}\)\\Rightarrow\\widehat\{R\}\_\{ij\}\\;=\\;\\sin\\\!\\big\(\\tfrac\{\\pi\}\{2\}\\,\\widehat\{\\tau\}\_\{ij\}\\big\)\\text\{\(elementwise\)\},followed by a nearest\-correlation projection to enforce symmetry, unit diagonal, and positive \(semi\)definiteness \(a vanishing ridge may be added if needed for numerical stability\)\. For thettcopula, we then profile a rank\-based pseudo\-likelihood overν\\nuwithRRfixed to obtainν^\\widehat\{\\nu\}\(a one\-dimensional search\)\. The fitted copulaC^\\widehat\{C\}yields drawsU∈\(0,1\)N×dU\\in\(0,1\)^\{N\\times d\}, which are mapped to weights through the inverse marginals in Step 3\.

##### Archimedean families \(Clayton, Gumbel, Frank\)

These families are governed by a single scalar parameterθ\\thetathat controls concordance and tail behavior\. We employ robust, composite, rank\-based estimation from pairwise Kendall’sτ\\taucomputed onUU\. Letτ^ij\\widehat\{\\tau\}\_\{ij\}denote the sample Kendall’sτ\\taubetween coordinatesiiandjj, and letτ¯\\bar\{\\tau\}be their average overi<ji<j\. With this setup, we estimate a singleθ\\thetaper family as follows\.

- •Clayton\(θ\>0\\theta\>0, lower\-tail dependent\)\. Theτ\\tau–θ\\thetarelationτ=θ/\(θ\+2\)\\tau=\\theta/\(\\theta\+2\)yields θ^=2τ¯1−τ¯,\\widehat\{\\theta\}\\;=\\;\\frac\{2\\,\\bar\{\\tau\}\}\{1\-\\bar\{\\tau\}\}\\,,or, equivalently, one may minimize a composite method\-of\-moments objective∑i<j\(θθ\+2−τ^ij\)2\\sum\_\{i<j\}\\\!\\big\(\\tfrac\{\\theta\}\{\\theta\+2\}\-\\widehat\{\\tau\}\_\{ij\}\\big\)^\{2\}\.
- •Gumbel\(θ≥1\\theta\\geq 1, upper\-tail dependent\)\. Withτ=1−1θ\\tau=1\-\\tfrac\{1\}\{\\theta\}, θ^=11−τ¯,\\widehat\{\\theta\}\\;=\\;\\frac\{1\}\{1\-\\bar\{\\tau\}\}\\,,or via the same composite fit∑i<j\(1−1θ−τ^ij\)2\\sum\_\{i<j\}\\\!\\big\(1\-\\tfrac\{1\}\{\\theta\}\-\\widehat\{\\tau\}\_\{ij\}\\big\)^\{2\}\.
- •Frank\(θ∈ℝ∖\{0\}\\theta\\in\\mathbb\{R\}\\setminus\\\{0\\\}, tail\-neutral, non\-elliptical\)\. Kendall’sτ\\tausatisfies τ\(θ\)=1−4θ\+4θD1\(θ\),\\tau\(\\theta\)\\;=\\;1\-\\frac\{4\}\{\\theta\}\+\\frac\{4\}\{\\theta\}\\,D\_\{1\}\(\\theta\),whereD1D\_\{1\}is the Debye function of order 1\. We obtainθ^\\widehat\{\\theta\}by numerically invertingτ\(θ\)\\tau\(\\theta\)atτ¯\\bar\{\\tau\}\(monotone one\-dimensional root\-finding\), or by composite pseudo\-likelihood\.

Ford\>2d\>2we adopt the standard exchangeable extension, i\.e\., a single global parameterθ^\\widehat\{\\theta\}shared across dimensions\. This delivers stable, scalabledd\-variate fits and admits efficient generator\-based sampling ofU∈\(0,1\)N×dU\\in\(0,1\)^\{N\\times d\}\. As in Step 3, the sampledUUis then mapped coordinatewise through the chosen marginal quantileG−1G^\{\-1\}to produce dependence\-aligned weight columns\.

### 4\.2Computational Complexity of CAWI

CAWI adds a one\-time preprocessing stage \(Steps 1\-3\) on top of the RdNN pipeline\. Letmmdenote the number of samples,ddthe input dimensionality, andhhthe width \(number of hidden units\)\.

##### Step 1: Probability\-integral transform\.

Computing the empirical CDF or rank transform for each of theddfeature columns requires sortingmmvalues per column, leading to a total cost ofO\(dmlog⁡m\)O\(d\\,m\\log m\)\.

##### Step 2: Copula fitting\.

Using rank\-based Kendall’sτ\\taufor dependence estimation requires computing pairwise concordance statistics for alld\(d−1\)/2=O\(d2\)d\(d\-1\)/2=O\(d^\{2\}\)feature pairs\. A standard implementation of Kendall’sτ\\taucostsO\(m2\)O\(m^\{2\}\)per pair, yielding an overall complexity ofO\(d2m2\)O\(d^\{2\}m^\{2\}\)\. For elliptical copulas \(Gaussian ortt\), an additional nearest\-correlation or spectral projection step contributesO\(d3\)O\(d^\{3\}\)\. Thett\-copula further requires a one\-dimensional search over degrees of freedom, with per\-iteration costO\(md2\)O\(md^\{2\}\), which remains negligible relative to the dominantO\(d2m2\)O\(d^\{2\}m^\{2\}\)term\. For Archimedean copulas \(Clayton, Gumbel, Frank\), estimating a single scalar parameter from averaged Kendall’sτ¯\\bar\{\\tau\}incurs negligible cost beyond theO\(d2m2\)O\(d^\{2\}m^\{2\}\)computation of the pairwiseτ\\tau\-values\.

##### Step 3: Sampling dependence\-aligned weight columns\.

For elliptical copulas, sampling requires an initial factorization of thed×dd\\times dcorrelation matrix, costingO\(d3\)O\(d^\{3\}\), followed byhhmatrix\-vector multiplications, each costingO\(d2\)O\(d^\{2\}\), for a total ofO\(d3\+hd2\)O\(d^\{3\}\+hd^\{2\}\)\. For Archimedean copulas, sampling each column requires onlyO\(d\)O\(d\)operations, yielding a total cost ofO\(hd\)O\(hd\)\. In both cases, mapping the copula samples through the marginal quantileG−1G^\{\-1\}costsO\(hd\)O\(hd\), which is dominated by the previous terms\.

##### Step 4: Closed\-form RdNN training \(unchanged\)\.

With dependence\-aligned frozen weights, hidden features are formed asH=ϕ\(XW\+𝟏mb⊤\)H=\\phi\(XW\+\\mathbf\{1\}\_\{m\}b^\{\\top\}\)\. The cost of formingXWXWisO\(mdh\)O\(mdh\), and computing the ridge\-regularized output layer using the normal equations costsO\(mh2\+h3\)O\(mh^\{2\}\+h^\{3\}\)\. This is identical to classical RdNNs\.

##### Summary\.

CAWI introduces a one\-time preprocessing cost ofO\(dmlog⁡m\+d2m2\+d3\+hd2\)O\\big\(d\\ m\\log m\+d^\{2\}m^\{2\}\+d^\{3\}\+hd^\{2\}\\big\), while the dominant runtime in practice remains the standard RdNN training step,O\(mdh\+mh2\+h3\)O\(mdh\+mh^\{2\}\+h^\{3\}\)\. Thus, CAWI preserves the hallmark efficiency of RdNNs\. The empirical training\-time analysis reported in Section E of the Appendix further confirms this efficiency\.

## 5Empirical Results

We evaluate CAWI by integrating it into representative randomized architectures spanning*shallow*,*deep*, and*wide*settings:*\(i\) RVFL*\(single hidden layer with direct links\)Paoet al\.\([1994](https://arxiv.org/html/2605.12580#bib.bib22)\),*\(ii\) dRVFL*\(deep RVFL\)Shiet al\.\([2021](https://arxiv.org/html/2605.12580#bib.bib60)\), and*\(iii\) BLS*\(Broad Learning System; a wide, block\-structured random\-feature architecture\)Chen and Liu \([2017](https://arxiv.org/html/2605.12580#bib.bib66)\)\. We compare the standard i\.i\.d\. baseline against CAWI variants that differ only in the fitted copula used to sample columns ofWW:

- •i\.i\.d\. baseline:WWis sampled independently from a Uniform distribution \(𝒰\[−1,1\]\\mathcal\{U\}\[\-1,1\]\)\.
- •CAWI \(elliptical\):WWis sampled using*Gaussian*and*tt*copulas\.
- •CAWI \(Archimedean\):WWis sampled using*Clayton*,*Frank*, and*Gumbel*copulas\.

The detailed experimental setup is provided in Section[B](https://arxiv.org/html/2605.12580#A2)of the Appendix\.

Table 1:Results on the BreaKHis dataset for RVFL baseline and copula\-based initializations \(test accuracy, %\)\. Best per row isboldand highlighted; the last column reports the improvement of the best method over the i\.i\.d\. baseline\.Table 2:Results on the BreaKHis dataset for dRVFL baseline and copula\-based initializations \(test accuracy, %\)\. Best per row isboldand highlighted; the last column reports the improvement of the best method over the i\.i\.d\. baseline\.Table 3:Results on the Schizophrenia ROI datasets for RVFL baseline and copula\-based initializations \(test accuracy, %\)\. Best per row isboldand highlighted; the last column reports the improvement of the best method over the i\.i\.d\. baseline\.Table 4:Results on the Schizophrenia ROI datasets for dRVFL baseline and copula\-based initializations \(test accuracy, %\)\. Best per row isboldand highlighted; the last column reports the improvement of the best method over the i\.i\.d\. baseline\.### 5\.1Dataset Description

The evaluation is conducted on 83 benchmark datasets downloaded from the UCI repositoryDua and Graff \([2017](https://arxiv.org/html/2605.12580#bib.bib69)\), comprising 30 binary and 53 multiclass classification tasks\. The number of classes ranges from 2 to 26\. Sample sizes span from 24 to 58,000 instances, and the number of features ranges from 3 to 263, covering both small\- and large\-scale datasets as well as low\- and high\-dimensional regimes\. Detailed dataset statistics are provided in Table[6](https://arxiv.org/html/2605.12580#A2.T6)of the Appendix\. To validate performance in real\-world settings, we additionally evaluate CAWI on two biomedical datasets: the*BreaKHis*breast cancer datasetSpanholet al\.\([2015](https://arxiv.org/html/2605.12580#bib.bib100)\)and*Schizophrenia ROI*dataset from the COBRE repository \([http://fcon\_1000\.projects\.nitrc\.org/indi/retro/cobre\.html](http://fcon_1000.projects.nitrc.org/indi/retro/cobre.html)\)\. For BreaKHis, we use 1,240 histopathological image scans at400×400\\timesmagnification, organized into benign and malignant classes\. The benign subclasses are adenosis \(AD, 106\), fibroadenoma \(FA, 237\), phyllodes tumor \(PT, 115\), and tubular adenoma \(TA, 130\); the malignant subclasses are lobular carcinoma \(LC, 137\), papillary carcinoma \(PC, 138\), ductal carcinoma \(DC, 208\), and mucinous carcinoma \(MC, 169\)\. Feature extraction follows the procedure inGautamet al\.\([2020](https://arxiv.org/html/2605.12580#bib.bib99)\)\. The Schizophrenia ROI dataset includes 72 schizophrenia subjects \(ages 18–65; mean38\.1±13\.938\.1\\pm 13\.9years\) and 74 healthy controls \(ages 18–65; mean35\.8±11\.535\.8\\pm 11\.5years\); ROI features are derived followingTanveeret al\.\([2022](https://arxiv.org/html/2605.12580#bib.bib98)\)\.

### 5\.2Performance Evaluation

In this subsection, we present empirical results on three groups of benchmarks: the BreaKHis dataset, the Schizophrenia ROI dataset, and a broader suite of 83 UCI benchmark datasets\. For BreaKHis and Schizophrenia ROI, we evaluate RVFL, dRVFL, and BLS under the standard i\.i\.d\. initialization and the proposed CAWI variants \(Gaussian,tt, Clayton, Frank, and Gumbel\)\. Tables[1](https://arxiv.org/html/2605.12580#S5.T1)and[2](https://arxiv.org/html/2605.12580#S5.T2)report the BreaKHis results for RVFL and dRVFL, while Tables[3](https://arxiv.org/html/2605.12580#S5.T3)and[4](https://arxiv.org/html/2605.12580#S5.T4)present the corresponding results for the Schizophrenia ROI dataset\. Results for BLS are provided in Tables[7](https://arxiv.org/html/2605.12580#A2.T7)–[8](https://arxiv.org/html/2605.12580#A2.T8)of the Appendix\. For the broader benchmark evaluation, we report results on 83 UCI datasets \(30 binary and 53 multiclass\)\. The aggregate summary is presented in Table[5](https://arxiv.org/html/2605.12580#S5.T5), while the complete per\-dataset results are available in Tables[9](https://arxiv.org/html/2605.12580#A2.T9)–[10](https://arxiv.org/html/2605.12580#A2.T10)of the Appendix\. All these results are reported using a fixed seed \(rng\(42, ‘twister’\)\); additional experiments with two independent seeds \(7 and 123\) are provided in Section[C](https://arxiv.org/html/2605.12580#A3)of the Appendix to assess seed sensitivity\.

#### 5\.2\.1Result on BreaKHis Dataset

Table[1](https://arxiv.org/html/2605.12580#S5.T1)shows that*every*pairwise task attains a higher test accuracy with a copula\-aligned initialization than with the i\.i\.d\. baseline \(rightmost column\>0\>0for all 16 rows\)\. Gains range from about\+1\.07\+1\.07to\+5\.60\+5\.60percentage points \(e\.g\., TA vs\. PC:\+5\.60\+5\.60, PT vs\. LC:\+5\.20\+5\.20, PT vs\. DC:\+4\.64\+4\.64\), indicating that dependence\-aware sampling ofWWprovides a meaningful lift even in a shallow RVFL\. Not every copula family dominates on every task; rather, the*best*variant shifts across pairs, consistent with heterogeneous dependence patterns in the features\. Across the 16 pairs, the winning family varies: thett\-copula is best on 5 tasks \(AD vs\. LC/MC; FA vs\. LC; TA vs\. LC; plus a three\-way tie in FA vs\. DC\),*Gaussian*on 3 \(e\.g\., AD vs\. DC; FA vs\. PC; PT vs\. LC\),*Clayton*on 2 \(PT vs\. DC; TA vs\. DC\),*Frank*on 4 \(FA vs\. MC; PT vs\. MC/PC\), and*Gumbel*on 2 \(AD vs\. PC; TA vs\. PC\)\. A three\-way tie \(Gaussian/tt/Clayton\) in FA vs\. DC suggests that distinct copula families can approximate the relevant dependence equally well in some cases\. Overall, every BreaKHis pair benefits from copula\-aligned initialization \(all improvements\>0\>0\), with gains up to\+5\.60\+5\.60percentage points, reinforcing the central claim: injecting empirical dependence into the frozen projections consistently improves RVFL performance, while different copula families provide complementary advantages across tasks\. Table[2](https://arxiv.org/html/2605.12580#S5.T2)shows that copula\-aligned initialization again improves over the i\.i\.d\. baseline on*every*pair \(rightmost column\>0\>0for all 16 rows\), now with a wider spread of gains: from\+0\.27\+0\.27to\+12\.97\+12\.97percentage \(e\.g\., PT vs\. DC:\+12\.97\+12\.97, TA vs\. PC:\+4\.10\+4\.10, PT vs\. MC:\+3\.87\+3\.87\)\. This indicates that dependence\-aware sampling remains beneficial in deeper randomized stacks and can yield larger lifts when multiple frozen layers compound representational biases\. The winning family varies by pair, with a clear pattern favoring Archimedean copulas:*Clayton*leads on8tasks,*Frank*on4,*Gumbel*on2, and the*tt*\-copula on2;*Gaussian*does not take a win in this setting\. These outcomes are consistent with heterogeneous, often asymmetric dependence in the extracted features: Clayton/Frank/Gumbel \(which capture different tail/asymmetry profiles\) dominate, while elliptical models win when dependence is more symmetric\.

#### 5\.2\.2Result on Schizophrenia Dataset

Across the three ROI configurations \(GM\+WM, GM, WM\), CAWI improves upon the i\.i\.d\. baseline in both architectures\. In the RVFL setting \(see Table[3](https://arxiv.org/html/2605.12580#S5.T3)\), the best copula varies by ROI:*Gumbel*leads on GM\+WM \(\+2\.78\+2\.78\),*Gaussian*on GM \(\+1\.38\+1\.38\), and*Clayton*on WM \(\+1\.31\+1\.31\)\. For dRVFL \(see Table[4](https://arxiv.org/html/2605.12580#S5.T4)\), the pattern shifts:*Gaussian*tops GM\+WM \(\+3\.49\+3\.49pp\),*Frank*on GM \(\+2\.16\+2\.16\), and*Clayton*again wins on WM \(\+3\.45\+3\.45\)\. Overall, these results echo the BreaKHis findings without redundancy: dependence\-aware sampling ofWWconsistently lifts accuracy, and the most effective copula family depends on the ROI subset and architecture—underscoring the value of modeling the*type*of dependence \(symmetric vs\. asymmetric; central vs\. tail\) rather than assuming independence\.

### 5\.3Results on 83 UCI Datasets

On the broader benchmark suite, we evaluate RVFL on 83 UCI datasets\. To provide a concise and transparent summary of the empirical findings, we report the aggregate performance of CAWI in Table[5](https://arxiv.org/html/2605.12580#S5.T5)\. For each dataset, we compare the conventional i\.i\.d\. initialization with the best\-performing CAWI configuration \(i\.e\., the copula family yielding the highest test accuracy for that dataset\)\. This reflects the intended usage of CAWI as a flexible, copula\-agnostic framework capable of adapting to dataset\-specific dependency structures\. Across all 83 datasets, the best CAWI variant outperforms the i\.i\.d\. baseline on every dataset\. The average test accuracy of the i\.i\.d\. initialization is78\.08%78\.08\\%, whereas the average accuracy of the best CAWI configuration reaches79\.35%79\.35\\%, corresponding to a mean improvement of\+1\.27\+1\.27percentage points\. These results demonstrate that CAWI yields consistent improvements across both binary and multiclass problems, rather than isolated gains on a small subset of datasets\.

Table 5:Aggregate performance summary across 83 UCI datasets\.

## 6Conclusions

In this work, we address a foundational gap in randomized neural networks, namely that input\-to\-hidden weights are randomly initialized and then held fixed throughout training\. We present CAWI, a plug\-in initialization scheme that aligns the frozen weight columns with the empirical dependence structure of the inputs via a fitted copula, while leaving the classical RdNN pipeline—closed\-form readout, no backpropagation—unchanged\. By modifying only the sampling law forWW, CAWI yields dependence\-aware hidden projections with a computational footprint comparable to i\.i\.d\. initialization\. Building on this design, our empirical study shows consistent and often substantial gains across 83 UCI benchmarks and two biomedical datasets \(BreaKHis and Schizophrenia ROI\) using RVFL, dRVFL, and BLS architectures, all while preserving backpropagation\-free training\. Codes are provided at[https://github\.com/mtanveer1/CAWI](https://github.com/mtanveer1/CAWI)\.

A natural next step for future research is the development of automatic copula selection strategies\. Rather than manually evaluating multiple copula families, a data\-driven mechanism could identify the most suitable dependency structure for a given dataset and initialize the model accordingly\. Such an adaptive selection procedure would further streamline CAWI while preserving its computational efficiency and enhancing its ability to capture dataset\-specific dependence patterns\.

## Acknowledgement

Mushir Akhtar acknowledges the financial support received from the CSIR, New Delhi, under Fellowship Grant No\. 09/1022\(13849\)/2022\-EMR\-I\. Mohd\. Arshad acknowledges the funding support provided by the ANRF, India, through the Core Research Grant \(Grant No\. CRG/2023/001230\)\.

## References

- Towards robust and inversion\-free randomized neural networks: the XG\-RVFL framework\.Pattern Recognition,pp\. 112711\.External Links:[Link](https://arxiv.org/html/2605.12580v1/10.1016/j.patcog.2025.112711)Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p3.1)\.
- A\. Ashok, É\. Marcotte, V\. Zantedeschi, N\. Chapados, and A\. Drouin \(2024\)TACTis\-2: better, faster, simpler attentional copulas for multivariate time series\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=xtOydkE1Ku)Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p5.1)\.
- W\. Cao, X\. Wang, Z\. Ming, and J\. Gao \(2018\)A review on neural networks with random weights\.Neurocomputing275,pp\. 278–287\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p2.1)\.
- A\. Ceni \(2025\)Random orthogonal additive filters: a solution to the vanishing/exploding gradient of deep neural networks\.IEEE Transactions on Neural Networks and Learning Systems\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- D\. Chapman and P\. Farvardin \(2024\)Interpretable measurement of cnn deep feature density using copula and the generalized characteristic function\.arXiv preprint arXiv:2411\.05183\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p5.1)\.
- C\. Chen, O\. Li, D\. Tao, A\. Barnett, C\. Rudin, and J\. K\. Su \(2019\)This looks like that: deep learning for interpretable image recognition\.Advances in Neural Information Processing Systems32\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- C\. P\. Chen and Z\. Liu \(2017\)Broad learning system: an effective and efficient incremental learning system without the need for deep architecture\.IEEE Transactions on Neural Networks and Learning Systems29\(1\),pp\. 10–24\.Cited by:[§A\.2](https://arxiv.org/html/2605.12580#A1.SS2),[§5](https://arxiv.org/html/2605.12580#S5.p1.1)\.
- W\. Chen, K\. Yang, Z\. Yu, F\. Nie, and C\. P\. Chen \(2025\)Adaptive broad network with graph\-fuzzy embedding for imbalanced noise data\.IEEE Transactions on Fuzzy Systems\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p3.1)\.
- D\. Dua and C\. Graff \(2017\)UCI machine learning repository\.URL http://archive\. ics\. uci\. edu/ml7\(1\),pp\. 62\.Cited by:[§5\.1](https://arxiv.org/html/2605.12580#S5.SS1.p1.3)\.
- C\. Gautam, P\. K\. Mishra, A\. Tiwari, B\. Richhariya, H\. M\. Pandey, S\. Wang, M\. Tanveer, and for the Alzheimer’s Disease Neuroimaging Initiative \(2020\)Minimum variance\-embedded deep kernel regularized least squares method for one\-class classification and its applications to biomedical data\.Neural Networks123,pp\. 191–216\.Cited by:[§5\.1](https://arxiv.org/html/2605.12580#S5.SS1.p1.3)\.
- I\. Goodfellow \(2016\)Deep learning\.MIT press\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- L\. Guo, R\. Li, and B\. Jiang \(2021\)An ensemble broad learning scheme for semisupervised vehicle type classification\.IEEE Transactions on Neural Networks and Learning Systems32\(12\),pp\. 5287–5297\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p3.1)\.
- M\. Hu, R\. Gao, and P\. N\. Suganthan \(2024\)Self\-distillation for randomized neural networks\.IEEE Transactions on Neural Networks and Learning Systems35\(11\),pp\. 16119–16128\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p2.1)\.
- W\. Hu, L\. Xiao, and J\. Pennington \(2020\)Provable benefit of orthogonal initialization in optimizing deep linear networks\.InThe Eighth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p6.1)\.
- G\. Huang, Q\. Zhu, and C\. Siew \(2006\)Extreme learning machine: theory and applications\.Neurocomputing70\(1\-3\),pp\. 489–501\.Cited by:[§A\.1](https://arxiv.org/html/2605.12580#A1.SS1)\.
- B\. Igelnik and Y\. Pao \(1995\)Stochastic choice of basis functions in adaptive function approximation and the functional\-link net\.IEEE Transactions on Neural Networks6\(6\),pp\. 1320–1329\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p2.1)\.
- A\. Jaiswal, P\. Wang, T\. Chen, J\. Rousseau, Y\. Ding, and Z\. Wang \(2022\)Old can be gold: better gradient flow can make vanilla\-gcns great again\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 7561–7574\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- H\. Joe \(2014\)Dependence modeling with copulas\.CRC press\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.12580#S2.SS2.p1.1)\.
- A\. Kessy, A\. Lewin, and K\. Strimmer \(2018\)Optimal whitening and decorrelation\.The American Statistician72\(4\),pp\. 309–314\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p6.1)\.
- I\. Lauriola, A\. Lavelli, and F\. Aiolli \(2022\)An introduction to deep learning in natural language processing: models, techniques, and tools\.Neurocomputing470,pp\. 443–456\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- Y\. LeCun, Y\. Bengio, and G\. Hinton \(2015\)Deep learning\.Nature521\(7553\),pp\. 436–444\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- Y\. LeCun, L\. Bottou, G\. B\. Orr, and K\. Müller \(2002\)Efficient backprop\.InNeural Networks: Tricks of the Trade,pp\. 9–50\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p6.1)\.
- N\. A\. Letizia, N\. Novello, and A\. M\. Tonello \(2025\)Copula density neural estimation\.IEEE Transactions on Neural Networks and Learning Systems\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p5.1)\.
- Y\. Liu, J\. Wen, C\. Liu, X\. Fang, Z\. Li, Y\. Xu, and Z\. Zhang \(2024\)Language\-driven cross\-modal classifier for zero\-shot multi\-label image recognition\.InProceedings of the 41st International Conference on Machine Learning,Vol\.235,pp\. 32173–32183\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- B\. Luo, J\. Zhu, T\. Yang, S\. Zhao, C\. Hu, X\. Zhao, and Y\. Gao \(2023\)Learning deep hierarchical features with spatial regularization for one\-class facial expression recognition\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 6065–6073\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- M\. V\. Narkhede, P\. P\. Bartakke, and M\. S\. Sutaone \(2022\)A review on weight initialization strategies for neural networks\.Artificial Intelligence Review55\(1\),pp\. 291–322\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p6.1)\.
- D\. Needell, A\. A\. Nelson, R\. Saab, P\. Salanevich, and O\. Schavemaker \(2024\)Random vector functional link networks for function approximation on manifolds\.Frontiers in Applied Mathematics and Statistics10,pp\. 1284706\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p2.1)\.
- R\. B\. Nelsen \(2006\)An introduction to copulas\.Springer\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.12580#S2.SS2.p1.1)\.
- D\. W\. Otter, J\. R\. Medina, and J\. K\. Kalita \(2021\)A survey of the usages of deep learning for natural language processing\.IEEE Transactions on Neural Networks and Learning Systems32\(2\),pp\. 604–624\.External Links:[Document](https://dx.doi.org/10.1109/TNNLS.2020.2979670)Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- Y\. Pao, G\. Park, and D\. J\. Sobajic \(1994\)Learning and generalization characteristics of the random vector functional\-link net\.Neurocomputing6\(2\),pp\. 163–180\.Cited by:[§A\.1](https://arxiv.org/html/2605.12580#A1.SS1),[§1](https://arxiv.org/html/2605.12580#S1.p2.1),[§5](https://arxiv.org/html/2605.12580#S5.p1.1)\.
- R\. Pascanu, T\. Mikolov, and Y\. Bengio \(2013\)On the difficulty of training recurrent neural networks\.InProceedings of the 30th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.28,pp\. 1310–1318\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- C\. Ren, S\. Cheng, and Z\. Xuan \(2025\)FOATLBL: federated online active transfer learning via broad learning system\.Neurocomputing,pp\. 131363\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p3.1)\.
- M\. Sajid, A\. Quadir, M\. Tanveer, and for the Alzheimer’s Disease Neuroimaging Initiative \(2025a\)GB\-RVFL: fusion of randomized neural network and granular ball computing\.Pattern Recognition159,pp\. 111142\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p3.1)\.
- M\. Sajid, A\. K\. Malik, M\. Tanveer, and P\. N\. Suganthan \(2024a\)Neuro\-fuzzy random vector functional link neural network for classification and regression problems\.IEEE Transactions on Fuzzy Systems32\(5\),pp\. 2738–2749\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p3.1)\.
- M\. Sajid, A\. K\. Malik, and M\. Tanveer \(2024b\)Intuitionistic fuzzy broad learning system: enhancing robustness against noise and outliers\.IEEE Transactions on Fuzzy Systems32\(8\),pp\. 4460–4469\.External Links:[Document](https://dx.doi.org/10.1109/TFUZZ.2024.3400898)Cited by:[Appendix B](https://arxiv.org/html/2605.12580#A2.p3.7)\.
- M\. Sajid, M\. Tanveer, and P\. N\. Suganthan \(2025b\)Ensemble deep random vector functional link neural network based on fuzzy inference system\.IEEE Transactions on Fuzzy Systems33\(1\),pp\. 479–490\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p3.1)\.
- Q\. Shi, R\. Katuwal, P\. N\. Suganthan, and M\. Tanveer \(2021\)Random vector functional link neural network based ensemble deep learning\.Pattern Recognition117,pp\. 107978\.Cited by:[§A\.3](https://arxiv.org/html/2605.12580#A1.SS3),[Appendix B](https://arxiv.org/html/2605.12580#A2.p3.7),[§1](https://arxiv.org/html/2605.12580#S1.p3.1),[§5](https://arxiv.org/html/2605.12580#S5.p1.1)\.
- G\. Simchoni and S\. Rosset \(2025\)Flexible copula\-based mixed models in deep learning: a scalable approach to arbitrary marginals\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 91–99\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p5.1)\.
- M\. Sklar \(1959\)Fonctions de répartition à n dimensions et leurs marges\.InAnnales de l’ISUP,Vol\.8,pp\. 229–231\.Cited by:[§2\.2](https://arxiv.org/html/2605.12580#S2.SS2.p1.1)\.
- F\. A\. Spanhol, L\. S\. Oliveira, C\. Petitjean, and L\. Heutte \(2015\)A dataset for breast cancer histopathological image classification\.IEEE Transactions on Biomedical Engineering63\(7\),pp\. 1455–1462\.Cited by:[§5\.1](https://arxiv.org/html/2605.12580#S5.SS1.p1.3)\.
- P\. N\. Suganthan and R\. Katuwal \(2021\)On the origins of randomization\-based feedforward neural networks\.Applied Soft Computing105,pp\. 107239\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p2.1)\.
- S\. Sun and R\. Yu \(2022\)Copula conformal prediction for multi\-step time series forecasting\.arXiv preprint arXiv:2212\.03281\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p5.1)\.
- M\. Tanveer, M\. A\. Ganaie, A\. Bhattacharjee, and C\. T\. Lin \(2022\)Intuitionistic fuzzy weighted least squares twin SVMs\.IEEE Transactions on Cybernetics53\(7\),pp\. 4400–4409\.Cited by:[§5\.1](https://arxiv.org/html/2605.12580#S5.SS1.p1.3)\.
- J\. Tian, H\. Xue, Y\. Xue, and P\. Fang \(2024\)Copula\-nested spectral kernel network\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p5.1)\.
- R\. Tiwari, A\. Chavan, D\. Gupta, G\. Mago, A\. Gupta, A\. Gupta, S\. Sharan, Y\. Yang, S\. Zhao, S\. Wang,et al\.\(2023\)RCV2023 challenges: benchmarking model training and inference for resource\-constrained deep learning\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 1534–1543\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- A\. Voulodimos, N\. Doulamis, A\. Doulamis, and E\. Protopapadakis \(2018\)Deep learning for computer vision: a brief review\.Computational Intelligence and Neuroscience2018\(1\),pp\. 7068349\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- C\. S\. Y\. Wong, G\. Yang, A\. Ambikapathi, and R\. Savitha \(2022\)Online continual learning using enhanced random vector functional link networks\.InICASSP 2022\-2022 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 1905–1909\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p3.1)\.
- F\. Yin, J\. Srinivasa, and K\. Chang \(2024\)Characterizing truthfulness in large language model generations with local intrinsic dimension\.InProceedings of the 41st International Conference on Machine Learning,Vol\.235,pp\. 57069–57084\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p1.1)\.
- L\. Zhang and P\. N\. Suganthan \(2016\)A survey of randomized algorithms for training neural networks\.Information Sciences364,pp\. 146–155\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p2.1)\.
- W\. Zhang, C\. K\. Ling, and X\. Zhang \(2024\)Deep copula\-based survival analysis for dependent censoring with identifiability guarantees\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 20613–20621\.Cited by:[§1](https://arxiv.org/html/2605.12580#S1.p5.1)\.

## Checklist

1. 1\.For all models and algorithms presented, check if you include: 1. \(a\)A clear description of the mathematical setting, assumptions, algorithm, and/or model\. \[Yes\] 2. \(b\)An analysis of the properties and complexity \(time, space, sample size\) of any algorithm\. \[Yes\] 3. \(c\)\(Optional\) Anonymized source code, with specification of all dependencies, including external libraries\. \[Yes\]
2. 2\.For any theoretical claim, check if you include: 1. \(a\)Statements of the full set of assumptions of all theoretical results\. \[Not Applicable\] 2. \(b\)Complete proofs of all theoretical results\. \[Not Applicable\] 3. \(c\)Clear explanations of any assumptions\. \[Not Applicable\]
3. 3\.For all figures and tables that present empirical results, check if you include: 1. \(a\)The code, data, and instructions needed to reproduce the main experimental results \(either in the supplemental material or as a URL\)\. \[Yes\] 2. \(b\)All the training details \(e\.g\., data splits, hyperparameters, how they were chosen\)\. \[Yes\] 3. \(c\)A clear definition of the specific measure or statistics and error bars \(e\.g\., with respect to the random seed after running experiments multiple times\)\. \[Yes\] 4. \(d\)A description of the computing infrastructure used\. \(e\.g\., type of GPUs, internal cluster, or cloud provider\)\. \[Yes\]
4. 4\.If you are using existing assets \(e\.g\., code, data, models\) or curating/releasing new assets, check if you include: 1. \(a\)Citations of the creator If your work uses existing assets\. \[Yes\] 2. \(b\)The license information of the assets, if applicable\. \[Not Applicable\] 3. \(c\)New assets either in the supplemental material or as a URL, if applicable\. \[Yes\] 4. \(d\)Information about consent from data providers/curators\. \[Not Applicable\] 5. \(e\)Discussion of sensible content if applicable, e\.g\., personally identifiable information or offensive content\. \[Not Applicable\]
5. 5\.If you used crowdsourcing or conducted research with human subjects, check if you include: 1. \(a\)The full text of instructions given to participants and screenshots\. \[Not Applicable\] 2. \(b\)Descriptions of potential participant risks, with links to Institutional Review Board \(IRB\) approvals if applicable\. \[Not Applicable\] 3. \(c\)The estimated hourly wage paid to participants and the total amount spent on participant compensation\. \[Not Applicable\]

Appendix

## Appendix AArchitectural and Mathematical Formulation of Standard RdNN Models

### A\.1Formulation and Architecture of RVFL and ELMPaoet al\.\([1994](https://arxiv.org/html/2605.12580#bib.bib22)\); Huanget al\.\([2006](https://arxiv.org/html/2605.12580#bib.bib20)\)

RVFL and ELM are single\-layer feed\-forward RdNNs that rely on hidden layer transformations and closed\-form solutions for output weight computation\. While their architectures share significant similarities, the primary distinction lies in the presence of direct input\-to\-output connections in RVFL, which are absent in ELM\. Their common formulation is described as follows: Hidden Layer:The hidden layer transforms the input dataXXusing a random weight matrixW∈ℝd×hW\\in\\mathbb\{R\}^\{d\\times h\}and biasB∈ℝm×hB\\in\\mathbb\{R\}^\{m\\times h\}:

H=ϕ\(XW\+B\),\\displaystyle H=\\phi\(XW\+B\),\(8\)whereϕ\(⋅\)\\phi\(\\cdot\)is the non\-linear function andhhrepresents the number of nodes in the hidden layer\. Output Layer:In RVFL, the final output combines the hidden layer outputs and the original inputs to form an augmented representation:

A=\[X\|H\],\\displaystyle A=\[X\\,\|\\,H\],\(9\)whereas in ELM, the output is directly derived from the hidden layer outputs:

A=H\.\\displaystyle A=H\.\(10\)The output weight matrixΘ∈ℝh×nclass\\Theta\\in\\mathbb\{R\}^\{h\\times n\_\{\\text\{class\}\}\}is computed in both architectures using a closed\-form solution:

Θ=\(A⊤A\+λI\)−1A⊤Y,\\displaystyle\\Theta=\(A^\{\\top\}A\+\\lambda I\)^\{\-1\}A^\{\\top\}Y,\(11\)whereλ\\lambdais a regularization parameter\.

Key Difference:RVFL incorporates direct input\-to\-output connections \(Z=\[X\|H\]Z=\[X\\,\|\\,H\]\), which preserve input information and enhance robustness\. In contrast, ELM omits this feature \(Z=HZ=H\), resulting in a simpler architecture that trades off some robustness for faster training\.

### A\.2Formulation and Architecture of BLSChen and Liu \([2017](https://arxiv.org/html/2605.12580#bib.bib66)\)

BLS represents a significant advancement in RdNNs by utilizing a flat architecture instead of the deep hierarchical structures seen in traditional deep learning\. The architecture of BLS consists of three primary components: the feature layer, the enhancement layer, and the output layer\. The feature layer is responsible for extracting meaningful features from the input data by generating multiple groups of feature nodes through random projection and non\-linear transformations\. On the other hand, the enhancement layer enriches the feature representations by applying additional non\-linear transformations, thereby constructing enhancement nodes that provide diversity and robustness\. The outputs from both layers are concatenated to form a final matrix, which is used to compute the output weights via a closed\-form solution\. Its architecture can be described as follows: Feature Layer:For the input matrixXX, the feature layer generatesqqwindows of feature nodes\. Each windowfif\_\{i\}containsppnodes, which are computed as:

Zfi=ϕ\(XWfi\+Bfi\),\\displaystyle Z\_\{f\_\{i\}\}=\\phi\(XW\_\{f\_\{i\}\}\+B\_\{f\_\{i\}\}\),\(12\)whereWfi∈ℝd×pW\_\{f\_\{i\}\}\\in\\mathbb\{R\}^\{d\\times p\}is a random weight matrix,Bfi∈ℝm×pB\_\{f\_\{i\}\}\\in\\mathbb\{R\}^\{m\\times p\}is a bias matrix,ϕ\(⋅\)\\phi\(\\cdot\)is a non\-linear function, andZfi∈ℝm×pZ\_\{f\_\{i\}\}\\in\\mathbb\{R\}^\{m\\times p\}represents the output of theithi^\{th\}feature node window\. The outputs of all feature windows are concatenated to form the overall feature layer output:

Z=\[Zf1,Zf2,…,Zfq\],\\displaystyle Z=\[Z\_\{f\_\{1\}\},Z\_\{f\_\{2\}\},\\ldots,Z\_\{f\_\{q\}\}\],\(13\)whereZ∈ℝm×pqZ\\in\\mathbb\{R\}^\{m\\times pq\}\. The feature layer outputZZserves as the input to the enhancement layer\. Enhancement Layer:The enhancement layer enriches the feature representation by applying additional random transformations and a non\-linear function to the concatenated feature nodesZZ\. Each window\(ej\)\(e\_\{j\}\)of enhancement nodes \(there aresswindows, each withrrnodes\) is computed as:

Eej=ψ\(ZWej\+Bej\),\\displaystyle E\_\{e\_\{j\}\}=\\psi\(ZW\_\{e\_\{j\}\}\+B\_\{e\_\{j\}\}\),\(14\)whereZ∈ℝm×pqZ\\in\\mathbb\{R\}^\{m\\times pq\}is the output from the feature layer,Wej∈ℝpq×rW\_\{e\_\{j\}\}\\in\\mathbb\{R\}^\{pq\\times r\}is a random weight matrix,Bej∈ℝm×rB\_\{e\_\{j\}\}\\in\\mathbb\{R\}^\{m\\times r\}is a bias matrix,ψ\(⋅\)\\psi\(\\cdot\)is a non\-linear function, andEej∈ℝm×rE\_\{e\_\{j\}\}\\in\\mathbb\{R\}^\{m\\times r\}is the output of thejthj^\{th\}enhancement node window\. The outputs of all enhancement windows are concatenated to form the overall enhancement layer output:

E=\[Ee1,Ee2,…,Ees\],\\displaystyle E=\[E\_\{e\_\{1\}\},E\_\{e\_\{2\}\},\\ldots,E\_\{e\_\{s\}\}\],\(15\)whereE∈ℝm×rsE\\in\\mathbb\{R\}^\{m\\times rs\}\. Output Layer:The final representation matrixAA, which combines the outputs of the feature and enhancement layers, is given by:

A=\[Z\|E\],\\displaystyle A=\[Z\\,\|\\,E\],\(16\)whereA∈ℝm×\(pq\+rs\)A\\in\\mathbb\{R\}^\{m\\times\(pq\+rs\)\}\.

The output weightsΘ∈ℝ\(pq\+rs\)×nclass\\Theta\\in\\mathbb\{R\}^\{\(pq\+rs\)\\times n\_\{\\text\{class\}\}\}are then learned using a closed\-form solution:

Θ=\(A⊤A\+λI\)−1A⊤Y,\\displaystyle\\Theta=\(A^\{\\top\}A\+\\lambda I\)^\{\-1\}A^\{\\top\}Y,\(17\)whereλ\\lambdais a regularization parameter\. Incremental Learning:One of BLS’s unique features is its ability to incrementally update the network by adding new feature mapping nodes or enhancement nodes without retraining the entire model\. This makes BLS computationally efficient and adaptable to streaming or evolving data\.

### A\.3Formulation and Architecture of deep RVFLShiet al\.\([2021](https://arxiv.org/html/2605.12580#bib.bib60)\)

The deep random vector functional link network \(dRVFL\) extends the standard RVFL architecture by incorporating multiple hidden layers, thereby enhancing its representational capacity while preserving the hallmark efficiency of closed\-form training\. Unlike conventional deep neural networks, dRVFL stacks several RVFL\-like layers where only the final layer’s output weights are trained via least squares, and all intermediate transformations are computed using fixed, randomly initialized weights\. The architecture can be described as follows:

Layer\-wise Random Transformations:Let the input matrix beX∈ℝm×dX\\in\\mathbb\{R\}^\{m\\times d\}\. Thelthl^\{\\text\{th\}\}hidden layer transformation is defined as:

H\(l\)=ϕ\(H\(l−1\)W\(l\)\+B\(l\)\),l=1,2,…,L,\\displaystyle H^\{\(l\)\}=\\phi\(H^\{\(l\-1\)\}W^\{\(l\)\}\+B^\{\(l\)\}\),\\quad l=1,2,\\ldots,L,\(18\)whereH\(0\)=XH^\{\(0\)\}=X,W\(l\)∈ℝhl−1×hlW^\{\(l\)\}\\in\\mathbb\{R\}^\{h\_\{l\-1\}\\times h\_\{l\}\}is the randomly initialized weight matrix for layerll,B\(l\)∈ℝm×hlB^\{\(l\)\}\\in\\mathbb\{R\}^\{m\\times h\_\{l\}\}is the corresponding bias matrix, andϕ\(⋅\)\\phi\(\\cdot\)is a nonlinear activation function applied elementwise\.

Augmented Feature Representation:The final feature representationAAis constructed by concatenating the original input and the outputs of all hidden layers:

A=\[X\|H\(1\)\|H\(2\)\|…\|H\(L\)\]∈ℝm×a\\displaystyle A=\[X\\,\|\\,H^\{\(1\)\}\\,\|\\,H^\{\(2\)\}\\,\|\\,\\ldots\\,\|\\,H^\{\(L\)\}\]\\in\\mathbb\{R\}^\{m\\times a\}\(19\)wherea=d\+∑l=1Lhla=d\+\\sum\_\{l=1\}^\{L\}h\_\{l\}is the total feature dimensionality\. This dense aggregation of features enables dRVFL to capture increasingly abstract representations at deeper layers while retaining the original input\.

Output Layer:The output weightsΘ∈ℝa×nclass\\Theta\\in\\mathbb\{R\}^\{a\\times n\_\{\\text\{class\}\}\}are obtained using the same closed\-form solution as in RVFL:

Θ=\(A⊤A\+λI\)−1A⊤Y,\\displaystyle\\Theta=\(A^\{\\top\}A\+\\lambda I\)^\{\-1\}A^\{\\top\}Y,\(20\)whereλ≥0\\lambda\\geq 0is the regularization parameter\.

## Appendix BExperimental Setup and Hyperparameter Setting

All the experiments are implemented using MATLAB R2023a and executed on a Windows 10 PC equipped with an Intel\(R\) Core\(TM\) i7\-6700 CPU @ 3\.40GHz \(4 cores, 8 logical processors\) and 16 GB RAM\. Each dataset is preprocessed by normalizing the input features to have zero mean and unit variance\. The description of 83 datasets \(binary and multiclass\) used in the experiments are provided in Table[6](https://arxiv.org/html/2605.12580#A2.T6)\. A55\-fold cross\-validation procedure is employed to ensure reliable and unbiased evaluation\. In each fold, the dataset is split into80%80\\%training data and20%20\\%testing data\. For every combination of hyperparameters, the model is trained on the training data and evaluated on the testing data across all55folds\. The testing accuracy is recorded for each fold\. The final testing accuracy for each dataset is computed as the mean testing accuracy across the five folds, providing a robust estimate of the model’s performance\.

To eliminate any possibility of data leakage, all copula estimation steps are performed strictly within each training split\. For every fold, the pseudo\-observations and copula parameters are computed exclusively using the training features of that fold\. The fitted copula is then used to sample the hidden\-layer weights, which are subsequently evaluated on the disjoint test set\. At no stage is information from the test portion used during copula fitting or weight initialization\.

Hyperparameter tuning is performed using a grid search strategy to identify the optimal settings for each model\. For each model, the regularization parameter \(λ\\lambda\) is selected from\{10i∣i=−6,−5,…,6\}\\\{10^\{i\}\\mid i=\-6,\-5,\\ldots,6\\\}\. For RVFL, the number of hidden nodes \(hh\) varies from\[3:20:203\]\[3:20:203\], and seven activation functions \(Sigmoid \(1\), Sine \(2\), Tribas \(3\), Radbas \(4\), Tansig \(5\), ReLU \(6\), and SELU \(7\)\) are evaluated\. For BLS, the number of feature windows \(qq\), the number of feature nodes in each window \(pp\), and the number of enhancement nodes \(rr\) are set as perSajidet al\.\([2024b](https://arxiv.org/html/2605.12580#bib.bib68)\), with Tansig as the activation function\. For dRVFL, we adopt the same hyperparameter settings as provided inShiet al\.\([2021](https://arxiv.org/html/2605.12580#bib.bib60)\)and evaluate three activation functions: Sigmoid \(1\), ReLU \(2\), and SELU \(3\)\.

Table 6:Description of binary and multiclass datasets used in the experiments\. The table lists the dataset name, number of samples, number of features, and number of classes\.Table 7:Results on the BreaKHis dataset for BLS baseline and copula\-based initializations \(test accuracy, %\)\. Best per row isboldand highlighted; the last column reports the improvement of the best method over the i\.i\.d\. baseline\.Table 8:Results on the Schizophrenia ROI datasets for BLS baseline and copula\-based initializations \(test accuracy, %\)\. Best per row isboldand highlighted; the last column reports the improvement of the best method over the i\.i\.d\. baseline\.Table 9:Results on benchmark binary datasets for RVFL baseline and copula\-based initializations \(test accuracy, %\)\. Best per row isboldand highlighted; the last column reports the improvement of the best method over the i\.i\.d\. baseline\.Table 10:Results on benchmark multiclass datasets for RVFL baseline and copula\-based initializations \(test accuracy, %\)\. Best per row isboldand highlighted; the last column reports the improvement of the best method over the i\.i\.d\. baseline\.
## Appendix CMulti\-Seed Stability Analysis

To assess the robustness of CAWI with respect to randomness, we additionally evaluate the RVFL model under multiple independent random seeds\. While all primary experiments were conducted using a fixed seed \(rng\(42, ‘twister’\)\), we re\-run the experiments using two additional seeds \(7 and 123\) for both the i\.i\.d\. baseline and all copula\-based variants\.

The detailed results for the BreaKHis dataset under Seed 7 and Seed 123 are reported in Tables[11](https://arxiv.org/html/2605.12580#A3.T11)and[12](https://arxiv.org/html/2605.12580#A3.T12), respectively\. Similarly, the corresponding results for the Schizophrenia ROI dataset are presented in Tables[13](https://arxiv.org/html/2605.12580#A3.T13)and[14](https://arxiv.org/html/2605.12580#A3.T14)\.

Across both datasets and independent seeds, CAWI consistently outperforms the conventional i\.i\.d\. initialization\. Importantly, the relative improvements remain stable in magnitude and direction, indicating that the gains are not attributable to a particular random realization\.

Overall, the multi\-seed evaluation confirms that CAWI provides stable and reproducible performance improvements across independent random initializations, reinforcing its robustness as a dependency\-aware weight initialization strategy\.

Table 11:BreaKHis dataset \(Seed 7\): Test accuracy \(%\)\. Best per row isboldand highlighted; the last column reports the improvement over the i\.i\.d\. baseline\.Table 12:BreaKHis dataset \(Seed 123\): Test accuracy \(%\)\. Best per row isboldand highlighted; the last column reports the improvement over the i\.i\.d\. baseline\.Table 13:Schizophrenia ROI dataset \(Seed 7\): Test accuracy \(%\)\. Best per row isboldand highlighted; the last column reports the improvement over the i\.i\.d\. baseline\.Table 14:Schizophrenia ROI dataset \(Seed 123\): Test accuracy \(%\)\. Best per row isboldand highlighted; the last column reports the improvement over the i\.i\.d\. baseline\.
## Appendix DStatistical Significance Across Datasets

To assess whether the performance improvements of CAWI over the i\.i\.d\. initialization are statistically significant across datasets, we conduct a non\-parametric Wilcoxon signed\-rank test over the 83 UCI benchmark datasets\.

For each dataseti∈\{1,…,83\}i\\in\\\{1,\\dots,83\\\}, letAiiidA\_\{i\}^\{\\text\{iid\}\}denote the test accuracy obtained using the conventional i\.i\.d\. initialization, and letAiCAWIA\_\{i\}^\{\\text\{CAWI\}\}denote the test accuracy achieved by the best\-performing CAWI configuration on the same dataset\. We use the best CAWI variant per dataset because CAWI is designed as a flexible, copula\-agnostic framework: different datasets exhibit different dependency structures, and therefore different copula families may be most appropriate for different tasks\. This comparison reflects the intended usage of CAWI, where the copula family is selected to best capture the empirical dependence of each dataset\.

We then construct paired observations

\(Aiiid,AiCAWI\),i=1,…,83,\\left\(A\_\{i\}^\{\\text\{iid\}\},\\,A\_\{i\}^\{\\text\{CAWI\}\}\\right\),\\quad i=1,\\dots,83,and apply the Wilcoxon signed\-rank test to the differences

Di=AiCAWI−Aiiid\.D\_\{i\}=A\_\{i\}^\{\\text\{CAWI\}\}\-A\_\{i\}^\{\\text\{iid\}\}\.
The resulting Wilcoxon statistic isW=3403W=3403, while the maximum possible statistic for 83 paired observations is 3486\. The corresponding p\-value satisfiesp<10−6,p<10^\{\-6\},indicating extremely strong statistical evidence against the null hypothesis of equal median performance\.

These results demonstrate that the observed improvements of CAWI over the i\.i\.d\. initialization are not attributable to random fluctuations on individual datasets, but instead reflect a consistent and statistically significant advantage across diverse benchmark problems\.

## Appendix EEmpirical Training\-Time Analysis

To complement the theoretical complexity discussion provided in the Section[4\.2](https://arxiv.org/html/2605.12580#S4.SS2)of the main paper, we report an empirical evaluation of the training time incurred by CAWI under different copula families\. Specifically, we measure the average training time \(in seconds\) of the RVFL model initialized using five copula families, Gaussian,tt, Clayton, Frank, and Gumbel, alongside the conventional i\.i\.d\. initialization baseline\.

The results are averaged over all 83 UCI benchmark datasets used in our experimental evaluation\. Table[15](https://arxiv.org/html/2605.12580#A5.T15)summarizes the average training time across datasets\.

Table 15:Average training time \(in seconds\) across 83 UCI datasets\.As observed, all copula\-based initializations introduce only a marginal computational overhead relative to the i\.i\.d\. baseline\. The additional cost stems primarily from the estimation of the empirical copula parameters and subsequent dependency\-aligned sampling\. However, since these operations scale linearly with the number of features and hidden nodes, the overall training complexity remains effectively unchanged in practice\.

Importantly, the empirical overhead remains below0\.0030\.003seconds on average, confirming that CAWI preserves the computational efficiency characteristic of randomized neural networks while improving stability and performance\.
CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks

Similar Articles

Weight normalization: A simple reparameterization to accelerate training of deep neural networks

Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation

PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications

Submit Feedback

Similar Articles

Weight normalization: A simple reparameterization to accelerate training of deep neural networks
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation
PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications