Low-power analogue neural networks with trainable nonlinear connections for continuous control

arXiv cs.LG Papers

Summary

This paper presents low-power analogue neural networks that place trainable nonlinear functions on connections, inspired by Kolmogorov-Arnold networks, enabling efficient continuous control tasks with far fewer nodes and connections than multilayer perceptrons, demonstrated on hardware with projected microWatt power.

arXiv:2606.23742v1 Announce Type: new Abstract: Physical neural networks promise low-power machine learning by computing directly with analogue device physics, but most architectures force nonlinear device responses to act as scalar weights. Inspired by Kolmogorov-Arnold networks, we place trainable nonlinear functions on the connections, making each physical connection a learnable computational element. Realising these functions as analogue band-pass filters on field-programmable analogue arrays, we find that the benefit is task-dependent and follows from the smoothness of the physical basis: the networks represent smooth, continuously valued targets, including robotic kinematics, continuous control, and photovoltaic maximum-power-point tracking, with far fewer nodes and connections than multilayer perceptrons, but offer no parameter-efficiency advantage on classification-like decision boundaries. Trained networks transfer to hardware across approximately 35,000 connections with quantified fidelity, and a dedicated CMOS implementation is projected to operate at approximately 30 microwatts. A memristive realisation reproduces the same behaviour in simulation, indicating that the advantage comes from placing trainable nonlinearity on connections, rather than from a particular device.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:48 AM

# Low-power analogue neural networks with trainable nonlinear connections for continuous control
Source: [https://arxiv.org/html/2606.23742](https://arxiv.org/html/2606.23742)
Ian T\. VidamourCorresponding author:i\.vidamour@sheffield\.ac\.ukSchool of Computer Science, University of Sheffield, Sheffield, S1 4DP, United KingdomThomas J\. HaywardSchool of Chemical, Biological, and Materials Science Engineering, University of Sheffield, Sheffield, S1 3JD, United KingdomMatthew O\. A\. EllisSchool of Computer Science, University of Sheffield, Sheffield, S1 4DP, United KingdomCharles SwindellsSchool of Chemical, Biological, and Materials Science Engineering, University of Sheffield, Sheffield, S1 3JD, United KingdomAlexander McDonnellSchool of Physics, Engineering, and Technology, University of York, York, YO10 5EZ, United KingdomMartin TrefzerSchool of Physics, Engineering, and Technology, University of York, York, YO10 5EZ, United KingdomFinley RobinsSchool of Computer Science, University of Sheffield, Sheffield, S1 4DP, United KingdomLuca ManneschiSchool of Computer Science, University of Sheffield, Sheffield, S1 4DP, United KingdomSusan StepneyDepartment of Computer Science, University of York, York, YO10 5EZ, United KingdomTony KenyonDepartment of Electronic & Electrical Engineering, University College London, Roberts Building, Torrington Place, London, WC1E 7JE, United KingdomOliver J\. SuttonKing’s College London, London, WC2R 2LS, United KingdomJack C\. GartsideBlackett Laboratory, Imperial College London, London, SW7 2AZ, United KingdomIvan Y\. TyukinKing’s College London, London, WC2R 2LS, United KingdomAdnan MehonicDepartment of Electronic & Electrical Engineering, University College London, Roberts Building, Torrington Place, London, WC1E 7JE, United KingdomEleni VasilakiCorresponding author:e\.vasilaki@sheffield\.ac\.ukSchool of Computer Science, University of Sheffield, Sheffield, S1 4DP, United Kingdom

###### Abstract

Physical neural networks promise low\-power machine learning by computing directly with analogue device physics, but most architectures force nonlinear device responses to act as scalar weights\. Inspired by Kolmogorov–Arnold networks, we place trainable nonlinear functions on the connections, making each physical connection a learnable computational element\. Realising these functions as analogue band\-pass filters on field\-programmable analogue arrays, we find the benefit is task\-dependent and follows from the smoothness of the physical basis: the networks represent smooth, continuously valued targets — robotic kinematics, continuous control and photovoltaic maximum\-power\-point tracking — with far fewer nodes and connections than multilayer perceptrons, but offer no parameter\-efficiency advantage on classification\-like decision boundaries\. Trained networks transfer to hardware across 35,000 connections with quantified fidelity, and a dedicated CMOS implementation is projected to operate at∼30​μ​W\{\\sim\}30\\;\\mathrm\{\\mu W\}\. A memristive realisation reproduces the same behaviour in simulation, indicating that the advantage comes from placing trainable nonlinearity on connections, not from a particular device\.

Keywords:physical neural networks, Kolmogorov–Arnold networks, analogue computing, low\-power control, trainable nonlinear connections

## 1Introduction

Physical neural networks implement inference directly in hardware, using the native response of analogue devices as computational primitives\[[21](https://arxiv.org/html/2606.23742#bib.bib363),[12](https://arxiv.org/html/2606.23742#bib.bib360),[24](https://arxiv.org/html/2606.23742#bib.bib53),[13](https://arxiv.org/html/2606.23742#bib.bib368)\]\. Implementations now span memristive\[[14](https://arxiv.org/html/2606.23742#bib.bib20),[29](https://arxiv.org/html/2606.23742#bib.bib362),[2](https://arxiv.org/html/2606.23742#bib.bib2)\], spintronic\[[3](https://arxiv.org/html/2606.23742#bib.bib391),[32](https://arxiv.org/html/2606.23742#bib.bib273)\], photonic\[[15](https://arxiv.org/html/2606.23742#bib.bib22)\], and electronic\[[17](https://arxiv.org/html/2606.23742#bib.bib29)\]substrates, offering routes to low\-power machine learning beyond conventional digital processors\. Yet most inherit an architectural assumption from software: that a connection is a scalar weight\. This is a poor fit for analogue hardware, because it multiplies the number of physical connections that must be fabricated, programmed and controlled, and forces devices with rich nonlinear responses — a memristor’s current–voltage curve, a filter’s frequency response — to act as a single programmable conductance, discarding the very physics that makes them efficient\.

Here we invert that assumption\. Each connection between two network nodes, which we call an edge, carries a trainable nonlinear function rather than a scalar weight\. The edge therefore computes with the native response of the device rather than merely scaling a signal\. A fixed sigmoidal step at each node keeps inter\-layer signals within the hardware’s operating range\. The architecture is related to the Kolmogorov–Arnold network\[[19](https://arxiv.org/html/2606.23742#bib.bib34)\], in which learnable univariate functions replace scalar weights, but is motivated here by a hardware principle: device nonlinearity should be used as the computational resource, not suppressed\. We refer to it as a Physical Kolmogorov–Arnold\-inspired Network \(PhyKAN\)\. Because each edge carries an expressive, smoothly tuneable function, we hypothesise that PhyKANs are especially suited to continuous control and regression — tasks whose targets are smooth maps over coupled real\-valued variables, and which can therefore be represented with few nodes and connections\.

Prior work on physical neural networks has largely asked how to train them when a device’s behaviour is only partially known: physics\-aware training\[[43](https://arxiv.org/html/2606.23742#bib.bib396)\]and noise\-aware dynamic optimisation\[[20](https://arxiv.org/html/2606.23742#bib.bib19)\]estimate gradients through a learned digital twin, forward–forward and local\-learning schemes\[[23](https://arxiv.org/html/2606.23742#bib.bib55)\]remove the need for global backpropagation, and sharpness\-aware training\[[44](https://arxiv.org/html/2606.23742#bib.bib8)\]improves robustness to model–reality mismatch\. These methods take the conventional weight\-and\-activation architecture as given\. The complementary question — what architecture a physical system should implement in the first place — has received far less attention\. Early trainable\-edge explorations were confined to simulation\[[26](https://arxiv.org/html/2606.23742#bib.bib35)\]or to small photonic modules of standard nonlinear components\[[8](https://arxiv.org/html/2606.23742#bib.bib7)\], and more recent demonstrations have shown that trainable edge nonlinearities can be built — from compound memristor–transistor cells\[[41](https://arxiv.org/html/2606.23742#bib.bib6)\], silicon\-on\-insulator synaptic elements\[[36](https://arxiv.org/html/2606.23742#bib.bib9)\], and reconfigurable nonlinear\-processing units\[[11](https://arxiv.org/html/2606.23742#bib.bib4)\]\. What they leave open is the question we take up here: not whether such architectures can be built, but what physically realised nonlinear edges are actually good for\.

We answer it with analogue electronic filters as the physical edge\. Each filter is set by two corner frequencies and a gain, tuned by gradient descent through its analytical transfer function, and banks of filters sum to form a smooth univariate response within the filter bank’s range\. With sufficient filters, this basis can approximate any continuous one\-dimensional response; at finite size, it favours smooth variation over sharp transitions\. The central finding is that the benefit of this architecture is not generic: across six\-axis robotic kinematics, continuous\-action reinforcement learning and photovoltaic maximum\-power\-point tracking, PhyKANs match or exceed multilayer perceptrons at substantially fewer trainable parameters — and therefore far fewer nodes and connections — whereas on classification and binary\-action control they offer no such advantage\. The benefit appears precisely when the target is a smooth, continuously valued map, and disappears when the task reduces to a decision boundary\. Trained parameters transfer directly to hardware with quantified per\-connection error, and the networks are projected to run at∼30​μ​W\{\\sim\}30\\;\\mathrm\{\\mu W\}on a CMOS transconductor implementation\. Finally, a mathematical analysis shows why individual physical edges can carry substantial approximation power: in the idealised setting, a single filter\-based edge can approximate any continuous one\-dimensional response, so smooth variation can be represented inside individual connections, and networks built from such edges can approximate continuous multivariate maps \(Supplementary Note[S1](https://arxiv.org/html/2606.23742#A1)\)\. A distinct memristor\-based realisation reproduces the same task\-dependent pattern in simulation — indicating that the advantage stems from placing trainable nonlinearity on the connections rather than from any single device\.

## 2Results

### 2\.1Constructing trainable nonlinear connections from analogue filters\.

![Refer to caption](https://arxiv.org/html/2606.23742v1/KAN_fig1_v3.png)Figure 1:Constructing trainable nonlinear connections from analogue filters\.\(a\) Schematic of network components \(PhyKAN\)\. Inputs are encoded as frequencies at network nodes, which pass signals to network edges\. Edges are univariate functions, and here edge basis functions resemble tuneable band\-pass filters\. These filters consist of high\-pass, low\-pass, and amplification stages\. Multiple basis functions \(band\-pass filters\) are combined to produce more flexible network edges\. \(b\) Network diagram showing how a \[2, 3, 2, 1\] PhyKAN approximates function I\.50\.26 from the dimensionless Feynman function approximation dataset\[[38](https://arxiv.org/html/2606.23742#bib.bib12)\]\(Y=f​\(X0,X1\)=cos​\(X0\)\+X1​cos2​\(X0\)Y=f\(X\_\{0\},X\_\{1\}\)=\\mathrm\{cos\}\(X\_\{0\}\)\+X\_\{1\}\\mathrm\{cos\}^\{2\}\(X\_\{0\}\)\)\. Red curves show trained edge responses of each edge in the network\. Inset \(far right\) shows how three summed sub\-units \(grey curves\) produce the nonlinear edge; of the original six units, the three pruned units are those enclosed in the dashed black box\. \(c\-d\) Comparison between hardware\-realised function approximation and ground truth data for dimensionless Feynman function approximation; \(c\) represents function I\.50\.26 shown in \(b\), while \(d\) shows function II\.30\.5Y=f​\(X0,X1\)=\[sin2​X0⋅X12\]/\[sin2​X02\]Y=f\(X\_\{0\},X\_\{1\}\)=\[\\mathrm\{sin\}^\{2\}\\frac\{X\_\{0\}\\cdot X\_\{1\}\}\{2\}\]/\[\{\\mathrm\{sin\}^\{2\}\\frac\{X\_\{0\}\}\{2\}\}\]Fig\.[1](https://arxiv.org/html/2606.23742#S2.F1)shows that trainable nonlinear connections can be constructed from analogue electronic filters and transferred to hardware\. The architecture places a learnable univariate function on each connection—the same arrangement as a multilayer perceptron \(MLP\), but with the scalar weights replaced by trainable nonlinearities, as motivated by the Kolmogorov–Arnold representation theorem\[[19](https://arxiv.org/html/2606.23742#bib.bib34)\]\. Each edge is built from tuneable band\-pass filter units whose inputs are encoded as signal frequencies \(Fig\.[1](https://arxiv.org/html/2606.23742#S2.F1)a\), and a fixed sigmoidal step at each node keeps inter\-layer signals within the hardware operating range\. The filter corner frequencies and gains are the trainable parameters of the physical network \(hereafter PhyKAN\); the circuit, its analytical transfer\-function model and the training procedure are given in Section[4](https://arxiv.org/html/2606.23742#S4)\.

To make each edge more expressive, several filter units are summed in parallel and then sparsified during training, pruning units that contribute little \(Section[4\.7](https://arxiv.org/html/2606.23742#S4.SS7); the regularised\-MLP control is reported in Supplementary Note[S6](https://arxiv.org/html/2606.23742#A6)\)\. Fig\.[1](https://arxiv.org/html/2606.23742#S2.F1)b shows a \[2, 3, 2, 1\] PhyKAN trained to approximate a dimensionless Feynman function\[[38](https://arxiv.org/html/2606.23742#bib.bib12)\], with the learned edge responses in red; the inset shows how six sub\-units combine to form one edge, with the pruned units enclosed in the dashed box\.

The trained parameters transfer directly to programmable analogue hardware\. To keep this transfer robust, networks are regularised to favour solutions that are insensitive to small parameter changes, and the simulated parameters are then mapped to the nearest hardware\-achievable filter values through a look\-up table, with a straight\-through estimator maintaining a differentiable computational graph \(Section[4\.6](https://arxiv.org/html/2606.23742#S4.SS6)\)\. Fig\.[1](https://arxiv.org/html/2606.23742#S2.F1)c,d show the function\-fitting performance of the hardware\-realised networks; in compact networks, the hardware outputs closely match the ground\-truth functions\.

### 2\.2Six\-axis robotic kinematics

![Refer to caption](https://arxiv.org/html/2606.23742v1/KAN_fig2_v2.png)Figure 2:Six\-axis Robot Arm Kinematics with Analogue Electronic Networks\.\(a\) Schematic diagram showing the joint configuration of the six\-axis IRB120 commercial robotic manipulator modelled here\. Rotational axes are shown in grey, whereφi\\varphi\_\{i\}denotes the joint angle at each of the joints, and black arrows show the plane of rotation around each of the joints\. The payload \(end\-effector\) location is shown in black\. \(b\) Schematic diagram showing the learning procedure for the forward \(upper\) and inverse \(lower\) kinematics problems\. The forward kinematics take joint angles, which are uniquely mapped to payload locations, with the analogue networks approximating the relationship directly\. In the inverse kinematics problem, the network predicts joint angles, which are then passed through the forward kinematic equations to calculate true locations from predicted joint angles, and the loss is computed between desired and achieved payload locations\. \(c\-f\) Plots of mean\-squared error on the test dataset for the six\-axis robotic control forward \(c/d\) and inverse \(e/f\) kinematics problems for one hidden layer \(c/e\) and two hidden layers \(d/f\) for analogue networks in both simulation \(red dots\) and transferred to hardware \(red crosses\), compared to standard multilayer perceptrons \(black dots\)\. Shaded regions show standard deviation of performance across 10 independently trained models, while lines through the series represent the mean error\.The ability of edge nonlinearities to smoothly approximate multivariate functions makes them well suited to modelling the kinematics of rigid bodies with high parameter efficiency\. Here, we model both the forward and inverse kinematic behaviours of a robotic manipulator using analogue filter networks\. Fig\.[2](https://arxiv.org/html/2606.23742#S2.F2)a shows a schematic diagram of the IRB120 commercial robotic manipulator manufactured by ABB, which has been previously used for neural\-network\-based approximation of kinematics\[[30](https://arxiv.org/html/2606.23742#bib.bib33)\], and is used here to benchmark the analogue networks\.

The manipulator is controlled by six revolute joints that move the end\-effector in three\-dimensional space\. The joint angles and arm lengths uniquely define the end\-effector location, so we built a forward\-kinematics dataset by computing end\-effector locations for joint angles sampled randomly across their working ranges\. The Denavit\-Hartenberg parameters used to determine these end\-effector locations via a spatial kinematic chain are outlined in Section[4\.8](https://arxiv.org/html/2606.23742#S4.SS8)\.

For the inverse kinematics, multiple joint\-angle solutions exist for a given end\-effector location, so simply reversing the inputs and targets of the forward problem is ambiguous\. To constrain the model, the predicted joint angles from the model are passed through the forward kinematics equations to give the true location, and loss between true target location \(input\) and modelled location \(output passed through forward kinematics\) is minimised, as shown in Fig\.[2](https://arxiv.org/html/2606.23742#S2.F2)b\.

Fig\.[2](https://arxiv.org/html/2606.23742#S2.F2)c,d show the achieved mean\-squared error between model\-predicted end\-effector location and true location given by the kinematic equations as a function of trainable parameters for one and two hidden layers respectively\. Trainable\-parameter counts are computed from the unpruned architecture of as\-initialised networks before they undergo pruning during training\. Red dots show electronic network models, while red crosses show transferred performance in hardware\. As a control, standard multilayer perceptrons \(MLPs, black\) with ReLU activation functions serve as a baseline for conventional neural networks\. Despite having more trainable parameters per edge, the analogue networks need substantially fewer hidden\-layer nodes in simulation, giving greater parameter efficiency, especially with multiple hidden layers\. Transfer to hardware increases the error through device mismatch, but the hardware networks still compare favourably with the software MLPs\.

Fig\.[2](https://arxiv.org/html/2606.23742#S2.F2)e,f compare the accuracies in modelling the inverse kinematics of the robotic arm between analogue filter networks and standard MLPs for networks with one and two hidden layers respectively\. For networks with one hidden layer, the MLPs slightly outperform the analogue networks across trainable parameters, with the experimental networks incurring a slight additional error due to transfer mismatch\. However, in the two\-hidden\-layer networks, beyond≈10,000\\approx 10,000parameters \(20 nodes per layer, 18 parameters per edge\), the analogue networks solve the inverse kinematics more accurately than the MLPs, whose performance saturates to the same local minimum as the analogue networks with<10,000<10,000parameters\. Similarly, experimental transfer causes error to increase, although in the largest networks tested, the experimental networks slightly outperform the MLPs on average\.

In the two\-hidden\-layer inverse\-kinematics task, the simulated PhyKAN error decreases with parameter count, reaches a minimum and then forms a shallow plateau at the largest model sizes \(Fig\.[2](https://arxiv.org/html/2606.23742#S2.F2)f\)\. We analysed the hidden representations with the intrinsic\-dimensionality measure of Sutton et al\.\[[34](https://arxiv.org/html/2606.23742#bib.bib5)\], as a probe of how desired end\-effector positions are separated within the network \(Supplementary Fig\.[S4](https://arxiv.org/html/2606.23742#A5.F4)\)\. The first hidden layer becomes progressively richer with model size and then saturates close to the onset of the error plateau, consistent with the network having formed a stable encoding of the input distribution\.

By contrast, the intrinsic dimensionality of the second hidden layer continues to grow with model size\. At the largest model sizes, this coincides with a small increase in test error and with more sub\-optimal runs across the ten independent trainings \(Supplementary Fig\.[S5](https://arxiv.org/html/2606.23742#A5.F5)\), suggesting that the extra capacity mainly enlarges the optimisation problem rather than improving the representation\.

### 2\.3Continuous\-action control

![Refer to caption](https://arxiv.org/html/2606.23742v1/KAN_fig3_v2.png)Figure 3:Continuous\-Action CartPole Control with Analogue Electronic Networks\.\(a\) Schematic diagram of the CartPole task\. The actor provides a continuous output corresponding to a vector force applied to the cart, with the goal of keeping the pole displacement,xx, and pole angle,θ\\thetaclose to zero\. \(b\) and \(c\) compare the average duration for which the agent maintains the pole upright between analogue electronic networks \(red\) and MLPs \(black\) as a function of trainable parameters for networks with two hidden layers \(b\) and three hidden layers \(c\)\. Performance is evaluated as the average of 100 different initial conditions for 10 independently trained models\. Shaded regions show the range of performance over 10 independently trained models, while lines show the mean across the 10 runs\. \(d\-g\) Typical example runs of the CartPole environment, showing force applied \(red\) and pole angle \(black\) across the first 100 timesteps\. \(d\) and \(e\) show typical runs for simulated and experimental analogue networks respectively, which solved the task\. \(f\) and \(g\) show characteristic solution and failure cases respectively for an MLP network that maintains the pole upright for≈\\approx75% of initial conditions\.As well as approximating known input–output relationships, networks with analogue edges can model decision\-making policies learned from environmental feedback, in a reinforcement\-learning paradigm\. Here, we use an actor\-critic algorithm to learn policies which maximise a given reward in an environment with a continuous action space\. In this paradigm, a ’critic’ network estimates total discounted future rewards from a given state\-action pair, while an ’actor’ network learns a policy for action selection given a current state, maximising estimates of reward generated by the critic\. Full implementation details can be found in Section[4\.9](https://arxiv.org/html/2606.23742#S4.SS9)\.

As a demonstration, we perform a classical reinforcement learning control task, CartPole, modified to use continuous\-valued forces instead of a binary choice between±10\\pm 10N\. Here, the environment models a pole attached to a cart via a revolute joint, shown schematically in Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)a\. The actor network uses observations of the position and velocity of the cart \(xxandx˙\\dot\{x\}\), as well as the angle and angular velocity of the pole \(θ\\thetaandθ˙\\dot\{\\theta\}\), to provide a continuous force in the positive \(right\) or negative \(left\) direction that maximises the expected total future reward estimated via the critic network\. In this task, the agent receives increased rewards for maintaining the pole as close to upright as possible, up to a maximum trial length of 500 timesteps\. Through parameter updates, the critic network adapts until its predictions align with the rewards actually received\. We consider the task solved when the agent consistently maintains the pole upright for 500 timesteps\.

Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)b,c compare performance on the continuous\-action CartPole task between analogue networks \(red\) and conventional MLPs \(black\) as a function of trainable parameters for two and three hidden layer networks respectively\. Performance is measured by training the networks for 1000 episodes, and evaluating the trained agents over 100 randomly sampled initial conditions\. Plotted data points show mean performance over 10 independently trained models, while the shaded regions show the standard deviation across the 10 models\. In both cases, the analogue networks solve the task with fewer trainable parameters than MLPs, demonstrating the architecture’s suitability for continuous\-action reinforcement learning\.

Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)d–g compare the typical behaviours learned by both analogue and MLP networks with≈\\approx2000 trainable parameters and two hidden layers, highlighted by the annotated points in Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)b, where the analogue network has solved the task optimally, while the MLP has discovered a suboptimal solution in which the agent maintains the pole upright for the full 500 timesteps for around 75% of the sampled initial conditions\. Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)d shows that the simulated analogue network quickly establishes a stable pole angle from initial conditions, and maintains the pole upright with negligible deviation, reminiscent of critically damped oscillatory behaviour\. When transferred to experimental hardware \(Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)e\), the speed at which the system is stabilised is reduced, though the agent still systematically keeps the pole upright\. This reduced speed is due to small mismatches in experimental transfer leading to suboptimal actions, though with feedback from the environment, the agent can correct for these actions and stabilise the pole\. On the other hand, the MLP \(Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)f,g\) exhibits more pronounced oscillatory behaviour reminiscent of an underdamped system, and depending upon initial conditions, stabilises the pole \(Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)f\), or the oscillations grow to a point at which the trial is failed \(Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)g\)\.

![Refer to caption](https://arxiv.org/html/2606.23742v1/KAN_fig4.png)Figure 4:Maximum Power Point Tracking for Photovoltaic Cells Under Partial Shading Conditions \(MPPT\)\.\(a\) Schematic diagram of the MPPT task\. Four arrays operate under variable irradiance\. A state vector of current, voltage, power, and change in power is passed to a network, which outputs a voltage change to a voltage\-booster circuit, and as a consequence changes the generated power and current across the arrays\. \(b\) A comparison of power generation ratio and trainable parameters between analogue network \(red\) and MLP \(black\) actor networks\. Power generation ratio is defined as the ratio between power output from the controlled system and the maximum available power for a given irradiance condition calculated via physical models\. Data points show mean performance sampled over 100 irradiance conditions across 10 independently trained models, while shaded regions show the standard deviation across the models\. \(c\) and \(d\) show two example responses of the analogue network controller to two different partial shading conditions\. The left panels show power \(black\), voltage \(red\) and maximum power points \(dashed line\) for the first 30 timesteps of controller operation\. The right panels show the modelled power versus control voltage response of the PV cell for the shading conditions on the left panels\.We next consider photovoltaic maximum\-power\-point tracking, a power\-electronics control problem in which compact, low\-power inference is directly relevant\. The objective in this task is to change the control voltage held over an array of photovoltaic cells to maximise the output power of the array while remaining agnostic to the irradiance acting upon each cell\. Input states consist of the control voltage currently being applied, the current output of the array, the change in power compared to the previous timestep, and the power currently being generated\. Details of the simulation used for modelling associated PV\-curves can be found in Section[4\.11](https://arxiv.org/html/2606.23742#S4.SS11)\.

Fig\.[4](https://arxiv.org/html/2606.23742#S2.F4)b compares the generated power ratio between analogue networks \(red\) and MLPs \(black\) as a function of trainable parameters\. Here, power generation ratio is defined as the ratio between the power being generated, and the maximum available power for the current irradiance condition, determined via simulation\. The data presented reflect average performance across 100 randomly sampled irradiance conditions, each with randomly generated initial conditions and 50 timesteps during which the agent can change the control voltage\. The plotted data points show the mean performance across 10 independently trained models, and the shaded regions show the standard deviation across the 10 models\. Except for the smallest network, with only 2 hidden nodes, the analogue networks maintain a performance advantage over the MLP baselines across the tested parameter range, with the gap persisting at the largest network sizes evaluated\. This reflects a greater ability to find the maximum power point quickly and reliably from environmental feedback\.

Fig\.[4](https://arxiv.org/html/2606.23742#S2.F4)c,d show example trials for the analogue networks under two different partial shading conditions \(left panels\), as well as the associated power versus control voltage responses determined via the model \(right panels\)\. The curves contain local maxima in both cases that the agent avoids while finding the maximum power point, a known challenge for simple control methods such as perturb and observe algorithms\[[22](https://arxiv.org/html/2606.23742#bib.bib13)\]\.

### 2\.4Task\-dependence of the advantage

The preceding tasks all require the network to represent smooth, continuously valued mappings\. To establish where the parameter\-efficiency advantage is lost, we applied the same networks to two tasks that instead reduce to a decision boundary: Fashion\-MNIST classification and the standard CartPole task with binary \(±10\\pm 10N\) actions \(Supplementary Figs\.[S2](https://arxiv.org/html/2606.23742#A3.F2)and[S3](https://arxiv.org/html/2606.23742#A4.F3)\)\. On Fashion\-MNIST the per\-parameter advantage over MLPs is largely absent, surviving only at the smallest network sizes, and the number of filters per edge has little effect on accuracy; what advantage remains is in connectivity rather than parameters, as the analogue networks reach a given accuracy with fewer edges\. On binary CartPole the two architectures are closely matched—with one hidden layer they reach maximum performance at similar parameter counts \(the MLP leading only at the smallest sizes\), and with two hidden layers performance is almost identical\. Because greedy selection between two discrete actions is a winner\-take\-all decision on the predicted Q\-values, this task is functionally a binary classification\. The parameter\-efficiency advantage therefore appears specifically when the target is smooth and continuously valued– consistent with the finite, smoothly varying filter basis of each edge– and disappears when the task reduces to a decision boundary\.

### 2\.5Memristor\-based PhyKANs

![Refer to caption](https://arxiv.org/html/2606.23742v1/MemKAN_Figure.png)Figure 5:Memristor\-based PhyKANs\.\(a\) Circuit topology of the underlying nonlinear element, showing the input and reference voltages, operational amplifier, memristor \(MR\), and output load\. \(b\-d\) Performance comparison between MLPs \(black\) and MemKANs \(red\) for the \(b\) six\-axis forward kinematics task, \(c\) CartPole task, and \(d\) photovoltaic\-cell control task\. Markers show mean performance and shaded regions show the standard deviation over 10 independent experiments\.To test whether the same trainable\-nonlinear\-edge principle extends beyond analogue filters, we replaced the analogue filters that provide the tuneable transfer functions with memristors \(MemKAN\)\. The same tasks performed with the analogue\-filter networks were then repeated with memristor\-based networks\.

The memristors studied here are two\-terminal devices whose current–voltage \(I/V\) relationship depends on an internal state variable, enabling a history\-dependent response to applied stimuli\[[9](https://arxiv.org/html/2606.23742#bib.bib407),[33](https://arxiv.org/html/2606.23742#bib.bib174)\]\. A defining feature of such devices is their ability to switch between distinct resistance states, typically referred to as the low\-resistance state \(LRS\) and the high\-resistance state \(HRS\), as shown in Supplementary Fig\.[9\(a\)](https://arxiv.org/html/2606.23742#A8.F9.sf1), each associated with a different I/V characteristic\. This behaviour arises from the coupling between electronic transport and the spatial distribution of defects within the active material, typically governed by nanoscale redox processes\[[40](https://arxiv.org/html/2606.23742#bib.bib403),[39](https://arxiv.org/html/2606.23742#bib.bib404)\]\.

In filamentary devices, this resistive switching is commonly associated with the formation and rupture of a conductive filament bridging the electrodes \(Supplementary Fig\.[9\(b\)](https://arxiv.org/html/2606.23742#A8.F9.sf2)\)\. The LRS corresponds to a continuous filament that provides a quasi\-metallic conduction path, whereas the HRS is characterised by a partially ruptured filament, where a nanoscale gap separates the filament tip from the electrode\. In this regime, conduction is dominated by barrier\-limited transport mechanisms across the gap, giving rise to a strongly nonlinear, often exponential, current–voltage relationship\[[35](https://arxiv.org/html/2606.23742#bib.bib402)\]\. By applying appropriate voltage signals, the device can be reversibly tuned between these configurations, effectively controlling both the filament continuity and the associated transport mechanism\. As a result, both the overall conductance and the degree of nonlinearity can be precisely adjusted through electrical programming, providing a versatile platform for analogue signal processing\.

For the memristor\-based PhyKAN, both inputs and outputs of each nonlinear synapse are encoded in pulse amplitude, not in the frequency domain\. To obtain transfer functions with negative differential resistance, and hence turning points, we use the programmable nonlinear current–voltage behaviour of a memristor operated in its high\-resistance state and embedded in a reconfigurable active network that can realise tuneable nonlinear functions\. This is achieved through a modified subtractor configuration in which an output\-sensing resistor is additionally driven by a current proportional to\(Vin−Vout\)/R\(V\_\{\\mathrm\{in\}\}\-V\_\{\\mathrm\{out\}\}\)/R, introducing a feedback mechanism that selectively enhances intermediate input amplitudes while attenuating both low and high extremes\. Fig\.[5](https://arxiv.org/html/2606.23742#S2.F5)a shows the underlying components behind the memristor edges\.

The electrical transport properties of the circuit were determined by a dynamic memdiode model \(see Methods, Section[4\.12](https://arxiv.org/html/2606.23742#S4.SS12)\) to resolve the state of the memristive element, coupled with a circuit solver for the interplay between the memristor and the operational amplifier\. As the model requires an iterative nonlinear least\-squares optimisation step to find the I/V response, a surrogate model was used to provide a non\-recursive differentiable map of input voltages to output voltages with respect to control parameters to aid optimisation\. This model took the form of a standard MLP, with further details found in Section[4\.13](https://arxiv.org/html/2606.23742#S4.SS13)\.

Fig\.[5](https://arxiv.org/html/2606.23742#S2.F5)b shows performance of the forward kinematics task shown in Fig\.[2](https://arxiv.org/html/2606.23742#S2.F2), while panels c and d show the CartPole and PV\-control tasks respectively\. As with the filter\-based networks, the MemKANs show improved parameter efficiency over the MLP controls across all tasks, with performance comparable to the filter\-based PhyKANs\. This suggests that the improved efficiency observed here and in other studies\[[41](https://arxiv.org/html/2606.23742#bib.bib6),[11](https://arxiv.org/html/2606.23742#bib.bib4),[36](https://arxiv.org/html/2606.23742#bib.bib9)\]comes from placing trainable nonlinearities on network edges\.

## 3Discussion

Whether trainable nonlinear connections provide a parameter\-efficiency advantage depends on the target computation\. On the classification and binary\-action control tasks tested here \(Supplementary Figs\.[S2](https://arxiv.org/html/2606.23742#A3.F2)and[S3](https://arxiv.org/html/2606.23742#A4.F3)\), where the problem reduces to selecting a decision boundary rather than representing a continuous control law, the parameter\-efficiency advantage disappears: the architecture performs comparably to a multilayer perceptron\. This is consistent with the finite edge basis itself: each edge is a finite sum of smooth filter responses, so smooth variation can be represented compactly, whereas sharp transitions require additional units and erode this advantage\. The advantage emerges instead when the target computation is itself smooth and continuously valued\. In this regime, the architecture concentrates computation into expressive physical edges, reducing the number of nodes, active connections and trainable parameters needed to reach a given performance; when mapped to hardware, this means fewer analogue edge circuits to route, program and calibrate\. This behaviour holds across robotic kinematics, continuous\-action reinforcement learning and photovoltaic maximum\-power\-point tracking, with the strongest effect in the last of these, where the nonlinear\-edge networks outperform multilayer perceptrons across all tested network sizes\. The benefit therefore comes from the match between nonlinear physical edges and continuous regression or control problems\.

The hardware results indicate that this architectural advantage can survive transfer from model to device\. Training is performed through analytical transfer\-function models of the analogue filters, with hardware\-aware discretisation of trainable parameters before deployment\. Across approximately 35,000 experimentally realised connections, 90% match their simulated targets to within a mean\-squared error of7\.92×10−57\.92\\times 10^\{\-5\}\. The remaining error is mainly due to parasitic capacitance and resistance that are not captured by the analytical model\. For the networks studied here, this mismatch produces only modest increases in loss after transfer\. These results use direct transfer only: after mapping the simulated parameters to hardware\-achievable values, we did not carry out hardware\-in\-the\-loop hill climbing or reinforcement\-learning fine\-tuning\. Such post\-transfer calibration could further reduce residual mismatch, particularly for larger networks where transfer errors may compound\. Closed\-loop tasks provide an additional tolerance mechanism, because the controller can compensate for small deviations through subsequent interactions with the environment\. Scaling to larger networks will require more complete modelling of parasitics or post\-transfer fine\-tuning, but the present results show that direct model\-to\-hardware transfer is already accurate enough for demanding continuous\-control tasks\.

The same design principle extends beyond the analogue\-filter platform\. In simulated memristor\-based networks, the trainable nonlinearity is provided by the steady\-state current–voltage characteristic of the device, not by a frequency\-domain filter response\. Despite this change in physical substrate, the same performance pattern is recovered: parameter\-efficiency gains on regression and continuous control tasks, and a persistent advantage over multilayer perceptrons in photovoltaic tracking\. This supports the central architectural claim that the useful resource is not a particular transfer function, but the placement of trainable nonlinearity on network connections\. For memristive computing, this is a substantial shift in emphasis\. Memristors are commonly engineered to behave as programmable linear conductances in crossbar arrays, suppressing the nonlinear current–voltage response and often pushing operation into higher\-current regimes\. Here, that nonlinear response becomes the computational element itself, allowing operation in the natural low\-current analogue regime where power dissipation is reduced and endurance can be improved\.

The implementation of filter response in the differentiable model assumes steady\-state operation for ease of simulation and optimisation\. As a possible extension, processing can also operate on continuous\-time signals\. When operated in the time domain, the transience induced in the filters can be leveraged as an inherent source of short\-term memory\. However, this would involve adapting the learning process to be cognisant of dynamic dependencies and assign credit over long timescales during optimisation\. Typical approaches such as backpropagation\-through\-time \(BPTT\)\[[42](https://arxiv.org/html/2606.23742#bib.bib73)\], solve this issue given appropriate adaptations to the network forward pass, however techniques such as eligibility\-propagation\[[5](https://arxiv.org/html/2606.23742#bib.bib190)\]may be more suited, as the memory requirements scale on the order𝒪​\(1\)\\mathcal\{O\}\(1\)with signal length as opposed to𝒪​\(T\)\\mathcal\{O\}\(T\)for BPTT, whereTTreflects number of time increments in the optimisation process– especially pertinent if small increments in time are necessary to capture the transience of the filters– while remaining mathematically exact due to the feedforward nature of the networks studied here\.

The experimental platform used here was chosen for reconfigurability and measurement access, not maximum efficiency\. On dedicated hardware, the reduced network size enabled by nonlinear physical edges is expected to translate into lower power consumption\. Because a nonlinear edge is physically more complex than a scalar weight, the hardware benefit cannot be inferred from parameter count alone\. We therefore estimate power from the number of required edge circuits and a component\-level CMOS design of those circuits \(Section[4\.15](https://arxiv.org/html/2606.23742#S4.SS15)\)\. Based on this CMOS\-compatible transconductor implementation, the photovoltaic control networks are projected to consume∼29\.5​μ​W\{\\sim\}29\.5\\;\\mathrm\{\\mu W\}, more than two orders of magnitude below the1010–150​mW150\\;\\mathrm\{mW\}envelope of microcontrollers used for comparable edge\-computing tasks\. This estimate combines the intrinsic efficiency of analogue operation with the architectural reduction in required components\.

Overall, the work points to a different route for physical neural networks\. Rather than forcing analogue devices to approximate software\-style scalar weights, their native nonlinear responses can be made trainable and used directly as the computational primitives of the network\. For continuous inference and control, this produces compact networks that transfer to analogue hardware and can be mapped onto distinct physical substrates, suggesting a general strategy for low\-power physical machine learning\.

## Acknowledgements

This work was primarily supported by the UK Neuromorphic Computing Hardware Semiconductor IKC \(Neuroware; UKRI2784\), funded by EPSRC and Innovate UK\. I\.T\.V\., T\.K\., A\.Me\. and E\.V\. acknowledge support from Neuroware\. Additional support was provided by the EPSRC MARCH project: I\.T\.V\., L\.M\., M\.O\.A\.E\., T\.J\.H\. and E\.V\. acknowledge support from EP/V006339/1, and M\.T\. and S\.S\. acknowledge support from EP/V006029/1\. E\.V\. and L\.M\. also acknowledge support from the CHIST\-ERA project Causal eXplanations in Reinforcement Learning \(CausalXRL; CHIST\-ERA\-19\-XAI\-002\), funded by EPSRC under grant EP/V055720/1\.

## Author Contributions

E\.V\. originated the physical\-KAN concept\. I\.T\.V\. proposed the analogue\-filter implementation\. I\.T\.V\. performed the simulations, experiments and task implementations with input from E\.V\. E\.V\. contributed to software development and independent validation of the computational results\. I\.T\.V\. and E\.V\. analysed the results\. E\.V\. proposed photovoltaic maximum\-power\-point tracking as an application, and I\.T\.V\. implemented the photovoltaic\-control task\. I\.T\.V\., T\.J\.H\., J\.C\.G\. and E\.V\. refined the study through discussion; J\.C\.G\. also encouraged the development of compelling application demonstrations\. I\.T\.V\., F\.A\., T\.K\., A\.Me\. and E\.V\. developed the memristor\-based realisation\. F\.A\. developed the memdiode model and circuit topology with input from A\.Me\. I\.T\.V\., C\.S\. and T\.J\.H\. designed and implemented the measurement infrastructure\. I\.T\.V\. and M\.O\.A\.E\. developed the FPAA programming code\. I\.T\.V\., M\.T\. and T\.J\.H\. developed the FPAA deployment\. I\.T\.V\. and E\.V\. conducted the intrinsic\-dimensionality analysis\. O\.J\.S\., L\.M\., F\.R\. and E\.V\. performed preliminary evaluations of the intrinsic\-dimensionality metric on other network architectures, with input from I\.Y\.T\. E\.V\. and I\.Y\.T\. developed the universality argument with input from O\.J\.S\. I\.T\.V\. and E\.V\. drafted the manuscript\. F\.A\. drafted the memristor sections\. T\.J\.H\., J\.C\.G\., M\.T\., O\.J\.S\., M\.O\.A\.E\., L\.M\. and S\.S\. provided feedback on the manuscript\. E\.V\. directed the research\.

## 4Methods

### 4\.1Definition of a Physical KAN\-type Edge

A PhyKAN is built from trainable nonlinear edges\. Consider an edge from nodeiiin layerℓ\\ellto nodehhin layerℓ\+1\\ell\+1\. This edge receives the scalar activationai\(ℓ\)a\_\{i\}^\{\(\\ell\)\}and returns a learned univariate response formed by summing the outputs of a bank of analogue band\-pass filters:

Φi,h​\(ai\(ℓ\)\)=∑k=1Ki,hGi,h,k​ρ​\(fenc​\(ai\(ℓ\)\);ϑi,h,k\),\\Phi\_\{i,h\}\\bigl\(a\_\{i\}^\{\(\\ell\)\}\\bigr\)=\\sum\_\{k=1\}^\{K\_\{i,h\}\}G\_\{i,h,k\}\\,\\rho\\\!\\left\(f\_\{\\mathrm\{enc\}\}\\bigl\(a\_\{i\}^\{\(\\ell\)\}\\bigr\);\\vartheta\_\{i,h,k\}\\right\),\(1\)whereKi,hK\_\{i,h\}is the number of filters on the edge,Gi,h,kG\_\{i,h,k\}is the learned gain of filterkk,ϑi,h,k\\vartheta\_\{i,h,k\}denotes its learned corner\-frequency parameters, andρ​\(ν;ϑ\)≔\|HBP​\(ν;ϑ\)\|\\rho\(\\nu;\\vartheta\)\\coloneqq\|H\_\{\\mathrm\{BP\}\}\(\\nu;\\vartheta\)\|is the corresponding band\-pass magnitude response\. The complete network is obtained by summing these edge responses at each receiving node and applying the fixed sigmoidal conditioning described below\.

### 4\.2PhyKAN: Implementation Details

In our implementation, each single\-filter responseρ​\(ν;ϑ\)\\rho\(\\nu;\\vartheta\)is realised by a cascade of first\-order high\-pass and low\-pass stages in steady state\. The following subsections describe the constituent components used to implement a filter\-based PhyKAN\.

#### 4\.2\.1Input\-to\-Frequency Encoding

To ensure inputs remain within the operating range of hardware, raw inputs to the network,x^\\hat\{x\}, are constrained via a sigmoid:

x=11\+e−x^\.x=\\frac\{1\}\{1\+e^\{\-\\hat\{x\}\}\}\.\(2\)For a constrained inputx∈\[0,1\]x\\in\[0,1\], we define the encoding map

fenc​\(x\)=10αin\+βin​x,f\_\{\\mathrm\{enc\}\}\(x\)=10^\{\\alpha\_\{\\text\{in\}\}\+\\beta\_\{\\text\{in\}\}x\},\(3\)whereαin=3\.65\\alpha\_\{\\text\{in\}\}=3\.65is the log\-frequency offset andβin=1\.5\\beta\_\{\\text\{in\}\}=1\.5is the log\-frequency slope\. This maps roughly to10αin≈103\.65≈4470​Hz10^\{\\alpha\_\{\\text\{in\}\}\}\\approx 10^\{3\.65\}\\approx 4470\\,\\text\{Hz\}\(forx=0x=0\) up to10αin\+βin≈105\.15≈141​kHz10^\{\\alpha\_\{\\text\{in\}\}\+\\beta\_\{\\text\{in\}\}\}\\approx 10^\{5\.15\}\\approx 141\\,\\text\{kHz\}\(forx=1x=1\)\.

#### 4\.2\.2Learnable Filter Parameters

Each filter unit on edge\(i,h\)\(i,h\)with indexkkhas three trainable parameters stored as

θi,h,k=\(gi,h,k,pi,h,klow,pi,h,khigh\)\.\\theta\_\{i,h,k\}=\\bigl\(g\_\{i,h,k\},\\,p\_\{i,h,k\}^\{\\text\{low\}\},\\,p\_\{i,h,k\}^\{\\text\{high\}\}\\bigr\)\.\(4\)
Heregi,h,k,pi,h,klow,pi,h,khigh∈ℝg\_\{i,h,k\},p\_\{i,h,k\}^\{\\text\{low\}\},p\_\{i,h,k\}^\{\\text\{high\}\}\\in\\mathbb\{R\}are unconstrained \(raw\) parameters for the gain and the low\-pass and high\-pass corner frequencies of filterkkon edge\(i,h\)\(i,h\)\. The corresponding physical gainGi,h,kG\_\{i,h,k\}and corner frequenciesνi,h,klow\\nu^\{\\text\{low\}\}\_\{i,h,k\}andνi,h,khigh\\nu^\{\\text\{high\}\}\_\{i,h,k\}are defined below\. These raw parameters are bounded to values achievable on hardware via a scaled sigmoid

σs​\(z\)=11\+e−z/sσ,\\sigma\_\{s\}\(z\)=\\frac\{1\}\{1\+e^\{\-z/s\_\{\\sigma\}\}\},\(5\)wheresσ=0\.5s\_\{\\sigma\}=0\.5sets the slope of the nonlinearity\.

##### Gain\.

The effective gain applied by filterkkon edge\(i,h\)\(i,h\)is

Gi,h,k=Gmax​\(σs​\(gi,h,k\)−12\),G\_\{i,h,k\}=G\_\{\\max\}\\left\(\\sigma\_\{s\}\(g\_\{i,h,k\}\)\-\\tfrac\{1\}\{2\}\\right\),\(6\)whereGmax=3G\_\{\\max\}=3fixes the overall gain scale, so thatGi,h,k∈\[−1\.5,1\.5\]G\_\{i,h,k\}\\in\[\-1\.5,1\.5\]for the parameterisation used here\.

##### Corner frequencies\.

The low\-pass and high\-pass corner frequencies for filterkkon edge\(i,h\)\(i,h\), corresponding to the raw parameterspi,h,klowp\_\{i,h,k\}^\{\\text\{low\}\}andpi,h,khighp\_\{i,h,k\}^\{\\text\{high\}\}fromθi,h,k\\theta\_\{i,h,k\}, are defined as

νi,h,klow=10αc\+βc​σs​\(pi,h,klow\),νi,h,khigh=10αc\+βc​σs​\(pi,h,khigh\),\\nu^\{\\text\{low\}\}\_\{i,h,k\}=10^\{\\alpha\_\{\\text\{c\}\}\+\\beta\_\{\\text\{c\}\}\\,\\sigma\_\{s\}\(p\_\{i,h,k\}^\{\\text\{low\}\}\)\},\\qquad\\nu^\{\\text\{high\}\}\_\{i,h,k\}=10^\{\\alpha\_\{\\text\{c\}\}\+\\beta\_\{\\text\{c\}\}\\,\\sigma\_\{s\}\(p\_\{i,h,k\}^\{\\text\{high\}\}\)\},\(7\)whereαc=3\.65\\alpha\_\{\\text\{c\}\}=3\.65andβc=1\.9\\beta\_\{\\text\{c\}\}=1\.9determine the range of available corner frequencies\. For these values, the corner frequencies lie in

10αc≈4\.5​kHzto10αc\+βc≈355​kHz\.10^\{\\alpha\_\{\\text\{c\}\}\}\\approx 4\.5\\,\\text\{kHz\}\\quad\\text\{to\}\\quad 10^\{\\alpha\_\{\\text\{c\}\}\+\\beta\_\{\\text\{c\}\}\}\\approx 355\\,\\text\{kHz\}\.\(8\)These bounds slightly exceed the input encoding range, allowing filters to place their effective passbands both within and near the edges of the driven frequencies\.

#### 4\.2\.3Band\-pass Filter Transfer Function

The PhyKAN uses a cascade of first\-order high\-pass and low\-pass filters, described by the following equations, which give the ratio of input to output amplitude at steady state with respect to input frequencyν\\nu\.

##### High\-pass filter\.

HHP​\(ν\)=j​ωa​τH1\+j​ωa​τH,τH=R​Chigh=12​π​νhigh,H\_\{\\text\{HP\}\}\(\\nu\)=\\frac\{j\\omega\_\{a\}\\tau\_\{H\}\}\{1\+j\\omega\_\{a\}\\tau\_\{H\}\},\\qquad\\tau\_\{H\}=RC\_\{\\text\{high\}\}=\\frac\{1\}\{2\\pi\\nu^\{\\text\{high\}\}\},\(9\)whereωa=2​π​ν\\omega\_\{a\}=2\\pi\\nu\.

##### Low\-pass filter\.

HLP​\(ν\)=11\+j​ωa​τL,τL=R​Clow=12​π​νlow\.H\_\{\\text\{LP\}\}\(\\nu\)=\\frac\{1\}\{1\+j\\omega\_\{a\}\\tau\_\{L\}\},\\qquad\\tau\_\{L\}=RC\_\{\\text\{low\}\}=\\frac\{1\}\{2\\pi\\nu^\{\\text\{low\}\}\}\.\(10\)

##### Combined band\-pass response\.

The combined \(magnitude\) response for a single filter is

H​\(ν;θi,h,k\)=Gi,h,k⋅\|j​ωa​τH1\+j​ωa​τH⋅11\+j​ωa​τL\|\.H\(\\nu;\\theta\_\{i,h,k\}\)=G\_\{i,h,k\}\\cdot\\left\|\\frac\{j\\omega\_\{a\}\\tau\_\{H\}\}\{1\+j\\omega\_\{a\}\\tau\_\{H\}\}\\cdot\\frac\{1\}\{1\+j\\omega\_\{a\}\\tau\_\{L\}\}\\right\|\.\(11\)SubstitutingτL=1/\(2​π​νL\),τH=1/\(2​π​νH\)\\tau\_\{L\}=1/\(2\\pi\\nu\_\{L\}\),\\tau\_\{H\}=1/\(2\\pi\\nu\_\{H\}\)gives

H​\(ν;θi,h,k\)=Gi,h,k⋅\|j​\(ν/νi,h,khigh\)1\+j​\(ν/νi,h,khigh\)⋅11\+j​\(ν/νi,h,klow\)\|\.H\(\\nu;\\theta\_\{i,h,k\}\)=G\_\{i,h,k\}\\cdot\\left\|\\frac\{j\\,\(\\nu/\\nu^\{\\text\{high\}\}\_\{i,h,k\}\)\}\{1\+j\\,\(\\nu/\\nu^\{\\text\{high\}\}\_\{i,h,k\}\)\}\\cdot\\frac\{1\}\{1\+j\\,\(\\nu/\\nu^\{\\text\{low\}\}\_\{i,h,k\}\)\}\\right\|\.\(12\)In particular, the band\-pass magnitudeHBPH\_\{\\mathrm\{BP\}\}appearing in the definition ofρ\\rhoabove is given by

HBP​\(ν;νi,h,klow,νi,h,khigh\)=\|j​\(ν/νi,h,khigh\)1\+j​\(ν/νi,h,khigh\)⋅11\+j​\(ν/νi,h,klow\)\|\.H\_\{\\mathrm\{BP\}\}\\\!\\left\(\\nu;\\nu^\{\\text\{low\}\}\_\{i,h,k\},\\nu^\{\\text\{high\}\}\_\{i,h,k\}\\right\)=\\left\|\\frac\{j\\,\(\\nu/\\nu^\{\\text\{high\}\}\_\{i,h,k\}\)\}\{1\+j\\,\(\\nu/\\nu^\{\\text\{high\}\}\_\{i,h,k\}\)\}\\cdot\\frac\{1\}\{1\+j\\,\(\\nu/\\nu^\{\\text\{low\}\}\_\{i,h,k\}\)\}\\right\|\.\(13\)In the network, the scalar gain factor is instantiated by the learned gainsGi,h,kG\_\{i,h,k\}defined above\.

### 4\.3Full Layer Computation

Consider a layerℓ\\ellwithnℓn\_\{\\ell\}input nodes andnℓ\+1n\_\{\\ell\+1\}output nodes, andKi,hK\_\{i,h\}filters per edge\(i,h\)\(i,h\)\.

##### Edge function output\.

For an activationai\(ℓ\)a\_\{i\}^\{\(\\ell\)\}at nodeiiin layerℓ\\ell, the edge function from nodeiito nodehhin layerℓ\+1\\ell\+1is

Φi,h​\(ai\(ℓ\)\)=∑k=1Ki,hH​\(fenc​\(ai\(ℓ\)\);θi,h,k\)\.\\Phi\_\{i,h\}\\bigl\(a\_\{i\}^\{\(\\ell\)\}\\bigr\)=\\sum\_\{k=1\}^\{K\_\{i,h\}\}H\\\!\\left\(f\_\{\\mathrm\{enc\}\}\\bigl\(a\_\{i\}^\{\(\\ell\)\}\\bigr\);\\theta\_\{i,h,k\}\\right\)\.\(14\)

##### Node aggregation\.

The pre\-activation at nodehhin layerℓ\+1\\ell\+1is

yh\(ℓ\+1\)=∑i=1nℓMi,h\(ℓ\)​Φi,h​\(ai\(ℓ\)\),y\_\{h\}^\{\(\\ell\+1\)\}=\\sum\_\{i=1\}^\{n\_\{\\ell\}\}M\_\{i,h\}^\{\(\\ell\)\}\\,\\Phi\_\{i,h\}\\\!\\left\(a\_\{i\}^\{\(\\ell\)\}\\right\),\(15\)whereMi,h\(ℓ\)∈\{0,1\}M\_\{i,h\}^\{\(\\ell\)\}\\in\\\{0,1\\\}is an edge mask for layerℓ\\ell, used, for example, for pruning\. In the implementation used here,Mi,h\(ℓ\)M\_\{i,h\}^\{\(\\ell\)\}is obtained by post\-training hard pruning: for each edge\(i,h\)\(i,h\)in layerℓ\\ell, we evaluate the learned edge response at 1000 uniformly spaced points sampled from the interval\[0,1\]\[0,1\], and setMi,h\(ℓ\)=0M\_\{i,h\}^\{\(\\ell\)\}=0if the mean absolute response over those samples falls below a fixed pruning threshold; otherwiseMi,h\(ℓ\)=1M\_\{i,h\}^\{\(\\ell\)\}=1\.

##### Inter\-layer activation\.

For hidden layers, we apply the scaled sigmoid nonlinearity, constraining activations to values that encode to physically realisable frequencies:

ah\(ℓ\+1\)=σs​\(yh\(ℓ\+1\)\)for hidden layers\.a\_\{h\}^\{\(\\ell\+1\)\}=\\sigma\_\{s\}\\left\(y\_\{h\}^\{\(\\ell\+1\)\}\\right\)\\qquad\\text\{for hidden layers\.\}\(16\)

#### 4\.3\.1Complete Network Forward Pass

Given an input vector𝐱=\(x1,…,xn0\)\\mathbf\{x\}=\(x\_\{1\},\\ldots,x\_\{n\_\{0\}\}\):

1. 1\.Input transformation\.The initial activations are ai\(0\)=σs​\(xi\),a\_\{i\}^\{\(0\)\}=\\sigma\_\{s\}\(x\_\{i\}\),\(17\)or, if thresholdsτi\\tau\_\{i\}are used, ai\(0\)=σs​\(xi−τi\)\.a\_\{i\}^\{\(0\)\}=\\sigma\_\{s\}\(x\_\{i\}\-\\tau\_\{i\}\)\.\(18\)
2. 2\.Layer propagation\.Forℓ=0,1,…,L−1\\ell=0,1,\\ldots,L\-1, yh\(ℓ\+1\)=∑i=1nℓMi,h\(ℓ\)​∑k=1Ki,hGi,h,k\(ℓ\)​HBP​\(fenc​\(ai\(ℓ\)\);νi,h,klow,νi,h,khigh\),y\_\{h\}^\{\(\\ell\+1\)\}=\\sum\_\{i=1\}^\{n\_\{\\ell\}\}M\_\{i,h\}^\{\(\\ell\)\}\\sum\_\{k=1\}^\{K\_\{i,h\}\}G\_\{i,h,k\}^\{\(\\ell\)\}H\_\{\\text\{BP\}\}\\\!\\left\(f\_\{\\mathrm\{enc\}\}\(a\_\{i\}^\{\(\\ell\)\}\);\\nu^\{\\text\{low\}\}\_\{i,h,k\},\\nu^\{\\text\{high\}\}\_\{i,h,k\}\\right\),\(19\)whereHBPH\_\{\\text\{BP\}\}denotes the band\-pass magnitude response\. The activations in the next layer are ah\(ℓ\+1\)=\{σs​\(yh\(ℓ\+1\)\),if​ℓ<L−1,yh\(ℓ\+1\),if​ℓ=L−1\.a\_\{h\}^\{\(\\ell\+1\)\}=\\begin\{cases\}\\sigma\_\{s\}\\left\(y\_\{h\}^\{\(\\ell\+1\)\}\\right\),&\\text\{if \}\\ell<L\-1,\\\\\[4\.0pt\] y\_\{h\}^\{\(\\ell\+1\)\},&\\text\{if \}\\ell=L\-1\.\\end\{cases\}\(20\)
3. 3\.Output\.The network output is y^=𝐚\(L\)\.\\hat\{y\}=\\mathbf\{a\}^\{\(L\)\}\.\(21\)

### 4\.4Band\-pass filters on programmable analogue hardware

Input signals are provided as sinusoidal voltage oscillations with an amplitude of22\\frac\{2\}\{\\sqrt\{2\}\}with input data magnitude encoded into the frequency of the oscillation, provided by an AIM\-TTI TG1010A DDS function generator\. Analogue filters were realised on an Anadigm AN231K04 development board operating on a 4 MHz clock frequency\. This enables up to six first\-order band\-pass filters to be realised with corner frequencies between 4 and 400 kHz\. The steady\-state amplitude of each filter is measured by using a full\-wave rectifier followed by a third\-order low\-pass filter with a corner frequency of 4 kHz\. To ensure adequate filtering of the rectified signal, the minimum input frequencies are applied at 10 kHz, with the voltage signal after the final low\-pass filter resembling a pseudo\-DC voltage at22\\frac\{\\sqrt\{2\}\}\{2\}of the rectified signal magnitude with a residual oscillation of−24\-24dB or 0\.398% of the unfiltered signal at 10 kHz\. These voltages are summed using operational amplifiers with controllable gains, with the summed voltage read out using an NI BNC\-2120 DAQ card by averaging measurements over a 1 ms window at a sampling frequency of 1 MHz, allowing a 0\.5 ms delay for the steady\-state levels of the filters to settle\.

As the hardware filters used here are based upon a switch\-capacitor architecture, they behave as discrete\-time filters\. To accommodate for the effects of frequency warping, a bilinear transform is used to map the frequencies of continuous filters \(ωa\\omega\_\{a\}\) in equations described in Section[4\.2\.3](https://arxiv.org/html/2606.23742#S4.SS2.SSS3)to frequencies at which discrete\-time hardware equivalents have identical phase and gain \(ωd\\omega\_\{d\}\) via Eq\.[22](https://arxiv.org/html/2606.23742#S4.E22):

ωd=2T⋅arctan⁡\(ωa​T2\)\\omega\_\{d\}=\\frac\{2\}\{T\}\\cdot\\arctan\(\\frac\{\{\\omega\_\{a\}T\}\}\{2\}\)\(22\)withT=2\.5×10−7​sT=2\.5\\times 10^\{\-7\}\\,\\text\{s\}representing the underlying clock period of the FPAA\.

### 4\.5Hardware PhyKAN\.

To generate a network in hardware, we first use the model to optimise parameter values to solve the task, and transfer the gain, low\-pass corner frequency, and high\-pass corner frequency to programmable analogue filters\. As the development board used allows a maximum of six programmable filters per instance, a maximum of six sub\-edges are used in each network edge for experimental demonstrations\. To reconstruct an entire network from a single board, the response of the network is measured edge by edge, sampling the input–output relationship across the input range \(5\.6–140 kHz\) at 200 logarithmically spaced steps\. This gives a fine\-grained, experimentally measured response function for each edge in the network\. To simulate the response of a full network using experimentally measured data, linear interpolation is used to determine the predicted experimental response for task inputs from the fine\-grained edge data gathered in the previous step\.

### 4\.6Hardware\-Aware Training: Straight\-Through Estimator

Physical component values \(e\.g\. filter corner frequencies and gains\) take values only in a discrete set𝒟⊂ℝ\\mathcal\{D\}\\subset\\mathbb\{R\}\. To train such parameters, we use continuous updates while enforcing discrete values in the forward computation\. In other words, optimisation uses a continuous surrogate for gradients, but the model behaves discretely when evaluated, following the straight\-through estimator approach introduced by Bengio et al\.\[[6](https://arxiv.org/html/2606.23742#bib.bib412)\]\.

Letθ∈ℝ\\theta\\in\\mathbb\{R\}be a trainable parameter and let

T:ℝ→𝒟,θ~=T​\(θ\)T:\\mathbb\{R\}\\to\\mathcal\{D\},\\qquad\\tilde\{\\theta\}=T\(\\theta\)denote the quantisation map selecting the physically realisable value\. The mapTTis piecewise constant, so

d​T​\(θ\)d​θ=0except at jump discontinuities,\\frac\{dT\(\\theta\)\}\{d\\theta\}=0\\quad\\text\{except at jump discontinuities,\}and at those jump points the derivative is undefined\. Directly differentiatingθ~=T​\(θ\)\\tilde\{\\theta\}=T\(\\theta\)would therefore yield zero or undefined gradients and stall optimisation\.

To retain gradients while keeping the discrete forward pass, we define the surrogate expression

θ∗≔θ\+\(T​\(θ\)−θ\),\\theta^\{\\ast\}\\coloneqq\\theta\+\\bigl\(T\(\\theta\)\-\\theta\\bigr\),\(23\)which satisfiesθ∗=T​\(θ\)\\theta^\{\\ast\}=T\(\\theta\)identically\. In the forward pass, the computation uses the discrete valueθ∗\\theta^\{\\ast\}\. In the backward pass, we differentiate only the identity contribution,

d​θ∗d​θ=1,\\frac\{d\\theta^\{\\ast\}\}\{d\\theta\}=1,so that for any lossℒ\\mathcal\{L\}depending onθ∗\\theta^\{\\ast\},

∂ℒ∂θ=∂ℒ∂θ∗\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\theta\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\theta^\{\\ast\}\}\.\(24\)Whenθ\\thetais bounded by a sigmoid reparameterisation \(as in the gain and cut\-off definitions\), this gradient is implicitly clipped to the admissible physical range, and no further modification is required\.

After each continuous update ofθ\\theta, the next forward pass again uses the discretised valueT​\(θ\)T\(\\theta\), so training explores a continuous parameter space while evaluation always reflects the discrete hardware constraints\.

### 4\.7Regularisation Terms

To encourage sparsity and stability, we add regularisation terms to the training objective\.

##### L1 \+ entropy penalty\.

A sparsity\-promoting term can be written as

ℒreg=λ​∑ℓ\[1Nμ​∑μNμ∑iNi∑hNh\|Φi,h​\(ai,μ\(ℓ\)\)\|−∑iNi∑hNhqi,h\(ℓ\)​log⁡qi,h\(ℓ\)\],\\mathcal\{L\}\_\{\\text\{reg\}\}=\\lambda\\sum\_\{\\ell\}\\left\[\\frac\{1\}\{N\_\{\\mu\}\}\\sum\_\{\\mu\}^\{N\_\{\\mu\}\}\\sum\_\{i\}^\{N\_\{i\}\}\\sum\_\{h\}^\{N\_\{h\}\}\|\\Phi\_\{i,h\}\(a\_\{i,\\mu\}^\{\(\\ell\)\}\)\|\-\\sum\_\{i\}^\{N\_\{i\}\}\\sum\_\{h\}^\{N\_\{h\}\}q\_\{i,h\}^\{\(\\ell\)\}\\log q\_\{i,h\}^\{\(\\ell\)\}\\right\],\(25\)whereλ\\lambdais a regularisation coefficient,μ\\murepresents a sample index in a batch of sizeNμN\_\{\\mu\}, andai,μ\(ℓ\)a\_\{i,\\mu\}^\{\(\\ell\)\}is the activation of nodeiiin layerℓ\\ellfor sampleμ\\mu\. We also define:

qi,h\(ℓ\)≔1Nμ​∑μNμ\|Φi,h​\(ai,μ\(ℓ\)\)\|∑i∑h\|Φi,h​\(ai,μ\(ℓ\)\)\|\.q\_\{i,h\}^\{\(\\ell\)\}\\coloneqq\\frac\{1\}\{N\_\{\\mu\}\}\\sum\_\{\\mu\}^\{N\_\{\\mu\}\}\\frac\{\|\\Phi\_\{i,h\}\(a\_\{i,\\mu\}^\{\(\\ell\)\}\)\|\}\{\\sum\_\{i\}\\sum\_\{h\}\|\\Phi\_\{i,h\}\(a\_\{i,\\mu\}^\{\(\\ell\)\}\)\|\}\.\(26\)

##### Stability penalty\.

To penalise large parameter gradients and promote parameter\-stable solutions, we may include

ℒstab=λstab​∑ℓ‖∇θ\(ℓ\)ℒ‖1,\\mathcal\{L\}\_\{\\text\{stab\}\}=\\lambda\_\{\\text\{stab\}\}\\sum\_\{\\ell\}\\left\\\|\\nabla\_\{\\theta^\{\(\\ell\)\}\}\\mathcal\{L\}\\right\\\|\_\{1\},\(27\)whereλstab\\lambda\_\{\\text\{stab\}\}controls the strength of the stability penalty andℒ\\mathcal\{L\}denotes the base training loss \(before adding regularisation terms\)\.

### 4\.8Data generation for Six\-axis Robot Kinematics

Robot kinematics are calculated for an IRB120 commercial robotic manipulator manufactured by ABB, which has been previously used for neural\-network\-based approximation of kinematics\[[30](https://arxiv.org/html/2606.23742#bib.bib33)\]\. Link lengths and rotational axes are defined according to the user manual, and are used to define Denavit\-Hartenberg parameters for determining end\-effector positions for known joint angles\. This formulation recursively translates relative coordinate systems between jointskkandk−1k\-1via transformation matricesTk−1kT\_\{k\-1\}^\{k\}:

Tk−1k=\[cos​θk−sin​θk​cos​αksin​θk​sin​αkrk​cos​θksin​θkcos​θk​cos​αk−cos​θk​sin​αkrk​sin​θk0sin​αkcos​αkdk0001\]T\_\{k\-1\}^\{k\}=\\begin\{bmatrix\}\\mathrm\{cos\}\\theta\_\{k\}&\-\\mathrm\{sin\}\\theta\_\{k\}\\mathrm\{cos\}\\alpha\_\{k\}&\\mathrm\{sin\}\\theta\_\{k\}\\mathrm\{sin\}\\alpha\_\{k\}&r\_\{k\}\\mathrm\{cos\}\\theta\_\{k\}\\\\ \\mathrm\{sin\}\\theta\_\{k\}&\\mathrm\{cos\}\\theta\_\{k\}\\mathrm\{cos\}\\alpha\_\{k\}&\-\\mathrm\{cos\}\\theta\_\{k\}\\mathrm\{sin\}\\alpha\_\{k\}&r\_\{k\}\\mathrm\{sin\}\\theta\_\{k\}\\\\ 0&\\mathrm\{sin\}\\alpha\_\{k\}&\\mathrm\{cos\}\\alpha\_\{k\}&d\_\{k\}\\\\ 0&0&0&1\\end\{bmatrix\}

where:

- •zkz^\{k\}is normal to the rotational axis of the joint
- •xkx^\{k\}is normal to bothzkz^\{k\}andzk−1z^\{k\-1\}\(xk=zk×zk−1x^\{k\}=z^\{k\}\\times z^\{k\-1\}\)
- •yky^\{k\}is normal to thexk​zkx^\{k\}z^\{k\}plane such that it generates a right\-handed coordinate system
- •θk\\theta\_\{k\}is the angle of jointkk, about axiszk−1z^\{k\-1\}fromxk−1x^\{k\-1\}toxkx^\{k\}
- •αk\\alpha\_\{k\}is the angle about axisxk−1x^\{k\-1\}fromzk−1z^\{k\-1\}tozkz^\{k\}
- •rkr\_\{k\}is the radius about axiszk−1z^\{k\-1\}
- •dkd\_\{k\}is the translational offset along axiszk−1z^\{k\-1\}to thexk​zkx^\{k\}z^\{k\}plane

The robot consists of six revolute joints, meaningθ\\thetarepresents the only variable, here denoted byϕ\\phi, withα\\alpha,rr, andddremaining fixed for eachkk, defined by the robot geometry\. The corresponding Denavit\-Hartenberg parameters for each axis are defined in Table[1](https://arxiv.org/html/2606.23742#S4.T1)\. The end\-effector location relative to the origin is found by calculatingT06=T01​T12​T23​T34​T45​T56T\_\{0\}^\{6\}=T\_\{0\}^\{1\}T\_\{1\}^\{2\}T\_\{2\}^\{3\}T\_\{3\}^\{4\}T\_\{4\}^\{5\}T\_\{5\}^\{6\}for known joint angles \[ϕ1,…,ϕ6\\phi\_\{1\},\.\.\.,\\phi\_\{6\}\]\. A dataset of 20,000 data points was generated by randomly sampling uniformly across the maximum ranges for each of the six joint angles, and calculating the forward kinematics from the above equations\.

k=1k=1k=2k=2k=3k=3k=4k=4k=5k=5k=6k=6θ\\thetaϕ1\\phi\_\{1\}ϕ2\\phi\_\{2\}ϕ3\\phi\_\{3\}ϕ4\\phi\_\{4\}ϕ5\\phi\_\{5\}ϕ6\\phi\_\{6\}α\\alpha−π2\-\\frac\{\\pi\}\{2\}0\-π2\\frac\{\\pi\}\{2\}π2\\frac\{\\pi\}\{2\}\-π2\\frac\{\\pi\}\{2\}0rr00\.2700\.13400\.0720dd0\.29000\.0700\.16800Table 1:Denavit\-Hartenberg parameters defining the six\-axis robot\. All lengths \(rranddd\) are given in metres, and all angles \(θ\\thetaandα\\alpha\) are given in radians\.
### 4\.9Actor–Critic Training for Reinforcement Learning

The continuous\-action experiments use a deterministic actor–critic update\. The actorπ​\(s;θA\)\\pi\(s;\\theta\_\{A\}\)maps a statessto a continuous action, and the criticQ​\(s,a;θC\)Q\(s,a;\\theta\_\{C\}\)estimates the return, meaning the discounted future reward, associated with taking actionaain statess\. During training, observed transitions\(s,a,r,s′\)\(s,a,r,s^\{\\prime\}\), consisting of the current state, action, reward, and next state, are stored in a replay buffer𝒟\\mathcal\{D\}and later sampled in mini\-batches for parameter updates\. This update is related to DDPG\-style continuous\-control methods\[[31](https://arxiv.org/html/2606.23742#bib.bib25),[18](https://arxiv.org/html/2606.23742#bib.bib26)\], but uses a reduced target\-network structure: full DDPG uses target copies of both the actor and the critic, whereas here only the critic has a target copy\.

##### Critic loss \(TD learning\)\.

For a transition\(s,a,r,s′\)∼𝒟\(s,a,r,s^\{\\prime\}\)\\sim\\mathcal\{D\}, the scalar value that the critic is trained to match is

y​\(s,a,r,s′;θA,θC′\)=r\+γ​Q′​\(s′,π​\(s′;θA\);θC′\),y\(s,a,r,s^\{\\prime\};\\theta\_\{A\},\\theta\_\{C^\{\\prime\}\}\)=r\+\\gamma\\,Q^\{\\prime\}\\bigl\(s^\{\\prime\},\\pi\(s^\{\\prime\};\\theta\_\{A\}\);\\theta\_\{C^\{\\prime\}\}\\bigr\),\(28\)whereγ\\gammais the discount factor andQ′Q^\{\\prime\}is the slowly updated target critic with parametersθC′\\theta\_\{C^\{\\prime\}\}\. The next action in this target is generated by the current actor,π​\(s′;θA\)\\pi\(s^\{\\prime\};\\theta\_\{A\}\), and this action is then evaluated by the target critic\. The critic loss is then

ℒcritic​\(θC\)=𝔼\(s,a,r,s′\)∼𝒟​\[\(Q​\(s,a;θC\)−y​\(s,a,r,s′;θA,θC′\)\)2\]\.\\mathcal\{L\}\_\{\\text\{critic\}\}\(\\theta\_\{C\}\)=\\mathbb\{E\}\_\{\(s,a,r,s^\{\\prime\}\)\\sim\\mathcal\{D\}\}\\left\[\\bigl\(Q\(s,a;\\theta\_\{C\}\)\-y\(s,a,r,s^\{\\prime\};\\theta\_\{A\},\\theta\_\{C^\{\\prime\}\}\)\\bigr\)^\{2\}\\right\]\.\(29\)During critic updates,θA\\theta\_\{A\}andθC′\\theta\_\{C^\{\\prime\}\}are held fixed whileθC\\theta\_\{C\}is optimised\.

##### Actor loss \(deterministic policy gradient\)\.

The actor parametersθA\\theta\_\{A\}are optimised to select actions that maximise the critic’s estimate of long\-term return\. We define the actor objective

Jactor​\(θA\)=𝔼s∼𝒟s​\[Q​\(s,π​\(s;θA\);θC\)\],J\_\{\\text\{actor\}\}\(\\theta\_\{A\}\)=\\mathbb\{E\}\_\{s\\sim\\mathcal\{D\}\_\{s\}\}\\left\[Q\\bigl\(s,\\pi\(s;\\theta\_\{A\}\);\\theta\_\{C\}\\bigr\)\\right\],\(30\)where𝒟s\\mathcal\{D\}\_\{s\}is the marginal state distribution induced by the replay buffer𝒟\\mathcal\{D\}, and the critic parametersθC\\theta\_\{C\}are treated as fixed\. In a loss\-based formulation, the actor is trained to minimise

ℒactor​\(θA\)=−Jactor​\(θA\)=−𝔼s∼𝒟s​\[Q​\(s,π​\(s;θA\);θC\)\]\.\\mathcal\{L\}\_\{\\text\{actor\}\}\(\\theta\_\{A\}\)=\-J\_\{\\text\{actor\}\}\(\\theta\_\{A\}\)=\-\\mathbb\{E\}\_\{s\\sim\\mathcal\{D\}\_\{s\}\}\\left\[Q\\bigl\(s,\\pi\(s;\\theta\_\{A\}\);\\theta\_\{C\}\\bigr\)\\right\]\.\(31\)

##### Soft target update\.

To stabilise training, the target critic parameters are updated using a soft update rule

θC′←\(1−τupdate\)​θC′\+τupdate​θC,\\theta\_\{C^\{\\prime\}\}\\leftarrow\(1\-\\tau\_\{\\text\{update\}\}\)\\,\\theta\_\{C^\{\\prime\}\}\+\\tau\_\{\\text\{update\}\}\\,\\theta\_\{C\},\(32\)whereτupdate∈\(0,1\)\\tau\_\{\\text\{update\}\}\\in\(0,1\)controls the update rate\.

### 4\.10CartPole Environment

The CartPole control task environment follows the classic pole\-balancing problem\[[4](https://arxiv.org/html/2606.23742#bib.bib24)\]and was adapted from the OpenAI Gymnasium library\[[7](https://arxiv.org/html/2606.23742#bib.bib23)\], with dynamic equations modified to accept a continuous force\. The agent receives the state of the cart at every timesteptt,StS\_\{t\}, defined by Eq\.[33](https://arxiv.org/html/2606.23742#S4.E33):

St=\[xt,x˙t,θt,θ˙t\]S\_\{t\}=\[x\_\{t\},\\dot\{x\}\_\{t\},\\theta\_\{t\},\\dot\{\\theta\}\_\{t\}\]\(33\)wherextx\_\{t\}denotes the position of the cart with respect to the centre of the environment at timettin arbitrary units,x˙t\\dot\{x\}\_\{t\}the velocity of the cart at timettin arbitrary units per second,θt\\theta\_\{t\}the angle of the pole with respect to the normal of the cart at timettin radians, andθ˙t\\dot\{\\theta\}\_\{t\}the angular velocity of the pole at timettin radians per second\. The agent provides a vector force along the x axis as output \(as defined in Fig\.[3](https://arxiv.org/html/2606.23742#S2.F3)a\), and is presented with a reward of \+1 for every timestep for which it keeps the pole upright\. Rewards were shaped according to state observations to improve learning\. The reward shaping function is defined by Eq\.[34](https://arxiv.org/html/2606.23742#S4.E34):

Rts=Rt−\|xt\|−0\.1⋅\|x˙t\|−\|θt\|−0\.5⋅\|θ˙t\|R^\{s\}\_\{t\}=R\_\{t\}\-\|x\_\{t\}\|\-0\.1\\cdot\|\\dot\{x\}\_\{t\}\|\-\|\\theta\_\{t\}\|\-0\.5\\cdot\|\\dot\{\\theta\}\_\{t\}\|\(34\)whereRtsR^\{s\}\_\{t\}defines the shaped reward at timett, andRtR\_\{t\}the reward given by the environment at timett\. Networks were optimised over 1000 episodes with different random initial conditions, and trained according to the actor–critic paradigm outlined above\.

### 4\.11Data generation of Simulated Photovoltaic Array

Simulations of a photovoltaic array consisting of four individual models were performed using a modified version of a Simulink model\[[37](https://arxiv.org/html/2606.23742#bib.bib21)\]of four connected SunPower SPR\-X20\-250\-BLK modules operating at a fixed temperature of25∘25^\{\\circ\}C\. Reference data were generated from 2000 random samples of irradiance, uniformly sampled in the range of400400–10001000W/m2\\mathrm\{W/m^\{2\}\}\. Control voltage is ramped between0–200200V in steps of0\.10\.1V, and simulated current and power generated by the array are recorded, generating simulation predictions for the power versus voltage response of the array under partial shading conditions\.

These simulations form the basis for the environment in the PV array control task, acting as ground truth data for determining the outcome of agent actions\. At the start of an episode, a random irradiance condition is sampled, as well as an initial control voltage\. From the associated simulation data for the sampled irradiance conditions, the initial control voltage is used to determine the current, and consequently the power, generated by the array\. The state vector provided to the actor and critic networks at timet=0t=0is given by the initial control voltage,V0V\_\{0\}, currentI0I\_\{0\}, and powerP0P\_\{0\}, as well as the change in powerΔ​P0=0\\Delta P\_\{0\}=0on initialisation\. From this input state vector, the actor network produces a change in the control voltage,Δ​Vt\\Delta V\_\{t\}, as its output, and the control voltage at the next timestep is given byVt\+1=Vt\+Δ​VtV\_\{t\+1\}=V\_\{t\}\+\\Delta V\_\{t\}\. The simulation data are consulted to giveIt\+1I\_\{t\+1\}andPt\+1P\_\{t\+1\}, whileΔ​Pt\+1\\Delta P\_\{t\+1\}is given byΔ​Pt\+1=Pt\+1−Pt\\Delta P\_\{t\+1\}=P\_\{t\+1\}\-P\_\{t\}\. These new state data are fed into the actor network, and the process occurs recursively until the fiftieth timestep, when the episode ends and a new irradiance condition is sampled\. The agents used here were trained over 1000 episodes, using a similar reward shaping function as described in\[[27](https://arxiv.org/html/2606.23742#bib.bib11)\]\.

### 4\.12Analytical Modelling of Memristor Circuits

To model the memristor’s behaviour in the proposed active circuit, the Dynamic Memdiode Model \(DMM\) for memristors is used, in which the device is represented by two coupled equations: an electronic transport equation and a memory\-state equation\[[1](https://arxiv.org/html/2606.23742#bib.bib3)\]\. In this framework, the memristive device is modelled as a nonlinear conduction element whose characteristics depend on an internal state variableλ\\lambda, which evolves dynamically under the applied electrical stimulus\.

The electronic transport equation is expressed in terms of the voltage across the conductive constriction as

VC=V−I​RI,V\_\{C\}=V\-IR\_\{I\},\(35\)
whereVVis the applied voltage,IIis the device current, andRIR\_\{I\}accounts for the series resistance associated with the filament structure\. The current is then given by

I​\(VC\)=I0​\(λ\)​sinh⁡\[α​\(VC−RS​\(λ\)​I\)\],I\(V\_\{C\}\)=I\_\{0\}\(\\lambda\)\\sinh\\left\[\\alpha\\left\(V\_\{C\}\-R\_\{S\}\(\\lambda\)I\\right\)\\right\],\(36\)whereα​\(λ\)\\alpha\(\\lambda\)is a fitting parameter that controls the voltage sensitivity of the conduction process \(the higherα\\alpha, the more nonlinear the conduction is\),RS​\(λ\)R\_\{S\}\(\\lambda\)is a state\-dependent resistance term, andI0​\(λ\)I\_\{0\}\(\\lambda\)is a state\-dependent current scaling factor\. This expression captures the nonlinear current–voltage behaviour and its dependence on the internal state of the device\.

The internal state variableλ\\lambdais normalised between 0 and 1, representing the high\-resistance state \(HRS\) and low\-resistance state \(LRS\), respectively\. The dependence of the current amplitude on the state variable is described as

I0​\(λ\)=\(Ion−Ioff\)​λ\+Ioff,I\_\{0\}\(\\lambda\)=\\left\(I\_\{\\mathrm\{on\}\}\-I\_\{\\mathrm\{off\}\}\\right\)\\lambda\+I\_\{\\mathrm\{off\}\},\(37\)whereIonI\_\{\\mathrm\{on\}\}andIoffI\_\{\\mathrm\{off\}\}define the current levels corresponding to the LRS and HRS limits\. Similar dependencies exist forα​\(λ\)\\alpha\(\\lambda\)andRS​\(λ\)R\_\{S\}\(\\lambda\), involving limiting parametersαm​i​n\\alpha\_\{min\},αm​a​x\\alpha\_\{max\},RSm​i​nR\_\{S\_\{min\}\}, andRSm​a​xR\_\{S\_\{max\}\}\.

The evolution of the internal state is governed by

d​λd​t=1−λτS​\(λ,VC\)−λτR​\(λ,VC\),\\frac\{d\\lambda\}\{dt\}=\\frac\{1\-\\lambda\}\{\\tau\_\{S\}\(\\lambda,V\_\{C\}\)\}\-\\frac\{\\lambda\}\{\\tau\_\{R\}\(\\lambda,V\_\{C\}\)\},\(38\)whereτS\\tau\_\{S\}andτR\\tau\_\{R\}are the characteristic SET and RESET times, respectively\. These characteristic times depend on the voltage and are defined as

τS​\(λ,VC\)=exp⁡\[−ηS​\(VC−VS​\(λ\)\)\],\\tau\_\{S\}\(\\lambda,V\_\{C\}\)=\\exp\\left\[\-\\eta\_\{S\}\\left\(V\_\{C\}\-V\_\{S\}\(\\lambda\)\\right\)\\right\],\(39\)τR​\(λ,VC\)=exp⁡\[ηR​λγ​\(VC−VR\)\],\\tau\_\{R\}\(\\lambda,V\_\{C\}\)=\\exp\\left\[\\eta\_\{R\}\\lambda^\{\\gamma\}\\left\(V\_\{C\}\-V\_\{R\}\\right\)\\right\],\(40\)whereηS\\eta\_\{S\}andηR\\eta\_\{R\}are fitting parameters that control the voltage dependence of the transition rates,VS​\(λ\)V\_\{S\}\(\\lambda\)andVRV\_\{R\}are the SET and RESET voltages, andγ\\gammaaccounts for nonlinear effects during RESET\.

To reproduce snapback behaviour during SET, the model allows the SET voltage to depend on the current as

VS​\(I\)=\{VT,I≥ISB,VS,I<ISB,V\_\{S\}\(I\)=\\begin\{cases\}V\_\{T\},&I\\geq I\_\{\\mathrm\{SB\}\},\\\\ V\_\{S\},&I<I\_\{\\mathrm\{SB\}\},\\end\{cases\}\(41\)whereVTV\_\{T\}is the transition voltage andISBI\_\{\\mathrm\{SB\}\}is the snapback current threshold\.

The coupling between the transport equation and the state equation enables the model to reproduce the nonlinear and history\-dependent behaviour characteristic of bipolar resistive switching devices\. This model can also be synthesised in an equivalent circuit model, as shown in Supplementary Fig\.[9\(c\)](https://arxiv.org/html/2606.23742#A8.F9.sf3), making it suitable for SPICE\-like simulation tools\.

Based on this model, the proposed active synapse is depicted in Fig\.[5](https://arxiv.org/html/2606.23742#S2.F5)a\. The transfer function for each case is given by Eq\.[42](https://arxiv.org/html/2606.23742#S4.E42)\.

Vo​u​t​\(x\)=RL​\(Vi​n−Vo​u​tR\+IO​eα​\[\(Vi​n−Vo​u​t\)−RS​IM​R\]\)V\_\{out\}\(x\)=R\_\{L\}\\left\(\\ \\frac\{V\_\{in\}\-V\_\{out\}\}\{R\}\+I\_\{O\}e^\{\\alpha\\left\[\\left\(V\_\{in\}\-V\_\{out\}\\right\)\-R\_\{S\}I\_\{MR\}\\right\]\}\\right\)\(42\)For this transfer function, the fitting parameters used to tune the synapse response areRLR\_\{L\},RR,RSR\_\{S\},α\\alpha, andIOI\_\{O\}, and their impact on the transfer function is shown in Supplementary Figs\.[8\(a\)](https://arxiv.org/html/2606.23742#A8.F8.sf1)–[8\(e\)](https://arxiv.org/html/2606.23742#A8.F8.sf5), respectively\. The first two correspond to circuit parameters, and the latter three are fitting parameters of the memristor model\. Although treated as three independent fitting parameters, these could also be represented as functions ofλ\\lambda, according to Eq\.[37](https://arxiv.org/html/2606.23742#S4.E37)in the scenario of a memristor whose state is tuneable and can achieve different nonlinear HRS states asλ\\lambdagoes from 0 to 1\.

### 4\.13Multilayer Perceptron Models of Memristor Circuits

To ease the computational cost of backpropagation through the nonlinear\-least\-squares solver used to findVo​u​tV\_\{out\}in Eq\.[42](https://arxiv.org/html/2606.23742#S4.E42)for the memristor circuit, a data\-driven approach was taken\. First, data were generated via the analytical model, consisting of 20,000 samples of the five control parameters, with 200 input voltages randomly sampled per set of control parameters to give 4 million data points\.

An MLP of shape \[6, 125, 125, 1\] was trained to interpolate the memristor’s transfer function with respect to voltage input and the control parameters over a training set of 3 million data points, with a validation set of 500,000 data points used to determine hyperparameters such as network size and learning rate\. The model converged with a mean\-squared error of5\.17×10−75\.17\\times 10^\{\-7\}evaluated on an unseen dataset consisting of the remaining 500,000 data points\. Details of hyperparameter optimisation can be found in[S7](https://arxiv.org/html/2606.23742#A7)\.

### 4\.14Memristor\-based PhyKANs

The memristor\-based networks are constructed similarly to the analogue filter networks described above\. Learnable parameters on the network edges are the five control parameters that dictate the memristor circuit response,θi,h,k\\theta\_\{i,h,k\}, plus an analogue gaingi,h,kg\_\{i,h,k\}applied after the circuit\. No frequency\-encoding function was used; however, the sigmoid activation that bounds input ranges and activations at nodes in hidden layers remained\. The response of edge functions is given by a forward pass through the MLP, which takes activations and learnable parametersθ\\thetaas inputs\. The parameters of the MLP,WW, remain fixed and identical for all nodes\. Edge functions, node aggregation, and node activations are as follows:

θi,h,k=\(RL,i,h,k,Ri,h,k,RS,i,h,k,αi,h,k,I0,i,h,k\),\\theta\_\{i,h,k\}=\(R\_\{L,i,h,k\},R\_\{i,h,k\},R\_\{S,i,h,k\},\\alpha\_\{i,h,k\},I\_\{0,i,h,k\}\),\(43\)Φi,h​\(ai\(ℓ\)\)=∑k=1Ki,hgi,h,k⋅HMLP​\(\(ai\(ℓ\),θi,h,k\);W\),\\Phi\_\{i,h\}\\bigl\(a\_\{i\}^\{\(\\ell\)\}\\bigr\)=\\sum\_\{k=1\}^\{K\_\{i,h\}\}g\_\{i,h,k\}\\cdot H\_\{\\mathrm\{MLP\}\}\\\!\\left\(\(a\_\{i\}^\{\(\\ell\)\},\\theta\_\{i,h,k\}\);W\\right\),\(44\)yh\(ℓ\+1\)=∑i=1nℓMi,h\(ℓ\)​∑k=1Ki,hgi,h,k⋅HMLP​\(\(ai\(ℓ\),θi,h,k\);W\),y\_\{h\}^\{\(\\ell\+1\)\}=\\sum\_\{i=1\}^\{n\_\{\\ell\}\}M\_\{i,h\}^\{\(\\ell\)\}\\sum\_\{k=1\}^\{K\_\{i,h\}\}g\_\{i,h,k\}\\cdot H\_\{\\mathrm\{MLP\}\}\\\!\\left\(\(a\_\{i\}^\{\(\\ell\)\},\\theta\_\{i,h,k\}\);W\\right\),\(45\)ah\(ℓ\+1\)=\{σs​\(yh\(ℓ\+1\)\),if​ℓ<L−1,yh\(ℓ\+1\),if​ℓ=L−1\.a\_\{h\}^\{\(\\ell\+1\)\}=\\begin\{cases\}\\sigma\_\{s\}\\left\(y\_\{h\}^\{\(\\ell\+1\)\}\\right\),&\\text\{if \}\\ell<L\-1,\\\\\[4\.0pt\] y\_\{h\}^\{\(\\ell\+1\)\},&\\text\{if \}\\ell=L\-1\.\\end\{cases\}\(46\)

### 4\.15Power Consumption Estimations

The circuitry used in the experiments was chosen for reconfigurability and ease of measurement\. For low\-power edge\-computing applications, the switch\-capacitor\-based FPAA used for network edges and the DDS arbitrary function generator used on network nodes consume considerably more power than would be expected on a dedicated ASIC\. We therefore estimate the potential power savings of a well\-optimised PhyKAN platform using plausible functionally equivalent components from the literature\. As these components are implemented in a variety of technologies and operate at different speeds, our approach is to extrapolate performance figures for a realistic estimate\.

As some calculations here are based on theoretical constraints, the numbers should be treated as lower\-bound estimates for a manufactured platform, and hence serve to motivate approximate power costs, not a specific design value\. The proposed architecture is designed around a 1 V supply voltage\. Network edges used in our demonstration require a nested set of six band\-pass filters, each with its own peak detection method and amplifier\.

To realise the filters, we propose sub\-threshold transconductance amplifiers, and 1 pF capacitors\. By lowering operating frequency closer to audio frequency range \(100 Hz–50 kHz\), the operating power of the band\-pass filter stage can be further reduced\. To achieve 50 kHz corner frequencies, transconductance values of 314 nS are required forVD​DV\_\{DD\}= 1 V\. For an overdrive voltage of 200 mV, this gives a required bias current of 31\.4 nA \(Ibias=Gm⋅Vov/2I\_\{\\mathrm\{bias\}\}=G\_\{m\}\\cdot V\_\{\\mathrm\{ov\}\}/2\)\. Employing a folded Cascode configuration to reduce noise on the amplifiers, we can assume a doubling of the required bias currents, giving a total of 62\.8 nW power consumption per band\-pass, or 376\.8 nW for the filter bank\. In existing works, programmable band\-pass filters based upon operational transconductance amplifiers have been realised in 350 nm CMOS with power consumptions of 1\.31μ\\muW and frequency tuning ratios greater than 10,000\[[25](https://arxiv.org/html/2606.23742#bib.bib17)\]\. For fixed filters based upon 180 nm CMOS, powers as low as 41 nW have been realised\[[16](https://arxiv.org/html/2606.23742#bib.bib18)\]\.

To extract the amplitude of the filtered signal and provide analogue gains to the outputs, additional components are required\. To measure the amplitude of the signal, we propose a source follower stage to track the envelope of the oscillations\. For a maximum tolerable droop rate of 10% at 100 Hz, a slew rate of 5 V/s is required\. To capture 50 kHz oscillations using a comparator\-based rectifier, a 100 kHz gain\-bandwidth product is required\. For a PMOS comparator with a gate capacitance of 50 fF, the required transconductance is given byGm=2​π⋅G​B​W⋅CgateG\_\{m\}=2\\pi\\cdot GBW\\cdot C\_\{\\mathrm\{gate\}\}= 31 nS\. Assuming weak inversion \(sub\-threshold regime\), this leads to a bias current ofIbias=Gm⋅n​VT=1\.2I\_\{\\mathrm\{bias\}\}=G\_\{m\}\\cdot nV\_\{T\}=1\.2nA or 1\.2 nW of power atVD​D=1V\_\{DD\}=1V\. To amplify the envelope signal, a low gain\-bandwidth amplifier operating at≈\\approx100 Hz can be used, requiring bias currents of∼1\{\\sim\}1nA and powers of 1 nW\. In total, the peak detection and amplification stages require 2\.2 nW per filter, or 13\.2 nW for the filter bank\. This brings the total power for each edge to 390 nW\.

Network nodes are required to convert the amplified edge responses into frequency\-encoded oscillations for the subsequent layer of edges\. Using a relaxation oscillator and a differential pair shaper to approximate a sine wave, the 10 Hz to 50 kHz range can be covered\. Assuming the highest input frequency and 1 V peak\-to\-peak amplitude, the oscillator requires a slew rate of 157 kV/s \(0\.5⋅2​π⋅50,0000\.5\\cdot 2\\pi\\cdot 50\{,\}000\)\. Assuming a load capacitance of 12 pF per connected filter bank, this gives a power consumption of 1\.88μ\\mathrm\{\\mu\}W for signal generation, or a total of 2\.27μ\\mathrm\{\\mu\}W per edge in the network\. In existing works, oscillators are reported with per\-cycle power efficiencies ranging between 1\.9 nW/kHz in 350 nm CMOS\[[10](https://arxiv.org/html/2606.23742#bib.bib14)\], to 0\.68 nW/kHz in 40 nm CMOS\[[28](https://arxiv.org/html/2606.23742#bib.bib15)\]\.

For the PV control task as shown in Fig\.[4](https://arxiv.org/html/2606.23742#S2.F4), the smallest networks \(after pruning\) to achieve power generation ratios above 0\.9 required 13 edges in the network, leading to a total estimated power consumption of 29\.5μ\\mathrm\{\\mu\}W\. Compared to a typical microcontroller capable of running neural networks with enough parameters to match performance of the PhyKANs, this is a significant reduction in power\. For example, an STM32 edge computing microcontroller designed for neural networks typically consumes between 10 mW and 150 mW when active\. The component\-level assumptions behind this estimate are summarised in Table[2](https://arxiv.org/html/2606.23742#S4.T2)\.

TheoreticalDesignLiteratureExamplesCMOSprocessCommentsBand\-passFilter62\.8​nW62\.8\\mathrm\{nW\}1\.31​μ​W1\.31\\mathrm\{\\mu W\}\[[25](https://arxiv.org/html/2606.23742#bib.bib17)\]41​n​W41\\mathrm\{nW\}\[[16](https://arxiv.org/html/2606.23742#bib.bib18)\]350​n​m350\\mathrm\{nm\}180​n​m180\\mathrm\{nm\}20​k​H​z20\\mathrm\{kHz\}250​H​z250\\mathrm\{Hz\}SignalGeneration6\.33​nW/kHz6\.33\\mathrm\{nW/kHz\}1\.9​nW/kHz1\.9\\mathrm\{nW/kHz\}\[[10](https://arxiv.org/html/2606.23742#bib.bib14)\]0\.68​nW/kHz0\.68\\mathrm\{nW/kHz\}\[[28](https://arxiv.org/html/2606.23742#bib.bib15)\]350​n​m350\\mathrm\{nm\}40​n​m40\\mathrm\{nm\}Assuming1 pF loadEnvelopetracking1\.2​nW1\.2\\mathrm\{nW\}\-\-Max 50 kHzDCAmplification1\.0​nW1\.0\\mathrm\{nW\}\-\-DC level changes≈100​H​z\\approx 100\\mathrm\{Hz\}Table 2:Theoretical design power consumption and literature examples for each stage required to create low\-power PhyKAN edges\. CMOS processes refer to the literature citations\.

## References

- \[1\]F\. L\. Aguirre, J\. Suñé, and E\. Miranda\(2022\)SPICE implementation of the dynamic memdiode model for bipolar resistive switching devices\.Micromachines13\(2\),pp\. 330\.External Links:[Document](https://dx.doi.org/10.3390/mi13020330)Cited by:[§4\.12](https://arxiv.org/html/2606.23742#S4.SS12.p1.1)\.
- \[2\]F\. Aguirre, A\. Sebastian, M\. Le Gallo, W\. Song, T\. Wang, J\. J\. Yang, W\. Lu, M\. Chang, D\. Ielmini, Y\. Yang, A\. Mehonic, A\. Kenyon,et al\.\(2024\)Hardware implementation of memristor\-based artificial neural networks\.Nature Communications15,pp\. 1974\.External Links:[Document](https://dx.doi.org/10.1038/s41467-024-45670-9)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[3\]D\. A\. Allwood, M\. O\. Ellis, D\. Griffin, T\. J\. Hayward, L\. Manneschi, M\. F\. Musameh, S\. O’Keefe, S\. Stepney, C\. Swindells, M\. A\. Trefzer,et al\.\(2023\)A perspective on physical reservoir computing with nanomagnetic devices\.Applied Physics Letters122\(4\),pp\. 040501\.External Links:[Document](https://dx.doi.org/10.1063/5.0119040)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[4\]A\. G\. Barto, R\. S\. Sutton, and C\. W\. Anderson\(1983\)Neuronlike adaptive elements that can solve difficult learning control problems\.IEEE Transactions on Systems, Man, and CyberneticsSMC\-13\(5\),pp\. 834–846\.External Links:[Document](https://dx.doi.org/10.1109/TSMC.1983.6313077)Cited by:[§4\.10](https://arxiv.org/html/2606.23742#S4.SS10.p1.2)\.
- \[5\]G\. Bellec, F\. Scherr, A\. Subramoney, E\. Hajek, D\. Salaj, R\. Legenstein, and W\. Maass\(2020\-07\)A solution to the learning dilemma for recurrent networks of spiking neurons\.Nature Communications11\(1\),pp\. 3625\(en\)\.Note:Number: 1 Publisher: Nature Publishing GroupExternal Links:ISSN 2041\-1723,[Link](https://www.nature.com/articles/s41467-020-17236-y),[Document](https://dx.doi.org/10.1038/s41467-020-17236-y)Cited by:[§3](https://arxiv.org/html/2606.23742#S3.p4.3)\.
- \[6\]Y\. Bengio, N\. Léonard, and A\. Courville\(2013\)Estimating or propagating gradients through stochastic neurons for conditional computation\.arXiv preprint arXiv:1308\.3432\.Cited by:[§4\.6](https://arxiv.org/html/2606.23742#S4.SS6.p1.1)\.
- \[7\]G\. Brockman, V\. Cheung, L\. Pettersson, J\. Schneider, J\. Schulman, J\. Tang, and W\. Zaremba\(2016\)OpenAI gym\.External Links:1606\.01540Cited by:[§4\.10](https://arxiv.org/html/2606.23742#S4.SS10.p1.2)\.
- \[8\]L\. N\. Calçado, S\. K\. Turitsyn, and E\. Manuylovich\(2026\)Small\-scale photonic kolmogorov\-arnold networks using standard telecom nonlinear modules\.External Links:2604\.08432Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p3.1)\.
- \[9\]L\. O\. Chua\(1971\)Memristor—the missing circuit element\.IEEE Transactions on Circuit Theory18\(5\),pp\. 507–519\.External Links:[Document](https://dx.doi.org/10.1109/TCT.1971.1083337)Cited by:[§2\.5](https://arxiv.org/html/2606.23742#S2.SS5.p2.1)\.
- \[10\]U\. Denier\(2010\)Analysis and design of an ultralow\-power cmos relaxation oscillator\.IEEE Transactions on Circuits and Systems I: Regular Papers57\(8\),pp\. 1973–1982\.External Links:[Document](https://dx.doi.org/10.1109/TCSI.2010.2041504)Cited by:[§4\.15](https://arxiv.org/html/2606.23742#S4.SS15.p5.3),[Table 2](https://arxiv.org/html/2606.23742#S4.T2.9.9.2.1.1)\.
- \[11\]M\. Escudero, M\. Zolfagharinejad, S\. v\. d\. Belt, N\. Alachiotis, and W\. G\. van der Wiel\(2026\)Physical analog kolmogorov\-arnold networks based on reconfigurable nonlinear\-processing units\.External Links:2602\.07518Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p3.1),[§2\.5](https://arxiv.org/html/2606.23742#S2.SS5.p6.1)\.
- \[12\]M\. Hermans, M\. Burm, T\. Van Vaerenbergh, J\. Dambre, and P\. Bienstman\(2015\-03\)Trainable hardware for dynamical computing using error backpropagation through physical media\.Nature Communications6\(1\),pp\. 6729\.External Links:[Document](https://dx.doi.org/10.1038/ncomms7729)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[13\]H\. Jaeger, B\. Noheda, and W\. G\. Van Der Wiel\(2023\)Toward a formal theory for computing machines made out of whatever physics offers\.Nature Communications14\(1\),pp\. 4911\.External Links:[Document](https://dx.doi.org/10.1038/s41467-023-40533-1)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[14\]D\. S\. Jeong, K\. M\. Kim, S\. Kim, B\. J\. Choi, and C\. S\. Hwang\(2016\)Memristors for energy\-efficient new computing paradigms\.Advanced Electronic Materials2\(9\),pp\. 1600090\.External Links:[Document](https://dx.doi.org/10.1002/aelm.201600090)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[15\]N\. L\. Kazanskiy, M\. A\. Butt, and S\. N\. Khonina\(2022\)Optical computing: status and perspectives\.Nanomaterials12\(13\),pp\. 2171\.External Links:[Document](https://dx.doi.org/10.3390/nano12132171)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[16\]S\. Lee, C\. Wang, and Y\. Chu\(2019\)Low\-voltage ota–c filter with an area\-and power\-efficient ota for biosignal sensor applications\.IEEE Transactions on Biomedical Circuits and Systems13\(1\),pp\. 56–67\.External Links:[Document](https://dx.doi.org/10.1109/TBCAS.2018.2882521)Cited by:[§4\.15](https://arxiv.org/html/2606.23742#S4.SS15.p3.3),[Table 2](https://arxiv.org/html/2606.23742#S4.T2.3.3.3.2.2)\.
- \[17\]X\. Liang, J\. Tang, Y\. Zhong, B\. Gao, H\. Qian, and H\. Wu\(2024\)Physical reservoir computing with emerging electronics\.Nature Electronics7\(3\),pp\. 193–206\.External Links:[Document](https://dx.doi.org/10.1038/s41928-024-01133-z)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[18\]T\. P\. Lillicrap, J\. J\. Hunt, A\. Pritzel, N\. Heess, T\. Erez, Y\. Tassa, D\. Silver, and D\. Wierstra\(2016\)Continuous control with deep reinforcement learning\.International Conference on Learning Representations\.Cited by:[§4\.9](https://arxiv.org/html/2606.23742#S4.SS9.p1.7)\.
- \[19\]Z\. Liu, Y\. Wang, S\. Vaidya, F\. Ruehle, J\. Halverson, M\. Soljačić, T\. Y\. Hou, and M\. Tegmark\(2025\)KAN: kolmogorov–arnold networks\.International Conference on Learning Representations\.Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.23742#S2.SS1.p1.1)\.
- \[20\]L\. Manneschi, I\. T\. Vidamour, …, and E\. Vasilaki\(2025\)Noise\-aware training of neuromorphic dynamic device networks\.Nature Communications16,pp\. 9192\.External Links:[Document](https://dx.doi.org/10.1038/s41467-025-64232-1)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p3.1)\.
- \[21\]D\. Marković, A\. Mizrahi, D\. Querlioz, and J\. Grollier\(2020\-09\)Physics for neuromorphic computing\.Nature Reviews Physics2\(9\),pp\. 499–510\.External Links:[Document](https://dx.doi.org/10.1038/s42254-020-0208-2)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[22\]A\. Mohapatra, B\. Nayak, P\. Das, and K\. B\. Mohanty\(2017\)A review on mppt techniques of pv system under partial shading condition\.Renewable and Sustainable Energy Reviews80,pp\. 854–867\.External Links:[Document](https://dx.doi.org/10.1016/j.rser.2017.05.083)Cited by:[§2\.3](https://arxiv.org/html/2606.23742#S2.SS3.p7.1)\.
- \[23\]A\. Momeni, B\. Rahmani, M\. Malléjac, P\. del Hougne, and R\. Fleury\(2023\)Backpropagation\-free training of deep physical neural networks\.Science382\(6676\),pp\. 1297–1303\.External Links:[Document](https://dx.doi.org/10.1126/science.adi8474)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p3.1)\.
- \[24\]M\. Nakajima, K\. Inoue, K\. Tanaka, Y\. Kuniyoshi, T\. Hashimoto, and K\. Nakajima\(2022\)Physical deep learning with biologically inspired training method: gradient\-free approach for physical hardware\.Nature Communications13\(1\),pp\. 7847\.External Links:[Document](https://dx.doi.org/10.1038/s41467-022-35216-2)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[25\]S\. Peng, Y\. Lee, T\. Wang, H\. Huang, M\. Lai, C\. Lee, and L\. Liu\(2018\)A power\-efficient reconfigurable ota\-c filter for low\-frequency biomedical applications\.IEEE Transactions on Circuits and Systems I: Regular Papers65\(2\),pp\. 543–555\.External Links:[Document](https://dx.doi.org/10.1109/TCSI.2017.2728809)Cited by:[§4\.15](https://arxiv.org/html/2606.23742#S4.SS15.p3.3),[Table 2](https://arxiv.org/html/2606.23742#S4.T2.2.2.2.1.1)\.
- \[26\]Y\. Peng, S\. Hooten, X\. Yu, T\. Van Vaerenbergh, Y\. Yuan, X\. Xiao, B\. Tossoun, S\. Cheung, M\. Fiorentino, and R\. Beausoleil\(2025\)Photonic KAN: a kolmogorov–arnold network inspired efficient neuromorphic accelerator\.Optical Fiber Communication Conference,pp\. W2A\.40\.External Links:[Document](https://dx.doi.org/10.1364/OFC.2025.W2A.40)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p3.1)\.
- \[27\]B\. C\. Phan, Y\. Lai, and C\. E\. Lin\(2020\)A deep reinforcement learning\-based mppt control for pv systems under partial shading condition\.Sensors20\(11\),pp\. 3039\.External Links:[Document](https://dx.doi.org/10.3390/s20113039)Cited by:[§4\.11](https://arxiv.org/html/2606.23742#S4.SS11.p2.11)\.
- \[28\]A\. Savanth, A\. S\. Weddell, J\. Myers, D\. Flynn, and B\. M\. Al\-Hashimi\(2019\)A sub\-nw/khz relaxation oscillator with ratioed reference and sub\-clock power gated comparator\.IEEE Journal of Solid\-State Circuits54\(11\),pp\. 3097–3106\.External Links:[Document](https://dx.doi.org/10.1109/JSSC.2019.2930360)Cited by:[§4\.15](https://arxiv.org/html/2606.23742#S4.SS15.p5.3),[Table 2](https://arxiv.org/html/2606.23742#S4.T2.10.10.3.2.2)\.
- \[29\]A\. Sebastian, M\. Le Gallo, R\. Khaddam\-Aljameh, and E\. Eleftheriou\(2020\-07\)Memory devices and applications for in\-memory computing\.Nature Nanotechnology15\(7\),pp\. 529–544\.External Links:[Document](https://dx.doi.org/10.1038/s41565-020-0655-z)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[30\]S\. B\. Šegota, N\. Anđelić, V\. Mrzljak, I\. Lorencin, I\. Kuric, and Z\. Car\(2021\)Utilization of multilayer perceptron for determining the inverse kinematics of an industrial robotic manipulator\.International Journal of Advanced Robotic Systems18\(4\),pp\. 1729881420925283\.External Links:[Document](https://dx.doi.org/10.1177/1729881420925283)Cited by:[§2\.2](https://arxiv.org/html/2606.23742#S2.SS2.p1.1),[§4\.8](https://arxiv.org/html/2606.23742#S4.SS8.p1.3)\.
- \[31\]D\. Silver, G\. Lever, N\. Heess, T\. Degris, D\. Wierstra, and M\. Riedmiller\(2014\)Deterministic policy gradient algorithms\.InProceedings of the 31st International Conference on Machine Learning,E\. P\. Xing and T\. Jebara \(Eds\.\),Proceedings of Machine Learning Research, Vol\.32,Beijing, China,pp\. 387–395\.Cited by:[§4\.9](https://arxiv.org/html/2606.23742#S4.SS9.p1.7)\.
- \[32\]K\. D\. Stenning, J\. C\. Gartside, …, E\. Vasilaki, and W\. R\. Branford\(2024\)Neuromorphic overparameterisation and few\-shot learning in multilayer physical neural networks\.Nature Communications15\(1\),pp\. 7377\.External Links:[Document](https://dx.doi.org/10.1038/s41467-024-50633-1)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p1.1)\.
- \[33\]D\. B\. Strukov, G\. S\. Snider, D\. R\. Stewart, and R\. S\. Williams\(2008\-05\)The missing memristor found\.Nature453\(7191\),pp\. 80–83\.External Links:[Document](https://dx.doi.org/10.1038/nature06932)Cited by:[§2\.5](https://arxiv.org/html/2606.23742#S2.SS5.p2.1)\.
- \[34\]O\. J\. Sutton, Q\. Zhou, A\. N\. Gorban, and I\. Y\. Tyukin\(2023\)Relative intrinsic dimensionality is intrinsic to learning\.InArtificial Neural Networks and Machine Learning – ICANN 2023,L\. Iliadis, A\. Papaleonidas, P\. Angelov, and C\. Jayne \(Eds\.\),Cham,pp\. 516–529\.External Links:ISBN 978\-3\-031\-44207\-0Cited by:[§2\.2](https://arxiv.org/html/2606.23742#S2.SS2.p6.1)\.
- \[35\]S\. M\. Sze and K\. K\. Ng\(2006\)Physics of semiconductor devices\.Wiley\.Cited by:[§2\.5](https://arxiv.org/html/2606.23742#S2.SS5.p3.1)\.
- \[36\]F\. Taglietti, A\. Pulici, M\. Roxburgh, G\. Seguini, I\. Vidamour, S\. Menzel, E\. Franco, M\. Laus, E\. Vasilaki, M\. Perego, T\. J\. Hayward, M\. Fanciulli, and J\. C\. Gartside\(2026\)Learning nonlinear heterogeneity in physical kolmogorov\-arnold networks\.External Links:2601\.15340Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p3.1),[§2\.5](https://arxiv.org/html/2606.23742#S2.SS5.p6.1)\.
- \[37\]R\. Tan\(2025\)PV string partial shading model\.Note:MATLAB Central File ExchangeAccessed 22 October 2025External Links:[Link](https://uk.mathworks.com/matlabcentral/fileexchange/62743-pv-string-partial-shading-model)Cited by:[§4\.11](https://arxiv.org/html/2606.23742#S4.SS11.p1.7)\.
- \[38\]S\. Udrescu and M\. Tegmark\(2020\)AI feynman: a physics\-inspired method for symbolic regression\.Science Advances6\(16\),pp\. eaay2631\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.aay2631)Cited by:[Figure 1](https://arxiv.org/html/2606.23742#S2.F1.2.2.2),[§2\.1](https://arxiv.org/html/2606.23742#S2.SS1.p2.1)\.
- \[39\]I\. Valov, E\. Linn, S\. Tappertzhofen, S\. Schmelzer, J\. van den Hurk, F\. Lentz, and R\. Waser\(2013\)Nanobatteries in redox\-based resistive switches require extension of memristor theory\.Nature Communications4,pp\. 1771\.External Links:[Document](https://dx.doi.org/10.1038/ncomms2784)Cited by:[§2\.5](https://arxiv.org/html/2606.23742#S2.SS5.p2.1)\.
- \[40\]R\. Waser\(2009\)Nanoelectronics and information technology\.Wiley\-VCH\.Cited by:[§2\.5](https://arxiv.org/html/2606.23742#S2.SS5.p2.1)\.
- \[41\]Z\. Wen, Q\. Zhang, J\. Chen, T\. Yang, F\. Yang, X\. Wang, Q\. Liu, X\. Luo, P\. Lin, L\. Deng,et al\.\(2026\)Computing\-in\-memory architecture for kolmogorov\-arnold networks based on tunable gaussian\-like memory cells\.Nature Communications17,pp\. 3496\.External Links:[Document](https://dx.doi.org/10.1038/s41467-026-69592-w)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p3.1),[§2\.5](https://arxiv.org/html/2606.23742#S2.SS5.p6.1)\.
- \[42\]P\. J\. Werbos\(1990\)Backpropagation through time: what it does and how to do it\.Proceedings of the IEEE78\(10\),pp\. 1550–1560\.Cited by:[§3](https://arxiv.org/html/2606.23742#S3.p4.3)\.
- \[43\]L\. G\. Wright, T\. Onodera, M\. M\. Stein, T\. Wang, D\. T\. Schachter, Z\. Hu, and P\. L\. McMahon\(2022\)Deep physical neural networks trained with backpropagation\.Nature601\(7894\),pp\. 549–555\.External Links:[Document](https://dx.doi.org/10.1038/s41586-021-04223-6)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p3.1)\.
- \[44\]T\. Xu, Z\. Luo, S\. Liu, L\. Fan, Q\. Xiao, B\. Wang, D\. Wang, and C\. Huang\(2026\)Physical neural networks using sharpness\-aware training\.Nature Communications17,pp\. 1766\.External Links:[Document](https://dx.doi.org/10.1038/s41467-026-68470-9)Cited by:[§1](https://arxiv.org/html/2606.23742#S1.p3.1)\.

## Appendix S1Universality Argument for Idealised PhyKANs

This note proves that idealised PhyKAN networks are universal approximators on compact domains\. The proof has five parts\. First, we show that an equal\-corner band\-pass filter gives a translated bump, and that changing the corner frequencies moves the centre of the bump\. A PhyKAN edge is a finite signed sum of such translated bumps, because each filter in the bank has its own gain\. Second, we show that finite sums of translated bumps can approximate any sine or cosine on a compact interval\. Third, Stone–Weierstrass is applied to the algebra generated by the constant function, sines, and cosines\. Fourth, a single idealised PhyKAN edge is a one\-dimensional universal approximator\. Finally, we show that a PhyKAN network is a universal function approximator via the standard sigmoid universal approximation theorem: the weighted sums in a classical ANN can be replaced by one\-dimensional PhyKAN edge approximators\.

Throughout, expressions of the form

supx∈K\|p​\(x\)−q​\(x\)\|\\sup\_\{x\\in K\}\|p\(x\)\-q\(x\)\|usesup\\supfor the supremum, which is the largest approximation error betweenppandqqover the compact setKK\. This quantity is the worst\-case error onKK\. Thus a bound such as

supx∈K\|p​\(x\)−q​\(x\)\|<δ\\sup\_\{x\\in K\}\|p\(x\)\-q\(x\)\|<\\deltameans that\|p​\(x\)−q​\(x\)\|<δ\|p\(x\)\-q\(x\)\|<\\deltafor everyx∈Kx\\in K\.

### S1\.1Equal\-corner filters give translated bumps

An input valuexxis not applied to the filter directly; it is encoded as the driving frequency

fenc​\(x\)\\displaystyle f\_\{\\rm enc\}\(x\)=10αin\+βin​x\\displaystyle=0^\{\\alpha\_\{\\text\{in\}\}\+\\beta\_\{\\text\{in\}\}x\}=e\(αin\+βin​x\)​ln⁡10\\displaystyle=e^\{\(\\alpha\_\{\\text\{in\}\}\+\\beta\_\{\\text\{in\}\}x\)\\ln 10\}=ea\+κ​x,\\displaystyle=e^\{a\+\\kappa x\},a\\displaystyle a:=αin​ln⁡10,κ:=βin​ln⁡10\>0\.\\displaystyle=\\alpha\_\{\\text\{in\}\}\\ln 0,\\quad\\kappa=\\beta\_\{\\text\{in\}\}\\ln 0\>0\.Thus we are using the same log\-frequency encoding as before, only written in baseee\.

A single band\-pass filter is the cascade of a first\-order high\-pass stage and a first\-order low\-pass stage\. LetνHP\\nu\_\{\\rm HP\}denote the high\-pass corner frequency andνLP\\nu\_\{\\rm LP\}the low\-pass corner frequency\. Both are tuneable filter parameters\. To obtain a translated bump centred at the input locationcc, tune both corners to the frequency produced when the input iscc:

νHP\(c\)=νLP\(c\)=:νc=fenc\(c\)\.\\nu\_\{\\rm HP\}\(c\)=\\nu\_\{\\rm LP\}\(c\)=:\\nu\_\{c\}=f\_\{\\rm enc\}\(c\)\.Hereccis the chosen centre of this filter response in the input coordinate\. Changingccretunes the two corner frequencies and translates the bump along the input axis\.

For a first\-order high\-pass stage and a first\-order low\-pass stage with the same cornerνc\\nu\_\{c\}, the magnitude responses are

\|HHP​\(ν;νc\)\|=ν/νc1\+\(ν/νc\)2,\|HLP​\(ν;νc\)\|=11\+\(ν/νc\)2\.\|H\_\{\\rm HP\}\(\\nu;\\nu\_\{c\}\)\|=\\frac\{\\nu/\\nu\_\{c\}\}\{\\sqrt\{1\+\(\\nu/\\nu\_\{c\}\)^\{2\}\}\},\\qquad\|H\_\{\\rm LP\}\(\\nu;\\nu\_\{c\}\)\|=\\frac\{1\}\{\\sqrt\{1\+\(\\nu/\\nu\_\{c\}\)^\{2\}\}\}\.These are the standard first\-order filter magnitude responses, and cascaded linear time\-invariant transfer functions multiply\[[1](https://arxiv.org/html/2606.23742#biba.bib1)\]\. The product rule assumes that the two stages are buffered, or equivalently that inter\-stage loading is negligible\. Their cascade has magnitude equal to the product, hence

\|HHP​\(ν;νc\)\|​\|HLP​\(ν;νc\)\|=ν/νc1\+\(ν/νc\)2\.\|H\_\{\\rm HP\}\(\\nu;\\nu\_\{c\}\)\|\\,\|H\_\{\\rm LP\}\(\\nu;\\nu\_\{c\}\)\|=\\frac\{\\nu/\\nu\_\{c\}\}\{1\+\(\\nu/\\nu\_\{c\}\)^\{2\}\}\.Driving the filter atν=fenc​\(x\)\\nu=f\_\{\\rm enc\}\(x\)and settingνc=fenc​\(c\)\\nu\_\{c\}=f\_\{\\rm enc\}\(c\), the ratio is

ννc=ea\+κ​xea\+κ​c=eκ​\(x−c\)\.\\frac\{\\nu\}\{\\nu\_\{c\}\}=\\frac\{e^\{a\+\\kappa x\}\}\{e^\{a\+\\kappa c\}\}=e^\{\\kappa\(x\-c\)\}\.The unnormalised equal\-corner response is therefore the cascade magnitude evaluated at these encoded frequencies:

\|HHP​\(fenc​\(x\);fenc​\(c\)\)\|​\|HLP​\(fenc​\(x\);fenc​\(c\)\)\|=eκ​\(x−c\)1\+e2​κ​\(x−c\)\.\\displaystyle\|H\_\{\\rm HP\}\(f\_\{\\rm enc\}\(x\);f\_\{\\rm enc\}\(c\)\)\|\\,\|H\_\{\\rm LP\}\(f\_\{\\rm enc\}\(x\);f\_\{\\rm enc\}\(c\)\)\|=\\frac\{e^\{\\kappa\(x\-c\)\}\}\{1\+e^\{2\\kappa\(x\-c\)\}\}\.Its value at the centrex=cx=cis1/21/2, so after normalising the peak to one we obtain

B​\(x−c\):=2​eκ​\(x−c\)1\+e2​κ​\(x−c\)=2eκ​\(x−c\)\+e−κ​\(x−c\)=sech⁡\(κ​\(x−c\)\)\.B\(x\-c\):=\\frac\{2e^\{\\kappa\(x\-c\)\}\}\{1\+e^\{2\\kappa\(x\-c\)\}\}=\\frac\{2\}\{e^\{\\kappa\(x\-c\)\}\+e^\{\-\\kappa\(x\-c\)\}\}=\\operatorname\{sech\}\\big\(\\kappa\(x\-c\)\\big\)\.Heresech\\operatorname\{sech\}denotes the hyperbolic secant,

sech⁡z=1cosh⁡z=2ez\+e−z\.\\operatorname\{sech\}z=\\frac\{1\}\{\\cosh z\}=\\frac\{2\}\{e^\{z\}\+e^\{\-z\}\}\.
A PhyKAN edge is a bank of such filters whose outputs are added with trainable gains\. Consequently, define the idealised edge family on the real line by

ℱedge​\(ℝ\):=\{x↦∑k=1KGk​B​\(x−ck\):K<∞,Gk,ck∈ℝ\}\.\\mathcal\{F\}\_\{\\rm edge\}\(\\mathbb\{R\}\):=\\left\\\{x\\mapsto\\sum\_\{k=1\}^\{K\}G\_\{k\}B\(x\-c\_\{k\}\):\\ K<\\infty,\\ G\_\{k\},c\_\{k\}\\in\\mathbb\{R\}\\right\\\}\.These sums are defined for every real inputxx\. When the approximation problem is posed on a compact intervalI⊂ℝI\\subset\\mathbb\{R\}, we use the same sums but evaluate them only forx∈Ix\\in I\. We denote this interval\-restricted use byℱedge​\(I\)\\mathcal\{F\}\_\{\\rm edge\}\(I\)\.

### S1\.2Scaled and translated bumps approximate trigonometric polynomials

From the previous subsection, an idealised PhyKAN edge has the form

∑k=1KGk​B​\(x−ck\)\.\\sum\_\{k=1\}^\{K\}G\_\{k\}B\(x\-c\_\{k\}\)\.We first consider the continuous analogue

SG​\(x\)=∫ℝG​\(c\)​B​\(x−c\)​𝑑c,S\_\{G\}\(x\)=\\int\_\{\\mathbb\{R\}\}G\(c\)B\(x\-c\)\\,dc,whereG​\(c\)G\(c\)is the gain assigned to the bump centred atcc\. To build a cosine with frequencyω\\omega, choose the centre\-dependent gainG​\(c\)=cos⁡\(ω​c\)G\(c\)=\\cos\(\\omega c\)\. Thus

Sωcos​\(x\):=∫ℝcos⁡\(ω​c\)​B​\(x−c\)​𝑑c\.S\_\{\\omega\}^\{\\cos\}\(x\):=\\int\_\{\\mathbb\{R\}\}\\cos\(\\omega c\)B\(x\-c\)\\,dc\.Changing variablest=x−ct=x\-c, soc=x−tc=x\-t, gives

Sωcos​\(x\)=∫ℝcos⁡\(ω​\(x−t\)\)​B​\(t\)​𝑑t\.S\_\{\\omega\}^\{\\cos\}\(x\)=\\int\_\{\\mathbb\{R\}\}\\cos\(\\omega\(x\-t\)\)B\(t\)\\,dt\.Herexxis fixed while integrating overcc, sod​t=−d​cdt=\-dc; the minus sign is absorbed by reversing the integration limits\. Using

cos⁡\(ω​\(x−t\)\)=cos⁡\(ω​x\)​cos⁡\(ω​t\)\+sin⁡\(ω​x\)​sin⁡\(ω​t\),\\cos\(\\omega\(x\-t\)\)=\\cos\(\\omega x\)\\cos\(\\omega t\)\+\\sin\(\\omega x\)\\sin\(\\omega t\),we obtain

Sωcos​\(x\)=cos⁡\(ω​x\)​∫ℝcos⁡\(ω​t\)​B​\(t\)​𝑑t\+sin⁡\(ω​x\)​∫ℝsin⁡\(ω​t\)​B​\(t\)​𝑑t\.S\_\{\\omega\}^\{\\cos\}\(x\)=\\cos\(\\omega x\)\\int\_\{\\mathbb\{R\}\}\\cos\(\\omega t\)B\(t\)\\,dt\+\\sin\(\\omega x\)\\int\_\{\\mathbb\{R\}\}\\sin\(\\omega t\)B\(t\)\\,dt\.The second integral is zero becauseB​\(t\)=sech⁡\(κ​t\)B\(t\)=\\operatorname\{sech\}\(\\kappa t\)is even andsin⁡\(ω​t\)\\sin\(\\omega t\)is odd, so the integrand is odd and is integrated over the symmetric domainℝ\\mathbb\{R\}\. Therefore

Sωcos​\(x\)=Cω​cos⁡\(ω​x\),Cω:=∫ℝcos⁡\(ω​t\)​B​\(t\)​𝑑t\.S\_\{\\omega\}^\{\\cos\}\(x\)=C\_\{\\omega\}\\cos\(\\omega x\),\\qquad C\_\{\\omega\}:=\\int\_\{\\mathbb\{R\}\}\\cos\(\\omega t\)B\(t\)\\,dt\.For the hyperbolic secant, the standard Fourier\-transform identity gives the cosine\-transform formula\[[2](https://arxiv.org/html/2606.23742#biba.bib2), Sec\. 3\.8\]

∫ℝsech⁡\(t\)​cos⁡\(q​t\)​𝑑t=π​sech⁡\(π​q2\)\(q∈ℝ\)\.\\int\_\{\\mathbb\{R\}\}\\operatorname\{sech\}\(t\)\\cos\(qt\)\\,dt=\\pi\\operatorname\{sech\}\\\!\\left\(\\frac\{\\pi q\}\{2\}\\right\)\\qquad\(q\\in\\mathbb\{R\}\)\.Applying this withq=ω/κq=\\omega/\\kappaafter the change of variablet′=κ​tt^\{\\prime\}=\\kappa t, we obtain

Cω=πκ​sech⁡\(π​ω2​κ\)\>0\.C\_\{\\omega\}=\\frac\{\\pi\}\{\\kappa\}\\operatorname\{sech\}\\\!\\left\(\\frac\{\\pi\\omega\}\{2\\kappa\}\\right\)\>0\.SinceCω\>0C\_\{\\omega\}\>0, we can divide byCωC\_\{\\omega\}:

cos⁡\(ω​x\)=1Cω​∫ℝcos⁡\(ω​c\)​B​\(x−c\)​𝑑c\.\\cos\(\\omega x\)=\\frac\{1\}\{C\_\{\\omega\}\}\\int\_\{\\mathbb\{R\}\}\\cos\(\\omega c\)B\(x\-c\)\\,dc\.So a cosine is an exact continuous superposition of translated bumps\.

The sine case is similar: choosingG​\(c\)=sin⁡\(ω​c\)G\(c\)=\\sin\(\\omega c\)gives

sin⁡\(ω​x\)=1Cω​∫ℝsin⁡\(ω​c\)​B​\(x−c\)​𝑑c\.\\sin\(\\omega x\)=\\frac\{1\}\{C\_\{\\omega\}\}\\int\_\{\\mathbb\{R\}\}\\sin\(\\omega c\)B\(x\-c\)\\,dc\.
We move back to the discrete case of the filters by approximating the continuous superposition over centres with a finite Riemann sum\. WriteI=\[xmin,xmax\]I=\[x\_\{\\min\},x\_\{\\max\}\]\. Forρ\>0\\rho\>0, first keep only centres in the finite interval\[xmin−ρ,xmax\+ρ\]\[x\_\{\\min\}\-\\rho,x\_\{\\max\}\+\\rho\], so that

∫ℝcos⁡\(ω​c\)Cω​B​\(x−c\)​𝑑c\\displaystyle\\int\_\{\\mathbb\{R\}\}\\frac\{\\cos\(\\omega c\)\}\{C\_\{\\omega\}\}B\(x\-c\)\\,dc=∫xmin−ρxmax\+ρcos⁡\(ω​c\)Cω​B​\(x−c\)​𝑑c\\displaystyle=\\int\_\{x\_\{\\min\}\-\\rho\}^\{x\_\{\\max\}\+\\rho\}\\frac\{\\cos\(\\omega c\)\}\{C\_\{\\omega\}\}B\(x\-c\)\\,dc\+∫ℝ∖\[xmin−ρ,xmax\+ρ\]cos⁡\(ω​c\)Cω​B​\(x−c\)​𝑑c\.\\displaystyle\\quad\+\\int\_\{\\mathbb\{R\}\\setminus\[x\_\{\\min\}\-\\rho,x\_\{\\max\}\+\\rho\]\}\\frac\{\\cos\(\\omega c\)\}\{C\_\{\\omega\}\}B\(x\-c\)\\,dc\.Ifx∈Ix\\in Iandc∉\[xmin−ρ,xmax\+ρ\]c\\notin\[x\_\{\\min\}\-\\rho,x\_\{\\max\}\+\\rho\], then\|x−c\|≥ρ\|x\-c\|\\geq\\rho\. SinceBBdecays exponentially and\|cos⁡\(ω​c\)/Cω\|≤1/Cω\|\\cos\(\\omega c\)/C\_\{\\omega\}\|\\leq 1/C\_\{\\omega\}, the second integral tends to zero asρ→∞\\rho\\to\\infty, uniformly forx∈Ix\\in I\. On the remaining finite interval of centres, the integrand is continuous on a compact set, so the first integral can be approximated uniformly inx∈Ix\\in Iby a Riemann sum, giving

∑k=1KΔ​c​cos⁡\(ω​ck\)Cω​B​\(x−ck\),\\sum\_\{k=1\}^\{K\}\\frac\{\\Delta c\\,\\cos\(\\omega c\_\{k\}\)\}\{C\_\{\\omega\}\}B\(x\-c\_\{k\}\),and similarly for sine\. Therefore finite signed sums of translated bumps can approximate any sine or cosine on a compact interval\. A trigonometric polynomial is defined as:

T​\(x\)=a0\+∑m=1M\(am​cos⁡\(ωm​x\)\+bm​sin⁡\(ωm​x\)\)\.T\(x\)=a\_\{0\}\+\\sum\_\{m=1\}^\{M\}\\left\(a\_\{m\}\\cos\(\\omega\_\{m\}x\)\+b\_\{m\}\\sin\(\\omega\_\{m\}x\)\\right\)\.
By linearity, a single idealised PhyKAN edge with sufficiently many filters can approximate any trigonometric polynomial\. To see this, approximate each sine and cosine term by its own finite filter sum, then collect all those filters into one larger finite filter sum\. This is still one element ofℱedge​\(I\)\\mathcal\{F\}\_\{\\rm edge\}\(I\)\. Constants are included by the same argument withω=0\\omega=0\. In that caseC0=π/κ\>0C\_\{0\}=\\pi/\\kappa\>0andcos⁡\(0​x\)=1\\cos\(0x\)=1, so the cosine construction approximates the constant function11; scaling the gains gives any constanta0a\_\{0\}\.

### S1\.3Trigonometric polynomials can approximate 1d functions

We use the Stone–Weierstrass theorem in the form stated by Pinkus\[[3](https://arxiv.org/html/2606.23742#biba.bib3)\]\. In words, on a compact set, any family of continuous real\-valued functions that contains the constants, is closed under addition, multiplication and scalar multiplication, and separates points can uniformly approximate every continuous real\-valued function on that set\.

In our case the compact set is the intervalI⊂ℝI\\subset\\mathbb\{R\}\. The approximating family is the set of trigonometric polynomials onII, namely finite sums of the form

a0\+∑m=1M\(am​cos⁡\(ωm​x\)\+bm​sin⁡\(ωm​x\)\)\.a\_\{0\}\+\\sum\_\{m=1\}^\{M\}\\left\(a\_\{m\}\\cos\(\\omega\_\{m\}x\)\+b\_\{m\}\\sin\(\\omega\_\{m\}x\)\\right\)\.These functions are continuous and contain the constants\. They are also closed under addition, scalar multiplication and multiplication; for multiplication, products of sine and cosine terms reduce to finite sums of sine and cosine terms by the standard product\-to\-sum identities\.

It remains only to check that they separate points\. Ifx≠yx\\neq y, choose

ω=π2​\(y−x\)\.\\omega=\\frac\{\\pi\}\{2\(y\-x\)\}\.Then

t↦sin⁡\(ω​\(t−x\)\)t\\mapsto\\sin\(\\omega\(t\-x\)\)is a trigonometric polynomial, since it is a linear combination ofsin⁡\(ω​t\)\\sin\(\\omega t\)andcos⁡\(ω​t\)\\cos\(\\omega t\)\. Moreover it takes the value0att=xt=xand the value11att=yt=y\. Thus the trigonometric polynomials separate points\.

Therefore all hypotheses of Stone–Weierstrass are satisfied, and trigonometric polynomials uniformly approximate every continuous real\-valued function onII\.

### S1\.4A single idealised PhyKAN edge is a one\-dimensional universal approximator

We now put together the two approximation results\. LetI⊂ℝI\\subset\\mathbb\{R\}be a compact interval and letf:I→ℝf:I\\to\\mathbb\{R\}be continuous\. Letε\>0\\varepsilon\>0\. From the Stone–Weierstrass argument in the previous subsection, there is a trigonometric polynomialTTsuch that

supx∈I\|f​\(x\)−T​\(x\)\|<ε2\.\\sup\_\{x\\in I\}\|f\(x\)\-T\(x\)\|<\\frac\{\\varepsilon\}\{2\}\.
By the translated\-bump approximation of trigonometric polynomials, finite signed sums of translated bumps can approximateTTonII\. Therefore the sameTTcan be approximated by a single idealised PhyKAN edge, say

ℰ​\(x\)=∑ℓ=1LGℓ​B​\(x−cℓ\),\\mathcal\{E\}\(x\)=\\sum\_\{\\ell=1\}^\{L\}G\_\{\\ell\}B\(x\-c\_\{\\ell\}\),with

supx∈I\|T​\(x\)−ℰ​\(x\)\|<ε2\.\\sup\_\{x\\in I\}\|T\(x\)\-\\mathcal\{E\}\(x\)\|<\\frac\{\\varepsilon\}\{2\}\.Here the centrescℓc\_\{\\ell\}are set by the filter corner frequencies, and the coefficientsGℓG\_\{\\ell\}are the filter gains\.

The triangle inequality then gives

supx∈I\|f​\(x\)−ℰ​\(x\)\|≤supx∈I\|f​\(x\)−T​\(x\)\|\+supx∈I\|T​\(x\)−ℰ​\(x\)\|<ε\.\\sup\_\{x\\in I\}\|f\(x\)\-\\mathcal\{E\}\(x\)\|\\leq\\sup\_\{x\\in I\}\|f\(x\)\-T\(x\)\|\+\\sup\_\{x\\in I\}\|T\(x\)\-\\mathcal\{E\}\(x\)\|<\\varepsilon\.Thus a single idealised PhyKAN edge can approximate any continuous one\-dimensional function on a compact interval\.

### S1\.5A PhyKAN network is a universal function approximator

LetQ=I1×⋯×Id⊂ℝdQ=I\_\{1\}\\times\\cdots\\times I\_\{d\}\\subset\\mathbb\{R\}^\{d\}be compact and letF:Q→ℝF:Q\\to\\mathbb\{R\}be continuous\. Letε\>0\\varepsilon\>0\. By the standard sigmoid universal approximation theorem of Cybenko\[[4](https://arxiv.org/html/2606.23742#biba.bib4)\], there is a finite one\-hidden\-layer sigmoid network

N​\(x\)=∑j=1nhγj​σs​\(yj​\(x\)\),yj​\(x\)=∑i=1dri​j​xi\+bj,N\(x\)=\\sum\_\{j=1\}^\{n\_\{\\mathrm\{h\}\}\}\\gamma\_\{j\}\\,\\sigma\_\{s\}\(y\_\{j\}\(x\)\),\\qquad y\_\{j\}\(x\)=\\sum\_\{i=1\}^\{d\}r\_\{ij\}x\_\{i\}\+b\_\{j\},such that

supx∈Q\|F​\(x\)−N​\(x\)\|<ε2\.\\sup\_\{x\\in Q\}\|F\(x\)\-N\(x\)\|<\\frac\{\\varepsilon\}\{2\}\.Hereγj\\gamma\_\{j\}are the usual output weights of the classical ANN, and

σs​\(z\)=11\+e−z/sσ\\sigma\_\{s\}\(z\)=\\frac\{1\}\{1\+e^\{\-z/s\_\{\\sigma\}\}\}is the sigmoid used in the classical approximating network, with fixed scale parametersσ\>0s\_\{\\sigma\}\>0\.

We now replace this classical ANN by a PhyKAN network\. The PhyKAN architecture already contains the same fixed, non\-trainable sigmoid at hidden nodes\. We replace the scalar weighted sums entering the sigmoid by sums of one\-dimensional PhyKAN edges, and then replace the output weights by output PhyKAN edges\.

For each hidden unit, split the affine preactivationyjy\_\{j\}into one\-dimensional functions,

g1​j​\(x1\)=r1​j​x1\+bj,gi​j​\(xi\)=ri​j​xi\(i=2,…,d\),g\_\{1j\}\(x\_\{1\}\)=r\_\{1j\}x\_\{1\}\+b\_\{j\},\\qquad g\_\{ij\}\(x\_\{i\}\)=r\_\{ij\}x\_\{i\}\\quad\(i=2,\\ldots,d\),so thatyj​\(x\)=∑igi​j​\(xi\)y\_\{j\}\(x\)=\\sum\_\{i\}g\_\{ij\}\(x\_\{i\}\)\. The bias is assigned to the first coordinate only to include it once; equivalently, it could be distributed among the one\-dimensional terms\. By the one\-dimensional universal approximation result for idealised PhyKAN edges, each scalar functiongi​j:Ii→ℝg\_\{ij\}:I\_\{i\}\\to\\mathbb\{R\}can be uniformly approximated onIiI\_\{i\}\. Fix an error toleranceη\>0\\eta\>0, to be chosen later, and for each pair\(i,j\)\(i,j\)choose an idealised PhyKAN edgeℰi​j:Ii→ℝ\\mathcal\{E\}\_\{ij\}:I\_\{i\}\\to\\mathbb\{R\}such that

supxi∈Ii\|gi​j​\(xi\)−ℰi​j​\(xi\)\|<η\.\\sup\_\{x\_\{i\}\\in I\_\{i\}\}\|g\_\{ij\}\(x\_\{i\}\)\-\\mathcal\{E\}\_\{ij\}\(x\_\{i\}\)\|<\\eta\.The classical preactivationyj​\(x\)=∑igi​j​\(xi\)y\_\{j\}\(x\)=\\sum\_\{i\}g\_\{ij\}\(x\_\{i\}\)is then replaced by the sum of these one\-dimensional edge approximations:

y~j​\(x\)=∑i=1dℰi​j​\(xi\)\.\\widetilde\{y\}\_\{j\}\(x\)=\\sum\_\{i=1\}^\{d\}\\mathcal\{E\}\_\{ij\}\(x\_\{i\}\)\.Then, uniformly forx∈Qx\\in Q,

\|yj​\(x\)−y~j​\(x\)\|≤∑i=1d\|gi​j​\(xi\)−ℰi​j​\(xi\)\|<d​η\.\|y\_\{j\}\(x\)\-\\widetilde\{y\}\_\{j\}\(x\)\|\\leq\\sum\_\{i=1\}^\{d\}\|g\_\{ij\}\(x\_\{i\}\)\-\\mathcal\{E\}\_\{ij\}\(x\_\{i\}\)\|<d\\eta\.
In the classical network, hidden unitjjsends the scalar activationσs​\(yj​\(x\)\)\\sigma\_\{s\}\(y\_\{j\}\(x\)\)to the output weightγj\\gamma\_\{j\}, which returnsγj​σs​\(yj​\(x\)\)\\gamma\_\{j\}\\sigma\_\{s\}\(y\_\{j\}\(x\)\)\. In the PhyKAN replacement, the corresponding hidden activation isσs​\(y~j​\(x\)\)\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\), and the output weightγj\\gamma\_\{j\}is replaced by a PhyKAN edge𝒪j\\mathcal\{O\}\_\{j\}\.

Bothσs​\(yj​\(x\)\)\\sigma\_\{s\}\(y\_\{j\}\(x\)\)andσs​\(y~j​\(x\)\)\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)lie in\[0,1\]\[0,1\]because they are sigmoid outputs, so𝒪j\\mathcal\{O\}\_\{j\}needs only to approximate multiplication byγj\\gamma\_\{j\}on the interval\[0,1\]\[0,1\]\. We therefore choose𝒪j\\mathcal\{O\}\_\{j\}such that

supu∈\[0,1\]\|γj​u−𝒪j​\(u\)\|<η,\\sup\_\{u\\in\[0,1\]\}\|\\gamma\_\{j\}u\-\\mathcal\{O\}\_\{j\}\(u\)\|<\\eta,whereuuis a dummy variable ranging over possible scalar activation values\. Here𝒪j\\mathcal\{O\}\_\{j\}is the output edge from hidden unitjjto the scalar output node\. The output node sums these edge responses and does not apply a hidden\-node sigmoid\.

The resulting PhyKAN network is

FP​\(x\)=∑j=1nh𝒪j​\(σs​\(y~j​\(x\)\)\)\.F\_\{\\rm P\}\(x\)=\\sum\_\{j=1\}^\{n\_\{\\mathrm\{h\}\}\}\\mathcal\{O\}\_\{j\}\\\!\\left\(\\sigma\_\{s\}\\\!\\left\(\\widetilde\{y\}\_\{j\}\(x\)\\right\)\\right\)\.
It remains to show that this replacement is close toNN\. The only nonlinear point is that the sigmoid seesy~j​\(x\)\\widetilde\{y\}\_\{j\}\(x\)instead ofyj​\(x\)y\_\{j\}\(x\)\. This is controlled by the mean value theorem\. Indeed,

σs′​\(z\)=1sσ​σs​\(z\)​\(1−σs​\(z\)\)\.\\sigma\_\{s\}^\{\\prime\}\(z\)=\\frac\{1\}\{s\_\{\\sigma\}\}\\sigma\_\{s\}\(z\)\(1\-\\sigma\_\{s\}\(z\)\)\.Since0≤σs​\(z\)≤10\\leq\\sigma\_\{s\}\(z\)\\leq 1, we haveσs​\(z\)​\(1−σs​\(z\)\)≤1/4\\sigma\_\{s\}\(z\)\(1\-\\sigma\_\{s\}\(z\)\)\\leq 1/4\. Hence

\|σs′\(z\)\|≤14​sσ=:Lσ\.\|\\sigma\_\{s\}^\{\\prime\}\(z\)\|\\leq\\frac\{1\}\{4s\_\{\\sigma\}\}=:L\_\{\\sigma\}\.The factor1/sσ1/s\_\{\\sigma\}appears because the exponent inσs\\sigma\_\{s\}is−z/sσ\-z/s\_\{\\sigma\}\. The mean value theorem therefore gives

\|σs​\(u\)−σs​\(v\)\|≤Lσ​\|u−v\|\.\|\\sigma\_\{s\}\(u\)\-\\sigma\_\{s\}\(v\)\|\\leq L\_\{\\sigma\}\|u\-v\|\.
We now compare one hidden\-unit contribution in the classical ANN with its PhyKAN replacement\. The classical term is

γj​σs​\(yj​\(x\)\),\\gamma\_\{j\}\\sigma\_\{s\}\(y\_\{j\}\(x\)\),whereas the PhyKAN term is

𝒪j​\(σs​\(y~j​\(x\)\)\)\.\\mathcal\{O\}\_\{j\}\(\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\)\.There are two errors: first, the sigmoid is fed the slightly wrong inputy~j​\(x\)\\widetilde\{y\}\_\{j\}\(x\)instead ofyj​\(x\)y\_\{j\}\(x\); second, the output linear mapu↦γj​uu\\mapsto\\gamma\_\{j\}uis replaced by the output PhyKAN edge𝒪j\\mathcal\{O\}\_\{j\}\. To separate these two effects, insert the intermediate termγj​σs​\(y~j​\(x\)\)\\gamma\_\{j\}\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\. For eachjj,

\|γj​σs​\(yj​\(x\)\)−𝒪j​\(σs​\(y~j​\(x\)\)\)\|\\displaystyle\\left\|\\gamma\_\{j\}\\sigma\_\{s\}\(y\_\{j\}\(x\)\)\-\\mathcal\{O\}\_\{j\}\(\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\)\\right\|≤\|γj​σs​\(yj​\(x\)\)−γj​σs​\(y~j​\(x\)\)\|\\displaystyle\\leq\\left\|\\gamma\_\{j\}\\sigma\_\{s\}\(y\_\{j\}\(x\)\)\-\\gamma\_\{j\}\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\\right\|\+\|γj​σs​\(y~j​\(x\)\)−𝒪j​\(σs​\(y~j​\(x\)\)\)\|\\displaystyle\\quad\+\\left\|\\gamma\_\{j\}\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\-\\mathcal\{O\}\_\{j\}\(\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\)\\right\|To bound the first term, factor outγj\\gamma\_\{j\}:

\|γj​σs​\(yj​\(x\)\)−γj​σs​\(y~j​\(x\)\)\|\\displaystyle\\left\|\\gamma\_\{j\}\\sigma\_\{s\}\(y\_\{j\}\(x\)\)\-\\gamma\_\{j\}\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\\right\|=\|γj\|​\|σs​\(yj​\(x\)\)−σs​\(y~j​\(x\)\)\|\\displaystyle=\|\\gamma\_\{j\}\|\\,\\left\|\\sigma\_\{s\}\(y\_\{j\}\(x\)\)\-\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\\right\|≤\|γj\|​Lσ​\|yj​\(x\)−y~j​\(x\)\|\\displaystyle\\leq\|\\gamma\_\{j\}\|L\_\{\\sigma\}\|y\_\{j\}\(x\)\-\\widetilde\{y\}\_\{j\}\(x\)\|<\|γj\|​Lσ​d​η\.\\displaystyle<\|\\gamma\_\{j\}\|L\_\{\\sigma\}d\\eta\.Here the last inequality uses the preactivation bound\|yj​\(x\)−y~j​\(x\)\|<d​η\|y\_\{j\}\(x\)\-\\widetilde\{y\}\_\{j\}\(x\)\|<d\\etaproved above\.

It remains to bound the second term

\|γj​σs​\(y~j​\(x\)\)−𝒪j​\(σs​\(y~j​\(x\)\)\)\|\.\\left\|\\gamma\_\{j\}\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\-\\mathcal\{O\}\_\{j\}\(\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\)\\right\|\.This term compares the two scalar mapsu↦γj​uu\\mapsto\\gamma\_\{j\}uand𝒪j\\mathcal\{O\}\_\{j\}when both are evaluated at the same scalar input\. That common scalar input is

σs​\(y~j​\(x\)\),\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\),the sigmoid output of the replacement preactivation\. Because the sigmoid takes values in\[0,1\]\[0,1\], this input lies in\[0,1\]\[0,1\]\.

By construction,𝒪j\\mathcal\{O\}\_\{j\}satisfies the uniform bound

supu∈\[0,1\]\|γj​u−𝒪j​\(u\)\|<η,\\sup\_\{u\\in\[0,1\]\}\|\\gamma\_\{j\}u\-\\mathcal\{O\}\_\{j\}\(u\)\|<\\eta,that is,\|γj​u−𝒪j​\(u\)\|<η\|\\gamma\_\{j\}u\-\\mathcal\{O\}\_\{j\}\(u\)\|<\\etaholds for everyu∈\[0,1\]u\\in\[0,1\]\. Sinceσs​\(y~j​\(x\)\)∈\[0,1\]\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\\in\[0,1\], the bound applies at this particular point and gives

\|γj​σs​\(y~j​\(x\)\)−𝒪j​\(σs​\(y~j​\(x\)\)\)\|<η\.\\left\|\\gamma\_\{j\}\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\-\\mathcal\{O\}\_\{j\}\(\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\)\\right\|<\\eta\.Therefore, for eachjj,

\|γj​σs​\(yj​\(x\)\)−𝒪j​\(σs​\(y~j​\(x\)\)\)\|<\|γj\|​Lσ​d​η\+η\.\\left\|\\gamma\_\{j\}\\sigma\_\{s\}\(y\_\{j\}\(x\)\)\-\\mathcal\{O\}\_\{j\}\(\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\)\\right\|<\|\\gamma\_\{j\}\|L\_\{\\sigma\}d\\eta\+\\eta\.
Now sum the hidden\-unit errors\. For eachx∈Qx\\in Q,

N​\(x\)−FP​\(x\)\\displaystyle N\(x\)\-F\_\{\\rm P\}\(x\)=∑j=1nh\[γj​σs​\(yj​\(x\)\)−𝒪j​\(σs​\(y~j​\(x\)\)\)\],\\displaystyle=\\sum\_\{j=1\}^\{n\_\{\\mathrm\{h\}\}\}\\left\[\\gamma\_\{j\}\\sigma\_\{s\}\(y\_\{j\}\(x\)\)\-\\mathcal\{O\}\_\{j\}\(\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\)\\right\],so the triangle inequality gives

\|N​\(x\)−FP​\(x\)\|\\displaystyle\|N\(x\)\-F\_\{\\rm P\}\(x\)\|=\|∑j=1nh\[γj​σs​\(yj​\(x\)\)−𝒪j​\(σs​\(y~j​\(x\)\)\)\]\|\\displaystyle=\\left\|\\sum\_\{j=1\}^\{n\_\{\\mathrm\{h\}\}\}\\left\[\\gamma\_\{j\}\\sigma\_\{s\}\(y\_\{j\}\(x\)\)\-\\mathcal\{O\}\_\{j\}\(\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\)\\right\]\\right\|≤∑j=1nh\|γj​σs​\(yj​\(x\)\)−𝒪j​\(σs​\(y~j​\(x\)\)\)\|\\displaystyle\\leq\\sum\_\{j=1\}^\{n\_\{\\mathrm\{h\}\}\}\\left\|\\gamma\_\{j\}\\sigma\_\{s\}\(y\_\{j\}\(x\)\)\-\\mathcal\{O\}\_\{j\}\(\\sigma\_\{s\}\(\\widetilde\{y\}\_\{j\}\(x\)\)\)\\right\|<∑j=1nh\(\|γj\|​Lσ​d​η\+η\)\\displaystyle<\\sum\_\{j=1\}^\{n\_\{\\mathrm\{h\}\}\}\\left\(\|\\gamma\_\{j\}\|L\_\{\\sigma\}d\\eta\+\\eta\\right\)=\(nh\+d​Lσ​∑j=1nh\|γj\|\)​η\.\\displaystyle=\\left\(n\_\{\\mathrm\{h\}\}\+dL\_\{\\sigma\}\\sum\_\{j=1\}^\{n\_\{\\mathrm\{h\}\}\}\|\\gamma\_\{j\}\|\\right\)\\eta\.The inequality above holds for every input vectorx=\(x1,…,xd\)∈Q=I1×⋯×Idx=\(x\_\{1\},\\ldots,x\_\{d\}\)\\in Q=I\_\{1\}\\times\\cdots\\times I\_\{d\}\. Therefore the same right\-hand side also bounds the supremum of the error over the whole input domain:

supx∈Q\|N​\(x\)−FP​\(x\)\|≤\(nh\+d​Lσ​∑j=1nh\|γj\|\)​η\.\\sup\_\{x\\in Q\}\|N\(x\)\-F\_\{\\rm P\}\(x\)\|\\leq\\left\(n\_\{\\mathrm\{h\}\}\+dL\_\{\\sigma\}\\sum\_\{j=1\}^\{n\_\{\\mathrm\{h\}\}\}\|\\gamma\_\{j\}\|\\right\)\\eta\.Because the factor multiplyingη\\etais finite, chooseη\\etasmall enough that

supx∈Q\|N​\(x\)−FP​\(x\)\|<ε2\.\\sup\_\{x\\in Q\}\|N\(x\)\-F\_\{\\rm P\}\(x\)\|<\\frac\{\\varepsilon\}\{2\}\.Finally, the triangle inequality gives

supx∈Q\|F​\(x\)−FP​\(x\)\|≤supx∈Q\|F​\(x\)−N​\(x\)\|\+supx∈Q\|N​\(x\)−FP​\(x\)\|<ε\.\\sup\_\{x\\in Q\}\|F\(x\)\-F\_\{\\rm P\}\(x\)\|\\leq\\sup\_\{x\\in Q\}\|F\(x\)\-N\(x\)\|\+\\sup\_\{x\\in Q\}\|N\(x\)\-F\_\{\\rm P\}\(x\)\|<\\varepsilon\.Thus PhyKAN networks are universal approximators on compact domains\.

## Appendix S2Benchmarking Experimental Transfer

To quantify model\-to\-hardware transfer quality in a task\-agnostic manner, we evaluate the mean\-squared error between simulated and experimental edges across transferred networks in the regression and reinforcement learning tasks\.

To reduce the need for additional experimentation, the trained networks used in Figs\.[2](https://arxiv.org/html/2606.23742#S2.F2),[3](https://arxiv.org/html/2606.23742#S2.F3), and[4](https://arxiv.org/html/2606.23742#S2.F4)were used as representative samples of the range of edges likely to be realised in practice\. Evaluation was performed by linearly spacing 200 inputs over the maximum input range \(logarithmically spaced in frequency\) for each edge, calculating the simulated response via the analytical equations introduced in Section[4\.2\.3](https://arxiv.org/html/2606.23742#S4.SS2.SSS3), and comparing the results to the experimentally gathered data\. Mean\-squared errors were then calculated for each of the edges, resulting in≈35000\\approx 35000samples of edge transfer functions\. Supplementary Fig\.[S1](https://arxiv.org/html/2606.23742#A2.F1)a shows a histogram of the mean\-squared errors between simulation results and experimental data, while panel b shows the cumulative distribution function for the same data\. The plots show that the majority of edges transfer with an MSE below1\.58×10−61\.58\\times 10^\{\-6\}, with 90% of edges transferring with MSEs below7\.92×10−57\.92\\times 10^\{\-5\}\.

![Refer to caption](https://arxiv.org/html/2606.23742v1/Supplementary_figure_transfer.png)Figure S1:Transfer Error Statistics\- \(a\) Histogram of mean\-squared errors between simulated edge response and experimentally realised edge response\. \(b\) Cumulative probability function of the histogram on the left\. The black dashed line shows the error at whichP​\(x<X\)=0\.9P\(x<X\)=0\.9, at an MSE of7\.92×10−57\.92\\times 10^\{\-5\}, while the grey dashed line shows the error forP​\(x<X\)=0\.5P\(x<X\)=0\.5, at an MSE of1\.58×10−61\.58\\times 10^\{\-6\}\.
## Appendix S3Benchmarking on Classification Problems

As well as the regression\-based problems performed in the main text, the analogue networks were also benchmarked on the standard classification benchmark of Fashion\-MNIST\. This task resembles a more difficult version of the classic MNIST digit recognition problem, and requires classifiers to assign input images to one of 10 clothing classes\. Models here were trained for 100,000 iterations with a batch size of 50\.

Supplementary Fig\.[S2](https://arxiv.org/html/2606.23742#A3.F2)shows the final performance of simulated analogue network models on Fashion\-MNIST as a function of the number of filters per edge, with an MLP used to provide baseline performance\. Panels \(a\) and \(c\) plot the performance of analogue networks with 2, 4, and 6 nested filters per edge with respect to the total number of trainable parameters for networks with one and two hidden layers respectively\. For this task, the number of filters per edge, which serves as a proxy for edge complexity, has limited effect on the overall performance of the network, with only smaller networks with fewer filters per edge showing improved per\-parameter performance compared to MLP controls\. However, when edge complexity is increased, although the number of trainable parameters increases, the number of edges required for performance decreases because edges can be pruned more effectively from networks with more complex edge functions, as shown in panels \(b\) and \(d\)\. The analogue networks also show a marked increase in performance compared to MLPs when normalised for total number of edges\. In manufacturing contexts this is likely to be highly beneficial due to the ability to achieve simpler connectivity topologies\.

![Refer to caption](https://arxiv.org/html/2606.23742v1/Supplementary_figure_FMNIST.png)Figure S2:Fashion\-MNIST Performance\.Panels \(a\) and \(c\) show classification accuracy versus trainable parameters for networks with one and two hidden layers respectively, while panels \(b\) and \(d\) show the same accuracy data as a function of total network edges, again for one and two hidden layer networks\. In all panels, data markers represent mean performance, while the shaded regions show the standard deviation of performance across 10 repetitions with different initialisations of networks\.
## Appendix S4Benchmarking on Discrete Control

The discrete CartPole control task environment follows the classic pole\-balancing problem\[[6](https://arxiv.org/html/2606.23742#biba.bib6)\]and was taken directly from the OpenAI Gymnasium library\[[5](https://arxiv.org/html/2606.23742#biba.bib5)\]\. The agent receives the state of the cart at every timesteptt,StS\_\{t\}, defined by Eq\.[S1](https://arxiv.org/html/2606.23742#A4.E1):

St=\[xt,x˙t,θt,θ˙t\]S\_\{t\}=\[x\_\{t\},\\dot\{x\}\_\{t\},\\theta\_\{t\},\\dot\{\\theta\}\_\{t\}\]\(S1\)wherextx\_\{t\}denotes the position of the cart with respect to the centre of the environment at timettin arbitrary units,x˙t\\dot\{x\}\_\{t\}the velocity of the cart at timettin arbitrary units per second,θt\\theta\_\{t\}the angle of the pole with respect to the normal of the cart at timettin radians, andθ˙t\\dot\{\\theta\}\_\{t\}the angular velocity of the pole at timettin radians per second\. The agent chooses one of two discrete actions, moving the cart left or right, and is presented with a reward of \+1 for every timestep for which it keeps the pole upright\. Rewards were shaped according to the same function as the continuous task\. Training of the agent was performed using a double deep q\-learning approach\[[7](https://arxiv.org/html/2606.23742#biba.bib7),[8](https://arxiv.org/html/2606.23742#biba.bib8)\]\. This approach features two networks \(typically MLPs, though here also analogue networks\), conditioned on two sets of parametersθ0\\mathbf\{\\theta\}^\{0\}andθ′\\mathbf\{\\theta\}^\{\\prime\}, which are responsible for predicting the total future returns of the current action pair,Q​\(St,At;θ0\)Q\(S\_\{t\},A\_\{t\};\\mathbf\{\\theta\}^\{0\}\), and the action pair at the next state,Q​\(St\+1,At\+1;θ′\)Q\(S\_\{t\+1\},A\_\{t\+1\};\\mathbf\{\\theta\}^\{\\prime\}\)respectively\. Parameters for the current network,θ0\\mathbf\{\\theta\}^\{0\}, are updated via gradient descent, by minimising the following objective function defined in Eq\.[S2](https://arxiv.org/html/2606.23742#A4.E2):

ℒ=\[Rt\+1s\+γ​max𝑎​Q​\(St\+1,a;θ′\)−Q​\(St,At;θ0\)\]2\\mathcal\{L\}=\[R\_\{t\+1\}^\{s\}\+\\gamma\\underset\{a\}\{\\mathrm\{max\}\}Q\(S\_\{t\+1\},a;\\mathbf\{\\theta\}^\{\\prime\}\)\-Q\(S\_\{t\},A\_\{t\};\\mathbf\{\\theta\}^\{0\}\)\]^\{2\}\(S2\)whereγ\\gammarepresents the discount factor used to discount future rewards, here selected as 0\.9\. The parameters of the network used for estimating total future returns of the next timestep,θ′\\mathbf\{\\theta\}^\{\\prime\}, received soft updates such that parameters slowly converge to those of the current prediction network,θ0\\mathbf\{\\theta\}^\{0\}, via the update rule shown in Eq\.[S3](https://arxiv.org/html/2606.23742#A4.E3):

θ′←τ​θ0\+\(1−τ\)​θ′\\mathbf\{\\theta\}^\{\\prime\}\\leftarrow\\tau\\mathbf\{\\theta\}^\{0\}\+\(1\-\\tau\)\\mathbf\{\\theta\}^\{\\prime\}\(S3\)whereτ\\taudenotes the rate at which original parameters are overwritten, here 0\.001\.

Networks were optimised over 1000 episodes, with the actions selected according to a softmax policy with a temperature of 0\.1 to encourage exploration\. Performance was evaluated after each episode by repeating the run with a greedy policy, and no parameter updates were made during this evaluation\.

Supplementary Fig\.[S3](https://arxiv.org/html/2606.23742#A4.F3)shows a performance comparison between simulated analogue network models and MLPs\.

![Refer to caption](https://arxiv.org/html/2606.23742v1/Supplementary_figure_binary_cartpole.png)Figure S3:CartPole Performance for Binary Action Selection\- Comparison of the peak performance of models trained under the double\-DQN approach for networks with a single hidden layer \(a\), and two hidden layers \(b\) in the deep Q\-learning networks\. Marked data points show mean performance over 10 independent initialisations, while the shaded regions show the standard deviation\.Both model classes are used in the double\-DQN algorithm for the standard CartPole task with a single hidden layer \(a\), and two hidden layers \(b\)\.

With a single hidden layer, both architectures reach maximum performance at around the same number of trainable parameters, though the MLP outperforms the analogue network for the smallest network sizes\. In the two hidden layer experiments, performance between the two architectures is almost identical\.

In the classical implementation of the CartPole benchmark for reinforcement learning, the agent receives the same continuous input data, but during action selection, the output is binarised into applying forces of±10\\pm 10N\. With greedy action selection, this approach resembles a winner\-takes\-all algorithm operating on Q\-values predicted by the network, making the task functionally similar to binary classification\. We hypothesise that this explains the similarity to the Fashion\-MNIST results and the reduction in parameter\-efficiency gains compared to the regression\-based tasks in the main manuscript\.

## Appendix S5Intrinsic Dimensionality Analysis of Learned Representations

To help elucidate the different representations learned by the PhyKANs, we quantify the intrinsic dimensionality of the layer representations in the network using the metric of Sutton et al\.\[[9](https://arxiv.org/html/2606.23742#biba.bib9)\]\. According to this metric, data sampled from a distribution𝒟\\mathcal\{D\}onℝd\\mathbb\{R\}^\{d\}have an intrinsic dimensionalityn​\(𝒟\)∈ℝn\(\\mathcal\{D\}\)\\in\\mathbb\{R\}with respect to a centrec∈ℝc\\in\\mathbb\{R\}if:

P\(x,y∼𝒟:⟨x−y,y−c⟩≥0\)=12n​\(𝒟\)\+1P\(x,y\\sim\\mathcal\{D\}:\\langle x\-y,y\-c\\rangle\\geq 0\)=\\frac\{1\}\{2^\{n\(\\mathcal\{D\}\)\+1\}\}\(S4\)Supplementary Fig\.[S4](https://arxiv.org/html/2606.23742#A5.F4)shows a comparison between the intrinsic dimensionality of PhyKANs and MLPs as a function of trainable parameters for the six\-axis inverse kinematics problem, for the trained models as used in Fig\.[2](https://arxiv.org/html/2606.23742#S2.F2)\.

![Refer to caption](https://arxiv.org/html/2606.23742v1/Supplementary_figure_dimensionality.png)Figure S4:Intrinsic Dimensionality Analysis of PhyKAN and MLP for Six\-Axis Inverse KinematicsPlots comparing the mean\-squared error for the six\-axis inverse kinematics problem \(solid lines\) for \(black\) MLPs and \(red\) PhyKANs, as well as the intrinsic dimensionality \(dotted lines\) of representations provided by converged models for \(a\) the hidden layer of networks with a single hidden layer, \(b\) the first hidden layer of networks with two hidden layers, and \(c\) the second hidden layer of networks with two hidden layers\.For networks with a single hidden layer \(Supplementary Fig\.[S4](https://arxiv.org/html/2606.23742#A5.F4)a\), the two architectures have both similar dimensionalities and similar performance\. In the two\-layer networks, the first hidden layers of the PhyKANs \(Supplementary Fig\.[S4](https://arxiv.org/html/2606.23742#A5.F4)b\) saturate near the point where the PhyKANs achieve a smaller mean\-squared error than the MLPs, whose performance appears to remain in a local minimum\. For the second hidden layer \(Supplementary Fig\.[S4](https://arxiv.org/html/2606.23742#A5.F4)c\), the PhyKANs uniformly have a lower dimensionality than the MLPs\. We hypothesise that this is due to the PhyKANs finding a specific intermediate representation that is well suited to solving this task\.

The increase in average error for the two largest network sizes is caused by outliers among the 10 sampled models that learn suboptimal solutions\. Figure[S5](https://arxiv.org/html/2606.23742#A5.F5)a shows a scatter plot of trainable parameters versus mean\-squared error for the MLPs and simulated PhyKANs\. For PhyKAN networks with more than 200 parameters, some trained models discover representations with MSEs below 0\.001\. However, for both smaller networks \(≈200\\approx 200parameters\) and larger networks \(\>104\>10^\{4\}parameters\), suboptimal solutions are more commonly found, increasing the average error\. This supports the hypothesis that maintaining the intrinsic dimensionality shown in Supplementary Fig\.[S4](https://arxiv.org/html/2606.23742#A5.F4)b,c increases the ability to discover good solutions during learning\. When outliers are excluded in the largest networks, shown in Supplementary Fig\.[S5](https://arxiv.org/html/2606.23742#A5.F5)b, performance is maintained compared to the optimal zone of≈\\approx10000 parameters\.

![Refer to caption](https://arxiv.org/html/2606.23742v1/Supplementary_invkin_outliers.png)Figure S5:Statistical Variation of Solutions Learned for the Inverse Kinematics Task\.\(a\) A scatter plot of mean squared error versus trainable parameters across 10 independently trained models for each size, comparing MLP performance \(black circles\) to the simulated PhyKAN networks \(red circles\)\. Greater variation overall is observed in the PhyKANs, with outliers in the general trends as network size increases\. Datapoints excluded in panel \(b\) are highlighted by the dashed circles\. \(b\) A repeat of the plot shown in[2](https://arxiv.org/html/2606.23742#S2.F2)\(f\), but with points with anomalously low performance excluded from the average\.
## Appendix S6L1 \+ Shannon Entropy Regularisation on MLPs

For a fair comparison with the regularised PhyKAN models, we applied analogous L1 and entropy penalties to the MLP baselines\. The hardware\-aware quantisation/stability penalty was not applied to MLPs, since these models were not transferred to physical hardware\. Standard L1 regularisation was performed on the weights of the MLP\. The entropy penalty was computed on node activations in MLPs, not on the edge\-wise activations described in Section[4\.7](https://arxiv.org/html/2606.23742#S4.SS7):

ℒreg=∑ℓ\(‖θ\(ℓ\)‖−∑hqh\(ℓ\)​log⁡qh\(ℓ\)\)\\mathcal\{L\}\_\{\\mathrm\{reg\}\}=\\sum\_\{\\ell\}\\left\(\\\|\\mathbf\{\\theta\}^\{\(\\ell\)\}\\\|\-\\sum\_\{h\}q\_\{h\}^\{\(\\ell\)\}\\log\{q\_\{h\}^\{\(\\ell\)\}\}\\right\)\(S5\)qh=\|ah\|∑h\|ah\|\.q\_\{h\}=\\frac\{\|a\_\{h\}\|\}\{\\sum\_\{h\}\|a\_\{h\}\|\}\.\(S6\)whereθ\\thetadescribes the model’s weights, andaha\_\{h\}is the activation on nodehhin a given layer\.

Supplementary Fig\.[S6](https://arxiv.org/html/2606.23742#A6.F6)shows a comparison of the mean\-squared errors achieved on the inverse kinematics task for MLPs without penalty terms \(black\), with the same1×10−51\\times 10^\{\-5\}magnitude of penalty as the PhyKANs \(red\), and an optimised penalty of1×10−71\\times 10^\{\-7\}\(crimson\)\. While the penalty term boosts performance for the smallest MLPs in both cases compared to the control, only well\-optimised penalty terms do not impact performance in larger networks, as the larger penalty results in reduced performance\. This suggests that in MLPs, the penalty terms are useful for finding better solutions in constrained networks, but can form additional constraints that reduce accuracy if the penalty term is too strong\.

![Refer to caption](https://arxiv.org/html/2606.23742v1/Supplementary_figure_penalties.png)Figure S6:Effect of penalty terms on MLPs\.Comparison between mean\-squared errors achieved on multilayer perceptrons with no regularisation penalty \(black\), a small but not optimised penalty magnitude \(red\), and an optimised penalty magnitude \(crimson\) for the six\-axis inverse kinematics problem\. Panels \(a\) and \(b\) show errors achieved in networks with a single hidden layer, and two hidden layers respectively\. Markers show the average performance across ten independent runs, while the shaded region reflects the standard deviation across the ten runs\.![Refer to caption](https://arxiv.org/html/2606.23742v1/MemKANMLP.png)Figure S7:Hyperparameter optimisation of memristor models\.\(a\) Surface plot of test mean\-squared error of the MLP model of the memristor circuit as a function of number of nodes in the hidden layers of the network as well as the learning rate used to train the network\. Panels \(b\) and \(c\) show plots of the lowest achieved mean\-squared errors as a function of learning rate and hidden layer size respectively\.
## Appendix S7Hyperparameter optimisation of MLP models of Memristor circuits\.

Hyperparameters of learning rate and number of nodes in the two hidden layers for the MLPs used to model the memristor circuits were selected by looking at the lowest test mean\-squared\-error averaged across 10 independent runs on different shuffles of training/validation data splits of size3×1063\\times 10^\{6\}and5×1055\\times 10^\{5\}respectively\. Each model was trained for one million iterations with a fixed minibatch size of 10,000\. Supplementary figure[S7](https://arxiv.org/html/2606.23742#A6.F7)\(a\) shows a 3D plot of mean squared error as a function of learning rate and number of neurons in each hidden layer, showing a region where models fit strongly where number of nodes\>\>100 and learning rate<<3×10−33\\times 10^\{\-3\}\. The model with the best performing combination of hidden nodes and learning rate was found to be 125 nodes and a learning rate of1×10−41\\times 10^\{\-4\}\. Panels \(b\) and \(c\) show the results of a single slice down the 3D plot around both axes for mean\-squared error versus learning rate and hidden layer size respectively\.

## Appendix S8Supporting Figures for Memristor\-based PhyKANs

![Refer to caption](https://arxiv.org/html/2606.23742v1/R_BP_impact.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.23742v1/RS_BP_impact.png)\(b\)
![Refer to caption](https://arxiv.org/html/2606.23742v1/RL_BP_impact.png)\(c\)
![Refer to caption](https://arxiv.org/html/2606.23742v1/alpha_BP_impact.png)\(d\)
![Refer to caption](https://arxiv.org/html/2606.23742v1/Io_BP_impact.png)\(e\)

Figure S8:The specific transfer function of the memristor circuit is governed by five fitting parameters, whose impact in the band\-pass case is indicated in panels \(a\)–\(e\)\.![Refer to caption](https://arxiv.org/html/2606.23742v1/memristor_IV.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.23742v1/devices_physics.png)\(b\)
![Refer to caption](https://arxiv.org/html/2606.23742v1/model.png)\(c\)

Figure S9:\(a\) Sketch of the typical I/V loop in memristor devices\. Nonlinear HRS state conduction is indicated in blue and linear LRS state conduction is indicated in red\. Set and reset transitions are indicated to illustrate the transition between the two states\. \(b\) Illustration of the memristor filament’s morphology in the HRS and LRS states and the transition between them\. Note that in most memristive devices a so\-called electroforming step is necessary to allow the device to switch between HRS and LRS states\. \(c\) Equivalent circuit of the dynamic memdiode model employed in the reconfigurable active synapse\.
## Appendix S9References

## References

- R \[1\]Oppenheim, A\. V\., Willsky, A\. S\. & Nawab, S\. H\.*Signals and Systems*, 2nd edn\. \(Prentice Hall, 1997\)\.
- R \[2\]Kärtner, F\. X\.*Ultrafast Optics*\. MIT OpenCourseWare \(2005\)\.
- R \[3\]Pinkus, A\. Weierstrass and approximation theory\.*J\. Approx\. Theory*107, 1–66 \(2000\)\.
- R \[4\]Cybenko, G\. Approximation by superpositions of a sigmoidal function\.*Math\. Control Signals Syst\.*2, 303–314 \(1989\)\.
- R \[5\]Brockman, G\.*et al\.*OpenAI Gym\. Preprint at arXiv:1606\.01540 \(2016\)\.
- R \[6\]Barto, A\. G\., Sutton, R\. S\. & Anderson, C\. W\. Neuronlike adaptive elements that can solve difficult learning control problems\.*IEEE Trans\. Syst\. Man Cybern\.*SMC\-13, 834–846 \(1983\)\.
- R \[7\]Mnih, V\.*et al\.*Human\-level control through deep reinforcement learning\.*Nature*518, 529–533 \(2015\)\.
- R \[8\]Van Hasselt, H\., Guez, A\. & Silver, D\. Deep reinforcement learning with double Q\-learning\.*Proc\. AAAI Conf\. Artif\. Intell\.*30, 2094–2100 \(2016\)\.
- R \[9\]Sutton, O\. J\., Zhou, Q\., Gorban, A\. N\. & Tyukin, I\. Y\. Relative intrinsic dimensionality is intrinsic to learning\. In Iliadis, L\., Papaleonidas, A\., Angelov, P\. & Jayne, C\. \(eds\.\)*Artificial Neural Networks and Machine Learning – ICANN 2023*, 516–529 \(Springer Nature Switzerland, 2023\)\.

Similar Articles

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

Hacker News Top

This post explains the author's Master's thesis on using Kolmogorov-Arnold Networks (KANs) for ultrafast machine learning on FPGAs, achieving sub-microsecond inference and online learning via custom hardware architectures. It references two accepted papers: KANELÉ for LUT-based evaluation (FPGA 2026 Best Paper) and a method for on-FPGA online learning (ICML 2026).

Physics-Modeled Neural Networks

arXiv cs.LG

This paper introduces Dynamical Physics-Modeled Neural Networks (DynPMNNs), a continuous-time deep learning architecture where hidden layers are defined by ordinary differential equations. It presents a biologically inspired approach grounded in Reproducing Kernel Banach Spaces, demonstrating competitive performance on the California Housing dataset with fewer parameters than standard Neural ODEs.

Layer-wise Derivative Controlled Networks

arXiv cs.LG

Introduces ChainzRule, a neural architecture using Polynomial Engine and Differential Regularization to balance accuracy, hardware efficiency, and functional stability, outperforming standard models with 15.5x fewer parameters and smoother gradients.

Nonlinear computation in deep linear networks

OpenAI Blog

OpenAI research explores how nonlinear computation can emerge in deep linear networks, presenting theoretical and empirical analysis with code examples using TensorFlow.

Understanding neural networks through sparse circuits

OpenAI Blog

OpenAI researchers present methods for training sparse neural networks that are easier to interpret by forcing most weights to zero, enabling the discovery of small, disentangled circuits that can explain model behavior while maintaining performance. This work aims to advance mechanistic interpretability as a complement to post-hoc analysis of dense networks and support AI safety goals.