Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers

arXiv cs.LG Papers

Summary

This paper proposes a plug-and-play framework that implements spike-friendly approximations for Transformer nonlinearities (e.g., Softmax, SiLU, normalization) via population computation with LIF neurons and lightweight bit-shift scaling, achieving less than 1% accuracy drop on LLMs without fine-tuning.

arXiv:2605.20289v1 Announce Type: new Abstract: ANN-to-SNN conversion offers a practical, training-free route to spiking large language models. However, current pipelines primarily focus on spike-driven realizations for Transformer linear-algebra operations, while providing limited support for key nonlinear operators. This gap limits compatibility with neuromorphic-style execution constraints, where such nonlinearities typically require division, exponentiation, or norm computations that are not naturally supported by standard leaky integrate-and-fire dynamics. To solve this problem, we propose a plug-and-play framework that implements spike-friendly approximations for Transformer nonlinearities and integrates into existing ANN-to-SNN pipelines. Our method decomposes these nonlinear computations into three recurring primitives -- division, exponentiation, and $\ell_2$ norms -- and realizes them via population computation using LIF neuron groups, combined with lightweight bit-shift scaling to avoid floating-point arithmetic. By composing these primitives as modular operator blocks, our framework supports common Transformer nonlinearities (e.g., Softmax, SiLU, and normalization) without any fine-tuning. Experiments on a range of LLMs Transformers show that selectively replacing the targeted nonlinear operators incurs less than a $1\%$ accuracy drop across all evaluated tasks.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:21 AM

# Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers
Source: [https://arxiv.org/html/2605.20289](https://arxiv.org/html/2605.20289)
###### Abstract

ANN\-to\-SNN conversion offers a practical, training\-free route to spiking large language models\. However, current pipelines primarily focus on spike\-driven realizations for Transformer linear\-algebra operations, while providing limited support for key nonlinear operators\. This gap limits compatibility with neuromorphic\-style execution constraints, where such nonlinearities typically require division, exponentiation, or norm computations that are not naturally supported by standard leaky integrate\-and\-fire dynamics\. To solve this problem, we propose a plug\-and\-play framework that implements spike\-friendly approximations for Transformer nonlinearities and integrates into existing ANN\-to\-SNN pipelines\. Our method decomposes these nonlinear computations into three recurring primitives—division, exponentiation, andℓ2\\ell\_\{2\}norms—and realizes them via population computation using LIF neuron groups, combined with lightweight bit\-shift scaling to avoid floating\-point arithmetic\. By composing these primitives as modular operator blocks, our framework supports common Transformer nonlinearities \(e\.g\., Softmax, SiLU, and normalization\) without any fine\-tuning\. Experiments on a range of LLMs Transformers show that selectively replacing the targeted nonlinear operators incurs less than a1%1\\%accuracy drop across all evaluated tasks\.

Machine Learning, ICML

## 1Introduction

Recently, spiking neural networks \(SNNs\) have gained increasing attention for energy\-efficient, event\-driven computation on neuromorphic hardware\(Royet al\.,[2019](https://arxiv.org/html/2605.20289#bib.bib10); Davieset al\.,[2018](https://arxiv.org/html/2605.20289#bib.bib35); Akopyanet al\.,[2015](https://arxiv.org/html/2605.20289#bib.bib36)\)\. These models have shown promising results in both computer vision and natural language processing\(Caoet al\.,[2015](https://arxiv.org/html/2605.20289#bib.bib28); Zhouet al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib31); Lvet al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib33); Zhuet al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib34)\)\. Meanwhile, Transformer\-based foundation models, particularly large language models \(LLMs\), have become the dominant inference workloads\. Their costs are largely driven by dense linear algebra and associated memory traffic\(Horowitz,[2014](https://arxiv.org/html/2605.20289#bib.bib29)\)\.

Motivated by this, recent spiking Transformers and SNN\-LLM efforts have increasingly adopted ANN\-to\-SNN conversion to reduce inference cost without expensive retraining\(Rueckaueret al\.,[2017](https://arxiv.org/html/2605.20289#bib.bib30); Chenet al\.,[2025a](https://arxiv.org/html/2605.20289#bib.bib25),[b](https://arxiv.org/html/2605.20289#bib.bib1); Youet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib43)\)\. These methods primarily convert Transformer linear computations, such as attention matrix multiplications and feed\-forward projections, into spike\-driven implementations that leverage event sparsity for improved energy efficiency\(Rueckaueret al\.,[2017](https://arxiv.org/html/2605.20289#bib.bib30); Yanet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib21)\)\.

However, this spike\-centric research paradigm remains incomplete for Transformer architectures, as existing works rarely provide spike\-based realizations of key nonlinear operators such as activation functions, normalization layers, and Softmax\(Zhouet al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib31); Shiet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib32)\)\. Although these components are often considered secondary from an energy\-accounting perspective\(Horowitz,[2014](https://arxiv.org/html/2605.20289#bib.bib29)\), they become decisive under strict spike\-only deployment\. They are difficult to represent with standard leaky integrate\-and\-fire \(LIF\) neurons, since they commonly require division or norm computations that do not naturally arise from LIF dynamics\. More critically, on neuromorphic hardware platforms that support only spike\-based primitives, continuous\-valued states are unavailable, making conventional implementations a fundamental obstacle to deployment\(Davieset al\.,[2018](https://arxiv.org/html/2605.20289#bib.bib35),[2021](https://arxiv.org/html/2605.20289#bib.bib11)\)\. Consequently, existing ANN\-to\-SNN approaches often fall short of truly end\-to\-end spiking Transformers under strict spike\-only constraints,*including their nonlinearities*\(Zhuet al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib34); Lvet al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib33)\)\.

To address this limitation, a few works approximate Transformer nonlinearities with spiking computation\. But they typically require additional training, limiting their compatibility with standard ANN\-to\-SNN conversion pipelines\(Tanget al\.,[2025](https://arxiv.org/html/2605.20289#bib.bib45)\)\. Motivated by these observations, we ask the following natural and important question:Can we design plug\-and\-play spike\-based conversion for nonlinear operators in ANN\-to\-SNN pipelines?

In this work, we answer this question affirmatively by developing*training\-free*spike\-based replacements that are compatible with standard LIF dynamics\. We identify three primitive nonlinear computations that repeatedly arise in Transformer inference: division, exponentiation, andℓ2\\ell\_\{2\}norms, which form the computational core of Softmax, SiLU, and RMSNorm\. Our spike\-friendly realizations are built from population computation using LIF neuron groups, together with simple shift\-based scaling that avoids floating\-point arithmetic\. By constructing approximations for these primitives and composing them as modular operator blocks, we obtain fully spiking realizations of the above Transformer nonlineariti es under strict spike\-only constraints, without any additional fine\-tuning\.

These operator\-level modules are composable and can be plugged into existing ANN\-to\-SNN conversion pipelines to enable end\-to\-end spiking Transformer blocks, while respecting the spike\-only primitives and lightweight digital operations natively supported by neuromorphic hardware\. Our main contributions are summarized as follows:

- •We propose a spike\-friendly approach to approximate Transformer nonlinear operators\.Our method operates at the operator level and can be seamlessly integrated into existing ANN\-to\-SNN conversion pipelines without requiring any modifications to model weights\.
- •We provide theoretical guarantees on the conversion error\.We show that our spike\-based approximations of nonlinear operators admit provably bounded conversion errors under mild conditions\. Additionally, we identify specific configurations that can achieve high approximation accuracy\.
- •We empirically demonstrate the applicability of the proposed method across different models\.We test our method on two widely used ANN\-to\-SNN conversion frameworks\. Additionally, we apply our approach to well\-known models that have not previously been converted to SNNs, such as Qwen3, by selectively replacing their nonlinear operators while keeping the other computational parts unchanged, to verify the broad applicability of our method\.

## 2Related Work

Early ANN\-to\-SNN conversion methods approximate continuous ANN activations using firing\-rate coding and scale alignment\. Diehl et al\. introduce weight and threshold balancing to enable fast and accurate inference in deep spiking networks\(Diehlet al\.,[2015](https://arxiv.org/html/2605.20289#bib.bib41)\)\. Sengupta et al\. further extend this conversion paradigm to deeper architectures, demonstrating that VGG and residual networks can be converted to SNNs without retraining while maintaining competitive performance\(Senguptaet al\.,[2019](https://arxiv.org/html/2605.20289#bib.bib42)\)\. As Transformers became the dominant model class, You et al\. propose SpikeZIP\-TF, which systematically applies ANN\-to\-SNN conversion to Transformer architectures by spiking the linear projections in attention and feed\-forward layers\(Youet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib43)\)\. At the scale of large language models, Xing et al\. introduce SpikeLLM, which constructs large spiking language models using saliency\-driven spike allocation and reports efficiency–accuracy trade\-offs compared with standard low\-bit inference pipelines\(Xinget al\.,[2025](https://arxiv.org/html/2605.20289#bib.bib44)\)\.

Distinct from the above works that primarily focus on dense linear algebra, Sorbet constructs spike\-based realizations of nonlinear functions using shift\-based discrete operations, and applies knowledge distillation and fine\-tuning to align the spiking model with the original BERT behavior\(Tanget al\.,[2025](https://arxiv.org/html/2605.20289#bib.bib45)\)\. Although it identifies the issue of non\-linear operators, its method requires training and is incompatible with other pipelines\.

## 3Preliminary

##### Spiking Neurons and Temporal Accumulation

SNNs process information over discrete timestepst=1,…,Tt=1,\\dots,T\. At each timestep, neurons emit binary spikes and communicate through temporally accumulated activity\. As a result, real\-valued quantities can be represented implicitly by spike counts over a finite temporal window\.

In practice, ANN\-to\-SNN conversion methods typically rely on rate\-based representations, where the accumulated spike activity approximates the activation of an artificial neuron:

aSNN≈∑t=1Tst⋅θ,a\_\{\\mathrm\{SNN\}\}\\approx\\sum\_\{t=1\}^\{T\}s^\{t\}\\cdot\\theta,\(1\)wherest∈\{0,1\}s^\{t\}\\in\\\{0,1\\\}denotes the spike at timesteptt, andθ\\thetais the firing threshold\. Under appropriate scaling, the expected spike count matches the quantized ANN activation, enabling a direct correspondence between ANN activations and SNN spike statistics\(Rueckaueret al\.,[2017](https://arxiv.org/html/2605.20289#bib.bib30)\)\.

##### Leaky Integrate\-and\-Fire Neuron

LIF neuron is the most commonly used neuron model in SNNs\. In discrete time, its dynamics are given by

v​\(t\)\\displaystyle v\(t\)=λ​v​\(t−1\)\+I​\(t\),\\displaystyle=\\lambda v\(t\-1\)\+I\(t\),\(2\)s​\(t\)\\displaystyle s\(t\)=𝕀​\[v​\(t\)≥θ\],\\displaystyle=\\mathbb\{I\}\\left\[v\(t\)\\geq\\theta\\right\],\(3\)v​\(t\)\\displaystyle v\(t\)←v​\(t\)−s​\(t\)​θ,\\displaystyle\\leftarrow v\(t\)\-s\(t\)\\theta,\(4\)wherev​\(t\)v\(t\)denotes the membrane potential,I​\(t\)I\(t\)the input current,λ∈\(0,1\]\\lambda\\in\(0,1\]the leak factor, andθ\\thetathe firing threshold,𝕀​\[⋅\]\\mathbb\{I\}\[\\cdot\]denotes the indicator function\.

This accumulate–threshold–reset mechanism allows LIF neurons to approximate linear transformations through temporal spike accumulation, forming the basis of most existing ANN\-to\-SNN conversion methods\.

## 4Method

LLMs heavily rely on certain nonlinear operations that involve exponentials, divisions, and square roots\. Specifically, the following are the formulas for three key functions we focus on, which involve the ratio of a numerator and a denominator:

ϕSoftmax​\(xi\)=exi∑jexj,ϕSiLU​\(x\)=x1\+e−x,ϕRMSNorm​\(x\)=x1d​∑i=1dxi2\+ϵ\.\\begin\{split\}\\phi\_\{\\mathrm\{Softmax\}\}\(x\_\{i\}\)&=\\frac\{e^\{x\_\{i\}\}\}\{\\sum\_\{j\}e^\{x\_\{j\}\}\},\\\\ \\phi\_\{\\mathrm\{SiLU\}\}\(x\)&=\\frac\{x\}\{1\+e^\{\-x\}\},\\\\ \\phi\_\{\\mathrm\{RMSNorm\}\}\(x\)&=\\frac\{x\}\{\\sqrt\{\\tfrac\{1\}\{d\}\\sum\_\{i=1\}^\{d\}x\_\{i\}^\{2\}\+\\epsilon\}\}\.\\\\ \\end\{split\}\(5\)We specifically focus on RMSNorm because it involves the most challenging part of normalization, the computation of theℓ2\\ell\_\{2\}\-norm \(via squaring, summation, and square root operations\) followed by a division\. The approximation of RMSNorm can also be easily extended to LayerNorm by using mean subtraction, which can be implemented through addition operations\.

![Refer to caption](https://arxiv.org/html/2605.20289v1/x1.png)Figure 1:Overview of division neuron group\.![Refer to caption](https://arxiv.org/html/2605.20289v1/x2.png)Figure 2:Overview of the NLSpiking\.Each spike\-unfriendly function \(SiLU, Softmax, RMSNorm\) is approximated using modular spiking Blocks: the Piecewise Linear Exponential \(PWL\-EXP\) Unit, PolarNorm Unit, and Division Neuron\.### 4\.1Core Building Blocks

#### 4\.1\.1Division Neuron Group

We start with the core division neuron group\. Although the division operator is hardly computable by neurons, integer division serves as a promising alternative\. By merely controlling the error of integer division within a reasonable range, it can be regarded as an approximate division operation\. Following this rationale, we implement a spike\-native division approximation using a population ofLLstandard LIF neurons with ordered thresholds andλ=1\\lambda=1\.

As shown in Figure[1](https://arxiv.org/html/2605.20289#S4.F1), in typical usage, bothIAI\_\{A\}andIBI\_\{B\}are spike\-coded signals\. The division operation is carried out in a two\-stage manner\. In the first temporal window, the denominator inputIBI\_\{B\}is temporally integrated to estimate a normalization scale\. This scale is then held fixed and applied as population thresholds during a second temporal window, in which the numerator inputIAI\_\{A\}drives the division neuron group\. This separation reflects the fact that division depends on the aggregate magnitude of the denominator rather than its precise spike timing\.

##### Threshold construction\.

LetIB​\(t\)I\_\{B\}\(t\)denote the spike\-coded denominator input over the first temporal window of lengthTT\. We define the accumulated denominator

IB≜∑t=1TIB​\(t\),I\_\{B\}\\triangleq\\sum\_\{t=1\}^\{T\}I\_\{B\}\(t\),\(6\)which yields a scalar quantity through temporal integration\. From this accumulated value, we derive a base threshold via a right shift

θ≜IB≫n=⌊IB2n⌋,\\theta\\triangleq I\_\{B\}\\gg n\\;=\\;\\left\\lfloor\\frac\{I\_\{B\}\}\{2^\{n\}\}\\right\\rfloor,\(7\)and assign theii\-th neuron in the division population the threshold

θi=i​θ,i=1,…,L\.\\theta\_\{i\}=i\\,\\theta,\\qquad i=1,\\dots,L\.\(8\)
We choose both the temporal lengthTTand the population sizeLLas powers of two, and set

n=log2⁡\(T​L\),n=\\log\_\{2\}\(TL\),\(9\)so that the effective scale ofθ\\thetamatches the dynamic range of temporal accumulation\. Once constructed, the thresholds\{θi\}\\\{\\theta\_\{i\}\\\}remain fixed throughout the subsequent computation window\.

##### Population decoding as integer division\.

During the second temporal window of lengthTT, the spike\-coded numerator inputIA​\(t\)I\_\{A\}\(t\)is applied to the division neuron group\. Neuroniifires if and only ifIA​\(t\)≥θi=i​θI\_\{A\}\(t\)\\geq\\theta\_\{i\}=i\\,\\theta\. We decode the quotient by counting the number of active neurons,

q≜∑i=1Lsi,si∈\{0,1\};q^=q≫n,q\\triangleq\\sum\_\{i=1\}^\{L\}s\_\{i\},\\ s\_\{i\}\\in\\\{0,1\\\};\\qquad\\hat\{q\}=q\\gg n,\(10\)which yields

q^=∑t=1Tmax⁡\{i∣v​\(t\)≥i​θ\}=⌊∑t=1Tv​\(t\)θ⌋\.\\hat\{q\}=\\sum\_\{t=1\}^\{T\}\\max\\\{i\\mid v\(t\)\\geq i\\theta\\\}=\\left\\lfloor\\frac\{\\sum\_\{t=1\}^\{T\}v\(t\)\}\{\\theta\}\\right\\rfloor\.\(11\)Noticing that⌊∑t=1Tv​\(t\)θ⌋=⌊∑t=1TIA​\(t\)θ⌋\\left\\lfloor\\frac\{\\sum\_\{t=1\}^\{T\}v\(t\)\}\{\\theta\}\\right\\rfloor=\\left\\lfloor\\frac\{\\sum\_\{t=1\}^\{T\}I\_\{A\}\(t\)\}\{\\theta\}\\right\\rfloor, substitutingθ=IB≫n\\theta=I\_\{B\}\\gg nfrom \([7](https://arxiv.org/html/2605.20289#S4.E7)\) completes a spike\-native discretization of division, in which the denominator is derived from a temporally integrated spike signal using only a shift operation\.

#### 4\.1\.2PolarNorm Unit \(PN Unit\)\.

Having approximated the division, the second challenge arises from the norm, as square and square root operations are equally difficult to perform within spiking computations\. To solve this problem, we introduce the*PolarNorm Unit \(PN Unit\)*\. Specifically, we consider the expression∑i=1dxi2\+ϵ​d\\sqrt\{\\sum\_\{i=1\}^\{d\}x\_\{i\}^\{2\}\+\\epsilon d\}and aim to approximate it using only additions, subtractions, comparisons, and bit\-shifts—without general\-purpose multiplications or square roots\. From a geometric perspective, this term corresponds to the Euclidean norm of a vector, suggesting that the problem can be reformulated as a sequence of length\-preserving vector reductions rather than explicit arithmetic operations\.

Inspired by this, let𝐯=\[x1,x2,…,xd,ϵ​d\]\\mathbf\{v\}=\[x\_\{1\},x\_\{2\},\\dots,x\_\{d\},\\sqrt\{\\epsilon d\}\]be the augmented input vector\. Our goal is to estimate‖𝐯‖2\\\|\\mathbf\{v\}\\\|\_\{2\}via a recursive reduction process using theCORDIC\-Hypotalgorithm\(Volder,[1959](https://arxiv.org/html/2605.20289#bib.bib2)\), which approximatesx2\+y2\\sqrt\{x^\{2\}\+y^\{2\}\}with only shift\-add logic\.

The approximation is performed using a binary\-tree structure: adjacent elements of𝐯\\mathbf\{v\}are recursively merged via CORDIC\-Hypot operations\. Each CORDIC step updates\(xk,yk\)\(x\_\{k\},y\_\{k\}\)as

xk\+1=xk\+dk⋅yk2k,yk\+1=yk−dk⋅xk2k,\\displaystyle x\_\{k\+1\}=x\_\{k\}\+d\_\{k\}\\cdot\\frac\{y\_\{k\}\}\{2^\{k\}\},\\quad y\_\{k\+1\}=y\_\{k\}\-d\_\{k\}\\cdot\\frac\{x\_\{k\}\}\{2^\{k\}\},\(12\)wheredk=sign​\(yk\)d\_\{k\}=\\mathrm\{sign\}\(y\_\{k\}\)\. Afternniterations, the outputxnx\_\{n\}approximatesx2\+y2\\sqrt\{x^\{2\}\+y^\{2\}\}up to a constant gain\. Applying this recursively over the tree yields a scalar norm estimate\. The final result is rescaled by a fixed inverse gain1/Kn1/K\_\{n\}to correct for CORDIC accumulation\. This constant factor is an integer power of22, and therefore does not violate spike\-friendly constraints: it can be absorbed into the scalet​h​e​t​athetaor approximated using fixed\-point scaling\.

#### 4\.1\.3Piecewise Linear Exponential Unit \(PWL\-Exp Unit\)

Last but not least, the computation of exp\(x\) remains infeasible in the realm of spiking computation\. To address this, we introduce the PWL\-Exp Unit\.

We restrict our approximation to the interval\[−L,L\]\[\-L,L\], which covers the effective input range for most normalized activations\. This range is divided into K uniform segments\. Letγ=2​L/K\\gamma=2L/Kand we apply piecewise linear interpolation for each subinterval:

ex≈a​x\+b=exi\+1−exixi\+1−xi​\(x−xi\)\+exi,\\displaystyle e^\{x\}\\approx ax\+b=\\frac\{e^\{x\_\{i\+1\}\}\-e^\{x\_\{i\}\}\}\{x\_\{i\+1\}\-x\_\{i\}\}\(x\-x\_\{i\}\)\+e^\{x\_\{i\}\},xi=−L\+γ​i,i=0,1,…,K−1\.\\displaystyle\\quad x\_\{i\}=\-L\+\\gamma i,\\quad i=0,1,\\dots,K\-1\.The coefficientsaaandbbare precomputed and stored in lookup tables to avoid using the multiplication operator\. To minimize memory usage, the coefficientaais quantized to 8\-bit fixed\-point precision\. This design eliminates the need for online exponential computation and enables high\-throughput deployment in SNNs\.

The PWL\-Exp Unit introduces negligible overhead and operates with a cost comparable to a lightweight linear mapping\. Its structure is particularly amenable to LUT\-based implementations in neuromorphic hardware\.

### 4\.2Nonlinear Spiking Function \(NLSpiking\)

We now build NLSpiking functions\. The key observation behind NLSpiking is that a wide class of nonlinear functions used in LLMs shares a common fractional structure: an input\-dependent numerator modulated by a positive normalization term\. Rather than approximating each nonlinear function independently, we factorize them into a numerator path and a denominator path, each implemented using spike\-friendly primitives, and recombine them through the Division Neuron\.

This decomposition provides a unified abstraction for seemingly different nonlinearities\. Softmax, SiLU, and RMSNorm differ mainly in how the numerator and denominator are constructed, while the overall computational skeleton remains identical\.

##### Softmax\.

Softmax is characterized by a competitive normalization across multiple inputs, where the relative scale between exponentiated activations determines the output distribution\. We first stabilize the input using the shift\-invariance property of Softmax:

zi←zi−maxj⁡zj\+H,z\_\{i\}\\leftarrow z\_\{i\}\-\\max\_\{j\}z\_\{j\}\+H,ensuringzi∈\(−∞,H\]z\_\{i\}\\in\(\-\\infty,H\]\. To suppress negligible contributions, the PWL\-Exp Unit outputs zero for inputs below−H\-H\.

Numerator:exp⁡\(zi\)\\exp\(z\_\{i\}\)is approximated using the PWL\-Exp Unit within the interval\[−H,H\]\[\-H,H\]\.

Denominator:The summation∑jexp⁡\(zj\)\\sum\_\{j\}\\exp\(z\_\{j\}\)is computed through temporal accumulation overTTtimesteps\.

Output:The final normalized output is obtained via the Division Neuron\.

This formulation avoids runtime exponentiation and division, enabling low\-latency spike\-based inference with bounded output\.

##### SiLU\.

Unlike Softmax, SiLU is a self\-modulated nonlinearity, where the input simultaneously contributes to both the numerator and the normalization term\. For SiLU, we constrain the domain to\[−H,H\]\[\-H,H\]and extend the output linearly beyond this range:SiLU​\(x\)←x\\mathrm\{SiLU\}\(x\)\\leftarrow xforx\>Hx\>H, andSiLU​\(x\)←0\\mathrm\{SiLU\}\(x\)\\leftarrow 0forx<−Hx<\-H\.

Numerator:The inputxxis directly encoded as a spike train\.

Denominator:The term1\+exp⁡\(−x\)1\+\\exp\(\-x\)is approximated by applying the PWL\-Exp Unit to−x\-xand adding 1\.

Output:The final normalized value is computed via the Division Neuron\.

This construction preserves the smooth nonlinearity of SiLU while maintaining spike compatibility\.

##### RMSNorm\.

RMSNorm differs fundamentally from Softmax and SiLU in that its denominator encodes a vector magnitude rather than an activation\-dependent gating term\. To approximate RMSNorm, we follow the transformed formulation:

Numerator:Eachxix\_\{i\}is scaled by the constantd\\sqrt\{d\}and encoded as a spike train\.

Denominator:The expression∑xi2\+ϵ​d\\sqrt\{\\sum x\_\{i\}^\{2\}\+\\epsilon d\}is approximated using the PolarNorm Unit\.

Output:The final normalized value is computed via the Division Neuron\.

All nonlinear spiking functions in NLSpiking follow the same modular construction paradigm, relying exclusively on primitive, spike\-friendly operations\. By explicitly factorizing each nonlinearity into a numerator path and a normalization path, approximation errors are isolated within individual modules rather than entangled across the computation\. This modularity enables systematic error analysis and exposes clear control knobs over accuracy, latency, and resource usage, which is particularly amenable to hardware co\-design\.

Beyond the specific instances of Softmax, SiLU, and RMSNorm, this construction naturally extends to related normalization\-based variants\. For example, LayerNorm can be viewed as an RMSNorm augmented with a mean subtraction term, which can be implemented using only additions and accumulations without altering the overall division\-based structure\. This highlights the generality of NLSpiking as a unifying framework for spike\-compatible nonlinear normalization\.

A detailed discussion of approximation accuracy, the influence of constants such asε\\varepsilon, and the selection of hyperparameters—including the range boundHH—is provided in Section[5](https://arxiv.org/html/2605.20289#S5), where we rigorously derive error bounds and offer principled guidelines for configuring NLSpiking functions\.

## 5Performance Analysis

![Refer to caption](https://arxiv.org/html/2605.20289v1/figure/paper/softmax_error1.png)\(a\)Operator\-level errors for Softmax approximations under 8\-bit quantization\.
![Refer to caption](https://arxiv.org/html/2605.20289v1/figure/paper/rms_error1.png)\(b\)Operator\-level errors for RMSNorm approximations under 8\-bit quantization\.

Figure 3:Operator\-level errors under 8\-bit quantization\. Error bars indicate the gap between mean and maximum absolute error\. NLS\-Softmax achieves the lowest mean error across dimensions while keeping bounded maximum error under integer\-only implementation, and NLS\-RMS yields lower mean errors than blockwise and Sorbet baselines with stable performance across dimensions\.To validate the effectiveness of the NLSpiking framework, we analyze its performance from two complementary perspectives:*accuracy*and*memory footprint*, aiming to provide a holistic view of how spike\-based approximations balance functional fidelity with hardware constraints\. Our analysis is guided by a central question: whether the modular decomposition in NLSpiking leads to controlled and predictable approximation errors, or whether errors introduced by individual spike\-based operators accumulate uncontrollably\. To this end, we derive explicit per\-function error bounds that decompose the total approximation error into contributions from exponentiation, division, and norm estimation\. Throughout this section we denote by:

εexp=L22​K2​e2​L/K,Δ=1n,εpol=⌈log2⁡d⌉​2−2​n−1,\\varepsilon\_\{\\exp\}=\\frac\{L^\{2\}\}\{2K^\{2\}\}e^\{2L/K\},\\quad~\\Delta=\\frac\{1\}\{n\},\\quad~\\varepsilon\_\{\\text\{pol\}\}=\\bigl\\lceil\\log\_\{2\}d\\bigr\\rceil\\,2^\{\-2n\-1\},the relative error of PWL\-Exp, the quantisation step of the\(T,L\)\(T,L\)\-Division neuron, and the relativeℓ2\\ell\_\{2\}–norm error of annn–step CORDIC tree in the PolarNorm Unit, respectively\.

###### Theorem 5\.1\(Error Bounds for NLSpiking\)\.

For each NLSpiking functionϕ~\\tilde\{\\phi\}, assume the standard spike\-based approximations \(PWL\-Exp, Division Neuron, and PolarNorm\) are employed\. Then the following per\-output error bounds hold:

1. \(i\)Softmax\.For every classii,\|ϕ~i−ϕi\|ϕi≤21−εexp​\(εexp\+Δ\)\.\\frac\{\|\\tilde\{\\phi\}\_\{i\}\-\\phi\_\{i\}\|\}\{\\phi\_\{i\}\}\\;\\leq\\;\\frac\{2\}\{1\-\\varepsilon\_\{\\exp\}\}\\,\\bigl\(\\varepsilon\_\{\\exp\}\+\\Delta\\bigr\)\.
2. \(ii\)SiLU\.For everyx∈\[−H,H\]x\\in\[\-H,H\],\|ϕ~​\(x\)−ϕ​\(x\)\|≤\|x\|​2​εexp1−εexp\+\|x\|​Δ\.\|\\tilde\{\\phi\}\(x\)\-\\phi\(x\)\|\\;\\leq\\;\|x\|\\,\\frac\{2\\varepsilon\_\{\\exp\}\}\{1\-\\varepsilon\_\{\\exp\}\}\\;\+\\;\|x\|\\,\\Delta\.
3. \(iii\)RMSNorm\.For each coordinateii,\|ϕ~i−ϕi\|ϕi≤εpol\+Δ1−εpol\+d​Δ\.\\frac\{\|\\tilde\{\\phi\}\_\{i\}\-\\phi\_\{i\}\|\}\{\\phi\_\{i\}\}\\;\\leq\\;\\frac\{\\varepsilon\_\{\\text\{pol\}\}\+\\Delta\}\{1\-\\varepsilon\_\{\\text\{pol\}\}\}\\;\+\\;\\sqrt\{d\}\\,\\Delta\.

A key observation from Theorem[5\.1](https://arxiv.org/html/2605.20289#S5.Thmtheorem1)is that, for all three nonlinearities, the total approximation error decomposes additively into a small number of well\-isolated terms\. These bounds demonstrate that NLSpiking functions achieve high\-precision approximations across core nonlinearities\. Each error term remains tightly controlled, stemming from well\-isolated sources: lookup\-based exponentiation, spike\-based division, and recursive norm estimation\. We empirically validate these guarantees in the next section\. The proof is available in the Appendix[A](https://arxiv.org/html/2605.20289#A1)\.

##### Memory Efficiency\.

It is worth noting that Theorem[5\.1](https://arxiv.org/html/2605.20289#S5.Thmtheorem1)directly exposes the trade\-off between approximation accuracy and memory footprint\. In particular, the PWL\-Exp errorεexp\\varepsilon\_\{\\exp\}depends only on the number of segmentsKK, which determines the size of the lookup table\. Compared to traditional floating\-point computation, NLSpiking only requires storing a compact set of lookup entries:KKvalues of 8\-bit and 16\-bit precision in total\. This is significantly smaller than typical table\-based methods, which often rely on large floating\-point tables\. Such reduction is critical for deployment on spiking neuromorphic chips, which usually operate under strict on\-chip memory constraints\.

## 6Experiments

In this section, we present a comprehensive evaluation of the proposed method from both the operator\-level and the model\-level perspectives, together with a sensitivity analysis of key hyperparameters\. More detailed experiments are deferred to the Appendix[B](https://arxiv.org/html/2605.20289#A2)\. Furthermore, while nonlinear operators generally incur lower computational cost than linear layers in practice, we also report the counts of MAC, AC, and shift operations in Appendix[B\.4](https://arxiv.org/html/2605.20289#A2.SS4)to provide\. a reference for energy analysis\.

### 6\.1Function\-Level Evaluation\.

##### Baseline\.

Our evaluation is conducted strictly at the operator level\. For each target function, we compare NLS\-based operators against representative approximation implementations commonly adopted for the same operator in quantized or efficient inference settings\. Most of these operator instantiations are adapted from prior work on surrogate/nonlinear approximations and extreme quantization, including Sorbet\(Tanget al\.,[2025](https://arxiv.org/html/2605.20289#bib.bib45)\), XNOR\-Net\(Rastegariet al\.,[2016](https://arxiv.org/html/2605.20289#bib.bib39)\), and DoReFa\-Net\(Zhouet al\.,[2016](https://arxiv.org/html/2605.20289#bib.bib40)\), and we additionally include classical numerical approximations for completeness\.

ForSiLU, we include XNOR\-style binary thresholding, DoReFa\-style low\-bit uniform quantization \(DoReFa\-4b\), surrogate nonlinearities \(ReLU and hard\-swish\), and piecewise\-linear \(PWL\) sigmoid\-based approximations with 16 and 64 segments\. These baselines span the spectrum from extreme low\-cost discretization to higher\-precision numerical approximations\.

ForSoftmax, we compare against hardmax, surrogate\-based approximations \(Sorbet\), a Padé \[2/2\] rational approximation, and a 16\-segment PWL exponential baseline, which approximate the exponential and normalization steps in efficient attention\.

ForRMSNorm, we include Sorbet\-style surrogates and blockwise RMS approximations with block sizes of 32and 64, which are commonly used to reduce reduction cost in practice\. All methods are treated as isolated operator approximations and evaluated\.

![Refer to caption](https://arxiv.org/html/2605.20289v1/figure/appendix/silu_error.png)\(a\)SiLU approximation error \(8\-bit\)\.
![Refer to caption](https://arxiv.org/html/2605.20289v1/figure/appendix/silu_sensitivity_line.png)\(b\)NLS\-SiLU sensitivity to2​L2L\.
![Refer to caption](https://arxiv.org/html/2605.20289v1/figure/appendix/softmax_sensitivity_line.png)\(c\)NLS\-Softmax sensitivity \(d=64d=64\)\.

Figure 4:Left: SiLU approximation errors across baselines\.Middle–Right: sensitivity of NLS\-SiLU and NLS\-Softmax to the clipping interval length2​L2L\. Excessively large intervals increase SiLU maximum error, while overly small intervals lead to significant Softmax deviations\. We recommendL=5L=5as the default setting\.Table 1:Model\-level accuracy before and after operator replacement\.Δ\\Deltadenotes the change from the original operator\. NLSpike results are highlighted for clarity\.SourceModelOperatorWinoGrandeHellaSwagArcCArcEPIQAAvg\. Acc\.ANNLLaMA\-3\-8B\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib17)\)Original0\.7360\.7920\.5420\.7760\.8070\.730NLSpike\(Δ\\Delta\)\-0\.008\-0\.000\+0\.001\+0\.000\-0\.004\-0\.003LLaMA\-2\-7B\(Touvronet al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib16)\)Original0\.6930\.7620\.4510\.7370\.7880\.686NLSpike\(Δ\\Delta\)\-0\.006\-0\.001\+0\.001\-0\.003\+0\.001\-0\.002Mistral\-7B\(Jianget al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib6)\)Original0\.7360\.8440\.5440\.7170\.7790\.724NLSpike\(Δ\\Delta\)\+0\.001\-0\.001\-0\.003\+0\.000\+0\.004\+0\.000Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2605.20289#bib.bib5)\)Original0\.7200\.7860\.5650\.8000\.7970\.734NLSpike\(Δ\\Delta\)\-0\.005\+0\.002\+0\.039\+0\.033\-0\.001\+0\.014SpikeLLMT=2,W2A16LLaMA\-2\-7B\(Touvronet al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib16)\)Original0\.5280\.4990\.2840\.4190\.6570\.477NLSpike\(Δ\\Delta\)\-0\.004\-0\.002\+0\.002\-0\.001\+0\.003\-0\.000LLaMA\-2\-13B\(Touvronet al\.,[2023](https://arxiv.org/html/2605.20289#bib.bib16)\)Original0\.5720\.5800\.3040\.4450\.6770\.516NLSpike\(Δ\\Delta\)\-0\.003\-0\.001\+0\.001\+0\.002\-0\.002\-0\.001

##### Operator\-level Approximation Results\.

Figures[3\(a\)](https://arxiv.org/html/2605.20289#S5.F3.sf1),[3\(b\)](https://arxiv.org/html/2605.20289#S5.F3.sf2),[4\(a\)](https://arxiv.org/html/2605.20289#S6.F4.sf1)summarize operator\-level approximation errors under 8\-bit quantization\.

ForSiLU\(Fig\.[4\(a\)](https://arxiv.org/html/2605.20289#S6.F4.sf1),x∈\[−5,5\]x\\in\[\-5,5\]\), NLS\-SiLU achieves the lowest mean error among training\-free baselines, while keeping maximum error comparable to a strong 64\-segment PWL\-sigmoid; Sorbet and hard\-swish deviate more, and DoReFa\-4b/XNOR exhibit both high mean and peak errors\.

ForSoftmax\(Fig\.[3\(a\)](https://arxiv.org/html/2605.20289#S5.F3.sf1), varying input dimensiondd\), NLS\-Softmax consistently yields the lowest mean error across alldd, with maximum error effectively bounded by the 8\-bit grid, whereas Padé/PWL may approach similar mean accuracy but suffer larger maximum error and variance at higher dimensions, and Sorbet/hardmax remain substantially worse\.

ForRMSNorm\(Fig\.[3\(b\)](https://arxiv.org/html/2605.20289#S5.F3.sf2)\), blockwise RMS \(block size 32/64\) can be accurate only whenddaligns with the block partition, but becomes unstable otherwise; in contrast, NLS\-RMS maintains consistently low mean error across all tested dimensions\.

Overall, NLS\-based operators provide stable, quantization\-robust approximations, while enabling an efficient deployment that shares a single lookup table across Softmax, SiLU, and RMSNorm\.

### 6\.2Ablation and Function Sensitivity Analysis

Figure[4\(b\)](https://arxiv.org/html/2605.20289#S6.F4.sf2)shows the sensitivity of NLSpike\-SiLU whenHHis varied from33to1010\. We observe that both mean and maximum approximation errors remain extremely small within moderate ranges \(e\.g\.,H=3,4,5H=3,4,5\)\. However, whenHHgrows larger, the maximum error increases rapidly, reflecting the growing difficulty of approximating the extreme tails of the activation\. This suggests that unnecessarily large intervals harm robustness without improving the error in the practically relevant region\. In contrast, Figure[4\(c\)](https://arxiv.org/html/2605.20289#S6.F4.sf3)reports the sensitivity of NLSpike\-Softmax \(fixedd=64d=64\)\. Here, enlargingHHconsistently reduces both mean and maximum errors, because clipping less aggressively preserves the exponential scaling inside the softmax\. Nevertheless, smallHHvalues \(≤4\\leq 4\) already induce non\-negligible errors that may accumulate at the layer level\.

### 6\.3Model\-Level Evaluation\.

Table 2:Performance of a BERT model converted via the SpikeZIP ANN2SNN pipeline \(T=64T=64\) before and after applying NLSpike\.Operator typeMRSST\-2SubjSST\-5Avg\. Acc\.Original0\.8810\.9040\.9430\.5000\.807NLSpike\(Δ\\Delta\)\-0\.001\+0\.006\-0\.003\+0\.009\+0\.003

We evaluate on two categories of large language models\. The first category comprises SNN models produced by ANN\-to\-SNN conversion pipelines, including:

SpikeLLM:\(Xinget al\.,[2025](https://arxiv.org/html/2605.20289#bib.bib44)\)the first spiking large language model using ANN\-to\-SNN\.

SpikeZIP:\(Youet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib43)\)a novel ANN\-to\-SNN conversion method used in BERT and ViT\.

We use their conversion pipelines as released and perform operator replacement*only after*the conversion stage, without modifying any pipeline components\. We use a LayerNorm\-based NLSpike variant for the BERT model in our experiments, with activations implemented via a piecewise linear approximation\.

The second category comprises standard ANN\-based LLMs that are not explicitly covered by existing ANN\-to\-SNN conversion pipelines, including LLaMA\-3\-8B, LLaMA\-2\-7B, Mistral\-7B, and Qwen3\-8B\. We collapse the temporal dimension of the division neuron into a single\-step representation and interface it directly with the ANN computation graph, while keeping the architecture, all linear layers, pretrained parameters, and inference hyperparameters unchanged, and without any retraining\. This controlled protocol isolates the impact of our operator implementations at the model level and highlights their potential as building blocks for future SNN\-based LLM designs\.

As shown in Tabel[1](https://arxiv.org/html/2605.20289#S6.T1),[2](https://arxiv.org/html/2605.20289#S6.T2), model\-level evaluation results show that NLSpike first validates its core design objective in ANN\-to\-SNN converted models\. For SNNs produced by the SpikeLLM and SpikeZIP pipelines, we conduct evaluation without modifying any conversion procedures and replace the nonlinear operators only after the conversion stage\. The results indicate that performance changes across all tasks are negligible, with average accuracy remaining stable, demonstrating that NLSpike can be directly loaded as an independent operator into existing ANN2SNN pipelines while maintaining compatibility with established temporal dynamics and spike\-based inference\. Building on this, we further evaluate NLSpike on standard LLMs without conversion, where it similarly introduces no noticeable performance degradation and even yields positive gains on certain reasoning tasks, suggesting both strong robustness\.

Table 3:Latency comparison under practical neuromorphic execution models\.MethodNonlinear / Special\-Function CallsData MovementTime\-Step LatencySpikeZIP\(Xinget al\.,[2025](https://arxiv.org/html/2605.20289#bib.bib44)\)T×T\\timesnonlinear evaluations𝒪​\(T\)\\mathcal\{O\}\(T\)cross\-domain0SpikeLLM\(Youet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib43)\)1×1\\timesnonlinear evaluation𝒪​\(1\)\\mathcal\{O\}\(1\)cross\-domainTTNLSpike \(Ours\)𝒏×\\bm\{n\\times\}shift\-add / LUT calls0 \(in\-core\)TT
### 6\.4Latency Analysis

We provide a hardware\-aware latency analysis under practical neuromorphic execution models\. Digital neuromorphic hardware typically consists of neuromorphic cores and embedded processors\(Davieset al\.,[2018](https://arxiv.org/html/2605.20289#bib.bib35); Maet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib7)\)\. Neuromorphic cores mainly support lightweight arithmetic datapaths such as LUT and shift\-add operations, and are generally not optimized for floating\-point nonlinear computation\(Maet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib7)\)\. Consequently, nonlinear operators are often executed on external processors, introducing substantial cross\-domain data movement overhead\. We analyze latency from the perspective of temporal execution structures in ANN\-to\-SNN nonlinear operator pipelines\.

SpikeLLM\-style pipelines\(Xinget al\.,[2025](https://arxiv.org/html/2605.20289#bib.bib44)\)\. Nonlinear operators are evaluated after spike accumulation over a temporal window, followed by spike emission\. This introduces a two\-window execution structure for each operator\. Our method follows the same temporal structure and therefore does not introduce additional time\-step latency compared to this class of methods\.

SpikeZIP\-style pipelines\(Youet al\.,[2024](https://arxiv.org/html/2605.20289#bib.bib43)\)\. Nonlinear operators are evaluated incrementally at every time step:O​\(t\)=o​\(t\)−o​\(t−1\)O\(t\)=o\(t\)\-o\(t\-1\)\. This requires repeated nonlinear evaluations acrossTTtime steps, resulting in𝒪​\(T\)\\mathcal\{O\}\(T\)nonlinear computations\.

As discussed in\(Liet al\.,[2020](https://arxiv.org/html/2605.20289#bib.bib8)\), practical neuromorphic deployment cost is often dominated by cross\-domain data movement rather than arithmetic computation itself\. In Tabel[3](https://arxiv.org/html/2605.20289#S6.T3), SpikeZIP\-style pipelines achieve zero time\-step latency, but require repeated nonlinear evaluations together with𝒪​\(T\)\\mathcal\{O\}\(T\)cross\-domain communication overhead\. SpikeLLM reduces nonlinear evaluations to one, but still relies on cross\-domain nonlinear execution\. In contrast, our method executes nonlinear operators entirely within neuromorphic cores using shift\-add and LUT operations, thereby eliminating cross\-domain data movement while maintaining the same temporal execution structure as SpikeLLM\.

## 7Conclusion and Limitations

This paper presents a plug\-and\-play framework that approximates key Transformer nonlinearities with LIF neuron populations, using only shift\-and\-scale operations and largely avoiding floating\-point computation, making it compatible with existing ANN\-to\-SNN pipelines\. A current limitation is the lack of end\-to\-end LLM deployment on real spiking hardware due to operator, memory, and accuracy–latency constraints\. Future work will focus on end\-to\-end deployment on real spiking hardware\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- F\. Akopyan, J\. Sawada, A\. Cassidy, R\. Alvarez\-Icaza, J\. V\. Arthur, P\. A\. Merolla, N\. Imam, Y\. Nakamura, P\. Datta, G\. Nam,et al\.\(2015\)TrueNorth: design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip\.InProceedings of the 2015 ACM/IEEE International Symposium on Computer Architecture \(ISCA\),pp\. 262–273\.Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p1.1)\.
- Y\. Cao, Y\. Chen, and D\. Khosla \(2015\)Spiking deep convolutional neural networks for energy\-efficient object recognition\.International Journal of Computer Vision113\(1\),pp\. 54–66\.Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p1.1)\.
- L\. Chen, X\. Song, A\. Song, B\. Chen, J\. Lv, and Y\. Sun \(2025a\)FAS: fast ann–snn conversion for spiking large language models\.arXiv preprint\.External Links:2502\.04405,[Link](https://arxiv.org/abs/2502.04405)Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p2.1)\.
- L\. Chen, X\. Song, and Y\. Sun \(2025b\)LAS: loss\-less ann\-snn conversion for fully spike\-driven large language models\.External Links:2505\.09659,[Link](https://arxiv.org/abs/2505.09659)Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p2.1)\.
- M\. Davies, N\. Srinivasa, T\. Lin, G\. Chinya, Y\. Cao, S\. H\. Choday, G\. Dimou, P\. Joshi, N\. Imam, S\. Jain,et al\.\(2018\)Loihi: a neuromorphic manycore processor with on\-chip learning\.IEEE Micro38\(1\),pp\. 82–99\.Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p1.1),[§1](https://arxiv.org/html/2605.20289#S1.p3.1),[§6\.4](https://arxiv.org/html/2605.20289#S6.SS4.p1.1)\.
- M\. Davies, A\. Wild, G\. Orchard, Y\. Sandamirskaya, G\. A\. F\. Guerra, P\. Joshi, P\. Plank, and S\. R\. Risbud \(2021\)Advancing neuromorphic computing with loihi: a survey of results and outlook\.Proceedings of the IEEE109\(5\),pp\. 911–934\.Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p3.1)\.
- P\. U\. Diehl, D\. Neil, J\. Binas, M\. Cook, S\. Liu, and M\. Pfeiffer \(2015\)Fast\-classifying, high\-accuracy spiking deep networks through weight and threshold balancing\.In2015 International joint conference on neural networks \(IJCNN\),pp\. 1–8\.Cited by:[§2](https://arxiv.org/html/2605.20289#S2.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[Table 1](https://arxiv.org/html/2605.20289#S6.T1.9.7.9.2.1)\.
- M\. Horowitz \(2014\)1\.1 computing’s energy problem \(and what we can do about it\)\.In2014 IEEE international solid\-state circuits conference digest of technical papers \(ISSCC\),pp\. 10–14\.Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p1.1),[§1](https://arxiv.org/html/2605.20289#S1.p3.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. Renard Lavaud, M\. Lachaux, P\. Stock, T\. Le Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. El Sayed \(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.External Links:2310\.06825,[Document](https://dx.doi.org/10.48550/arXiv.2310.06825),[Link](https://arxiv.org/abs/2310.06825)Cited by:[Table 1](https://arxiv.org/html/2605.20289#S6.T1.9.7.11.1.1)\.
- S\. Li, S\. Guo, L\. Zhang, Z\. Kang, S\. Wang, W\. Shi, L\. Wang, and W\. Xu \(2020\)SNEAP: a fast and efficient toolchain for mapping large\-scale spiking neural network onto noc\-based neuromorphic platform\.InProceedings of the 2020 on Great Lakes Symposium on VLSI,pp\. 9–14\.Cited by:[§6\.4](https://arxiv.org/html/2605.20289#S6.SS4.p4.1)\.
- C\. Lv, T\. Li, J\. Xu, C\. Gu, Z\. Ling, C\. Zhang, X\. Zheng, and X\. Huang \(2023\)SpikeBERT: a language spikformer learned from bert with knowledge distillation\.arXiv preprint\.External Links:2308\.15122,[Link](https://arxiv.org/abs/2308.15122)Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p1.1),[§1](https://arxiv.org/html/2605.20289#S1.p3.1)\.
- D\. Ma, X\. Jin, S\. Sun, Y\. Li, X\. Wu, Y\. Hu, F\. Yang, H\. Tang, X\. Zhu, P\. Lin,et al\.\(2024\)Darwin3: a large\-scale neuromorphic chip with a novel isa and on\-chip learning\.National Science Review11\(5\),pp\. nwae102\.Cited by:[§6\.4](https://arxiv.org/html/2605.20289#S6.SS4.p1.1)\.
- M\. Rastegari, V\. Ordonez, J\. Redmon, and A\. Farhadi \(2016\)Xnor\-net: imagenet classification using binary convolutional neural networks\.InEuropean conference on computer vision,pp\. 525–542\.Cited by:[§6\.1](https://arxiv.org/html/2605.20289#S6.SS1.SSS0.Px1.p1.1)\.
- K\. Roy, A\. Jaiswal, and P\. Panda \(2019\)Towards spike\-based machine intelligence with neuromorphic computing\.Nature575\(7784\),pp\. 607–617\.Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p1.1)\.
- B\. Rueckauer, I\. Lungu, Y\. Hu, M\. Pfeiffer, and S\. Liu \(2017\)Conversion of continuous\-valued deep networks to efficient event\-driven networks for image classification\.Frontiers in neuroscience11,pp\. 682\.Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p2.1),[§3](https://arxiv.org/html/2605.20289#S3.SS0.SSS0.Px1.p2.3)\.
- A\. Sengupta, Y\. Ye, R\. Wang, C\. Liu, and K\. Roy \(2019\)Going deeper in spiking neural networks: vgg and residual architectures\.Frontiers in neuroscience13,pp\. 95\.Cited by:[§2](https://arxiv.org/html/2605.20289#S2.p1.1)\.
- X\. Shi, Z\. Hao, and Z\. Yu \(2024\)SpikingResformer: bridging resnet and vision transformer in spiking neural networks\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),External Links:2403\.14302,[Link](https://arxiv.org/abs/2403.14302)Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p3.1)\.
- K\. Tang, Z\. Yan, and W\. Wong \(2025\)Sorbet: a neuromorphic hardware\-compatible transformer\-based spiking language model\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=5dFJukfj4y)Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p4.1),[§2](https://arxiv.org/html/2605.20289#S2.p2.1),[§6\.1](https://arxiv.org/html/2605.20289#S6.SS1.SSS0.Px1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[Table 1](https://arxiv.org/html/2605.20289#S6.T1.7.5.5.2.1),[Table 1](https://arxiv.org/html/2605.20289#S6.T1.9.7.10.1.1),[Table 1](https://arxiv.org/html/2605.20289#S6.T1.9.7.13.1.1)\.
- J\. E\. Volder \(1959\)The cordic trigonometric computing technique\.IRE Transactions on Electronic ComputersEC\-8\(3\),pp\. 330–334\.External Links:[Document](https://dx.doi.org/10.1109/TEC.1959.5222693)Cited by:[Lemma A\.2](https://arxiv.org/html/2605.20289#A1.Thmtheorem2.p1.5),[§4\.1\.2](https://arxiv.org/html/2605.20289#S4.SS1.SSS2.p2.3)\.
- X\. Xing, B\. Gao, Z\. Liu, D\. A\. Clifton, S\. Xiao, W\. Zhang, L\. Du, Z\. Zhang, G\. Li, and J\. Zhang \(2025\)SpikeLLM: scaling up spiking neural network to large language models via saliency\-based spiking\.External Links:[Link](https://openreview.net/forum?id=ZadnlOHsHv)Cited by:[§2](https://arxiv.org/html/2605.20289#S2.p1.1),[§6\.3](https://arxiv.org/html/2605.20289#S6.SS3.p2.1),[§6\.4](https://arxiv.org/html/2605.20289#S6.SS4.p2.1),[Table 3](https://arxiv.org/html/2605.20289#S6.T3.2.2.3)\.
- Z\. Yan, Z\. Bai, and W\. Wong \(2024\)Reconsidering the energy efficiency of spiking neural networks\.arXiv preprint arXiv:2409\.08290\.Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Table 1](https://arxiv.org/html/2605.20289#S6.T1.9.7.12.1.1)\.
- K\. You, Z\. Xu, C\. Nie, Z\. Deng, X\. Wang, Q\. Guo, and Z\. He \(2024\)SpikeZIP\-tf: conversion is all you need for transformer\-based snn\.InForty\-first International Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p2.1),[§2](https://arxiv.org/html/2605.20289#S2.p1.1),[§6\.3](https://arxiv.org/html/2605.20289#S6.SS3.p3.1),[§6\.4](https://arxiv.org/html/2605.20289#S6.SS4.p3.3),[Table 3](https://arxiv.org/html/2605.20289#S6.T3.5.5.4)\.
- S\. Zhou, Y\. Wu, Z\. Ni, X\. Zhou, H\. Wen, and Y\. Zou \(2016\)Dorefa\-net: training low bitwidth convolutional neural networks with low bitwidth gradients\.arXiv preprint arXiv:1606\.06160\.Cited by:[§6\.1](https://arxiv.org/html/2605.20289#S6.SS1.SSS0.Px1.p1.1)\.
- Z\. Zhou, Y\. Zhu, C\. He, Y\. Wang, S\. Yan, Y\. Tian, and L\. Yuan \(2023\)Spikformer: when spiking neural network meets transformer\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:2209\.15425,[Link](https://arxiv.org/abs/2209.15425)Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p1.1),[§1](https://arxiv.org/html/2605.20289#S1.p3.1)\.
- R\. Zhu, Q\. Zhao, G\. Li, and J\. K\. Eshraghian \(2023\)SpikeGPT: generative pre\-trained language model with spiking neural networks\.arXiv preprint\.External Links:2302\.13939,[Link](https://arxiv.org/abs/2302.13939)Cited by:[§1](https://arxiv.org/html/2605.20289#S1.p1.1),[§1](https://arxiv.org/html/2605.20289#S1.p3.1)\.

## Appendix AProof of Theorem1

Based on the method introduced in the previous section, we have derived a set of spike\-compatible NLS\-functions intended for application in the forward propagation of the spike\-LLM\. We now estimate the approximation errors of the three NLS\-functions used in this work: NLS\-Softmax, NLS\-SiLU, and NLS\-RMSNorm\. First, we need some necessary lemmas\.

###### Lemma A\.1\(PWL\-Exp Unit Relative Error Bound\)\.

Lete~x\\tilde\{e\}^\{x\}be the piecewise\-linear approximation toexe^\{x\}on\[−H,H\]\[\-H,H\]obtained by dividing the interval intoKKequal subintervals of width

and interpolating linearly on each\. Then for everyx∈\[−H,H\]x\\in\[\-H,H\],

\|e~x−exex\|≤h28​eh=\(2​HK\)28​e2​H/K\.\\left\|\\frac\{\\tilde\{e\}^\{x\}\-e^\{x\}\}\{e^\{x\}\}\\right\|\\;\\leq\\;\\frac\{h^\{2\}\}\{8\}\\,e^\{h\}\\;=\\;\\frac\{\\bigl\(\\tfrac\{2H\}\{K\}\\bigr\)^\{2\}\}\{8\}\\,e^\{2H/K\}\.Especially, whenH=5,K=64H=5,K=64, we have:

\|e~x−exex\|≤h28​eh=\(532\)28​e5/32≈3\.63×10−3\.\\left\|\\frac\{\\tilde\{e\}^\{x\}\-e^\{x\}\}\{e^\{x\}\}\\right\|\\;\\leq\\;\\frac\{h^\{2\}\}\{8\}\\,e^\{h\}\\;=\\;\\frac\{\\bigl\(\\tfrac\{5\}\{32\}\\bigr\)^\{2\}\}\{8\}\\,e^\{5/32\}\\;\\approx\\;3\.63\\times 10^\{\-3\}\.

###### Lemma A\.2\.

\(Volder,[1959](https://arxiv.org/html/2605.20289#bib.bib2)\)\[Pairwise CORDIC Relative Error Bound\] Leta,b∈ℝa,b\\in\\mathbb\{R\}andr=a2\+b2r=\\sqrt\{a^\{2\}\+b^\{2\}\}\. Performnnsteps of CORDIC in vectoring mode with anglesϕi=arctan⁡\(2−i\)\\phi\_\{i\}=\\arctan\(2^\{\-i\}\)fori=0,1,…,n−1i=0,1,\\dots,n\-1, and apply the exact scale correction

Kn−1=∏i=0n−111\+2−2​i\.K\_\{n\}^\{\-1\}\\;=\\;\\prod\_\{i=0\}^\{n\-1\}\\frac\{1\}\{\\sqrt\{1\+2^\{\-2i\}\}\}\.Ifr~\\tilde\{r\}denotes the CORDIC output after correction, then

\|r~−r\|r≤2−2​n−1\.\\frac\{\\bigl\|\\tilde\{r\}\-r\\bigr\|\}\{r\}\\;\\leq\\;2^\{\-2n\-1\}\.

###### Lemma A\.3\(Binary\-Tree CORDIC Accumulation Error\)\.

To compute

R=∑j=1Dvj2\+ε​D,R=\\sqrt\{\\sum\_\{j=1\}^\{D\}v\_\{j\}^\{2\}\+\\varepsilon\\,D\},group theDDvalues into⌈D/2⌉\\lceil D/2\\rceilpairs, apply Lemma[A\.2](https://arxiv.org/html/2605.20289#A1.Thmtheorem2)to each pair \(using the samenn\-step CORDIC\), and then recursively merge the resulting radii in a balanced binary tree of heightℓ=⌈log2⁡D⌉\\ell=\\lceil\\log\_\{2\}D\\rceil\. Denote the final CORDIC result byR~\\tilde\{R\}\. Then

\|R~−R\|R≤ℓ⋅2−2​n−1\.\\frac\{\\bigl\|\\tilde\{R\}\-R\\bigr\|\}\{R\}\\;\\leq\\;\\ell\\;\\cdot 2^\{\-2n\-1\}\.

After discussing the error bounds of PWL\-Exp Unit and PolarNorm Unit, now we can focus on the errors of NLS\-Softmax, NLS\-SiLU, and NLS\-RMSNorm\.

###### Proof of Lemma[A\.3](https://arxiv.org/html/2605.20289#A1.Thmtheorem3)\.

Let the pairwise CORDIC relative error bound beδ:=2−2​n−1\\delta:=2^\{\-2n\-1\}\. We analyze the error propagation layer by layer through the CORDIC binary tree\.

Base layer \(Layer 1\):Group theDDinputs into⌈D/2⌉\\lceil D/2\\rceilpairs\. Apply Lemma[A\.2](https://arxiv.org/html/2605.20289#A1.Thmtheorem2)to each pair, yielding approximate radiir~k\(1\)\\tilde\{r\}^\{\(1\)\}\_\{k\}with relative error≤δ\\leq\\delta\.

Inductive step \(Layertt\):Assume inputs to layertthave maximum relative error≤\(t−1\)​δ\\leq\(t\-1\)\\delta\. Merging two such values gives:

r~\(t\)=\(r~1\(t−1\)\)2\+\(r~2\(t−1\)\)2⋅\(1\+eCORDIC\),\\tilde\{r\}^\{\(t\)\}=\\sqrt\{\(\\tilde\{r\}\_\{1\}^\{\(t\-1\)\}\)^\{2\}\+\(\\tilde\{r\}\_\{2\}^\{\(t\-1\)\}\)^\{2\}\}\\cdot\(1\+e\_\{\\text\{CORDIC\}\}\),where\|eCORDIC\|≤δ\|e\_\{\\text\{CORDIC\}\}\|\\leq\\delta\(by Lemma[A\.2](https://arxiv.org/html/2605.20289#A1.Thmtheorem2)\)\.

Eachr~\(t−1\)\\tilde\{r\}^\{\(t\-1\)\}differs from its true valuer\(t−1\)r^\{\(t\-1\)\}by at most\(t−1\)​δ\(t\-1\)\\delta, so the total error in computingr~\(t\)\\tilde\{r\}^\{\(t\)\}is at most:

\|e\(t\)\|≤\(t−1\)​δ\+δ=t​δ\.\|e^\{\(t\)\}\|\\leq\(t\-1\)\\delta\+\\delta=t\\delta\.Final result\.The CORDIC reduction forms a complete binary tree whose leaves are theDDoriginal vectors\. At every layer the number of active nodes is at most halved \(merging each pair into one\)\. Afterttlayers the node count is therefore at mostD/2tD/2^\{\\,t\}\. We need enough layers so that only one node remains:

D2ℓ≤1⟹ℓ≥log2⁡D\.\\frac\{D\}\{2^\{\\,\\ell\}\}\\;\\leq\\;1\\quad\\Longrightarrow\\quad\\ell\\;\\geq\\;\\log\_\{2\}D\.
Taking the smallest integer that meets the inequality yields

ℓ=⌈log2⁡D⌉\.\\ell=\\lceil\\log\_\{2\}D\\rceil\.
Consequently, afterℓ\\elllayers the relative error of the final radius satisfies

\|R~−R\|R≤ℓ​δ=ℓ​2−2​n−1\.\\frac\{\|\\tilde\{R\}\-R\|\}\{R\}\\;\\leq\\;\\ell\\,\\delta\\;=\\;\\ell\\,2^\{\-2n\-1\}\.∎

###### Theorem A\.4\(Error Bound of NLS‐Softmax\)\.

Let𝐳=\(z1,…,zd\)\\mathbf\{z\}=\(z\_\{1\},\\dots,z\_\{d\}\)be the shifted and clipped logits

−H≤zi≤L,i=1,…,K,\-H\\;\\leq\\;z\_\{i\}\\;\\leq\\;L,\\qquad i=1,\\dots,K,and write

pi=ezi∑j=1dezjfor the exact softmax, andp~i=e~zi∑j=1de~zj\+δip\_\{i\}=\\frac\{e^\{z\_\{i\}\}\}\{\\sum\_\{j=1\}^\{d\}e^\{z\_\{j\}\}\}\\quad\\text\{for the exact softmax, and\}\\quad\\tilde\{p\}\_\{i\}=\\frac\{\\tilde\{e\}^\{\\,z\_\{i\}\}\}\{\\sum\_\{j=1\}^\{d\}\\tilde\{e\}^\{\\,z\_\{j\}\}\}\+\\delta\_\{i\}for the NLS\-Softmax output, where

\*e~⋅\\tilde\{e\}^\{\\,\\cdot\}is the PWL\-Exp approximation of Lemma 1 whose relative error satisfies\|e~x−ex\|≤εexp​ex\\lvert\\tilde\{e\}^\{x\}\-e^\{x\}\\rvert\\leq\\varepsilon\_\{\\exp\}e^\{x\}withεexp=\(2​KK\)28​e2​K/K\\varepsilon\_\{\\exp\}=\\frac\{\\bigl\(\\tfrac\{2K\}\{K\}\\bigr\)^\{2\}\}\{8\}\\,e^\{2K/K\};

\* the final reciprocal is produced by a\(T,L\)\(T,L\)\-Division neuron whose quantisation step isΔ=1T​L\\Delta=\\frac\{1\}\{TL\}, so that\|δi\|≤Δ​pi\\lvert\\delta\_\{i\}\\rvert\\leq\\Delta p\_\{i\}for everyii\.

Then, for every classii,

\|p~i−pi\|pi≤21−εexp\(εexp\+Δ\)\.\\boxed\{\\;\\frac\{\\lvert\\tilde\{p\}\_\{i\}\-p\_\{i\}\\rvert\}\{p\_\{i\}\}\\;\\leq\\;\\frac\{2\}\{1\-\\varepsilon\_\{\\exp\}\}\\bigl\(\\varepsilon\_\{\\exp\}\+\\Delta\\bigr\)\.\\;\}

###### Proof\.

Step 1: bound the exponential approximations\.By Lemma 1,

\(1−εexp\)​ezi≤e~zi≤\(1\+εexp\)​ezi,i=1,…,d\.\(1\-\\varepsilon\_\{\\exp\}\)e^\{z\_\{i\}\}\\;\\leq\\;\\tilde\{e\}^\{\\,z\_\{i\}\}\\;\\leq\\;\(1\+\\varepsilon\_\{\\exp\}\)e^\{z\_\{i\}\},\\qquad i=1,\\dots,d\.\(13\)
Step 2: bound the denominator\.Summing \([13](https://arxiv.org/html/2605.20289#A1.E13)\) yields

\(1−εexp\)​S≤S~:=∑je~zj≤\(1\+εexp\)​S,S=∑jezj\.\(1\-\\varepsilon\_\{\\exp\}\)S\\;\\leq\\;\\tilde\{S\}:=\\sum\_\{j\}\\tilde\{e\}^\{\\,z\_\{j\}\}\\;\\leq\\;\(1\+\\varepsilon\_\{\\exp\}\)S,\\qquad S=\\sum\_\{j\}e^\{z\_\{j\}\}\.
Step 3: ratio perturbation\.Define the “ideal” NLS\-Softmax without quantisation,p^i=e~zi/S~\\hat\{p\}\_\{i\}=\\tilde\{e\}^\{\\,z\_\{i\}\}/\\tilde\{S\}\. Using the standard perturbation inequality111For positivea,ba,band errors\|δ​a\|≤ε​a\\lvert\\delta a\\rvert\\leq\\varepsilon a,\|δ​b\|≤ε​b\\lvert\\delta b\\rvert\\leq\\varepsilon b, one has\|\(a\+δ​a\)/\(b\+δ​b\)−a/b\|≤2​ε​a/b​\(1−ε\)−1\\bigl\|\(a\+\\delta a\)/\(b\+\\delta b\)\-a/b\\bigr\|\\leq 2\\varepsilon\\,a/b\\,\(1\-\\varepsilon\)^\{\-1\}\.and \([13](https://arxiv.org/html/2605.20289#A1.E13)\),

\|p^i−pi\|pi≤2​εexp1−εexp\.\\frac\{\\lvert\\hat\{p\}\_\{i\}\-p\_\{i\}\\rvert\}\{p\_\{i\}\}\\;\\leq\\;\\frac\{2\\varepsilon\_\{\\exp\}\}\{1\-\\varepsilon\_\{\\exp\}\}\.
Step 4: add the division\-neuron quantisation\.The Division neuron introduces an extra additive error\|δi\|≤Δ\\lvert\\delta\_\{i\}\\rvert\\leq\\Delta, so

\|p~i−pi\|≤\|p~i−p^i\|\+\|p^i−pi\|≤\(Δ\+2​εexp1−εexp\)​pi\.\\lvert\\tilde\{p\}\_\{i\}\-p\_\{i\}\\rvert\\;\\leq\\;\\lvert\\tilde\{p\}\_\{i\}\-\\hat\{p\}\_\{i\}\\rvert\+\\lvert\\hat\{p\}\_\{i\}\-p\_\{i\}\\rvert\\;\\leq\\;\(\\Delta\+\\frac\{2\\varepsilon\_\{\\exp\}\}\{1\-\\varepsilon\_\{\\exp\}\}\)\\,p\_\{i\}\.Dividing both sides bypip\_\{i\}proves the claimed bound\. ∎

###### Theorem A\.5\(Error Bound of NLS‐SiLU\)\.

Assume the input is clipped tox∈\[−5,5\]x\\in\[\-5,5\]before NLS‐SiLU is applied\. Let

f​\(x\)=SiLU⁡\(x\)=x1\+e−x,f~​\(x\)=x​σ~​\(x\)\+δmul,f\(x\)=\\operatorname\{SiLU\}\(x\)=\\frac\{x\}\{1\+e^\{\-x\}\},\\qquad\\tilde\{f\}\(x\)=x\\;\\tilde\{\\sigma\}\(x\)\+\\delta\_\{\\text\{mul\}\},where

\*σ~​\(x\)=11\+e~−x\+δdiv\\tilde\{\\sigma\}\(x\)=\\dfrac\{1\}\{1\+\\tilde\{e\}^\{\-x\}\}\+\\delta\_\{\\text\{div\}\}is obtained from the PWL‐Exp approximatione~⋅\\tilde\{e\}^\{\\,\\cdot\}of Lemma 1 \(relative errorεexp=3\.63×10−3\\varepsilon\_\{\\exp\}=3\.63\\times 10^\{\-3\}\) followed by a\(T,L\)\(T,L\)‐Division neuron whose quantisation step isΔ=1/\(T​L\)\\Delta=1/\(TL\);

\*δdiv\\delta\_\{\\text\{div\}\}andδmul\\delta\_\{\\text\{mul\}\}are, respectively, the additive errors introduced by the Division neuron and by the spike–time multiplication, both satisfying\|δdiv\|,\|δmul\|≤Δ\\lvert\\delta\_\{\\text\{div\}\}\\rvert,\\lvert\\delta\_\{\\text\{mul\}\}\\rvert\\leq\\Delta\.

Then, for everyx∈\[−5,5\]x\\in\[\-5,5\],

\|f~​\(x\)−f​\(x\)\|≤\|x\|​2​εexp1−εexp\+\|x\|​Δ\\boxed\{\\;\\bigl\|\\tilde\{f\}\(x\)\-f\(x\)\\bigr\|\\;\\leq\\;\|x\|\\,\\frac\{2\\varepsilon\_\{\\exp\}\}\{1\-\\varepsilon\_\{\\exp\}\}\\;\+\\;\|x\|\\,\\Delta\\;\}and in particular, with\(T,L\)=\(16,256\)\(T,L\)=\(16,256\)so thatΔ=2−12≈2\.44×10−4\\Delta=2^\{\-12\}\\\!\\approx 2\.44\\times 10^\{\-4\},

\|f~​\(x\)−f​\(x\)\|≤0\.038for all​x∈\[−5,5\]\.\\bigl\|\\tilde\{f\}\(x\)\-f\(x\)\\bigr\|\\;\\leq\\;0\.038\\qquad\\text\{for all \}x\\in\[\-5,5\]\.

###### Proof\.

1\. Bounding the sigmoid approximation\.Becauseσ​\(x\)=1/\(1\+e−x\)\\sigma\(x\)=1/\(1\+e^\{\-x\}\)is Lipschitz‐continuous on\[−5,5\]\[\-5,5\]with Lipschitz constant at most1/41/4, the PWL‐Exp error of Lemma 1 implies

\|σ^​\(x\)−σ​\(x\)\|=\|11\+e~−x−11\+e−x\|≤2​εexp1−εexp​σ​\(x\),\\bigl\|\\hat\{\\sigma\}\(x\)\-\\sigma\(x\)\\bigr\|=\\left\|\\frac\{1\}\{1\+\\tilde\{e\}^\{\-x\}\}\-\\frac\{1\}\{1\+e^\{\-x\}\}\\right\|\\leq\\frac\{2\\varepsilon\_\{\\exp\}\}\{1\-\\varepsilon\_\{\\exp\}\}\\;\\sigma\(x\),whereσ^​\(x\)=1/\(1\+e~−x\)\\hat\{\\sigma\}\(x\)=1/\(1\+\\tilde\{e\}^\{\-x\}\)is the*ideal*reciprocal without quantisation\.

2\. Adding division‐neuron quantisation\.The Division neuron producesσ~​\(x\)=σ^​\(x\)\+δdiv\\tilde\{\\sigma\}\(x\)=\\hat\{\\sigma\}\(x\)\+\\delta\_\{\\text\{div\}\}with\|δdiv\|≤Δ\|\\delta\_\{\\text\{div\}\}\|\\leq\\Delta, hence

\|σ~​\(x\)−σ​\(x\)\|≤2​εexp1−εexp​σ​\(x\)\+Δ\.\\bigl\|\\tilde\{\\sigma\}\(x\)\-\\sigma\(x\)\\bigr\|\\leq\\frac\{2\\varepsilon\_\{\\exp\}\}\{1\-\\varepsilon\_\{\\exp\}\}\\;\\sigma\(x\)\+\\Delta\.\(14\)Therefore, using \([14](https://arxiv.org/html/2605.20289#A1.E14)\) ,

\|f~​\(x\)−f​\(x\)\|≤\|x\|​2​εexp1−εexp\+\|x\|​Δ,\\bigl\|\\tilde\{f\}\(x\)\-f\(x\)\\bigr\|\\leq\|x\|\\,\\frac\{2\\varepsilon\_\{\\exp\}\}\{1\-\\varepsilon\_\{\\exp\}\}\+\|x\|\\,\\Delta,which is the claimed bound\. For\(T,L\)=\(256,16\)\(T,L\)=\(256,16\)we substituteεexp=3\.63×10−3\\varepsilon\_\{\\exp\}=3\.63\\times 10^\{\-3\}andΔ≈2\.44×10−4\\Delta\\approx 2\.44\\times 10^\{\-4\}to obtain the numerical value\. ∎

###### Theorem A\.6\(Error Bound of NLS‐RMSNorm\)\.

Let𝐱=\(x1,…,xd\)∈ℝd\\mathbf\{x\}=\(x\_\{1\},\\dots,x\_\{d\}\)\\in\\mathbb\{R\}^\{d\}and define

R=1d​∑j=1dxj2\+ε,yi=xiR,i=1,…,d\.R=\\sqrt\{\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\}^\{2\}\+\\varepsilon\},\\qquad y\_\{i\}=\\frac\{x\_\{i\}\}\{R\},\\;\\;i=1,\\dots,d\.
Denote byR~\\tilde\{R\}the output of thePolarNormunit that uses annn–step CORDIC in a balanced binary tree of heightℓ=⌈log2⁡d⌉\\ell=\\lceil\\log\_\{2\}d\\rceiland let

εpol:=ℓ​2−2​n−1so that\|R~−RR\|≤εpol\(Lemma[A\.3](https://arxiv.org/html/2605.20289#A1.Thmtheorem3)\)\.\\varepsilon\_\{\\text\{pol\}\}:=\\ell\\,2^\{\-2n\-1\}\\quad\\text\{so that\}\\quad\\Bigl\|\\tfrac\{\\tilde\{R\}\-R\}\{R\}\\Bigr\|\\leq\\varepsilon\_\{\\text\{pol\}\}\\quad\\text\{\(Lemma~\\ref\{lem:tree\-cordic\}\)\}\.
Next, let the\(T,L\)\(T,L\)–Division neuron produce

q~=1R~\+δdiv\\tilde\{q\}=\\frac\{1\}\{\\tilde\{R\}\}\+\\delta\_\{\\text\{div\}\}with quantisation step

Δ:=1T​L\(or​\|δdiv\|max\),\\Delta:=\\frac\{1\}\{TL\}\\quad\(\\text\{or\}\\;\|\\delta\_\{\\text\{div\}\}\|\_\{\\max\}\),and finally obtain the NLS‐RMSNorm output

y~i=xi​q~\+δmul,\|δmul\|≤Δ\.\\tilde\{y\}\_\{i\}=x\_\{i\}\\tilde\{q\}\+\\delta\_\{\\text\{mul\}\},\\qquad\|\\delta\_\{\\text\{mul\}\}\|\\leq\\Delta\.
Then, for each coordinatei=1,…,di=1,\\dots,d, we have

\|y~i−yi\|\|yi\|≤εpol\+Δ1−εpol\+Δ\|xi\|​R\\boxed\{\\;\\frac\{\\bigl\|\\tilde\{y\}\_\{i\}\-y\_\{i\}\\bigr\|\}\{\|y\_\{i\}\|\}\\;\\leq\\;\\frac\{\\varepsilon\_\{\\text\{pol\}\}\+\\Delta\}\{1\-\\varepsilon\_\{\\text\{pol\}\}\}\\;\+\\;\\frac\{\\Delta\}\{\|x\_\{i\}\|\}R\\;\}and, whenever\|xi\|≥ε\|x\_\{i\}\|\\geq\\sqrt\{\\varepsilon\}\(the usual case in practice\), the last term satisfies

Δ\|xi\|​R≤d⋅Δ\.\\frac\{\\Delta\}\{\|x\_\{i\}\|\}R\\leq\\sqrt\{d\}\\cdot\\Delta\.

###### Proof\.

We decompose the total error as

y~i−yi=xi​\(q~−1R\)=xi​\(1R~−1R\)\+xi​δdiv\.\\tilde\{y\}\_\{i\}\-y\_\{i\}=x\_\{i\}\\bigl\(\\tilde\{q\}\-\\frac\{1\}\{R\}\\bigr\)=x\_\{i\}\\Bigl\(\\frac\{1\}\{\\tilde\{R\}\}\-\\frac\{1\}\{R\}\\Bigr\)\+x\_\{i\}\\delta\_\{\\text\{div\}\}\.
\(i\) Reciprocal ofR~\\tilde\{R\}:Using the approximation

R~=R​\(1\+η\),\|η\|≤εpol,\\tilde\{R\}=R\(1\+\\eta\),\\quad\|\\eta\|\\leq\\varepsilon\_\{\\text\{pol\}\},we have the following identity:

\|1R~−1R\|=\|η\|1\+η⋅1R≤εpol1−εpol⋅1R\.\\left\|\\frac\{1\}\{\\tilde\{R\}\}\-\\frac\{1\}\{R\}\\right\|=\\frac\{\|\\eta\|\}\{1\+\\eta\}\\cdot\\frac\{1\}\{R\}\\leq\\frac\{\\varepsilon\_\{\\text\{pol\}\}\}\{1\-\\varepsilon\_\{\\text\{pol\}\}\}\\cdot\\frac\{1\}\{R\}\.This result follows from the first\-order approximation and ensures that the error is controlled by the CORDIC approximation precision\.

\(ii\) Division\-neuron quantisation:Since the quantisation error is bounded by

\|xi​δdiv\|≤\|xi\|​Δ,\|x\_\{i\}\\delta\_\{\\text\{div\}\}\|\\leq\|x\_\{i\}\|\\Delta,we can combine these terms to get:

\|y~i−yi\|≤\|xi\|⋅εpol1−εpol⋅1R\+\|xi\|​Δ\.\|\\tilde\{y\}\_\{i\}\-y\_\{i\}\|\\leq\|x\_\{i\}\|\\cdot\\frac\{\\varepsilon\_\{\\text\{pol\}\}\}\{1\-\\varepsilon\_\{\\text\{pol\}\}\}\\cdot\\frac\{1\}\{R\}\+\|x\_\{i\}\|\\Delta\.
Dividing both sides by\|yi\|=\|xi\|R\|y\_\{i\}\|=\\frac\{\|x\_\{i\}\|\}\{R\}gives:

\|y~i−yi\|\|yi\|≤εpol1−εpol\+Δ\|xi\|​R\.\\frac\{\|\\tilde\{y\}\_\{i\}\-y\_\{i\}\|\}\{\|y\_\{i\}\|\}\\leq\\frac\{\\varepsilon\_\{\\text\{pol\}\}\}\{1\-\\varepsilon\_\{\\text\{pol\}\}\}\+\\frac\{\\Delta\}\{\|x\_\{i\}\|\}R\.
Thus, we obtain the final error bound:

\|y~i−yi\|\|yi\|≤εpol\+Δ1−εpol\+Δ\|xi\|​R\.\\frac\{\|\\tilde\{y\}\_\{i\}\-y\_\{i\}\|\}\{\|y\_\{i\}\|\}\\leq\\frac\{\\varepsilon\_\{\\text\{pol\}\}\+\\Delta\}\{1\-\\varepsilon\_\{\\text\{pol\}\}\}\+\\frac\{\\Delta\}\{\|x\_\{i\}\|\}R\.
Finally, when\|xi\|\|x\_\{i\}\|is sufficiently large \(i\.e\.,\|xi\|≥ε\|x\_\{i\}\|\\geq\\sqrt\{\\varepsilon\}\), we can bound the term:

Δ\|xi\|​R≤d⋅Δ,\\frac\{\\Delta\}\{\|x\_\{i\}\|\}R\\leq\\sqrt\{d\}\\cdot\\Delta,becauseRRis bounded by a factor ofd\\sqrt\{d\}, as derived from the sum over allxjx\_\{j\}\. ∎

## Appendix BMore Experiments

### B\.1Function\-Level Evaluation

In addition to the errors provided in the main text, we present here the errors for d=8, 16, 32, 64, 128, and 256 in Figure[5](https://arxiv.org/html/2605.20289#A2.F5)\. The results demonstrate that NLSpike consistently maintains its advantage\.

![Refer to caption](https://arxiv.org/html/2605.20289v1/figure/appendix/softmax_error.png)\(a\)Operator\-level errors for Softmax approximations under 8\-bit quantization\.
![Refer to caption](https://arxiv.org/html/2605.20289v1/figure/appendix/rms_error.png)\(b\)Operator\-level errors for RMSNorm approximations under 8\-bit quantization\.

Figure 5:Operator\-level errors under 8\-bit quantization\. Error bars indicate the gap between mean and maximum absolute error\. ES\-Softmax achieves the lowest mean error across dimensions while keeping bounded maximum error under integer\-only implementation, and ES\-RMS yields lower mean errors than blockwise and Sorbet baselines with stable performance across dimensions\.
### B\.2Module\-Level Evaluation\.

We take LLaMA3\-QCFS \(with 8\-bit weights and activations, 8B scale\) as the base model and evaluate three different conversion strategies: \(1\) a direct ANN\-to\-SNN conversion without approximation, \(2\) conversion using the NLS framework in time\-independent form \(NLS\), and \(3\) conversion using the time\-dependent form \(NLS\-TDF\)\. For evaluation, we randomly sample 1,000 tasks from the MMLU dataset and run inference using each of the three converted models to assess their module\-level performance under different design configurations\.

![Refer to caption](https://arxiv.org/html/2605.20289v1/figure/paper/moudle_error.png)Figure 6:Module\-Level approximation error in MMLU datasets\.Figure[6](https://arxiv.org/html/2605.20289#A2.F6)presents the meanL2L\_\{2\}error on 1,000 MMLU samples across different time stepsT∈\{2,4,8\}T\\in\\\{2,4,8\\\}\. The results show that both NLS and NLS\-TDF maintain approximation errors comparable to the direct ANN\-to\-SNN conversion baseline\. Notably, asTTincreases, the gap among all three configurations narrows, suggesting that spike accumulation compensates for early approximation gaps\. These results validate the functional fidelity of our NLS\-based modules under typical SNN execution settings\.

### B\.3Model\-Level Evaluation\.

To assess the scalability and effectiveness of our framework at the full model level, we conduct evaluations on LLaMA2\-7B, LLaMA3\-8B, and LLaMA3\-70B\. Each model is first converted into spiking form via ANN\-to\-SNN conversion under two quantization settings: W6A6 and W8A8\. For each setting, we compare two conversion variants: a baseline SNN without approximation and NN with NLSpike\. All evaluations are performed at time stepsT∈\{1,2,4\}T\\in\\\{1,2,4\\\}to capture both low\-latency and high\-accuracy behaviors\.

Table[4](https://arxiv.org/html/2605.20289#A2.T4)summarizes the end\-to\-end performance of our ES\-converted models on five representative language understanding tasks: Winogrande, HellaSwag, ARC\-Challenge, ARC\-Easy, and PIQA\. We evaluate both standard SNN baselines and ES\-based conversions on LLaMA2\-7B and LLaMA3\-8B under W6A6 and W8A8 quantization settings, across time stepsT∈\{1,2,4\}T\\in\\\{1,2,4\\\}, and on LLaMA3\-70B under W8A8 across time stepsT∈\{1,2\}T\\in\\\{1,2\\\}\.

As shown, NLSpike models achieve accuracy comparable to direct SNN conversions across most benchmarks\. At low time steps \(T=1T=1\), NLSpike retains competitive accuracy while remaining fully compatible with integer\-only execution\. AsTTincreases, the performance gap among methods narrows, confirming the stability of ES approximations under deeper spike integration\.

Table 4:Performance on LLaMA Models\. We reportaccfor WinoGrande andacc\_normfor HellaSwag, ArcE, and PIQA\.ModelMethodTPrecisionWinoGrandeHellaSwagArcCArcEPIQAAvg\. Acc\.Llama 27BPrefixQ\-W6A6/W8A870\.09 / 70\.2474\.06 / 76\.4444\.80 / 45\.9973\.11 / 73\.0277\.15 / 78\.3567\.84 / 68\.81DuQ\-W6A6/W8A867\.88 / 66\.6972\.64 / 72\.8140\.53 / 40\.3653\.07 / 53\.3777\.15 / 77\.2062\.25 / 62\.09SNN1W6A6/W8A870\.09 / 70\.2474\.06 / 76\.4444\.80 / 45\.9973\.11 / 73\.0277\.15 / 78\.3567\.84 / 68\.81NLS1W6A6/W8A867\.48 / 68\.3573\.87 / 76\.2344\.71 / 46\.5073\.32 / 73\.8676\.44 / 78\.6267\.16 / 68\.71SNN2W6A6/W8A869\.06 / 69\.9374\.23 / 76\.4544\.88 / 45\.9072\.98 / 72\.9075\.68 / 78\.4567\.37 / 68\.73NLS2W6A6/W8A869\.53 / 69\.6174\.17 / 76\.4644\.45 / 46\.3373\.06 / 73\.1176\.77 / 78\.3567\.60 / 68\.77SNN4W6A6/W8A869\.46 / 69\.8574\.11 / 76\.5945\.14 / 46\.1672\.98 / 73\.4076\.82 / 78\.2467\.70 / 68\.85NLS4W6A6/W8A868\.27 / 70\.0173\.39 / 76\.3443\.34 / 46\.3372\.77 / 73\.9576\.66 / 78\.5166\.89 / 69\.03Llama 38BPrefixQ\-W6A6/W8A871\.82 / 73\.0977\.61 / 78\.9650\.94 / 53\.7575\.59 / 77\.9977\.69 / 80\.4770\.73 / 72\.85DuQ\-W6A6/W8A867\.88 / 73\.5672\.64 / 79\.0740\.53 / 53\.2453\.07 / 77\.9577\.15 / 80\.2562\.25 / 72\.81SNN1W6A6/W8A871\.82 / 73\.0977\.61 / 78\.9650\.94 / 53\.7575\.59 / 77\.9977\.69 / 80\.4770\.73 / 72\.85NLS1W6A6/W8A874\.11 / 73\.7277\.40 / 79\.0449\.40 / 53\.4175\.59 / 77\.6577\.75 / 80\.3670\.85 / 72\.84SNN2W6A6/W8A871\.82 / 73\.1677\.75 / 79\.0147\.27 / 53\.7575\.21 / 77\.8675\.84 / 79\.9869\.58 / 72\.75NLS2W6A6/W8A872\.22 / 73\.2477\.58 / 78\.8348\.89 / 53\.5075\.63 / 77\.5777\.86 / 80\.0370\.44 / 72\.63SNN4W6A6/W8A870\.40 / 73\.3277\.65 / 78\.9148\.98 / 53\.5874\.33 / 80\.4375\.90 / 79\.9869\.45 / 73\.24NLS4W6A6/W8A871\.19 / 73\.0977\.34 / 78\.8148\.81 / 53\.7574\.37 / 77\.9076\.77 / 80\.3069\.69 / 72\.77Llama 370BPrefixQ\-W8A879\.3285\.6562\.3782\.7984\.1178\.85DuQ\-W8A880\.8284\.8363\.4885\.7384\.3979\.85SNN1W8A879\.3285\.6562\.3782\.7984\.1178\.85NLS1W8A878\.8585\.7162\.5482\.2083\.9078\.64SNN2W8A879\.4885\.7062\.8882\.8783\.9078\.97NLS2W8A879\.0885\.6062\.8882\.6283\.9078\.82

### B\.4Additional Operation Counts Comparisons

Table 5:QNN vs\. SNN Function\-Level Operation Counts Comparison \(G=10910^\{9\}\)FunctionOperatorQNNCount \(G\)SNN \(T=1\)Count \(G\)SNN \(T=2\)Count \(G\)SNN \(T=4\)Count \(G\)RMSNormMACs0\.10510\.00000\.00000\.0000ACs0\.05240\.00000\.10490\.3146Shifts0\.00000\.31450\.31450\.3145SiLUMACs2\.39530\.00000\.00000\.0000ACs0\.14090\.14090\.28180\.5636Shifts0\.00001\.40902\.81805\.6361SoftmaxMACs0\.65540\.00000\.00000\.0000ACs0\.04060\.40920\.53210\.7778Shifts0\.00000\.40960\.40960\.4096To complement our analysis, Table[5](https://arxiv.org/html/2605.20289#A2.T5)reports function\-level operation counts\. Our NLS\-operators systematically replace multiply–accumulate \(MAC\) operations with accumulate \(AC\) and bit\-shift operations, resulting in a more spike\-friendly computation profile for neuromorphic hardware\. The scaling behavior with time stepsTTfurther reveals two implementation patterns\. RMSNorm and Softmax perform their core shift\-based computation only once, leading to constant shift counts while ACs grow linearly withTT\. In contrast, SiLU executes both shifts and ACs at every time step, causing the total operation count to scale linearly withTT\.

Similar Articles

SNLP: Layer-Parallel Inference via Structured Newton Corrections

Hugging Face Daily Papers

This paper introduces SNLP, a framework that enables layer-parallel inference for transformers by replacing exact Newton corrections with structured approximations, achieving up to 2.3x speedup on a 0.5B model while improving perplexity.

Simply Stabilizing the Loop via Fully Looped Transformer

arXiv cs.LG

This paper identifies gradient oscillation and residual explosion as causes of training instability in Looped Transformers, and proposes Fully Looped Transformer with two parameter-free modifications (Fully Looped Architecture and Attention Injection) to stabilize training up to 12 loop iterations, achieving up to 13.2% improvement in downstream performance.

Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes

Hugging Face Daily Papers

This paper formally proves that training neural networks with asymmetric activation functions like ReLU, GELU, or SiLU causes weights to drift negative, leading to up to 90% activation sparsity. It also shows that squared activations like ReLU² improve performance but cause activation spikes, which can be fixed by clipping, with GELU² achieving the best validation loss.

Linearizing Vision Transformer with Test-Time Training

Hugging Face Daily Papers

This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.