Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks
Summary
Proposes Selective Alignment Knowledge Distillation (SeAl-KD) for Spiking Neural Networks, which selectively aligns class-level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter-timestep similarity, achieving consistent improvements over existing distillation methods on static and neuromorphic datasets.
View Cached Full Text
Cached at: 05/15/26, 06:28 AM
# Selective Alignment Knowledge Distillation for Spiking Neural Networks
Source: [https://arxiv.org/html/2605.14252](https://arxiv.org/html/2605.14252)
\\newcolumntype
P\[1\]¿\\arraybackslashp\#1
## Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks
Peibo Duan1∗Yongsheng Huang2Guowei Zhang1Benjamin Smith1Nanxu Gong3&Levin Kuhlmann1 1Faculty of Information Technolody, Monash University, Australia 2School of Software, Northeastern University, China 3Department of Medicine, National University of Singapore, Singapore \{kai\.sun1, peibo\.duan, levin\.kuhlmann\}@monash\.eduCorresponding author\.
###### Abstract
Spiking neural networks \(SNNs\), which are brain\-inspired and spike\-driven, achieve high energy efficiency\. However, a performance gap between SNNs and artificial neural networks \(ANNs\) still remains\. Knowledge distillation \(KD\) is commonly adopted to improve SNN performance, but existing methods typically enforce uniform alignment across all timesteps, either from a teacher network or through inter\-temporal self\-distillation, implicitly assuming that per\-timestep predictions should be treated equally\. In practice, SNN predictions vary and evolve over time, and intermediate timesteps need not all be individually correct even when the final aggregated output is correct\. Under such conditions, effective distillation should not force every timestep toward the same supervision target, but instead provide corrective guidance to erroneous timesteps while preserving useful temporal dynamics\. To address this issue, we proposeSelectiveAlignmentKnowledgeDistillation\(SeAl\-KD\), which selectively aligns class\-level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter\-timestep similarity\. Extensive experiments on static image and neuromorphic event\-based datasets demonstrate consistent improvements over existing distillation methods\. The code is available at[https://github\.com/KaiSUN1/SeAl](https://github.com/KaiSUN1/SeAl)\.
## 1Introduction
Spiking Neural Networks \(SNNs\), often regarded as the third generation of neural networks, are biologically inspired models that communicate through discrete spikes rather than continuous\-valued activations \(Maass \([1997](https://arxiv.org/html/2605.14252#bib.bib41)\)\)\. By processing information in an event\-driven manner with sparse spikes over time, SNNs offer the potential for high energy efficiency, especially when deployed on neuromorphic hardware \(Guoet al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib12)\)\)\. Nevertheless, training SNNs to achieve accuracy comparable to Artificial Neural Networks \(ANNs\) remains challenging, because useful evidence is accumulated progressively through spikes, making different timesteps unevenly informative during learning \(Bellecet al\.\([2018](https://arxiv.org/html/2605.14252#bib.bib57)\); Neftciet al\.\([2019](https://arxiv.org/html/2605.14252#bib.bib59)\)\)\.
To narrow the performance gap between ANNs and SNNs, knowledge distillation \(KD\) has become a widely adopted strategy \(Xuet al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib3)\); Qiuet al\.\([2024](https://arxiv.org/html/2605.14252#bib.bib7)\)\)\. Early distillation methods typically supervise only the final aggregated output by matching averaged logits, which may mix inconsistent temporal information during optimization \(Zhaoet al\.\([2025](https://arxiv.org/html/2605.14252#bib.bib26)\); Denget al\.\([2022](https://arxiv.org/html/2605.14252#bib.bib28)\)\)\. More recent approaches move toward timestep\-wise distillation by injecting teacher supervision at each timestep and aligning logits throughout the temporal dimension \(Yuet al\.\([2025b](https://arxiv.org/html/2605.14252#bib.bib15),[a](https://arxiv.org/html/2605.14252#bib.bib16)\)\)\. In parallel, several methods further exploit the temporal structure of SNNs through self\-distillation \(Dinget al\.\([2025b](https://arxiv.org/html/2605.14252#bib.bib2)\)\) or temporal consistency regularization \(Zhaoet al\.\([2025](https://arxiv.org/html/2605.14252#bib.bib26)\); Dinget al\.\([2025a](https://arxiv.org/html/2605.14252#bib.bib27)\)\), encouraging predictions or features from different timesteps to be consistent in order to stabilize optimization and improve performance \(Dinget al\.\([2025b](https://arxiv.org/html/2605.14252#bib.bib2)\); Yuet al\.\([2025b](https://arxiv.org/html/2605.14252#bib.bib15)\)\)\.
The consistency assumption is partially misaligned with the intrinsic properties of SNNs and their prediction mechanism\. Due to membrane potential integration and reset, spike firing may induce abrupt changes in membrane states, making it difficult for SNNs to maintain identical predictions across all timesteps, as detailed inAppendix[A](https://arxiv.org/html/2605.14252#A1)\. Meanwhile, since the final decision is determined by temporal accumulation rather than any single timestep, an incorrect intermediate prediction does not necessarily imply an incorrect final outcome\. As illustrated in the toy case in Figure[1](https://arxiv.org/html/2605.14252#S1.F1)\(a\), some intermediate timesteps are misclassified while the final temporally aggregated prediction remains correct\. Figure[1](https://arxiv.org/html/2605.14252#S1.F1)\(b\) further shows that per\-timestep accuracy is consistently lower than temporally aggregated accuracy\. Moreover, Figure[1](https://arxiv.org/html/2605.14252#S1.F1)\(c\) reveals that more than 12% of correctly classified samples are misclassified at some intermediate timesteps, and some samples are never correctly classified at any individual timestep but still become correct after temporal aggregation\. These observations suggest that a timestep should not be judged solely by whether it is already correct, but by whether it contributes useful evidence to the final temporal accumulation\.
Figure 1:Mismatch between intermediate and final predictions in SNNs\. \(a\) A toy example where intermediate predictions are wrong, but the final aggregated prediction is correct\. \(b\) Per\-timestep accuracy and final aggregated accuracy on CIFAR100\. \(c\) Distribution of CIFAR100 samples with correct final predictions by the number of correctly predicted timesteps out ofTT\.However, existing timestep\-wise distillation strategies do not explicitly account for this distinction\. By encouraging each timestep to align with the teacher signal, they may overlook what corrective evidence is actually needed for improving temporal accumulation and which temporal sources can provide reliable support\. This perspective leads to two key questions for timestep\-wise distillation:what corrective evidence should be injected into a currently erroneous timestep, andfrom which temporal sources should it be drawn? For a currently erroneous timestep dominated by a confusing wrong class, directly transferring teacher preference over the full class distribution may dilute the needed correction and interfere with the adjustment of the most critical class relation, namely that between the ground\-truth class and the wrongly favored class\. Moreover, in temporal self\-distillation, not all source timesteps are equally reliable: some provide confident and compatible evidence, while others may introduce noisy or misleading temporal signals\. To address these issues, we proposeSeAl\-KD, a selective distillation framework with two components\.Error\-aware Logit Alignment \(ELA\)refines the class evidence received by erroneous timesteps, whileSelective Temporal Alignment \(STA\)emphasizes reliable and compatible source timesteps during alignment\.
Figure 2:SeAl\-KD framework\. SNNs learn from the same copied ANN output across timesteps\. ELA equalizes the true and predicted\-false logits at erroneous timesteps before teacher–student KL\. STA reweights temporal KL using confidence and inter\-timestep similarity\.CCdenotes the ground\-truth class\. Left\-bottom illustration shows that our method aims to follow temporal diversity to obtain correct predictions, even when some intermediate predictions are incorrect\.Our main contributions are summarized as follows: \(1\) We reveal that erroneous timesteps are prevalent in SNNs and show that intermediate misclassifications do not necessarily impair the final prediction under temporal aggregation, exposing the mismatch between consistency and temporal evidence accumulation\. \(2\) We propose SeAl\-KD, a selective KD framework that improves corrective supervision for erroneous timesteps through ELA and STA\. We further provide theoretical analysis to justify the proposed selective alignment\. \(3\) We conduct extensive experiments on both static and neuromorphic image datasets\. The results show that SeAl\-KD preserves richer temporal distributions across timesteps and consistently improves performance\.
## 2Related Works
### 2\.1Knowledge Distillation for SNNs
KD improves SNN training by transferring final outputs, logits, or features from ANNs to narrow the gap with real\-valued ANNs \(Xuet al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib3)\); Honget al\.\([2025](https://arxiv.org/html/2605.14252#bib.bib4)\)\)\. Along this line, distillation strategies have evolved from global supervision on temporally aggregated outputs \(Zhanget al\.\([2025](https://arxiv.org/html/2605.14252#bib.bib6)\)\) to timestep\-wise supervision that aligns logits or representations at each timestep \(Yuet al\.\([2025b](https://arxiv.org/html/2605.14252#bib.bib15),[a](https://arxiv.org/html/2605.14252#bib.bib16)\)\)\. In addition, self\-distillation and temporal consistency regularization further exploit the student’s temporal structure to align outputs across timesteps \(Qiuet al\.\([2024](https://arxiv.org/html/2605.14252#bib.bib7)\); Zuoet al\.\([2024](https://arxiv.org/html/2605.14252#bib.bib13)\)\)\. Despite these advances, most existing methods impose supervision on different timesteps in a largely uniform manner, without distinguishing whether a timestep is currently erroneous, what corrective information it actually needs, or whether the temporal source used for supervision is reliable\.
### 2\.2Temporal Discrepancy in SNNs
Temporal discrepancy is intrinsic to SNNs, as membrane integration and spike\-triggered resets cause neuronal states and predictions to evolve across timesteps \(Bellecet al\.\([2018](https://arxiv.org/html/2605.14252#bib.bib57)\)\)\. Prior work has improved temporal modeling and training stability through learnable membrane dynamics \(Fanget al\.\([2021](https://arxiv.org/html/2605.14252#bib.bib9)\)\), temporal normalization \(Zhenget al\.\([2021](https://arxiv.org/html/2605.14252#bib.bib29)\); Duanet al\.\([2022](https://arxiv.org/html/2605.14252#bib.bib8)\)\), and surrogate\-gradient design \(Liet al\.\([2021b](https://arxiv.org/html/2605.14252#bib.bib58)\); Wanget al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib10)\)\)\. Some studies further exploit temporal structure via timestep\-dependent weighting \(Denget al\.\([2022](https://arxiv.org/html/2605.14252#bib.bib28)\)\) or temporal consistency regularization across timesteps \(Zhaoet al\.\([2025](https://arxiv.org/html/2605.14252#bib.bib26)\); Dinget al\.\([2025a](https://arxiv.org/html/2605.14252#bib.bib27)\)\)\. However, these methods mostly regulate temporal discrepancy uniformly, without explicitly considering the role of each timestep in the final temporally accumulated prediction\. Consequently, they do not distinguish what supervision a timestep needs or which temporal sources are reliable for providing it\.
## 3Preliminaries
##### Timestep\-wise distillation\.
Consider an SNN unrolled overTTtimesteps\. At timesteptt, the student SNN model, denoted by the superscriptSS, produces a logit vector𝐳tS∈ℝC\\mathbf\{z\}\_\{t\}^\{S\}\\in\\mathbb\{R\}^\{C\}, whereCCdenotes the number of classes\. The final prediction is typically obtained by temporally averaging the logits over all timesteps, while each𝐳tS\\mathbf\{z\}\_\{t\}^\{S\}can also be treated as an intermediate prediction\. For the same input, the teacher ANN model, denoted by the superscriptAA, produces a time\-invariant logit vector𝐳A\\mathbf\{z\}^\{A\}\. The corresponding categorical distributions are given by the temperature\-scaled softmax functions𝐩tS=softmax\(𝐳tS/τ\)\\mathbf\{p\}\_\{t\}^\{S\}=\\mathrm\{softmax\}\(\\mathbf\{z\}\_\{t\}^\{S\}/\\tau\)and𝐩A=softmax\(𝐳A/τ\)\\mathbf\{p\}^\{A\}=\\mathrm\{softmax\}\(\\mathbf\{z\}^\{A\}/\\tau\), whereτ\>0\\tau\>0is the temperature\. The probability of classiiunder𝐩tS\\mathbf\{p\}\_\{t\}^\{S\}and𝐩A\\mathbf\{p\}^\{A\}is denoted bypt,iSp\_\{t,i\}^\{S\}andpiAp\_\{i\}^\{A\}, respectively\.
For each timesteptt, the student is optimized with two objectives\. The first objectiveℒCLS\\mathcal\{L\}\_\{\\mathrm\{CLS\}\}applies cross\-entropy \(CE\) loss for classification supervision at each timestep:
ℒCLS=1T∑t=1TℓCE\(𝐳tS,𝐲\)=1T∑t=1T\(−∑i=1C𝐲ilogpt,iS\),\\mathcal\{L\}\_\{\\mathrm\{CLS\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\ell\_\{\\mathrm\{CE\}\}\(\\mathbf\{z\}\_\{t\}^\{S\},\\mathbf\{y\}\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\Big\(\-\\sum\_\{i=1\}^\{C\}\\mathbf\{y\}\_\{i\}\\log p\_\{t,i\}^\{S\}\\Big\),\(1\)where𝐲∈\{0,1\}C\\mathbf\{y\}\\in\\\{0,1\\\}^\{C\}is the one\-hot ground\-truth label\.
The second objectiveℒKD\\mathcal\{L\}\_\{\\mathrm\{KD\}\}aligns each temporal output with the same teacher distribution using the Kullback–Leibler \(KL\) divergence:
ℒKD=1T∑t=1TKL\(𝐩A∥𝐩tS\)=1T∑t=1T∑i=1CpiAlogpiApt,iS\.\\mathcal\{L\}\_\{\\mathrm\{KD\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{KL\}\\\!\\big\(\\mathbf\{p\}^\{A\}\\,\\\|\\,\\mathbf\{p\}\_\{t\}^\{S\}\\big\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{C\}p\_\{i\}^\{A\}\\log\\frac\{p\_\{i\}^\{A\}\}\{p\_\{t,i\}^\{S\}\}\.\(2\)
The overall training objectiveℒ\\mathcal\{L\}is defined as
ℒ=ℒCLS\+λℒKD,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{CLS\}\}\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{KD\}\},\(3\)whereλ\\lambdacontrols the contribution of the distillation term\.
## 4Methodology
This section presents SeAl\-KD, which consists of ELA and STA\. As illustrated in Figure[2](https://arxiv.org/html/2605.14252#S1.F2), ELA refines class\-level correction at erroneous timesteps, while STA selects confident and compatible temporal sources for alignment\.
### 4\.1Error\-aware Logits Alignment Distillation
ELA performs timestep\-wise distillation while accounting for prediction errors across timesteps in SNNs\. Lety∗∈\{1,…,C\}y^\{\\ast\}\\in\\\{1,\\dots,C\\\}denote the ground\-truth class index corresponding to the one\-hot label𝐲\\mathbf\{y\}, and letzt,cSz\_\{t,c\}^\{S\}andzt,cAz\_\{t,c\}^\{A\}denote the student and teacher logits for classccat timesteptt, respectively\. The student prediction at timestepttisctpred=argmaxczt,cSc\_\{t\}^\{\\mathrm\{pred\}\}=\\arg\\max\_\{c\}z\_\{t,c\}^\{S\}\. When an intermediate timestep is erroneous, ELA relaxes distillation on its dominant confusion instead of directly enforcing the correct class ordering, avoiding misleading correction on the confusing class pair while preserving supervision on the remaining classes\.
##### Logit modification\.
Ifctpred≠y∗c\_\{t\}^\{\\mathrm\{pred\}\}\\neq y^\{\\ast\}, we focus on the class pair formed by the ground\-truth classy∗y^\{\\ast\}and the predicted false classctfalse=ctpredc\_\{t\}^\{\\mathrm\{false\}\}=c\_\{t\}^\{\\mathrm\{pred\}\}\. We equalize only the logits of this pair and keep the remaining classes unchanged\. Specifically, the logits ofy∗y^\{\\ast\}andctfalsec\_\{t\}^\{\\mathrm\{false\}\}are both set to the minimum of their original values, which avoids introducing unnecessary absolute shifts\. The modified student and teacher logits, denoted byz~t,cS\\tilde\{z\}\_\{t,c\}^\{S\}andz~t,cA\\tilde\{z\}\_\{t,c\}^\{A\}, are defined as
z~t,cS=\{min\(zt,y∗S,zt,ctfalseS\),c∈\{y∗,ctfalse\},zt,cS,otherwise\.\\tilde\{z\}\_\{t,c\}^\{S\}=\\begin\{cases\}\\min\\\!\\left\(z\_\{t,y^\{\\ast\}\}^\{S\},\\,z\_\{t,c\_\{t\}^\{\\mathrm\{false\}\}\}^\{S\}\\right\),&c\\in\\\{y^\{\\ast\},\\,c\_\{t\}^\{\\mathrm\{false\}\}\\\},\\\\ z\_\{t,c\}^\{S\},&\\text\{otherwise\}\.\\end\{cases\}\(4\)z~t,cA=\{min\(zt,y∗A,zt,ctfalseA\),c∈\{y∗,ctfalse\},zt,cA,otherwise\.\\tilde\{z\}\_\{t,c\}^\{A\}=\\begin\{cases\}\\min\\\!\\left\(z\_\{t,y^\{\\ast\}\}^\{A\},\\,z\_\{t,c\_\{t\}^\{\\mathrm\{false\}\}\}^\{A\}\\right\),&c\\in\\\{y^\{\\ast\},\\,c\_\{t\}^\{\\mathrm\{false\}\}\\\},\\\\ z\_\{t,c\}^\{A\},&\\text\{otherwise\}\.\\end\{cases\}\(5\)
##### ELA objective\.
The ELA loss is then defined as
ℒELA=1T∑t=1TKL\(p\(𝐳~tA\)∥p\(𝐳~tS\)\)\.\\mathcal\{L\}\_\{\\mathrm\{ELA\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{KL\}\\\!\\big\(p\(\\tilde\{\\mathbf\{z\}\}\_\{t\}^\{A\}\)\\,\\\|\\,p\(\\tilde\{\\mathbf\{z\}\}\_\{t\}^\{S\}\)\\big\)\.\(6\)By removing the forced preference between the ground\-truth class and the dominant false class at erroneous timesteps, ELA prevents distillation from reinforcing misleading intermediate ordering while retaining teacher guidance on the remaining classes\.
### 4\.2Similarity\-aware Temporal Alignment Distillation
Based on uniform temporal alignment \(UTA\)Zhaoet al\.\([2025](https://arxiv.org/html/2605.14252#bib.bib26)\), which enforces unweighted pairwise alignment across timesteps, STA adopts a weighted temporal distillation scheme that allows each timestepttto selectively learn from confident and compatiable timesteps\. This guides weak timesteps toward better temporal states that support effective temporal evidence accumulation\.
##### Timestep confidence\.
The reliability of timestepttis quantified by an entropy\-based confidence score\. The entropyHtH\_\{t\}measures the uncertainty of the class distribution and is normalized by the maximum entropylogC\\log C:
Conft=1−HtlogC,Ht=−∑c=1Cp\(𝐳tS\)clogp\(𝐳tS\)c\.\\mathrm\{Conf\}\_\{t\}=1\-\\frac\{H\_\{t\}\}\{\\log C\},\\qquad H\_\{t\}=\-\\\!\\sum\_\{c=1\}^\{C\}p\(\\mathbf\{z\}\_\{t\}^\{S\}\)\_\{c\}\\log p\(\\mathbf\{z\}\_\{t\}^\{S\}\)\_\{c\}\.\(7\)A lower entropy leads to a largerConft\\mathrm\{Conf\}\_\{t\}, indicating that the timestep provides a more reliable source\.
##### Timestep compatibility\.
The compatibility between a target timestepttand a source timestept′t^\{\\prime\}is measured by cosine similarity in the student logit space\. This similarity reflects the consistency of their class\-level preferences:
Sim\(t,t′\)=𝐳tS⋅𝐳t′S‖𝐳tS‖‖𝐳t′S‖\.\\mathrm\{Sim\}\(t,t^\{\\prime\}\)=\\frac\{\\mathbf\{z\}\_\{t\}^\{S\}\\cdot\\mathbf\{z\}\_\{t^\{\\prime\}\}^\{S\}\}\{\\\|\\mathbf\{z\}\_\{t\}^\{S\}\\\|\\,\\\|\\mathbf\{z\}\_\{t^\{\\prime\}\}^\{S\}\\\|\}\.\(8\)A largerSim\(t,t′\)\\mathrm\{Sim\}\(t,t^\{\\prime\}\)indicates that the two timesteps exhibit more compatible class preferences\.
##### Source weighting\.
For each target timesteptt, an unnormalized source scorest,t′s\_\{t,t^\{\\prime\}\}is computed for source timestept′t^\{\\prime\}by combining its confidence with its compatibility to the target:
st,t′=Conft′⋅Sim\(t,t′\)\.s\_\{t,t^\{\\prime\}\}=\\mathrm\{Conf\}\_\{t^\{\\prime\}\}\\cdot\\mathrm\{Sim\}\(t,t^\{\\prime\}\)\.\(9\)The final weights are obtained by applying a softmax over all source timesteps satisfyingt′≠tt^\{\\prime\}\\neq t, withwt,t=0w\_\{t,t\}=0:
wt,t′=exp\(st,t′\)∑j≠texp\(st,j\),t′≠t\.w\_\{t,t^\{\\prime\}\}=\\frac\{\\exp\(s\_\{t,t^\{\\prime\}\}\)\}\{\\sum\_\{j\\neq t\}\\exp\(s\_\{t,j\}\)\},\\quad t^\{\\prime\}\\neq t\.\(10\)Thus, each target timestep primarily learns from confident and compatible source timesteps\.
##### STA objective\.
Based on the obtained weights, we encourage each target timestep to align with other source timesteps through a weighted KL divergence\. The STA loss is defined as
ℒSTA=1T∑t=1T∑t′≠twt,t′KL\(p\(𝐳t′S\)∥p\(𝐳tS\)\)\.\\mathcal\{L\}\_\{\\mathrm\{STA\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\sum\_\{t^\{\\prime\}\\neq t\}w\_\{t,t^\{\\prime\}\}\\,\\mathrm\{KL\}\\\!\\big\(p\(\\mathbf\{z\}\_\{t^\{\\prime\}\}^\{S\}\)\\,\\\|\\,p\(\\mathbf\{z\}\_\{t\}^\{S\}\)\\big\)\.\(11\)This objective lets each timestep absorb selectively weighted temporal guidance, instead of being uniformly aligned to all other timesteps, thereby providing more effective support for weak timesteps and reducing interference on already reliable ones\.
Figure 3:Layer\-wise statistics over all timesteps for the three propositions: \(a\) the fraction of the ELA update assigned to the ground\-truth class and the dominant false class at erroneous timesteps; \(b\) the cosine similarity between the STA update and the direction that reduces the gap to reliability\-weighted temporal references at weak timesteps; \(c\) the ratio between the distillation\-gradient norm and the classification\-gradient norm at already\-correct timesteps\. Statistics are computed from five randomly selected samples and reported as mean±\\pmstd\.MethodArchitectureCIFAR\-10CIFAR\-100T=4T=6T=4T=6w/o KDDspikeLiet al\.\([2021a](https://arxiv.org/html/2605.14252#bib.bib30)\)ResNet\-1893\.6694\.2573\.3574\.24GLIFYaoet al\.\([2022](https://arxiv.org/html/2605.14252#bib.bib31)\)94\.6794\.8876\.4277\.28RateBPYuet al\.\([2024](https://arxiv.org/html/2605.14252#bib.bib33)\)95\.6195\.9078\.2679\.02STBP\-tdBNZhenget al\.\([2021](https://arxiv.org/html/2605.14252#bib.bib29)\)ResNet\-1992\.9293\.16––TETDenget al\.\([2022](https://arxiv.org/html/2605.14252#bib.bib28)\)94\.4494\.5074\.4774\.72GLIFYaoet al\.\([2022](https://arxiv.org/html/2605.14252#bib.bib31)\)94\.8595\.0377\.0577\.35LSGLianet al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib53)\)95\.1795\.5276\.8577\.13RateBPYuet al\.\([2024](https://arxiv.org/html/2605.14252#bib.bib33)\)96\.2696\.3680\.7180\.83w/ KDKDSNNXuet al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib3)\)ResNet\-1893\.41–––Joint A\-SNNGuoet al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib12)\)95\.45–77\.39–Rate\-based KDYanget al\.\([2025](https://arxiv.org/html/2605.14252#bib.bib50)\)95\.9296\.1478\.8579\.40Logit\-SNNYuet al\.\([2025a](https://arxiv.org/html/2605.14252#bib.bib16)\)95\.5795\.9679\.1079\.80\\cellcolorpink\!25SeAl\-KD \(Ours\)\\cellcolorpink\!2595\.88±0\.12\\cellcolorpink\!2596\.18±0\.06\\cellcolorpink\!2579\.88±0\.13\\cellcolorpink\!2580\.25±0\.12SAKDQiuet al\.\([2024](https://arxiv.org/html/2605.14252#bib.bib7)\)ResNet\-1996\.06–80\.10–HTA\-KLZhanget al\.\([2025](https://arxiv.org/html/2605.14252#bib.bib6)\)96\.76–81\.03–Logit\-SNNYuet al\.\([2025a](https://arxiv.org/html/2605.14252#bib.bib16)\)96\.9797\.0082\.4782\.56\\cellcolorpink\!25SeAl\-KD \(Ours\)\\cellcolorpink\!2597\.14±0\.06\\cellcolorpink\!2597\.23±0\.05\\cellcolorpink\!2583\.04±0\.05\\cellcolorpink\!2583\.37±0\.08Table 1:Comparison of different direct\-training and distillation methods on CIFAR\-10 and CIFAR\-100\.
### 4\.3Selective Alignment Distillation
SeAl\-KD combines ELA in Eq\. \([6](https://arxiv.org/html/2605.14252#S4.E6)\) and STA in Eq\. \([11](https://arxiv.org/html/2605.14252#S4.E11)\) to realize selective temporal supervision\. Specifically, ELA regulates what class\-level knowledge is injected at erroneous timesteps, while STA guides weak timesteps toward weighted better temporal states, so that temporal correction better supports the final evidence accumulation process\. Together with the classification loss, the overall training objective is
ℒSeAl\-KD=ℒCLS\+αℒELA\+βℒSTA,\\mathcal\{L\}\_\{\\mathrm\{SeAl\\text\{\-\}KD\}\}=\\mathcal\{L\}\_\{\\mathrm\{CLS\}\}\+\\alpha\\,\\mathcal\{L\}\_\{\\mathrm\{ELA\}\}\+\\beta\\,\\mathcal\{L\}\_\{\\mathrm\{STA\}\},\(12\)whereα\\alphaandβ\\betacontrol the contributions of ELA and STA, respectively\.
### 4\.4Theoretical and Statistical Analysis
We analyze SeAl\-KD from the perspective of the additional update introduced by distillation beyond the standard cross\-entropy objective\. The analysis focuses on three representative timestep conditions: erroneous timesteps, weak timesteps, and already\-correct timesteps\. For space efficiency, in the main paper we select one timestep for each condition\. The full results over additional sampled timesteps are provided inAppendix[B](https://arxiv.org/html/2605.14252#A2)\. These diagnostics show that SeAl\-KD intervenes selectively: it localizes correction when the prediction is wrong, transfers temporal guidance from more reliable temporal states, and remains restrained when the prediction is already reliable\.
###### Proposition 1\(Localized correction\)\.
At an erroneous timestep, ELA encourages the distillation update to concentrate on the error\-relevant class subspace, rather than spreading the correction uniformly over all classes\.
Consider an erroneous timestepttwith ground\-truth class indexy∗y^\{\\ast\}and currently predicted wrong classctfalsec\_\{t\}^\{\\mathrm\{false\}\}\. Letz¯t,restS\\bar\{z\}\_\{t,\\mathrm\{rest\}\}^\{S\}denote the mean student logit over the remaining classes outside\{y∗,ctfalse\}\\\{y^\{\\ast\},c\_\{t\}^\{\\mathrm\{false\}\}\\\}, and let∇θl\\nabla\_\{\\theta\_\{l\}\}denote the gradient with respect to the parameters of layerll\. To examine whether the ELA update is concentrated on the dominant confusion pair, we define three directional effects:
Dtruel\\displaystyle D\_\{\\mathrm\{true\}\}^\{l\}=\|⟨∇θlℒELA,∇θlzt,y∗S⟩\|,\\displaystyle=\\left\|\\left\\langle\\nabla\_\{\\theta\_\{l\}\}\\mathcal\{L\}\_\{\\mathrm\{ELA\}\},\\nabla\_\{\\theta\_\{l\}\}z\_\{t,y^\{\\ast\}\}^\{S\}\\right\\rangle\\right\|,\(13\)Dfalsel\\displaystyle D\_\{\\mathrm\{false\}\}^\{l\}=\|⟨∇θlℒELA,∇θlzt,ctfalseS⟩\|,\\displaystyle=\\left\|\\left\\langle\\nabla\_\{\\theta\_\{l\}\}\\mathcal\{L\}\_\{\\mathrm\{ELA\}\},\\nabla\_\{\\theta\_\{l\}\}z\_\{t,c\_\{t\}^\{\\mathrm\{false\}\}\}^\{S\}\\right\\rangle\\right\|,Drestl\\displaystyle D\_\{\\mathrm\{rest\}\}^\{l\}=\|⟨∇θlℒELA,∇θlz¯t,restS⟩\|\.\\displaystyle=\\left\|\\left\\langle\\nabla\_\{\\theta\_\{l\}\}\\mathcal\{L\}\_\{\\mathrm\{ELA\}\},\\nabla\_\{\\theta\_\{l\}\}\\bar\{z\}\_\{t,\\mathrm\{rest\}\}^\{S\}\\right\\rangle\\right\|\.The pair\-concentration score is then defined as
PairSharetl=Dtruel\+DfalselDtruel\+Dfalsel\+Drestl\.\\mathrm\{PairShare\}\_\{t\}^\{l\}=\\frac\{D\_\{\\mathrm\{true\}\}^\{l\}\+D\_\{\\mathrm\{false\}\}^\{l\}\}\{D\_\{\\mathrm\{true\}\}^\{l\}\+D\_\{\\mathrm\{false\}\}^\{l\}\+D\_\{\\mathrm\{rest\}\}^\{l\}\}\.\(14\)
The inner products measure how strongly an update along the ELA direction acts on each logit\. A largerPairSharetl\\mathrm\{PairShare\}\_\{t\}^\{l\}means that the update is more concentrated on the confusing pair\{y∗,ctfalse\}\\\{y^\{\\ast\},c\_\{t\}^\{\\mathrm\{false\}\}\\\}than on the remaining classes\. As shown in Figure[3](https://arxiv.org/html/2605.14252#S4.F3)\(a\), SeAl\-KD gives a higher pair\-share score than timestep\-wise KD, suggesting that ELA yields more localized correction at erroneous timesteps\.
###### Proposition 2\(Reference\-guided correction\)\.
For a weak timestep, STA encourages the update to reduce its margin discrepancy to reliability\-weighted temporal references, thereby providing guidance from confident and compatible states\.
Letmt=zt,y∗S−zt,ctfalseSm\_\{t\}=z\_\{t,y^\{\\ast\}\}^\{S\}\-z\_\{t,c\_\{t\}^\{\\mathrm\{false\}\}\}^\{S\}denote the student margin at timesteptt, where the margin measures the logit gap between the ground\-truth classy∗y^\{\\ast\}and the dominant false classctfalsec\_\{t\}^\{\\mathrm\{false\}\}\. A larger margin indicates that the student assigns stronger relative preference to the ground\-truth class over the confusing false class\. Letmref,tm\_\{\\mathrm\{ref\},t\}be the reference margin for timesteptt, obtained by aggregating source\-timestep margins according to the STA weights\. We measure whether the STA update is aligned with the direction that reduces the discrepancy betweenmtm\_\{t\}andmref,tm\_\{\\mathrm\{ref\},t\}:
RefAligntl=cos\(∇θlℒSTA,∇θl\(mt−mref,t\)2\)\.\\mathrm\{RefAlign\}\_\{t\}^\{l\}=\\cos\\\!\\Big\(\\nabla\_\{\\theta\_\{l\}\}\\mathcal\{L\}\_\{\\mathrm\{STA\}\},\\nabla\_\{\\theta\_\{l\}\}\(m\_\{t\}\-m\_\{\\mathrm\{ref\},t\}\)^\{2\}\\Big\)\.\(15\)
A largerRefAligntl\\mathrm\{RefAlign\}\_\{t\}^\{l\}indicates stronger alignment with the discrepancy\-reduction direction, suggesting that STA guides weak timesteps toward reliability\-weighted temporal references\. Rather than forcing each intermediate timestep to be independently correct, STA promotes more consistent evidence accumulation across timesteps\. As shown in Figure[3](https://arxiv.org/html/2605.14252#S4.F3)\(b\), SeAl\-KD produces positive alignment, whereas timestep\-wise KD shows weaker or negative alignment\.
###### Proposition 3\(Restrained correction\)\.
When a timestep is already correctly classified, the distillation update remains small relative to the task gradient, reducing unnecessary interference with reliable predictions\.
For already\-correct timesteps, a large distillation update may perturb a reliable prediction\. We therefore measure the relative strength of the distillation gradient against the task gradient:
KDRatiotl=‖∇θlℒKD‖‖∇θlℒCLS‖\.\\mathrm\{KDRatio\}\_\{t\}^\{l\}=\\frac\{\\\|\\nabla\_\{\\theta\_\{l\}\}\\mathcal\{L\}\_\{\\mathrm\{KD\}\}\\\|\}\{\\\|\\nabla\_\{\\theta\_\{l\}\}\\mathcal\{L\}\_\{\\mathrm\{CLS\}\}\\\|\}\.\(16\)
A smallerKDRatiotl\\mathrm\{KDRatio\}\_\{t\}^\{l\}indicates weaker interference from the distillation objective\. As shown in Figure[3](https://arxiv.org/html/2605.14252#S4.F3)\(c\), SeAl\-KD yields a much smaller KD\-to\-CLS gradient ratio than timestep\-wise KD at the representative correct timestep, indicating that it remains restrained when the prediction is already reliable\.
## 5Experiments
### 5\.1Implementation Details
##### Datasets\.
We evaluate the proposed method on four benchmark datasets, including three static image datasets, namely CIFAR\-10, CIFAR\-100, and ImageNet, as well as one neuromorphic dataset, DVS\-CIFAR10\. This selection allows us to assess the effectiveness of the proposed approach on both frame\-based vision tasks and event\-based neuromorphic data\. Dataset details are provided in theAppendix[C](https://arxiv.org/html/2605.14252#A3)\.
##### Training Settings\.
All experiments are implemented in PyTorch and trained on NVIDIA A100 GPUs\. The SNNs are built with leaky integrate\-and\-fire neurons and trained using surrogate gradient learningNeftciet al\.\([2019](https://arxiv.org/html/2605.14252#bib.bib59)\)\. All experiments are conducted three times and the average results are reported, except for ImageNet\. For neuromorphic datasets, ANN training uses the average value of the event data as input\. Training details are provided in theAppendix[D](https://arxiv.org/html/2605.14252#A4)\.
### 5\.2Performance Comparison
The results on CIFAR\-10, CIFAR\-100, ImageNet, and DVS\-CIFAR10 are summarized in Tables[1](https://arxiv.org/html/2605.14252#S4.T1),[2](https://arxiv.org/html/2605.14252#S5.T2), and[3](https://arxiv.org/html/2605.14252#S5.T3)\. Across all evaluated benchmarks, SeAl\-KD consistently outperforms directly trained SNNs and achieves superior or competitive performance compared with representative logit\-based SNN distillation methods and other distillation baselines under comparable architectures and inference timesteps\. The improvements are observed on both static frame\-based datasets and event\-based neuromorphic datasets, indicating that the proposed method generalizes well across different data modalities\. In addition, SeAl\-KD only introduces lightweight training\-time objectives, and a detailed energy analysis is provided inAppendix[E](https://arxiv.org/html/2605.14252#A5)\.
Table 2:Comparison of different direct\-training and distillation methods on ImageNet with ResNet\-34\.MethodArchitectureTAcc\(%\)w/o KDSTBP\-tdBNZhenget al\.\([2021](https://arxiv.org/html/2605.14252#bib.bib29)\)ResNet\-191067\.80DspikeLiet al\.\([2021a](https://arxiv.org/html/2605.14252#bib.bib30)\)ResNet\-181075\.40TETDenget al\.\([2022](https://arxiv.org/html/2605.14252#bib.bib28)\)VGG\-111083\.17SLTTMenget al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib52)\)VGG\-111082\.20SpikformerZhouet al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib34)\)1680\.90Spike\-driven TransformerYaoet al\.\([2023](https://arxiv.org/html/2605.14252#bib.bib36)\)1680\.00w/ KDSAKDQiuet al\.\([2024](https://arxiv.org/html/2605.14252#bib.bib7)\)VGG\-11481\.50ResNet\-19480\.30TSSDZuoet al\.\([2024](https://arxiv.org/html/2605.14252#bib.bib13)\)ResNet\-18872\.901681\.60LogitSNNYuet al\.\([2025a](https://arxiv.org/html/2605.14252#bib.bib16)\)ResNet\-18483\.501086\.40\\cellcolorpink\!25\\cellcolorpink\!25\\cellcolorpink\!25 4\\cellcolorpink\!2584\.00±0\.20\\cellcolorpink\!25SeAl\-KD \(Ours\)\\cellcolorpink\!25ResNet\-18\\cellcolorpink\!25 10\\cellcolorpink\!2586\.70±0\.20
Table 3:Comparison of different direct\-training and distillation methods on DVS\-CIFAR10\.Table 4:Component ablation of SeAl\-KD using ResNet\-18 across datasets underT=4T=4\.Table 5:Ablation study of ELA variants using ResNet\-18 across different datasets underT=4T=4\.Table 6:Ablation study of STA variants using ResNet\-18 across different datasets underT=4T=4\.
### 5\.3Ablation Study
#### 5\.3\.1Component Ablation of SeAl\-KD
We first evaluate the individual and joint contributions of ELA and STA in Table[4](https://arxiv.org/html/2605.14252#S5.T4)\. Removing either module degrades performance, and removing both causes a larger drop\. This indicates that SeAl\-KD benefits from making alignment selective in both class and temporal dimensions: ELA reduces misleading supervision on erroneous classes, while STA provides temporal guidance from more reliable states\.
#### 5\.3\.2Ablation Study of ELA Variants
We analyze several ELA variants, with results reported in Table[5](https://arxiv.org/html/2605.14252#S5.T5)\.ELA\-Sapplies error\-aware modification only to the student logits,ELA\-Aonly to the teacher logits,ELA\-ASintroduces error awareness on the ANN teacher side and aligns the student accordingly, andELA\-Bothextends the alignment to three classes when the teacher prediction is also incorrect\. All variants outperform the baseline without ELA, showing that class\-level selective correction is beneficial once alignment is no longer imposed uniformly\. Our ELA achieves the best overall performance, and student\-driven variants are consistently more effective than teacher\-driven ones, because student errors directly identify the timesteps where naive alignment is most misleading\. This supports that ELA should be selective, student\-driven, and focused on correcting the dominant confusing class relation rather than enforcing broader per\-timestep matching\.
Figure 5:t\-SNE visualization of learned feature representations on DVS\-CIFAR10 under different direct\-training and distillation methods\.
#### 5\.3\.3Ablation Study of STA Variants
We analyze several STA variants, with results reported in Table[6](https://arxiv.org/html/2605.14252#S5.T6)\.STA\-NoConfremoves confidence weighting,STA\-NoSimremoves similarity filtering,STA\-Distreplaces cosine similarity with cosine distance, andUTAenforces unweighted pairwise alignment across all timesteps\. All variants outperform the baseline without STA, confirming the benefit of temporal guidance, while the full STA performs best overall\. This suggests that temporal correction should not be propagated uniformly across timesteps, but should instead guide erroneous timesteps using temporal states that are both reliable and compatible\. Confidence weighting suppresses noisy sources, similarity filtering avoids mismatched guidance, and their combination yields the most effective selective temporal correction\.
### 5\.4Visualization
#### 5\.4\.1Temporal Discrepancy
To better understand how selective alignment affects temporal prediction dynamics, we visualize per\-timestep class logits on DVS\-CIFAR10 in Figure[6](https://arxiv.org/html/2605.14252#S5.F6)\. Compared with Logit\-SNN, which exhibits a largely time\-invariant pattern where a subset of classes remains dominant across most timesteps, our method produces more structured temporal trajectories\. The dominant class can shift over time, and certain classes show larger logit variation across timesteps\. For example, given that the ground\-truth label isCC3, the logit ofCC0 transitions from negative to strongly positive\. Overall, the visualization suggests that SeAI\-KD redistributes evidence over time and encourages temporally differentiated class responses, rather than maintaining a fixed class preference across timesteps\.
#### 5\.4\.2t\-SNE
To better understand the effect of selective temporal alignment, we visualize the learned representations using t\-SNE, as shown in Figure[5](https://arxiv.org/html/2605.14252#S5.F5)\. Compared with direct training and previous distillation methods, our approach produces more compact class clusters with clearer separation\. Notably, the circled regions exhibit less class overlap and fewer scattered samples\. Overall, the visualization indicates that selective alignment leads to more discriminative and well\-structured representations over time\.
Figure 6:Heatmap of per\-timestep class logits on DVS\-CIFAR10 for Logit\-SNN and SeAI\-KD\.
## 6Conclusion
This work examines the limitations of uniform timestep\-wise KD in SNNs and shows that treating all temporal states equally can impose overly rigid supervision, even though intermediate timesteps need not all be individually correct\. Instead, erroneous timesteps should receive guidance that moves them toward the correct final outcome\. We propose SeAl\-KD, which selectively aligns class\-level and temporal knowledge via ELA and STA\. Our analysis shows that ELA and STA reduce the influence of erroneous, low\-confidence, or incompatible timesteps during alignment, enabling more effective correction where it is needed\. Experiments on both static and neuromorphic datasets further confirm that this selective alignment strategy consistently improves accuracy with only minimal computational overhead\.
## Ethical Statement
There are no ethical issues\.
## Acknowledge
This work was supported by start\-up funds with No\. MSRI8001004 and No\. MSRI9002005 and Monash eResearch capabilities, including M3\.
## References
- G\. Bellec, D\. Salaj, A\. Subramoney, R\. Legenstein, and W\. Maass \(2018\)Long short\-term memory and learning\-to\-learn in networks of spiking neurons\.Advances in neural information processing systems31\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.14252#S2.SS2.p1.1)\.
- J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei \(2009\)Imagenet: a large\-scale hierarchical image database\.In2009 IEEE conference on computer vision and pattern recognition,pp\. 248–255\.Cited by:[§C\.1](https://arxiv.org/html/2605.14252#A3.SS1.p2.1)\.
- S\. Deng, Y\. Li, S\. Zhang, and S\. Gu \(2022\)Temporal efficient training of spiking neural network via gradient re\-weighting\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14252#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.15.7.1),[Table 2](https://arxiv.org/html/2605.14252#S5.T2.2.2.2.2),[Table 3](https://arxiv.org/html/2605.14252#S5.T3.2.2.6.4.1)\.
- Y\. Ding, L\. Zuo, M\. Jing, P\. He, and H\. Deng \(2025a\)Rethinking spiking neural networks from an ensemble learning perspective\.International Conference on Learning Representations\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14252#S2.SS2.p1.1)\.
- Y\. Ding, L\. Zuo, M\. Jing, K\. Yang, P\. He, and T\. Xie \(2025b\)Synergy between the strong and the weak: spiking neural networks are inherently self\-distillers\.Advances in neural information processing systems\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p2.1)\.
- C\. Duan, J\. Ding, S\. Chen, Z\. Yu, and T\. Huang \(2022\)Temporal effective batch normalization in spiking neural networks\.Advances in Neural Information Processing Systems35,pp\. 34377–34390\.Cited by:[§2\.2](https://arxiv.org/html/2605.14252#S2.SS2.p1.1)\.
- W\. Fang, Z\. Yu, Y\. Chen, T\. Masquelier, T\. Huang, and Y\. Tian \(2021\)Incorporating learnable membrane time constant to enhance learning of spiking neural networks\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 2661–2671\.Cited by:[§2\.2](https://arxiv.org/html/2605.14252#S2.SS2.p1.1)\.
- Y\. Guo, W\. Peng, Y\. Chen, L\. Zhang, X\. Liu, X\. Huang, and Z\. Ma \(2023\)Joint a\-snn: joint training of artificial and spiking neural networks via self\-distillation and weight factorization\.Pattern Recognition142,pp\. 109639\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p1.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.20.12.1)\.
- Y\. Guo, X\. Tong, Y\. Chen, L\. Zhang, X\. Liu, Z\. Ma, and X\. Huang \(2022\)Recdis\-snn: rectifying membrane potential distribution for directly training spiking neural networks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 326–335\.Cited by:[Table 2](https://arxiv.org/html/2605.14252#S5.T2.2.5.5.1)\.
- S\. Han, J\. Pool, J\. Tran, and W\. Dally \(2015\)Learning both weights and connections for efficient neural network\.Advances in neural information processing systems28\.Cited by:[Appendix E](https://arxiv.org/html/2605.14252#A5.p1.6)\.
- D\. Hong, Y\. Qi, and Y\. Wang \(2025\)Lasnn: layer\-wise ann\-to\-snn distillation for effective and efficient training in deep spiking neural networks\.Neurocomputing,pp\. 131351\.Cited by:[§2\.1](https://arxiv.org/html/2605.14252#S2.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.14252#S5.T2.2.8.8.1)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[§C\.1](https://arxiv.org/html/2605.14252#A3.SS1.p1.1)\.
- Y\. Li, Y\. Guo, S\. Zhang, S\. Deng, Y\. Hai, and S\. Gu \(2021a\)Differentiable spike: rethinking gradient\-descent for training spiking neural networks\.Advances in neural information processing systems34,pp\. 23426–23439\.Cited by:[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.11.3.2),[Table 2](https://arxiv.org/html/2605.14252#S5.T2.2.3.3.1),[Table 3](https://arxiv.org/html/2605.14252#S5.T3.2.2.5.3.1)\.
- Y\. Li, Y\. Guo, S\. Zhang, S\. Deng, Y\. Hai, and S\. Gu \(2021b\)Differentiable spike: rethinking gradient\-descent for training spiking neural networks\.Advances in neural information processing systems34,pp\. 23426–23439\.Cited by:[§2\.2](https://arxiv.org/html/2605.14252#S2.SS2.p1.1)\.
- S\. Lian, J\. Shen, Q\. Liu, Z\. Wang, R\. Yan, and H\. Tang \(2023\)Learnable surrogate gradient for direct training spiking neural networks\.\.InIJCAI,pp\. 3002–3010\.Cited by:[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.17.9.1)\.
- W\. Maass \(1997\)Networks of spiking neurons: the third generation of neural network models\.Neural networks10\(9\),pp\. 1659–1671\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p1.1)\.
- Q\. Meng, M\. Xiao, S\. Yan, Y\. Wang, Z\. Lin, and Z\. Luo \(2023\)Towards memory\-and time\-efficient backpropagation for training spiking neural networks\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 6166–6176\.Cited by:[Table 3](https://arxiv.org/html/2605.14252#S5.T3.2.2.7.5.1)\.
- E\. O\. Neftci, H\. Mostafa, and F\. Zenke \(2019\)Surrogate gradient learning in spiking neural networks: bringing the power of gradient\-based optimization to spiking neural networks\.IEEE Signal Processing Magazine36\(6\),pp\. 51–63\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.14252#S5.SS1.SSS0.Px2.p1.1)\.
- G\. Orchard, A\. Jayawant, G\. K\. Cohen, and N\. Thakor \(2015\)Converting static image datasets to spiking neuromorphic datasets using saccades\.Frontiers in neuroscience9,pp\. 437\.Cited by:[§C\.2](https://arxiv.org/html/2605.14252#A3.SS2.p1.1)\.
- H\. Qiu, M\. Ning, Z\. Song, W\. Fang, Y\. Chen, T\. Sun, Z\. Ma, L\. Yuan, and Y\. Tian \(2024\)Self\-architectural knowledge distillation for spiking neural networks\.Neural Networks178,pp\. 106475\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14252#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.23.15.1),[Table 3](https://arxiv.org/html/2605.14252#S5.T3.2.2.10.8.2.1)\.
- Z\. Wang, R\. Jiang, S\. Lian, R\. Yan, and H\. Tang \(2023\)Adaptive smoothing gradient learning for spiking neural networks\.InInternational conference on machine learning,pp\. 35798–35816\.Cited by:[§2\.2](https://arxiv.org/html/2605.14252#S2.SS2.p1.1)\.
- Q\. Xu, Y\. Li, J\. Shen, J\. K\. Liu, H\. Tang, and G\. Pan \(2023\)Constructing deep spiking neural networks from artificial neural networks with knowledge distillation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 7886–7895\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14252#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.19.11.2),[Table 2](https://arxiv.org/html/2605.14252#S5.T2.2.7.7.2)\.
- Z\. Xu, K\. You, Q\. Guo, X\. Wang, and Z\. He \(2024\)Bkdsnn: enhancing the performance of learning\-based spiking neural networks training with blurred knowledge distillation\.InEuropean Conference on Computer Vision,pp\. 106–123\.Cited by:[Table 2](https://arxiv.org/html/2605.14252#S5.T2.2.9.9.1)\.
- S\. Yang, C\. Yu, L\. Liu, H\. Ma, A\. Wang, and E\. Li \(2025\)Efficient ann\-guided distillation: aligning rate\-based features of spiking neural networks through hybrid block\-wise replacement\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 10025–10035\.Cited by:[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.21.13.1)\.
- M\. Yao, J\. Hu, Z\. Zhou, L\. Yuan, Y\. Tian, B\. Xu, and G\. Li \(2023\)Spike\-driven transformer\.Advances in neural information processing systems36,pp\. 64043–64058\.Cited by:[Table 3](https://arxiv.org/html/2605.14252#S5.T3.2.2.9.7.1)\.
- X\. Yao, F\. Li, Z\. Mo, and J\. Cheng \(2022\)Glif: a unified gated leaky integrate\-and\-fire neuron for spiking neural networks\.Advances in Neural Information Processing Systems35,pp\. 32160–32171\.Cited by:[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.12.4.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.16.8.1),[Table 2](https://arxiv.org/html/2605.14252#S5.T2.2.4.4.1)\.
- C\. Yu, L\. Liu, G\. Wang, E\. Li, and A\. Wang \(2024\)Advancing training efficiency of deep spiking neural networks through rate\-based backpropagation\.Advances in Neural Information Processing Systems37,pp\. 115786–115815\.Cited by:[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.13.5.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.18.10.1),[Table 2](https://arxiv.org/html/2605.14252#S5.T2.2.6.6.1)\.
- C\. Yu, X\. Zhao, L\. Liu, S\. Yang, G\. Wang, E\. Li, and A\. Wang \(2025a\)Efficient logit\-based knowledge distillation of deep spiking neural networks for full\-range timestep deployment\.In Proceedings of the 42nd International Conference on Machine Learning\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14252#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.22.14.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.25.17.1),[Table 2](https://arxiv.org/html/2605.14252#S5.T2.2.10.10.1),[Table 3](https://arxiv.org/html/2605.14252#S5.T3.2.2.14.12.1.1)\.
- K\. Yu, C\. Yu, T\. Zhang, X\. Zhao, S\. Yang, H\. Wang, Q\. Zhang, and Q\. Xu \(2025b\)Temporal separation with entropy regularization for knowledge distillation in spiking neural networks\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 8806–8816\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14252#S2.SS1.p1.1)\.
- T\. Zhang, Z\. Zhu, K\. Yu, and H\. Wang \(2025\)Head\-tail\-aware kl divergence in knowledge distillation for spiking neural networks\.arXiv preprint arXiv:2504\.20445\.Cited by:[§2\.1](https://arxiv.org/html/2605.14252#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.24.16.1)\.
- D\. Zhao, G\. Shen, Y\. Dong, Y\. Li, and Y\. Zeng \(2025\)Improving stability and performance of spiking neural networks through enhancing temporal consistency\.Pattern Recognition159,pp\. 111094\.Cited by:[§1](https://arxiv.org/html/2605.14252#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14252#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2605.14252#S4.SS2.p1.1)\.
- H\. Zheng, Y\. Wu, L\. Deng, Y\. Hu, and G\. Li \(2021\)Going deeper with directly\-trained larger spiking neural networks\.InProceedings of the AAAI conference on artificial intelligence,Vol\.35,pp\. 11062–11070\.Cited by:[§2\.2](https://arxiv.org/html/2605.14252#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2605.14252#S4.T1.8.14.6.1),[Table 3](https://arxiv.org/html/2605.14252#S5.T3.2.2.4.2.2)\.
- Z\. Zhou, Y\. Zhu, C\. He, Y\. Wang, S\. YAN, Y\. Tian, and L\. Yuan \(2023\)Spikformer: when spiking neural network meets transformer\.InThe Eleventh International Conference on Learning Representations,Cited by:[Appendix E](https://arxiv.org/html/2605.14252#A5.p1.7),[Table 3](https://arxiv.org/html/2605.14252#S5.T3.2.2.8.6.1)\.
- L\. Zuo, Y\. Ding, M\. Jing, K\. Yang, and Y\. Yu \(2024\)Self\-distillation learning based on temporal\-spatial consistency for spiking neural networks\.arXiv preprint arXiv:2406\.07862\.Cited by:[§2\.1](https://arxiv.org/html/2605.14252#S2.SS1.p1.1),[Table 3](https://arxiv.org/html/2605.14252#S5.T3.2.2.12.10.1.1)\.
## Appendix AStructural Temporal Fluctuations in LIF Dynamics
The leak–integrate–reset mechanism of LIF neurons inherently produces temporal fluctuations in membrane potentials, even under constant input\. Consider a single LIF neuron receiving a fixed currentII\. Its membrane potential evolves as
ut\+1=αut\+I−Vthst,st∈\{0,1\},u\_\{t\+1\}=\\alpha u\_\{t\}\+I\-V\_\{\\mathrm\{th\}\}\\,s\_\{t\},\\qquad s\_\{t\}\\in\\\{0,1\\\},\(17\)where0<α<10<\\alpha<1is the leak factor andVth\>0V\_\{\\mathrm\{th\}\}\>0is the firing threshold\. In any non\-trivial firing regime \(neither always silent nor always firing\), the neuron exhibits both spike and non\-spike timesteps\. Consequently, the update increments take two distinct values:
ut\+1−ut=\(α−1\)ut\+I\(no spike\),u\_\{t\+1\}\-u\_\{t\}=\(\\alpha\-1\)\\,u\_\{t\}\+I\\qquad\\text\{\(no spike\)\},\(18\)ut\+1−ut=\(α−1\)ut\+I−Vth\(spike\)\.u\_\{t\+1\}\-u\_\{t\}=\(\\alpha\-1\)\\,u\_\{t\}\+I\-V\_\{\\mathrm\{th\}\}\\qquad\\text\{\(spike\)\}\.\(19\)
These two increments differ by exactlyVth\>0V\_\{\\mathrm\{th\}\}\>0, so the sequence\{ut\}\\\{u\_\{t\}\\\}cannot remain constant over time and must visit at least two distinct membrane\-potential levels with non\-zero frequency\. This implies a strictly positive temporal variance, showing that temporal fluctuations arise naturally from the LIF update rule\.
DatasetBatchSizeEpochsLearningRateWeightDecayStudentArchitectureTeacherArchitectureTeacherAcc \(%\)CIFAR\-101283000\.15e\-4ResNet\-18ResNet\-3497\.06ResNet\-19ResNet\-1997\.20CIFAR\-1001283000\.15e\-4ResNet\-18ResNet\-3481\.31ResNet\-19ResNet\-1982\.57DVS\-CIFAR10323000\.15e\-4ResNet\-18ResNet\-19\-T483\.80ResNet\-19\-T1083\.60ImageNet5121000\.22e\-5ResNet\-34ResNet\-3471\.24Table 7:Training settings across different datasets\.
## Appendix BTheoretical Analysis Across Timesteps
This section extends the theoretical and statistical analysis in the main text by reporting the corresponding layer\-wise statistics over all timesteps\. Together with Figure[3](https://arxiv.org/html/2605.14252#S4.F3), these results show that the trends supporting the three propositions remain consistent across timesteps\.
## Appendix CDataset Details
### C\.1Static Image Datasets
CIFAR\-10 and CIFAR\-100\.CIFAR\-10 and CIFAR\-100Krizhevskyet al\.\[[2009](https://arxiv.org/html/2605.14252#bib.bib54)\]are image classification datasets consisting of 60,000 color images with a spatial resolution of32×3232\\times 32\. Each dataset contains 50,000 training images and 10,000 test images\. CIFAR\-10 includes 10 object classes, while CIFAR\-100 contains 100 classes with finer granularity\. Both datasets share the same image format and data split\. Standard preprocessing and data augmentation are applied during training\.
ImageNet\.ImageNetDenget al\.\[[2009](https://arxiv.org/html/2605.14252#bib.bib55)\]is an image classification dataset containing approximately 1\.28 million training images and 50,000 validation images spanning 1,000 object categories\. The images have varying spatial resolutions and are resized and cropped during preprocessing to match the input size required by the models\.
### C\.2Neuromorphic Dataset
DVS\-CIFAR10\.DVS\-CIFAR10Orchardet al\.\[[2015](https://arxiv.org/html/2605.14252#bib.bib56)\]is a neuromorphic dataset derived from CIFAR\-10 and recorded using a Dynamic Vision Sensor\. Instead of frame\-based images, the dataset represents visual information as asynchronous streams of events, where each event is described by its spatial location, timestamp, and polarity of brightness changes\. It contains 10 object classes corresponding to CIFAR\-10\. For training and evaluation, the event streams are converted into discrete\-time representations by accumulating events within fixed temporal windows\.
## Appendix DTraining Details
All experiments are optimized using stochastic gradient descent with momentum set to 0\.9\. The learning rate is scheduled by a cosine decay strategy throughout training\. All implementations are based on the PyTorch framework\. For static image benchmarks, including CIFAR and ImageNet, different training configurations are adopted according to dataset scale\. Models on CIFAR datasets are trained using a single NVIDIA A100 GPU\. For ImageNet, distributed data parallelism is employed across eight A100 GPUs\. For event\-based vision tasks, experiments on the DVS\-CIFAR10 dataset follow the same optimization strategy and are conducted on a single A100 GPU\. Input event streams are processed into frame\-based representations before being fed into the network\. Detailed network configurations and all hyperparameter choices for different datasets and model variants are provided in Table[7](https://arxiv.org/html/2605.14252#A1.T7)\.
\(a\)CIFAR\-10\.
\(b\)CIFAR\-100\.
\(c\)DVS\-CIFAR10\.
Figure 8:Sensitivity analysis of hyperparameters on three datasets underT=4T=4\.
## Appendix EEnergy Consumption Analysis
To quantify the computational energy cost of SNNs, we follow a commonly adopted evaluation protocol in neuromorphic computing, which characterizes energy consumption in terms of synaptic operationsZhouet al\.\[[2023](https://arxiv.org/html/2605.14252#bib.bib34)\]\. Specifically, the overall synaptic operation power \(SOP\) is modeled as the weighted sum of accumulation and multiply\-accumulate operations:
SOPs=EAC⋅ACs\+EMAC⋅MACs,\\text\{SOP\}\_\{s\}=E\_\{\\mathrm\{AC\}\}\\cdot AC\_\{s\}\+E\_\{\\mathrm\{MAC\}\}\\cdot MAC\_\{s\},\(20\)whereACsAC\_\{s\}andMACsMAC\_\{s\}denote the total numbers of accumulation \(AC\) and multiply\-accumulate \(MAC\) operations, respectively\. The coefficientsEACE\_\{\\mathrm\{AC\}\}andEMACE\_\{\\mathrm\{MAC\}\}correspond to the energy cost of a single operation of each type\. Following the hardware energy model introduced inHanet al\.\[[2015](https://arxiv.org/html/2605.14252#bib.bib37)\], a 32\-bit floating\-point addition is assumed to consume0\.90\.9picojoules \(pJ\), while a 32\-bit MAC operation requires4\.64\.6pJ\.
In SNNs, information is conveyed through discrete spike events\. Letsil\[t\]∈\{0,1\}s\_\{i\}^\{l\}\[t\]\\in\\\{0,1\\\}indicate whether neuroniiin layerllemits a spike at timesteptt\. Whenever a spike is generated, all outgoing synapses of that neuron are activated, and each activated synapse contributes one accumulation operation\. If neuroniiin layerllhasfilf\_\{i\}^\{l\}outgoing connections, the total number of AC operations accumulated over the entire network and all timesteps can be written as:
ACs=∑t=1T∑l=1L−1∑i=1Nlfilsil\[t\],AC\_\{s\}=\\sum\_\{t=1\}^\{T\}\\sum\_\{l=1\}^\{L\-1\}\\sum\_\{i=1\}^\{N^\{l\}\}f\_\{i\}^\{l\}\\,s\_\{i\}^\{l\}\[t\],\(21\)whereTTdenotes the simulation length in timesteps,LLis the total number of layers, andNlN^\{l\}is the number of neurons in thell\-th layer\.
By contrast, artificial neural networks \(ANNs\) do not exhibit temporal spiking behavior\. Each neuron performs a single feedforward computation, and every synaptic connection contributes exactly one MAC operation\. Therefore, the total number of MAC operations in an ANN is solely determined by the network connectivity:
MACs=∑l=1L−1∑i=1Nlfil\.MAC\_\{s\}=\\sum\_\{l=1\}^\{L\-1\}\\sum\_\{i=1\}^\{N^\{l\}\}f\_\{i\}^\{l\}\.\(22\)
Using the above formulations, the SOP of both SNNs and ANNs can be consistently estimated by combining the corresponding operation counts with their associated per\-operation energy costs\.
Table[8](https://arxiv.org/html/2605.14252#A6.T8)indicates that energy is primarily driven by ACs, while MACs remain essentially constant under the same architecture\. For each dataset, we report results using the maximum number of timestepsTTadopted in this paper\. Compared with ANNs, SNNs can therefore achieve substantially lower energy consumption because most computations shift away from MAC\-dominated processing and become event\-driven\. Under this setting, the energy gap across different SNN methods is largely explained by how many spikes they produce, since firing activity directly determines the number of ACs\. Our method does not increase the fire rate or energy consumption\. It maintains comparable energy and can even reduce it for certain architecture–dataset combinations by lowering the fire rate relative to LogitSNN, thereby reducing ACs without sacrificing accuracy\. In addition, the training\-time overhead is negligible, suggesting that the method introduces only lightweight operations during optimization\.
## Appendix FHyperparameters Sensitivity Analysis
We analyze the sensitivity of the proposed method to the weighting coefficientsα\\alphafor ELA andβ\\betafor STA\. We first setβ=0\\beta=0and varyα\\alphato evaluate ELA alone, and then fixα=0\.6\\alpha=0\.6and varyβ\\betato evaluate STA on top of ELA\. Figure[8](https://arxiv.org/html/2605.14252#A4.F8)reports the results on CIFAR\-10, CIFAR\-100, and DVS\-CIFAR10\.
The performance changes are small across a reasonable range of coefficients, and the improvements over the corresponding baseline remain consistent\. These results indicate that the proposed method is relatively insensitive to the choice ofα\\alphaandβ\\betaand does not require careful tuning\. Based on this analysis, we useα=0\.6\\alpha=0\.6andβ=0\.15\\beta=0\.15in all experiments\.
DataArchitectureMethodGPU hFire RateACsMACsEnergyAcc\(h\)\(%\)\(M\)\(M\)\(μ\\muJ\)\(%\)CIFAR\-10Resnet18ANN0\.52\-0\.56549\.132526\.5097\.06VanillaSNN4\.6614\.3373\.123\.3480\.6295\.66KDSNN4\.9014\.8375\.853\.3486\.1595\.78LogitSNN5\.0615\.7482\.263\.3490\.9096\.12\\cellcolorpink\!25SeAl \(Ours\)\\cellcolorpink\!255\.07\\cellcolorpink\!2514\.89\\cellcolorpink\!2581\.56\\cellcolorpink\!253\.34\\cellcolorpink\!2588\.78\\cellcolorpink\!2596\.18CIFAR\-100Resnet18ANN0\.52\-0\.56549\.182526\.7281\.31VanillaSNN4\.6417\.4592\.923\.3499\.8278\.33KDSNN4\.8217\.9993\.653\.34101\.9379\.31LogitSNN5\.0818\.5599\.493\.34108\.2780\.07\\cellcolorpink\!25SeAl \(Ours\)\\cellcolorpink\!255\.08\\cellcolorpink\!2517\.48\\cellcolorpink\!2598\.33\\cellcolorpink\!253\.34\\cellcolorpink\!25103\.87\\cellcolorpink\!2580\.25CIFAR\-10Resnet19ANN1\.17\-1\.442268\.6010436\.8497\.20VanillaSNN11\.6613\.11286\.398\.65296\.9996\.73KDSNN11\.9212\.24271\.888\.65285\.0696\.98LogitSNN12\.1215\.42317\.468\.65319\.5297\.04\\cellcolorpink\!25SeAl \(Ours\)\\cellcolorpink\!2512\.14\\cellcolorpink\!2515\.14\\cellcolorpink\!25310\.54\\cellcolorpink\!258\.65\\cellcolorpink\!25319\.77\\cellcolorpink\!2597\.23CIFAR\-100Resnet19ANN1\.17\-1\.442268\.6210436\.9582\.57VanillaSNN11\.7916\.15350\.708\.65355\.4381\.11KDSNN12\.0816\.26359\.658\.65363\.4982\.12LogitSNN12\.4517\.41371\.748\.65372\.5783\.12\\cellcolorpink\!25SeAl \(Ours\)\\cellcolorpink\!2512\.46\\cellcolorpink\!2517\.05\\cellcolorpink\!25366\.02\\cellcolorpink\!258\.65\\cellcolorpink\!25369\.22\\cellcolorpink\!2583\.37DVS\-Cifar10Resnet18ANN1\.04\-0\.56549\.132526\.5083VanillaSNN4\.1612\.68807\.95\.57752\.7384KDSNN4\.2817\.211090\.375\.571006\.9684LogitSNN4\.4115\.651030\.175\.57952\.7885\.6\\cellcolorpink\!25SeAl \(Ours\)\\cellcolorpink\!254\.45\\cellcolorpink\!2514\.84\\cellcolorpink\!25969\.08\\cellcolorpink\!255\.57\\cellcolorpink\!25897\.8\\cellcolorpink\!2586\.7Table 8:Comparison of energy consumption, accumulation \(ACs\) and multiply–accumulate \(MACs\) operations, training cost \(GPU hours\), and accuracy on CIFAR\-10 and CIFAR\-100 with ResNet\-18/ResNet\-19 atT=6T=6, and on DVS\-CIFAR10 with ResNet\-18 atT=10T=10\.
## Appendix GUse of Large Language Models
Large language models were used solely for language editing \(e\.g\., grammar and clarity\)\. All ideas, experiments, and analyses are entirely the authors’ own\.Similar Articles
On-Policy Distillation (5 minute read)
This paper introduces on-policy distillation, which trains a student model on its own trajectories with teacher token-level KL supervision to fix train-inference mismatch, unifying forward-KL, reverse-KL, and JSD losses, with reverse-KL favored for smaller students.
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
This paper proposes Shadow Mask Distillation (SMD) to solve the off-policy bias caused by KV cache compression during reinforcement learning post-training for large language models. It introduces a mechanism that ensures on-policy alignment and improves memory efficiency for long-context reasoning tasks.
Consistently Informative Soft-Label Temperature for Knowledge Distillation
Proposes CIST, a method that assigns separate sample-wise adaptive temperatures to teacher and student in knowledge distillation, producing consistently informative soft labels and relaxing rigid logit-scale matching. Experiments on vision and language tasks show consistent improvements over standard KD.
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
Switch-KD proposes a novel visual-switch knowledge distillation framework for efficiently compressing vision-language models by unifying multimodal knowledge transfer within a shared text-probability space. The method achieves 3.6-point average improvement across 10 multimodal benchmarks when distilling a 0.5B TinyLLaVA student from a 3B teacher model.
FedeKD: Energy-Based Gating for Robust Federated Knowledge Distillation under Heterogeneous Settings
This paper introduces FedeKD, a reliability-aware framework for federated knowledge distillation that uses an energy-based gating mechanism to mitigate negative transfer in heterogeneous settings. The authors demonstrate that weighting knowledge transfer based on sample-wise trust improves robustness and predictive performance without requiring public datasets.