Hamiltonian-Inspired Attention Mechanism for Scalable RF Transmitter Fingerprinting

arXiv cs.AI 06/01/26, 04:00 AM Papers
rf-fingerprinting transformer physics-informed attention-mechanism wireless-security iq-signals deep-learning
Summary
Proposes a Hamiltonian Transformer, a physics-informed attention mechanism that enforces norm-preserving value dynamics for RF transmitter fingerprinting, achieving 99.12% accuracy in same-day conditions and 61.64% with 150 transmitters, outperforming CNN and Transformer baselines.
arXiv:2605.30364v1 Announce Type: cross Abstract: Radio-frequency (RF) fingerprinting identifies wire-less transmitters using hardware-induced imperfections present in baseband I/Q signals. However, deep learning models often degrade under receiver and channel distribution shifts, particularly as transmitter populations grow. This work proposes the Hamiltonian Transformer, a physics-informed attention architecture that enforces norm preserving value dynamics within each attention head using a learned skew-symmetric generator and a St\"ormer-Verlet leapfrog integration step. An additional phase-increment embedding exposes oscillator dynamics at the input layer. All experiments use non-equalized raw I/Q signals from the WiSig dataset under four protocols: same-day classification, cross-receiver generalisation, cross-day generalisation, and transmitter scaling up to 150 devices. The Hamiltonian Transformer achieves 99.12% accuracy under same-day conditions and 61.64% at 150 transmitters, consistently outperforming CNN and Transformer baselines across all scale points. A controlled ablation study identifies norm-preservation in the value update as the primary inductive bias driving the scaling advantage, with the phase increment embedding providing the single largest per-component improvement. These results indicate that embedding physics-informed structural priors into attention mechanisms is an effective approach to large-scale transmitter identification on raw wireless signals.
Original Article
View Cached Full Text
Cached at: 06/01/26, 09:29 AM
# Hamiltonian-Inspired Attention Mechanism for Scalable RF Transmitter Fingerprinting
Source: [https://arxiv.org/html/2605.30364](https://arxiv.org/html/2605.30364)
###### Abstract

Radio\-frequency \(RF\) fingerprinting identifies wireless transmitters using hardware\-induced imperfections present in baseband I/Q signals\. However, deep learning models often degrade under receiver and channel distribution shifts, particularly as transmitter populations grow\. This work proposes the*Hamiltonian Transformer*, a physics\-informed attention architecture that enforces norm\-preserving value dynamics within each attention head using a learned skew\-symmetric generator and a Störmer–Verlet leapfrog integration step\. An additional phase\-increment embedding exposes oscillator dynamics at the input layer\. All experiments use non\-equalized raw I/Q signals from the WiSig dataset under four protocols: same\-day classification, cross\-receiver generalisation, cross\-day generalisation, and transmitter scaling up to 150 devices\. The Hamiltonian Transformer achieves99\.12%99\.12\\%accuracy under same\-day conditions and61\.64%61\.64\\%at 150 transmitters, consistently outperforming CNN and Transformer baselines across all scale points\. A controlled ablation study identifies norm\-preservation in the value update as the primary inductive bias driving the scaling advantage, with the phase\-increment embedding providing the single largest per\-component improvement\. These results indicate that embedding physics\-informed structural priors into attention mechanisms is an effective approach to large\-scale transmitter identification on raw wireless signals\.

## IIntroduction

The proliferation of low\-cost wireless devices has made physical\-layer security an increasingly important concern\. Consumer\-grade WiFi chipsets, IoT sensors, and industrial transceivers are deployed at scale in environments ranging from hospital networks to critical infrastructure\. Despite this growth, device identity is typically established through software\-layer credentials such as MAC addresses or protocol\-level handshakes, both of which can be cloned or spoofed without physical access to the hardware\. A rogue transmitter that successfully impersonates a legitimate node can inject traffic, exhaust network resources, or silently exfiltrate data while remaining indistinguishable at the network layer\.

Radio\-frequency \(RF\) fingerprinting addresses this problem by exploiting the unintentional hardware imperfections introduced during semiconductor manufacturing\. Even transmitters from identical production runs exhibit measurable variations in carrier frequency offset \(CFO\), phase noise spectral density, and I/Q imbalance\. These deviations are stable over time and detectable in the baseband waveform, making them a practical basis for hardware\-level authentication\[[16](https://arxiv.org/html/2605.30364#bib.bib10)\]\.

Deep convolutional neural networks \(CNNs\) significantly advanced RF fingerprinting performance\. Architectures such as ORACLE\[[18](https://arxiv.org/html/2605.30364#bib.bib3)\]and the large\-scale study of Jian*et al\.*\[[11](https://arxiv.org/html/2605.30364#bib.bib5)\]demonstrated that a CNN trained end\-to\-end on raw I/Q samples can exceed95%95\\%classification accuracy for a controlled transmitter set under a fixed channel\. However, both training and evaluation in those studies were performed using the same receiver within a single recording session\. When either the receiver or the capture day changed, accuracy dropped sharply\. Al\-Shawabka*et al\.*\[[1](https://arxiv.org/html/2605.30364#bib.bib9)\]systematically characterised this fragility, showing that a model trained on one day’s captures can degrade to near\-random performance when evaluated on the following day\. In such cases the model has effectively learned channel characteristics rather than the transmitter fingerprint\.

The release of the WiSig dataset\[[9](https://arxiv.org/html/2605.30364#bib.bib1)\]enabled systematic investigation of this problem at scale\. WiSig provides approximately ten million WiFi preamble captures from 174 commercial transmitters across 41 USRP receivers and four recording days on the ORBIT testbed\[[15](https://arxiv.org/html/2605.30364#bib.bib2)\]\. Its pre\-packaged subsets isolate distinct sources of variability: receiver identity, capture day, and transmitter population size\. This structure has made WiSig a standard benchmark for receiver\- and channel\-agnostic RF fingerprinting, motivating subsequent work on open\-set authorisation\[[8](https://arxiv.org/html/2605.30364#bib.bib8)\], data augmentation for channel resilience\[[20](https://arxiv.org/html/2605.30364#bib.bib7)\], and receiver\-agnostic fingerprinting via generative adversarial networks\[[24](https://arxiv.org/html/2605.30364#bib.bib6)\]\.

Despite these advances, a structural limitation persists across existing architectures\. CNNs capture short\-range amplitude and phase patterns through local convolutional filters, while standard Transformers\[[22](https://arxiv.org/html/2605.30364#bib.bib20)\]capture long\-range dependencies through unconstrained attention weights\. Neither architecture incorporates prior knowledge about the physical process generating the I/Q signal\. The baseband waveform of a WiFi preamble evolves approximately as a rotation in the complex plane, driven by oscillator CFO and phase noise — dynamics that are energy\-conserving by construction\. Because attention mechanisms are unconstrained, models can freely attend to channel\-dependent amplitude variations that do not generalise across receivers or capture days\. This is the fundamental source of distributional failure under shift\.

To address this limitation, this paper introduces the*Hamiltonian Transformer*, a neural architecture that embeds an energy\-conservation prior directly within the attention mechanism\. Inspired by Hamiltonian neural networks\[[6](https://arxiv.org/html/2605.30364#bib.bib14)\]and symplectic integration theory\[[3](https://arxiv.org/html/2605.30364#bib.bib15),[7](https://arxiv.org/html/2605.30364#bib.bib17)\], the proposed model partitions value vectors within each attention head into position and momentum components and evolves them via a Störmer–Verlet leapfrog step governed by a learned skew\-symmetric generator\. Because skew\-symmetric matrices generate orthogonal flows, the resulting value updates are norm\-preserving by construction, aligning the model’s internal dynamics with the rotational behaviour of oscillator phase evolution\. An explicit phase\-increment embedding additionally exposes oscillator dynamics at the input representation level\.

The model is evaluated on the WiSig dataset across four experimental protocols using*non\-equalized raw I/Q signals*throughout no channel preprocessing is applied at any stage\. Under same\-day conditions all architectures achieve high and comparable accuracy\. The Hamiltonian Transformer achieves99\.12%99\.12\\%on same\-day classification and52\.94%52\.94\\%on cross\-receiver generalisation, the highest among all models in both settings\. In the transmitter scaling experiment, the Hamiltonian Transformer maintains stable performance across all scale points and reaches61\.64%61\.64\\%at 150 transmitters, consistently outperforming both baselines\. A controlled ablation study further demonstrates that norm\-preserving value dynamics whether implemented via Hamiltonian leapfrog or Cayley orthogonal parameterisation\[[10](https://arxiv.org/html/2605.30364#bib.bib18)\]consistently outperform unconstrained attention at large transmitter counts, identifying norm\-preservation as the key inductive bias for large\-scale transmitter identification on raw signals\.

To the best of our knowledge, this is the first work to embed norm\-preserving value dynamics into a Transformer attention mechanism for RF fingerprinting, and the first to systematically evaluate such constraints on the WiSig ManyTx subset using non\-equalized signals\. On this setting, our best\-performing variant reaches79\.72%79\.72\\%at 150 transmitters, representing the highest published accuracy on non\-equalized WiSig ManyTx in the literature\.

The contributions of this work are:

- •We introduce theHamiltonian Transformer, a physics\-informed attention architecture that constrains value vector dynamics to be norm\-preserving through a Störmer–Verlet leapfrog update governed by a learned skew\-symmetric generator, motivated by the rotational behaviour of oscillator\-driven I/Q signals\.
- •We propose anI/Q phase\-increment embeddingthat exposes transmitter oscillator dynamics directly at the input representation level, providing the single largest per\-component accuracy improvement in our ablation study\.

## IIRelated Work

Early efforts in RF fingerprinting relied on hand\-crafted features derived from transient signals or spectral estimates\. The landscape shifted when convolutional neural networks demonstrated that raw I/Q waveforms contain sufficient discriminative information to identify transmitters without explicit feature engineering\. O’Shea*et al\.*\[[14](https://arxiv.org/html/2605.30364#bib.bib11)\]showed that convolutional architectures could classify modulation schemes directly from baseband samples, establishing a paradigm later adopted by fingerprinting systems\. Riyaz*et al\.*\[[16](https://arxiv.org/html/2605.30364#bib.bib10)\]extended this approach to transmitter identification, demonstrating that a multi\-layer CNN trained end\-to\-end on raw I/Q data achieves high accuracy on controlled datasets while offering practical advantages over classical feature\-based pipelines\.

Subsequent work refined CNN\-based RF fingerprinting architectures\. ORACLE\[[18](https://arxiv.org/html/2605.30364#bib.bib3)\]performed extensive ablation studies over network depth, filter width, and input representations\. Sankhe*et al\.*\[[17](https://arxiv.org/html/2605.30364#bib.bib4)\]further showed that hardware\-level impairments, including carrier frequency offset and I/Q imbalance, could be exploited directly from raw signals without radio\-specific pre\-processing\. A comprehensive multi\-dataset evaluation by Jian*et al\.*\[[11](https://arxiv.org/html/2605.30364#bib.bib5)\]confirmed that CNN\-based fingerprinters generalise across waveform standards under controlled training conditions\. Narrowband transmitter identification has also been explored\. Zhang*et al\.*\[[23](https://arxiv.org/html/2605.30364#bib.bib12)\]developed modelling techniques for narrowband emitters, while Shen*et al\.*\[[19](https://arxiv.org/html/2605.30364#bib.bib13)\]addressed scalability and channel robustness in LoRa\-based deployments\.

While high accuracy under controlled conditions is well established, maintaining performance under changing receivers or propagation channels remains challenging\. Al\-Shawabka*et al\.*\[[1](https://arxiv.org/html/2605.30364#bib.bib9)\]provide one of the most systematic analyses of this issue, showing that models trained under a particular channel condition can degrade to near\-chance accuracy when evaluated under a different channel\. This occurs because convolutional filters may inadvertently encode channel\-specific characteristics alongside transmitter hardware impairments\.

Data augmentation has been proposed as a mitigation strategy\. Soltani*et al\.*\[[20](https://arxiv.org/html/2605.30364#bib.bib7)\]showed that training with synthetically augmented channel realisations improves cross\-channel generalisation\. Hanna*et al\.*\[[8](https://arxiv.org/html/2605.30364#bib.bib8)\]studied open\-set transmitter authorisation, where the classifier must reject devices not observed during training\. Zhao*et al\.*\[[24](https://arxiv.org/html/2605.30364#bib.bib6)\]addressed receiver\-agnostic fingerprinting using a GAN\-based framework that attempts to remove receiver\-induced distortions while preserving transmitter\-specific features\.

The WiSig dataset\[[9](https://arxiv.org/html/2605.30364#bib.bib1)\], collected on the ORBIT testbed\[[15](https://arxiv.org/html/2605.30364#bib.bib2)\], was designed specifically to support systematic evaluation of receiver variation, channel changes, and transmitter population size within a unified experimental framework\. It provides the benchmark dataset used in this work\.

The Transformer architecture\[[22](https://arxiv.org/html/2605.30364#bib.bib20)\], originally proposed for natural language processing, has become widely used for structured time\-series data\. Its self\-attention mechanism allows the model to capture dependencies across the full sequence length without the locality constraints of convolutional filters\. Layer normalisation\[[2](https://arxiv.org/html/2605.30364#bib.bib21)\]stabilises training for sequences with heterogeneous statistics, an important property for I/Q signals whose amplitude envelope varies across the WiFi preamble\.

Extensions of attention\-based models have also been proposed for unordered or structured signal sets\. Lee*et al\.*\[[12](https://arxiv.org/html/2605.30364#bib.bib22)\]introduced the Set Transformer, which extends attention mechanisms to permutation\-invariant settings through inducing\-point pooling\. Despite these advances, standard Transformer architectures impose no structural constraints on how value vectors evolve across attention layers\. For RF fingerprinting tasks under distribution shift, this flexibility can allow the model to focus on channel\-dependent features that fail to generalise across receivers or capture days\.

Recent work has explored incorporating physical constraints directly into neural network architectures\. Greydanus*et al\.*\[[6](https://arxiv.org/html/2605.30364#bib.bib14)\]introduced Hamiltonian Neural Networks, which parameterise the Hamiltonian function and derive system dynamics through Hamilton’s equations, enforcing energy conservation as a hard inductive bias\. Cranmer*et al\.*\[[5](https://arxiv.org/html/2605.30364#bib.bib16)\]proposed Lagrangian Neural Networks, which instead parameterise the Lagrangian and derive equations of motion through the Euler–Lagrange formalism\.

Chen*et al\.*\[[3](https://arxiv.org/html/2605.30364#bib.bib15)\]demonstrated that recurrent architectures equipped with symplectic integration steps better preserve phase\-space structure over long rollouts\. The mathematical foundation for these approaches lies in geometric numerical integration theory\[[7](https://arxiv.org/html/2605.30364#bib.bib17)\], which shows that symplectic integrators preserve a modified Hamiltonian over long time horizons\.

The Hamiltonian based Transformer proposed in this work adapts this principle to attention mechanisms\. Value vectors are partitioned into position and momentum components and evolved through a leapfrog integration step governed by a learned skew\-symmetric generator, ensuring that each update corresponds to a norm\-preserving rotation\.

## IIIMethodology

This section describes the input representation, the three architectures evaluated on the WiSig dataset, and the training procedure applied uniformly across all models\. The CNN and standard Transformer serve as discriminative baselines representing established approaches to RF fingerprinting\. The Hamiltonian Transformer is the proposed model and constitutes the primary contribution of this work\. All models receive the same normalised input tensor and are trained under identical optimisation conditions to ensure fair comparison\.

### III\-AMotivation for the Hamiltonian Prior

A WiFi preamble is produced by a local oscillator whose carrier frequency offset \(CFO\) and phase noise are stable hardware imperfections unique to each transmitter\. In the complex baseband, these impairments manifest as a near\-constant rotation of the signal vectorz\(t\)=I\(t\)\+jQ\(t\)z\(t\)=I\(t\)\+jQ\(t\)\. Pure rotation preserves∥z\(t\)∥\\lVert z\(t\)\\rVertand therefore conserves signal energy — a property that holds by construction for oscillator\-driven dynamics\. Unconstrained architectures impose no such constraint, leaving them free to learn channel\-induced amplitude variations that are receiver\-specific and temporally unstable\. When training and test conditions diverge, these spurious features become the primary source of distributional failure\.

To validate this physical intuition empirically, two metrics are computed across all four WiSig subsets using 256\-sample snapshots\. The*Circularity Index*\(CI\) quantifies how closely the I/Q trajectory approximates a circle in the complex plane:

CI=1−σ\(\|z\|\)μ\(\|z\|\),\\operatorname\{CI\}=1\-\\frac\{\\sigma\(\\lvert z\\rvert\)\}\{\\mu\(\\lvert z\\rvert\)\},\(1\)
whereCI=1\\operatorname\{CI\}=1denotes a perfect circle with constant signal energy\. The*Phase Linearity*\(PL\) measures the coefficient of determinationR2R^\{2\}of a linear fit to the unwrapped phaseϕ\(t\)=∠z\(t\)\\phi\(t\)=\\angle z\(t\), withPL=1\\text\{PL\}=1indicating a constant rotation rate consistent with a pure oscillator\.

As shown in Fig\.[1](https://arxiv.org/html/2605.30364#S3.F1), normalised I/Q trajectories across all four subsets trace arcs that closely approximate the unit circle\. Fig\.[2](https://arxiv.org/html/2605.30364#S3.F2)quantifies this: between76%76\\%and91%91\\%of snapshots exceedCI\>0\.5\\operatorname\{CI\}\>0\.5, and between60%60\\%and92%92\\%exceedPL\>0\.7\\text\{PL\}\>0\.7across all subsets\. These proportions confirm that the dominant signal structure is rotational and oscillator\-driven\. Deviations from perfect circularity correspond to channel\-induced amplitude perturbations\. The Hamiltonian prior constrains the model’s internal dynamics to the energy\-conserving subspace associated with transmitter oscillator behaviour, thereby discouraging the model from encoding channel\-dependent features\.

![Refer to caption](https://arxiv.org/html/2605.30364v1/x1.png)Figure 1:Normalised I/Q trajectories in the complex plane for all four WiSig subsets \(three overlaid snapshots per subset\)\. Trajectories approximate the unit circle, consistent with oscillator\-driven rotational dynamics\.![Refer to caption](https://arxiv.org/html/2605.30364v1/x2.png)Figure 2:Circularity Index \(CI\) and Phase Linearity \(PL\) distributions across all WiSig subsets\. Percentages indicate the fraction of snapshots exceedingCI\>0\.5\\operatorname\{CI\}\>0\.5andPL\>0\.7\\text\{PL\}\>0\.7respectively\. The consistently high proportions motivate the Hamiltonian energy\-conservation prior\.
### III\-BInput Representation

Each WiSig snapshot consists of 256 complex baseband samples captured at 25 Msps, stored as interleaved in\-phase and quadrature components𝐱∈ℝ256×2\\mathbf\{x\}\\in\\mathbb\{R\}^\{256\\times 2\}\. Prior to training, each snapshot is normalised channel\-wise to zero mean and unit standard deviation\. This removes large power variations across receiver–transmitter pairs while preserving the relative I/Q structure that encodes hardware\-level fingerprint information\. All models receive the same normalised tensor as input\. Crucially, no channel equalisation is applied all experiments operate on raw non\-equalized signals throughout\.

### III\-CCNN Baseline

The CNN baseline replicates the architecture used in the original WiSig evaluation\[[9](https://arxiv.org/html/2605.30364#bib.bib1)\]\. The input tensor is treated as a single\-channel2×2562\\times 256image and processed through five convolutional layers with filter counts\(8,16,32,16,16\)\(8,16,32,16,16\)and kernel sizes\(2×3\)\(2\{\\times\}3\),\(1×3\)\(1\{\\times\}3\),\(1×3\)\(1\{\\times\}3\),\(1×3\)\(1\{\\times\}3\), and\(1×3\)\(1\{\\times\}3\)respectively\. Max\-pooling with stride 2 is applied after the first three convolutional layers to progressively downsample the temporal dimension\. The resulting feature map is flattened and passed through three fully connected layers of width 100, 80, andNN, whereNNis the number of transmitter classes\. ReLU activations are used throughout and the output layer produces logits optimised with cross\-entropy loss\.

The CNN processes I and Q jointly as two spatial rows, enabling convolutional filters to learn cross\-channel activation patterns\. Its inductive bias of local receptive fields and shared weights along the temporal axis suits the short\-range amplitude and phase distortions produced by transmitter hardware imperfections\.

### III\-DStandard Transformer Baseline

The Transformer baseline treats each of the 256 time steps as a token with feature dimension 2 \(I and Q\)\. A linear projection maps each token to an embedding of dimensionD=128D=128and sinusoidal positional encodings are added to encode temporal order\. Four pre\-normalisation encoder layers are stacked, each withH=4H=4attention heads and a feed\-forward hidden dimension of 256\. Global average pooling over the temporal dimension yields a fixed\-length representation passed to a linear classification head\.

Standard scaled dot\-product attention imposes no structural constraint on how value vectors evolve across layers\. For RF fingerprinting under distribution shift, this flexibility permits the model to attend to channel\-dependent features that do not generalise, which is the fundamental limitation the Hamiltonian prior is designed to address\.

### III\-EHamiltonian Transformer

The Hamiltonian Transformer embeds a physics\-informed constraint directly within the attention mechanism\. Rather than allowing value vectors to be updated arbitrarily, they are evolved through a Hamiltonian dynamical system implemented via Störmer–Verlet leapfrog integration\. Because the generator of this system is constrained to be skew\-symmetric, the resulting update is norm\-preserving by construction a property that mirrors the energy\-conserving rotational behaviour of oscillator\-driven I/Q signals\. The full architecture is illustrated in Fig\.[3](https://arxiv.org/html/2605.30364#S3.F3)\.

![Refer to caption](https://arxiv.org/html/2605.30364v1/x3.png)Figure 3:Hamiltonian Transformer architecture\. Value vectors in each attention head are split into position and momentum components, evolved via Störmer–Verlet leapfrog integration with a learned skew\-symmetric generator, and recombined before aggregation\.#### III\-E1I/Q Phase Embedding

To expose oscillator dynamics directly at the input level, each time step is augmented with the instantaneous phase increment

δϕt=ϕt−ϕt−1,ϕt=arctan⁡\(QtIt\)\.\\delta\\phi\_\{t\}=\\phi\_\{t\}\-\\phi\_\{t\-1\},\\qquad\\phi\_\{t\}=\\arctan\\\!\\left\(\\frac\{Q\_\{t\}\}\{I\_\{t\}\}\\right\)\.\(2\)
This scalar encodes the per\-sample rotation rate of the I/Q vector, which is directly determined by the transmitter’s carrier frequency offset and is stable across channel realisations\. The augmented feature vector\[It,Qt,δϕt\]\[I\_\{t\},Q\_\{t\},\\delta\\phi\_\{t\}\]is projected to embedding dimensionD=128D=128via a learned linear layer\.

#### III\-E2Hamiltonian Multi\-Head Attention

Each attention head operates on feature dimensiondh=D/Hd\_\{h\}=D/H\. Query and key projections follow the standard scaled dot\-product formulation\[[22](https://arxiv.org/html/2605.30364#bib.bib20)\]\. The value vectors are where the Hamiltonian constraint is applied\. Each value vector𝐯∈ℝdh\\mathbf\{v\}\\in\\mathbb\{R\}^\{d\_\{h\}\}is partitioned into equal\-length position and momentum coordinates:

𝐕→\(𝐕1,𝐕2\),𝐕1,𝐕2∈ℝdh/2\.\\mathbf\{V\}\\;\\rightarrow\\;\\bigl\(\\mathbf\{V\}^\{1\},\\,\\mathbf\{V\}^\{2\}\\bigr\),\\qquad\\mathbf\{V\}^\{1\},\\mathbf\{V\}^\{2\}\\in\\mathbb\{R\}^\{d\_\{h\}/2\}\.\(3\)
A skew\-symmetric generator is constructed from a learned parameter matrix𝐀h\\mathbf\{A\}\_\{h\}:

𝐂h=𝐀h−𝐀h⊤\.\\mathbf\{C\}\_\{h\}=\\mathbf\{A\}\_\{h\}\-\\mathbf\{A\}\_\{h\}^\{\\top\}\.\(4\)
Because skew\-symmetric matrices generate orthogonal flows,𝐂h\\mathbf\{C\}\_\{h\}defines a norm\-preserving rotation in the\(dh/2\)\(d\_\{h\}/2\)\-dimensional feature space\. The position and momentum coordinates are then co\-evolved using the Störmer–Verlet leapfrog scheme\[[7](https://arxiv.org/html/2605.30364#bib.bib17)\], which is a second\-order symplectic integrator:

𝐕2\\displaystyle\\mathbf\{V\}^\{2\}←𝐕2\+Δt2𝐕1𝐂h,\\displaystyle\\leftarrow\\mathbf\{V\}^\{2\}\+\\tfrac\{\\Delta t\}\{2\}\\,\\mathbf\{V\}^\{1\}\\mathbf\{C\}\_\{h\},\(5\)𝐕1\\displaystyle\\mathbf\{V\}^\{1\}←𝐕1\+Δt𝐕2𝐂h,\\displaystyle\\leftarrow\\mathbf\{V\}^\{1\}\+\\Delta t\\,\\mathbf\{V\}^\{2\}\\mathbf\{C\}\_\{h\},\(6\)𝐕2\\displaystyle\\mathbf\{V\}^\{2\}←𝐕2\+Δt2𝐕1𝐂h,\\displaystyle\\leftarrow\\mathbf\{V\}^\{2\}\+\\tfrac\{\\Delta t\}\{2\}\\,\\mathbf\{V\}^\{1\}\\mathbf\{C\}\_\{h\},\(7\)
whereΔt=0\.05\\Delta t=0\.05is the integration step size\. The leapfrog scheme is chosen over a simple Euler step because it is time\-reversible and volume\-preserving, properties that align with the conservative dynamics of oscillator phase evolution\. The updated coordinates are concatenated andℓ2\\ell\_\{2\}\-normalised before attention aggregation\. The full computation is summarised in Algorithm[1](https://arxiv.org/html/2605.30364#algorithm1)and the leapfrog update is illustrated in Fig\.[4](https://arxiv.org/html/2605.30364#S3.F4)\.

![Refer to caption](https://arxiv.org/html/2605.30364v1/x4.png)Figure 4:Störmer–Verlet leapfrog update in Hamiltonian attention\. Position \(𝐕1\\mathbf\{V\}^\{1\}\) and momentum \(𝐕2\\mathbf\{V\}^\{2\}\) are co\-evolved in three half\-steps, producing a norm\-preserving rotation of the value representation\.Input:Query

𝐐\\mathbf\{Q\}, Key

𝐊\\mathbf\{K\}, Value

𝐕\\mathbf\{V\}
Output:Context representation

𝐎\\mathbf\{O\}
1

2Project inputs via learned matrices

WQW\_\{Q\},

WKW\_\{K\},

WVW\_\{V\}
3

4Compute standard scaled dot\-product attention scores\[[22](https://arxiv.org/html/2605.30364#bib.bib20)\]:

𝐀=softmax\(𝐐𝐊⊤/dh\)\\mathbf\{A\}=\\mathrm\{softmax\}\\\!\\left\(\\mathbf\{Q\}\\mathbf\{K\}^\{\\top\}/\\sqrt\{d\_\{h\}\}\\right\)
5

6Partition value vectors:

𝐕→\(𝐕1,𝐕2\)\\mathbf\{V\}\\rightarrow\(\\mathbf\{V\}^\{1\},\\mathbf\{V\}^\{2\}\),

7

𝐕1,𝐕2∈ℝdh/2\\mathbf\{V\}^\{1\},\\mathbf\{V\}^\{2\}\\in\\mathbb\{R\}^\{d\_\{h\}/2\}
8

9Construct skew\-symmetric generator:

𝐂h=𝐀h−𝐀h⊤\\mathbf\{C\}\_\{h\}=\\mathbf\{A\}\_\{h\}\-\\mathbf\{A\}\_\{h\}^\{\\top\}
10

11Apply Störmer–Verlet leapfrog \(Eqs\.[5](https://arxiv.org/html/2605.30364#S3.E5)–[7](https://arxiv.org/html/2605.30364#S3.E7)\)

12

13Recombine and normalise:

𝐕=\[𝐕1∥𝐕2\]\\mathbf\{V\}=\[\\mathbf\{V\}^\{1\}\\\|\\mathbf\{V\}^\{2\}\],

14

𝐕←𝐕/‖𝐕‖2\\mathbf\{V\}\\leftarrow\\mathbf\{V\}/\\\|\\mathbf\{V\}\\\|\_\{2\}
15

16Aggregate context:

𝐎=𝐀𝐕\\mathbf\{O\}=\\mathbf\{A\}\\mathbf\{V\}
17

return

𝐎\\mathbf\{O\}

Algorithm 1Hamiltonian Multi\-Head AttentionThe Hamiltonian attention layer is integrated within a pre\-normalisation encoder block following standard practice\[[2](https://arxiv.org/html/2605.30364#bib.bib21)\]:

𝐄\\displaystyle\\mathbf\{E\}←LN\(𝐄\+HamAttn\(𝐄\)\),\\displaystyle\\leftarrow\\mathrm\{LN\}\\bigl\(\\mathbf\{E\}\+\\mathrm\{HamAttn\}\(\\mathbf\{E\}\)\\bigr\),\(8\)𝐄\\displaystyle\\mathbf\{E\}←LN\(𝐄\+FFN\(𝐄\)\)\.\\displaystyle\\leftarrow\\mathrm\{LN\}\\bigl\(\\mathbf\{E\}\+\\mathrm\{FFN\}\(\\mathbf\{E\}\)\\bigr\)\.\(9\)
Four such layers are stacked\. Global average pooling over the temporal dimension produces the final embedding𝐳∈ℝ128\\mathbf\{z\}\\in\\mathbb\{R\}^\{128\}, which is passed to a linear classification head\.

#### III\-E3Training Objective

All models are trained using AdamW\[[13](https://arxiv.org/html/2605.30364#bib.bib24)\]for 20 epochs with a learning rate of10−310^\{\-3\}, weight decay10−410^\{\-4\}, a two\-epoch linear warmup followed by cosine annealing, and gradient clipping at1\.01\.0\. This unified protocol applies to all architectures including the baselines, ensuring that no model benefits from a more favourable optimisation regime\.

The full Hamiltonian Transformer additionally optimises a composite objective:

ℒ=ℒCE\+λcℒCTR\+λrℒREG,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\+\\lambda\_\{c\}\\,\\mathcal\{L\}\_\{\\mathrm\{CTR\}\}\+\\lambda\_\{r\}\\,\\mathcal\{L\}\_\{\\mathrm\{REG\}\},\(10\)
withλc=0\.5\\lambda\_\{c\}=0\.5andλr=0\.01\\lambda\_\{r\}=0\.01\. The three terms serve distinct purposes as described below\. The ablation study isolates the contribution of eachcomponent and shows that the phase embedding not the composite loss provides the dominant per\-component improvement\.

##### Cross\-entropy loss\.

The standard supervised classification termℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}optimises transmitter identity directly from the embedding𝐳\\mathbf\{z\}via a linear classifier head\[[4](https://arxiv.org/html/2605.30364#bib.bib25)\]\. It is the sole training objective for the CNN, Transformer, and all ablation variants except Variant D\.

##### Batch contrastive loss\.

ℒCTR\\mathcal\{L\}\_\{\\mathrm\{CTR\}\}encourages intra\-class compactness and inter\-class separation in the embedding space using a margin\-based contrastive formulation\[[4](https://arxiv.org/html/2605.30364#bib.bib25)\]applied over all pairs\(i,j\)\(i,j\)in each mini\-batch:

ℒCTR=1\|𝒫\|∑\(i,j\)\[𝟏yi=yjdij2\+𝟏yi≠yjmax\(0,m−dij\)2\],\\mathcal\{L\}\_\{\\mathrm\{CTR\}\}=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{\(i,j\)\}\\Bigl\[\\mathbf\{1\}\_\{y\_\{i\}=y\_\{j\}\}\\,d\_\{ij\}^\{2\}\+\\mathbf\{1\}\_\{y\_\{i\}\\neq y\_\{j\}\}\\,\\max\(0,\\,m\-d\_\{ij\}\)^\{2\}\\Bigr\],\(11\)
wheredij=‖𝐳^i−𝐳^j‖2d\_\{ij\}=\\\|\\hat\{\\mathbf\{z\}\}\_\{i\}\-\\hat\{\\mathbf\{z\}\}\_\{j\}\\\|\_\{2\}is the distance betweenℓ2\\ell\_\{2\}\-normalised embeddings andm=1\.0m=1\.0is the margin\. The ablation study shows that this term does not provide consistent benefit at large transmitter counts and may introduce instability; it is retained in the full model for completeness but is not recommended for deployment at scale\.

##### Sphere regularisation\.

ℒREG\\mathcal\{L\}\_\{\\mathrm\{REG\}\}penalises deviation of each embedding from the unit hypersphere:

ℒREG=1N∑i=1N\(‖𝐳i‖2−1\)2\.\\mathcal\{L\}\_\{\\mathrm\{REG\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\bigl\(\\\|\\mathbf\{z\}\_\{i\}\\\|\_\{2\}\-1\\bigr\)^\{2\}\.\(12\)
This term constrains the absolute scale of embeddings whileℒCTR\\mathcal\{L\}\_\{\\mathrm\{CTR\}\}controls their relative geometry, together producing a structured hyperspherical representation\. In practice, theℓ2\\ell\_\{2\}normalisation applied within the Hamiltonian attention layer already provides implicit scale control, which may explain why the additional regularisation offers diminishing returns at scale\.

### III\-FExperimental Setup

Table[I](https://arxiv.org/html/2605.30364#S3.T1)summarises the hyperparameters used across all models\. The WiSig ManyTx subset contains signals from up to 150 transmitters across 18 receivers and four capture days\. For the scaling experiment, a balanced per\-class snapshot cap is applied within each transmitter bucket so that all models receive identical training data at every scale point\. The train/val/test split is70%/15%/15%70\\%/15\\%/15\\%stratified by class for all experiments\. For the cross\-receiver experiment, the split is by receiver index rather than snapshot, ensuring test receivers are entirely disjoint from training receivers\. For the cross\-day experiment, the last capture day is reserved for testing and the preceding three days form the training set\.

TABLE I:Hyperparameters used across all models\. Parameters marked*All*apply equally to CNN, Transformer, and Hamiltonian Transformer\.ParameterValueApplies ToInput length256AllNormalisationZero mean, unit stdAllOptimiserAdamWAllLearning rate10−310^\{\-3\}AllWeight decay10−410^\{\-4\}AllWarmup epochs2AllScheduleCosine decayAllGradient clip1\.0AllEpochs20AllBatch size32AllTrain/val/test split70/15/15AllEmbedding dimDD128Transformer / HamiltonianEncoder layers4Transformer / HamiltonianAttention headsHH4Transformer / HamiltonianFFN hidden dim256Transformer / HamiltonianLeapfrog stepΔt\\Delta t0\.05HamiltonianContrastive weightλc\\lambda\_\{c\}0\.5Hamiltonian \(full model\)Regulariser weightλr\\lambda\_\{r\}0\.01Hamiltonian \(full model\)

Table[II](https://arxiv.org/html/2605.30364#S3.T2)reports the parameter count, computational cost, and per\-epoch training time for all evaluated models and ablation variants at 150 transmitters\. All Transformer\-based variants share a comparable parameter budget and FLOPs footprint, confirming that the performance differences observed reflect architectural inductive biases rather than differences in model capacity\.

TABLE II:Model complexity at 150 transmitters\. FLOPs measured for a single forward pass on a\(1×256×2\)\(1\\times 256\\times 2\)input\. Training time per epoch on GPU\.

## IVResults

We evaluate the proposed Hamiltonian based Transformer against two established baselines: a convolutional neural network \(CNN\) and a standard Transformer encoder\. All models are trained for 20 epochs using AdamW with a learning rate of10−310^\{\-3\}, weight decay10−410^\{\-4\}, a two\-epoch linear warmup followed by cosine annealing, and gradient clipping at1\.01\.0\. This unified training protocol ensures that observed performance differences reflect architectural properties rather than optimisation advantages\. All experiments use*non\-equalized*raw I/Q signals consistent with the original WiSig evaluation protocol\[[9](https://arxiv.org/html/2605.30364#bib.bib1)\]; no channel preprocessing is applied before the model\. Input snapshots of length 256 are normalised to zero mean and unit standard deviation per channel\. For the transmitter scaling experiment, snapshots are capped at the minimum per\-class count within each transmitter bucket so that all architectures receive identical training data at every scale point\.

The four experimental settings are: same\-day transmitter classification \(Ex\-1\), cross\-receiver generalisation \(Ex\-2\), cross\-day generalisation \(Ex\-3\), and large\-scale transmitter scaling \(Ex\-4\)\. Table[III](https://arxiv.org/html/2605.30364#S4.T3)summarises test accuracy across the four scenarios\. Table[IV](https://arxiv.org/html/2605.30364#S4.T4)reports Ex\-4 performance across all scale points together with the full ablation study\. Fig\.[5](https://arxiv.org/html/2605.30364#S4.F5)and Fig\.[6](https://arxiv.org/html/2605.30364#S4.F6)visualise the scaling trajectories for the main models and ablation variants respectively\.

### IV\-ASame\-Day Transmitter Classification

In the same\-day setting, models are trained and evaluated on the SingleDay subset of WiSig, which contains 28 transmitters and 448,000 I/Q snapshots across 10 receivers and a single capture day\. Because training and test signals originate from the same capture session, the wireless channel and receiver characteristics remain consistent across splits\.

Under these matched conditions all three architectures converge to high accuracy\. The CNN baseline achieves98\.07%98\.07\\%, confirming the effectiveness of convolutional inductive biases for short RF sequences\. The standard Transformer reaches98\.19%98\.19\\%, marginally above the CNN\. The proposed Hamiltonian based Transformer achieves the highest accuracy at99\.12%99\.12\\%, improving upon the CNN and Transformer baselines by1\.051\.05and0\.930\.93percentage points respectively\. These results confirm that the Hamiltonian structural constraint does not impair discriminative capacity and provides a mild regularisation benefit under matched train–test conditions\.

### IV\-BCross\-Receiver Generalisation

The cross\-receiver experiment evaluates whether a classifier trained on signals from a subset of receivers can generalise to entirely unseen receivers\. Signals are drawn from the ManyRx subset of WiSig, which contains recordings from 10 transmitters across 32 receivers and four capture days\. The split is strictly by receiver index: 22 receivers are used for training and 10 entirely disjoint receivers are used for testing exclusively\. This protocol reflects realistic deployment scenarios in which an authentication system must operate across receiver hardware not available at training time\.

The CNN baseline achieves35\.06%35\.06\\%accuracy and the standard Transformer achieves47\.90%47\.90\\%\. The Hamiltonian based Transformer achieves52\.94%52\.94\\%, the highest among all three models\. All models exhibit a substantial drop relative to the same\-day experiment, confirming that receiver hardware variation remains a significant confounding factor in transmitter identification\. The stronger performance of attention\-based models suggests that global sequence dependencies carry more receiver\-agnostic discriminative information than local convolutional features\.

### IV\-CCross\-Day Generalisation

Cross\-day generalisation evaluates temporal robustness by training on signals from three capture days and evaluating on a fourth day recorded approximately one week later\. The ManySig subset is used, containing 6 transmitters, 12 receivers, and 576,000 snapshots across four days\. The last capture day \(23 March 2021\) is reserved for testing; the preceding three days form the training set\. Unlike the single\-receiver protocol of the original WiSig evaluation\[[9](https://arxiv.org/html/2605.30364#bib.bib1)\], all 12 receivers are included in both training and test to reflect a realistic multi\-receiver deployment\.

The CNN baseline achieves88\.41%88\.41\\%and the standard Transformer achieves92\.29%92\.29\\%\. The Hamiltonian based Transformer achieves93\.73%93\.73\\%, comparable to the Transformer baseline\. The consistently high accuracy across all models under this multi\-receiver protocol indicates that receiver diversity in training substantially mitigates temporal channel variation, consistent with the findings of Hanna*et al\.*\[[9](https://arxiv.org/html/2605.30364#bib.bib1)\]\. The marginal inter\-model differences in this setting suggest that temporal channel variation is better addressed through training data diversity than through architectural constraints alone\.

### IV\-DTransmitter Scaling

The transmitter scaling experiment evaluates model behaviour as the number of transmitter identities increases from 10 to 150 using the ManyTx subset of WiSig\. Fig\.[5](https://arxiv.org/html/2605.30364#S4.F5)shows the scaling trajectories for all three models\.

The CNN baseline degrades gradually from66\.32%66\.32\\%at 10 transmitters to41\.77%41\.77\\%at 150 transmitters\. This decline reflects the limited per\-class snapshot budget introduced by the balanced cap: the CNN’s local convolutional filters require more examples per class to stabilise discriminative responses than attention\-based models under the same data constraint\.

The standard Transformer exhibits non\-monotonic behaviour across scale points, collapsing to30\.34%30\.34\\%at 100 transmitters before partially recovering to59\.63%59\.63\\%at 150 transmitters\. This instability arises from the interaction between the snapshot cap and class count: at 30 transmitters the cap drops sharply, simultaneously halving the per\-class data budget while tripling the number of identities, causing a disproportionate accuracy drop\. The Transformer is more sensitive to this joint pressure than the other architectures due to its lack of structural constraints on the value update\.

The Hamiltonian based Transformer demonstrates the most stable performance across all scale points, consistently outperforming both baselines at every transmitter count\. This behaviour confirms that the physics\-informed structural prior provides a meaningful advantage in the limited per\-class data regime associated with large transmitter populations\.

### IV\-EAblation Study

To isolate the contribution of each architectural component, a controlled ablation study is conducted on the Ex\-4 scaling experiment\. Five variants are evaluated under training conditions identical to the main models, differing only in the value update mechanism and input representation\. All variants use standard scaled dot\-product attention for query and key projections; only the value update differs\. The five variants are:

- •A\(Ham only, CE\): Störmer–Verlet leapfrog with skew\-symmetric generator𝐂h=𝐀h−𝐀h⊤\\mathbf\{C\}\_\{h\}=\\mathbf\{A\}\_\{h\}\-\\mathbf\{A\}\_\{h\}^\{\\top\}, plain I/Q input, cross\-entropy loss only\. Isolates the symplectic structure\.
- •B\(Cayley, CE\): Cayley\-parameterised orthogonal value update\[[10](https://arxiv.org/html/2605.30364#bib.bib18),[21](https://arxiv.org/html/2605.30364#bib.bib19)\]𝐖=\(𝐈−𝐀\)\(𝐈\+𝐀\)−1\\mathbf\{W\}=\(\\mathbf\{I\}\-\\mathbf\{A\}\)\(\\mathbf\{I\}\+\\mathbf\{A\}\)^\{\-1\}, plain I/Q input, CE only\. Norm\-preserving via algebra rather than integration — no position–momentum split, no time step\.
- •C\(Linear mix, CE\): Unconstrained learned linear layer on values, plain I/Q input, CE only\. Null hypothesis no norm\-preservation and no physical structure\.
- •D\(Ham \+ CTR\): Leapfrog value update plus contrastive loss, plain I/Q input, no phase features\. Isolates the contrastive loss contribution\.
- •E\(Ham \+ phase, CE\): Leapfrog value update plus phase\-increment embedding at input, CE only\. Isolates the phase embedding contribution\.

Fig\.[6](https://arxiv.org/html/2605.30364#S4.F6)shows the scaling curves for all variants\. Table[IV](https://arxiv.org/html/2605.30364#S4.T4)reports numerical results\.

##### Symplectic structure versus plain mixing\.

Variant A consistently and substantially outperforms Variant C at every scale point\. At 100 transmitters the gap reaches54\.954\.9percentage points \(71\.73%71\.73\\%versus16\.83%16\.83\\%\), providing strong evidence that the leapfrog value update does real discriminative work beyond the effect ofℓ2\\ell\_\{2\}normalisation alone\. This directly addresses the hypothesis that normalisation rather than symplectic structure may be responsible for observed gains\.

##### Norm\-preservation as the primary inductive bias\.

Variant B \(Cayley\) matches or exceeds Variant A at all scale points from 30 transmitters onwards, reaching79\.72%79\.72\\%at 150 transmitters compared to67\.01%67\.01\\%for Variant A\. Since Variant B is norm\-preserving but uses no symplectic integration, position–momentum splitting, or physical time\-step parameter, this result indicates that*norm\-preservation in the value update*is the primary architectural property driving the scaling advantage rather than the specific symplectic structure of the leapfrog\. Both Variants A and B substantially outperform Variant C and the vanilla Transformer at large scale, confirming that the norm\-preservation constraint itself not its physical implementation is the essential inductive bias\. This is consistent with the physical motivation: norm\-preserving updates prevent the model from encoding channel\-induced amplitude variations that do not generalise across transmitter populations\.

##### Phase embedding contribution\.

Variant E consistently outperforms Variant A at every scale point, reaching78\.61%78\.61\\%at 150 transmitters versus67\.01%67\.01\\%for Variant A\. The phase\-increment embedding therefore provides a meaningful and consistent improvement on top of the Hamiltonian attention mechanism, confirming that oscillator phase dynamics carry transmitter\-specific hardware information that benefits from explicit representation at the input level\.

##### Contrastive loss contribution\.

Variant D underperforms Variant A at all scale points except 100 transmitters, indicating that the contrastive loss without phase features provides no consistent benefit and introduces training instability at scale\. The full model F, which combines all components, peaks at 10 transmitters \(75\.20%75\.20\\%\) but degrades to61\.64%61\.64\\%at 150 transmitters below both Variant B and Variant E\. These findings indicate that the phase\-increment embedding is the single most effective enhancement to the base Hamiltonian attention mechanism, while the contrastive loss is not recommended for large\-scale transmitter identification tasks\.

TABLE III:Test accuracy \(%\) across four WiSig experimental scenarios\.Boldindicates the best result per scenario\. All models use non\-equalized raw I/Q signals\.TABLE IV:Test accuracy \(%\) on WiSig ManyTx scaling\. Upper block: main model comparison\. Lower block: ablation variants\.Boldindicates best per column\.![Refer to caption](https://arxiv.org/html/2605.30364v1/x5.png)Figure 5:Test accuracy as a function of transmitter count for CNN, Transformer, and Hamiltonian based Transformer on the WiSig ManyTx subset\. All models trained on non\-equalized raw I/Q signals with a balanced per\-class snapshot cap\.![Refer to caption](https://arxiv.org/html/2605.30364v1/x6.png)Figure 6:Ablation study scaling curves on WiSig ManyTx\. Variants A–E isolate individual architectural components under identical training conditions\. Variants A and B \(both norm\-preserving\) consistently outperform Variant C \(unconstrained linear\), confirming norm\-preservation as the primary inductive bias for large\-scale transmitter identification\.

## VConclusion

This paper presented the Hamiltonian based Transformer, a physics\-informed attention architecture for RF fingerprinting under receiver and channel distribution shifts\. The proposed model incorporates an energy\-conserving inductive bias directly within the attention mechanism by evolving value representations through a Störmer–Verlet leapfrog update governed by a learned skew\-symmetric generator\. Because skew\-symmetric operators produce norm\-preserving rotations, the resulting attention dynamics remain consistent with the rotational behaviour of oscillator phase evolution in complex baseband signals\. An additional I/Q phase\-increment embedding exposes oscillator dynamics at the input representation level, allowing the model to focus on transmitter\-specific hardware behaviour rather than channel\-dependent variations\.

Experiments on the WiSig dataset demonstrate that embedding physics\-inspired structural constraints into neural architectures can improve robustness for RF fingerprinting tasks, particularly in scenarios involving channel variability and large transmitter populations\. These findings suggest that incorporating domain knowledge from signal physics can serve as a useful inductive bias for security\-oriented wireless machine learning systems\. Future work will focus on addressing receiver\-induced hardware variability, exploring adaptive integration dynamics within the Hamiltonian attention layer, and extending the approach to additional wireless waveform standards such as Bluetooth, LoRa, and LTE\.

## References

- \[1\]A\. Al\-Shawabka, F\. Restuccia, S\. D’Oro, T\. Jian, B\. Costa Rendon, N\. Soltani, J\. Dy, S\. Ioannidis, K\. R\. Chowdhury, and T\. Melodia\(2020\)Exposing the fingerprint: dissecting the impact of the wireless channel on radio fingerprinting\.InProc\. IEEE Conference on Computer Communications \(INFOCOM\),pp\. 646–655\.External Links:[Document](https://dx.doi.org/10.1109/INFOCOM41043.2020.9155259)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p3.1),[§II](https://arxiv.org/html/2605.30364#S2.p3.1)\.
- \[2\]J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton\(2016\)Layer normalization\.InProc\. NeurIPS Workshop on Deep Learning Symposium,External Links:[Link](https://arxiv.org/abs/1607.06450)Cited by:[§II](https://arxiv.org/html/2605.30364#S2.p6.1),[§III\-E2](https://arxiv.org/html/2605.30364#S3.SS5.SSS2.p8.1)\.
- \[3\]Z\. Chen, J\. Zhang, M\. Arjovsky, and L\. Bottou\(2020\)Symplectic recurrent neural networks\.InProc\. International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=BkgYPREtPr)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p6.1),[§II](https://arxiv.org/html/2605.30364#S2.p9.1)\.
- \[4\]S\. Chopra, R\. Hadsell, and Y\. LeCun\(2005\)Learning a similarity metric discriminatively, with application to face verification\.InProc\. IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 539–546\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2005.202)Cited by:[§III\-E3](https://arxiv.org/html/2605.30364#S3.SS5.SSS3.Px1.p1.2),[§III\-E3](https://arxiv.org/html/2605.30364#S3.SS5.SSS3.Px2.p1.2)\.
- \[5\]M\. Cranmer, S\. Greydanus, S\. Hoyer, P\. Battaglia, D\. Spergel, and S\. Ho\(2020\)Lagrangian neural networks\.arXiv preprint arXiv:2003\.04630\.External Links:[Link](https://arxiv.org/abs/2003.04630)Cited by:[§II](https://arxiv.org/html/2605.30364#S2.p8.1)\.
- \[6\]S\. Greydanus, M\. Dzamba, and J\. Yosinski\(2019\)Hamiltonian neural networks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),pp\. 15353–15363\.External Links:[Link](https://proceedings.neurips.cc/paper/2019/hash/26cd8ecadce0d4efd6cc8a8725cbd1f8-Abstract.html)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p6.1),[§II](https://arxiv.org/html/2605.30364#S2.p8.1)\.
- \[7\]E\. Hairer, C\. Lubich, and G\. Wanner\(2006\)Geometric numerical integration: structure\-preserving algorithms for ordinary differential equations\.2nd edition,Springer Series in Computational Mathematics, Vol\.31,Springer\.External Links:[Document](https://dx.doi.org/10.1007/3-540-30666-8)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p6.1),[§II](https://arxiv.org/html/2605.30364#S2.p9.1),[§III\-E2](https://arxiv.org/html/2605.30364#S3.SS5.SSS2.p5.2)\.
- \[8\]S\. Hanna, S\. Karunaratne, and D\. Cabric\(2021\)Open set wireless transmitter authorization: deep learning approaches and dataset considerations\.IEEE Transactions on Cognitive Communications and Networking7\(1\),pp\. 59–72\.External Links:[Document](https://dx.doi.org/10.1109/TCCN.2020.3043332)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p4.1),[§II](https://arxiv.org/html/2605.30364#S2.p4.1)\.
- \[9\]S\. Hanna, S\. Karunaratne, and D\. Cabric\(2022\)WiSig: a large\-scale WiFi signal dataset for receiver and channel agnostic RF fingerprinting\.IEEE Access10,pp\. 22808–22818\.External Links:ISSN 2169\-3536,[Document](https://dx.doi.org/10.1109/ACCESS.2022.3154790)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p4.1),[§II](https://arxiv.org/html/2605.30364#S2.p5.1),[§III\-C](https://arxiv.org/html/2605.30364#S3.SS3.p1.9),[§IV\-C](https://arxiv.org/html/2605.30364#S4.SS3.p1.1),[§IV\-C](https://arxiv.org/html/2605.30364#S4.SS3.p2.3),[§IV](https://arxiv.org/html/2605.30364#S4.p1.3)\.
- \[10\]K\. E\. Helfrich, D\. Willmott, and Q\. Ye\(2018\)Orthogonal recurrent neural networks with scaled Cayley transform\.InProc\. International Conference on Machine Learning \(ICML\),pp\. 1970–1978\.External Links:[Link](https://arxiv.org/abs/1707.09520)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p7.3),[2nd item](https://arxiv.org/html/2605.30364#S4.I1.i2.p1.1)\.
- \[11\]T\. Jian, B\. Costa Rendon, E\. Ojuba, N\. Soltani, Z\. Wang, K\. Sankhe, A\. Gritsenko, J\. Dy, K\. R\. Chowdhury, and S\. Ioannidis\(2020\)Deep learning for RF fingerprinting: a massive experimental study\.IEEE Internet of Things Magazine3\(1\),pp\. 50–57\.External Links:[Document](https://dx.doi.org/10.1109/IOTM.0001.1900065)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p3.1),[§II](https://arxiv.org/html/2605.30364#S2.p2.1)\.
- \[12\]J\. Lee, Y\. Lee, J\. Kim, A\. R\. Kosiorek, S\. Choi, and Y\. W\. Teh\(2019\)Set transformer: a framework for attention\-based permutation\-invariant neural networks\.InProc\. International Conference on Machine Learning \(ICML\),pp\. 3744–3753\.External Links:[Link](https://arxiv.org/abs/1810.00825)Cited by:[§II](https://arxiv.org/html/2605.30364#S2.p7.1)\.
- \[13\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled weight decay regularization\.InProc\. International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/1711.05101)Cited by:[§III\-E3](https://arxiv.org/html/2605.30364#S3.SS5.SSS3.p1.3)\.
- \[14\]T\. J\. O’Shea, J\. Corgan, and T\. C\. Clancy\(2016\)Convolutional radio modulation recognition networks\.InProc\. International Conference on Engineering Applications of Neural Networks \(EANN\),pp\. 213–226\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-44188-7%5F16)Cited by:[§II](https://arxiv.org/html/2605.30364#S2.p1.1)\.
- \[15\]D\. Raychaudhuri, I\. Seskar, M\. Ott, S\. Ganu, K\. Ramachandran, H\. Kremo, R\. Siracusa, H\. Liu, and M\. Singh\(2005\)Overview of the ORBIT radio grid testbed for evaluation of next\-generation wireless network protocols\.InProc\. IEEE Wireless Communications and Networking Conference \(WCNC\),Vol\.3,pp\. 1664–1669\.External Links:[Document](https://dx.doi.org/10.1109/WCNC.2005.1424763)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p4.1),[§II](https://arxiv.org/html/2605.30364#S2.p5.1)\.
- \[16\]S\. Riyaz, K\. Sankhe, S\. Ioannidis, and K\. R\. Chowdhury\(2018\)Deep learning convolutional neural networks for radio identification\.IEEE Communications Magazine56\(9\),pp\. 146–152\.External Links:[Document](https://dx.doi.org/10.1109/MCOM.2018.1800153)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p2.1),[§II](https://arxiv.org/html/2605.30364#S2.p1.1)\.
- \[17\]K\. Sankhe, M\. Belgiovine, F\. Zhou, L\. Angioloni, F\. Restuccia, S\. D’Oro, T\. Melodia, S\. Ioannidis, and K\. R\. Chowdhury\(2020\)No radio left behind: radio fingerprinting through deep learning of physical\-layer hardware impairments\.IEEE Transactions on Cognitive Communications and Networking6\(1\),pp\. 165–178\.External Links:[Document](https://dx.doi.org/10.1109/TCCN.2019.2949308)Cited by:[§II](https://arxiv.org/html/2605.30364#S2.p2.1)\.
- \[18\]K\. Sankhe, M\. Belgiovine, F\. Zhou, S\. Riyaz, S\. Ioannidis, and K\. Chowdhury\(2019\)ORACLE: optimized radio clAssification through Convolutional neuraL nEtworks\.InProc\. IEEE Conference on Computer Communications \(INFOCOM\),pp\. 370–378\.External Links:[Document](https://dx.doi.org/10.1109/INFOCOM.2019.8737463)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p3.1),[§II](https://arxiv.org/html/2605.30364#S2.p2.1)\.
- \[19\]G\. Shen, J\. Zhang, A\. Marshall, and J\. R\. Cavallaro\(2022\)Towards scalable and channel\-robust radio frequency fingerprint identification for LoRa\.IEEE Transactions on Information Forensics and Security17,pp\. 774–787\.External Links:[Document](https://dx.doi.org/10.1109/TIFS.2022.3152404)Cited by:[§II](https://arxiv.org/html/2605.30364#S2.p2.1)\.
- \[20\]N\. Soltani, K\. Sankhe, J\. G\. Dy, S\. Ioannidis, and K\. R\. Chowdhury\(2020\)More is better: data augmentation for channel\-resilient RF fingerprinting\.IEEE Communications Magazine58\(10\),pp\. 66–72\.External Links:[Document](https://dx.doi.org/10.1109/MCOM.001.2000180)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p4.1),[§II](https://arxiv.org/html/2605.30364#S2.p4.1)\.
- \[21\]A\. Trockman and J\. Z\. Kolter\(2021\)Orthogonalizing convolutional layers with the Cayley transform\.InProc\. International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2104.07167)Cited by:[2nd item](https://arxiv.org/html/2605.30364#S4.I1.i2.p1.1)\.
- \[22\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems \(NeurIPS\),pp\. 5998–6008\.External Links:[Link](https://arxiv.org/abs/1706.03762)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p5.1),[§II](https://arxiv.org/html/2605.30364#S2.p6.1),[§III\-E2](https://arxiv.org/html/2605.30364#S3.SS5.SSS2.p1.2),[4](https://arxiv.org/html/2605.30364#algorithm1.8.8)\.
- \[23\]J\. Zhang, R\. Woods, M\. Sandell, M\. Valkama, A\. Marshall, and J\. Cavallaro\(2021\)Radio frequency fingerprint identification for narrowband systems: modelling and classification\.IEEE Transactions on Information Forensics and Security16,pp\. 3974–3987\.External Links:[Document](https://dx.doi.org/10.1109/TIFS.2021.3088008)Cited by:[§II](https://arxiv.org/html/2605.30364#S2.p2.1)\.
- \[24\]T\. Zhao, S\. Sarkar, E\. Krijestorac, and D\. Cabric\(2024\)GAN\-RXA: a practical scalable solution to receiver\-agnostic transmitter fingerprinting\.IEEE Transactions on Cognitive Communications and Networking10\(2\),pp\. 523–537\.External Links:[Document](https://dx.doi.org/10.1109/TCCN.2023.3329012)Cited by:[§I](https://arxiv.org/html/2605.30364#S1.p4.1),[§II](https://arxiv.org/html/2605.30364#S2.p4.1)\.
Hamiltonian-Inspired Attention Mechanism for Scalable RF Transmitter Fingerprinting

Similar Articles

Controlled Dynamics Attractor Transformer

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

Flexformer: Flexible Linear Transformer with Learnable Attention Kernel

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

Submit Feedback

Similar Articles

Controlled Dynamics Attractor Transformer
RF-DETR: Neural Architecture Search for Real-Time Detection Transformers
Flexformer: Flexible Linear Transformer with Learnable Attention Kernel
Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention
Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers