Classification and detection of multiple UAVs using rational Gaussian wavelet neural networks

arXiv cs.LG Papers

Summary

This paper proposes a cost-effective UAV detection and classification system using sound signals processed by rational Gaussian wavelet neural networks, achieving interpretable and robust performance for single and multiple UAVs including swarms, outperforming traditional methods.

arXiv:2605.26310v1 Announce Type: new Abstract: The detection of unmanned aerial vehicles (UAVs) is important for the protection of civilian and military infrastructure. In this paper we propose a cost effective UAV detection system using sound signals obtained from microphones. The recorded signals are passed through a signal processing pipeline which employs interpretable adaptive feature extractors using so-called rational Gaussian wavelets. These adaptive wavelet transformations are embedded into and trained together with an underlying small neural network which detects and classifies UAVs based on the obtained features. This leads to a physically interpretable machine learning algorithm that in addition to classifying UAVs is also capable of detecting UAV swarms. We demonstrate our results using data collected in indoor studio and noisy outdoor environments. We conclude that the proposed method outperforms traditional machine learning approaches for detecting and classifying single UAVs as well as drone swarms, while retaining a high degree of interpretability. Our implementation of the proposed methods is made publicly available for reproducibility.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:07 AM

# Classification and detection of multiple UAVs using rational Gaussian wavelet neural networks
Source: [https://arxiv.org/html/2605.26310](https://arxiv.org/html/2605.26310)
\[1\]\\fnmPéter\\surKovács

\[1\]\\orgdivDepartment of Numerical Analysis,\\orgnameEötvös Loránd University,\\orgaddress\\streetPázmány Péter sétány,\\cityBudapest,\\postcode1117,\\countryHungary

2\]\\orgdivSystem and Control Laboratory,\\orgnameHUN\-REN Institute for Computer Science and Control,\\orgaddress\\streetKende utca,\\cityBudapest,\\postcode1111,\\countryHungary

3\]\\orgnameSzent György Hang\- és Filmművészeti Technikum,\\orgaddress\\streetLenhossék utca,\\cityBudapest,\\postcode1096,\\countryHungary

4\]\\orgdivDepartment of Mathematics and Computer Science,\\orgnameUniDistanceSuisse,\\orgaddress\\streetSchinerstrasse,\\cityBrig\-Glis,\\postcode3900,\\countrySwitzerland

5\]\\orgnameSiemens Mobility Kft\.,\\orgaddress\\streetGábor Dénes utca,\\cityBudapest,\\postcode1117,\\countryHungary

6\]\\orgnameHUN\-REN Centre for Energy Research,\\orgaddress\\streetKonkoly\-Thege Miklós út,\\cityBudapest,\\postcode1121,\\countryHungary

###### Abstract

The detection of unmanned aerial vehicles \(UAVs\) is important for the protection of civilian and military infrastructure\. In this paper we propose a cost effective UAV detection system using sound signals obtained from microphones\. The recorded signals are passed through a signal processing pipeline which employs interpretable adaptive feature extractors using so\-called rational Gaussian wavelets\. These adaptive wavelet transformations are embedded into and trained together with an underlying small neural network which detects and classifies UAVs based on the obtained features\. This leads to a physically interpretable machine learning algorithm that in addition to classifying UAVs is also capable of detecting UAV swarms\. We demonstrate our results using data collected in indoor studio and noisy outdoor environments\. We conclude that the proposed method outperforms traditional machine learning approaches for detecting and classifying single UAVs as well as drone swarms, while retaining a high degree of interpretability\. Our implementation of the proposed methods is made publicly available for reproducibility\.

###### keywords:

Drones, wavelets, machine learning, explainability, neural networks

## 1Introduction

The detection and classification of unmanned aerial vehicles \(UAVs\) has become an important problem in recent years\. UAVs, also commonly known as drones, have been successfully applied for the delivery of goods, agriculture and various other civilian industries\[seidaliyeva2023advances\]\. On the other hand, exploiting their ability to carry weaponry and surveillance equipment, UAVs have found a large array of military applications and play an ever increasing role on the modern battlefield\[DronesBattle,swinney2022review,wang2021counter\]\. In addition, drones provide a cost efficient method for criminal groups to carry out illegal activities such as smuggling\[samaras2019deep,DedroneWorldwideDroneIncidents\]and disrupting industry and air traffic\[seidaliyeva2023advances,DedroneWorldwideDroneIncidents\]\. We recommend\[seidaliyeva2023advances\]and\[droneRev2\]for a deeper discussion on the different threats posed by malicious UAV activity\.

In order to counter the threat posed by drones, the development of trustworthy and affordable technology to detect and classify UAVs is necessary\. Accordingly, several recent works consider the problem of drone detection using a variety of technologies\. The detection and classification of drones is a difficult measurement and signal processing task for a number of reasons\. Depending on the type of the UAV, its sound, radar or visual signatures can vary widely\[mohsan2023unmanned\]\. In addition, the measurement apparatus usually needs to operate in an outside environment subject to sudden changes in weather conditions and significant background noise\[mohsan2023unmanned\]\. The appearance of so\-called drone swarms, where multiple UAVs perform a coordinated task, further complicates potential defense strategies\[seidaliyeva2023advances\]\.

Current state\-of\-the\-art technologies\[seidaliyeva2023advances,droneRev2\]consider signals obtained from different types of sensors accompanied by a diverse array of signal processing methods to detect drone activity\. Due to the above highlighted difficulties, it is generally accepted that a high performance UAV detection system should incorporate a variety of signals obtained from different sensors\. The design of such a system can also depend on other factors, such as the type of infrastructure it is intended to protect, the overall cost of the system and specifics related to the environment \(such as predictable weather conditions, noise pollution levels, etc\.\)\. Nevertheless, drone detection subsystems that depend on a single mode of measurement are an important field of study, as these provide the components of an overall drone detection system\.

Current UAV detection and classification technologies can be divided into four classes depending on the type of measurements they rely on\. Radar based methods \(see e\.g\.,\[radar,tanveer2025from\]\) use radio waves to detect objects\. Advantages of this technology include robustness against weather conditions, long range and being able to detect the speed and direction of UAVs\[seidaliyeva2023advances\]\. Unfortunately, the detection of smaller drones is a difficult task for radar systems due to small radar cross sections and low flying altitudes\. In addition, radar based systems are costly due to the complexity of the involved instrumentation\.

Vision based detection systems usually employ image processing to analyze recordings from cameras\[camera\]\. These are very cost effective and can provide visual confirmation of UAVs, however their performance can be hindered by visibility and weather constraints\. In addition, a visual detection system is only able to detect UAVs in its line of sight\.

Another interesting family of detection methods attempts to recognize drone activity by capturing and analyzing wireless signals\. Methods belonging to this family include\[RF1,RF2\]\. Benefits of radio frequency analysis include a long operating range and the ability to classify different types of UAVs\. On the other hand these methods cannot identify autonomous drones\[seidaliyeva2023advances\]or drones controlled by non\-wireless means such as carbon\-optic fibers\[carbon\]\.

Finally, we mention UAV detection systems that operate by recoding sound in the surrounding area\[sound1,sound2\]\. These systems are easy to deploy and are cost effective\. Additional advantages of sound based detection schemes include the ability to classify recognized drones, estimate the position of UAVs\[seidaliyeva2023advances\]and the ability to operate without a line of sight to the target\. Difficulties associated with acoustic drone detection systems include background noise filtering and vulnerability to wind conditions\.

In this study, we consider a novel sound based drone detection system\. We show that through the use of appropriate signal processing methods, the proposed system is able to

1. 1\.Mitigate the effect of wind and other background noise, by learning frequency signatures characteristic of UAVs of interest,
2. 2\.Classify UAV types based on the learned signatures,
3. 3\.Recognize drone swarms, that is, distinguish between intrusions by a single UAV and groups of drones\.

In this study, we assume that the drones\-to\-be\-detected, are small, electrical vehicles that are traditionally difficult to recognize\[seidaliyeva2023advances,droneRev2\]\. To reduce the cost of the system, we employ microphones based on micro\-electro\-mechanical systems \(MEMS\) technology and a mathematically justified, so\-called model driven machine learning \(ML\) paradigm\.

The use of the proposed novel signal processing scheme clearly distinguishes our approach from previous sound based UAV detection technologies\. Although previous drone detectors employing artificial intelligence methods on sound signals have been introduced in\[sound1\]and\[sound2\], these rely on traditional machine learning architectures\. While deep learning based approaches achieve high performance\[soundDeep\], model parameters do not carry physical meaning\. Thus, the output of such models is not explainable\. This poses concerns for safety\-critical applications such as drone detection\. In addition, the size of deep neural models often requires specialized hardware for real\-time use, which can lead to increased deployment and maintenance costs\.

In contrast, certain previous methods\[sound1\]apply static feature extraction steps to obtain meaningful information from recorded sound signals\. The extracted features are then passed to different ML models, which are usually smaller than their deep learning counterparts\[soundDeep\]\. Features are often extracted using time\-frequency transformations, and thus the obtained information is interpretable\. On the other hand, previous methods only considered static feature extractors\. That is, these methods apply a fixed transformation \(e\.g\., short time Fourier transform\[sound1\]\) to each sound segment\. This leads to sub\-optimal signal representation, because the feature extractor methods cannot adapt to different UAV types, maneuvers and changing environmental conditions\.

In contrast, our proposed approach relies on adaptive feature extraction transformations embedded into small neural networks\. The proposed model is suitable for real time applications on limited hardware and extracts physically meaningful information from the sound signals\. Furthermore it provides feature transformations that can adapt to different sound signatures and changing environmental factors\.

The proposed signal processing pipeline falls into the category of so\-called model driven machine learning approaches\. These mathematically justified ML models are fully, or partially interpretable and can be used in safety\-critical applications\. Importantly, such methods retain the generalization abilities of classical deep learning approaches, at least in the context of the application to which they are deployed\. A number of model driven ML methods have been recently introduced, including variable projection based neural networks\[kovacs2022VPNet\]and kernel methods\[vpsvm\]\. Other recent examples of model driven ML methods include wavelet convolution neural networks\[wang2021automatic,li2021waveletkernelnet\]\.

In the current study, we propose a new wavelet convolution layer \(and a corresponding model driven neural network\) that uses so\-called rational Gaussian wavelet \(RGW\) kernels\[RGW\]\. RGW is a recently introduced class of admissible wavelets, whose morphology can be greatly influenced through a number of parameters\. The proposed layer can learn the shape of the optimal mother wavelet, as well as a finite number of scales \(corresponding to pseudo frequencies\[daubechies1992ten\]\) that can be used to detect and classify UAVs\.

Below, we summarize the most important novelties of our study

1. 1\.We propose a novel, RGW convolution based model driven neural network\. The proposed signal processing model obtains physically meaningful wavelet coefficients from the recorded sound signals\. These adaptively extracted features are then processed by a small neural network to detect UAV presence and classify UAV types\.
2. 2\.We demonstrate that our proposed wavelet kernel convolution based machine learning model can be used to detect the presence of groups of UAVs\.
3. 3\.We compare the proposed signal processing approach to several baseline machine learning methods\. We show that in addition to the inherent interpretability of the proposed model, it significantly outperforms traditional ML methods for drone detection\. All of our experiments are completely reproducible using our publicly available python implementation of the proposed methods \(see data and code statement at the end of the article\)\.
4. 4\.We propose a cost\-effective, sound based system for UAV detection and classification with a light weight, interpretable ML based signal processing unit\.

The rest of this paper is organized as follows\. In section[2](https://arxiv.org/html/2605.26310#S2), we described the hardware used to collect our dataset and the different measurement scenarios\. Section[3](https://arxiv.org/html/2605.26310#S3)provides an overview of the most important properties of the collected sound signals\. These properties provide justification for the proposed signal processing pipeline\. In section[4](https://arxiv.org/html/2605.26310#S4), we review some important properties of rational Gaussian wavelets\[RGW\]and introduce interpretable RGW convolution layers\. These are then used to construct a small, partially interpretable neural network for drone detection\. Section[5](https://arxiv.org/html/2605.26310#S5)details our experiments and results\. Finally, in section[6](https://arxiv.org/html/2605.26310#S6)we summarize our findings and discuss future research directions\.

## 2Data acquisition

In the studio environment, an Audio/Video Recorder \(Video Devices PIX 270i\) received the HD\-SDI signal from a wide\-angle camera and simultaneously generated a 48 kHz word clock signal that was phase\-locked to the video reference\. Audio recordings were captured at a sampling rate of 192 kHz with a 24\-bit resolution in signed Linear Pulse Code Modulation \(LPCM\) format\. The word clock from the PIX 270i was distributed to a Master Clock Generator \(Apogee Big Ben\), which operated as a frequency multiplier to derive a 192 kHz reference\. The Big Ben provided this synchronized 192 kHz word clock to an Audio Interface System \(Focusrite RedNet\) and the associated Digital Audio Workstation \(Pro Tools\), ensuring a common time base across all microphone channels\. The PIX 270i also recorded the video feed along with an analog reference from one of the microphones, allowing precise post\-synchronization between the audio and video domains\. The microphones used in the studio were Brüel & Kjær 4006 and Schoeps MK2 with CMC5 preamplifiers, all of which are pressure\-type electroacoustic transducers, ensuring the most phase\-coherent data acquisition possible\.

To simulate real environmental conditions relevant to the intended system application, a microphone array equipped with commercial MEMS transducers was deployed outdoors\. The recordings were carried out at a sampling rate of 8 kHz with a 16\-bit resolution in signed LPCM format\. The acoustic environment included intense background noise sources such as helicopter overflights, nearby road traffic, human speech, and strong wind\. As the four microphones were mounted on a common PCB and shared identical signal\-conditioning and acquisition circuitry, their outputs were recorded in temporal synchronization relative to each other\.

The experiments were conducted using several commercially available multirotor UAVs, representing different size classes, acoustic signatures, and propulsion configurations\. The DJI Mavic Pro and Mavic Pro 2 are foldable quadcopters with a takeoff weight of approximately 734 g and 907 g, respectively\. Both platforms use four two\-blade rotors with typical hover\-speed ranges between 5000–7800 RPM\.

The DJI Mavic Mini is an ultralight \(249 g\) quadcopter equipped with four two\-blade rotors, operating at higher fundamental rotor frequencies due to its smaller propeller diameter and reduced inertia\.

For outdoor measurements, additional UAV types were included, such as the DJI Mavic 3 Pro \(958 g, 4 rotors\), the DJI Avata 2 \(cinewhoop\-style ducted quadcopter, 377 g, high\-RPM 3\-blade rotors\), and the DJI Matrice 30T, a larger industrial platform weighing approximately 3\.7 kg and equipped with four large\-diameter two\-blade rotors rotating at significantly lower RPM ranges \(2500–4200 RPM\)\.

These platforms cover a broad acoustic spectrum: lightweight high\-RPM systems produce higher\-frequency harmonic structures, while heavier drones exhibit stronger low\-frequency components\. This diversity enables robust evaluation of the proposed signal processing and classification methodology\.

Table 1:Technical specifications of the UAV platforms used in the study
## 3Signal description

We review some important properties of the sound signals obtained from the front\-facing Schoeps MK2 microphones described in section[2](https://arxiv.org/html/2605.26310#S2)\. Audio recorded from a single such microphone was used to generate the results presented in Section[5](https://arxiv.org/html/2605.26310#S5)relating to the indoor measurements, thus the input to the proposed methods are 1D signals \(see Fig\.[1](https://arxiv.org/html/2605.26310#S3.F1)\)\. We note that even though the sound signals obtained from our noisy outdoor measurement scenario are worse quality, the characteristics discussed here remain true for them, which is also reflected by our experimental results \(see section[5](https://arxiv.org/html/2605.26310#S5)\)\. The signals considered here have been recorded with a sampling rate of 192 kHz\. The audio data is split into100100ms long, non\-overlapping segments\. Thus input samples consisted of arrays containing 19200 amplitude values each\. The only other preprocessing step applied to the recorded sound signals was normalization, so that the values in every segment had a mean of0and standard deviation11\.

![Refer to caption](https://arxiv.org/html/2605.26310v1/x1.png)Figure 1:Audio signals of a Mavic drone \(left\) and a Mavic and Mavic 2 drones flying at 1 meter high\.![Refer to caption](https://arxiv.org/html/2605.26310v1/x2.png)Figure 2:Time\-scale representation of audio signals of the Mavic Pro and Mavic Pro 2 drones taken indoors, with very little noise present\. The top left signal is the Mavic Pro flying at 1 meter high and the top right is the Mavic Pro flying at 4 meters\. The bottom is both the Mavic and Mavic 2 flying at 1 meter\. The time\-scale representation have been obtained via continuous wavelet transform, using Morle\-wavelets\.As noted in\[seidaliyeva2023advances,sound1,sound2,soundDeep\], the frequency profiles of the audio signals carry noticeable differences, characteristic of each drone model\. This can also be seen in Fig\.[2](https://arxiv.org/html/2605.26310#S3.F2), as certain features appear consistently in the spectrogram, or the time\-frequency representation obtained from the Morlet\-transform of signals, only when audio from a Mavic Pro type drone is present\.

The characteristics of these features, however, can be subject to change depending on factors other than the model, such as the position, or the movement of the drone\. Furthermore, the measured signal segments do not exhibit periodic behavior\. If the transformations used during the classification pipeline are not translation invariant, the position \(in time\) of the identifying features may be subject to change as well\.

These qualities of the audio signals mean that drone detection is a nonlinear classification problem of nonstationary and nonperiodic signals\. In section[4](https://arxiv.org/html/2605.26310#S4)we introduce an ML model based on adaptive rational Gaussian wavelet convolution operators, that can successfully process these types of signals\. The experimental results of section[5](https://arxiv.org/html/2605.26310#S5)demonstrate that the scheme provides an apt solution to the identification and classification problem set in this paper\.

## 4Rational Gaussian Wavelet convolution networks

Given a signalffand functionψ\\psi, the continuous wavelet transform of the signal is defined as

Wψ​\(λ,τ\)=∫−∞\+∞f​\(t\)​ψ¯λ,τ​\(t\)​𝑑t,W\_\{\\psi\}\(\\lambda,\\tau\)=\\int\_\{\-\\infty\}^\{\+\\infty\}f\(t\)\\overline\{\\psi\}\_\{\\lambda,\\tau\}\(t\)dt,\(1\)where

ψλ,τ\(t\):=1λψ\(1λ\(t−τ\)\)\(t,λ,τ∈ℝ,λ≠0\)\.\\psi\_\{\\lambda,\\tau\}\(t\):=\\frac\{1\}\{\\sqrt\{\\lambda\}\}\\psi\\left\(\\frac\{1\}\{\\lambda\}\(t\-\\tau\)\\right\)\\quad\(t,\\lambda,\\tau\\in\\mathbb\{R\},\\lambda\\neq 0\)\.
We refer to the functionψ\\psias the mother wavelet, whileλ\\lambdaandτ\\taudenote so\-called dilation and translation parameters, respectively\. Given fixed parametersλ\\lambdaandτ\\tau, the numberWψ​\(λ,τ\)W\_\{\\psi\}\(\\lambda,\\tau\), is called a wavelet coefficient and it describes the similarity betweenffandψλ,τ\\psi\_\{\\lambda,\\tau\}\. We shall assumeψ,f∈L2​\(ℝ\)\\psi,f\\in L\_\{2\}\(\\mathbb\{R\}\)henceforth, which is a sufficient condition for \([1](https://arxiv.org/html/2605.26310#S4.E1)\) to exist\. Furthermore, if the mother waveletψ\\psisatisfies the so\-called admissibility property\[daubechies1992ten\], then the transformation \([1](https://arxiv.org/html/2605.26310#S4.E1)\) is invertible \(in theL2​\(ℝ\)L\_\{2\}\(\\mathbb\{R\}\)\-sense\)\. TheWψW\_\{\\psi\}coefficients describe the so\-called time\-scale representation offfwhich is closely related to time\-frequency representations\[daubechies1992ten\]\.

In\[RGW\]the authors introduce a family of wavelet functions, called rational Gaussian wavelets \(RGW\)\. RGWs are defined with a parameter vector𝜼:=\[t1,t2,…,tp,z1,z2,…,zn\]\\boldsymbol\{\\eta\}:=\\left\[t\_\{1\},t\_\{2\},\.\.\.,t\_\{p\},z\_\{1\},z\_\{2\},\.\.\.,z\_\{n\}\\right\]as

ψ𝜼​\(t\)=C​\(𝜼\)​P𝜼​\(t\)​v𝜼​\(t\)​e−t22\(t∈ℝ,𝜼∈ℂp\+n\),\\psi^\{\\boldsymbol\{\\eta\}\}\(t\)=C\(\\boldsymbol\{\\eta\}\)P^\{\\boldsymbol\{\\eta\}\}\(t\)v^\{\\boldsymbol\{\\eta\}\}\(t\)e^\{\\frac\{\-t^\{2\}\}\{2\}\}\\quad\(t\\in\\mathbb\{R\},\\boldsymbol\{\\eta\}\\in\\mathbb\{C\}^\{p\+n\}\),\(2\)whereCCis a constant that depends only on𝜼\\boldsymbol\{\\eta\}\. The rational termP𝜼​\(t\)​v𝜼​\(t\)P^\{\\boldsymbol\{\\eta\}\}\(t\)v^\{\\boldsymbol\{\\eta\}\}\(t\)is defined as

P𝜼​\(t\)=t​∏k=1p\(t−tk\)​\(t\+tk\),\(tk∈ℂ\\\{0\},p∈ℕ\)P^\{\\boldsymbol\{\\eta\}\}\(t\)=t\\prod^\{p\}\_\{k=1\}\(t\-t\_\{k\}\)\(t\+t\_\{k\}\),\\quad\(t\_\{k\}\\in\\mathbb\{C\}\\backslash\\\{0\\\},p\\in\\mathbb\{N\}\)\(3\)andv𝜼∈𝒱v^\{\\boldsymbol\{\\eta\}\}\\in\\mathcal\{V\}, where

𝒱:=\{v​\(t\)=1∏k=0n−1\(t−zk\)​\(t\+zk\)​\(t−z~k\)​\(t\+z~k\),n∈ℕ\}\\mathcal\{V\}:=\\\\ \\left\\\{v\(t\)=\\frac\{1\}\{\\prod^\{n\-1\}\_\{k=0\}\(t\-z\_\{k\}\)\(t\+z\_\{k\}\)\(t\-\\tilde\{z\}\_\{k\}\)\(t\+\\tilde\{z\}\_\{k\}\)\},n\\in\\mathbb\{N\}\\right\\\}\(4\)and

z~:=−ℜ⁡\(z\)\+i​ℑ⁡\(z\)\.\\tilde\{z\}:=\-\\Re\(z\)\+i\\Im\(z\)\.
For simplicity, in this work we assume\{tk\}k=1n⊂ℝ\\\{t\_\{k\}\\\}\_\{k=1\}^\{n\}\\subset\\mathbb\{R\}\. In\[RGW\]the admissibility ofψ𝜼\\psi^\{\\boldsymbol\{\\eta\}\}is proven, thus \(in theory\)ffcan be reconstructed from its RGW wavelet coefficients\. Fig\.[3](https://arxiv.org/html/2605.26310#S4.F3)illustrates some RGW mother wavelets with different𝜼\\boldsymbol\{\\eta\}parameter choices\. Fig\.[3](https://arxiv.org/html/2605.26310#S4.F3)also demonstrates how the large number of parameters offers a high degree of flexibility for influencing the morphology of the mother wavelet\. This property of RGW wavelets is also well exploited in section[5](https://arxiv.org/html/2605.26310#S5)of this study\.

![Refer to caption](https://arxiv.org/html/2605.26310v1/x3.png)Figure 3:Rational Gaussian functions of Eq\. \([2](https://arxiv.org/html/2605.26310#S4.E2)\), with parametersppandzz\.In\[li2021waveletkernelnet\]a wavelet based convolution kernel has been introduced, using a fixed mother wavelet, with the translation and dilation pairs as learnable parameters\. The approach in\[li2021waveletkernelnet\]differs from the one introduced in this article in two regards\. First, our use of RGWs allows us to optimize the morphology of the mother wavelet\. Secondly, our proposed RGW convolution layer only considers the scalesλk​\(k=1,…,N,N∈ℕ\)\\lambda\_\{k\}\\ \(k=1,\\ldots,N,\\ N\\in\\mathbb\{N\}\)as free parameters\. The reason for this is that learning exact translation parameters would increase the size of the convolution layer without affecting the model’s accuracy in any meaningful capacity\.

An important novelty of the current study is the construction of so\-called RGW convolution operators\. Define the functionψλ\\psi\_\{\\lambda\}as

ψλ​\(t\):=1λ​ψ​\(tλ\)\(t∈ℝ,λ\>0\)\.\\psi\_\{\\lambda\}\(t\):=\\frac\{1\}\{\\sqrt\{\\lambda\}\}\\psi\\left\(\\frac\{t\}\{\\lambda\}\\right\)\\quad\(t\\in\\mathbb\{R\},\\ \\lambda\>0\)\.\(5\)The wavelet coefficientsWψ​\(λ,τ\)W\_\{\\psi\}\(\\lambda,\\tau\)from Eq\. \([1](https://arxiv.org/html/2605.26310#S4.E1)\), where\(τ∈ℝ\)\(\\tau\\in\\mathbb\{R\}\)can then be written as

Wψ​\(λ,τ\)=\(f∗ψλ\)​\(τ\)=∫−∞\+∞f​\(t\)​ψλ¯​\(t−τ\)​𝑑t\.W\_\{\\psi\}\(\\lambda,\\tau\)=\(f\\ast\\psi\_\{\\lambda\}\)\(\\tau\)=\\int\_\{\-\\infty\}^\{\+\\infty\}f\(t\)\\overline\{\\psi\_\{\\lambda\}\}\(t\-\\tau\)dt\.In practical cases the signals that have to be processed are usually only available in a discretely sampled form\. Let

fk=f​\(tk\)\(tk∈ℝ\+,k=1,…,N\),f\_\{k\}=f\(t\_\{k\}\)\\quad\(t\_\{k\}\\in\\mathbb\{R\_\{\+\}\},\\ k=1,\.\.\.,N\),and consider the notation𝒇:=\(f1,f2,…​fN\)∈ℝN\\boldsymbol\{f\}:=\(f\_\{1\},f\_\{2\},\.\.\.f\_\{N\}\)\\in\\mathbb\{R\}^\{N\}\. Let furthermore

𝝍λ,k𝜼:=ψλ𝜼​\(tk\)\(k=1,…,M\)\\boldsymbol\{\\psi\}^\{\\boldsymbol\{\\eta\}\}\_\{\\lambda,k\}:=\\psi^\{\\boldsymbol\{\\eta\}\}\_\{\\lambda\}\(t\_\{k\}\)\\quad\(k=1,\.\.\.,M\)denote the discrete sampling of the waveletψλ\\psi\_\{\\lambda\}\(see Eq\. \([5](https://arxiv.org/html/2605.26310#S4.E5)\)\)\.

Then, the wavelet coefficients offfcan be approximated with the discrete convolution

\[𝒇∗𝝍λ𝜼\]k:=∑j=1k𝒇j​𝝍λ,k−j\+1𝜼\.\\left\[\\boldsymbol\{f\}\\ast\\boldsymbol\{\\psi\}\_\{\\lambda\}^\{\\boldsymbol\{\\eta\}\}\\right\]\_\{k\}:=\\sum\_\{j=1\}^\{k\}\\boldsymbol\{f\}\_\{j\}\\boldsymbol\{\\psi\}^\{\\boldsymbol\{\\eta\}\}\_\{\\lambda,k\-j\+1\}\.\(6\)
Previous applications of RGW wavelets considered approximating wavelet coefficients using variable projection operators\[golub1973differentiation,golub2003separable\]\. This approach however heavily depends on the periodic, or quasi\-periodic property of the signal\. As shown in Fig\.[1](https://arxiv.org/html/2605.26310#S3.F1)and in Fig\.[2](https://arxiv.org/html/2605.26310#S3.F2), the audio signals considered in this paper are not periodic, thus variable projection based computation of RGW coefficients will not capture meaningful features\. To overcome this issue, we propose discrete convolution layers with RGW kernels, that are

1. 1\.capable of extracting meaningful features from non\-periodic signals using discrete convolutions,
2. 2\.allow for a high degree of adaptivity of the mother wavelet’s morphology due to the nature of RGW\.

The key component of the proposed model driven neural network is the RGW convolution layer\. This layer applies discrete convlutions \(see Eq\. \([6](https://arxiv.org/html/2605.26310#S4.E6)\)\) to the1​D1Dinput signals, where the kernel𝝍λ𝜼\\boldsymbol\{\\psi\}^\{\\boldsymbol\{\\eta\}\}\_\{\\lambda\}is an RGW wavelet characterized by a dilation parameterλ\>0\\lambda\>0and the parameters in𝜼∈ℝp\+n\\boldsymbol\{\\eta\}\\in\\mathbb\{R\}^\{p\+n\}\. This vector contains the zerostk​\(k=1,…,p\)t\_\{k\}\\ \(k=1,\\ldots,p\)of the polynomial term from Eq\. \([3](https://arxiv.org/html/2605.26310#S4.E3)\) and the poleszj​\(j=1,…,n\)z\_\{j\}\\ \(j=1,\\ldots,n\)from Eq\. \([4](https://arxiv.org/html/2605.26310#S4.E4)\)\. Together they greatly influence the shape of the RGW mother wavelet and can be collected to a single vector by

𝜼=\(t1,t2,…,tp,z1,z2,…,zn\)∈ℂn\+p\(n,p∈ℕ\)\.\\begin\{split\}&\\boldsymbol\{\\eta\}=\(t\_\{1\},t\_\{2\},\.\.\.,t\_\{p\},z\_\{1\},z\_\{2\},\.\.\.,z\_\{n\}\)\\in\\mathbb\{C\}^\{n\+p\}\\\\ &\\quad\(n,p\\in\\mathbb\{N\}\)\.\\end\{split\}The transformation characterizing the proposed RGW layer is defined by

𝒇→T𝜼𝒇=\[𝒇∗𝝍λ1𝜼𝒇∗𝝍λ2𝜼⋮𝒇∗𝝍λm𝜼\]=:\[Wψη​\(λ1\)Wψη​\(λ2\)⋮Wψη​\(λm\)\]∈ℝm×\(N\+M−1\),\\boldsymbol\{f\}\\rightarrow T\_\{\\boldsymbol\{\\eta\}\}\\boldsymbol\{f\}=\\begin\{bmatrix\}\\boldsymbol\{f\}\\ast\\boldsymbol\{\\psi\}\_\{\\lambda\_\{1\}\}^\{\\boldsymbol\{\\eta\}\}\\\\ \\boldsymbol\{f\}\\ast\\boldsymbol\{\\psi\}\_\{\\lambda\_\{2\}\}^\{\\boldsymbol\{\\eta\}\}\\\\ \\vdots\\\\ \\boldsymbol\{f\}\\ast\\boldsymbol\{\\psi\}\_\{\\lambda\_\{m\}\}^\{\\boldsymbol\{\\eta\}\}\\\\ \\end\{bmatrix\}=:\\begin\{bmatrix\}W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{1\}\)\\\\ W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{2\}\)\\\\ \\vdots\\\\ W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{m\}\)\\\\ \\end\{bmatrix\}\\in\\mathbb\{R\}^\{m\\times\(N\+M\-1\)\},\(7\)where𝒇∗𝝍λk𝜼​\(k=1,…,m\)\\boldsymbol\{f\}\\ast\\boldsymbol\{\\psi\}\_\{\\lambda\_\{k\}\}^\{\\boldsymbol\{\\eta\}\}\\ \(k=1,\\ldots,m\)is the discrete convolution defined in \([6](https://arxiv.org/html/2605.26310#S4.E6)\)\. The Jacobian matrix of \([7](https://arxiv.org/html/2605.26310#S4.E7)\), with respect to𝜼\\boldsymbol\{\\eta\}, can be easily computed:

\(∂T𝜼​𝒇∂𝜼\)λk=𝒇∗∂ψλk𝜼∂𝜼\.\(k=1,⋯,m\)\.\\left\(\\frac\{\\partial T\_\{\\boldsymbol\{\\eta\}\}\\boldsymbol\{f\}\}\{\\partial\\boldsymbol\{\\eta\}\}\\right\)\_\{\\lambda\_\{k\}\}=\\boldsymbol\{f\}\\ast\\frac\{\\partial\\psi^\{\\boldsymbol\{\\eta\}\}\_\{\\lambda\_\{k\}\}\}\{\\partial\\boldsymbol\{\\eta\}\}\.\\quad\(k=1,\\cdots,m\)\.\(8\)
In\[RGW\], the authors offer a formula for computing partial derivatives ofψλk𝜼\\psi^\{\\boldsymbol\{\\eta\}\}\_\{\\lambda\_\{k\}\}with regards totkt\_\{k\}andzkz\_\{k\}\. We rely on this formulation to implement the proposed RGW convolution layer\.

The convolution layer is followed by a pooling layer, and a fully connected layer\. The pooling layer uses a scheme in which, for everyλk​\(k=1,…,m\)\\lambda\_\{k\}\\ \(k=1,\\ldots,m\)scale, only the wavelet coefficient of the maximal value is retained\.

M​P​\(T𝜼​f\)=\[topQ⁡\(\|Wψη​\(λ1\)\|\)topQ⁡\(\|Wψη​\(λ2\)\|\)⋮topQ⁡\(\|Wψη​\(λm\)\|\)\]∈ℝm×Q,MP\(T\_\{\\boldsymbol\{\\eta\}\}f\)=\\begin\{bmatrix\}\\operatorname\{topQ\}\(\|W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{1\}\)\|\)\\\\ \\operatorname\{topQ\}\(\|W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{2\}\)\|\)\\\\ \\vdots\\\\ \\operatorname\{topQ\}\(\|W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{m\}\)\|\)\\\\ \\end\{bmatrix\}\\in\\mathbb\{R\}^\{m\\times Q\},\(9\)where the operatortopQ\\operatorname\{topQ\}selects theQ∈ℕQ\\in\\mathbb\{N\}largest elements of the argument vector\. This allows the proposed neural network architecture to retain physically meaningful information in the following sense\. Notice that thekk\-th component ofM​P​\(T𝜼​f\)MP\(T\_\{\\boldsymbol\{\\eta\}\}f\)encodes theQQmaximum similarity scores achieved between𝒇\\boldsymbol\{f\}and𝝍λk𝜼​\(k=1,…,m\)\\boldsymbol\{\\psi\}\_\{\\lambda\_\{k\}\}^\{\\boldsymbol\{\\eta\}\}\\ \(k=1,\\ldots,m\)\. Since for a fixedλk\\lambda\_\{k\}dilation the vector of wavelet coefficientsWψη​\(λk\)W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{k\}\)\(see Eq\. \([7](https://arxiv.org/html/2605.26310#S4.E7)\)\) corresponds to a pseudo frequency, the output ofM​P​\(T𝜼​f\)MP\(T\_\{\\boldsymbol\{\\eta\}\}f\)can be interpreted as the maximal amplitude present in the input signal𝒇\\boldsymbol\{f\}at the frequency band defined byλk\\lambda\_\{k\}\.

## 5Experiments

We conducted a number of experiments to demonstrate the effectiveness of RGW convolution networks for audio\-based UAV detection\. As described in section[2](https://arxiv.org/html/2605.26310#S2), our dataset includes indoor and outdoor measurement scenarios along with several different types of drones and drone swarms\. Accordingly, different experimental scenarios were considered for the detection and classification of a single, or multiple UAVs\. The considered scenarios are described as follows\.

1. 1\.Differentiating between the presence of a single UAV, and multiple UAVs\. The recordings used for this experiment were taken in a studio environment\. The three categories are no drones being present, the presence of a single UAV, and the presence of multiple UAVs \(up to three\)\. The UAV models used for this experiment were DJI Mavic Pro, DJI Mavic Pro 2, and DJI Mavic Mini\.
2. 2\.The classification of a single UAV in a studio environment\. In this case, the model receives an audio segment with a single UAV present and has to differentiate between three possible models \(Mavic Pro, Mavic Pro 2 and Mavic Mini\)\.
3. 3\.Detecting the presence of a single UAV outdoors, in a noisy environment\. The categories are the same as in Scenario 1\. The UAV models used for this scenario are Avata 2, Matrice 30T, Mavic Mini and Mavic Pro 3\.

Table 2:Considered experimental scenarios\. The number of UAVs refers to the maximum number of individual UAVs present in a single segment, while the number of UAV types refers to the total number used in that scenario\.Table[2](https://arxiv.org/html/2605.26310#S5.T2)gives a summary of the considered scenarios\. The outdoor experiment \(as described in section[2](https://arxiv.org/html/2605.26310#S2)\) can be regarded as extremely noisy, while the indoor experiments as noise free\.

Table 3:Structure of the datasets used in the individual experiments\.In each experiment, the recorded sound samples were subjected to an identical preprocessing regime described in section[3](https://arxiv.org/html/2605.26310#S3)\. The proposed ML models were very similar for each measurement scenario as well\. The most significant difference, between the models used for indoors and outdoors scenarios, is the size of the used wavelet convolution kernel \([7](https://arxiv.org/html/2605.26310#S4.E7)\)\. In case the size of the kernel does not equalNN, the size of the convolution output will not bem×Nm\\times N, however it is dependant the size of the kernel and the input, and is different in each scenario\. The considered neural network is composed of an RGW convolution layer, followed by two fully connected layers with ReLU activation functions before each\. The output of the RGW layer is first normalized:

Wψη​\(λk\)−𝐄​\(Wψη​\(λk\)\)𝐕𝐚𝐫​\(Wψη​\(λk\)\),\\frac\{W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{k\}\)\-\\mathbf\{E\}\(W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{k\}\)\)\}\{\\mathbf\{Var\}\(W\_\{\\psi^\{\\eta\}\}\(\\lambda\_\{k\}\)\)\},
where𝐄\\mathbf\{E\}is the mean and𝐕𝐚𝐫\\mathbf\{Var\}is the variance, then downsampled with a pooling layer\. The pooling layer is unique, in that it samples the dominant coefficients for eachλk​\(k=1,…,m\)\\lambda\_\{k\}\\ \(k=1,\\ldots,m\)dilation parameter, essentially functioning as a 1\-dimensional filter, as described in Eq\. \([9](https://arxiv.org/html/2605.26310#S4.E9)\)\. The activation function after the final layer is either sigmoid, in the binary classification case, or softmax, in the case of multiple categories\.

The number ofppandnnlearnable parameters used to determine the morphology of the RGW mother wavelet was chosen asp=1p=1andn=10n=10\. In these experiments, we considered1010wavelet filters \(the number of scalesm=10m=10\)\.

The classifier block used after the RGW layer consists of a single fully connected layer with200200neurons\. The model is trained for300300epochs with a batch size of6464\.

Each individual experiment was evaluated via 5\-fold cross\-validation\. The training consisted of randomly splitting the total training data 80\-to\-20 into a training set, which has been used to fit the model parameters and a test set, on which the accuracy of the model was evaluated\. This process has been repeated a total of five times for the experiment, and in the following tables the mean, minimum and maximum accuracy can be seen\.

Model performance has been compared to several baseline ML methods\. For this purpose, we considered the Random Forest \(RF\) classifier\[pal2005random\], linear and radial Basis kernel SVM \(SVM\-L and SVM\-RBF\) classifiers\[chapelle2007training\], as well as the naive Bayes classifier\[NB\]\(NB\)\. In addition, we conducted experiments on these scenarios using fully connected \(FCNN\) as well as convolutional \(CNN\) neural networks\. In each scenario the structure of the FCNN model, and the fully connected layers in the CNN model were identical to the structure of the fully connected layers of the RGW\-kernel \([7](https://arxiv.org/html/2605.26310#S4.E7)\) model used in that scenario\. Furhtermore, the number of channels, and the size of the convolution layer in the used CNN model had been identical to the one used for the RGW convolution layer\. The results are shown in the following tables[4](https://arxiv.org/html/2605.26310#S5.T4),[5](https://arxiv.org/html/2605.26310#S5.T5)and[6](https://arxiv.org/html/2605.26310#S5.T6)\.

The accuracy of the trained models are calculated by

A​c​c=Nt​p\+Nt​nNt​o​t​a​l,Acc=\\frac\{N\_\{tp\}\+N\_\{tn\}\}\{N\_\{total\}\},\(10\)whereNt​pN\_\{tp\}andNt​nN\_\{tn\}are the number of true positive and true negative prediction respectively, andNt​o​t​a​lN\_\{total\}is the total number of segments in the test set\.

In the following subsections we describe each experimental scenario in detail\. We note that all of the hyperparameters were determined using a large grid search of the parameter space\. To ensure full reproducibility of our results, we direct the reader to the data and code availability statement at the end of the article\.

![Refer to caption](https://arxiv.org/html/2605.26310v1/x4.png)Figure 4:Schematic model of the neural network used for our experiments\. The samples are fed first into an RGW convolution layer for feature extraction, then to a fully connected block, used for classification\.### 5\.1Detecting swarms in a studio environment

The summary of our experimental scenarios can be found in[2](https://arxiv.org/html/2605.26310#S5.T2)\. In this section we consider the indoors experimental scenario wich we label with id\. 1\. Our dataset contains three labels: a category containing samples with only background noise, a category with sounds from a single drone, and a category which contains samples with sound recordings from multiple drones\. That is, the classification task is to differentiate between a lone drone, background noise, and the presence of multiple drones\.

In this particular scenario, the length of each wavelet kernel was determined to be3232, which results in the size of the convolution output being10×1916910\\times 19169\.

The performance of the model has been compared to baseline ML algorithms, as introduced in the beginning of this section\. The results can be seen in Table[4](https://arxiv.org/html/2605.26310#S5.T4)\. The classical methods, as well as the fully connected neural network, fail to achieve high level of accuracy, while the RGW\-kernel based network and the convolutional neural network is able to accurately identify the UAV types\. Furthermore, the RGW\-kernel based approach overperforms the convolutional neural network\. It is important to note that the RGW\-kernel based model achieves this accuracy using a lot fewer parameters and the RGW\-kernel layer gives us an output whose parameters carry physical meaning about the input signals\.

Table 4:Results of detecting the presence of a swarm in a studio environment \(scenario 1\. in Table[2](https://arxiv.org/html/2605.26310#S5.T2)\)\.
### 5\.2Classifying drones in a studio environment

Next, we consider scenario 2\. from Table[2](https://arxiv.org/html/2605.26310#S5.T2)\. In the case of identifying drone model types, three categories describe our dataset\. Each category contains samples from a single drone, where the categories themselves are characterized by the type of the UAV that can be heard in the sample\.

The parameters of the trained model had been the same as the model used in section[5\.1](https://arxiv.org/html/2605.26310#S5.SS1), with the exception of the activation function at the end being Softmax instead of Sigmoid\.

The results can be seen in Table[5](https://arxiv.org/html/2605.26310#S5.T5)\. Similarly to the previous scenario, the CNN and the RGW\-kernel based model are able to achieve a much higher accuracy than the other approaches\.

Table 5:Results of classifying the UAV model types present in a studio recording \(scenario 2\. in Table[2](https://arxiv.org/html/2605.26310#S5.T2)\)\.
### 5\.3Detecting drones in a noisy environment

Finally we consider experimental scenario 3\. from Table[2](https://arxiv.org/html/2605.26310#S5.T2)\. For this experiment, our objective is to detect the presence of a single UAV, however the used dataset has been collected in a much less controlled environment, being subject to a large number of noise factors\. The details of the noisy dataset are described in section[2](https://arxiv.org/html/2605.26310#S2)\.

Just like in the previous examples, the number of dilation parameters learned had been1010\. The numbersppandnnof RGW mother wavelet parameters remainedp=1p=1andn=10n=10, however the size of the convolution kernel was raised to6464, and the size of the convolution output changed to10×434710\\times 4347\. As the task is a binary classification problem, Sigmoid had been used as an activation layer\.

Table[6](https://arxiv.org/html/2605.26310#S5.T6)contains the accuracy of the RGW\-based model, as well as the accuracy of baseline ML methods, also used in the previous experiments\. As can be seen, the RGW\-kernel based model outperforms every other neural network, even the CNN model\. It is important to note that in this scenario the task has been a lot more difficult than in the previous ones, as the audio signal was subject to heavy background noise\. In this case, the interpretability of the RGW\-kernel model is especially critical, as it allows us to verify the validity of the trained model\.

Table 6:Results of detecting drone presence in a noisy environment \(scenario 3\. in Table[2](https://arxiv.org/html/2605.26310#S5.T2)\)\.

## 6Conclusion

In this study, we introduce a novel model\-based neural network, incorporating Rational Gaussian\-based wavelet transformation\. We have demonstrated the effectiveness of this neural network by solving a series of tasks based around acoustic detection of UAVs in various environments\. By using a generic model, we have shown that our scheme can achieve accurate results, and for some tasks, can even surpass the state\-of\-the art CNN approach, with the use of fewer parameters and small datasets\.

In the future, we plan on using more specialized models in order to achiever higher accuracy on specific tasks, as well as to use these trained models for real time detection and classification\.

## Acknowledgment

This research was funded by the Ministry of Innovation and Technology of Hungary from the National Research, Development and Innovation Fund, financed under the TKP2021 funding scheme, grant number TKP2021\-NVA\-03\. TD received funding from the Swiss Government Excellence Scholarship No\. 2025\.0057\. This work was supported by the University Excellence Fund of Eötvös Loránd University, Budapest, Hungary \(ELTE\)\. Project no\. K146721 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the K\_23 ”OTKA” funding scheme\. This work was supported by the János Bolyai Research Scholarship of the Hungarian Academy of Science\. Supported by the EKÖP\-KDP\-24 university excellence scholarship program cooperative doctoral program of the Ministry for Culture and Innovation from the source of the National Research, Development and Innovation Fund\.

## References

## Code and data availability

The python implementation of the proposed methods and experiments can be downloaded from

[https://gitlab\.com/aele02/drone\-classification\-and\-detection/\-/tree/main](https://gitlab.com/aele02/drone-classification-and-detection/-/tree/main)\.

Data that support the findings of this study is available from HUN\-REN Centre for Energy research, but restrictions apply to the availability of the data, which was used under license for the current study\. The data is not publicly available\.

## CRediT Author statement

Gergő Ungvári: Formal Analysis, Software, Writing \- Original Draft\.Ferenc Braun: Data Curation, Writing \- Original Draft\.Attila Ámon: Software, Writing \- Review & Editing\.Péter Kackstädter: Resources, Writing \- Original Draft\.János Volk: Supervision, Funding Acquisition, Writing \- Review & EditingPéter Kovács: Project Administration, Writing \- Review & Editing\.Tamás Dózsa: Methodology, Software, Writing \- Original Draft\.

## Acknowledgment

This research was funded by the Ministry of Innovation and Technology of Hungary from the National Research, Development and Innovation Fund, financed under the TKP2021 funding scheme, grant number TKP2021\-NVA\-03\. TD received funding from the Swiss Government Excellence Scholarship No\. 2025\.0057\. This work was supported by the University Excellence Fund of Eötvös Loránd University, Budapest, Hungary \(ELTE\)\. Project no\. K146721 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the K\_23 ”OTKA” funding scheme\. This work was supported by the János Bolyai Research Scholarship of the Hungarian Academy of Science\. Supported by the EKÖP\-KDP\-24 university excellence scholarship program cooperative doctoral program of the Ministry for Culture and Innovation from the source of the National Research, Development and Innovation Fund\.

Similar Articles