SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

arXiv cs.AI 06/04/26, 04:00 AM Papers
few-shot-learning audio-classification shortcut-learning benchmark spurious-correlations audio-ai evaluation
Summary
SpurAudio is a new benchmark designed to evaluate shortcut learning and spurious correlations in few-shot audio classification, revealing that state-of-the-art methods—including large pretrained audio foundation models—suffer significant performance degradation when background correlations are disrupted.
arXiv:2605.13672v1 Announce Type: cross Abstract: Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.
Original Article
View Cached Full Text
Cached at: 06/05/26, 02:10 AM
# SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification
Source: [https://arxiv.org/html/2605.13672](https://arxiv.org/html/2605.13672)
Giries Abu Ayoub Department of Computer Science University of Haifa jerryabuayob@gmail\.com &Morad Tukan11footnotemark:1 Independent Researcher muradtuk@gmail\.com Loay Mualem University of Stuttgart, Germany IMPRS\-IS, Germany loaymua@gmail\.com Equal contributionCorresponding AuthorInternational Max Planck Research School for Intelligent Systems\.

###### Abstract

Few\-shot classification \(FSC\) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues\. In real\-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals\. While such effects have been studied in few\-shot image classification, their role in few\-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure\. We introduceSpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi\-level evaluation of contextual shifts across support and query sets\. Using this benchmark, we show that many state\-of\-the\-art few\-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols\. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation\. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time\. These findings provide new insight into the behavior of few\-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models\.[https://github\.com/Jerryaa98/SpurAudio](https://github.com/Jerryaa98/SpurAudio)

## 1Introduction

Few\-shot classification \(*FSC*\) aims to recognize novel classes from only a handful of labeled examplesVinyalset al\.\([2016](https://arxiv.org/html/2605.13672#bib.bib21)\); Snellet al\.\([2017](https://arxiv.org/html/2605.13672#bib.bib32)\); Finnet al\.\([2017a](https://arxiv.org/html/2605.13672#bib.bib20)\); Wanget al\.\([2020](https://arxiv.org/html/2605.13672#bib.bib8)\)\. While recent advances in representation learning have substantially improved data efficiency,*FSC*remains particularly challenging in real\-world*audio*applications\. Sound events are often rare, costly to annotate, and highly variable in their acoustic realization, making large\-scale supervised training impractical\. As a result, few\-shot audio classification is critical for many high\-impact domains, including bioacoustic monitoringGhaniet al\.\([2024](https://arxiv.org/html/2605.13672#bib.bib3)\); Nolascoet al\.\([2023](https://arxiv.org/html/2605.13672#bib.bib47)\); Youet al\.\([2023](https://arxiv.org/html/2605.13672#bib.bib45)\); Moummadet al\.\([2023](https://arxiv.org/html/2605.13672#bib.bib42)\); Liuet al\.\([2024a](https://arxiv.org/html/2605.13672#bib.bib43)\); Ijazet al\.\([2024](https://arxiv.org/html/2605.13672#bib.bib44)\); McEwenet al\.\([2024](https://arxiv.org/html/2605.13672#bib.bib46)\); Janaet al\.\([2025](https://arxiv.org/html/2605.13672#bib.bib48)\), industrial fault diagnosisSirajet al\.\([2023](https://arxiv.org/html/2605.13672#bib.bib52)\); Lianget al\.\([2023](https://arxiv.org/html/2605.13672#bib.bib50)\); Saleemet al\.\([2025](https://arxiv.org/html/2605.13672#bib.bib49)\); Zabinet al\.\([2025](https://arxiv.org/html/2605.13672#bib.bib51)\), and healthcare audio analysisDisha Sendhil Kumaret al\.\([2025](https://arxiv.org/html/2605.13672#bib.bib53)\); Floreaet al\.\([2025](https://arxiv.org/html/2605.13672#bib.bib54)\), where failures can carry significant ecological, economic, or safety consequences\.

A defining challenge in audio, however, lies in its*additive*nature\. Unlike images, where objects are often spatially separable from their surroundings, audio foreground events are superimposed on background sounds in the time–frequency domainWichernet al\.\([2019](https://arxiv.org/html/2605.13672#bib.bib17)\); Maciejewskiet al\.\([2020](https://arxiv.org/html/2605.13672#bib.bib18)\)\. In practice, target sounds rarely occur in isolation and are embedded in rich and often predictive acoustic contexts\. This makes few\-shot audio models vulnerable to exploiting*spurious correlations*between class labels and background cues: models can achieve strong accuracy for the*wrong reason*by keying on context rather than semantic foreground content\. Such non\-causal shortcuts can artificially inflate performance under matched training and testing conditions but lead to abrupt failures when background contexts shift\.

Prior work has demonstrated that audio representations are sensitive to background interference, degradations, and polyphonySalamon and Bello \([2017](https://arxiv.org/html/2605.13672#bib.bib56)\); Turpaultet al\.\([2021](https://arxiv.org/html/2605.13672#bib.bib57)\); Abeßeret al\.\([2023](https://arxiv.org/html/2605.13672#bib.bib55)\)\. More recently, robustness\-oriented approaches such as*RobustCLAP*Selvakumaret al\.\([2025](https://arxiv.org/html/2605.13672#bib.bib59)\)and related methods aim to improve invariance to noise, corruption, or background variation at the*representation learning*level\. However, these studies are predominantly conducted in supervised or zero\-shot settings and do not examine*episodic few\-shot generalization*, where models must rapidly adapt from only a handful of labeled examples\. As a result, it remains unclear how much current few\-shot audio methods rely on contextual shortcuts and how fragile their performance is under controlled background shifts\.

![Refer to caption](https://arxiv.org/html/2605.13672v1/figures/illustration_iid_vs_ood.png)Figure 1:Visualization of SpurAudio’s episodic structure\. \(a\) A 1\-shot 2\-way episode: two foreground classes \(Coughing and Pig\) are mixed with different backgrounds \(e\.g\., Vacuum Cleaner, Thunderstorm, Church Bells, Siren\) and projected byϕ\\phiinto a feature space\. Samples from the same foreground class point in the same direction, while background noise changes the length of mixed sample vectors \(dark hues\) relative to clean ones \(light hues\)\. \(b\) An OOD episode where support \(s\) and query \(q\) contain disjoint backgrounds \(e\.g\., Church Bells vs\. Siren\)\. Despite sharing the same foreground class, queries are displaced from their correct support clusters in the metric space, leading to misclassification\.In a typical*FSC*episode, models generalize from a small labeled*support set*to an unlabeled*query set*Vinyalset al\.\([2016](https://arxiv.org/html/2605.13672#bib.bib21)\); Snellet al\.\([2017](https://arxiv.org/html/2605.13672#bib.bib32)\)\. Although modern methods achieve strong performance under controlled evaluation protocols, they often degrade substantially under realistic out\-of\-distribution \(*OOD*\) conditions\. Following prior work in visionZhang and et al\. \([2024](https://arxiv.org/html/2605.13672#bib.bib9)\);[Sagawa and Koh](https://arxiv.org/html/2605.13672#bib.bib10), OOD failures in FSC can arise from two sources: \(i\) cross\-domain*FSC*, where source and target domains differ \(e\.g\., speech→\\rightarrowmusic\), and \(ii\) spurious\-correlation*FSC*\(*SC\-FSC*\), where semantic classes remain unchanged but contextual cues vary\.

This work focuses on*SC\-FSC*, a failure mode that is particularly pernicious in audio\. Because foreground and background signals are inseparably mixed, consistent co\-occurrence patterns during training can encourage models to rely on contextual shortcuts rather than semantic foreground content\. This phenomenon is closely related to*shortcut learning*Geirhoset al\.\([2020](https://arxiv.org/html/2605.13672#bib.bib61)\)and the “Clever Hans” effectLiuet al\.\([2024b](https://arxiv.org/html/2605.13672#bib.bib58)\)\. When such correlations are broken at test time, for example, when a machine fault occurs in a different factory environment, performance can degrade catastrophically, undermining real\-world deployment\.

### 1\.1Positioning and Gap in Existing Work

Existing audio robustness studies primarily investigate domain shifts across datasets, recording conditions, or acoustic scenesHegganet al\.\([2022](https://arxiv.org/html/2605.13672#bib.bib11)\)\. Even large\-scale datasets such as*FSD50K*Fonsecaet al\.\([2021](https://arxiv.org/html/2605.13672#bib.bib60)\)are typically used for supervised or zero\-shot training without controlling or disentangling foreground–background correlations\. While robustness\-focused methods \(e\.g\.,*RobustCLAP*Selvakumaret al\.\([2025](https://arxiv.org/html/2605.13672#bib.bib59)\)\) aim to improve invariance to noise or corruption at the representation level, they do not evaluate*episodic few\-shot generalization*under controlled background manipulations\. Consequently, current benchmarks predominantly evaluate matched conditions or broad domain shiftsHegganet al\.\([2022](https://arxiv.org/html/2605.13672#bib.bib11)\), often conflating semantic foreground content with background context\. They cannot isolate the effect of non\-causal background correlations on episodic generalization and may overestimate robustness by rewarding reliance on contextual shortcuts\. As a result, the effect of spurious contextual cues on few\-shot audio learning remains largely uncharacterized, leaving a critical blind spot in understanding few\-shot audio generalization\.

### 1\.2Our Work

To address this gap, we introduceSpurAudio, a benchmark systematically designed to isolate and evaluate spurious correlations in few\-shot audio classification; see Figure[1](https://arxiv.org/html/2605.13672#S1.F1)for an illustration of SpurAudio\. Our data is a obtained from mixing foreground events from five real\-world datasets with semantically unrelated background textures\. This controlled mixing induces strong correlations in the support set \(e\.g\., Class A with Background X\) while varying background conditions in the query set \(e\.g\., Class A with Background Y\), enabling clean disentanglement between causal foreground learning and shortcut reliance\. Crucially, SpurAudio serves as a diagnostic*dataset*for analyzing failure modes across different families of FSC methods\.

Beyond providing a controlled benchmark, SpurAudio enables in\-depth analysis of few\-shot audio methods under contextual shifts\. Using this benchmark, we show that many state\-of\-the\-art*FSC*approaches suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols\. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations because few\-shot performance is shaped by how well feature representations align with classifier heads\. Importantly, this vulnerability is not confined to small backbones: it persists across large pretrained audio foundation models, indicating that spurious background reliance is a fundamental property of few\-shot audio inference rather than a limitation of representational capacity\. These observations reveal systematic strengths and vulnerabilities of current*FSC*algorithms and highlight the importance of benchmarks that explicitly probe context dependence when evaluating few\-shot audio models\.

More importantly, our work highlights subtle patterns that not only open new research directions within few\-shot learning, but also point to broader implications for a wide range of audio tasks\. Hence, our contributions are three\-fold:

- •We introduceSpurAudio, a controlled benchmark enabling manipulation of foreground–background correlations across multiple audio domains\.
- •We characterize spurious\-correlation*OOD*failures in few\-shot audio classification, showing state\-of\-the\-art methods collapse when background contexts shift\.
- •We provide an extensive benchmark of metric\-based, meta\-learning, contrastive, transductive and fine\-tuning approaches, spanning both standard backbones and large pretrained audio foundation models, highlighting systematic context reliance and motivating future work on context\-robust few\-shot audio learning\.

## 2SpurAudio Dataset

This section presentsSpurAudio, a dataset specifically designed to study the impact of spurious correlations in few\-shot audio classification\. SpurAudio is assembled by aggregating audio samples from five publicly available datasets spanning diverse acoustic domains:\(i\)ESC\-50Piczak \([2015](https://arxiv.org/html/2605.13672#bib.bib12)\): A benchmark dataset of 50 environmental sound classes, including animal vocalizations, natural phenomena, and human activities\.\(ii\)UrbanSound8KSalamonet al\.\([2014](https://arxiv.org/html/2605.13672#bib.bib13)\): A collection of 8,732 urban audio clips across 10 categories, such as sirens, dog barks, and drilling\.\(iii\)VocalSoundGonget al\.\([2022](https://arxiv.org/html/2605.13672#bib.bib14)\): A dataset of human\-produced vocal imitations and sound effects\.\(iv\)WILD DESEDXiao and Das \([2024](https://arxiv.org/html/2605.13672#bib.bib15)\): Weakly labeled recordings captured in a variety of outdoor acoustic environments\.\(v\)USMAbeßer \([2022](https://arxiv.org/html/2605.13672#bib.bib16)\): A large\-scale dataset of sound events embedded within complex acoustic scenes\.

### 2\.1Sound Event Generation

To synthesize realistic sound events that occur in the “wild,” we define two complementary concepts that partition the audio collection:\(I\)Foreground \(FG\): the target event class to be recognized within a few\-shot learning episode; and\(II\)Background \(BG\): an audio clip sampled from a semantically unrelated class, introduced as a confounding context\.Foreground and background classes are paired to be semantically independent; for example, a “dog barking” foreground combined with “park noise” in the background while still reflecting combinations that plausibly co\-occur in real\-world acoustic environments\. To ensure diversity and avoid over\-representation, we further limit the repeated use of the same background class across multiple foreground classes\.

Data Generation Flow\.Given a pair of sound clips, a foregroundxfg\(t\)x\_\{\\mathrm\{fg\}\}\(t\)and a backgroundxbg\(t\)x\_\{\\mathrm\{bg\}\}\(t\), our objective is to ensure that the resulting mixture represents a scenario that humans would perceive as a plausible real\-world co\-occurrence\. To this end, two annotators conducted a three\-stage curation process:\(i\)partitioning the complete collection of sound clips into foreground and background sets while maintaining maximal connectivity between them;\(ii\)associating each foreground class with four distinct background classes; and\(iii\)manually curating the resulting combined sound events\.

Mixing Process\.To generate mixtures that resemble naturally occurring sound events, we adopt the mixing procedure proposed inWichernet al\.\([2019](https://arxiv.org/html/2605.13672#bib.bib17)\); Maciejewskiet al\.\([2020](https://arxiv.org/html/2605.13672#bib.bib18)\)\. Givenxfg\(t\)x\_\{\\mathrm\{fg\}\}\(t\)andxbg\(t\)x\_\{\\mathrm\{bg\}\}\(t\), both signals are first resampled to1616kHz and trimmed or padded to a fixed duration ofT=5T=5seconds, yieldingx^fg\(t\)\\hat\{x\}\_\{\\mathrm\{fg\}\}\(t\)andx^bg\(t\)\\hat\{x\}\_\{\\mathrm\{bg\}\}\(t\)\. We then compute the integrated loudness of each signal using the EBU R128 \(LUFS\) standard and scale the background to a fixed perceptual margin of88dB below the foreground prior to mixing\. The final mixture is peak\-normalized to prevent clipping\. Formally, the mixed signal is defined as

xmix\(t\)=x^fg\(t\)\+α10\(Lfg−Lbg−γ\)/20x^bg\(t\),x\_\{\\mathrm\{mix\}\}\(t\)=\\hat\{x\}\_\{\\mathrm\{fg\}\}\(t\)\+\\alpha\\,10^\{\\nicefrac\{\{\(L\_\{fg\}\-L\_\{bg\}\-\\gamma\)\}\}\{\{20\}\}\}\\,\\hat\{x\}\_\{\\mathrm\{bg\}\}\(t\),\(1\)whereγ=8\\gamma=8\. Details of the foreground–background class pairings are provided in Appendix[B](https://arxiv.org/html/2605.13672#A2)\.

Choice ofγ\\gamma\.Empirical analyses of everyday acoustic environments indicate that naturally occurring foreground–background signal\-to\-noise ratios span a broad positive range\. Large\-scale in\-situ measurements report that most SNRs lie between approximately22dB and1414dB, with mean values around77–88dB in noisy daily settingsWuet al\.\([2018](https://arxiv.org/html/2605.13672#bib.bib19)\)\. These statistics characterize the acoustic structure of real environments rather than listener\-dependent perception\. Based on these observations, we set the perceptual loudness margin toγ=8\\gamma=8dB, yielding mixtures in which background interference remains audible without overwhelming the foreground signal\.

Manual Quality Control\.The automated mixing process may introduce unintended semantic overlap, for example, a “traffic noise” background clip containing a siren, which could confound analysis of spurious correlations\. To mitigate this, annotators evaluated each mixed sound event according to the following criteria:\(i\)degree of acoustic similarity between the background and the foreground;\(ii\)whether the background overwhelms the foreground;\(iii\)whether the background is inaudible; and\(iv\)presence of additional unintended sound events beyond foreground and background labels\.Each criterion is scored on a scale from11to55, with higher scores indicating clearer perceptual separation\. Mixed events with an average score below44are discarded\. In total,50,11650\{,\}116mixtures were generated, from which SpurAudio comprises a curated subset of16,37816\{,\}378sound events; See Appendix[B](https://arxiv.org/html/2605.13672#A2)for the complete mapping between foreground classes and background contexts\.

## 3Families of Few\-Shot Learning Methodologies

Few\-shot classification \(*FSC*\) is casted as an episodic learning problem\. Each episode𝒯\\mathcal\{T\}corresponds to anNN\-wayKK\-shot classification task, composed of a*support set*𝒮≡\{\(xi,yi\)\}i=1N⋅K\\mathcal\{S\}\\equiv\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\\cdot K\}and a*query set*𝒬≡\{\(xj,yj\)\}j=1M\\mathcal\{Q\}\\equiv\\\{\(x\_\{j\},y\_\{j\}\)\\\}\_\{j=1\}^\{M\}\. The labelsyi,yj∈𝒞y\_\{i\},y\_\{j\}\\in\\mathcal\{C\}, where𝒞\\mathcal\{C\}is the set ofNNclasses sampled for that episode\. The objective is to infer the labels of the query examples in𝒬\\mathcal\{Q\}using only the labeled support examples in𝒮\\mathcal\{S\}\. In the standard setup, during training,*FSC*approaches learn a model over a collection of base classes, denoted𝒴train\\mathcal\{Y\}\_\{\\text\{train\}\}, and are then evaluated on a disjoint set of novel classes𝒴test\\mathcal\{Y\}\_\{\\text\{test\}\}\.

In our scenario, each input signalxxcan contain both a*foreground*event \(which determines the class label\) and a*background*acoustic environment\. This compositional nature inherently gives rise to spurious correlations between the class labels and the background components of the signal\. To this end, We examine five principal categories of*FSC*methods:\(i\)metric\-based,\(ii\)meta\-learning\-based,\(iii\)fine\-tuning\-based,\(iv\)transductive, and\(v\)contrastive\-based\.

Metric\-Based Methods\.Metric\-based*FSC*learns an embedding functionθ:𝒳→ℝd\\theta:\\mathcal\{X\}\\rightarrow\\mathbb\{R\}^\{d\}that brings samples from the same class close together while separating different classes\. Support examples for each class are embedded and summarized by a prototype, typically the mean of their embeddings\. Let\{θ\(xij\)\}i=1K\\left\\\{\\theta\(x^\{j\}\_\{i\}\)\\right\\\}\_\{i=1\}^\{K\}be the embeddings ofKKsupport samples for classj∈\[N\]j\\in\[N\], and define the prototype aspj=1K∑i=1Kθ\(xij\)p^\{j\}=\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\theta\(x^\{j\}\_\{i\}\)\. For a queryqq, the class posterior is computed via a softmax over negative distances:Pr\(qbelongs to classj∣𝒫\)=e−Dist\(θ\(q\),pj\)∑l∈\[N\]e−Dist\(θ\(q\),pl\)\.\\mathrm\{Pr\}\\left\(q\\text\{ belongs to class \}j\\mid\\mathcal\{P\}\\right\)=\\frac\{e^\{\-\\mathrm\{Dist\}\\left\(\\theta\(q\),p^\{j\}\\right\)\}\}\{\\sum\_\{l\\in\[N\]\}e^\{\-\\mathrm\{Dist\}\\left\(\\theta\(q\),p^\{l\}\\right\)\}\}\.

Meta\-Learning Based Methods\.Meta\-learning trains models over a distribution of tasks to enable rapid adaptation from limited data\. Tasks𝒯∼𝒮×𝒬\\mathcal\{T\}\\sim\\mathcal\{S\}\\times\\mathcal\{Q\}comprise support𝒮i\\mathcal\{S\}\_\{i\}and query𝒬i\\mathcal\{Q\}\_\{i\}sets\. Starting from meta\-parametersθ\\theta, task\-specific parameters are obtained via an adaptation operator𝒜\\mathcal\{A\}, i\.e\.,θi=𝒜\(θ,𝒮i\),\\theta\_\{i\}=\\mathcal\{A\}\(\\theta,\\mathcal\{S\}\_\{i\}\),and the meta\-objective minimizes expected query loss:minθ⁡𝔼𝒯i\[ℒ𝒯i\(θi;𝒬i\)\]\.\\min\_\{\\theta\}\\mathbb\{E\}\_\{\\mathcal\{T\}\_\{i\}\}\\left\[\\mathcal\{L\}\_\{\\mathcal\{T\}\_\{i\}\}\(\\theta\_\{i\};\\mathcal\{Q\}\_\{i\}\)\\right\]\.

A canonical example is MAMLFinnet al\.\([2017a](https://arxiv.org/html/2605.13672#bib.bib20)\), which learns an initializationθ\\thetasuch that a few gradient steps on the support set yield good task performance:θi\(k\+1\)=θi\(k\)−α∇θi\(k\)ℒ\(θi\(k\);𝒮i\),θi\(0\)=θ,\\theta\_\{i\}^\{\(k\+1\)\}=\\theta\_\{i\}^\{\(k\)\}\-\\alpha\\nabla\_\{\\theta\_\{i\}^\{\(k\)\}\}\\mathcal\{L\}\(\\theta\_\{i\}^\{\(k\)\};\\mathcal\{S\}\_\{i\}\),\\quad\\theta\_\{i\}^\{\(0\)\}=\\theta,with meta\-objectiveminθ∑iℒ\(θi\(M\);𝒬i\),\\min\\limits\_\{\\theta\}\\sum\_\{i\}\\mathcal\{L\}\(\\theta\_\{i\}^\{\(M\)\};\\mathcal\{Q\}\_\{i\}\),whereMMis the number of inner\-loop updates\.

Fine\-Tuning\-Based Methods\.Fine\-tuning\-based FSC first trains a backbone on base classes to learn transferable representations, then adapts to novel classes by replacing and fine\-tuning the classifier optionally updating parts of the backbone using the support set\. While this enables greater adaptation than fixed\-embedding methods, it relies on very limited data, making it sensitive to optimization choices, prone to overfitting, and susceptible to reinforcing spurious correlations learned during pretraining, often resulting in high performance variability across tasks\.

Contrastive\-Based Methods\.Contrastive learning aims to learn representations in which semantically related audio samples are mapped close together, while unrelated samples are pushed apart\. In few\-shot audio classification, such objectives are commonly used to improve representation quality by leveraging data augmentations or weak supervision, and have been shown to enhance overall robustness\. A typical formulation employs the InfoNCE loss, which encourages similarity between positive pairs while contrasting them against negatives\.

Transductive\-Based Methods\.Transductive methods leverage the statistical distribution of the unlabeled query set during inference by processing query samples collectively rather than in isolation, they effectively mitigate support set bias and better align representations across novel tasks\.

## 4Experiments

In this section, we study the effect of spurious correlations in our dataset,*SpurAudio*, under few\-shot learning settings\. Specifically, we evaluate11and55shots classification tasks, with1010query samples per class, across different encoders and*FSC*algorithms\. For each configuration, performance is averaged over three random seeds, and we report both the mean classification accuracy and standard deviation; we refer the reader to Section[A](https://arxiv.org/html/2605.13672#A1)in the appendix for the full experimental setup\.

Roadmap\.First, we introduce*FSC*evaluation tasks that control foreground–background relationships in audio episodes, enabling systematic manipulation of background context across episodes \(Section[4\.1](https://arxiv.org/html/2605.13672#S4.SS1)\)\. Second, we quantify the*IID*vs\.*OOD*performance gap across multiple*FSC*model families and backbone architectures \(Tables[1](https://arxiv.org/html/2605.13672#S4.T1),[2](https://arxiv.org/html/2605.13672#S4.T2), and[6](https://arxiv.org/html/2605.13672#A6.T6)\), and stress\-test the proposed tasks by progressively increasing*OOD*difficulty, revealing a consistent amplification of the*IID*–*OOD*gap \(Section[4\.2\.1](https://arxiv.org/html/2605.13672#S4.SS2.SSS1)\)\. Third, we conduct a diagnostic analysis by visualizing embedding spaces of support and query samples and studying the effect of the mixing coefficientα\\alpha\(Eq\. \([1](https://arxiv.org/html/2605.13672#S2.E1)\)\) on the magnitude of the gap \(Sections[4\.2\.2](https://arxiv.org/html/2605.13672#S4.SS2.SSS2)and[C](https://arxiv.org/html/2605.13672#A3)\); as an additional validation, we demonstrate that the generated audio mixtures resemble real\-world recordings by showing close proximity between CLAP embeddings of SpurAudio and those of FSD50KFonsecaet al\.\([2021](https://arxiv.org/html/2605.13672#bib.bib60)\)\(Sections[4\.2\.3](https://arxiv.org/html/2605.13672#S4.SS2.SSS3)and[J\.2](https://arxiv.org/html/2605.13672#A10.SS2)\)\. Fourth, we isolate foreground signals and re\-evaluate*FSC*methods to verify that background content is the primary source of the shortcut learning \(See Figure[3](https://arxiv.org/html/2605.13672#S4.F3)\), then use head–backbone replacement studies to disentangle representation learning from decision rules, revealing how different*FSC*algorithms handle spurious correlations \(Section[G\.6](https://arxiv.org/html/2605.13672#A7.SS6)\)\. Finally, to rule out the possibility that the observed*IID*–*OOD*gap is merely an artifact of limited CNN capacity, we extend our analysis to large pretrained audio models used as frozen encoders in combination with a wide range of non\-backbone\-dependent few\-shot heads \(Section[4\.3](https://arxiv.org/html/2605.13672#S4.SS3)\)\.

Training & Evaluation\.We adapt the LibFewShotLiet al\.\([2023](https://arxiv.org/html/2605.13672#bib.bib22)\)framework to work with audio inputs, while adopting its training parameters and implementing the*FSC*algorithms described above\. All methods are evaluated using episodic classification accuracy under both*IID*and*OOD*task sampling\. Results are averaged over multiple random seeds and evaluation episodes\. To quantify sensitivity to spurious correlations and distribution shifts, we report the accuracy gapΔ=AccIID−AccOOD\\Delta=\\mathrm\{Acc\}\_\{\\mathrm\{IID\}\}\-\\mathrm\{Acc\}\_\{\\mathrm\{OOD\}\}, where larger values indicate greater performance degradation under*OOD*conditions\.

### 4\.1Highlighting the Effect of Spurious Background Shifts

The*SpurAudio*dataset is split into three main subsets:\(i\)training,\(ii\)validation, and\(iii\)test\.Training and validation primarily consist of*IID*\(in\-distribution\) tasks\. The test set is further divided into two subsets: one composed of*IID*tasks, and the other of*OOD*\(out\-of\-distribution\) tasks\. Both subsets share the same set of foreground sound events but differ in their background pairings, enabling us to assess the effect of spurious correlations introduced by background context\. Specifically, we focus on two types of tasks:\(I\)*IID*tasks – support and query examples share the same background pairings for each of theNNclasses, and\(II\)*OOD*tasks – the background pairings are altered to break spurious foreground–background correlations, challenging the model to rely on the foreground content rather than contextual shortcuts\.

*OOD*task generation\.For each*OOD*task, each class’s foreground sound event is paired with background sounds that may also appear with other classes’ foregrounds\. This creates overlap in background usage across classes, explicitly breaking the spurious association while keepingNN,KK, and the number of query samples unchanged\. In Section[4](https://arxiv.org/html/2605.13672#S4), we demonstrate the resulting performance gap between*IID*and*OOD*tasks\.

### 4\.2In\-Depth Analysis

#### 4\.2\.1Effect of Spurious Correlation Strength

\-0\.2cm![Refer to caption](https://arxiv.org/html/2605.13672v1/figures/alpha_graphs/alpha_graph_proto_Hybrid_backbone.png)

Figure 2:IID–OOD gap versus spurious correlation strength\. We plotΔ\(α\)\\Delta\(\\alpha\)for 1/5 shot\. The gap grows withα\\alpha, showing strong foreground – background correlations increase*OOD*degradation\.In the standard*OOD*evaluation, support and query backgrounds are sampled such that some background patterns seen in the support set do not appear in the query set of other classes\. To more directly probe the effect of spurious correlations, we construct a*stronger correlation*setting in which every background present in the support set of each class in the test split also appears in the query sets of all other test classes; see Figure[7](https://arxiv.org/html/2605.13672#A4.F7)at Section[D](https://arxiv.org/html/2605.13672#A4)of the appendix\. This maximizes background overlap across test tasks, removing background\-specific cues for classification\. Across all methods, stronger correlations induce larger accuracy drops than the standard setting, indicating that few\-shot methods rely more on background cues as their predictive power grows\. Thus, degradation scales with spurious\-correlation strength rather than distribution shift alone\.

Table 1:Conv64 results on 1\-shot and 5\-shot tasks\.∗\\astrefers to an additional attention module on top of the Conv64 model\.
#### 4\.2\.2Embedding Geometry and Spurious\-Correlation Effects

To understand how spurious backgrounds affect few\-shot audio classification, we analyze embedding geometry under*IID*and*OOD*settings\. Figures[4](https://arxiv.org/html/2605.13672#A3.F4)–[6](https://arxiv.org/html/2605.13672#A3.F6)show t\-SNE projections of support and query embeddings \(see Section[C](https://arxiv.org/html/2605.13672#A3)\), and Figure[2](https://arxiv.org/html/2605.13672#S4.F2)plotsΔ\\Deltaagainst the spurious\-correlation strengthα\\alpha\(Eq\. \([1](https://arxiv.org/html/2605.13672#S2.E1)\)\)\.

Embedding structure under*IID*vs\.*OOD*\.Across methods,*IID*episodes show tighter class clusters and better alignment between support and query samples\. Under*OOD*background shifts, query embeddings drift away from their corresponding support clusters, increasing overlap between classes and making nearest\-prototype decisions less reliable\. This geometric mismatch provides an interpretable explanation for the observed accuracy drop under spurious\-correlation shifts\.

Effect of spurious\-correlation strength\.Figure[2](https://arxiv.org/html/2605.13672#S4.F2)shows that the performance gapΔ\(α\)=AccIID−AccOOD\\Delta\(\\alpha\)=\\text\{Acc\}\_\{\\text\{IID\}\}\-\\text\{Acc\}\_\{\\text\{OOD\}\}grows withα\\alphain both11\-shot and55\-shot regimes\. Interestingly, the gap can be larger in the higher\-shot setting, suggesting that additional support examples may reinforce background\-dependent representations rather than improving robustness when correlations are strong as depicted In Figure[15](https://arxiv.org/html/2605.13672#A12.F15); See section[L](https://arxiv.org/html/2605.13672#A12)at the appendix\.

#### 4\.2\.3On The Distribution of SpurAudio

We show that the distribution underlying SpurAudio forms only a sub\-distribution of the broader pool of sound events encountered in the “wild”, and that these events are situated within diverse acoustic contexts\. Our dataset incorporates standard audio mixing at realistic signal\-to\-noise ratios, without adding synthetic artifacts or adversarial perturbations\. In Section[J](https://arxiv.org/html/2605.13672#A10), we empirically show that the resulting sound events are close to those found in FSD50K datasetFonsecaet al\.\([2021](https://arxiv.org/html/2605.13672#bib.bib60)\)\. In particular, we find that embeddings of sound events from SpurAudio samples preserve the structure induced by embeddings of real, “in\-the\-wild” sound events that are semantically similar \(in terms of class labels\), while also remaining in close proximity to those real events in the embedding space; see Figure[14](https://arxiv.org/html/2605.13672#A10.F14)\.

#### 4\.2\.4Why Spurious Correlation Harms Few\-Shot Audio Classification

Spurious background correlations harm few\-shot audio classification by biasing embedding\-based similarity, even when foreground semantics remain unchanged\. Through controlled IID–OOD evaluations, head–backbone replacement, and background perturbations, we identify a recurring failure mechanism across architectures and algorithm families \(Figures[9](https://arxiv.org/html/2605.13672#A7.F9)–[12](https://arxiv.org/html/2605.13672#A7.F12), Tables[1](https://arxiv.org/html/2605.13672#S4.T1)–[6](https://arxiv.org/html/2605.13672#A6.T6), Appendix[G](https://arxiv.org/html/2605.13672#A7)\)\.

Table 2:ResNet12 results on 1\-shot and 5\-shot tasks\.∗\\astrefers to an additional attention module on top of the ResNet12 model\.Background effects on representation geometry\.Background variation primarily affects*feature magnitude*rather than semantic direction\. Figure[13](https://arxiv.org/html/2605.13672#A8.F13)shows that background perturbations consistently contract embedding norms while preserving class\-aligned structure\. In addition, this contraction is consistent across backbones and training paradigms, indicating that magnitude sensitivity is a fundamental property of current audio embeddings rather than an artifact of a specific model\. Moreover, this magnitude sensitivity is not localized to the final embedding layer: stage\-wise hooking experiments on Conv64F and ResNet12 \(Appendix[I](https://arxiv.org/html/2605.13672#A9), Tables[9](https://arxiv.org/html/2605.13672#A9.T9)–[10](https://arxiv.org/html/2605.13672#A9.T10)\) reveal that the*IID*–*OOD*gap is present from the first convolutional or residual block, indicating that spurious background correlations are encoded throughout the feature hierarchy rather than emerging only at the final embedding\.

Failure of global, unnormalized similarity\.Many few\-shot methods implicitly treat representation strength as a proxy for semantic similarity\. Approaches based on dot\-product similarity, Euclidean prototypes, or global pooling directly entangle background\-dependent activation magnitude with class identity\. While this bias is largely invisible under IID conditions, background shifts cause systematic overestimation of dissimilarity, leading to largeΔ\\Delta\(e\.g\., Baseline, ProtoNet, ATL\-NET in Figure[3](https://arxiv.org/html/2605.13672#S4.F3)\)\.

Limits of feature normalization\.Cosine\-based classifiers \(Baseline\+\+, Meta\-Baseline\) mitigate magnitude sensitivity by normalizing features prior to comparison\. This normalization explains their improved stability on shallow backbones where background variation is primarily radial\.

![Refer to caption](https://arxiv.org/html/2605.13672v1/figures_of_data_analysis/data_analysis/graph_candle_comparison.png)Figure 3:Impact of Spurious Background Correlations on Query Evaluation\. We compare few\-shot accuracy under three query settings: Green \(IID\): matched support\-query backgrounds; Blue \(Clean\): IID support with background\-free queries; and Red \(OOD\): mismatched spurious backgrounds\. Clean queries improve over OOD but remain below IID, indicating that models rely on background shortcuts\. Results are averaged over 3 seeds\.ResNet backbones and background leakage\.The background information for deeper ResNet e\.g\., ResNet12, becomes directionally entangled with foreground features, leaving substantialΔ\\Deltaeven after normalization \(Table[6](https://arxiv.org/html/2605.13672#A6.T6)\)\. This shows that normalization alone is insufficient and motivates inference mechanisms that either localize matching \(e\.g\., DN4\) or adapt representations at test time\.

Robust inference mechanisms\.Up to this point, we have found that methods that more strongly suppress global similarity exhibit consistently greater robustness\. Local\-descriptor approaches \(DN4\) bypass global aggregation entirely, while adaptive meta\-learning methods \(e\.g\., MAML, BOIL\) use the support set to reshape the representations and downweight background\-dependent features\. By contrast, frozen\-backbone approaches \(e\.g\., ANIL, R2D2\) preserve these background shortcuts, resulting in larger and more persistent OOD performance gaps \(Appendix[G\.1\.1](https://arxiv.org/html/2605.13672#A7.SS1.SSS1),[G\.2\.2](https://arxiv.org/html/2605.13672#A7.SS2.SSS2)\)\.

### 4\.3Large Audio Models

A natural objection is that spurious background reliance may reflect the limited capacity of the CNN backbones used so far\. We address this by stress\-testing our findings on five state\-of\-the\-art large audio encoders spanning contrastive \(CLAPWu\*et al\.\([2023](https://arxiv.org/html/2605.13672#bib.bib62)\)\), supervised transformer \(ASTGonget al\.\([2021](https://arxiv.org/html/2605.13672#bib.bib63)\)\), masked autoencoding \(AudioMAE\-AS20KHuanget al\.\([2022](https://arxiv.org/html/2605.13672#bib.bib64)\)\), and instruction\-tuned multimodal \(Qwen2\-Audio\-7BChuet al\.\([2024](https://arxiv.org/html/2605.13672#bib.bib65)\)\), each used as a frozen encoder paired with eleven non\-backbone\-dependent heads ranging from classical baselinesSnellet al\.\([2017](https://arxiv.org/html/2605.13672#bib.bib32)\); Chenet al\.\([2019](https://arxiv.org/html/2605.13672#bib.bib23)\); Liet al\.\([2019](https://arxiv.org/html/2605.13672#bib.bib34)\)to modern transductive and label\-propagation methodsZhu and Koniusz \([2023](https://arxiv.org/html/2605.13672#bib.bib69)\); Shalam \([2024](https://arxiv.org/html/2605.13672#bib.bib68)\); Leeet al\.\([2024](https://arxiv.org/html/2605.13672#bib.bib67)\); Zikoet al\.\([2020](https://arxiv.org/html/2605.13672#bib.bib70)\); Liuet al\.\([2020](https://arxiv.org/html/2605.13672#bib.bib71)\); Martinet al\.\([2022](https://arxiv.org/html/2605.13672#bib.bib72)\); Guoet al\.\([2026](https://arxiv.org/html/2605.13672#bib.bib73)\); For additional results using11shot*FSC*concerning large audio models and transformer based models, we refer the reader to Section[K](https://arxiv.org/html/2605.13672#A11)\.

Takeaways:From a 4\-layer CNN to a 7B\-parameter audio LLM \(Tables[1](https://arxiv.org/html/2605.13672#S4.T1),[2](https://arxiv.org/html/2605.13672#S4.T2),[3](https://arxiv.org/html/2605.13672#S4.T3),[6](https://arxiv.org/html/2605.13672#A6.T6)\), three findings emerge:\(i\)Every*FSC*method suffers an*IID*–*OOD*gap\.The gap appears in every method family, at every backbone size, and at every shot count, and grows asα\\alphaincreases \(Figure[2](https://arxiv.org/html/2605.13672#S4.F2)\)\. It is therefore not caused by any single architecture\.\(ii\)Larger pre\-trained encoders are not robust\.While such encoders can reach≈96%\\approx 96\\%*IID*accuracy on corpora thatcoverSpurAudio’s classes, they still suffer gaps under standard heads\. This means background shifts perturb embedding magnitudes while leaving angular structure largely preserved, and standard heads read magnitude changes as class differences\.\(iii\)The inference head matters more than the encoder\.Since background shifts perturb magnitude but not direction, heads that ignore magnitude \(cosine similarity, query\-set graphs, neighborhood propagation\) shrink the*IID*–*OOD*gap by an order of magnitude, while heads built on absolute distance inherit the shift\. Results confirm this: transductive heads \(Proto\-LPZhu and Koniusz \([2023](https://arxiv.org/html/2605.13672#bib.bib69)\), BD\-CSPNLiuet al\.\([2020](https://arxiv.org/html/2605.13672#bib.bib71)\), ECPEGuoet al\.\([2026](https://arxiv.org/html/2605.13672#bib.bib73)\)\) shrinkΔ\\Deltaby an order of magnitude \(Proto\-LP:1\.51\.5–2\.6%2\.6\\%on CLAP, AST, AudioMAE, Qwen2\-Audio\), while magnitude\-sensitive heads \(Hela\-VFA, BPA\) hit double\-digit gaps even with the strongest encoders\. While the same transductive heads result in smallΔ\\Deltaon Conv64 and ResNet12, their absolute accuracy on these backbones lags the best metric\-based methods\. This is because transductive inference relies on the cluster assumptionChapelleet al\.\([2009](https://arxiv.org/html/2605.13672#bib.bib79)\)and on a well\-clustered query manifoldLiet al\.\([2020a](https://arxiv.org/html/2605.13672#bib.bib78)\); Zikoet al\.\([2021](https://arxiv.org/html/2605.13672#bib.bib77)\), a property that emerges in large pretrained encoders\. The large IID gap between Conv64 and CLAP under a fixed ProtoNet head \(Tables[1](https://arxiv.org/html/2605.13672#S4.T1)and[3](https://arxiv.org/html/2605.13672#S4.T3)\) emphasizes that large pretrained encoders produce sharper class clusters that transductive heads can exploit\.

Table 3:5\-shot IID and OOD accuracy \(%\) and gaps across few\-shot methods on large audio models\.

## 5Conclusions and Future Work

We introducedSpurAudio, a benchmark for studying spurious foreground–background correlations in few\-shot audio classification, and showed that the resulting*IID*–*OOD*gap is consistent across model families, backbone scales, and large pretrained encoders\. Our geometric analysis traces this failure to inference\-time interactions with background cues: background variation primarily perturbs embedding*magnitudes*while leaving angular structure intact, explaining why magnitude\-sensitive heads degrade sharply while transductive, relational ones remain robust\. These findings open a new avenue for few\-shot research centered on embedding geometry, including magnitude\-aware objectives, adaptive similarity metrics, and multimodal extensions of SpurAudio\.

## Acknowledgments and Disclosure of Funding

This research was funded by the Ministry of Science, Research and the Arts Baden\-Wuerttemberg in the Artificial Intelligence Software Academy \(AISA\)\. L\. Mualem also acknowledge the support of the Stuttgart Center for Simulation Science \(SimTech\) and thank the International Max Planck Research School for Intelligent Systems \(IMPRS\-IS\) for support\. L\. Mualem was supported by a postdoctoral scholarship from the Planning and Budgeting Committee \(PBC\) of the Council for Higher Education in Israel\. L\. Mualem gratefully acknowledge the computing time provided on the high\-performance computer HoreKa by the National High\-Performance Computing Center at KIT \(NHR@KIT\)\. This center is jointly supported by the Federal Ministry of Education and Research and the Ministry of Science, Research and the Arts of Baden\-Württemberg, as part of the National High\-Performance Computing \(NHR\) joint funding program \(https://www\.nhr\-verein\.de/en/our\-partners\)\. HoreKa is partly funded by the German Research Foundation \(DFG\)\.

## References

- \[1\]\(2023\)How robust are audio embeddings for polyphonic sound event tagging?\.IEEE/ACM Transactions on Audio, Speech, and Language Processing31,pp\. 2658–2667\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p3.1)\.
- \[2\]J\. Abeßer\(2022\)Classifying sounds in polyphonic urban sound scenes\.AES E\-Library\. Online resource\.Cited by:[item \(v\)](https://arxiv.org/html/2605.13672#S2.I1.i5.1)\.
- \[3\]A\. Baevski, Y\. Zhou, A\. Mohamed, and M\. Auli\(2020\)Wav2vec 2\.0: a framework for self\-supervised learning of speech representations\.Advances in neural information processing systems33,pp\. 12449–12460\.Cited by:[Appendix K](https://arxiv.org/html/2605.13672#A11.p1.1)\.
- \[4\]S\. Baik, J\. Choi, H\. Kim, D\. Cho, J\. Min, and K\. M\. Lee\(2021\)Meta\-learning with task\-adaptive loss function for few\-shot learning\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 9465–9474\.Cited by:[2nd item](https://arxiv.org/html/2605.13672#A1.I2.i2.p1.1),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.32.32.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.40.40.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.45.43.43.43.43.43.43.43.5)\.
- \[5\]L\. Bertinetto, J\. F\. Henriques, P\. H\. Torr, and A\. Vedaldi\(2018\)Meta\-learning with differentiable closed\-form solvers\.arXiv preprint arXiv:1805\.08136\.Cited by:[2nd item](https://arxiv.org/html/2605.13672#A1.I2.i2.p1.1),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.22.22.22.22.22.22.22.22.5),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.38.38.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.48.48.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.33.31.31.31.31.31.31.31.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.24.22.22.22.22.22.22.22.5)\.
- \[6\]O\. Chapelle, B\. Scholkopf, and A\. Zien\(2009\)Semi\-supervised learning \(chapelle, o\. et al\., eds\.; 2006\)\[book reviews\]\.IEEE Transactions on Neural Networks20\(3\),pp\. 542–542\.Cited by:[item \(iii\)](https://arxiv.org/html/2605.13672#S4.I3.i3.p1.4)\.
- \[7\]S\. Chen, Y\. Wu, C\. Wang, S\. Liu, D\. Tompkins, Z\. Chen, and F\. Wei\(2022\)Beats: audio pre\-training with acoustic tokenizers\.arXiv preprint arXiv:2212\.09058\.Cited by:[Appendix K](https://arxiv.org/html/2605.13672#A11.p1.1)\.
- \[8\]T\. Chen, S\. Kornblith, M\. Norouzi, and G\. Hinton\(2020\)A simple framework for contrastive learning of visual representations\.InInternational conference on machine learning,pp\. 1597–1607\.Cited by:[Appendix M](https://arxiv.org/html/2605.13672#A13.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.79.77.77.77.77.77.77.77.8)\.
- \[9\]W\. Chen, Y\. Liu, Z\. Kira, Y\. F\. Wang, and J\. Huang\(2019\)A closer look at few\-shot classification\.arXiv preprint arXiv:1904\.04232\.Cited by:[1st item](https://arxiv.org/html/2605.13672#A1.I2.i1.p1.1),[Table 11](https://arxiv.org/html/2605.13672#A11.T11.22.20.20.9),[Table 11](https://arxiv.org/html/2605.13672#A11.T11.30.28.28.9),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.32.32.32.14),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.44.44.44.13),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.10.10.10.5),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.14.14.14.5),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.10.10.10.10.10.10.10.10.5),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.6.6.6.6.6.6.6.6.6),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.11.11.4),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.14.14.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.12.12.5),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.16.16.5),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.14.14.14.7),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.20.20.20.7),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.38.38.38.7),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.44.44.44.7),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.62.62.62.7),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.68.68.68.7),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.14.14.14.7),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.20.20.20.7),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.38.38.38.7),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.44.44.44.7),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.62.62.62.7),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.68.68.68.7),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.12.10.10.10.10.10.10.10.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.8.6.6.6.6.6.6.6.6),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.12.10.10.10.10.10.10.10.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.8.6.6.6.6.6.6.6.6),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.20.20.20.9),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.28.28.28.9)\.
- \[10\]Y\. Chen, Z\. Liu, H\. Xu, T\. Darrell, and X\. Wang\(2021\)Meta\-baseline: exploring simple meta\-learning for few\-shot learning\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 9062–9071\.Cited by:[1st item](https://arxiv.org/html/2605.13672#A1.I2.i1.p1.1),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.14.14.14.14.14.14.14.14.5),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.8.8.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.8.8.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.16.14.14.14.14.14.14.14.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.16.14.14.14.14.14.14.14.5)\.
- \[11\]Y\. Chu, J\. Xu, Q\. Yang, H\. Wei, X\. Wei, Z\. Guo, Y\. Leng, Y\. Lv, J\. He, J\. Lin, C\. Zhou, and J\. Zhou\(2024\)Qwen2\-audio technical report\.arXiv preprint arXiv:2407\.10759\.Cited by:[Appendix K](https://arxiv.org/html/2605.13672#A11.p1.1),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1)\.
- \[12\]Y\. Disha Sendhil Kumar, M\. V\. Shetty, and S\. Vhaduri\(2025\)Cough classification using few\-shot learning\.arXiv e\-prints,pp\. arXiv–2509\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[13\]C\. Dong, W\. Li, J\. Huo, Z\. Gu, and Y\. Gao\(2021\)Learning task\-aware local representations for few\-shot learning\.InProceedings of the twenty\-ninth international conference on international joint conferences on artificial intelligence,pp\. 716–722\.Cited by:[3rd item](https://arxiv.org/html/2605.13672#A1.I2.i3.p1.1),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.42.42.42.42.42.42.42.42.5),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.23.23.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.28.28.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.69.67.67.67.67.67.67.67.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.44.42.42.42.42.42.42.42.5)\.
- \[14\]B\. Elizalde, S\. Deshmukh, M\. Al Ismail, and H\. Wang\(2023\)CLAP learning audio concepts from natural language supervision\.InIEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 1–5\.Cited by:[Appendix J](https://arxiv.org/html/2605.13672#A10.p1.1)\.
- \[15\]C\. Finn, P\. Abbeel, and S\. Levine\(2017\)Model\-agnostic meta\-learning for fast adaptation of deep networks\.InInternational conference on machine learning,pp\. 1126–1135\.Cited by:[2nd item](https://arxiv.org/html/2605.13672#A1.I2.i2.p1.1),[§1](https://arxiv.org/html/2605.13672#S1.p1.1),[§3](https://arxiv.org/html/2605.13672#S3.p5.4)\.
- \[16\]C\. Finn, P\. Abbeel, and S\. Levine\(2017\)Model\-agnostic meta\-learning for fast adaptation of deep networks\.InInternational conference on machine learning,pp\. 1126–1135\.Cited by:[Table 7](https://arxiv.org/html/2605.13672#A8.T7.44.44.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.56.56.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.25.23.23.23.23.23.23.23.7)\.
- \[17\]A\. Florea, X\. Jiang, N\. Mesgarani, and X\. Jiang\(2025\)Exploring finetuned audio\-llm on heart murmur features\.Smart Health,pp\. 100557\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[18\]E\. Fonseca, X\. Favory, J\. Pons, F\. Font, and X\. Serra\(2021\)Fsd50k: an open dataset of human\-labeled sound events\.IEEE/ACM Transactions on Audio, Speech, and Language Processing30,pp\. 829–852\.Cited by:[Appendix J](https://arxiv.org/html/2605.13672#A10.p1.1),[§1\.1](https://arxiv.org/html/2605.13672#S1.SS1.p1.1),[§4\.2\.3](https://arxiv.org/html/2605.13672#S4.SS2.SSS3.p1.1),[§4](https://arxiv.org/html/2605.13672#S4.p2.1)\.
- \[19\]R\. Geirhos, J\. Jacobsen, C\. Michaelis, R\. Zemel, W\. Brendel, M\. Bethge, and F\. A\. Wichmann\(2020\)Shortcut learning in deep neural networks\.Nature Machine Intelligence2\(11\),pp\. 665–673\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p5.1)\.
- \[20\]B\. Ghani, T\. Denton, and S\. Kahl\(2024\)Deep learning for bioacoustics: a survey of recent advances in few\-shot recognition\.Journal of Applied Ecology \(Example Citation\)12\(4\),pp\. 112–125\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[21\]Y\. Gong, Y\. Chung, and J\. Glass\(2021\)Ast: audio spectrogram transformer\.arXiv preprint arXiv:2104\.01778\.Cited by:[Appendix K](https://arxiv.org/html/2605.13672#A11.p1.1),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1)\.
- \[22\]Y\. Gong, J\. Yu, and J\. Glass\(2022\)Vocalsound: a dataset for improving human vocal sounds recognition\.InICASSP 2022 \- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 151–155\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP43922.2022.9746828)Cited by:[item \(iii\)](https://arxiv.org/html/2605.13672#S2.I1.i3.1)\.
- \[23\]M\. Guo, J\. Wang, Q\. Xu, B\. Jiang, and B\. Luo\(2026\)Entropy calibrated prototype embedding for transductive few\-shot learning\.Pattern Recognition Letters\.Cited by:[Table 11](https://arxiv.org/html/2605.13672#A11.T11.98.96.96.9),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.144.144.144.13),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.48.48.48.5),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.64.64.64.64.64.64.64.64.6),[item \(iii\)](https://arxiv.org/html/2605.13672#S4.I3.i3.p1.4),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.124.122.122.122.122.122.122.122.8),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.82.80.80.80.80.80.80.80.6),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.96.96.96.9)\.
- \[24\]C\. Heggan, S\. Budgett, T\. Hospedales, and M\. Yaghoobi\(2022\)Metaaudio: a few\-shot audio classification benchmark\.InInternational Conference on Artificial Neural Networks,pp\. 219–230\.Cited by:[Appendix A](https://arxiv.org/html/2605.13672#A1.p3.5),[Appendix B](https://arxiv.org/html/2605.13672#A2.p1.1),[§1\.1](https://arxiv.org/html/2605.13672#S1.SS1.p1.1)\.
- \[25\]P\. Huang, H\. Xu, J\. Li, A\. Baevski, M\. Auli, W\. Galuba, F\. Metze, and C\. Feichtenhofer\(2022\)Masked autoencoders that listen\.Advances in neural information processing systems35,pp\. 28708–28720\.Cited by:[Appendix K](https://arxiv.org/html/2605.13672#A11.p1.1),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1)\.
- \[26\]N\. Ijaz, F\. Banoori, and I\. Koo\(2024\)Reshaping bioacoustics event detection: leveraging few\-shot learning \(fsl\) with transductive inference and data augmentation\.Bioengineering11\(7\),pp\. 685\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[27\]A\. Jana, M\. Uili, J\. Atherton, M\. O’Brien, J\. Wood, and L\. Brickson\(2025\)An automated pipeline for few\-shot bird call classification: a case study with the tooth\-billed pigeon\.arXiv preprint arXiv:2504\.16276\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[28\]K\. Koutini, J\. Schlüter, H\. Eghbal\-Zadeh, and G\. Widmer\(2021\)Efficient training of audio transformers with patchout\.arXiv preprint arXiv:2110\.05069\.Cited by:[Appendix K](https://arxiv.org/html/2605.13672#A11.p1.1)\.
- \[29\]G\. Y\. Lee, T\. Dam, D\. P\. Poenar, V\. N\. Duong, and M\. M\. Ferdaus\(2024\)HELA\-vfa: a hellinger distance\-attention\-based feature aggregation network for few\-shot classification\.InProceedings of the IEEE/CVF winter conference on applications of computer vision,pp\. 2173–2183\.Cited by:[Table 11](https://arxiv.org/html/2605.13672#A11.T11.64.62.62.9),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.96.96.96.13),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.31.31.31.6),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.53.51.51.51.51.51.51.51.1),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.88.88.88.9)\.
- \[30\]S\. Li, D\. Chen, Y\. Chen, L\. Yuan, L\. Zhang, Q\. Chu, and N\. Yu\(2020\)Are fewer labels possible for few\-shot learning?\.arXiv preprint arXiv:2012\.05899\.Cited by:[item \(iii\)](https://arxiv.org/html/2605.13672#S4.I3.i3.p1.4)\.
- \[31\]W\. Li, L\. Wang, J\. Huo, Y\. Shi, Y\. Gao, and J\. Luo\(2020\)Asymmetric distribution measure for few\-shot learning\.arXiv preprint arXiv:2002\.00153\.Cited by:[3rd item](https://arxiv.org/html/2605.13672#A1.I2.i3.p1.1),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.38.38.38.38.38.38.38.38.5),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.17.17.4),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.47.47.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.20.20.5),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.60.60.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.61.59.59.59.59.59.59.59.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.65.63.63.63.63.63.63.63.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.40.38.38.38.38.38.38.38.5)\.
- \[32\]W\. Li, L\. Wang, J\. Xu, J\. Huo, Y\. Gao, and J\. Luo\(2019\)Revisiting local descriptor based image\-to\-class measure for few\-shot learning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 7260–7268\.Cited by:[3rd item](https://arxiv.org/html/2605.13672#A1.I2.i3.p1.1),[Table 11](https://arxiv.org/html/2605.13672#A11.T11.38.36.36.9),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.56.56.56.13),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.18.18.18.5),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.34.34.34.34.34.34.34.34.5),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.20.20.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.24.24.5),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.26.26.26.7),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.50.50.50.7),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.74.74.74.7),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.26.26.26.7),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.50.50.50.7),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.74.74.74.7),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.57.55.55.55.55.55.55.55.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.36.34.34.34.34.34.34.34.5),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.36.36.36.9)\.
- \[33\]W\. Li, Z\. Wang, X\. Yang, C\. Dong, P\. Tian, T\. Qin, J\. Huo, Y\. Shi, L\. Wang, Y\. Gao,et al\.\(2023\)Libfewshot: a comprehensive library for few\-shot learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence45\(12\),pp\. 14938–14955\.Cited by:[Appendix A](https://arxiv.org/html/2605.13672#A1.p4.1),[§4](https://arxiv.org/html/2605.13672#S4.p3.1)\.
- \[34\]Y\. Liang, P\. Zhao, and Y\. Wang\(2023\)Federated few\-shot learning\-based machinery fault diagnosis in the industrial internet of things\.Applied Sciences13\(18\),pp\. 10458\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[35\]J\. Liu, L\. Song, and Y\. Qin\(2020\)Prototype rectification for few\-shot learning\.InEuropean conference on computer vision,pp\. 741–756\.Cited by:[Table 11](https://arxiv.org/html/2605.13672#A11.T11.82.80.80.11),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.120.120.120.13),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.40.40.40.5),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.50.50.50.50.50.50.50.50.5),[item \(iii\)](https://arxiv.org/html/2605.13672#S4.I3.i3.p1.4),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.102.100.100.100.100.100.100.100.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.65.63.63.63.63.63.63.63.5),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.52.52.52.9)\.
- \[36\]W\. Liu, H\. Liu, F\. Lin, H\. Liu, T\. Gao, X\. Fang, J\. Liu, X\. Deng, Y\. Sun, K\. Xu,et al\.\(2024\)Few\-shot bioacoustic event detection at the dcase 2024 challenge\.Recall1,pp\. F1\_PB\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[37\]Y\. Liu, W\. Zhang, C\. Xiang, T\. Zheng, D\. Cai, and X\. He\(2022\)Learning to affiliate: mutual centralized learning for few\-shot classification\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 14411–14420\.Cited by:[3rd item](https://arxiv.org/html/2605.13672#A1.I2.i3.p1.1),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.41.41.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.52.52.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.73.71.71.71.71.71.71.71.5)\.
- \[38\]Y\. Liu, R\. Feng, J\. Yuan, and Z\. Ling\(2024\)Clever hans effect found in automatic detection of alzheimer’s disease through speech\.arXiv preprint arXiv:2406\.07410\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p5.1)\.
- \[39\]M\. Maciejewski, G\. Wichern, E\. McQuinn, and J\. Le Roux\(2020\)WHAMR\!: noisy and reverberant single\-channel speech separation\.InICASSP 2020\-2020 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 696–700\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.13672#S2.SS1.p3.7)\.
- \[40\]S\. Martin, M\. Boudiaf, E\. Chouzenoux, J\. Pesquet, and I\. Ayed\(2022\)Towards practical few\-shot query sets: transductive minimum description length inference\.Advances in Neural Information Processing Systems35,pp\. 34677–34688\.Cited by:[Table 11](https://arxiv.org/html/2605.13672#A11.T11.90.88.88.9),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.132.132.132.13),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.44.44.44.5),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.54.54.54.54.54.54.54.54.5),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.108.106.106.106.106.106.106.106.7),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.69.67.67.67.67.67.67.67.5),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.60.60.60.9)\.
- \[41\]B\. McEwen, K\. Soltero, S\. Gutschmidt, A\. Bainbridge\-Smith, J\. Atlas, and R\. Green\(2024\)Active few\-shot learning for rare bioacoustic feature annotation\.Ecological Informatics82,pp\. 102734\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[42\]I\. Moummad, R\. Serizel, and N\. Farrugia\(2023\)Pretraining representations for bioacoustic few\-shot detection using supervised contrastive learning\.arXiv preprint arXiv:2309\.00878\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[43\]I\. Nolasco, S\. Singh, V\. Morfi, V\. Lostanlen, A\. Strandburg\-Peshkin, E\. Vidaña\-Vila, L\. Gill, H\. Pamuła, H\. Whitehead, I\. Kiskin,et al\.\(2023\)Learning to detect an animal sound from five examples\.Ecological informatics77,pp\. 102258\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[44\]J\. Oh, H\. Yoo, C\. Kim, and S\. Yun\(2020\)Boil: towards representation change for few\-shot learning\.arXiv preprint arXiv:2008\.08882\.Cited by:[2nd item](https://arxiv.org/html/2605.13672#A1.I2.i2.p1.1),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.35.35.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.44.44.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.41.39.39.39.39.39.39.39.5)\.
- \[45\]K\. J\. Piczak\(2015\-10\-13\)ESC: Dataset for Environmental Sound Classification\.InProceedings of the 23rd Annual ACM Conference on Multimedia,pp\. 1015–1018\.External Links:[Link](http://dl.acm.org/citation.cfm?doid=2733373.2806390),[Document](https://dx.doi.org/10.1145/2733373.2806390),ISBN 978\-1\-4503\-3459\-4Cited by:[item \(i\)](https://arxiv.org/html/2605.13672#S2.I1.i1.1)\.
- \[46\]A\. Raghu, M\. Raghu, S\. Bengio, and O\. Vinyals\(2019\)Rapid learning or feature reuse? towards understanding the effectiveness of maml\.arXiv preprint arXiv:1909\.09157\.Cited by:[2nd item](https://arxiv.org/html/2605.13672#A1.I2.i2.p1.1),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.26.26.26.26.26.26.26.26.5),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.26.26.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.32.32.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.37.35.35.35.35.35.35.35.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.28.26.26.26.26.26.26.26.5)\.
- \[47\]A\. A\. Rusu, D\. Rao, J\. Sygnowski, O\. Vinyals, R\. Pascanu, S\. Osindero, and R\. Hadsell\(2018\)Meta\-learning with latent embedding optimization\.arXiv preprint arXiv:1807\.05960\.Cited by:[2nd item](https://arxiv.org/html/2605.13672#A1.I2.i2.p1.1),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.18.18.18.18.18.18.18.18.6),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.29.29.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.36.36.5),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.29.27.27.27.27.27.27.27.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.20.18.18.18.18.18.18.18.6)\.
- \[48\]S\. Sagawa and P\. W\. KohDistributionally robust neural networks for group shifts: on the importance of regularization for worst\-case generalization\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p4.1)\.
- \[49\]J\. Salamon, C\. Jacoby, and J\. P\. Bello\(2014\-Nov\.\)A dataset and taxonomy for urban sound research\.In22nd ACM International Conference on Multimedia \(ACM\-MM’14\),Orlando, FL, USA,pp\. 1041–1044\.Cited by:[item \(ii\)](https://arxiv.org/html/2605.13672#S2.I1.i2.1)\.
- \[50\]J\. Salamon and J\. P\. Bello\(2017\)Deep convolutional neural networks and data augmentation for environmental sound classification\.IEEE Signal processing letters24\(3\),pp\. 279–283\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p3.1)\.
- \[51\]F\. Saleem, M\. Umar, and J\. Kim\(2025\)An optimized few\-shot learning framework for fault diagnosis in milling machines\.Machines13\(11\),pp\. 1010\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[52\]S\. Schneider, A\. Baevski, R\. Collobert, and M\. Auli\(2019\)Wav2vec: unsupervised pre\-training for speech recognition\.arXiv preprint arXiv:1904\.05862\.Cited by:[Appendix K](https://arxiv.org/html/2605.13672#A11.p1.1)\.
- \[53\]R\. Selvakumar, S\. Kumar, H\. K\. Giri, N\. Anand, A\. Seth, S\. Ghosh, and D\. Manocha\(2025\)Do audio\-language models understand linguistic variations?\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),pp\. 899–913\.Cited by:[§1\.1](https://arxiv.org/html/2605.13672#S1.SS1.p1.1),[§1](https://arxiv.org/html/2605.13672#S1.p3.1)\.
- \[54\]C\. Sgouropoulos, C\. Nikou, S\. Vlachos, V\. Theiou, C\. Foukanelis, and T\. Giannakopoulos\(2025\)Prototypical contrastive learning for improved few shot audio classification\.In2025 IEEE 35th International Workshop on Machine Learning for Signal Processing \(MLSP\),pp\. 1–6\.Cited by:[Table 14](https://arxiv.org/html/2605.13672#A13.T14.4.4.4.5),[Table 14](https://arxiv.org/html/2605.13672#A13.T14.8.8.8.5),[Appendix M](https://arxiv.org/html/2605.13672#A13.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.85.83.83.83.83.83.83.83.7),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.86.84.84.84.84.84.84.84.1)\.
- \[55\]D\. Shalam\(2024\)The balanced\-pairwise\-affinities feature transform\.University of Haifa \(Israel\)\.Cited by:[Table 11](https://arxiv.org/html/2605.13672#A11.T11.56.54.54.9),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.84.84.84.13),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.26.26.26.5),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.118.116.116.116.116.116.116.116.5),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.78.76.76.76.76.76.76.76.5),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.80.80.80.9)\.
- \[56\]F\. M\. Siraj, S\. T\. K\. Ayon, and J\. Uddin\(2023\)A few\-shot learning based fault diagnosis model using sensors data from industrial machineries\.Vibration6\(4\),pp\. 1004–1029\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[57\]J\. Snell, K\. Swersky, and R\. Zemel\(2017\)Prototypical networks for few\-shot learning\.Advances in neural information processing systems30\.Cited by:[3rd item](https://arxiv.org/html/2605.13672#A1.I2.i3.p1.1),[Table 11](https://arxiv.org/html/2605.13672#A11.T11.14.12.12.9),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.19.19.19.14),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.6.6.6.5),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.30.30.30.30.30.30.30.30.6),[Table 7](https://arxiv.org/html/2605.13672#A8.T7.5.5.4),[Table 8](https://arxiv.org/html/2605.13672#A8.T8.4.4.5),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.32.32.32.8),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.56.56.56.8),[Table 10](https://arxiv.org/html/2605.13672#A9.T10.8.8.8.8),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.32.32.32.8),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.56.56.56.8),[Table 9](https://arxiv.org/html/2605.13672#A9.T9.8.8.8.8),[§1](https://arxiv.org/html/2605.13672#S1.p1.1),[§1](https://arxiv.org/html/2605.13672#S1.p4.1),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.49.47.47.47.47.47.47.47.6),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.32.30.30.30.30.30.30.30.6),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.12.12.12.9)\.
- \[58\]F\. Sung, Y\. Yang, L\. Zhang, T\. Xiang, P\. H\. Torr, and T\. M\. Hospedales\(2018\)Learning to compare: relation network for few\-shot learning\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 1199–1208\.Cited by:[3rd item](https://arxiv.org/html/2605.13672#A1.I2.i3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.53.51.51.51.51.51.51.51.5)\.
- \[59\]N\. Turpault, R\. Serizel, S\. Wisdom, H\. Erdogan, J\. R\. Hershey, E\. Fonseca, P\. Seetharaman, and J\. Salamon\(2021\)Sound event detection and separation: a benchmark on desed synthetic soundscapes\.InICASSP 2021\-2021 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 840–844\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p3.1)\.
- \[60\]O\. Vinyals, C\. Blundell, T\. Lillicrap, D\. Wierstra,et al\.\(2016\)Matching networks for one shot learning\.Advances in neural information processing systems29\.Cited by:[Appendix A](https://arxiv.org/html/2605.13672#A1.p2.8),[§1](https://arxiv.org/html/2605.13672#S1.p1.1),[§1](https://arxiv.org/html/2605.13672#S1.p4.1)\.
- \[61\]Y\. Wang, Q\. Yao, J\. T\. Kwok, and L\. M\. Ni\(2020\)Generalizing from a few examples: a survey on few\-shot learning\.ACM computing surveys \(CSUR\)53\(3\),pp\. 1–34\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[62\]D\. Wertheimer, L\. Tang, and B\. Hariharan\(2021\)Few\-shot classification with feature map reconstruction networks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 8012–8021\.Cited by:[3rd item](https://arxiv.org/html/2605.13672#A1.I2.i3.p1.1),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.48.46.46.46.46.46.46.46.5)\.
- \[63\]G\. Wichern, J\. Antognini, M\. Flynn, L\. R\. Zhu, E\. McQuinn, D\. Crow, E\. Manilow, and J\. L\. Roux\(2019\)Wham\!: extending speech separation to noisy environments\.arXiv preprint arXiv:1907\.01160\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.13672#S2.SS1.p3.7)\.
- \[64\]Y\. Wu, E\. Stangl, O\. Chipara, S\. S\. Hasan, A\. Welhaven, and J\. Oleson\(2018\)Characteristics of real\-world signal to noise ratios and speech listening situations of older adults with mild to moderate hearing loss\.Ear and hearing39\(2\),pp\. 293–304\.Cited by:[§2\.1](https://arxiv.org/html/2605.13672#S2.SS1.p4.6)\.
- \[65\]Y\. Wu\*, K\. Chen\*, T\. Zhang\*, Y\. Hui\*, T\. Berg\-Kirkpatrick, and S\. Dubnov\(2023\)Large\-scale contrastive language\-audio pretraining with feature fusion and keyword\-to\-caption augmentation\.InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP,Cited by:[Appendix K](https://arxiv.org/html/2605.13672#A11.p1.1),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1)\.
- \[66\]Y\. Xiao and R\. K\. Das\(2024\)WildDESED: an llm\-powered dataset for wild domestic environment sound event detection system\.arXiv preprint arXiv:2407\.03656\.Cited by:[item \(iv\)](https://arxiv.org/html/2605.13672#S2.I1.i4.1)\.
- \[67\]J\. Xie, F\. Long, J\. Lv, Q\. Wang, and P\. Li\(2022\)Joint distribution matters: deep brownian distance covariance for few\-shot classification\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 7972–7981\.Cited by:[3rd item](https://arxiv.org/html/2605.13672#A1.I2.i3.p1.1),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.52.50.50.50.50.50.50.50.5)\.
- \[68\]L\. You, E\. P\. Coyotl, S\. Gunturu, and M\. Van Segbroeck\(2023\)Transformer\-based bioacoustic sound event detection on few\-shot learning tasks\.InICASSP 2023\-2023 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 1–5\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[69\]M\. Zabin, S\. T\. K\. Ayon, F\. M\. Siraj, M\. H\. Shuvo, H\. Choi, and J\. Uddin\(2025\)Few\-shot learning\-based machine fault diagnosis using emd\-gammatone spectrogram with limited labeled audio dataset\.In2025 IEEE International Conference on Big Data and Smart Computing \(BigComp\),pp\. 183–190\.Cited by:[§1](https://arxiv.org/html/2605.13672#S1.p1.1)\.
- \[70\]Y\. Zhang and et al\.\(2024\)MetaCoCo: a benchmark for spurious correlation in few\-shot learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Note:Based on user prompt descriptionCited by:[Appendix A](https://arxiv.org/html/2605.13672#A1.p4.1),[§1](https://arxiv.org/html/2605.13672#S1.p4.1)\.
- \[71\]K\. Zheng, H\. Zhang, and W\. Huang\(2023\)DiffKendall: a novel approach for few\-shot learning with differentiable kendall’s rank correlation\.Advances in Neural Information Processing Systems36,pp\. 49403–49415\.Cited by:[3rd item](https://arxiv.org/html/2605.13672#A1.I2.i3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.20.18.18.18.18.18.18.18.5)\.
- \[72\]H\. Zhu and P\. Koniusz\(2023\)Transductive few\-shot learning with prototype\-based label propagation by iterative graph refinement\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 23996–24006\.Cited by:[Table 11](https://arxiv.org/html/2605.13672#A11.T11.48.46.46.11),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.72.72.72.17),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.22.22.22.5),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.60.60.60.60.60.60.60.60.7),[item \(iii\)](https://arxiv.org/html/2605.13672#S4.I3.i3.p1.4),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.114.112.112.112.112.112.112.112.7),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.74.72.72.72.72.72.72.72.6),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.72.72.72.13)\.
- \[73\]I\. Ziko, J\. Dolz, E\. Granger, and I\. B\. Ayed\(2020\)Laplacian regularized few\-shot learning\.InInternational conference on machine learning,pp\. 11660–11670\.Cited by:[Table 11](https://arxiv.org/html/2605.13672#A11.T11.72.70.70.9),[Table 12](https://arxiv.org/html/2605.13672#A11.T12.108.108.108.13),[Table 13](https://arxiv.org/html/2605.13672#A11.T13.36.36.36.6),[Table 6](https://arxiv.org/html/2605.13672#A6.T6.46.46.46.46.46.46.46.46.6),[§4\.3](https://arxiv.org/html/2605.13672#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.13672#S4.T1.98.96.96.96.96.96.96.96.8),[Table 2](https://arxiv.org/html/2605.13672#S4.T2.61.59.59.59.59.59.59.59.6),[Table 3](https://arxiv.org/html/2605.13672#S4.T3.44.44.44.9)\.
- \[74\]I\. M\. Ziko, M\. Boudiaf, J\. Dolz, E\. Granger, and I\. B\. Ayed\(2021\)Transductive few\-shot learning: clustering is all you need?\.arXiv preprint arXiv:2106\.09516\.Cited by:[item \(iii\)](https://arxiv.org/html/2605.13672#S4.I3.i3.p1.4)\.

Supplementary Material Contents

Appendix Contents

1. AExperimental Setup\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.A
2. BForeground–Background Mapping\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.B
3. CEmbeddings of different*FSL*Methods\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.C
4. DEffect of Spurious Correlation Strength\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.D
5. EVisualization of the Results in Table[1](https://arxiv.org/html/2605.13672#S4.T1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.E
6. F*FSL*with ResNet18 as Embedding Backbone\.F
7. GAnalysis of Shortcut Reliance via Representation Geometry and Background Perturbations\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G 1. G\.1Fine\-Tuning Based*FSC*\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.1 1. G\.1\.1Head–Backbone Replacement Analysis\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.1\.1 2. G\.1\.2IID–OOD Background Perturbation Analysis\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.1\.2 3. G\.1\.3Linking Representation Geometry to Shortcut Exploitation\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.1\.3 2. G\.2Meta\-Learning\-Based*FSC*\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.2 1. G\.2\.1Head–Backbone Replacement Analysis \(IID Baseline\)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.2\.1 2. G\.2\.2IID–OOD Background Perturbation Analysis\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.2\.2 3. G\.2\.3Linking Representation Geometry to Shortcut Exploitation\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.2\.3 4. G\.2\.4OOD Heatmap Comparison: Shifts in Robustness\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.2\.4 3. G\.3Metric\-Based*FSC*\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.3 1. G\.3\.1Head–Backbone Replacement Analysis \(IID Baseline\)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.3\.1 2. G\.3\.2IID–OOD Background Perturbation Analysis\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.3\.2 3. G\.3\.3Linking Representation Geometry to Shortcut Exploitation\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.3\.3 4. G\.3\.4OOD Heatmap Comparison: Shifts in Robustness\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.3\.4 4. G\.4Cross\-Family Analysis of Few\-Shot Learning Algorithms \(IID Setting\)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.4 5. G\.5Cross\-Family Analysis of Few\-Shot Learning Algorithms \(OOD Setting\)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.5 6. G\.6Cross\-Family Analysis of Few\-Shot Learning Algorithms in Deeper Architectures\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.G\.6
8. HGeometric Disentanglement of Background Correlations\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.H 1. H\.1Methodology: Radial\-Angular Decomposition\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.H\.1 2. H\.2Empirical Observation: The "Magnitude Contraction" Phenomenon\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.H\.2 3. H\.3Implication for Few\-Shot Metric Learning\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.H\.3
9. IWhere in the Network Are Spurious Correlations Encoded?\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.I
10. JOn the Distribution of SpurAudio\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.J 1. J\.1Methodology and Metrics\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.J\.1 2. J\.2Analysis of Results\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.J\.2 1. H\.2\.1Semantic Alignment\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.J\.2\.1 2. H\.2\.2Distributional and Structural Integrity\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.J\.2\.2
11. KMore Results on Large\-Audio Models\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.K
12. LClassification Accuracy as a Function of Number Of Shots\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.L
13. MAdditional Contrastive Learning Experiments\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.M
14. NLimitations\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.N

## Appendix AExperimental Setup

To study spurious correlations in few\-shot settings, we consider four main factors that inform the design of our benchmark:\(i\)Episodic sampling,\(ii\)Audio preprocessing,\(iii\)Backbone architectures, and\(iv\)*FSL*methodologies\.

Episode sampling\.We follow the standardNN\-wayKK\-shot protocol\[[60](https://arxiv.org/html/2605.13672#bib.bib21)\]withN=5N\{=\}5,K∈\{1,5\}K\{\\in\}\\\{1,5\\\}, and1010query samples per class, sampling between600600and20002000training tasks and10001000test tasks depending on the algorithm setting\.

Audio preprocessing\.We adopt the preprocessing strategy of\[[24](https://arxiv.org/html/2605.13672#bib.bib11)\]\. All sound events are resampled to1616kHz and trimmed or repeated to a fixed duration ofT=5T=5s\. Each waveform is converted into a mel spectrogram using a10241024\-point short\-time Fourier transform with a hop size of512512samples and128128mel frequency bands\. Power spectrograms are computed and transformed to the logarithmic \(decibel\) scale to compress the dynamic range\.

Embedding architectures\.Few\-shot learning methods require a backbone that maps the mel spectrogram into feature vectors or descriptor sets\. We employ architectures used in*MetaCoCo*\[[70](https://arxiv.org/html/2605.13672#bib.bib9)\]and LibFewShot\[[33](https://arxiv.org/html/2605.13672#bib.bib22)\], specifically Conv\-64F, ResNet\-12, and ResNet\-18, covering both shallow and deep designs\.

Few\-shot learning methodologies\.We evaluate representative approaches across multiple methodological families:

- •Fine\-tuning\-based:Baseline and Baseline\+\+\[[9](https://arxiv.org/html/2605.13672#bib.bib23)\], and Meta\-Baseline\[[10](https://arxiv.org/html/2605.13672#bib.bib24)\], combining standard supervised pretraining with episodic adaptation\.
- •Gradient\-based meta\-learning:MAML\[[15](https://arxiv.org/html/2605.13672#bib.bib20)\]and variants including ANIL\[[46](https://arxiv.org/html/2605.13672#bib.bib26)\], BOIL\[[44](https://arxiv.org/html/2605.13672#bib.bib27)\], R2D2\[[5](https://arxiv.org/html/2605.13672#bib.bib31)\], METAL\[[4](https://arxiv.org/html/2605.13672#bib.bib30)\], and LEO\[[47](https://arxiv.org/html/2605.13672#bib.bib28)\]\.
- •Metric\-based and structure\-aware methods:Prototypical Networks\[[57](https://arxiv.org/html/2605.13672#bib.bib32)\], Relation Networks\[[58](https://arxiv.org/html/2605.13672#bib.bib33)\], DN4\[[32](https://arxiv.org/html/2605.13672#bib.bib34)\], MCL\[[37](https://arxiv.org/html/2605.13672#bib.bib35)\], alongside higher\-order statistic models ADM\[[31](https://arxiv.org/html/2605.13672#bib.bib36)\], ATL\-Net\[[13](https://arxiv.org/html/2605.13672#bib.bib37)\], FRN\[[62](https://arxiv.org/html/2605.13672#bib.bib38)\], DeepBDC\[[67](https://arxiv.org/html/2605.13672#bib.bib39)\], and DiffKendall\[[71](https://arxiv.org/html/2605.13672#bib.bib25)\]\.

##### Software/Hardware\.

All the experiments detailed in this paper were conducted on a PC equipped with one NVIDIA A100\-SXM4\-40GB GPU and AMD EPYC 7742 64\-Core CPUs\.

## Appendix BForeground–Background Mapping

The source\-level separation in SpurAudio is enforced at the foreground class level: each foreground class is assigned exclusively to one split \(train, validation, or test\), with no overlap across splits\. Following the meta\-audio\[[24](https://arxiv.org/html/2605.13672#bib.bib11)\]library procedure, the class pool is split into 70% train, 10% validation, and 20% test\. Specifically, the 38 foreground classes are partitioned as follows:

- •Test: crackling fire, crow, chainsaw, coughing, sneezing, blender, phone, pig\.
- •Validation: page turn, keys drop, door slam, clearing throat, drawer\.
- •Train: all remaining classes, as listed in the Foreground column of Table[4](https://arxiv.org/html/2605.13672#A2.T4)\.

Note that background sounds are deliberately shared across all splits\. This reflects a realistic scenario where background contexts \(e\.g\., street noise, indoor ambience\) are not class\-specific and can naturally co\-occur with both seen and unseen foreground classes, which is precisely what enables the spurious correlation effect\. The model may learn to rely on background cues during training, and since those same backgrounds appear at test time paired with different foreground classes, the shortcut is exposed and measured\.

In what follows, we present the final foreground–background matching from which our merged sound events are composed; see Table[4](https://arxiv.org/html/2605.13672#A2.T4)\.

In addition, for our experiments assessing spurious\-correlation strength, we generate a higher\-correlation configuration that makes*FSL*more challenging and lowers classification accuracy; see Table[5](https://arxiv.org/html/2605.13672#A2.T5)\.

Table 4:Complete foreground\-to\-background mapping used for constructing spurious correlations in the audio dataset\.Table 5:Foreground\-to\-background mapping for the*Hard OOD*spurious correlation configuration\.
## Appendix CEmbeddings of different*FSL*Methods

In this section, we present embeddings of different*FSL*approaches, with the goal of visualizing separability, where in the

![Refer to caption](https://arxiv.org/html/2605.13672v1/x1.png)\(a\)Meta\-Baseline \(*IID*, 10\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x2.png)\(b\)Meta\-Baseline \(*OOD*, 10\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x3.png)\(c\)R2D2 \(*IID*, 10\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x4.png)\(d\)R2D2 \(*OOD*, 10\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x5.png)\(e\)DeepBDC \(*IID*, 10\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x6.png)\(f\)DeepBDC \(*OOD*, 10\-shot\)

Figure 4:tt\-SNE visualization of embedding space under*IID*vs\.*OOD*episodes\. Across methods, embeddings form cleaner, more separable clusters in*IID*settings, while*OOD*background shifts induce query\-support misalignment and increased inter\-class overlap\.following figures, circles denote support points while crosses denote query points\. This highlights one factor behind higher/lower average*IID*accuracy across seeds\.

![Refer to caption](https://arxiv.org/html/2605.13672v1/x7.png)\(a\)Meta\-Baseline \(*IID*, 5\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x8.png)\(b\)Meta\-Baseline \(*OOD*, 5\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x9.png)\(c\)R2D2 \(*IID*, 5\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x10.png)\(d\)R2D2 \(*OOD*, 5\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x11.png)\(e\)DeepBDC \(*IID*, 5\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x12.png)\(f\)DeepBDC \(*OOD*, 5\-shot\)

Figure 5:tt\-SNE visualization of embedding space under*IID*vs\.*OOD*episodes\. Across methods, embeddings form cleaner, more separable clusters in*IID*settings, while*OOD*background shifts induce query\-support misalignment and increased inter\-class overlap\.![Refer to caption](https://arxiv.org/html/2605.13672v1/x13.png)\(a\)Meta\-Baseline \(*IID*, 1\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x14.png)\(b\)Meta\-Baseline \(*OOD*, 1\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x15.png)\(c\)R2D2 \(*IID*, 1\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x16.png)\(d\)R2D2 \(*OOD*, 1\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x17.png)\(e\)DeepBDC \(*IID*, 1\-shot\)
![Refer to caption](https://arxiv.org/html/2605.13672v1/x18.png)\(f\)DeepBDC \(*OOD*, 1\-shot\)

Figure 6:tt\-SNE visualization of embedding space under*IID*vs\.*OOD*episodes\. Across methods, embeddings form cleaner, more separable clusters in*IID*settings, while*OOD*background shifts induce query\-support misalignment and increased inter\-class overlap\.
## Appendix DEffect of Spurious Correlation Strength

In this section, we show the effect of spurious correlation via strengthening the hardness of the*OOD*tasks which aims to correlate the backgrounds of queries of one class with the backgrounds of support instances from different classes more often than achieved in our main experiments\. This is done to amplify the*IID*–*OOD*gap\. To obtain such tasks, we rely on Table[5](https://arxiv.org/html/2605.13672#A2.T5)\.

![Refer to caption](https://arxiv.org/html/2605.13672v1/figures_of_data_analysis/1shot_comparison.png)\(a\)1\-shot accuracy
![Refer to caption](https://arxiv.org/html/2605.13672v1/figures_of_data_analysis/5shot_comparison.png)\(b\)5\-shot accuracy

Figure 7:*IID*vs*OOD*vs*Hard OOD*accuracy for few\-shot tasks\. Each candle represents the classification accuracy of a method under a given evaluation setting\.*IID*corresponds to standard in\-distribution support\-query sampling,*OOD*represents typical out\-of\-distribution support\-query pairs, and*Hard OOD*enforces maximal background overlap across classes\. Accuracy decreases progressively from*IID*to*OOD*to*Hard OOD*, for both 1\-shot \(a\) and 5\-shot \(b\), highlighting that few\-shot methods increasingly rely on background cues as spurious correlations become stronger\.
## Appendix EVisualization of the Results in Table[1](https://arxiv.org/html/2605.13672#S4.T1)

In what follows, we provide visualization of the accuracy results reported in Table[1](https://arxiv.org/html/2605.13672#S4.T1)across our three different families of*FSC*\.

![Refer to caption](https://arxiv.org/html/2605.13672v1/x19.png)\(a\)Fine\-tuning
![Refer to caption](https://arxiv.org/html/2605.13672v1/x20.png)\(b\)Meta\-learning
![Refer to caption](https://arxiv.org/html/2605.13672v1/x21.png)\(c\)Metric\-based

Figure 8:*IID*vs*OOD*accuracy for few\-shot families\. Each candle shows the classification accuracy of an algorithm across multiple seeds\.*IID*accuracy remains higher, while*OOD*accuracy exhibits a lower mean, indicating that few\-shot methods partially rely on background cues\. This highlights that performance degrades when support and query distributions differ, even without enforcing extreme correlations\.
## Appendix F*FSL*with ResNet18 as Embedding Backbone

In this section we present also the*IID*–*OOD*gap when using ResNet18 as the embedding model \(backbone\) of differnet*FSC*models\.

Table 6:ResNet18 results on 1\-shot and 5\-shot tasks\.
## Appendix GAnalysis of Shortcut Reliance via Representation Geometry and Background Perturbations

To better understand whether shortcut reliance is primarily encoded in the backbone representations or induced by the inference head, we perform a systematic head–backbone replacement study\. In particular, we examine whether embeddings learned under different training objectives and classifier heads transfer consistently when paired with alternative inference heads at test time\. We first analyze inter\-family backbone–head swaps, where backbones and heads originate from the same methodological family, and then extend this analysis to cross\-family swaps, in which components trained under different learning paradigms are combined\. This progression allows us to isolate compatibility effects within families and to assess the robustness of learned representations under more disruptive transfer settings\.

### G\.1Fine\-Tuning Based*FSC*

#### G\.1\.1Head–Backbone Replacement Analysis

We first analyze the interaction between embedding backbones and inference heads by systematically replacing the classifier head used at test time\. In this experiment, rows correspond to the inference head \(algorithm\), while columns correspond to the backbone used to extract embeddings\. This controlled intervention allows us to probe how different training objectives encode discriminative and spurious information in the learned representations; see Figures[9](https://arxiv.org/html/2605.13672#A7.F9)and[10](https://arxiv.org/html/2605.13672#A7.F10)\.

##### Meta\-Baseline head→\\rightarrowBaseline backbone\.

When the Meta\-Baseline head is applied to a Baseline\-trained backbone, inference still relies on unnormalized dot\-product similarities, while the backbone was optimized in a non\-episodic manner\. This mild mismatch slightly reduces reliance on magnitude\-based shortcuts learned during Baseline training, resulting in a small performance improvement\.

##### Meta\-Baseline head→\\rightarrowBaseline\+\+ backbone\.

Applying the Meta\-Baseline head to a Baseline\+\+ backbone introduces a strong geometric mismatch\. The Baseline\+\+ backbone is trained to produce normalized, directionally discriminative embeddings, whereas the Meta\-Baseline head assumes unnormalized features\. Consequently, angular information is not optimally exploited, leading to degraded performance\.

##### Baseline head→\\rightarrowMeta\-Baseline backbone\.

Using a standard Baseline head with a Meta\-Baseline backbone causes inference to overemphasize feature magnitude\. In Meta\-Baseline embeddings, feature norms often correlate with background and dataset\-specific artifacts\. This increases shortcut reliance and results in a slight performance drop\.

##### Baseline head→\\rightarrowBaseline\+\+ backbone\.

Combining a Baseline head with a Baseline\+\+ backbone leads to severe performance degradation\. Baseline\+\+ training enforces feature normalization and encodes class separability primarily in angular space, while the Baseline head relies on raw dot products\. This incompatibility prevents effective utilization of the learned representations\.

##### Baseline\+\+ head→\\rightarrowMeta\-Baseline backbone\.

Applying a cosine\-based Baseline\+\+ head to a Meta\-Baseline backbone yields the largest performance improvement\. The Meta\-Baseline backbone produces episodically discriminative embeddings whose feature magnitudes often encode spurious background cues\. Feature normalization suppresses these magnitude\-based shortcuts and forces decisions to rely on angular similarity, significantly improving few\-shot generalization\.

##### Baseline\+\+ head→\\rightarrowBaseline backbone\.

When applied to a Baseline\-trained backbone, the Baseline\+\+ head removes magnitude\-based confidence cues that are frequently correlated with background conditions\. Although the backbone was not explicitly trained for angular discrimination, the cosine classifier acts as a regularizer that reduces shortcut exploitation, resulting in a substantial performance gain\.

##### Pairwise backbone replacement analysis \(*IID*vs\.*OOD*\)\.

Thus far, under*IID*evaluation, we observed that replacing the backbone of simpler baselines \(e\.g\., Baseline and Meta\-Baseline\) with stronger representations \(e\.g\., Baseline\+\+\) consistently leads to accuracy gains, whereas the reverse replacement typically results in noticeable degradation\. This asymmetric behavior suggests that stronger backbones encode more discriminative and transferable features that benefit multiple downstream classifiers, while weaker backbones fail to support more advanced inference mechanisms\. We repeat the same analysis for*OOD*tasks, where the query samples contain foreground events that are distributionally shifted from the support set, often due to mismatched background or contextual acoustic conditions\. Notably, the qualitative trends observed under IID evaluation persist under*OOD*settings: backbones trained with stronger inductive biases continue to generalize better across algorithms, while weaker backbones exacerbate performance drops\. In several cases, the performance gap widens under*OOD*conditions, indicating that models relying more heavily on spurious background correlations are less robust to distribution shifts\.

Overall, these results suggest that backbone quality plays a central role in both IID and*OOD*generalization\. In particular, representations that are less sensitive to shortcut features and background\-specific cues yield more stable performance when transferred across learning algorithms and evaluation regimes\.

#### G\.1\.2*IID*–*OOD*Background Perturbation Analysis

To quantify shortcut reliance during few\-shot inference, we evaluate all methods under*IID*and*OOD*episodic settings; see Table[1](https://arxiv.org/html/2605.13672#S4.T1)\. In the*OOD*setting, the foreground sound event of the query remains unchanged, while background sound events are systematically mismatched between the support and query sets\. Therefore, any performance degradation from*IID*to*OOD*cannot be attributed to semantic class ambiguity and directly reflects sensitivity to background–class correlations\.

Recall the performance gap as

Δ=Avg\.IID−Avg\.OOD,\\Delta=\\text\{Avg\. \\emph\{IID\}\}\-\\text\{Avg\. \\emph\{OOD\}\},which serves as a measure of background shortcut reliance\.

We observe that fine\-tuning\-based methods exhibit substantially differentΔ\\Deltavalues in the 5\-shot setting\. Baseline shows the largest gap, indicating strong reliance on background consistency\. DiffKendall exhibits a moderate gap, suggesting partial mitigation of shortcut effects\. Meta\-Baseline further reducesΔ\\Delta, reflecting improved robustness due to episodic training\. Baseline\+\+ achieves the smallest gap, consistent with its reliance on cosine similarity and feature normalization\.

Notably, the magnitude ofΔ\\Deltaincreases with the number of shots, indicating that additional support examples amplify shortcut exploitation when background statistics are consistent within a class\.

#### G\.1\.3Linking Representation Geometry to Shortcut Exploitation

The head–backbone replacement experiment and the*IID*–*OOD*evaluation provide complementary perspectives on shortcut learning\. The replacement analysis reveals where spurious correlations are encoded in the representation, while the*IID*–*OOD*gap quantifies how strongly these correlations are exploited during few\-shot inference\.

Taken together, the results indicate that background\-induced shortcuts are predominantly encoded in feature magnitudes, whereas foreground semantic information is encoded in feature direction\. Inference heads that rely on unnormalized dot\-product similarities are therefore more susceptible to exploiting background consistency, resulting in larger*IID*–*OOD*performance gaps\. Conversely, cosine\-based classifiers suppress magnitude\-based cues and yield improved robustness under background perturbations\.

The strong agreement between these two independent analyses provides converging evidence that magnitude\-based decision rules are a primary driver of shortcut reliance in few\-shot audio classification under dataset heterogeneity\.

### G\.2Meta\-Learning\-Based*FSC*

#### G\.2\.1Head–Backbone Replacement Analysis \(*IID*Baseline\)

This section analyzes the compatibility between the classification heads \(rows\) and feature backbones \(columns\) of different meta\-learning algorithms; e\.g\., Figure[9](https://arxiv.org/html/2605.13672#A7.F9)\. Our aim here is to link the performance impact \(Δ\\Delta\) to the compatibility between thehead’s inference mechanismand thebackbone’s learned geometry\.

##### MAML Head \(Gradient\-Based Adaptation\)

The MAML head initializes a linear classifierwwand performs a small number of gradient descent steps \(w′←w−α∇wℒCEw^\{\\prime\}\\leftarrow w\-\\alpha\\nabla\_\{w\}\\mathcal\{L\}\_\{CE\}\)\. Rather than requiring features to be linearly separable*a priori*, MAML is trained to produce representations that become linearly separable after a few gradient updates\.

- •vs\. METAL Backbone:METAL optimizes a distance\-based objective that encourages compact class representations with reduced intra\-class variance\. Such geometrically regularized features can yield better\-conditioned gradients for cross\-entropy optimization, which may facilitate faster or more stable adaptation of the MAML head compared to less structured feature spaces\.
- •vs\. ANIL & BOIL:Performance remains largely neutral with a slight degregation in BOIL\. As close variants of MAML, ANIL and BOIL induce feature spaces with similar local geometry, allowing the MAML head to transfer without significant difficulty, albeit without the additional geometric regularization benefits provided by METAL\.
- •vs\. R2D2:R2D2 optimizes a ridge regression objective \(∥Xw−y∥2\+λ∥w∥2\\lVert Xw\-y\\rVert^\{2\}\+\\lambda\\lVert w\\rVert^\{2\}\), which permits substantial variation in feature scale and anisotropic covariance structure\. When paired with such features, MAML’s cross\-entropy objective can exhibit poor conditioning, resulting in less stable or slower gradient\-based adaptation\.
- •vs\. LEO:LEO backbones are trained to support latent\-space weight generation via a variational encoder \(z∼𝒩\(μ,σ\)z\\sim\\mathcal\{N\}\(\\mu,\\sigma\)\), rather than direct linear separability in feature space\. This induces representations whose geometry is aligned with latent inference, limiting the effectiveness of MAML’s direct gradient\-based linear adaptation\.

##### R2D2 Head \(Ridge Regression\)

The R2D2 head computes a closed\-form ridge regression solutionW=\(XTX\+λI\)−1XTYW=\(X^\{T\}X\+\\lambda I\)^\{\-1\}X^\{T\}Y, which requires the feature covariance matrixXTXX^\{T\}Xto be reasonably well\-conditioned\.

- •vs\. BOIL:Severe degradation\.BOIL freezes the head and adapts only the backbone, effectively forcing the feature extractor to fit a fixed, randomly initialized classifier\. This can induce highly distorted or collapsed feature representations, leading to poorly conditioned or near\-rank\-deficient covariance matrices and unstable regression solutions\.
- •vs\. METAL & MAML:METAL and MAML do not explicitly optimize features for linear regression consistency\. METAL emphasizes distance\-based clustering, while MAML emphasizes rapid adaptability under cross\-entropy loss\. The resulting feature representations may exhibit anisotropic variance or weak linear correlations with labels, which can degrade the performance of a ridge regression solver\.
- •vs\. LEO:LEO backbones are trained to support latent\-space weight generation under a Gaussian prior rather than direct linear prediction in feature space\. As a result, the induced representations are not well aligned with the assumptions underlying linear regression, limiting the stability of the R2D2 solution\.
- •vs\. ANIL:ANIL produces generic, static feature representations optimized for linear classification rather than regression\. While these features may not be optimally conditioned for ridge regression, their stability is typically sufficient to avoid numerical failure\.

##### METAL Head \(Task\-Dependent Metric Scaling\)

The METAL head computes class probabilities using a scaled Euclidean distance,P\(y∣x\)∝exp⁡\(−αd\(x,ck\)\)P\(y\\mid x\)\\propto\\exp\\\!\\left\(\-\\alpha\\,d\(x,c\_\{k\}\)\\right\), whereα\\alphais a learnable task\-dependent scaling parameter\.

- •vs\. BOIL:METAL is relatively robust to BOIL\-induced feature irregularities, as the learnable scaling parameterα\\alphacan partially compensate for distorted or unevenly scaled feature dimensions\.
- •vs\. MAML & ANIL:MAML and ANIL backbones are trained to optimize dot\-product\-based logits under cross\-entropy loss\. The resulting angularly structured representations are not optimally aligned with METAL’s distance\-based inference, leading to a degradation due to geometric mismatch\.
- •vs\. R2D2 & LEO:R2D2 \(regression\-oriented\) and LEO \(latent\-variable\-based\) backbones induce feature geometries that are not explicitly optimized for Euclidean distance comparisons, resulting in reduced effectiveness of METAL’s metric\-based classification\.

##### ANIL Head \(Feature Reuse / Linear Classifier\)

ANIL freezes the backbonefθf\_\{\\theta\}and trains a linear classifierwwon top\. It serves as a proxy for evaluating the quality of static features\.

- •vs\. METAL Backbone:ANIL shows strong transfer\. Since ANIL does not adapt the backbone, it benefits most from features that are already well\-separated; METAL provides tightly clustered, ready\-to\-use features\.
- •vs\. MAML:MAML’s backbone produces features of sufficient quality that even without adaptation, ANIL’s linear classifier can achieve competitive performance\.
- •vs\. BOIL:BOIL backbones are trained with body\-only adaptation, yielding expressive features that can support ANIL’s frozen head without additional adaptation\.
- •vs\. R2D2:R2D2 features are optimized for regression objectives, emphasizing scale consistency rather than purely linear separability, which may slightly reduce ANIL’s effectiveness\.
- •vs\. LEO:LEO features are latent\-space optimized; their probabilistic structure is not directly aligned with ANIL’s linear classifier\.

##### LEO Head \(Latent Encoding Optimization\)

LEO generates parameters for a linear classifier via a latent space with Gaussian priorp\(z\)=𝒩\(0,I\)p\(z\)=\\mathcal\{N\}\(0,I\)\.

- •vs\. ANIL:LEO can benefit from stationary, frozen features such as ANIL’s, as they provide a consistent input distribution for latent encoding\.
- •vs\. MAML:High\-variance adaptive features from MAML may be misaligned with LEO’s latent encoding, potentially reducing transfer effectiveness\.
- •vs\. R2D2:Regression\-optimized features from R2D2 are not specifically aligned with LEO’s Gaussian latent prior, which may limit performance\.
- •vs\. METAL & BOIL:Features strongly constrained by geometric clustering or backbone\-specific adaptation may not be ideally aligned with LEO’s probabilistic latent representation\.

##### BOIL Head \(Body\-Only Adaptation\)

BOIL freezes the head and updates backbone parameters via gradient descent\.

- •vs\. METAL:Starting adaptation from METAL’s clustered features can provide a favorable initialization for gradient updates compared to random initialization\.
- •vs\. R2D2:Features with low intra\-class variance, such as R2D2’s, may be relatively stable and support body\-only adaptation without significant feature distortion\.
- •vs\. MAML & ANIL:These backbones may require more extensive adaptation than is feasible in a few gradient steps under a frozen head, limiting BOIL’s performance\.
- •vs\. LEO:When using a LEO backbone, BOIL exhibits severe performance degradation, as LEO relies on task\-specific latent\-space optimization and decoder\-based parameter generation, making body\-only gradient updates under a frozen head poorly aligned with its intended adaptation mechanism\.

#### G\.2\.2*IID*–*OOD*Background Perturbation Analysis

This section examines the 5\-shot performance gap \(Δ\\Delta\) between IID \(matched backgrounds\) and*OOD*\(mismatched or spurious backgrounds\)\.

- •MAML vs\. ANIL:MAML generally exhibits a smaller drop in performance compared to ANIL\. This aligns with the principle that full network adaptation can mitigate reliance on spurious correlations\. By updating both backbone and head on the support set, MAML can adjust feature representations that are correlated with background noise, whereas ANIL freezes the backbone, retaining any pre\-trained shortcuts\.
- •R2D2 Sensitivity:R2D2 often achieves high*IID*accuracy but experiences a notable drop under*OOD*conditions\. Since R2D2 computes a closed\-form classifier on fixed embeddings, it cannot dynamically reweigh features during few\-shot updates, making it similarly sensitive to background shifts as ANIL\.
- •METAL Robustness:METAL typically shows high accuracy with minimal degradation under*OOD*\. Its task\-dependent metric scaling encourages class prototypes to form compact clusters, reducing the influence of background variability and improving robustness to distribution shifts\.

#### G\.2\.3Linking Representation Geometry to Shortcut Exploitation

- •Feature Reuse \(ANIL/R2D2\) vs\. Feature Adaptation \(MAML\):Algorithms that rely on feature reuse are more susceptible to spurious background correlations because static backbones encode such correlations permanently\. Feature adaptation, as in MAML, allows the backbone to adjust the feature manifold according to the support set, mitigating the effect of spurious correlations when foreground and background are mismatched\.
- •Gradient\-Based “Unlearning”:MAML’s gradient updates on the backbone can suppress features associated with spurious correlations\. BOIL shares this potential in principle, but its entangled head\-backbone optimization limits effective adaptation in practice\.

#### G\.2\.4*OOD*Heatmap Comparison: Shifts in Robustness

Comparing*IID*and*OOD*heatmaps highlights differences in backbone robustness\.

- •R2D2 Backbone with MAML Head:While this pairing shows reduced performance in*IID*settings, the gap diminishes under*OOD*conditions\. - –Mechanistic Interpretation:R2D2’s regression objective produces regularized, low\-variance feature embeddings\. In*IID*, this regularization can restrict adaptation under MAML\. In*OOD*, the same regularization stabilizes the model by limiting overfitting to background noise\.
- •METAL Backbone:METAL consistently demonstrates strong performance under OOD\. - –Mechanistic Interpretation:METAL enforces geometric compactness of class prototypes\. This tight clustering reduces susceptibility to spurious background correlations, providing robustness in both*IID*and*OOD*settings\.
- •Persistent Architectural Mismatches:Combinations such as LEO Head with MAML Backbone or R2D2 Head with BOIL Backbone show consistently low performance across*IID*and*OOD*conditions\. These failures are attributable to fundamental mathematical incompatibilities between head and backbone, rather than dataset\-specific effects\.

### G\.3Metric\-Based*FSC*

This section analyzes the compatibility between the classification heads \(rows\) and feature backbones \(columns\) of Metric\-based algorithms in the In\-Distribution \(*IID*\) setting\.

#### G\.3\.1Head–Backbone Replacement Analysis \(*IID*Baseline\)

##### ProtoNet Head \(Euclidean Centroid\)

The ProtoNet head computes class centroids via Global Average Pooling \(GAP\) and classifies using Euclidean distance\.

- •vs\. DN4:The ProtoNet head benefits from these backbones\. DN4 preserves rich local descriptors, which, after pooling, create a denser and more discriminative feature space than ProtoNet’s native training\.
- •vs\. ATL\-NETATL\-NET employs episodic attention, resulting in feature embeddings that highlight discriminative regions, which worsens slightly the ProtoNet centroid classification\.
- •vs\. MCL:Performance drops significantly\. MCL optimizes a ranking\-based objective that alters the Euclidean structure of the feature space\. ProtoNet assumes isotropic Gaussian clusters; this assumption is violated when features exhibit non\-Euclidean geometry induced by MCL\.

##### ADM Head \(Adaptive Distance Metric\)

The ADM head learns a task\-specific distance metric, such as a Mahalanobis\-like matrix, for similarity measurement\.

- •vs\. All Foreign Backbones:ADM is highly sensitive to backbone changes\. Its learned metric relies on specific feature covariance structures\. Backbones not trained with ADM \(e\.g\., ProtoNet, DN4\) do not provide the expected correlations, resulting in performance degradation\. The effect is most pronounced with MCL, indicating a geometric incompatibility between ADM’s learned metric and MCL feature space\.

##### DN4 Head \(Image\-to\-Class / Local Descriptors\)

DN4 operates on unpooled feature maps using a local neighbor matching procedure rather than global feature vectors\.

- •vs\. ProtoNet & ATL\-NET:Accuracy decreases when paired with these backbones\. ProtoNet and ATL\-NET optimize global pooled representations, which do not necessarily preserve the spatial locality and fine\-grained information required by DN4’s image\-to\-class matching\. Consequently, the DN4 head receives embeddings that are less compatible with its matching mechanism, resulting in lower performance relative to its native backbone\.

##### MCL Head

MCL is trained to optimize a ranking\-based or curriculum\-based objective in the feature space\.

- •vs\. All Foreign Backbones:MCL exhibits performance degradation with any non\-native backbone\. This indicates that MCL’s learned feature manifold has a specialized geometry, which is incompatible with the Euclidean or local\-matching assumptions used by standard metric\-based heads such as ProtoNet or DN4\.

#### G\.3\.2*IID*–*OOD*Background Perturbation Analysis

This section examines the performance gap between*IID*\(matched backgrounds\) and*OOD*\(mismatched/spurious backgrounds\)\.

- •Robustness of DN4:DN4 maintains high*OOD*accuracy and one of the smallest performance drops\. This supports theLocal Feature Hypothesis\. By comparing sets of local descriptors rather than a single global vector, DN4 can effectively focus on foreground patches that match the query while ignoring background patches that do not, providing intrinsic robustness to background shifts\.
- •ProtoNet Sensitivity:ProtoNet experiences a substantial drop in OOD performance\. Its reliance on Global Average Pooling \(GAP\) aggregates both foreground and background into a single vector\. Background changes in the*OOD*setting shift this vector away from the class centroid, leading to misclassification\. ProtoNet lacks a mechanism to spatially filter out background features\.
- •ATL\-NET and Attention:ATL\-NET also shows a large*OOD*performance drop despite using an attention mechanism\. In*IID*training, attention can leverage background correlations, which act as a shortcut\. In*OOD*conditions, these correlations are misleading, reducing accuracy because the attention mechanism cannot fully suppress background contributions\.

#### G\.3\.3Linking Representation Geometry to Shortcut Exploitation

- •Global vs\. Local Geometry:Heads relying onGlobal Pooling\(ProtoNet, ATL\-NET\) are sensitive to spurious correlations because signal and noise are combined into a single representation\. Heads usingLocal Descriptors\(DN4\) preserve spatial separation of signal and noise, allowing the model to selectively match foreground features while ignoring irrelevant background\.
- •The “Attention” Trap:Large drops from*IID*to*OOD*indicate that attention mechanisms may overfit to background correlations present during training, rather than learning invariance to them\.

#### G\.3\.4*OOD*Heatmap Comparison: Shifts in Robustness

Comparing the*OOD*heatmap to the*OOD*heatmap highlights shifts in backbone\-head transferability under distributional changes\.

- •The “ATL\-NET Flip” \(ProtoNet Head\):In*IID*, the ProtoNet head benefits from ATL\-NET features, which encode both object and context\. Under OOD conditions, the contextual correlation becomes misleading\. ProtoNet, relying on pooled global representations, cannot disentangle object from background, resulting in reduced transfer performance\.
- •DN4 Consistency:DN4 consistently provides positive transfer to ProtoNet across*IID*and*OOD*\. DN4 features are trained for local matching, creating a representation in which object and background are disentangled\. Even when aggregated by a global pooling head, these features preserve robustness to background shifts\.
- •MCL Isolation Persists:MCL continues to exhibit negative transfer\. The geometric incompatibility of its feature manifold with standard metric heads remains unchanged by the distribution shift, indicating a structural rather than data\-dependent failure\.

### G\.4Cross\-Family Analysis of Few\-Shot Learning Algorithms \(*IID*Setting\)

This section analyzes the interactions between the three major families:Finetuning\-based\(Baseline, Baseline\+\+, Meta\-Baseline\),Meta\-learning\-based\(MAML, R2D2, METAL, ANIL, LEO, BOIL\), andMetric\-based\(ProtoNet, ADM, DN4, ATL\-NET, MCL, ADM\_KL\)\. The analysis focuses on the geometric compatibility between the backbone’s feature space and the head’s inference mechanism in the*IID*setting\.

- •Finetuning heads on Metric backbones:Finetuning heads \(typically linear classifiers like Softmax or Cosine classifiers\) rely on the backbone producing features that are linearly separable\. Metric backbones \(like ProtoNet, MCL, ADM\) are trained to optimize local clustering structure \(minimizing intra\-class distance, maximizing inter\-class distance\) often using Euclidean or KL\-divergence metrics\. They do not necessarily force the global linear separability required by a standard finetuning head\. Consequently, the linear head cannot find a valid decision boundary in the "clustered" but potentially non\-linear embedding space provided by metric backbones\. Slight improvements where shown on the cosine classifier when using the strong feature representation of the DN4 and ATL\-NET backbones showcased improvement on all finetuning heads\.
- •Finetuning heads Meta\-learning:Applying Finetuning backbones \(Baseline, Baseline\+\+, Meta\-Baseline\) to Meta\-learning heads reveals a sharp performance dichotomy driven by the compatibility between feature geometry and adaptation mechanisms\. Heads that decouple feature extraction from task\-specific adaptation, particularly ANIL, achieve their highest accuracy gains with standard backbones with Baseline\+\+\), suggesting that robust, globally separable features provide a superior, "honest" foundation for flexible parameter generators compared to meta\-learned features potentially corrupted by episode\-specific shortcuts\. Conversely, this positive transfer collapses when geometric assumptions clash: the cosine\-constrained Baseline\+\+ backbone is catastrophic for regression and optimization\-based heads, causing massive drops for R2D2 and MAML due to the incompatibility between angular embedding spaces and Euclidean or covariance\-based update rules, while body\-updating methods like BOIL consistently degrade because converged pre\-trained backbones lack the malleability required for rapid inner\-loop adaptation\.
- •Meta\-Learning Backbones on Metric Heads:Replacing the native backbones of Metric heads \(ProtoNet, MCL, ADM\) with Meta\-Learning backbones \(MAML, BOIL, METAL\) results in universal and severe degradation, highlighting a fundamental conflict between "adaptation\-ready" and "inference\-ready" geometries\. Metric heads rely on static, tightly clustered embeddings to calculate distances \(e\.g\., Euclidean or Cosine\) without further training; however, meta\-learning backbones are trained to produce "malleable" initializations that only become discriminative after gradient updates \(inner\-loop adaptation\)\. Consequently, when a static Metric head is applied to a "raw" meta\-learning backbone like BOIL or METAL, it encounters feature spaces that are not yet linearly or spherically separated, leading to catastrophic failures proving that meta\-learning optimization objectives do not inherently align with the distance\-minimization manifolds required by metric classifiers\.
- •Meta\-Learning Backbones on Finetuning Heads:The application of Finetuning heads to Meta\-Learning backbones is largely unsuccessful, with a notable exception for the cosine\-based Baseline\+\+ head, revealing that only specific meta\-learning strategies produce globally separable features\. Most meta\-learning backbones \(especially BOIL, MAML and METAL\) cause significant drops for standard linear heads\. However, Baseline\+\+ achieves positive transfer with ANIL and a very slight performance drop in R2D2\. This success occurs because ANIL explicitly learns a fixed feature extractor \(compatible with fixed heads\), and R2D2 optimizes a ridge\-regression objective that aligns geometrically with the angular/cosine margins of Baseline\+\+\. This indicates that for a meta\-learned backbone to be transferable to a standard classifier without further adaptation, it must be trained with constraints \(like ANIL’s frozen body or R2D2’s regression\) that enforce a stable, normalized embedding space compatible with cosine similarity\.
- •Metric backbones on Meta\-learning heads:The replacement of meta\-learning backbones with Metric backbones \(ProtoNet, ADM, MCL\) results in almost universal failure, driven by the incompatibility between the "static clustering" objective of metric learning and the "dynamic adaptation" requirement of meta\-learning\.
- •Metric backbones on finetuning heads:Applying Finetuning heads to Metric backbones yields a mixed but generally stable outcome, contrasting sharply with the failures seen in meta\-learning heads, because the linear separability enforced by metric objectives is often sufficient for static linear classifiers\. Standard Baseline heads show a slight moderate loss or slight gains and the cosine\-based Baseline\+\+ achieves significant boosts suggesting that these specific backbones produce well\-normalized, globally separable manifolds that align perfectly with fixed cosine classifiers\. ATL\-NET produces positive results on all finetuning methods\.

### G\.5Cross\-Family Analysis of Few\-Shot Learning Algorithms \(*OOD*Setting\)

This section analyzes the interactions between the three major families:Finetuning\-based\(Baseline, Baseline\+\+, Meta\-Baseline\),Meta\-learning\-based\(MAML, R2D2, METAL, ANIL, LEO, BOIL\), andMetric\-based\(ProtoNet, ADM, DN4, ATL\-NET, MCL, ADM\_KL\)\. The analysis focuses on the geometric compatibility between the backbone’s feature space and the head’s inference mechanism in the*OOD*setting\.

- •Metric backbones on finetuning heads:In the*OOD*setting, replacing the native heads of Metric backbones with standard Finetuning heads proves to be surprisingly robust\. Unlike other cross\-family transfers that suffer from severe degradation, standard linear classifiers \(Baseline\) and cosine classifiers \(Baseline\+\+\) often maintain or even improve performance when applied to backbones like ProtoNet, DN4, and ATL\-NET\. This suggests that the embedding geometry learned by most metric algorithms is not "twisted" or overly specialized, but rather creates clean, globally separable clusters that even a simple linear boundary can respect, effectively filtering out perturbed background noise\.
- •Metric backbones on Meta\-Learning heads:The replacement of meta\-learning backbones with metric backbones results in universal degradation across the board\. Meta\-learning algorithms typically require the feature space to be "malleable" so that a few gradient steps can align the support and query sets\. In the*OOD*scenario, this rigidity is fatal; the backbone has effectively "memorized" the specific background\-to\-class relationships of the training data\. Because the meta\-learning head cannot reshape this rigid geometry sufficiently during the inner loop, it is forced to operate on features where the background acts as a misleading distraction, leading to consistent performance drops\.
- •Finetuning backbones on Metric heads:A significant shift occurs here when moving from*IID*to*OOD*\. the varying backgrounds introduce large distance penalties—effectively "pushing" the query samples away from their correct class prototypes\. Unlike linear heads which can learn to ignore certain dimensions, distance\-based heads aggregate error from all dimensions, including the perturbed background noise\.
- •Finetuning backbones on Meta\-Learning heads:This interaction remains the most robust, particularly for methods that decouple feature extraction from adaptation\. When a standard pre\-trained backbone is used, it tends to learn robust, global representations of the foreground objects \(as it must distinguish all classes simultaneously\)
- •Meta\-learning backbones on Finetuning heads:The application of Finetuning heads to Meta\-Learning backbones reveals a sharp divide based on how the backbone was trained\. Backbones from algorithms that update the entire network during the inner loop \(like BOIL or METAL\) perform poorly with standard heads because their pre\-adaptation features are not yet linearly separable—they rely on the adaptation step to align the geometry\. However, backbones from algorithms that explicitly freeze feature extraction \(like ANIL\) or use regression constraints \(like R2D2\) transfer remarkably well to Cosine\-based heads \(Baseline\+\+\)\. This indicates that meta\-learning strategies which enforce a stable, normalized embedding space can produce features that are robust to background perturbations\.
- •Meta\-learning backbones on Metric heads:applying Metric heads to Meta\-Learning backbones results in consistent performance degradation across the board\. This failure stems from a conflict in optimization objectives\. In the*OOD*context, where background noise changes, the un\-adapted meta\-learning backbone likely has not yet separated the foreground from the background, causing the metric head to calculate distances based on spurious noise rather than semantic class identity, leading to widespread failure\.

### G\.6Cross\-Family Analysis of Few\-Shot Learning Algorithms in Deeper Architectures

The transition from Conv64 to ResNet12 reveals a fundamental shift towards higher feature robustness, characterized by a marked reduction in the severe geometric incompatibilities \(deep red blocks in the figure\) seen in Conv64, and the emergence of FRN as a "universal" head\. However, this deeper architecture exposes a specific vulnerability in ANIL, which flips from a generally compatible head in Conv64 to suffering catastrophic failures on ResNet12 backbones, indicating that deeper, more complex feature hierarchies require the full body adaptation that ANIL forbids\. Crucially, ResNet12 effectively better neutralizes the spurious correlation gap that plagued Conv64; while Conv64 showed degradation under*OOD*background perturbation, ResNet12 displays remarkable stability and improvement by switching backbones\. proving that the deeper residual representations successfully disentangle foreground semantics from background noise, making the model geometrically less invariant to the*OOD*shifts that broke the shallower Conv64 architectures\.

![Refer to caption](https://arxiv.org/html/2605.13672v1/figures_of_data_analysis/data_analysis/conv64_iid_confusion_matrix.png)Figure 9:*IID*configuration, multiple Conv64F backbones on multiple algorithms![Refer to caption](https://arxiv.org/html/2605.13672v1/figures_of_data_analysis/data_analysis/conv64_ood_confusion_matrix.png)Figure 10:*OOD*configuration, multiple Conv64 backbones on multiple algorithms![Refer to caption](https://arxiv.org/html/2605.13672v1/figures_of_data_analysis/data_analysis/resnet12_iid_confusion_matrix.png)Figure 11:*IID*configuration, multiple Resnet12 backbones on multiple algorithms![Refer to caption](https://arxiv.org/html/2605.13672v1/figures_of_data_analysis/data_analysis/resnet12_ood_confusion_matrix.png)Figure 12:*OOD*configuration, multiple Resnet12 backbones on multiple algorithmsKey Takeaway:•Cosine\-based classifiers are the most robust head family in Conv64, because they suppress the magnitude\-encoded background shortcut that magnitude\-sensitive heads \(Euclidean ProtoNet, dot\-product Baseline\) exploit\.•Swapping a cosine head onto a magnitude\-sensitive backbone \(Meta\-Baseline, Baseline\) yields the largest single robustness gain observed in the head–backbone replacement study; the reverse swap yields the largest drop\.•This robustness holds across backbone depths \(Conv64, ResNet12, ResNet18\): scaling the backbone alone does not close the gap, confirming that magnitude\-invariant inference, i\.e\., not deeper representations, is the decisive factor\.

## Appendix HGeometric Disentanglement of Background Correlations

To investigate how feature extractors encode spurious background correlations, we perform a geometric decomposition of the feature space\. We hypothesize that CNN backbones naturally disentangle semantic foreground information from additive background noise by encoding the former in theangulardirection and the latter in thefeature magnitude\.

### H\.1Methodology: Radial\-Angular Decomposition

Letfθ:𝒳→ℝdf\_\{\\theta\}:\\mathcal\{X\}\\rightarrow\\mathbb\{R\}^\{d\}denote the backbone feature extractor\. For a given inputxx, we decompose the resulting feature embedding𝐳=fθ\(x\)\\mathbf\{z\}=f\_\{\\theta\}\(x\)into two components:

1. 1\.Magnitude \(Radial Component\):r=‖𝐳‖2r=\\\|\\mathbf\{z\}\\\|\_\{2\}, representing the activation intensity or "signal energy\."
2. 2\.Direction \(Angular Component\):𝐳^=𝐳‖𝐳‖2\\hat\{\\mathbf\{z\}\}=\\frac\{\\mathbf\{z\}\}\{\\\|\\mathbf\{z\}\\\|\_\{2\}\}, representing the semantic identity on the hypersphere𝕊d−1\\mathbb\{S\}^\{d\-1\}\.

To quantify the semantic alignment, we compute theCosine Similaritybetween a sample’s direction𝐳^\\hat\{\\mathbf\{z\}\}and its corresponding clean \(without backgrounds\) class prototype𝐩c\\mathbf\{p\}\_\{c\}\. The prototype is defined as the mean direction of the clean, foreground\-only samples for classcc:

𝐩c=Normalize\(1\|𝒮clean\|∑xi∈𝒮cleanfθ\(xi\)\)\\mathbf\{p\}\_\{c\}=\\text\{Normalize\}\\left\(\\frac\{1\}\{\|\\mathcal\{S\}\_\{clean\}\|\}\\sum\_\{x\_\{i\}\\in\\mathcal\{S\}\_\{clean\}\}f\_\{\\theta\}\(x\_\{i\}\)\\right\)\(2\)

### H\.2Empirical Observation: The "Magnitude Contraction" Phenomenon

Figure[13](https://arxiv.org/html/2605.13672#A8.F13)visualizes the feature magnitudes \(x\-axis\) as a function to the cosine alignment \(y\-axis\) of the features with the clean prototype\. Across all evaluated backbones—spanning metric learning \(e\.g\., ProtoNet\), optimization\-based meta\-learning \(e\.g\., MAML\), and standard fine\-tuning \(e\.g\., Baseline\)—we observe a consistent geometric shift characterized by two distinct properties:

- •Angular Stability \(y\-axis invariance\):The angular alignment of mixed samples remains statistically comparable to that of clean samples\. Specifically,cos⁡\(𝐳^mixed,𝐩c\)≈cos⁡\(𝐳^clean,𝐩c\)\\cos\(\\hat\{\\mathbf\{z\}\}\_\{mixed\},\\mathbf\{p\}\_\{c\}\)\\approx\\cos\(\\hat\{\\mathbf\{z\}\}\_\{clean\},\\mathbf\{p\}\_\{c\}\)\. This indicates that the semantic identity of the foreground object is preserved; the model does not "hallucinate" the background class, nor does the background vector rotationally displace the embedding into an orthogonal subspace\.
- •Magnitude Contraction \(x\-axis shift\):We observe a systematic leftward shift in the feature distribution for mixed samples\. As shown in the centroids of Figure[13](https://arxiv.org/html/2605.13672#A8.F13), the mean magnitude of clean samples is consistently larger than that of mixed samples: 𝔼\[‖𝐳clean‖\]\>𝔼\[‖𝐳mixed‖\]\\mathbb\{E\}\[\\\|\\mathbf\{z\}\_\{clean\}\\\|\]\>\\mathbb\{E\}\[\\\|\\mathbf\{z\}\_\{mixed\}\\\|\]\(3\)

### H\.3Implication for Few\-Shot Metric Learning

This geometric disentanglement elucidates the performance disparity observed between Euclidean\-based and Cosine\-based few\-shot algorithms:

1. 1\.Failure of Euclidean Metrics \(e\.g\., ProtoNet\):Euclidean distance is sensitive to magnitude differences\. For a queryqqand prototypepp: ‖q−p‖2=‖q‖2\+‖p‖2⏟Magnitude Term−2‖q‖‖p‖cos⁡θ⏟Interaction Term\\\|q\-p\\\|^\{2\}=\\underbrace\{\\\|q\\\|^\{2\}\+\\\|p\\\|^\{2\}\}\_\{\\text\{Magnitude Term\}\}\-\\underbrace\{2\\\|q\\\|\\\|p\\\|\\cos\\theta\}\_\{\\text\{Interaction Term\}\}\(4\)The "Magnitude Contraction" observed in mixed samples \(‖q‖↓\\\|q\\\|\\downarrow\) reduces the interaction term and alters the magnitude term, creating a large Euclidean distance even if the angleθ\\thetais perfect\. The model interprets the drop in signal energy as dissimilarity, leading to misclassification\.
2. 2\.Robustness of Cosine Metrics \(e\.g\., Baseline\+\+, Meta\-Baseline\):Cosine\-based heads explicitly normalize feature vectors during inference: Score\(q,p\)=qTp‖q‖‖p‖=cos⁡θ\\text\{Score\}\(q,p\)=\\frac\{q^\{T\}p\}\{\\\|q\\\|\\\|p\\\|\}=\\cos\\theta\(5\)By projecting all embeddings onto the unit hypersphere, these algorithms mathematically nullify the magnitude axis\. Since the background information is sequestered primarily in the magnitude \(as shown by our analysis\), Cosine classifiers are inherently less variant to this specific type of distribution shift\.

We conclude that spurious background correlations in audio classification are not learned as semantic features \(which would alter direction\) but are encoded as signal dampeners \(which alter magnitude\)\. Consequently, robustness to background noise in few\-shot learning is strictly dependent on the choice of a metric that ignores feature magnitude\.

![Refer to caption](https://arxiv.org/html/2605.13672v1/figures_of_data_analysis/data_analysis/magnitude_analysis_5x3.png)Figure 13:Geometric Decomposition of Feature Embeddings across 15 Backbones\. On the y\-axis, Green circles represent clean \(foreground\-only\) samples cosine similarity to the clean embeddings prototype and on the x\-axis, the feature’s magnitude ; Red triangles represent mixed \(foreground \+ background\) samples cosine similarity to the clean embeddings prototype on the y\-axis and the feature magnitude on the x\-axis\. The large circle and triangle markers denote the mean of those projections per category\. Across all backbones, we observe a universal phenomenon: background noise causes a Magnitude Contraction \(shift left\) while maintaining Angular Stability \(stable y\-axis\)\. This confirms that background correlations are encoded in the feature magnitude, rendering Euclidean\-based methods sensitive to noise while preserving validity for Cosine\-based methods\.
### H\.4Rigorous Quantitative Validation

For each of the 15 few\-shot algorithms where the backbone model is Conv64F, we ran Mann\-Whitney U tests and computed95%95\\%confidence intervals\.

##### Magnitude Contraction \(Table[7](https://arxiv.org/html/2605.13672#A8.T7)\)\.

is confirmed with statistical significance across all 15 FS\-Algs\. The table reports the mean feature magnitude \(L2L\_\{2\}norm\) for clean and mixed samples with95%95\\%confidence intervals, along with the Mann\-Whitney U test p\-value, testing whether the two magnitude distributions are significantly different:

Table 7:Few\-shot algorithm performance on clean vs\. mixed magnitude conditions with statistical significance\.Across all few\-shot algorithms, clean samples consistently show larger magnitudes than mixed samples, with the difference being highly significant \(all p\-values<10−35<10^\{\-35\}\)\.

##### Angular Stability \(Table[8](https://arxiv.org/html/2605.13672#A8.T8)\)\.

is also confirmed\. The table reports the mean cosine similarity to the clean class prototype for clean and mixed samples, the absolute difference between them \(Diff\), and the Mann\-Whitney U test p\-value\. While differences are statistically detectable, the Diff column shows that the absolute effect size is negligibly small \(<0\.025<0\.025across all backbones\), confirming that background noise does not meaningfully alter the semantic direction of feature embeddings:

Table 8:Cosine similarity of few\-shot algorithm embeddings under clean vs\. mixed conditions with difference and statistical significance\.Taken together, background noise induces a statistically significant magnitude contraction \(p<10−35<10^\{\-35\}across all backbones\) while feature direction remains largely intact \(cosine similarity drop<0\.025<0\.025in all cases\)\.

## Appendix IWhere in the Network Are Spurious Correlations Encoded?

To investigate where in the network spurious correlations are encoded, we performed a stage\-wise analysis by hooking into the intermediate stages of both Conv64F and ResNet12, evaluating IID\-OOD performance using 4 inference heads, specifically, Proto, Baseline, Baseline\+\+, and DN4\.

An important note on interpreting the 1\-shot results: as shown in Figure 15 of the paper, higher shot counts amplify shortcut reliance, meaning the IID\-OOD gap is naturally larger in the 5\-shot setting\. The 1\-shot gaps at intermediate stages are therefore expected to be smaller, and should be interpreted in this context rather than as evidence against the effect\.

Table 9:IID and OOD accuracy with generalization gap across convolution stages and few\-shot heads for 5\-shot and 1\-shot settings\.For Conv64F, the IID\-OOD gap is consistently strong across both 1\-shot and 5\-shot settings at every layer, ranging from55–10%10\\%\. The gap is already clearly present at Layer 1, confirming that spurious background correlations are encoded from the very first convolutional stage\.

Table 10:IID and OOD accuracy with generalization gap across ResNet12 residual blocks and few\-shot heads for 5\-shot and 1\-shot settings\.For ResNet12, the 5\-shot gaps are consistently present from residual block 1 \(55–8%8\\%\) and grow in deeper layers\. The 1\-shot gaps at intermediate layers are smaller \(0\.07%0\.07\\%–3\.18%3\.18\\%\), consistent with the shot\-count amplification effect noted above\. Notably, the intermediate layer gaps are smaller than the final embedding gap \(≈10%\\approx 10\\%\), which is expected: intermediate representations are less task\-discriminative overall, and the spurious correlation effect amplifies progressively as representations become more semantically structured through the network\.

Taken together, across both architectures, both shot settings, and all inference heads, the IID\-OOD gap is present from the earliest feature extraction stage\. This confirms that spurious background correlations are encoded throughout the feature hierarchy rather than emerging only at the final embedding layer\.

## Appendix JOn the Distribution of SpurAudio

In this section we conducted a comprehensive distributional analysis against the FSD50K benchmark\[[18](https://arxiv.org/html/2605.13672#bib.bib60)\]\. As FSD50K is widely regarded as a gold standard for “in\-the\-wild” multi\-label audio event classification, aligning with its semantic topology is critical for establishing the utility of our dataset\. We utilized the Contrastive Language\-Audio Pretraining \(CLAP\) model\[[14](https://arxiv.org/html/2605.13672#bib.bib2)\]to extract rich, semantically aware embeddings from both datasets, enabling a direct comparison in a shared latent space\.

We employed two complementary metrics to evaluate the semantic and structural alignment of our data: Maximum Mean Discrepancy \(MMD\) and centroid cosine similarity\.

### J\.1Methodology and Metrics

- •Maximum Mean Discrepancy \(MMD\):This metric measures the distance between the feature distributions of our synthesized data and the real\-world FSD50K data using a kernel\-based approach \(RBF kernel\)\. Lower values indicate closer distributional matching\. MMD is particularly effective at detecting non\-linear discrepancies between the manifolds of the two domains\.
- •Centroid Cosine Similarity:This metric quantifies the semantic alignment of class prototypes\. We compute the cosine similarity between the centroid of a class inSpurAudioand the centroid of the corresponding class in FSD50K\. High values \(approaching 1\.0\) indicate that the “average” representation of a class \(e\.g\., “Cat”\) in our dataset is semantically identical to its real\-world counterpart\.

### J\.2Analysis of Results

Figure[14](https://arxiv.org/html/2605.13672#A10.F14)presents the per\-class metrics alongside the global structural alignment score\.

![Refer to caption](https://arxiv.org/html/2605.13672v1/figures_of_data_analysis/data_analysis/distribution_analysis_results.png)Figure 14:Distributional Analysis Metrics per Class\. Comparison ofSpurAudioagainst FSD50K using CLAP embeddings\. Green bars denote Centroid Cosine Similarity \(higher is better\), indicating strong semantic alignment for acoustic events\. Orange bars represent MMD \(lower is better\)\.#### J\.2\.1Semantic Alignment

The Centroid Cosine Similarity results reveal a distinct “Event vs\. Paralinguistic” split\. Distinct acoustic events such asCrow\(Cosine: 0\.92\) andFrog\(Cosine: 0\.88\) exhibit near\-perfect semantic alignment with FSD50K\. This confirms that our synthesis pipeline correctly captures the core spectral and temporal characteristics of these classes, positioning them identically in the CLAP latent space\. Conversely, highly variable human paralinguistic classes likeLaughterandSighshow lower cosine similarity \(≈0\.56\\approx 0\.56\)\. This is an expected consequence of the “in\-the\-wild” variance; FSD50K contains a vast diversity of laughter types \(e\.g\., giggles, guffaws\) mixed with background noise, whereas our source data represents a more canonical subset\.

#### J\.2\.2Distributional and Structural Integrity

The MMD scores \(Avg: 0\.19\) quantify the distributional gap between our synthesized foregrounds and the FSD50K recordings\. While these non\-zero values indicate a detectable domain shift\-attributable to the cleaner, high\-SNR nature ofSpurAudiocompared to the noisy FSD50K environments; the structural metrics confirm this shift is uniform and non\-destructive\.

## Appendix KMore Results On Large\-Audio Models And Transformers

In this section we showcase the remaining 1\-shot results on the large foundation models CLAP\[[65](https://arxiv.org/html/2605.13672#bib.bib62)\], AST\[[21](https://arxiv.org/html/2605.13672#bib.bib63)\], AudioMAE\-AS20K\[[25](https://arxiv.org/html/2605.13672#bib.bib64)\],Qwen2\-Audio\-7B\[[11](https://arxiv.org/html/2605.13672#bib.bib65)\]for 1\-shot \(Table[11](https://arxiv.org/html/2605.13672#A11.T11)\) as well as other variants HTSAT\-CLAP\[[65](https://arxiv.org/html/2605.13672#bib.bib62)\], self\-supervised \(BEATS\[[7](https://arxiv.org/html/2605.13672#bib.bib66)\]\) and PaSST\[[28](https://arxiv.org/html/2605.13672#bib.bib74)\]at Table[12](https://arxiv.org/html/2605.13672#A11.T12)on 1\-shot and 5\-shot\. We also showcase the performance on a pre\-trained transformer based encoder Wav2Vec 2\.0\[[52](https://arxiv.org/html/2605.13672#bib.bib76),[3](https://arxiv.org/html/2605.13672#bib.bib75)\]in Table[13](https://arxiv.org/html/2605.13672#A11.T13)\.

Table 11:IID and OOD accuracy \(%\) across few\-shot methods \(rows\) on pretrained large audio models for11\-shot classification\.Table 12:1\-shot, 5\-shot IID and OOD accuracy \(%\) on three additional pretrained encoders: PaSST \(AudioSet\-pretrained\), HTSAT\-CLAP \(HTSAT encoder fused with CLAP\), and BEATS\.Table 13:1\-shot and 5\-shot IID and OOD accuracy \(%\) on Wav2Vec 2\.0
## Appendix LClassification Accuracy as a Function of Number Of Shots

As the number of shots increases, we observe a consistent widening of the performance gap between the*IID*and*OOD*settings\. While additional shots improve absolute accuracy in both regimes, the gains are substantially larger under*IID*conditions, indicating that the models increasingly exploit dataset\-specific correlations\. Notably, this gap does not grow indefinitely: beyond a certain number of shots, performance in both settings saturates and the*IID*–*OOD*gap converges, suggesting that the learned representations reach a stable regime where additional supervision yields diminishing returns\. We refer the reader to Figure[15](https://arxiv.org/html/2605.13672#A12.F15)for a visual illustration of this phenomenon\.

![Refer to caption](https://arxiv.org/html/2605.13672v1/figures/more_shots/Meta-Baseline_shots_candles.png)\(a\)Meta\-Baseline 1\-shot 3\-shot 5\-shot 8\-shot 10\-shot*IID*vs\.*OOD*
![Refer to caption](https://arxiv.org/html/2605.13672v1/figures/more_shots/R2D2_shots_candles.png)\(b\)R2D2 1\-shot 3\-shot 5\-shot 8\-shot 10\-shot*IID*vs\.*OOD*
![Refer to caption](https://arxiv.org/html/2605.13672v1/figures/more_shots/DeepBDC_shots_candles.png)\(c\)DeepBDC 1\-shot 3\-shot 5\-shot 8\-shot 10\-shot*IID*vs\.*OOD*

Figure 15:Impact of support set size \(KK\) on*IID*versus*OOD*performance\. The widening gap illustrates a “Simple Bias,” where the model increasingly relies on spurious background correlations asKKgrows\. Notably, this generalization gap eventually plateaus, indicating that the model’s convergence on the spurious feature saturates at higher shot counts\.
## Appendix MAdditional Contrastive Learning Experiments

Table[14](https://arxiv.org/html/2605.13672#A13.T14)shows that adding contrastive objectives to prototypical learning does not consistently improve robustness under OOD background shifts\. Across both 1\-shot and 5\-shot settings, contrastive variants achieve strong IID performance but retain substantial IID–OOD gaps, with only limited and inconsistent gains from augmentation or attention\. We evaluated SimCLR\[[8](https://arxiv.org/html/2605.13672#bib.bib40)\]and contrastive proto\[[54](https://arxiv.org/html/2605.13672#bib.bib41)\]\. Specifically, for the contrastive framework, we employed SpecAugment techniques such as Time Masking and Frequency Masking to randomly mask blocks of time and frequency on the log\-mel spectrograms to generate the positive pairs for the contrastive loss\.

Table 14:Contrastive few\-shot accuracy \(%\) under IID and OOD settings\.Moreover, SimCLR\-pretrained backbones exhibit degraded few\-shot accuracy as model capacity increases, highlighting a mismatch between contrastive pretraining and distance\-based few\-shot inference in the presence of background shifts\.

## Appendix NLimitations

SpurAudio is designed as a controlled benchmark for studying spurious foreground–background correlations in few\-shot audio classification\. Its mixtures are synthetic, combining foregrounds with semantically paired backgrounds\. This design enables precise control over the spurious association between the foreground and backgrounds signals\. Our distributional analysis \(Appendix[J](https://arxiv.org/html/2605.13672#A10)\) shows close alignment with real audio in CLAP embedding space\. At the same time, it abstracts away some factors present in fully in\-situ recordings, such as room effects, recording variability, and spatial cues\. Finally, SpurAudio inherits the licenses, biases, and consent regimes of its five constituent datasets, so any biases in those corpora propagate into our benchmark and any downstream analysis conducted with it\.
SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

Similar Articles

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Learning task-specific subspaces via interventional post-training of speech foundation models

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

Submit Feedback

Similar Articles

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
Learning task-specific subspaces via interventional post-training of speech foundation models
ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates