CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

arXiv cs.LG 06/03/26, 04:00 AM Papers
Summary
This paper introduces CoughSense, a system that classifies cough recordings into five respiratory disease categories using a fine-tuned Whisper encoder with active-frame pooling, achieving 82.3% balanced accuracy and deployed as a real-time mobile app.
arXiv:2606.02998v1 Announce Type: new Abstract: Automated cough analysis offers a path to low-cost respiratory screening, but most existing work stops at binary COVID-19 detection. A practical tool needs to tell apart several respiratory conditions from one cough recording on a consumer smartphone. We present CoughSense, a system that sorts cough recordings into five classes. These are healthy, COVID-19, asthma or respiratory condition, bronchitis, and pneumonia. We aggregated 18,301 recordings from four public datasets (Coswara, CoughVID, Virufy, and the West China Hospital Pediatric Cough Dataset) and used the OpenAI Whisper encoder as a pretrained backbone for cough disease classification. The main contribution is active-frame QKV attention pooling, which restricts attention to the first 200 of 1500 encoder tokens. This avoids the silence-dilution problem that arises because a 3-second cough fills only 150 tokens of Whisper's 30-second input window. Other training parts handle the 19 to 1 class imbalance and the four-dataset domain shift. These include WeightedRandomSampler, SpecAugment, Balanced Mixup with forced minority pairing, a supervised contrastive auxiliary loss, FiLM symptom conditioning, and gradient-reversal domain adaptation. A dual-encoder model fuses Whisper with the OPERA-CT respiratory foundation model through cross-attention. CoughSense (Whisper-tiny, 8.6M parameters) reached 82.3 percent balanced accuracy on five-fold cross-validation (macro-F1 of 0.817, AUC of 0.941). It beat an ImageNet-pretrained EfficientNet-B2 by 11.1 points and a ViT trained from scratch by 29.6 points. All five classes passed 74 percent recall and four of five passed 80 percent. The dual-encoder model reached 85.4 percent balanced accuracy. Active-frame pooling is the largest single contributor across all ablation components at 5.1 points, which should help any short-audio task using Whisper as a backbone.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:41 AM
# 1. Introduction
Source: [https://arxiv.org/html/2606.02998](https://arxiv.org/html/2606.02998)
CoughSense: Five\-Class Respiratory Disease Classification via Whisper Encoder Fine\-Tuning and Dual\-Encoder Cross\-Attention Fusion with Balanced Contrastive Learning

Nikhil Vincent1

1Independent Researcher, Bothell, Washington, USA

Corresponding Author: Nikhil Vincent Email: nikhil\.vincent\.v@gmail\.com ORCID: 0009\-0007\-1995\-2529

Code and Data Availability:Training code, model checkpoints, and benchmark data splits are available at: [https://github\.com/nikhilvincentv/Cough\-Mobile\-App](https://github.com/nikhilvincentv/Cough-Mobile-App)

Abstract

Background:Automated cough analysis offers a path to low\-cost respiratory screening, but most existing work stops at binary COVID\-19 detection\. A practical screening tool must distinguish between multiple respiratory conditions from a single cough recording captured on a consumer smartphone\.

Objective:This paper describes CoughSense, a system that classifies cough recordings into five categories \(healthy, COVID\-19, asthma/respiratory condition, bronchitis, and pneumonia\) and deploys as a real\-time mobile application on iOS and Android\.

Methods:We aggregated 18,301 recordings from four public datasets \(Coswara, CoughVID, Virufy, and the West China Hospital Pediatric Cough Dataset\) and applied the OpenAI Whisper encoder\[[2](https://arxiv.org/html/2606.02998#bib.bib2)\]as a pretrained backbone for cough disease classification for the first time\. The central technical contribution is active\-frame QKV attention pooling, which restricts attention to the first 200 of 1500 encoder output tokens, avoiding the silence\-dilution problem that arises because a 3\-second cough occupies only 150 tokens of Whisper’s 30\-second input window\. Additional training components address the 19:1 class imbalance and four\-dataset domain shift: WeightedRandomSampler, SpecAugment, Balanced Mixup with forced minority pairing\[[10](https://arxiv.org/html/2606.02998#bib.bib10)\], supervised contrastive auxiliary loss\[[8](https://arxiv.org/html/2606.02998#bib.bib8)\], FiLM symptom conditioning\[[12](https://arxiv.org/html/2606.02998#bib.bib12)\], and gradient\-reversal domain adaptation\[[13](https://arxiv.org/html/2606.02998#bib.bib13)\]\. A complementary dual\-encoder model fuses Whisper with the OPERA\-CT respiratory foundation model\[[7](https://arxiv.org/html/2606.02998#bib.bib7)\]via cross\-attention\.

Results:CoughSense \(Whisper\-tiny, 8\.6M parameters\) reached 82\.3% balanced accuracy on five\-fold cross\-validation \(macro\-F1=0\.817F\_\{1\}=0\.817, AUC=0\.941=0\.941\), outperforming an ImageNet\-pretrained EfficientNet\-B2 by 11\.1 percentage points and a ViT trained from scratch by 29\.6 points\. All five classes exceeded 74% recall; four of five exceeded 80%\. The dual\-encoder model reached 85\.4% balanced accuracy\. Server\-side inference latency was approximately 180 ms per recording\.

Conclusions:Active\-frame pooling is the single largest contributor across all ablation components \(\+5\.1\+5\.1points\), a finding that applies to any short\-audio task using Whisper as a backbone\. CoughSense is deployed as a real\-time mobile screening tool for iOS and Android\. All benchmark data splits, training code, and model checkpoints are released to support reproducibility\.

Keywords:cough sound classification; respiratory disease; Whisper encoder; transfer learning; contrastive learning; domain adaptation; audio foundation models; COVID\-19; bronchitis; pneumonia; multi\-class imbalanced classification; mobile health

Cough is the most common reason for primary care visits globally\[[1](https://arxiv.org/html/2606.02998#bib.bib1)\], and it spans conditions from mild upper respiratory infections to pneumonia and COVID\-19\. Cough audio is readily captured on any smartphone, yet almost every published classifier stops at the binary question of whether a person has COVID\-19\. A screening tool needs to do more\. If a child’s wet, rattling cough sounds like pneumonia rather than COVID\-19, the model should say so\.

A five\-class cough classifier is harder to build than expected, for reasons that go beyond the usual machine\-learning challenges\.

The biggest obstacle is acoustic ambiguity\. COVID\-19 in the acute phase produces a dry, non\-productive cough close to a healthy person clearing their throat\. Bronchitis and pneumonia both cause wet, rattling coughs; they differ in the depth of the airway affected, which produces small acoustic differences that are hard to learn from a few hundred recordings\.

Class imbalance is severe\. Across all four source datasets, healthy subjects outnumber pneumonia cases by 19:1\. Early training runs with plain loss functions collapsed: the model predicted “healthy” for every input and still hit 68% accuracy\. Pushing all five classes past 74% recall required careful interaction between the sampler, the loss function, and data augmentation\.

The datasets also come from different environments: adult crowdsourced recordings from India and Latin America, and pediatric clinical recordings from a hospital in Chengdu, China\. A model that memorises recording\-environment cues rather than disease cues will fail on unseen sources\.

### 1\.1 Why Whisper

The core idea is to use OpenAI’s Whisper speech recognition encoder\[[2](https://arxiv.org/html/2606.02998#bib.bib2)\]as a pretrained backbone, which \(as far as we can determine\) has not been done for cough disease classification\. Whisper was trained on 680,000 hours of speech, learning to represent glottal excitation, vocal\-tract resonances, and fine temporal structure at 10 ms resolution\. Cough shares this anatomy: it is an explosive forced exhalation through the same laryngeal apparatus as speech\. These pretrained representations transfer to respiratory pathology, the same way ImageNet features transfer to medical imaging despite the domain gap\.

The intuition holds up\. With the Whisper encoder frozen, we observed AUC=0\.784=0\.784on the first validation epoch, above EfficientNet\-B2’s trajectory at the same point\. After full fine\-tuning, the model reaches 82\.3% five\-class balanced accuracy\.

One non\-obvious implementation issue matters here\. Whisper expects 30\-second inputs, but cough recordings are 1–4 seconds long\. After the encoder’s convolutional stem, a 3\-second clip occupies around 150 of 1500 output tokens; the rest correspond to zero\-padded silence\. Pooling over all 1500 tokens dilutes the disease signal\. Restricting pooling to the first 200 tokens \(a safe upper bound on cough duration\) and using learned query\-based attention over that range accounts for\+5\.1\+5\.1percentage points of balanced accuracy on its own\.

### 1\.2 Summary of Contributions

1. 1\.To our knowledge, the first application of the Whisper encoder to multi\-class cough disease classification, with a 5\-fold cross\-validation benchmark of 18,301 recordings across five classes\.
2. 2\.Active\-frame QKV attention pooling: restrict to the firstK=200K\\\!=\\\!200encoder tokens then apply learned multi\-head attention, avoiding the silence\-dilution problem\. This alone yields\+5\.1\+5\.1points over naive mean pooling across all 1500 tokens\.
3. 3\.A training recipe combining WeightedRandomSampler, SpecAugment, Balanced Mixup with forced minority pairing\[[10](https://arxiv.org/html/2606.02998#bib.bib10)\], supervised contrastive loss\[[8](https://arxiv.org/html/2606.02998#bib.bib8)\], FiLM symptom conditioning\[[12](https://arxiv.org/html/2606.02998#bib.bib12)\], and gradient\-reversal domain adaptation\[[13](https://arxiv.org/html/2606.02998#bib.bib13)\], reaching 82\.3% balanced accuracy at 8\.6M parameters\.
4. 4\.A dual\-encoder model fusing Whisper with OPERA\-CT\[[7](https://arxiv.org/html/2606.02998#bib.bib7)\]via cross\-attention, reaching 85\.4% balanced accuracy\.
5. 5\.A real\-time mobile inference pipeline and a curated benchmark with structured augmentation of the under\-represented West China Hospital bronchitis \(91→72891\\to 728\) and pneumonia \(82→65682\\to 656\) classes\.

## 2\. Methods

### 2\.1 Study Design

This study describes the development and offline validation of a multi\-class cough classification system using publicly available, previously collected audio datasets\. No new participant data were collected\. All source datasets are used in accordance with their respective data use agreements \(detailed in the Ethics section\)\.

### 2\.2 Dataset Construction

#### 2\.2\.1 Source Collections

We aggregated cough recordings from four publicly available datasets, spanning three continents and three acquisition modalities\.

Coswara\[[3](https://arxiv.org/html/2606.02998#bib.bib3)\]: 16,780 recordings from participants across India collected via a web portal\. Audio includes nine modalities at 44\.1 kHz stereo\. Labels include healthy, COVID\-19 \(self\-reported\), and respiratory conditions \(asthma, COPD, other\), with a seven\-dimensional binary symptom vector \(fever, cold, cough, diarrhoea, loss of smell, fatigue, sore throat\)\. Heavy cough clips only are used\.

CoughVID\[[4](https://arxiv.org/html/2606.02998#bib.bib4)\]: A crowdsourced collection of 27,000 cough recordings from participants globally\. The expert\-reviewed subset with confirmed quality score≥1\\geq 1and reported health status labels\. PCR\-confirmed COVID\-19 positive recordings \(n=1,107n=1,\\\!107expert\-reviewed\) are used; healthy controls are selected to match demographic distribution\.

Virufy\[[5](https://arxiv.org/html/2606.02998#bib.bib5)\]: 103 clinically\-validated cough recordings from PCR\-confirmed COVID\-19 patients \(48 positive, 55 negative\) collected in Latin American clinical settings\. Both the original files \(16 recordings, MP3\) and segmented clips \(87 recordings, MP3\) with labels derived from filename prefix conventions \(pos\-\*vs\.neg\-\*\)\.

West China Hospital Pediatric Cough Dataset\[[6](https://arxiv.org/html/2606.02998#bib.bib6)\]: 173 cough recordings \(91 bronchitis, 82 pneumonia\) from children aged 0–11 years, collected at West China Second University Hospital, Chengdu, China\. This dataset provides the only publicly available bronchitis and pneumonia cough recordings with confirmed clinical diagnoses, making it indispensable despite its small size and pediatric demographic\.

#### 2\.2\.2 Disease Taxonomy

Coswara includes 70 asthma recordings\. With only 70 samples, a dedicated asthma class cannot be reliably trained\. Asthma and general respiratory conditions share a common pathophysiology \(obstructive airway disease\), producing highly similar expiratory cough acoustics\. We therefore merge asthma recordings into the broader respiratory condition class:

label←\{resp\_condif label∈\{asthma\}labelotherwise\\text\{label\}\\leftarrow\\begin\{cases\}\\texttt\{resp\\\_cond\}&\\text\{if label\}\\in\\\{\\texttt\{asthma\}\\\}\\\\ \\text\{label\}&\\text\{otherwise\}\\end\{cases\}\(1\)
This yields a tractable five\-class taxonomy:healthy,COVID\-19,asthma/respiratory condition,bronchitis, andpneumonia\.

#### 2\.2\.3 Preprocessing

All audio was resampled to 16,000 Hz mono using librosa\[[30](https://arxiv.org/html/2606.02998#bib.bib30)\]with thekaiser\_bestfilter, then peak\-normalized\. Following Whisper’s preprocessing specification\[[2](https://arxiv.org/html/2606.02998#bib.bib2)\], we computed an 80\-band log\-mel spectrogram withNFFT=400N\_\{\\mathrm\{FFT\}\}=400samples \(25 ms window\), hop lengthH=160H=160samples \(10 ms\), Hann window, and Slaney\-normalized mel filterbank\. Spectrograms were zero\-padded and truncated to exactlyT=3000T=3000frames \(30 seconds\), then normalized to match Whisper’s pretraining normalization:

𝐌←clip⁡\(𝐌,m∗−8,∞\)\+44\\mathbf\{M\}\\leftarrow\\frac\{\\operatorname\{clip\}\(\\mathbf\{M\},\\,m^\{\*\}\-8,\\,\\infty\)\+4\}\{4\}\(2\)wherem∗=maxf,t⁡𝐌f,tm^\{\*\}=\\max\_\{f,t\}\\mathbf\{M\}\_\{f,t\}\. All 18,301 spectrograms were precomputed and stored as float16 NumPy arrays \(≈8\.8\\approx 8\.8GB total\)\.

#### 2\.2\.4 Data Augmentation

The West China Hospital bronchitis \(n=91n=91\) and pneumonia \(n=82n=82\) collections are insufficient for stable deep learning training\. A structured 8\-way augmentation pipeline produced 728 bronchitis and 656 pneumonia recordings:

1. \(1\)Original: No modification\.
2. \(2\)Gaussian noise: Additive white Gaussian noise atSNR=15\\mathrm\{SNR\}=15dB\.
3. \(3\)Time stretch×0\.88\\times 0\.88: Slows audio by 12%\.
4. \(4\)Time stretch×1\.12\\times 1\.12: Speeds audio by 12%\.
5. \(5\)Pitch shift−1\.5\-1\.5semitones\.
6. \(6\)Pitch shift\+1\.5\+1\.5semitones\.
7. \(7\)Time shift\+15%\+15\\%: Rolls waveform forward\.
8. \(8\)Combined: Gaussian noise \(1515dB\) \+ pitch shift \(−1\.0\-1\.0semitone\)\.

The augmentation factor of8×8\\timeswas chosen so that augmented minority classes exceeded 650 samples, the empirically\-determined floor for stable five\-class cross\-validation\. Table[1](https://arxiv.org/html/2606.02998#S2.T1)summarises the final benchmark\.

Table 1:CoughSense V7 Benchmark Dataset Statistics\. Raw counts pre\-augmentation; Final counts post\-augmentation\.ClassRawFinalSource\(s\)Aug\.Healthy12,44612,446Coswara, CoughVID, Virufy—COVID\-191,5071,507Coswara, CoughVID, Virufy—Asthma/Respiratory cond\.2,9642,964Coswara \(incl\. asthma\)—Bronchitis91728West China Hospital×8\\times 8Pneumonia82656West China Hospital×8\\times 8Total17,09018,3014 datasetsThe healthy\-to\-pneumonia imbalance ratio is 19:1\. The five\-fold stratified split maintains this distribution per fold\.

### 2\.3 Architecture Overview

Figure[1](https://arxiv.org/html/2606.02998#S2.F1)illustrates the CoughSense pipeline\. A 30\-second Whisper\-format log\-mel spectrogram \(80×300080\\times 3000\) is passed through the pretrained Whisper\-tiny encoder, producing 1500 time\-step features at 384 dimensions\. Active\-frame QKV attention pooling selects and attends over the firstK=200K=200tokens \(corresponding to≈\\approx4 seconds of audio\) to produce a single 384\-dimensional embedding\. A two\-layer projection head applies LayerNorm and GELU activation\. FiLM conditioning integrates the seven\-dimensional clinical symptom vector\. The L2\-normalized embedding𝐳^\\hat\{\\mathbf\{z\}\}is routed to: \(i\) a five\-class disease head with focal loss, \(ii\) a gradient\-reversed two\-class domain classifier, and \(iii\) a supervised contrastive loss branch\.

Raw Audio16 kHz80\-bandLog\-Mel\(80×3000\)\(80\\\!\\times\\\!3000\)Whisper\-tinyEncoder4 layers, 384\-d7\.6M paramsActive FrameCrop𝐇:200,:\\mathbf\{H\}\_\{:200,:\}QKV Attn\.Pool4 headsLN \+ Lin\+ GELU𝐳\\mathbf\{z\}L2\-Norm𝐳^\\hat\{\\mathbf\{z\}\}Symptom𝐬∈\{0,1\}7\\mathbf\{s\}\\in\\\{0,1\\\}^\{7\}FiLM\(γ,β\)\(\\gamma,\\beta\)Disease HeadLinear 384→\\to256→\\to128→\\to5ℒfocal\\mathcal\{L\}\_\{\\mathrm\{focal\}\}GRLλ\(t\)\\lambda\(t\)Domain HeadLinear 384→\\to64→\\to2ℒdom\\mathcal\{L\}\_\{\\mathrm\{dom\}\}ℒSupCon\\mathcal\{L\}\_\{\\mathrm\{SupCon\}\}𝐇∈ℝ1500×384\\mathbf\{H\}\\in\\mathbb\{R\}^\{1500\\times 384\}ℝ200×384\\mathbb\{R\}^\{200\\times 384\}𝐳^∈ℝ384\\hat\{\\mathbf\{z\}\}\\in\\mathbb\{R\}^\{384\}Phase 2:ηenc=2×10−5\\eta\_\{\\mathrm\{enc\}\}=2\\\!\\times\\\!10^\{\-5\}Phase 1\+2:ηhead=10−3\\eta\_\{\\mathrm\{head\}\}=10^\{\-3\}

Figure 1:CoughSense single\-encoder architecture\. Raw audio is converted to an 80\-band Whisper\-format log\-mel spectrogram and encoded by a pretrained Whisper\-tiny transformer\. Active\-frame QKV attention pooling selects and attends over the first 200 of 1500 encoder tokens \(covering actual cough audio, not zero\-padded silence\)\. FiLM conditions the feature embedding on seven binary clinical symptoms\. The L2\-normalized embedding feeds a five\-class disease head \(focal loss\), a gradient\-reversed domain classifier \(ℒdom\\mathcal\{L\}\_\{\\mathrm\{dom\}\}\), and a supervised contrastive loss branch \(ℒSupCon\\mathcal\{L\}\_\{\\mathrm\{SupCon\}\}\)\. Dashed arrows denote gradient reversal\.
### 2\.4 Whisper Encoder

The Whisper\-tiny encoder\[[2](https://arxiv.org/html/2606.02998#bib.bib2)\]processes audio via a two\-layer convolutional stem followed by four transformer blocks\. The convolutional stem applies two 1D convolutions with kernel width 3 and GELU activations; the second convolution uses stride 2, halving the temporal resolution fromT=3000T=3000mel frames toT/2=1500T/2=1500feature frames\. Sinusoidal positional embeddings are added to the resultingℝ1500×384\\mathbb\{R\}^\{1500\\times 384\}features before the transformer blocks\. Total encoder parameters: 7\.6M\. The Whisper decoder is discarded entirely\.

Two\-phase training strategy\.Phase 1\(epochs 1–3, warm\-up\): The Whisper encoder is frozen\. Only the pooling layer, FiLM module, GRL domain head, and disease classification head are trained at learning rateηhead=10−3\\eta\_\{\\mathrm\{head\}\}=10^\{\-3\}\.Phase 2\(epochs 4–25, fine\-tune\): The full model is optimized with differential learning rates:ηenc=2×10−5\\eta\_\{\\mathrm\{enc\}\}=2\\times 10^\{\-5\}for the encoder andηhead=10−3\\eta\_\{\\mathrm\{head\}\}=10^\{\-3\}for the head\. Both optimizers use cosine annealing with 200\-step linear warmup\.

### 2\.5 Active\-Frame QKV Attention Pooling

After the convolutional stem, a 3\-second cough clip occupies only⌈3×100/2⌉=150\\lceil 3\\times 100/2\\rceil=150of 1500 encoder output tokens\. The remaining 1350 tokens correspond to zero\-padded silence and carry no disease\-discriminative information\.

Naive mean pooling over all 1500 tokens computes:

𝐳mean=11500∑t=11500𝐇t=1501500𝐇¯cough\+13501500𝐇¯silence\\mathbf\{z\}\_\{\\mathrm\{mean\}\}=\\frac\{1\}\{1500\}\\sum\_\{t=1\}^\{1500\}\\mathbf\{H\}\_\{t\}=\\frac\{150\}\{1500\}\\,\\bar\{\\mathbf\{H\}\}\_\{\\mathrm\{cough\}\}\+\\frac\{1350\}\{1500\}\\,\\bar\{\\mathbf\{H\}\}\_\{\\mathrm\{silence\}\}\(3\)
The proposedactive\-frame QKV attention poolingfirst selects only the firstK=200K=200encoder output tokens:

𝐇\(K\)=𝐇1:K,:∈ℝK×d,K=200\\mathbf\{H\}^\{\(K\)\}=\\mathbf\{H\}\_\{1:K,:\}\\in\\mathbb\{R\}^\{K\\times d\},\\quad K=200\(4\)
Then applies a learned single\-query multi\-head attention:

𝐪\\displaystyle\\mathbf\{q\}=𝐰q∈ℝ1×d\(learned parameter\)\\displaystyle=\\mathbf\{w\}\_\{q\}\\in\\mathbb\{R\}^\{1\\times d\}\\quad\(\\text\{learned parameter\}\)\(5\)𝐳pool\\displaystyle\\mathbf\{z\}\_\{\\mathrm\{pool\}\}=MHA\(𝐪,𝐇\(K\),𝐇\(K\)\)∈ℝd\\displaystyle=\\mathrm\{MHA\}\\\!\\left\(\\mathbf\{q\},\\,\\mathbf\{H\}^\{\(K\)\},\\,\\mathbf\{H\}^\{\(K\)\}\\right\)\\in\\mathbb\{R\}^\{d\}\(6\)where MHA is four\-head scaled dot\-product attention with dropout 0\.1\. TheK=200K=200threshold is validated by ablation; settingK<150K<150risks clipping genuine cough content, whileK\>300K\>300begins to include silence tokens\. After pooling:

𝐳=GELU\(𝐖pLayerNorm\(𝐳pool\)\+𝐛p\)\\mathbf\{z\}=\\mathrm\{GELU\}\\\!\\left\(\\mathbf\{W\}\_\{p\}\\,\\mathrm\{LayerNorm\}\(\\mathbf\{z\}\_\{\\mathrm\{pool\}\}\)\+\\mathbf\{b\}\_\{p\}\\right\)\(7\)

### 2\.6 FiLM Symptom Conditioning

Seven binary clinical symptoms from Coswara \(fever, cold, cough, diarrhoea, loss of smell, fatigue, sore throat\) provide complementary non\-acoustic diagnostic signal\. Loss of smell \(anosmia\) is a near\-pathognomonic COVID\-19 indicator absent in bronchitis and pneumonia\. We encode𝐬∈\{0,1\}7\\mathbf\{s\}\\in\\\{0,1\\\}^\{7\}via Feature\-wise Linear Modulation\[[12](https://arxiv.org/html/2606.02998#bib.bib12)\]:

𝜸,𝜷\\displaystyle\\boldsymbol\{\\gamma\},\\boldsymbol\{\\beta\}=fϕ\(𝐬\),fϕ:ℝ7→ℝ384×ℝ384\\displaystyle=f\_\{\\phi\}\(\\mathbf\{s\}\),\\quad f\_\{\\phi\}:\\mathbb\{R\}^\{7\}\\to\\mathbb\{R\}^\{384\}\\times\\mathbb\{R\}^\{384\}\(8\)𝐳~\\displaystyle\\tilde\{\\mathbf\{z\}\}=\(1\+𝜸\)⊙𝐳\+𝜷\\displaystyle=\(1\+\\boldsymbol\{\\gamma\}\)\\odot\\mathbf\{z\}\+\\boldsymbol\{\\beta\}\(9\)wherefϕf\_\{\\phi\}is a two\-layer MLP \(7→64→7687\\to 64\\to 768\) with GELU activation\. For datasets without symptom annotations,𝐬=𝟎\\mathbf\{s\}=\\mathbf\{0\}and FiLM reduces to an identity modulation\.

### 2\.7 Gradient\-Reversal Domain Adaptation

Binary domain labels are assigned:d=0d=0for clinical recordings \(Coswara, West China Hospital\) andd=1d=1for crowdsourced recordings \(CoughVID, Virufy\)\. A two\-layer domain classifiergψ:ℝ384→ℝ2g\_\{\\psi\}:\\mathbb\{R\}^\{384\}\\to\\mathbb\{R\}^\{2\}predicts domain membership\. A Gradient Reversal Layer \(GRL\)\[[13](https://arxiv.org/html/2606.02998#bib.bib13)\]negates gradients fromgψg\_\{\\psi\}during backpropagation\. The GRL reversal strength is scheduled as:

λ\(t\)=21\+exp⁡\(−γt/T\)−1,γ=10\\lambda\(t\)=\\frac\{2\}\{1\+\\exp\(\-\\gamma\\,t/T\)\}\-1,\\quad\\gamma=10\(10\)

### 2\.8 Loss Function

The total training loss combines three objectives:

ℒtotal=ℒfocal\+0\.3λ\(t\)ℒdom\+0\.1ℒSupCon\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{focal\}\}\+0\.3\\,\\lambda\(t\)\\,\\mathcal\{L\}\_\{\\mathrm\{dom\}\}\+0\.1\\,\\mathcal\{L\}\_\{\\mathrm\{SupCon\}\}\(11\)
#### 2\.8\.1 Focal Loss with Soft Labels

Focal loss\[[11](https://arxiv.org/html/2606.02998#bib.bib11)\]withγf=2\\gamma\_\{f\}=2concentrates learning on hard samples:

ℒfocal=−∑c=1C\(1−pc\)γfy~clog⁡pc\\mathcal\{L\}\_\{\\mathrm\{focal\}\}=\-\\sum\_\{c=1\}^\{C\}\(1\-p\_\{c\}\)^\{\\gamma\_\{f\}\}\\,\\tilde\{y\}\_\{c\}\\log p\_\{c\}\(12\)whereC=5C=5,pc=softmax\(𝐨\)cp\_\{c\}=\\mathrm\{softmax\}\(\\mathbf\{o\}\)\_\{c\}, and𝐲~\\tilde\{\\mathbf\{y\}\}are soft labels from Balanced Mixup\. Class weights are uniform when WeightedRandomSampler is active, to avoid double\-penalization\.

#### 2\.8\.2 Balanced Mixup

Balanced Mixup\[[10](https://arxiv.org/html/2606.02998#bib.bib10)\]pairs each sample𝐱i\\mathbf\{x\}\_\{i\}with a minority\-class sample𝐱j\\mathbf\{x\}\_\{j\}from the minority pool \(asthma/respiratory condition, bronchitis, pneumonia\):

𝐱~\\displaystyle\\tilde\{\\mathbf\{x\}\}=λm𝐱i\+\(1−λm\)𝐱j,λm∼Beta\(0\.4,0\.4\)\\displaystyle=\\lambda\_\{m\}\\,\\mathbf\{x\}\_\{i\}\+\(1\-\\lambda\_\{m\}\)\\,\\mathbf\{x\}\_\{j\},\\quad\\lambda\_\{m\}\\sim\\mathrm\{Beta\}\(0\.4,0\.4\)\(13\)𝐲~\\displaystyle\\tilde\{\\mathbf\{y\}\}=λm𝐲i\+\(1−λm\)𝐲j\\displaystyle=\\lambda\_\{m\}\\,\\mathbf\{y\}\_\{i\}\+\(1\-\\lambda\_\{m\}\)\\,\\mathbf\{y\}\_\{j\}\(14\)

#### 2\.8\.3 Supervised Contrastive Loss

SupCon\[[8](https://arxiv.org/html/2606.02998#bib.bib8)\]shapes the embedding geometry:

ℒSupCon=−∑i∈I1\|P\(i\)\|∑j∈P\(i\)log⁡exp⁡\(𝐳^i⋅𝐳^j/τ\)∑k∈I∖\{i\}exp⁡\(𝐳^i⋅𝐳^k/τ\)\\mathcal\{L\}\_\{\\mathrm\{SupCon\}\}=\-\\sum\_\{i\\in I\}\\frac\{1\}\{\|P\(i\)\|\}\\sum\_\{j\\in P\(i\)\}\\log\\frac\{\\exp\(\\hat\{\\mathbf\{z\}\}\_\{i\}\\cdot\\hat\{\\mathbf\{z\}\}\_\{j\}/\\tau\)\}\{\\sum\_\{k\\in I\\setminus\\\{i\\\}\}\\exp\(\\hat\{\\mathbf\{z\}\}\_\{i\}\\cdot\\hat\{\\mathbf\{z\}\}\_\{k\}/\\tau\)\}\(15\)whereP\(i\)=\{j∈I:j≠i,yj=yi\}P\(i\)=\\\{j\\in I:j\\neq i,\\,y\_\{j\}=y\_\{i\}\\\}andτ=0\.07\\tau=0\.07\. SupCon is applied only on non\-mixed batches\.

### 2\.9 Training Protocol

Table[2](https://arxiv.org/html/2606.02998#S2.T2)summarises all hyperparameters\. AdamW\[[28](https://arxiv.org/html/2606.02998#bib.bib28)\]is used with gradient clipping at norm 1\.0 and weight decay10−410^\{\-4\}\. The micro\-batch size of 4 with gradient accumulation overG=4G=4steps \(effective batch size 16\) was required because larger batch sizes ran out of memory on the Apple MPS backend during Phase 2\. Training time per fold on Apple M\-series silicon \(MPS backend\) was approximately 24 hours for Whisper\-tiny with full fine\-tuning\. For reference, an equivalent training run on an NVIDIA A100 GPU would require approximately 5–6 hours based on standard benchmarks for 25\-epoch transformer fine\-tuning at comparable parameter counts\. Five\-fold stratified cross\-validation selected checkpoints by maximum validation balanced accuracy\.

Table 2:CoughSense Hyperparameter Configuration\.Algorithm 1CoughSense Training Loop \(Single Fold\)0:Dataset

𝒟\\mathcal\{D\}, fold split

\(tr,val\)\(tr,val\), epochs

E=25E=25, freeze epochs

F=3F=3
1:Initialize model

θ\\thetawith pretrained Whisper\-tiny encoder

2:

sampler←\\texttt\{sampler\}\\leftarrowWeightedRandomSampler

\(𝒟tr,1/Nc\)\(\\mathcal\{D\}\_\{tr\},1/N\_\{c\}\)
3:Freeze

θenc\\theta\_\{\\mathrm\{enc\}\}; best\_acc

←0\\leftarrow 0
4:forepoch

e=1e=1to

EEdo

5:if

e=F\+1e=F\+1then

6:Unfreeze

θenc\\theta\_\{\\mathrm\{enc\}\}
7:endif

8:formini\-batch

ℬ\\mathcal\{B\}from samplerdo

9:if

e\>Fe\>FAND rand\(\)

<0\.5<0\.5then

10:

\(𝐗~,𝐘~\)←\(\\tilde\{\\mathbf\{X\}\},\\tilde\{\\mathbf\{Y\}\}\)\\leftarrowBalancedMixup

\(ℬ\)\(\\mathcal\{B\}\);

ℒsc←0\\mathcal\{L\}\_\{\\mathrm\{sc\}\}\\leftarrow 0
11:else

12:

\(𝐗~,𝐘~\)←ℬ\(\\tilde\{\\mathbf\{X\}\},\\tilde\{\\mathbf\{Y\}\}\)\\leftarrow\\mathcal\{B\};

ℒsc←ℒSupCon\(𝐙^,𝐲\)\\mathcal\{L\}\_\{\\mathrm\{sc\}\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{SupCon\}\}\(\\hat\{\\mathbf\{Z\}\},\\mathbf\{y\}\)
13:endif

14:

ℒ←ℒfocal\+0\.3λ\(t\)ℒdom\+0\.1ℒsc\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{focal\}\}\+0\.3\\lambda\(t\)\\mathcal\{L\}\_\{\\mathrm\{dom\}\}\+0\.1\\mathcal\{L\}\_\{\\mathrm\{sc\}\}
15:

\(ℒ/G\)\(\\mathcal\{L\}/G\)\.backward\(\), clip, step, schedule

16:endfor

17:bal\_acc

←\\leftarrowEvaluate

\(fθ,𝒟val\)\(f\_\{\\theta\},\\mathcal\{D\}\_\{val\}\); save best checkpoint

18:endfor

### 2\.10 Dual\-Encoder Cross\-Attention Fusion

Whisper and OPERA\-CT encode complementary aspects of respiratory audio\. Whisper \(speech\-pretrained, 680k hours\) captures temporal phoneme dynamics, voiced/unvoiced distinctions, and glottal waveform features\. OPERA\-CT \(respiratory\-pretrained, 136k hours\)\[[7](https://arxiv.org/html/2606.02998#bib.bib7)\]specializes in pathological respiratory acoustics: wheeze, crackle, and productive versus dry cough distinctions\. To our knowledge, this is the first work to fuse a speech\-domain foundation model with a respiratory\-domain foundation model via cross\-attention for disease classification\.

The Whisper encoder is kept from the best single\-encoder checkpoint\. OPERA\-CT provides a ViT\-Base \(d=768d=768, 12 heads, 12 layers, 85M params\) pretrained on respiratory audio\. A linear projection reduces OPERA’s dimension:

𝐡~O=𝐖O𝐡O,𝐖O∈ℝ384×768\\tilde\{\\mathbf\{h\}\}\_\{O\}=\\mathbf\{W\}\_\{O\}\\,\\mathbf\{h\}\_\{O\},\\quad\\mathbf\{W\}\_\{O\}\\in\\mathbb\{R\}^\{384\\times 768\}\(16\)
Cross\-attention fusion uses Whisper as query and OPERA as key/value:

𝐳fused=MHA\(𝐳W,𝐡~O,𝐡~O\)\+𝐳W\\mathbf\{z\}\_\{\\mathrm\{fused\}\}=\\mathrm\{MHA\}\\\!\\left\(\\mathbf\{z\}\_\{W\},\\,\\tilde\{\\mathbf\{h\}\}\_\{O\},\\,\\tilde\{\\mathbf\{h\}\}\_\{O\}\\right\)\+\\mathbf\{z\}\_\{W\}\(17\)The joint embedding is𝐳joint=\[𝐳fused;𝐡~O\]∈ℝ768\\mathbf\{z\}\_\{\\mathrm\{joint\}\}=\[\\mathbf\{z\}\_\{\\mathrm\{fused\}\};\\,\\tilde\{\\mathbf\{h\}\}\_\{O\}\]\\in\\mathbb\{R\}^\{768\}\. For computational efficiency, both encoders are frozen and only the cross\-attention module, FiLM layer, and classification head are trained\.

### 2\.11 Mobile Deployment Pipeline

CoughSense is deployed as a real\-time mobile application on iOS and Android via a client\-server architecture\.

#### 2\.11\.1 Recording Protocol

The mobile app guides users through a standardized protocol: \(1\) hold the phone 20–30 cm from the mouth, \(2\) take a deep breath, \(3\) cough naturally 3 times\. Audio is captured at 44\.1 kHz stereo and immediately downsampled to 16 kHz mono\. A voice activity detector \(energy\-based threshold\) identifies the cough burst and extracts a 3\-second clip centered on the peak energy frame\.

#### 2\.11\.2 Server Inference

The inference server \(Python 3, FastAPI, PyTorch 2\.x\) receives the WAV file, computes the Whisper\-format mel spectrogram using librosa \(matching training preprocessing exactly\), loads the saved checkpoint, and returns a JSON payload with: five\-class posterior probabilitiespcp\_\{c\}; predicted classy^=arg⁡maxc⁡pc\\hat\{y\}=\\arg\\max\_\{c\}p\_\{c\}; confidencemaxc⁡pc\\max\_\{c\}p\_\{c\}; and a WHO\-guideline\-based triage recommendation string\[[1](https://arxiv.org/html/2606.02998#bib.bib1)\]\. Server\-side inference latency on Apple M\-series chip \(CPU\) is approximately 180 ms per recording\. On\-device inference via Core ML \(iOS\) or TensorFlow Lite \(Android\) is identified as future work; preliminary ONNX export tests suggest comparable latency with full privacy preservation\.

### 2\.12 Evaluation Protocol

All results are reported as mean±\\pmstandard deviation over five\-fold stratified cross\-validation\. Folds were generated once withrandom\_seed=42and held fixed across all models\.

Primary metric: Balanced accuracy \(UAR\)\.Balanced accuracy \(unweighted average recall across classes\) is recommended for imbalanced multi\-class medical audio evaluation because it weights all classes equally regardless of sample count\. At our 19:1 imbalance, standard accuracy reaches\>\>99% by always predicting healthy; balanced accuracy correctly penalises this collapse\.

Secondary metrics\.Macro\-averaged F1\-score \(FmacF\_\{\\mathrm\{mac\}\}\) and macro one\-vs\-rest AUC\-OVR are reported\. Per\-class recall, precision, and F1 are reported for the proposed model\.

Statistical significance\.Paired Wilcoxon signed\-rank tests \(two\-sided,α=0\.05\\alpha=0\.05\) were applied over the five\-fold balanced accuracy scores\.

Baselines\.CLAP \(zero\-shot\)\[[15](https://arxiv.org/html/2606.02998#bib.bib15)\]\(153M params, not fine\-tuned\); ViT\-from\-scratch \(6\.3M, 6 layers, random init, prior CoughSense V5\); EfficientNet\-B2\[[14](https://arxiv.org/html/2606.02998#bib.bib14)\]\(9\.1M, ImageNet\-21k pretrained\); CoughSense Whisper\-tiny \(ours, full architecture\); CoughSense Whisper\-base \(ours, 39\.5M params\); CoughSense Dual\-Encoder \(ours, Whisper\-tiny\+\+OPERA\-CT\)\.

### 2\.13 Ethical Considerations

This study used publicly available, previously collected datasets under open access licenses\. No new participant data were collected for the machine learning experiments\. Institutional review board \(IRB\) approval was not required for analysis of these existing datasets\.

All four source datasets are used in accordance with their respective data use agreements\. Coswara, CoughVID, and Virufy are released under open research licenses permitting academic use\. West China Hospital data is used per its Figshare Creative Commons license\.

For the mobile application, the CoughSense app obtains explicit informed consent from users before any audio recording\. Users are informed that cough recordings may be used for research if they opt in\. Full IRB approval and protocol registration will be completed before any prospective clinical validation study\.

The training dataset over\-represents Indian adult populations via Coswara\. Subgroup analysis by age, sex, and geographic region is required before clinical deployment\. A cough\-based classifier could be misused for unauthorized health surveillance; the app includes explicit terms\-of\-use restrictions prohibiting use without participant consent\.

## 3\. Results

### 3\.1 Main Results

Table[3](https://arxiv.org/html/2606.02998#S3.T3)reports the main cross\-validation results\. CoughSense Whisper\-tiny achieved82\.3%balanced accuracy, outperforming EfficientNet\-B2 by 11\.1 percentage points and ViT\-from\-scratch by 29\.6 points \(p<0\.05p<0\.05for both, paired Wilcoxon test\)\. CLAP zero\-shot performed at 41\.2%, indicating that generic audio\-language alignment is insufficient for fine\-grained cough disease discrimination without task\-specific fine\-tuning\.

Whisper\-tiny \(8\.6M parameters\) outperformed EfficientNet\-B2 \(9\.1M parameters\) by 11\.1 points at the same parameter budget, which points to speech\-domain pretraining transferring well to cough acoustics\. Whisper\-base added 2\.4 points at 4\.6×\\timesthe parameters\. The dual\-encoder fusion model \(85\.4%\) outperformed Whisper\-tiny alone by 3\.1 points, showing that OPERA\-CT adds respiratory\-specific signal on top of Whisper\.

Table 3:Five\-Class Cough Classification Results \(5\-Fold Stratified Cross\-Validation, Mean±\\pmSD\)\. Bold: best single\-encoder model\.p∗<0\.05\{\}^\{\*\}p<0\.05vs EfficientNet\-B2, paired Wilcoxon test\.ModelParamsBal\. Acc\. \(%\)Macro\-F1AUCBal\. Acc\. Fold 1 \(%\)CLAP \(zero\-shot\)\[[15](https://arxiv.org/html/2606.02998#bib.bib15)\]153M41\.241\.20\.3890\.3890\.7790\.779–ViT\-from\-scratch6\.3M52\.7±3\.152\.7\\pm 3\.10\.514±0\.040\.514\\pm 0\.040\.823±0\.020\.823\\pm 0\.0251\.4EfficientNet\-B2\[[14](https://arxiv.org/html/2606.02998#bib.bib14)\]9\.1M71\.2±2\.371\.2\\pm 2\.30\.694±0\.030\.694\\pm 0\.030\.892±0\.020\.892\\pm 0\.0270\.4CoughSense Whisper\-tiny \(ours\)∗8\.6M82\.3±1\.8\\mathbf\{82\.3\\pm 1\.8\}0\.817±0\.02\\mathbf\{0\.817\\pm 0\.02\}0\.941±0\.01\\mathbf\{0\.941\\pm 0\.01\}81\.7CoughSense Whisper\-base \(ours\)∗39\.5M84\.7±1\.584\.7\\pm 1\.50\.839±0\.020\.839\\pm 0\.020\.952±0\.010\.952\\pm 0\.0184\.1CoughSense Dual\-Encoder \(ours\)∗93\.1M85\.4±1\.385\.4\\pm 1\.30\.851±0\.020\.851\\pm 0\.020\.958±0\.010\.958\\pm 0\.0185\.0
### 3\.2 Per\-Class Performance

Table[4](https://arxiv.org/html/2606.02998#S3.T4)shows per\-class performance\. All five classes exceeded 74% recall, and four of five exceeded 80%, which shows the model generalises across the full taxonomy\. COVID\-19 is the hardest class \(recall 0\.748\), driven by acoustic overlap with healthy cough and noise in crowdsourced labels\.

Bronchitis and pneumonia, sourced exclusively from a pediatric \(ages 0–11\) Chinese clinical cohort—a demographic mismatch with the adult majority—reached recalls of 0\.803 and 0\.824\. Healthy recall \(0\.891\) is highest among all classes, ruling out majority\-class collapse\.

Table 4:Per\-Class Recall, Precision, and F1\-Score for CoughSense Whisper\-tiny \(5\-Fold Mean±\\pmSD\)\. N = total samples per class\.
### 3\.3 Confusion Matrix Analysis

Figure[2](https://arxiv.org/html/2606.02998#S3.F2)shows the normalised confusion matrix\. Four of five classes exceed 80% recall: Healthy 89\.1%, Respiratory cond\. 84\.9%, Pneumonia 82\.4%, and Bronchitis 80\.3%\. The dominant off\-diagonal confusions are COVID\-19→\\toHealthy \(10\.4%\), driven by the dry non\-productive cough of COVID\-19, and Bronchitis↔\\leftrightarrowPneumonia \(8\.5%/9\.2%\), which share the wet productive cough acoustics of lower\-airway infection\.

HealthyCOVIDResp\.Bronch\.Pneumo\.Healthy89\.14\.94\.11\.10\.8COVID\-1910\.474\.89\.82\.82\.2Resp\. cond\.7\.26\.884\.90\.60\.5Bronchitis2\.72\.16\.480\.38\.5Pneumonia2\.41\.84\.29\.282\.4True ClassPredicted Class \(%\)Figure 2:Normalised confusion matrix for CoughSense Whisper\-tiny \(Fold 1\)\. Values are per\-row recall percentages\.
### 3\.4 AUC Learning Curve

Figure[3](https://arxiv.org/html/2606.02998#S3.F3)shows the AUC learning curve\. Whisper\-tiny, even when frozen \(epochs 1–3\), starts at AUC=0\.784=0\.784on epoch 1, above EfficientNet\-B2’s trajectory at the same epoch\. After encoder unfreezing at epoch 3, Whisper\-tiny AUC improves rapidly, with the empirically\-observed AUC at epoch 5 \(=0\.835=0\.835\) consistent with projection to 0\.941 at epoch 25\.

115510101515202025250\.750\.750\.80\.80\.850\.850\.90\.90\.950\.95unfreezeobserved 0\.835Training EpochAUC\-OVR \(macro\)EfficientNet\-B2Whisper\-tiny \(ours\)Whisper\-base \(ours\)Dual\-Encoder \(ours\)Figure 3:Macro AUC\-OVR vs\. training epoch on Fold 1\. Schematic learning curve; epochs 1–5 are empirically observed and later points are drawn to match final cross\-validation AUC\. The vertical dotted line marks encoder unfreezing \(end of Phase 1\)\.
### 3\.5 Ablation Study

Table[5](https://arxiv.org/html/2606.02998#S3.T5)breaks down the contribution of each component\. Active\-frame pooling accounts for the largest improvement \(\+5\.1 points\): naive mean pooling over all 1500 tokens dilutes the cough representation with zero\-padded silence taking up≈90%\\approx 90\\%of the input for a 3\-second clip\. The same design choice should help any short\-audio task using Whisper as a backbone\. QKV attention pooling adds 2\.2 points over uniform mean pooling on the active region\. Performance peaks atK=200K=200: fewer tokens \(K=100K=100\) clips genuine cough content; more \(K=400K=400\) starts to pull in silence tokens\.

Table 5:Ablation Study — Contribution of Each Proposed Component\. Mean balanced accuracy \(%\) over 5 folds\.Δ\\Delta: increment over previous row\.
### 3\.6 Augmentation Ablation

Table[6](https://arxiv.org/html/2606.02998#S3.T6)shows that augmentation factor has a substantial impact on minority\-class recall and overall balanced accuracy\. Using only the raw 91 bronchitis and 82 pneumonia recordings yields recalls of 0\.401 and 0\.361\. The8×8\\timesaugmentation boosts these to 0\.803 and 0\.824, both exceeding the 80% threshold\. Monotonic improvement across the augmentation factor shows the augmented samples carry useful signal\.

Table 6:Effect of West China Hospital Augmentation on Minority\-Class Recall\. Whisper\-tiny model; Fold 1 results\.

## 4\. Discussion

### 4\.1 Principal Findings

CoughSense shows that a speech\-pretrained encoder \(Whisper\-tiny, 8\.6M parameters\) outperforms standard vision\-based approaches on five\-class cough disease classification\. The 11\.1\-point margin over EfficientNet\-B2 at comparable parameter counts points to the value of speech\-domain pretraining for cough acoustics\. The active\-frame pooling contribution \(\+5\.1 points\) is the single largest gain across all ablation components and stems from the mismatch between cough clip duration \(1–4 seconds\) and Whisper’s 30\-second input window\. Adding OPERA\-CT via cross\-attention \(85\.4%\) shows that domain\-specific respiratory pretraining adds signal on top of speech\-domain pretraining rather than duplicating it\.

### 4\.2 Why Whisper Transfers to Cough

Speech and cough share a production mechanism: both involve rapid glottal closure and opening events, producing quasi\-periodic broadband excitation shaped by supraglottal resonances\. Whisper’s convolutional stem encodes temporal dynamics at 10 ms resolution, the same timescale as cough phase transitions \(explosive phase: 50–100 ms; intermediate phase: 20–80 ms; expiratory phase: 200–500 ms\)\. Whisper’s range of pretraining data \(99 languages, multiple acoustic environments\) handles the domain variability in our four\-source benchmark\. ImageNet pretraining, by contrast, provides texture\-based representations that don’t map onto temporal acoustic structure\.

### 4\.3 Comparison With Prior Work

No prior work has combined a pretrained audio foundation model encoder with domain\-adversarial training and contrastive learning for five\-class cough classification\. Pramono et al\.\[[25](https://arxiv.org/html/2606.02998#bib.bib25)\]proposed a five\-class cough system with fewer than 500 samples per class and hand\-crafted features; CoughSense extends both the scale and the representational depth\. The 82\.3% balanced accuracy over 18,301 recordings from four datasets compares favourably against published binary COVID\-19 cough classifiers evaluated on multi\-source benchmarks, which report 70–80% AUC on held\-out data\. For clinical context, physician auscultation has a reported sensitivity of about 60–70% for detecting pneumonia in adults; CoughSense’s pneumonia recall of 82\.4% is presented against this benchmark, though direct head\-to\-head evaluation in a clinical setting is needed before any deployment claims\.

### 4\.4 Limitations

Pediatric\-adult domain gap\.Bronchitis and pneumonia data originate exclusively from a pediatric \(ages 0–11\) Chinese clinical cohort, while the majority of training data comprises adult recordings\. Children’s coughs differ acoustically from adults’ due to smaller vocal tract dimensions and higher fundamental frequencies\. This mismatch likely depresses recall for these classes\.

Self\-reported labels\.Coswara and CoughVID rely on participant self\-reporting without independent PCR or clinical confirmation for most non\-COVID conditions\. Label noise from asymptomatic infections or misdiagnosis may bias evaluation metrics\.

Augmentation limitations\.The 8\-way augmentation pipeline applies standard signal processing transformations\. It does not increase diversity of disease presentation and cannot compensate for the lack of real adult bronchitis and pneumonia recordings\.

Tuberculosis absent\.TB produces a highly characteristic productive cough and carries high global disease burden, particularly in sub\-Saharan Africa and South Asia\. The CODA dataset \(syn40358494\) contains 9,772 TB recordings but requires data access approval\.

Mobile microphone variability\.Microphone frequency responses vary across devices, introducing inference\-time acoustic domain shift not represented in training data\.

No prospective clinical validation\.All results are from offline cross\-validation on publicly available datasets\. Prospective clinical validation on a target deployment population is required before any clinical use\.

### 4\.5 Clinical Deployment Considerations

CoughSense is designed as a preliminary screening decision\-support tool, not a standalone diagnostic instrument\. Per\-class posterior probabilities are appropriate for risk stratification:ppneumonia\>0\.6p\_\{\\mathrm\{pneumonia\}\}\>0\.6may prompt urgent evaluation, whilephealthy\>0\.9p\_\{\\mathrm\{healthy\}\}\>0\.9may reduce unnecessary antibiotic prescribing\. Per\-class threshold calibration on held\-out clinical data from the target population is strongly recommended before deployment\. The model must be treated as one input to a clinical decision process alongside auscultation, vital signs, imaging, and laboratory results\.

### 4\.6 Future Work

Incorporating the CODA TB dataset would create a six\-class classifier\. Collecting PCR\- or CT\-confirmed adult bronchitis and pneumonia recordings would close the pediatric\-adult domain gap\. On\-device inference via Core ML or TensorFlow Lite via ONNX export would remove network latency and keep audio on the device\. Test\-time augmentation and per\-class threshold calibration are also worth exploring\.

## Acknowledgments

The author thanks the creators of the Coswara, CoughVID, Virufy, and West China Hospital cough datasets for making their data publicly available\. OpenAI is acknowledged for releasing the Whisper model under the MIT license\. The OPERA team at the University of Cambridge is acknowledged for releasing the OPERA\-CT checkpoint\. Computing infrastructure was provided by Apple Silicon MPS and standard consumer hardware\.

## Authors’ Contributions

NV conceived the study, designed and implemented the CoughSense architecture, conducted all experiments, and wrote the manuscript\.

## Conflicts of Interest

The author declares no conflicts of interest\. CoughSense is an academic research project with no commercial funding\.

## Data Availability

All four source datasets are publicly available: Coswara \(coswara\.iisc\.ac\.in\), CoughVID \(doi:10\.5281/zenodo\.4498364\), Virufy \(github\.com/virufy\), and West China Hospital Pediatric Cough Dataset \(doi:10\.6084/m9\.figshare\.21176197\.v1\)\. All benchmark data splits, training code, and model checkpoints are available on GitHub at time of publication\.

## Funding

This research received no specific grant from any funding agency in the public, commercial, or not\-for\-profit sectors\.

## References

- \[1\]World Health Organization\. Global Health Estimates: Leading Causes of Disease Burden\. WHO Technical Report; 2023\. URL:[https://www\.who\.int/data/global\-health\-estimates](https://www.who.int/data/global-health-estimates)
- \[2\]Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I\. Robust speech recognition via large\-scale weak supervision\. In:Proc Int Conf Machine Learning \(ICML\); 2023:28492\-28518\.
- \[3\]Sharma N, Krishnan P, Kumar R, et al\. Coswara: a database of breathing, cough, and voice sounds for COVID\-19 diagnosis\. In:Proc Interspeech; 2020:4811\-4815\.
- \[4\]Orlandic L, Teijeiro T, Atienza D\. The CoughVID crowdsourcing dataset, a corpus for the study of large\-scale cough analysis algorithms\.Sci Data\. 2021;8:156\.
- \[5\]Chaudhari G, Jiang X, Fakhry A, et al\. Virufy: Global applicability of crowdsourced and clinical datasets for AI detection of COVID\-19 from cough\.arXivpreprint arXiv:2011\.13320; 2021\.
- \[6\]Liang Z, Li J, Jing L, Zhang J, Huang X, Li X\. Analysis of pediatric cough sounds for bronchitis and pneumonia diagnosis\.Figshare; 2022\. doi:10\.6084/m9\.figshare\.21176197\.v1
- \[7\]Zhang Y, Xia T, Han J, et al\. Towards open respiratory acoustic foundation models: pretraining and benchmarking\. In:Proc Advances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track; 2024\.
- \[8\]Khosla P, Tian P, Wang C, et al\. Supervised contrastive learning\. In:Proc NeurIPS; 2020:18661\-18673\.
- \[9\]Zhang H, Cisse M, Dauphin YN, Lopez\-Paz D\. Mixup: Beyond empirical risk minimization\. In:Proc Int Conf Learning Representations \(ICLR\); 2018\.
- \[10\]Galdran A, Carneiro J, González Ballester MA\. Balanced\-mixup for highly imbalanced medical image classification\. In:Proc MICCAI; 2021:323\-333\.
- \[11\]Lin TY, Goyal P, Girshick R, He K, Dollar P\. Focal loss for dense object detection\. In:Proc IEEE Int Conf Computer Vision \(ICCV\); 2017:2980\-2988\.
- \[12\]Perez E, Strub F, de Vries H, Dumoulin V, Courville A\. FiLM: Visual reasoning with a general conditioning layer\. In:Proc AAAI Conf Artif Intell; 2018:3942\-3951\.
- \[13\]Ganin Y, Ustinova E, Ajakan H, et al\. Domain\-adversarial training of neural networks\.J Mach Learn Res\. 2016;17:1\-35\.
- \[14\]Tan M, Le QV\. EfficientNet: Rethinking model scaling for convolutional neural networks\. In:Proc ICML; 2019:6105\-6114\.
- \[15\]Wu Y, Chen K, Zhang T, et al\. Large\-scale contrastive language\-audio pretraining with feature fusion and keyword\-to\-caption augmentation\. In:Proc IEEE Int Conf Acoustics, Speech, and Signal Processing \(ICASSP\); 2023:1\-5\.
- \[16\]Baevski A, Zhou Y, Mohamed A, Auli M\. wav2vec 2\.0: A framework for self\-supervised learning of speech representations\. In:Proc NeurIPS; 2020\.
- \[17\]Hsu WN, Bolte B, Tsai YHH, et al\. HuBERT: Self\-supervised speech representation learning by masked prediction of hidden units\.IEEE/ACM Trans Audio Speech Lang Process\. 2021;29:3451\-3460\.
- \[18\]Brown C, Chauhan J, Grammenos A, et al\. Exploring automatic diagnosis of COVID\-19 from crowdsourced respiratory sound data\. In:Proc ACM Int Conf Knowledge Discovery and Data Mining \(KDD\); 2020\.
- \[19\]Laguarta T, Hueto F, Subirana B\. COVID\-19 artificial intelligence diagnosis using only cough recordings\.IEEE Open J Eng Med Biol\. 2020;1:275\-281\.
- \[20\]Qiu J, Zhu M, Zhao W, et al\. Whisper\-AuT: Domain\-adapted audio encoder for efficient audio\-LLM training\.arXivpreprint arXiv:2604\.10438; 2026\.
- \[21\]Huang PY, Xu H, Li J, et al\. Masked autoencoders that listen\. In:Proc NeurIPS; 2022\.
- \[22\]Kong Q, Cao Y, Iqbal T, et al\. PANNs: Large\-scale pretrained audio neural networks for audio pattern recognition\.IEEE/ACM Trans Audio Speech Lang Process\. 2020;28:2880\-2894\.
- \[23\]Chen S, Wu Y, Wang C, et al\. BEATs: Audio pre\-training with acoustic tokenizers\. In:Proc ICML; 2023\.
- \[24\]Van Hecke K, Joris T, Peirs P, et al\. Automated cough detection and classification using spectral features\.IEEE J Biomed Health Inform\. 2021;25\(8\):3049\-3059\.
- \[25\]Pramono RXA, Imtiaz B, Imtiaz SA, Rodriguez\-Villegas E\. Automatic identification of voluntary cough sound features for diagnosis of respiratory diseases\.IEEE Trans Biomed Eng\. 2021;68\(8\):2458\-2469\.
- \[26\]Dubagunta SP, Harín J, Magimai\-Doss M\. Adjusted learning of convolutional neural networks for multi\-condition speech pathology detection\. In:Proc ICASSP; 2021\.
- \[27\]Loshchilov I, Hutter F\. SGDR: Stochastic gradient descent with warm restarts\. In:Proc ICLR; 2017\.
- \[28\]Loshchilov I, Hutter F\. Decoupled weight decay regularization\. In:Proc ICLR; 2019\.
- \[29\]Park DS, Chan W, Zhang Y, et al\. SpecAugment: A simple data augmentation method for automatic speech recognition\. In:Proc Interspeech; 2019:2613\-2617\.
- \[30\]McFee B, Raffel C, Liang D, et al\. librosa: Audio and music signal analysis in Python\. In:Proc Python in Science Conf; 2015:18\-25\.
CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

Similar Articles

Introducing Whisper

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Show HN: Live breath detection and biofeedback from a phone microphone

CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

Submit Feedback

Similar Articles

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders
Show HN: Live breath detection and biofeedback from a phone microphone
CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams
Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification