InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization
Summary
InfoShield introduces a privacy-preserving method for speech representations in mental health screening using information-theoretic optimization, reducing sensitive attribute inference while maintaining diagnostic accuracy. A novel TimeAwareMINE estimator addresses temporal-static misalignment in sequential speech.
View Cached Full Text
Cached at: 06/05/26, 08:06 AM
# InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization
Source: [https://arxiv.org/html/2606.05561](https://arxiv.org/html/2606.05561)
Wu Liu Yang Ling
###### Abstract
Speech\-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure\. Current techniques struggle to resolve this conflict\. Adversarial training often fails against unseen threats, whereas Differential Privacy tends to compromise diagnostic performance by injecting noise across all features\. This paper presents InfoShield, which minimizes mutual information between speech representations and sensitive attributes while preserving depression classification accuracy\. We identify that standard MINE estimators struggle with sequential speech due to temporal\-static misalignment, and introduce TimeAwareMINE with cross\-modal attention to align acoustic frames with attribute embeddings\. Experiments on the Androids Corpus show InfoShield reduces gender inference from 92\.6% to 55\.5% and age inference from 55\.7% to 30\.3% with limited utility loss \(6% F1 reduction\), achieving F1=0\.784 compared to prior SOTA's 0\.723\[alsarrani2022thin\]\.
###### keywords:
mental health screening, speech privacy, healthcare applications, mutual information, information bottleneck, attribute inference
## 1Introduction
Depression affects approximately 4\.4% of the global population\[who2022mental\]\. Speech\-based screening offers scalable, non\-invasive depression detection as acoustic features encode clinically relevant biomarkers\[cummins2018speech,low2020voice\]\. However, clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure\. Studies show privacy worries deter adoption, as employers, insurers, or governments might infer protected attributes like gender, age, or socioeconomic status from voice recordings\[tao2023androids\]\. The patients who could benefit most from screening may be least willing to share speech data\.
Problem Statement\.Speech signals inherently encode both diagnostic biomarkers and sensitive demographic traits\. This implies that standard acoustic representations inevitably leak private information alongside depression cues \(see Section[4](https://arxiv.org/html/2606.05561#S4)\)\. Balancing diagnostic precision with privacy is therefore challenging\. Current solutions are ill\-suited for this task: adversarial training offers protection only against anticipated attacks\[fan2019adv\], whereas Differential Privacy \(DP\)\[dwork2006differential\]applies noise uniformly, often degrading the fine\-grained spectral patterns required for screening\. Crucially, neither method explicitly measures or targets the removal of sensitive attributes from the latent space\.
Our Approach\.We propose InfoShield to secure speech representations in mental health tools\. The framework integrates Variational Information Bottleneck \(VIB\) compression for better generalization with targeted mutual information \(MI\) minimization for privacy\. By optimizing these objectives simultaneously—balancingKL\[qϕ\(Z\|X\)∥p\(Z\)\]\\text\{KL\}\[q\_\{\\phi\}\(Z\|X\)\\\|p\(Z\)\]againstI^\(Z;s\)\\hat\{I\}\(Z;s\)—the model learns to retain diagnostic markers while suppressing sensitive attributes\.
Key Limitation Identified\.Standard MINE estimators struggle with sequential speech due to temporal\-static misalignment between time\-varying acoustics and static attribute labels, leading to unreliable MI estimates\. We introduce TimeAwareMINE with cross\-modal attention to align acoustic frames with attribute representations\.
This paper makes three contributions:
- •TimeAwareMINE: Standard MINE pools variable\-length acoustic sequences but receives a single static attribute label, causing temporal\-static misalignment\. Our cross\-modal attention solution achieves better utility \(F1: 0\.782 vs\. 0\.714\) and age privacy \(39\.7% vs\. 43\.5%\) compared to StandardMINE\.
- •Unified Framework: Integrating VIB compression with TimeAwareMINE achieves better privacy\-utility balance than individual components \(Gender: 55\.5%, Age: 30\.3%, F1: 0\.784\)\.
- •Evaluation on Androids Corpus: Ablation studies show InfoShield reduces gender inference from 92\.6% to 55\.5% and age inference from 55\.7% to 30\.3% with 6% utility loss, outperforming DP baselines and prior SOTA\[alsarrani2022thin\]\(F1: 0\.784 vs\. 0\.723\)\.
## 2Related Work
Speech\-Based Mental Health Screening\.Speech analysis demonstrates significant potential for non\-invasive depression detection\[cummins2018speech,low2020voice\]\. Recent deep learning approaches achieve competitive performance\[alsarrani2022thin,tao2023androids\], yet clinical deployment remains limited due to privacy concerns about demographic leakage\[de2024probing\]\. Our work addresses this gap with privacy\-preserving representations specifically designed for mental health applications\.
Existing speech privacy approaches fall into three categories, each with limitations for demographic attribute protection:
Traditional Privacy Methodsfocus on speaker anonymization\[snyder2017deep,fang2019speaker\]or adversarial training against specific attackers\. These methods lack generalizability—they protect against known adversaries but fail against novel attack architectures\. In contrast, our information\-theoretic approach provides universal protection without requiring knowledge of specific attack models\.
Differential Privacyprovides formal guarantees through global noise injection\[abadi2016deep,pelikan2023federated\], but this affects all learned features indiscriminately, including diagnostically relevant acoustic patterns\. Our targeted MI minimization selectively removes sensitive information while preserving task\-relevant signals through principled VIB compression\.
Information\-Theoretic Approachesoffer principled representation learning through the Information Bottleneck\[tishby2000information\]and MINE\[belghazi2018mine\]\. Recent advances include CLUB\[cheng2020club\]for tighter MI bounds and MI\-based speaker learning\[ravanelli2019learning\]\. However, standard MINE suffers from high variance on sequential data\[poole2019variational\]\. Recent work targets different threats: USC\[vecino2025universal\]protects speaker identity, SafeEar\[safeear2024\]prevents deepfake detection—none address comprehensive privacy protection in clinical speech analysis\. Our framework integrates VIB compression with TimeAwareMINE for joint optimization of utility, privacy, and generalization\.
## 3Methodology
### 3\.1InfoShield Architecture
Log\-Mel SpectrogramXXSpeech Encoder\(Transformer\)StochasticZ∼qϕ\(Z\|X\)Z\\sim q\_\{\\phi\}\(Z\|X\)Priorp\(Z\)p\(Z\)ℒVIB\\mathcal\{L\}\_\{\\text\{VIB\}\}DepressionClassifierℒutility\\mathcal\{L\}\_\{\\text\{utility\}\}TranscriptText EncoderTimeAwareMINEℒMI\\mathcal\{L\}\_\{\\text\{MI\}\}ℒ=ℒutility\+βℒVIB\+γℒMI\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{utility\}\}\+\\beta\\mathcal\{L\}\_\{\\text\{VIB\}\}\+\\gamma\\mathcal\{L\}\_\{\\text\{MI\}\}Cross\-Modal Attention
Figure 1:InfoShield architecture\. Left: speech encoder produces latent representationsZZfor depression classification\. Right: TimeAwareMINE quantifies privacy leakage via cross\-modal attention with transcript embeddings\. Three loss terms jointly optimize the framework\. Red dashed arrows indicate gradient backpropagation\.Figure[1](https://arxiv.org/html/2606.05561#S3.F1)illustrates the overall InfoShield framework\. The input log\-mel spectrogramXXis processed by a Transformer encoder to produce stochastic latent representationsZZ\. Three loss terms jointly optimize the network: \(1\) utility lossℒutility\\mathcal\{L\}\_\{\\text\{utility\}\}for depression prediction, \(2\) VIB compression lossℒVIB\\mathcal\{L\}\_\{\\text\{VIB\}\}for regularization, and \(3\) privacy lossℒMI\\mathcal\{L\}\_\{\\text\{MI\}\}via TimeAwareMINE for minimizing information leakage about sensitive attributes\.
### 3\.2Notation and Problem Formulation
Notation\.We define the following notation used throughout this paper:
- •X∈ℝT×DX\\in\{\\mathbb\{R\}\}^\{T\\times D\}: input log\-mel spectrogram sequence withTTframes
- •Y∈\{0,1\}Y\\in\\\{0,1\\\}: binary depression label \(target task\)
- •Z∈ℝT×dZ\\in\{\\mathbb\{R\}\}^\{T\\times d\}: learned latent representation
- •ss: transcript sentence \(text input\), encoded by BERT for cross\-modal attention
- •𝐞s∈ℝd′\\mathbf\{e\}\_\{s\}\\in\{\\mathbb\{R\}\}^\{d^\{\\prime\}\}: BERT embedding of transcript sentencess
- •s′s^\{\\prime\}: negative transcript samples drawn from marginal distribution for contrastive MI estimation
### 3\.3Information\-Theoretic Privacy Framework
Problem Formulation\.We formulate privacy\-preserving speech representation learning as a constrained optimization problem that directly minimizes mutual information between learned representations and sensitive attributes while preserving diagnostic utility\.
Objective Function\.Our encoderqϕ\(Z\|X\)q\_\{\\phi\}\(Z\|X\)learns stochastic representationsZZby optimizing:
ℒ=−𝔼\[logpθ\(Y\|Z\)\]\+βKL\[qϕ\(Z\|X\)∥p\(Z\)\]\+γI^\(Z;s\)\\displaystyle\\mathcal\{L\}=\-\\mathbb\{E\}\[\\log p\_\{\\theta\}\(Y\|Z\)\]\+\\beta\\text\{KL\}\[q\_\{\\phi\}\(Z\|X\)\\\|p\(Z\)\]\+\\gamma\\hat\{I\}\(Z;s\)\(1\)where the VIB compression termKL\[qϕ\(Z\|X\)∥p\(Z\)\]\\text\{KL\}\[q\_\{\\phi\}\(Z\|X\)\\\|p\(Z\)\]provides generalization regularization, andI^\(Z;s\)\\hat\{I\}\(Z;s\)explicitly minimizes mutual information between speech representationsZZand transcript sentencessscontaining demographic cues, thereby protecting sensitive attributes from inference attacks\.
Theoretical Foundation\.The privacy\-utility relationship follows from data processing inequalities\[cover1999elements\]\. For any representationZZderived from inputXXwhere\(X,Y,s\)\(X,Y,s\)forms a joint distribution:
I\(Z;Y\)≤I\(X;Y\)\+I\(X;s\|Y\)−I\(Z;s\)\\displaystyle I\(Z;Y\)\\leq I\(X;Y\)\+I\(X;s\|Y\)\-I\(Z;s\)\(2\)This bound reveals the fundamental trade\-off: reducing sensitive informationI\(Z;s\)I\(Z;s\)necessarily constrains achievable utilityI\(Z;Y\)I\(Z;Y\), assuming the Markov conditionZ←X→\(Y,s\)Z\\leftarrow X\\rightarrow\(Y,s\)holds for our encoder\.
### 3\.4TimeAwareMINE for Sequential Speech
Mechanism of Protection\.Speech inherently links acoustic features \(e\.g\., pitch\) with linguistic patterns\. By minimizing mutual information with the transcriptI\(Z;s\)I\(Z;s\), InfoShield forces the model to discard not just linguistic content, but also the statistically correlated acoustic cues\. This effectively strips away gender and age information implicit in the signal, blocking attribute inference\.
Problem Identification\.Standard MINE estimators suffer from estimation challenges on sequential speech: temporal pooling destroys critical time\-dependency information, while random temporal\-static pairing introduces noise that leads to loose, unreliable MI bounds\. This estimation inaccuracy directly undermines privacy optimization effectiveness\.
Cross\-Modal Temporal Alignment\.Given acoustic sequenceZ=\(z1,…,zT\)Z=\(z\_\{1\},\\ldots,z\_\{T\}\)and transcript embedding𝐞s\\mathbf\{e\}\_\{s\}from BERT encoding, we compute frame\-wise alignment through cross\-modal attention:
αt\\displaystyle\\alpha\_\{t\}=softmax\(zt⊤𝐞s/d\)\\displaystyle=\\text\{softmax\}\(z\_\{t\}^\{\\top\}\\mathbf\{e\}\_\{s\}/\\sqrt\{d\}\)\(3\)ct\\displaystyle c\_\{t\}=∑j=1Tαtjzj\\displaystyle=\\sum\_\{j=1\}^\{T\}\\alpha\_\{tj\}z\_\{j\}\(4\)This cross\-modal attention mechanism dynamically aligns acoustic frames with transcript representations, addressing the temporal\-static misalignment that standard MINE cannot handle when pairing sequential speech with static attribute labels\.
MI Estimation\.The statistics networkTψT\_\{\\psi\}processes aligned features to estimate:
I^\(Z;s\)=1T∑t=1T\[Tψ\(ct,𝐞s\)−log𝔼s′eTψ\(ct,𝐞s′\)\]\\displaystyle\\hat\{I\}\(Z;s\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\[T\_\{\\psi\}\(c\_\{t\},\\mathbf\{e\}\_\{s\}\)\-\\log\\mathbb\{E\}\_\{s^\{\\prime\}\}e^\{T\_\{\\psi\}\(c\_\{t\},\\mathbf\{e\}\_\{s^\{\\prime\}\}\)\}\\right\]\(5\)
This temporal\-aware design produces significantly tighter MI bounds by ensuring meaningful frame\-transcript correspondence, leading to improved privacy protection through more accurate information leakage quantification\. The MI minimization objectiveI^\(Z;s\)\\hat\{I\}\(Z;s\)learns representations that reduce mutual information between speechZZand transcriptssscontaining demographic cues, thereby protecting sensitive attributes \(gender, age\) from inference attacks\.
Connection to Attribute Privacy\.LetAAdenote the sensitive attribute \(gender/age\) andSSdenote the transcript\. From the chain rule of mutual information and data processing inequality:
I\(Z;A\)≤I\(Z;S\)\+I\(A;S\|Z\)\\displaystyle I\(Z;A\)\\leq I\(Z;S\)\+I\(A;S\|Z\)\(6\)When we minimizeI\(Z;S\)→0I\(Z;S\)\\to 0andI\(A;S\|Z\)I\(A;S\|Z\)is small \(weak conditional dependence\), we guaranteeI\(Z;A\)→0I\(Z;A\)\\to 0\. Our empirical results validate this theoretical mechanism: reducing transcript\-speech MI correlates with reduced attribute inference accuracy\.
Why Cross\-Modal Attention Matters\.Standard MINE applies global pooling over the temporal dimension, computing a single MI estimate for the entire utterance\. This is problematic for speech because different frames contain varying amounts of privacy\-relevant information—vowel\-heavy segments encode more gender cues due to pitch, while consonants are less revealing\. Global pooling dilutes these frame\-wise differences\. Our cross\-modal attention enables frame\-wise privacy quantification where the attention weightαt\\alpha\_\{t\}indicates how much demographic information each frame contains, allowing targeted privacy removal rather than uniform blurring\.
Theoretical Guarantees\.Under standard regularity conditions \(finite VC\-dimension forTψT\_\{\\psi\},β\\beta\-mixing temporal dependence\), our estimator converges almost surely:limn→∞I^TA,n=I\(Z;s\)\\lim\_\{n\\to\\infty\}\\hat\{I\}\_\{TA,n\}=I\(Z;s\)\. We ensure stable training via spectral normalization on the statistics network and exponential moving average of the MI estimate \(τ=0\.99\\tau=0\.99\)\.
## 4Experiments
### 4\.1Experimental Setup
Clinical Dataset:The Androids Corpus\[tao2023androids\]contains 228 recordings from 118 native Italian speakers \(64 clinically diagnosed with depression, 54 healthy controls\)\. Similar to prior work\[alsarrani2022thin\], we use interview speech for ecological validity\. For privacy evaluation, we extract gender and age groups111Age groups: Young≤\\leq30, Middle 31–45, Senior≥\\geq46\.from clinical metadata, simulating privacy\-sensitive healthcare deployment\.
Baselines and Ablations:We compare against: \(1\)Normal\(base Transformer without privacy, utility oracle\), \(2\)DP \(ε\\varepsilon=1, 8\)\(Differential Privacy\), \(3\)VIB\-only, \(4\)StandardMINE/TimeAwareMINE, and \(5\)InfoShield\. This systematic ablation demonstrates each component's contribution\.
Evaluation Framework:Given the modest dataset size \(118 speakers\), our evaluation focuses on feasibility demonstration\. All experiments use 5\-fold cross\-validation with participant\-level splits\. We assess: \(1\)Privacy\-Utility Feasibility: Can information\-theoretic approaches achieve privacy without destroying utility? \(2\)Method Comparison: Targeted MI minimization vs\. differential privacy\. \(3\)Component Analysis: TimeAwareMINE's temporal adaptations vs\. standard MINE\.
Hyperparameter Configuration:Preliminary experiments show stable performance with utility:privacy:compression loss ratios of 2:1:1\. We setβ\\beta=0\.001,γ\\gamma=0\.01\.
Architecture and Training Details:Our Transformer encoder processes log\-mel features through multi\-head self\-attention, outputting Gaussian parameters for stochastic representations\. TimeAwareMINE uses three fully\-connected layers with ReLU activations\. We employ curriculum learning with privacy weight increasing linearly from 0\.001 to target value over the first 25% of epochs, with MI estimator updating twice per encoder update\.
### 4\.2Architecture and Implementation Details
Data Preprocessing\.Raw audio was resampled to 16kHz\. Log\-mel spectrograms were computed with 80 mel bins, 25ms window, and 10ms stride\. No data augmentation was applied due to the small dataset size\. Transcripts were encoded using sentence\-BERT\[sentencebert\]\.
Encoder\.4\-layer Transformer \(8 heads,dmodeld\_\{\\text\{model\}\}=256, dropout=0\.3\) outputs Gaussian parameters\{μ,σ\}\\\{\\mu,\\sigma\\\}for a 64\-dimensional latentZZ\. Training uses 30 MC samples; inference uses meanμ\\mu\. The 4\-layer design balances capacity with overfitting risk on 118 speakers\.
TimeAwareMINE\.3\-layer FC network \(256→\\to128→\\to64\) with spectral normalization\. Cross\-modal attention aligns acoustic frames with BERT attribute embeddings \(768\-dim\) to address temporal\-static misalignment\.
Optimization\.AdamW \(lr=10−410^\{\-4\}, weight decay=10−510^\{\-5\}\), batch size 32, 5 epochs\. We setβ=10−3\\beta=10^\{\-3\}\(VIB\) andγ=0\.01\\gamma=0\.01\(privacy\) based on grid search—lower values insufficient for privacy, higher values cause utility collapse\.
Differential Privacy\.Opacus DP\-SGD with gradient clipping 1\.2,δ=10−5\\delta=10^\{\-5\}\. We reportε∈\{1\.0,8\.0\}\\varepsilon\\in\\\{1\.0,8\.0\\\}showing the privacy\-utility spectrum\.
Attack Model\.3\-layer Transformer attacker \(8 heads,dd=80\) trained on frozen representations with class\-weighted cross\-entropy for age groups: Young \(≤\\leq30\), Middle \(31–45\), Senior \(≥\\geq46\)\.
### 4\.3Feasibility Validation and Ablation Study
Tables[1](https://arxiv.org/html/2606.05561#S4.T1)and[2](https://arxiv.org/html/2606.05561#S4.T2)present comprehensive 5\-fold cross\-validation results demonstrating our unified framework's effectiveness\. The Normal baseline \(F1=0\.834\) establishes the utility upper bound but suffers severe privacy vulnerability: 92\.6% gender and 55\.7% age inference accuracy\. Our InfoShield framework achieves optimal privacy\-utility balance: gender inference drops to 55\.5% and age inference to 30\.3% \(below 33\.3% random chance for 3\-class classification\), while maintaining competitive depression classification \(F1=0\.784\) with only 6% utility cost compared to the oracle\. InfoShield outperforms SOTA approaches\[alsarrani2022thin\]\(F1: 0\.784 vs\. 0\.723\) while providing strong privacy protection, with TimeAwareMINE showing substantial improvements over StandardMINE for both utility and privacy\.
Table 1:Diagnostic Performance for Depression Classification on Androids Corpus \(5\-fold CV\)Table 2:Privacy Protection Against Attribute Inference Attacks \(5\-fold CV\)Statistical Significance for Clinical Deployment\.Paired t\-tests confirm that InfoShield significantly outperforms all baselines except the Normal oracle in diagnostic accuracy for depression classification \(p<0\.05p<0\.05\), validating clinical feasibility\. For privacy protection, InfoShield achieves significantly lower gender and age inference compared to Normal, VIB\-only, and DP baselines \(p<0\.05p<0\.05\), demonstrating statistically robust privacy guarantees suitable for healthcare applications\.
Framework Synergy Analysis\.InfoShield demonstrates synergistic effects\. TimeAwareMINE alone achieves 62\.2% gender and 39\.7% age inference, while InfoShield improves to 55\.5% and 30\.3%—a 6\.7pp and 9\.4pp improvement, respectively\. StandardMINE provides better gender privacy \(54\.3%\) than TimeAwareMINE \(62\.2%\), but TimeAwareMINE offers superior age privacy \(39\.7% vs\. 43\.5%\) and better utility \(F1: 0\.782 vs\. 0\.714\), validating temporal\-aware design benefits for sequential speech\.
Component Ablation Analysis\.Table[1](https://arxiv.org/html/2606.05561#S4.T1)and[2](https://arxiv.org/html/2606.05561#S4.T2)reveal how each component contributes: \(1\)VIB\-only: Compression provides modest privacy gains \(Gender: 61\.3% vs\. 77\.8% Normal\) through implicit information filtering\. \(2\)StandardMINE: Direct privacy optimization achieves strong gender privacy \(54\.3%\) but suffers utility degradation \(F1: 0\.714\) and weaker age protection \(43\.5%\)\. \(3\)TimeAwareMINE: Temporal\-aware design improves utility \(F1: 0\.782\) and age privacy \(39\.7%\) vs\. StandardMINE\. \(4\)InfoShield: Complete framework achieves optimal balance \(Gender: 55\.5%, Age: 30\.3%, F1: 0\.784\), demonstrating synergistic benefits\.
Tradeoff and Failure Analysis\.Varying the privacy weightγ\\gammareveals a monotonic relationship: increasingγ\\gammareduces gender inference \(62\.2%→\\to55\.5%→\\to48\.3%\) and age inference \(39\.7%→\\to30\.3%→\\to27\.1%\), while depression F1 degrades \(0\.782→\\to0\.784→\\to0\.721\)\. The selectedγ=0\.01\\gamma=0\.01achieves age inference near random chance \(33\.3%\) with 6% utility loss\. However, we observe high variance across folds \(±\\pm15\-20% for some methods\), likely due to the small dataset size\. Age protection proves more challenging than gender—the three\-class age task has higher random baseline, and adjacent age groups \(Young vs\. Middle\) are frequently confused by the attacker\.
### 4\.4Information\-Theoretic vs\. Differential Privacy
InfoShield demonstrates clear superiority over DP methods\. Compared to strong DP \(ε\\varepsilon=1\), we achieve better privacy \(Gender: 55\.5% vs\. 59\.4%, Age: 30\.3% vs\. 42\.0%\) with higher utility \(F1: 0\.784 vs\. 0\.568\)\. Even against weaker DP \(ε\\varepsilon=8\), we provide superior privacy with comparable utility \(F1: 0\.784 vs\. 0\.707\)\.
The key difference: DP's global noise injection affects all learned features indiscriminately, including diagnostically relevant patterns\. Our targeted MI minimization selectively removes sensitive information through principled optimization, guided by TimeAwareMINE's accurate quantification\. This explains our 38% utility improvement over strong DP while maintaining superior privacy\.
### 4\.5Multi\-Attribute Privacy Analysis
Our multi\-attribute evaluation validates comprehensive protection across diverse demographics\. Against Transformer\-based attacks, InfoShield achieves substantial privacy gains: gender inference drops to 55\.5% \(22\.3pp improvement from Normal\) and age inference to 30\.3% \(13\.6pp improvement, below 33% random chance\)\.
These improvements demonstrate attribute\-specific effectiveness: gender protection reaches near\-random levels, while age protection significantly exceeds random chance, suggesting TimeAwareMINE's temporal\-aware design provides superior protection for temporal\-dependent attributes\. Consistent performance across binary \(gender\) and multi\-class \(age\) tasks validates generalizability\. InfoShield outperforms VIB\-only \(61\.3% vs\. 55\.5% gender, 45\.6% vs\. 30\.3% age\) while achieving superior diagnostic performance, confirming explicit MI minimization benefits over implicit compression\.
## 5Conclusion
This paper presents InfoShield, a unified information\-theoretic framework addressing privacy concerns in speech\-based mental health technologies\. By integrating VIB compression with TimeAwareMINE, we achieve utility\-privacy balance for healthcare applications\.
Summary of Contributions\.Our work delivers three key advancements: \(1\)TimeAwareMINE: We address the temporal\-static misalignment in sequential speech by introducing cross\-modal attention\. This mechanism yields superior outcomes over StandardMINE, improving utility \(F1: 0\.782 vs\. 0\.714\) while tightening age privacy \(39\.7% vs\. 43\.5%\)\. \(2\)Unified InfoShield Framework: By synergizing VIB compression with our targeted MI minimization, we achieve a robust privacy\-utility trade\-off \(Gender: 55\.5%, Age: 30\.3%, F1: 0\.784\) that surpasses individual components\. \(3\)Empirical Validation: On the Androids Corpus, InfoShield successfully suppresses attribute leakage—reducing gender inference from 92\.6% to 55\.5% and age inference from 55\.7% to 30\.3%—with only a marginal 6% utility cost, thereby outperforming differential privacy baselines and the prior SOTA\[alsarrani2022thin\]\(F1: 0\.784 vs\. 0\.723\)\.
Limitations and Future Work\.The statistical power of our findings is constrained by the small sample size \(118 speakers\) and single language \(Italian\)\. While this study demonstrates feasibility, future work must validate robustness on larger, multilingual cohorts and extend protection to other sensitive attributes under stronger threat models\.
## 6Disclosure of Generative AI Use
Generative AI tools were used solely for: \(1\) text polishing and language refinement to improve clarity; \(2\) code efficiency optimization suggestions; and \(3\) partial debugging assistance\. All core research contributions were completed by the authors, including idea conception, initial drafting, related work analysis, code implementation, and data preprocessing\. The authors take full responsibility for the content of this paper\.
## ReferencesSimilar Articles
Fair Cognitive Impairment Detection Through Unlearning
Proposes a multimodal framework for fair Mild Cognitive Impairment detection from speech, using unlearning via gradient reversal to reduce demographic bias and improve performance across subgroups.
Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders
This paper demonstrates that Whisper's hallucination failures on silence, noise, or music can be detected and mitigated purely from internal activations using sparse autoencoders, achieving large reductions in hallucination rate without fine-tuning.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
Researchers introduce SHADE, a hybrid estimator that combines Good-Turing coverage with graph-spectral cues to quantify semantic uncertainty and detect LLM hallucinations when only a few black-box samples are available.
Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications
This paper proposes a multi-agent framework using deterministic orchestration and neuro-symbolic state tracking to mitigate premature diagnostic handoff and silent hallucinations in healthcare LLM applications.
In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks
This paper presents an in silico simulation of the RAMPHO episodic buffer using phonetic entropy from wav2vec 2.0 to dissociate informational and energetic masking in multi-talker environments, revealing a cognitive-acoustic Pareto optimization problem.