A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Hugging Face Daily Papers Papers

Summary

A comprehensive survey reviewing the trustworthiness challenges of Large Audio Language Models (LALMs), including vulnerabilities like cross-modal jailbreaking and acoustic backdoors, and proposing a defense-in-depth roadmap.

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.
Original Article
View Cached Full Text

Cached at: 05/21/26, 10:10 AM

Paper page - A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Source: https://huggingface.co/papers/2605.20266 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Large Audio Language Models exhibit significant trustworthiness challenges despite performance advances, requiring comprehensive frameworks addressing security vulnerabilities and defensive strategies.

The foundational capabilities established byLarge Language Models(LLMs) have paved the way forMultimodal Large Language Models(MLLMs), within whichLarge Audio Language Models(LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs’ capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unifiedend-to-end frameworksand the integration of continuousacoustic signalsinherently expand theattack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such ascross-modal jailbreaking, latentacoustic backdoors, andbiometric privacy leakage. We review the state-of-the-art through six analytical pillars:hallucination,robustness,safety,privacy,fairness, andauthentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for “Defense-in-Depth” architectures,causal auditory world modeling, andintrinsic representation engineeringto bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.

View arXiv pageView PDFGitHubAdd to collection

Get this paper in your agent:

hf papers read 2605\.20266

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.20266 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.20266 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.20266 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Voice AI Systems Are Vulnerable to Hidden Audio Attacks

Hacker News Top

New research shows that imperceptible audio signals can hijack large audio-language models (LALMs) with 79-96% success, forcing them to execute unauthorized commands like web searches or sending emails. The technique, dubbed AudioHijack, targets generative models and works regardless of user input, posing a serious security risk to voice AI systems.

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

arXiv cs.CL

Introduces TrustLDM, a comprehensive benchmark for evaluating safety, privacy, and fairness of Language Diffusion Models, revealing that their alignment degrades with malicious post contexts. Proposes an automatic evaluation framework, TrustLDM-Auto, to identify vulnerable configurations.

Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

arXiv cs.AI

This paper presents a five-stage framework integrating large language models into survey research, addressing declining response rates, sample bias, and fraudulent completions. Using 2024 Hurricane Milton survey data, the authors propose a theory-informed LLM (A-TLM) that outperforms classical imputation methods in missing-data scenarios and demonstrates manageable hallucination risk through grounded refusal.