大型音频语言模型综述:泛化、可信性与展望
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
May 18, 2026
作者: Kaiwen Luo, Zhenhong Zhou, Leo Wang, Liang Lin, Yang Xiao, Tianyu Shao, Yuanhe Zhang, Yuxuan Li, Miao Yu, Kailin Lyu, Jiaming Zhang, Dongrui Liu, Li Sun, Yueming Wu, Kai Li, Ting Dang, Xiaojun Jia, Rohan Kumar Das, Xinfeng Li, Siyuan Liang, Qiufeng Wang, Xingjun Ma, Jing Chen, Kun Wang, Junhao Dong, Deqing Zou, Yu Cheng, Xia Hu, Zhigang Zeng, Sen Su, Yang Liu, Yu-Gang Jiang, Philip S. Yu, Yew-Soon Ong
cs.AI
摘要
大语言模型(LLMs)奠定的基础能力为多模态大语言模型(MLLMs)的发展铺平了道路,其中大型音频语言模型(LALMs)对于实现通用听觉智能至关重要。尽管LALMs表现出色,但其能力提升速度远超确保可信度的系统性框架的发展。本综述全面探讨了LALMs的内生机制,详细介绍了促进涌现推理的架构创新和对齐算法。具体而言,我们分析了向统一端到端框架的转变以及连续声学信号的集成如何固有地扩大了攻击面。为了严格评估这些范式中的风险,我们建立了一个全面的可信度分类体系,将关键漏洞分类,如跨模态越狱、潜在声学后门和生物特征隐私泄露。我们通过六大分析支柱审视了当前最先进的技术:幻觉、鲁棒性、安全性、隐私、公平性和认证。成熟的攻击场景与薄弱的防御之间的深刻不平衡进一步验证了以音频为中心的智能所面临的严重可信度差距和多维风险。最后,我们提出了一项战略路线图,倡导采用“纵深防御”架构、因果听觉世界建模和内在表征工程,以弥合实证性能与内可信音频智能之间的差距。我们的项目已上传至GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs。
English
The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.