大型音訊語言模型綜述:泛化能力、可信度與展望
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
May 18, 2026
作者: Kaiwen Luo, Zhenhong Zhou, Leo Wang, Liang Lin, Yang Xiao, Tianyu Shao, Yuanhe Zhang, Yuxuan Li, Miao Yu, Kailin Lyu, Jiaming Zhang, Dongrui Liu, Li Sun, Yueming Wu, Kai Li, Ting Dang, Xiaojun Jia, Rohan Kumar Das, Xinfeng Li, Siyuan Liang, Qiufeng Wang, Xingjun Ma, Jing Chen, Kun Wang, Junhao Dong, Deqing Zou, Yu Cheng, Xia Hu, Zhigang Zeng, Sen Su, Yang Liu, Yu-Gang Jiang, Philip S. Yu, Yew-Soon Ong
cs.AI
摘要
大型語言模型(LLMs)所奠定的基礎能力,為多模態大型語言模型(MLLMs)開闢了道路,其中大型音訊語言模型(LALMs)對於實現通用聽覺智慧至關重要。儘管這類模型展現出卓越的表現,但其能力的快速提升已顯著超越系統性可信賴性框架的發展。本綜述深入探討LALMs的內生機制,詳述促進湧現推理的架構創新與對齊演算法。具體而言,我們分析從統一端到端框架的轉型以及連續聲學訊號的整合,如何從根本上擴展攻擊面。為嚴格評估這些範式內的風險,我們建立了一套全面的可信賴性分類法,劃分關鍵漏洞,例如跨模態越獄、潛在聲學後門以及生物特徵隱私洩漏。我們透過六大分析支柱回顧當前最新技術:幻覺、穩健性、安全性、隱私、公平性與認證。成熟的攻擊面與相對不足的防禦之間存在深刻失衡,這進一步驗證了以音訊為中心的智慧所面臨的重大可信賴性差距與多維風險。最後,我們提出策略性藍圖,倡導「縱深防禦」架構、因果聽覺世界模型以及內在表徵工程,以彌合實證效能與內在可信賴音訊智慧之間的鴻溝。我們的專案已上傳至GitHub:https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs。
English
The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.