大規模音声言語モデルのサーベイ：汎化性、信頼性、そして展望

要旨

大規模言語モデル（LLM）によって確立された基礎的な能力は、マルチモーダル大規模言語モデル（MLLM）への道を開き、その中でも大規模音声言語モデル（LALM）は普遍的な聴覚知能を実現するために不可欠です。しかし、その顕著な性能にもかかわらず、LALMの能力の向上は、その信頼性を確保するための体系的なフレームワークの開発を大幅に上回っています。本サーベイは、LALMの内在的なメカニズムについて包括的な調査を行い、創発的推論を促進するアーキテクチャの革新とアライメントアルゴリズムを詳述します。具体的には、統一されたエンドツーエンドフレームワークへの移行と連続的な音響信号の統合が、本質的に攻撃対象領域を拡大する方法を分析します。これらのパラダイム内のリスクを厳密に評価するために、我々は信頼性の包括的な分類法を確立し、クロスモーダル脱獄、潜在的な音響バックドア、生体認証プライバシー漏洩などの重要な脆弱性を分類します。また、幻覚、ロバスト性、安全性、プライバシー、公平性、認証という6つの分析軸を通じて最先端の研究をレビューします。成熟した攻撃手法と未発達な防御手法との間の深刻な不均衡は、音声中心の知能が直面する重要な信頼性のギャップと多次元的リスクをさらに裏付けています。最後に、我々は「多層防御」アーキテクチャ、因果的聴覚世界モデリング、そして内在的表現工学を提唱する戦略的ロードマップを提案し、実証的性能と本質的に信頼可能な音声知能との間のギャップを埋めます。我々のプロジェクトはGitHubにアップロードされています。https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs

English

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.