대규모 오디오 언어 모델 서베이: 일반화, 신뢰성, 그리고 전망

초록

대규모 언어 모델(LLMs)이 확립한 기초 역량은 다중 모달 대규모 언어 모델(MLLMs)의 발전을 위한 토대를 마련하였으며, 이 중 대규모 오디오 언어 모델(LALMs)은 보편적 청각 지능을 구현하는 데 필수적이다. 뛰어난 성능에도 불구하고, LALMs의 역량 확장은 신뢰성을 보장하기 위한 체계적 프레임워크의 발전을 훨씬 앞지르고 있다. 본 논문은 LALMs의 내생적 메커니즘에 대한 포괄적 분석을 제공하며, 창발적 추론을 가능하게 하는 아키텍처 혁신 및 정렬 알고리즘을 상세히 기술한다. 구체적으로, 통합 종단 간 프레임워크로의 전환과 연속적인 음향 신호의 통합이 공격 표면을 본질적으로 확장하는 방식을 분석한다. 이러한 패러다임 내의 위험을 엄격히 평가하기 위해, 우리는 신뢰성에 대한 포괄적 분류 체계를 수립하고, 교차 모달 젤브레이킹, 잠재적 음향 백도어, 생체 인식 프라이버시 유출과 같은 주요 취약점을 분류한다. 최신 연구 동향을 환각, 견고성, 안전성, 프라이버시, 공정성, 인증 여섯 가지 분석 축을 통해 검토한다. 성숙된 공격 환경과 미비한 방어 체계 간의 심각한 불균형은 오디오 중심 지능이 직면한 중요한 신뢰성 격차와 다차원적 위험을 더욱 입증한다. 마지막으로, 경험적 성능과 본질적으로 신뢰 가능한 오디오 지능 간의 격차를 해소하기 위해 심층 방어 아키텍처, 인과적 청각 세계 모델링, 그리고 본질적 표현 엔지니어링을 제안하는 전략적 로드맵을 제시한다. 본 프로젝트는 GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs 에 업로드되었다.

English

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.