대규모 오디오-언어 모델의 종합적 평가를 위한 포괄적 조사

초록

대규모 오디오-언어 모델(LALMs)의 발전으로, 대규모 언어 모델(LLMs)에 청각 능력을 강화한 이러한 모델들은 다양한 청각 작업에서 보편적인 숙련도를 보일 것으로 기대된다. LALMs의 성능을 평가하기 위해 수많은 벤치마크가 등장했지만, 이들은 여전히 단편적이며 체계적인 분류 체계가 부족하다. 이러한 격차를 해소하기 위해, 우리는 포괄적인 조사를 수행하고 LALM 평가를 위한 체계적인 분류 체계를 제안하며, 이를 목적에 따라 네 가지 차원으로 분류한다: (1) 일반 청각 인식 및 처리, (2) 지식 및 추론, (3) 대화 지향 능력, (4) 공정성, 안전성 및 신뢰성. 각 범주 내에서 상세한 개요를 제공하고 이 분야의 도전 과제를 강조하며, 미래의 유망한 방향에 대한 통찰을 제공한다. 우리가 아는 한, 이는 LALMs의 평가에 초점을 맞춘 첫 번째 조사로, 커뮤니티를 위한 명확한 지침을 제공한다. 우리는 조사된 논문의 컬렉션을 공개하고, 이 분야의 지속적인 발전을 지원하기 위해 적극적으로 유지할 것이다.

English

With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.