大規模音声言語モデルの包括的評価に向けて：総合的な調査

要旨

大規模音声言語モデル（LALMs）の進展により、大規模言語モデル（LLMs）に聴覚能力を付加したこれらのモデルは、様々な聴覚タスクにおいて普遍的な熟達度を示すことが期待されています。LALMsの性能を評価するための多くのベンチマークが登場しているものの、それらは断片的であり、体系的な分類が欠如しています。このギャップを埋めるため、我々は包括的な調査を行い、LALM評価のための体系的な分類法を提案します。これを目的に基づいて4つの次元に分類します：(1) 一般的な聴覚認識と処理、(2) 知識と推論、(3) 対話指向の能力、(4) 公平性、安全性、信頼性です。各カテゴリー内での詳細な概要を提供し、この分野の課題を強調し、将来の有望な方向性についての洞察を提供します。我々の知る限り、これはLALMsの評価に特化した初めての調査であり、コミュニティに対して明確なガイドラインを提供します。調査した論文のコレクションを公開し、この分野の継続的な進展を支援するために積極的に維持していきます。

English

With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.