邁向大型音頻-語言模型的整體評估：一項全面調查

摘要

隨著大型音頻語言模型（LALMs）的進步，這些模型通過增強大型語言模型（LLMs）的聽覺能力，預計將在多種聽覺任務中展現出通用性。儘管已出現眾多基準來評估LALMs的性能，但它們仍顯得零散且缺乏系統化的分類。為彌補這一差距，我們進行了一項全面調查，並提出了一個系統化的LALM評估分類法，根據其目標將其分為四個維度：(1) 通用聽覺感知與處理，(2) 知識與推理，(3) 對話導向能力，以及(4) 公平性、安全性與可信度。我們在每個類別中提供了詳細的概述，並強調了該領域的挑戰，為未來的研究方向提供了洞見。據我們所知，這是首個專門聚焦於LALM評估的調查，為學術界提供了清晰的指導。我們將發布所調查論文的集合，並積極維護以支持該領域的持續發展。

English

With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.