AU-Harness:一套开源工具包,用于音频大语言模型的全面评估
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
September 9, 2025
作者: Sidharth Surapaneni, Hoang Nguyen, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Akshay Kalkunte, Sai Rajeswar, Sathwik Tejaswi Madhusudhan
cs.AI
摘要
大型音频语言模型(LALMs)正迅速进步,但由于评估工具包效率低下,限制了公平比较和系统评估,其评价仍面临挑战。当前框架存在三个关键问题:处理速度慢,阻碍了大规模研究;提示不一致,损害了可重复性;任务覆盖范围狭窄,遗漏了重要的音频推理能力。我们引入了AU-Harness,一个高效且全面的LALMs评估框架。通过优化的批处理和并行执行,我们的系统比现有工具包实现了高达127%的速度提升,使得之前不切实际的大规模评估成为可能。我们提供了标准化的提示协议和灵活的配置,以支持跨多种场景的公平模型比较。此外,我们引入了两个新的评估类别:LLM自适应对话者识别,用于时间音频理解;以及口语语言推理,用于复杂的基于音频的认知任务。通过对380多项任务的评估,我们揭示了当前LALMs在时间理解和复杂口语语言推理任务上的显著差距。我们的发现还强调了音频基准测试中指令模态缺乏标准化的问题,这在具有挑战性的复杂指令跟随下游任务中可能导致高达9.5个绝对百分点的性能差异。AU-Harness不仅提供了实用的评估工具,还深入剖析了模型的局限性,推动了LALMs的系统性发展。
English
Large Audio Language Models (LALMs) are rapidly advancing, but evaluating
them remains challenging due to inefficient toolkits that limit fair comparison
and systematic assessment. Current frameworks suffer from three critical
issues: slow processing that bottlenecks large-scale studies, inconsistent
prompting that hurts reproducibility, and narrow task coverage that misses
important audio reasoning capabilities. We introduce AU-Harness, an efficient
and comprehensive evaluation framework for LALMs. Our system achieves a speedup
of up to 127% over existing toolkits through optimized batch processing and
parallel execution, enabling large-scale evaluations previously impractical. We
provide standardized prompting protocols and flexible configurations for fair
model comparison across diverse scenarios. Additionally, we introduce two new
evaluation categories: LLM-Adaptive Diarization for temporal audio
understanding and Spoken Language Reasoning for complex audio-based cognitive
tasks. Through evaluation across 380+ tasks, we reveal significant gaps in
current LALMs, particularly in temporal understanding and complex spoken
language reasoning tasks. Our findings also highlight a lack of standardization
in instruction modality existent across audio benchmarks, which can lead up
performance differences up to 9.5 absolute points on the challenging complex
instruction following downstream tasks. AU-Harness provides both practical
evaluation tools and insights into model limitations, advancing systematic LALM
development.