AU-Harness: 音声LLMの総合的評価のためのオープンソースツールキット

要旨

大規模音声言語モデル（LALMs）は急速に進化していますが、その評価は依然として困難を伴っています。これは、公平な比較と体系的な評価を制限する非効率なツールキットが原因です。現在のフレームワークは、大規模研究をボトルネックにする遅い処理速度、再現性を損なう一貫性のないプロンプト、重要な音声推論能力を見落とす狭いタスクカバレッジという3つの重大な問題を抱えています。本論文では、LALMsのための効率的かつ包括的な評価フレームワークであるAU-Harnessを紹介します。私たちのシステムは、最適化されたバッチ処理と並列実行により、既存のツールキットに対して最大127%の高速化を実現し、これまで非現実的だった大規模評価を可能にします。多様なシナリオでの公平なモデル比較のため、標準化されたプロンプトプロトコルと柔軟な設定を提供します。さらに、時間的な音声理解のためのLLM-Adaptive Diarizationと、複雑な音声ベースの認知タスクのためのSpoken Language Reasoningという2つの新しい評価カテゴリーを導入します。380以上のタスクにわたる評価を通じて、現在のLALMs、特に時間的理解と複雑な音声言語推論タスクにおける重大なギャップを明らかにしました。また、音声ベンチマークに存在する指示モダリティの標準化の欠如が、困難な複雑指示追従ダウンストリームタスクにおいて最大9.5ポイントの絶対的な性能差を引き起こす可能性があることも指摘しました。AU-Harnessは、実用的な評価ツールとモデルの限界に関する洞察を提供し、体系的なLALM開発を推進します。

English

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

AU-Harness: 音声LLMの総合的評価のためのオープンソースツールキット

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

要旨

Support