TAU: 意味論を超えた文化的音響理解のためのベンチマーク

要旨

大規模な音声言語モデルは急速に進化していますが、その評価の多くは音声やグローバルに収集された音に重点を置いており、文化的に特徴的な手がかりを見落としています。このギャップは重要な疑問を提起します：現在のモデルは、コミュニティが即座に認識するが外部の人間にはわからない、ローカライズされた非意味的な音声に一般化できるのでしょうか？この問題に対処するため、私たちはTAU（Taiwan Audio Understanding）を提案します。これは台湾の日常的な「サウンドマーク」をベンチマーク化したものです。TAUは、キュレーションされたソース、人間による編集、LLM支援の質問生成を組み合わせたパイプラインを通じて構築され、トランスクリプトだけでは解決できない702のクリップと1,794の多肢選択問題を生成します。実験では、Gemini 2.5やQwen2-Audioを含む最先端のLALMが、地元の人間のパフォーマンスを大きく下回ることが示されました。TAUは、文化的な盲点を明らかにし、より公平なマルチモーダル評価を導き、グローバルな主流を超えたコミュニティにモデルが役立つことを保証するために、ローカライズされたベンチマークの必要性を実証しています。

English

Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.

TAU: 意味論を超えた文化的音響理解のためのベンチマーク

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

要旨

Support