TAU：超越語義的文化聲音理解基準

摘要

大型音頻語言模型正迅速發展，然而大多數評估仍側重於語音或全球通用的聲音，忽略了文化獨特的線索。這一差距引發了一個關鍵問題：當前模型能否推廣到本地化、非語義的音頻，這些音頻社區成員能立即識別，而外部人員卻無法理解？為解決此問題，我們提出了TAU（台灣音頻理解），這是一個基於台灣日常「聲音標誌」的基準測試。TAU通過結合精選來源、人工編輯及大語言模型輔助的問題生成流程構建，產生了702段音頻片段和1,794道無法僅憑轉錄文本解決的多選題。實驗表明，包括Gemini 2.5和Qwen2-Audio在內的頂尖音頻語言模型表現遠遜於本地人類。TAU證明了本地化基準測試的必要性，以揭示文化盲點，引導更公平的多模態評估，並確保模型服務於全球主流之外的社區。

English

Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.

TAU：超越語義的文化聲音理解基準

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

摘要

Support