TAU:超越語義的文化聲音理解基準
TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics
September 30, 2025
作者: Yi-Cheng Lin, Yu-Hua Chen, Jia-Kai Dong, Yueh-Hsuan Huang, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Chen, Yu-Jung Lin, Yu-Ling Chen, Zih-Yu Chen, I-Ning Tsai, Hsiu-Hsuan Wang, Ho-Lam Chung, Ke-Han Lu, Hung-yi Lee
cs.AI
摘要
大型音頻語言模型正迅速發展,然而大多數評估仍側重於語音或全球通用的聲音,忽略了文化獨特的線索。這一差距引發了一個關鍵問題:當前模型能否推廣到本地化、非語義的音頻,這些音頻社區成員能立即識別,而外部人員卻無法理解?為解決此問題,我們提出了TAU(台灣音頻理解),這是一個基於台灣日常「聲音標誌」的基準測試。TAU通過結合精選來源、人工編輯及大語言模型輔助的問題生成流程構建,產生了702段音頻片段和1,794道無法僅憑轉錄文本解決的多選題。實驗表明,包括Gemini 2.5和Qwen2-Audio在內的頂尖音頻語言模型表現遠遜於本地人類。TAU證明了本地化基準測試的必要性,以揭示文化盲點,引導更公平的多模態評估,並確保模型服務於全球主流之外的社區。
English
Large audio-language models are advancing rapidly, yet most evaluations
emphasize speech or globally sourced sounds, overlooking culturally distinctive
cues. This gap raises a critical question: can current models generalize to
localized, non-semantic audio that communities instantly recognize but
outsiders do not? To address this, we present TAU (Taiwan Audio Understanding),
a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline
combining curated sources, human editing, and LLM-assisted question generation,
producing 702 clips and 1,794 multiple-choice items that cannot be solved by
transcripts alone. Experiments show that state-of-the-art LALMs, including
Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates
the need for localized benchmarks to reveal cultural blind spots, guide more
equitable multimodal evaluation, and ensure models serve communities beyond the
global mainstream.