TAU:超越语义的文化声音理解基准
TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics
September 30, 2025
作者: Yi-Cheng Lin, Yu-Hua Chen, Jia-Kai Dong, Yueh-Hsuan Huang, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Chen, Yu-Jung Lin, Yu-Ling Chen, Zih-Yu Chen, I-Ning Tsai, Hsiu-Hsuan Wang, Ho-Lam Chung, Ke-Han Lu, Hung-yi Lee
cs.AI
摘要
大型音频-语言模型正迅速发展,然而多数评估侧重于语音或全球通用的声音,忽视了具有文化特色的线索。这一空白引发了一个关键问题:当前模型能否推广到本地化、非语义的音频,这些音频社区成员能立即识别,而外人却无法理解?为解决这一问题,我们提出了TAU(台湾音频理解)基准,它包含日常台湾“声景”的评估集。TAU通过整合精选资源、人工编辑及大语言模型辅助的问题生成流程构建而成,共包含702段音频片段和1,794道多选题,这些问题无法仅凭文字转录解答。实验表明,包括Gemini 2.5和Qwen2-Audio在内的最先进音频-语言模型表现远不及本地人类。TAU凸显了建立本地化基准的必要性,以揭示文化盲点,指导更公平的多模态评估,并确保模型服务于全球主流之外的社区。
English
Large audio-language models are advancing rapidly, yet most evaluations
emphasize speech or globally sourced sounds, overlooking culturally distinctive
cues. This gap raises a critical question: can current models generalize to
localized, non-semantic audio that communities instantly recognize but
outsiders do not? To address this, we present TAU (Taiwan Audio Understanding),
a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline
combining curated sources, human editing, and LLM-assisted question generation,
producing 702 clips and 1,794 multiple-choice items that cannot be solved by
transcripts alone. Experiments show that state-of-the-art LALMs, including
Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates
the need for localized benchmarks to reveal cultural blind spots, guide more
equitable multimodal evaluation, and ensure models serve communities beyond the
global mainstream.