TAU：超越语义的文化声音理解基准

摘要

大型音频-语言模型正迅速发展，然而多数评估侧重于语音或全球通用的声音，忽视了具有文化特色的线索。这一空白引发了一个关键问题：当前模型能否推广到本地化、非语义的音频，这些音频社区成员能立即识别，而外人却无法理解？为解决这一问题，我们提出了TAU（台湾音频理解）基准，它包含日常台湾“声景”的评估集。TAU通过整合精选资源、人工编辑及大语言模型辅助的问题生成流程构建而成，共包含702段音频片段和1,794道多选题，这些问题无法仅凭文字转录解答。实验表明，包括Gemini 2.5和Qwen2-Audio在内的最先进音频-语言模型表现远不及本地人类。TAU凸显了建立本地化基准的必要性，以揭示文化盲点，指导更公平的多模态评估，并确保模型服务于全球主流之外的社区。

English

Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.