TAU: 의미론을 넘어선 문화적 소리 이해를 위한 벤치마크

초록

대형 오디오-언어 모델이 빠르게 발전하고 있지만, 대부분의 평가는 음성이나 전 세계적으로 수집된 소리에 초점을 맞추며 문화적으로 독특한 단서를 간과하고 있습니다. 이러한 격차는 중요한 질문을 제기합니다: 현재의 모델이 지역화된, 비의미적 오디오에 대해 일반화할 수 있을까요? 이러한 오디오는 지역 사회에서는 즉시 인식되지만 외부인들은 알아차리지 못하는 것들입니다. 이를 해결하기 위해 우리는 TAU(Taiwan Audio Understanding)를 제시합니다. TAU는 일상적인 대만의 "사운드마크"를 기반으로 한 벤치마크로, 큐레이션된 소스, 인간 편집, 그리고 LLM(대형 언어 모델) 지원 질문 생성을 결합한 파이프라인을 통해 구축되었습니다. 이는 702개의 클립과 1,794개의 객관식 항목을 생성하며, 이는 텍스트만으로는 해결할 수 없는 문제들입니다. 실험 결과, Gemini 2.5와 Qwen2-Audio를 포함한 최첨단 LALM(대형 오디오-언어 모델)들은 지역 인간의 성능에 훨씬 미치지 못하는 것으로 나타났습니다. TAU는 지역화된 벤치마크의 필요성을 보여주며, 문화적 맹점을 드러내고, 더 공평한 다중모드 평가를 이끌며, 모델이 글로벌 주류를 넘어 지역 사회에 서비스할 수 있도록 보장합니다.

English

Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.