AuditoryBench++：語言模型能否在未經聽覺體驗的情況下理解聽覺知識？

摘要

即便不直接聆聽聲音，人類也能憑藉聽覺常識，輕鬆推斷出音高、響度或聲源關聯等聽覺屬性。相比之下，語言模型往往缺乏此種能力，這限制了它們在多模態交互中的效能。為彌補這一差距，我們提出了AuditoryBench++，這是一個全面的基準測試，用於評估僅基於文本設置下的聽覺知識與推理能力。該基準涵蓋了從基礎聽覺比較到情境化推理的任務，使我們能夠細緻分析模型如何處理和整合聽覺概念。此外，我們引入了AIR-CoT，這是一種新穎的聽覺想象推理方法，它通過特殊標記的跨度檢測與知識注入，在推理過程中生成並整合聽覺信息。對近期大型語言模型（LLMs）及多模態大型語言模型（Multimodal LLMs）的廣泛實驗表明，AIR-CoT普遍優於現成模型及那些增強了聽覺知識的模型。項目頁面可訪問https://auditorybenchpp.github.io。

English

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.

AuditoryBench++：語言模型能否在未經聽覺體驗的情況下理解聽覺知識？

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

摘要

Support