AuditoryBench++：语言模型能否在不“听”的情况下理解听觉知识？

摘要

即便不直接聆听声音，人类也能凭借听觉常识轻松推断出音高、响度或声源关联等听觉属性。相比之下，语言模型往往欠缺这一能力，限制了其在多模态交互中的效能。为填补这一空白，我们迈出了初步的一步，推出了AuditoryBench++，这是一个全面的基准测试，旨在评估纯文本环境下的听觉知识与推理能力。该基准涵盖从基础听觉比较到情境化推理的多种任务，使得对模型如何处理和整合听觉概念的细致分析成为可能。此外，我们引入了AIR-CoT，一种新颖的听觉想象推理方法，它通过特殊标记的跨度检测与知识注入，在推理过程中生成并整合听觉信息。对近期大型语言模型及多模态大型语言模型的大量实验表明，AIR-CoT普遍优于未经增强的现成模型以及那些仅通过听觉知识增强的模型。项目页面可通过https://auditorybenchpp.github.io访问。

English

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.

AuditoryBench++：语言模型能否在不“听”的情况下理解听觉知识？

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

摘要

Support