AuditoryBench++：言語モデルは聴覚知識を聞かずに理解できるか？

要旨

直接的に音を聞かなくても、人間は音の高さ、音量、音源の関連性といった聴覚的特性を、聴覚的常識を基に容易に推論することができます。一方で、言語モデルはこの能力を欠いていることが多く、マルチモーダルな相互作用における有効性が制限されています。このギャップを埋めるための最初のステップとして、テキストのみの設定で聴覚的知識と推論を評価する包括的なベンチマークであるAuditoryBench++を提案します。このベンチマークは、基本的な聴覚的比較から文脈に基づいた推論まで幅広いタスクを網羅し、モデルが聴覚的概念をどのように処理し統合するかを詳細に分析することを可能にします。さらに、特別なトークンを用いたスパン検出と知識注入を通じて、推論中に聴覚情報を生成し統合する新しい聴覚的想像推論手法であるAIR-CoTを導入します。最近のLLM（大規模言語モデル）やマルチモーダルLLMを用いた広範な実験により、AIR-CoTが既存のモデルや聴覚的知識を追加したモデルを一般的に上回ることが実証されています。プロジェクトページはhttps://auditorybenchpp.github.ioで公開されています。

English

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.

AuditoryBench++：言語モデルは聴覚知識を聞かずに理解できるか？

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

要旨

Support