AuditoryBench++: Können Sprachmodelle auditives Wissen verstehen, ohne zu hören?

papers.abstract

Auch ohne Geräusche direkt zu hören, können Menschen mühelos auditive Eigenschaften wie Tonhöhe, Lautstärke oder Klangquellen-Zuordnungen durch auditiven Hausverstand erschließen. Im Gegensatz dazu fehlt Sprachmodellen oft diese Fähigkeit, was ihre Effektivität in multimodalen Interaktionen einschränkt. Als ersten Schritt zur Schließung dieser Lücke präsentieren wir AuditoryBench++, einen umfassenden Benchmark zur Bewertung von auditivem Wissen und Schlussfolgerungen in rein textbasierten Umgebungen. Der Benchmark umfasst Aufgaben, die von einfachen auditiven Vergleichen bis hin zu kontextuell fundiertem Denken reichen, und ermöglicht eine detaillierte Analyse, wie Modelle auditive Konzepte verarbeiten und integrieren. Zusätzlich stellen wir AIR-CoT vor, eine neuartige Methode zur auditiven Vorstellungsbildung, die während der Inferenz durch Spannenerkennung mit speziellen Tokens und Wissenseinspeisung auditive Informationen generiert und integriert. Umfangreiche Experimente mit aktuellen LLMs und Multimodalen LLMs zeigen, dass AIR-CoT sowohl die Standardmodelle als auch solche, die mit auditivem Wissen angereichert wurden, im Allgemeinen übertrifft. Die Projektseite ist unter https://auditorybenchpp.github.io verfügbar.

English

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.

AuditoryBench++: Können Sprachmodelle auditives Wissen verstehen, ohne zu hören?

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

papers.abstract

Support