AuditoryBench++: Modelos de Linguagem Podem Compreender Conhecimento Auditivo sem Ouvir?

Resumo

Mesmo sem ouvir sons diretamente, os humanos podem raciocinar facilmente sobre propriedades auditivas, como tom, volume ou associações de fontes sonoras, recorrendo ao senso comum auditivo. Em contraste, os modelos de linguagem frequentemente carecem dessa capacidade, limitando sua eficácia em interações multimodais. Como um passo inicial para abordar essa lacuna, apresentamos o AuditoryBench++, um benchmark abrangente para avaliar o conhecimento e o raciocínio auditivo em configurações exclusivamente textuais. O benchmark engloba tarefas que variam desde comparações auditivas básicas até raciocínios contextualmente fundamentados, permitindo uma análise detalhada de como os modelos processam e integram conceitos auditivos. Além disso, introduzimos o AIR-CoT, um método inovador de raciocínio por imaginação auditiva que gera e integra informações auditivas durante a inferência por meio de detecção de intervalos com tokens especiais e injeção de conhecimento. Experimentos extensivos com LLMs recentes e Multimodal LLMs demonstram que o AIR-CoT geralmente supera tanto os modelos prontos para uso quanto aqueles aprimorados com conhecimento auditivo. A página do projeto está disponível em https://auditorybenchpp.github.io.

English

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.

AuditoryBench++: Modelos de Linguagem Podem Compreender Conhecimento Auditivo sem Ouvir?

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

Resumo

Support