AuditoryBench++: 언어 모델은 듣지 않고도 청각 지식을 이해할 수 있을까?

초록

사람들은 직접 소리를 듣지 않더라도 청각적 상식을 바탕으로 음높이, 음량, 음원 연관성과 같은 청각적 속성을 쉽게 추론할 수 있습니다. 반면, 언어 모델은 종종 이러한 능력이 부족하여 다중모달 상호작용에서의 효과성이 제한됩니다. 이러한 격차를 해결하기 위한 첫 번째 단계로, 우리는 텍스트 전용 환경에서 청각 지식과 추론 능력을 평가하기 위한 포괄적인 벤치마크인 AuditoryBench++를 제안합니다. 이 벤치마크는 기본적인 청각 비교부터 맥락 기반 추론에 이르는 다양한 과제를 포함하여, 모델이 청각 개념을 처리하고 통합하는 방식을 세밀하게 분석할 수 있도록 합니다. 또한, 우리는 특수 토큰을 통한 범위 탐지와 지식 주입을 통해 추론 과정에서 청각 정보를 생성하고 통합하는 새로운 청각 상상 추론 방법인 AIR-CoT를 소개합니다. 최신 LLM(Large Language Model) 및 다중모달 LLM을 대상으로 한 광범위한 실험을 통해 AIR-CoT가 일반적으로 오프더셸프 모델과 청각 지식이 보강된 모델 모두를 능가함을 입증했습니다. 프로젝트 페이지는 https://auditorybenchpp.github.io에서 확인할 수 있습니다.

English

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.

AuditoryBench++: 언어 모델은 듣지 않고도 청각 지식을 이해할 수 있을까?

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

초록

Support