SAKURA: 음성 및 오디오 정보 기반 대규모 오디오-언어 모델의 다중 홉 추론

초록

대형 오디오-언어 모델(LALMs)은 대형 언어 모델을 음성, 오디오 등 다중 모달리티 이해로 확장한 모델이다. 이 모델들의 음성 및 오디오 처리 작업에 대한 성능은 광범위하게 연구되었으나, 그들의 추론 능력은 아직 충분히 탐구되지 않았다. 특히, 다중 단계 추론, 즉 여러 사실을 기억하고 통합하는 능력에 대한 체계적인 평가가 부족하다. 기존 벤치마크는 일반적인 음성 및 오디오 처리 작업, 대화 능력, 공정성에 초점을 맞추고 있으나 이 측면을 간과하고 있다. 이러한 격차를 해소하기 위해, 우리는 음성 및 오디오 정보를 기반으로 LALMs의 다중 단계 추론 능력을 평가하는 벤치마크인 SAKURA를 소개한다. 결과에 따르면, LALMs는 관련 정보를 정확히 추출하더라도 다중 단계 추론을 위해 음성/오디오 표현을 통합하는 데 어려움을 겪으며, 이는 다중 모달리티 추론에서의 근본적인 문제를 드러낸다. 우리의 연구 결과는 LALMs의 중요한 한계를 밝히고, 향후 연구를 위한 통찰과 자원을 제공한다.

English

Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.

SAKURA: 음성 및 오디오 정보 기반 대규모 오디오-언어 모델의 다중 홉 추론

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

초록

Support