SAKURA: 音声およびオーディオ情報に基づく大規模音声言語モデルのマルチホップ推論

要旨

大規模音声言語モデル（LALMs）は、大規模言語モデルを音声やオーディオなどのマルチモーダル理解に拡張したものである。音声およびオーディオ処理タスクにおける性能は広く研究されている一方で、その推論能力は未だ十分に検討されていない。特に、複数の事実を想起し統合する能力であるマルチホップ推論については、体系的な評価が欠如している。既存のベンチマークは、一般的な音声およびオーディオ処理タスク、会話能力、公平性に焦点を当てているが、この側面を見落としている。このギャップを埋めるため、我々は音声およびオーディオ情報に基づくLALMsのマルチホップ推論を評価するベンチマーク「SAKURA」を提案する。結果は、LALMsが関連情報を正しく抽出した場合でも、音声/オーディオ表現を統合してマルチホップ推論を行うことに苦戦することを示しており、マルチモーダル推論における根本的な課題を浮き彫りにしている。我々の知見は、LALMsの重要な限界を明らかにし、今後の研究に対する洞察とリソースを提供するものである。

English

Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.

SAKURA: 音声およびオーディオ情報に基づく大規模音声言語モデルのマルチホップ推論

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

要旨

Support