ChatPaper.aiChatPaper

SAKURA:基于语音与音频信息的大规模音频-语言模型的多跳推理研究

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

May 19, 2025
作者: Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee
cs.AI

摘要

大型音频语言模型(LALMs)通过融入语音、音频等多模态理解能力,扩展了大型语言模型的应用范围。尽管这些模型在语音和音频处理任务上的表现已得到广泛研究,但其推理能力仍待深入探索。特别是,它们在多跳推理——即回忆并整合多个事实的能力方面,缺乏系统性的评估。现有基准测试主要关注通用语音和音频处理任务、对话能力及公平性,却忽视了这一关键维度。为填补这一空白,我们推出了SAKURA基准,专门评估LALMs基于语音和音频信息的多跳推理能力。结果显示,即便LALMs能正确提取相关信息,它们在整合语音/音频表征以进行多跳推理时仍面临困难,这揭示了多模态推理中的一个根本性挑战。我们的研究揭示了LALMs的一个关键局限,为未来研究提供了洞见与资源。
English
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.

Summary

AI-Generated Summary

PDF02May 23, 2025