ChatPaper.aiChatPaper

SAKURA:基於語音與音頻信息的大型音頻-語言模型的多跳推理研究

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

May 19, 2025
作者: Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee
cs.AI

摘要

大型音頻語言模型(LALMs)通過整合語音、音頻等多模態理解能力,擴展了大型語言模型的應用範圍。儘管這些模型在語音和音頻處理任務上的表現已得到廣泛研究,但其推理能力仍未被充分探索。特別是,它們的多跳推理能力——即回憶並整合多個事實的能力——缺乏系統性的評估。現有的基準測試主要關注於一般的語音和音頻處理任務、對話能力以及公平性,卻忽視了這一關鍵方面。為填補這一空白,我們引入了SAKURA,這是一個基於語音和音頻信息來評估LALMs多跳推理能力的基準測試。結果顯示,即便LALMs能夠正確提取相關信息,它們在整合語音/音頻表徵以進行多跳推理時仍面臨困難,這揭示了多模態推理中的一個根本性挑戰。我們的研究發現暴露了LALMs的一個關鍵限制,為未來研究提供了洞見和資源。
English
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.

Summary

AI-Generated Summary

PDF02May 23, 2025