GAMA:具有先进音频理解和复杂推理能力的大型音频语言模型
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
June 17, 2024
作者: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
cs.AI
摘要
感知和理解非语音声音和非语言语音对于做出有助于我们与周围环境互动的决策至关重要。在本文中,我们提出了GAMA,一种新颖的通用大型音频语言模型(LALM),具有先进的音频理解和复杂推理能力。我们通过将LLM与多种类型的音频表示集成,包括来自自定义音频Q-Former的特征,以及一个多层聚合器,用于聚合来自音频编码器多个层的特征,来构建GAMA。我们在大规模音频语言数据集上对GAMA进行微调,从而增强其音频理解能力。接下来,我们提出CompA-R(用于复杂音频推理的指令微调),这是一个合成生成的指令微调(IT)数据集,其中包含需要模型对输入音频执行复杂推理的指令。我们使用CompA-R对GAMA进行指令微调,赋予其复杂推理能力,进一步通过利用输入音频的事件标签添加一个高级语义证据的软提示作为输入。最后,我们还提出CompA-R-test,一个人工标记的评估数据集,用于评估LALM在需要复杂推理的开放式音频问答任务上的能力。通过自动化和专家人工评估,我们展示了GAMA在各种音频理解任务上的表现优于文献中所有其他LALM,优势范围为1%-84%。此外,通过CompA-R的IT,GAMA在复杂推理和遵循指令能力方面表现出更高水平。
English
Perceiving and understanding non-speech sounds and non-verbal speech is
essential to making decisions that help us interact with our surroundings. In
this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model
(LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We
build GAMA by integrating an LLM with multiple types of audio representations,
including features from a custom Audio Q-Former, a multi-layer aggregator that
aggregates features from multiple layers of an audio encoder. We fine-tune GAMA
on a large-scale audio-language dataset, which augments it with audio
understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for
Complex Audio Reasoning), a synthetically generated instruction-tuning (IT)
dataset with instructions that require the model to perform complex reasoning
on the input audio. We instruction-tune GAMA with CompA-R to endow it with
complex reasoning abilities, where we further add a soft prompt as input with
high-level semantic evidence by leveraging event tags of the input audio.
Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for
evaluating the capabilities of LALMs on open-ended audio question-answering
that requires complex reasoning. Through automated and expert human
evaluations, we show that GAMA outperforms all other LALMs in literature on
diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on
CompA-R proves to be superior in its complex reasoning and instruction
following capabilities.Summary
AI-Generated Summary