GAMA：具有先進音頻理解和複雜推理能力的大型音語言模型

摘要

認知和理解非語音聲音和非語言語音對於做出幫助我們與周圍環境互動的決策至關重要。在本文中，我們提出了GAMA，一種新型的通用大型音頻語言模型（LALM），具有先進的音頻理解和複雜推理能力。我們通過將LLM與多種類型的音頻表示集成來構建GAMA，其中包括來自自定義音頻Q-Former的特徵，這是一種多層聚合器，可以聚合來自音頻編碼器多個層的特徵。我們在大規模音頻語言數據集上對GAMA進行微調，從而增強其音頻理解能力。接著，我們提出CompA-R（用於複雜音頻推理的指令調整），這是一個合成生成的指令調整（IT）數據集，其中包含需要模型對輸入音頻進行複雜推理的指令。我們使用CompA-R對GAMA進行指令調整，賦予其複雜推理能力，同時通過利用輸入音頻的事件標籤添加高層語義證據作為輸入的軟提示。最後，我們還提出CompA-R-test，這是一個人工標記的評估數據集，用於評估LALM在需要複雜推理的開放式音頻問答任務上的能力。通過自動化和專家人工評估，我們展示GAMA在各種音頻理解任務上的表現優於文獻中所有其他LALM，優勢範圍為1%至84%。此外，經CompA-R指令調整後的GAMA在複雜推理和指令遵循能力方面表現卓越。

English

Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities.

GAMA：具有先進音頻理解和複雜推理能力的大型音語言模型

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

摘要

Support