ChatPaper.aiChatPaper

Audio Flamingo 2:具備長音頻理解與專家推理能力的音頻語言模型

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

March 6, 2025
作者: Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro
cs.AI

摘要

理解與推理非語音聲音及音樂對於人類與AI代理有效與環境互動至關重要。本文介紹了Audio Flamingo 2(AF2),這是一款具備先進音頻理解與推理能力的音頻-語言模型(ALM)。AF2整合了以下三大要素:(i) 定制的CLAP模型,(ii) 用於細粒度音頻推理的合成音頻問答數據,以及(iii) 多階段課程學習策略。AF2僅憑3B參數的小型語言模型便實現了頂尖性能,在超過20個基準測試中超越了大型開源及專有模型。此外,我們首次將音頻理解能力延伸至長音頻片段(30秒至5分鐘),並提出了LongAudio,這是一個用於訓練ALM在長音頻字幕生成與問答任務上的大型新穎數據集。在LongAudio上微調AF2,使其在我們提出的LongAudioBench上表現卓越,這是一個專家註釋的基準,用於評估ALM在長音頻理解能力上的表現。我們進行了廣泛的消融研究以驗證方法的有效性。項目網站:https://research.nvidia.com/labs/adlr/AF2/。
English
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.
PDF242March 7, 2025