ChatPaper.aiChatPaper

Audio Flamingo 3:以全開放大型音頻語言模型推進音頻智能

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

July 10, 2025
作者: Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
cs.AI

摘要

我們推出Audio Flamingo 3(AF3),這是一個完全開源的尖端大型音頻-語言模型,其在語音、聲音和音樂的理解與推理能力上實現了顯著進步。AF3引入了以下創新:(i) AF-Whisper,一種通過新穎策略訓練的統一音頻編碼器,能夠跨語音、聲音和音樂三種模態進行聯合表示學習;(ii) 靈活的按需思考功能,使模型在回答前能進行鏈式思維推理;(iii) 多輪、多音頻對話;(iv) 長達10分鐘的音頻(包括語音)理解與推理能力;以及(v) 語音到語音的互動。為實現這些功能,我們提出了多個採用新策略策劃的大規模訓練數據集,包括AudioSkills-XL、LongAudio-XL、AF-Think和AF-Chat,並採用了一種新穎的五階段課程式訓練策略來訓練AF3。僅基於開源音頻數據訓練的AF3,在超過20個(長)音頻理解與推理基準測試中取得了新的頂尖成績,超越了那些基於更大數據集訓練的開源權重模型和閉源模型。
English
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
PDF81July 15, 2025