ChatPaper.aiChatPaper

Audio Flamingo Next:面向語音、音效與音樂的新一代開源音訊語言模型

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

April 13, 2026
作者: Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, Wei Ping
cs.AI

摘要

我們推出Audio Flamingo系列中最新一代、功能最強大的大型音頻語言模型——Audio Flamingo Next(AF-Next),該模型旨在提升對語音、環境聲音及音樂的理解與推理能力。相較於Audio Flamingo 3,AF-Next具備以下創新:(i) 更強大的基礎音頻語言模型,顯著提升各類音頻理解任務的準確性;(ii) 可擴展的數據構建策略,能突破現有學術基準大規模生成音頻理解與推理數據;(iii) 支援最長達30分鐘的複雜長音頻輸入;(iv) 提出「時間軸音頻思維鏈」新推理範式,將中間推理步驟明確對應至長音頻時間戳,實現細粒度時間對齊與更佳可解釋性。為實現這些能力,我們首先系統性分析Audio Flamingo 3的技術缺口,進而構建總計超100萬小時的新大規模數據集,擴充現有AudioSkills-XL、LongAudio-XL、AF-Think與AF-Chat數據集。AF-Next採用分階段課程學習策略,涵蓋預訓練、中期訓練與後訓練。在包含高難度長音頻任務的20項音頻理解與推理基準測試中,AF-Next大幅領先同規模開源模型,並與參數量更大的開源權重模型及閉源模型保持競爭力,部分任務甚至實現超越。除基準表現外,AF-Next展現出強大的實際應用價值,能有效遷移至未見任務,凸顯其魯棒性與泛化能力。我們開源全部數據、程式碼與方法,並發布AF-Next三種變體模型:AF-Next-Instruct、AF-Next-Think與AF-Next-Captioner。
English
We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.
PDF171April 15, 2026