ChatPaper.aiChatPaper

音频火烈鸟新一代:面向语音、声音与音乐的下一代开源音频语言模型

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

April 13, 2026
作者: Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, Wei Ping
cs.AI

摘要

我们推出新一代音频语言大模型Audio Flamingo Next(AF-Next),作为Audio Flamingo系列中功能最强大的版本,旨在提升对语音、环境音与音乐的理解与推理能力。相较于Audio Flamingo 3,AF-Next具备以下突破:(i) 构建了更强大的基础音频语言模型,显著提升多类音频理解任务的准确率;(ii) 提出可扩展策略,构建超越现有学术基准的大规模音频理解与推理数据集;(iii) 支持长达30分钟的复杂长音频输入;(iv) 创新提出时序音频思维链推理范式,将中间推理步骤显式关联至长音频时间戳,实现细粒度时序对齐并增强可解释性。为实现这些能力,我们首先系统分析了Audio Flamingo 3的音频理解与推理短板,进而构建并扩展了总时长超100万小时的新大规模数据集,扩充了原有的AudioSkills-XL、LongAudio-XL、AF-Think与AF-Chat数据集。AF-Next采用分阶段课程学习策略,涵盖预训练、中期训练与后训练三个阶段。在涵盖20项音频理解与推理基准(包括具有挑战性的长音频任务)的大规模实验中,AF-Next以显著优势超越同类规模开源模型,并与参数量更大的开源权重模型及闭源模型保持强劲竞争力,部分任务甚至实现反超。除基准测试表现外,AF-Next展现出强大的实际应用价值,能良好迁移至未见任务,凸显其鲁棒性与泛化能力。我们同步开源全部数据、代码与方法,并发布AF-Next的3个变体模型:AF-Next-Instruct、AF-Next-Think与AF-Next-Captioner。
English
We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.
PDF171April 15, 2026