音频火烈鸟新一代：面向语音、声音与音乐的下一代开源音频语言模型

摘要

我们推出新一代音频语言大模型Audio Flamingo Next（AF-Next），作为Audio Flamingo系列中功能最强大的版本，旨在提升对语音、环境音与音乐的理解与推理能力。相较于Audio Flamingo 3，AF-Next具备以下突破：(i) 构建了更强大的基础音频语言模型，显著提升多类音频理解任务的准确率；(ii) 提出可扩展策略，构建超越现有学术基准的大规模音频理解与推理数据集；(iii) 支持长达30分钟的复杂长音频输入；(iv) 创新提出时序音频思维链推理范式，将中间推理步骤显式关联至长音频时间戳，实现细粒度时序对齐并增强可解释性。为实现这些能力，我们首先系统分析了Audio Flamingo 3的音频理解与推理短板，进而构建并扩展了总时长超100万小时的新大规模数据集，扩充了原有的AudioSkills-XL、LongAudio-XL、AF-Think与AF-Chat数据集。AF-Next采用分阶段课程学习策略，涵盖预训练、中期训练与后训练三个阶段。在涵盖20项音频理解与推理基准（包括具有挑战性的长音频任务）的大规模实验中，AF-Next以显著优势超越同类规模开源模型，并与参数量更大的开源权重模型及闭源模型保持强劲竞争力，部分任务甚至实现反超。除基准测试表现外，AF-Next展现出强大的实际应用价值，能良好迁移至未见任务，凸显其鲁棒性与泛化能力。我们同步开源全部数据、代码与方法，并发布AF-Next的3个变体模型：AF-Next-Instruct、AF-Next-Think与AF-Next-Captioner。

English

We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.

音频火烈鸟新一代：面向语音、声音与音乐的下一代开源音频语言模型

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

摘要

Support