Audio Flamingo 3:通过全开放大型音频语言模型推进音频智能
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
July 10, 2025
作者: Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
cs.AI
摘要
我们推出了Audio Flamingo 3(AF3),这是一款完全开源、处于前沿水平的大型音频-语言模型,它在语音、声音和音乐的理解与推理方面取得了显著进展。AF3引入了以下创新:(i) AF-Whisper,一种统一的音频编码器,采用新颖策略进行跨语音、声音和音乐三种模态的联合表征学习;(ii) 灵活的按需思考能力,使模型在回答前能进行链式思维推理;(iii) 多轮次、多音频对话功能;(iv) 长达10分钟的音频(包括语音)理解与推理能力;以及(v) 语音到语音的交互功能。为实现这些能力,我们提出了多个采用新颖策略构建的大规模训练数据集,包括AudioSkills-XL、LongAudio-XL、AF-Think和AF-Chat,并采用了一种创新的五阶段课程式训练策略对AF3进行训练。仅基于开源音频数据训练的AF3,在超过20项(长)音频理解与推理基准测试中取得了新的前沿成绩,超越了那些基于更大数据集训练的开源权重模型和闭源模型。
English
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large
audio-language model that advances reasoning and understanding across speech,
sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder
trained using a novel strategy for joint representation learning across all 3
modalities of speech, sound, and music; (ii) flexible, on-demand thinking,
allowing the model to do chain-of-thought-type reasoning before answering;
(iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning
(including speech) up to 10 minutes; and (v) voice-to-voice interaction. To
enable these capabilities, we propose several large-scale training datasets
curated using novel strategies, including AudioSkills-XL, LongAudio-XL,
AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based
training strategy. Trained on only open-source audio data, AF3 achieves new
SOTA results on over 20+ (long) audio understanding and reasoning benchmarks,
surpassing both open-weight and closed-source models trained on much larger
datasets.