音乐火烈鸟:音频语言模型中的音乐理解规模化研究
Music Flamingo: Scaling Music Understanding in Audio Language Models
November 13, 2025
作者: Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro
cs.AI
摘要
我们推出Music Flamingo——一种新颖的大型音频语言模型,旨在推动基础音频模型中的音乐(含歌曲)理解能力。尽管音频语言研究发展迅速,但音乐因其动态性、层次性和信息密集性而始终充满挑战。开放音频理解模型的扩展难度进一步限制了进展,这主要源于高质量音乐数据与标注的稀缺。因此,现有模型仅能生成简短的概要描述,回答浅层问题,且在不同音乐文化间的泛化能力有限。为应对这些挑战,我们构建了MF-Skills数据集:通过多阶段标注流程获得大规模标注数据,包含涵盖和声、结构、音色、歌词及文化背景的丰富描述与问答对。我们在增强版Audio Flamingo 3骨架上对MF-Skills进行微调,并进一步强化音乐理解相关的多项技能。为提升模型推理能力,我们提出一种后训练方案:首先基于音乐理论构建的新型思维链数据集MF-Think进行冷启动训练,随后采用定制奖励函数进行GRPO强化学习。Music Flamingo在10余项音乐理解与推理基准测试中达到顶尖水平,确立了其作为通用型音乐智能音频语言模型的地位。除强劲的实证结果外,该模型通过展现从表层识别转向人类般层次化歌曲感知的能力,为高级音乐理解设立了新标准。我们相信这项工作既为学界提供了基准,也为构建能像人类一样深度理解音乐的新一代模型奠定了基石。
English
We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.