音樂火烈鳥:音訊語言模型中音樂理解能力的規模化拓展
Music Flamingo: Scaling Music Understanding in Audio Language Models
November 13, 2025
作者: Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro
cs.AI
摘要
我們推出Music Flamingo——一款新型大型音頻語言模型,旨在推動基礎音頻模型中的音樂(含歌曲)理解能力。儘管音頻語言研究發展迅速,但音樂因其動態性、多層次性與信息密集性而始終充滿挑戰。由於高質量音樂數據與標註的匱乏,開放式音頻理解模型的規模化拓展舉步維艱,這進一步制約了研究進展。現有模型僅能生成簡短的高層次描述,回答淺層問題,且在不同音樂文化間的泛化能力有限。為應對這些挑戰,我們構建了MF-Skills大規模數據集,通過多階段標註流程生成涵蓋和聲、結構、音色、歌詞及文化背景的豐富描述與問答對。我們基於增強版Audio Flamingo 3骨幹模型在MF-Skills上進行微調,並進一步強化了多項音樂理解相關技能。為提升模型推理能力,我們提出一種後訓練方案:先採用基於音樂理論的新型思維鏈數據集MF-Think進行冷啟動訓練,再結合自定義獎勵機制開展GRPO強化學習。Music Flamingo在10餘項音樂理解與推理基準測試中達到頂尖水平,確立了其作為通用型音樂智能音頻語言模型的地位。除強勁的實證結果外,該模型通過展現從淺層識別邁向類人多層次歌曲感知的能力,為高階音樂理解樹立了新標杆。我們相信此工作不僅為學界提供了基準,更為構建能像人類一樣深度理解音樂的下一代模型奠定了基礎。
English
We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.