CMI-RewardBench:基於組合式多模態指令的音樂獎勵模型評估框架
CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
February 28, 2026
作者: Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos
cs.AI
摘要
儘管音樂生成模型已發展至能處理融合文字、歌詞與參考音訊的複雜多模態輸入,其評估機制卻進展滯後。本文透過建立「組合式多模態指令」框架下的音樂獎勵建模完整生態系統,彌補此關鍵缺口——該框架下生成的音樂可同時受文字描述、歌詞及音訊提示的約束。我們首先提出CMI-Pref-Pseudo,一個包含11萬個偽標註樣本的大規模偏好數據集,以及專為細粒度對齊任務設計的高質量人工標註數據集CMI-Pref。為統一評估標準,我們構建CMI-RewardBench基準測試平台,針對音樂性、文字-音樂對齊度及組合指令對齊度三大維度,對音樂獎勵模型進行異構樣本評估。基於這些資源,我們開發了CMI獎勵模型系列,這類參數高效的獎勵模型能處理異構輸入。我們在CMI-Pref與既有數據集上驗證其對音樂性及對齊度的人類評分相關性。進一步實驗表明,CMI-RM不僅與人類評判高度相關,更能透過top-k過濾實現高效的推理時擴展。所有訓練數據、基準測試與獎勵模型均已開源。
English
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.