CMI-RewardBench:基于组合式多模态指令的音乐奖励模型评估框架
CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
February 28, 2026
作者: Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos
cs.AI
摘要
尽管音乐生成模型已发展至能够处理融合文本、歌词与参考音频的复杂多模态输入,其评估机制却相对滞后。本文通过构建基于组合多模态指令(CMI)的音乐奖励建模完整生态,填补了这一关键空白。在该框架下,生成的音乐可受文本描述、歌词及音频提示的共同约束。我们首先提出CMI-Pref-Pseudo——一个包含11万伪标注样本的大规模偏好数据集,以及专为细粒度对齐任务设计的高质量人工标注数据集CMI-Pref。为统一评估标准,我们构建了CMI-RewardBench基准测试平台,从音乐性、文本-音乐对齐度和组合指令对齐度三个维度对音乐奖励模型进行异构样本评估。基于这些资源,我们开发了参数高效的CMI奖励模型家族(CMI-RMs),其具备处理异构输入的能力。通过在CMI-Pref及既有数据集上评估,我们验证了该模型在音乐性与对齐度方面与人类评判得分的相关性。进一步实验表明,CMI-RM不仅与人类评判高度契合,还能通过top-k过滤实现高效的推理时缩放。相关训练数据、基准测试集及奖励模型均已开源。
English
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.