TuneJury：一种用于提升音乐生成偏好对齐的开放度量

摘要

我们介绍TuneJury，这是一个面向文本到音乐的开放式、实例级成对奖励模型，它能够根据文本提示和音频片段预测音乐偏好分数。发布的检查点在公开可用的人类偏好标签上训练，涵盖竞技场式（A vs. B）投票、度量对齐偏好对、众包成对比较和专家审美评分。两个片段之间的预测分数差在我们保留的测试集上校准良好，支持通过简单的分数阈值进行数据筛选。TuneJury对保留的测试对和分布外基准均具有良好的泛化能力，并在后者上保持与先前基线相当的水平。对于训练后发布的生成器，我们引入了锚定校准，这是一种事后、每系统的Bradley-Terry校准，以显著优于从头再训练的数据效率恢复一致性。相同的冻结奖励在三个下游应用中驱动一致的奖励轴增益：推理时的最佳N选择、DITTO风格的潜在优化和专家迭代后训练。TuneJury可在https://github.com/yonghyunk1m/TuneJury 获取。

English

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.