使用人类偏好奖励改进文本到音乐生成
Improving Text-to-Music Generation with Human Preference Rewards
June 19, 2026
作者: Yonghyun Kim, Junwon Lee, Haiwen Xia, Yinghao Ma, Chris Donahue
cs.AI
摘要
本文介绍了我们在ICME 2026举办的学术文本到音乐(ATTM)Grand Challenge效率赛道的参赛方案。除了挑战协议中规定的FAD-CLAP和CLAP评分之外,我们还引入了来自TuneJury的基于人类偏好的学习奖励机制——TuneJury是一个在公开音乐偏好数据集上训练的双重成对排序器。该奖励既可作为训练时的条件信号,也可作为样本筛选的标准。整个流程在120M参数的FluxAudio-S骨干网络上整合了五项工程决策,其中四项应用于训练阶段,一项应用于推理阶段:(i)训练时奖励条件化,同时作为推理时无分类器引导(CFG)的辅助维度;(ii)对五种评分条件化架构进行扫描,训练与推理阶段采用不同变体;(iii)在排名前10%的数据上执行专家迭代;(iv)通过短时偏好微调(CRPO)实现音频-文本对齐;(v)推理后处理联合使用CFG、源分离及响度归一化。基于100条Song Describer提示的逐阶段分解表明:训练时奖励条件化作为功能性条件维度发挥作用,专家迭代是主要贡献因素,偏好微调阶段仅带来噪音级别的增益,而推理时的评分标量在流程末端已趋于饱和。
English
We describe our entry to the efficiency track of the Academic Text-to-Music (ATTM) Grand Challenge at ICME 2026. Beyond the challenge protocol's FAD-CLAP and CLAP score, we add a learned human-preference reward from TuneJury, a twin pairwise ranker trained over open music-preference datasets. The reward serves both as a training-time conditioning signal and as a sample-selection criterion. The pipeline combines five engineering decisions on a 120M-parameter FluxAudio-S backbone, four at training time and one at inference: (i) training-time reward conditioning that doubles as an inference-time CFG axis, (ii) a sweep over five score-conditioning architectures, where training and inference use different variants, (iii) expert iteration on the top decile, (iv) a short preference-tuning pass (CRPO) for audio-text alignment, and (v) inference post-processing via joint CFG, source separation, and loudness normalization. Per-stage decomposition on 100 Song Describer prompts shows training-time reward conditioning as a functional conditioning axis, expert iteration as the dominant contributor, the preference-tuning pass adding only noise-level gain, and the inference-time score scalar already saturated by the end of the chain.