RewardDance:视觉生成中的奖励缩放机制
RewardDance: Reward Scaling in Visual Generation
September 10, 2025
作者: Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang
cs.AI
摘要
奖励模型(RMs)对于通过强化学习(RL)提升生成模型至关重要,然而在视觉生成领域,RM的扩展范式仍鲜有探索。这主要源于现有方法的根本性局限:基于CLIP的RMs受限于架构与输入模态的约束,而广泛采用的Bradley-Terry损失函数与视觉语言模型(VLMs)的下一令牌预测机制存在本质上的不匹配,阻碍了有效扩展。更为关键的是,RLHF优化过程饱受“奖励欺骗”问题的困扰,即模型利用奖励信号中的漏洞而不提升真实质量。为应对这些挑战,我们提出了RewardDance,一个可扩展的奖励建模框架,通过创新的生成式奖励范式突破上述障碍。RewardDance将奖励分数重新定义为模型预测“是”令牌的概率,该令牌表示根据特定标准,生成图像优于参考图像,从而在本质上使奖励目标与VLM架构对齐。这一对齐解锁了两个维度的扩展:(1)模型扩展:系统性地将RMs扩展至高达260亿参数;(2)上下文扩展:整合任务特定指令、参考示例及链式思维(CoT)推理。大量实验表明,RewardDance在文本到图像、文本到视频及图像到视频生成任务中显著超越现有最先进方法。尤为重要的是,我们解决了长期存在的“奖励欺骗”难题:我们的大规模RMs在RL微调过程中展现出并维持高奖励方差,证明了其对欺骗的抵抗能力及生成多样化高质量输出的能力,极大地缓解了困扰较小模型的模式崩溃问题。
English
Reward Models (RMs) are critical for improving generation models via
Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation
remains largely unexplored. It primarily due to fundamental limitations in
existing approaches: CLIP-based RMs suffer from architectural and input
modality constraints, while prevalent Bradley-Terry losses are fundamentally
misaligned with the next-token prediction mechanism of Vision-Language Models
(VLMs), hindering effective scaling. More critically, the RLHF optimization
process is plagued by Reward Hacking issue, where models exploit flaws in the
reward signal without improving true quality. To address these challenges, we
introduce RewardDance, a scalable reward modeling framework that overcomes
these barriers through a novel generative reward paradigm. By reformulating the
reward score as the model's probability of predicting a "yes" token, indicating
that the generated image outperforms a reference image according to specific
criteria, RewardDance intrinsically aligns reward objectives with VLM
architectures. This alignment unlocks scaling across two dimensions: (1) Model
Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context
Scaling: Integration of task-specific instructions, reference examples, and
chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that
RewardDance significantly surpasses state-of-the-art methods in text-to-image,
text-to-video, and image-to-video generation. Crucially, we resolve the
persistent challenge of "reward hacking": Our large-scale RMs exhibit and
maintain high reward variance during RL fine-tuning, proving their resistance
to hacking and ability to produce diverse, high-quality outputs. It greatly
relieves the mode collapse problem that plagues smaller models.