RewardDance:視覺生成中的獎勵縮放
RewardDance: Reward Scaling in Visual Generation
September 10, 2025
作者: Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang
cs.AI
摘要
獎勵模型(Reward Models, RMs)對於通過強化學習(Reinforcement Learning, RL)改進生成模型至關重要,然而視覺生成領域中的RM擴展範式仍大多未被探索。這主要源於現有方法的根本性限制:基於CLIP的RMs受制於架構和輸入模態的約束,而廣泛使用的Bradley-Terry損失函數與視覺語言模型(Vision-Language Models, VLMs)的下一個詞預測機制本質上不匹配,阻礙了有效的擴展。更關鍵的是,RLHF優化過程深受獎勵欺騙(Reward Hacking)問題的困擾,即模型利用獎勵信號中的缺陷而不提升真實質量。為應對這些挑戰,我們提出了RewardDance,這是一個可擴展的獎勵建模框架,通過一種新穎的生成式獎勵範式克服了這些障礙。通過將獎勵分數重新定義為模型預測“是”標記的概率,表明生成的圖像在特定標準下優於參考圖像,RewardDance本質上將獎勵目標與VLM架構對齊。這種對齊解鎖了兩個維度的擴展:(1) 模型擴展:系統性地將RMs擴展至260億參數;(2) 上下文擴展:整合任務特定指令、參考示例和思維鏈(Chain-of-Thought, CoT)推理。大量實驗表明,RewardDance在文本到圖像、文本到視頻以及圖像到視頻生成方面顯著超越了現有最先進的方法。尤為重要的是,我們解決了長期存在的“獎勵欺騙”挑戰:我們的大規模RMs在RL微調過程中展現並保持了高獎勵方差,證明了其對欺騙的抵抗能力以及生成多樣化高質量輸出的能力。這極大地緩解了困擾較小模型的模式崩潰問題。
English
Reward Models (RMs) are critical for improving generation models via
Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation
remains largely unexplored. It primarily due to fundamental limitations in
existing approaches: CLIP-based RMs suffer from architectural and input
modality constraints, while prevalent Bradley-Terry losses are fundamentally
misaligned with the next-token prediction mechanism of Vision-Language Models
(VLMs), hindering effective scaling. More critically, the RLHF optimization
process is plagued by Reward Hacking issue, where models exploit flaws in the
reward signal without improving true quality. To address these challenges, we
introduce RewardDance, a scalable reward modeling framework that overcomes
these barriers through a novel generative reward paradigm. By reformulating the
reward score as the model's probability of predicting a "yes" token, indicating
that the generated image outperforms a reference image according to specific
criteria, RewardDance intrinsically aligns reward objectives with VLM
architectures. This alignment unlocks scaling across two dimensions: (1) Model
Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context
Scaling: Integration of task-specific instructions, reference examples, and
chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that
RewardDance significantly surpasses state-of-the-art methods in text-to-image,
text-to-video, and image-to-video generation. Crucially, we resolve the
persistent challenge of "reward hacking": Our large-scale RMs exhibit and
maintain high reward variance during RL fine-tuning, proving their resistance
to hacking and ability to produce diverse, high-quality outputs. It greatly
relieves the mode collapse problem that plagues smaller models.