RewardDance：視覺生成中的獎勵縮放

摘要

獎勵模型（Reward Models, RMs）對於通過強化學習（Reinforcement Learning, RL）改進生成模型至關重要，然而視覺生成領域中的RM擴展範式仍大多未被探索。這主要源於現有方法的根本性限制：基於CLIP的RMs受制於架構和輸入模態的約束，而廣泛使用的Bradley-Terry損失函數與視覺語言模型（Vision-Language Models, VLMs）的下一個詞預測機制本質上不匹配，阻礙了有效的擴展。更關鍵的是，RLHF優化過程深受獎勵欺騙（Reward Hacking）問題的困擾，即模型利用獎勵信號中的缺陷而不提升真實質量。為應對這些挑戰，我們提出了RewardDance，這是一個可擴展的獎勵建模框架，通過一種新穎的生成式獎勵範式克服了這些障礙。通過將獎勵分數重新定義為模型預測“是”標記的概率，表明生成的圖像在特定標準下優於參考圖像，RewardDance本質上將獎勵目標與VLM架構對齊。這種對齊解鎖了兩個維度的擴展：(1) 模型擴展：系統性地將RMs擴展至260億參數；(2) 上下文擴展：整合任務特定指令、參考示例和思維鏈（Chain-of-Thought, CoT）推理。大量實驗表明，RewardDance在文本到圖像、文本到視頻以及圖像到視頻生成方面顯著超越了現有最先進的方法。尤為重要的是，我們解決了長期存在的“獎勵欺騙”挑戰：我們的大規模RMs在RL微調過程中展現並保持了高獎勵方差，證明了其對欺騙的抵抗能力以及生成多樣化高質量輸出的能力。這極大地緩解了困擾較小模型的模式崩潰問題。

English

Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.

RewardDance：視覺生成中的獎勵縮放

RewardDance: Reward Scaling in Visual Generation

摘要

Support