RewardDance: 視覚生成における報酬スケーリング

要旨

報酬モデル（Reward Models, RMs）は、強化学習（Reinforcement Learning, RL）を通じて生成モデルを改善する上で重要な役割を果たしますが、視覚生成におけるRMのスケーリングパラダイムは未だほとんど探求されていません。これは主に、既存のアプローチにおける根本的な制限によるものです。CLIPベースのRMsは、アーキテクチャと入力モダリティの制約に悩まされており、広く使われているBradley-Terry損失は、視覚言語モデル（Vision-Language Models, VLMs）の次トークン予測メカニズムと根本的に整合せず、効果的なスケーリングを妨げています。さらに重要なことに、RLHF最適化プロセスは「報酬ハッキング」問題に悩まされており、モデルが真の品質を向上させることなく、報酬信号の欠陥を利用してしまいます。これらの課題に対処するため、我々はRewardDanceを導入します。これは、新しい生成的報酬パラダイムを通じてこれらの障壁を克服するスケーラブルな報酬モデリングフレームワークです。RewardDanceは、生成された画像が特定の基準に従って参照画像を上回ることを示す「yes」トークンをモデルが予測する確率として報酬スコアを再定式化することで、報酬目標をVLMアーキテクチャと本質的に整合させます。この整合により、2つの次元でのスケーリングが可能になります：(1) モデルスケーリング：RMsを最大260億パラメータまで体系的にスケーリングする。(2) コンテキストスケーリング：タスク固有の指示、参照例、および連鎖的思考（Chain-of-Thought, CoT）推論の統合。大規模な実験により、RewardDanceがテキストから画像、テキストから動画、および画像から動画の生成において、最先端の手法を大幅に上回ることが実証されました。特に、我々は「報酬ハッキング」という持続的な課題を解決しました。大規模なRMsは、RL微調整中に高い報酬分散を示し、ハッキングに対する耐性と多様で高品質な出力を生成する能力を証明しています。これにより、小規模モデルを悩ませるモード崩壊問題が大幅に緩和されます。

English

Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.

RewardDance: 視覚生成における報酬スケーリング

RewardDance: Reward Scaling in Visual Generation

要旨

Support