AutoRubric-T2I: テキスト-画像アライメントのためのロバストなルールベース報酬モデル

要旨

テキスト画像生成（T2I）モデルを人間の選好に合わせるために、プロンプトへの適合性や知覚品質に基づいて生成画像をスコアリングまたはランク付けする画像報酬モデルへの依存が高まっている。既存の報酬モデルは、大規模な人間の選好コーパスでBradley-Terry（BT）選好モデルとして訓練されることが一般的であり、そのため訓練コストが高く、適応が困難で、評価基準が不透明である。一方、Vision-Language Model（VLM）ジャッジはテキストによるルーブリックを通じてより詳細な評価を提供できるが、人手で設計されたりヒューリスティックに生成されたスコアリングルールは人間の選好を確実に反映できない可能性がある。本論文では、T2I分野で初めてとなる、VLMジャッジをガイドする明示的なルーブリックを自動的に合成・選択するルーブリック学習フレームワーク、AutoRubric-T2Iを提案する。AutoRubric-T2Iはまず、選好ペアから推論トレースを合成して候補ルーブリックを生成し、次に各ルーブリックの下でVLMジャッジを用いてペア画像をスコアリングし、選好学習のためのペアごとのルーブリックスコア差を生成する。ノイズや冗長なルールを除去するため、さらにℓ₁正則化ロジスティック回帰リファイナを採用し、最も識別力のあるTop-Nルーブリックを選択する。広範な評価により、AutoRubric-T2Iがアノテーション済み選好データの0.01%未満を使用して高品質で解釈可能な報酬信号を生成し、大規模な報酬モデル訓練の必要性を大幅に低減することが示された。MMRB2などの画像報酬ベンチマークにおいて、AutoRubric-T2Iは強力な報酬モデルのベースラインを上回る性能を示す。さらに下流のT2Iタスク（TIIFやUniGenBench++など）において、AutoRubric-T2IをRL報酬として検証し、拡散モデル上のFlow-GRPOパイプラインを用いたスカラー報酬モデルと比較して生成品質を改善することを確認した。

English

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a ell_1-Regularized Logistic Regression Refiner, which selects the Top-N most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.