ChatPaper.aiChatPaper

AutoRubric-T2I:鲁棒的基于规则的奖励模型用于文本到图像对齐

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

May 20, 2026
作者: Kuei-Chun Kao, Daixuan Huo, Yuanhao Ban, Cho-Jui Hsieh
cs.AI

摘要

将文本到图像(T2I)生成模型与人类偏好对齐越来越依赖于图像奖励模型,这些模型根据提示对齐度和感知质量对生成图像进行评分或排序。现有的奖励模型通常在大规模人类偏好语料库上作为布拉德利-特里(BT)偏好模型进行训练,这使得它们训练成本高、难以适应且评估标准不透明。与此同时,视觉-语言模型(VLM)评估者可以通过文本评分规则提供更细粒度的评估,但其手动设计或启发式生成的评分规则可能无法可靠地反映人类偏好。本文提出AutoRubric-T2I,这是T2I领域首个自动合成并选择显式评分规则以指导VLM评估者的规则学习框架。AutoRubric-T2I首先将偏好对中的推理轨迹合成为候选规则,然后使用VLM评估者在每条规则下对成对图像进行评分,生成用于偏好学习的成对规则-分数差异。为去除噪声和冗余规则,我们进一步采用L1正则化逻辑回归精炼器,选择最具区分力的前N条规则。大量评估表明,AutoRubric-T2I使用不到0.01%的标注偏好数据即可生成高质量、可解释的奖励信号,大幅减少了大规模奖励模型训练的需求。在MMRB2等图像奖励基准上,AutoRubric-T2I超越了强奖励模型基线。我们进一步将AutoRubric-T2I作为强化学习奖励应用于下游T2I任务(包括TIIF和UniGenBench++),在扩散模型的Flow-GRPO流程中,相比标量奖励模型提升了生成质量。
English
Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a ell_1-Regularized Logistic Regression Refiner, which selects the Top-N most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.