ChatPaper.aiChatPaper

AutoRubric-T2I:基於規則的穩健獎勵模型應用於文本到圖像對齊

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

May 20, 2026
作者: Kuei-Chun Kao, Daixuan Huo, Yuanhao Ban, Cho-Jui Hsieh
cs.AI

摘要

將文字生成圖像(T2I)模型與人類偏好對齊,日益依賴於能根據提示對齊程度與感知品質對生成圖像進行評分或排序的圖像獎勵模型。現有獎勵模型通常在大規模人類偏好語料庫上以 Bradley-Terry(BT)偏好模型的形式進行訓練,這使得它們訓練成本高昂、難以適應,且評估標準不透明。與此同時,視覺語言模型(VLM)評估者能透過文字評分規則提供更細緻的評估,但其手動設計或啟發式生成的評分規則可能無法可靠反映人類偏好。本文提出 AutoRubric-T2I,這是 T2I 領域首個自動合成並選擇明確評分規則以引導 VLM 評估者的評分規則學習框架。AutoRubric-T2I 首先從偏好對中合成推理軌跡作為候選評分規則,接著使用 VLM 評估者在每條規則下對成對圖像進行評分,產生每對規則分數差異以進行偏好學習。為去除雜訊與冗餘規則,我們進一步採用 ℓ1 正則化邏輯回歸精煉器,選出最具區分力的前 N 條規則。廣泛評估顯示,AutoRubric-T2I 使用不到 0.01% 的標註偏好數據即可產生高品質、可解釋的獎勵信號,大幅降低對大規模獎勵模型訓練的需求。在 MMRB2 等圖像獎勵基準上,AutoRubric-T2I 超越了強大的獎勵模型基準。我們進一步在下游 T2I 任務(包括 TIIF 與 UniGenBench++)中驗證 AutoRubric-T2I 作為強化學習獎勵的效果,發現在擴散模型上使用 Flow-GRPO 管線時,它比純量獎勵模型更能提升生成品質。
English
Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a ell_1-Regularized Logistic Regression Refiner, which selects the Top-N most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.