AutoRubric-T2I: 텍스트-이미지 정렬을 위한 강건한 규칙 기반 보상 모델

초록

텍스트-이미지(T2I) 생성 모델을 인간 선호도에 정렬하는 작업은 점점 더 이미지 보상 모델에 의존하고 있으며, 이러한 모델은 프롬프트 정렬 및 지각 품질에 따라 생성된 이미지를 점수화하거나 순위를 매긴다. 기존 보상 모델은 일반적으로 대규모 인간 선호도 말뭉치에 대해 브래들리-테리(BT) 선호 모델로 학습되므로 학습 비용이 높고 적응이 어려우며 평가 기준이 불투명하다. 한편, 시각-언어 모델(VLM) 평가자는 텍스트 루브릭을 통해 보다 세분화된 평가를 제공할 수 있지만, 수동으로 설계되거나 휴리스틱하게 생성된 점수 규칙이 인간 선호도를 신뢰성 있게 반영하지 못할 수 있다. 본 논문에서는 T2I 분야에서 최초로 자동으로 명시적 루브릭을 합성하고 선택하여 VLM 평가자를 안내하는 루브릭 학습 프레임워크인 AutoRubric-T2I를 제안한다. AutoRubric-T2I는 먼저 선호도 쌍에서 추론 흔적을 후보 루브릭으로 합성한 후, VLM 평가자를 사용하여 각 루브릭 하에서 쌍별 이미지를 점수화함으로써 쌍별 루브릭-점수 차이를 생성하여 선호도 학습을 수행한다. 노이즈가 많고 중복된 규칙을 제거하기 위해 ℓ₁ 정규화 로지스틱 회귀 정제기(ℓ₁-Regularized Logistic Regression Refiner)를 추가로 사용하여 가장 판별력 있는 상위 N개의 루브릭을 선택한다. 광범위한 평가 결과, AutoRubric-T2I는 주석이 달린 선호도 데이터의 0.01% 미만을 사용하여 고품질의 해석 가능한 보상 신호를 생성하며, 대규모 보상 모델 학습의 필요성을 크게 줄인다. MMRB2와 같은 이미지 보상 벤치마크에서 AutoRubric-T2I는 강력한 보상 모델 기준선을 능가한다. 또한 AutoRubric-T2I를 TIIF 및 UniGenBench++를 포함한 하위 T2I 작업에 대한 RL 보상으로 검증한 결과, 확산 모델에서 Flow-GRPO 파이프라인을 사용하여 스칼라 보상 모델보다 생성 품질을 향상시킴을 확인하였다.

English

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a ell_1-Regularized Logistic Regression Refiner, which selects the Top-N most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.