RationalRewards: 훈련 및 테스트 시간 모두에서 시각적 생성을 조절하는 추론 보상 시스템

초록

시각 생성 분야의 대부분의 보상 모델은 풍부한 인간의 판단을 단일한 설명 없는 점수로 축소하여 선호도 뒤에 있는 논리를 버립니다. 우리는 보상 모델이 점수를 매기기 전에 명시적이고 다차원적인 비판을 생성하도록 가르치면, 이를 수동적인 평가자에서 능동적인 최적화 도구로 변환하여 생성기를 두 가지 상호 보완적인 방식으로 개선할 수 있음을 보여줍니다: 훈련 시에는 구조화된 근거가 강화 학습을 위한 해석 가능하고 세분화된 보상을 제공하며, 테스트 시에는 '생성-비판-수정' 루프가 비판을 대상 명령어 수정으로 전환하여 매개변수 업데이트 없이 출력을 개선합니다. 이러한 보상 모델을 비용이 많이 드는 근거 주석 없이 훈련시키기 위해, 우리는 앵커 생성, 일관성 필터링 및 증류를 통해 기존에 쉽게 이용 가능한 선호도 데이터로부터 고품질 근거를 복원하는 원칙적인 프레임워크인 Preference-Anchored Rationalization (PARROT)을 소개합니다. 그 결과물인 RationalRewards (8B) 모델은 오픈소스 보상 모델 중에서 최첨단 선호도 예측 성능을 달성하며, 유사한 기준 모델보다 10-20배 적은 훈련 데이터를 사용하면서 Gemini-2.5-Pro에 버금가는 성능을 보입니다. RL 보상으로 사용될 때, 이 모델은 텍스트-이미지 및 이미지 편집 생성기를 스칼라 대안들보다 지속적으로 개선합니다. 가장 놀라운 점은, 테스트 단계의 비판-수정 루프가 여러 벤치마크에서 RL 기반 미세 조정을 능가하거나 그에 버금가는 성능을 보여준다는 것입니다. 이는 구조화된 추론이 최적이 아닌 명령어가 이끌어내지 못하는 기존 생성기의 잠재 능력을 해제할 수 있음을 시사합니다.

English

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

RationalRewards: 훈련 및 테스트 시간 모두에서 시각적 생성을 조절하는 추론 보상 시스템

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

초록

Support