보상은 항상 당신의 데이터 속에 있었다: 판별자 유도 강화 학습으로 플로우 매칭 교정하기

초록

점수- 및 흐름-정합 모델은 종종 선호 기반 강화 학습에 의존하는데, 이는 두 가지 목적을 위해 사용됩니다: 주관적 선호도와의 정렬, 그리고 놀랍게도 정합 기반 훈련이 데이터 자체로부터 학습하도록 의도된 시각적 사실성 및 일관된 객체 구조와 같은 속성의 회복입니다. 우리는 이것이 구조적 부정합을 반영한다고 주장합니다. 정합 손실은 훈련 시간 주변 분포 하에서 속도 또는 점수 필드에 대한 L2 회귀 오차를 측정하며, 이는 추론 시 샘플 품질을 결정하는 시각적 및 의미적 속성과 잘 정렬되지 않은 대리 지표입니다. 이러한 속성과 정렬된 보상이 주어지면, 강화 학습은 모델을 자체 샘플에서 평가하고 보상 랜드스케이프를 직접 따름으로써 부정합을 우회합니다. 문제는 비용이 많이 들고 데이터 사실성과 주석자의 성향을 혼동하는 인간의 선호에 의존하지 않고 그러한 보상을 얻는 데 있습니다. 우리는 판별기-유도 강화 학습(DRL)을 제안합니다. DRL은 사전 훈련된 표현 공간에서 데이터와 기본 모델 샘플을 분리하도록 판별기를 훈련시키고, 그 로짓을 KL-정규화 강화 학습의 보상으로 사용합니다. 사전 훈련된 공간은 판별기를 지각적으로 의미 있는 방향으로 제한하며, 로짓은 데이터와 모델 간의 로그-우도 비율을 추정하는데, 이는 데이터 분포를 목표로 하는 최적의 보상입니다. SiT, JiT, REPA 및 RAE 전반에 걸쳐 DRL은 가이던스 없는 FID(예: SiT에서 9.38에서 2.62로)와 의미 공간 FD(예: SiT의 DINOv3에서 88.2에서 19.3으로)를 감소시키고, 모든 백본에서 일관된 개선을 보이며, 인간 선호 보상에 대해 훈련하지 않고도 이를 향상시킵니다. 또한 후속 선호 기반 사후 훈련에서 선호 보상과 이미지 충실도 간의 더 나은 파레토 프론티어를 제공하여, 과포화 및 과도한 밝기와 같은 저수준 아티팩트를 줄이면서 정렬을 증가시킵니다.

English

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.