奖励一直在你的数据中：使用判别器引导的强化学习纠正流匹配

摘要

评分匹配与流匹配模型通常依赖基于偏好的强化学习以实现两个目标：与主观偏好对齐，以及令人惊讶地，恢复诸如视觉真实感和连贯物体结构等属性——而基于匹配的训练本应直接从数据本身学习这些属性。我们认为这反映了一种结构性失配：匹配损失在训练时的边缘分布下测量速度场或评分场的ℓ₂回归误差，这一代理指标与决定推理时样本质量的视觉和语义属性对齐不佳。当存在与这些属性对齐的奖励时，强化学习通过在其自身生成样本上评估模型并直接沿奖励地形优化，从而规避了这种失配。挑战在于如何在不依赖人类偏好（成本高昂且将数据真实性与标注者倾向混为一谈）的情况下获得此类奖励。为此，我们提出判别器引导强化学习（DRL）。DRL训练一个判别器，在预训练表示空间中区分真实数据与基础模型样本，并将其logit值用作KL正则化强化学习中的奖励。预训练空间将判别器约束在具有感知意义的方向上，而logit值则估计数据与模型之间的对数似然比——这正是以数据分布为目标的理想奖励。在SiT、JiT、REPA和RAE等模型上，DRL显著降低了无引导FID（例如，SiT上从9.38降至2.62）和语义空间FD（例如，SiT在DINOv3特征上从88.2降至19.3），在所有骨干网络上均取得一致改进，并且在不使用人类偏好数据训练的情况下提升了人类偏好奖励。此外，在后续基于偏好的后训练中，DRL在偏好奖励与图像保真度之间实现了更优的帕累托前沿，既增强了对齐度，又减少了过饱和、过度亮度等低层级伪影。

English

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.