獎勵一直都在你的數據中：以判別器引導的強化學習修正流匹配

摘要

分數匹配與流匹配模型常依賴基於偏好的強化學習來達成兩個目的：一是與主觀偏好對齊，二是令人驚訝地恢復諸如視覺真實性與連貫物體結構等屬性——而這些本應是基於匹配的訓練從數據本身學習到的內容。我們認為，這反映了結構上的不匹配。匹配損失函數衡量的是在訓練時間邊際分佈下速度場或分數場的 ℓ₂ 迴歸誤差，這種代理指標與決定推理時樣本品質的視覺與語義屬性關聯薄弱。當獎勵與這些屬性對齊時，強化學習透過在模型自身樣本上進行評估並直接遵循獎勵地圖來繞過這種不匹配。關鍵挑戰在於獲取此類獎勵時，無需依賴人類偏好——這類方法不僅成本高昂，還將數據真實性與註釋者主觀傾向混為一談。我們提出鑑別器引導強化學習（Discriminator-Guided RL, DRL）。DRL 訓練一個鑑別器，在預訓練表徵空間中區分數據與基線模型樣本，並將其 logit 值作為 KL 正則化強化學習的獎勵。預訓練空間將鑑別器限制在感知上有意義的方向上，而 logit 值估計了數據與模型之間的對數似然比，這正是針對數據分佈的最佳獎勵。在 SiT、JiT、REPA 及 RAE 等模型上，DRL 降低了無引導 FID（例如在 SiT 上從 9.38 降至 2.62）與語義空間 FD（例如在 SiT 上使用 DINOv3 從 88.2 降至 19.3），在所有骨幹模型上均展現一致的提升，並且在未經人類偏好獎勵訓練的情況下改善了該獎勵。此外，在後續基於偏好的微調中，DRL 在偏好獎勵與圖像保真度之間建立了更優的帕累托前緣，在提升對齊程度的同時減少了諸如過度飽和與過亮等低層次偽影。

English

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.