報酬はずっとデータの中にあった：識別器誘導型強化学習によるフローマッチングの修正

要旨

スコアマッチングモデルとフローマッチングモデルは、しばしば嗜好ベースの強化学習に依存している。その目的は二つある。一つは主観的な嗜好に合致させることであり、もう一つは——驚くべきことに——マッチングベースの訓練がデータそのものから学習することを意図している視覚的リアリズムや一貫性のある物体構造といった特性を回復することである。本稿では、この背景には構造的なミスマッチが存在すると主張する。マッチング損失は、訓練時の周辺分布のもとでの速度場やスコア場に対するℓ2回帰誤差を測定するものであり、推論時のサンプル品質を決定する視覚的・意味的特性との整合性が低い代理指標である。こうした特性と整合する報酬が与えられれば、RLはモデル自身のサンプルに対して評価を行い、報酬ランドスケープを直接追跡することで、このミスマッチを回避できる。課題は、人間の嗜好に依存することなく、そのような報酬を得ることである。人間の嗜好はコストが高いうえ、データのリアリズムとアノテータの傾向を混同してしまうからである。本稿では、**識別器誘導型強化学習**（Discriminator-Guided RL; DRL）を提案する。DRLは、事前学習済みの表現空間において、識別器をデータとベースモデルのサンプルを分離するよう訓練し、そのロジットをKL正則化付きRLにおける報酬として用いる。事前学習済み空間は識別器を知覚的に意味のある方向に制限し、ロジットはデータとモデル間の対数尤度比を推定する。この対数尤度比は、データ分布を目標とするための最適な報酬である。SiT、JiT、REPA、RAEの各モデルにおいて、DRLはガイダンスなしのFID（例：SiTでは9.38から2.62に低減）および意味空間におけるFD（例：SiTのDINOv3において88.2から19.3に低減）を改善し、すべてのバックボーンで一貫した向上を示す。また、人間の嗜好報酬を学習することなくこれを向上させる。さらに、その後の嗜好ベースのポスト訓練において、嗜好報酬と画像忠実度の間により優れたパレートフロンティアを実現し、過度な彩度や過剰な輝度といった低レベルのアーティファクトを低減しつつ、アライメントを向上させる。

English

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.