フローマッチングのための報酬逆伝播の設計空間の探索

要旨

直接的な報酬逆伝播によるテキストから画像へのフローマッチングモデルの人間の選好への適合はサンプル効率が良いが、二つのよく知られた病理によって妨げられる。すなわち、現代のモデル規模ではアクティベーションを全サンプリング軌跡にわたって保存できず、またステップ間の連鎖的なヤコビ行列積が報酬勾配を初期インデックスに逆伝播する際に膨張させる。LeapAlignなどのコネクタベースの手法は、完全な逆方向軌跡を短い固定経路で置き換えることでこれらの問題に対処し、サンプリングと最適化の間の有用な分離を強調する。しかし、得られる勾配の質は、この短い経路が特に長い区間において完全なロールアウトをどれだけ正確に近似するかに依存する。我々はFlowBPを提案する。これは逆方向軌跡自体を設計対象として扱う統一的なサロゲート軌跡フレームワークである。FlowBPはサンプリングのために勾配なしでキャッシュされたロールアウトを保持し、次にキャッシュされた速度と選択的に再順方向計算された速度から軽量な逆方向サロゲートを構築する。この見方は4つの選択肢、すなわち報酬モデル入力、アクティブセット、統合重み、ブリッジ結合を分離し、従来の直接勾配法を特定の設定として再現する。このフレームワーク内で、我々は3つの変種を具体化する。FlowBP-Sparseはスパースなオイラー再構成を使用し、FlowBP-Bridgeは制御されたブリッジ結合を追加し、FlowBP-Lagrangeは跳躍求積の次数を上げる。これら3つすべてはアクティブセットサイズによってメモリを制限し、勾配連鎖を最大1つのヤコビ因子に制限する。SD3.5-M、FLUX.1-dev、FLUX.2-Klein-baseにおいて、選好、品質、構成的指標にわたって、3つの変種はほとんどの指標で直接勾配ベースラインを改善する。

English

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highlighting a useful decoupling between sampling and optimization. However, the quality of the resulting gradient depends on how accurately this short path approximates the full rollout, especially over long intervals. We propose FlowBP, a unified surrogate-trajectory framework that treats the backward trajectory itself as the design object. FlowBP keeps a no-gradient cached rollout for sampling, then builds a lightweight backward surrogate from cached and selectively re-forwarded velocities. This view separates four choices: the reward-model input, active set, integration weights, and bridge coupling, and recovers prior direct-gradient methods as particular settings. Within this framework, we instantiate three variants: FlowBP-Sparse uses sparse Euler reconstruction, FlowBP-Bridge adds controlled bridge coupling, and FlowBP-Lagrange raises the order of leap quadrature. All three bound memory by the active-set size and limit gradient chaining to at most one Jacobian factor. Across SD3.5-M, FLUX.1-dev, and FLUX.2-Klein-base on preference, quality, and compositional metrics, the three variants improve over direct-gradient baselines on most metrics.