LeapAlign：2ステップ軌道構築による任意の生成ステップにおけるポストトレーニング・フローマッチングモデルの調整

要旨

本論文は、フローマッチングモデルと人間の選好の整合性に焦点を当てる。有望なアプローチとして、フローマッチングの微分可能な生成過程を通じて報酬勾配を直接逆伝播させるファインチューニングが挙げられる。しかし、長い軌跡を通じた逆伝播は、膨大なメモリコストと勾配爆発を引き起こす。そのため、直接勾配法は、最終画像の大域的な構造を決定する上で重要な初期生成ステップの更新が困難である。この問題に対処するため、我々は計算コストを削減し、報酬から初期生成ステップへの直接的な勾配伝播を可能にするファインチューニング手法LeapAlignを提案する。具体的には、複数のODEサンプリングステップを飛び越し、将来の潜在変数を1ステップで予測する2つの連続した「跳躍」を設計することで、長い軌跡をわずか2ステップに短縮する。跳躍の開始・終了タイムステップをランダム化することにより、LeapAlignは任意の生成ステップにおいて効率的かつ安定したモデル更新を実現する。さらに、このように短縮された軌跡を効果的に活用するため、長い生成経路との整合性が高い軌跡により大きな学習重みを割り当てる。勾配安定性をさらに向上させるため、従来研究のように大きい勾配項を完全に除去するのではなく、その重みを低減する。Fluxモデルのファインチューニングにおいて、LeapAlignは様々な指標で最新のGRTOベース手法および直接勾配法を一貫して上回り、優れた画像品質と画像-テキスト整合性を達成した。

English

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

LeapAlign：2ステップ軌道構築による任意の生成ステップにおけるポストトレーニング・フローマッチングモデルの調整

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

要旨

Support