フローOPD：フローマッチングモデルのためのオン方策蒸留

要旨

既存のFlow Matching（FM）テキスト画像生成モデルは、マルチタスクアライメントにおいて2つの重大なボトルネックに直面している。スカラー値報酬によって引き起こされる報酬のスパース性と、異種目的関数を同時最適化することに起因する勾配干渉であり、これらが相まって指標間の「シーソー効果」と蔓延する報酬ハッキングを生じさせる。大規模言語モデル分野で成功を収めたOn-Policy Distillation（OPD）に着想を得て、我々はFlow Matchingモデルにオン方針蒸留を統合した初の統一的な学習後フレームワーク「Flow-OPD」を提案する。Flow-OPDは2段階のアライメント戦略を採用する：(1) 単一報酬GRPOファインチューニングによるドメイン特化型教師モデルの育成により、各専門家が個別に性能限界に到達可能にし、(2) FlowベースCold-Start方式で堅牢な初期方策を確立した後、オン方針サンプリング・タスクルーティングラベリング・密な軌跡レベル監視という3段階の協調により、異種専門知識を単一の生徒モデルに統合する。さらに、タスク非依存型教師を活用したManifold Anchor Regularization（MAR）を導入し、高品質多様体への生成を固定する全データ監視を提供することで、純粋な強化学習駆動アライメントで頻繁に観測される美的品質の劣化を効果的に抑制する。Stable Diffusion 3.5 Mediumを基盤としたFlow-OPDは、GenEvalスコアを63から92へ、OCR精度を59から94へ向上させ、標準GRPOに対し約10ポイントの総合改善を達成。画像忠実度と人間選好アライメントを維持しつつ、「教師モデル超越」効果を創発的に示した。これらの結果は、Flow-OPDが汎用テキスト画像モデル構築のためのスケーラブルなアライメントパラダイムであることを立証している。

English

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.

フローOPD：フローマッチングモデルのためのオン方策蒸留

Flow-OPD: On-Policy Distillation for Flow Matching Models

要旨

Support