DanceOPD: オン方策生成フィールド蒸留

要旨

現代の画像生成においては、テキストから画像への生成（T2I）、局所編集、全体編集といった多様な機能を統合する単一モデルが求められる。しかし、これらの機能は本来、自然に整合することは稀であり、しばしば競合する。例えば、編集はT2I性能を低下させる傾向があり、全体編集と局所編集は互いに干渉する。その結果、これらの機能を効果的に組み合わせることが、画像生成モデルの学習における中心的な課題となっている。この課題に取り組むため、我々はDanceOPDを提案する。これはフローマッチングモデルのためのオン・ポリシー生成場蒸留フレームワークであり、各サンプルを一つの機能場にルーティングし、低ノイズの学生誘導状態を一つ問い合わせ、シンプルな速度MSE目的で学習を行う。各機能源を共有フロー状態空間上の速度場として定義することで、学生は自身のロールアウト状態において問い合わせられた場から、専門家の機能を合成することを学ぶ。この定式化は、分類器不要ガイダンスのような演算子定義場も吸収する。T2I、編集、リアリズム場吸収、CFG吸収に関する包括的な実験により、本手法が多機能合成を改善し、アンカー生成品質を維持しつつターゲット機能を強化することを示す。我々は、この研究がフローマッチングモデルにおける生成場蒸留の実用的な道筋を確立すると考える。

English

Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.