生成器をブートストラップする：フローマッチングによる教師なしビジュアル編集

要旨

現代の生成モデルは視覚コンテンツに対する深い理解を有しているが、画像編集のためにそれらを訓練するには通常、大規模なペアデータセットが必要となる。これはスケーラビリティを制限し、特にペアデータの収集が極めて高コストである動画編集において顕著である。本稿では、フローマッチング編集モデルのペアなし訓練のための汎用フレームワークであるBootstrap Your Generator (ByG)を提案する。本手法は、外部信号を一切用いずにベースモデルの知識を活用する。我々のアプローチでは、凍結モデルから抽出した指示追跡キューと、構造保存のためのサイクル一貫性を組み合わせる。これを実現可能にするため、下流損失からの勾配をクリーンな予測を介してノイズのある訓練状態にルーティングする手法を提案する。データが不足している挑戦的な画像および動画編集シナリオにおいて、最先端の成果を示す。広範な評価とユーザ調査により、本手法が未見のドメインに効果的に汎化し、数百万サンプルで訓練された教師ありベースラインを凌駕することが明らかになった。分析により、我々の勾配ルーティングが訓練-推論ギャップを埋めること、およびベースモデルから意味的手がかりを抽出することが外部報酬モデルの必要性を不要にする堅牢な訓練信号を提供することが示された。

English

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.