自舉生成器：基於流匹配的未配對視覺編輯

摘要

現代的生成模型對視覺內容具有深刻的理解，然而將其用於訓練影像編輯時，通常需要大量成對範例的資料集。這限制了可擴展性，尤其對於影片編輯而言，收集成對資料的成本過於高昂。我們提出Bootstrap Your Generator (ByG)，這是一個通用框架，可用於無成對訓練的流匹配編輯模型。它利用基礎模型的知識，無需任何外部信號。我們的方法將從凍結模型中提取的指令遵循提示與用於結構保留的循環一致性結合。為了使此方法可行，我們提出將來自下游損失的梯度經由乾淨預測路由至噪聲訓練狀態。我們在具挑戰性的資料稀缺影像與影片編輯場景中展現了最先進的成果。大量評估與使用者研究顯示，我們的方法能有效泛化至未見過的領域，並且超越在數百萬樣本上訓練的監督式基線。分析表明，我們的梯度路由橋接了訓練與推論之間的差距，而從基礎模型中提取語義提示提供了穩健的訓練信號，從而消除了對外部獎勵模型的需求。

English

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.