自举生成器：基于流匹配的无配对视觉编辑

摘要

现代生成模型对视觉内容具有深刻的理解，然而将其训练用于图像编辑通常需要海量配对样本数据集。这限制了可扩展性，尤其在视频编辑中，收集配对数据的成本高得令人望而却步。我们提出Bootstrap Your Generator (ByG)，一种用于流匹配编辑模型无配对训练的通用框架。该框架无需任何外部信号即可利用基础模型的知识。我们的方法将从冻结模型中提取的指令跟随线索与循环一致性相结合以保留结构。为使这一方法可行，我们提出将来自干净预测的下游损失梯度路由至噪声训练状态。我们在数据稀缺的图像和视频编辑挑战性场景中展示了最先进的成果。大量评估和用户研究表明，我们的方法有效泛化至未见过的领域，并优于基于数百万样本训练的监督基线。分析揭示，我们的梯度路由弥合了训练-推理差距，而从基础模型中提取语义线索提供了强大的训练信号，无需外部奖励模型。

English

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.