부트스트랩 생성기: 비짝 시각 편집을 위한 흐름 매칭

초록

최신 생성 모델은 시각적 콘텐츠에 대한 깊은 이해를 갖추고 있지만, 이를 이미지 편집에 활용하기 위해 훈련하려면 일반적으로 방대한 양의 쌍을 이룬 예시 데이터셋이 필요하다. 이는 특히 쌍 데이터 수집이 엄청난 비용을 초래하는 비디오 편집에서 확장성을 제한한다. 본 논문에서는 흐름 매칭 기반 편집 모델의 비쌍 훈련을 위한 일반 프레임워크인 Bootstrap Your Generator (ByG)를 제안한다. 이 프레임워크는 외부 신호 없이 기반 모델의 지식을 활용한다. 우리의 접근 방식은 고정된 모델에서 추출한 명령 수행 단서를 구조 보존을 위한 순환 일관성과 결합한다. 이 과정을 실현 가능하게 만들기 위해, 하류 손실에서 발생한 그래디언트를 깨끗한 예측을 거쳐 노이즈가 포함된 훈련 상태로 라우팅하는 방법을 제안한다. 우리는 데이터가 부족한 까다로운 이미지 및 비디오 편집 시나리오에서 최고 수준의 결과를 입증한다. 광범위한 평가와 사용자 연구 결과, 우리 방법이 보지 못한 도메인에 효과적으로 일반화되며, 수백만 개의 샘플로 훈련된 지도 학습 기준선보다 우수한 성능을 나타냄을 보여준다. 분석 결과, 그래디언트 라우팅이 훈련-추론 격차를 해소하고, 기반 모델에서 의미적 단서를 추출하는 것이 외부 보상 모델의 필요성을 없애는 강력한 훈련 신호를 제공함을 확인하였다.

English

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.