FramePainter:为交互式图像编辑赋予视频扩散先验
FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors
January 14, 2025
作者: Yabo Zhang, Xinpeng Zhou, Yihan Zeng, Hang Xu, Hui Li, Wangmeng Zuo
cs.AI
摘要
交互式图像编辑允许用户通过视觉交互操作(如绘制、点击和拖动)修改图像。现有方法从视频中构建这些监督信号,因为视频捕捉了物体如何随着各种物理交互而变化。然而,这些模型通常是基于文本到图像扩散模型构建的,因此需要(i)大量训练样本和(ii)额外的参考编码器来学习现实世界的动态和视觉一致性。在本文中,我们将这一任务重新定义为图像到视频生成问题,以继承强大的视频扩散先验,以降低训练成本并确保时间一致性。具体而言,我们介绍了FramePainter作为这一公式的高效实例化。通过稳定视频扩散的初始化,它仅使用轻量级稀疏控制编码器来注入编辑信号。考虑到处理两帧之间大运动的时间注意力的局限性,我们进一步提出匹配注意力以扩大感受野,同时鼓励编辑和源图像标记之间的密集对应。我们强调了FramePainter在各种编辑信号上的有效性和效率:它在远少于训练数据的情况下,显著优于先前的最先进方法,实现了图像的高度无缝和连贯编辑,例如,自动调整杯子的反射。此外,FramePainter在真实世界视频中不存在的场景中也表现出色,例如,将小丑鱼转变成鲨鱼形状。我们的代码将在https://github.com/YBYBZhang/FramePainter 上提供。
English
Interactive image editing allows users to modify images through visual
interaction operations such as drawing, clicking, and dragging. Existing
methods construct such supervision signals from videos, as they capture how
objects change with various physical interactions. However, these models are
usually built upon text-to-image diffusion models, so necessitate (i) massive
training samples and (ii) an additional reference encoder to learn real-world
dynamics and visual consistency. In this paper, we reformulate this task as an
image-to-video generation problem, so that inherit powerful video diffusion
priors to reduce training costs and ensure temporal consistency. Specifically,
we introduce FramePainter as an efficient instantiation of this formulation.
Initialized with Stable Video Diffusion, it only uses a lightweight sparse
control encoder to inject editing signals. Considering the limitations of
temporal attention in handling large motion between two frames, we further
propose matching attention to enlarge the receptive field while encouraging
dense correspondence between edited and source image tokens. We highlight the
effectiveness and efficiency of FramePainter across various of editing signals:
it domainantly outperforms previous state-of-the-art methods with far less
training data, achieving highly seamless and coherent editing of images, \eg,
automatically adjust the reflection of the cup. Moreover, FramePainter also
exhibits exceptional generalization in scenarios not present in real-world
videos, \eg, transform the clownfish into shark-like shape. Our code will be
available at https://github.com/YBYBZhang/FramePainter.Summary
AI-Generated Summary