FramePainter：为交互式图像编辑赋予视频扩散先验

摘要

交互式图像编辑允许用户通过视觉交互操作（如绘制、点击和拖动）修改图像。现有方法从视频中构建这些监督信号，因为视频捕捉了物体如何随着各种物理交互而变化。然而，这些模型通常是基于文本到图像扩散模型构建的，因此需要（i）大量训练样本和（ii）额外的参考编码器来学习现实世界的动态和视觉一致性。在本文中，我们将这一任务重新定义为图像到视频生成问题，以继承强大的视频扩散先验，以降低训练成本并确保时间一致性。具体而言，我们介绍了FramePainter作为这一公式的高效实例化。通过稳定视频扩散的初始化，它仅使用轻量级稀疏控制编码器来注入编辑信号。考虑到处理两帧之间大运动的时间注意力的局限性，我们进一步提出匹配注意力以扩大感受野，同时鼓励编辑和源图像标记之间的密集对应。我们强调了FramePainter在各种编辑信号上的有效性和效率：它在远少于训练数据的情况下，显著优于先前的最先进方法，实现了图像的高度无缝和连贯编辑，例如，自动调整杯子的反射。此外，FramePainter在真实世界视频中不存在的场景中也表现出色，例如，将小丑鱼转变成鲨鱼形状。我们的代码将在https://github.com/YBYBZhang/FramePainter 上提供。

English

Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions. However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals. Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens. We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, \eg, automatically adjust the reflection of the cup. Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, \eg, transform the clownfish into shark-like shape. Our code will be available at https://github.com/YBYBZhang/FramePainter.

FramePainter：为交互式图像编辑赋予视频扩散先验

FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

摘要

Support