魔法修复：通过观看动态视频简化照片编辑

摘要

我们提出了一个生成模型，给定一个粗略编辑的图像，合成一个逼真的输出，遵循指定的布局。我们的方法从原始图像中转移细节，并保留其部分特征，同时使其适应新布局定义的光照和背景。我们的关键见解是视频是这一任务的一个强大监督来源：物体和摄像机运动提供了许多关于世界如何随着视角、光照和物理相互作用而变化的观察。我们构建了一个图像数据集，其中每个样本是从同一视频中随机选择的时间间隔内提取的源帧和目标帧的一对。我们使用两个模拟预期测试时用户编辑的运动模型将源帧向目标帧进行变形。我们监督我们的模型将变形图像转换为地面真实值，从预训练扩散模型开始。我们的模型设计明确实现了从源帧到生成图像的细节转移，同时紧密遵循用户指定的布局。我们展示通过使用简单的分割和粗糙的二维操作，我们可以合成一个忠实于用户输入的逼真编辑，同时解决诸如协调编辑对象之间的光照和物理相互作用等二阶效应。

English

We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserves the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user's input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects.

魔法修复：通过观看动态视频简化照片编辑

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

摘要

Support