FlowVid：驯服不完美的光流以实现一致的视频到视频合成

摘要

扩散模型已经改变了图像到图像（I2I）合成，并且正在渗透到视频领域。然而，视频到视频（V2V）合成的进展受到了跨视频帧保持时间一致性的挑战的阻碍。本文提出了一种一致的V2V合成框架，通过共同利用源视频中的空间条件和时间光流线索。与先前严格遵循光流的方法相反，我们的方法利用其优势同时处理流估计中的不完美之处。我们通过从第一帧进行变形来编码光流，并将其作为扩散模型中的补充参考。这使得我们的模型能够通过使用任何主流I2I模型编辑第一帧，然后将编辑传播到后续帧进行视频合成。我们的V2V模型FlowVid展示了显著的特性：（1）灵活性：FlowVid与现有I2I模型无缝配合，支持各种修改，包括风格化、对象交换和局部编辑。（2）效率：生成一段30 FPS、512x512分辨率的4秒视频仅需1.5分钟，比CoDeF、Rerender和TokenFlow分别快3.1倍、7.2倍和10.5倍。（3）高质量：在用户研究中，我们的FlowVid在45.7%的时间内被优先选择，优于CoDeF（3.5%）、Rerender（10.2%）和TokenFlow（40.4%）。

English

Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).

FlowVid：驯服不完美的光流以实现一致的视频到视频合成

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

摘要

Support