视频外推的分层遮罩3D扩散模型

摘要

视频外延旨在充分完善视频帧边缘的缺失区域。与图像外延相比，视频外延面临额外挑战，因为模型应保持填充区域的时间一致性。本文介绍了一种用于视频外延的蒙版3D扩散模型。我们使用蒙版建模技术来训练3D扩散模型。这使我们能够使用多个引导帧来连接多个视频剪辑推断的结果，从而确保时间一致性并减少相邻帧之间的抖动。同时，我们提取视频的全局帧作为提示，并通过交叉注意力引导模型获取当前视频剪辑以外的信息。我们还引入了一种混合粗到细的推断流程，以减轻伪影积累问题。现有的粗到细流程仅使用填充策略，这会因稀疏帧的时间间隔过大而导致降级。我们的流程通过蒙版建模的双向学习获益，因此在生成稀疏帧时可以采用填充和插值的混合策略。实验证明，我们的方法在视频外延任务中取得了最先进的结果。更多结果请访问我们的网站https://fanfanda.github.io/M3DDM/。

English

Video outpainting aims to adequately complete missing areas at the edges of video frames. Compared to image outpainting, it presents an additional challenge as the model should maintain the temporal consistency of the filled area. In this paper, we introduce a masked 3D diffusion model for video outpainting. We use the technique of mask modeling to train the 3D diffusion model. This allows us to use multiple guide frames to connect the results of multiple video clip inferences, thus ensuring temporal consistency and reducing jitter between adjacent frames. Meanwhile, we extract the global frames of the video as prompts and guide the model to obtain information other than the current video clip using cross-attention. We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem. The existing coarse-to-fine pipeline only uses the infilling strategy, which brings degradation because the time interval of the sparse frames is too large. Our pipeline benefits from bidirectional learning of the mask modeling and thus can employ a hybrid strategy of infilling and interpolation when generating sparse frames. Experiments show that our method achieves state-of-the-art results in video outpainting tasks. More results are provided at our https://fanfanda.github.io/M3DDM/.

视频外推的分层遮罩3D扩散模型

Hierarchical Masked 3D Diffusion Model for Video Outpainting

摘要

Support