用於視頻外描的分層遮罩式3D擴散模型

摘要

視頻外描法旨在充分完成視頻幀邊緣的缺失區域。與圖像外描法相比，視頻外描法面臨額外挑戰，因為模型應保持填充區域的時間一致性。本文介紹了一種用於視頻外描法的遮罩式3D擴散模型。我們使用遮罩建模技術來訓練3D擴散模型。這使我們能夠使用多個引導幀來連接多個視頻片段推斷的結果，從而確保時間一致性並減少相鄰幀之間的抖動。同時，我們提取視頻的全局幀作為提示，並通過交叉注意力引導模型獲取當前視頻片段以外的信息。我們還引入了一種混合粗到細的推斷流程來緩解藝術品積累問題。現有的粗到細流程僅使用填充策略，這會導致降級，因為稀疏幀的時間間隔太大。我們的流程受益於遮罩建模的雙向學習，因此在生成稀疏幀時可以採用填充和插值的混合策略。實驗表明，我們的方法在視頻外描任務中取得了最先進的結果。更多結果可在我們的網站https://fanfanda.github.io/M3DDM/ 上查看。

English

Video outpainting aims to adequately complete missing areas at the edges of video frames. Compared to image outpainting, it presents an additional challenge as the model should maintain the temporal consistency of the filled area. In this paper, we introduce a masked 3D diffusion model for video outpainting. We use the technique of mask modeling to train the 3D diffusion model. This allows us to use multiple guide frames to connect the results of multiple video clip inferences, thus ensuring temporal consistency and reducing jitter between adjacent frames. Meanwhile, we extract the global frames of the video as prompts and guide the model to obtain information other than the current video clip using cross-attention. We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem. The existing coarse-to-fine pipeline only uses the infilling strategy, which brings degradation because the time interval of the sparse frames is too large. Our pipeline benefits from bidirectional learning of the mask modeling and thus can employ a hybrid strategy of infilling and interpolation when generating sparse frames. Experiments show that our method achieves state-of-the-art results in video outpainting tasks. More results are provided at our https://fanfanda.github.io/M3DDM/.

用於視頻外描的分層遮罩式3D擴散模型

Hierarchical Masked 3D Diffusion Model for Video Outpainting

摘要

Support