用於視頻外描的分層遮罩式3D擴散模型
Hierarchical Masked 3D Diffusion Model for Video Outpainting
September 5, 2023
作者: Fanda Fan, Chaoxu Guo, Litong Gong, Biao Wang, Tiezheng Ge, Yuning Jiang, Chunjie Luo, Jianfeng Zhan
cs.AI
摘要
視頻外描法旨在充分完成視頻幀邊緣的缺失區域。與圖像外描法相比,視頻外描法面臨額外挑戰,因為模型應保持填充區域的時間一致性。本文介紹了一種用於視頻外描法的遮罩式3D擴散模型。我們使用遮罩建模技術來訓練3D擴散模型。這使我們能夠使用多個引導幀來連接多個視頻片段推斷的結果,從而確保時間一致性並減少相鄰幀之間的抖動。同時,我們提取視頻的全局幀作為提示,並通過交叉注意力引導模型獲取當前視頻片段以外的信息。我們還引入了一種混合粗到細的推斷流程來緩解藝術品積累問題。現有的粗到細流程僅使用填充策略,這會導致降級,因為稀疏幀的時間間隔太大。我們的流程受益於遮罩建模的雙向學習,因此在生成稀疏幀時可以採用填充和插值的混合策略。實驗表明,我們的方法在視頻外描任務中取得了最先進的結果。更多結果可在我們的網站https://fanfanda.github.io/M3DDM/ 上查看。
English
Video outpainting aims to adequately complete missing areas at the edges of
video frames. Compared to image outpainting, it presents an additional
challenge as the model should maintain the temporal consistency of the filled
area. In this paper, we introduce a masked 3D diffusion model for video
outpainting. We use the technique of mask modeling to train the 3D diffusion
model. This allows us to use multiple guide frames to connect the results of
multiple video clip inferences, thus ensuring temporal consistency and reducing
jitter between adjacent frames. Meanwhile, we extract the global frames of the
video as prompts and guide the model to obtain information other than the
current video clip using cross-attention. We also introduce a hybrid
coarse-to-fine inference pipeline to alleviate the artifact accumulation
problem. The existing coarse-to-fine pipeline only uses the infilling strategy,
which brings degradation because the time interval of the sparse frames is too
large. Our pipeline benefits from bidirectional learning of the mask modeling
and thus can employ a hybrid strategy of infilling and interpolation when
generating sparse frames. Experiments show that our method achieves
state-of-the-art results in video outpainting tasks. More results are provided
at our https://fanfanda.github.io/M3DDM/.