ChatPaper.aiChatPaper

VideoCanvas:基於上下文條件化的任意時空補丁統一視頻修復

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

October 9, 2025
作者: Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue
cs.AI

摘要

我們提出了任意時空視頻補全任務,該任務從用戶指定的任意空間位置和時間戳的補丁生成視頻,類似於在視頻畫布上繪畫。這種靈活的表述自然統一了許多現有的可控視頻生成任務——包括首幀圖像到視頻、修復、擴展和插值——在一個統一、連貫的範式下。然而,實現這一願景面臨現代潛在視頻擴散模型中的一個基本障礙:因果變分自編碼器(VAE)引入的時間模糊性,其中多個像素幀被壓縮為單一的潛在表示,使得精確的幀級條件控制在結構上變得困難。我們通過VideoCanvas這一新框架應對這一挑戰,該框架將上下文條件(ICC)範式適應於這一精細控制任務,且無需新增參數。我們提出了一種混合條件策略,將空間和時間控制解耦:空間放置通過零填充處理,而時間對齊則通過時間RoPE插值實現,該方法為每個條件分配潛在序列中的連續分數位置。這解決了VAE的時間模糊性,並在凍結的骨幹上實現了像素幀感知控制。為了評估這一新能力,我們開發了VideoCanvasBench,這是第一個用於任意時空視頻補全的基準測試,涵蓋了場景內保真度和場景間創造力。實驗表明,VideoCanvas顯著優於現有的條件範式,在靈活和統一的視頻生成領域建立了新的技術水平。
English
We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.
PDF482October 10, 2025