VideoCanvas：基於上下文條件化的任意時空補丁統一視頻修復

摘要

我們提出了任意時空視頻補全任務，該任務從用戶指定的任意空間位置和時間戳的補丁生成視頻，類似於在視頻畫布上繪畫。這種靈活的表述自然統一了許多現有的可控視頻生成任務——包括首幀圖像到視頻、修復、擴展和插值——在一個統一、連貫的範式下。然而，實現這一願景面臨現代潛在視頻擴散模型中的一個基本障礙：因果變分自編碼器（VAE）引入的時間模糊性，其中多個像素幀被壓縮為單一的潛在表示，使得精確的幀級條件控制在結構上變得困難。我們通過VideoCanvas這一新框架應對這一挑戰，該框架將上下文條件（ICC）範式適應於這一精細控制任務，且無需新增參數。我們提出了一種混合條件策略，將空間和時間控制解耦：空間放置通過零填充處理，而時間對齊則通過時間RoPE插值實現，該方法為每個條件分配潛在序列中的連續分數位置。這解決了VAE的時間模糊性，並在凍結的骨幹上實現了像素幀感知控制。為了評估這一新能力，我們開發了VideoCanvasBench，這是第一個用於任意時空視頻補全的基準測試，涵蓋了場景內保真度和場景間創造力。實驗表明，VideoCanvas顯著優於現有的條件範式，在靈活和統一的視頻生成領域建立了新的技術水平。

English

We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.

VideoCanvas：基於上下文條件化的任意時空補丁統一視頻修復

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

摘要

Support