VideoCanvas: 任意の時空間パッチからの統合的なビデオ補完を実現するインコンテキスト条件付け

要旨

任意時空間ビデオ補完のタスクを紹介する。このタスクでは、ユーザーが指定した任意のパッチを任意の空間位置とタイムスタンプに配置してビデオを生成する。これは、ビデオキャンバス上で絵を描くようなものである。この柔軟な定式化により、既存の多くの制御可能なビデオ生成タスク（初フレームの画像からビデオ、インペインティング、拡張、補間など）が単一の統一されたパラダイムの下に自然に統合される。しかし、このビジョンを実現するには、現代の潜在ビデオ拡散モデルにおける根本的な課題がある。因果的VAEによって導入される時間的曖昧さである。ここでは、複数のピクセルフレームが単一の潜在表現に圧縮されるため、フレームレベルの正確な条件付けが構造的に困難となる。この課題に対処するため、VideoCanvasを提案する。これは、In-Context Conditioning (ICC) パラダイムをこの細粒度制御タスクに適応させ、新たなパラメータを追加せずに実現する新しいフレームワークである。空間的配置はゼロパディングによって処理し、時間的アラインメントはTemporal RoPE Interpolationによって達成するハイブリッド条件付け戦略を提案する。これにより、VAEの時間的曖昧さが解消され、凍結されたバックボーン上でピクセルフレームを意識した制御が可能となる。この新たな能力を評価するため、VideoCanvasBenchを開発した。これは、任意時空間ビデオ補完のための最初のベンチマークであり、シーン内の忠実度とシーン間の創造性の両方をカバーする。実験により、VideoCanvasが既存の条件付けパラダイムを大幅に上回り、柔軟で統一されたビデオ生成において新たな最先端を確立することが示された。

English

We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.

VideoCanvas: 任意の時空間パッチからの統合的なビデオ補完を実現するインコンテキスト条件付け

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

要旨

Support