VideoCanvas: 인-컨텍스트 조건화를 통한 임의의 시공간 패치로부터의 통합 비디오 완성

초록

임의의 시공간 비디오 완성 작업을 소개한다. 이 작업에서는 사용자가 지정한 패치를 비디오 캔버스에 그림을 그리듯이 임의의 공간적 위치와 타임스탬프에 배치하여 비디오를 생성한다. 이 유연한 공식은 첫 프레임 이미지-투-비디오, 인페인팅, 확장, 보간 등 기존의 다양한 제어 가능한 비디오 생성 작업을 단일한 통합 패러다임 아래 자연스럽게 통합한다. 그러나 이러한 비전을 실현하기 위해서는 현대의 잠재 비디오 확산 모델에서 근본적인 장애물에 직면하게 된다. 이는 인과적 VAE에 의해 도입된 시간적 모호성으로, 여러 픽셀 프레임이 단일 잠재 표현으로 압축되어 정확한 프레임 수준의 조건 설정이 구조적으로 어렵다는 문제이다. 이를 해결하기 위해 VideoCanvas라는 새로운 프레임워크를 제안한다. 이 프레임워크는 In-Context Conditioning (ICC) 패러다임을 이 세밀한 제어 작업에 적용하며, 새로운 파라미터를 추가하지 않고도 이를 가능하게 한다. 공간적 배치는 제로 패딩을 통해 처리하고, 시간적 정렬은 Temporal RoPE Interpolation을 통해 달성하는 하이브리드 조건 설정 전략을 제안한다. 이는 VAE의 시간적 모호성을 해결하고, 고정된 백본에서 픽셀 프레임 인식 제어를 가능하게 한다. 이 새로운 기능을 평가하기 위해 VideoCanvasBench를 개발했다. 이는 임의의 시공간 비디오 완성을 위한 첫 번째 벤치마크로, 장면 내 충실도와 장면 간 창의성을 모두 다룬다. 실험 결과, VideoCanvas는 기존의 조건 설정 패러다임을 크게 능가하며, 유연하고 통합된 비디오 생성 분야에서 새로운 최첨단 기술을 확립한다.

English

We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.

VideoCanvas: 인-컨텍스트 조건화를 통한 임의의 시공간 패치로부터의 통합 비디오 완성

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

초록

Support