HL-OutPaint: 고해상도 장거리 비디오를 위한 Coarse-to-Fine 비디오 아웃페인팅

초록

비디오 아웃페인팅은 비디오의 원래 공간적 범위를 넘어 그럴듯한 시각적 콘텐츠를 생성하는 기술로, 다양한 디스플레이 형식에 비디오를 적용하는 데 핵심적인 역할을 한다. 이러한 활용 사례를 지원하기 위해서는 긴 시퀀스에 걸쳐 큰 공간적 외삽(extrapolation)이 가능해야 한다. 그러나 대부분의 기존 방법은 이러한 문제 중 하나만 다루거나, 전역 시공간적 일관성을 보장하기 위한 명시적 메커니즘이 부족하여 현저한 한계를 보인다. 본 논문에서는 긴 시퀀스를 위한 고해상도 비디오 아웃페인팅 프레임워크인 HL-OutPaint를 제안한다. 우리의 접근법은 두 단계로 구성된 조세밀(coarse-to-fine) 전략을 따른다. 먼저 비디오 전체의 전역 구조와 주요 움직임을 포착하는 저해상도 표현인 전역 조악 가이던스(Global Coarse Guidance, GCG)를 구축한다. GCG는 단순한 다운샘플링과 달리, 희소 전역 키프레임과 지역 시간 윈도우를 결합하고 샘플링 중 정보를 교환하는 새로운 전역-지역 프레임 교환 메커니즘을 통해 구성된다. 이를 통해 GCG는 장기 구조적 일관성과 단기 시간적 역동성을 통합된 표현으로 인코딩할 수 있다. 이 표현의 안내를 받아 HL-OutPaint는 고해상도 아웃페인팅을 수행하여 공간적으로 세밀하고 시간적으로 일관된 콘텐츠를 생성한다. 전역 구조 모델링을 미세 합성과 분리함으로써, 우리의 프레임워크는 넓은 공간 확장과 긴 비디오 시퀀스에 대해 안정적이고 일관된 생성을 달성한다. 광범위한 실험을 통해 HL-OutPaint가 넓은 공간 외삽과 긴 비디오 시퀀스를 포함한 도전적인 시나리오에서 기존 방법보다 우수한 성능을 보임을 입증한다.

English

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.