HL-OutPaint: 粗密段階的ビデオアウトペインティングによる高解像度長尺動画の外側補完

要旨

ビデオアウトペインティングは、動画の元の空間範囲を超えて妥当な視覚コンテンツを生成する技術であり、多様な表示形式への動画適応において重要な役割を果たす。このようなユースケースを実現するには、長尺シーケンスにわたって大規模な空間的外挿を可能にする必要がある。しかし、既存手法のほとんどはこれらの課題のいずれか一方しか扱っておらず、またグローバルな時空間一貫性を保証する明示的な仕組みを欠いているため、顕著な限界がある。本論文では、長尺シーケンスのための高解像度ビデオアウトペインティングフレームワークHL-OutPaintを提案する。本手法は粗密（coarse-to-fine）戦略に従い、二段階のパイプラインを採用する。まず、動画全体の大域構造と主要な動きを捉えた低解像度表現である大域粗ガイダンス（Global Coarse Guidance, GCG）を構築する。GCGは単純なダウンサンプリングではなく、疎な大域キーフレームと局所的時間窓を結合し、サンプリング中に情報を交換する新規な大域-局所フレームスワッピング機構により構築される。これにより、GCGは長期的な構造的一貫性と短期的な時間的ダイナミクスの両方を統一された表現に符号化する。この表現に導かれ、HL-OutPaintは高解像度アウトペインティングを実行し、空間的に詳細で時間的に一貫したコンテンツを生成する。大域構造モデリングと微細な合成を分離することで、本フレームワークは広い空間拡張と長尺動画シーケンスに対して安定かつ整合性のある生成を実現する。広範な実験により、HL-OutPaintは広い空間的外挿と長尺動画シーケンスを含む困難なシナリオにおいて既存手法を凌駕することを示す。

English

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.