HL-OutPaint：面向高解析度長程影片的由粗到細影片外繪方法

摘要

視頻外繪技術可在原始視頻空間範圍之外生成合理的視覺內容，在將視頻適配至不同顯示格式方面扮演關鍵角色。為支援此類應用場景，該技術必須能夠在長序列中實現大範圍空間外推。然而，現有方法大多僅解決其中一項挑戰，或缺乏明確機制來確保整體時空一致性，因而存在顯著限制。本文提出 HL-OutPaint，一個適用於長序列的高解析度視頻外繪框架。我們採用粗到細的兩階段策略。首先構建全局粗引導（GCG），這是一種低解析度表示，擷取視頻的整體結構與主要運動。不同於簡單的下採樣，GCG 經由一種新穎的全局-局部幀交換機制構建，該機制將稀疏的全局關鍵幀與局部時間視窗耦合，並在取樣過程中交換資訊。這使得 GCG 能夠將長期結構一致性與短期時間動態編碼為統一的表示。在此表示的引導下，HL-OutPaint 接著進行高解析度外繪，生成空間細節豐富且時間一致的內容。透過將全局結構建模與細粒度合成分離，我們的框架實現了針對大範圍空間擴展與長視頻序列的穩定、連貫生成。大量實驗表明，HL-OutPaint 在涉及寬空間外推與長視頻序列的挑戰性場景中優於現有方法。

English

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.