HL-OutPaint: 面向高分辨率长时程视频的从粗到细视频外扩绘制

摘要

视频外绘能够生成视频原始空间范围之外的合理视觉内容，在将视频适配到多种显示格式中扮演关键角色。要支持这类应用，该方法必须能够在长序列中实现大范围空间外推。然而，现有方法大多仅解决其中某一挑战，或缺乏确保全局时空一致性的显式机制，导致显著局限性。本文提出HL-OutPaint——面向长序列的高分辨率视频外绘框架。我们的方法遵循由粗到精的两阶段流水线策略。首先构建全局粗引导（GCG），这是一种低分辨率表示，能够捕获视频的全局结构和主导运动。与简单下采样不同，GCG通过一种新颖的全局-局部帧交换机制构建，该机制将稀疏全局关键帧与局部时间窗口耦合，并在采样过程中交换信息。这使得GCG能够将长期结构一致性与短期时间动态编码到统一表示中。在此表示引导下，HL-OutPaint执行高分辨率外绘，生成空间细节丰富且时间一致的内容。通过分离全局结构建模与精细合成，我们的框架实现了大空间扩展和长视频序列的稳定、连贯生成。大量实验表明，在涉及大范围空间外推和长视频序列的挑战性场景中，HL-OutPaint优于现有方法。

English

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.