HL-OutPaint: Grof-naar-fijn Video-Outpainting voor Hoog-Resolutie Langeafstandsvideo's

Samenvatting

Videouitpainting genereert aannemelijke visuele inhoud buiten de oorspronkelijke ruimtelijke omvang van een video, en speelt een sleutelrol bij het aanpassen van video's aan diverse weergaveformaten. Om dergelijke gebruiksscenario's te ondersteunen, moet het grote ruimtelijke extrapolatie over lange sequenties mogelijk maken. De meeste bestaande methoden pakken echter slechts een van deze uitdagingen aan of missen expliciete mechanismen voor het waarborgen van globale spatio-temporele consistentie, wat leidt tot opvallende beperkingen. In dit artikel stellen we HL-OutPaint voor, een hogeresolutie videouitpainting-framework voor lange sequenties. Onze aanpak volgt een grof-naar-fijn strategie met een tweetrapspijplijn. We construeren eerst Global Coarse Guidance (GCG), een laagresolutie-representatie die globale structuur en dominante beweging over de video vastlegt. In tegenstelling tot naïef downsampling wordt GCG opgebouwd via een nieuw globaal-lokaal frame-wisselmechanisme dat schaarse globale sleutelframes koppelt aan lokale temporele vensters en informatie uitwisselt tijdens sampling. Dit stelt GCG in staat om zowel langetermijn structurele consistentie als kortetermijn temporele dynamiek in een uniforme representatie te coderen. Geleid door deze representatie voert HL-OutPaint vervolgens hogeresolutie outpainting uit om ruimtelijk gedetailleerde en temporeel consistente inhoud te genereren. Door modellering van globale structuur te scheiden van fijnmazige synthese, bereikt ons framework stabiele, coherente generatie voor grote ruimtelijke uitbreiding en lange videosequenties. Uitgebreide experimenten tonen aan dat HL-OutPaint bestaande methoden overtreft in uitdagende scenario's die brede ruimtelijke extrapolatie en lange videosequenties omvatten.

English

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.