PackForcing: Korte videotraining volstaat voor lange videosampling en inferentie met lange context

Samenvatting

Autoregressieve videodiffusiemodellen hebben opmerkelijke vooruitgang geboekt, maar worden nog steeds beperkt door onhanteerbare lineaire KV-cache-groei, temporele herhaling en cumulerende fouten tijdens de generatie van lange video's. Om deze uitdagingen aan te pakken, presenteren we PackForcing, een uniform raamwerk dat de gegenereerde geschiedenis efficiënt beheert via een nieuwe KV-cache-strategie met drie partities. Concreet categoriseren we de historische context in drie afzonderlijke typen: (1) Sink-tokens, die vroege ankerframes op volledige resolutie bewaren om de globale semantiek te behouden; (2) Mid-tokens, die een enorme spatiotemporele compressie bereiken (32x tokenreductie) via een dual-branch netwerk dat progressieve 3D-convoluties combineert met low-resolutie VAE-hercodering; en (3) Recent-tokens, die op volledige resolutie worden gehouden om lokale temporele coherentie te garanderen. Om het geheugengebruik strikt te begrenzen zonder kwaliteitsverlies, introduceren we een dynamisch top-k contextselectiemechanisme voor de mid-tokens, gekoppeld aan een continue Temporele RoPE-aanpassing die positiehiaten veroorzaakt door verwijderde tokens naadloos herstelt met verwaarloosbare overhead. Dankzij deze principled hiërarchische contextcompressie kan PackForcing coherente video's van 2 minuten (832x480) genereren met 16 FPS op een enkele H200 GPU. Het bereikt een begrensde KV-cache van slechts 4 GB en maakt een opmerkelijke 24x temporele extrapolatie mogelijk (5s naar 120s), waarbij het effectief werkt, zowel zero-shot als getraind op clips van slechts 5 seconden. Uitgebreide resultaten op VBench tonen state-of-the-art temporele consistentie (26.07) en dynamische graad (56.25) aan, wat bewijst dat kortvideotoezicht voldoende is voor hoogwaardige langevideosynthese. https://github.com/ShandaAI/PackForcing

English

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

PackForcing: Korte videotraining volstaat voor lange videosampling en inferentie met lange context

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Samenvatting

Support