PackForcing: 短い動画のトレーニングで長い動画の生成と長文脈推論を実現

要旨

自己回帰型ビデオ拡散モデルは著しい進歩を見せているものの、長尺ビデオ生成において、扱いにくい線形KVキャッシュの増大、時間的反復、誤差の累積といった課題がボトルネックとなっている。これらの課題に対処するため、我々は新たな3分割KVキャッシュ戦略により生成履歴を効率的に管理する統一フレームワーク「PackForcing」を提案する。具体的には、履歴コンテキストを以下の3種類に分類する：（1）グローバルな意味論を維持するため、初期のアンカーフレームを完全解像度で保持するシンクトークン；（2）プログレッシブ3D畳み込みと低解像度VAE再エンコーディングを融合するデュアルブランチネットワークにより、大規模な時空間圧縮（32倍のトークン削減）を実現するミッドトークン；（3）局所的な時間的一貫性を確保するため、完全解像度で保持される最近傍トークン。品質を損なうことなくメモリ使用量を厳密に制限するため、ミッドトークンに対して動的トップkコンテキスト選択機構を導入し、さらにドロップされたトークンによる位置ギャップをわずかなオーバーヘッドでシームレスに再調整する連続的時間的RoPE調整を組み合わせる。この原理に基づいた階層的コンテキスト圧縮により、PackForcingは単一のH200 GPUで16 FPS、2分間の832x480ビデオを一貫して生成可能である。KVキャッシュは4GBに抑えられ、驚異的な24倍の時間的外挿（5秒から120秒）を実現し、ゼロショットまたはわずか5秒のクリップで学習した場合でも効果的に動作する。VBenchにおける大規模な評価結果は、最先端の時間的一貫性（26.07）と動的度合い（56.25）を実証し、短尺ビデオの監督信号のみで高品質な長尺ビデオ合成が可能であることを証明している。https://github.com/ShandaAI/PackForcing

English

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

PackForcing: 短い動画のトレーニングで長い動画の生成と長文脈推論を実現

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

要旨

Support