PackForcing：短影片訓練足以實現長影片取樣與長上下文推論

摘要

基於自迴歸的視訊擴散模型雖已取得顯著進展，但在長視訊生成過程中仍受制於線性KV快取不可控增長、時間重複性以及誤差累積三大瓶頸。為解決這些挑戰，我們提出PackForcing——一個通過創新型三段式KV快取策略統一管理生成歷史的高效框架。具體而言，我們將歷史上下文劃分為三類：(1) 錨點標記：以全解析度保留早期關鍵幀，維持全域語義；(2) 中段標記：通過融合漸進式3D卷積與低解析度VAE重編碼的雙分支網絡，實現高達32倍的時空壓縮；(3) 近期標記：保持全解析度以確保局部時間連貫性。為在嚴格限制記憶體佔用的同時保障生成質量，我們針對中段標記引入動態top-k上下文選擇機制，並配合連續式時序旋轉位置編碼調整技術，以可忽略的開銷無縫重定位因標記丟棄產生的位置間隙。憑藉這種分層次上下文壓縮機制，PackForcing可在單張H200 GPU上生成連貫的2分鐘16FPS、832x480解析度視訊，將KV快取嚴格控制在4GB以內，實現驚人的24倍時長外推能力（從5秒擴展至120秒），且僅需5秒短片訓練即可零樣本生效。VBench大量實驗結果表明，該框架在時間一致性（26.07）與動態程度（56.25）指標上均達到頂尖水平，證實短視訊監督足以驅動高質量長視訊合成。項目代碼已開源於：https://github.com/ShandaAI/PackForcing

English

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

PackForcing：短影片訓練足以實現長影片取樣與長上下文推論

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

摘要

Support