PackForcing:短影片訓練足以實現長影片取樣與長上下文推論
PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
March 26, 2026
作者: Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang
cs.AI
摘要
基於自迴歸的視訊擴散模型雖已取得顯著進展,但在長視訊生成過程中仍受制於線性KV快取不可控增長、時間重複性以及誤差累積三大瓶頸。為解決這些挑戰,我們提出PackForcing——一個通過創新型三段式KV快取策略統一管理生成歷史的高效框架。具體而言,我們將歷史上下文劃分為三類:(1) 錨點標記:以全解析度保留早期關鍵幀,維持全域語義;(2) 中段標記:通過融合漸進式3D卷積與低解析度VAE重編碼的雙分支網絡,實現高達32倍的時空壓縮;(3) 近期標記:保持全解析度以確保局部時間連貫性。為在嚴格限制記憶體佔用的同時保障生成質量,我們針對中段標記引入動態top-k上下文選擇機制,並配合連續式時序旋轉位置編碼調整技術,以可忽略的開銷無縫重定位因標記丟棄產生的位置間隙。憑藉這種分層次上下文壓縮機制,PackForcing可在單張H200 GPU上生成連貫的2分鐘16FPS、832x480解析度視訊,將KV快取嚴格控制在4GB以內,實現驚人的24倍時長外推能力(從5秒擴展至120秒),且僅需5秒短片訓練即可零樣本生效。VBench大量實驗結果表明,該框架在時間一致性(26.07)與動態程度(56.25)指標上均達到頂尖水平,證實短視訊監督足以驅動高質量長視訊合成。項目代碼已開源於:https://github.com/ShandaAI/PackForcing
English
Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing