打包强制：短视频训练足以实现长视频采样与长上下文推理（注：标题翻译在保持技术准确性的同时，采用四六骈文结构增强学术质感。"PackForcing"译为"打包强制"既体现算法特性又符合中文术语习惯，"suffices for"弱化为"足以实现"使句式更符合中文表达逻辑）

摘要

自回归视频扩散模型已取得显著进展，但在生成长视频时仍受限于线性KV缓存增长、时间重复性以及误差累积等问题。为应对这些挑战，我们提出PackForcing——一个通过新颖的三分区KV缓存策略高效管理生成历史的统一框架。具体而言，我们将历史上下文划分为三种类型：（1）锚点令牌，以全分辨率保留初始锚帧以维持全局语义；（2）中间令牌，通过融合渐进式3D卷积与低分辨率VAE重编码的双分支网络实现大规模时空压缩（令牌量减少32倍）；（3）近期令牌，保持全分辨率以确保局部时序连贯性。为严格限制内存占用而不牺牲质量，我们针对中间令牌引入动态top-k上下文选择机制，并结合连续时序RoPE调整技术，以可忽略的开销无缝重定位因令牌丢弃产生的位置间隙。基于这种分层次上下文压缩原理，PackForcing可在单张H200 GPU上生成连贯的2分钟16帧/秒832x480视频，实现仅4GB的有界KV缓存，并达成24倍时序外推能力（从5秒至120秒），无需训练或仅需5秒片段训练即可高效运行。VBench上的大量实验结果表明，该方法在时序一致性（26.07）和动态程度（56.25）上达到业界最优水平，证明短视频监督足以实现高质量的长视频合成。项目地址：https://github.com/ShandaAI/PackForcing

English

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

摘要

Support