ChatPaper.aiChatPaper

打包强制:短视频训练足以实现长视频采样与长上下文推理 (注:标题翻译在保持技术准确性的同时,采用四六骈文结构增强学术质感。"PackForcing"译为"打包强制"既体现算法特性又符合中文术语习惯,"suffices for"弱化为"足以实现"使句式更符合中文表达逻辑)

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

March 26, 2026
作者: Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang
cs.AI

摘要

自回归视频扩散模型已取得显著进展,但在生成长视频时仍受限于线性KV缓存增长、时间重复性以及误差累积等问题。为应对这些挑战,我们提出PackForcing——一个通过新颖的三分区KV缓存策略高效管理生成历史的统一框架。具体而言,我们将历史上下文划分为三种类型:(1)锚点令牌,以全分辨率保留初始锚帧以维持全局语义;(2)中间令牌,通过融合渐进式3D卷积与低分辨率VAE重编码的双分支网络实现大规模时空压缩(令牌量减少32倍);(3)近期令牌,保持全分辨率以确保局部时序连贯性。为严格限制内存占用而不牺牲质量,我们针对中间令牌引入动态top-k上下文选择机制,并结合连续时序RoPE调整技术,以可忽略的开销无缝重定位因令牌丢弃产生的位置间隙。基于这种分层次上下文压缩原理,PackForcing可在单张H200 GPU上生成连贯的2分钟16帧/秒832x480视频,实现仅4GB的有界KV缓存,并达成24倍时序外推能力(从5秒至120秒),无需训练或仅需5秒片段训练即可高效运行。VBench上的大量实验结果表明,该方法在时序一致性(26.07)和动态程度(56.25)上达到业界最优水平,证明短视频监督足以实现高质量的长视频合成。项目地址:https://github.com/ShandaAI/PackForcing
English
Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing
PDF381March 31, 2026