深度强制:基于深度汇聚与参与式压缩的无训练长视频生成
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
December 4, 2025
作者: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim
cs.AI
摘要
自回归视频扩散技术的最新进展已实现实时帧流传输,但现有方案仍存在时序重复、漂移和运动减速问题。我们发现直接将StreamingLLM式的注意力沉淀机制应用于视频扩散会导致保真度下降和运动停滞。为此,我们提出深度强制(Deep Forcing)方法,包含两种无需微调的免训练机制:1)深度沉淀(Deep Sink)将滑动窗口的一半专用于持久性沉淀令牌,并将其时序RoPE相位重新对齐至当前时间轴,从而在长序列生成中稳定全局上下文;2)参与式压缩(Participative Compression)执行重要性感知的KV缓存剪枝,仅保留近期注意力中活跃参与的令牌,安全剔除冗余和劣化的历史记录,在超分布长度生成下最小化误差累积。这两项技术协同实现了超过12倍的序列外推能力(如从训练时长的5秒扩展到生成60秒以上),在成像质量上优于LongLive,在美学质量上超越RollingForcing,几乎保持整体一致性,并在动态程度上获得显著提升,同时维持实时生成效率。实验结果表明,免训练的KV缓存管理方法可媲美甚至超越基于训练的方案,适用于自回归流式长视频生成场景。
English
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.