英雄联盟：时长突破极限，视频生成技术迈向小时级规模

摘要

近期长视频生成研究已从双向模型转向自回归模型，但这些方法普遍存在误差累积和长期连贯性缺失的问题。虽然注意力锚定帧的引入缓解了性能衰减，但其常引发一种关键故障模式——锚定坍缩：生成内容反复回归至锚定帧，导致场景突兀重置和循环运动模式。我们通过分析发现，锚定坍缩源于旋转位置编码（RoPE）的周期结构与当前生成模型中普遍采用的多头注意力机制之间的固有冲突。为此，我们提出一种轻量级、免训练的解决方案，通过引入多头RoPE扰动来打破头间注意力同质化，从而有效抑制长序列坍缩现象。大量实验表明，我们的方法在保持生成质量的同时成功缓解了锚定坍缩。据我们所知，这项研究首次实现了质量几乎无衰减的实时、流式、无限长度视频生成。为验证其鲁棒性，我们生成了长达12小时的连续视频，这应是目前公开演示中最长的流式视频生成结果。

English

Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.

英雄联盟：时长突破极限，视频生成技术迈向小时级规模

LoL: Longer than Longer, Scaling Video Generation to Hour

摘要

Support