英雄联盟:长之又长,视频生成技术迈向小时级突破
LoL: Longer than Longer, Scaling Video Generation to Hour
January 23, 2026
作者: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh
cs.AI
摘要
近期长视频生成研究已从双向模型转向自回归模型,但这些方法普遍存在误差累积和长期连贯性丧失的问题。虽然注意力汇聚帧的引入缓解了性能衰减,但其常会引发我们称之为"汇聚塌陷"的关键故障模式:生成内容会反复回归至汇聚帧,导致场景突然重置和循环运动模式。通过分析我们发现,汇聚塌陷源于旋转位置编码(RoPE)的周期结构与当前生成模型中普遍采用的多头注意力机制之间的内在冲突。为此,我们提出一种轻量级、无需训练的方法,通过引入多头RoPE扰动来打破头间注意力同质化,从而有效抑制此类行为并缓解长序列塌陷。大量实验表明,我们的方法在保持生成质量的同时成功缓解了汇聚塌陷现象。据我们所知,本研究首次实现了质量几乎无衰减的实时流式无限长视频生成。为验证其鲁棒性,我们生成了长达12小时的连续视频,这应是目前公开演示的流式视频生成中最长的成果之一。
English
Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.