Echo-Infinity:学习进化记忆用于实时无限视频生成
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
June 3, 2026
作者: Yuxuan Bian, Zeyue Xue, Songchun Zhang, Shiyi Zhang, Weiyang Jin, Yaowei Li, Junhao Zhuang, Haoran Li, Jie Huang, Haoyang Huang, Nan Duan, Qiang Xu
cs.AI
摘要
我们提出了Echo-Infinity,一个面向实时无限视频生成的自回归(AR)框架,采用可学习的演化记忆,以恒定代价动态过滤、抽象和压缩任意长度历史信息。现有方法主要依赖预定义的KV缓存调度、固定比例的启发式压缩或推理时的RoPE适配来管理记忆。由于缓存窗口有限且忽略了自回归生成噪声,这些设计不可避免地会丢失历史信息并放大累积误差。受人类记忆巩固机制启发,Echo-Infinity用可学习的记忆查询(Memory Query)替代了手工设计的记忆管理方案。当过去帧从局部窗口中被逐出时,这些查询通过注意力机制和门控机制进行更新。查询与视频扩散变换器(DiTs)进行端到端联合优化,形成一种支持任意压缩比率的演化记忆,其计算量恒定且与视频长度无关。它们还充当可泛化的生成先验,即使仅使用优化后的初始状态也能提升生成质量。我们进一步引入了统一相对RoPE方案(Unified Relative RoPE Recipe),该方案将锚定帧(sink frames)固定在id 0处,并让最新帧的id在训练和推理过程中最多增长到DiTs预训练的最大时间RoPE id,从而摆脱有限RoPE约束的限制,并缩小训练与推理之间的RoPE外推差距。在长视频和短视频生成任务中,Echo-Infinity达到了最先进的性能,并且据我们所知,首次展示了超过24小时(>130万帧)的实时滚动生成能力,为无限视频生成开辟了一条实用路径。
English
We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.