ChatPaper.aiChatPaper

Echo-Infinity: リアルタイム無限動画生成のための進化的メモリ学習

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

June 3, 2026
著者: Yuxuan Bian, Zeyue Xue, Songchun Zhang, Shiyi Zhang, Weiyang Jin, Yaowei Li, Junhao Zhuang, Haoran Li, Jie Huang, Haoyang Huang, Nan Duan, Qiang Xu
cs.AI

要旨

我们提出Echo Infinity,一种面向实时无限视频生成的自回归框架,采用可学习的演化记忆,以恒定成本动态过滤、抽象和压缩任意长度的历史信息。现有方法主要通过预定义的KV-cache调度、固定比例的启发式压缩或推理时的RoPE适配来管理记忆。这些设计由于缓存窗口有限且忽视自回归生成噪声,不可避免地丢失历史信息并放大累积误差。受人类记忆巩固机制的启发,Echo-Infinity用可学习的Memory Query替代手工设计的记忆管理方式,当历史帧从局部窗口中被驱逐时,通过注意力机制和门控机制更新这些查询。这些查询与视频扩散Transformer(DiT)进行端到端联合优化,形成支持任意压缩比的演化记忆,其计算量恒定且不随视频长度变化。它们还充当可泛化的生成先验,即使仅使用优化后的初始状态也能提升生成质量。我们进一步提出了统一相对RoPE方案,该方案将sink帧锚定在id 0起始,并在训练和推理过程中让最新帧的id最多增长到DiT预训练的最大时间RoPE id,从而将模型从有限RoPE约束中解放出来,并消除训练-测试RoPE外推差距。在长视频和短视频生成中,Echo-Infinity取得了最先进性能,并且据我们所知,首次展示了具有前景的24小时(超过130万帧)实时生成能力,为无限视频生成开辟了一条实用路径。
English
We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.