回聲-無限:學習演化記憶以實現即時無限影片生成
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
June 3, 2026
作者: Yuxuan Bian, Zeyue Xue, Songchun Zhang, Shiyi Zhang, Weiyang Jin, Yaowei Li, Junhao Zhuang, Haoran Li, Jie Huang, Haoyang Huang, Nan Duan, Qiang Xu
cs.AI
摘要
我們提出Echo-Infinity,一個面向即時無限影片生成的自迴歸(AR)框架,採用可學習的演化記憶,以恆定代價動態過濾、抽象和壓縮任意長度的歷史資訊。現有方法主要使用預定義的KV快取排程、固定比例啟發式壓縮或推理時的RoPE適配來管理記憶。由於快取視窗有限且忽略自迴歸生成雜訊,這些設計無可避免地丟失歷史資訊並放大累積誤差。受人類記憶鞏固機制啟發,Echo-Infinity用可學習的記憶查詢(Memory Query)取代手工設計的記憶管理,當過去幀從局部視窗中被逐出時,透過注意力機制和門控機制更新這些查詢。查詢與影片擴散變換器(DiTs)進行端到端最佳化,形成一種支援任意壓縮比且計算量不隨影片長度變化的恆定代價演化記憶。這些查詢還可作為可泛化的生成先驗,即使僅使用最佳化後的初始狀態也能提升生成品質。我們進一步引入統一相對RoPE方案(Unified Relative RoPE Recipe),將錨點幀的起始位置固定為id 0,並使最新幀的id在訓練和推理過程中最多增長到DiTs預訓練的最大時間RoPE id,從而將模型從有限RoPE約束中解放出來,並消除訓練-測試之間的RoPE外推差距。在長影片和短影片生成任務中,Echo-Infinity均達到了最先進的效能,據我們所知,首次實現了24小時(超過130萬幀)的即時滾動生成,為無限影片生成提供了切實可行的路徑。
English
We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.