RELIC:具備長時程記憶的互動式影片世界模型
RELIC: Interactive Video World Model with Long-Horizon Memory
December 3, 2025
作者: Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan
cs.AI
摘要
真正的互動式世界模型需要三大關鍵要素:即時長時序串流、一致性的空間記憶,以及精確的使用者控制。然而現有方法大多僅能單獨實現其中一項,因為要同時達成三項目標極具挑戰性——例如長時記憶機制往往會犧牲即時效能。本研究提出RELIC統一框架,能整體性解決這三大難題。該模型僅需輸入單張影像與文字描述,即可實現具記憶感知能力的即時長時序場景探索。基於最新自迴歸視頻擴散蒸餾技術,我們透過KV快取中編碼相對動作與絕對相機姿態的高度壓縮歷史潛在標記,建構長時序記憶表徵。此緊湊型相機感知記憶結構支援隱式3D一致性內容檢索,並以最小計算開銷確保長期連貫性。同時,我們微調雙向教師視頻模型,使其生成超越原始5秒訓練時長的序列,並透過新型記憶高效自強制範式轉化為因果學生生成器,實現對長時序教師模型及學生自推演的全上下文蒸餾。作為參數規模達140億的模型,RELIC在經過精心策劃的Unreal Engine渲染數據集上訓練後,能以16 FPS實現即時生成,相較既有研究展現出更精準的動作跟隨、更穩定的長時序串流及更強健的空間記憶檢索能力。這些特性使RELIC成為新一代互動式世界建模的堅實基礎。
English
A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.