RELIC:具备长时记忆的交互式视频世界模型
RELIC: Interactive Video World Model with Long-Horizon Memory
December 3, 2025
作者: Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan
cs.AI
摘要
真正实现交互式世界模型需要三个关键要素:实时长序列流式生成、一致的空间记忆和精确的用户控制。然而现有方法大多孤立地解决其中单个问题,因为同时实现三者极具挑战性——例如长时记忆机制往往会降低实时性能。本研究提出RELIC统一框架,整体性攻克这三个挑战。给定单张图像和文本描述,RELIC能够实时实现对任意场景的具备记忆感知的长时探索。基于最新自回归视频扩散蒸馏技术,我们的模型采用高度压缩的历史潜变量令牌来表示长时记忆,这些令牌通过KV缓存编码了相对动作和绝对相机位姿。这种紧凑的相机感知记忆结构支持隐式3D一致内容检索,并以最小计算开销保障长时连贯性。与此同时,我们微调双向教师视频模型以突破其原始5秒训练时长的限制,并通过新型内存高效的自强制范式将其转化为因果性学生生成器,该范式支持对长时教师序列及学生自生成序列进行全上下文蒸馏。作为140亿参数模型并在精心策划的虚幻引擎渲染数据集上训练,RELIC实现了16帧/秒的实时生成,与现有工作相比展现出更精准的动作跟随、更稳定的长序列流式生成以及更鲁棒的空间记忆检索能力。这些特性使RELIC成为新一代交互式世界建模的坚实基础。
English
A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.