ChatPaper.aiChatPaper

WorldKV: 基于世界检索与压缩的高效世界记忆

WorldKV: Efficient World Memory with World Retrieval and Compression

May 21, 2026
作者: Jung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun, Seungryong Kim
cs.AI

摘要

自回归视频扩散模型已实现实时、行为条件化的世界生成。然而,如何维持一个持久化的世界——回顾之前视角时能生成一致内容——仍是一个开放性问题。全KV缓存注意力虽能保证这种一致性,但会破坏实时性约束:内存占用和注意力成本随展开长度线性增长。滑动窗口推理虽能恢复吞吐量,却牺牲了长期一致性。我们提出WorldKV,一种无需训练框架,包含两个组件:世界检索与世界压缩。世界检索将逐出的KV缓存块存储于GPU/CPU内存,并通过相机/行为对应关系选择性检索场景相关块,将其直接插入原生注意力窗口而无需重新编码。世界压缩通过锚帧的键-键相似性剪枝每个块中的冗余标记,使每块存储减半,从而在固定预算下容纳两倍历史信息。在Matrix-Game-2.0和LingBot-World-Fast上,WorldKV在全KV内存保真度下匹配或超越其性能,吞吐量约为两倍,且无需微调即可与基于记忆训练的基线相竞争。项目页面:https://cvlab-kaist.github.io/WorldKV/
English
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/