WorldKV: 基于世界检索与压缩的高效世界记忆

摘要

自回归视频扩散模型已实现实时、行为条件化的世界生成。然而，如何维持一个持久化的世界——回顾之前视角时能生成一致内容——仍是一个开放性问题。全KV缓存注意力虽能保证这种一致性，但会破坏实时性约束：内存占用和注意力成本随展开长度线性增长。滑动窗口推理虽能恢复吞吐量，却牺牲了长期一致性。我们提出WorldKV，一种无需训练框架，包含两个组件：世界检索与世界压缩。世界检索将逐出的KV缓存块存储于GPU/CPU内存，并通过相机/行为对应关系选择性检索场景相关块，将其直接插入原生注意力窗口而无需重新编码。世界压缩通过锚帧的键-键相似性剪枝每个块中的冗余标记，使每块存储减半，从而在固定预算下容纳两倍历史信息。在Matrix-Game-2.0和LingBot-World-Fast上，WorldKV在全KV内存保真度下匹配或超越其性能，吞吐量约为两倍，且无需微调即可与基于记忆训练的基线相竞争。项目页面：https://cvlab-kaist.github.io/WorldKV/

English

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/