WorldKV:具有世界檢索與壓縮的高效世界記憶
WorldKV: Efficient World Memory with World Retrieval and Compression
May 21, 2026
作者: Jung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun, Seungryong Kim
cs.AI
摘要
自回歸視頻擴散模型已實現即時、動作條件下的世界生成。然而,維持一個持久的世界——即重新訪問先前視角時能產生一致內容——仍然是個待解決的問題。完整 KV 緩存注意力可確保此一致性,但會破壞即時性限制:記憶體佔用與注意力成本隨展開長度線性增長。滑窗推論雖恢復了吞吐量,卻失去了長期一致性。我們提出 WorldKV,一個免訓練框架,包含兩個組件:世界檢索與世界壓縮。世界檢索將被驅逐的 KV 緩存區塊儲存在 GPU/CPU 記憶體中,並透過相機/動作對應選擇性地檢索場景相關區塊,將其插回原生注意力視窗,無需重新編碼。世界壓縮則透過鍵-鍵相似度與錨定幀比較,修剪每個區塊中的冗餘標記,使每區塊儲存減半,從而在固定預算下容納兩倍以上的歷史資訊。在 Matrix-Game-2.0 與 LingBot-World-Fast 上,WorldKV 在約兩倍吞吐量下達到或超越完整 KV 記憶體的保真度,且無需任何微調即可與經過記憶體訓練的基線競爭。專案頁面:https://cvlab-kaist.github.io/WorldKV/
English
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/