WorldKV: 世界検索と圧縮による効率的な世界記憶

要旨

自己回帰型ビデオ拡散モデルにより、リアルタイムでのアクション条件付き世界生成が可能となった。しかし、以前に観測した視点に再訪した際に一貫した内容が得られる持続的な世界を維持することは、依然として未解決の問題である。完全なKVキャッシュアテンションはこの一貫性を保つものの、メモリ使用量とアテンションコストがロールアウト長に比例して線形増加するため、リアルタイム制約を破綻させる。スライディングウィンドウ推論はスループットを回復するが、長期的な一貫性を犠牲にする。本稿では、学習不要のフレームワークであるWorldKVを提案する。WorldKVはWorld RetrievalとWorld Compressionの2つの要素から成る。World Retrievalは、退避されたKVキャッシュチャンクをGPU/CPUメモリに格納し、カメラ対応やアクション対応に基づいてシーンに関連するチャンクを選択的に取得し、再エンコードすることなくネイティブアテンションウィンドウに挿入する。World Compressionは、アンカーフレームとのキー間類似度を用いて各チャンク内の冗長トークンを刈り込み、チャンクあたりのストレージを半減させることで、固定予算下で2倍の履歴を収容可能にする。Matrix-Game-2.0およびLingBot-World-Fastにおいて、WorldKVは完全KVメモリと同等以上の忠実度を達成しながら、スループットは約2倍であり、ファインチューニングなしでメモリ学習ベースラインと互角に競う。プロジェクトページ: https://cvlab-kaist.github.io/WorldKV/

English

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/