WorldKV: 월드 검색 및 압축을 통한 효율적인 월드 메모리

초록

자동회귀 비디오 확산 모델은 실시간 행동 조건부 세계 생성을 가능하게 했습니다. 그러나 이전에 보았던 시점을 다시 방문할 때 일관된 내용을 유지하는 지속적 세계를 유지하는 것은 여전히 해결되지 않은 문제로 남아 있습니다. 전체 KV-캐시 어텐션은 이러한 일관성을 유지하지만, 실시간 제약 조건을 깨뜨립니다: 메모리 사용량과 어텐션 비용이 롤아웃 길이에 따라 선형적으로 증가합니다. 슬라이딩 윈도우 추론은 처리량을 복원하지만 장기적 일관성을 버립니다. 저희는 훈련 없이 사용 가능한 두 가지 구성 요소(World Retrieval과 World Compression)로 구성된 WorldKV를 제안합니다. World Retrieval은 제거된 KV-캐시 청크를 GPU/CPU 메모리에 저장하고, 카메라/행동 대응을 통해 장면 관련 청크를 선택적으로 검색하여 재인코딩 없이 기본 어텐션 윈도우에 다시 삽입합니다. World Compression은 앵커 프레임에 대한 키-키 유사도를 통해 각 청크 내의 중복 토큰을 제거하여, 청크당 저장 공간을 절반으로 줄여 고정 예산 하에서 2배 더 많은 기록을 수용할 수 있게 합니다. Matrix-Game-2.0 및 LingBot-World-Fast에서 WorldKV는 전체 KV 메모리 충실도와 동등하거나 그 이상의 성능을 약 2배의 처리량으로 달성하며, 미세 조정 없이 메모리 학습 기반 기준선과 경쟁할 수 있습니다. 프로젝트 페이지: https://cvlab-kaist.github.io/WorldKV/

English

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/