WorldCache: 가속화된 비디오 월드 모델을 위한 콘텐츠 인식 캐싱

초록

확산 변환기(DiT)는 높은 정확도의 비디오 월드 모델을 구동하지만 순차적 노이즈 제거와 고비용의 시공간적 어텐션으로 인해 계산 비용이 많이 듭니다. 학습 없이 특징을 캐싱하는 방식은 노이즈 제거 단계 간 중간 활성화를 재사용하여 추론 속도를 높이지만, 기존 방법은 전반적 드리프트가 작을 때 캐시된 특징을 정적 스냅샷으로 재사용하는 영차 유지 가정에 크게 의존합니다. 이는 동적 장면에서 고스팅 아티팩트, 흐림 현상 및 모션 불일치를 초래하는 경우가 많습니다. 우리는 **언제**, **어떻게** 특징을 재사용할지 모두 개선하는 지각 기반 동적 캐싱 프레임워크인 **WorldCache**를 제안합니다. WorldCache는 모션 적응형 임계값, 중요도 가중 드리프트 추정, 블렌딩 및 워핑을 통한 최적 근사, 그리고 확산 단계 전반의 위상 인식 임계값 스케줄링을 도입합니다. 우리의 통합적 접근법은 재학습 없이도 모션 일관성을 유지하는 적응형 특징 재사용을 가능하게 합니다. PAI-Bench에서 평가된 Cosmos-Predict2.5-2B에서 WorldCache는 기준 모델 대비 99.4%의 품질을 유지하면서 2.3배의 추론 속도 향상을 달성하여, 기존의 학습 없는 캐싱 접근법을 크게 능가합니다. 우리의 코드는 https://umair1221.github.io/World-Cache/{World-Cache}에서 확인할 수 있습니다.

English

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose WorldCache, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves 2.3times inference speedup while preserving 99.4\% of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on https://umair1221.github.io/World-Cache/{World-Cache}.

WorldCache: 가속화된 비디오 월드 모델을 위한 콘텐츠 인식 캐싱

WorldCache: Content-Aware Caching for Accelerated Video World Models

초록

Support