WorldCache: Inhoudsbewuste Caching voor Versnelde Video World Models

Samenvatting

Diffusion Transformers (DiTs) vormen de basis voor hoogwaardige videowereldmodellen, maar blijven rekenkundig kostbaar vanwege sequentiële denoisering en dure spatio-temporele aandacht. Training-vrije feature caching versnelt de inferentie door het hergebruiken van tussenliggende activeringen over denoiseringsstappen heen; bestaande methodes steunen echter grotendeels op een Zero-Order Hold-aanname, d.w.z. het hergebruiken van gecachete features als statische momentopnames wanneer de globale drift klein is. Dit leidt vaak tot ghosting-artefacten, vervaging en beweginginconsistenties in dynamische scènes. Wij stellen WorldCache voor, een Perception-Constrained Dynamical Caching-framework dat zowel verbetert wannéér als hóé features hergebruikt moeten worden. WorldCache introduceert beweging-adaptieve drempels, saliency-gewogen drift-schatting, optimale approximatie via blending en warping, en fase-bewuste drempelplanning over diffusiestappen heen. Onze samenhangende aanpak maakt adaptief, beweging-consistent feature-hergebruik mogelijk zonder hertraining. Op Cosmos-Predict2.5-2B, geëvalueerd met PAI-Bench, behaalt WorldCache een 2,3× versnelling van de inferentie, waarbij 99,4% van de baseline-kwaliteit behouden blijft, wat aanzienlijk beter is dan eerdere training-vrije caching-benaderingen. Onze code is toegankelijk op https://umair1221.github.io/World-Cache/{World-Cache}.

English

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose WorldCache, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves 2.3times inference speedup while preserving 99.4\% of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on https://umair1221.github.io/World-Cache/{World-Cache}.

WorldCache: Inhoudsbewuste Caching voor Versnelde Video World Models

WorldCache: Content-Aware Caching for Accelerated Video World Models

Samenvatting

Support