WorldCache：面向加速视频世界模型的内容感知缓存技术

摘要

扩散变换器（DiTs）虽能驱动高保真视频世界模型，但由于序列化去噪和高成本时空注意力机制，其计算开销依然巨大。基于免训练的特征缓存技术通过跨去噪步骤复用中间激活值来加速推理，然而现有方法主要依赖零阶保持假设——即当全局特征漂移较小时将缓存特征视为静态快照复用。这常导致动态场景中出现重影伪影、模糊和运动不一致问题。我们提出WorldCache，一种感知约束的动态缓存框架，从"何时复用"和"如何复用"两个维度进行优化。该框架引入运动自适应阈值、显著性加权的漂移估计、基于混合与形变的最优近似策略，以及跨扩散步骤的相位感知阈值调度。我们的协同方法无需重新训练即可实现自适应、运动一致的特征复用。在PAI-Bench上对Cosmos-Predict2.5-2B的评测表明，WorldCache在保持基线模型99.4%生成质量的同时实现2.3倍推理加速，显著优于现有免训练缓存方法。代码已发布于https://umair1221.github.io/World-Cache/{World-Cache}。

English

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose WorldCache, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves 2.3times inference speedup while preserving 99.4\% of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on https://umair1221.github.io/World-Cache/{World-Cache}.