轻量交互：交互式视频世界模型的无训练推理加速

摘要

交互式视频世界模型能够根据用户控制的摄像机运动逐块生成视频，从而支持实时游戏模拟、虚拟场景导航和具身人工智能训练等应用。然而，由于上下文记忆不断增长、二次注意力复杂度以及重复的去噪步骤，将模型扩展到长交互轨迹的计算成本极高。我们提出Light Interaction——一种无需训练的交互式视频世界模型推理加速框架。其核心洞察在于：交互自然支持轨迹依赖的自适应计算——探索新区域时可丢弃检索到的空间记忆，根据局部潜在动态调整时间上下文，当摄像机重返熟悉区域时可复用早期模型输出。基于此洞察，Light Interaction结合了自适应上下文管理、去噪缓存加速，以及硬件-软件协同设计的3D块稀疏注意力（配备融合Triton内核）。在HY-WorldPlay和Matrix-Game-3.0上的评估表明，Light Interaction在不重新训练模型的情况下实现了高达2.59倍的加速，同时保持具有竞争力的视觉质量。

English

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.