輕量互動：互動式視頻世界模型的免訓練推理加速

摘要

交互式視頻世界模型能夠根據用戶控制的攝影機移動，逐塊生成視頻內容，從而支援即時遊戲模擬、虛擬場景導航以及具身人工智慧訓練等應用。然而，由於上下文記憶體不斷增長、注意力機制的二次複雜度以及重複的去噪步驟，將此類模型擴展至長時間交互軌跡的成本極高。為此，我們提出Light Interaction——一種針對交互式視頻世界模型的免訓練推理加速框架。我們的關鍵洞察在於：交互本身自然允許基於軌跡的自適應計算——探索新區域時可丟棄已檢索的空間記憶，根據局部潛在動態調整時間上下文，當攝影機重新進入熟悉區域時可重用模型早期步驟的輸出。基於此洞察，Light Interaction結合了自適應上下文管理、去噪快取加速，以及採用融合Triton內核的硬軟體協同設計3D塊稀疏注意力。在HY-WorldPlay和Matrix-Game-3.0上的評估結果顯示，Light Interaction在不重新訓練模型的前提下實現了最高2.59倍的加速，同時保持具有競爭力的視覺品質。

English

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.