軽量インタラクション: インタラクティブビデオ世界モデルのための学習不要推論高速化

要旨

インタラクティブビデオ世界モデルは、ユーザーによるカメラ操作に応じてビデオをチャンク単位で逐次生成し、リアルタイムゲームシミュレーション、仮想シーン探索、具現化AIトレーニングなどの応用を可能にする。しかし、長いインタラクティブな軌跡への拡張は、コンテキストメモリの増大、二次関数的な注意機構の計算量、繰り返しのデノイジングステップにより、実用的なコストが極めて高くなる。本稿では、インタラクティブビデオ世界モデルにおける訓練不要の推論高速化フレームワーク「Light Interaction」を提案する。我々の重要な洞察は、インタラクションが軌跡に依存した適応計算を自然に可能にする点にある。すなわち、新規探索時には検索された空間メモリを破棄し、局所的な潜在ダイナミクスに応じて時間的コンテキストを調整し、カメラが既知の領域を再訪する際には初期ステップのモデル出力を再利用できる。この洞察に基づき、Light Interactionは適応的コンテキスト管理、デノイジングキャッシュ高速化、そしてハードウェア・ソフトウェア協調設計による融合Tritonカーネルを用いた3Dブロックスパース注意機構を組み合わせる。HY-WorldPlayおよびMatrix-Game-3.0での評価により、Light Interactionはモデルの再訓練を必要とせず、同等の画質を維持しながら最大2.59倍の速度向上を達成する。

English

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.