SCOPE: FPSワールドモデルのためのプレイ可能環境におけるクロスゲーム操作のシミュレーション

要旨

一人称視点シューティング（FPS）ゲーム向けのインタラクティブ世界モデルは、各フレームにおいて高頻度で重複する制御信号を解決しつつ、影響を受けない領域を乱さないことが求められる。既存手法はアクションを全体的に注入し、単一タイトルで学習するため、密度の高いFPS入力には対応できない。我々は、FPSアクションが空間的に選択的であることを観察した。すなわち、発射やリロードといった離散的イベントは、武器周辺の局所領域（スコープ）にのみ影響し、一方で連続的なカメラや移動の信号は安定した周囲環境を支配する。本稿ではSCOPEを提案する。これは、事前学習済みビデオ拡散モデルの各トランスフォーマーブロックに条件付けモジュールを挿入するものである。特徴量をピクセル単位の時間系列に再構成することで、各位置が局所的な視覚情報からアクション応答を計算できるようにする。これにより、セグメンテーションラベルを用いずに、スコープ内の効果とスコープ外の生成を分離する。また、フレーム同期されたアクションテレメトリを持つ初のマルチゲームFPSデータセットであるCrossFPSを導入する。これは7タイトルから69,000クリップで構成され、10自由度のコントローラー信号を含み、ゲームプレイの偏りを除去するようキュレーションされている。モデルはゲーム固有のパターンではなく、汎用的な視覚－アクション対応を学習し、未見のシーンへのゼロショット転送を可能にする。実験により、高いアクション応答性、正確なスコープ分離、および効果的なクロスゲーム汎化が確認された。

English

Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.