呪文: マルチエンティティ動画世界モデルにおける行動インターフェースとしての自然言語

要旨

近年のインタラクティブビデオ世界モデルは印象的な視覚的忠実度を達成しているが、きめ細かなマルチエンティティ制御や、エンティティ間・世界間の汎化が欠けている。我々はこのギャップを動作インタフェースに起因するものと捉える。すなわち、標準的な制御プロトコル（アニメーションID、デバイス入力、シーンレベルのキャプションなど）は、設計時に動作意味論を特定のエンティティやエンジンに束縛する。本稿では、従来のいかなるインタフェースも達成し得ない表現力を解放する手段として自然言語を提案し、潜在フレーム単位（0.25秒）で自然言語による条件付けを行い、同時マルチエンティティ制御と、固定レンダリングパイプラインを超えた概念レベルのエンティティ間転送をサポートする、初のインタラクティブビデオ世界モデルIncantationを提示する。我々は、事前学習済み双方向ビデオバックボーンとフレームローカルテキストクロスアテンションを組み合わせ、ODE初期化自己強制蒸留法とRoPE分離型スライディングKVキャッシュにより、リアルタイムの長時間ストリーミングを実現する。エンティティ間転送（89%対43%）および語彙外プロンプト（90%対0%）において、Action-Indexベースラインを上回り、2ステップの学生モデルは480pで19.7FPSを維持し、2時間のロールアウトで安定したFVDを示す。さらに、同一のアーキテクチャと訓練レシピを『ザ・キング・オブ・ファイターズ』に適用し、エンティティごとの動作語彙スロットのみを変更した。Incantationデータセットのプレビューサブセットをhttps://huggingface.co/datasets/zhush/incantation-elden-ring-scenes で公開しており、手動収集した『エルデンリング』のプレイヤー対ボス戦闘クリップと構造化された動作指向メタデータを含む。より大規模な『エルデンリング』およびKOFデータは、プロジェクト全体とともに公開予定である。

English

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.