咒語：自然語言作為多實體視頻世界模型的動作介面

摘要

現代交互式視頻世界模型已實現令人驚嘆的視覺保真度，但在細粒度多實體控制及跨實體、跨世界泛化方面仍存在不足。我們將此差距歸因於動作介面：標準控制協議（例如動畫ID、設備輸入、場景級描述）在設計時便將動作語義綁定至特定實體或引擎。我們提出以自然語言作為介面，釋放先前任何介面都無法達到的表現力，並介紹Incantation——首個具備每潛在幀（0.25秒）自然語言條件控制、支援同時多實體控制及超越任何固定渲染管線的概念級跨實體遷移的交互式視頻世界模型。我們將預訓練的雙向視頻主幹網路與幀局部文本交叉注意力相結合，並通過ODE初始化的自強制蒸餾搭配解耦RoPE的滑動KV緩存，實現即時長程流式處理。我們在跨實體遷移（89%對43%）及詞彙外提示（90%對0%）上超越動作索引基線，且我們的2步學生模型在480p解析度下維持19.7幀每秒，並在2小時滾動生成中保持穩定的FVD。我們進一步將相同的架構與訓練流程應用至《拳皇》，僅更改每個實體的動作詞彙槽。我們已在https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes發佈Incantation數據集的預覽子集，包含手動收集的《艾爾登法環》玩家-首領戰鬥片段及其結構化面向動作的元數據。更大規模的《艾爾登法環》與《拳皇》數據將隨完整項目一併發佈。

English

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.