咒语:自然语言作为多实体视频世界模型的动作接口
Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models
May 18, 2026
作者: Shangwen Zhu, Qianyu Peng, Zhao Pu, Zhilei Shu, Xiangrui Ke, Zhaohu Xing, Zizhao Tong, Zeqing Wang, Xinyu Cui, Huangji Wang, Jian Zhao, Yeying Jin, Fan Cheng, Ruili Feng
cs.AI
摘要
现代交互式视频世界模型在视觉保真度上取得了显著成就,但在细粒度多实体控制以及跨实体、跨世界泛化方面仍存在不足。我们将这一差距归因于动作接口:标准控制协议(如动画ID、设备输入、场景级描述)在设计时就将动作语义绑定到特定实体或引擎上。我们提出以自然语言作为接口,解锁以往任何接口都无法实现的表达能力,并介绍了Incantation——首个支持每潜在帧(0.25秒)自然语言条件控制的交互式视频世界模型,该模型能够实现同时多实体控制以及超越任何固定渲染流程的概念级跨实体迁移。我们采用预训练的双向视频骨干网络与帧级文本交叉注意力机制,并通过基于ODE初始化的自强制蒸馏和RoPE解耦滑动KV缓存,实现了实时长时程流式处理。在跨实体迁移(89%对比43%)和词汇外提示(90%对比0%)任务上,我们超越了动作索引基线,而我们的2步学生模型在480p分辨率下保持19.7 FPS,且在两小时的推演中FVD保持稳定。我们进一步将相同的架构和训练配方应用于《拳皇》,仅更改每个实体的动作词汇槽位。我们已在https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes发布了Incantation数据集的预览子集,其中包含手动收集的《艾尔登法环》玩家-Boss对战片段及结构化的动作导向元数据。更大规模的《艾尔登法环》和《拳皇》数据将随完整项目一同发布。
English
Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.