Incantation: Natürliche Sprache als Aktionsschnittstelle für Multi-Entity-Videoweltmodelle

Zusammenfassung

Moderne interaktive Videoweltmodelle erzielen eine beeindruckende visuelle Wiedergabetreue, ermangeln jedoch einer feinkörnigen Multi-Entitäten-Steuerung sowie einer entitäts- und weltübergreifenden Generalisierung. Wir führen diese Lücke auf die Aktionsschnittstelle zurück: Standard-Kontrollprotokolle (z. B. Animations-IDs, Geräteeingaben, Szenenebenen-Beschreibungen) binden die Aktionssemantik zur Entwurfszeit an bestimmte Entitäten oder Engines. Wir schlagen natürliche Sprache als Schnittstelle vor, um eine Ausdruckskraft zu erschließen, die keine bisherige Schnittstelle erreichen kann, und präsentieren Incantation, das erste interaktive Videoweltmodell mit einer natürlichen Sprachsteuerung pro latentem Frame (0,25 s), das gleichzeitige Multi-Entitäten-Steuerung und konzeptuellen entitätsübergreifenden Transfer jenseits jeder festen Rendering-Pipeline unterstützt. Wir kombinieren ein vortrainiertes bidirektionales Video-Backbone mit Frame-lokaler Text-Cross-Attention und ermöglichen Echtzeit-Streaming über lange Horizonte mittels ODE-initialisierter Self-Forcing-Destillation mit einem RoPE-entkoppelten gleitenden KV-Cache. Wir übertreffen die Action-Index-Baseline beim entitätsübergreifenden Transfer (89 % vs. 43 %) und bei außervokabularischen Prompts (90 % vs. 0 %), und unser 2-Schritt-Student erreicht 19,7 FPS bei 480p mit stabilem FVD über 2-stündige Rollouts. Wir wenden dieselbe Architektur und dasselbe Trainingsrezept auf The King of Fighters an, wobei nur die pro Entität angelegten Aktionsvokabularslots geändert werden. Wir haben einen Vorschauausschnitt des Incantation-Datensatzes unter https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes veröffentlicht, der manuell gesammelte Elden-Ring-Spieler-Boss-Kampfclips mit strukturierten aktionsorientierten Metadaten enthält. Größere Elden-Ring- und KOF-Daten werden mit dem vollständigen Projekt veröffentlicht.

English

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.