Incantatie: Natuurlijke taal als de actie-interface voor multi-entiteit videowereldmodellen

Samenvatting

Moderne interactieve videowereldmodellen hebben indrukwekkende visuele getrouwheid bereikt, maar missen fijnmazige controle over meerdere entiteiten en generalisatie over entiteiten en werelden heen. We herleiden dit hiaat tot de actie-interface: standaard besturingsprotocollen (bijv. animatie-ID's, apparaatinvoer, scèneniveau-bijschriften) binden actiesemantiek aan specifieke entiteiten of engines tijdens ontwerptijd. We stellen natuurlijke taal voor als interface om uitdrukkingskracht te ontgrendelen die geen enkele eerdere interface kan bereiken, en we presenteren Incantation, het eerste interactieve videowereldmodel met per-latent-frame (0,25 s) natuurlijke-taalconditionering die gelijktijdige multi-entiteitcontrole en conceptniveau-overdracht tussen entiteiten mogelijk maakt, verder dan elke vaste renderpijplijn. We koppelen een voorgetrainde bidirectionele videobackbone aan framelokale tekstkruisaandacht, en maken realtime langetermijnstreaming mogelijk via ODE-geïnitialiseerde Self-Forcing-distillatie met een RoPE-ontkoppelde schuivende KV-cache. We overtreffen de Action-Index-baseline op overdracht tussen entiteiten (89% vs. 43%) en out-of-vocabulary prompts (90% vs. 0%), en onze 2-staps student handhaaft 19,7 FPS op 480p met stabiele FVD over 2 uur durende rollouts. We passen dezelfde architectuur en trainingsprocedure verder toe op The King of Fighters, waarbij we alleen de per-entiteit actiewoordenschatsleuven wijzigen. We hebben een voorvertoningssubset van de Incantation-dataset uitgebracht op https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, met handmatig verzamelde Elden Ring speler-baas gevechtsclips met gestructureerde actiegerichte metadata. Grotere schaal Elden Ring- en KOF-gegevens zullen worden uitgebracht met het volledige project.

English

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.