주문: 다중 개체 비디오 세계 모델을 위한 행동 인터페이스로서의 자연어

초록

최신 인터랙티브 비디오 세계 모델은 인상적인 시각적 충실도를 달성했지만, 미세한 다중 엔티티 제어와 엔티티 간, 세계 간 일반화가 부족합니다. 우리는 이러한 격차를 액션 인터페이스에서 비롯된 것으로 추적합니다. 표준 제어 프로토콜(예: 애니메이션 ID, 장치 입력, 장면 수준 캡션)은 설계 시점에 특정 엔티티 또는 엔진에 액션 의미를 바인딩합니다. 우리는 자연어를 인터페이스로 제안하여 이전 인터페이스가 달성할 수 없는 표현력을 해제하고, Incantation을 제시합니다. 이는 최초의 인터랙티브 비디오 세계 모델로, 잠재 프레임당(0.25초) 자연어 조건화를 지원하며, 동시 다중 엔티티 제어와 고정된 렌더링 파이프라인을 넘어서는 개념 수준의 엔티티 간 전이를 가능하게 합니다. 사전 훈련된 양방향 비디오 백본을 프레임 로컬 텍스트 교차 주의와 결합하고, ODE 초기화된 자기 강제 증류와 RoPE 분리 슬라이딩 KV-캐시를 통해 실시간 장시간 스트리밍을 구현합니다. 엔티티 간 전이(89% 대 43%) 및 어휘 외 프롬프트(90% 대 0%)에서 액션 인덱스 기준을 능가하며, 2단계 학생 모델은 480p에서 19.7 FPS를 유지하고 2시간 롤아웃 동안 안정적인 FVD를 보여줍니다. 또한 동일한 아키텍처와 훈련 레시피를 KOF(더 킹 오브 파이터즈)에 적용하여, 엔티티별 액션 어휘 슬롯만 변경했습니다. Incantation 데이터셋의 미리보기 하위 집합을 https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes 에서 공개했으며, 수동으로 수집한 엘든 링 플레이어-보스 전투 클립과 구조화된 액션 지향 메타데이터를 포함합니다. 더 큰 규모의 엘든 링 및 KOF 데이터는 전체 프로젝트와 함께 공개될 예정입니다.

English

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.