MagicWorld: Interactieve Geometrie-gestuurde Verkenning van Videowerelden

Samenvatting

Recente methoden voor interactieve videowereldmodellen genereren scène-evolutie op basis van gebruikersinstructies. Hoewel ze indrukwekkende resultaten behalen, blijven twee belangrijke beperkingen bestaan. Ten eerste benutten ze de correspondentie tussen instructiegestuurde scènebeweging en de onderliggende 3D-geometrie onvoldoende, wat leidt tot structurele instabiliteit bij viewpointveranderingen. Ten tweede vergeten ze historische informatie gemakkelijk tijdens multi-stap interacties, wat resulteert in foutaccumulatie en progressieve drift in scènesemantiek en -structuur. Om deze problemen aan te pakken, stellen we MagicWorld voor, een interactief videowereldmodel dat 3D-geometrische priors en historische retrievals integreert. MagicWorld vertrekt vanuit een enkele scèneafbeelding, gebruikt gebruikersacties om dynamische scène-evolutie aan te sturen, en synthetiseert autoregressief continue scènes. We introduceren de Actie-Gestuurde 3D Geometrie Module (AG3D), die een pointcloud construeert vanuit de eerste frame van elke interactie en de corresponderende actie, waardoor expliciete geometrische constraints voor viewpointtransities worden geboden en de structurele consistentie verbetert. Verder stellen we het History Cache Retrieval (HCR)-mechanisme voor, dat relevante historische frames tijdens generatie opzoekt en deze als conditioneringssignalen injecteert, waardoor het model historische scène-informatie kan benutten en foutaccumulatie wordt gemitigeerd. Experimentele resultaten tonen aan dat MagicWorld aanzienlijke verbeteringen bereikt in scènestabiliteit en continuïteit over interactie-iteraties heen.

English

Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.

MagicWorld: Interactieve Geometrie-gestuurde Verkenning van Videowerelden

MagicWorld: Interactive Geometry-driven Video World Exploration

Samenvatting

Support