De Dageraad van Wereldactie-Interactieve Modellen

Samenvatting

Een plausibele scène-evolutie hangt af van de beschouwde manoeuvre, terwijl een goede manoeuvre afhangt van hoe de scène kan evolueren. Bestaande World Action Models (WAMs) missen grotendeels deze wederkerigheid, door wereldvoorspelling en actiegeneratie te behandelen als geïsoleerde parallelle takken of starre voorspel-dan-plan-pijplijnen. We formaliseren dit perspectief als World-Action Interactive Models (WAIMs) en implementeren het in autonoom rijden met DAWN (Denoising Actions and World iNteractive model), een eenvoudige maar sterke latente generatieve basislijn. DAWN werkt in een compacte semantische latente ruimte en koppelt een World Predictor met een World-Conditioned Action Denoiser: de voorspelde wereldhypothese conditioneert de actie-ontruising, terwijl de ontruiste actiehypothese wordt teruggekoppeld om de wereldvoorspelling bij te werken, zodat beide tijdens inferentie recursief worden verfijnd. In plaats van testtijd-wereldevolutie volledig te elimineren of de volledige toekomst in pixelruimte uit te rollen, voert DAWN een korte expliciete latente rollout uit die voldoende is om langhorizontrajectgeneratie in complexe interactieve scènes te ondersteunen. Experimenten tonen aan dat DAWN sterke planningsprestaties en gunstige veiligheidsgerelateerde resultaten behaalt op meerdere benchmarks voor autonoom rijden. In bredere zin suggereren onze resultaten dat interactieve wereld-actiegeneratie een principiële weg is naar werkelijk bruikbare wereldmodellen.

English

A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with DAWN (Denoising Actions and World iNteractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a World Predictor with a World-Conditioned Action Denoiser: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.