BadWorld: Adversariële aanvallen op wereldmodellen

Samenvatting

Visuele wereldmodellen (VWMs) synthetiseren interactieve, actie-afhankelijke rollouts op basis van een enkele contextafbeelding. Het blijft echter een open vraag hoe robuust deze modellen zijn tegen adversarial verstoringen. Standaard adversarial aanvallen schieten tekort in het beoordelen van deze kwetsbaarheid omdat aanvallers geen grondwaarheid hebben voor toekomstige video's en de daaropvolgende gebruikersacties niet kunnen voorspellen. We introduceren BadWorld, een labelvrij adversarial raamwerk dat specifiek is ontworpen voor autoregressieve VWMs en beide beperkingen systematisch overwint. Ten eerste stellen we, om de noodzaak van toekomstige supervisie te omzeilen, een zelfgesuperviseerde snelheidsaanval voor die direct de vroege denoisingdynamiek van het model verstoort. Ten tweede formuleren we, om ervoor te zorgen dat de aanval generaliseert over onvoorspelbare gebruikersacties, een traject-adaptieve bi-level optimalisatie die actief moeilijke besturingsreeksen ontgint om controle-agnostische verstoringen te smeden. Geëvalueerd op representatieve VWMs met continue en discrete besturingen, onthult BadWorld ernstige structurele fragiliteit. Visueel niet te onderscheiden adversarial beelden leiden betrouwbaar tot catastrofale degradatie in toekomstige rollouts, resulterend in onvolledige denoising, structurele ineenstorting en besturingsinconsistentie. Deze bevindingen leggen kritieke risico's bloot voor de inzet van VWMs in veiligheid-kritische systemen, terwijl ze ook een praktisch mechanisme voor privacybescherming benadrukken.

English

Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.