BadWorld: Adversarielle Angriffe auf Weltmodelle

Zusammenfassung

Visuelle Weltmodelle (VWMs) synthetisieren interaktive, aktionsabhängige Rollouts aus einem einzelnen Kontextbild. Es bleibt jedoch eine offene Frage, wie robust diese Modelle gegenüber adversarialen Störungen sind. Standardmäßige adversariale Angriffe versagen bei der Bewertung dieser Anfälligkeit, da Angreifer keine Ground-Truth-Zukunftsvideos besitzen und nachfolgende Benutzersteuerungen nicht vorhersagen können. Wir stellen BadWorld vor, ein bezeichnungsfreies adversaries Framework, das speziell für autoregressive VWMs entwickelt wurde und systematisch beide Einschränkungen überwindet. Erstens schlagen wir zur Umgehung des Bedarfs an zukünftiger Überwachung einen selbstüberwachten Geschwindigkeitsangriff vor, der direkt die frühen Entrauschungsdynamiken des Modells stört. Zweitens formulieren wir zur Sicherstellung der Generalisierung des Angriffs über unvorhersehbare Benutzeraktionen hinweg eine trajektorienadaptive zweistufige Optimierung, die aktiv schwierige Steuerungssequenzen abbaut, um steuerungsunabhängige Störungen zu erzeugen. Evaluiert auf repräsentativen VWMs mit kontinuierlichen und diskreten Steuerungen offenbart BadWorld eine schwerwiegende strukturelle Fragilität. Visuell nicht unterscheidbare adversariale Bilder führen zuverlässig zu katastrophaler Verschlechterung in zukünftigen Rollouts, was zu unvollständigem Entrauschen, strukturellem Kollaps und Steuerungsinkonsistenz führt. Diese Ergebnisse decken kritische Risiken für den Einsatz von VWMs in sicherheitskritischen Systemen auf und heben gleichzeitig einen praktischen Mechanismus zum Schutz der Privatsphäre hervor.

English

Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.