BadWorld: 对世界模型的对抗性攻击

摘要

视觉世界模型（VWMs）能够从单张上下文图像中合成交互式、受动作调节的未来展开。然而，这些模型对对抗性扰动的鲁棒性仍是一个开放问题。标准对抗攻击难以评估这一脆弱性，因为攻击者既无法获取真实的未来视频，也无法预测后续的用户控制信号。我们提出BadWorld——一种专为自回归VWMs设计的无标签对抗框架，系统性地克服了上述两个限制。首先，为绕过对未来监督信号的需求，我们提出一种自监督速度攻击，直接破坏模型早期的去噪动力学过程。其次，为确保攻击能泛化至不可预测的用户行为，我们构建了轨迹自适应双层优化方法，主动挖掘困难控制序列以生成与具体控制无关的扰动。在采用连续与离散控制的代表性VWMs上评估表明，BadWorld揭示了模型严重的结构脆弱性。视觉上不可分辨的对抗图像可可靠地诱发未来展开的灾难性退化，导致去噪不完整、结构崩塌以及控制不一致。这些发现揭示了在安全关键系统中部署VWMs的重大风险，同时突显了一种实用的隐私保护机制。

English

Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.