BadWorld：對世界模型的對抗性攻擊

摘要

視覺世界模型（VWMs）能從單一上下文圖像合成具互動性、以動作為條件的展開預測。然而，此類模型對對抗擾動的穩健性仍屬未解問題。標準對抗攻擊無法評估此脆弱性，因攻擊者既缺乏真實未來影片，也無法預測後續使用者控制。我們提出 BadWorld，一種專為自迴歸視覺世界模型設計的無標籤對抗框架，系統性地克服上述兩項限制。首先，為繞過對未來監督訊號的需求，我們提出自監督速度攻擊，直接擾亂模型早期的去噪動態。其次，為確保攻擊能泛化至不可預測的使用者動作，我們制定軌跡自適應雙層最佳化，主動挖掘困難控制序列以鑄造與控制無關的擾動。在以連續與離散控制為特徵的代表性視覺世界模型上進行評估後，BadWorld 揭露了嚴重的結構脆弱性。視覺上難以分辨的對抗影像能可靠地觸發未來展開預測的災難性退化，導致去噪不完全、結構崩解及控制不一致。這些發現揭示了將視覺世界模型部署於安全關鍵系統中的關鍵風險，同時也凸顯出一種實用的隱私保護機制。

English

Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.