BadWorld: 世界モデルに対する敵対的攻撃

要旨

視覚世界モデル（VWMs）は、単一のコンテキスト画像からインタラクティブで行動条件付けられたロールアウトを合成する。しかし、これらのモデルが敵対的摂動に対してどの程度頑健であるかは未解決の問題である。標準的な敵対的攻撃は、攻撃者が将来のビデオの正解を持たず、その後のユーザー制御を予測できないため、この脆弱性を評価できない。我々は、自己回帰型VWMsを対象としたラベル不要の敵対的フレームワークであるBadWorldを導入し、これら両方の制約を系統的に克服する。第一に、将来の教師信号の必要性を回避するため、モデルの初期のノイズ除去ダイナミクスを直接乱す自己教師あり速度攻撃を提案する。第二に、予測不可能なユーザー行動に対して攻撃が汎化することを保証するため、困難な制御系列を積極的に探索して制御非依存の摂動を生成する軌道適応型二段階最適化を定式化する。連続制御および離散制御を持つ代表的なVWMs上で評価した結果、BadWorldは深刻な構造的脆弱性を露呈する。視覚的に識別不能な敵対的画像は、将来のロールアウトにおいて壊滅的な劣化を確実に引き起こし、不完全なノイズ除去、構造的崩壊、制御の不整合をもたらす。これらの発見は、安全性重視のシステムにおけるVWMsの展開に対する重大なリスクを明らかにすると同時に、プライバシー保護のための実用的なメカニズムを浮き彫りにする。

English

Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.