BadWorld: 세계 모델에 대한 적대적 공격

초록

시각적 세계 모델(VWM)은 단일 컨텍스트 이미지로부터 상호작용적이고 행동 조건부 롤아웃을 합성한다. 그러나 이러한 모델이 적대적 교란에 대해 얼마나 강건한지는 여전히 미해결 문제로 남아 있다. 표준 적대적 공격은 공격자가 실제 미래 비디오를 가지고 있지 않고 후속 사용자 제어를 예측할 수 없기 때문에 이러한 취약성을 평가하지 못한다. 우리는 자기회귀적 VWM에 특화된 레이블 없는 적대적 프레임워크인 BadWorld를 소개하며, 이는 두 가지 제약을 체계적으로 극복한다. 첫째, 미래 감독의 필요성을 우회하기 위해, 모델의 초기 잡음 제거 동역학을 직접 교란하는 자기지도 속도 공격을 제안한다. 둘째, 공격이 예측 불가능한 사용자 행동에 걸쳐 일반화되도록 보장하기 위해, 제어에 구애받지 않는 교란을 생성하기 위해 어려운 제어 시퀀스를 적극적으로 탐색하는 궤적 적응형 이중 수준 최적화를 공식화한다. 연속 및 이산 제어를 갖는 대표적인 VWM에서 평가된 BadWorld는 심각한 구조적 취약성을 드러낸다. 시각적으로 구별할 수 없는 적대적 이미지는 미래 롤아웃에서 지속적으로 치명적인 성능 저하를 유발하여, 불완전한 잡음 제거, 구조적 붕괴, 제어 불일치를 초래한다. 이러한 발견은 안전에 중요한 시스템에 VWM을 배포할 때의 심각한 위험을 드러내는 동시에, 프라이버시 보호를 위한 실용적인 메커니즘을 강조한다.

English

Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.