ImageWAM: 세계 행동 모델에게 정말 비디오 생성이 필요한가, 아니면 단순한 이미지 편집만으로 충분한가?

초록

세계 행동 모델(WAM)은 일반적으로 비디오 생성을 활용하여 시각적 세계 모델링과 로봇 제어를 연결한다. 그러나 비디오 기반 WAM은 세 가지 상호 연관된 한계에 직면한다: 밀집된 다중 프레임 미래 토큰으로 인해 추론 비용이 높아지고, 전체 비디오 예측이 행동과 무관한 시간적 및 외관 세부 사항에 용량을 소모하며, 장기 미래 상상이 행동 예측을 오도하는 오류를 유발할 수 있다. 이러한 문제는 단순한 질문을 제기한다: 세계 행동 모델이 정말로 비디오 생성을 필요로 하는가? 본 논문에서는 사전 학습된 이미지 편집 모델을 로봇 행동 예측에 재활용하는 간단한 WAM 프레임워크인 ImageWAM을 제안한다. 비디오 생성과 달리 이미지 편집은 더 나은 사전 지식을 제공한다: 목표 프레임 변환만 모델링하면 되고, 행동 관련 현재-목표 시각적 차이에 집중하며, 편집 사전 학습을 통해 작업 명령을 국소적 시각적 변화로 구체화한다. 실제로 ImageWAM은 추론 시 목표 프레임을 디코딩하지 않고, 이미지 편집 잡음 제거 과정에서 생성된 KV 캐시를 이용해 흐름 정합 행동 전문가를 조건화하여, 이를 간결한 세계-행동 맥락으로 사용한다. ImageWAM은 추가 정책 사전 학습 없이도 다양한 시뮬레이터 및 실제 실험에서 표준 VLA 기준선 및 경쟁력 있는 WAM을 능가하는 성능을 보인다. 또한 FLOPs를 1/6, 지연 시간을 비디오 기반 WAM의 1/4로 줄인다. 주의 분석은 편집 캐시가 작업 관련 변화 영역에 집중함을 보여주며, 비디오 기반 세계-행동 모델링의 효과적인 대안으로서 이미지 편집을 뒷받침한다.

English

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.