ImageWAM: 世界行動モデルは本当に動画生成を必要とするのか、それとも画像編集だけで十分なのか？

要旨

世界行動モデル（WAM）は通常、ビデオ生成を活用して視覚的世界モデリングとロボット制御を橋渡しする。しかし、ビデオベースのWAMには三つの相互に関連する制約がある。すなわち、密度の高い複数フレームの将来トークンにより推論コストが増大すること、完全なビデオ予測では動作に無関係な時間的・外観的詳細に容量が割かれること、そして長期の将来想像において行動予測を誤らせる誤差が生じうることである。これらの問題は単純な疑問を提起する：世界行動モデルは本当にビデオ生成を必要とするのか？我々はImageWAMを提案する。これは、事前学習済み画像編集モデルをロボットの行動予測に転用するシンプルなWAMフレームワークである。ビデオ生成とは対照的に、画像編集はより適した事前分布を提供する。すなわち、目標フレームの変換のみをモデル化すればよく、動作に関連する現在と目標の視覚的差異に焦点を当て、編集事前学習を通じてタスク指示を局所的な視覚変化に接地する。実際には、ImageWAMは推論時に目標フレームをデコードせず、代わりに画像編集のデノイジングによって生成されるKVキャッシュを利用してフローマッチング行動エキスパートを条件付け、これらをコンパクトな世界行動コンテキストとして使用する。ImageWAMは、シミュレータおよび実世界の実験の両方において、追加のポリシー事前学習を必要とせずに、標準的なVLAベースラインや競争力のあるWAMを上回る性能を示す。また、ビデオベースのWAMと比較して、FLOPsを1/6、レイテンシを1/4に削減する。アテンション分析はさらに、編集キャッシュがタスク関連の変化領域に焦点を当てることを示し、画像編集がビデオベースの世界行動モデリングの有効な代替手段であることを支持する。

English

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.