ImageWAM: 世界动作模型真的需要视频生成，还是仅仅需要图像编辑？

摘要

世界动作模型（WAMs）通常依赖视频生成来桥接视觉世界建模与机器人控制。然而，基于视频的WAMs面临三个相互关联的局限：密集的多帧未来令牌使推理成本高昂，完整的视频预测将模型容量消耗在与动作无关的时间与外观细节上，长时域的未来想象可能引入误差，误导动作预测。这些问题引出一个简单疑问：世界动作模型真的需要视频生成吗？我们提出ImageWAM，一种简洁的WAM框架，将预训练的图像编辑模型重新用于机器人动作预测。与视频生成相比，图像编辑提供了更匹配的先验：它只需建模目标帧的变换，聚焦于与动作相关的当前到目标视觉差异，并通过编辑预训练将任务指令锚定到局部视觉变化上。实际应用中，ImageWAM在推理时并不解码目标帧，而是以图像编辑去噪产生的KV缓存作为条件，输入给流匹配动作专家，将其作为紧凑的世界动作上下文。ImageWAM在多种仿真与真实世界实验中，无需额外策略预训练，即超越了标准VLA基线和具有竞争力的WAMs。同时它将计算量（FLOPs）降至视频WAMs的1/6，延迟降至1/4。注意力分析进一步表明，编辑缓存聚焦于任务相关变化区域，支持图像编辑作为基于视频的世界动作建模的有效替代方案。

English

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.