ImageWAM: 世界动作模型真的需要视频生成,还是仅仅需要图像编辑?
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
June 17, 2026
作者: Yuyang Zhang, Wenyao Zhang, Zekun Qi, He Zhang, Haitao Lin, Jingbo Zhang, Yao Mu, Xiaokang Yang, Wenjun Zeng, Xin Jin
cs.AI
摘要
世界动作模型(WAMs)通常依赖视频生成来桥接视觉世界建模与机器人控制。然而,基于视频的WAMs面临三个相互关联的局限:密集的多帧未来令牌使推理成本高昂,完整的视频预测将模型容量消耗在与动作无关的时间与外观细节上,长时域的未来想象可能引入误差,误导动作预测。这些问题引出一个简单疑问:世界动作模型真的需要视频生成吗?我们提出ImageWAM,一种简洁的WAM框架,将预训练的图像编辑模型重新用于机器人动作预测。与视频生成相比,图像编辑提供了更匹配的先验:它只需建模目标帧的变换,聚焦于与动作相关的当前到目标视觉差异,并通过编辑预训练将任务指令锚定到局部视觉变化上。实际应用中,ImageWAM在推理时并不解码目标帧,而是以图像编辑去噪产生的KV缓存作为条件,输入给流匹配动作专家,将其作为紧凑的世界动作上下文。ImageWAM在多种仿真与真实世界实验中,无需额外策略预训练,即超越了标准VLA基线和具有竞争力的WAMs。同时它将计算量(FLOPs)降至视频WAMs的1/6,延迟降至1/4。注意力分析进一步表明,编辑缓存聚焦于任务相关变化区域,支持图像编辑作为基于视频的世界动作建模的有效替代方案。
English
World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.