iMaC：将动作转化为运动与接触图像，用于具身世界模型

摘要

具身世界模型已成为视觉机器人决策与交互环境模拟的关键范式。然而，传统具身框架依赖低维结构化动作向量（如关节角度与末端执行器位姿），存在表达能力有限、跨不同具身形态泛化能力弱、复杂物理交互动态建模不自然等局限性。为解决上述问题，本文提出iMac（图像即动作控制）——一种将原始视觉图像作为具身世界模型原生动作表征的新型统一控制范式。与传统的显式运动学动作编码不同，iMac将连续视觉操作任务形式化为基于图像的动作标记，这些标记天然蕴含空间运动意图、交互几何约束与细微物理动态。我们构建了由图像动作编码器与动态世界预测器组成的双分支具身架构：编码器将目标驱动视觉图像压缩为紧凑动作嵌入，预测器则学习以图像动作为条件的环境转移规则，实现高保真未来状态预测与闭环具身控制。在公开具身操作基准与真实机器人场景上开展大量实验，结果表明iMac在预测精度、任务成功率与跨场景泛化能力上全面超越基于向量的动作控制基线。此外，我们的图像动作设计消除了对人工定义动作空间的依赖，实现了异构具身智能体的灵活通用控制。本工作为具身世界模型提供了创新的视觉动作视角，为可扩展机器人感知与操作建立了简洁有效的范式。

English

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.