iMaC：將動作轉化為運動與接觸影像以應用於具身世界模型

摘要

具身世界模型已成為視覺機器人決策與交互環境模擬的重要範疇。然而，傳統具身框架依賴於低維結構化動作向量（如關節角度與末端執行器位姿），這些向量存在表達能力有限、跨多樣化本體泛化能力差以及對複雜物理交互動態建模不自然等侷限。為解決這些問題，本文提出iMac（Image as Action Control，圖像即動作控制）——一種新穎的統一控制範式，將原始視覺圖像視為具身世界模型的固有動作表徵。不同於傳統顯式運動學動作編碼，iMac將連續視覺操作形式化為基於圖像的動作標記，這些標記內在地包含了空間運動意圖、交互幾何約束與細微物理動態。我們構建了一個雙分支具身架構，包含圖像-動作編碼器與動態世界預測器：編碼器將目標驅動的視覺圖像壓縮為緊湊的動作嵌入，而預測器則學習以圖像動作為條件的環境轉移規則，從而實現高保真未來狀態預測與閉環具身控制。大量實驗在公開的具身操作基準測試與真實機器人場景中進行。結果表明，iMac在預測準確率、任務成功率與跨場景泛化能力上優於基於向量的動作控制基線。此外，我們的圖像-動作設計消除了對人工定義動作空間的依賴，實現了對異質具身智能體的靈活通用控制。該工作為具身世界模型提供了創新的視覺-動作視角，為可擴展的機器人感知與操作提供了簡單而有效的範式。

English

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.