iMaC: 身体化世界モデルのための行動の動作と接触画像への変換

要旨

具身世界モデルは、視覚に基づくロボットの意思決定やインタラクティブな環境シミュレーションにおける重要なパラダイムとして台頭してきた。しかし、従来の具身フレームワークは、低次元で構造化されたアクションベクトル（例：関節角度やエンドエフェクタの姿勢）に依存しており、表現能力の限界、多様な身体性への適応性の低さ、複雑な物理的インタラクションに対する不自然な動的モデリングといった問題を抱えている。これらの制約に対処するため、本論文ではiMac（Image as Action Control）を提案する。これは、生の視覚画像を具身世界モデルにおける自然な行動表現として扱う、新たな統一制御パラダイムである。従来の明示的な運動学的行動符号化から脱却し、iMacは連続的な視覚操作を画像ベースの行動トークンとして定式化し、そこに空間的な動作意図、インタラクションの幾何学的制約、微妙な物理的ダイナミクスを内在的に内包する。我々は、画像行動エンコーダと動的世界予測器からなる二分岐の具身アーキテクチャを構築する。エンコーダは目標駆動型の視覚画像をコンパクトな行動埋め込みに圧縮し、予測器は画像行動に条件付けられた環境遷移ルールを学習することで、高忠実度の未来状態予測と閉ループの具身制御を実現する。公開されている具身操作ベンチマークと実世界のロボットシナリオにおいて広範な実験を実施した。結果は、iMacが予測精度、タスク成功率、シーン間の汎化能力において、ベクトルベースの行動制御ベースラインを凌駕することを示している。さらに、我々の画像行動設計は手動で定義された行動空間への依存を排除し、異種の具身エージェントに対する柔軟で普遍的な制御を実現する。本研究は、具身世界モデルに革新的な視覚行動の視点を提供し、スケーラブルなロボット知覚と操作のためのシンプルかつ効果的なパラダイムを提示する。

English

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.