iMaC: 구현된 세계 모델을 위한 행동의 모션 및 접촉 이미지 변환

초록

포현적 세계 모델은 시각적 로봇 의사결정 및 상호작용 환경 시뮬레이션을 위한 핵심 패러다임으로 부상했다. 그러나 기존의 포현적 프레임워크는 저차원의 구조화된 행동 벡터(예: 관절 각도와 말단효과기 자세)에 의존하며, 이는 제한된 표현 능력, 다양한 체현 간의 낮은 일반화 성능, 그리고 복잡한 물리적 상호작용에 대한 부자연스러운 동역학 모델링이라는 한계를 가진다. 이러한 한계를 극복하기 위해 본 논문은iMac(Image as Action Control)을 제안한다. 이는 포현적 세계 모델을 위해 원시 시각 이미지를 자연스러운 행동 표현으로 취급하는 새로운 통합 제어 패러다임이다. 기존의 명시적 운동학적 행동 부호화에서 벗어나, iMac은 연속적인 시각적 조작을 이미지 기반 행동 토큰으로 정식화하며, 이는 공간적 운동 의도, 상호작용 기하학적 제약 및 미묘한 물리적 동역학을 본질적으로 내포한다. 우리는 이미지-행동 인코더와 동적 세계 예측기로 구성된 이중 분기 포현적 아키텍처를 구축한다. 인코더는 목표 지향적 시각 이미지를 간결한 행동 임베딩으로 압축하고, 예측기는 이미지 행동에 조건화된 환경 전이 규칙을 학습하여 고충실도 미래 상태 예측 및 폐루프 포현적 제어를 달성한다. 공개 포현적 조작 벤치마크와 실제 로봇 시나리오에서 광범위한 실험을 수행했다. 결과는 iMac이 예측 정확도, 작업 성공률 및 교차 장면 일반화 능력에서 벡터 기반 행동 제어 기준선을 능가함을 보여준다. 더욱이, 우리의 이미지-행동 설계는 수동으로 정의된 행동 공간에 대한 의존성을 제거하여 이질적인 포현적 에이전트에 대한 유연하고 보편적인 제어를 실현한다. 이 연구는 포현적 세계 모델에 대한 혁신적인 시각-행동 관점을 제공하며, 확장 가능한 로봇 지각 및 조작을 위한 간단하면서도 효과적인 패러다임을 제시한다.

English

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.