3DFlowAction: 3D Flow World에서 교차 구현체 조작 학습하기

초록

로봇의 조작은 오랫동안 어려운 과제로 여겨져 왔으며, 반면 인간은 컵을 컵걸이에 걸어두는 것과 같은 복잡한 물체 상호작용을 쉽게 수행할 수 있습니다. 이에 대한 주요 이유 중 하나는 로봇에게 조작 기술을 가르치기 위한 대규모이고 통일된 데이터셋의 부재입니다. 현재의 로봇 데이터셋은 단순한 장면 내에서 다양한 동작 공간에 로봇의 동작을 기록하는 경우가 많습니다. 이는 다양한 장면에서 서로 다른 로봇들이 통일되고 견고한 동작 표현을 학습하는 데 방해가 됩니다. 인간이 조작 작업을 이해하는 방식을 관찰해보면, 물체가 3D 공간에서 어떻게 움직여야 하는지를 이해하는 것이 동작을 안내하는 데 중요한 단서임을 알 수 있습니다. 이 단서는 구현체에 구애받지 않으며 인간과 다양한 로봇 모두에게 적합합니다. 이를 바탕으로, 우리는 인간과 로봇의 조작 데이터로부터 3D 흐름 세계 모델을 학습하는 것을 목표로 합니다. 이 모델은 상호작용하는 물체의 미래 움직임을 3D 공간에서 예측하여 조작을 위한 동작 계획을 안내합니다. 구체적으로, 우리는 움직이는 물체 자동 감지 파이프라인을 통해 ManiFlow-110k라는 대규모 3D 광학 흐름 데이터셋을 합성합니다. 비디오 확산 기반의 세계 모델은 이러한 데이터로부터 조작 물리를 학습하고, 언어 지시에 따라 조건화된 3D 광학 흐름 궤적을 생성합니다. 생성된 3D 물체 광학 흐름을 바탕으로, 우리는 흐름-가이드 렌더링 메커니즘을 제안합니다. 이 메커니즘은 예측된 최종 상태를 렌더링하고 GPT-4o를 활용하여 예측된 흐름이 작업 설명과 일치하는지 평가합니다. 이를 통해 로봇은 폐루프 계획 능력을 갖추게 됩니다. 마지막으로, 예측된 3D 광학 흐름을 최적화 정책의 제약 조건으로 고려하여 조작을 위한 일련의 로봇 동작을 결정합니다. 광범위한 실험을 통해 다양한 로봇 조작 작업에서 강력한 일반화 능력과 하드웨어별 훈련 없이도 신뢰할 수 있는 교차 구현체 적응을 입증합니다.

English

Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment-agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large-scale 3D optical flow dataset, named ManiFlow-110k, through a moving object auto-detect pipeline. A video diffusion-based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow-guided rendering mechanism, which renders the predicted final state and leverages GPT-4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed-loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.

3DFlowAction: 3D Flow World에서 교차 구현체 조작 학습하기

3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

초록

Support