3DFlowAction：从3D流态世界中学习跨实体操作模型

摘要

长期以来，操控一直是机器人面临的一项挑战，而人类却能轻松完成与物体的复杂交互，例如将杯子挂在杯架上。一个关键原因在于缺乏大规模且统一的数据集来教授机器人操控技能。现有的机器人数据集通常记录的是简单场景中不同动作空间内的机器人行为，这阻碍了机器人在多样化场景中为不同机器人学习统一且稳健的动作表示。通过观察人类如何理解操控任务，我们发现理解物体在三维空间中应如何移动是指导动作的关键线索。这一线索与具体形态无关，既适用于人类，也适用于各种机器人。受此启发，我们旨在从人类和机器人的操控数据中学习一个三维流世界模型。该模型预测交互物体在三维空间中的未来运动，从而指导操控动作的规划。具体而言，我们通过移动物体自动检测管道合成了一个名为ManiFlow-110k的大规模三维光流数据集。随后，一个基于视频扩散的世界模型从这些数据中学习操控物理，生成基于语言指令的三维光流轨迹。利用生成的三维物体光流，我们提出了一种流引导渲染机制，该机制渲染预测的最终状态，并利用GPT-4o评估预测的光流是否与任务描述相符。这为机器人提供了闭环规划能力。最后，我们将预测的三维光流作为优化策略的约束条件，以确定一系列用于操控的机器人动作。大量实验表明，该方法在多种机器人操控任务中展现出强大的泛化能力，并实现了无需硬件特定训练的可靠跨形态适应。

English

Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment-agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large-scale 3D optical flow dataset, named ManiFlow-110k, through a moving object auto-detect pipeline. A video diffusion-based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow-guided rendering mechanism, which renders the predicted final state and leverages GPT-4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed-loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.

3DFlowAction：从3D流态世界中学习跨实体操作模型

3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

摘要

Support