ChatPaper.aiChatPaper

3DFlowAction:從3D流動世界中學習跨實體操作模型

3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

June 6, 2025
作者: Hongyan Zhi, Peihao Chen, Siyuan Zhou, Yubo Dong, Quanxi Wu, Lei Han, Mingkui Tan
cs.AI

摘要

長期以來,操控一直是機器人面臨的挑戰,而人類卻能輕鬆地與物體進行複雜的互動,例如將杯子掛在杯架上。一個關鍵原因在於缺乏一個大規模且統一的數據集來教授機器人操控技能。現有的機器人數據集通常記錄的是機器人在簡單場景中不同動作空間的行為,這阻礙了機器人學習到適用於不同機器人和多樣場景的統一且魯棒的動作表示。通過觀察人類如何理解操控任務,我們發現理解物體在3D空間中應如何移動是引導動作的關鍵線索。這一線索與具體的實體無關,既適用於人類也適用於不同的機器人。基於此,我們旨在從人類和機器人的操控數據中學習一個3D流動世界模型。該模型預測交互物體在3D空間中的未來運動,從而指導操控動作的規劃。具體而言,我們通過一個移動物體自動檢測管道合成了一個大規模的3D光流數據集,命名為ManiFlow-110k。隨後,一個基於視頻擴散的世界模型從這些數據中學習操控物理,生成基於語言指令的3D光流軌跡。利用生成的3D物體光流,我們提出了一種流動引導的渲染機制,該機制渲染預測的最終狀態,並利用GPT-4o評估預測的光流是否與任務描述相符。這賦予了機器人閉環規劃的能力。最後,我們將預測的3D光流作為優化策略的約束,以確定一系列用於操控的機器人動作。大量實驗表明,該方法在多樣化的機器人操控任務中展現出強大的泛化能力,並能在無需針對特定硬件訓練的情況下實現可靠的跨實體適應。
English
Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment-agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large-scale 3D optical flow dataset, named ManiFlow-110k, through a moving object auto-detect pipeline. A video diffusion-based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow-guided rendering mechanism, which renders the predicted final state and leverages GPT-4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed-loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.
PDF52June 9, 2025