Real2Edit2Real:通过三维控制界面生成机器人演示数据
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
December 22, 2025
作者: Yujie Zhao, Hongwei Fan, Di Chen, Shengcong Chen, Liliang Chen, Xiaoqi Li, Guanghui Ren, Hao Dong
cs.AI
摘要
儘管機器人學習領域的最新進展得益於大規模數據集和強大的視覺運動策略架構,但策略的魯棒性仍因收集多樣化示範數據的高昂成本而受限,尤其是在操作任務的空間泛化方面。為減少重複性數據採集,我們提出Real2Edit2Real框架,該框架通過3D控制接口將3D可編輯性與2D視覺數據相結合,從而生成新的示範數據。我們的方法首先通過具備公制尺度重建能力的3D重建模型,從多視角RGB觀測數據中重建場景幾何結構。基於重建的幾何結構,我們對點雲進行深度可靠的3D編輯以生成新的操作軌跡,同時通過幾何校正機械臂位姿來恢復物理一致的深度信息,以此作為合成新示範數據的可靠條件。最後,我們提出以深度信息作為主控信號,結合動作、邊緣和射線圖的多條件視頻生成模型,合成空間增強的多元視角操作視頻。在四項真實世界操作任務上的實驗表明,僅使用1-5個源示範數據生成的訓練策略,其性能可媲美甚至超越基於50個真實示範數據訓練的策略,將數據效率提升高達10-50倍。此外,在高度和紋理編輯方面的實驗結果驗證了該框架的靈活性與可擴展性,表明其具備成為統一數據生成框架的潛力。
English
Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework's flexibility and extensibility, indicating its potential to serve as a unified data generation framework.