Real2Edit2Real:基于三维控制接口的机器人演示生成
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
December 22, 2025
作者: Yujie Zhao, Hongwei Fan, Di Chen, Shengcong Chen, Liliang Chen, Xiaoqi Li, Guanghui Ren, Hao Dong
cs.AI
摘要
尽管机器人学习的最新进展得益于大规模数据集和强大的视觉运动策略架构,但策略鲁棒性仍受限于采集多样化演示数据的高昂成本,尤其是在操作任务的空间泛化方面。为减少重复性数据采集,我们提出Real2Edit2Real框架,通过3D控制界面将3D可编辑性与2D视觉数据相融合来生成新演示。该方法首先通过公制尺度的3D重建模型从多视角RGB观测中重建场景几何结构,基于重建几何对点云进行深度可靠的3D编辑以生成新操作轨迹,同时通过几何校正机器人位姿来恢复物理一致的深度信息,为合成新演示提供可靠条件。最后,我们提出以深度作为主要控制信号,结合动作、边缘和射线图的多条件视频生成模型,合成空间增强的多视角操作视频。在四个真实世界操作任务上的实验表明,仅需1-5个原始演示生成的训练数据,其策略性能即可媲美或超越使用50个真实演示训练的模型,将数据效率提升高达10-50倍。此外,高度和纹理编辑的实验结果验证了框架的灵活性与可扩展性,表明其具备成为统一数据生成框架的潜力。
English
Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework's flexibility and extensibility, indicating its potential to serve as a unified data generation framework.