模拟中的操控:实现机器人精确的几何感知
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots
September 2, 2025
作者: Minghuan Liu, Zhengbang Zhu, Xiaoshen Han, Peng Hu, Haotong Lin, Xinyao Li, Jingxiao Chen, Jiafeng Xu, Yichu Yang, Yunfeng Lin, Xinghang Li, Yong Yu, Weinan Zhang, Tao Kong, Bingyi Kang
cs.AI
摘要
现代机器人操作技术主要依赖二维彩色空间的视觉观察进行技能学习,但存在泛化能力不足的问题。相比之下,生活在三维世界中的人类,在与物体互动时,更多依赖于距离、大小和形状等物理属性,而非纹理。鉴于此类三维几何信息可通过广泛应用的深度相机获取,赋予机器人相似的感知能力似乎可行。我们的初步研究发现,使用深度相机进行操控面临挑战,主要源于其精度有限及易受多种噪声干扰。本研究中,我们提出相机深度模型(CDMs)作为日常使用深度相机的简易插件,它以RGB图像和原始深度信号为输入,输出去噪后的精确度量深度。为此,我们开发了一个神经数据引擎,通过模拟深度相机的噪声模式,从仿真中生成高质量配对数据。实验结果显示,CDMs在深度预测上达到了近乎仿真级别的精度,有效弥合了仿真到现实操作任务的差距。尤为值得一提的是,我们的实验首次证明,基于原始仿真深度训练的策略,无需添加噪声或进行现实世界微调,即可无缝迁移至现实机器人,在涉及铰接、反光及细长物体的两项长期复杂任务中,性能几乎无损。我们期望这一发现能激发未来研究在通用机器人策略中利用仿真数据及三维信息的兴趣。
English
Modern robotic manipulation primarily relies on visual observations in a 2D
color space for skill learning but suffers from poor generalization. In
contrast, humans, living in a 3D world, depend more on physical properties-such
as distance, size, and shape-than on texture when interacting with objects.
Since such 3D geometric information can be acquired from widely available depth
cameras, it appears feasible to endow robots with similar perceptual
capabilities. Our pilot study found that using depth cameras for manipulation
is challenging, primarily due to their limited accuracy and susceptibility to
various types of noise. In this work, we propose Camera Depth Models (CDMs) as
a simple plugin on daily-use depth cameras, which take RGB images and raw depth
signals as input and output denoised, accurate metric depth. To achieve this,
we develop a neural data engine that generates high-quality paired data from
simulation by modeling a depth camera's noise pattern. Our results show that
CDMs achieve nearly simulation-level accuracy in depth prediction, effectively
bridging the sim-to-real gap for manipulation tasks. Notably, our experiments
demonstrate, for the first time, that a policy trained on raw simulated depth,
without the need for adding noise or real-world fine-tuning, generalizes
seamlessly to real-world robots on two challenging long-horizon tasks involving
articulated, reflective, and slender objects, with little to no performance
degradation. We hope our findings will inspire future research in utilizing
simulation data and 3D information in general robot policies.