시뮬레이션에서의 조작: 로봇의 정확한 기하학적 인지 가능성

초록

현대 로봇 매니퓰레이션은 주로 2D 컬러 공간의 시각적 관찰에 의존하여 기술을 학습하지만, 일반화 능력이 부족한 문제를 안고 있습니다. 반면, 3D 세계에서 살아가는 인간은 물체와 상호작용할 때 질감보다는 거리, 크기, 형태와 같은 물리적 속성에 더 의존합니다. 이러한 3D 기하학적 정보는 널리 사용 가능한 깊이 카메라로부터 획득할 수 있기 때문에, 로봇에 유사한 지각 능력을 부여하는 것이 가능해 보입니다. 우리의 파일럿 연구는 깊이 카메라를 매니퓰레이션에 사용하는 것이 주로 정확도가 제한적이고 다양한 종류의 노이즈에 취약하기 때문에 어렵다는 것을 발견했습니다. 본 연구에서는 일상적으로 사용되는 깊이 카메라에 간단히 추가할 수 있는 Camera Depth Models (CDMs)를 제안합니다. 이 모델은 RGB 이미지와 원시 깊이 신호를 입력으로 받아 노이즈가 제거된 정확한 미터법 깊이를 출력합니다. 이를 위해, 우리는 깊이 카메라의 노이즈 패턴을 모델링하여 시뮬레이션에서 고품질의 짝지어진 데이터를 생성하는 신경망 데이터 엔진을 개발했습니다. 우리의 결과는 CDMs가 깊이 예측에서 거의 시뮬레이션 수준의 정확도를 달성하여 매니퓰레이션 작업을 위한 시뮬레이션-실제 간격을 효과적으로 좁힌다는 것을 보여줍니다. 특히, 우리의 실험은 노이즈를 추가하거나 실제 세계에서 미세 조정 없이 원시 시뮬레이션 깊이로 훈련된 정책이 관절, 반사, 그리고 가느다란 물체를 포함한 두 가지 도전적인 장기 작업에서 실제 로봇으로 원활하게 일반화되며 성능 저하가 거의 없음을 처음으로 입증했습니다. 우리의 연구 결과가 시뮬레이션 데이터와 3D 정보를 일반 로봇 정책에 활용하는 미래 연구에 영감을 줄 수 있기를 바랍니다.

English

Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.

시뮬레이션에서의 조작: 로봇의 정확한 기하학적 인지 가능성

Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

초록

Support