シミュレーションとしての操作：ロボットにおける正確な幾何学知覚の実現

要旨

現代のロボット操作は主に2Dカラー空間における視覚観察に依存して技能学習を行っていますが、汎化性能が低いという課題を抱えています。一方、3D世界で生活する人間は、物体と相互作用する際に、テクスチャよりも距離、サイズ、形状といった物理的特性に大きく依存しています。このような3D幾何学的情報は広く利用可能な深度カメラから取得できるため、ロボットに同様の知覚能力を付与することは可能であるように思われます。私たちの予備調査では、深度カメラを操作に使用することは、主にその精度の限界や様々な種類のノイズへの影響を受けやすいことから、困難であることがわかりました。本研究では、日常使用される深度カメラに簡単に組み込めるプラグインとして、カメラ深度モデル（CDMs）を提案します。CDMsはRGB画像と生の深度信号を入力として受け取り、ノイズ除去された正確なメトリック深度を出力します。これを実現するために、深度カメラのノイズパターンをモデル化することでシミュレーションから高品質なペアデータを生成するニューラルデータエンジンを開発しました。結果として、CDMsは深度予測においてほぼシミュレーションレベルの精度を達成し、操作タスクにおけるシミュレーションと現実のギャップを効果的に埋めることが示されました。特に、私たちの実験では、ノイズを追加したり現実世界での微調整を必要とせず、生のシミュレーション深度で訓練されたポリシーが、関節、反射性、細長い物体を含む2つの挑戦的な長期タスクにおいて、現実世界のロボットにシームレスに汎化し、性能の低下がほとんどないことを初めて実証しました。私たちの研究結果が、シミュレーションデータと3D情報を一般的なロボットポリシーに活用する今後の研究にインスピレーションを与えることを願っています。

English

Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.

シミュレーションとしての操作：ロボットにおける正確な幾何学知覚の実現

Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

要旨

Support