人間の普遍的把持

要旨

人間は物体を難なく把握できるが、多指ロボットはこのような汎用性には遠く及ばない。我々は、ロボット把持データの最も自然な源泉は、毎日何千もの物体を拾い上げる人間にあると主張する。本稿では、ステレオカメラで撮影された単一のRGB-D画像から、ユーザーが指定した任意の物体に対する多様な人間の把持を生成するフローマッチングモデルであるHUGを提案する。スマートグラスを用いて、まず1M-HUGsを収集した。これは、41棟の建物にわたる6,707個の物体インスタンスを含む100万フレーム（27.8時間）からなる、人間の把持の自己中心視点データセットである。次に、自然な人間の把持の分布をモデル化するために、我々の新しいフローマッチングモデルはRGBと深度観測を融合し、手首の並進、手首の回転、MANO手姿勢によってパラメータ化された把持を出力する。予測された把持は様々なロボットハンドにリターゲット可能であり、日常シーンでのゼロショット把持を実現する。評価を標準化するために、我々は5つの幾何学的カテゴリと様々なサイズからなる90個の未見物体を含み、メートルスケールの3Dメッシュを備えた新しいシミュレーションベンチマークHUG-Benchを構築した。HUGを実世界で評価するため、複数のステレオカメラ、ロボット実施形態、家庭環境においてHUG-Benchの30物体テストセットを用いた。HUGは、我々の難易度の高い物体セットにおいて、最先端の把持ベースラインを+23%および+34%上回った。コード、データ、ベンチマーク、チェックポイント、インタラクティブデモはウェブサイトで公開している: https://grasping.io/

English

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/