인간 범용 파지

초록

인간은 물체를 손쉽게 잡을 수 있지만, 다중 손가락 로봇은 이러한 일반성 수준에 크게 미치지 못한다. 우리는 로봇 파지 데이터의 가장 자연스러운 원천이 매일 수천 개의 물체를 집어 올리는 인간이라고 주장한다. 본 논문에서는 스테레오 카메라로 촬영한 단일 RGB-D 이미지에서 사용자가 지정한 모든 물체에 대해 다양한 인간 파지를 생성하는 흐름 정합 모델인 HUG를 제시한다. 먼저 스마트 안경을 사용하여 100만 개의 프레임(27.8시간)과 41개 건물에 걸친 6,707개의 물체 인스턴스를 포함하는 자기중심적 인간 파지 데이터셋인 1M-HUGs를 수집한다. 다음으로 자연스러운 인간 파지의 분포를 모델링하기 위해, 우리의 새로운 흐름 정합 모델은 RGB와 깊이 관측을 융합하여 손목 병진, 손목 회전, MANO 손 자세로 매개변수화된 파지를 출력한다. 예측된 파지는 다양한 로봇 손으로 재타겟팅될 수 있어 일상적인 장면에서 제로샷 파지를 가능하게 한다. 평가를 표준화하기 위해, 다섯 가지 기하학적 범주와 다양한 크기에 속하는 90개의 미지 물체에 대한 메트릭 스케일의 3D 메시를 포함하는 새로운 시뮬레이션 벤치마크인 HUG-Bench를 구축한다. 우리는 HUG를 실제 세계에서 HUG-Bench의 30개 물체 테스트 세트에 대해 여러 스테레오 카메라, 로봇 구현체 및 가정 환경에서 평가한다. HUG는 어려운 물체 세트에서 최첨단 파지 기준선 대비 각각 +23% 및 +34% 더 뛰어난 성능을 보인다. 코드, 데이터, 벤치마크, 체크포인트 및 대화형 데모는 웹사이트(https://grasping.io/)에 공개되어 있다.

English

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/