모바일 매니퓰레이션을 위한 일반화 가능한 특징 필드 학습

초록

모바일 매니퓰레이션 분야에서의 미해결 문제 중 하나는 로봇이 환경 내에서의 탐색과 물체 조작 모두에 활용할 수 있도록 객체와 장면을 통합적으로 표현하는 방법입니다. 후자의 경우 복잡한 기하학적 구조를 포착하면서도 세밀한 의미를 이해해야 하는 반면, 전자는 광범위한 물리적 규모에 내재된 복잡성을 포착해야 합니다. 본 연구에서는 탐색과 조작 모두를 위한 통합 표현으로 실시간으로 동작하는 장면 수준의 일반화 가능한 신경망 특징 필드인 GeFF(Generalizable Feature Fields)를 제안합니다. 이를 위해 생성적 신시점 합성(generative novel view synthesis)을 사전 학습 작업으로 간주하고, 그 결과로 얻은 풍부한 장면 사전 지식을 CLIP 특징 증류를 통해 자연어와 정렬합니다. 우리는 매니퓰레이터가 장착된 사족 보행 로봇에 GeFF를 배치하여 이 접근법의 효과를 입증합니다. 동적 장면에서 개방형 어휘 모바일 매니퓰레이션을 수행할 때 GeFF의 개방형 객체에 대한 일반화 능력과 실행 시간을 평가합니다.

English

An open problem in mobile manipulation is how to represent objects and scenes in a unified manner, so that robots can use it both for navigating in the environment and manipulating objects. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherit to an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We evaluate GeFF's ability to generalize to open-set objects as well as running time, when performing open-vocabulary mobile manipulation in dynamic scenes.

모바일 매니퓰레이션을 위한 일반화 가능한 특징 필드 학습

Learning Generalizable Feature Fields for Mobile Manipulation

초록

Support