물리적 기반을 갖춘 시각-언어 모델을 활용한 로봇 매니퓰레이션

초록

최근 비전-언어 모델(VLMs)의 발전으로 시각 질의응답 및 이미지 캡셔닝과 같은 작업에서의 성능이 향상되었습니다. 이에 따라 이러한 모델들은 물리적 세계, 특히 로봇 조작과 같은 영역에서 추론을 수행할 수 있는 위치에 놓이게 되었습니다. 그러나 현재의 VLMs은 일반 물체의 물리적 개념(예: 재질, 취약성)에 대한 이해가 제한적이어서, 이러한 물체와의 상호작용 및 물리적 추론이 필요한 로봇 조작 작업에서의 유용성이 제한됩니다. 이러한 한계를 해결하기 위해, 우리는 PhysObjects를 제안합니다. 이는 일반 가정용 물체에 대한 36.9K 크라우드소싱 및 417K 자동화된 물리적 개념 주석으로 구성된 객체 중심 데이터셋입니다. 우리는 PhysObjects를 사용하여 VLM을 미세 조정함으로써 시각적 외관에서 이러한 개념에 대한 인간의 사전 지식을 포착하여 물리적 객체 개념에 대한 이해를 향상시킴을 보여줍니다. 우리는 이 물리적으로 기반을 둔 VLM을 대형 언어 모델 기반 로봇 플래너와의 상호작용 프레임워크에 통합하고, 물리적으로 기반을 둔 VLMs을 활용하지 않는 베이스라인과 비교하여 물리적 객체 개념에 대한 추론이 필요한 작업에서의 플래닝 성능이 향상됨을 보여줍니다. 또한, 우리는 실제 로봇에서 물리적으로 기반을 둔 VLM의 이점을 보여주며, 이를 통해 작업 성공률이 향상됨을 입증합니다. 우리는 데이터셋을 공개하고, 결과에 대한 추가 세부 사항 및 시각화 자료를 https://iliad.stanford.edu/pg-vlm/에서 제공합니다.

English

Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world, particularly within domains such as robotic manipulation. However, current VLMs are limited in their understanding of the physical concepts (e.g., material, fragility) of common objects, which restricts their usefulness for robotic manipulation tasks that involve interaction and physical reasoning about such objects. To address this limitation, we propose PhysObjects, an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. We demonstrate that fine-tuning a VLM on PhysObjects improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance. We incorporate this physically-grounded VLM in an interactive framework with a large language model-based robotic planner, and show improved planning performance on tasks that require reasoning about physical object concepts, compared to baselines that do not leverage physically-grounded VLMs. We additionally illustrate the benefits of our physically-grounded VLM on a real robot, where it improves task success rates. We release our dataset and provide further details and visualizations of our results at https://iliad.stanford.edu/pg-vlm/.

물리적 기반을 갖춘 시각-언어 모델을 활용한 로봇 매니퓰레이션

Physically Grounded Vision-Language Models for Robotic Manipulation

초록

Support