KUDA: 개방형 어휘 로봇 조작을 위한 동역학 학습과 시각적 프롬프팅 통합을 위한 키포인트

초록

대규모 언어 모델(LLM)과 시각-언어 모델(VLM)의 급속한 발전과 함께, 개방형 어휘 로봇 조작 시스템 개발에 있어 상당한 진전이 이루어졌습니다. 그러나 기존의 많은 접근 방식들은 객체 동역학의 중요성을 간과하여 더 복잡하고 동적인 작업에 적용하기 어려운 한계를 보였습니다. 본 연구에서는 동역학 학습과 키포인트를 통한 시각적 프롬프트를 통합한 개방형 어휘 조작 시스템인 KUDA를 소개합니다. KUDA는 VLM과 학습 기반 신경 동역학 모델을 모두 활용합니다. 우리의 핵심 통찰은 키포인트 기반 목표 지정이 VLM에 의해 해석 가능한 동시에 모델 기반 계획을 위한 비용 함수로 효율적으로 변환될 수 있다는 점입니다. 언어 지시와 시각 관측이 주어지면, KUDA는 먼저 RGB 이미지에 키포인트를 할당하고 VLM을 쿼리하여 목표 사양을 생성합니다. 이러한 추상적인 키포인트 기반 표현은 비용 함수로 변환되며, 학습된 동역학 모델을 사용하여 최적화되어 로봇 궤적을 생성합니다. 우리는 KUDA를 다양한 객체 카테고리에 걸친 자유형 언어 지시, 다중 객체 상호작용, 변형 가능하거나 입자 형태의 객체를 포함한 다양한 조작 작업에서 평가하며, 우리 프레임워크의 효과성을 입증합니다. 프로젝트 페이지는 http://kuda-dynamics.github.io에서 확인할 수 있습니다.

English

With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http://kuda-dynamics.github.io.

KUDA: 개방형 어휘 로봇 조작을 위한 동역학 학습과 시각적 프롬프팅 통합을 위한 키포인트

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

초록

Support