KUDA:關鍵點統一動態學習與視覺提示,實現開放詞彙的機器人操作
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation
March 13, 2025
作者: Zixian Liu, Mingtong Zhang, Yunzhu Li
cs.AI
摘要
隨著大型語言模型(LLMs)和視覺語言模型(VLMs)的快速發展,開放詞彙機器人操作系統的開發取得了顯著進展。然而,許多現有方法忽視了物體動態的重要性,限制了其在更複雜、動態任務中的應用。在本研究中,我們提出了KUDA,這是一個整合了動態學習和通過關鍵點進行視覺提示的開放詞彙操作系統,它充分利用了VLMs和基於學習的神經動態模型。我們的核心洞見是,基於關鍵點的目標規格既可由VLMs解釋,又能高效轉化為基於模型的規劃成本函數。在給定語言指令和視覺觀察後,KUDA首先為RGB圖像分配關鍵點,並查詢VLM以生成目標規格。這些基於關鍵點的抽象表示隨後被轉換為成本函數,並使用學習到的動態模型進行優化,以產生機器人軌跡。我們在多種操作任務上評估了KUDA,包括跨多樣物體類別的自由形式語言指令、多物體交互以及可變形或顆粒狀物體,展示了我們框架的有效性。項目頁面可訪問:http://kuda-dynamics.github.io。
English
With the rapid advancement of large language models (LLMs) and
vision-language models (VLMs), significant progress has been made in developing
open-vocabulary robotic manipulation systems. However, many existing approaches
overlook the importance of object dynamics, limiting their applicability to
more complex, dynamic tasks. In this work, we introduce KUDA, an
open-vocabulary manipulation system that integrates dynamics learning and
visual prompting through keypoints, leveraging both VLMs and learning-based
neural dynamics models. Our key insight is that a keypoint-based target
specification is simultaneously interpretable by VLMs and can be efficiently
translated into cost functions for model-based planning. Given language
instructions and visual observations, KUDA first assigns keypoints to the RGB
image and queries the VLM to generate target specifications. These abstract
keypoint-based representations are then converted into cost functions, which
are optimized using a learned dynamics model to produce robotic trajectories.
We evaluate KUDA on a range of manipulation tasks, including free-form language
instructions across diverse object categories, multi-object interactions, and
deformable or granular objects, demonstrating the effectiveness of our
framework. The project page is available at http://kuda-dynamics.github.io.Summary
AI-Generated Summary