KUDA：關鍵點統一動態學習與視覺提示，實現開放詞彙的機器人操作

摘要

隨著大型語言模型（LLMs）和視覺語言模型（VLMs）的快速發展，開放詞彙機器人操作系統的開發取得了顯著進展。然而，許多現有方法忽視了物體動態的重要性，限制了其在更複雜、動態任務中的應用。在本研究中，我們提出了KUDA，這是一個整合了動態學習和通過關鍵點進行視覺提示的開放詞彙操作系統，它充分利用了VLMs和基於學習的神經動態模型。我們的核心洞見是，基於關鍵點的目標規格既可由VLMs解釋，又能高效轉化為基於模型的規劃成本函數。在給定語言指令和視覺觀察後，KUDA首先為RGB圖像分配關鍵點，並查詢VLM以生成目標規格。這些基於關鍵點的抽象表示隨後被轉換為成本函數，並使用學習到的動態模型進行優化，以產生機器人軌跡。我們在多種操作任務上評估了KUDA，包括跨多樣物體類別的自由形式語言指令、多物體交互以及可變形或顆粒狀物體，展示了我們框架的有效性。項目頁面可訪問：http://kuda-dynamics.github.io。

English

With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http://kuda-dynamics.github.io.

KUDA：關鍵點統一動態學習與視覺提示，實現開放詞彙的機器人操作

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

摘要

Support