KUDA:关键点统一动力学学习与视觉提示,实现开放词汇机器人操控
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation
March 13, 2025
作者: Zixian Liu, Mingtong Zhang, Yunzhu Li
cs.AI
摘要
随着大语言模型(LLMs)和视觉-语言模型(VLMs)的快速发展,开放词汇机器人操作系统领域取得了显著进展。然而,许多现有方法忽视了物体动力学的重要性,限制了其在更复杂、动态任务中的适用性。在本研究中,我们提出了KUDA,一个集成了动力学学习和通过关键点进行视觉提示的开放词汇操作系统,它同时利用了VLMs和学习型神经动力学模型。我们的核心洞见在于,基于关键点的目标规范不仅对VLMs具有可解释性,还能高效地转化为基于模型的规划成本函数。给定语言指令和视觉观察,KUDA首先为RGB图像分配关键点,并查询VLM以生成目标规范。随后,这些抽象的关键点表示被转换为成本函数,并通过学习到的动力学模型进行优化,从而生成机器人轨迹。我们在多种操作任务上评估了KUDA,包括跨不同物体类别的自由形式语言指令、多物体交互以及可变形或颗粒状物体的操作,验证了我们框架的有效性。项目页面可在http://kuda-dynamics.github.io访问。
English
With the rapid advancement of large language models (LLMs) and
vision-language models (VLMs), significant progress has been made in developing
open-vocabulary robotic manipulation systems. However, many existing approaches
overlook the importance of object dynamics, limiting their applicability to
more complex, dynamic tasks. In this work, we introduce KUDA, an
open-vocabulary manipulation system that integrates dynamics learning and
visual prompting through keypoints, leveraging both VLMs and learning-based
neural dynamics models. Our key insight is that a keypoint-based target
specification is simultaneously interpretable by VLMs and can be efficiently
translated into cost functions for model-based planning. Given language
instructions and visual observations, KUDA first assigns keypoints to the RGB
image and queries the VLM to generate target specifications. These abstract
keypoint-based representations are then converted into cost functions, which
are optimized using a learned dynamics model to produce robotic trajectories.
We evaluate KUDA on a range of manipulation tasks, including free-form language
instructions across diverse object categories, multi-object interactions, and
deformable or granular objects, demonstrating the effectiveness of our
framework. The project page is available at http://kuda-dynamics.github.io.