KUDA: オープンボキャブラリーロボット操作のためのダイナミクス学習と視覚プロンプティングを統合するキーポイント

要旨

大規模言語モデル（LLMs）と視覚言語モデル（VLMs）の急速な進展に伴い、オープン語彙ロボット操作システムの開発において大きな進展が見られています。しかし、既存の多くのアプローチは物体のダイナミクスの重要性を見落としており、より複雑で動的なタスクへの適用性が制限されています。本研究では、KUDAを紹介します。これは、VLMsと学習ベースのニューラルダイナミクスモデルを活用し、キーポイントを通じてダイナミクス学習と視覚プロンプティングを統合したオープン語彙操作システムです。私たちの重要な洞察は、キーポイントベースのターゲット指定がVLMsによって解釈可能であり、モデルベースのプランニングのためのコスト関数に効率的に変換できることです。言語指示と視覚観測が与えられると、KUDAはまずRGB画像にキーポイントを割り当て、VLMにクエリを送ってターゲット指定を生成します。これらの抽象的なキーポイントベースの表現は、その後コスト関数に変換され、学習されたダイナミクスモデルを使用して最適化され、ロボットの軌道を生成します。私たちはKUDAを、多様なオブジェクトカテゴリにわたる自由形式の言語指示、複数オブジェクトの相互作用、変形可能または粒状のオブジェクトを含む一連の操作タスクで評価し、本フレームワークの有効性を実証しました。プロジェクトページはhttp://kuda-dynamics.github.ioで公開されています。

English

With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http://kuda-dynamics.github.io.

KUDA: オープンボキャブラリーロボット操作のためのダイナミクス学習と視覚プロンプティングを統合するキーポイント

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

要旨

Support