SayPlan: 3Dシーングラフを用いた大規模言語モデルのグラウンディングによるスケーラブルなタスクプランニング

要旨

大規模言語モデル（LLM）は、多様なタスクに対する汎用プランニングエージェントの開発において印象的な成果を示してきました。しかし、これらのプランを広大で複数階層・複数部屋からなる環境に適用することは、ロボティクスにおいて大きな課題となっています。本研究では、3Dシーングラフ（3DSG）表現を用いた、LLMベースの大規模タスクプランニング手法「SayPlan」を提案します。本手法のスケーラビリティを確保するため、以下の3つのアプローチを採用しています：(1) 3DSGの階層構造を活用し、LLMがフルグラフの縮小表現からタスク関連のサブグラフを意味的に検索できるようにする、(2) 古典的なパスプランナーを統合することでLLMのプランニング範囲を縮小する、(3) シーングラフシミュレータからのフィードバックを用いて初期プランを反復的に改善し、実行不可能なアクションを修正し、プランニングの失敗を回避する。本手法を、最大3階層、36部屋、140オブジェクトに及ぶ2つの大規模環境で評価し、モバイルマニピュレータロボットが実行するための抽象的かつ自然言語の指示から、大規模で長期的なタスクプランを適用可能であることを示します。

English

Large language models (LLMs) have demonstrated impressive results in developing generalist planning agents for diverse tasks. However, grounding these plans in expansive, multi-floor, and multi-room environments presents a significant challenge for robotics. We introduce SayPlan, a scalable approach to LLM-based, large-scale task planning for robotics using 3D scene graph (3DSG) representations. To ensure the scalability of our approach, we: (1) exploit the hierarchical nature of 3DSGs to allow LLMs to conduct a semantic search for task-relevant subgraphs from a smaller, collapsed representation of the full graph; (2) reduce the planning horizon for the LLM by integrating a classical path planner and (3) introduce an iterative replanning pipeline that refines the initial plan using feedback from a scene graph simulator, correcting infeasible actions and avoiding planning failures. We evaluate our approach on two large-scale environments spanning up to 3 floors, 36 rooms and 140 objects, and show that our approach is capable of grounding large-scale, long-horizon task plans from abstract, and natural language instruction for a mobile manipulator robot to execute.

SayPlan: 3Dシーングラフを用いた大規模言語モデルのグラウンディングによるスケーラブルなタスクプランニング

SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning

要旨

Support