ConceptGraphs: 知覚と計画のためのオープン語彙3Dシーングラフ

要旨

ロボットが多様なタスクを実行するためには、意味的に豊かでありながらコンパクトで、タスク駆動型の知覚と計画に効率的な3D世界表現が必要です。最近のアプローチでは、大規模な視覚言語モデルから得られる特徴量を活用して、3D表現に意味情報をエンコードしようと試みています。しかし、これらのアプローチでは、ポイントごとの特徴ベクトルを持つマップが生成される傾向があり、大規模な環境ではスケーラビリティに欠け、また環境内のエンティティ間の意味的空間関係を含んでいないため、下流の計画タスクに有用ではありません。本研究では、ConceptGraphsという、3Dシーンに対するオープン語彙のグラフ構造表現を提案します。ConceptGraphsは、2D基盤モデルを活用し、その出力をマルチビュー関連付けによって3Dに融合することで構築されます。この結果得られる表現は、大規模な3Dデータセットを収集したりモデルをファインチューニングしたりする必要なく、新しい意味クラスに一般化します。我々は、抽象的な（言語による）プロンプトで指定され、空間的および意味的概念にわたる複雑な推論を必要とする、いくつかの下流計画タスクを通じて、この表現の有用性を実証します。（プロジェクトページ: https://concept-graphs.github.io/ 解説動画: https://youtu.be/mRhNkQwRYnc）

English

For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )

ConceptGraphs: 知覚と計画のためのオープン語彙3Dシーングラフ

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

要旨

Support