階層的計画のための構成基盤モデル

要旨

新たな環境において長期的な目標を達成するための効果的な意思決定を行うためには、空間的および時間的なスケールにわたる階層的な推論を行うことが重要である。これには、抽象的なサブゴールのシーケンスを計画し、その基盤となる計画を視覚的に推論し、視覚-運動制御を通じて策定された計画に従って行動を実行することが含まれる。本論文では、階層的計画のための構成要素的基盤モデル（HiP）を提案する。この基盤モデルは、言語、視覚、行動データを個別に学習した複数の専門家基盤モデルを統合し、長期的なタスクを解決するものである。大規模言語モデルを使用して、環境に基づいた記号的計画を構築し、それを大規模ビデオ拡散モデルを通じて具体化する。生成されたビデオ計画は、生成されたビデオから行動を推論する逆動力学モデルを通じて、視覚-運動制御に具体化される。この階層内で効果的な推論を可能にするため、反復的な精緻化を通じてモデル間の一貫性を確保する。本手法の有効性と適応性を、3つの異なる長期的なテーブルトップ操作タスクにおいて実証する。

English

To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model trained on language, vision and action data individually jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via iterative refinement. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks.