用於階層式規劃的組合式基礎模型

摘要

為了在具有長期目標的新環境中做出有效決策，跨越空間和時間尺度進行階層推理至關重要。這包括規劃抽象的子目標序列，對底層計劃進行視覺推理，並根據制定的計劃通過視覺-運動控制執行動作。我們提出了用於階層規劃的組合基礎模型（HiP），這是一個基礎模型，它利用分別在語言、視覺和動作數據上訓練的多個專家基礎模型聯合解決長期目標任務。我們使用一個大型語言模型來構建在環境中扎根的符號計劃，通過一個大型視頻擴散模型。生成的視頻計劃然後通過一個從生成的視頻中推斷動作的逆動力學模型扎根於視覺-運動控制。為了在這個層次結構內進行有效推理，我們通過迭代細化強制在模型之間保持一致性。我們在三個不同的長期目標桌面操作任務中展示了我們方法的功效和適應性。

English

To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model trained on language, vision and action data individually jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via iterative refinement. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks.