계층적 계획을 위한 구성적 기초 모델

초록

장기적 목표를 가진 새로운 환경에서 효과적인 결정을 내리기 위해서는 공간적 및 시간적 규모에 걸친 계층적 추론이 필수적이다. 이는 추상적인 하위 목표 시퀀스를 계획하고, 기저에 있는 계획에 대해 시각적으로 추론하며, 시각-운동 제어를 통해 계획된 바에 따라 행동을 실행하는 것을 포함한다. 본 연구에서는 계층적 계획을 위한 구성적 기초 모델(HiP)을 제안한다. 이 기초 모델은 언어, 시각, 행동 데이터에 대해 개별적으로 훈련된 다수의 전문가 기초 모델을 함께 활용하여 장기적 과제를 해결한다. 대규모 언어 모델을 사용하여 환경에 기반을 둔 상징적 계획을 구성하고, 이를 대규모 비디오 확산 모델을 통해 구체화한다. 생성된 비디오 계획은 생성된 비디오로부터 행동을 추론하는 역동학 모델을 통해 시각-운동 제어에 기반을 둔다. 이 계층 내에서 효과적인 추론을 가능하게 하기 위해, 반복적 정제를 통해 모델 간의 일관성을 강화한다. 본 접근법의 효율성과 적응성을 입증하기 위해 세 가지 다른 장기적 테이블탑 조작 과제에서 실험을 수행하였다.

English

To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model trained on language, vision and action data individually jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via iterative refinement. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks.