언어 모델 예측 제어를 통해 인간 피드백으로부터 더 빠르게 학습하는 방법

초록

대규모 언어 모델(LLM)은 언어 명령으로부터 로봇 코드를 작성하는 등 다양한 능력을 보여주며, 비전문가가 로봇의 행동을 지시하고 피드백을 바탕으로 이를 수정하거나 새로운 작업을 구성할 수 있게 합니다. 그러나 이러한 능력(컨텍스트 내 학습에 의해 주도됨)은 단기 상호작용에 국한되어 있으며, 사용자의 피드백은 LLM의 컨텍스트 크기 내에서만 관련성을 유지하고, 장기 상호작용에서는 잊혀질 수 있습니다. 본 연구에서는 로봇 코드 작성 LLM을 미세 조정하여 컨텍스트 내 상호작용을 기억하고 가르치기 쉬운 능력(즉, 인간의 입력에 얼마나 효율적으로 적응하는지, 사용자가 작업을 성공적으로 간주하기 전의 평균 수정 횟수로 측정)을 개선하는 방법을 탐구합니다. 우리의 주요 관찰은 인간-로봇 상호작용이 부분 관측 가능 마르코프 결정 과정(인간의 언어 입력은 관측, 로봇 코드 출력은 행동으로 간주)으로 공식화될 때, 이전 상호작용을 완료하도록 LLM을 훈련시키는 것을 전이 역학 모델을 훈련시키는 것으로 볼 수 있다는 것입니다. 이는 모델 예측 제어(MPC)와 같은 고전적인 로봇 공학 기법과 결합하여 성공으로 이르는 더 짧은 경로를 발견할 수 있게 합니다. 이는 언어 모델 예측 제어(LMPC)라는 프레임워크를 탄생시켰으며, PaLM 2를 미세 조정하여 5가지 로봇 구현체에서 78개 작업에 대한 가르치기 쉬운 능력을 개선했습니다. 이를 통해 보이지 않는 작업에 대한 비전문가의 가르침 성공률을 26.9% 향상시키고, 평균 인간 수정 횟수를 2.4에서 1.9로 줄였습니다. 실험 결과, LMPC는 강력한 메타 학습자를 생성하며, 보이지 않는 로봇 구현체와 API에서 새로운 작업을 컨텍스트 내 학습하는 성공률을 31.5% 향상시켰습니다. 비디오, 코드, 데모는 https://robot-teaching.github.io/에서 확인할 수 있습니다.

English

Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are formulated as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions can be viewed as training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.

언어 모델 예측 제어를 통해 인간 피드백으로부터 더 빠르게 학습하는 방법

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

초록

Support