通过语言模型预测控制，从人类反馈中学习更快速的学习

摘要

大型语言模型（LLMs）已被证明具有广泛的能力，例如从语言命令中编写机器人代码，使非专家能够指导机器人行为，根据反馈进行修改，或将其组合以执行新任务。然而，这些能力（由上下文学习驱动）仅限于短期交互，用户的反馈只有在符合LLM的上下文大小范围内时才保持相关，并且在较长时间的交互中可能会被遗忘。在这项工作中，我们研究了对机器人编写代码的LLMs进行微调，以记住它们的上下文交互并改善它们的可教性，即它们如何有效地适应人类输入（通过用户认为任务成功之前的平均更正次数来衡量）。我们的关键观察是，当人机交互被构建为部分可观察的马尔可夫决策过程（其中人类语言输入是观察值，机器人代码输出是动作）时，训练LLM以完成先前的交互可以被视为训练一个转移动态模型，可以与经典的机器人技术（如模型预测控制（MPC））结合，以发现成功的更短路径。这引出了语言模型预测控制（LMPC），一个框架，通过微调PaLM 2来改善其在5个机器人实体上的78个任务中的可教性，将未见任务的非专家教学成功率提高了26.9％，同时将人类更正的平均次数从2.4减少到1.9。实验证明，LMPC还产生了强大的元学习器，将在未见机器人实体和API上学习新任务的成功率提高了31.5％。请访问以下链接查看视频、代码和演示：https://robot-teaching.github.io/。

English

Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are formulated as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions can be viewed as training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.

通过语言模型预测控制，从人类反馈中学习更快速的学习

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

摘要

Support