通过语言模型预测控制,从人类反馈中学习更快速的学习
Learning to Learn Faster from Human Feedback with Language Model Predictive Control
February 18, 2024
作者: Jacky Liang, Fei Xia, Wenhao Yu, Andy Zeng, Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed, Chuyuan Kelly Fu, Nimrod Gileadi, Marissa Giustina, Keerthana Gopalakrishnan, Leonard Hasenclever, Jan Humplik, Jasmine Hsu, Nikhil Joshi, Ben Jyenis, Chase Kew, Sean Kirmani, Tsang-Wei Edward Lee, Kuang-Huei Lee, Assaf Hurwitz Michaely, Joss Moore, Ken Oslund, Dushyant Rao, Allen Ren, Baruch Tabanpour, Quan Vuong, Ayzaan Wahid, Ted Xiao, Ying Xu, Vincent Zhuang, Peng Xu, Erik Frey, Ken Caluwaerts, Tingnan Zhang, Brian Ichter, Jonathan Tompson, Leila Takayama, Vincent Vanhoucke, Izhak Shafran, Maja Mataric, Dorsa Sadigh, Nicolas Heess, Kanishka Rao, Nik Stewart, Jie Tan, Carolina Parada
cs.AI
摘要
大型语言模型(LLMs)已被证明具有广泛的能力,例如从语言命令中编写机器人代码,使非专家能够指导机器人行为,根据反馈进行修改,或将其组合以执行新任务。然而,这些能力(由上下文学习驱动)仅限于短期交互,用户的反馈只有在符合LLM的上下文大小范围内时才保持相关,并且在较长时间的交互中可能会被遗忘。在这项工作中,我们研究了对机器人编写代码的LLMs进行微调,以记住它们的上下文交互并改善它们的可教性,即它们如何有效地适应人类输入(通过用户认为任务成功之前的平均更正次数来衡量)。我们的关键观察是,当人机交互被构建为部分可观察的马尔可夫决策过程(其中人类语言输入是观察值,机器人代码输出是动作)时,训练LLM以完成先前的交互可以被视为训练一个转移动态模型,可以与经典的机器人技术(如模型预测控制(MPC))结合,以发现成功的更短路径。这引出了语言模型预测控制(LMPC),一个框架,通过微调PaLM 2来改善其在5个机器人实体上的78个任务中的可教性,将未见任务的非专家教学成功率提高了26.9%,同时将人类更正的平均次数从2.4减少到1.9。实验证明,LMPC还产生了强大的元学习器,将在未见机器人实体和API上学习新任务的成功率提高了31.5%。请访问以下链接查看视频、代码和演示:https://robot-teaching.github.io/。
English
Large language models (LLMs) have been shown to exhibit a wide range of
capabilities, such as writing robot code from language commands -- enabling
non-experts to direct robot behaviors, modify them based on feedback, or
compose them to perform new tasks. However, these capabilities (driven by
in-context learning) are limited to short-term interactions, where users'
feedback remains relevant for only as long as it fits within the context size
of the LLM, and can be forgotten over longer interactions. In this work, we
investigate fine-tuning the robot code-writing LLMs, to remember their
in-context interactions and improve their teachability i.e., how efficiently
they adapt to human inputs (measured by average number of corrections before
the user considers the task successful). Our key observation is that when
human-robot interactions are formulated as a partially observable Markov
decision process (in which human language inputs are observations, and robot
code outputs are actions), then training an LLM to complete previous
interactions can be viewed as training a transition dynamics model -- that can
be combined with classic robotics techniques such as model predictive control
(MPC) to discover shorter paths to success. This gives rise to Language Model
Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its
teachability on 78 tasks across 5 robot embodiments -- improving non-expert
teaching success rates of unseen tasks by 26.9% while reducing the average
number of human corrections from 2.4 to 1.9. Experiments show that LMPC also
produces strong meta-learners, improving the success rate of in-context
learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos,
code, and demos at: https://robot-teaching.github.io/.