ChatPaper.aiChatPaper

通過語言模型預測控制,從人類反饋中學習更快速的學習

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

February 18, 2024
作者: Jacky Liang, Fei Xia, Wenhao Yu, Andy Zeng, Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed, Chuyuan Kelly Fu, Nimrod Gileadi, Marissa Giustina, Keerthana Gopalakrishnan, Leonard Hasenclever, Jan Humplik, Jasmine Hsu, Nikhil Joshi, Ben Jyenis, Chase Kew, Sean Kirmani, Tsang-Wei Edward Lee, Kuang-Huei Lee, Assaf Hurwitz Michaely, Joss Moore, Ken Oslund, Dushyant Rao, Allen Ren, Baruch Tabanpour, Quan Vuong, Ayzaan Wahid, Ted Xiao, Ying Xu, Vincent Zhuang, Peng Xu, Erik Frey, Ken Caluwaerts, Tingnan Zhang, Brian Ichter, Jonathan Tompson, Leila Takayama, Vincent Vanhoucke, Izhak Shafran, Maja Mataric, Dorsa Sadigh, Nicolas Heess, Kanishka Rao, Nik Stewart, Jie Tan, Carolina Parada
cs.AI

摘要

大型語言模型(LLMs)已被證明具有廣泛的能力,例如從語言命令中編寫機器人代碼,使非專家能夠指導機器人行為、根據反饋進行修改,或組合它們以執行新任務。然而,這些能力(由上下文學習驅動)僅限於短期交互,其中用戶的反饋僅在符合LLM上下文大小的範圍內才保持相關,並且在較長的交互過程中可能被遺忘。在這項工作中,我們研究了對機器人代碼編寫LLMs進行微調,以記住它們的上下文交互並改善它們的可教性,即它們如何有效地適應人類輸入(通過用戶認為任務成功之前的平均更正次數來衡量)。我們的關鍵觀察是,當人機交互被制定為部分可觀察馬爾可夫決策過程(其中人類語言輸入為觀察,機器人代碼輸出為行動)時,訓練LLM完成先前交互可以被視為訓練轉換動態模型,該模型可以與經典機器人技術(如模型預測控制(MPC))結合,以發現成功的更短路徑。這導致了語言模型預測控制(LMPC),一個框架,對PaLM 2進行微調,以提高其在5個機器人實體上的78個任務中的可教性,將未見過的任務的非專家教學成功率提高了26.9%,同時將人類更正的平均次數從2.4減少到1.9。實驗表明,LMPC還產生了強大的元學習器,將在未見過的機器人實體和API上學習新任務的成功率提高了31.5%。請參閱視頻、代碼和演示:https://robot-teaching.github.io/。
English
Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are formulated as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions can be viewed as training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.
PDF232December 15, 2024