使用語言反饋模型進行政策改進
Policy Improvement using Language Feedback Models
February 12, 2024
作者: Victor Zhong, Dipendra Misra, Xingdi Yuan, Marc-Alexandre Côté
cs.AI
摘要
我們引入語言反饋模型(LFMs),用於識別理想行為,即有助於完成指令中指定任務的行動,以進行指令跟隨中的模仿學習。為了訓練LFMs,我們從大型語言模型(LLMs)獲取反饋,該反饋是關於視覺軌跡被轉化為語言描述。首先,通過使用LFMs識別理想行為進行模仿,我們在三個不同的語言基礎環境(Touchdown、ScienceWorld和ALFWorld)上提高了任務完成率,超越了強行為克隆基線。其次,當控制LLM輸出標記數量時,LFMs表現優於直接預測行動的LLMs專家。第三,LFMs能夠泛化到未見環境,通過一輪適應提高了3.5-12.0%的任務完成率。最後,LFM可以修改為提供人類可解釋的反饋,而不會影響性能,從而允許人類驗證模仿學習中的理想行為。
English
We introduce Language Feedback Models (LFMs) that identify desirable
behaviour - actions that help achieve tasks specified in the instruction - for
imitation learning in instruction following. To train LFMs, we obtain feedback
from Large Language Models (LLMs) on visual trajectories verbalized to language
descriptions. First, by using LFMs to identify desirable behaviour to imitate,
we improve in task-completion rate over strong behavioural cloning baselines on
three distinct language grounding environments (Touchdown, ScienceWorld, and
ALFWorld). Second, LFMs outperform using LLMs as experts to directly predict
actions, when controlling for the number of LLM output tokens. Third, LFMs
generalize to unseen environments, improving task-completion rate by 3.5-12.0%
through one round of adaptation. Finally, LFM can be modified to provide
human-interpretable feedback without performance loss, allowing human
verification of desirable behaviour for imitation learning.