利用语言反馈模型进行政策改进
Policy Improvement using Language Feedback Models
February 12, 2024
作者: Victor Zhong, Dipendra Misra, Xingdi Yuan, Marc-Alexandre Côté
cs.AI
摘要
我们引入了语言反馈模型(LFMs),用于在指令跟随中识别理想行为 - 即有助于实现指令中指定任务的行为,以进行模仿学习。为了训练LFMs,我们从大型语言模型(LLMs)获取反馈,该反馈基于将视觉轨迹口头描述为语言描述。首先,通过使用LFMs识别理想行为进行模仿,我们在三个不同的语言基础环境(Touchdown、ScienceWorld和ALFWorld)上提高了任务完成率,超过了强行为克隆基线。其次,LFMs在控制LLM输出令牌数量时胜过使用LLMs作为专家直接预测行动。第三,LFMs可以推广到未见环境,通过一轮适应提高了3.5-12.0%的任务完成率。最后,LFM可以进行修改以提供人类可解释的反馈,而不会损失性能,从而允许人类验证模仿学习的理想行为。
English
We introduce Language Feedback Models (LFMs) that identify desirable
behaviour - actions that help achieve tasks specified in the instruction - for
imitation learning in instruction following. To train LFMs, we obtain feedback
from Large Language Models (LLMs) on visual trajectories verbalized to language
descriptions. First, by using LFMs to identify desirable behaviour to imitate,
we improve in task-completion rate over strong behavioural cloning baselines on
three distinct language grounding environments (Touchdown, ScienceWorld, and
ALFWorld). Second, LFMs outperform using LLMs as experts to directly predict
actions, when controlling for the number of LLM output tokens. Third, LFMs
generalize to unseen environments, improving task-completion rate by 3.5-12.0%
through one round of adaptation. Finally, LFM can be modified to provide
human-interpretable feedback without performance loss, allowing human
verification of desirable behaviour for imitation learning.