利用语言反馈模型进行政策改进

摘要

我们引入了语言反馈模型（LFMs），用于在指令跟随中识别理想行为 - 即有助于实现指令中指定任务的行为，以进行模仿学习。为了训练LFMs，我们从大型语言模型（LLMs）获取反馈，该反馈基于将视觉轨迹口头描述为语言描述。首先，通过使用LFMs识别理想行为进行模仿，我们在三个不同的语言基础环境（Touchdown、ScienceWorld和ALFWorld）上提高了任务完成率，超过了强行为克隆基线。其次，LFMs在控制LLM输出令牌数量时胜过使用LLMs作为专家直接预测行动。第三，LFMs可以推广到未见环境，通过一轮适应提高了3.5-12.0%的任务完成率。最后，LFM可以进行修改以提供人类可解释的反馈，而不会损失性能，从而允许人类验证模仿学习的理想行为。

English

We introduce Language Feedback Models (LFMs) that identify desirable behaviour - actions that help achieve tasks specified in the instruction - for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, ScienceWorld, and ALFWorld). Second, LFMs outperform using LLMs as experts to directly predict actions, when controlling for the number of LLM output tokens. Third, LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation. Finally, LFM can be modified to provide human-interpretable feedback without performance loss, allowing human verification of desirable behaviour for imitation learning.

利用语言反馈模型进行政策改进

Policy Improvement using Language Feedback Models

摘要

Support