使用語言反饋模型進行政策改進

摘要

我們引入語言反饋模型（LFMs），用於識別理想行為，即有助於完成指令中指定任務的行動，以進行指令跟隨中的模仿學習。為了訓練LFMs，我們從大型語言模型（LLMs）獲取反饋，該反饋是關於視覺軌跡被轉化為語言描述。首先，通過使用LFMs識別理想行為進行模仿，我們在三個不同的語言基礎環境（Touchdown、ScienceWorld和ALFWorld）上提高了任務完成率，超越了強行為克隆基線。其次，當控制LLM輸出標記數量時，LFMs表現優於直接預測行動的LLMs專家。第三，LFMs能夠泛化到未見環境，通過一輪適應提高了3.5-12.0%的任務完成率。最後，LFM可以修改為提供人類可解釋的反饋，而不會影響性能，從而允許人類驗證模仿學習中的理想行為。

English

We introduce Language Feedback Models (LFMs) that identify desirable behaviour - actions that help achieve tasks specified in the instruction - for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, ScienceWorld, and ALFWorld). Second, LFMs outperform using LLMs as experts to directly predict actions, when controlling for the number of LLM output tokens. Third, LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation. Finally, LFM can be modified to provide human-interpretable feedback without performance loss, allowing human verification of desirable behaviour for imitation learning.

使用語言反饋模型進行政策改進

Policy Improvement using Language Feedback Models

摘要

Support