言語フィードバックモデルを用いた方策改善

要旨

我々は、指示追従における模倣学習のため、望ましい行動（指示で指定されたタスクを達成するための行動）を特定するLanguage Feedback Models（LFMs）を提案する。LFMsを訓練するために、視覚的軌跡を言語記述に変換し、大規模言語モデル（LLMs）からフィードバックを取得する。まず、LFMsを用いて模倣すべき望ましい行動を特定することで、3つの異なる言語基盤環境（Touchdown、ScienceWorld、ALFWorld）において、強力な行動クローニングベースラインを上回るタスク達成率の向上を実現する。次に、LLMの出力トークン数を制御した場合、LFMsはLLMsをエキスパートとして直接行動を予測する手法を上回る性能を示す。さらに、LFMsは未見の環境にも適応可能であり、1回の適応を通じてタスク達成率を3.5-12.0%向上させる。最後に、LFMは性能を損なうことなく人間が解釈可能なフィードバックを提供するように変更可能であり、模倣学習における望ましい行動の人間による検証を可能にする。

English

We introduce Language Feedback Models (LFMs) that identify desirable behaviour - actions that help achieve tasks specified in the instruction - for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, ScienceWorld, and ALFWorld). Second, LFMs outperform using LLMs as experts to directly predict actions, when controlling for the number of LLM output tokens. Third, LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation. Finally, LFM can be modified to provide human-interpretable feedback without performance loss, allowing human verification of desirable behaviour for imitation learning.

言語フィードバックモデルを用いた方策改善

Policy Improvement using Language Feedback Models

要旨

Support