計画、排除、追跡――言語モデルはエンボディエージェントにとって優れた教師である

要旨

事前学習済みの大規模言語モデル（LLM）は、世界に関する手続き的知識を獲得しています。最近の研究では、LLMが抽象的なプランを生成する能力を活用して、アクションのスコアリングやアクションモデリング（ファインチューニング）を通じて、難しい制御タスクを簡素化することが試みられています。しかし、トランスフォーマーアーキテクチャには、LLMが直接エージェントとして機能することを困難にするいくつかの制約が存在します。例えば、入力長の制限、ファインチューニングの非効率性、事前学習からのバイアス、非テキスト環境との非互換性などです。低レベルの学習可能なアクターとの互換性を維持するために、我々は、制御問題を解決するのではなく、LLMの知識を活用して制御問題を簡素化することを提案します。我々は、Plan（計画）、Eliminate（排除）、Track（追跡）のPETフレームワークを提案します。Planモジュールは、タスクの説明を高レベルのサブタスクのリストに変換します。Eliminateモジュールは、現在のサブタスクに関連しないオブジェクトや受け皿を観測からマスクします。最後に、Trackモジュールは、エージェントが各サブタスクを達成したかどうかを判断します。AlfWorldの指示追従ベンチマークにおいて、PETフレームワークは、人間の目標仕様への一般化において、SOTAを15%大幅に上回る改善をもたらしました。

English

Pre-trained large language models (LLMs) capture procedural knowledge about the world. Recent work has leveraged LLM's ability to generate abstract plans to simplify challenging control tasks, either by action scoring, or action modeling (fine-tuning). However, the transformer architecture inherits several constraints that make it difficult for the LLM to directly serve as the agent: e.g. limited input lengths, fine-tuning inefficiency, bias from pre-training, and incompatibility with non-text environments. To maintain compatibility with a low-level trainable actor, we propose to instead use the knowledge in LLMs to simplify the control problem, rather than solving it. We propose the Plan, Eliminate, and Track (PET) framework. The Plan module translates a task description into a list of high-level sub-tasks. The Eliminate module masks out irrelevant objects and receptacles from the observation for the current sub-task. Finally, the Track module determines whether the agent has accomplished each sub-task. On the AlfWorld instruction following benchmark, the PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.

計画、排除、追跡――言語モデルはエンボディエージェントにとって優れた教師である

Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents

要旨

Support