Minecraftのゴースト：テキストベースの知識とメモリを備えた大規模言語モデルによるオープンワールド環境向け汎用エージェント

要旨

魅力的なMinecraftの世界は近年、オープンワールド環境で機能するインテリジェントエージェントを開発するための豊かなプラットフォームとして、多大な研究関心を集めています。しかし、現在の研究動向は「ObtainDiamond」タスクのような特定の目的に焦点を当てており、より広範なタスクへの効果的な汎化はまだ示されていません。さらに、「ObtainDiamond」タスクにおける現在の最高成功率は約20%であり、既存の手法で使用されている強化学習（RL）ベースのコントローラの限界が浮き彫りになっています。これらの課題に取り組むため、我々は「Ghost in the Minecraft（GITM）」という新しいフレームワークを導入します。このフレームワークは、大規模言語モデル（LLM）をテキストベースの知識とメモリと統合し、Minecraft内で汎用的に能力を発揮するエージェント（GCA）の創出を目指しています。これらのエージェントは、LLMの論理と常識能力を備えており、テキストベースのインタラクションを通じて複雑で報酬が希薄な環境を巧みにナビゲートできます。我々は構造化されたアクションセットを開発し、LLMを活用してエージェントが実行するアクションプランを生成します。その結果、LLMベースのエージェントは従来の手法を大幅に上回り、「ObtainDiamond」タスクにおいて+47.5%という顕著な成功率の向上を達成し、従来のRLベースのコントローラと比較して優れた堅牢性を示しました。特に、我々のエージェントはMinecraftのオーバーワールド技術ツリー内のすべてのアイテムを入手した初めてのエージェントであり、その広範な能力を実証しました。GITMはトレーニングにGPUを必要とせず、32CPUコアを備えた単一のCPUノードで十分です。この研究は、長期的で複雑なタスクを処理し、オープンワールド環境における不確実性に適応する能力を持つエージェントを開発する上で、LLMの可能性を示しています。プロジェクトのウェブサイトはhttps://github.com/OpenGVLab/GITMをご覧ください。

English

The captivating realm of Minecraft has attracted substantial research interest in recent years, serving as a rich platform for developing intelligent agents capable of functioning in open-world environments. However, the current research landscape predominantly focuses on specific objectives, such as the popular "ObtainDiamond" task, and has not yet shown effective generalization to a broader spectrum of tasks. Furthermore, the current leading success rate for the "ObtainDiamond" task stands at around 20%, highlighting the limitations of Reinforcement Learning (RL) based controllers used in existing methods. To tackle these challenges, we introduce Ghost in the Minecraft (GITM), a novel framework integrates Large Language Models (LLMs) with text-based knowledge and memory, aiming to create Generally Capable Agents (GCAs) in Minecraft. These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions. We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute. The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate on the "ObtainDiamond" task, demonstrating superior robustness compared to traditional RL-based controllers. Notably, our agent is the first to procure all items in the Minecraft Overworld technology tree, demonstrating its extensive capabilities. GITM does not need any GPU for training, but a single CPU node with 32 CPU cores is enough. This research shows the potential of LLMs in developing capable agents for handling long-horizon, complex tasks and adapting to uncertainties in open-world environments. See the project website at https://github.com/OpenGVLab/GITM.

Minecraftのゴースト：テキストベースの知識とメモリを備えた大規模言語モデルによるオープンワールド環境向け汎用エージェント

Ghost in the Minecraft: Generally Capable Agents for Open-World Enviroments via Large Language Models with Text-based Knowledge and Memory

要旨

Support