Minecraft 中的幽靈：透過具有基於大型語言模型的文本知識和記憶的通常能力代理在開放世界環境中。

摘要

近年來，迷人的 Minecraft 領域吸引了相當多的研究興趣，成為發展能夠在開放世界環境中運作的智能代理的豐富平台。然而，目前的研究領域主要集中在特定目標上，如熱門的「獲取鑽石」任務，並且尚未有效地推廣到更廣泛的任務範疇。此外，目前在「獲取鑽石」任務上的領先成功率約為 20%，突顯了現有方法中使用的基於強化學習（RL）的控制器的局限性。為應對這些挑戰，我們引入了《Minecraft 中的幽靈》（GITM），一個新穎的框架，將大型語言模型（LLMs）與基於文本的知識和記憶相結合，旨在在 Minecraft 中創建具有普遍能力的代理。這些代理配備了LLMs的邏輯和常識能力，可以熟練地在基於文本互動的複雜、稀疏獎勵環境中導航。我們制定了一組結構化的動作，並利用LLMs生成代理執行的行動計劃。結果，基於LLMs的代理明顯超越了先前的方法，在「獲取鑽石」任務的成功率上實現了顯著的+47.5%改善，展示了與傳統基於RL的控制器相比的卓越穩健性。值得注意的是，我們的代理是第一個獲得 Minecraft 主世界技術樹中所有物品的代理，展示了其廣泛的能力。GITM 在訓練時不需要任何GPU，但一個擁有32個CPU核心的單CPU節點就足夠了。這項研究展示了LLMs在開發處理長期、複雜任務並適應開放世界環境中的不確定性方面的潛力。請參閱項目網站 https://github.com/OpenGVLab/GITM。

English

The captivating realm of Minecraft has attracted substantial research interest in recent years, serving as a rich platform for developing intelligent agents capable of functioning in open-world environments. However, the current research landscape predominantly focuses on specific objectives, such as the popular "ObtainDiamond" task, and has not yet shown effective generalization to a broader spectrum of tasks. Furthermore, the current leading success rate for the "ObtainDiamond" task stands at around 20%, highlighting the limitations of Reinforcement Learning (RL) based controllers used in existing methods. To tackle these challenges, we introduce Ghost in the Minecraft (GITM), a novel framework integrates Large Language Models (LLMs) with text-based knowledge and memory, aiming to create Generally Capable Agents (GCAs) in Minecraft. These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions. We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute. The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate on the "ObtainDiamond" task, demonstrating superior robustness compared to traditional RL-based controllers. Notably, our agent is the first to procure all items in the Minecraft Overworld technology tree, demonstrating its extensive capabilities. GITM does not need any GPU for training, but a single CPU node with 32 CPU cores is enough. This research shows the potential of LLMs in developing capable agents for handling long-horizon, complex tasks and adapting to uncertainties in open-world environments. See the project website at https://github.com/OpenGVLab/GITM.

Minecraft 中的幽靈：透過具有基於大型語言模型的文本知識和記憶的通常能力代理在開放世界環境中。

Ghost in the Minecraft: Generally Capable Agents for Open-World Enviroments via Large Language Models with Text-based Knowledge and Memory

摘要

Support