SPRING:GPT-4 通過學習論文和推理優於強化學習算法
SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning
May 24, 2023
作者: Yue Wu, So Yeon Min, Shrimai Prabhumoye, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Tom Mitchell, Yuanzhi Li
cs.AI
摘要
開放世界生存遊戲對AI演算法提出重大挑戰,因為它們需要多任務處理、深度探索和目標優先排序的要求。儘管強化學習(RL)在解決遊戲方面很受歡迎,但其高樣本複雜性限制了其在像Crafter或Minecraft這樣複雜的開放世界遊戲中的有效性。我們提出了一種新方法,名為SPRING,通過閱讀遊戲的原始學術論文並利用所學知識來推理和玩遊戲,使用一個大型語言模型(LLM)。在以LaTeX源碼作為遊戲背景並提供代理人當前觀察描述的情況下,我們的SPRING框架採用一個帶有遊戲相關問題的有向無環圖(DAG)作為節點,並以依賴關係作為邊。我們通過遍歷DAG並按照拓撲順序計算LLM對每個節點的回應,將環境中採取的最佳行動識別為LLM對最終節點的回答,直接轉化為環境行動。在我們的實驗中,我們研究了在Crafter開放世界環境設置下,不同提示形式誘發的上下文“推理”的質量。定量上,使用GPT-4的SPRING優於所有最先進的RL基準線,在未經任何訓練的情況下進行了100萬步的訓練。最後,我們展示了遊戲作為LLM測試平臺的潛力。
English
Open-world survival games pose significant challenges for AI algorithms due
to their multi-tasking, deep exploration, and goal prioritization requirements.
Despite reinforcement learning (RL) being popular for solving games, its high
sample complexity limits its effectiveness in complex open-world games like
Crafter or Minecraft. We propose a novel approach, SPRING, to read the game's
original academic paper and use the knowledge learned to reason and play the
game through a large language model (LLM). Prompted with the LaTeX source as
game context and a description of the agent's current observation, our SPRING
framework employs a directed acyclic graph (DAG) with game-related questions as
nodes and dependencies as edges. We identify the optimal action to take in the
environment by traversing the DAG and calculating LLM responses for each node
in topological order, with the LLM's answer to final node directly translating
to environment actions. In our experiments, we study the quality of in-context
"reasoning" induced by different forms of prompts under the setting of the
Crafter open-world environment. Our experiments suggest that LLMs, when
prompted with consistent chain-of-thought, have great potential in completing
sophisticated high-level trajectories. Quantitatively, SPRING with GPT-4
outperforms all state-of-the-art RL baselines, trained for 1M steps, without
any training. Finally, we show the potential of games as a test bed for LLMs.