利用CogEval評估大型語言模型中的認知地圖和規劃

摘要

最近許多研究聲稱大型語言模型（LLMs）具有新興的認知能力。然而，大多數研究依賴軼聞，忽略訓練集的污染，或缺乏涉及多項任務、對照條件、多次迭代和統計韌性測試的系統評估。本文提出兩個主要貢獻。首先，我們提出CogEval，這是一個受認知科學啟發的協議，用於系統評估大型語言模型的認知能力。CogEval協議可用於評估各種能力。其次，我們遵循CogEval協議，對八個LLMs（OpenAI GPT-4、GPT-3.5-turbo-175B、davinci-003-175B、Google Bard、Cohere-xlarge-52.4B、Anthropic Claude-1-52B、LLaMA-13B和Alpaca-7B）的認知地圖和規劃能力進行系統評估。我們的任務提示基於人類實驗，既提供了評估規劃的建構效度，又不包含在LLM訓練集中。我們發現，儘管LLMs在一些結構較簡單的規劃任務中表現出明顯的能力，系統評估揭示了規劃任務中引人注目的失敗模式，包括對無效軌跡的幻覺和陷入循環中。這些發現不支持LLMs具有新興的即時規劃能力的想法。這可能是因為LLMs不理解規劃問題背後的潛在關係結構，即認知地圖，並且無法根據潛在結構展開目標導向的軌跡。文中討論了應用和未來方向的影響。

English

Recently an influx of studies claim emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in Large Language Models. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and getting trapped in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.

利用CogEval評估大型語言模型中的認知地圖和規劃

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

摘要

Support