使用CogEval评估大型语言模型中的认知地图和规划
Evaluating Cognitive Maps and Planning in Large Language Models with CogEval
September 25, 2023
作者: Ida Momennejad, Hosein Hasanbeig, Felipe Vieira, Hiteshi Sharma, Robert Osazuwa Ness, Nebojsa Jojic, Hamid Palangi, Jonathan Larson
cs.AI
摘要
最近,大量的研究声称大型语言模型(LLMs)具有新兴的认知能力。然而,大多数研究依赖于案例,忽视训练集的污染,或缺乏涉及多个任务、对照条件、多次迭代和统计鲁棒性测试的系统评估。在这里,我们做出了两个重要贡献。首先,我们提出了CogEval,这是一个受认知科学启发的协议,用于系统评估大型语言模型的认知能力。CogEval协议可用于评估各种能力。其次,在这里我们遵循CogEval协议,系统评估了八个LLMs(OpenAI GPT-4、GPT-3.5-turbo-175B、davinci-003-175B、Google Bard、Cohere-xlarge-52.4B、Anthropic Claude-1-52B、LLaMA-13B 和 Alpaca-7B)的认知地图和规划能力。我们的任务提示基于人类实验,这些实验既为评估规划提供了建立的构建效度,又不包含在LLM的训练集中。我们发现,虽然LLMs在一些结构较简单的规划任务中表现出明显的能力,但系统评估揭示了规划任务中引人注目的失败模式,包括产生无效轨迹的幻觉和陷入循环。这些发现并不支持LLMs具有即插即用的规划能力的观点。这可能是因为LLMs不理解规划问题背后的潜在关系结构,即认知地图,并且无法根据基础结构展开目标导向的轨迹。对应用和未来方向的影响进行了讨论。
English
Recently an influx of studies claim emergent cognitive abilities in large
language models (LLMs). Yet, most rely on anecdotes, overlook contamination of
training sets, or lack systematic Evaluation involving multiple tasks, control
conditions, multiple iterations, and statistical robustness tests. Here we make
two major contributions. First, we propose CogEval, a cognitive
science-inspired protocol for the systematic evaluation of cognitive capacities
in Large Language Models. The CogEval protocol can be followed for the
evaluation of various abilities. Second, here we follow CogEval to
systematically evaluate cognitive maps and planning ability across eight LLMs
(OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard,
Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base
our task prompts on human experiments, which offer both established construct
validity for evaluating planning, and are absent from LLM training sets. We
find that, while LLMs show apparent competence in a few planning tasks with
simpler structures, systematic evaluation reveals striking failure modes in
planning tasks, including hallucinations of invalid trajectories and getting
trapped in loops. These findings do not support the idea of emergent
out-of-the-box planning ability in LLMs. This could be because LLMs do not
understand the latent relational structures underlying planning problems, known
as cognitive maps, and fail at unrolling goal-directed trajectories based on
the underlying structure. Implications for application and future directions
are discussed.