大規模言語モデルの認知地図とプランニング能力をCogEvalで評価する

要旨

最近、大規模言語モデル（LLM）に新たな認知能力が出現していると主張する研究が増えています。しかし、その多くは逸話に依存し、訓練データセットの汚染を見落としていたり、複数のタスク、対照条件、複数の反復、統計的ロバストネステストを含む体系的な評価を欠いています。本論文では、2つの主要な貢献を行います。第一に、CogEvalという、認知科学にインスパイアされたプロトコルを提案します。これは、大規模言語モデルの認知能力を体系的に評価するためのもので、様々な能力の評価に適用可能です。第二に、ここではCogEvalに従って、8つのLLM（OpenAI GPT-4、GPT-3.5-turbo-175B、davinci-003-175B、Google Bard、Cohere-xlarge-52.4B、Anthropic Claude-1-52B、LLaMA-13B、Alpaca-7B）における認知地図と計画能力を体系的に評価します。タスクプロンプトは人間の実験に基づいており、計画評価のための確立された構成妥当性を提供し、かつLLMの訓練データセットには含まれていないものです。その結果、LLMは構造が単純な計画タスクでは一見有能に見えるものの、体系的な評価を行うと、無効な軌道を幻覚する、ループに陥るといった顕著な失敗モードが明らかになりました。これらの知見は、LLMに即座に利用可能な計画能力が出現しているという考えを支持するものではありません。これは、LLMが計画問題の基盤となる潜在的な関係構造、すなわち認知地図を理解しておらず、その基盤構造に基づいて目標指向の軌道を展開することに失敗しているためと考えられます。応用と今後の方向性についても議論します。

English

Recently an influx of studies claim emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in Large Language Models. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and getting trapped in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.

大規模言語モデルの認知地図とプランニング能力をCogEvalで評価する

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

要旨

Support