LLM(Large Language Models)在高阶心智理论任务上达到成年人类的表现水平。
LLMs achieve adult human performance on higher-order theory of mind tasks
May 29, 2024
作者: Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, Robin I. M. Dunbar
cs.AI
摘要
本文研究了大型语言模型(LLMs)在发展高阶心智理论(ToM)方面的程度;即人类推理多种心理和情感状态的能力,以递归方式进行(例如,我认为你相信她知道)。本文在先前研究的基础上引入了一个手工测试套件 -- 多阶心智问答 -- 并使用它来比较五个LLMs在新收集的成年人基准上的表现。我们发现,GPT-4和Flan-PaLM在整体ToM任务上达到了成年人水平和接近成年人水平的表现,而GPT-4在第6阶推理上超过了成年人的表现。我们的结果表明,模型大小和微调之间存在相互作用,以实现ToM能力,表现最佳的LLMs已经发展出了一种普遍的ToM能力。鉴于高阶ToM在广泛合作和竞争人类行为中的作用,这些发现对面向用户的LLM应用具有重要意义。
English
This paper examines the extent to which large language models (LLMs) have
developed higher-order theory of mind (ToM); the human ability to reason about
multiple mental and emotional states in a recursive manner (e.g. I think that
you believe that she knows). This paper builds on prior work by introducing a
handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to
compare the performance of five LLMs to a newly gathered adult human benchmark.
We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level
performance on ToM tasks overall, and that GPT-4 exceeds adult performance on
6th order inferences. Our results suggest that there is an interplay between
model size and finetuning for the realisation of ToM abilities, and that the
best-performing LLMs have developed a generalised capacity for ToM. Given the
role that higher-order ToM plays in a wide range of cooperative and competitive
human behaviours, these findings have significant implications for user-facing
LLM applications.Summary
AI-Generated Summary