LLM(語言模型)在高階心智理論任務上達到成年人類的表現水平。
LLMs achieve adult human performance on higher-order theory of mind tasks
May 29, 2024
作者: Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, Robin I. M. Dunbar
cs.AI
摘要
本文探討大型語言模型(LLMs)在發展高階心智理論(ToM)方面的程度;即人類推理多個心智和情感狀態的能力,以遞迴方式進行(例如,我認為你相信她知道)。本文在先前研究基礎上引入手寫測試套件--多階心智問答--並使用它來比較五個LLMs的表現與新收集的成年人基準。我們發現GPT-4和Flan-PaLM在整體ToM任務上達到成年水平和接近成年水平的表現,並且GPT-4在第六階推論上超過成年人的表現。我們的結果表明,模型大小和微調之間存在相互作用,以實現ToM能力,而表現最佳的LLMs已經發展出對ToM的通用能力。鑒於高階ToM在廣泛合作和競爭人類行為中的作用,這些發現對面向用戶的LLM應用具有重要意義。
English
This paper examines the extent to which large language models (LLMs) have
developed higher-order theory of mind (ToM); the human ability to reason about
multiple mental and emotional states in a recursive manner (e.g. I think that
you believe that she knows). This paper builds on prior work by introducing a
handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to
compare the performance of five LLMs to a newly gathered adult human benchmark.
We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level
performance on ToM tasks overall, and that GPT-4 exceeds adult performance on
6th order inferences. Our results suggest that there is an interplay between
model size and finetuning for the realisation of ToM abilities, and that the
best-performing LLMs have developed a generalised capacity for ToM. Given the
role that higher-order ToM plays in a wide range of cooperative and competitive
human behaviours, these findings have significant implications for user-facing
LLM applications.