LLM（語言模型）在高階心智理論任務上達到成年人類的表現水平。

摘要

本文探討大型語言模型（LLMs）在發展高階心智理論（ToM）方面的程度；即人類推理多個心智和情感狀態的能力，以遞迴方式進行（例如，我認為你相信她知道）。本文在先前研究基礎上引入手寫測試套件--多階心智問答--並使用它來比較五個LLMs的表現與新收集的成年人基準。我們發現GPT-4和Flan-PaLM在整體ToM任務上達到成年水平和接近成年水平的表現，並且GPT-4在第六階推論上超過成年人的表現。我們的結果表明，模型大小和微調之間存在相互作用，以實現ToM能力，而表現最佳的LLMs已經發展出對ToM的通用能力。鑒於高階ToM在廣泛合作和競爭人類行為中的作用，這些發現對面向用戶的LLM應用具有重要意義。

English

This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.

LLM（語言模型）在高階心智理論任務上達到成年人類的表現水平。

LLMs achieve adult human performance on higher-order theory of mind tasks

摘要

Support