ChatPaper.aiChatPaper

LLM(語言模型)在高階心智理論任務上達到成年人類的表現水平。

LLMs achieve adult human performance on higher-order theory of mind tasks

May 29, 2024
作者: Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, Robin I. M. Dunbar
cs.AI

摘要

本文探討大型語言模型(LLMs)在發展高階心智理論(ToM)方面的程度;即人類推理多個心智和情感狀態的能力,以遞迴方式進行(例如,我認為你相信她知道)。本文在先前研究基礎上引入手寫測試套件--多階心智問答--並使用它來比較五個LLMs的表現與新收集的成年人基準。我們發現GPT-4和Flan-PaLM在整體ToM任務上達到成年水平和接近成年水平的表現,並且GPT-4在第六階推論上超過成年人的表現。我們的結果表明,模型大小和微調之間存在相互作用,以實現ToM能力,而表現最佳的LLMs已經發展出對ToM的通用能力。鑒於高階ToM在廣泛合作和競爭人類行為中的作用,這些發現對面向用戶的LLM應用具有重要意義。
English
This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.
PDF187December 12, 2024