LLM（大規模言語モデル）は、高次理論的思考課題において成人レベルの人間のパフォーマンスを達成する。

要旨

本論文では、大規模言語モデル（LLM）が高次の心の理論（Theory of Mind, ToM）をどの程度発達させているかを検証する。ToMとは、人間が複数の精神的・感情的な状態を再帰的に推論する能力のことである（例：私は、あなたが彼女が知っていると信じていると思う）。本論文は、先行研究を基に、手書きのテストスイート「Multi-Order Theory of Mind Q&A」を導入し、それを用いて5つのLLMの性能を新たに収集した成人のベンチマークと比較する。その結果、GPT-4とFlan-PaLMはToMタスク全体で成人レベルまたはそれに近い性能を達成し、GPT-4は6次推論において成人の性能を上回ることがわかった。我々の結果は、ToM能力の実現にはモデルサイズとファインチューニングの相互作用があり、最高性能のLLMはToMの一般化された能力を発達させていることを示唆している。高次のToMが多様な協力的・競争的人間行動において果たす役割を考えると、これらの発見はユーザー向けLLMアプリケーションにとって重要な意味を持つ。

English

This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.

LLM（大規模言語モデル）は、高次理論的思考課題において成人レベルの人間のパフォーマンスを達成する。

LLMs achieve adult human performance on higher-order theory of mind tasks

要旨

Support