動的な心の理論に向けて：人間の状態の時間的変化に対するLLMの適応性の評価

要旨

大規模言語モデル（LLMs）が人間とAIの相互作用にますます参加する中、その心の理論（Theory of Mind, ToM）能力、特に動的な心的状態を追跡する能力を評価することが重要となっています。既存のベンチマークは基本的なToM能力を評価していますが、主に心的状態の静的なスナップショットに焦点を当てており、現実世界の社会的相互作用を特徴づける時間的進化を見落としています。本論文では、DynToMという新しいベンチマークを提案します。これは、LLMsが相互に関連するシナリオにおける心的状態の時間的進行を理解し追跡する能力を評価するために特別に設計されています。体系的な4段階のフレームワークを通じて、5,500のシナリオと78,100の質問を含む1,100の社会的コンテキストを生成し、それぞれが現実性と品質について検証されています。10の最先端LLMsを包括的に評価した結果、その平均性能は人間よりも44.7%低く、特に心的状態の変化を追跡し推論する際に性能が大幅に低下することが明らかになりました。この性能のギャップは、現在のLLMsが人間の心的状態の動的な性質をモデル化する能力に根本的な限界があることを示しています。

English

As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present DynToM, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs' ability to model the dynamic nature of human mental states.

動的な心の理論に向けて：人間の状態の時間的変化に対するLLMの適応性の評価

Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

要旨

Support