迈向动态心智理论：评估大语言模型对人类状态时间演变的适应性

摘要

随着大型语言模型（LLMs）越来越多地参与到人机交互中，评估其心智理论（Theory of Mind, ToM）能力——尤其是追踪动态心理状态的能力——变得至关重要。尽管现有基准测试评估了基本的ToM能力，但它们主要关注心理状态的静态快照，忽视了现实世界社交互动中特有的时间演变。我们提出了DynToM，这是一个专门设计用于评估LLMs理解和追踪跨关联场景中心理状态时间进程的新基准。通过一个系统化的四步框架，我们生成了包含1,100个社交情境、5,500个场景和78,100个问题的数据集，每个问题都经过真实性和质量的验证。我们对十种最先进的LLMs进行的全面评估显示，它们的平均表现比人类低44.7%，在追踪和推理心理状态变化时表现显著下降。这一性能差距凸显了当前LLMs在模拟人类心理状态动态特性方面的根本性局限。

English

As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present DynToM, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs' ability to model the dynamic nature of human mental states.

迈向动态心智理论：评估大语言模型对人类状态时间演变的适应性

Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

摘要

Support