迈向动态心智理论:评估大语言模型对人类状态时间演变的适应性
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
May 23, 2025
作者: Yang Xiao, Jiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, Pengfei Liu
cs.AI
摘要
随着大型语言模型(LLMs)越来越多地参与到人机交互中,评估其心智理论(Theory of Mind, ToM)能力——尤其是追踪动态心理状态的能力——变得至关重要。尽管现有基准测试评估了基本的ToM能力,但它们主要关注心理状态的静态快照,忽视了现实世界社交互动中特有的时间演变。我们提出了DynToM,这是一个专门设计用于评估LLMs理解和追踪跨关联场景中心理状态时间进程的新基准。通过一个系统化的四步框架,我们生成了包含1,100个社交情境、5,500个场景和78,100个问题的数据集,每个问题都经过真实性和质量的验证。我们对十种最先进的LLMs进行的全面评估显示,它们的平均表现比人类低44.7%,在追踪和推理心理状态变化时表现显著下降。这一性能差距凸显了当前LLMs在模拟人类心理状态动态特性方面的根本性局限。
English
As Large Language Models (LLMs) increasingly participate in human-AI
interactions, evaluating their Theory of Mind (ToM) capabilities - particularly
their ability to track dynamic mental states - becomes crucial. While existing
benchmarks assess basic ToM abilities, they predominantly focus on static
snapshots of mental states, overlooking the temporal evolution that
characterizes real-world social interactions. We present DynToM, a
novel benchmark specifically designed to evaluate LLMs' ability to understand
and track the temporal progression of mental states across interconnected
scenarios. Through a systematic four-step framework, we generate 1,100 social
contexts encompassing 5,500 scenarios and 78,100 questions, each validated for
realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs
reveals that their average performance underperforms humans by 44.7\%, with
performance degrading significantly when tracking and reasoning about the shift
of mental states. This performance gap highlights fundamental limitations in
current LLMs' ability to model the dynamic nature of human mental states.Summary
AI-Generated Summary