邁向動態心智理論：評估大型語言模型對人類狀態時序演變的適應能力

摘要

隨著大型語言模型（LLMs）越來越多地參與人機互動，評估其心智理論（Theory of Mind, ToM）能力——尤其是追蹤動態心理狀態的能力——變得至關重要。現有的基準測試雖然評估了基本的ToM能力，但主要聚焦於心理狀態的靜態快照，忽略了現實世界社交互動中特徵性的時間演變。我們提出了DynToM，這是一個專門設計的新基準，用於評估LLMs在相互關聯的情境中理解和追蹤心理狀態時間進展的能力。通過一個系統化的四步框架，我們生成了包含1,100個社交情境、5,500個場景和78,100個問題的數據集，每個問題都經過了真實性和質量的驗證。我們對十個最先進的LLMs進行了全面評估，結果顯示它們的平均表現比人類低44.7%，在追蹤和推理心理狀態轉變時表現顯著下降。這一性能差距凸顯了當前LLMs在模擬人類心理狀態動態特性方面的根本性限制。

English

As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present DynToM, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs' ability to model the dynamic nature of human mental states.

邁向動態心智理論：評估大型語言模型對人類狀態時序演變的適應能力

Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

摘要

Support