作为评判者的感知智能体：评估大语言模型中的高阶社会认知能力

摘要

评估大型语言模型（LLM）对人类的理解程度，而非仅仅对文本的理解，仍是一个开放性的挑战。为弥合这一差距，我们引入了“感知智能体作为评判者”（SAGE），这是一个自动化评估框架，用于衡量LLM的高阶社会认知能力。SAGE实例化了一个感知智能体，该智能体在交互过程中模拟人类情感变化和内心思维，从而在多轮对话中对被测试模型进行更为真实的评估。在每一轮对话中，智能体都会推理：(i) 其情感如何变化，(ii) 其感受如何，以及(iii) 应如何回应，由此生成一条数值化的情感轨迹和可解释的内心思维。在100个支持性对话场景中的实验表明，最终的感知情感得分与巴雷特-伦纳德关系量表（BLRI）评分及话语层面的共情指标高度相关，验证了其心理真实性。我们还构建了一个公开的感知排行榜，涵盖了18个商业和开源模型，揭示了前沿系统（如GPT-4o-Latest、Gemini2.5-Pro）与早期基线之间存在的显著差距（高达4倍），这些差距在传统排行榜（如Arena）中并未体现。因此，SAGE为追踪真正具备共情能力和社交娴熟的语言智能体的进展，提供了一个原则性强、可扩展且可解释的工具。

English

Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.