ChatPaper.aiChatPaper

作为评判者的感知智能体:评估大语言模型中的高阶社会认知能力

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

May 1, 2025
作者: Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
cs.AI

摘要

评估大型语言模型(LLM)对人类的理解程度,而非仅仅对文本的理解,仍是一个开放性的挑战。为弥合这一差距,我们引入了“感知智能体作为评判者”(SAGE),这是一个自动化评估框架,用于衡量LLM的高阶社会认知能力。SAGE实例化了一个感知智能体,该智能体在交互过程中模拟人类情感变化和内心思维,从而在多轮对话中对被测试模型进行更为真实的评估。在每一轮对话中,智能体都会推理:(i) 其情感如何变化,(ii) 其感受如何,以及(iii) 应如何回应,由此生成一条数值化的情感轨迹和可解释的内心思维。在100个支持性对话场景中的实验表明,最终的感知情感得分与巴雷特-伦纳德关系量表(BLRI)评分及话语层面的共情指标高度相关,验证了其心理真实性。我们还构建了一个公开的感知排行榜,涵盖了18个商业和开源模型,揭示了前沿系统(如GPT-4o-Latest、Gemini2.5-Pro)与早期基线之间存在的显著差距(高达4倍),这些差距在传统排行榜(如Arena)中并未体现。因此,SAGE为追踪真正具备共情能力和社交娴熟的语言智能体的进展,提供了一个原则性强、可扩展且可解释的工具。
English
Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.

Summary

AI-Generated Summary

PDF163May 9, 2025