作為評判者的感知代理:評估大型語言模型中的高階社會認知
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
May 1, 2025
作者: Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
cs.AI
摘要
評估大型語言模型(LLM)對人類的理解程度,而非僅僅對文本的理解,仍是一個未解的挑戰。為彌合這一差距,我們引入了「感知代理作為評判者」(SAGE),這是一個自動化評估框架,用於衡量LLM的高階社會認知能力。SAGE實例化了一個感知代理,該代理在互動過程中模擬人類情感變化和內心想法,從而提供對測試模型在多輪對話中更為真實的評估。在每一輪對話中,代理會推理:(i) 其情感如何變化,(ii) 其感受如何,以及(iii) 應如何回應,從而生成數值化的情感軌跡和可解釋的內心想法。在100個支持性對話場景中的實驗表明,最終的感知情感分數與Barrett-Lennard關係量表(BLRI)評分和話語層面的同理心指標高度相關,驗證了其心理真實性。我們還建立了一個公開的感知排行榜,涵蓋了18個商業和開源模型,揭示了前沿系統(如GPT-4o-Latest、Gemini2.5-Pro)與早期基線模型之間顯著的差距(高達4倍),這些差距在傳統排行榜(如Arena)中並未體現。因此,SAGE提供了一個有原則、可擴展且可解釋的工具,用於追蹤真正具備同理心和社交能力的語言代理的進展。
English
Assessing how well a large language model (LLM) understands human, rather
than merely text, remains an open challenge. To bridge the gap, we introduce
Sentient Agent as a Judge (SAGE), an automated evaluation framework that
measures an LLM's higher-order social cognition. SAGE instantiates a Sentient
Agent that simulates human-like emotional changes and inner thoughts during
interaction, providing a more realistic evaluation of the tested model in
multi-turn conversations. At every turn, the agent reasons about (i) how its
emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a
numerical emotion trajectory and interpretable inner thoughts. Experiments on
100 supportive-dialogue scenarios show that the final Sentient emotion score
correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings
and utterance-level empathy metrics, validating psychological fidelity. We also
build a public Sentient Leaderboard covering 18 commercial and open-source
models that uncovers substantial gaps (up to 4x) between frontier systems
(GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in
conventional leaderboards (e.g., Arena). SAGE thus provides a principled,
scalable and interpretable tool for tracking progress toward genuinely
empathetic and socially adept language agents.Summary
AI-Generated Summary