Ψ基准：评估说服性对话中人格敏感的影响力

摘要

个性化是现代语言代理的关键能力。然而，当前研究主要将个性化代理定位为用户偏好的被动响应者，这限制了它们主动与用户交互并提供建议或指导的能力。为系统评估这种在真实交互中的主动个性化能力，我们提出了Ψ-Bench，一个用于评估大语言模型通过对话影响真实用户能力的基准。我们在Ψ-Bench中设计了三个涉及说服的真实世界交互场景，并通过从对话历史中提取的显式用户画像赋予模拟客户端个性化特征。我们在Ψ-Bench上评估了10个前沿大语言模型，发现尽管大多数模型能生成连贯且合理的论点，但即使是当前最先进的模型在说服方面仍有显著提升空间。我们还发现，提供客户端画像访问权限可使平均性能提升18.24%，凸显了用户特定信息对有效说服的重要性。总体而言，我们的工作强调了个性化敏感影响作为评估和开发更具主动性的个性化大语言模型代理的一个具有挑战性且实用的方向。代码可在以下链接获取：https://github.com/Hanpx20/Psi-Bench。

English

Personalization is a crucial capability of modern language agents. However, current research primarily positions personalized agents as passive responders to user preferences, limiting their ability to interact with users and provide suggestions or guidance proactively. To systematically evaluate such proactive personalization in realistic interactions, we propose Ψ-Bench, a benchmark for assessing LLMs' ability to influence realistic users through conversation. We design three real-world interaction scenarios that involve persuasion in Ψ-Bench, and endow simulated clients with personal characteristics through explicit user profiles derived from dialogue histories. We evaluate 10 frontier LLMs on Ψ-Bench and find that while most models can produce coherent and reasonable arguments, even state-of-the-art models still leave considerable room for improvement in persuasion. We also find that providing access to client profiles yields an average performance gain of 18.24\%, highlighting the importance of user-specific information for effective persuasion. Overall, our work highlights persona-sensitive influencing as a challenging yet practical direction for evaluating and developing more proactive personalized LLM agents. Codes are available at: https://github.com/Hanpx20/Psi-Bench.