Ψ-Bench：評估說服性對話中的人格感知影響力

摘要

個人化是現代語言代理的關鍵能力。然而，當前的研究主要將個人化代理定位為被動回應使用者偏好的角色，限制了其主動與使用者互動、提供建議或引導的能力。為系統性評估此類主動式個人化在真實互動中的表現，我們提出Ψ-基準測試（Ψ-Bench），一個用於評估大型語言模型透過對話影響真實使用者能力的基準。我們在Ψ-基準測試中設計了三個涉及說服的現實世界互動場景，並透過從對話歷史中獲得的明確使用者檔案，賦予模擬客戶個人特質。我們在Ψ-基準測試上評估了10個前沿的大型語言模型，發現雖然大多數模型能產出連貫且合理的論點，但即使是最先進的模型在說服力上仍有相當大的改進空間。我們也發現，提供客戶檔案存取權限平均可帶來18.24%的效能提升，凸顯使用者特定資訊對有效說服的重要性。整體而言，我們的研究強調了人物敏感影響力作為評估與開發更主動個人化大型語言模型代理的一個具挑戰性且實用的方向。程式碼可見於：https://github.com/Hanpx20/Psi-Bench。

English

Personalization is a crucial capability of modern language agents. However, current research primarily positions personalized agents as passive responders to user preferences, limiting their ability to interact with users and provide suggestions or guidance proactively. To systematically evaluate such proactive personalization in realistic interactions, we propose Ψ-Bench, a benchmark for assessing LLMs' ability to influence realistic users through conversation. We design three real-world interaction scenarios that involve persuasion in Ψ-Bench, and endow simulated clients with personal characteristics through explicit user profiles derived from dialogue histories. We evaluate 10 frontier LLMs on Ψ-Bench and find that while most models can produce coherent and reasonable arguments, even state-of-the-art models still leave considerable room for improvement in persuasion. We also find that providing access to client profiles yields an average performance gain of 18.24\%, highlighting the importance of user-specific information for effective persuasion. Overall, our work highlights persona-sensitive influencing as a challenging yet practical direction for evaluating and developing more proactive personalized LLM agents. Codes are available at: https://github.com/Hanpx20/Psi-Bench.