Ψ-Bench: 説得対話におけるペルソナ感受性の影響評価

要旨

パーソナライゼーションは、現代の言語エージェントにとって重要な能力である。しかしながら、現在の研究は主に、パーソナライズされたエージェントをユーザーの好みに対する受動的な応答者として位置づけており、ユーザーと対話し、提案やガイダンスを積極的に提供する能力を制限している。このような現実的な相互作用におけるプロアクティブなパーソナライゼーションを体系的に評価するために、我々はΨ-Benchを提案する。これは、会話を通じて現実的なユーザーに影響を与えるLLMの能力を評価するためのベンチマークである。我々はΨ-Benchにおいて、説得を伴う3つの現実世界の対話シナリオを設計し、対話履歴から導出される明示的なユーザープロファイルを通じて、シミュレーションされたクライアントに個人特性を付与する。Ψ-Bench上で10の最先端LLMを評価した結果、ほとんどのモデルは首尾一貫した妥当な議論を生成できるものの、最先端モデルであっても説得の面では依然として改善の余地が大きいことが判明した。また、クライアントプロファイルへのアクセスを提供することで、平均性能が18.24%向上することが明らかとなり、効果的な説得にはユーザー固有の情報が重要であることが浮き彫りになった。全体として、本研究は、よりプロアクティブなパーソナライズドLLMエージェントを評価・開発するための、挑戦的かつ実践的な方向性として、ペルソナに敏感な影響力行使を強調する。コードは以下で入手可能である：https://github.com/Hanpx20/Psi-Bench。

English

Personalization is a crucial capability of modern language agents. However, current research primarily positions personalized agents as passive responders to user preferences, limiting their ability to interact with users and provide suggestions or guidance proactively. To systematically evaluate such proactive personalization in realistic interactions, we propose Ψ-Bench, a benchmark for assessing LLMs' ability to influence realistic users through conversation. We design three real-world interaction scenarios that involve persuasion in Ψ-Bench, and endow simulated clients with personal characteristics through explicit user profiles derived from dialogue histories. We evaluate 10 frontier LLMs on Ψ-Bench and find that while most models can produce coherent and reasonable arguments, even state-of-the-art models still leave considerable room for improvement in persuasion. We also find that providing access to client profiles yields an average performance gain of 18.24\%, highlighting the importance of user-specific information for effective persuasion. Overall, our work highlights persona-sensitive influencing as a challenging yet practical direction for evaluating and developing more proactive personalized LLM agents. Codes are available at: https://github.com/Hanpx20/Psi-Bench.