PingPong：一个具有用户仿真和多模型评估功能的角色扮演语言模型基准测试。

摘要

我们引入了一个新颖的基准测试，用于评估语言模型的角色扮演能力。我们的方法利用语言模型本身来模拟用户在动态的多轮对话中的表现，并评估生成的对话。该框架包括三个主要组件：扮演特定角色的玩家模型、模拟用户行为的询问者模型，以及评估对话质量的评判者模型。我们进行了实验，将自动化评估与人类注释进行比较，以验证我们的方法，结果显示在多个标准上存在很强的相关性。这项工作为在互动场景中对模型能力进行稳健而动态的评估奠定了基础。

English

We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.