PingPong：一個用戶仿真和多模型評估的角色扮演語言模型基準。

摘要

我們引入了一個新穎的基準來評估語言模型的角色扮演能力。我們的方法利用語言模型本身來模擬動態的多輪對話中的用戶，並評估所產生的對話。該框架包括三個主要組件：扮演特定角色的玩家模型、模擬用戶行為的審問者模型，以及評估對話質量的評判模型。我們進行了實驗，將自動化評估與人類標註進行比較，以驗證我們的方法，展示了在多個標準上的強相關性。這項工作為在互動場景中對模型能力進行堅固且動態的評估奠定了基礎。

English

We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.