PingPong:一个具有用户仿真和多模型评估功能的角色扮演语言模型基准测试。
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
September 10, 2024
作者: Ilya Gusev
cs.AI
摘要
我们引入了一个新颖的基准测试,用于评估语言模型的角色扮演能力。我们的方法利用语言模型本身来模拟用户在动态的多轮对话中的表现,并评估生成的对话。该框架包括三个主要组件:扮演特定角色的玩家模型、模拟用户行为的询问者模型,以及评估对话质量的评判者模型。我们进行了实验,将自动化评估与人类注释进行比较,以验证我们的方法,结果显示在多个标准上存在很强的相关性。这项工作为在互动场景中对模型能力进行稳健而动态的评估奠定了基础。
English
We introduce a novel benchmark for evaluating the role-playing capabilities
of language models. Our approach leverages language models themselves to
emulate users in dynamic, multi-turn conversations and to assess the resulting
dialogues. The framework consists of three main components: a player model
assuming a specific character role, an interrogator model simulating user
behavior, and a judge model evaluating conversation quality. We conducted
experiments comparing automated evaluations with human annotations to validate
our approach, demonstrating strong correlations across multiple criteria. This
work provides a foundation for a robust and dynamic evaluation of model
capabilities in interactive scenarios.Summary
AI-Generated Summary