PingPong: 사용자 에뮬레이션 및 다중 모델 평가를 통한 역할-플레이 언어 모델을 위한 벤치마크

초록

언어 모델의 역할 수행 능력을 평가하기 위한 새로운 벤치마크를 소개합니다. 저희의 방법론은 언어 모델 자체를 활용하여 동적이고 다중 턴 대화에서 사용자를 흉내내고 그 결과 대화를 평가합니다. 이 프레임워크는 특정 캐릭터 역할을 가정하는 플레이어 모델, 사용자 행동을 모방하는 심문자 모델, 대화 품질을 평가하는 심사자 모델로 구성됩니다. 우리는 자동 평가와 인간 주석을 비교하는 실험을 실시하여 우리의 방법을 검증하였으며, 다양한 기준에 걸쳐 강한 상관 관계를 보여주었습니다. 이 연구는 상호작용 시나리오에서 모델 능력을 견고하고 동적으로 평가하기 위한 기초를 제공합니다.

English

We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.

PingPong: 사용자 에뮬레이션 및 다중 모델 평가를 통한 역할-플레이 언어 모델을 위한 벤치마크

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

초록

Support