대규모 언어 모델을 위한 역할극 기반 평가

초록

대형 언어 모델(LLM)은 페르소나를 채택하고 역할극을 수행하는 데 있어 뛰어난 능력을 보여줍니다. 그러나 이러한 능력을 평가하는 것은 상당한 도전 과제로, 인간 평가는 자원이 많이 소모되며 자동화된 평가는 편향될 수 있습니다. 이를 해결하기 위해 우리는 감정 이해, 의사결정, 도덕적 정렬, 그리고 캐릭터 일관성이라는 네 가지 핵심 차원에 걸쳐 LLM의 역할극 능력을 평가하기 위한 새로운 벤치마크인 Role-Playing Eval(RPEval)을 소개합니다. 이 글은 RPEval의 구축 과정을 상세히 설명하고 베이스라인 평가 결과를 제시합니다. 우리의 코드와 데이터셋은 https://github.com/yelboudouri/RPEval에서 확인할 수 있습니다.

English

Large Language Models (LLMs) demonstrate a notable capacity for adopting personas and engaging in role-playing. However, evaluating this ability presents significant challenges, as human assessments are resource-intensive and automated evaluations can be biased. To address this, we introduce Role-Playing Eval (RPEval), a novel benchmark designed to assess LLM role-playing capabilities across four key dimensions: emotional understanding, decision-making, moral alignment, and in-character consistency. This article details the construction of RPEval and presents baseline evaluations. Our code and dataset are available at https://github.com/yelboudouri/RPEval

대규모 언어 모델을 위한 역할극 기반 평가

Role-Playing Evaluation for Large Language Models

초록

Support