CoSER:協調基於LLM的已建立角色模擬
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
February 13, 2025
作者: Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou
cs.AI
摘要
角色扮演語言代理人(RPLAs)已成為大型語言模型(LLMs)應用的前景。然而,模擬已建立角色對RPLAs來說是一項具有挑戰性的任務,原因在於缺乏真實角色數據集以及使用此類數據的微妙評估方法。在本文中,我們提出了CoSER,這是一個高質量數據集、開放模型和評估協議的集合,旨在實現對已建立角色的有效RPLAs。CoSER數據集涵蓋了來自771本知名書籍的17,966個角色。它提供了具有真實世界細節的對話,以及多樣的數據類型,如對話設置、角色經歷和內心想法。我們借鑑表演方法論,引入了給定情況表演,用於訓練和評估角色扮演LLMs,在這種方法中,LLMs依次扮演書中多個角色。利用我們的數據集,我們開發了CoSER 8B和CoSER 70B,即基於LLaMA-3.1模型構建的先進開放角色扮演LLMs。廣泛的實驗證明了CoSER數據集對於RPLA的訓練、評估和檢索的價值。此外,CoSER 70B在我們的評估和三個現有基準測試中展現出最新技術,超越或匹敵GPT-4o,即在InCharacter和LifeChoice基準測試中分別實現了75.80%和93.47%的準確率。
English
Role-playing language agents (RPLAs) have emerged as promising applications
of large language models (LLMs). However, simulating established characters
presents a challenging task for RPLAs, due to the lack of authentic character
datasets and nuanced evaluation methods using such data. In this paper, we
present CoSER, a collection of a high-quality dataset, open models, and an
evaluation protocol towards effective RPLAs of established characters. The
CoSER dataset covers 17,966 characters from 771 renowned books. It provides
authentic dialogues with real-world intricacies, as well as diverse data types
such as conversation setups, character experiences and internal thoughts.
Drawing from acting methodology, we introduce given-circumstance acting for
training and evaluating role-playing LLMs, where LLMs sequentially portray
multiple characters in book scenes. Using our dataset, we develop CoSER 8B and
CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
Extensive experiments demonstrate the value of the CoSER dataset for RPLA
training, evaluation and retrieval. Moreover, CoSER 70B exhibits
state-of-the-art performance surpassing or matching GPT-4o on our evaluation
and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on
the InCharacter and LifeChoice benchmarks respectively.Summary
AI-Generated Summary