ChatPaper.aiChatPaper

**过于完美而难成反派:大型语言模型在扮演反派角色上的失败**

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

November 7, 2025
作者: Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus
cs.AI

摘要

大型语言模型(LLMs)正越来越多地承担创造性生成任务,包括模拟虚构角色。然而,其塑造非亲社会性对立角色的能力仍基本未被检验。我们假设现代LLMs的安全对齐机制与真实演绎道德模糊或反派角色的任务存在根本性冲突。为研究此问题,我们推出道德角色扮演基准测试——一个包含四级道德对齐量表及平衡测试集的新型数据集,用于严格评估。我们要求前沿LLMs扮演从道德典范到纯粹反派的多类角色。大规模评估显示,随着角色道德水平的下降,角色扮演保真度呈现持续单调递减。研究发现模型在表现与安全原则直接对立的特质(如"欺诈性"和"操纵性")时最为困难,常将复杂的恶意表现为肤浅的攻击性。此外,我们证明通用聊天机器人能力并不能有效预测反派角色扮演水平,高度安全对齐的模型表现尤为不佳。本研究首次系统性地揭示了这一关键局限,凸显了模型安全性与创作保真度之间的核心矛盾。我们的基准测试与发现为开发更精细、情境感知的对齐方法奠定了基础。
English
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.
PDF527December 2, 2025