《好得不像反派:論大型語言模型在扮演反派角色上的失敗》
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
November 7, 2025
作者: Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus
cs.AI
摘要
大型語言模型(LLMs)在創意生成任務中的應用日益廣泛,包括對虛構角色的模擬。然而,這些模型在塑造非親社會的對立型人格方面的能力仍鮮少被探討。我們假設,現代LLMs的安全對齊機制與真實演繹道德模糊或反派角色的任務存在根本性衝突。為驗證此觀點,我們提出「道德角色扮演基準」(Moral RolePlay benchmark),該數據集包含四級道德對齊量表及用於嚴謹評估的平衡測試集。我們要求頂尖LLMs演繹從道德典範到純粹惡棍的各類角色。大規模評估結果顯示,隨著角色道德水平降低,角色扮演的擬真度呈現一致且單調的下降趨勢。我們發現模型最難以表現與安全原則直接對立的特質(如「欺詐性」和「操控性」),往往將細膩的惡意替換為淺層的攻擊性。此外,我們證實通用聊天機器人的效能無法有效預測其反派角色扮演能力,高度安全對齊的模型表現尤其不佳。本研究首次系統性揭示了這一關鍵局限,凸顯了模型安全性與創作擬真度之間的深刻矛盾。我們提出的基準與發現為開發更細膩、具情境感知的對齊方法奠定了基礎。
English
Large Language Models (LLMs) are increasingly tasked with creative
generation, including the simulation of fictional characters. However, their
ability to portray non-prosocial, antagonistic personas remains largely
unexamined. We hypothesize that the safety alignment of modern LLMs creates a
fundamental conflict with the task of authentically role-playing morally
ambiguous or villainous characters. To investigate this, we introduce the Moral
RolePlay benchmark, a new dataset featuring a four-level moral alignment scale
and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs
with role-playing characters from moral paragons to pure villains. Our
large-scale evaluation reveals a consistent, monotonic decline in role-playing
fidelity as character morality decreases. We find that models struggle most
with traits directly antithetical to safety principles, such as ``Deceitful''
and ``Manipulative'', often substituting nuanced malevolence with superficial
aggression. Furthermore, we demonstrate that general chatbot proficiency is a
poor predictor of villain role-playing ability, with highly safety-aligned
models performing particularly poorly. Our work provides the first systematic
evidence of this critical limitation, highlighting a key tension between model
safety and creative fidelity. Our benchmark and findings pave the way for
developing more nuanced, context-aware alignment methods.