人格提示作为审视大语言模型社会推理能力的透镜

摘要

在仇恨言论检测等社会敏感性任务中，大型语言模型（LLMs）生成解释的质量对用户信任和模型对齐等因素至关重要。虽然角色提示（PP）作为一种引导模型实现用户定制化生成的方式日益普及，但其对模型推理过程的影响仍待深入探究。本研究通过模拟不同人口统计特征的角色设定，探讨LLMs生成推理依据的差异性。基于带有词级标注的数据集，我们测量了模型与不同人口统计群体人工标注的一致性，并评估了PP对模型偏见和人类对齐的影响。通过对三种LLMs的评估，我们获得三项关键发现：（1）PP能提升最具主观性的任务（仇恨言论检测）的分类性能，但会降低推理质量；（2）模拟角色无法与真实世界对应群体对齐，且角色间高度一致性表明模型难以被有效引导；（3）无论是否使用PP，模型均表现出稳定的人口统计偏见和过度标记有害内容的倾向。我们的研究揭示了一个关键权衡：PP虽能提升社会敏感性任务的分类性能，但往往以牺牲推理质量为代价，且无法缓解模型固有偏见，这警示我们需要审慎应用该技术。

English

For socially sensitive tasks like hate speech detection, the quality of explanations from Large Language Models (LLMs) is crucial for factors like user trust and model alignment. While Persona prompting (PP) is increasingly used as a way to steer model towards user-specific generation, its effect on model rationales remains underexplored. We investigate how LLM-generated rationales vary when conditioned on different simulated demographic personas. Using datasets annotated with word-level rationales, we measure agreement with human annotations from different demographic groups, and assess the impact of PP on model bias and human alignment. Our evaluation across three LLMs results reveals three key findings: (1) PP improving classification on the most subjective task (hate speech) but degrading rationale quality. (2) Simulated personas fail to align with their real-world demographic counterparts, and high inter-persona agreement shows models are resistant to significant steering. (3) Models exhibit consistent demographic biases and a strong tendency to over-flag content as harmful, regardless of PP. Our findings reveal a critical trade-off: while PP can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, urging caution in its application.

人格提示作为审视大语言模型社会推理能力的透镜

Persona Prompting as a Lens on LLM Social Reasoning

摘要

Support