大型语言模型的道德基础

摘要

道德基础理论（MFT）是一种心理评估工具，将人类道德推理分解为包括关怀/伤害、自由/压迫和神圣/堕落在内的五个因素（Graham等，2009）。人们在做出道德决策时在这些维度上的权重会有所不同，部分原因是由于他们的文化背景和政治意识形态。由于大型语言模型（LLMs）是在从互联网收集的数据集上进行训练的，它们可能会反映出这些语料库中存在的偏见。本文以MFT作为分析视角，研究了流行的LLMs是否已经对特定一组道德价值观产生了偏见。我们分析了已知的LLMs，并发现它们表现出特定的道德基础，并展示了这些基础如何与人类的道德基础和政治立场相关。我们还衡量了这些偏见的一致性，或者它们是否会根据模型被提示的上下文而强烈变化。最后，我们展示了我们可以对抗性地选择提示，鼓励模型展示特定一组道德基础，并且这可能会影响模型在下游任务上的行为。这些发现有助于说明LLMs假定特定道德立场可能存在的潜在风险和意想不到的后果。

English

Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary in the weight they place on these dimensions when making moral decisions, in part due to their cultural upbringing and political ideology. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find they exhibit particular moral foundations, and show how these relate to human moral foundations and political affiliations. We also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks. These findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.

大型语言模型的道德基础

Moral Foundations of Large Language Models

摘要

Support