大型語言模型的道德基礎

摘要

道德基礎理論（MFT）是一種心理評估工具，將人類的道德推理分解為五個因素，包括關懷/傷害、自由/壓迫和神聖/墮落（Graham等人，2009年）。人們在做出道德決定時，會根據文化背景和政治意識形態的不同，對這些維度賦予不同的重要性。由於大型語言模型（LLMs）是在從互聯網收集的數據集上進行訓練的，因此它們可能反映了這些文集中存在的偏見。本文使用MFT作為一個透鏡，分析瞭流行的LLMs是否對特定一組道德價值觀產生了偏見。我們分析已知的LLMs，發現它們展現了特定的道德基礎，並展示了這些基礎如何與人類的道德基礎和政治立場相關。我們還測量這些偏見的一致性，或者它們是否在模型被提示的上下文中強烈變化。最後，我們展示了我們可以對抗地選擇提示，鼓勵模型展現特定一組道德基礎，並且這可能影響模型在下游任務中的行為。這些發現有助於說明LLMs假定特定道德立場可能帶來的潛在風險和意外後果。

English

Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary in the weight they place on these dimensions when making moral decisions, in part due to their cultural upbringing and political ideology. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find they exhibit particular moral foundations, and show how these relate to human moral foundations and political affiliations. We also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks. These findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.

大型語言模型的道德基礎

Moral Foundations of Large Language Models

摘要

Support