大規模言語モデルの倫理的基盤

要旨

道徳基盤理論（Moral Foundations Theory, MFT）は、人間の道徳的推論をケア/危害、自由/抑圧、神聖/堕落などの5つの要素に分解する心理学的評価ツールである（Graham et al., 2009）。人々は道徳的判断を行う際にこれらの次元に置く重みが異なり、その違いは文化的背景や政治的思想に部分的に起因する。大規模言語モデル（LLMs）はインターネットから収集されたデータセットで訓練されるため、そのようなコーパスに存在するバイアスを反映する可能性がある。本論文では、MFTをレンズとして、主要なLLMsが特定の道徳的価値観に対するバイアスを獲得しているかどうかを分析する。既知のLLMsを分析し、それらが特定の道徳基盤を示すことを明らかにし、それらが人間の道徳基盤や政治的所属とどのように関連するかを示す。また、これらのバイアスの一貫性、つまりモデルがどのようにプロンプトされるかによって強く変動するかどうかを測定する。最後に、特定の道徳基盤を引き出すように意図的に選択したプロンプトが、モデルの下流タスクにおける振る舞いに影響を与える可能性があることを示す。これらの知見は、LLMsが特定の道徳的立場を仮定することに伴う潜在的なリスクと意図しない結果を浮き彫りにするものである。

English

Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary in the weight they place on these dimensions when making moral decisions, in part due to their cultural upbringing and political ideology. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find they exhibit particular moral foundations, and show how these relate to human moral foundations and political affiliations. We also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks. These findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.

大規模言語モデルの倫理的基盤

Moral Foundations of Large Language Models

要旨

Support