대형 언어 모델의 도덕적 기초

초록

도덕 기초 이론(Moral Foundations Theory, MFT)은 인간의 도덕적 추론을 돌봄/해악, 자유/억압, 신성/타락 등 다섯 가지 요소로 분해하는 심리학적 평가 도구이다(Graham et al., 2009). 사람들은 도덕적 결정을 내릴 때 이러한 차원에 부여하는 중요도가 다르며, 이는 부분적으로 문화적 배경과 정치적 이념에 기인한다. 대규모 언어 모델(LLM)은 인터넷에서 수집된 데이터셋으로 학습되기 때문에, 이러한 코퍼스에 존재하는 편향을 반영할 가능성이 있다. 본 논문은 MFT를 통해 인기 있는 LLM이 특정 도덕적 가치에 대한 편향을 습득했는지 분석한다. 우리는 알려진 LLM을 분석하여 특정 도덕 기초를 나타내는지 확인하고, 이러한 기초가 인간의 도덕 기초 및 정치적 성향과 어떻게 관련되는지 보여준다. 또한 이러한 편향의 일관성을 측정하거나, 모델이 어떤 맥락에서 프롬프트를 받는지에 따라 편향이 크게 달라지는지 살펴본다. 마지막으로, 모델이 특정 도덕 기초를 나타내도록 유도하는 적대적 프롬프트를 선택할 수 있으며, 이는 다운스트림 작업에서 모델의 행동에 영향을 미칠 수 있음을 보여준다. 이러한 연구 결과는 LLM이 특정 도덕적 입장을 취할 때 발생할 수 있는 잠재적 위험과 의도하지 않은 결과를 잘 보여준다.

English

Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary in the weight they place on these dimensions when making moral decisions, in part due to their cultural upbringing and political ideology. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find they exhibit particular moral foundations, and show how these relate to human moral foundations and political affiliations. We also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks. These findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.

대형 언어 모델의 도덕적 기초

Moral Foundations of Large Language Models

초록

Support