衡量语言模型中主观全局观点表征的方法

摘要

大型语言模型（LLMs）可能无法公平地代表社会问题上多元化的全球观点。在本文中，我们开发了一个定量框架来评估模型生成的回答更类似于谁的观点。我们首先构建了一个数据集GlobalOpinionQA，其中包含来自跨国调查的问题和答案，旨在捕捉不同国家对全球问题的多元化观点。接下来，我们定义了一个度量标准，量化了LLM生成的调查回答与人类回答之间的相似性，条件是国家。通过我们的框架，我们对一个经过宪法AI训练以帮助、诚实和无害的LLM进行了三个实验。默认情况下，LLM的回答往往更类似于某些人口群体的观点，比如来自美国、欧洲和南美洲的国家，突显了偏见的潜在性。当我们提示模型考虑特定国家的视角时，回答会转变为更类似于被提示人口群体的观点，但可能反映出有害的文化刻板印象。当我们将GlobalOpinionQA问题翻译成目标语言时，模型的回答不一定会变得最类似于那些语言使用者的观点。我们发布了我们的数据集供他人使用和构建。我们的数据位于https://huggingface.co/datasets/Anthropic/llm_global_opinions。我们还提供了一个交互式可视化网站，网址为https://llmglobalvalues.anthropic.com。

English

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.

衡量语言模型中主观全局观点表征的方法

Towards Measuring the Representation of Subjective Global Opinions in Language Models

摘要

Support