在大型语言模型中揭示细粒度价值观和观点
Revealing Fine-Grained Values and Opinions in Large Language Models
June 27, 2024
作者: Dustin Wright, Arnav Arora, Nadav Borenstein, Srishti Yadav, Serge Belongie, Isabelle Augenstein
cs.AI
摘要
在大型语言模型(LLMs)中揭示潜在的价值观和观点可以帮助识别偏见并减轻潜在的危害。最近,这一问题已经通过向LLMs提出调查问题并量化它们对道德和政治言论的立场来解决。然而,LLMs生成的立场可能会因提示方式不同而有很大差异,并且可以有许多方式支持或反对某一立场。在这项工作中,我们提出通过分析一个包含156k个LLM回应的大型且稳健的数据集,该数据集包括6个LLMs生成的62个政治罗盘测试(PCT)命题的420种提示变体。我们对它们生成的立场进行粗粒度分析,并对这些立场的纯文本理由进行细粒度分析。对于细粒度分析,我们提出识别回应中的修辞手法:在不同提示中反复出现且一致的语义相似短语,揭示了给定LLM倾向于生成的文本模式。我们发现,将人口统计特征添加到提示中显著影响PCT的结果,反映了偏见,以及在引出封闭形式与开放领域回应时测试结果之间存在的差异。此外,通过修辞手法在纯文本理由中的模式表明,即使在存在不同立场的情况下,相似的理由在不同模型和提示中反复生成。
English
Uncovering latent values and opinions in large language models (LLMs) can
help identify biases and mitigate potential harm. Recently, this has been
approached by presenting LLMs with survey questions and quantifying their
stances towards morally and politically charged statements. However, the
stances generated by LLMs can vary greatly depending on how they are prompted,
and there are many ways to argue for or against a given position. In this work,
we propose to address this by analysing a large and robust dataset of 156k LLM
responses to the 62 propositions of the Political Compass Test (PCT) generated
by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of
their generated stances and fine-grained analysis of the plain text
justifications for those stances. For fine-grained analysis, we propose to
identify tropes in the responses: semantically similar phrases that are
recurrent and consistent across different prompts, revealing patterns in the
text that a given LLM is prone to produce. We find that demographic features
added to prompts significantly affect outcomes on the PCT, reflecting bias, as
well as disparities between the results of tests when eliciting closed-form vs.
open domain responses. Additionally, patterns in the plain text rationales via
tropes show that similar justifications are repeatedly generated across models
and prompts even with disparate stances.