揭示大型語言模型中的細粒度價值觀和意見
Revealing Fine-Grained Values and Opinions in Large Language Models
June 27, 2024
作者: Dustin Wright, Arnav Arora, Nadav Borenstein, Srishti Yadav, Serge Belongie, Isabelle Augenstein
cs.AI
摘要
在大型語言模型(LLMs)中揭示潛在價值觀和觀點可以幫助識別偏見並減輕潛在危害。最近,這一方法是通過向LLMs提出調查問題並量化它們對道德和政治問題的立場來實現的。然而,LLMs生成的立場可能會根據提示方式而有很大不同,對於支持或反對特定立場有許多辯論方式。在這項工作中,我們提出通過分析一個包含156k個LLM對6個LLMs生成的62個政治指南測試(PCT)命題的強大數據集,使用420種提示變化。我們對它們生成的立場進行粗粒度分析,並對這些立場的純文本理由進行細粒度分析。對於細粒度分析,我們提出識別回應中的修辭:在不同提示中反复出現並保持一致的語義相似短語,揭示了特定LLM容易生成的文本模式。我們發現添加到提示中的人口統計特徵顯著影響PCT的結果,反映了偏見,以及在引出閉式形式與開放領域回應時測試結果之間存在的差異。此外,通過修辭在純文本理由中的模式表明,即使在立場不同的情況下,模型和提示之間也會反复生成類似的理由。
English
Uncovering latent values and opinions in large language models (LLMs) can
help identify biases and mitigate potential harm. Recently, this has been
approached by presenting LLMs with survey questions and quantifying their
stances towards morally and politically charged statements. However, the
stances generated by LLMs can vary greatly depending on how they are prompted,
and there are many ways to argue for or against a given position. In this work,
we propose to address this by analysing a large and robust dataset of 156k LLM
responses to the 62 propositions of the Political Compass Test (PCT) generated
by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of
their generated stances and fine-grained analysis of the plain text
justifications for those stances. For fine-grained analysis, we propose to
identify tropes in the responses: semantically similar phrases that are
recurrent and consistent across different prompts, revealing patterns in the
text that a given LLM is prone to produce. We find that demographic features
added to prompts significantly affect outcomes on the PCT, reflecting bias, as
well as disparities between the results of tests when eliciting closed-form vs.
open domain responses. Additionally, patterns in the plain text rationales via
tropes show that similar justifications are repeatedly generated across models
and prompts even with disparate stances.Summary
AI-Generated Summary