大規模言語モデルにおける細粒度の価値観と意見の解明

要旨

大規模言語モデル（LLM）に潜在する価値観や意見を明らかにすることは、バイアスを特定し、潜在的な危害を軽減するのに役立ちます。最近では、LLMに調査質問を提示し、道徳的・政治的にセンシティブな声明に対する姿勢を定量化するアプローチが取られています。しかし、LLMが生成する姿勢は、プロンプトの与え方によって大きく異なる可能性があり、特定の立場を支持または反対するための議論の方法も多岐にわたります。本研究では、6つのLLMが420種類のプロンプト変種を用いて生成した、政治コンパステスト（PCT）の62の命題に対する156kのLLM応答からなる大規模で堅牢なデータセットを分析することで、この問題に取り組みます。我々は、生成された姿勢の粗粒度分析と、それらの姿勢に対する平文の正当化の細粒度分析を行います。細粒度分析では、応答内のトロープ（反復的で一貫した意味的に類似したフレーズ）を特定することを提案します。これにより、特定のLLMが生成しやすいテキストのパターンを明らかにします。我々は、プロンプトに追加された人口統計学的特徴がPCTの結果に大きな影響を与え、バイアスを反映すること、および閉形式の応答とオープンドメインの応答を引き出す際のテスト結果の間に差異があることを発見しました。さらに、平文の根拠におけるトロープを通じたパターンは、異なる姿勢であっても、モデルやプロンプトを跨いで類似した正当化が繰り返し生成されることを示しています。

English

Uncovering latent values and opinions in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by presenting LLMs with survey questions and quantifying their stances towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts, revealing patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models and prompts even with disparate stances.

大規模言語モデルにおける細粒度の価値観と意見の解明

Revealing Fine-Grained Values and Opinions in Large Language Models

要旨

Support