Bスコア：応答履歴を用いた大規模言語モデルのバイアス検出

要旨

大規模言語モデル（LLM）は、女性に対する偏見や数字の7を好む傾向など、強いバイアスを示すことが多い。本研究では、マルチターン会話において、同じ質問に対する以前の回答を観察できる場合に、LLMがより偏りの少ない回答を出力できるかどうかを調査する。どのようなタイプの質問がより偏った回答を引き起こすかを理解するため、9つのトピックにまたがり、3つのタイプ（1）主観的、（2）ランダム、（3）客観的、に属する質問セットを提案し、LLMをテストした。興味深いことに、LLMは、ランダムで偏りのない回答を求める質問に対して、マルチターン会話において自己のバイアスを「除去」することができた。さらに、主観的、ランダム、易しい、難しい質問に対するバイアスを検出するのに有効な新しい指標であるBスコアを提案する。MMLU、HLE、CSQAにおいて、Bスコアを活用することで、言語化された信頼度スコアや単一ターン回答の頻度だけを使用する場合と比較して、LLMの回答の検証精度（つまり、LLMの正しい回答を受け入れ、誤った回答を拒否する）が大幅に向上した。コードとデータは以下で公開されている：https://b-score.github.io。

English

Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: https://b-score.github.io.

Bスコア：応答履歴を用いた大規模言語モデルのバイアス検出

B-score: Detecting biases in large language models using response history

要旨

Support