B-score：利用回應歷史檢測大型語言模型中的偏見

摘要

大型語言模型（LLMs）常表現出強烈的偏見，例如對女性的偏見或對數字7的偏好。我們探討了當LLMs在多輪對話中能夠觀察到自己對同一問題的先前回答時，是否能夠輸出較少偏見的答案。為了理解哪些類型的問題更容易引發偏見回答，我們在提出的問題集上測試了LLMs，這些問題涵蓋9個主題並分為三種類型：(1) 主觀性；(2) 隨機性；以及(3) 客觀性。有趣的是，LLMs在多輪對話中能夠針對尋求隨機、無偏見答案的問題進行“自我去偏見”。此外，我們提出了B-score，這是一種新穎的指標，能有效檢測對主觀性、隨機性、簡單及困難問題的偏見。在MMLU、HLE和CSQA上，利用B-score相比僅使用口語化信心分數或單輪回答頻率，顯著提升了LLM答案的驗證準確性（即接受LLM的正確答案並拒絕錯誤的）。代碼和數據可在以下網址獲取：https://b-score.github.io。

English

Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: https://b-score.github.io.

B-score：利用回應歷史檢測大型語言模型中的偏見

B-score: Detecting biases in large language models using response history

摘要

Support