B-점수: 응답 기록을 활용한 대규모 언어 모델의 편향 탐지

초록

대형 언어 모델(LLMs)은 종종 여성에 대한 편견이나 숫자 7에 대한 선호와 같은 강한 편향성을 보입니다. 우리는 LLMs가 다중 턴 대화에서 동일한 질문에 대한 이전 답변을 관찰할 수 있을 때, 덜 편향된 답변을 출력할 수 있는지 조사합니다. 어떤 유형의 질문이 더 편향된 답변을 유도하는지 이해하기 위해, 우리는 9개 주제를 아우르고 세 가지 유형((1) 주관적; (2) 무작위; (3) 객관적)에 속하는 질문 세트를 제안하고 LLMs를 테스트합니다. 흥미롭게도, LLMs는 무작위적이고 편향되지 않은 답변을 요구하는 질문에 대해 다중 턴 대화에서 스스로 "편향 제거"를 할 수 있습니다. 또한, 우리는 주관적, 무작위적, 쉬운, 어려운 질문에 대한 편향을 효과적으로 탐지할 수 있는 새로운 메트릭인 B-score를 제안합니다. MMLU, HLE, CSQA에서 B-score를 활용하면, 언어적 신뢰도 점수나 단일 턴 답변의 빈도만을 사용하는 것에 비해 LLM 답변의 검증 정확도(즉, LLM의 정답을 수용하고 오답을 거부하는 것)가 크게 향상됩니다. 코드와 데이터는 https://b-score.github.io에서 확인할 수 있습니다.

English

Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: https://b-score.github.io.

B-점수: 응답 기록을 활용한 대규모 언어 모델의 편향 탐지

B-score: Detecting biases in large language models using response history

초록

Support