표면 너머: LLM 판단에서의 자기 선호도 측정

초록

최근 연구에 따르면, 대형 언어 모델(LLM)이 판단자 역할을 할 때 자기 선호 편향(self-preference bias)을 보이는 것으로 나타났습니다. 이는 모델이 다른 모델이 생성한 응답보다 자신이 생성한 응답을 더 선호하는 경향을 의미합니다. 기존 방법은 일반적으로 판단자 모델이 자신의 응답에 부여한 점수와 다른 모델의 응답에 부여한 점수 간의 차이를 계산하여 이러한 편향을 측정합니다. 그러나 이 접근법은 자기 선호 편향과 응답 품질을 혼동할 수 있습니다. 판단자 모델의 응답 품질이 더 높은 경우 편향이 없더라도 긍정적인 점수 차이가 발생할 수 있기 때문입니다. 이 문제를 해결하기 위해, 우리는 실제 응답 품질을 대표하는 기준 판단(gold judgment)을 도입하고, DBG 점수를 제안합니다. DBG 점수는 판단자 모델이 자신의 응답에 부여한 점수와 해당 기준 판단 간의 차이로 자기 선호 편향을 측정합니다. 기준 판단은 실제 응답 품질을 반영하므로, DBG 점수는 편향 측정에 대한 응답 품질의 혼란 효과를 완화합니다. DBG 점수를 사용하여, 우리는 다양한 버전, 크기, 추론 능력을 가진 LLM 간의 자기 선호 편향을 평가하기 위한 포괄적인 실험을 수행했습니다. 또한, 우리는 자기 선호 편향에 영향을 미치고 완화하는 두 가지 요인인 응답 텍스트 스타일과 판단자 모델의 사후 학습 데이터를 조사했습니다. 마지막으로, 우리는 주의 기반 관점에서 자기 선호 편향의 잠재적 근본 메커니즘을 탐구합니다. 우리의 코드와 데이터는 https://github.com/zhiyuanc2001/self-preference에서 확인할 수 있습니다.

English

Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.

표면 너머: LLM 판단에서의 자기 선호도 측정

Beyond the Surface: Measuring Self-Preference in LLM Judgments

초록

Support