超越表象：測量大語言模型判斷中的自我偏好

摘要

近期研究表明，大型語言模型（LLMs）在擔任評判者時表現出自我偏好偏差，即它們傾向於偏愛自身生成的回答，而非其他模型產生的回應。現有方法通常通過計算評判模型對自身回答與其他模型回答所賦予分數的差異來衡量此偏差。然而，這種方法將自我偏好偏差與回答質量混為一談，因為評判模型生成的高質量回答也可能導致正分差，即便不存在偏差。為解決這一問題，我們引入黃金評判作為回答實際質量的代理，並提出DBG分數，該分數通過計算評判模型對自身回答與相應黃金評判所賦分數的差異來衡量自我偏好偏差。由於黃金評判反映了回答的真實質量，DBG分數減少了回答質量對偏差測量的混淆效應。利用DBG分數，我們進行了全面實驗，以評估不同版本、規模及推理能力的LLMs中的自我偏好偏差。此外，我們探討了影響並有助於減輕自我偏好偏差的兩個因素：回答文本風格及評判模型的後訓練數據。最後，我們從基於注意力的角度探討了自我偏好偏差的潛在機制。我們的代碼與數據可在https://github.com/zhiyuanc2001/self-preference獲取。

English

Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.

超越表象：測量大語言模型判斷中的自我偏好

Beyond the Surface: Measuring Self-Preference in LLM Judgments

摘要

Support