超越表象:測量大語言模型判斷中的自我偏好
Beyond the Surface: Measuring Self-Preference in LLM Judgments
June 3, 2025
作者: Zhi-Yuan Chen, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin
cs.AI
摘要
近期研究表明,大型語言模型(LLMs)在擔任評判者時表現出自我偏好偏差,即它們傾向於偏愛自身生成的回答,而非其他模型產生的回應。現有方法通常通過計算評判模型對自身回答與其他模型回答所賦予分數的差異來衡量此偏差。然而,這種方法將自我偏好偏差與回答質量混為一談,因為評判模型生成的高質量回答也可能導致正分差,即便不存在偏差。為解決這一問題,我們引入黃金評判作為回答實際質量的代理,並提出DBG分數,該分數通過計算評判模型對自身回答與相應黃金評判所賦分數的差異來衡量自我偏好偏差。由於黃金評判反映了回答的真實質量,DBG分數減少了回答質量對偏差測量的混淆效應。利用DBG分數,我們進行了全面實驗,以評估不同版本、規模及推理能力的LLMs中的自我偏好偏差。此外,我們探討了影響並有助於減輕自我偏好偏差的兩個因素:回答文本風格及評判模型的後訓練數據。最後,我們從基於注意力的角度探討了自我偏好偏差的潛在機制。我們的代碼與數據可在https://github.com/zhiyuanc2001/self-preference獲取。
English
Recent studies show that large language models (LLMs) exhibit self-preference
bias when serving as judges, meaning they tend to favor their own responses
over those generated by other models. Existing methods typically measure this
bias by calculating the difference between the scores a judge model assigns to
its own responses and those it assigns to responses from other models. However,
this approach conflates self-preference bias with response quality, as
higher-quality responses from the judge model may also lead to positive score
differences, even in the absence of bias. To address this issue, we introduce
gold judgments as proxies for the actual quality of responses and propose the
DBG score, which measures self-preference bias as the difference between the
scores assigned by the judge model to its own responses and the corresponding
gold judgments. Since gold judgments reflect true response quality, the DBG
score mitigates the confounding effect of response quality on bias measurement.
Using the DBG score, we conduct comprehensive experiments to assess
self-preference bias across LLMs of varying versions, sizes, and reasoning
abilities. Additionally, we investigate two factors that influence and help
alleviate self-preference bias: response text style and the post-training data
of judge models. Finally, we explore potential underlying mechanisms of
self-preference bias from an attention-based perspective. Our code and data are
available at https://github.com/zhiyuanc2001/self-preference.