超越表象:衡量大语言模型判断中的自我偏好
Beyond the Surface: Measuring Self-Preference in LLM Judgments
June 3, 2025
作者: Zhi-Yuan Chen, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin
cs.AI
摘要
近期研究表明,大型语言模型(LLMs)在充当评判者时表现出自我偏好偏差,即它们倾向于偏爱自身生成的回答而非其他模型生成的回答。现有方法通常通过计算评判模型对其自身回答与其他模型回答所打分数的差异来衡量这种偏差。然而,这种方法将自我偏好偏差与回答质量混为一谈,因为即使不存在偏差,评判模型生成的高质量回答也可能导致正分数差异。为解决这一问题,我们引入黄金评判作为回答实际质量的代理,并提出DBG评分,该评分通过衡量评判模型对其自身回答的评分与相应黄金评判之间的差异来量化自我偏好偏差。由于黄金评判反映了回答的真实质量,DBG评分有效减少了回答质量对偏差测量的混淆影响。利用DBG评分,我们进行了全面实验,评估了不同版本、规模和推理能力的LLMs中的自我偏好偏差。此外,我们还探讨了影响并有助于缓解自我偏好偏差的两个因素:回答文本风格和评判模型的训练后数据。最后,我们从注意力机制的角度探索了自我偏好偏差的潜在内在机制。我们的代码和数据可在https://github.com/zhiyuanc2001/self-preference获取。
English
Recent studies show that large language models (LLMs) exhibit self-preference
bias when serving as judges, meaning they tend to favor their own responses
over those generated by other models. Existing methods typically measure this
bias by calculating the difference between the scores a judge model assigns to
its own responses and those it assigns to responses from other models. However,
this approach conflates self-preference bias with response quality, as
higher-quality responses from the judge model may also lead to positive score
differences, even in the absence of bias. To address this issue, we introduce
gold judgments as proxies for the actual quality of responses and propose the
DBG score, which measures self-preference bias as the difference between the
scores assigned by the judge model to its own responses and the corresponding
gold judgments. Since gold judgments reflect true response quality, the DBG
score mitigates the confounding effect of response quality on bias measurement.
Using the DBG score, we conduct comprehensive experiments to assess
self-preference bias across LLMs of varying versions, sizes, and reasoning
abilities. Additionally, we investigate two factors that influence and help
alleviate self-preference bias: response text style and the post-training data
of judge models. Finally, we explore potential underlying mechanisms of
self-preference bias from an attention-based perspective. Our code and data are
available at https://github.com/zhiyuanc2001/self-preference.