超越表象：衡量大语言模型判断中的自我偏好

摘要

近期研究表明，大型语言模型（LLMs）在充当评判者时表现出自我偏好偏差，即它们倾向于偏爱自身生成的回答而非其他模型生成的回答。现有方法通常通过计算评判模型对其自身回答与其他模型回答所打分数的差异来衡量这种偏差。然而，这种方法将自我偏好偏差与回答质量混为一谈，因为即使不存在偏差，评判模型生成的高质量回答也可能导致正分数差异。为解决这一问题，我们引入黄金评判作为回答实际质量的代理，并提出DBG评分，该评分通过衡量评判模型对其自身回答的评分与相应黄金评判之间的差异来量化自我偏好偏差。由于黄金评判反映了回答的真实质量，DBG评分有效减少了回答质量对偏差测量的混淆影响。利用DBG评分，我们进行了全面实验，评估了不同版本、规模和推理能力的LLMs中的自我偏好偏差。此外，我们还探讨了影响并有助于缓解自我偏好偏差的两个因素：回答文本风格和评判模型的训练后数据。最后，我们从注意力机制的角度探索了自我偏好偏差的潜在内在机制。我们的代码和数据可在https://github.com/zhiyuanc2001/self-preference获取。

English

Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.

超越表象：衡量大语言模型判断中的自我偏好

Beyond the Surface: Measuring Self-Preference in LLM Judgments

摘要

Support