ChatPaper.aiChatPaper

忽视人文触觉:基于大语言模型的摘要评估中的重叠偏见

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

February 7, 2026
作者: Jiangnan Fang, Cheng-Tse Liu, Hanieh Deilamsalehy, Nesreen K. Ahmed, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi
cs.AI

摘要

大型语言模型(LLM)评判器常与基于算法的传统指标共同用于摘要生成等任务,因其能更好地捕捉语义信息、具备更强的推理能力,并对改写内容具有更高鲁棒性。然而LLM评判器存在长度偏好、顺序偏好等偏差,且易受各类对抗性输入提示的影响。尽管近期研究已关注这些偏差,但鲜有研究结合明确定义的重合度指标进行细粒度分析。本研究通过分析摘要领域LLM评判结果与人工撰写响应的重合度函数,系统解析其偏差特性。我们测试了9个参数规模从10亿到120亿的最新LLM,包括Gemma 3和LLaMA 3的多个变体。实验发现:当被评判摘要与参考摘要的相似度(以ROUGE和BLEU衡量)降低时,LLM评判器会逐渐更倾向于选择其他LLM生成的摘要而非人工撰写的摘要,该现象在除一个模型外的所有测试模型中普遍存在,且不受模型自身位置偏差的影响。此外,研究发现即使对于重合度有限的摘要,模型评判仍存在困难,这表明摘要领域的LLM评判机制需突破简单对比的范式,采用更复杂的技术手段。
English
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.
PDF12February 18, 2026