ChatPaper.aiChatPaper

无视人类评判:基于大语言模型的摘要评估中的重叠性偏见 (注:标题采用学术论文常见的双标题结构,前段以文学化表达点明核心问题,后段用专业术语明确研究主题。"Blind to the Human Touch"意译为"无视人类评判",既保留原文隐喻又符合中文表达习惯。"Overlap Bias"译为"重叠性偏见",采用自然语言处理领域的标准译法。)

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

February 7, 2026
作者: Jiangnan Fang, Cheng-Tse Liu, Hanieh Deilamsalehy, Nesreen K. Ahmed, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi
cs.AI

摘要

在文本摘要等任务中,大语言模型(LLM)评判器常与传统基于算法的指标结合使用,因其能更好地捕捉语义信息、具备更强的推理能力,并对改写内容具有更高鲁棒性。然而,LLM评判器存在对文本长度和顺序等的偏好,且易受各类对抗性提示输入的影响。尽管近期研究已关注这些偏差,但少有工作基于明确定义的重合度指标进行细粒度分析。本研究通过分析摘要领域LLM评判结果与人工撰写响应的重合度函数关系,系统评估了其偏差特性。我们测试了9个参数规模从10亿到120亿的最新LLM,包括Gemma 3和LLaMA 3的多个变体。研究发现:当被评判摘要之间的相似度(以ROUGE和BLEU衡量)降低时,LLM评判器会逐渐更倾向于选择其他LLM生成的摘要而非人工撰写摘要,该模式在除一个模型外的所有测试模型中均存在,且不受模型自身位置偏差的影响。此外,模型甚至对有限重合度的摘要也难以准确评判,这表明摘要领域的LLM评判器需采用超越简单对比的进阶技术。
English
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.
PDF12February 18, 2026