既不有效也不可靠？探究大语言模型作为评判者的使用

摘要

评估自然语言生成（NLG）系统依然是自然语言处理（NLP）领域的核心挑战，而旨在成为通用工具的大型语言模型（LLMs）的兴起更使这一挑战复杂化。近期，作为评判者的大型语言模型（LLJs）作为一种替代传统评估指标的有前景方案崭露头角，但其有效性仍待深入探究。本立场论文认为，当前对LLJs的热切追捧或许为时过早，因为其应用速度已超过对其作为评估者可靠性与有效性的严格审查。借鉴社会科学中的测量理论，我们识别并批判性地评估了使用LLJs所基于的四大核心假设：其作为人类判断代理的能力、作为评估者的胜任力、可扩展性以及成本效益。我们探讨了这些假设如何可能因LLMs、LLJs或当前NLG评估实践中的固有局限而受到挑战。为夯实分析基础，我们考察了LLJs在文本摘要、数据标注及安全对齐三个应用场景中的表现。最后，我们强调在LLJs评估中需采取更为负责任的评估实践，以确保其在该领域日益增长的作用能够支持而非阻碍NLG的进步。

English

Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.

既不有效也不可靠？探究大语言模型作为评判者的使用

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

摘要

Support