既無效亦不可靠？探討大型語言模型作為評判者的應用

摘要

評估自然語言生成（NLG）系統仍然是自然語言處理（NLP）的核心挑戰，而大型語言模型（LLMs）的崛起旨在成為通用工具，使得這一挑戰更加複雜。最近，作為評判者的大型語言模型（LLJs）已成為傳統指標的有力替代方案，但其有效性仍有待深入探討。本立場文件認為，當前對LLJs的熱情可能為時過早，因為其應用速度已超過了對其作為評估者的可靠性和有效性的嚴格審查。借鑒社會科學中的測量理論，我們識別並批判性地評估了使用LLJs的四個核心假設：其作為人類判斷代理的能力、其作為評估者的能力、其可擴展性以及其成本效益。我們探討了這些假設如何可能受到LLMs、LLJs或當前NLG評估實踐固有局限性的挑戰。為了使分析更具體，我們探討了LLJs的三個應用場景：文本摘要、數據註釋和安全對齊。最後，我們強調在LLJs評估中需要更負責任的評估實踐，以確保其在該領域日益增長的角色能夠支持而非阻礙NLG的進步。

English

Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.

既無效亦不可靠？探討大型語言模型作為評判者的應用

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

摘要

Support