有効性も信頼性もないのか？ LLMを評価者として使用することの検証

要旨

自然言語生成（NLG）システムの評価は、自然言語処理（NLP）における核心的な課題であり、汎用性を目指す大規模言語モデル（LLM）の台頭によってさらに複雑化している。最近、大規模言語モデルを評価者として用いる「LLJ（Large Language Model as Judge）」が、従来の評価指標に代わる有望な選択肢として登場したが、その有効性はまだ十分に検証されていない。本ポジションペーパーでは、LLJに対する現在の熱狂が時期尚早である可能性を指摘する。なぜなら、その採用が、評価者としての信頼性と有効性に対する厳密な検証を上回るペースで進んでいるからである。社会科学における測定理論を参照し、LLJの使用を支える4つの核心的な仮定——人間の判断の代理としての能力、評価者としての能力、スケーラビリティ、コスト効率——を特定し、批判的に検証する。これらの仮定が、LLMやLLJの本質的な制約、あるいは現在のNLG評価の実践によってどのように挑戦されるかを考察する。分析を具体化するため、LLJの3つの応用——テキスト要約、データアノテーション、安全性のアラインメント——を探る。最後に、LLJ評価におけるより責任ある評価実践の必要性を強調し、その分野での役割の拡大がNLGの進歩を支えるものであり、損なうものではないことを保証するよう呼びかける。

English

Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.