基于大型语言模型的评估器是否是扩大多语言评估的解决方案？

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

September 14, 2023

作者: Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram

cs.AI

摘要

大型语言模型（LLMs）在自然语言处理（NLP）任务中展现出令人印象深刻的性能，如问答、摘要和分类。LLMs作为评估器的使用日益流行，可以对其他模型（通常是LLMs）的输出进行排名或评分，这是因为当前评估技术存在诸多限制，包括缺乏适当的基准、度量标准、成本和人工标注者的获取。虽然LLMs能够处理大约100种语言，但大多数排名在前20位之外的语言在各种任务、度量标准和基准上缺乏系统性评估。这导致迫切需要扩大多语言评估的规模，以确保对LLMs在不同语言上的性能有准确的理解。基于LLMs的评估器似乎是解决这一问题的完美方案，因为它们不需要人工标注者、人工创建的参考文献或基准，并且理论上可以用于评估LLMs覆盖的任何语言。在本文中，我们调查了基于LLMs的评估器是否可以帮助扩大多语言评估。具体而言，我们校准了基于LLMs的评估与20k个人类判断的五个度量标准在八种语言中三个文本生成任务上的表现。我们的研究结果表明，基于LLMs的评估器可能存在对更高分数的偏见，应谨慎使用，并且应始终与一组母语者判断的数据集进行校准，特别是在资源匮乏和非拉丁文字语言中。

English

Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.