基於大型語言模型的評估器是否是擴展多語言評估的解決方案？

摘要

大型語言模型（LLMs）在自然語言處理（NLP）任務上展現出令人印象深刻的表現，如問答、摘要和分類。LLMs被用作評估器，可以對其他模型（通常是LLMs）的輸出進行排名或評分，這一做法變得越來越流行，原因在於目前評估技術存在諸多限制，包括缺乏適當的基準、指標、成本和人工標註者的訪問。雖然LLMs能夠處理大約100種語言，但在前20種之外的大多數語言缺乏跨各種任務、指標和基準的系統評估。這導致迫切需要擴大多語言評估，以確保對LLMs在各種語言上的表現有準確的理解。基於LLMs的評估器似乎是解決這個問題的完美方案，因為它們不需要人工標註者、人工創建的參考資料或基準，理論上可以用於評估LLMs所涵蓋的任何語言。在本文中，我們探討了基於LLMs的評估器是否有助於擴大多語言評估。具體而言，我們對20k個人類判斷的五個指標在八種語言的三個文本生成任務中進行了LLM-based評估的校準。我們的研究結果表明，基於LLMs的評估器可能存在對較高分數的偏見，應該謹慎使用，並且應始終與一組母語者判斷的數據集進行校準，特別是在資源匱乏和非拉丁文字語言中。

English

Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.

基於大型語言模型的評估器是否是擴展多語言評估的解決方案？

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

摘要

Support