大規模言語モデルベースの評価者は、多言語評価のスケールアップにおける解決策となるか？

要旨

大規模言語モデル（LLM）は、質問応答、要約、分類などの自然言語処理（NLP）タスクにおいて、印象的な性能を発揮しています。他のモデル（通常はLLM）の出力をランク付けまたはスコア付けする評価者としてLLMを使用することが、適切なベンチマークやメトリクスの不足、コスト、人間のアノテーターへのアクセスの制限など、現在の評価手法の限界により、ますます一般的になっています。LLMは約100の言語を扱うことができますが、上位20言語を超える大多数の言語では、さまざまなタスク、メトリクス、ベンチマークにわたる体系的な評価が不足しています。これにより、多様な言語におけるLLMの性能を正確に理解するために、多言語評価を拡大することが急務となっています。LLMベースの評価者は、人間のアノテーターや人間が作成した参照データ、ベンチマークを必要とせず、理論的にはLLMがカバーするあらゆる言語を評価できるため、この問題に対する完璧な解決策のように思われます。本論文では、LLMベースの評価者が多言語評価の拡大に役立つかどうかを調査します。具体的には、8つの言語における3つのテキスト生成タスクにわたる5つのメトリクスに対する20,000件の人間の判断に対して、LLMベースの評価を較正します。我々の調査結果は、LLMベースの評価者が高得点に偏る可能性があり、特に低リソース言語や非ラテン文字言語では、母語話者の判断データセットで常に較正する必要があることを示唆しています。

English

Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.

大規模言語モデルベースの評価者は、多言語評価のスケールアップにおける解決策となるか？

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

要旨

Support