多言語LLMの評価のためのクロスリンガル自動評価

要旨

機械生成テキストの評価は、特に非英語の言語においては自然言語処理における重要な課題である。現在の手法は、自動評価尺度、人間による評価、LLMに基づく評価などがあり、これらは主に英語に焦点を当てており、多言語評価フレームワークにおける大きなギャップが明らかになっている。本研究では、Cross Lingual Auto Evaluation (CIA) Suiteを導入する。これは、評価者LLM（Hercule）と、多言語評価に特化した新しいテストセット（Recon）を含む拡張可能なフレームワークである。当試験セットには、さまざまなタスク能力をカバーする500の人間注釈付き指示が含まれており、さらに6つの言語にわたる人間の判断スコアも提供されている。これにより、汎用多言語LLMのベンチマークを可能にし、評価者LLMのメタ評価を容易にする。提案されたモデルであるHerculeは、英語で容易に利用可能な参照回答に基づいて応答にスコアを割り当てることを学習することで、対象言語における参照回答の不足に対処する多言語評価モデルである。実験により、Herculeが独自のモデルと比較して人間の判断とより密接に一致することが示され、このような多言語評価がリソースが限られた状況での効果を示している。さらに、未知の言語に対するゼロショット評価でも効果的であることが示されている。本研究は、LLMを用いた多言語評価の初の包括的な検討であり、多言語評価におけるスケーラブルで効果的なアプローチを提示している。すべてのコード、データセット、モデルは、この重要な分野におけるさらなる研究を可能にするために公開される予定である。

English

Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.

多言語LLMの評価のためのクロスリンガル自動評価

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

要旨

Support