用于评估多语言LLM的跨语言自动评估

摘要

在自然语言处理中，评估机器生成的文本仍然是一个重要挑战，尤其是对于非英语语言。当前的方法包括自动化指标、人工评估和基于LLM的评估，主要集中在英语上，揭示了多语言评估框架中的重大差距。我们引入了跨语言自动评估（CIA）套件，这是一个可扩展的框架，包括评估LLMs（Hercule）和一个专为多语言评估设计的新型测试集（Recon）。我们的测试集包括500条人工注释的指令，涵盖各种任务能力，并跨六种语言提供人类判断分数。这将使通用多语言LLMs的基准测试成为可能，并促进评估LLMs的元评估。所提出的模型Hercule是一个跨语言评估模型，通过学习根据英语中易获得的参考答案为回复分配分数来解决目标语言中参考答案稀缺的问题。我们的实验表明，与专有模型相比，Hercule与人类判断更为接近，展示了这种跨语言评估在资源匮乏情况下的有效性。此外，它在未见过的语言上也具有零样本评估的有效性。这项研究是对使用LLMs进行跨语言评估的首次全面检查，提出了一种可扩展且有效的多语言评估方法。所有代码、数据集和模型都将公开提供，以促进这一重要领域的进一步研究。

English

Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.

用于评估多语言LLM的跨语言自动评估

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

摘要

Support