MUG-Eval:面向任意语言的多语言生成能力的代理评估框架
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
May 20, 2025
作者: Seyoung Song, Seogyeong Jeong, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh
cs.AI
摘要
评估大型语言模型(LLMs)的文本生成能力颇具挑战性,尤其是在低资源语言领域,直接评估方法极为匮乏。为此,我们提出了MUG-Eval这一创新框架,它通过将现有基准转化为对话任务并测量LLMs在这些任务上的准确率,来评估其多语言生成能力。我们特别设计了这些对话任务,要求模型在目标语言中进行有效沟通,进而以任务成功率作为成功生成对话的代理指标。该方法具备两大优势:其一,它不依赖于特定语言的NLP工具或标注数据集,这些资源对大多数语言而言极为有限;其二,它不采用LLMs作为评判者,因为在高资源语言之外,其评估质量会显著下降。我们在涵盖高、中、低资源类别的30种语言上对8个LLMs进行了评估,发现MUG-Eval与现有基准具有强相关性(r > 0.75),同时实现了跨语言和跨模型的标准化比较。我们的框架为评估多语言生成提供了一个稳健且资源高效的解决方案,可扩展至数千种语言。
English
Evaluating text generation capabilities of large language models (LLMs) is
challenging, particularly for low-resource languages where methods for direct
assessment are scarce. We propose MUG-Eval, a novel framework that evaluates
LLMs' multilingual generation capabilities by transforming existing benchmarks
into conversational tasks and measuring the LLMs' accuracies on those tasks. We
specifically designed these conversational tasks to require effective
communication in the target language. Then, we simply use task success rate as
a proxy of successful conversation generation. Our approach offers two key
advantages: it is independent of language-specific NLP tools or annotated
datasets, which are limited for most languages, and it does not rely on
LLMs-as-judges, whose evaluation quality degrades outside a few high-resource
languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and
low-resource categories, and we find that MUG-Eval correlates strongly with
established benchmarks (r > 0.75) while enabling standardized comparisons
across languages and models. Our framework provides a robust and
resource-efficient solution for evaluating multilingual generation that can be
extended to thousands of languages.Summary
AI-Generated Summary