ChatPaper.aiChatPaper

MUG-Eval:面向任意語言的多語言生成能力代理評估框架

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

May 20, 2025
作者: Seyoung Song, Seogyeong Jeong, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh
cs.AI

摘要

評估大型語言模型(LLMs)的文本生成能力具有挑戰性,尤其是在低資源語言中,直接評估的方法十分稀缺。我們提出了MUG-Eval,這是一個新穎的框架,通過將現有基準轉化為對話任務並測量LLMs在這些任務上的準確率,來評估LLMs的多語言生成能力。我們特別設計了這些對話任務,要求模型在目標語言中進行有效溝通。然後,我們簡單地使用任務成功率作為成功生成對話的代理指標。我們的方法具有兩個關鍵優勢:它不依賴於特定語言的NLP工具或註釋數據集,這些資源在大多數語言中都很有限;並且它不依賴於LLMs作為評判者,因為在少數高資源語言之外,其評估質量會下降。我們評估了8個LLMs在30種語言中的表現,涵蓋高、中、低資源類別,發現MUG-Eval與已建立的基準具有強相關性(r > 0.75),同時能夠實現跨語言和模型的標準化比較。我們的框架提供了一個穩健且資源高效的多語言生成評估解決方案,可擴展至數千種語言。
English
Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs' multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs' accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy of successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks (r > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.

Summary

AI-Generated Summary

PDF22May 23, 2025