ChatPaper.aiChatPaper

BenchMAX:用於大型語言模型的全面多語言評估套件

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

February 11, 2025
作者: Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan
cs.AI

摘要

先前的多語言基準主要著重於簡單的理解任務,但對於大型語言模型(LLMs),我們強調在指令遵循、推理、長篇文本理解、程式碼生成等方面的熟練度。然而,跨語言測量這些高級能力的研究尚未深入。為了解決這種差異,我們引入了BenchMAX,一個多向多語言評估基準,允許在不同語言之間公平比較這些重要能力。為了保持高質量,三位母語者獨立地對所有任務中的每個樣本進行標註,這些樣本是從英語機器翻譯成其他16種語言後得到的。此外,我們提出了一個源自數據集構建的新型翻譯挑戰。對BenchMAX的廣泛實驗顯示了核心能力在不同語言之間的效果差異,突顯了無法僅通過擴大模型規模來彌合的性能差距。BenchMAX作為一個全面的多語言評估平台,提供了一個有前途的測試平臺,促進多語言語言模型的發展。數據集和代碼可公開訪問。
English
Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.

Summary

AI-Generated Summary

PDF542February 13, 2025