BenchMAX: 大規模言語モデルのための包括的な多言語評価スイート

要旨

従来の多言語ベンチマークは、主に単純な理解タスクに焦点を当ててきましたが、大規模言語モデル（LLMs）においては、指示の遵守、推論、長い文脈の理解、コード生成などの能力を重視しています。ただし、これらの高度な能力を言語間でどのように測定するかは、未開拓の分野です。この格差に対処するために、私たちはBenchMAXを導入しました。これは、言語間でこれらの重要な能力を公平に比較できる多方向多言語評価ベンチマークです。高い品質を維持するために、英語から他の16言語に機械翻訳されたデータを用いて、3人の母語話者注釈者がそれぞれのタスク内の各サンプルを独立して注釈付けします。さらに、データセット構築から生じる新しい翻訳課題を提示します。BenchMAXでの幅広い実験により、言語間での主要な能力の効果の違いが明らかになり、単にモデルサイズを拡大するだけでは埋められない性能の差が浮き彫りにされます。BenchMAXは包括的な多言語評価プラットフォームとして機能し、多言語言語モデルの開発を促進する有望なテストベッドを提供します。データセットとコードは公開されています。

English

Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.

BenchMAX: 大規模言語モデルのための包括的な多言語評価スイート

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

要旨

Support