大規模言語モデルベンチマークに関する調査

要旨

近年、大規模言語モデルの能力の深度と広度が急速に発展するにつれ、それに対応する様々な評価ベンチマークが次々と登場しています。モデル性能の定量的評価ツールとして、ベンチマークはモデル能力を測定するための核心的な手段であるだけでなく、モデル開発の方向性を導き、技術革新を促進するための重要な要素でもあります。本稿では、初めて大規模言語モデルのベンチマークの現状と発展を体系的にレビューし、283の代表的なベンチマークを一般能力、ドメイン特化、ターゲット特化の3つのカテゴリに分類しました。一般能力ベンチマークは、コア言語学、知識、推論などの側面をカバーしています。ドメイン特化ベンチマークは、自然科学、人文社会科学、工学技術などの分野に焦点を当てています。ターゲット特化ベンチマークは、リスク、信頼性、エージェントなどに注目しています。現在のベンチマークには、データ汚染によるスコアの過大評価、文化的・言語的バイアスによる不公平な評価、プロセスの信頼性や動的環境における評価の欠如といった問題があることを指摘し、今後のベンチマーク革新のための参照可能な設計パラダイムを提供します。

English

In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.

大規模言語モデルベンチマークに関する調査

A Survey on Large Language Model Benchmarks

要旨

Support