U-MATH: 大学レベルの数学スキル評価のためのベンチマーク、LLMにおける

要旨

現在のLLMにおける数学的スキルの評価は限られており、既存のベンチマークは比較的小規模であり、主に初等および高校レベルの問題に焦点を当てているか、またはトピックの多様性に欠けています。さらに、タスクに視覚的要素を含めることに関しては、未だほとんど探求されていません。これらのギャップに対処するために、私たちはU-MATHを導入します。これは、教材から収集された1,100の未公開のオープンエンドの大学レベルの問題の新しいベンチマークです。これは、6つの主要科目にバランスよく分布しており、20%がマルチモーダルな問題です。U-MATHの問題がオープンエンドであることから、我々はLLMを用いて生成された解の正確性を判断します。このために、解の判断能力を評価するためにmu-MATHというデータセットを公開します。一般領域、数学特化型、マルチモーダルLLMの評価は、U-MATHが提示する課題を浮き彫りにします。我々の調査結果によれば、LLMはテキストベースのタスクにおいて最大63%の正解率を達成し、視覚的問題ではさらに低い45%となります。解の評価はLLMにとって難しいことが示され、mu-MATHにおいて最も優れたLLMジャッジはF1スコアで80%を達成しています。

English

The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release mu-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on mu-MATH.

U-MATH: 大学レベルの数学スキル評価のためのベンチマーク、LLMにおける

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

要旨

Support