对小学算术题上大型语言模型表现的仔细检查
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
May 1, 2024
作者: Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele, Lunati, Summer Yue
cs.AI
摘要
大型语言模型(LLMs)在许多数学推理基准上取得了令人瞩目的成功。然而,人们越来越担心部分性能实际上反映了数据集污染,即类似于基准问题的数据泄漏到训练数据中,而非真正的推理能力。为了严谨地调查这一说法,我们委托开展了Grade School Math 1000(GSM1k)项目。GSM1k旨在模拟已建立的GSM8k基准的风格和复杂性,后者是衡量基础数学推理的黄金标准。我们确保这两个基准在人类解决率、解决步骤数量、答案大小等重要指标上是可比较的。在对GSM1k上的主要开源和闭源LLMs进行评估时,我们观察到高达13%的准确率下降,其中几个模型家族(例如Phi和Mistral)显示出几乎所有模型规模都存在系统性过拟合的证据。与此同时,许多模型,尤其是那些处于前沿的模型(例如Gemini/GPT/Claude),几乎没有过拟合的迹象。进一步的分析表明,模型生成GSM8k示例的概率与其在GSM8k和GSM1k之间的性能差距之间存在正相关关系(Spearman's r^2=0.32),这表明许多模型可能已经部分记忆了GSM8k。
English
Large language models (LLMs) have achieved impressive success on many
benchmarks for mathematical reasoning. However, there is growing concern that
some of this performance actually reflects dataset contamination, where data
closely resembling benchmark questions leaks into the training data, instead of
true reasoning ability. To investigate this claim rigorously, we commission
Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and
complexity of the established GSM8k benchmark, the gold standard for measuring
elementary mathematical reasoning. We ensure that the two benchmarks are
comparable across important metrics such as human solve rates, number of steps
in solution, answer magnitude, and more. When evaluating leading open- and
closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with
several families of models (e.g., Phi and Mistral) showing evidence of
systematic overfitting across almost all model sizes. At the same time, many
models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show
minimal signs of overfitting. Further analysis suggests a positive relationship
(Spearman's r^2=0.32) between a model's probability of generating an example
from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that
many models may have partially memorized GSM8k.Summary
AI-Generated Summary