對於小學算術的大型語言模型表現進行仔細檢驗
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
May 1, 2024
作者: Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele, Lunati, Summer Yue
cs.AI
摘要
大型語言模型(LLMs)在許多數學推理基準上取得了令人印象深刻的成功。然而,越來越多人擔心這些成績實際上反映了數據集污染的問題,即類似基準問題的數據泄漏到訓練數據中,而非真正的推理能力。為了嚴謹地調查這一主張,我們委託了Grade School Math 1000(GSM1k)項目。GSM1k旨在模擬已建立的GSM8k基準的風格和複雜性,後者是衡量基本數學推理的黃金標準。我們確保這兩個基準在人類解題率、解題步驟數、答案大小等重要指標上是可比較的。在對GSM1k上的領先開源和封閉源LLMs進行評估時,我們觀察到準確率下降高達13%,其中幾個模型家族(例如Phi和Mistral)表現出幾乎所有模型尺寸都存在系統性過擬合的證據。與此同時,許多模型,特別是那些處於前沿的模型(例如Gemini/GPT/Claude),顯示出很少的過擬合跡象。進一步的分析表明,模型生成GSM8k示例的概率與其在GSM8k和GSM1k之間的性能差距之間存在正向關係(Spearman's r^2=0.32),這表明許多模型可能已經部分記住了GSM8k。
English
Large language models (LLMs) have achieved impressive success on many
benchmarks for mathematical reasoning. However, there is growing concern that
some of this performance actually reflects dataset contamination, where data
closely resembling benchmark questions leaks into the training data, instead of
true reasoning ability. To investigate this claim rigorously, we commission
Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and
complexity of the established GSM8k benchmark, the gold standard for measuring
elementary mathematical reasoning. We ensure that the two benchmarks are
comparable across important metrics such as human solve rates, number of steps
in solution, answer magnitude, and more. When evaluating leading open- and
closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with
several families of models (e.g., Phi and Mistral) showing evidence of
systematic overfitting across almost all model sizes. At the same time, many
models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show
minimal signs of overfitting. Further analysis suggests a positive relationship
(Spearman's r^2=0.32) between a model's probability of generating an example
from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that
many models may have partially memorized GSM8k.Summary
AI-Generated Summary