StatEval:面向大语言模型的统计学综合基准测试
StatEval: A Comprehensive Benchmark for Large Language Models in Statistics
October 10, 2025
作者: Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai, Ziwei Wang, Jiayi Xiang, Wenxin E, Siran Gao, Xinyao Ruan, Yirui Huang, Chenjing Xi, Haibo Hu, Yueming Fu, Qinglan Yu, Xiaobing Wei, Jiani Gu, Rui Sun, Jiaxuan Jia, Fan Zhou
cs.AI
摘要
大型语言模型(LLMs)在数学与逻辑推理方面已展现出显著进步,然而统计学作为一门独特且综合的学科,在基准测试中仍未被充分探索。为填补这一空白,我们推出了StatEval,这是首个专为统计学设计的全面基准,覆盖了从广度到深度、跨越不同难度层次的内容。StatEval包含13,817道基础题目,涵盖本科及研究生课程,以及从顶尖期刊中提取的2,374项研究级证明任务。为构建此基准,我们设计了一个可扩展的多智能体流程,结合人类参与验证,实现了大规模问题提取、重写及质量控制的自动化,同时确保了学术严谨性。此外,我们提出了一套针对计算与证明任务量身定制的稳健评估框架,能够细致评估推理能力。实验结果显示,尽管如GPT5-mini等闭源模型在研究级问题上得分低于57%,开源模型的表现则更为逊色。这些发现凸显了统计推理的独特挑战及当前LLMs的局限性。我们期待StatEval能成为推动大型语言模型统计智能发展的严格基准。所有数据与代码均可在我们的网络平台上获取:https://stateval.github.io/。
English
Large language models (LLMs) have demonstrated remarkable advances in
mathematical and logical reasoning, yet statistics, as a distinct and
integrative discipline, remains underexplored in benchmarking efforts. To
address this gap, we introduce StatEval, the first comprehensive
benchmark dedicated to statistics, spanning both breadth and depth across
difficulty levels. StatEval consists of 13,817 foundational problems covering
undergraduate and graduate curricula, together with 2374 research-level proof
tasks extracted from leading journals. To construct the benchmark, we design a
scalable multi-agent pipeline with human-in-the-loop validation that automates
large-scale problem extraction, rewriting, and quality control, while ensuring
academic rigor. We further propose a robust evaluation framework tailored to
both computational and proof-based tasks, enabling fine-grained assessment of
reasoning ability. Experimental results reveal that while closed-source models
such as GPT5-mini achieve below 57\% on research-level problems, with
open-source models performing significantly lower. These findings highlight the
unique challenges of statistical reasoning and the limitations of current LLMs.
We expect StatEval to serve as a rigorous benchmark for advancing statistical
intelligence in large language models. All data and code are available on our
web platform: https://stateval.github.io/.