SKYLENAGE技术报告:多层次数学评估中的数学推理与竞赛创新基准
SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation
September 24, 2025
作者: Hu Wei, Ze Xu, Boyu Yang, Linlin Miao, Weiqi Zhai, Yihan Li, Zixuan Li, Zhijun Wang, Boya Wang, Jianwei Yu, Jialing Yuan, Xiaoyue Zhang, Cheng He, Minglei Chen, Zifan Zhang, Qianhui Li, Wei Wang, Xiang Xu
cs.AI
摘要
大型语言模型(LLMs)当前在众多公开数学测试集上表现优异,然而数学领域内的前沿区分度正日益受到天花板效应的制约。我们推出了两项互补的基准测试:SKYLENAGE-ReasoningMATH,一个包含100道题目的结构化诊断集,每道题均附有关于长度、数字密度及符号复杂度的元数据;以及SKYLENAGE-MATH,一个涵盖从高中至博士四个阶段、按七大学科分类的150道竞赛风格测试集。我们在统一设置下评估了十五种当代LLM变体,并分析了学科与模型、年级与模型间的表现差异。在竞赛测试集中,最强模型达到44%的准确率,次优模型为37%;从高中到博士阶段,准确率呈下降趋势,顶级系统在博士至高中阶段的保持率接近79%。在推理测试集中,最佳模型整体准确率为81%,最难子集的结果揭示了领先者与中游模型间显著的鲁棒性差距。总之,我们发布了SKYLENAGE-ReasoningMATH并报告了SKYLENAGE-MATH的汇总结果;SKYLENAGE共同构成了一套难度校准、元数据丰富、以推理为核心且覆盖广泛的数学基准,为未来数学推理评估提供了参考标准。
English
Large language models (LLMs) now perform strongly on many public math suites,
yet frontier separation within mathematics increasingly suffers from ceiling
effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a
100-item, structure-aware diagnostic set with per-item metadata on length,
numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item
contest-style suite spanning four stages from high school to doctoral under a
seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a
single setup and analyze subject x model and grade x model performance. On the
contest suite, the strongest model reaches 44% while the runner-up reaches 37%;
accuracy declines from high school to doctoral, and top systems exhibit a
doctoral-to-high-school retention near 79%. On the reasoning set, the best
model attains 81% overall, and hardest-slice results reveal clear robustness
gaps between leaders and the mid-tier. In summary, we release
SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH;
together, SKYLENAGE provides a hard, reasoning-centered and broadly covering
math benchmark with calibrated difficulty and rich metadata, serving as a
reference benchmark for future evaluations of mathematical reasoning.