SKYLENAGE技术报告：多层次数学评估中的数学推理与竞赛创新基准

摘要

大型语言模型（LLMs）当前在众多公开数学测试集上表现优异，然而数学领域内的前沿区分度正日益受到天花板效应的制约。我们推出了两项互补的基准测试：SKYLENAGE-ReasoningMATH，一个包含100道题目的结构化诊断集，每道题均附有关于长度、数字密度及符号复杂度的元数据；以及SKYLENAGE-MATH，一个涵盖从高中至博士四个阶段、按七大学科分类的150道竞赛风格测试集。我们在统一设置下评估了十五种当代LLM变体，并分析了学科与模型、年级与模型间的表现差异。在竞赛测试集中，最强模型达到44%的准确率，次优模型为37%；从高中到博士阶段，准确率呈下降趋势，顶级系统在博士至高中阶段的保持率接近79%。在推理测试集中，最佳模型整体准确率为81%，最难子集的结果揭示了领先者与中游模型间显著的鲁棒性差距。总之，我们发布了SKYLENAGE-ReasoningMATH并报告了SKYLENAGE-MATH的汇总结果；SKYLENAGE共同构成了一套难度校准、元数据丰富、以推理为核心且覆盖广泛的数学基准，为未来数学推理评估提供了参考标准。

English

Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

SKYLENAGE技术报告：多层次数学评估中的数学推理与竞赛创新基准

SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

摘要

Support