SKYLENAGE技術報告:多層次數學評估中的數學推理與競賽創新基準
SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation
September 24, 2025
作者: Hu Wei, Ze Xu, Boyu Yang, Linlin Miao, Weiqi Zhai, Yihan Li, Zixuan Li, Zhijun Wang, Boya Wang, Jianwei Yu, Jialing Yuan, Xiaoyue Zhang, Cheng He, Minglei Chen, Zifan Zhang, Qianhui Li, Wei Wang, Xiang Xu
cs.AI
摘要
大型語言模型(LLMs)目前在許多公開數學測試套件上表現出色,然而數學領域的前沿分離日益受到天花板效應的影響。我們提出了兩個互補的基準測試:SKYLENAGE-ReasoningMATH,這是一個包含100個項目的結構感知診斷集,每個項目都附有長度、數值密度和符號複雜度的元數據;以及SKYLENAGE-MATH,這是一個包含150個項目的競賽風格測試套件,涵蓋從高中到博士的四個階段,並按照七個學科分類。我們在單一設置下評估了十五種當代LLM變體,並分析了學科×模型和年級×模型的表現。在競賽套件中,最強的模型達到了44%的準確率,而第二名則達到了37%;從高中到博士階段,準確率逐漸下降,頂尖系統的博士到高中保留率接近79%。在推理測試集中,最佳模型的總體準確率達到81%,最難部分的結果揭示了領先者與中層之間的明顯魯棒性差距。總之,我們發布了SKYLENAGE-ReasoningMATH並報告了SKYLENAGE-MATH的綜合結果;SKYLENAGE共同提供了一個難度高、以推理為中心且廣泛覆蓋的數學基準測試,具有校準的難度和豐富的元數據,作為未來數學推理評估的參考基準。
English
Large language models (LLMs) now perform strongly on many public math suites,
yet frontier separation within mathematics increasingly suffers from ceiling
effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a
100-item, structure-aware diagnostic set with per-item metadata on length,
numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item
contest-style suite spanning four stages from high school to doctoral under a
seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a
single setup and analyze subject x model and grade x model performance. On the
contest suite, the strongest model reaches 44% while the runner-up reaches 37%;
accuracy declines from high school to doctoral, and top systems exhibit a
doctoral-to-high-school retention near 79%. On the reasoning set, the best
model attains 81% overall, and hardest-slice results reveal clear robustness
gaps between leaders and the mid-tier. In summary, we release
SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH;
together, SKYLENAGE provides a hard, reasoning-centered and broadly covering
math benchmark with calibrated difficulty and rich metadata, serving as a
reference benchmark for future evaluations of mathematical reasoning.