SKYLENAGE技術報告：多層次數學評估中的數學推理與競賽創新基準

摘要

大型語言模型（LLMs）目前在許多公開數學測試套件上表現出色，然而數學領域的前沿分離日益受到天花板效應的影響。我們提出了兩個互補的基準測試：SKYLENAGE-ReasoningMATH，這是一個包含100個項目的結構感知診斷集，每個項目都附有長度、數值密度和符號複雜度的元數據；以及SKYLENAGE-MATH，這是一個包含150個項目的競賽風格測試套件，涵蓋從高中到博士的四個階段，並按照七個學科分類。我們在單一設置下評估了十五種當代LLM變體，並分析了學科×模型和年級×模型的表現。在競賽套件中，最強的模型達到了44%的準確率，而第二名則達到了37%；從高中到博士階段，準確率逐漸下降，頂尖系統的博士到高中保留率接近79%。在推理測試集中，最佳模型的總體準確率達到81%，最難部分的結果揭示了領先者與中層之間的明顯魯棒性差距。總之，我們發布了SKYLENAGE-ReasoningMATH並報告了SKYLENAGE-MATH的綜合結果；SKYLENAGE共同提供了一個難度高、以推理為中心且廣泛覆蓋的數學基準測試，具有校準的難度和豐富的元數據，作為未來數學推理評估的參考基準。

English

Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

SKYLENAGE技術報告：多層次數學評估中的數學推理與競賽創新基準

SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

摘要

Support