SKYLENAGE技術レポート：多段階数学評価のための数学的推論とコンテスト・イノベーションベンチマーク

要旨

大規模言語モデル（LLM）は現在、多くの公開数学テストスイートで高い性能を発揮していますが、数学のフロンティアにおける分離は天井効果に悩まされることが増えています。我々は2つの補完的なベンチマークを提示します：SKYLENAGE-ReasoningMATHは、長さ、数値密度、記号の複雑さに関する項目ごとのメタデータを持つ100項目の構造認識診断セットであり、SKYLENAGE-MATHは、高校から博士課程までの4段階を7つの科目分類でカバーする150項目のコンテスト形式スイートです。我々は15の最新LLMバリアントを単一のセットアップで評価し、科目×モデルおよび学年×モデルの性能を分析しました。コンテストスイートでは、最強のモデルが44%に達し、次点が37%に達しました。精度は高校から博士課程にかけて低下し、トップシステムは博士課程から高校までの保持率が約79%を示しました。推論セットでは、最良のモデルが全体で81%を達成し、最も難しいスライスの結果は、リーダーと中位層の間に明確な堅牢性のギャップがあることを明らかにしました。要約すると、我々はSKYLENAGE-ReasoningMATHをリリースし、SKYLENAGE-MATHの集計結果を報告します。SKYLENAGEは、難易度が調整され、豊富なメタデータを持つ、推論中心で広範囲をカバーする数学ベンチマークを提供し、将来の数学的推論評価のための参照ベンチマークとして機能します。

English

Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

SKYLENAGE技術レポート：多段階数学評価のための数学的推論とコンテスト・イノベーションベンチマーク

SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

要旨

Support