ComBench: 数学オリンピックレベルの組合せ論における厳密な証明推論と構成的実現のためのベンチマーク

要旨

組み合わせ論は、オリンピックレベルの数学的問題解決において中心的な位置を占めており、深い離散的推論、創造的な構成、そして厳密な構造的洞察を必要とする。最近のエビデンスによれば、現在の最先端モデルでさえ、オリンピックレベルの組み合わせ論においては依然として性能にばらつきがあり、創造的数学的推論におけるギャップが明らかになっている。我々は、大規模言語モデルの組み合わせ論的推論能力を評価・診断するための、オリンピックレベルの組み合わせ論ベンチマークであるComBenchを紹介する。ComBenchは、人手により注釈が付された100の競技レベルの問題から構成され、これらは二つの補完的な設定に整理されている。すなわち、主として厳密な数学的議論を必要とする分析中心の問題と、正しさの正当化に加えて明示的な構築を必要とする構築中心の問題である。評価プロトコルは、ルーブリックに基づく証明の採点と決定論的な構築検証を組み合わせたものであり、証明の質と構築の妥当性が乖離するケースを明らかにする。最先端のオープンソースモデルおよびクローズドソースモデルを用いた実験では、ComBenchは飽和状態にはほど遠いことが示された。最も強力なモデルでも全体平均で65.4%、全体Best@4で75.3%に達するにとどまる。さらに、厳密な証明推論と構成的実現は異なる能力であることが明らかになった。Kimi-K2.6は分析中心の証明の採点ではGPT-5.5に劣るが、構築中心のBest@4ではそれを上回り、一方で存在性問題と構築問題は代表的ないずれの最先端モデルにとっても一貫して最も困難な問題であり続けている。

English

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.