ChatPaper.aiChatPaper

ComBench:一项面向奥赛级组合数学中严谨证明推理与构造性实现的基准

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

June 9, 2026
作者: Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin, Yu Qiao, Ning Ding, Yafu Li, Yu Cheng
cs.AI

摘要

组合数学是奥林匹克级数学问题求解的核心,要求具备深刻的离散推理能力、创造性构造能力以及严谨的结构洞察力。最新证据表明,即便是当前最前沿的模型在奥林匹克组合数学问题上仍表现不均,暴露出创造性数学推理能力的不足。为此,我们提出ComBench——一个面向大型语言模型组合推理能力评估与诊断的奥林匹克级组合数学基准测试。该基准包含100道经过人工标注的竞赛级试题,围绕两种互补场景组织:分析中心型问题(主要需严谨数学论证)与构造中心型问题(除正确性证明外还需显式构造)。评估协议结合了基于评分标准的证明分级与确定性构造验证,揭示了证明质量与构造有效性可能脱节的现象。针对前沿开源与闭源模型的实验表明,ComBench远未达到饱和:最强模型整体平均分达65.4%,最高Best@4得分为75.3%。我们进一步发现,严谨证明推理与构造性实现是两种独立能力:Kimi-K2.6在分析中心型证明评分上落后于GPT-5.5,但在构造中心型Best@4上反超;而在代表性前沿模型中,存在性与构造类问题始终是最具挑战性的部分。
English
Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.