ComBench：奧林匹亞級組合學中嚴謹證明推理與構造性實現的基準

摘要

組合數學是奧林匹亞級別數學解題的核心，需要深入的離散推理、創造性建構以及嚴謹的結構洞察。近期證據顯示，即便是當今最先進的前沿模型，在奧林匹亞組合問題上的表現仍不均衡，暴露出創造性數學推理方面的差距。我們提出ComBench，這是一個針對大型語言模型組合推理能力進行評估與診斷的奧林匹亞級組合數學基準。ComBench包含100道經人工註解的競賽級問題，依兩種互補情境編排：分析為核心的問題，主要要求嚴謹的數學論證；以及建構為核心的問題，在正確性論證之外還需要明確的建構方案。評估流程結合了以評分量表引導的證明評分與確定性建構驗證，從而揭示證明品質與建構有效性不一致的案例。對前沿開源與閉源模型的實驗顯示，ComBench遠未飽和：最強模型在整體平均表現上達到65.4%，在整體Best@4上達到75.3%。我們進一步發現，嚴謹證明推理與建構實現是兩種不同的能力：Kimi-K2.6在分析為核心的證明評分上落後於GPT-5.5，但在建構為核心的Best@4上超越後者；而存在性與建構問題在代表性前沿模型中始終是最困難的。

English

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.