ComBench: 올림피아드 수준의 조합론에서 엄밀한 증명 추론과 구성적 실현을 위한 벤치마크

초록

조합론은 올림피아드 수준의 수학적 문제 해결에 핵심적인 분야로, 심층적인 이산 추론, 창의적 구성, 그리고 엄밀한 구조적 통찰을 요구한다. 최근 증거는 현존 최첨단 모델조차 올림피아드 조합론에서 고르지 못한 성능을 보이며, 창의적 수학적 추론에 있어 격차가 존재함을 시사한다. 본 논문에서는 대규모 언어 모델의 조합적 추론 능력을 평가하고 진단하기 위한 올림피아드급 조합론 벤치마크인 ComBench를 소개한다. ComBench는 100개의 사람이 주석을 단 대회 수준 문제로 구성되며, 크게 두 가지 보완적 설정으로 정리된다: 엄밀한 수학적 논증을 주로 요구하는 분석 중심 문제와 정당성 입증에 더해 명시적 구성을 요구하는 구성 중심 문제이다. 평가 프로토콜은 루브릭 기반 증명 채점과 결정론적 구성 검증을 결합하여 증명 품질과 구성 타당성 간의 괴리가 드러나는 사례를 노출한다. 최첨단 오픈소스 및 클로즈드소스 모델에 대한 실험 결과, ComBench는 포화 상태와는 거리가 멀며, 가장 강력한 모델이 전체 평균 65.4%, 전체 Best@4 75.3%를 기록했다. 또한 엄밀한 증명 추론(Rigorous Proof Reasoning)과 구성적 실현(Constructive Realization)은 별개의 능력임을 발견했다: Kimi-K2.6은 분석 중심 증명 채점에서 GPT-5.5에 뒤쳐지지만, 구성 중심 Best@4에서는 이를 능가하며, 존재성과 구성 문제(Existence and Construction problems)는 대표적 최첨단 모델 전반에서 일관되게 가장 어려운 과제로 남아 있다.

English

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.