ComBench: Een benchmark voor rigoureuze bewijsredenering en constructieve realisatie in combinatoriek op olympiadeniveau

Samenvatting

Combinatoriek staat centraal in het oplossen van wiskundige problemen op olympiadeniveau, waarbij diepgaand discreet redeneren, creatieve constructies en rigoureus structureel inzicht vereist zijn. Recent bewijs suggereert dat zelfs de sterkste frontiermodellen van vandaag nog steeds ongelijkmatig presteren op olympiadecombinatoriek, wat een kloof in creatief wiskundig redeneren aan het licht brengt. Wij introduceren ComBench, een combinatoriekbenchmark op olympiadeniveau voor het evalueren en diagnosticeren van de combinatorische redeneervaardigheden van grote taalmodellen. ComBench bevat 100 door mensen geannoteerde problemen op wedstrijdniveau, georganiseerd rond twee complementaire instellingen: analysegerichte problemen, die voornamelijk rigoureuze wiskundige argumenten vereisen, en constructiegerichte problemen, die expliciete constructies vereisen naast verantwoording van de juistheid. Het evaluatieprotocol combineert rubric-gestuurde bewijswaardering met deterministische constructieverificatie, waarbij gevallen aan het licht komen waar bewijskwaliteit en constructievaliditeit uiteenlopen. Experimenten met frontiermodellen met open en gesloten broncode tonen aan dat ComBench verre van verzadigd is: het sterkste model bereikt 65,4% overall gemiddelde en 75,3% overall Best@4. Wij ontdekken verder dat Rigoureus Bewijs Redeneren en Constructieve Realisatie verschillende vaardigheden zijn: Kimi-K2.6 blijft achter bij GPT-5.5 op analysegerichte bewijswaardering, maar overtreft het op constructiegerichte Best@4, terwijl Existentie- en Constructieproblemen consequent het moeilijkst blijven bij representatieve frontiermodellen.

English

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.