FormalMATH: Benchmarking van formeel wiskundig redeneren van grote taalmodelle

Samenvatting

Formeel wiskundig redeneren blijft een cruciale uitdaging voor kunstmatige intelligentie, gehinderd door de beperkingen van bestaande benchmarks in omvang en schaal. Om dit aan te pakken, presenteren we FormalMATH, een grootschalige Lean4-benchmark bestaande uit 5.560 formeel geverifieerde problemen, variërend van middelbare school Olympiade-uitdagingen tot bachelor-niveau stellingen in diverse domeinen (bijv. algebra, toegepaste wiskunde, calculus, getaltheorie en discrete wiskunde). Om de inefficiëntie van handmatige formalisering te verminderen, introduceren we een innovatieve human-in-the-loop autoformalisering-pipeline die integreert: (1) gespecialiseerde grote taalmodellen (LLM's) voor het autoformaliseren van stellingen, (2) multi-LLM semantische verificatie, en (3) negatie-gebaseerde weerleggingsfilterstrategieën met behulp van kant-en-klare LLM-gebaseerde bewijzers. Deze aanpak verlaagt de kosten van expertannotatie door 72,09% van de stellingen te behouden vóór handmatige verificatie, terwijl de trouw aan de oorspronkelijke natuurlijke-taakproblemen wordt gewaarborgd. Onze evaluatie van state-of-the-art LLM-gebaseerde stellingbewijzers onthult aanzienlijke beperkingen: zelfs de sterkste modellen behalen slechts een slagingspercentage van 16,46% binnen praktische steekproefbudgetten, met een uitgesproken domeinvooroordeel (bijv. uitblinken in algebra maar falen in calculus) en een overmatige afhankelijkheid van vereenvoudigde automatiseringsstrategieën. Opmerkelijk is dat we een contra-intuïtieve omgekeerde relatie identificeren tussen natuurlijke-taakoplossingsbegeleiding en bewijssucces in keten-van-redenering-scenario's, wat suggereert dat door mensen geschreven informeel redeneren ruis introduceert in plaats van duidelijkheid in formele redeneersettings. Wij geloven dat FormalMATH een robuuste benchmark biedt voor het beoordelen van formeel wiskundig redeneren.

English

Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

FormalMATH: Benchmarking van formeel wiskundig redeneren van grote taalmodelle

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Samenvatting

Support