s2n-bignum-bench: Een praktische benchmark voor het evalueren van laag-niveau codebegrip van LLM's

Samenvatting

Neurosymbolische benaderingen die Large Language Models (LLMs) combineren met formele methoden hebben onlangs sterke resultaten behaald op wiskundig georiënteerde theorem-proving benchmarks. Succes op competitieachtige wiskunde toont echter op zichzelf niet het vermogen aan om bewijzen te construeren over real-world implementaties. Wij voorzien in deze leemte met een benchmark afgeleid van een industriële cryptografische bibliotheek waarvan de assembleroutines reeds zijn geverifieerd in HOL Light. s2n-bignum is een bibliotheek die bij AWS wordt gebruikt voor het leveren van snelle assembleroutines voor cryptografie, en de correctheid ervan is vastgesteld door formele verificatie. De taak om deze bibliotheek formeel te verifiëren was een belangrijke prestatie voor de Automated Reasoning Group. Het omvatte twee taken: (1) het precies specificeren van het correcte gedrag van een programma als een wiskundige propositie, en (2) het bewijzen dat de propositie correct is. In het geval van s2n-bignum werden beide taken uitgevoerd door menselijke experts. In s2n-bignum-bench verstrekken wij de formele specificatie en vragen wij de LLM om een proof-script te genereren dat wordt geaccepteerd door HOL Light binnen een vaste time-out voor proof-checking. Voor zover wij weten, is s2n-bignum-bench de eerste openbare benchmark die zich richt op machine-controleerbare proof-synthese voor industriële low-level cryptografische assembleroutines in HOL Light. Deze benchmark biedt een uitdagende en praktisch relevante testomgeving voor het evalueren van op LLM gebaseerd theorem proving, voorbij competitiewiskunde. De code om de benchmark op te zetten en te gebruiken is hier beschikbaar: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

English

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In s2n-bignum-bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, s2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

s2n-bignum-bench: Een praktische benchmark voor het evalueren van laag-niveau codebegrip van LLM's

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Samenvatting

Support