s2n-bignum-bench: Ein praktischer Benchmark zur Bewertung des Low-Level-Code-Verständnisses von LLMs

Zusammenfassung

Neurosymbolische Ansätze, die große Sprachmodelle (LLMs) mit formalen Methoden kombinieren, haben kürzlich starke Ergebnisse bei mathematisch orientierten Theorem-Beweis-Benchmarks erzielt. Erfolge bei wettbewerbsartiger Mathematik demonstrieren jedoch nicht automatisch die Fähigkeit, Beweise über Implementierungen aus der realen Welt zu konstruieren. Wir schließen diese Lücke mit einem Benchmark, der aus einer industriellen Kryptographie-Bibliothek abgeleitet ist, deren Assembler-Routinen bereits in HOL Light verifiziert sind. s2n-bignum ist eine bei AWS eingesetzte Bibliothek, die schnelle Assembler-Routinen für Kryptographie bereitstellt, und ihre Korrektheit wird durch formale Verifikation sichergestellt. Die Aufgabe, diese Bibliothek formal zu verifizieren, war eine bedeutende Errungenschaft für die Automated Reasoning Group. Sie umfasste zwei Aufgaben: (1) die präzise Spezifikation des korrekten Verhaltens eines Programms als mathematische Aussage und (2) den Beweis, dass diese Aussage korrekt ist. Im Fall von s2n-bignum wurden beide Aufgaben von menschlichen Experten durchgeführt. In s2n-bignum-bench stellen wir die formale Spezifikation bereit und fordern das LLM auf, ein Proof-Skript zu generieren, das von HOL Light innerhalb eines festgelegten Proof-Checking-Timeouts akzeptiert wird. Nach unserem Wissen ist s2n-bignum-bench der erste öffentliche Benchmark, der sich auf maschinenprüfbare Beweissynthese für industrielle Low-Level-Kryptographie-Assembler-Routinen in HOL Light konzentriert. Dieser Benchmark bietet eine anspruchsvolle und praktisch relevante Testumgebung zur Bewertung LLM-basierten Theorembeweisens über Wettbewerbsmathematik hinaus. Der Code zur Einrichtung und Nutzung des Benchmarks ist hier verfügbar: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

English

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In s2n-bignum-bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, s2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

s2n-bignum-bench: Ein praktischer Benchmark zur Bewertung des Low-Level-Code-Verständnisses von LLMs

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Zusammenfassung

Support