s2n-bignum-bench: Un benchmark pratico per la valutazione del ragionamento sul codice di basso livello dei LLM

Abstract

Gli approcci neurosimbolici che sfruttano i Large Language Model (LLM) insieme a metodi formali hanno recentemente ottenuto risultati significativi su benchmark di dimostrazione di teoremi orientati alla matematica. Tuttavia, il successo in matematica di tipo competitivo non dimostra di per sé la capacità di costruire dimostrazioni su implementazioni del mondo reale. Colmiamo questa lacuna con un benchmark derivato da una libreria crittografica industriale le cui routine in assembly sono già verificate in HOL Light. s2n-bignum è una libreria utilizzata in AWS per fornire routine assembly veloci per la crittografia, e la sua correttezza è stabilita dalla verifica formale. Il compito di verificare formalmente questa libreria è stato un risultato significativo per l'Automated Reasoning Group. Esso ha coinvolto due compiti: (1) specificare precisamente il comportamento corretto di un programma come una proposizione matematica, e (2) dimostrare che la proposizione è corretta. Nel caso di s2n-bignum, entrambi i compiti sono stati eseguiti da esperti umani. In s2n-bignum-bench, forniamo la specifica formale e chiediamo all'LLM di generare uno script di dimostrazione che sia accettato da HOL Light entro un timeout fisso per il controllo delle dimostrazioni. A nostra conoscenza, s2n-bignum-bench è il primo benchmark pubblico focalizzato sulla sintesi di dimostrazioni verificabili automaticamente per routine crittografiche industriali di basso livello in assembly in HOL Light. Questo benchmark fornisce un banco di prova impegnativo e praticamente rilevante per valutare la dimostrazione automatica di teoremi basata su LLM al di là della matematica competitiva. Il codice per configurare e utilizzare il benchmark è disponibile qui: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

English

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In s2n-bignum-bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, s2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

s2n-bignum-bench: Un benchmark pratico per la valutazione del ragionamento sul codice di basso livello dei LLM

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Abstract

Support