s2n-bignum-bench: LLM의 저수준 코드 추론 능력 평가를 위한 실용 벤치마크

초록

대규모 언어 모델(LLM)과 형식적 방법을 결합한 신경-기호 접근법이 최근 수학 중심 정리 증명 벤치마크에서 강력한 성과를 보여주고 있습니다. 그러나 경쟁형 수학 문제에서의 성공만으로는 실제 구현에 대한 증명을 구성할 능력이 있다고 입증하기는 어렵습니다. 우리는 이러한 격차를 해소하기 위해 HOL Light에서 이미 검증된 어셈블리 루틴을 보유한 산업용 암호화 라이브러리에서 유래한 벤치마크를 제시합니다. s2n-bignum은 AWS에서 암호화를 위한 고속 어셈블리 루틴을 제공하는 데 사용되는 라이브러리이며, 그 정확성은 형식적 검증을 통해 입증되었습니다. 이 라이브러리의 형식적 검증 작업은 자동 추론 그룹에게 중요한 성과였습니다. 여기에는 두 가지 작업이 포함되었습니다: (1) 프로그램의 올바른 동작을 수학적 명제로 정밀하게 규격화하는 것, (2) 해당 명제가 옳다는 것을 증명하는 것. s2n-bignum의 경우, 두 작업 모두 인간 전문가에 의해 수행되었습니다. s2n-bignum-bench에서는 형식적 규격을 제공하고 LLM이 고정된 증명 검사 타임아웃 내에 HOL Light에서 승인되는 증명 스크립트를 생성하도록 요구합니다. 우리가 알기로, s2n-bignum-bench는 HOL Light에서 산업용 저수준 암호화 어셈블리 루틴에 대해 기계 검증 가능한 증명 합성에 초점을 맞춘 최초의 공개 벤치마크입니다. 이 벤치마크는 경쟁 수학을 넘어선 LLM 기반 정리 증명 평가를 위한 도전적이고 실질적으로 관련성 높은 테스트베드를 제공합니다. 벤치마크 설정 및 사용을 위한 코드는 다음에서 확인할 수 있습니다: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

English

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In s2n-bignum-bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, s2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

s2n-bignum-bench: LLM의 저수준 코드 추론 능력 평가를 위한 실용 벤치마크

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

초록

Support