s2n-bignum-bench：面向大语言模型底层代码推理能力的实用基准测试框架

摘要

近期，结合形式化方法与大型语言模型的神经符号化方法在数学导向的定理证明基准测试中取得了显著成果。然而，竞赛型数学任务的成功本身并不能证明其具备对现实世界实现进行形式化验证的能力。为此，我们推出一个基于工业密码学库的全新基准测试——该库的汇编例程已在HOL Light中完成验证。s2n-bignum是AWS使用的密码学快速汇编例程库，其正确性已通过形式化验证得以确立。对该库进行形式化验证是自动推理研究组的重要成果，包含两项核心任务：(1) 将程序正确行为精确定义为数学命题；(2) 证明该命题的正确性。在s2n-bignum项目中，这两项任务均由人类专家完成。而在s2n-bignum-bench基准测试中，我们提供形式化规范，要求LLM在限定验证时长内生成能被HOL Light接受的证明脚本。据我们所知，这是首个专注于HOL Light环境下工业级底层密码汇编例程的机器可验证证明合成公共基准。该基准为评估LLM在超越竞赛数学的定理证明能力提供了兼具挑战性与实践意义的测试平台。基准测试的配置与使用代码已开源：https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}。

English

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In s2n-bignum-bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, s2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.