s2n-bignum-bench：评估大语言模型底层代码推理能力的实用基准测试框架

摘要

近期，结合大型语言模型与形式化方法的神经符号方法在数学导向的定理证明基准测试中取得了显著成果。然而，竞赛风格数学问题的成功本身并不能证明其具备对现实世界实现进行证明构建的能力。我们通过一个源自工业密码库的基准测试来填补这一空白——该密码库的汇编例程已在HOL Light中完成验证。s2n-bignum是AWS使用的密码学快速汇编例程库，其正确性通过形式化验证得以确立。对该库进行形式化验证是自动化推理研究组的重要成果，涉及两项任务：(1) 将程序的正确行为精确表述为数学命题；(2) 证明该命题的正确性。在s2n-bignum项目中，这两项任务均由人类专家完成。在s2n-bignum-bench基准测试中，我们提供形式化规范，要求大型语言模型生成能在限定证明检查时限内被HOL Light接受的证明脚本。据我们所知，s2n-bignum-bench是首个专注于HOL Light工业级底层密码汇编例程的机器可验证证明合成的公开基准。该基准为评估基于大型语言模型的定理证明能力提供了超越竞赛数学范畴、具有挑战性且实践相关的测试平台。基准测试的设置与使用代码详见：https://github.com/kings-crown/s2n-bignum-bench。

English

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In s2n-bignum-bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, s2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

s2n-bignum-bench：评估大语言模型底层代码推理能力的实用基准测试框架

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

摘要

Support