s2n-bignum-bench: LLMの低レベルコード推論能力を評価する実践的ベンチマーク

要旨

大規模言語モデル（LLM）と形式手法を活用する神経記号的アプローチは、数学志向の定理証明ベンチマークにおいて近年強力な成果を上げている。しかし、競技数学的な問題での成功自体は、実世界の実装に関する証明を構築する能力を実証するものではない。我々はこの隔たりを、産業界で使用される暗号ライブラリから派生したベンチマークによって埋める。このライブラリのアセンブリルーチンは既にHOL Lightで検証済みである。s2n-bignumはAWSで使用されている、暗号処理のための高速なアセンブリルーチンを提供するライブラリであり、その正当性は形式検証によって確立されている。このライブラリの形式検証は、自動推論グループにとって重要な成果であった。この作業には二つの課題が含まれていた：(1) プログラムの正しい動作を数学的な命題として精密に特定すること、(2) その命題が正しいことを証明すること。s2n-bignumの場合、両方の課題は人間の専門家によって実行された。s2n-bignum-benchでは、我々は形式的仕様を提供し、LLMが固定の証明チェックタイムアウト内でHOL Lightに受理される証明スクリプトを生成することを求める。我々の知る限り、s2n-bignum-benchは、HOL Lightにおける産業用低レベル暗号アセンブリルーチンのための、機械検証可能な証明合成に焦点を当てた最初の公開ベンチマークである。このベンチマークは、競技数学を超えたLLMベースの定理証明を評価するための、挑戦的かつ実用的に関連性の高いテストベッドを提供する。ベンチマークのセットアップと使用のためのコードは以下で利用可能である：https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}

English

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In s2n-bignum-bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, s2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.

s2n-bignum-bench: LLMの低レベルコード推論能力を評価する実践的ベンチマーク

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

要旨

Support