FVSpec: 실전 속성 기반 테스트를 Lean 도전 과제로

초록

우리는 실제 소프트웨어 형식 검증 과제에서 AI 모델과 에이전트를 평가하기 위한 벤치마크를 제시한다. 먼저 실제 파이썬 저장소에서 11,039개의 속성 기반 테스트(PBT)를 수집한 후, 이 중 2,772개(25%)를 자동으로 9,415개의 Lean 4 명세로 변환하며, 이때 sorry 플레이스홀더를 포함시킨다(PBT당 약 3개의 형식화; 품질 지표에서 우위를 점하는 방식이 없을 경우 여러 시도를 유지한다). PBT를 Lean 명세로 변환하는 것은 까다롭다. Lean에서 파이썬 의미론을 모델링하고, 명령형 PBT에 인코딩된 논리적 속성을 추론하며, 드물게 사용되는 언어에서 의존 타입 프로그래밍의 고유한 어려움을 처리해야 하기 때문이다. 우리는 PBT를 Lean 명세로 트랜스파일링하기 위한 세 에이전트 LLM 파이프라인을 설명하고, 커버리지와 품질 지표를 평가하며, 여러 자동화 및 모델 기반 접근법을 사용한 증명 생성을 위한 기준선을 제공한다. 모든 코드(스크레이퍼 및 에이전트)와 데이터(PBT 및 Lean 명세)는 오픈 소스로 공개된다. 우리의 벤치마크는 점점 더 많은 코드를 AI가 생성하는 세상에서 중요성이 커지고 있는, 실세계 소프트웨어의 AI 보조 형식 검증이라는 충분히 탐구되지 않은 문제의 발전을 촉진하는 것을 목표로 한다.

English

We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.