FVSpec: 现实世界中的基于属性测试作为Lean挑战

摘要

我们提出了一项基准，用于评估AI模型和智能体在真实世界形式化软件验证任务上的表现。首先，我们从真实世界的Python代码库中抓取了11,039个属性基测试（PBT），然后自动将其中2,772个（25%）转化为9,415条包含`sorry`占位符的Lean 4规范（平均每个PBT对应约3个形式化版本；当没有单一版本在质量指标上占优时，我们会保留多次尝试的结果）。将PBT转换为Lean规范颇具挑战性：需要在Lean中模拟Python语义，推断命令式PBT中编码的逻辑属性，并处理一个鲜少使用的语言中依赖类型编程的固有难点。我们描述了一个三智能体LLM流水线，用于将PBT转译为Lean规范，评估了覆盖率和质量指标，并提供了基于自动化方法和模型驱动方法的多种证明生成基线。所有代码（抓取器和智能体）与数据（PBT和Lean规范）均已开源。我们的基准旨在推动AI辅助形式化验证真实世界软件这一尚未充分探索问题的进展，而随着AI生成越来越多全球代码，该问题的重要性与日俱增。

English

We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.