FVSpec：現實世界中基於屬性的測試作為 Lean 挑戰

摘要

我們提出了一個基準測試，用於評估AI模型與智能體在真實世界形式化軟體驗證任務上的表現。首先，我們從真實的Python程式庫中蒐集了11,039個基於屬性的測試（PBT），接著自動將其中的2,772個（佔25%）轉換成9,415個帶有sorry佔位符的Lean 4規格（每個PBT約有3個形式化版本；當沒有任何一個版本在品質指標上明顯佔優時，我們保留多個嘗試）。將PBT轉譯為Lean規格極具挑戰性：這需要在Lean中建模Python語義、推斷命令式PBT所編碼的邏輯屬性，並處理在鮮少使用的語言中進行依賴型別程式設計的固有困難。我們描述了一個由三個智能體組成的LLM管線，用於將PBT轉譯為Lean規格，評估其覆蓋率與品質指標，並為使用多種自動化與基於模型的方法進行證明生成提供基準。所有程式碼（蒐集器與智能體）與資料（PBT與Lean規格）均為開放原始碼。我們的基準旨在推動AI輔助形式化驗證真實世界軟體這一尚未充分探索問題的進展，而隨著AI生成越來越多全球程式碼，此問題正日益受到關注。

English

We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.