Verus-SpecGym：一個用於評估規格自動形式化的智能體環境

摘要

AI編碼代理（AI coding agents）正日益被用於撰寫真實世界的軟體，但確保其輸出正確性仍是基本挑戰。形式驗證提供了一條有前景的路徑：代理生成程式碼同時附帶機器驗證的證明，從而保證程式碼滿足形式規範。然而，形式規範本身是否符合使用者意圖並無保證。本研究探討規範自動形式化（specification autoformalization）這一課題：LLM代理是否能將非正式的程式設計問題轉化為忠實的形式規範。我們推出了Verus-SpecBench，這是一個包含581項規範撰寫任務的基準測試，任務源自Codeforces問題並針對Verus（Rust語言的驗證器）；同時推出Verus-SpecGym，這是一個代理環境，模型可在其中與Verus、bash及檔案系統互動以開發這些規範。核心挑戰在於評估：專家撰寫的參考規範造價高昂，而LLM評判可能遺漏細微錯誤。為解決此問題，我們採取了以下措施：(a) 擴展Verus的exec_spec機制，使生成的規範能作為Rust程式碼執行；(b) 將這些規範與官方Codeforces測試案例以及從Codeforces「hack」中提取的對抗性案例進行測試——後者是由參賽者編寫的邊緣案例，旨在破解不正確的解決方案。在Verus-SpecBench上，最強模型Gemini 3.1 Pro解決了77.8%的任務，其他前沿模型解決了51.1%–57.8%，而開源模型僅達到21.5%–25.5%。我們對失敗模式的分析顯示，模型生成的規範可能遺漏重要的輸入假設，接受不正確的輸出，以及拒絕有效的輸出。我們還發現，LLM作為評判的評估方式漏掉了我們評估器所抓獲的26%的失敗案例。總體而言，我們的研究結果表明，規範自動形式化對前沿代理而言已觸手可及，但在代理已能生成正確程式碼的問題上，該方法仍顯脆弱。程式碼、數據與日誌可於 https://github.com/formal-verif-is-cool/verus-spec-gym 取得。

English

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym