Verus-SpecGym：一个用于评估规范自动形式化的智能体环境

摘要

AI编码代理正越来越多地被用于编写真实世界的软件，但确保其输出正确性仍是一项根本性挑战。形式化验证提供了一条有前景的路径：代理生成代码的同时附带机器可验证的证明，从而保证代码满足形式化规范。然而，形式化规范本身是否与用户意图一致却无法得到保证。在本工作中，我们研究规范自动形式化问题：LLM代理能否将非形式化的编程问题转化为可靠的形式化规范。我们引入了Verus-SpecBench基准，包含从Codeforces问题派生出的581项规范编写任务，目标语言为Rust验证工具Verus；同时构建了Verus-SpecGym代理环境，使模型能够通过与Verus、bash及文件系统交互来开发这些规范。核心挑战在于评估：专家编写的参考规范成本高昂，而LLM评判者可能遗漏细微错误。为此，我们通过两种方式应对：(a) 扩展Verus的exec_spec机制，使生成的规范能够作为Rust代码执行；(b) 将这些规范与官方Codeforces测试用例及从Codeforces“hacks”中提取的对抗用例进行测试——后者是参赛者编写的、用于攻破错误解决方案的边界案例。在Verus-SpecBench上，最强模型Gemini 3.1 Pro解决了77.8%的任务，其他前沿模型解决了51.1–57.8%，而开源模型仅达到21.5–25.5%。对失败模式的分析表明，模型生成的规范可能遗漏重要输入假设、接受错误输出或拒绝正确输出。我们还发现，基于LLM评判者的评估方式会遗漏我们评估工具所捕获的26%的失败案例。总体而言，我们的结果表明，规范自动形式化对于前沿代理已触手可及，但在那些它们已能生成正确代码的问题上仍显脆弱。相关代码、数据及日志可在https://github.com/formal-verif-is-cool/verus-spec-gym获取。

English

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym