Verus-SpecGym: 仕様自動形式化評価のためのエージェント環境

要旨

AIコーディングエージェントは実際のソフトウェアの記述にますます利用されているが、その出力が正しいことを保証することは依然として基本的な課題である。形式検証は有望な道を提供する。すなわち、エージェントがコードと機械検証済みの証明を生成し、そのコードが形式仕様を満たすことを保証する。しかし、形式仕様自体がユーザの意図と一致するという保証はない。本研究では、仕様の自動形式化、すなわちLLMエージェントが非形式的なプログラミング問題を忠実な形式仕様に翻訳できるかどうかを研究する。我々は、Rust用検証器Verusを対象とし、Codeforcesの問題に由来する581個の仕様記述タスクからなるベンチマークVerus-SpecBenchと、モデルがVerus、bash、ファイルシステムと相互作用してこれらの仕様を開発するエージェント環境Verus-SpecGymを導入する。中心的な課題は評価である。専門家が作成した参照仕様は作成コストが高く、LLM判定者は微妙な誤りを見逃しうる。そこで我々は、(a) Verusのexec_spec機構を拡張し、生成された仕様をRustコードとして実行可能にするとともに、(b) それらを公式のCodeforcesテストと、誤った解法を破るために競技者によって書かれたエッジケースであるCodeforcesの「ハック」から抽出した敵対的ケースに対してテストすることで対処する。Verus-SpecBenchにおいて、最強モデルであるGemini 3.1 Proはタスクの77.8%を解決し、他のフロンティアモデルは51.1～57.8%、OSSモデルは21.5～25.5%にとどまる。失敗モードの分析から、モデル生成の仕様は重要な入力前提を省略したり、不正な出力を受け入れたり、正当な出力を拒否したりすることが示された。また、LLM判定者による評価は、我々の評価者が捕捉する失敗の26%を見逃すことも判明した。全体として、これらの結果は、仕様の自動形式化がフロンティアエージェントにとって手の届く範囲にあるものの、彼らがすでに正しいコードを生成できる問題においてさえ脆弱であることを示唆している。コード、データ、ログは https://github.com/formal-verif-is-cool/verus-spec-gym で入手可能である。

English

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym