Verus-SpecGym: 명세 자동형식화 평가를 위한 에이전트 환경

초록

AI 코딩 에이전트는 실제 소프트웨어를 작성하는 데 점점 더 많이 사용되고 있지만, 그 출력의 정확성을 보장하는 것은 여전히 근본적인 과제로 남아 있다. 형식 검증은 유망한 해결책을 제시한다. 즉, 에이전트가 코드와 함께 기계가 검증한 증명을 생성하여 코드가 형식 명세를 만족함을 보장하는 것이다. 그러나 형식 명세 자체가 사용자의 의도와 일치한다는 보장은 없다. 본 연구에서는 명세 자동 형식화(specification autoformalization), 즉 LLM 에이전트가 비형식적 프로그래밍 문제를 신뢰할 수 있는 형식 명세로 변환할 수 있는지 여부를 연구한다. 우리는 Rust 검증기인 Verus를 대상으로 Codeforces 문제에서 파생된 581개의 명세 작성 과제로 구성된 벤치마크인 Verus-SpecBench와, 모델이 Verus, bash, 파일시스템과 상호작용하며 명세를 개발할 수 있는 에이전트 환경인 Verus-SpecGym을 소개한다. 핵심 과제는 평가(evaluation)이다. 전문가가 작성한 참조 명세는 작성 비용이 많이 들고, LLM 판정자는 미묘한 오류를 놓칠 수 있다. 우리는 (a) Verus의 exec_spec 메커니즘을 확장하여 생성된 명세가 Rust 코드로 실행될 수 있도록 하고, (b) 이를 공식 Codeforces 테스트와 Codeforces "해킹(hacks)"에서 추출한 적대적 사례(즉, 경쟁자가 잘못된 해결책을 깨기 위해 작성한 극단적 사례)에 대해 테스트함으로써 이 문제를 해결한다. Verus-SpecBench에서 가장 강력한 모델인 Gemini 3.1 Pro는 과제의 77.8%를 해결했으며, 다른 최첨단 모델은 51.1–57.8%를, 오픈소스 모델은 21.5–25.5%만을 해결했다. 실패 모드 분석 결과, 모델이 생성한 명세는 중요한 입력 가정을 생략하거나, 잘못된 출력을 수용하거나, 유효한 출력을 거부할 수 있음을 보여준다. 또한 LLM-as-a-judge 평가는 우리 평가자가 포착한 실패의 26%를 놓친다는 사실을 발견했다. 전반적으로, 본 연구 결과는 명세 자동 형식화가 최첨단 에이전트에게는 가능하지만, 이미 올바른 코드를 생성할 수 있는 문제에서도 여전히 취약함을 시사한다. 코드, 데이터, 로그는 https://github.com/formal-verif-is-cool/verus-spec-gym에서 확인할 수 있다.

English

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym