지역적 성공은 조합되지 않는다: 조합적 형식 검증을 위한 대규모 언어 모델 벤치마킹

초록

우리는 Dafny에서의 조합적 명세 생성(compositional specification generation)을 평가하기 위한 벤치마크인 DafnyCOMP를 소개한다. 기존의 단일 함수 작업에 초점을 맞춘 벤치마크와 달리, DafnyCOMP는 데이터 의존성을 가진 여러 상호작용 함수로 구성된 프로그램을 대상으로 하며, 구성 요소 간 경계를 넘어 추론을 요구한다. 이 벤치마크는 자동으로 합성된 300개의 다중 함수 프로그램으로 구성되어 있다. 우리는 여러 최신 대규모 언어 모델(LLM) 패밀리를 평가했으며, 이들이 단일 함수 검증에서는 잘 수행되지만 조합적 작업에서는 성능이 급격히 떨어지는 것을 발견했다. 분석 결과, 취약한 명세, 구현과 증명 간의 불일치, 불안정한 추론 등 교차 기능적 추론에서의 체계적인 실패가 드러났다. 따라서 DafnyCOMP는 LLM을 사용한 신뢰할 수 있고 검증 가능하며 조합적인 코드 생성을 향한 진척도를 측정하기 위한 진단 도구로 활용될 수 있다.

English

We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.

지역적 성공은 조합되지 않는다: 조합적 형식 검증을 위한 대규모 언어 모델 벤치마킹

Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification

초록

Support