Lokaal succes componeert niet: Benchmarking van grote taalmodellen voor compositionele formele verificatie

Samenvatting

We introduceren DafnyCOMP, een benchmark voor het evalueren van grote taalmodellen (LLM's) op het gebied van compositionele specificatiegeneratie in Dafny. In tegenstelling tot eerdere benchmarks die zich richten op taken met één functie, richt DafnyCOMP zich op programma's die bestaan uit meerdere interactieve functies met data-afhankelijkheden, wat redenering over componentgrenzen vereist. De benchmark bestaat uit 300 automatisch gesynthetiseerde multifunctionele programma's. We evalueren verschillende state-of-the-art LLM-families en constateren dat, hoewel ze goed presteren op verificatie van één functie, hun prestaties sterk dalen bij compositionele taken. Analyse onthult systematische fouten in cross-functioneel redeneren, waaronder fragiele specificaties, misalignering tussen implementaties en bewijzen, en instabiel redeneren. DafnyCOMP biedt zo een diagnostisch hulpmiddel voor het meten van voortgang richting betrouwbare, verifieerbare en compositionele codegeneratie met LLM's.

English

We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.

Lokaal succes componeert niet: Benchmarking van grote taalmodellen voor compositionele formele verificatie

Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification

Samenvatting

Support