ChatPaper.aiChatPaper

局部成功无法组合:评估大语言模型在组合式形式验证中的表现

Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification

September 27, 2025
作者: Xu Xu, Xin Li, Xingwei Qu, Jie Fu, Binhang Yuan
cs.AI

摘要

我们推出了DafnyCOMP,这是一个用于评估大型语言模型(LLMs)在Dafny中组合式规范生成能力的基准测试。与以往专注于单函数任务的基准不同,DafnyCOMP针对的是由多个相互依赖的函数组成、具有数据依赖性的程序,要求跨越组件边界进行推理。该基准包含300个自动合成的多函数程序。我们对多个最先进的LLM家族进行了评估,发现尽管它们在单函数验证上表现良好,但在组合任务上的性能却急剧下降。分析揭示了跨函数推理中的系统性失败,包括脆弱的规范、实现与证明之间的错位,以及不稳定的推理。因此,DafnyCOMP为衡量LLMs在可靠、可验证及组合式代码生成方面的进展提供了一个诊断工具。
English
We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.
PDF42September 30, 2025