ChatPaper.aiChatPaper

局部成功不構成整體:大型語言模型在組合式形式驗證中的基準測試

Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification

September 27, 2025
作者: Xu Xu, Xin Li, Xingwei Qu, Jie Fu, Binhang Yuan
cs.AI

摘要

我們推出了DafnyCOMP,這是一個用於評估大型語言模型(LLMs)在Dafny中進行組合式規格生成能力的基準測試。與以往專注於單一函數任務的基準不同,DafnyCOMP針對的是由多個相互依賴的函數組成的程序,這要求模型能夠跨越組件邊界進行推理。該基準包含300個自動合成的多函數程序。我們評估了多種最先進的LLM家族,發現它們在單一函數驗證上表現良好,但在組合任務上的性能卻急劇下降。分析揭示了跨函數推理中的系統性失敗,包括脆弱的規格、實現與證明之間的不一致,以及不穩定的推理。因此,DafnyCOMP提供了一種診斷工具,用於衡量LLMs在生成可靠、可驗證且組合式代碼方面的進展。
English
We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.
PDF42September 30, 2025