为程序验证排名LLM生成的循环不变量

摘要

在自动化程序验证中，合成归纳循环不变量是至关重要的。在这项工作中，我们观察到大型语言模型（如gpt-3.5或gpt-4）能够在0-shot设置下为一类程序合成循环不变量，但需要多个样本来生成正确的不变量。这可能导致大量调用程序验证器来建立不变量。为解决这一问题，我们提出了一种用于LLMs生成结果的“重新排序”方法。我们设计了一个能够根据问题定义区分正确归纳不变量和错误尝试的排序器。该排序器被优化为对比排序器。实验结果表明，这种重新排序机制显著改善了正确不变量在生成的候选项中的排名，从而显著减少了对验证器的调用次数。

English

Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier.

为程序验证排名LLM生成的循环不变量

Ranking LLM-Generated Loop Invariants for Program Verification

摘要

Support