プログラム検証のためのLLM生成ループ不変式のランキング

要旨

帰納的ループ不変式の合成は、プログラム検証の自動化において基本的な課題である。本研究では、大規模言語モデル（gpt-3.5やgpt-4など）が0-shot設定で特定のクラスのプログラムに対するループ不変式を合成できるものの、正しい不変式を生成するためには複数のサンプルを必要とすることを観察した。これにより、不変式を確立するためにプログラム検証器への呼び出しが多数発生する可能性がある。この問題に対処するため、我々はLLMが生成した結果に対する{\it 再ランキング}手法を提案する。問題定義に基づいて、正しい帰納的不変式と誤った試行を区別できるランカーを設計した。このランカーは対照的ランカーとして最適化されている。実験結果は、この再ランキング機構が生成された候補の中から正しい不変式のランキングを大幅に改善し、検証器への呼び出し回数を著しく削減することを示している。

English

Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier.

プログラム検証のためのLLM生成ループ不変式のランキング

Ranking LLM-Generated Loop Invariants for Program Verification

要旨

Support