프로그램 검증을 위한 LLM 생성 루프 불변식 순위 매기기

초록

귀납적 루프 불변식(loop invariant)의 합성은 프로그램 검증 자동화의 핵심 요소입니다. 본 연구에서는 대규모 언어 모델(예: GPT-3.5 또는 GPT-4)이 0-shot 설정에서 특정 프로그램 클래스에 대한 루프 불변식을 합성할 수 있지만, 올바른 불변식을 생성하기 위해 여러 샘플이 필요하다는 점을 관찰했습니다. 이는 불변식을 확립하기 위해 프로그램 검증기에 대한 다수의 호출을 유발할 수 있습니다. 이 문제를 해결하기 위해, 우리는 LLM(대규모 언어 모델)이 생성한 결과에 대한 {\it 재순위화(re-ranking)} 접근 방식을 제안합니다. 문제 정의를 기반으로 올바른 귀납적 불변식과 잘못된 시도를 구별할 수 있는 순위 결정기(ranker)를 설계했습니다. 이 순위 결정기는 대조적 순위 결정기(contrastive ranker)로 최적화되었습니다. 실험 결과는 이 재순위화 메커니즘이 생성된 후보들 중 올바른 불변식의 순위를 크게 개선하고, 검증기에 대한 호출 횟수를 현저히 줄이는 데 기여함을 보여줍니다.

English

Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier.

프로그램 검증을 위한 LLM 생성 루프 불변식 순위 매기기

Ranking LLM-Generated Loop Invariants for Program Verification

초록

Support