Re:Form —— 在可擴展的形式化軟體驗證中利用LLM中的RL減少人為先驗:基於Dafny的初步研究
Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny
July 22, 2025
作者: Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu
cs.AI
摘要
現有的基於非正式語言(如人類語言)的大型語言模型(LLMs)在強化學習(RL)訓練下面臨一個重大挑戰:其驗證過程,這些過程提供關鍵的訓練信號,既不可靠也不具擴展性。事實上,主流的大型專有模型幾乎無法生成可驗證的程序。一個有前景但尚未充分探索的替代方案是基於形式語言的推理。將LLMs建立在嚴謹的形式系統中,使生成模型在形式語言空間(如Dafny)中運作,能夠自動且數學上可證明地驗證其推理過程和結果。這一能力對於實現大規模、可靠的形式軟件驗證至關重要。通常的做法是使用人工註釋的思維鏈和其他人類先驗知識來誘導LLMs的推理和編碼能力。然而,為監督複雜的編程任務提供此類先驗知識變得不可接受地耗費資源。在本研究中,我們系統地探索了如何以形式語言Dafny作為主要環境來減少人類先驗知識的使用。我們的流程主要依賴於引入一個自動化且可擴展的數據整理流程,以及與形式語言驗證器反饋相結合的精心設計的RL。我們引入了DafnyComp,這是一個包含自動形式化規格的組合形式程序基準,用於規格推理。我們的監督微調(SFT)階段使即使是小型模型(如0.5B)也能生成語法有效且可驗證的Dafny代碼,超越了專有模型。帶有正則化的RL進一步提升了性能,在具有挑戰性的DafnyComp基準上實現了對域外任務的更強泛化能力,並超越了所有強基線。
English
Existing informal language-based (e.g., human language) Large Language Models
(LLMs) trained with Reinforcement Learning (RL) face a significant challenge:
their verification processes, which provide crucial training signals, are
neither reliable nor scalable. In fact, the prevalent large proprietary models
could hardly generate verifiable programs. A promising yet largely uncharted
alternative is formal language-based reasoning. Grounding LLMs in rigorous
formal systems where generative models operate in formal language spaces (e.g.,
Dafny) enables the automatic and mathematically provable verification of their
reasoning processes and outcomes. This capability is pivotal for achieving
large-scale, reliable formal software verification. It is a common practice to
employ human-annotated chain-of-thought and other human priors to induce the
reasoning and coding capabilities of LLMs. Unfortunately, it becomes
unacceptably all-consuming to provide such priors for supervising complex
programming tasks. In this work, we systematically explore ways to reduce human
priors with the formal language, Dafny, as the main environment for our pilot
study. Our pipeline mainly relies on introducing an automatic and scalable data
curation pipeline, and careful RL designs integrated with feedback from the
formal language verifier. We introduce DafnyComp, a benchmark of compositional
formal programs with auto-formalized specifications for specification
reasoning. Our supervised fine-tuning (SFT) stage enables even small models
(e.g., 0.5B) to generate syntactically valid and verifiable Dafny code,
surpassing proprietary models. RL with regularization further improves
performance, achieving stronger generalization to out-of-domain tasks and
outperforming all strong baselines on the challenging DafnyComp benchmark.