Lean4Agent：智能體工作流與軌跡的形式化建模與驗證

摘要

使大型语言模型（LLMs）能夠執行可靠的多步驟工作流程已成為人工智慧領域的核心挑戰。儘管近期LLMs在代理能力方面取得進展，多數代理系統仍缺乏用於規範、驗證及除錯其工作流程與執行軌跡的形式化方法。此挑戰與數學中長期存在的問題相似——自然語言的歧義性促使了形式語言的發展。受此典範啟發，我們提出**Lean4Agent**，據我們所知，這是首個利用依賴類型形式語言Lean4來建模與驗證代理行為的框架。**Lean4Agent**推出**FormalAgentLib**——一個可擴展的Lean4函式庫，用於在明確假設下形式化建模與驗證代理工作流程的語義一致性，並能從軌跡中定位執行階段的錯誤。基於**FormalAgentLib**，我們進一步開發**LeanEvolve**，該工具應用**FormalAgentLib**的結果來修訂工作流程以提升其能力。在SWE-Bench-Verified的困難子集與ELAIP-Bench子集上，使用五個頂尖LLM進行的廣泛實驗表明：通過驗證的工作流程平均比未通過者高出**11.94%**，而**LeanEvolve**進一步將SWE效能平均提升**7.47%**。此外，**Lean4Agent**為使用具表達力的依賴類型形式語言來形式化建模與驗證代理行為開闢了新領域。

English

Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.