Lean4Agent:智能体工作流与轨迹的形式化建模与验证
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory
June 2, 2026
作者: Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang
cs.AI
摘要
使大型语言模型(LLM)能够执行可靠的多步工作流已成为人工智能领域的核心挑战。尽管近期在LLM的智能体能力方面取得进展,但大多数智能体系统仍缺乏形式化方法用于指定、验证和调试其工作流及执行轨迹。这一挑战与数学中一个长期存在的问题相呼应——自然语言(NL)的模糊性推动着形式语言(FL)的发展。受此范式启发,我们提出了**Lean4Agent**——据我们所知,这是首个利用依赖类型形式语言Lean4来建模和验证智能体行为的框架。**Lean4Agent**推出了**FormalAgentLib**,一个可扩展的Lean4库,用于在显式假设下形式化建模和验证智能体工作流的语义一致性,并支持通过轨迹定位执行时故障。基于**FormalAgentLib**,我们进一步开发了**LeanEvolve**,它应用**FormalAgentLib**中的结果来修订工作流以提升其能力。在SWE-Bench-Verified的困难子集及ELAIP-Bench子集上,针对5个主流LLM进行的大量实验表明:通过验证的工作流相较于未通过的方案平均性能提升**11.94%**,而**LeanEvolve**进一步将SWE性能平均提升**7.47%**。此外,**Lean4Agent**为使用表达能力丰富的依赖类型形式语言形式化建模与验证智能体行为这一新领域奠定了基础。
English
Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.