Lean4Agent: エージェントワークフローと軌跡の形式的モデリングと検証

要旨

大規模言語モデル（LLM）に信頼性の高い多段階ワークフローを実行させることは、人工知能における中心的課題となっている。近年、LLMのエージェント機能は進歩しているが、ほとんどのエージェントシステムは、そのワークフローや実行軌跡を仕様化・検証・デバッグするための形式的手法を依然として欠いている。この課題は数学における長年の問題、すなわち自然言語（NL）の曖昧さが形式言語（FL）の開発を動機づけてきた状況を反映している。このパラダイムに着想を得て、我々は **Lean4Agent** を提案する。これは、我々の知る限り、依存型形式言語であるLean4を用いてエージェントの振る舞いをモデル化・検証する初めてのフレームワークである。**Lean4Agent** は **FormalAgentLib** を導入する。これは拡張可能なLean4ライブラリであり、エージェントワークフローの意味的一貫性を明示的な仮定の下で形式的にモデル化・検証し、軌跡によって明らかになる実行時障害の局所化を可能にする。**FormalAgentLib** を基盤として、さらに **LeanEvolve** を開発した。これは **FormalAgentLib** の結果を適用してワークフローを修正し、その能力を向上させる。SWE-Bench-Verifiedの困難な問題サブセットとELAIP-Benchのサブセットを用いた、5つの主要LLMにわたる広範な実験により、検証に合格したワークフローは不合格のものよりも平均 **11.94%** 優れており、**LeanEvolve** はSWE性能を平均 **7.47%** さらに向上させることが示された。さらに **Lean4Agent** は、表現力豊かな依存型形式言語を用いてエージェントの振る舞いを形式的にモデル化・検証する新たな研究分野の基盤を確立する。

English

Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.