Lean4Agent: 에이전트 워크플로우와 궤적에 대한 형식적 모델링 및 검증

초록

대규모 언어 모델(LLM)이 신뢰할 수 있는 다단계 워크플로우를 실행할 수 있도록 하는 것은 인공지능 분야의 핵심 과제가 되었다. 최근 LLM의 에이전트 능력이 발전했음에도 불구하고, 대부분의 에이전트 시스템은 여전히 워크플로우와 실행 궤적을 명세, 검증, 디버깅하기 위한 형식적 방법이 부족하다. 이러한 어려움은 자연어의 모호성이 형식 언어의 개발을 촉진한 수학 분야의 오랜 문제를 반영한다. 이 패러다임에서 영감을 얻어, 우리는 **Lean4Agent**를 제안한다. 이는 저자의 지식 범위 내에서 의존 타입 형식 언어인 Lean4를 사용하여 에이전트 동작을 모델링하고 검증하는 최초의 프레임워크이다. **Lean4Agent**는 **FormalAgentLib**를 도입한다. 이는 명시적 가정 하에 에이전트 워크플로우의 의미적 일관성을 형식적으로 모델링하고 검증하며, 궤적을 통해 드러난 실행 시간 실패의 위치를 파악할 수 있게 해주는 확장 가능한 Lean4 라이브러리이다. **FormalAgentLib**를 기반으로, 우리는 **LeanEvolve**를 추가로 개발하였다. 이는 **FormalAgentLib**의 결과를 워크플로우 개정에 적용하여 그 능력을 향상시킨다. SWE-Bench-Verified의 어려운 문제 부분 집합과 ELAIP-Bench의 일부 부분 집합에 대해 5개의 선도적 LLM을 대상으로 한 광범위한 실험 결과, 검증을 통과한 워크플로우가 실패한 워크플로우보다 평균 **11.94%** 더 우수한 성능을 보였으며, **LeanEvolve**는 SWE 성능을 평균 **7.47%** 추가로 향상시켰다. 더 나아가 **Lean4Agent**는 표현력이 풍부한 의존 타입 형식 언어를 사용하여 에이전트 동작을 형식적으로 모델링하고 검증하는 새로운 연구 분야를 위한 기초를 마련한다.

English

Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.