에이전트식 워크플로 생성의 벤치마킹

초록

대규모 언어 모델(LLM)은 다양한 작업을 처리하는 뛰어난 능력으로 추론 및 계획 작업을 다루는 데 중요한 발전을 이끌어냈습니다. 여기에는 복잡한 문제를 실행 가능한 워크플로로 분해하는 것이 이 과정에서 중요한 단계입니다. 기존의 워크플로 평가 프레임워크는 종합적인 성능에만 초점을 맞추거나 제한된 시나리오 범위, 단순한 워크플로 구조, 그리고 완화된 평가 기준과 같은 한계를 가지고 있습니다. 이에 우리는 WorFBench를 소개합니다. 이는 다양한 시나리오와 복잡한 그래프 워크플로 구조를 갖춘 통합 워크플로 생성 벤치마크입니다. 게다가, 우리는 WorFEval을 제시합니다. 이는 부분 순차 및 부분 그래프 일치 알고리즘을 활용하여 LLM 에이전트의 워크플로 생성 능력을 정확하게 측정하는 체계적인 평가 프로토콜입니다. 다양한 유형의 LLM에 대한 포괄적인 평가를 통해, LLM 에이전트의 순차 계획 능력과 그래프 계획 능력 사이에 명백한 차이를 발견했습니다. 심지어 GPT-4도 약 15%의 차이를 보여주었습니다. 또한, 두 개의 오픈 소스 모델을 훈련시키고 보유한 작업에서 일반화 능력을 평가했습니다. 더욱이, 생성된 워크플로가 하류 작업을 향상시킬 수 있음을 관찰했습니다. 이를 통해 추론 중에 더 적은 시간으로 우수한 성능을 달성할 수 있습니다. 코드와 데이터셋은 https://github.com/zjunlp/WorFBench에서 제공될 예정입니다.

English

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset will be available at https://github.com/zjunlp/WorFBench.