自主式工作流生成的基准测试
Benchmarking Agentic Workflow Generation
October 10, 2024
作者: Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
cs.AI
摘要
大型语言模型(LLMs)以其出色的处理多种任务能力推动了在处理推理和规划任务方面的重大进展,其中将复杂问题分解为可执行工作流程是这一过程中的关键步骤。现有的工作流评估框架要么仅关注整体性能,要么存在诸如受限场景覆盖、简单工作流结构和宽松评估标准等限制。为此,我们引入了 WorFBench,一个统一的工作流生成基准,具有多方面的场景和复杂的图形工作流结构。此外,我们提出了 WorFEval,一种系统评估协议,利用子序列和子图匹配算法来准确量化LLM代理的工作流生成能力。通过对不同类型的LLMs进行全面评估,我们发现LLM代理的序列规划能力和图形规划能力之间存在明显差距,即使是 GPT-4 也存在大约15%的差距。我们还训练了两个开源模型,并评估它们在保留任务上的泛化能力。此外,我们观察到生成的工作流可以增强下游任务,使它们在推断期间以更少的时间实现更优越的性能。代码和数据集将在 https://github.com/zjunlp/WorFBench 上提供。
English
Large Language Models (LLMs), with their exceptional ability to handle a wide
range of tasks, have driven significant advancements in tackling reasoning and
planning tasks, wherein decomposing complex problems into executable workflows
is a crucial step in this process. Existing workflow evaluation frameworks
either focus solely on holistic performance or suffer from limitations such as
restricted scenario coverage, simplistic workflow structures, and lax
evaluation standards. To this end, we introduce WorFBench, a unified workflow
generation benchmark with multi-faceted scenarios and intricate graph workflow
structures. Additionally, we present WorFEval, a systemic evaluation protocol
utilizing subsequence and subgraph matching algorithms to accurately quantify
the LLM agent's workflow generation capabilities. Through comprehensive
evaluations across different types of LLMs, we discover distinct gaps between
the sequence planning capabilities and graph planning capabilities of LLM
agents, with even GPT-4 exhibiting a gap of around 15%. We also train two
open-source models and evaluate their generalization abilities on held-out
tasks. Furthermore, we observe that the generated workflows can enhance
downstream tasks, enabling them to achieve superior performance with less time
during inference. Code and dataset will be available at
https://github.com/zjunlp/WorFBench.Summary
AI-Generated Summary