自主式工作流生成的基准测试

摘要

大型语言模型（LLMs）以其出色的处理多种任务能力推动了在处理推理和规划任务方面的重大进展，其中将复杂问题分解为可执行工作流程是这一过程中的关键步骤。现有的工作流评估框架要么仅关注整体性能，要么存在诸如受限场景覆盖、简单工作流结构和宽松评估标准等限制。为此，我们引入了 WorFBench，一个统一的工作流生成基准，具有多方面的场景和复杂的图形工作流结构。此外，我们提出了 WorFEval，一种系统评估协议，利用子序列和子图匹配算法来准确量化LLM代理的工作流生成能力。通过对不同类型的LLMs进行全面评估，我们发现LLM代理的序列规划能力和图形规划能力之间存在明显差距，即使是 GPT-4 也存在大约15%的差距。我们还训练了两个开源模型，并评估它们在保留任务上的泛化能力。此外，我们观察到生成的工作流可以增强下游任务，使它们在推断期间以更少的时间实现更优越的性能。代码和数据集将在 https://github.com/zjunlp/WorFBench 上提供。

English

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset will be available at https://github.com/zjunlp/WorFBench.

自主式工作流生成的基准测试

Benchmarking Agentic Workflow Generation

摘要

Support