評估主動式工作流生成

摘要

大型語言模型（LLMs）以其出色的處理多種任務的能力，推動了在處理推理和規劃任務方面的重大進展，其中將複雜問題分解為可執行工作流程是這一過程中的關鍵步驟。現有的工作流程評估框架要麼僅關注整體性能，要麼存在著諸如受限情景涵蓋範圍、簡化的工作流程結構和寬鬆的評估標準等限制。為此，我們介紹了WorFBench，一個統一的工作流程生成基準，具有多方面的情景和複雜的圖形工作流程結構。此外，我們提出了WorFEval，一種系統性評估協議，利用子序列和子圖匹配算法來準確量化LLM代理的工作流程生成能力。通過對不同類型的LLMs進行全面評估，我們發現LLM代理的序列規劃能力和圖形規劃能力之間存在明顯差距，即使是GPT-4也存在約15%的差距。我們還訓練了兩個開源模型，並評估它們在留存任務上的泛化能力。此外，我們觀察到生成的工作流程可以增強下游任務，使其在推理過程中以更少的時間達到更優異的性能。代碼和數據集將在https://github.com/zjunlp/WorFBench 上提供。

English

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset will be available at https://github.com/zjunlp/WorFBench.

評估主動式工作流生成

Benchmarking Agentic Workflow Generation

摘要

Support