エージェンティックワークフロー生成のベンチマーキング

要旨

大規模言語モデル（LLM）は、幅広いタスクを処理する卓越した能力を持つことから、推論や計画タスクの解決において重要な進展をもたらしてきました。複雑な問題を実行可能なワークフローに分解することがこのプロセスにおいて重要なステップです。既存のワークフロー評価フレームワークは、全体的なパフォーマンスに焦点を当てるか、制限されたシナリオカバレッジ、単純化されたワークフロー構造、緩い評価基準などの制約を抱えています。このため、私たちは、多面的なシナリオと入り組んだグラフワークフロー構造を備えた統一されたワークフロー生成ベンチマークであるWorFBenchを紹介します。さらに、LLMエージェントのワークフロー生成能力を正確に定量化するために、部分系列および部分グラフマッチングアルゴリズムを利用した体系的な評価プロトコルであるWorFEvalを提案します。さまざまな種類のLLMについて包括的な評価を行った結果、LLMエージェントのシーケンス計画能力とグラフ計画能力の間に明確なギャップが存在することが分かりました。GPT-4でも約15％のギャップが見られます。さらに、2つのオープンソースモデルを訓練し、保持されたタスクでの汎化能力を評価しました。さらに、生成されたワークフローが下流タスクを向上させ、推論中により少ない時間で優れたパフォーマンスを達成できることが観察されました。コードとデータセットはhttps://github.com/zjunlp/WorFBenchで入手可能です。

English

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset will be available at https://github.com/zjunlp/WorFBench.