Graph2Eval: 知識グラフを用いたエージェント向け自動マルチモーダルタスク生成

要旨

マルチモーダルLLM駆動エージェントの自律性と汎化能力が向上し続ける中、静的なデータセットに基づく評価では、動的環境や多様なタスクにおける真の能力を適切に評価することがもはや困難になっています。既存のLLMベースの合成データ手法は、主にLLMのトレーニングと評価のために設計されており、ツール使用やインタラクティブな能力を必要とするエージェントタスクに直接適用することはできません。最近の研究では、LLMを用いた自動エージェントタスク生成が探求されていますが、その多くはテキストや画像分析に限定されており、ウェブ環境における多段階のインタラクションを体系的にモデル化していません。これらの課題に対処するため、我々はGraph2Evalを提案します。これは、知識グラフに基づくフレームワークであり、マルチモーダル文書理解タスクとウェブインタラクショタスクを自動生成し、エージェントの推論、協調、インタラクション能力を包括的に評価することを可能にします。我々のアプローチでは、多様な外部データから構築された知識グラフがタスク空間として機能し、サブグラフサンプリング、タスクテンプレート、メタパスを用いて意味的関係を構造化されたマルチモーダルタスクに変換します。ノード到達可能性、LLMスコアリング、類似性分析に基づく多段階フィルタリングパイプラインを適用し、生成されたタスクの品質と実行可能性を保証します。さらに、Graph2Evalは、複数のエージェントタイプ（シングルエージェント、マルチエージェント、ウェブエージェント）のエンドツーエンド評価をサポートし、推論、協調、インタラクション能力を測定します。我々は、Graph2Eval-Benchという1,319のタスクからなるキュレーションデータセットを用いてフレームワークを具体化しました。実験結果は、Graph2Evalがエージェントとモデルの性能を区別するタスクを効率的に生成し、異なる設定における推論、協調、ウェブインタラクションのギャップを明らかにし、エージェント評価の新たな視点を提供することを示しています。

English

As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents' reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.

Graph2Eval: 知識グラフを用いたエージェント向け自動マルチモーダルタスク生成

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

要旨

Support