Graph2Eval：基于知识图谱的智能体多模态任务自动生成系统

摘要

随着多模态大语言模型（LLM）驱动的智能体在自主性和泛化能力上的持续进步，基于静态数据集的评估已无法充分衡量其在动态环境及多样化任务中的真实能力。现有的LLM合成数据方法主要针对LLM的训练与评估设计，因此难以直接应用于需要工具使用和交互能力的智能体任务。尽管近期研究探索了利用LLM自动生成智能体任务，但多数工作仍局限于文本或图像分析，未能系统性地模拟网络环境中的多步交互。为应对这些挑战，我们提出了Graph2Eval，一个基于知识图谱的框架，能够自动生成多模态文档理解任务和网络交互任务，从而全面评估智能体的推理、协作及交互能力。在我们的方法中，从多源外部数据构建的知识图谱作为任务空间，通过子图采样、任务模板和元路径将语义关系转化为结构化的多模态任务。基于节点可达性、LLM评分和相似性分析的多阶段过滤管道被用于确保生成任务的质量与可执行性。此外，Graph2Eval支持对多种智能体类型（单智能体、多智能体、网络智能体）进行端到端评估，并衡量其推理、协作和交互能力。我们通过Graph2Eval-Bench实例化了该框架，这是一个包含1,319个任务的精选数据集，覆盖文档理解与网络交互场景。实验表明，Graph2Eval高效生成的任务能够区分智能体与模型的性能，揭示不同设置下在推理、协作及网络交互方面的差距，为智能体评估提供了新的视角。

English

As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents' reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.

Graph2Eval：基于知识图谱的智能体多模态任务自动生成系统

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

摘要

Support