Graph2Eval：基於知識圖譜的自動多模態任務生成框架

摘要

隨著多模態大語言模型驅動的智能體在自主性和泛化能力上不斷進步，基於靜態數據集的評估已無法充分衡量其在動態環境和多樣任務中的真實能力。現有基於大語言模型的合成數據方法主要針對大語言模型的訓練與評估設計，因此無法直接應用於需要工具使用和交互能力的智能體任務。儘管近期研究探索了利用大語言模型自動生成智能體任務，但大多數工作仍局限於文本或圖像分析，未能系統地模擬網絡環境中的多步交互。為應對這些挑戰，我們提出了Graph2Eval，這是一個基於知識圖譜的框架，能夠自動生成多模態文檔理解任務和網絡交互任務，從而全面評估智能體的推理、協作和交互能力。在我們的方法中，從多源外部數據構建的知識圖譜作為任務空間，我們通過子圖採樣、任務模板和元路徑將語義關係轉化為結構化的多模態任務。基於節點可達性、大語言模型評分和相似性分析的多階段過濾管道被應用於保證生成任務的質量和可執行性。此外，Graph2Eval支持對多種類型智能體（單智能體、多智能體、網絡智能體）的端到端評估，並衡量其推理、協作和交互能力。我們通過Graph2Eval-Bench實例化了該框架，這是一個包含1,319個任務的精心策劃數據集，涵蓋文檔理解和網絡交互場景。實驗表明，Graph2Eval能高效生成區分智能體和模型性能的任務，揭示不同設置下推理、協作和網絡交互的差距，為智能體評估提供了新的視角。

English

As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents' reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.

Graph2Eval：基於知識圖譜的自動多模態任務生成框架

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

摘要

Support