Agentic-MME：智能体能力究竟为多模态智能带来什么？

摘要

多模态大语言模型（MLLMs）正从被动观察者演变为主动智能体，通过视觉扩展（调用视觉工具）与知识扩展（开放网络搜索）来解决问题。然而现有评估体系存在不足：缺乏灵活的工具集成、对视觉与搜索工具分别测试、且主要依据最终答案进行评估。这导致无法验证工具是否真实调用、正确应用或高效使用。为此，我们推出Agentic-MME——面向多模态智能体能力的流程验证基准。该基准包含6大领域3个难度级别的418项现实任务，用于评估能力协同效应，并设有超2000个分步检查点，平均每项任务需10+人时的人工标注。每项任务配备支持沙盒代码与API的统一评估框架，以及标注有双轴分步检查点的人类参考轨迹：S轴（步骤轴）与V轴（验证轴）。为实现真正的流程级验证，我们审计细粒度中间状态而非仅最终答案，并通过相对人类轨迹的过度思考指标量化效率。实验结果表明，最佳模型Gemini3-pro总体准确率为56.3%，但在三级任务中骤降至23.0%，凸显出现实场景多模态智能体问题解决的挑战性。

English

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.

Agentic-MME：智能体能力究竟为多模态智能带来什么？

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

摘要

Support