Agentic-MME：能動性能力究竟為多模態智能帶來什麼？

摘要

多模態大型語言模型正從被動觀察者演進為主動智能體，通過視覺擴展（調用視覺工具）與知識擴展（開放網絡搜索）來解決問題。然而現有評估體系存在不足：缺乏靈活的工具整合、將視覺與搜索工具分開測試、且主要依賴最終答案進行評估。這導致無法驗證工具是否實際被調用、應用是否正確或使用是否高效。為此，我們提出Agentic-MME——一個面向多模態智能體能力的流程驗證基準。該基準包含涵蓋6大領域與3個難度等級的418項真實世界任務，用於評估能力協同效應，並設有超過2,000個逐步檢查點，平均每項任務需耗費10+人時進行人工標註。每項任務均配備支持沙箱代碼與API的統一評估框架，以及標註有雙軸（S軸與V軸）逐步檢查點的人工參考軌跡。為實現真正的流程級驗證，我們審計細粒度中間狀態而非僅關注最終答案，並通過相對於人類軌跡的「過度思考」指標來量化效率。實驗結果顯示，最佳模型Gemini3-pro的總體準確率為56.3%，但在三級難度任務中驟降至23.0%，凸顯了真實世界多模態智能體問題解決的挑戰性。

English

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.

Agentic-MME：能動性能力究竟為多模態智能帶來什麼？

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

摘要

Support