Agentic-MME: Wat brengt agentische capaciteit werkelijk bij aan multimodale intelligentie?

Samenvatting

Multimodale Large Language Models (MLLMs) evolueren van passieve waarnemers naar actieve agenten, die problemen oplossen via Visuele Expansie (aanroepen van visuele tools) en Kennis Expansie (open-web zoekopdrachten). Bestaande evaluaties schieten echter tekort: ze missen flexibele toolintegratie, testen visuele en zoektools afzonderlijk, en evalueren primair op basis van eindantwoorden. Hierdoor kunnen ze niet verifiëren of tools daadwerkelijk werden aangeroepen, correct werden toegepast of efficiënt werden gebruikt. Om dit aan te pakken, introduceren wij Agentic-MME, een proces-geverifieerde benchmark voor Multimodale Agentische Capaciteiten. Deze bevat 418 real-world taken verspreid over 6 domeinen en 3 moeilijkheidsgraden om capaciteitssynergie te evalueren, met meer dan 2.000 stapsgewijze checkpoints die gemiddeld 10+ persoon-uren aan handmatige annotatie per taak vergen. Elke taak omvat een uniform evaluatieraamwerk dat sandboxed code en API's ondersteunt, naast een menselijke referentietraject geannoteerd met stapsgewijze checkpoints langs een dubbele as: de S-as en V-as. Om echte procesniveau-verificatie mogelijk te maken, auditen we fijnmazige tussenliggende staten in plaats van alleen eindantwoorden, en kwantificeren we efficiëntie via een overthinking-metric ten opzichte van menselijke trajecten. Experimentele resultaten tonen aan dat het beste model, Gemini3-pro, een algemene nauwkeurigheid van 56.3% behaalt, wat significant daalt naar 23.0% op Level-3 taken, wat de moeilijkheid van real-world multimodale agentische probleemoplossing onderstreept.

English

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.

Agentic-MME: Wat brengt agentische capaciteit werkelijk bij aan multimodale intelligentie?

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Samenvatting

Support