AgentVista:在超挑战性真实视觉场景中评估多模态智能体
AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios
February 26, 2026
作者: Zhaochen Su, Jincheng Gao, Hangyu Guo, Zhenhua Liu, Lueyang Zhang, Xinyu Geng, Shijue Huang, Peng Xia, Guanyu Jiang, Cheng Wang, Yue Zhang, Yi R. Fung, Junxian He
cs.AI
摘要
现实世界中的多模态智能体能够基于视觉证据解决多步骤工作流。例如,智能体可通过将接线照片与原理图关联,并利用在线文档验证修复方案来排查设备故障;或通过解读交通地图并在路线约束下核对时刻表来规划行程。然而,现有多模态基准主要评估单轮视觉推理或特定工具技能,未能全面体现实用智能体所需的真实性、视觉细节感知和长周期工具使用能力。我们推出AgentVista基准测试,面向通用多模态智能体,涵盖7大类25个子领域,将真实且细节丰富的视觉场景与自然混合工具使用相结合。任务要求跨模态的长周期工具交互,包括网页搜索、图像搜索、页面导航,以及同时支持图像处理与通用编程的代码操作。通过对前沿模型的综合评估,我们发现其在执行长周期多模态工具使用方面存在显著差距。即使评估中表现最佳的Gemini-3-Pro(配备工具)模型,整体准确率也仅为27.3%,且复杂实例可能需要超过25轮工具调用。我们期待AgentVista能加速开发出更强大可靠的多模态智能体,以应对现实世界中极具挑战性的问题求解。
English
Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.