AgentVista：在極具挑戰性的真實視覺場景中評估多模態智能體

摘要

現實世界中的多模態智能體能夠基於視覺證據解決多步驟工作流程。例如，智能體可通過將接線照片與電路圖關聯，並利用線上文檔驗證修復方案來排查設備故障；或通過解讀交通路線圖並在路徑規劃約束下核查時刻表來規劃行程。然而，現有的多模態基準主要評估單輪視覺推理或特定工具技能，未能充分體現實用智能體所需的真實性、視覺細微差異和長程工具使用能力。我們推出AgentVista基準測試，針對通用多模態智能體涵蓋7大類別下的25個子領域，將真實且細節豐富的視覺場景與自然混合工具使用相結合。任務要求跨模態的長程工具交互，包括網路搜索、圖像搜索、頁面導航，以及用於圖像處理和通用編程的程式碼操作。對前沿模型的綜合評估揭示其在執行長程多模態工具使用方面存在顯著差距。即使在評估中表現最佳的Gemini-3-Pro模型（配備工具），整體準確率僅達27.3%，而高難度實例可能需超過25輪工具調用。我們期待AgentVista能加速開發更具能力、更可靠的多模態智能體，以應對現實世界中極具挑戰性的問題解決場景。

English

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

AgentVista：在極具挑戰性的真實視覺場景中評估多模態智能體

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

摘要

Support