ChatPaper.aiChatPaper

VisualAgentBench:朝向以視覺為基礎的大型多模態模型之路

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

August 12, 2024
作者: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
cs.AI

摘要

大型多模型(LMMs)已經引領了人工智慧的新時代,將語言和視覺能力融合在一起,形成高度能幹的視覺基礎代理。這些代理被認為在眾多任務上表現出色,有望接近通用人工智慧。然而,現有的基準測試未能充分挑戰或展示LMMs在複雜的現實環境中的全部潛力。為了彌補這一差距,我們引入了VisualAgentBench(VAB),這是一個全面且開創性的基準測試,專門設計用於訓練和評估LMMs作為視覺基礎代理在各種場景下的表現,包括具體化、圖形用戶界面和視覺設計,其中的任務旨在探究LMMs的理解和互動能力的深度。通過在九個專有LMM APIs和八個開源模型上進行嚴格測試,我們展示了這些模型的相當大但仍在發展中的代理能力。此外,VAB通過混合方法構建了一個軌跡訓練集,包括基於程序的求解器、LMM代理引導和人類示範,通過行為克隆促進了LMMs的顯著性能改進。我們的工作不僅旨在對現有模型進行基準測試,還為未來發展成視覺基礎代理奠定了堅實基礎。代碼、訓練和測試數據以及部分精調的開源LMMs可在https://github.com/THUDM/VisualAgentBench 上獲得。
English
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. These agents are postulated to excel across a myriad of tasks, potentially approaching general artificial intelligence. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), a comprehensive and pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs' understanding and interaction capabilities. Through rigorous testing across nine proprietary LMM APIs and eight open models, we demonstrate the considerable yet still developing agent capabilities of these models. Additionally, VAB constructs a trajectory training set constructed through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, promoting substantial performance improvements in LMMs through behavior cloning. Our work not only aims to benchmark existing models but also provides a solid foundation for future development into visual foundation agents. Code, train \& test data, and part of fine-tuned open LMMs are available at https://github.com/THUDM/VisualAgentBench.

Summary

AI-Generated Summary

PDF173November 28, 2024