ChatPaper.aiChatPaper

VisualAgentBench:朝着作为视觉基础的大型多模态模型的方向前进

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

August 12, 2024
作者: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
cs.AI

摘要

大型多模态模型(LMMs)引领了人工智能的新时代,将语言和视觉能力融合在一起,形成了高度能干的视觉基础代理。据推测,这些代理在各种任务上表现出色,有望接近通用人工智能。然而,现有基准测试未能充分挑战或展示LMMs在复杂的现实环境中的全部潜力。为了弥补这一空白,我们引入了VisualAgentBench(VAB),这是一个全面而开创性的基准测试,专门设计用于训练和评估LMMs作为视觉基础代理在各种场景下的表现,包括具身化、图形用户界面和视觉设计,任务旨在探究LMMs的理解和交互能力的深度。通过对九个专有LMM APIs和八个开放模型进行严格测试,我们展示了这些模型具有相当大但仍在发展中的代理能力。此外,VAB构建了一个轨迹训练集,通过包括基于程序的求解器、LMM代理引导和人类示范在内的混合方法,促进了LMMs的显著性能提升,通过行为克隆。我们的工作不仅旨在对现有模型进行基准测试,还为未来发展成为视觉基础代理奠定了坚实基础。代码、训练和测试数据,以及部分经过精细调整的开放LMMs可在https://github.com/THUDM/VisualAgentBench获取。
English
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. These agents are postulated to excel across a myriad of tasks, potentially approaching general artificial intelligence. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), a comprehensive and pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs' understanding and interaction capabilities. Through rigorous testing across nine proprietary LMM APIs and eight open models, we demonstrate the considerable yet still developing agent capabilities of these models. Additionally, VAB constructs a trajectory training set constructed through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, promoting substantial performance improvements in LMMs through behavior cloning. Our work not only aims to benchmark existing models but also provides a solid foundation for future development into visual foundation agents. Code, train \& test data, and part of fine-tuned open LMMs are available at https://github.com/THUDM/VisualAgentBench.

Summary

AI-Generated Summary

PDF173November 28, 2024