VisualAgentBench: 大規模マルチモーダルモデルを視覚基盤エージェントとして目指して

要旨

大規模マルチモーダルモデル（LMMs）は、言語と視覚の能力を統合し、高度に有能な視覚基盤エージェントを形成することで、人工知能の新たな時代を切り開きました。これらのエージェントは、多様なタスクにおいて優れた性能を発揮し、汎用人工知能に近づく可能性があるとされています。しかし、既存のベンチマークは、複雑な現実世界の環境においてLMMsの真の潜在能力を十分に試したり、示したりするには至っていません。このギャップを埋めるため、我々はVisualAgentBench（VAB）を導入します。これは、LMMsを視覚基盤エージェントとして多様なシナリオ（具体化、グラフィカルユーザーインターフェース、視覚デザインなど）で訓練・評価するために特別に設計された包括的で先駆的なベンチマークであり、LMMsの理解力と相互作用能力の深さを探るタスクを提供します。9つの独自LMM APIと8つのオープンモデルを対象とした厳密なテストを通じて、これらのモデルのエージェント能力が相当なものの、まだ発展途上であることを示します。さらに、VABは、プログラムベースのソルバー、LMMエージェントのブートストラップ、人間によるデモンストレーションを含むハイブリッド手法で構築された軌跡訓練セットを提供し、行動クローニングを通じてLMMsの大幅な性能向上を促進します。我々の研究は、既存のモデルをベンチマークするだけでなく、視覚基盤エージェントの将来の発展に向けた堅固な基盤を提供することを目指しています。コード、訓練・テストデータ、および一部のファインチューニングされたオープンLMMsは、https://github.com/THUDM/VisualAgentBench で公開されています。

English

Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. These agents are postulated to excel across a myriad of tasks, potentially approaching general artificial intelligence. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), a comprehensive and pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs' understanding and interaction capabilities. Through rigorous testing across nine proprietary LMM APIs and eight open models, we demonstrate the considerable yet still developing agent capabilities of these models. Additionally, VAB constructs a trajectory training set constructed through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, promoting substantial performance improvements in LMMs through behavior cloning. Our work not only aims to benchmark existing models but also provides a solid foundation for future development into visual foundation agents. Code, train \& test data, and part of fine-tuned open LMMs are available at https://github.com/THUDM/VisualAgentBench.

VisualAgentBench: 大規模マルチモーダルモデルを視覚基盤エージェントとして目指して

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

要旨

Support