VS-Bench:評估視覺語言模型在多智能體環境中的策略推理與決策能力
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments
June 3, 2025
作者: Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang
cs.AI
摘要
近期,視覺語言模型(VLMs)的進展已將其能力擴展至互動代理任務,然而現有的基準測試仍局限於單一代理或僅文本的環境。相比之下,現實世界場景通常涉及多個代理在豐富的視覺與語言情境中互動,這對多模態觀察與策略互動提出了挑戰。為彌補這一差距,我們引入了視覺策略基準(VS-Bench),這是一個多模態基準,旨在評估VLMs在多代理環境中的策略推理與決策能力。VS-Bench包含八個基於視覺的環境,涵蓋合作、競爭及混合動機的互動,設計用於評估代理預測他人未來行動並優化長期目標的能力。我們考慮了兩個互補的評估維度,包括基於下一動作預測準確率的離線策略推理評估,以及基於標準化回合回報的線上決策評估。對十四個領先VLMs的廣泛實驗顯示,當前模型與最佳性能之間存在顯著差距,最佳模型的預測準確率為47.8%,標準化回報率為24.3%。我們進一步深入分析了多模態觀察、測試時擴展、社交行為及VLM代理的失敗案例。通過標準化評估並凸顯現有模型的局限,我們期望VS-Bench能成為未來策略多模態代理研究的基石。代碼與數據可於https://vs-bench.github.io獲取。
English
Recent advancements in Vision Language Models (VLMs) have expanded their
capabilities to interactive agent tasks, yet existing benchmarks remain limited
to single-agent or text-only environments. In contrast, real-world scenarios
often involve multiple agents interacting within rich visual and linguistic
contexts, posing challenges with both multimodal observations and strategic
interactions. To bridge this gap, we introduce Visual Strategic Bench
(VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning
and decision-making in multi-agent environments. VS-Bench comprises eight
vision-grounded environments spanning cooperative, competitive, and
mixed-motive interactions, designed to assess agents' ability to predict
others' future moves and optimize for long-term objectives. We consider two
complementary evaluation dimensions, including offline evaluation of strategic
reasoning by next-action prediction accuracy and online evaluation of
decision-making by normalized episode return. Extensive experiments of fourteen
leading VLMs reveal a significant gap between current models and optimal
performance, with the best models attaining 47.8% prediction accuracy and 24.3%
normalized return. We further conduct in-depth analyses on multimodal
observations, test-time scaling, social behaviors, and failure cases of VLM
agents. By standardizing the evaluation and highlighting the limitations of
existing models, we envision VS-Bench as a foundation for future research on
strategic multimodal agents. Code and data are available at
https://vs-bench.github.io.