VS-Bench：評估視覺語言模型在多智能體環境中的策略推理與決策能力

摘要

近期，視覺語言模型（VLMs）的進展已將其能力擴展至互動代理任務，然而現有的基準測試仍局限於單一代理或僅文本的環境。相比之下，現實世界場景通常涉及多個代理在豐富的視覺與語言情境中互動，這對多模態觀察與策略互動提出了挑戰。為彌補這一差距，我們引入了視覺策略基準（VS-Bench），這是一個多模態基準，旨在評估VLMs在多代理環境中的策略推理與決策能力。VS-Bench包含八個基於視覺的環境，涵蓋合作、競爭及混合動機的互動，設計用於評估代理預測他人未來行動並優化長期目標的能力。我們考慮了兩個互補的評估維度，包括基於下一動作預測準確率的離線策略推理評估，以及基於標準化回合回報的線上決策評估。對十四個領先VLMs的廣泛實驗顯示，當前模型與最佳性能之間存在顯著差距，最佳模型的預測準確率為47.8%，標準化回報率為24.3%。我們進一步深入分析了多模態觀察、測試時擴展、社交行為及VLM代理的失敗案例。通過標準化評估並凸顯現有模型的局限，我們期望VS-Bench能成為未來策略多模態代理研究的基石。代碼與數據可於https://vs-bench.github.io獲取。

English

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to predict others' future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.

VS-Bench：評估視覺語言模型在多智能體環境中的策略推理與決策能力

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

摘要

Support