VS-Bench: マルチエージェント環境における戦略的推論と意思決定のための視覚言語モデルの評価

要旨

近年のVision Language Models（VLM）の進展により、対話型エージェントタスクへの適用が拡大しているが、既存のベンチマークは単一エージェントまたはテキストのみの環境に限定されている。一方、現実世界のシナリオでは、複数のエージェントが視覚的・言語的な文脈の中で相互作用することが多く、マルチモーダルな観察と戦略的な相互作用の両方に課題が存在する。このギャップを埋めるため、我々はVisual Strategic Bench（VS-Bench）を提案する。これは、マルチエージェント環境における戦略的推論と意思決定を評価するマルチモーダルベンチマークである。VS-Benchは、協力的、競争的、および混合動機の相互作用を含む8つの視覚的環境で構成され、エージェントが他者の将来の行動を予測し、長期的な目標を最適化する能力を評価するように設計されている。我々は、次の行動予測の精度による戦略的推論のオフライン評価と、正規化されたエピソードリターンによる意思決定のオンライン評価という2つの補完的な評価次元を考慮する。14の主要なVLMに対する大規模な実験により、現在のモデルと最適な性能との間に大きなギャップがあることが明らかになり、最高のモデルでも47.8%の予測精度と24.3%の正規化リターンしか達成できなかった。さらに、マルチモーダル観察、テスト時のスケーリング、社会的行動、およびVLMエージェントの失敗事例について詳細な分析を行った。評価を標準化し、既存モデルの限界を明らかにすることで、我々はVS-Benchが戦略的マルチモーダルエージェントの将来の研究の基盤となることを期待している。コードとデータはhttps://vs-bench.github.ioで公開されている。

English

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to predict others' future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.

VS-Bench: マルチエージェント環境における戦略的推論と意思決定のための視覚言語モデルの評価

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

要旨

Support