V-MAGE: マルチモーダル大規模言語モデルの視覚中心能力を評価するためのゲーム評価フレームワーク

要旨

マルチモーダル大規模言語モデル（MLLMs）の最近の進展により、様々なマルチモーダルベンチマークにおいて大幅な改善がもたらされています。しかし、評価が静的データセットからオープンワールドの動的環境へと移行するにつれ、現在のゲームベースのベンチマークは、視覚中心のタスクを欠いており、現実世界の意思決定に必要な多様な推論スキルを評価できないため、不十分なままです。これを解決するため、我々は視覚中心の多能力ゲーム評価（V-MAGE）を導入します。V-MAGEは、MLLMsの視覚推論能力を評価するために設計されたゲームベースの評価フレームワークで、5つの多様なゲームと30以上の手作りレベルを特徴とし、位置特定、軌跡追跡、タイミング、視覚記憶などのコアな視覚スキルに加えて、長期的な計画や熟慮といった高次の推論をテストします。我々はV-MAGEを使用して主要なMLLMsを評価し、その視覚知覚と推論における重大な課題を明らかにしました。すべてのゲーム環境において、Eloレーティング比較で決定されたトップパフォーマンスのMLLMsは、人間と比較して大幅なパフォーマンスギャップを示しました。我々の調査結果は、モデルが犯す様々な種類の知覚エラーを含む重要な制限を強調し、エージェント中心の視点からの改善の可能性を示唆しています。例えば、エージェント戦略の洗練や知覚の不正確さの解決などです。コードはhttps://github.com/CSU-JPG/V-MAGEで利用可能です。

English

Recent advancements in Multimodal Large Language Models (MLLMs) have led to significant improvements across various multimodal benchmarks. However, as evaluations shift from static datasets to open-world, dynamic environments, current game-based benchmarks remain inadequate because they lack visual-centric tasks and fail to assess the diverse reasoning skills required for real-world decision-making. To address this, we introduce Visual-centric Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. V-MAGE features five diverse games with 30+ handcrafted levels, testing models on core visual skills such as positioning, trajectory tracking, timing, and visual memory, alongside higher-level reasoning like long-term planning and deliberation. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning. In all game environments, the top-performing MLLMs, as determined by Elo rating comparisons, exhibit a substantial performance gap compared to humans. Our findings highlight critical limitations, including various types of perceptual errors made by the models, and suggest potential avenues for improvement from an agent-centric perspective, such as refining agent strategies and addressing perceptual inaccuracies. Code is available at https://github.com/CSU-JPG/V-MAGE.

V-MAGE: マルチモーダル大規模言語モデルの視覚中心能力を評価するためのゲーム評価フレームワーク

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

要旨

Support