V-MAGE: 멀티모달 대형 언어 모델의 시각 중심 능력 평가를 위한 게임 평가 프레임워크

초록

최근 멀티모달 대형 언어 모델(MLLM)의 발전으로 다양한 멀티모달 벤치마크에서 상당한 개선이 이루어졌습니다. 그러나 평가가 정적 데이터셋에서 개방형 동적 환경으로 전환됨에 따라, 현재의 게임 기반 벤치마크는 시각 중심 과제가 부족하고 실제 세계의 의사결정에 필요한 다양한 추론 능력을 평가하지 못해 여전히 부적합합니다. 이를 해결하기 위해, 우리는 MLLM의 시각적 추론 능력을 평가하기 위해 설계된 게임 기반 평가 프레임워크인 시각 중심 다중 능력 게임 평가(V-MAGE)를 소개합니다. V-MAGE는 위치 지정, 궤적 추적, 타이밍, 시각적 기억과 같은 핵심 시각 능력과 장기 계획 및 숙고와 같은 상위 수준의 추론을 테스트하는 30개 이상의 수작업 레벨로 구성된 다섯 가지 다양한 게임을 특징으로 합니다. 우리는 V-MAGE를 사용하여 주요 MLLM을 평가하고, 그들의 시각적 인식과 추론에서 상당한 어려움을 발견했습니다. 모든 게임 환경에서 Elo 등급 비교에 의해 결정된 최고 성능의 MLLM은 인간과 비교하여 상당한 성능 격차를 보였습니다. 우리의 연구 결과는 모델이 만드는 다양한 유형의 인식 오류를 포함한 중요한 한계를 강조하고, 에이전트 전략 개선 및 인식 부정확성 해결과 같은 에이전트 중심 관점에서의 개선 가능성을 제안합니다. 코드는 https://github.com/CSU-JPG/V-MAGE에서 확인할 수 있습니다.

English

Recent advancements in Multimodal Large Language Models (MLLMs) have led to significant improvements across various multimodal benchmarks. However, as evaluations shift from static datasets to open-world, dynamic environments, current game-based benchmarks remain inadequate because they lack visual-centric tasks and fail to assess the diverse reasoning skills required for real-world decision-making. To address this, we introduce Visual-centric Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. V-MAGE features five diverse games with 30+ handcrafted levels, testing models on core visual skills such as positioning, trajectory tracking, timing, and visual memory, alongside higher-level reasoning like long-term planning and deliberation. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning. In all game environments, the top-performing MLLMs, as determined by Elo rating comparisons, exhibit a substantial performance gap compared to humans. Our findings highlight critical limitations, including various types of perceptual errors made by the models, and suggest potential avenues for improvement from an agent-centric perspective, such as refining agent strategies and addressing perceptual inaccuracies. Code is available at https://github.com/CSU-JPG/V-MAGE.

V-MAGE: 멀티모달 대형 언어 모델의 시각 중심 능력 평가를 위한 게임 평가 프레임워크

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

초록

Support