VisuLogic: 다중 모달 대형 언어 모델의 시각적 추론 능력을 평가하기 위한 벤치마크

초록

시각적 추론은 인간 지능의 핵심 구성 요소이자 고급 멀티모달 모델의 중요한 능력입니다. 그러나 현재 멀티모달 대형 언어 모델(MLLM)의 추론 평가는 종종 텍스트 설명에 의존하고 언어 기반 추론 단축을 허용함으로써 진정한 시각 중심 추론을 측정하지 못하고 있습니다. 이를 해결하기 위해 우리는 VisuLogic을 소개합니다: 이는 양적 변화, 공간 관계, 속성 비교 등 여섯 가지 범주에 걸친 1,000개의 인간 검증 문제로 구성된 벤치마크입니다. 이러한 다양한 유형의 질문을 통해 MLLM의 시각적 추론 능력을 다각적으로 평가할 수 있습니다. 우리는 이 벤치마크에서 주요 MLLM을 평가하고 그 결과를 분석하여 일반적인 실패 모드를 식별했습니다. 대부분의 모델은 30% 미만의 정확도를 보였는데, 이는 25%의 무작위 기준선보다 약간 높은 수준이며 인간이 달성한 51.4%보다 훨씬 낮아 시각적 추론에서 상당한 격차가 있음을 보여줍니다. 또한, 추가적인 학습 데이터셋과 강화 학습 기준선을 제공하여 더 나은 진전을 지원합니다.

English

Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans-revealing significant gaps in visual reasoning. Furthermore, we provide a supplementary training dataset and a reinforcement-learning baseline to support further progress.

VisuLogic: 다중 모달 대형 언어 모델의 시각적 추론 능력을 평가하기 위한 벤치마크

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

초록

Support