VLM^2-Bench: 시각적 언어 모델이 명시적 매칭 시각적 단서를 암묵적으로 연결하는 능력에 대한 심층 분석

초록

시각적으로 일치하는 단서를 연결하는 능력은 일상생활에서 매우 중요한데, 예를 들어 특정 인물이 누구인지 모르더라도 여러 사진에서 동일한 사람을 그들의 단서를 통해 식별하는 것과 같은 경우가 이에 해당합니다. 비전-언어 모델(VLMs)이 방대한 지식을 보유하고 있음에도 불구하고, 이러한 기본적인 작업을 수행할 수 있는지 여부는 아직까지 크게 탐구되지 않았습니다. 이를 해결하기 위해, 우리는 VLMs가 시각적으로 일치하는 단서를 연결할 수 있는지를 평가하기 위한 벤치마크인 VLM^2-Bench를 소개합니다. 이 벤치마크는 9개의 하위 작업과 3,000개 이상의 테스트 케이스로 구성되어 있습니다. 8개의 오픈소스 VLM과 GPT-4o에 대한 포괄적인 평가, 그리고 다양한 언어 측면 및 시각 측면 프롬프팅 방법에 대한 추가 분석을 통해 총 8개의 주요 발견을 도출했습니다. 우리는 모델들이 시각적 단서를 연결하는 능력에서의 중요한 도전 과제를 확인했으며, 심지어 GPT-4o도 인간보다 34.80% 뒤처지는 상당한 성능 격차를 발견했습니다. 이러한 통찰을 바탕으로, 우리는 (i) 핵심 시각 능력을 강화하여 적응성을 개선하고 사전 지식에 대한 의존도를 줄이는 것, (ii) 시각 중심 작업에서 언어 기반 추론을 통합하는 더 명확한 원칙을 수립하여 불필요한 편향을 방지하는 것, (iii) 시각-텍스트 훈련 패러다임을 모델들이 시각적 단서 간의 관계를 독립적으로 구조화하고 추론할 수 있는 능력을 키우는 방향으로 전환하는 것을 제안합니다.

English

Visually linking matching cues is a crucial ability in daily life, such as identifying the same person in multiple photos based on their cues, even without knowing who they are. Despite the extensive knowledge that vision-language models (VLMs) possess, it remains largely unexplored whether they are capable of performing this fundamental task. To address this, we introduce VLM^2-Bench, a benchmark designed to assess whether VLMs can Visually Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings. We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap where even GPT-4o lags 34.80% behind humans. Based on these insights, we advocate for (i) enhancing core visual capabilities to improve adaptability and reduce reliance on prior knowledge, (ii) establishing clearer principles for integrating language-based reasoning in vision-centric tasks to prevent unnecessary biases, and (iii) shifting vision-text training paradigms toward fostering models' ability to independently structure and infer relationships among visual cues.

VLM^2-Bench: 시각적 언어 모델이 명시적 매칭 시각적 단서를 암묵적으로 연결하는 능력에 대한 심층 분석

VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

초록

Support