대규모 멀티모달 모델을 개방형 세계 이미지 분류기로 활용하기

초록

전통적인 이미지 분류는 사전에 정의된 의미론적 카테고리 목록을 필요로 합니다. 반면, 대규모 멀티모달 모델(Large Multimodal Models, LMMs)은 이러한 요구 사항을 우회하여 자연어를 직접 사용하여 이미지를 분류할 수 있습니다(예: "이미지의 주요 객체는 무엇인가요?"라는 프롬프트에 답변). 이 놀라운 능력에도 불구하고, LMM의 분류 성능에 대한 기존 연구 대부분은 놀랍게도 범위가 제한적이며, 종종 사전 정의된 카테고리 집합을 가진 폐쇄형 환경을 가정합니다. 본 연구에서는 진정한 오픈월드 환경에서 LMM의 분류 성능을 철저히 평가함으로써 이 격차를 해소하고자 합니다. 먼저, 이 작업을 공식화하고 평가 프로토콜을 소개하며, 예측된 클래스와 실제 클래스 간의 일치를 평가하기 위한 다양한 메트릭을 정의합니다. 그런 다음, 10개의 벤치마크에서 13개의 모델을 평가하여 프로토타입적, 비프로토타입적, 세분화된, 그리고 매우 세분화된 클래스를 포괄하며, 이 작업에서 LMM이 직면하는 도전 과제를 보여줍니다. 제안된 메트릭을 기반으로 한 추가 분석은 LMM이 범하는 오류 유형을 밝히고, 세분화 및 세밀한 능력과 관련된 도전 과제를 강조하며, 맞춤형 프롬프팅과 추론이 이를 어떻게 완화할 수 있는지를 보여줍니다.

English

Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.

대규모 멀티모달 모델을 개방형 세계 이미지 분류기로 활용하기

On Large Multimodal Models as Open-World Image Classifiers

초록

Support