論大型多模態模型作為開放世界圖像分類器

摘要

傳統的圖像分類需要預先定義一組語義類別。相比之下，大型多模態模型（LMMs）能夠繞過這一要求，直接使用自然語言對圖像進行分類（例如，回答提示「圖像中的主要物體是什麼？」）。儘管具備這一顯著能力，現有大多數關於LMM分類性能的研究卻出人意料地局限於封閉世界的設定，通常假設存在一組預先定義的類別。在本研究中，我們通過在真正的開放世界設定中全面評估LMM的分類性能來填補這一空白。我們首先形式化了這一任務，並引入了一個評估協議，定義了多種指標來評估預測類別與真實類別之間的一致性。隨後，我們在10個基準上評估了13個模型，涵蓋了原型、非原型、細粒度以及極細粒度的類別，展示了LMM在此任務中面臨的挑戰。基於所提出的指標進行的進一步分析揭示了LMM所犯錯誤的類型，突出了與粒度和細粒度能力相關的挑戰，並展示了如何通過定制的提示和推理來緩解這些問題。

English

Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.

論大型多模態模型作為開放世界圖像分類器

On Large Multimodal Models as Open-World Image Classifiers

摘要

Support