GPT-4o在视觉理解上的表现如何？评估多模态基础模型在标准计算机视觉任务中的性能

摘要

多模态基础模型，如GPT-4o，近期取得了显著进展，但这些模型在视觉理解方面的确切水平尚不明确。本文中，我们基于标准计算机视觉任务（语义分割、目标检测、图像分类、深度及表面法线预测），利用已建立的数据集（如COCO、ImageNet及其变体等），对主流多模态基础模型（GPT-4o、o4-mini、Gemini 1.5 Pro与Gemini 2.0 Flash、Claude 3.5 Sonnet、Qwen2-VL、Llama 3.2）进行了性能基准测试。执行此测试面临的主要挑战包括：1）多数模型训练以输出文本为主，无法原生表达如分割或3D几何等多样化领域；2）许多领先模型为专有性质，仅能通过API访问，即无法获取权重进行适配。我们通过将标准视觉任务转化为等效的文本提示任务，并利用提示链技术创建标准化基准测试框架，以应对这些挑战。我们的观察结果如下：1）这些模型在任何任务上均未接近最先进的专用模型水平。然而，2）它们作为通用模型表现尚可，这一点尤为引人注目，因为它们可能主要基于图像-文本任务进行训练。3）在语义任务上的表现明显优于几何任务。4）尽管提示链技术影响性能，但更优模型对提示变化的敏感性较低。5）在非推理模型中，GPT-4o表现最佳，在六项任务中四项位居榜首。6）推理模型，如o3，在几何任务上显示出改进。7）对具备原生图像生成能力的模型（如最新GPT-4o）的初步分析表明，它们存在幻觉和空间错位等异常现象。

English

Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.

GPT-4o在视觉理解上的表现如何？评估多模态基础模型在标准计算机视觉任务中的性能

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

摘要

Support