GPT-4o在視覺理解上的表現如何?評估多模態基礎模型於標準電腦視覺任務上的能力
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
July 2, 2025
作者: Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir
cs.AI
摘要
多模態基礎模型,如GPT-4o,近期取得了顯著進展,但這些模型在視覺理解方面的具體表現尚不明確。本文中,我們基於標準計算機視覺任務(語義分割、物體檢測、圖像分類、深度及表面法線預測),利用已建立的數據集(如COCO、ImageNet及其變體等),對流行的多模態基礎模型(GPT-4o、o4-mini、Gemini 1.5 Pro與Gemini 2.0 Flash、Claude 3.5 Sonnet、Qwen2-VL、Llama 3.2)進行了性能基準測試。
執行此項測試的主要挑戰在於:1)大多數模型被訓練以輸出文本,無法原生表達多樣化的領域,如分割或三維幾何;2)許多領先模型為專有模型,僅能通過API層面訪問,即無法獲取權重以進行適應性調整。我們通過將標準視覺任務轉化為可通過文本提示鏈接並與API兼容的任務,創建了一個標準化的基準測試框架,以應對這些挑戰。
我們觀察到:1)這些模型在任何任務上均未接近當前最先進的專業模型。然而,2)它們作為通用模型表現尚可,這點尤為值得注意,因為它們可能主要基於圖像-文本任務進行訓練。3)它們在語義任務上的表現明顯優於幾何任務。4)雖然提示鏈接技術影響性能,但更好的模型對提示變化的敏感性較低。5)在非推理模型中,GPT-4o表現最佳,在六項任務中佔據了四項的榜首位置。6)推理模型,如o3,在幾何任務上顯示出改進。7)對具備原生圖像生成能力的模型(如最新版GPT-4o)的初步分析顯示,它們存在如幻覺和空間錯位等異常現象。
English
Multimodal foundation models, such as GPT-4o, have recently made remarkable
progress, but it is not clear where exactly these models stand in terms of
understanding vision. In this paper, we benchmark the performance of popular
multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0
Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision
tasks (semantic segmentation, object detection, image classification, depth and
surface normal prediction) using established datasets (e.g., COCO, ImageNet and
its variants, etc).
The main challenges to performing this are: 1) most models are trained to
output text and cannot natively express versatile domains, such as segments or
3D geometry, and 2) many leading models are proprietary and accessible only at
an API level, i.e., there is no weight access to adapt them. We address these
challenges by translating standard vision tasks into equivalent text-promptable
and API-compatible tasks via prompt chaining to create a standardized
benchmarking framework.
We observe that 1) the models are not close to the state-of-the-art
specialist models at any task. However, 2) they are respectable generalists;
this is remarkable as they are presumably trained on primarily image-text-based
tasks. 3) They perform semantic tasks notably better than geometric ones. 4)
While the prompt-chaining techniques affect performance, better models exhibit
less sensitivity to prompt variations. 5) GPT-4o performs the best among
non-reasoning models, securing the top position in 4 out of 6 tasks, 6)
reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a
preliminary analysis of models with native image generation, like the latest
GPT-4o, shows they exhibit quirks like hallucinations and spatial
misalignments.