GPT-4oは視覚をどの程度理解できるか？標準的なコンピュータビジョンタスクにおけるマルチモーダル基盤モデルの評価

要旨

GPT-4oなどのマルチモーダル基盤モデルは最近目覚ましい進歩を遂げていますが、これらのモデルが視覚理解においてどの程度の位置にあるかは明確ではありません。本論文では、人気のあるマルチモーダル基盤モデル（GPT-4o、o4-mini、Gemini 1.5 Pro、Gemini 2.0 Flash、Claude 3.5 Sonnet、Qwen2-VL、Llama 3.2）を、標準的なコンピュータビジョンタスク（セマンティックセグメンテーション、物体検出、画像分類、深度および表面法線予測）において、確立されたデータセット（例：COCO、ImageNetおよびその変種など）を用いてベンチマークします。この取り組みにおける主な課題は以下の通りです：1）ほとんどのモデルはテキストを出力するように訓練されており、セグメントや3Dジオメトリなどの多様な領域をネイティブに表現できないこと、2）多くの主要なモデルはプロプライエタリであり、APIレベルでのみアクセス可能で、重みへのアクセスがないため、それらを適応させることができないことです。これらの課題に対処するため、標準的なビジョンタスクを等価なテキストプロンプト可能かつAPI互換のタスクに変換し、プロンプトチェーンを用いて標準化されたベンチマークフレームワークを構築します。観察された結果は以下の通りです：1）どのタスクにおいても、これらのモデルは最先端の専門モデルには及ばない。しかし、2）それらは立派なジェネラリストであり、これは主に画像-テキストベースのタスクで訓練されていることを考えると注目に値する。3）セマンティックタスクはジオメトリックタスクよりも顕著に優れている。4）プロンプトチェーン技術は性能に影響を与えるが、優れたモデルほどプロンプトの変動に対する感度が低い。5）GPT-4oは非推論モデルの中で最も優れており、6つのタスクのうち4つでトップの位置を確保している。6）o3などの推論モデルはジオメトリックタスクで改善を示す。7）最新のGPT-4oのようなネイティブ画像生成機能を持つモデルの予備的分析では、幻覚や空間的なずれなどの癖が見られる。

English

Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.

GPT-4oは視覚をどの程度理解できるか？標準的なコンピュータビジョンタスクにおけるマルチモーダル基盤モデルの評価

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

要旨

Support