GPT-4Vの挑戦者か？視覚的専門性におけるGeminiの初期探求

要旨

マルチモーダル大規模言語モデル（MLLMs）に対する関心の高まりは、例えばOpenAIのGPT-4V(ision)など、学界と産業界の両方において重要なトレンドとなっています。これらのモデルは、大規模言語モデル（LLMs）に視覚理解の強力な能力を付与し、多様なマルチモーダルタスクに対応できるようにします。最近では、GoogleがGeminiをリリースしました。これは、マルチモーダリティのためにゼロから構築された最新かつ最も能力の高いMLLMです。その優れた推論能力を考慮すると、Geminiはマルチモーダル学習におけるGPT-4Vのリーディングポジションに挑戦できるでしょうか？本論文では、Gemini Proの視覚理解能力について予備的な探求を行い、基本的な知覚、高度な認知、挑戦的な視覚タスク、および様々な専門能力という4つの領域を包括的にカバーします。Gemini Proを最先端のGPT-4Vと比較し、その上限を評価するとともに、最新のオープンソースMLLMであるSphinxとの比較を通じて、手作業の努力とブラックボックスシステムの間のギャップを明らかにします。質的なサンプルは、GPT-4VとGeminiが異なる回答スタイルと好みを示す一方で、同等の視覚推論能力を示すことができ、Sphinxはドメイン一般化に関してまだ彼らに遅れをとっていることを示しています。具体的には、GPT-4Vは詳細な説明と中間ステップを展開する傾向があり、Geminiは直接的で簡潔な回答を出力することを好みます。人気のあるMMEベンチマークでの定量的評価も、GeminiがGPT-4Vの強力な挑戦者となる可能性を示しています。Geminiの初期調査では、MLLMsに共通するいくつかの問題も観察され、人工汎用知能に向けてまだかなりの距離があることが示されています。MLLMの進捗を追跡するための私たちのプロジェクトは、https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models で公開されています。

English

The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

GPT-4Vの挑戦者か？視覚的専門性におけるGeminiの初期探求

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

要旨

Support