Gemini対GPT-4V：質的ケーススタディによる視覚言語モデルの予備的比較と統合

要旨

マルチモーダル大規模言語モデル（MLLMs）の急速に進化する分野は、人工知能における言語処理と視覚処理の統合の最前線に位置しています。本論文では、GoogleのGeminiとOpenAIのGPT-4V(ision)という2つの先駆的なモデルについて、詳細な比較研究を提示します。本研究では、視覚-言語能力、人間との相互作用、時間的理解、知能指数と感情指数の評価といった主要な次元にわたって、両モデルの多面的な評価を行いました。分析の核心は、各モデルの視覚理解能力の違いにあります。さまざまな産業応用シナリオにおける性能を評価するために、一連の構造化された実験を実施し、その実用性に関する包括的な視点を提供しました。直接的な性能比較だけでなく、プロンプトやシナリオの調整も含めて、公平でバランスの取れた分析を確保しました。我々の調査結果は、両モデルの独自の強みとニッチを明らかにしています。GPT-4Vは、その応答の正確さと簡潔さで際立っていますが、Geminiは、関連する画像やリンクを伴った詳細で広範な回答を提供することに優れています。これらの理解は、GeminiとGPT-4Vの比較的優位性を明らかにするだけでなく、マルチモーダル基盤モデルの進化する状況を強調し、この分野の将来の進歩への道を開くものです。比較の後、我々は両モデルを組み合わせることでより良い結果を達成しようと試みました。最後に、GPT-4VとGeminiの背後にあるチームに対して、この分野への先駆的な貢献に対して深い感謝の意を表します。また、Yangらによる『Dawn』で提示された包括的な質的分析にも謝意を表します。この研究は、広範な画像サンプル、プロンプト、およびGPT-4V関連の結果の収集を提供し、我々の分析の基礎となりました。

English

The rapidly evolving sector of Multi-modal Large Language Models (MLLMs) is at the forefront of integrating linguistic and visual processing in artificial intelligence. This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision). Our study involves a multi-faceted evaluation of both models across key dimensions such as Vision-Language Capability, Interaction with Humans, Temporal Understanding, and assessments in both Intelligence and Emotional Quotients. The core of our analysis delves into the distinct visual comprehension abilities of each model. We conducted a series of structured experiments to evaluate their performance in various industrial application scenarios, offering a comprehensive perspective on their practical utility. We not only involve direct performance comparisons but also include adjustments in prompts and scenarios to ensure a balanced and fair analysis. Our findings illuminate the unique strengths and niches of both models. GPT-4V distinguishes itself with its precision and succinctness in responses, while Gemini excels in providing detailed, expansive answers accompanied by relevant imagery and links. These understandings not only shed light on the comparative merits of Gemini and GPT-4V but also underscore the evolving landscape of multimodal foundation models, paving the way for future advancements in this area. After the comparison, we attempted to achieve better results by combining the two models. Finally, We would like to express our profound gratitude to the teams behind GPT-4V and Gemini for their pioneering contributions to the field. Our acknowledgments are also extended to the comprehensive qualitative analysis presented in 'Dawn' by Yang et al. This work, with its extensive collection of image samples, prompts, and GPT-4V-related results, provided a foundational basis for our analysis.

Gemini対GPT-4V：質的ケーススタディによる視覚言語モデルの予備的比較と統合

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

要旨

Support