Gemini vs GPT-4V:透過定性案例對視覺語言模型進行初步比較與組合
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases
December 22, 2023
作者: Zhangyang Qi, Ye Fang, Mengchen Zhang, Zeyi Sun, Tong Wu, Ziwei Liu, Dahua Lin, Jiaqi Wang, Hengshuang Zhao
cs.AI
摘要
多模式大型語言模型(MLLMs)領域正在快速發展,處於人工智慧語言和視覺處理整合的前沿。本文提出了對兩個開拓性模型進行深入比較研究:Google的Gemini和OpenAI的GPT-4V(ision)。我們的研究涉及對兩個模型在視覺-語言能力、與人類互動、時間理解以及智力和情感商數等關鍵維度上的多方面評估。我們分析的核心探討了每個模型獨特的視覺理解能力。我們進行了一系列結構化實驗,評估它們在各種工業應用場景中的表現,提供了對它們實際效用的全面觀點。我們不僅進行直接性能比較,還包括在提示和情境中進行調整,以確保平衡和公正的分析。我們的研究結果闡明了兩個模型的獨特優勢和特色。GPT-4V以其精確和簡潔的回答脫穎而出,而Gemini在提供詳細、豐富答案的同時附帶相關圖像和鏈接方面表現出色。這些理解不僅揭示了Gemini和GPT-4V的比較優勢,還強調了多模式基礎模型不斷演進的格局,為該領域的未來發展鋪平道路。在比較後,我們試圖通過結合兩個模型來取得更好的結果。最後,我們要向GPT-4V和Gemini團隊對該領域的開拓性貢獻表示深深的感謝。我們還要感謝楊等人在《黎明》中提出的全面定性分析,該工作通過其豐富的圖像樣本、提示和GPT-4V相關結果為我們的分析提供了基礎。
English
The rapidly evolving sector of Multi-modal Large Language Models (MLLMs) is
at the forefront of integrating linguistic and visual processing in artificial
intelligence. This paper presents an in-depth comparative study of two
pioneering models: Google's Gemini and OpenAI's GPT-4V(ision). Our study
involves a multi-faceted evaluation of both models across key dimensions such
as Vision-Language Capability, Interaction with Humans, Temporal Understanding,
and assessments in both Intelligence and Emotional Quotients. The core of our
analysis delves into the distinct visual comprehension abilities of each model.
We conducted a series of structured experiments to evaluate their performance
in various industrial application scenarios, offering a comprehensive
perspective on their practical utility. We not only involve direct performance
comparisons but also include adjustments in prompts and scenarios to ensure a
balanced and fair analysis. Our findings illuminate the unique strengths and
niches of both models. GPT-4V distinguishes itself with its precision and
succinctness in responses, while Gemini excels in providing detailed, expansive
answers accompanied by relevant imagery and links. These understandings not
only shed light on the comparative merits of Gemini and GPT-4V but also
underscore the evolving landscape of multimodal foundation models, paving the
way for future advancements in this area. After the comparison, we attempted to
achieve better results by combining the two models. Finally, We would like to
express our profound gratitude to the teams behind GPT-4V and Gemini for their
pioneering contributions to the field. Our acknowledgments are also extended to
the comprehensive qualitative analysis presented in 'Dawn' by Yang et al. This
work, with its extensive collection of image samples, prompts, and
GPT-4V-related results, provided a foundational basis for our analysis.