Gemini vs GPT-4V：通过定性案例对视觉-语言模型进行初步比较和组合

摘要

多模式大型语言模型（MLLMs）这一快速发展的领域正处于人工智能中整合语言和视觉处理的前沿。本文提出了对两个开创性模型进行深入比较研究：谷歌的 Gemini 和 OpenAI 的 GPT-4V(ision)。我们的研究涉及对这两个模型在关键维度上的多方面评估，如视觉-语言能力、与人类的互动、时间理解，以及智力和情感商数的评估。我们分析的核心是探讨每个模型独特的视觉理解能力。我们进行了一系列结构化实验，评估它们在各种工业应用场景中的表现，为它们的实际效用提供了全面的视角。我们不仅进行直接性能比较，还包括在提示和场景中进行调整，以确保平衡和公正的分析。我们的发现阐明了两个模型的独特优势和特色。GPT-4V 以其精准和简洁的回答脱颖而出，而 Gemini 则擅长提供详细、广泛的答案，并附带相关的图像和链接。这些理解不仅揭示了 Gemini 和 GPT-4V 的比较优点，还强调了多模式基础模型不断发展的格局，为这一领域的未来进步铺平道路。在比较之后，我们尝试通过结合这两个模型来取得更好的结果。最后，我们要对 GPT-4V 和 Gemini 团队的开创性贡献表示深深的感激。我们也要感谢杨等人在《黎明》中提出的全面定性分析。这项工作通过其大量的图像样本、提示和与 GPT-4V 相关的结果，为我们的分析提供了基础依据。

English

The rapidly evolving sector of Multi-modal Large Language Models (MLLMs) is at the forefront of integrating linguistic and visual processing in artificial intelligence. This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision). Our study involves a multi-faceted evaluation of both models across key dimensions such as Vision-Language Capability, Interaction with Humans, Temporal Understanding, and assessments in both Intelligence and Emotional Quotients. The core of our analysis delves into the distinct visual comprehension abilities of each model. We conducted a series of structured experiments to evaluate their performance in various industrial application scenarios, offering a comprehensive perspective on their practical utility. We not only involve direct performance comparisons but also include adjustments in prompts and scenarios to ensure a balanced and fair analysis. Our findings illuminate the unique strengths and niches of both models. GPT-4V distinguishes itself with its precision and succinctness in responses, while Gemini excels in providing detailed, expansive answers accompanied by relevant imagery and links. These understandings not only shed light on the comparative merits of Gemini and GPT-4V but also underscore the evolving landscape of multimodal foundation models, paving the way for future advancements in this area. After the comparison, we attempted to achieve better results by combining the two models. Finally, We would like to express our profound gratitude to the teams behind GPT-4V and Gemini for their pioneering contributions to the field. Our acknowledgments are also extended to the comprehensive qualitative analysis presented in 'Dawn' by Yang et al. This work, with its extensive collection of image samples, prompts, and GPT-4V-related results, provided a foundational basis for our analysis.

Gemini vs GPT-4V：通过定性案例对视觉-语言模型进行初步比较和组合

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

摘要

Support