GPT-4V 的挑战者？Gemini 在视觉专长方面的早期探索。

摘要

对多模态大型语言模型（MLLMs）的兴趣激增，例如来自OpenAI的GPT-4V(ision)，标志着学术界和工业界的一个重要趋势。它们赋予大型语言模型（LLMs）强大的视觉理解能力，使它们能够处理多样的多模态任务。谷歌最近发布了Gemini，这是其最新、最具能力的MLLM，从头开始专为多模态而构建。鉴于其卓越的推理能力，Gemini是否能挑战GPT-4V在多模态学习中的领先地位？在本文中，我们对Gemini Pro的视觉理解能力进行了初步探索，全面涵盖了四个领域：基础感知、高级认知、具有挑战性的视觉任务和各种专家能力。我们将Gemini Pro与最先进的GPT-4V进行比较，以评估其上限，同时还考虑了最新的开源MLLM Sphinx，揭示了人工努力和黑匣子系统之间的差距。定性样本表明，虽然GPT-4V和Gemini展示了不同的回答风格和偏好，但它们在视觉推理能力上可以相媲美，而Sphinx在领域泛化方面仍然落后于它们。具体而言，GPT-4V倾向于详细解释和中间步骤，而Gemini更倾向于输出直接而简洁的答案。对流行的MME基准测试的定量评估也显示了Gemini成为GPT-4V强有力挑战者的潜力。我们对Gemini的早期调查还观察到MLLM的一些常见问题，表明距离人工通用智能仍有相当大的距离。我们用于跟踪MLLM进展的项目已发布在https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models。

English

The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

GPT-4V 的挑战者？Gemini 在视觉专长方面的早期探索。

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

摘要

Support