GPT-4V 的挑战者?Gemini 在视觉专长方面的早期探索。
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
December 19, 2023
作者: Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Xiawu Zheng, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Xing Sun, Rongrong Ji
cs.AI
摘要
对多模态大型语言模型(MLLMs)的兴趣激增,例如来自OpenAI的GPT-4V(ision),标志着学术界和工业界的一个重要趋势。它们赋予大型语言模型(LLMs)强大的视觉理解能力,使它们能够处理多样的多模态任务。谷歌最近发布了Gemini,这是其最新、最具能力的MLLM,从头开始专为多模态而构建。鉴于其卓越的推理能力,Gemini是否能挑战GPT-4V在多模态学习中的领先地位?在本文中,我们对Gemini Pro的视觉理解能力进行了初步探索,全面涵盖了四个领域:基础感知、高级认知、具有挑战性的视觉任务和各种专家能力。我们将Gemini Pro与最先进的GPT-4V进行比较,以评估其上限,同时还考虑了最新的开源MLLM Sphinx,揭示了人工努力和黑匣子系统之间的差距。定性样本表明,虽然GPT-4V和Gemini展示了不同的回答风格和偏好,但它们在视觉推理能力上可以相媲美,而Sphinx在领域泛化方面仍然落后于它们。具体而言,GPT-4V倾向于详细解释和中间步骤,而Gemini更倾向于输出直接而简洁的答案。对流行的MME基准测试的定量评估也显示了Gemini成为GPT-4V强有力挑战者的潜力。我们对Gemini的早期调查还观察到MLLM的一些常见问题,表明距离人工通用智能仍有相当大的距离。我们用于跟踪MLLM进展的项目已发布在https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models。
English
The surge of interest towards Multi-modal Large Language Models (MLLMs),
e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
academia and industry. They endow Large Language Models (LLMs) with powerful
capabilities in visual understanding, enabling them to tackle diverse
multi-modal tasks. Very recently, Google released Gemini, its newest and most
capable MLLM built from the ground up for multi-modality. In light of the
superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
in multi-modal learning? In this paper, we present a preliminary exploration of
Gemini Pro's visual understanding proficiency, which comprehensively covers
four domains: fundamental perception, advanced cognition, challenging vision
tasks, and various expert capacities. We compare Gemini Pro with the
state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
black-box systems. The qualitative samples indicate that, while GPT-4V and
Gemini showcase different answering styles and preferences, they can exhibit
comparable visual reasoning capabilities, and Sphinx still trails behind them
concerning domain generalizability. Specifically, GPT-4V tends to elaborate
detailed explanations and intermediate steps, and Gemini prefers to output a
direct and concise answer. The quantitative evaluation on the popular MME
benchmark also demonstrates the potential of Gemini to be a strong challenger
to GPT-4V. Our early investigation of Gemini also observes some common issues
of MLLMs, indicating that there still remains a considerable distance towards
artificial general intelligence. Our project for tracking the progress of MLLM
is released at
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.