GPT-4V 的挑戰者?對 Gemini 在視覺專業領域的初步探索
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
December 19, 2023
作者: Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Xiawu Zheng, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Xing Sun, Rongrong Ji
cs.AI
摘要
對於多模態大型語言模型(MLLMs)的興趣激增,例如來自OpenAI的GPT-4V(ision),已在學術界和工業界標誌著一個重要趨勢。它們賦予大型語言模型(LLMs)強大的視覺理解能力,使它們能夠應對多樣的多模態任務。最近,Google推出了Gemini,這是其最新且功能最強大的MLLM,從頭為多模態而建。鑑於其卓越的推理能力,Gemini是否能挑戰GPT-4V在多模態學習中的領先地位?本文對Gemini Pro的視覺理解能力進行了初步探索,全面涵蓋四個領域:基本感知、高級認知、具有挑戰性的視覺任務和各種專家能力。我們將Gemini Pro與最先進的GPT-4V進行比較,以評估其上限,以及最新的開源MLLM Sphinx,揭示了手動努力和黑盒系統之間的差距。定性樣本表明,雖然GPT-4V和Gemini展示了不同的回答風格和偏好,但它們在視覺推理能力方面可以相當。Sphinx在領域泛化方面仍遠遠落後於它們。具體而言,GPT-4V傾向於詳細解釋和中間步驟,而Gemini則更喜歡輸出直接而簡潔的答案。對流行的MME基準測試的定量評估也顯示了Gemini成為GPT-4V強勁競爭者的潛力。我們對Gemini的早期調查還觀察到了MLLM的一些常見問題,表明還有相當大的距離要實現人工通用智能。我們釋出了用於追踪MLLM進展的專案,網址為https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models。
English
The surge of interest towards Multi-modal Large Language Models (MLLMs),
e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
academia and industry. They endow Large Language Models (LLMs) with powerful
capabilities in visual understanding, enabling them to tackle diverse
multi-modal tasks. Very recently, Google released Gemini, its newest and most
capable MLLM built from the ground up for multi-modality. In light of the
superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
in multi-modal learning? In this paper, we present a preliminary exploration of
Gemini Pro's visual understanding proficiency, which comprehensively covers
four domains: fundamental perception, advanced cognition, challenging vision
tasks, and various expert capacities. We compare Gemini Pro with the
state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
black-box systems. The qualitative samples indicate that, while GPT-4V and
Gemini showcase different answering styles and preferences, they can exhibit
comparable visual reasoning capabilities, and Sphinx still trails behind them
concerning domain generalizability. Specifically, GPT-4V tends to elaborate
detailed explanations and intermediate steps, and Gemini prefers to output a
direct and concise answer. The quantitative evaluation on the popular MME
benchmark also demonstrates the potential of Gemini to be a strong challenger
to GPT-4V. Our early investigation of Gemini also observes some common issues
of MLLMs, indicating that there still remains a considerable distance towards
artificial general intelligence. Our project for tracking the progress of MLLM
is released at
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.