GPT-4V 的挑戰者？對 Gemini 在視覺專業領域的初步探索

摘要

對於多模態大型語言模型（MLLMs）的興趣激增，例如來自OpenAI的GPT-4V(ision)，已在學術界和工業界標誌著一個重要趨勢。它們賦予大型語言模型（LLMs）強大的視覺理解能力，使它們能夠應對多樣的多模態任務。最近，Google推出了Gemini，這是其最新且功能最強大的MLLM，從頭為多模態而建。鑑於其卓越的推理能力，Gemini是否能挑戰GPT-4V在多模態學習中的領先地位？本文對Gemini Pro的視覺理解能力進行了初步探索，全面涵蓋四個領域：基本感知、高級認知、具有挑戰性的視覺任務和各種專家能力。我們將Gemini Pro與最先進的GPT-4V進行比較，以評估其上限，以及最新的開源MLLM Sphinx，揭示了手動努力和黑盒系統之間的差距。定性樣本表明，雖然GPT-4V和Gemini展示了不同的回答風格和偏好，但它們在視覺推理能力方面可以相當。Sphinx在領域泛化方面仍遠遠落後於它們。具體而言，GPT-4V傾向於詳細解釋和中間步驟，而Gemini則更喜歡輸出直接而簡潔的答案。對流行的MME基準測試的定量評估也顯示了Gemini成為GPT-4V強勁競爭者的潛力。我們對Gemini的早期調查還觀察到了MLLM的一些常見問題，表明還有相當大的距離要實現人工通用智能。我們釋出了用於追踪MLLM進展的專案，網址為https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models。

English

The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

GPT-4V 的挑戰者？對 Gemini 在視覺專業領域的初步探索

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

摘要

Support