GPT-4V의 도전자? 시각 전문성에서 Gemini의 초기 탐구

초록

다중 모달 대형 언어 모델(Multi-modal Large Language Models, MLLMs)에 대한 관심이 급증하면서, 특히 OpenAI의 GPT-4V(ision)와 같은 모델들은 학계와 산업계에서 중요한 트렌드로 자리 잡았습니다. 이러한 모델들은 대형 언어 모델(LLMs)에 강력한 시각 이해 능력을 부여하여 다양한 다중 모달 작업을 처리할 수 있게 합니다. 최근 Google은 다중 모달리티를 위해 처음부터 구축된 최신이자 가장 강력한 MLLM인 Gemini를 출시했습니다. 이 모델의 우수한 추론 능력을 고려할 때, Gemini가 다중 모달 학습 분야에서 GPT-4V의 선두 위치에 도전할 수 있을까요? 본 논문에서는 Gemini Pro의 시각 이해 능력을 네 가지 영역(기본 인지, 고급 인지, 도전적인 시각 작업, 다양한 전문가 역량)에 걸쳐 포괄적으로 탐구한 예비 연구를 제시합니다. 우리는 Gemini Pro를 최첨단 GPT-4V와 비교하여 그 상한선을 평가하고, 최신 오픈소스 MLLM인 Sphinx를 통해 수동 노력과 블랙박스 시스템 간의 격차를 드러냅니다. 질적 샘플 분석 결과, GPT-4V와 Gemini는 서로 다른 답변 스타일과 선호도를 보이지만, 비슷한 수준의 시각 추론 능력을 보여주는 반면, Sphinx는 도메인 일반화 측면에서 여전히 뒤처지는 것으로 나타났습니다. 특히, GPT-4V는 상세한 설명과 중간 단계를 자세히 기술하는 경향이 있는 반면, Gemini는 직접적이고 간결한 답변을 선호합니다. 인기 있는 MME 벤치마크에서의 정량적 평가 또한 GPT-4V에 대한 강력한 도전자로서 Gemini의 잠재력을 입증합니다. Gemini에 대한 초기 조사에서 우리는 MLLM들이 여전히 인공 일반 지능(AGI)에 도달하기까지 상당한 거리가 남아 있음을 시사하는 몇 가지 공통적인 문제점들을 관찰했습니다. MLLM의 발전을 추적하기 위한 우리의 프로젝트는 https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models에서 공개되었습니다.

English

The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

GPT-4V의 도전자? 시각 전문성에서 Gemini의 초기 탐구

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

초록

Support