Gemini 대 GPT-4V: 질적 사례를 통한 비전-언어 모델의 예비 비교 및 결합

초록

다중모달 대형 언어 모델(MLLMs)의 급속히 진화하는 분야는 인공지능에서 언어와 시각 처리의 통합을 선도하고 있습니다. 본 논문은 Google의 Gemini와 OpenAI의 GPT-4V(ision)라는 두 가지 선구적인 모델에 대한 심층적인 비교 연구를 제시합니다. 우리의 연구는 시각-언어 능력, 인간과의 상호작용, 시간적 이해력, 그리고 지능 및 감성 지수 평가와 같은 주요 차원에서 두 모델을 다각적으로 평가합니다. 우리의 분석의 핵심은 각 모델의 독특한 시각 이해 능력을 탐구합니다. 다양한 산업 응용 시나리오에서의 성능을 평가하기 위해 일련의 구조화된 실험을 수행하여 실용적 유용성에 대한 포괄적인 관점을 제공합니다. 직접적인 성능 비교뿐만 아니라 프롬프트와 시나리오 조정을 포함하여 균형 잡히고 공정한 분석을 보장합니다. 우리의 연구 결과는 두 모델의 독특한 강점과 특성을 밝혀냅니다. GPT-4V는 응답의 정확성과 간결함으로 두드러지는 반면, Gemini는 관련 이미지와 링크를 동반한 상세하고 포괄적인 답변에서 뛰어납니다. 이러한 이해는 Gemini와 GPT-4V의 비교적 장점을 밝힐 뿐만 아니라 다중모달 기반 모델의 진화하는 풍경을 강조하며, 이 분야의 미래 발전을 위한 길을 열어줍니다. 비교 후, 우리는 두 모델을 결합하여 더 나은 결과를 달성하려고 시도했습니다. 마지막으로, GPT-4V와 Gemini 팀에게 이 분야의 선구적인 공헌에 대해 깊은 감사를 표합니다. 또한 Yang et al.의 'Dawn'에 제시된 포괄적인 질적 분석에도 감사의 말씀을 전합니다. 이 작업은 광범위한 이미지 샘플, 프롬프트, GPT-4V 관련 결과를 제공하여 우리의 분석에 기초를 마련했습니다.

English

The rapidly evolving sector of Multi-modal Large Language Models (MLLMs) is at the forefront of integrating linguistic and visual processing in artificial intelligence. This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision). Our study involves a multi-faceted evaluation of both models across key dimensions such as Vision-Language Capability, Interaction with Humans, Temporal Understanding, and assessments in both Intelligence and Emotional Quotients. The core of our analysis delves into the distinct visual comprehension abilities of each model. We conducted a series of structured experiments to evaluate their performance in various industrial application scenarios, offering a comprehensive perspective on their practical utility. We not only involve direct performance comparisons but also include adjustments in prompts and scenarios to ensure a balanced and fair analysis. Our findings illuminate the unique strengths and niches of both models. GPT-4V distinguishes itself with its precision and succinctness in responses, while Gemini excels in providing detailed, expansive answers accompanied by relevant imagery and links. These understandings not only shed light on the comparative merits of Gemini and GPT-4V but also underscore the evolving landscape of multimodal foundation models, paving the way for future advancements in this area. After the comparison, we attempted to achieve better results by combining the two models. Finally, We would like to express our profound gratitude to the teams behind GPT-4V and Gemini for their pioneering contributions to the field. Our acknowledgments are also extended to the comprehensive qualitative analysis presented in 'Dawn' by Yang et al. This work, with its extensive collection of image samples, prompts, and GPT-4V-related results, provided a foundational basis for our analysis.

Gemini 대 GPT-4V: 질적 사례를 통한 비전-언어 모델의 예비 비교 및 결합

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

초록

Support