Gemini의 추론 능력: 멀티모달 대형 언어 모델에서의 상식 이해

초록

OpenAI의 GPT-4V(ision)와 같은 다중모달 대형 언어 모델(MLLM)에 대한 폭발적인 관심은 학계와 산업계 모두에 상당한 영향을 미쳤습니다. 이러한 모델은 대형 언어 모델(LLM)에 고급 시각 이해 능력을 추가하여 다양한 다중모달 작업에의 적용을 용이하게 합니다. 최근 Google은 다중모달 통합을 위해 특별히 설계된 최첨단 MLLM인 Gemini를 발표했습니다. 그럼에도 불구하고, 초기 벤치마크 결과에 따르면 Gemini는 상식 추론 작업에서 GPT 모델들에 비해 뒤처지는 것으로 나타났습니다. 그러나 이 평가는 제한된 데이터셋(예: HellaSWAG)을 기반으로 한 것으로, Gemini의 진정한 상식 추론 잠재력을 완전히 반영하지 못합니다. 이러한 격차를 해결하기 위해, 본 연구는 다양한 모달리티 간의 상식 지식 통합이 필요한 복잡한 추론 작업에서 Gemini의 성능을 철저히 평가합니다. 우리는 일반적인 작업부터 도메인 특화 작업까지 12개의 상식 추론 데이터셋에 대한 포괄적인 분석을 수행합니다. 이 중 11개는 언어에만 초점을 맞춘 데이터셋이며, 하나는 다중모달 요소를 포함합니다. 네 개의 LLM과 두 개의 MLLM에 걸친 실험을 통해 Gemini의 경쟁력 있는 상식 추론 능력을 입증합니다. 또한, 현재의 LLM과 MLLM이 상식 문제를 해결하는 데 직면한 공통적인 과제를 식별함으로써, 이러한 모델들의 상식 추론 능력을 향상시키기 위한 추가적인 발전의 필요성을 강조합니다.

English

The burgeoning interest in Multimodal Large Language Models (MLLMs), such as OpenAI's GPT-4V(ision), has significantly impacted both academic and industrial realms. These models enhance Large Language Models (LLMs) with advanced visual understanding capabilities, facilitating their application in a variety of multimodal tasks. Recently, Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration. Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks. However, this assessment, based on a limited dataset (i.e., HellaSWAG), does not fully capture Gemini's authentic commonsense reasoning potential. To address this gap, our study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks that necessitate the integration of commonsense knowledge across modalities. We carry out a comprehensive analysis of 12 commonsense reasoning datasets, ranging from general to domain-specific tasks. This includes 11 datasets focused solely on language, as well as one that incorporates multimodal elements. Our experiments across four LLMs and two MLLMs demonstrate Gemini's competitive commonsense reasoning capabilities. Additionally, we identify common challenges faced by current LLMs and MLLMs in addressing commonsense problems, underscoring the need for further advancements in enhancing the commonsense reasoning abilities of these models.

Gemini의 추론 능력: 멀티모달 대형 언어 모델에서의 상식 이해

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

초록

Support