推理中的双子座：揭示多模态大语言模型中的常识

摘要

对于多模态大型语言模型（MLLMs）的兴趣日益增长，比如OpenAI的GPT-4V(ision)，已经显著影响了学术界和工业界。这些模型通过增强大型语言模型（LLMs）的高级视觉理解能力，促进了它们在各种多模态任务中的应用。最近，谷歌推出了Gemini，这是一款专门设计用于多模态整合的尖端MLLM。尽管Gemini在常识推理任务上落后于GPT模型的初步基准测试显示。然而，这一评估是基于有限数据集（即HellaSWAG）进行的，未能充分捕捉Gemini在真实常识推理潜力。为填补这一空白，我们的研究对Gemini在需要跨模态整合常识知识的复杂推理任务中的表现进行了彻底评估。我们对12个常识推理数据集进行了全面分析，涵盖了从一般到特定领域任务的范围。其中包括11个仅关注语言的数据集，以及一个融合了多模态元素的数据集。我们在四个LLMs和两个MLLMs上的实验表明Gemini具有竞争力的常识推理能力。此外，我们还确定了当前LLMs和MLLMs在解决常识问题时面临的共同挑战，强调了在增强这些模型的常识推理能力方面需要进一步的进展。

English

The burgeoning interest in Multimodal Large Language Models (MLLMs), such as OpenAI's GPT-4V(ision), has significantly impacted both academic and industrial realms. These models enhance Large Language Models (LLMs) with advanced visual understanding capabilities, facilitating their application in a variety of multimodal tasks. Recently, Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration. Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks. However, this assessment, based on a limited dataset (i.e., HellaSWAG), does not fully capture Gemini's authentic commonsense reasoning potential. To address this gap, our study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks that necessitate the integration of commonsense knowledge across modalities. We carry out a comprehensive analysis of 12 commonsense reasoning datasets, ranging from general to domain-specific tasks. This includes 11 datasets focused solely on language, as well as one that incorporates multimodal elements. Our experiments across four LLMs and two MLLMs demonstrate Gemini's competitive commonsense reasoning capabilities. Additionally, we identify common challenges faced by current LLMs and MLLMs in addressing commonsense problems, underscoring the need for further advancements in enhancing the commonsense reasoning abilities of these models.

推理中的双子座：揭示多模态大语言模型中的常识

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

摘要

Support