Gemini在推理中的應用:揭示多模式大型語言模型中的常識
Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models
December 29, 2023
作者: Yuqing Wang, Yun Zhao
cs.AI
摘要
對於多模式大型語言模型(MLLMs)的興趣日益增長,例如OpenAI的GPT-4V(ision),已經顯著影響了學術界和工業界。這些模型通過先進的視覺理解能力增強了大型語言模型(LLMs),促進了它們在各種多模式任務中的應用。最近,Google推出了Gemini,這是一款專為多模式整合而設計的尖端MLLM。儘管Gemini取得了進展,但初步基準顯示Gemini在常識推理任務上落後於GPT模型。然而,這一評估基於有限的數據集(即HellaSWAG),並未完全揭示Gemini在真實常識推理潛力方面。為彌補這一差距,我們的研究對Gemini在需要跨模式整合常識知識的複雜推理任務中的表現進行了全面評估。我們對12個常識推理數據集進行了全面分析,涵蓋了從一般到特定領域任務的範圍。這包括11個僅關注語言的數據集,以及一個包含多模式元素的數據集。我們在四個LLMs和兩個MLLMs上進行的實驗表明Gemini具有競爭力的常識推理能力。此外,我們確定了當前LLMs和MLLMs在解決常識問題時面臨的共同挑戰,強調了需要進一步改進這些模型的常識推理能力的必要性。
English
The burgeoning interest in Multimodal Large Language Models (MLLMs), such as
OpenAI's GPT-4V(ision), has significantly impacted both academic and industrial
realms. These models enhance Large Language Models (LLMs) with advanced visual
understanding capabilities, facilitating their application in a variety of
multimodal tasks. Recently, Google introduced Gemini, a cutting-edge MLLM
designed specifically for multimodal integration. Despite its advancements,
preliminary benchmarks indicate that Gemini lags behind GPT models in
commonsense reasoning tasks. However, this assessment, based on a limited
dataset (i.e., HellaSWAG), does not fully capture Gemini's authentic
commonsense reasoning potential. To address this gap, our study undertakes a
thorough evaluation of Gemini's performance in complex reasoning tasks that
necessitate the integration of commonsense knowledge across modalities. We
carry out a comprehensive analysis of 12 commonsense reasoning datasets,
ranging from general to domain-specific tasks. This includes 11 datasets
focused solely on language, as well as one that incorporates multimodal
elements. Our experiments across four LLMs and two MLLMs demonstrate Gemini's
competitive commonsense reasoning capabilities. Additionally, we identify
common challenges faced by current LLMs and MLLMs in addressing commonsense
problems, underscoring the need for further advancements in enhancing the
commonsense reasoning abilities of these models.