Geminiの推論能力：マルチモーダル大規模言語モデルにおける常識の解明

要旨

マルチモーダル大規模言語モデル（MLLM）に対する関心が急速に高まっており、OpenAIのGPT-4V(ision)などのモデルは、学術界と産業界の両方に大きな影響を与えています。これらのモデルは、大規模言語モデル（LLM）に高度な視覚理解能力を付加し、さまざまなマルチモーダルタスクへの応用を可能にしています。最近、Googleはマルチモーダル統合に特化した最先端のMLLMであるGeminiを発表しました。その進歩にもかかわらず、初期のベンチマークでは、Geminiは常識推論タスクにおいてGPTモデルに遅れをとっていることが示されています。しかし、この評価は限られたデータセット（例：HellaSWAG）に基づいており、Geminiの真の常識推論能力を完全には捉えていません。このギャップを埋めるため、本研究では、モダリティを超えた常識知識の統合を必要とする複雑な推論タスクにおけるGeminiの性能を徹底的に評価します。私たちは、一般的なタスクからドメイン固有のタスクまで、12の常識推論データセットを包括的に分析します。これには、言語のみに焦点を当てた11のデータセットと、マルチモーダル要素を取り入れた1つのデータセットが含まれます。4つのLLMと2つのMLLMを対象とした実験を通じて、Geminiの競争力のある常識推論能力を実証しました。さらに、現在のLLMとMLLMが常識問題に対処する際に直面する共通の課題を特定し、これらのモデルの常識推論能力を向上させるためのさらなる進歩の必要性を強調しています。

English

The burgeoning interest in Multimodal Large Language Models (MLLMs), such as OpenAI's GPT-4V(ision), has significantly impacted both academic and industrial realms. These models enhance Large Language Models (LLMs) with advanced visual understanding capabilities, facilitating their application in a variety of multimodal tasks. Recently, Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration. Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks. However, this assessment, based on a limited dataset (i.e., HellaSWAG), does not fully capture Gemini's authentic commonsense reasoning potential. To address this gap, our study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks that necessitate the integration of commonsense knowledge across modalities. We carry out a comprehensive analysis of 12 commonsense reasoning datasets, ranging from general to domain-specific tasks. This includes 11 datasets focused solely on language, as well as one that incorporates multimodal elements. Our experiments across four LLMs and two MLLMs demonstrate Gemini's competitive commonsense reasoning capabilities. Additionally, we identify common challenges faced by current LLMs and MLLMs in addressing commonsense problems, underscoring the need for further advancements in enhancing the commonsense reasoning abilities of these models.

Geminiの推論能力：マルチモーダル大規模言語モデルにおける常識の解明

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

要旨

Support