マルチモーダルな大規模言語モデルにおけるビジュアル質問分解

要旨

質問分解は、複雑な質問に答えさせるための効果的な戦略として登場しています。ただし、既存の手法は主に単一モード言語モデルに焦点を当てている一方で、多モード大規模言語モデル（MLLMs）の質問分解能力はまだ未開拓です。この論文では、MLLMs上での視覚的な質問分解を探求します。具体的には、デコンポーズされたサブ質問の品質を評価するためのデータセットといくつかの評価基準を含む体系的な評価フレームワークを導入し、既存のMLLMsが高品質のサブ質問を生成するのに苦労していることが明らかになります。この制限に対処するために、モデルの質問分解能力を向上させるための特定のファインチューニングデータセットであるDecoVQA+を提案します。適切な選択的分解を実行するためのモデルを可能にすることを目指して、効率的なファインチューニングパイプラインを提案します。ファインチューニングパイプラインには、提案されたデータセットと選択的分解のためのトレーニング目的が含まれます。ファインチューニングされたMLLMsは、サブ質問の品質と選択的質問分解の方針において著しい改善を示し、さらに、VQAベンチマークデータセットでの選択的分解による高い精度も達成します。

English

Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model's question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The finetuning pipeline consists of our proposed dataset and a training objective for selective decomposition. Finetuned MLLMs demonstrate significant improvements in the quality of sub-questions and the policy of selective question decomposition. Additionally, the models also achieve higher accuracy with selective decomposition on VQA benchmark datasets.

マルチモーダルな大規模言語モデルにおけるビジュアル質問分解

Visual Question Decomposition on Multimodal Large Language Models

要旨

Support