MG-LLaVA：マルチグラニュラリティ視覚的指示チューニングに向けて

要旨

マルチモーダル大規模言語モデル（MLLM）は、様々な視覚理解タスクにおいて大きな進展を遂げてきました。しかし、これらのモデルの大多数は低解像度画像の処理に限定されており、詳細な視覚情報を必要とする知覚タスクにおける有効性が制限されています。本研究では、MG-LLaVAという革新的なMLLMを提案します。このモデルは、低解像度、高解像度、およびオブジェクト中心の特徴を含むマルチグラニュラリティ視覚フローを組み込むことで、モデルの視覚処理能力を向上させます。我々は、微細な詳細を捉えるための追加の高解像度視覚エンコーダを統合し、それをConv-Gate融合ネットワークを通じて基本視覚特徴と融合させることを提案します。さらに、モデルの物体認識能力を向上させるために、オフラインディテクタによって識別されたバウンディングボックスから導出されたオブジェクトレベルの特徴を組み込みます。公開されているマルチモーダルデータのみを命令チューニングを通じて訓練されたMG-LLaVAは、卓越した知覚スキルを発揮します。我々は、3.8Bから34Bまでの多様な言語エンコーダを用いてMG-LLaVAを実装し、モデルの性能を包括的に評価します。複数のベンチマークにわたる広範な評価により、MG-LLaVAが同等のパラメータサイズの既存のMLLMを凌駕し、その顕著な有効性を示しています。コードはhttps://github.com/PhoenixZ810/MG-LLaVAで公開されます。

English

Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at https://github.com/PhoenixZ810/MG-LLaVA.

MG-LLaVA：マルチグラニュラリティ視覚的指示チューニングに向けて

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

要旨

Support