HyperLLaVA: マルチモーダル大規模言語モデルのための動的視覚・言語エキスパートチューニング

要旨

最近の進展によると、マルチモーダル大規模言語モデル（MLLM）のスケールアップは、下流のマルチモーダルタスクにおける性能を効果的に向上させることが示されています。現在主流のMLLMパラダイム、例えばLLaVAは、静的な視覚-言語マッパーを使用して視覚的特徴をテキストのようなトークンに変換し、それによって静的なLLMが視覚情報を理解する能力を視覚的指示チューニングを通じて開発できるようにします。有望ではあるものの、静的なチューニング戦略（静的なパラメータを持つ訓練済みモデルを指す）は、異なる下流のマルチモーダルタスク間での性能を制約する可能性があります。これを踏まえて、我々はHyperLLaVAを導入します。これは、プロジェクターとLLMのパラメータを適応的にチューニングし、それぞれ動的な視覚エキスパートと言語エキスパートと組み合わせるものです。これらのエキスパートは、視覚と言語のガイダンスを通じて適応的なパラメータシフトを生成するHyperNetworksから派生しており、二段階の訓練において動的なプロジェクターとLLMのモデリングを可能にします。我々の実験は、我々のソリューションがMME、MMBench、SEED-Bench、LLaVA-Benchを含む既存のMLLMベンチマークにおいてLLaVAを大幅に上回ることを示しています。我々のプロジェクトは以下のリンクで利用可能です：https://github.com/DCDmllm/HyperLLaVA。

English

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, e.g., LLaVA, transforms visual features into text-like tokens using a static vision-language mapper, thereby enabling static LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the static tuning strategy~The static tuning refers to the trained model with static parameters. that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~Our project is available on the link https://github.com/DCDmllm/HyperLLaVA.

HyperLLaVA: マルチモーダル大規模言語モデルのための動的視覚・言語エキスパートチューニング

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

要旨

Support