u-LLaVA: 大規模言語モデルによるマルチモーダルタスクの統合

要旨

LLaVAやMini-GPT4などの最近の進歩により、視覚情報を大規模言語モデル（LLM）に統合することに成功し、画期的な成果を上げ、新世代のマルチモーダルLLM（MLLM）が誕生しました。しかし、これらの手法は幻覚現象やタスク間の相互干渉に悩まされています。これらの問題に対処するため、我々はLLMを複数の専門モデルを接続する橋渡しとして活用し、下流タスクに適応する効率的かつ正確なアプローチ、すなわちu-LLaVAを提案します。まず、モダリティアライメントモジュールとマルチタスクモジュールをLLMに組み込みます。次に、効率的なモダリティアライメントと指示追従を可能にするため、多種多様な公開データセットを再編成または再構築します。最後に、訓練されたLLMからタスク固有の情報を抽出し、異なるモジュールに提供して下流タスクを解決します。この全体のフレームワークはシンプルで効果的であり、複数のベンチマークで最先端の性能を達成しています。また、我々のモデル、生成されたデータ、およびコードベースを公開しています。

English

Recent advances such as LLaVA and Mini-GPT4 have successfully integrated visual information into LLMs, yielding inspiring outcomes and giving rise to a new generation of multi-modal LLMs, or MLLMs. Nevertheless, these methods struggle with hallucinations and the mutual interference between tasks. To tackle these problems, we propose an efficient and accurate approach to adapt to downstream tasks by utilizing LLM as a bridge to connect multiple expert models, namely u-LLaVA. Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.

u-LLaVA: 大規模言語モデルによるマルチモーダルタスクの統合

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

要旨

Support