TinyGPT-V: 小型バックボーンによる効率的なマルチモーダル大規模言語モデル

要旨

高度なマルチモーダル学習の時代において、GPT-4Vのようなマルチモーダル大規模言語モデル（MLLMs）は、言語と視覚要素を橋渡しする上で目覚ましい進歩を遂げてきました。しかし、クローズドソースの性質と膨大な計算需要は、普遍的な使用と改変において大きな課題となっています。ここで、LLaVAやMiniGPT-4のようなオープンソースのMLLMsが登場し、さまざまなタスクにおいて画期的な成果を提示しています。これらの成果にもかかわらず、LLaVA-v1.5-13Bのようなモデルは依然として大量のリソースを必要とするため、計算効率は未解決の問題です。これらの課題に対処するため、私たちはTinyGPT-Vを紹介します。これは、驚異的なパフォーマンスと一般的な計算能力を兼ね備えた新世代のモデルです。トレーニングにはわずか24GのGPUを、推論には8GのGPUまたはCPUを必要とする点で際立っています。Phi-2を基盤として構築されたTinyGPT-Vは、効果的な言語バックボーンとBLIP-2またはCLIPの事前学習済み視覚モジュールを組み合わせています。TinyGPT-Vの2.8Bパラメータは、8Gのさまざまなデバイスでのローカル展開と推論タスクに適した独自の量子化プロセスを経ることができます。私たちの研究は、コスト効率が高く、効率的で高性能なMLLMsを設計し、幅広い実世界のシナリオでの適用性を拡大するためのさらなる発展を促進します。さらに、本論文は、小さなバックボーンを介したマルチモーダル大規模言語モデルの新しいパラダイムを提案しています。私たちのコードとトレーニングウェイトは、それぞれhttps://github.com/DLYuanGod/TinyGPT-Vとhttps://huggingface.co/Tyrannosaurus/TinyGPT-Vに配置されています。

English

In the era of advanced multimodel learning, multimodal large language models (MLLMs) such as GPT-4V have made remarkable strides towards bridging language and visual elements. However, the closed-source nature and considerable computational demand present notable challenges for universal usage and modifications. This is where open-source MLLMs like LLaVA and MiniGPT-4 come in, presenting groundbreaking achievements across tasks. Despite these accomplishments, computational efficiency remains an unresolved issue, as these models, like LLaVA-v1.5-13B, require substantial resources. Addressing these issues, we introduce TinyGPT-V, a new-wave model marrying impressive performance with commonplace computational capacity. It stands out by requiring merely a 24G GPU for training and an 8G GPU or CPU for inference. Built upon Phi-2, TinyGPT-V couples an effective language backbone with pre-trained vision modules from BLIP-2 or CLIP. TinyGPT-V's 2.8B parameters can undergo a unique quantisation process, suitable for local deployment and inference tasks on 8G various devices. Our work fosters further developments for designing cost-effective, efficient, and high-performing MLLMs, expanding their applicability in a broad array of real-world scenarios. Furthermore this paper proposed a new paradigm of Multimodal Large Language Model via small backbones. Our code and training weights are placed at: https://github.com/DLYuanGod/TinyGPT-V and https://huggingface.co/Tyrannosaurus/TinyGPT-V respectively.

TinyGPT-V: 小型バックボーンによる効率的なマルチモーダル大規模言語モデル

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

要旨

Support