TinyGPT-V：透過小型骨幹實現高效多模態大型語言模型

摘要

在先進的多模型學習時代，多模式大型語言模型（MLLMs）如GPT-4V已經在語言和視覺元素之間取得了顯著進展。然而，封閉源代碼的特性和龐大的計算需求為其普遍使用和修改帶來了顯著挑戰。這就是開源MLLMs如LLaVA和MiniGPT-4的用武之地，它們在各項任務中取得了開創性的成就。儘管取得了這些成就，計算效率仍然是一個未解決的問題，因為這些模型，如LLaVA-v1.5-13B，需要大量資源。為了應對這些問題，我們引入了TinyGPT-V，這是一種新型模型，將出色的性能與普通的計算能力相結合。它的獨特之處在於僅需要一個24G GPU進行訓練，以及一個8G GPU或CPU進行推理。TinyGPT-V基於Phi-2構建，將具有高效語言骨幹的模型與來自BLIP-2或CLIP的預訓練視覺模塊相結合。TinyGPT-V的28億參數可以經歷獨特的量化過程，適用於本地部署和在各種8G設備上進行推理任務。我們的工作促進了進一步的發展，設計出成本效益高、高效且高性能的MLLMs，擴大了它們在各種現實場景中的應用。此外，本文提出了一種通過小型骨幹實現多模式大型語言模型的新範式。我們的代碼和訓練權重分別放在以下位置： https://github.com/DLYuanGod/TinyGPT-V 和 https://huggingface.co/Tyrannosaurus/TinyGPT-V。

English

In the era of advanced multimodel learning, multimodal large language models (MLLMs) such as GPT-4V have made remarkable strides towards bridging language and visual elements. However, the closed-source nature and considerable computational demand present notable challenges for universal usage and modifications. This is where open-source MLLMs like LLaVA and MiniGPT-4 come in, presenting groundbreaking achievements across tasks. Despite these accomplishments, computational efficiency remains an unresolved issue, as these models, like LLaVA-v1.5-13B, require substantial resources. Addressing these issues, we introduce TinyGPT-V, a new-wave model marrying impressive performance with commonplace computational capacity. It stands out by requiring merely a 24G GPU for training and an 8G GPU or CPU for inference. Built upon Phi-2, TinyGPT-V couples an effective language backbone with pre-trained vision modules from BLIP-2 or CLIP. TinyGPT-V's 2.8B parameters can undergo a unique quantisation process, suitable for local deployment and inference tasks on 8G various devices. Our work fosters further developments for designing cost-effective, efficient, and high-performing MLLMs, expanding their applicability in a broad array of real-world scenarios. Furthermore this paper proposed a new paradigm of Multimodal Large Language Model via small backbones. Our code and training weights are placed at: https://github.com/DLYuanGod/TinyGPT-V and https://huggingface.co/Tyrannosaurus/TinyGPT-V respectively.

TinyGPT-V：透過小型骨幹實現高效多模態大型語言模型

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

摘要

Support