TinyGPT-V:透過小型骨幹實現高效多模態大型語言模型
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
December 28, 2023
作者: Zhengqing Yuan, Zhaoxu Li, Lichao Sun
cs.AI
摘要
在先進的多模型學習時代,多模式大型語言模型(MLLMs)如GPT-4V已經在語言和視覺元素之間取得了顯著進展。然而,封閉源代碼的特性和龐大的計算需求為其普遍使用和修改帶來了顯著挑戰。這就是開源MLLMs如LLaVA和MiniGPT-4的用武之地,它們在各項任務中取得了開創性的成就。儘管取得了這些成就,計算效率仍然是一個未解決的問題,因為這些模型,如LLaVA-v1.5-13B,需要大量資源。為了應對這些問題,我們引入了TinyGPT-V,這是一種新型模型,將出色的性能與普通的計算能力相結合。它的獨特之處在於僅需要一個24G GPU進行訓練,以及一個8G GPU或CPU進行推理。TinyGPT-V基於Phi-2構建,將具有高效語言骨幹的模型與來自BLIP-2或CLIP的預訓練視覺模塊相結合。TinyGPT-V的28億參數可以經歷獨特的量化過程,適用於本地部署和在各種8G設備上進行推理任務。我們的工作促進了進一步的發展,設計出成本效益高、高效且高性能的MLLMs,擴大了它們在各種現實場景中的應用。此外,本文提出了一種通過小型骨幹實現多模式大型語言模型的新範式。我們的代碼和訓練權重分別放在以下位置:
https://github.com/DLYuanGod/TinyGPT-V 和 https://huggingface.co/Tyrannosaurus/TinyGPT-V。
English
In the era of advanced multimodel learning, multimodal large language models
(MLLMs) such as GPT-4V have made remarkable strides towards bridging language
and visual elements. However, the closed-source nature and considerable
computational demand present notable challenges for universal usage and
modifications. This is where open-source MLLMs like LLaVA and MiniGPT-4 come
in, presenting groundbreaking achievements across tasks. Despite these
accomplishments, computational efficiency remains an unresolved issue, as these
models, like LLaVA-v1.5-13B, require substantial resources. Addressing these
issues, we introduce TinyGPT-V, a new-wave model marrying impressive
performance with commonplace computational capacity. It stands out by requiring
merely a 24G GPU for training and an 8G GPU or CPU for inference. Built upon
Phi-2, TinyGPT-V couples an effective language backbone with pre-trained vision
modules from BLIP-2 or CLIP. TinyGPT-V's 2.8B parameters can undergo a unique
quantisation process, suitable for local deployment and inference tasks on 8G
various devices. Our work fosters further developments for designing
cost-effective, efficient, and high-performing MLLMs, expanding their
applicability in a broad array of real-world scenarios. Furthermore this paper
proposed a new paradigm of Multimodal Large Language Model via small backbones.
Our code and training weights are placed at:
https://github.com/DLYuanGod/TinyGPT-V and
https://huggingface.co/Tyrannosaurus/TinyGPT-V respectively.