ChatPaper.aiChatPaper

TinyGPT-V:通过小型骨干网络实现高效多模态大型语言模型

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

December 28, 2023
作者: Zhengqing Yuan, Zhaoxu Li, Lichao Sun
cs.AI

摘要

在先进的多模态学习时代,诸如GPT-4V之类的多模态大语言模型(MLLMs)已经在弥合语言和视觉元素方面取得了显著进展。然而,封闭源代码的特性和巨大的计算需求给普遍使用和修改带来了显著挑战。这就是开源MLLMs(如LLaVA和MiniGPT-4)的用武之地,它们在各种任务中取得了突破性成就。尽管取得了这些成就,计算效率仍然是一个尚未解决的问题,因为这些模型(如LLaVA-v1.5-13B)需要大量资源。为了解决这些问题,我们引入了TinyGPT-V,这是一种新型模型,将出色的性能与普通的计算能力相结合。它的独特之处在于仅需要一个24G GPU进行训练,以及一个8G GPU或CPU进行推断。TinyGPT-V基于Phi-2构建,将一个有效的语言主干与来自BLIP-2或CLIP的预训练视觉模块相结合。TinyGPT-V的28亿参数可以经历独特的量化过程,适用于在各种8G设备上进行本地部署和推断任务。我们的工作促进了进一步的发展,设计出成本效益高、高效且高性能的MLLMs,扩展了它们在各种实际场景中的适用性。此外,本文提出了通过小主干实现多模态大语言模型的新范式。我们的代码和训练权重分别放置在以下位置: https://github.com/DLYuanGod/TinyGPT-V 和 https://huggingface.co/Tyrannosaurus/TinyGPT-V。
English
In the era of advanced multimodel learning, multimodal large language models (MLLMs) such as GPT-4V have made remarkable strides towards bridging language and visual elements. However, the closed-source nature and considerable computational demand present notable challenges for universal usage and modifications. This is where open-source MLLMs like LLaVA and MiniGPT-4 come in, presenting groundbreaking achievements across tasks. Despite these accomplishments, computational efficiency remains an unresolved issue, as these models, like LLaVA-v1.5-13B, require substantial resources. Addressing these issues, we introduce TinyGPT-V, a new-wave model marrying impressive performance with commonplace computational capacity. It stands out by requiring merely a 24G GPU for training and an 8G GPU or CPU for inference. Built upon Phi-2, TinyGPT-V couples an effective language backbone with pre-trained vision modules from BLIP-2 or CLIP. TinyGPT-V's 2.8B parameters can undergo a unique quantisation process, suitable for local deployment and inference tasks on 8G various devices. Our work fosters further developments for designing cost-effective, efficient, and high-performing MLLMs, expanding their applicability in a broad array of real-world scenarios. Furthermore this paper proposed a new paradigm of Multimodal Large Language Model via small backbones. Our code and training weights are placed at: https://github.com/DLYuanGod/TinyGPT-V and https://huggingface.co/Tyrannosaurus/TinyGPT-V respectively.
PDF316December 15, 2024