TinyGPT-V: 소형 백본을 통한 효율적인 멀티모달 대규모 언어 모델

초록

고급 멀티모달 학습의 시대에서 GPT-4V와 같은 멀티모달 대형 언어 모델(MLLMs)은 언어와 시각 요소를 연결하는 데 있어 놀라운 진전을 이루었습니다. 그러나 폐쇄적인 소스 특성과 상당한 계산 요구 사항은 보편적인 사용과 수정에 있어 상당한 과제로 남아 있습니다. 이에 LLaVA와 MiniGPT-4와 같은 오픈소스 MLLMs가 등장하여 다양한 작업에서 획기적인 성과를 보여주고 있습니다. 이러한 성과에도 불구하고, LLaVA-v1.5-13B와 같은 모델들은 상당한 자원을 필요로 하기 때문에 계산 효율성은 여전히 해결되지 않은 문제로 남아 있습니다. 이러한 문제를 해결하기 위해, 우리는 인상적인 성능과 일반적인 계산 능력을 결합한 새로운 모델인 TinyGPT-V를 소개합니다. 이 모델은 학습에 단 24G GPU만을 필요로 하며, 추론에는 8G GPU 또는 CPU만을 요구하는 점에서 두드러집니다. Phi-2를 기반으로 구축된 TinyGPT-V는 효과적인 언어 백본과 BLIP-2 또는 CLIP의 사전 훈련된 비전 모듈을 결합합니다. TinyGPT-V의 2.8B 매개변수는 고유한 양자화 과정을 거칠 수 있어, 8G 다양한 장치에서의 로컬 배포 및 추론 작업에 적합합니다. 우리의 작업은 비용 효율적이고 효율적이며 고성능의 MLLMs 설계를 위한 추가 발전을 촉진하여, 다양한 실제 시나리오에서의 적용 가능성을 확장합니다. 더불어, 이 논문은 작은 백본을 통한 멀티모달 대형 언어 모델의 새로운 패러다임을 제안합니다. 우리의 코드와 훈련 가중치는 각각 https://github.com/DLYuanGod/TinyGPT-V와 https://huggingface.co/Tyrannosaurus/TinyGPT-V에 공개되어 있습니다.

English

In the era of advanced multimodel learning, multimodal large language models (MLLMs) such as GPT-4V have made remarkable strides towards bridging language and visual elements. However, the closed-source nature and considerable computational demand present notable challenges for universal usage and modifications. This is where open-source MLLMs like LLaVA and MiniGPT-4 come in, presenting groundbreaking achievements across tasks. Despite these accomplishments, computational efficiency remains an unresolved issue, as these models, like LLaVA-v1.5-13B, require substantial resources. Addressing these issues, we introduce TinyGPT-V, a new-wave model marrying impressive performance with commonplace computational capacity. It stands out by requiring merely a 24G GPU for training and an 8G GPU or CPU for inference. Built upon Phi-2, TinyGPT-V couples an effective language backbone with pre-trained vision modules from BLIP-2 or CLIP. TinyGPT-V's 2.8B parameters can undergo a unique quantisation process, suitable for local deployment and inference tasks on 8G various devices. Our work fosters further developments for designing cost-effective, efficient, and high-performing MLLMs, expanding their applicability in a broad array of real-world scenarios. Furthermore this paper proposed a new paradigm of Multimodal Large Language Model via small backbones. Our code and training weights are placed at: https://github.com/DLYuanGod/TinyGPT-V and https://huggingface.co/Tyrannosaurus/TinyGPT-V respectively.

TinyGPT-V: 소형 백본을 통한 효율적인 멀티모달 대규모 언어 모델

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

초록

Support