MiniGPT-v2: 비전-언어 다중 작업 학습을 위한 통합 인터페이스로서의 대형 언어 모델

초록

대규모 언어 모델은 다양한 언어 관련 애플리케이션을 위한 일반 인터페이스로서 놀라운 능력을 보여주었습니다. 이에 동기를 받아, 우리는 이미지 설명, 시각적 질문 응답, 시각적 그라운딩 등 다양한 시각-언어 작업을 완수하기 위한 통합 인터페이스를 구축하는 것을 목표로 합니다. 여기서의 도전은 단일 모델을 사용하여 간단한 다중 모달 지시로 다양한 시각-언어 작업을 효과적으로 수행하는 것입니다. 이러한 목표를 달성하기 위해, 우리는 다양한 시각-언어 작업을 더 잘 처리할 수 있는 통합 인터페이스로 간주될 수 있는 MiniGPT-v2 모델을 소개합니다. 우리는 모델을 훈련할 때 각 작업에 대해 고유한 식별자를 사용할 것을 제안합니다. 이러한 식별자는 우리 모델이 각 작업 지시를 더 쉽게 구별할 수 있게 하고, 각 작업에 대한 모델 학습 효율성을 향상시킵니다. 3단계 훈련 후, 실험 결과는 MiniGPT-v2가 다른 시각-언어 일반 모델에 비해 많은 시각적 질문 응답 및 시각적 그라운딩 벤치마크에서 강력한 성능을 달성함을 보여줍니다. 우리의 모델과 코드는 https://minigpt-v2.github.io/에서 확인할 수 있습니다.

English

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/

MiniGPT-v2: 비전-언어 다중 작업 학습을 위한 통합 인터페이스로서의 대형 언어 모델

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

초록

Support