MiniGPT-v2：作为视觉-语言多任务学习的统一接口的大型语言模型

摘要

大型语言模型展示了作为各种与语言相关应用的通用接口的显著能力。受此启发，我们的目标是构建一个统一的接口，用于完成许多视觉语言任务，包括图像描述、视觉问答和视觉定位等。挑战在于使用单一模型有效地执行多样的视觉语言任务，只需简单的多模态指令。为实现这一目标，我们引入了MiniGPT-v2，这是一个可视为更好处理各种视觉语言任务的统一接口的模型。我们建议在训练模型时为不同任务使用唯一标识符。这些标识符使我们的模型能够更好地轻松区分每个任务指令，并提高每个任务的模型学习效率。经过三阶段训练后，实验结果显示，与其他视觉语言通用模型相比，MiniGPT-v2在许多视觉问答和视觉定位基准上表现出色。我们的模型和代码可在https://minigpt-v2.github.io/ 上找到。

English

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/

MiniGPT-v2：作为视觉-语言多任务学习的统一接口的大型语言模型

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

摘要

Support