MiniGPT-v2：作為視覺語言多任務學習的統一界面的大型語言模型

摘要

大型語言模型展示了其作為各種與語言相關應用的通用接口的卓越能力。受此啟發，我們致力於建立一個統一的接口，用於完成許多視覺語言任務，包括圖像描述、視覺問答和視覺定位等。挑戰在於使用單一模型有效地執行多樣的視覺語言任務，並使用簡單的多模式指令。為了達到這個目標，我們介紹了MiniGPT-v2，這是一個可視為更好處理各種視覺語言任務的統一接口的模型。我們建議在訓練模型時為不同任務使用獨特的識別符。這些識別符使我們的模型能夠輕鬆更好地區分每個任務指令，同時提高每個任務的模型學習效率。經過三階段的訓練，實驗結果顯示，與其他視覺語言通用模型相比，MiniGPT-v2在許多視覺問答和視覺定位基準測試中取得了強大的表現。我們的模型和代碼可在 https://minigpt-v2.github.io/ 上找到。

English

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/

MiniGPT-v2：作為視覺語言多任務學習的統一界面的大型語言模型

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

摘要

Support