InternVL：擴展視覺基礎模型並對通用視覺語言任務進行對齊

摘要

大型語言模型（LLMs）的指數增長為多模式AGI系統開啟了眾多可能性。然而，視覺和視覺語言基礎模型的進展，這也是多模式AGI的關鍵元素之一，並未跟上LLMs的步伐。在這項工作中，我們設計了一個大規模視覺語言基礎模型（InternVL），將視覺基礎模型擴展到60億個參數，並逐步將其與大型語言模型對齊，使用來自各種來源的視訊圖像數據。該模型可廣泛應用於並在視覺感知任務（如圖像級或像素級識別）以及視覺語言任務（如零樣本圖像/視頻分類、零樣本圖像/視頻-文本檢索）上取得最先進的性能，並與LLMs連接以創建多模式對話系統。我們希望我們的研究能為多模式大型模型的發展做出貢獻。代碼和模型可在https://github.com/OpenGVLab/InternVL找到。

English

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the large language model, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

InternVL：擴展視覺基礎模型並對通用視覺語言任務進行對齊

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

摘要

Support