InternVL：扩展视觉基础模型并对通用视觉-语言任务进行对齐

摘要

大型语言模型（LLMs）的指数增长为多模态AGI系统开辟了许多可能性。然而，视觉和视觉语言基础模型的进展，这也是多模态AGI的关键要素之一，并没有跟上LLMs的步伐。在这项工作中，我们设计了一个大规模视觉语言基础模型（InternVL），将视觉基础模型扩展到60亿参数，并逐步将其与大型语言模型进行对齐，利用来自各种来源的大规模图像文本数据。该模型可广泛应用于并在视觉感知任务（如图像级或像素级识别）以及视觉语言任务（如零样本图像/视频分类、零样本图像/视频-文本检索）上取得最先进的性能，并与LLMs建立联系，创建多模态对话系统。我们希望我们的研究能为多模态大型模型的发展做出贡献。代码和模型可在https://github.com/OpenGVLab/InternVL找到。

English

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the large language model, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

InternVL：扩展视觉基础模型并对通用视觉-语言任务进行对齐

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

摘要

Support