我们离GPT-4V还有多远？通过开源套件缩小与商用多模态模型之间的差距

摘要

在本报告中，我们介绍 InternVL 1.5，这是一个开源的多模态大型语言模型（MLLM），旨在弥合多模态理解方面开源与专有商业模型之间的能力差距。我们引入了三项简单的改进：（1）强视觉编码器：我们探索了一种针对大规模视觉基础模型 InternViT-6B 的持续学习策略，增强了其视觉理解能力，并使其能够在不同的LLM中进行转移和重复使用。（2）动态高分辨率：我们根据输入图像的长宽比和分辨率，将图像分割成1至40个448x448像素的瓦片，支持高达4K分辨率的输入。（3）高质量双语数据集：我们精心收集了一个涵盖常见场景、文档图像的高质量双语数据集，并用英文和中文问答对进行了标注，显著提升了OCR和中文相关任务的性能。我们通过一系列基准测试和比较研究评估了InternVL 1.5。与开源和专有模型相比，InternVL 1.5表现出竞争力强，在18项基准测试中有8项取得了最先进的结果。代码已发布在 https://github.com/OpenGVLab/InternVL。

English

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448times448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

我们离GPT-4V还有多远？通过开源套件缩小与商用多模态模型之间的差距

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

摘要

Support