ChatPaper.aiChatPaper

我们离GPT-4V还有多远?通过开源套件缩小与商用多模态模型之间的差距

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

April 25, 2024
作者: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao
cs.AI

摘要

在本报告中,我们介绍 InternVL 1.5,这是一个开源的多模态大型语言模型(MLLM),旨在弥合多模态理解方面开源与专有商业模型之间的能力差距。我们引入了三项简单的改进:(1)强视觉编码器:我们探索了一种针对大规模视觉基础模型 InternViT-6B 的持续学习策略,增强了其视觉理解能力,并使其能够在不同的LLM中进行转移和重复使用。 (2)动态高分辨率:我们根据输入图像的长宽比和分辨率,将图像分割成1至40个448x448像素的瓦片,支持高达4K分辨率的输入。 (3)高质量双语数据集:我们精心收集了一个涵盖常见场景、文档图像的高质量双语数据集,并用英文和中文问答对进行了标注,显著提升了OCR和中文相关任务的性能。我们通过一系列基准测试和比较研究评估了InternVL 1.5。与开源和专有模型相比,InternVL 1.5表现出竞争力强,在18项基准测试中有8项取得了最先进的结果。代码已发布在 https://github.com/OpenGVLab/InternVL。
English
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448times448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.
PDF585December 15, 2024