ChatPaper.aiChatPaper

我們離GPT-4V有多遠? 用開源套件縮小與商業多模型的差距

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

April 25, 2024
作者: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao
cs.AI

摘要

在本報告中,我們介紹 InternVL 1.5,這是一個開源的多模態大型語言模型(MLLM),旨在彌合開源和專有商業模型在多模態理解方面的能力差距。我們引入了三項簡單的改進:(1)強大的視覺編碼器:我們探索了一種針對大規模視覺基礎模型 InternViT-6B 的連續學習策略,增強了其視覺理解能力,使其能夠在不同的LLM中進行轉移和重複使用。 (2)動態高分辨率:我們根據輸入圖像的寬高比和分辨率,將圖像分為1至40個448x448像素的瓷磚,支持高達4K分辨率的輸入。 (3)高質量的雙語數據集:我們精心收集了一個高質量的雙語數據集,涵蓋了常見場景、文檔圖像,並用英文和中文問答對進行了標註,顯著提升了OCR和與中文相關任務的性能。我們通過一系列基準測試和比較研究來評估 InternVL 1.5。與開源和專有模型相比,InternVL 1.5表現出競爭力,並在18個基準測試中的8個中取得了最先進的結果。代碼已在 https://github.com/OpenGVLab/InternVL 釋出。
English
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448times448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

Summary

AI-Generated Summary

PDF585December 15, 2024