GPT-4V에 얼마나 가까워졌는가? 오픈소스 제품군으로 상용 멀티모달 모델과의 격차 좁히기

초록

본 보고서에서는 오픈소스와 독점 상용 모델 간의 멀티모달 이해 능력 격차를 해소하기 위해 오픈소스 멀티모달 대형 언어 모델(MLLM)인 InternVL 1.5를 소개합니다. 우리는 세 가지 간단한 개선 사항을 도입했습니다: (1) 강력한 비전 인코더: 대규모 비전 기반 모델인 InternViT-6B에 대한 지속적 학습 전략을 탐구하여 시각적 이해 능력을 향상시키고, 이를 다양한 LLM에서 전이 및 재사용할 수 있도록 했습니다. (2) 동적 고해상도: 입력 이미지의 종횡비와 해상도에 따라 이미지를 448x448 픽셀의 1에서 40개의 타일로 분할하여 최대 4K 해상도 입력을 지원합니다. (3) 고품질 이중 언어 데이터셋: 일반적인 장면과 문서 이미지를 포함한 고품질 이중 언어 데이터셋을 신중하게 수집하고, 이를 영어와 중국어 질문-답변 쌍으로 주석 처리하여 OCR 및 중국어 관련 작업에서의 성능을 크게 향상시켰습니다. 우리는 일련의 벤치마크와 비교 연구를 통해 InternVL 1.5를 평가했습니다. 오픈소스 및 독점 모델과 비교하여, InternVL 1.5는 경쟁력 있는 성능을 보여주며 18개 벤치마크 중 8개에서 최첨단 결과를 달성했습니다. 코드는 https://github.com/OpenGVLab/InternVL에서 공개되었습니다.

English

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448times448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

GPT-4V에 얼마나 가까워졌는가? 오픈소스 제품군으로 상용 멀티모달 모델과의 격차 좁히기

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

초록

Support