GPT-4Vまでどれほど近づいているのか？オープンソーススイートで商用マルチモーダルモデルとのギャップを埋める

要旨

本レポートでは、オープンソースのマルチモーダル大規模言語モデル（MLLM）であるInternVL 1.5を紹介し、オープンソースモデルとプロプライエタリな商用モデルの間のマルチモーダル理解能力のギャップを埋めることを目指します。以下の3つのシンプルな改善を導入しました：(1) 強力なビジョンエンコーダ：大規模ビジョンファウンデーションモデルであるInternViT-6Bに対して継続学習戦略を探求し、その視覚理解能力を向上させ、異なるLLM間での転移と再利用を可能にしました。(2) ダイナミック高解像度：入力画像のアスペクト比と解像度に応じて、画像を1から40の448×448ピクセルのタイルに分割し、最大4K解像度の入力をサポートします。(3) 高品質バイリンガルデータセット：一般的なシーンや文書画像をカバーする高品質なバイリンガルデータセットを慎重に収集し、英語と中国語の質問-回答ペアでアノテーションを行い、OCRおよび中国語関連タスクの性能を大幅に向上させました。InternVL 1.5は、一連のベンチマークと比較研究を通じて評価されました。オープンソースモデルおよびプロプライエタリモデルと比較して、InternVL 1.5は競争力のある性能を示し、18のベンチマークのうち8つで最先端の結果を達成しました。コードはhttps://github.com/OpenGVLab/InternVLで公開されています。

English

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448times448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

GPT-4Vまでどれほど近づいているのか？オープンソーススイートで商用マルチモーダルモデルとのギャップを埋める

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

要旨

Support