InternVL: 視覚基盤モデルのスケールアップと汎用視覚-言語タスクへのアライメント

要旨

大規模言語モデル（LLM）の指数関数的な成長は、マルチモーダルAGIシステムの可能性を大きく広げました。しかし、マルチモーダルAGIの重要な要素である視覚および視覚-言語基盤モデルの進展は、LLMに追いついていません。本研究では、大規模視覚-言語基盤モデル（InternVL）を設計し、視覚基盤モデルを60億パラメータまでスケールアップし、様々なソースからのウェブスケールの画像-テキストデータを用いて、大規模言語モデルと段階的に整合させます。このモデルは、画像レベルまたはピクセルレベルの認識といった視覚知覚タスク、ゼロショット画像/動画分類、ゼロショット画像/動画-テキスト検索といった視覚-言語タスク、そしてLLMと連携したマルチモーダル対話システムの構築など、幅広く適用可能であり、最先端の性能を達成します。本研究がマルチモーダル大規模モデルの発展に貢献することを願っています。コードとモデルはhttps://github.com/OpenGVLab/InternVLで公開されています。

English

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the large language model, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

InternVL: 視覚基盤モデルのスケールアップと汎用視覚-言語タスクへのアライメント

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

要旨

Support