InternLM-XComposer2：ビジョン言語大規模モデルにおける自由形式のテキスト画像合成と理解のマスタリング

要旨

私たちは、自由形式のテキストと画像の構成と理解に優れた最先端の視覚言語モデル「InternLM-XComposer2」を紹介します。このモデルは、従来の視覚言語理解を超え、アウトライン、詳細なテキスト仕様、参照画像など多様な入力からテキストと画像を織り交ぜたコンテンツを巧みに作成し、高度にカスタマイズ可能なコンテンツ生成を実現します。InternLM-XComposer2は、Partial LoRA（PLoRA）アプローチを提案し、追加のLoRAパラメータを画像トークンにのみ適用することで、事前学習された言語知識の完全性を保ちつつ、正確な視覚理解と文学的な才能を活かしたテキスト構成のバランスを取ります。実験結果は、InternLM2-7Bを基盤とするInternLM-XComposer2が、高品質な長文マルチモーダルコンテンツの生成と、さまざまなベンチマークでの卓越した視覚言語理解性能を示しており、既存のマルチモーダルモデルを大幅に上回るだけでなく、特定の評価ではGPT-4VやGemini Proにも匹敵または凌駕することを証明しています。これは、マルチモーダル理解の領域におけるその驚異的な熟練度を強調しています。7BパラメータのInternLM-XComposer2モデルシリーズは、https://github.com/InternLM/InternLM-XComposer で公開されています。

English

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

InternLM-XComposer2：ビジョン言語大規模モデルにおける自由形式のテキスト画像合成と理解のマスタリング

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

要旨

Support