InternLM-XComposer2：在视觉-语言大型模型中掌握自由文本-图像组合和理解

摘要

我们介绍InternLM-XComposer2，这是一种尖端的视觉语言模型，在自由形式文本-图像合成和理解方面表现出色。该模型超越传统的视觉语言理解，熟练地从各种输入中如轮廓、详细文本规范和参考图像中精心制作交错的文本-图像内容，实现高度可定制的内容创作。InternLM-XComposer2提出了一种部分LoRA（PLoRA）方法，将额外的LoRA参数专门应用于图像标记，以保留预训练语言知识的完整性，实现精确的视觉理解和具有文学才能的文本合成之间的平衡。实验结果表明，基于InternLM2-7B，InternLM-XComposer2在生成高质量长文本多模态内容方面优越，并在各种基准测试中表现出色，不仅明显优于现有的多模态模型，而且在某些评估中与GPT-4V和Gemini Pro相匹敌甚至超越。这突显了它在多模态理解领域的卓越能力。InternLM-XComposer2系列模型的7B参数版本可在以下网址公开获取：https://github.com/InternLM/InternLM-XComposer。

English

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

InternLM-XComposer2：在视觉-语言大型模型中掌握自由文本-图像组合和理解

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

摘要

Support