InternLM-XComposer2：在視覺語言大型模型中掌握自由文本圖像構圖和理解

摘要

我們介紹 InternLM-XComposer2，一款領先的視覺語言模型，在自由形式文本-圖像合成和理解方面表現優異。該模型超越傳統的視覺語言理解，熟練地從各種輸入中製作交錯的文本-圖像內容，如輪廓、詳細的文本規範和參考圖像，實現高度可定制的內容創作。InternLM-XComposer2提出了一種部分LoRA（PLoRA）方法，將額外的LoRA參數專門應用於圖像標記，以保留預訓練語言知識的完整性，找到精確的視覺理解和具有文學才能的文本組合之間的平衡。實驗結果顯示，基於InternLM2-7B，InternLM-XComposer2在生成高質量長文本多模態內容方面優越，並在各種基準測試中表現出色的視覺語言理解性能，不僅明顯優於現有的多模態模型，還在某些評估中與甚至超越GPT-4V和Gemini Pro。這凸顯了它在多模態理解領域的卓越能力。InternLM-XComposer2系列模型，擁有7B參數，可在https://github.com/InternLM/InternLM-XComposer 公開獲得。

English

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

InternLM-XComposer2：在視覺語言大型模型中掌握自由文本圖像構圖和理解

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

摘要

Support