ChatPaper.aiChatPaper

InternLM-XComposer2:在视觉-语言大型模型中掌握自由文本-图像组合和理解

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

January 29, 2024
作者: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
cs.AI

摘要

我们介绍InternLM-XComposer2,这是一种尖端的视觉语言模型,在自由形式文本-图像合成和理解方面表现出色。该模型超越传统的视觉语言理解,熟练地从各种输入中如轮廓、详细文本规范和参考图像中精心制作交错的文本-图像内容,实现高度可定制的内容创作。InternLM-XComposer2提出了一种部分LoRA(PLoRA)方法,将额外的LoRA参数专门应用于图像标记,以保留预训练语言知识的完整性,实现精确的视觉理解和具有文学才能的文本合成之间的平衡。实验结果表明,基于InternLM2-7B,InternLM-XComposer2在生成高质量长文本多模态内容方面优越,并在各种基准测试中表现出色,不仅明显优于现有的多模态模型,而且在某些评估中与GPT-4V和Gemini Pro相匹敌甚至超越。这突显了它在多模态理解领域的卓越能力。InternLM-XComposer2系列模型的7B参数版本可在以下网址公开获取:https://github.com/InternLM/InternLM-XComposer。
English
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.
PDF561December 15, 2024