ChatPaper.aiChatPaper

InternLM-XComposer2:在視覺語言大型模型中掌握自由文本圖像構圖和理解

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

January 29, 2024
作者: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
cs.AI

摘要

我們介紹 InternLM-XComposer2,一款領先的視覺語言模型,在自由形式文本-圖像合成和理解方面表現優異。該模型超越傳統的視覺語言理解,熟練地從各種輸入中製作交錯的文本-圖像內容,如輪廓、詳細的文本規範和參考圖像,實現高度可定制的內容創作。InternLM-XComposer2提出了一種部分LoRA(PLoRA)方法,將額外的LoRA參數專門應用於圖像標記,以保留預訓練語言知識的完整性,找到精確的視覺理解和具有文學才能的文本組合之間的平衡。實驗結果顯示,基於InternLM2-7B,InternLM-XComposer2在生成高質量長文本多模態內容方面優越,並在各種基準測試中表現出色的視覺語言理解性能,不僅明顯優於現有的多模態模型,還在某些評估中與甚至超越GPT-4V和Gemini Pro。這凸顯了它在多模態理解領域的卓越能力。InternLM-XComposer2系列模型,擁有7B參數,可在https://github.com/InternLM/InternLM-XComposer 公開獲得。
English
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.
PDF561December 15, 2024