InternLM-XComposer2:在視覺語言大型模型中掌握自由文本圖像構圖和理解
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
January 29, 2024
作者: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
cs.AI
摘要
我們介紹 InternLM-XComposer2,一款領先的視覺語言模型,在自由形式文本-圖像合成和理解方面表現優異。該模型超越傳統的視覺語言理解,熟練地從各種輸入中製作交錯的文本-圖像內容,如輪廓、詳細的文本規範和參考圖像,實現高度可定制的內容創作。InternLM-XComposer2提出了一種部分LoRA(PLoRA)方法,將額外的LoRA參數專門應用於圖像標記,以保留預訓練語言知識的完整性,找到精確的視覺理解和具有文學才能的文本組合之間的平衡。實驗結果顯示,基於InternLM2-7B,InternLM-XComposer2在生成高質量長文本多模態內容方面優越,並在各種基準測試中表現出色的視覺語言理解性能,不僅明顯優於現有的多模態模型,還在某些評估中與甚至超越GPT-4V和Gemini Pro。這凸顯了它在多模態理解領域的卓越能力。InternLM-XComposer2系列模型,擁有7B參數,可在https://github.com/InternLM/InternLM-XComposer 公開獲得。
English
We introduce InternLM-XComposer2, a cutting-edge vision-language model
excelling in free-form text-image composition and comprehension. This model
goes beyond conventional vision-language understanding, adeptly crafting
interleaved text-image content from diverse inputs like outlines, detailed
textual specifications, and reference images, enabling highly customizable
content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach
that applies additional LoRA parameters exclusively to image tokens to preserve
the integrity of pre-trained language knowledge, striking a balance between
precise vision understanding and text composition with literary talent.
Experimental results demonstrate the superiority of InternLM-XComposer2 based
on InternLM2-7B in producing high-quality long-text multi-modal content and its
exceptional vision-language understanding performance across various
benchmarks, where it not only significantly outperforms existing multimodal
models but also matches or even surpasses GPT-4V and Gemini Pro in certain
assessments. This highlights its remarkable proficiency in the realm of
multimodal understanding. The InternLM-XComposer2 model series with 7B
parameters are publicly available at
https://github.com/InternLM/InternLM-XComposer.