InternLM-XComposer-2.5：一款多才多艺的大视觉语言模型，支持长上下文输入和输出。

摘要

我们介绍InternLM-XComposer-2.5（IXC-2.5），这是一个多才多艺的大视觉语言模型，支持长上下文输入和输出。IXC-2.5在各种文本图像理解和合成应用中表现出色，仅使用7B LLM后端即可达到GPT-4V级别的能力。通过使用24K交错的图像文本上下文进行训练，它可以通过RoPE外推轻松扩展到96K的长上下文。这种长上下文能力使IXC-2.5在需要广泛输入和输出上下文的任务中表现卓越。与其之前的2.0版本相比，InternLM-XComposer-2.5在视觉语言理解方面有三个主要升级：（1）超高分辨率理解，（2）细粒度视频理解，以及（3）多轮多图像对话。除了理解，IXC-2.5还通过额外的LoRA参数扩展到两个引人注目的应用领域，用于文本图像合成：（1）制作网页和（2）撰写高质量的文本图像文章。IXC-2.5已在28个基准测试上进行评估，在16个基准测试上优于现有的开源最先进模型。它还在16个关键任务上超越或与GPT-4V和Gemini Pro竞争激烈。InternLM-XComposer-2.5可在https://github.com/InternLM/InternLM-XComposer 上公开获取。

English

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

InternLM-XComposer-2.5：一款多才多艺的大视觉语言模型，支持长上下文输入和输出。

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

摘要

Support