NExT-GPT：任意多模态LLM

摘要

尽管最近多模态大型语言模型（MM-LLMs）取得了令人振奋的进展，但它们大多受限于仅具备输入端多模态理解的局限，无法在多种模态下生成内容。由于我们人类总是通过各种模态感知世界并与人交流，开发能够接受和输出任何模态内容的任意到任意多模态语言模型对于实现人类级别的人工智能至关重要。为填补这一空白，我们提出了一种端到端通用的任意到任意多模态语言模型系统，NExT-GPT。我们将一个语言模型连接到多模态适配器和不同的扩散解码器，使NExT-GPT能够以任意组合的文本、图像、视频和音频感知输入并生成输出。通过利用现有训练良好且性能优越的编码器和解码器，NExT-GPT仅调整了少量参数（某些投影层的1%），这不仅有利于低成本训练，还便于方便地扩展到更多潜在的模态。此外，我们引入了一种模态切换指令调整（MosIT）并手动策划了一个高质量的MosIT数据集，基于该数据集，NExT-GPT具备了复杂的跨模态语义理解和内容生成能力。总的来说，我们的研究展示了构建能够建模通用模态的人工智能代理的前景，为社区中更具人类化的人工智能研究铺平了道路。

English

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.

NExT-GPT：任意多模态LLM

NExT-GPT: Any-to-Any Multimodal LLM

摘要

Support