AnyGPT：具有离散序列建模的统一多模态LLM

摘要

我们介绍了AnyGPT，这是一个任意多模态语言模型，利用离散表示统一处理各种模态，包括语音、文本、图像和音乐。AnyGPT可以稳定训练，而无需对当前大型语言模型（LLM）架构或训练范式进行任何修改。相反，它完全依赖于数据级预处理，促进了新模态的无缝集成到LLM中，类似于引入新语言。我们构建了一个多模态文本为中心的数据集，用于多模态对齐预训练。利用生成模型，我们合成了第一个大规模的任意多模态指导数据集。它包括108k个多轮对话样本，精巧地交织了各种模态，从而使模型能够处理任意组合的多模态输入和输出。实验结果表明，AnyGPT能够促进任意多模态对话，同时在所有模态上实现与专门模型相媲美的性能，证明了离散表示能够有效且方便地统一语言模型内的多个模态。演示请参见https://junzhan2000.github.io/AnyGPT.github.io/

English

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

AnyGPT：具有离散序列建模的统一多模态LLM

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

摘要

Support