AnyGPT:具有离散序列建模的统一多模态LLM
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
February 19, 2024
作者: Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu
cs.AI
摘要
我们介绍了AnyGPT,这是一个任意多模态语言模型,利用离散表示统一处理各种模态,包括语音、文本、图像和音乐。AnyGPT可以稳定训练,而无需对当前大型语言模型(LLM)架构或训练范式进行任何修改。相反,它完全依赖于数据级预处理,促进了新模态的无缝集成到LLM中,类似于引入新语言。我们构建了一个多模态文本为中心的数据集,用于多模态对齐预训练。利用生成模型,我们合成了第一个大规模的任意多模态指导数据集。它包括108k个多轮对话样本,精巧地交织了各种模态,从而使模型能够处理任意组合的多模态输入和输出。实验结果表明,AnyGPT能够促进任意多模态对话,同时在所有模态上实现与专门模型相媲美的性能,证明了离散表示能够有效且方便地统一语言模型内的多个模态。演示请参见https://junzhan2000.github.io/AnyGPT.github.io/
English
We introduce AnyGPT, an any-to-any multimodal language model that utilizes
discrete representations for the unified processing of various modalities,
including speech, text, images, and music. AnyGPT can be trained stably without
any alterations to the current large language model (LLM) architecture or
training paradigms. Instead, it relies exclusively on data-level preprocessing,
facilitating the seamless integration of new modalities into LLMs, akin to the
incorporation of new languages. We build a multimodal text-centric dataset for
multimodal alignment pre-training. Utilizing generative models, we synthesize
the first large-scale any-to-any multimodal instruction dataset. It consists of
108k samples of multi-turn conversations that intricately interweave various
modalities, thus equipping the model to handle arbitrary combinations of
multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is
capable of facilitating any-to-any multimodal conversation while achieving
performance comparable to specialized models across all modalities, proving
that discrete representations can effectively and conveniently unify multiple
modalities within a language model. Demos are shown in
https://junzhan2000.github.io/AnyGPT.github.io/Summary
AI-Generated Summary