AnyGPT: 具有離散序列建模的統一多模態LLM

摘要

我們介紹了 AnyGPT，一種任意多模式語言模型，利用離散表示統一處理各種模態，包括語音、文本、圖像和音樂。AnyGPT 可以穩定訓練，而無需對當前大型語言模型（LLM）架構或訓練範式進行任何修改。相反，它完全依賴於數據層預處理，促進了新模態的無縫整合到LLMs中，類似於新語言的融入。我們為多模態對齊預訓練構建了一個多模態文本中心數據集。利用生成模型，我們合成了第一個大規模任意多模式指令數據集。它包含108k個多輪對話樣本，這些樣本巧妙地交織了各種模態，從而使模型能夠處理任意組合的多模態輸入和輸出。實驗結果表明，AnyGPT 能夠促進任意多模式對話，同時在所有模態上實現與專門模型相當的性能，證明了離散表示能夠有效且方便地統一語言模型中的多個模態。演示請參見 https://junzhan2000.github.io/AnyGPT.github.io/

English

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

AnyGPT: 具有離散序列建模的統一多模態LLM

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

摘要

Support