NExT-GPT: 任務無關多模態語言生成模型

摘要

近來，多模式大型語言模型（MM-LLMs）取得了令人振奮的進展，但它們主要受制於僅具有輸入端多模式理解的限制，無法在多個模式中生成內容。由於我們人類總是通過各種模式感知世界並與人溝通，開發能夠接受和提供任何模式內容的任意到任意MM-LLMs對於達到人類級AI至關重要。為了填補這一空白，我們提出了一個端到端通用的任意到任意MM-LLM系統，名為NExT-GPT。我們將一個LLM與多模式適配器和不同擴散解碼器相連接，使NExT-GPT能夠以任意組合的文本、圖像、視頻和音頻感知輸入並生成輸出。通過利用現有的訓練良好且性能優異的編碼器和解碼器，NExT-GPT僅調整了少量參數（某些投影層的1%），這不僅有利於低成本訓練，還有助於方便擴展到更多潛在的模式。此外，我們引入了一種模式切換指令調整（MosIT），並手動精心策劃了一個高質量的MosIT數據集，基於這個數據集，NExT-GPT具有了複雜的跨模式語義理解和內容生成能力。總的來說，我們的研究展示了構建一個能夠建模通用模式的AI代理的前景，為社區中更具人類化的AI研究鋪平了道路。

English

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.

NExT-GPT: 任務無關多模態語言生成模型

NExT-GPT: Any-to-Any Multimodal LLM

摘要

Support