ChatPaper.aiChatPaper

NExT-GPT: 任務無關多模態語言生成模型

NExT-GPT: Any-to-Any Multimodal LLM

September 11, 2023
作者: Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua
cs.AI

摘要

近來,多模式大型語言模型(MM-LLMs)取得了令人振奮的進展,但它們主要受制於僅具有輸入端多模式理解的限制,無法在多個模式中生成內容。由於我們人類總是通過各種模式感知世界並與人溝通,開發能夠接受和提供任何模式內容的任意到任意MM-LLMs對於達到人類級AI至關重要。為了填補這一空白,我們提出了一個端到端通用的任意到任意MM-LLM系統,名為NExT-GPT。我們將一個LLM與多模式適配器和不同擴散解碼器相連接,使NExT-GPT能夠以任意組合的文本、圖像、視頻和音頻感知輸入並生成輸出。通過利用現有的訓練良好且性能優異的編碼器和解碼器,NExT-GPT僅調整了少量參數(某些投影層的1%),這不僅有利於低成本訓練,還有助於方便擴展到更多潛在的模式。此外,我們引入了一種模式切換指令調整(MosIT),並手動精心策劃了一個高質量的MosIT數據集,基於這個數據集,NExT-GPT具有了複雜的跨模式語義理解和內容生成能力。總的來說,我們的研究展示了構建一個能夠建模通用模式的AI代理的前景,為社區中更具人類化的AI研究鋪平了道路。
English
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.
PDF7814December 15, 2024