X-LLM：通過將多模態視為外語來啟動先進的大型語言模型

摘要

大型語言模型（LLMs）展現了卓越的語言能力。基於先進的LLMs，GPT-4展示了非凡的多模態能力，超越了先前的視覺語言模型。我們將這歸因於與先前多模態模型相比使用了更先進的LLMs。不幸的是，GPT-4的模型架構和訓練策略尚不清楚。為了賦予LLMs多模態能力，我們提出了X-LLM，它通過X2L接口將多模態（圖像、語音、視頻）轉換為外語並輸入到大型語言模型（ChatGLM）中。具體來說，X-LLM通過X2L接口對齊多個凍結的單模態編碼器和一個凍結的LLM，其中“X”表示多模態，如圖像、語音和視頻，“L”表示語言。X-LLM的訓練包括三個階段：（1）轉換多模態信息：第一階段訓練每個X2L接口分別與其相應的單模態編碼器對齊，將多模態信息轉換為語言；（2）將X2L表示與LLM對齊：單模態編碼器通過X2L接口獨立與LLM對齊；（3）整合多模態：所有單模態編碼器通過X2L接口與LLM對齊，將多模態能力整合到LLM中。我們的實驗表明，X-LLM展示了令人印象深刻的多模式聊天能力，有時展現出對未見圖像/指令的多模態GPT-4行為，並在合成多模態指令遵循數據集上相對於GPT-4獲得了84.5％的相對分數。我們還對使用LLM進行ASR和多模態ASR進行了定量測試，希望推動基於LLM的語音識別時代的到來。

English

Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimodal capabilities, we propose X-LLM, which converts Multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM). Specifically, X-LLM aligns multiple frozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X'' denotes multi-modalities such as image, speech, and videos, and ``L'' denotes languages. X-LLM's training consists of three stages: (1) Converting Multimodal Information: The first stage trains each X2L interface to align with its respective single-modal encoder separately to convert multimodal information into languages. (2) Aligning X2L representations with the LLM: single-modal encoders are aligned with the LLM through X2L interfaces independently. (3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM. Our experiments show that X-LLM demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 84.5\% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. And we also conduct quantitative tests on using LLM for ASR and multimodal ASR, hoping to promote the era of LLM-based speech recognition.

X-LLM：通過將多模態視為外語來啟動先進的大型語言模型

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

摘要

Support