X-LLM：通过将多模态视为外语来引导先进的大型语言模型自举

摘要

大型语言模型（LLMs）展示了出色的语言能力。基于先进的LLMs，GPT-4展现出非凡的多模态能力，超越了先前的视觉语言模型。我们将这归因于与先前的多模态模型相比使用了更先进的LLMs。不幸的是，GPT-4的模型架构和训练策略尚不为人所知。为了赋予LLMs多模态能力，我们提出了X-LLM，它通过X2L接口将多模态（图像、语音、视频）转换为外语，并输入到一个大型语言模型（ChatGLM）中。具体来说，X-LLM通过X2L接口对齐多个冻结的单模态编码器和一个冻结的LLM，其中“X”表示图像、语音和视频等多模态，而“L”表示语言。X-LLM的训练包括三个阶段：（1）转换多模态信息：第一阶段训练每个X2L接口分别与其相应的单模态编码器对齐，将多模态信息转换为语言；（2）将X2L表示与LLM对齐：单模态编码器通过X2L接口独立与LLM对齐；（3）整合多模态：所有单模态编码器通过X2L接口与LLM对齐，将多模态能力整合到LLM中。我们的实验表明，X-LLM展示了令人印象深刻的多模态聊天能力，有时表现出对未见图像/指令的多模态GPT-4行为，并在合成的多模态指令遵循数据集上相对于GPT-4获得了84.5%的相对分数。我们还进行了关于使用LLM进行ASR和多模态ASR的定量测试，希望推动基于LLM的语音识别时代的到来。

English

Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimodal capabilities, we propose X-LLM, which converts Multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM). Specifically, X-LLM aligns multiple frozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X'' denotes multi-modalities such as image, speech, and videos, and ``L'' denotes languages. X-LLM's training consists of three stages: (1) Converting Multimodal Information: The first stage trains each X2L interface to align with its respective single-modal encoder separately to convert multimodal information into languages. (2) Aligning X2L representations with the LLM: single-modal encoders are aligned with the LLM through X2L interfaces independently. (3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM. Our experiments show that X-LLM demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 84.5\% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. And we also conduct quantitative tests on using LLM for ASR and multimodal ASR, hoping to promote the era of LLM-based speech recognition.

X-LLM：通过将多模态视为外语来引导先进的大型语言模型自举

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

摘要

Support