X-LLM: マルチモダリティを外国語として扱うことで高度な大規模言語モデルをブートストラップする

要旨

大規模言語モデル（LLM）は、驚異的な言語能力を発揮しています。GPT-4は、先進的なLLMを基盤としており、従来の視覚言語モデルを超える卓越したマルチモーダル能力を示しています。我々はこれを、従来のマルチモーダルモデルと比較してより先進的なLLMの使用に起因すると考えています。残念ながら、GPT-4のモデルアーキテクチャとトレーニング戦略は不明です。LLMにマルチモーダル能力を付与するために、我々はX-LLMを提案します。X-LLMは、X2Lインターフェースを使用してマルチモーダル（画像、音声、動画）を外国語に変換し、それを大規模言語モデル（ChatGLM）に入力します。具体的には、X-LLMは、複数の凍結された単一モーダルエンコーダと凍結されたLLMをX2Lインターフェースを使用してアラインメントします。ここで「X」は画像、音声、動画などのマルチモーダルを表し、「L」は言語を表します。X-LLMのトレーニングは3つの段階で構成されます：（1）マルチモーダル情報の変換：最初の段階では、各X2Lインターフェースをそれぞれの単一モーダルエンコーダと個別にアラインメントして、マルチモーダル情報を言語に変換します。（2）X2L表現とLLMのアラインメント：単一モーダルエンコーダは、X2Lインターフェースを介してLLMと独立してアラインメントされます。（3）複数のモーダルの統合：すべての単一モーダルエンコーダは、X2Lインターフェースを介してLLMとアラインメントされ、マルチモーダル能力をLLMに統合します。我々の実験では、X-LLMが印象的なマルチモーダルチャット能力を示し、未見の画像/指示に対してマルチモーダルGPT-4の挙動を示すこともあり、合成マルチモーダル指示追従データセットにおいてGPT-4と比較して84.5％の相対スコアを達成しました。また、LLMをASRおよびマルチモーダルASRに使用するための定量的テストも実施し、LLMベースの音声認識の時代を促進することを期待しています。

English

Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimodal capabilities, we propose X-LLM, which converts Multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM). Specifically, X-LLM aligns multiple frozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X'' denotes multi-modalities such as image, speech, and videos, and ``L'' denotes languages. X-LLM's training consists of three stages: (1) Converting Multimodal Information: The first stage trains each X2L interface to align with its respective single-modal encoder separately to convert multimodal information into languages. (2) Aligning X2L representations with the LLM: single-modal encoders are aligned with the LLM through X2L interfaces independently. (3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM. Our experiments show that X-LLM demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 84.5\% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. And we also conduct quantitative tests on using LLM for ASR and multimodal ASR, hoping to promote the era of LLM-based speech recognition.

X-LLM: マルチモダリティを外国語として扱うことで高度な大規模言語モデルをブートストラップする

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

要旨

Support