TEAL: マルチモーダル大規模言語モデルのためのトークン化と埋め込みの統合

要旨

マルチモーダル大規模言語モデル（MM-LLMs）は最近目覚ましい進歩を遂げているものの、依然としてマルチモーダル入力間の相互作用や非テキストモダリティにおける生成を効率的にモデル化するのに苦戦しています。本研究では、TEAL（Tokenize and Embed ALl）というアプローチを提案します。これは、あらゆるモダリティからの入力をトークンシーケンスとして扱い、すべてのモダリティのための結合埋め込み空間を学習するものです。具体的には、TEALはまず、任意のモダリティからの入力を既存のトークナイザーを使用してトークンシーケンスに離散化し、学習可能な埋め込み行列を用いて結合埋め込み空間に埋め込みます。MM-LLMsは、テキストLLMsが行うように、マルチモーダルトークンを自己回帰的に予測するだけで済みます。最後に、予測されたトークンシーケンスに基づいて、各モダリティにおける出力を生成するために対応するデトークナイザーが適用されます。結合埋め込み空間を利用することで、TEALは凍結されたLLMsが画像や音声などの非テキストモダリティを含む理解と生成タスクを実行できるようにします。これにより、テキストLLMはインターフェースとして機能し、テキスト理解と生成における高い性能を維持することができます。実験結果は、TEALがマルチモーダル理解において大幅な改善を達成し、マルチモーダル生成のためのシンプルなスキームを実装していることを示しています。

English

Despite Multi-modal Large Language Models (MM-LLMs) have made exciting strides recently, they are still struggling to efficiently model the interactions among multi-modal inputs and the generation in non-textual modalities. In this work, we propose TEAL (Tokenize and Embed ALl)}, an approach to treat the input from any modality as a token sequence and learn a joint embedding space for all modalities. Specifically, for the input from any modality, TEAL first discretizes it into a token sequence with the off-the-shelf tokenizer and embeds the token sequence into a joint embedding space with a learnable embedding matrix. MM-LLMs just need to predict the multi-modal tokens autoregressively as the textual LLMs do. Finally, the corresponding de-tokenizer is applied to generate the output in each modality based on the predicted token sequence. With the joint embedding space, TEAL enables the frozen LLMs to perform both understanding and generation tasks involving non-textual modalities, such as image and audio. Thus, the textual LLM can just work as an interface and maintain its high performance in textual understanding and generation. Experiments show that TEAL achieves substantial improvements in multi-modal understanding, and implements a simple scheme for multi-modal generations.

TEAL: マルチモーダル大規模言語モデルのためのトークン化と埋め込みの統合

TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models

要旨

Support