TEAL：用於多模態大型語言模型的標記化和嵌入全部

摘要

儘管多模式大型語言模型（MM-LLMs）最近取得了令人振奮的進展，但它們仍在努力有效地建模多模式輸入之間的互動以及非文本模態中的生成。在這項工作中，我們提出了TEAL（Tokenize and Embed All），一種將來自任何模態的輸入視為標記序列並學習所有模態的聯合嵌入空間的方法。具體而言，對於來自任何模態的輸入，TEAL首先將其離散化為標記序列，並使用可學習的嵌入矩陣將標記序列嵌入到聯合嵌入空間中。MM-LLMs只需要像文本LLMs一樣自回歸地預測多模式標記。最後，根據預測的標記序列，應用相應的去標記器以生成每個模態中的輸出。通過聯合嵌入空間，TEAL使凍結的LLMs能夠執行涉及圖像和音頻等非文本模式的理解和生成任務。因此，文本LLM只需作為一個接口，並保持其在文本理解和生成方面的高性能。實驗表明，TEAL在多模式理解方面取得了顯著改進，並實現了一種簡單的多模式生成方案。

English

Despite Multi-modal Large Language Models (MM-LLMs) have made exciting strides recently, they are still struggling to efficiently model the interactions among multi-modal inputs and the generation in non-textual modalities. In this work, we propose TEAL (Tokenize and Embed ALl)}, an approach to treat the input from any modality as a token sequence and learn a joint embedding space for all modalities. Specifically, for the input from any modality, TEAL first discretizes it into a token sequence with the off-the-shelf tokenizer and embeds the token sequence into a joint embedding space with a learnable embedding matrix. MM-LLMs just need to predict the multi-modal tokens autoregressively as the textual LLMs do. Finally, the corresponding de-tokenizer is applied to generate the output in each modality based on the predicted token sequence. With the joint embedding space, TEAL enables the frozen LLMs to perform both understanding and generation tasks involving non-textual modalities, such as image and audio. Thus, the textual LLM can just work as an interface and maintain its high performance in textual understanding and generation. Experiments show that TEAL achieves substantial improvements in multi-modal understanding, and implements a simple scheme for multi-modal generations.

TEAL：用於多模態大型語言模型的標記化和嵌入全部

TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models

摘要

Support