TEAL:用於多模態大型語言模型的標記化和嵌入全部
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
November 8, 2023
作者: Zhen Yang, Yingxue Zhang, Fandong Meng, Jie Zhou
cs.AI
摘要
儘管多模式大型語言模型(MM-LLMs)最近取得了令人振奮的進展,但它們仍在努力有效地建模多模式輸入之間的互動以及非文本模態中的生成。在這項工作中,我們提出了TEAL(Tokenize and Embed All),一種將來自任何模態的輸入視為標記序列並學習所有模態的聯合嵌入空間的方法。具體而言,對於來自任何模態的輸入,TEAL首先將其離散化為標記序列,並使用可學習的嵌入矩陣將標記序列嵌入到聯合嵌入空間中。MM-LLMs只需要像文本LLMs一樣自回歸地預測多模式標記。最後,根據預測的標記序列,應用相應的去標記器以生成每個模態中的輸出。通過聯合嵌入空間,TEAL使凍結的LLMs能夠執行涉及圖像和音頻等非文本模式的理解和生成任務。因此,文本LLM只需作為一個接口,並保持其在文本理解和生成方面的高性能。實驗表明,TEAL在多模式理解方面取得了顯著改進,並實現了一種簡單的多模式生成方案。
English
Despite Multi-modal Large Language Models (MM-LLMs) have made exciting
strides recently, they are still struggling to efficiently model the
interactions among multi-modal inputs and the generation in non-textual
modalities. In this work, we propose TEAL (Tokenize and Embed ALl)}, an
approach to treat the input from any modality as a token sequence and learn a
joint embedding space for all modalities. Specifically, for the input from any
modality, TEAL first discretizes it into a token sequence with the
off-the-shelf tokenizer and embeds the token sequence into a joint embedding
space with a learnable embedding matrix. MM-LLMs just need to predict the
multi-modal tokens autoregressively as the textual LLMs do. Finally, the
corresponding de-tokenizer is applied to generate the output in each modality
based on the predicted token sequence. With the joint embedding space, TEAL
enables the frozen LLMs to perform both understanding and generation tasks
involving non-textual modalities, such as image and audio. Thus, the textual
LLM can just work as an interface and maintain its high performance in textual
understanding and generation. Experiments show that TEAL achieves substantial
improvements in multi-modal understanding, and implements a simple scheme for
multi-modal generations.