ChatPaper.aiChatPaper

跨模型多模式模型:預測下一個標記並擴散影像

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

August 20, 2024
作者: Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy
cs.AI

摘要

我們介紹了Transfusion,一種用於在離散和連續數據上訓練多模型的方法。Transfusion將語言建模損失函數(下一個標記預測)與擴散結合起來,以訓練一個單一的Transformer模型來處理混合模態序列。我們從頭開始對多個Transfusion模型進行預訓練,總共達到7B個參數,使用文本和圖像數據的混合,建立了與各種單模態和跨模態基準相關的擴展規律。我們的實驗表明,Transfusion比對圖像進行量化並訓練語言模型以處理離散圖像標記的方法有顯著更好的擴展性。通過引入模態特定的編碼和解碼層,我們可以進一步提高Transfusion模型的性能,甚至將每個圖像壓縮為僅16個塊。我們進一步展示,將我們的Transfusion方法擴展到7B個參數和2T個多模態標記,可以生成與類似規模的擴散模型和語言模型相當的圖像和文本,兼具兩者的優勢。
English
We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.

Summary

AI-Generated Summary

PDF613November 17, 2024