ChatPaper.aiChatPaper

转移:使用一个多模型模型预测下一个标记并扩散图像

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

August 20, 2024
作者: Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy
cs.AI

摘要

我们介绍了Transfusion,这是一种用于训练多模态模型的方法,可以处理离散和连续数据。Transfusion将语言建模损失函数(下一个标记预测)与扩散相结合,以训练一个单一的Transformer模型,可以处理混合模态序列。我们从头开始预训练了多个Transfusion模型,总共达到了7B个参数,使用文本和图像数据的混合,建立了关于各种单模态和跨模态基准的扩展规律。我们的实验表明,Transfusion的扩展性比对图像进行量化并训练语言模型以处理离散图像标记要好得多。通过引入模态特定的编码和解码层,我们可以进一步提高Transfusion模型的性能,甚至将每个图像压缩为仅16个补丁。我们进一步证明,将我们的Transfusion方法扩展到7B个参数和2T个多模态标记,可以生成图像和文本的模型,与类似规模的扩散模型和语言模型相媲美,充分利用了两者的优势。
English
We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.

Summary

AI-Generated Summary

PDF613November 17, 2024