Emu3:下一个标记预测就是你所需要的。
Emu3: Next-Token Prediction is All You Need
September 27, 2024
作者: Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang
cs.AI
摘要
尽管下一个标记预测被认为是通向人工通用智能的一条有前途的道路,但在多模态任务中表现卓越仍然是一个挑战,这些任务仍然由扩散模型(例如,稳定扩散)和组合方法(例如,CLIP与LLM结合)主导。在本文中,我们介绍了Emu3,这是一套全新的最先进多模态模型,仅通过下一个标记预测进行训练。通过将图像、文本和视频标记化为离散空间,我们从头开始训练一个单一的变压器模型,用于混合多模态序列。Emu3在生成和感知任务中优于几个知名的特定任务模型,超越了旗舰模型如SDXL和LLaVA-1.6,同时消除了扩散或组合架构的需求。Emu3还能够通过预测视频序列中的下一个标记来生成高保真度的视频。我们通过聚焦于一个单一的重点:标记,简化了复杂的多模态模型设计,释放了在训练和推断过程中扩展的巨大潜力。我们的结果表明,下一个标记预测是通向构建超越语言的通用多模态智能的有前途的道路。我们开源了关键技术和模型,以支持在这个方向上的进一步研究。
English
While next-token prediction is considered a promising path towards artificial
general intelligence, it has struggled to excel in multimodal tasks, which are
still dominated by diffusion models (e.g., Stable Diffusion) and compositional
approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a
new suite of state-of-the-art multimodal models trained solely with next-token
prediction. By tokenizing images, text, and videos into a discrete space, we
train a single transformer from scratch on a mixture of multimodal sequences.
Emu3 outperforms several well-established task-specific models in both
generation and perception tasks, surpassing flagship models such as SDXL and
LLaVA-1.6, while eliminating the need for diffusion or compositional
architectures. Emu3 is also capable of generating high-fidelity video via
predicting the next token in a video sequence. We simplify complex multimodal
model designs by converging on a singular focus: tokens, unlocking great
potential for scaling both during training and inference. Our results
demonstrate that next-token prediction is a promising path towards building
general multimodal intelligence beyond language. We open-source key techniques
and models to support further research in this direction.Summary
AI-Generated Summary