统一多模态预训练中的涌现特性
Emerging Properties in Unified Multimodal Pretraining
May 20, 2025
作者: Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
cs.AI
摘要
统一多模态理解与生成已在尖端专有系统中展现出卓越能力。本研究推出BAGEL,一个开源基础模型,原生支持多模态理解与生成。BAGEL是一个统一的仅解码器模型,预训练于从大规模交错文本、图像、视频及网页数据中精选的数万亿个标记之上。当以如此多样化的多模态交错数据进行扩展时,BAGEL在复杂多模态推理方面展现出新兴能力。因此,在标准基准测试中,BAGEL在多模态生成与理解上均显著超越开源统一模型,同时展示了诸如自由形式图像编辑、未来帧预测、三维操控及世界导航等先进多模态推理能力。为促进多模态研究的进一步发展,我们分享了关键发现、预训练细节、数据创建协议,并向社区公开了代码与检查点。项目页面位于https://bagel-ai.org/。
English
Unifying multimodal understanding and generation has shown impressive
capabilities in cutting-edge proprietary systems. In this work, we introduce
BAGEL, an open0source foundational model that natively supports multimodal
understanding and generation. BAGEL is a unified, decoder0only model pretrained
on trillions of tokens curated from large0scale interleaved text, image, video,
and web data. When scaled with such diverse multimodal interleaved data, BAGEL
exhibits emerging capabilities in complex multimodal reasoning. As a result, it
significantly outperforms open-source unified models in both multimodal
generation and understanding across standard benchmarks, while exhibiting
advanced multimodal reasoning abilities such as free-form image manipulation,
future frame prediction, 3D manipulation, and world navigation. In the hope of
facilitating further opportunities for multimodal research, we share the key
findings, pretraining details, data creation protocal, and release our code and
checkpoints to the community. The project page is at https://bagel-ai.org/Summary
AI-Generated Summary