ChatPaper.aiChatPaper

統一多模態預訓練中的新興特性

Emerging Properties in Unified Multimodal Pretraining

May 20, 2025
作者: Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
cs.AI

摘要

統一多模態理解與生成在尖端專有系統中展現了令人矚目的能力。在本研究中,我們介紹了BAGEL,這是一個開源的基礎模型,原生支援多模態理解與生成。BAGEL是一個統一的、僅解碼器架構的模型,預訓練於從大規模交錯文本、圖像、視頻及網絡數據中精心挑選的數萬億個標記上。當以如此多樣化的多模態交錯數據進行擴展時,BAGEL展現出在複雜多模態推理中的新興能力。因此,它在標準基準測試中,無論是多模態生成還是理解方面,均顯著超越了開源統一模型,同時展示了高級多模態推理能力,如自由形式的圖像編輯、未來幀預測、3D操作及世界導航。為了促進多模態研究的進一步發展,我們分享了關鍵發現、預訓練細節、數據創建協議,並向社區公開了我們的代碼和檢查點。項目頁面位於https://bagel-ai.org/。
English
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

Summary

AI-Generated Summary

PDF832May 21, 2025