ChatPaper.aiChatPaper

Emu3.5:原生多模態模型即世界學習者

Emu3.5: Native Multimodal Models are World Learners

October 30, 2025
作者: Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang
cs.AI

摘要

我們推出Emu3.5——一個大規模多模態世界模型,其核心能力在於對視覺與語言進行原生化的下一狀態預測。該模型通過端到端預訓練,基於包含超過10兆符號的視覺語言交織數據集(主要來自網路影片的連續幀與轉錄文本),以統一的下一符號預測目標進行學習。Emu3.5能自然接收交織式視覺語言輸入,並生成交織式視覺語言輸出。我們進一步採用大規模強化學習對模型進行後訓練,以增強其多模態推理與生成能力。為提升推理效率,我們提出離散擴散適配技術(DiDA),將逐符號解碼轉換為雙向平行預測,在保持性能不損的前提下實現單圖推理約20倍加速。Emu3.5展現出強大的原生多模態能力,包括長視野視覺語言生成、任意內容到圖像(X2I)生成,以及複雜文本密集型圖像生成。同時具備可泛化的世界建模能力,能在多樣化場景與任務中實現時空一致的世界探索與開放世界具身操控。在對比測試中,Emu3.5在圖像生成與編輯任務上達到與Gemini 2.5 Flash Image(Nano Banana)相當的性能,並在交織生成任務集上展現更優異的表現。我們已開源Emu3.5模型(https://github.com/baaivision/Emu3.5)以支持社群研究。
English
We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.
PDF1044December 2, 2025