Emu3.5:原生多模态模型是世界学习者
Emu3.5: Native Multimodal Models are World Learners
October 30, 2025
作者: Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang
cs.AI
摘要
我们推出Emu3.5——一个通过原生方式预测视觉与语言跨模态下一状态的大规模多模态世界模型。该模型基于包含超过10万亿token的视觉语言交织数据(主要来自互联网视频的连续帧和转录文本)进行端到端预训练,采用统一的下一标记预测目标。Emu3.5天然支持交织式视觉语言输入,并生成交织式视觉语言输出。为进一步增强多模态推理与生成能力,我们通过大规模强化学习对模型进行后训练。为提升推理效率,我们提出离散扩散适配技术(DiDA),将逐标记解码转换为双向并行预测,在保持性能的同时实现单图像推理约20倍加速。Emu3.5展现出强大的原生多模态能力,包括长程视觉语言生成、任意模态到图像(X2I)生成以及复杂文本图像生成,同时具备可泛化的世界建模能力,能够在多样化场景和任务中实现时空一致的世界探索与开放世界具身操控。在图像生成与编辑任务上,Emu3.5达到与Gemini 2.5 Flash Image(Nano Banana)相当的性能,并在交织生成任务集上表现更优。我们将Emu3.5开源发布于https://github.com/baaivision/Emu3.5 以支持社区研究。
English
We introduce Emu3.5, a large-scale multimodal world model that natively
predicts the next state across vision and language. Emu3.5 is pre-trained
end-to-end with a unified next-token prediction objective on a corpus of
vision-language interleaved data containing over 10 trillion tokens, primarily
derived from sequential frames and transcripts of internet videos. The model
naturally accepts interleaved vision-language inputs and generates interleaved
vision-language outputs. Emu3.5 is further post-trained with large-scale
reinforcement learning to enhance multimodal reasoning and generation. To
improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA),
which converts token-by-token decoding into bidirectional parallel prediction,
accelerating per-image inference by about 20x without sacrificing performance.
Emu3.5 exhibits strong native multimodal capabilities, including long-horizon
vision-language generation, any-to-image (X2I) generation, and complex
text-rich image generation. It also exhibits generalizable world-modeling
abilities, enabling spatiotemporally consistent world exploration and
open-world embodied manipulation across diverse scenarios and tasks. For
comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image
(Nano Banana) on image generation and editing tasks and demonstrates superior
results on a suite of interleaved generation tasks. We open-source Emu3.5 at
https://github.com/baaivision/Emu3.5 to support community research.