超越语言建模：多模态预训练探索之路

摘要

视觉世界为推进基础模型超越语言范畴提供了关键路径。尽管该方向日益受到关注，原生多模态模型的设计空间仍不透明。我们通过受控的从零开始预训练实验提供实证依据，在排除语言预训练干扰的情况下，分离出主导多模态预训练的关键因素。采用Transfusion框架（语言采用下一词预测，视觉采用扩散模型），我们在包含文本、视频、图文对甚至动作条件视频的多样化数据上进行训练。实验得出四个核心发现：（一）表征自编码器（RAE）通过卓越的视觉理解与生成能力，提供了最优的统一视觉表征；（二）视觉与语言数据具有互补性，能为下游能力产生协同效应；（三）统一多模态预训练自然导向世界建模，通用训练中自发涌现出多种能力；（四）专家混合模型（MoE）在实现高效多模态扩展的同时，自然诱导出模态专长化。通过等计算量分析，我们推导出双模态的扩展规律并揭示不对称性：视觉的数据需求显著高于语言。研究表明，MoE架构通过提供语言所需的高模型容量，同时适应视觉的数据密集型特性，有效调和了这种扩展不对称性，为真正统一的多模态模型铺平道路。

English

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

超越语言建模：多模态预训练探索之路

Beyond Language Modeling: An Exploration of Multimodal Pretraining

摘要

Support