ChatPaper.aiChatPaper

超越语言建模:多模态预训练探索之路

Beyond Language Modeling: An Exploration of Multimodal Pretraining

March 3, 2026
作者: Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, Saining Xie
cs.AI

摘要

视觉世界为推进基础模型超越语言范畴提供了关键路径。尽管该方向日益受到关注,原生多模态模型的设计空间仍不透明。我们通过受控的从零开始预训练实验提供实证依据,在排除语言预训练干扰的情况下,分离出主导多模态预训练的关键因素。采用Transfusion框架(语言采用下一词预测,视觉采用扩散模型),我们在包含文本、视频、图文对甚至动作条件视频的多样化数据上进行训练。实验得出四个核心发现:(一)表征自编码器(RAE)通过卓越的视觉理解与生成能力,提供了最优的统一视觉表征;(二)视觉与语言数据具有互补性,能为下游能力产生协同效应;(三)统一多模态预训练自然导向世界建模,通用训练中自发涌现出多种能力;(四)专家混合模型(MoE)在实现高效多模态扩展的同时,自然诱导出模态专长化。通过等计算量分析,我们推导出双模态的扩展规律并揭示不对称性:视觉的数据需求显著高于语言。研究表明,MoE架构通过提供语言所需的高模型容量,同时适应视觉的数据密集型特性,有效调和了这种扩展不对称性,为真正统一的多模态模型铺平道路。
English
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.
PDF1046May 8, 2026