超越語言建模:多模態預訓練的探索之路
Beyond Language Modeling: An Exploration of Multimodal Pretraining
March 3, 2026
作者: Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, Saining Xie
cs.AI
摘要
視覺世界為推動基礎模型超越語言範疇提供了關鍵軸線。儘管此方向日益受到關注,原生多模態模型的設計空間仍不明朗。我們透過受控的從零開始預訓練實驗提供實證清晰度,在排除語言預訓練干擾的條件下,釐清主導多模態預訓練的關鍵因素。採用Transfusion框架(語言模組使用下一詞預測,視覺模組使用擴散模型),我們對文本、影片、圖文配對乃至動作條件影片等多元數據進行訓練。實驗得出四項核心發現:(i)表徵自編碼器(RAE)憑藉其在視覺理解與生成方面的雙重優勢,可提供最優的統一視覺表徵;(ii)視覺與語言數據具有互補性,能為下游能力產生協同效應;(iii)統一多模態預訓練自然導向世界建模,通用訓練中會湧現相關能力;(iv)專家混合模型(MoE)既能實現高效的多模態擴展,又會自然誘發模態專精化。透過等運算量分析,我們計算出雙模態的擴展律,並揭示擴展不對稱性:視覺的數據需求遠高於語言。我們證明MoE架構可通過提供語言所需的高模型容量,同時滿足視覺的數據密集型特性,從而協調此擴展不對稱性,為真正統一的跨模態模型鋪平道路。
English
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.